Bug 66142 - Loop is not vectorized because not sufficient support for GOMP_SIMD_LANE
Summary: Loop is not vectorized because not sufficient support for GOMP_SIMD_LANE
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 63916 66379
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-05-14 14:18 UTC by Yuri Rumyantsev
Modified: 2021-10-01 02:56 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-05-26 00:00:00


Attachments
test-case to reproduce (539 bytes, text/x-csrc)
2015-05-14 14:19 UTC, Yuri Rumyantsev
Details
gcc6-pr66142.patch (427 bytes, patch)
2015-05-22 11:54 UTC, Jakub Jelinek
Details | Diff
test-case to reproduce (609 bytes, text/x-csrc)
2016-03-11 15:14 UTC, Yuri Rumyantsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2015-05-14 14:18:04 UTC
The attached test-case compiled with "-Ofast -fopenmp -march=core-avx2" options contains loop marked with pragma omp simd which is not vectorized:
t1.cpp:66:20: note: not vectorized: data ref analysis failed MEM[(struct vec_ *)_9] = 1.0e+0;
where index is defined as
  _12 = GOMP_SIMD_LANE (simduid.0_11(D));
  _9 = &D.4231[_12].org;

Note that this test was extracted from important benchmark and loop in question is vectorized if we revert fix for 59984.
Comment 1 Yuri Rumyantsev 2015-05-14 14:19:34 UTC
Created attachment 35541 [details]
test-case to reproduce

Must be compiled with -Ofast -fopenmp -march=core-avx2 options.
Comment 2 Jakub Jelinek 2015-05-22 10:09:12 UTC
On this testcase, I believe it is most important to change FRE and SRA.
Without -fopenmp, we have before fre2:
  MEM[(struct ray_ *)&ray] = 1.0e+0;
  MEM[(struct ray_ *)&ray + 4B] = _8;
  ray.dir.x = x_10;
  ray.dir.y = y_11;
...
  _23 = ray.org.x;
  _26 = ray.org.y;
  _29 = ray.dir.x;
  _31 = ray.dir.y;
and fre2 optimizes this into the values stored in there.  With -fopenmp, we have the SIMD_LANE based array references:
  _12 = GOMP_SIMD_LANE (simduid.0_11(D));
  _9 = &D.2881[_12].org;
  MEM[(struct vec_ *)_9] = 1.0e+0;
  MEM[(struct vec_ *)_9 + 4B] = _8;
  D.2881[_12].dir.x = x_13;
  D.2881[_12].dir.y = y_14;
...
  _26 = MEM[(const struct Ray *)&D.2881][_12].org.x;
  _29 = MEM[(const struct Ray *)&D.2881][_12].org.y;
  _32 = MEM[(const struct Ray *)&D.2881][_12].dir.x;
  _34 = MEM[(const struct Ray *)&D.2881][_12].dir.y;
that fre2 isn't able to optimize.  Doing something here on the vectorizer side is too late, we really need to scalarize those or remove completely (through fre2).
Comment 3 Jakub Jelinek 2015-05-22 11:12:40 UTC
Ah, but fre2 is able to optimize those:
  D.2881[_12].dir.x = x_13;
...
  _32 = MEM[(const struct Ray *)&D.2881][_12].dir.x;
cases to
  _32 = x_13;
The problem in FRE2 is only with the MEM_REFs.
If in the testcase
-      ray.org = p;
+      ray.org.x = p.x;
+      ray.org.y = p.y;
then FRE2 handles it completely and the problem is the same between -fopenmp and -fno-openmp - ifcvt not ifconverting the loop.
Comment 4 Jakub Jelinek 2015-05-22 11:40:15 UTC
Ah, and the reason for the ifcvt failure is that we have an unoptimized
virtual PHI in the loop that ifcvt gives up on:
  # .MEM_55 = PHI <.MEM_55(7), .MEM_5(D)(2)>
Only phicprop2 pass is able to optimize that away (and replace all uses
of .MEM_55 with .MEM_5(D)).
Comment 5 Jakub Jelinek 2015-05-22 11:54:33 UTC
Created attachment 35596 [details]
gcc6-pr66142.patch

Fix for the ifcvt issue.  With this the modified testcase is vectorized.
Comment 6 Jakub Jelinek 2015-05-22 14:51:47 UTC
As for the FRE (actually SCCVN) issue, seems this affects more than GOMP_SIMD_LANE, seems there is insufficient canonicalization of the references performed by SCCVN, so the same references, but in slightly different form, aren't considered the same.
Please see e.g.
struct A { float x, y; };
struct B { struct A u, v; };
void bar (struct A *);

float
f1 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  struct A *q = &x[y].u;
  *q = p;
  x[y].v.x = 3.0f;
  x[y].v.y = 4.0f;
  float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y;
  bar (&p);
  return f;
}

float
f2 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  x[y].u = p;
  x[y].v.x = 3.0f;
  x[y].v.y = 4.0f;
  float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y;
  bar (&p);
  return f;
}

float
f3 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  struct A *q = &x[y].u;
  __builtin_memcpy (&q, &p.x, sizeof (float));
  __builtin_memcpy (&q->y, &p.y, sizeof (float));
  *q = p;
  x[y].v.x = 3.0f;
  x[y].v.y = 4.0f;
  float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y;
  bar (&p);
  return f;
}

float
f4 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  __builtin_memcpy (&x[y].u.x, &p.x, sizeof (float));
  __builtin_memcpy (&x[y].u.y, &p.y, sizeof (float));
  x[y].v.x = 3.0f;
  x[y].v.y = 4.0f;
  float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y;
  bar (&p);
  return f;
}

at -O2, FRE is able to optimize f2 and f4, but not f1 and f3, despite all 4 functions doing the same thing.
Comment 7 Jakub Jelinek 2015-05-25 09:34:57 UTC
Slightly improved variant of #c6 testcase:
struct A { float x, y; };
struct B { struct A u; };
void bar (struct A *);

float
f1 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  struct A *q = &x[y].u;
  *q = p;
  float f = x[y].u.x + x[y].u.y;
  bar (&p);
  return f;
}

float
f2 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  x[y].u = p;
  float f = x[y].u.x + x[y].u.y;
  bar (&p);
  return f;
}

float
f3 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  struct A *q = &x[y].u;
  __builtin_memcpy (&q->x, &p.x, sizeof (float));
  __builtin_memcpy (&q->y, &p.y, sizeof (float));
  *q = p;
  float f = x[y].u.x + x[y].u.y;
  bar (&p);
  return f;
}

float
f4 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  __builtin_memcpy (&x[y].u.x, &p.x, sizeof (float));
  __builtin_memcpy (&x[y].u.y, &p.y, sizeof (float));
  float f = x[y].u.x + x[y].u.y;
  bar (&p);
  return f;
}
Comment 8 Jakub Jelinek 2015-05-25 12:48:48 UTC
I guess it depends on what exactly SCCVN uses the operand vectors created/optimized by copy_reference_ops_from_ref/valueize_refs_1 for.
If it is used only for tracking what is the value of the reference (the value stored into that piece of memory), or also for canonicalization of references to certain more canonical form, or both.  If only the value of the reference, then
supposedly it would be nice to get far more canonicalization in the references (e.g. turn (perhaps multiple nested) COMPONENT_REFs into MEM_REFs with offset (except perhaps for bitfields, perhaps even do something about ARRAY_REFs, so that say even:
struct A { float x, y; };
struct B { struct A u; };
void bar (struct A *);

float
f2 (struct B *x, int y)
{
  struct A p;
  p.x = 1.0f;
  p.y = 2.0f;
  char *z = (char *) x;
  z += y * sizeof (struct B);
  z += __builtin_offsetof (struct B, u.y);
  x[y].u = p;
  float f = *(float *) z;
  bar (&p);
  return f;
}
is handled).  But, supposedly it would be undesirable to canonicalize all the COMPONENT_REFs to the MEM_REFs (at least early on, e.g. for __builtin_object_size (, 1) purposes, or aliasing etc.).
Richi, any thoughts on this?
Comment 9 rguenther@suse.de 2015-05-26 08:28:03 UTC
On Mon, 25 May 2015, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |rguenth at gcc dot gnu.org
> 
> --- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> --- I guess 
> it depends on what exactly SCCVN uses the operand vectors 
> created/optimized by copy_reference_ops_from_ref/valueize_refs_1 for. If 
> it is used only for tracking what is the value of the reference (the 
> value stored into that piece of memory), or also for canonicalization of 
> references to certain more canonical form, or both.  If only the value 

It's only for tracking the value (and to get a canonical form for
its internal hash-tables and compare function).  I think this bug
may be related to PR63916.

> of the reference, then supposedly it would be nice to get far more 
> canonicalization in the references (e.g. turn (perhaps multiple nested) 
> COMPONENT_REFs into MEM_REFs with offset (except perhaps for bitfields, 
> perhaps even do something about ARRAY_REFs, so that say even: struct A { 
> float x, y; }; struct B { struct A u; }; void bar (struct A *);
> 
> float
> f2 (struct B *x, int y)
> {
>   struct A p;
>   p.x = 1.0f;
>   p.y = 2.0f;
>   char *z = (char *) x;
>   z += y * sizeof (struct B);
>   z += __builtin_offsetof (struct B, u.y);
>   x[y].u = p;
>   float f = *(float *) z;
>   bar (&p);
>   return f;
> }
> is handled).  But, supposedly it would be undesirable to canonicalize all the
> COMPONENT_REFs to the MEM_REFs (at least early on, e.g. for
> __builtin_object_size (, 1) purposes, or aliasing etc.).
> Richi, any thoughts on this?

On the f1 testcase it probably fails to look through the aggregate
copy in vn_reference_lookup_3.  I'll have a short look later.
Comment 10 Richard Biener 2015-05-26 08:42:33 UTC
Testing

Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c        (revision 223574)
+++ gcc/tree-ssa-sccvn.c        (working copy)
@@ -1894,7 +1894,12 @@ vn_reference_lookup_3 (ao_ref *ref, tree
       size2 = lhs_ref.size;
       maxsize2 = lhs_ref.max_size;
       if (maxsize2 == -1
-         || (base != base2 && !operand_equal_p (base, base2, 0))
+         || (base != base2
+             && (TREE_CODE (base) != MEM_REF
+                 || TREE_CODE (base2) != MEM_REF
+                 || TREE_OPERAND (base, 0) != TREE_OPERAND (base2, 0)
+                 || !tree_int_cst_equal (TREE_OPERAND (base, 1),
+                                         TREE_OPERAND (base2, 1))))
          || offset2 > offset
          || offset2 + size2 < offset + maxsize)
        return (void *)-1;

which fixes the comment#7 testcase.
Comment 11 Richard Biener 2015-05-26 13:56:11 UTC
Author: rguenth
Date: Tue May 26 13:55:40 2015
New Revision: 223697

URL: https://gcc.gnu.org/viewcvs?rev=223697&root=gcc&view=rev
Log:
2015-05-26  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/66142
	* tree-ssa-sccvn.c (vn_reference_lookup_3): Manually compare
	MEM_REFs for the same base address.

	* gcc.dg/tree-ssa/ssa-fre-44.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-sccvn.c
Comment 12 Richard Biener 2015-05-26 14:43:19 UTC
Fixed(?).
Comment 13 Yuri Rumyantsev 2015-05-26 15:44:29 UTC
Original test-case is not vectorized yet with Richard patch for sccvn.
Comment 14 Richard Biener 2015-05-27 07:56:49 UTC
As said in comment #9, this is PR63916 I think.  We have

  _9 = &D.3665[_11].org;
  MEM[(struct vec_ *)_9] = 1.0e+0;
  MEM[(struct vec_ *)_9 + 4B] = _8;
...
  _24 = MEM[(const struct Ray *)&D.3665][_11].org.x;
  _27 = MEM[(const struct Ray *)&D.3665][_11].org.y;

which ultimately is from inlining of

  D.3665[D.3663].org = p;
  D.3665[D.3663].dir.x = x;
  D.3665[D.3663].dir.y = y;
  D.3664[D.3663].t = 1.0e+10;
  D.3664[D.3663].h = 0;
  D.3667 = &D.3665[D.3663];
  D.3668 = &D.3664[D.3663];
  bar (D.3668, D.3667, &spheres[0]);

to

  p.x = 1.0e+0;  
  p.y = _9;
...
  D.3665[_15].org = p;
  D.3665[_15].dir.x = x_16;
  D.3665[_15].dir.y = y_17;
  D.3664[_15].t = 1.0e+10;
  D.3664[_15].h = 0;
  _23 = &D.3665[_15];
  _24 = &D.3664[_15];
  _32 = MEM[(const struct Ray *)_23].org.x;

forwprop cannot forward &D.3665[_15] into the dereference (it would create
a bogus access paths even if types were matching and TBAA would be fine).
And SCCVN isn't able to "canonicalize" the accesses (because it doesn't
look at stmt defs when building the canonical address of an access).
Note that PRE uses the canonical form for expression insertion so we have
to be careful what we put in the canonical form (or separate the address
compute from the actual dereference).

Yes, lowering the access more in the SCCVN machinery would be possible
(say, use the affine combination machinery with the powers of
affine-comb-expand (and making that use the VN lattice...)), but the
issue with this is always PRE insertion and us _not_ wanting to have
all PRE inserted loads be fully lowered to address-compute plus indirection.

I hope to get to PR63916 this stage1.
Comment 15 Richard Biener 2015-05-28 13:25:25 UTC
Author: rguenth
Date: Thu May 28 13:24:53 2015
New Revision: 223816

URL: https://gcc.gnu.org/viewcvs?rev=223816&root=gcc&view=rev
Log:
2015-05-28  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/66142
	* tree-ssa-sccvn.c (vn_reference_lookup_3): Handle non-GIMPLE
	values better in memcpy destination handling.  Handle non-aliasing
	we discover here.

	* gcc.dg/tree-ssa/ssa-fre-44.c: Fixup.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
    trunk/gcc/tree-ssa-sccvn.c
Comment 16 Jakub Jelinek 2015-05-29 13:06:55 UTC
Author: jakub
Date: Fri May 29 13:06:23 2015
New Revision: 223863

URL: https://gcc.gnu.org/viewcvs?rev=223863&root=gcc&view=rev
Log:
	PR tree-optimization/66142
	* tree-if-conv.c (if_convertible_phi_p): Don't give up on
	virtual phis that feed themselves.

	* gcc.dg/vect/pr66142.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr66142.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-if-conv.c
Comment 17 Richard Biener 2015-06-02 13:18:24 UTC
Ok, so even with PR63916 rudimentary fixed we hit the issue that in

  _9 = &D.3665[_11].org;
  MEM[(struct vec_ *)_9] = 1.0e+0;
  MEM[(struct vec_ *)_9 + 4B] = _8;
...
  _24 = MEM[(const struct Ray *)&D.3665][_11].org.x;

the store to MEM[(struct vec_ *)_9 + 4B] aliases MEM[(const struct Ray *)&D.3665][_11].org.x.

And the rudimentary fix for PR63916 doesn't allow for the forwarding to succeed
into MEM[(struct vec_ *)_9 + 4B] either.

Without SRA we get

  D.3665[_11].org = p;
  D.3665[_11].dir.x = x_12;
  D.3665[_11].dir.y = y_13;
  D.3664[_11].t = 1.0e+10;
  D.3664[_11].h = 0;
  _25 = MEM[(const struct Ray *)&D.3665][_11].org.x;

and the look-through-aggregate-copy code bails out at

      /* See if the assignment kills REF.  */
      base2 = ao_ref_base (&lhs_ref);
      offset2 = lhs_ref.offset;
      size2 = lhs_ref.size;
      maxsize2 = lhs_ref.max_size;
      if (maxsize2 == -1
          || (base != base2
              && (TREE_CODE (base) != MEM_REF
                  || TREE_CODE (base2) != MEM_REF
                  || TREE_OPERAND (base, 0) != TREE_OPERAND (base2, 0)
                  || !tree_int_cst_equal (TREE_OPERAND (base, 1),
                                          TREE_OPERAND (base2, 1))))
          || offset2 > offset
          || offset2 + size2 < offset + maxsize)
        return (void *)-1;

stmt_kills_ref_p fails to compare
MEM[(const struct Ray *)&D.3665][_15].org and D.3665[_15].org
equal via operand_equal_p.
Comment 18 Richard Biener 2015-06-02 13:34:40 UTC
Let's re-open this for the original testcase.
Comment 19 Igor Zamyatin 2015-09-07 15:26:20 UTC
Richard, would you be able to look at this in some time?
Comment 20 Richard Biener 2015-09-14 14:10:39 UTC
Yes, I plan to come back to this during stage3.
Comment 21 Richard Biener 2015-09-17 12:36:45 UTC
Ok, so if we fix SCCVN further we hit its (designed) limitation with stmt
walking not following backedges (but the use is in a loop and the def outside).
We avoid doing the work to determine if sth is loop invariant here.
This was trying to improve things with SRA disabled - I'm going to still push
those changes if testing works out fine.

Now with SRA enabled we have a better situation in this regard but face

  _9 = &D.4309[_12].org;
  MEM[(struct vec_ *)_9] = 1.0e+0;
  MEM[(struct vec_ *)_9 + 4B] = _8;
  D.4309[_12].dir.x = x_13;
  D.4309[_12].dir.y = y_14;
  D.4308[_12].t = 1.0e+10;
  D.4308[_12].h = 0;
  _20 = &D.4308[_12];
  _26 = MEM[(const struct Ray *)&D.4309][_12].org.x;

which as said isn't friendly enough for the simplistic address forward
propagation we have.  The issue here is that we can't disambiguate
against

  MEM[(struct vec_ *)_9 + 4B] = _8;

when looking up MEM[(const struct Ray *)&D.4309][_12].org.x; because
vn_reference_maybe_forwprop_address doesn't forward into MEM[(struct vec_ *)_9 + 4B] as its offset is not zero:

      /* If that didn't work because the address isn't invariant propagate
         the reference tree from the address operation in case the current
         dereference isn't offsetted.  */
      if (!addr_base
          && *i_p == ops->length () - 1
          && off == 0
          /* This makes us disable this transform for PRE where the
             reference ops might be also used for code insertion which
             is invalid.  */
          && default_vn_walk_kind == VN_WALKREWRITE)
        {
          auto_vec<vn_reference_op_s, 32> tem;
          copy_reference_ops_from_ref (TREE_OPERAND (addr, 0), &tem);
          ops->pop ();
          ops->pop ();
          ops->safe_splice (tem);
          --*i_p;
          return true;

it also likely wouldn't help vn_reference_lookup_3 to do the disambiguation
as that performs just

      lhs_ops = valueize_refs_1 (lhs_ops, &valueized_anything);
      if (valueized_anything)
        {
          lhs_ref_ok = ao_ref_init_from_vn_reference (&lhs_ref,
                                                      get_alias_set (lhs),
                                                      TREE_TYPE (lhs), lhs_ops);
          if (lhs_ref_ok
              && !refs_may_alias_p_1 (ref, &lhs_ref, true))
            return NULL;

but there is still the variable offset to cope with.  We'd need to perform
some disambiguation on lhs_ref vs. vr.

But it's not clear how to represent &D.4309[_12].org forwarded into
MEM[(struct vec_ *)_9 + 4B].  It's kind-of a COMPONENT_REF with just
a offset added.

Let's collect the improvements I have sofar in my local tree and see if they
otherwise work.
Comment 22 Richard Biener 2015-09-18 07:57:32 UTC
Author: rguenth
Date: Fri Sep 18 07:57:00 2015
New Revision: 227896

URL: https://gcc.gnu.org/viewcvs?rev=227896&root=gcc&view=rev
Log:
2015-09-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/66142
	* fold-const.c (operand_equal_p): When OEP_ADDRESS_OF
	treat MEM[&x] and x the same.
	* tree-ssa-sccvn.h (vn_reference_fold_indirect): Remove.
	* tree-ssa-sccvn.c (vn_reference_fold_indirect): Return true
	when we simplified sth.
	(vn_reference_maybe_forwprop_address): Likewise.
	(valueize_refs_1): When we simplified through
	vn_reference_fold_indirect or vn_reference_maybe_forwprop_address
	set valueized_anything to true.
	(vn_reference_lookup_3): Use stmt_kills_ref_p to see whether
	one ref kills the other instead of just a offset-based test.
	* tree-ssa-alias.c (stmt_kills_ref_p): Use OEP_ADDRESS_OF
	for the operand_equal_p test to compare bases and also compare
	sizes.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/tree-ssa-alias.c
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-ssa-sccvn.h
Comment 23 Yuri Rumyantsev 2015-12-08 09:25:11 UTC
Richard,

Do we have any chance to vectorize attached test-case using GCC6 compiler?
Comment 24 Yuri Rumyantsev 2015-12-08 09:25:53 UTC
Richard,

Do we have any chance to vectorize attached test-case using GCC6 compiler?
Comment 25 rguenther@suse.de 2015-12-08 12:12:17 UTC
On Tue, 8 Dec 2015, ysrumyan at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142
> 
> --- Comment #23 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
> Richard,
> 
> Do we have any chance to vectorize attached test-case using GCC6 compiler?

No, I don't see any.
Comment 26 Yuri Rumyantsev 2016-03-11 15:13:17 UTC
If we convert copy structures to copy structure fields test will be vectorized and all mentions of GOMP_SIMD_LANE will be deleted. But if we slightly modify test by introducing new function vdot and insert its call:
   b = r.x * ray->dir.x + r.y * ray->dir.y;
                     |
                     v
   b = vdot (r, ray->dir);
test won't be vectorized:
test2.cpp:70:9: note: not vectorized: not suitable for scatter store D.6062[_9].org.x = 1.0e+0;

test2.cpp is attached.
Comment 27 Yuri Rumyantsev 2016-03-11 15:14:51 UTC
Created attachment 37940 [details]
test-case to reproduce

Need to be compiled with -Ofast -mavx2 -fopenmp options.