Bug 59984 - OpenMP pragma makes loop incorrect
Summary: OpenMP pragma makes loop incorrect
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: openmp, wrong-code
Depends on:
Blocks:
 
Reported: 2014-01-29 14:41 UTC by Stupachenko Evgeny
Modified: 2024-09-21 02:55 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-02-05 00:00:00


Attachments
test case (436 bytes, application/x-gzip)
2014-01-29 14:42 UTC, Stupachenko Evgeny
Details
Untested fix (470 bytes, patch)
2014-02-07 15:20 UTC, Jakub Jelinek
Details | Diff
test case where pragma simd disable vectorization (299 bytes, text/x-csrc)
2014-11-13 23:51 UTC, Stupachenko Evgeny
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stupachenko Evgeny 2014-01-29 14:41:22 UTC
GCC generates incorrect code for the SIMD loop (#pragma [omp] simd) with function call that have pointer parameters.

(test case attached)

gcc test.c foo.c -fcilkplus -O3
./a.out ; echo $?
1
gcc test.c foo.c -fcilkplus -O0
./a.out ; echo $?                                                                 
0

the same for -fopenmp -DENABLE_OMP instead of -fcilkplus

Moreover g++ crashes on the test:

g++ test.c -fcilkplus -O3 -c
test.c: In function 'int main()':
test.c:10:5: error: dead STMT in EH table
 int main()
     ^
# .MEM_11 = VDEF <.MEM_39>
foo (_10, &v1, &v2);
test.c:10:5: internal compiler error: verify_gimple failed
0xb774b3 verify_gimple_in_cfg(function*)
        /export/users/mstester/stability/svn/trunk/gcc/tree-cfg.c:4853
0xa9fc52 execute_function_todo
        /export/users/mstester/stability/svn/trunk/gcc/passes.c:1853
0xaa05a3 execute_todo
        /export/users/mstester/stability/svn/trunk/gcc/passes.c:1887
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
Comment 1 Stupachenko Evgeny 2014-01-29 14:42:22 UTC
Created attachment 31981 [details]
test case
Comment 2 Marek Polacek 2014-02-05 18:42:16 UTC
Confirmed.
Comment 3 Igor Zamyatin 2014-02-06 12:59:47 UTC
Vectorizer dump snippet for main:

  foo.simdclone.0 (vect__12.7_3, vect_cst_.8_53, vect_cst_.8_53, vect_cst_.9_51, vect_cst_.9_51);
  GIMPLE_NOP
  vect_v1.12_37 = MEM[(int *)vectp_v1.10_39];   (1)
  v1.0_14 = v1;
  vect_v2.16_60 = MEM[(int *)vectp_v2.14_58];   (2)
  v2.1_15 = v2;
  vect__16.18_63 = vect_cst_.13_7 * vect_cst_.17_62;  <--- constants instead of 
  _16 = v1.0_14 * v2.1_15;                                 vect_v1.12_37 and
  MEM[(int *)vectp_a.19_65] = vect__16.18_63;              vect_v2.16_60 


Then DCE destroys (1) and (2) and later LIM hoists the multiplication away from the loop.
Comment 4 Igor Zamyatin 2014-02-07 08:33:58 UTC
vect details show that v1.0_14 = v1 and v2.1_15 = v2 are treated as invariants:

test.c:24:14: note: ------>vectorizing statement: v1.0_14 = v1;
 
test.c:24:14: note: transform statement.
test.c:24:14: note: transform load. ncopies = 1
test.c:24:14: note: create vector_type-pointer variable to type: vector(4) int  vectorizing a pointer ref: v1
test.c:24:14: note: created vectp_v1.11_40
test.c:24:14: note: add new stmt: vect_v1.12_37 = MEM[(int *)vectp_v1.10_39];
 
test.c:24:14: note: hoisting out of the vectorized loop: v1.0_14 = v1;
 
test.c:24:14: note: created new init_stmt: vect_cst_.13_7 = {v1.0_8, v1.0_8, v1.0_8, v1.0_8};
 
test.c:24:14: note: ------>vectorizing statement: v2.1_15 = v2;
 
test.c:24:14: note: transform statement.
test.c:24:14: note: transform load. ncopies = 1
test.c:24:14: note: create vector_type-pointer variable to type: vector(4) int  vectorizing a pointer ref: v2
test.c:24:14: note: created vectp_v2.15_1
test.c:24:14: note: add new stmt: vect_v2.16_60 = MEM[(int *)vectp_v2.14_58];
 
test.c:24:14: note: hoisting out of the vectorized loop: v2.1_15 = v2;
 
test.c:24:14: note: created new init_stmt: vect_cst_.17_62 = {v2.1_61, v2.1_61, v2.1_61, v2.1_61};


Step for both loads determined as 0. Seems support for such case should be explicitly added
Comment 5 Jakub Jelinek 2014-02-07 14:34:50 UTC
I guess during gimplification when gimplifying #pragma omp simd body we should add all the local variables (declared in the body) that aren't static and are address taken to private clause on the #pragma omp simd, so that they will be handled like if you in your testcase do instead:
    int v1, v2;
#pragma omp simd private (v1, v2)
    for(i = 0; i < 1000; i++)
    {
      foo(a[i], &v1, &v2);
      a[i] = v1 * v2;
    }

We don't vectorize that right now, but that is something that is likely fixable.
Comment 6 Jakub Jelinek 2014-02-07 15:20:12 UTC
Created attachment 32077 [details]
Untested fix

Untested fix.  Likely I'll need to do something more, so that we don't emit
extra decls for the debug info outside of the #pragma omp simd (flag on the private clause?) and add a testcase.

And, the reason why with this patch (or the adjusted testcase with explicit private clause) isn't vectorized is that we'd need to teach vectorizer how to vectorize SSA_NAME = &array[SSA_NAME]; where the second SSA_NAME is GOMP_SIMD_LANE () result and array is an "omp simd array" array, as a vector
of addresses (initialized before the loop), containing { &array[0], &array[1], ..., &array[vf-1] }.
Comment 7 Jakub Jelinek 2014-02-08 09:10:46 UTC
Author: jakub
Date: Sat Feb  8 09:10:14 2014
New Revision: 207629

URL: http://gcc.gnu.org/viewcvs?rev=207629&root=gcc&view=rev
Log:
	PR c/59984
	* gimplify.c (gimplify_bind_expr): In ORT_SIMD region
	mark local addressable non-static vars as GOVD_PRIVATE
	instead of GOVD_LOCAL.
	* omp-low.c (lower_omp_for): Move gimple_bind_vars
	and BLOCK_VARS of gimple_bind_block to new_stmt rather
	than copying them.

	* gcc.dg/vect/pr59984.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr59984.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimplify.c
    trunk/gcc/omp-low.c
    trunk/gcc/testsuite/ChangeLog
Comment 8 Jakub Jelinek 2014-02-08 09:29:16 UTC
The wrong-code issue is now fixed, but we still don't vectorize this.
See http://gcc.gnu.org/ml/gcc-patches/2014-02/msg00514.html
for details.  Note that without making the two pointer parameters linear on the
elemental function it actually will not be successfully vectorized unless you have CPU that can do scatter (Skylake, anything else?).

Keeping this PR open so that I don't forget to address this.
Comment 9 Jakub Jelinek 2014-04-22 11:37:03 UTC
GCC 4.9.0 has been released
Comment 10 Jakub Jelinek 2014-07-16 13:30:49 UTC
GCC 4.9.1 has been released.
Comment 11 Jakub Jelinek 2014-10-30 10:40:49 UTC
GCC 4.9.2 has been released.
Comment 12 Stupachenko Evgeny 2014-11-13 23:51:25 UTC
Created attachment 33963 [details]
test case where pragma simd disable vectorization

The following test case compiled with "-Ofast" vectorize the loop in the GetXsum function.
Adding "-fopenmp" leads to failed vectorization due to:

simd_issue.cpp:26:18: note: not vectorized: data ref analysis failed D.2329[_7].x = _12;

It looks like before the patch in this Bug loop was vectorized with -fopenmp.
Comment 13 Jakub Jelinek 2014-11-14 15:03:53 UTC
(In reply to Stupachenko Evgeny from comment #12)
> Created attachment 33963 [details]
> test case where pragma simd disable vectorization
> 
> The following test case compiled with "-Ofast" vectorize the loop in the
> GetXsum function.
> Adding "-fopenmp" leads to failed vectorization due to:
> 
> simd_issue.cpp:26:18: note: not vectorized: data ref analysis failed
> D.2329[_7].x = _12;
> 
> It looks like before the patch in this Bug loop was vectorized with -fopenmp.

The testcase is invalid, you need reduction(+:sim) clause, otherwise the loop has invalid inter-iteration dependencies.

That said, even with that, with C it vectorizes fine, while with C++ it doesn't.

In *.einline the C -> C++ difference is (before that I don't see such):
-  D.1856[_19].x = _24;
-  _26 = &D.1856[_19];
-  _27 = MEM[(const struct XY *)_26].x;
+  D.2352[_19].x = _24;
+  _26 = &D.2352[_19];
+  _40 = MEM[(float *)_26];

In *.ealias the C -> C++ difference is:
-  D.1856[_19].x = _24;
-  _27 = MEM[(const struct XY *)&D.1856][_19].x;
+  D.2352[_19].x = _24;
+  _26 = &D.2352[_19];
+  _40 = MEM[(float *)_26];

and apparently FRE1 handles the former but not the latter.  Richard?
As the struct contains float at that offset, I don't see why FRE1 shouldn't
optimize that to _40 = _24.

Shorter testcase for the FRE1 missed-optimization:
struct S { float a, b; };

float
foo (int x, float y)
{
  struct S z[1024];
  z[x].a = y;
  struct S *p = &z[x];
  float *q = (float *) p;
  return *q;
}

(dunno why the inliner handles things differently between C and C++ on the #c12 testcase).  Now, as for vectorizing it even if FRE isn't able to optimize it,
we currently don't support interleaved accesses to the "omp simd array" attributed arrays, perhaps we could at least some easy cases thereof, and supposedly we should teach SRA about those too (like, if the arrays aren't addressable and aren't accesses as whole, but just individual fields, split it into separate "omp simd array" accesses instead.  In this particular case due to the FRE missed optimization it is addressable though.
Or perhaps teach fold to gimple folding to fold that:
  q_5 = &z[x_2(D)];
  _6 = *q_5;
back into:
  _6 = z[x_2(D)].x;
?
Comment 14 rguenther@suse.de 2014-11-17 09:02:48 UTC
On Fri, 14 Nov 2014, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59984
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|ASSIGNED                    |NEW
>                  CC|                            |jamborm at gcc dot gnu.org,
>                    |                            |rguenth at gcc dot gnu.org
>            Assignee|jakub at gcc dot gnu.org           |unassigned at gcc dot gnu.org
> 
> --- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Stupachenko Evgeny from comment #12)
> > Created attachment 33963 [details]
> > test case where pragma simd disable vectorization
> > 
> > The following test case compiled with "-Ofast" vectorize the loop in the
> > GetXsum function.
> > Adding "-fopenmp" leads to failed vectorization due to:
> > 
> > simd_issue.cpp:26:18: note: not vectorized: data ref analysis failed
> > D.2329[_7].x = _12;
> > 
> > It looks like before the patch in this Bug loop was vectorized with -fopenmp.
> 
> The testcase is invalid, you need reduction(+:sim) clause, otherwise the loop
> has invalid inter-iteration dependencies.
> 
> That said, even with that, with C it vectorizes fine, while with C++ it
> doesn't.
> 
> In *.einline the C -> C++ difference is (before that I don't see such):
> -  D.1856[_19].x = _24;
> -  _26 = &D.1856[_19];
> -  _27 = MEM[(const struct XY *)_26].x;
> +  D.2352[_19].x = _24;
> +  _26 = &D.2352[_19];
> +  _40 = MEM[(float *)_26];
> 
> In *.ealias the C -> C++ difference is:
> -  D.1856[_19].x = _24;
> -  _27 = MEM[(const struct XY *)&D.1856][_19].x;
> +  D.2352[_19].x = _24;
> +  _26 = &D.2352[_19];
> +  _40 = MEM[(float *)_26];
> 
> and apparently FRE1 handles the former but not the latter.  Richard?
> As the struct contains float at that offset, I don't see why FRE1 shouldn't
> optimize that to _40 = _24.
> 
> Shorter testcase for the FRE1 missed-optimization:
> struct S { float a, b; };
> 
> float
> foo (int x, float y)
> {
>   struct S z[1024];
>   z[x].a = y;
>   struct S *p = &z[x];
>   float *q = (float *) p;
>   return *q;
> }

I will have a look - it's designed to handle that fine.

> (dunno why the inliner handles things differently between C and C++ on the #c12
> testcase).  Now, as for vectorizing it even if FRE isn't able to optimize it,
> we currently don't support interleaved accesses to the "omp simd array"
> attributed arrays, perhaps we could at least some easy cases thereof, and
> supposedly we should teach SRA about those too (like, if the arrays aren't
> addressable and aren't accesses as whole, but just individual fields, split it
> into separate "omp simd array" accesses instead.  In this particular case due
> to the FRE missed optimization it is addressable though.
> Or perhaps teach fold to gimple folding to fold that:
>   q_5 = &z[x_2(D)];
>   _6 = *q_5;
> back into:
>   _6 = z[x_2(D)].x;
> ?

No, that's generally invalid (forwprop does that if types match
closely enough which appearantly they don't?)
Comment 15 Richard Biener 2014-11-17 14:57:52 UTC
Ok, so SCCVN only can forward constant addresses (because variable ones
wouldn't fit in the same refs vector without "inserting" ops).  See
vn_reference_maybe_forwprop_address.  It's also required so that we can
re-materialize the expression in PRE for example and not wind up with
invalid access paths there (similar to the reason why forwprop doesn't do
this transform).

I think it might be possible to teach SCCVN to do this, but it might be
not trivial.
Comment 16 Jakub Jelinek 2015-06-26 19:58:01 UTC
GCC 4.9.3 has been released.
Comment 17 Paolo Carlini 2017-12-04 10:51:09 UTC
Cilk Plus, deprecated for 7.x, will not be in 8.x.