Bug 63844 - [4.8 Regression] open mp parallelization prevents vectorization
Summary: [4.8 Regression] open mp parallelization prevents vectorization
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.3
: P2 normal
Target Milestone: 4.8.5
Assignee: Richard Biener
URL:
Keywords: missed-optimization, openmp
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2014-11-13 04:20 UTC by Andi Kleen
Modified: 2015-02-24 15:09 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work: 4.8.5, 4.9.3, 5.0
Known to fail: 4.8.4, 4.9.2
Last reconfirmed: 2014-11-17 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andi Kleen 2014-11-13 04:20:07 UTC
#define N 10000000
int a[N], b[N], c[N];

main()
{

        int i;

#pragma omp parallel num_threads(4)
        for (i = 0; i < N; i++) {
                a[i] = b[i] + c[i];
        }
        for (i = 0; i < N; i++) {
                a[i] += b[i] + c[i];
        }
}


compiled with gcc -O3 -fopenmp

The first loop gets parallelized by openmp, the second loop gets vectorized.
But why does the parallelized loop not get vectorized too?
Comment 1 Richard Biener 2014-11-13 09:16:13 UTC
We have a duplicate (or related) bug somewhere.
Comment 2 Andi Kleen 2014-11-17 18:52:09 UTC
Regression, doesn't happen on 4.8
Comment 3 Jakub Jelinek 2014-11-17 21:10:18 UTC
I certainly see the same thing in 4.8, it seems this regressed in r161655.  Anyway, I will have a look tomorrow.
Comment 4 Andi Kleen 2014-11-17 21:49:52 UTC
I had a typo in the test case (remove += to make the loops identical)

#define N 10000000
int a[N], b[N], c[N];

main()
{

        int i;

#pragma omp parallel num_threads(4)
        for (i = 0; i < N; i++) {
                a[i] = b[i] + c[i];
        }
        for (i = 0; i < N; i++) {
                a[i] = b[i] + c[i];
        }
}

The case I saw vectorized on 4.8 (opensuse 13.1 compiler), but not on 5.0, was slightly different, auto parallelized

#define N 10000000
int a[N], b[N], c[N];

main()
{

        int i;
        for (i = 0; i < N; i++) {
                a[i] = b[i] + c[i];
        }
}
With -O3 -mtree-parallelize-loops=4
I understand this will just internally generate openmp
Comment 5 Richard Biener 2014-11-18 10:59:23 UTC
The issue here is the induction variable is shared?

  <bb 5>:
  # _17 = PHI <0(2), _11(4)>
  _6 = b[_17];
  _7 = c[_17];
  _8 = _6 + _7;
  a[_17] = _8;
  _10 = .omp_data_i_3(D)->i;
  _11 = _10 + 1;
  .omp_data_i_3(D)->i = _11;
  if (_11 <= 9999999)
    goto <bb 4>;
  else
    goto <bb 3>;

and we didn't apply store-motion to it:

Memory reference 1: b[_17]
Memory reference 2: c[_17]
Memory reference 3: a[_17]
Memory reference 4: .omp_data_i_3(D)->i
...
Querying dependency of refs 4 and 1: dependent.
Querying dependencies of ref 4 in loop 1: dependent

I think the static chain never points to global decls thus we could
special-case this.  We could also treat members of the frame as
non-allocated and thus not worry about trailing arrays and such.
Unfortunately the .omp_data_i is a regular function parameter and
not a static chain :/
Comment 6 Jakub Jelinek 2014-11-18 11:03:12 UTC
(In reply to Richard Biener from comment #5)
> The issue here is the induction variable is shared?

Ah right, the testcase is clearly invalid.
There is a data race on i, as well as data races on all the a[i] stores (because 4 different threads all do the same thing).
Andi, did you mean
#pragma omp parallel for num_threads(4)
instead?
Comment 7 Richard Biener 2014-11-18 11:13:53 UTC
Ok, I think Tom had a similar patch for this but passing the .omp_data_i by
effective "reference" (restrict qualified pointer and DECL_BY_REFERENCE)
fixes this.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c       (revision 217692)
+++ gcc/omp-low.c       (working copy)
@@ -1517,7 +1517,8 @@ fixup_child_record_type (omp_context *ct
       layout_type (type);
     }
 
-  TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type);
+  TREE_TYPE (ctx->receiver_decl)
+    = build_qualified_type (build_pointer_type (type), TYPE_QUAL_RESTRICT);
 }
 
 /* Instantiate decls as necessary in CTX to satisfy the data sharing
@@ -2006,6 +2007,7 @@ create_omp_child_function (omp_context *
   DECL_NAMELESS (t) = 1;
   DECL_ARG_TYPE (t) = ptr_type_node;
   DECL_CONTEXT (t) = current_function_decl;
+  DECL_BY_REFERENCE (t) = 1;
   TREE_USED (t) = 1;
   if (cilk_for_count)
     DECL_CHAIN (t) = DECL_ARGUMENTS (decl);

there are more similar objects built, so the above may not fully optimize
all cases.
Comment 8 Richard Biener 2014-11-18 11:34:53 UTC
(In reply to Richard Biener from comment #7)
> Ok, I think Tom had a similar patch for this but passing the .omp_data_i by
> effective "reference" (restrict qualified pointer and DECL_BY_REFERENCE)
> fixes this.
> 
> Index: gcc/omp-low.c
> ===================================================================
> --- gcc/omp-low.c       (revision 217692)
> +++ gcc/omp-low.c       (working copy)
> @@ -1517,7 +1517,8 @@ fixup_child_record_type (omp_context *ct
>        layout_type (type);
>      }
>  
> -  TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type);
> +  TREE_TYPE (ctx->receiver_decl)
> +    = build_qualified_type (build_pointer_type (type), TYPE_QUAL_RESTRICT);
>  }
>  
>  /* Instantiate decls as necessary in CTX to satisfy the data sharing
> @@ -2006,6 +2007,7 @@ create_omp_child_function (omp_context *
>    DECL_NAMELESS (t) = 1;
>    DECL_ARG_TYPE (t) = ptr_type_node;
>    DECL_CONTEXT (t) = current_function_decl;
> +  DECL_BY_REFERENCE (t) = 1;
>    TREE_USED (t) = 1;
>    if (cilk_for_count)
>      DECL_CHAIN (t) = DECL_ARGUMENTS (decl);
> 
> there are more similar objects built, so the above may not fully optimize
> all cases.

For the invalid testcase the DECL_BY_REFERENCE setting isn't needed and
its positive effect on PTA could also be seen if using build_reference_type
instead of build_pointer_type in the first hunk.
Comment 9 Richard Biener 2014-11-18 11:38:39 UTC
Valid testcase (well, assign sth useful to g...)

#define N 10000000
int a[N], b[N], c[N];

main()
{

  int i, g;

#pragma omp parallel for num_threads(4)
  for (i = 0; i < N; i++) {
      a[i] = b[i] + c[i] + g;
  }
}

in the OMP fn we fail to hoist the load of .omp_child_i->g out of the
loop and thus generate inferior code.
Comment 10 Jakub Jelinek 2014-11-18 11:43:39 UTC
(In reply to Richard Biener from comment #9)
> Valid testcase (well, assign sth useful to g...)
> 
> #define N 10000000
> int a[N], b[N], c[N];
> 
> main()
> {
> 
>   int i, g;
> 
> #pragma omp parallel for num_threads(4)
>   for (i = 0; i < N; i++) {
>       a[i] = b[i] + c[i] + g;
>   }
> }
> 
> in the OMP fn we fail to hoist the load of .omp_child_i->g out of the
> loop and thus generate inferior code.

And does making receiver_decl restrict and/or reference type help with that?
BTW, we probably should add some IPA omp pass that would try to constant propagate across from GOMP_parallel*/GOMP_task* callers to the *.omp_fn.* functions (or teach IPA-SRA to do that?).
Comment 11 rguenther@suse.de 2014-11-18 12:09:20 UTC
On Tue, 18 Nov 2014, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63844
> 
> --- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #9)
> > Valid testcase (well, assign sth useful to g...)
> > 
> > #define N 10000000
> > int a[N], b[N], c[N];
> > 
> > main()
> > {
> > 
> >   int i, g;
> > 
> > #pragma omp parallel for num_threads(4)
> >   for (i = 0; i < N; i++) {
> >       a[i] = b[i] + c[i] + g;
> >   }
> > }
> > 
> > in the OMP fn we fail to hoist the load of .omp_child_i->g out of the
> > loop and thus generate inferior code.
> 
> And does making receiver_decl restrict and/or reference type help with that?

Yes.

> BTW, we probably should add some IPA omp pass that would try to constant
> propagate across from GOMP_parallel*/GOMP_task* callers to the *.omp_fn.*
> functions (or teach IPA-SRA to do that?).

I think aggregate IPA-CP does that, IPA-SRA cannot as the function has
its address taken.

Richard.
Comment 12 Andi Kleen 2014-11-18 13:42:03 UTC
Yes should have been omp parallel for
Comment 13 Andi Kleen 2014-11-18 13:43:58 UTC
>I think aggregate IPA-CP does that, IPA-SRA cannot as the function has
>its address taken.

Perhaps that case (only passing address to gomp runtime) could be special cased in the escape analysis.
Comment 14 Richard Biener 2014-11-19 09:47:51 UTC
Author: rguenth
Date: Wed Nov 19 09:47:19 2014
New Revision: 217757

URL: https://gcc.gnu.org/viewcvs?rev=217757&root=gcc&view=rev
Log:
2014-11-19  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63844
	* omp-low.c (fixup_child_record_type): Use a restrict qualified
	referece type for the receiver parameter.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/omp-low.c
Comment 15 Richard Biener 2014-11-19 09:48:15 UTC
Fixed on trunk(?)
Comment 16 Jakub Jelinek 2014-12-19 13:29:43 UTC
GCC 4.8.4 has been released.
Comment 17 Richard Biener 2015-02-23 11:14:57 UTC
Author: rguenth
Date: Mon Feb 23 11:14:25 2015
New Revision: 220912

URL: https://gcc.gnu.org/viewcvs?rev=220912&root=gcc&view=rev
Log:
2015-02-23  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2014-11-19  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63844
	* omp-low.c (fixup_child_record_type): Use a restrict qualified
	referece type for the receiver parameter.

	2014-11-27  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/61634
	* tree-vect-slp.c: Include gimple-walk.h.
	(vect_detect_hybrid_slp_stmts): Rewrite to propagate hybrid
	down the SLP tree for one scalar statement.
	(vect_detect_hybrid_slp_1): New walker function.
	(vect_detect_hybrid_slp_2): Likewise.
	(vect_detect_hybrid_slp): Properly handle pattern statements
	in a pre-scan over all loop stmts.

	* gcc.dg/vect/pr61634.c: New testcase.

	2015-01-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/59354
	* tree-vect-slp.c (vect_build_slp_tree_1): Treat loads from
	groups larger than the slp group size as having gaps.

	* gcc.dg/vect/pr59354.c: New testcase.

	2015-02-10  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/64909
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Properly
	pass a scalar-stmt count estimate to the cost model.
	* tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Likewise.

	* gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c: New testcase.

Added:
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr59354.c
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr61634.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/omp-low.c
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_9-branch/gcc/tree-vect-data-refs.c
    branches/gcc-4_9-branch/gcc/tree-vect-loop.c
    branches/gcc-4_9-branch/gcc/tree-vect-slp.c
Comment 18 Richard Biener 2015-02-24 15:09:22 UTC
Fixed.
Comment 19 Richard Biener 2015-02-24 15:09:32 UTC
Author: rguenth
Date: Tue Feb 24 15:09:00 2015
New Revision: 220941

URL: https://gcc.gnu.org/viewcvs?rev=220941&root=gcc&view=rev
Log:
2015-02-24  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2014-11-19  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63844
	* omp-low.c (fixup_child_record_type): Use a restrict qualified
	referece type for the receiver parameter.

Modified:
    branches/gcc-4_8-branch/gcc/ChangeLog
    branches/gcc-4_8-branch/gcc/omp-low.c