Bug 66873 - fortran variant of outer-1.c not parallelized by autopar
Summary: fortran variant of outer-1.c not parallelized by autopar
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-07-14 22:39 UTC by Tom de Vries
Modified: 2015-12-27 18:38 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Tentative patch (1.45 KB, patch)
2015-07-14 23:23 UTC, Tom de Vries
Details | Diff
Updated tentative patch (5.85 KB, patch)
2015-07-15 10:00 UTC, Tom de Vries
Details | Diff
Updated tentative patch (1.60 KB, patch)
2015-07-26 17:12 UTC, Tom de Vries
Details | Diff
autopar/outer-7.c (338 bytes, text/plain)
2015-07-26 21:39 UTC, Tom de Vries
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2015-07-14 22:39:23 UTC
Consider this example, a fortran version of autopar/outer-1.c:
...
program main
  implicit none
  integer, parameter         :: n = 500
  integer, dimension (0:n-1, 0:n-1) :: x
  integer                    :: i, j, ii, jj

  do ii = 0, n - 1
     do jj = 0, n - 1
        x(ii, jj) = ii + jj + 3
     end do
  end do

  do i = 0, n - 1
     do j = 0, n - 1
        if (x(i, j) .ne. i + j + 3) call abort
     end do
  end do

end program main
...

When trying to parallelize this using -O2 -ftree-parallelize-loops=2, it fails on the dependencies:
...
(Data Dep:
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 1}_3, +, 500}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 1}_3, +, 500}_4
#)
  access_fn_A: {{0, +, 1}_3, +, 500}_4
  access_fn_B: {{0, +, 1}_3, +, 500}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (3 4 )
  distance_vector:   0   0
  distance_vector: 500  -1
  direction_vector:     =    =
  direction_vector:     +    -
)
  FAILED: data dependencies exist across iterations
...
Comment 1 Tom de Vries 2015-07-14 22:53:19 UTC
Once noticeable difference with outer-1.c, is that pass_iv_canon make the inner and outer loop ivs run downwards (from 500 to 0).

Removing pass_iv_canon from the pass list fixes that, but doesn't change anything about the dependency analysis in parloops:
...
(Data Dep:
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 1}_3, +, 500}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 1}_3, +, 500}_4
#)
  access_fn_A: {{0, +, 1}_3, +, 500}_4
  access_fn_B: {{0, +, 1}_3, +, 500}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (3 4 )
  distance_vector:   0   0
  distance_vector: 500  -1
  direction_vector:     =    =
  direction_vector:     +    -
)
  FAILED: data dependencies exist across iterations
...
Comment 2 Tom de Vries 2015-07-14 23:02:39 UTC
Another obvious difference is that the fortran 2-dimensional array access is rewritten into a single dimension array access:
...
  <bb 3>:
  # ii_7 = PHI <0(2), ii_16(7)>
  pretmp_52 = (integer(kind=8)) ii_7;
  pretmp_53 = pretmp_52 * 500;

  <bb 4>:
  # jj_10 = PHI <0(3), jj_15(5)>
  _11 = (integer(kind=8)) jj_10;
  _12 = _11 + pretmp_53;
  _13 = ii_7 + jj_10;
  _14 = _13 + 3;
  x[_12] = _14;
  jj_15 = jj_10 + 1;
  if (jj_10 == 499)
    goto <bb 6>;
  else
    goto <bb 5>;
...

While the outer-1.c 2-dimensional array access is still 2-dimensional:
...
  <bb 9>:
  # i_34 = PHI <0(3), i_15(8)>
  goto <bb 5>;

  <bb 5>:
  # j_36 = PHI <0(9), j_14(4)>
  _11 = i_34 + j_36;
  _12 = _11 + 3;
  x[i_34][j_36] = _12;
  j_14 = j_36 + 1;
  if (N_9(D) > j_14)
    goto <bb 4>;
  else
    goto <bb 6>;
...

Which results in different access functions, and the dependence analysis succeeds:
...
(Data Dep:
#(Data Ref:
#  bb: 5
#  stmt: x[i_34][j_36] = _12;
#  ref: x[i_34][j_36];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
#(Data Ref:
#  bb: 5
#  stmt: x[i_34][j_36] = _12;
#  ref: x[i_34][j_36];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
  access_fn_A: {0, +, 1}_4
  access_fn_B: {0, +, 1}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  access_fn_A: {0, +, 1}_1
  access_fn_B: {0, +, 1}_1

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (1 4 )
  distance_vector:   0   0
  direction_vector:     =    =
)
  SUCCESS: may be parallelized
...
Comment 3 Tom de Vries 2015-07-14 23:18:22 UTC
The fortran example succeeds when floop-parallelize-all is used.

Even though the access function in graphite seems the same:
...
        Access function 0: {{0, +, 1}_3, +, 500}_4
...
Comment 4 Tom de Vries 2015-07-14 23:23:08 UTC
Created attachment 35983 [details]
Tentative patch

Using this tentative patch, we use graphite analysis (if available) by default for parloops. That way, we manage to parallelize the fortran example using just -ftree-parallelize-loops=2.
Comment 5 kargls 2015-07-14 23:39:54 UTC
(In reply to vries from comment #0)
> Consider this example, a fortran version of autopar/outer-1.c:
> ...
> program main
>   implicit none
>   integer, parameter         :: n = 500
>   integer, dimension (0:n-1, 0:n-1) :: x
>   integer                    :: i, j, ii, jj
> 
>   do ii = 0, n - 1
>      do jj = 0, n - 1
>         x(ii, jj) = ii + jj + 3
>      end do
>   end do
> 
>   do i = 0, n - 1
>      do j = 0, n - 1
>         if (x(i, j) .ne. i + j + 3) call abort
>      end do
>   end do
> 
> end program main
> ...
> 
> When trying to parallelize this using -O2 -ftree-parallelize-loops=2, it
> fails on the dependencies:

Does the loop ordering matter?  Fortran is a column major language,
so your nested loops are backwards.  One would normally write.

  do jj = 0, n - 1
      do ii = 0, n - 1
         x(ii, jj) = ii + jj + 3
      end do
   end do

where the first loop index varies most rapidly.
Comment 6 Tom de Vries 2015-07-15 06:12:15 UTC
(In reply to kargl from comment #5)
> Does the loop ordering matter?  Fortran is a column major language,
> so your nested loops are backwards.  One would normally write.
> 
>   do jj = 0, n - 1
>       do ii = 0, n - 1
>          x(ii, jj) = ii + jj + 3
>       end do
>    end do
> 
> where the first loop index varies most rapidly.

Thanks for letting me know. I'm obviously not fluent in Fortran.

Interchanging ii and jj in the array access of the example, and again disabling pass_iv_canon, gives:
...
(Data Dep:
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 500}_3, +, 1}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: x[_12] = _14;
#  ref: x[_12];
#  base_object: x;
#  Access function 0: {{0, +, 500}_3, +, 1}_4
#)
  access_fn_A: {{0, +, 500}_3, +, 1}_4
  access_fn_B: {{0, +, 500}_3, +, 1}_4

 (subscript
  iterations_that_access_an_element_twice_in_A: [0]
  last_conflict: scev_not_known
  iterations_that_access_an_element_twice_in_B: [0]
  last_conflict: scev_not_known
  (Subscript distance: 0 ))
  inner loop index: 0
  loop nest: (3 4 )
  distance_vector:   0   0
  distance_vector:   1 -500
  direction_vector:     =    =
  direction_vector:     +    -
)
  FAILED: data dependencies exist across iterations
...

Again, using -floops-parallelize-all allows the outer loop to be paralelized.
Comment 7 Tom de Vries 2015-07-15 10:00:57 UTC
Created attachment 35986 [details]
Updated tentative patch

I found that always doing graphite before parloops resulted in failures to parallelize reduction testcases.

I've split things up now:
- first we do parloopsred, a parloops variant in which we only handle reductions
- then we do graphite
- then we do the normal parloops

This seems to combine the best of graphite and parloops.

The only gotcha is that I had to disable pass_iv_canon when tree_parallelize_loops > 1. It seems to interfere with graphite. I did not observe any failures to parallelize due to not running pass_iv_canon.
Comment 9 Tom de Vries 2015-07-24 16:45:37 UTC
(In reply to vries from comment #7)
> Created attachment 35986 [details]
> Updated tentative patch
> 
> I found that always doing graphite before parloops resulted in failures to
> parallelize reduction testcases.
> 

That was due to not using -ffast-math.

By using this ( https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01863.html ) approved patch, we no longer have that problem.

However, we run into: PR66997 - outer loop reduction fails to parallelize with graphite.

And the runtime of graphite is excessive in some cases, with prevents us from using graphite by default, f.i. Bug 53852 - [4.9/5/6 Regression] -ftree-loop-linear: large compile time / memory usage.
Comment 10 Tom de Vries 2015-07-24 16:56:39 UTC
retracting patch. it doesn't make sense to use graphite by default before addressing the compile-time-hog problem.
Comment 11 Tom de Vries 2015-07-26 17:12:37 UTC
Created attachment 36057 [details]
Updated tentative patch
Comment 12 Tom de Vries 2015-07-26 21:39:49 UTC
Created attachment 36063 [details]
autopar/outer-7.c

C example to reproduce the same problem