Bug 34265 - Missed optimizations
Summary: Missed optimizations
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 31738 34223 36034
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-28 15:27 UTC by Dominique d'Humieres
Modified: 2011-09-26 08:36 UTC (History)
9 users (show)

See Also:
Host: i686-apple-darwin9
Target: i686-apple-darwin9
Build: i686-apple-darwin9
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Diffs between the original file and the simplest variants (1.56 KB, text/plain)
2007-11-28 15:30 UTC, Dominique d'Humieres
Details
patch for early complete unrolling of inner loops (1.07 KB, text/plain)
2007-11-28 16:33 UTC, Richard Biener
Details
result of -fdump-tree-optimized with patch #5 (1.80 KB, text/plain)
2007-12-03 14:33 UTC, Dominique d'Humieres
Details
result of -fdump-tree-optimized without patch #5 (781 bytes, text/plain)
2007-12-03 14:34 UTC, Dominique d'Humieres
Details
induct.f90 variants and their diff with the original file (30.40 KB, application/octet-stream)
2008-04-23 21:26 UTC, Dominique d'Humieres
Details
reduced tests (9.61 KB, application/octet-stream)
2011-05-22 12:06 UTC, Dominique d'Humieres
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique d'Humieres 2007-11-28 15:27:05 UTC
I have had a closer look to the optimization of the polyhedron test  induct.f90. 
3/4 of the runtime is spent in the subroutine 'mutual_ind_quad_cir_coil' and 1/4 in 'mutual_ind_quad_rec_coil'.
The two subroutines contain two main loops with the following structure:

do i = 1, 2*m
    ...
    do j = 1, 9
	...
	do k = 1, 9
	    q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
	    q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
	    q_vector(3) = 0.0_longreal
!
!       rotate quad vector into the global coordinate system
!
	    rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
	    rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
	    rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
!       compute and add in quadrature term
!
	    numerator = w1gauss(j) * w2gauss(k) *                         &
			dot_product(coil_current_vec,current_vector)
	    denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,     &
					   rot_c_vector-rot_q_vector))
	    l12_lower = l12_lower + numerator/denominator
	end do
    end do
end do

where the six first lines of code in the k loop do not depend on i nor j
and can be computed in a 'do k = 1, 9' loop outside the main loop by
replacing the length-three vector 'rot_q_vector' by a nine by three array.

The original code induct.f90 gives the following timing (all with -O3 
-ffast-math -funroll-loops):

93.227u 0.094s 1:33.32 99.9%	0+0k 0+0io 0pf+0w

and output:

...
Maximum wand/quad abs rel mutual inductance =   5.95379428444656467E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242061230E-002
...

Unrolling bye hand 'numerator' and 'denominator' gives (see 
http://gcc.gnu.org/ml/fortran/2007-11/msg00231.html):

65.563u 0.092s 1:05.66 99.9%	0+0k 0+0io 0pf+0w

Looking at the assembly I can see that for the original code the inner loops in k are not  unrolled, as guessed by Paul Thomas (only the implied vector loops being unrolled).

QUESTION 1: Should the frontend do the unrolling for small vectors itself? or should the middleend be more aggressive for nested loops with small known iterations?

Moving the invariants on i and j in the k loops outside the main loops gives:

80.313u 0.074s 1:20.39 99.9%	0+0k 0+1io 0pf+0w

Combining the two hand optimizations gives:

35.925u 0.040s 0:35.97 99.9%	0+0k 0+0io 0pf+0w

(without -ffast-math the timing is
59.263u 0.067s 0:59.33 99.9%	0+0k 0+1io 0pf+0w)

but the results change to:

Maximum wand/quad abs rel mutual inductance =   5.95379428444656675E-002
Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
Maximum resq coil/quad abs rel mutual inductance =   9.63995250242059842E-002

( Maximum wand/quad abs rel mutual inductance =   5.95379428444659520E-002
  Minimum resq coil/quad abs rel mutual inductance =    0.0000000000000000     
  Maximum resq coil/quad abs rel mutual inductance =   9.63995250242060675E-002
without -ffast-math).

The attached file gives the differences between the original code and the three variants.

This is to be compared to further optimizations (indu.v2.f90) leading to:

35.376u 0.062s 0:35.44 99.9%	0+0k 0+1io 0pf+0w

or after merging the two loops (indu.v3.f90) to:

34.452u 0.041s 0:34.49 100.0%	0+0k 0+0io 0pf+0w

I have counted the number of sqrt() in the assembly code and found 9 of them in the slow codes while I only found 5 (10 for indu.v3.f90) of them for the fast codes. I checked that this was not due to common 
subexpressions I may have missed. Looking more closely to the assemby I saw that he slow codes used 'sqrtsd' while the fast ones used 'sqrtpd' along with several other packed operations. Now 65.66*5/9=36.48 explaining most of the speedup. Note that 5=9/2+1, i.e., four packed computations 
followed by a scalar one.  Owing to the same structure in the two loops, the two scalar computations could be merged in a packed one, but it is missed by the optimizer.

I have tried without success to trigger this "vectorization" by making some variables vectors in k.

QUESTION 2:  Why the optimizer is able to vectorize in some cases and not in others? Can the frontend help to vectorize?
Comment 1 Dominique d'Humieres 2007-11-28 15:30:27 UTC
Created attachment 14654 [details]
Diffs between the original file and the simplest variants

In induct.v1.f90 'nominator' and 'denominator' are unrolled by hand. In induct.v2.f90 the invariants are moved outside the mail loop and induct.v3.f90 combines the two.
Comment 2 Richard Biener 2007-11-28 16:06:06 UTC
GCC doesn't have a facility to split the inner loop and move it out of the outer loops by introducing a array temporary.

As for completely unrolling, this only happens for innermost loops(?) and you
can tune the heuristics with --param max-completely-peeled-insns=N (defaults to
400) and --param max-completely-peel-times (defaults to 16).  Use -funroll-loops
to enable this.

Note that complete unrolling happens too late to help LIM or vectorization.
Comment 3 Dominique d'Humieres 2007-11-28 16:14:14 UTC
> Note that complete unrolling happens too late to help LIM or vectorization.

Could this be translated as a YES to my first question: the fortran frontend should unroll computations for short vectors?

Comment 4 Richard Biener 2007-11-28 16:17:53 UTC
I would in principle say no - we can instead improve the middle-end here.  But
it may pay off to not generate a loop for short vectors in case the resulting
IL is smaller for example.  Of course it would duplicate logic in the frontend
if that is not already available, so from a middle-end point of view we should
fix it there instead (the same problems happen for C and C++).  See PR34223.
Comment 5 Richard Biener 2007-11-28 16:33:59 UTC
Created attachment 14655 [details]
patch for early complete unrolling of inner loops

For example with a patch like this.
Comment 6 Dominique d'Humieres 2007-11-28 18:18:22 UTC
Subject: Re:  Missed optimizations

> For example with a patch like this.

You also need

--- ../_gcc_clean/gcc/tree-flow.h       2007-11-16 16:17:46.000000000 +0100
+++ ../gcc-4.3-work/gcc/tree-flow.h     2007-11-28 18:56:42.000000000 +0100
@@ -980,7 +980,7 @@
 void tree_ssa_lim (void);
 unsigned int tree_ssa_unswitch_loops (void);
 unsigned int canonicalize_induction_variables (void);
-unsigned int tree_unroll_loops_completely (bool);
+unsigned int tree_unroll_loops_completely (bool, bool);
 unsigned int tree_ssa_prefetch_arrays (void);
 unsigned int remove_empty_loops (void);
 void tree_ssa_iv_optimize (void);

Still building.

Comment 7 Dominique d'Humieres 2007-11-28 18:48:49 UTC
Subject: Re:  Missed optimizations

With your patch the runtime went from

93.670u 0.103s 1:33.85 99.9%    0+0k 0+0io 32pf+0w

to

38.741u 0.038s 0:38.85 99.7%    0+0k 0+1io 32pf+0w

Pretty impressive!

Note that with gfortran 4.2.2 the timing is

72.451u 0.046s 1:12.59 99.8%    0+0k 1+0io 33pf+0w

I'll run the full polyhedron suite.

Thanks
Comment 8 Janne Blomqvist 2007-11-28 20:48:07 UTC
The vectorization of dot products is covered by PR31738, I suppose
Comment 9 Tobias Burnus 2007-11-28 21:27:05 UTC
> With your patch the runtime went from
> 93.670u 0.103s 1:33.85 99.9%    0+0k 0+0io 32pf+0w
> to
> 38.741u 0.038s 0:38.85 99.7%    0+0k 0+1io 32pf+0w

Thus: 59% faster. Here, it "only" went ~30% down from 49.89s to ~35.2s. (AMD64/Linux, -m64). Still quite impressive!
Comment 10 Richard Biener 2007-11-28 22:05:42 UTC
Indeed - unexpectedly impressive ;)  The patch has (obviously) received no tuning
as of the placement of the early unrolling in the pass pipeline and early
unrolling is only done if that doesn't increase code-size (as of the metric
of the unrolling pass, of course), unless you specify -O3.
Comment 11 Dominique d'Humieres 2007-11-28 22:35:48 UTC
Here are the timings before and after the patch for the polyhedron tests and some variants:

                   Before patch                   After patch 

  Benchmark   Ave Run  Number   Estim    :   Ave Run  Number   Estim
       Name    (secs) Repeats   Err %    :    (secs) Repeats   Err %
  ---------   ------- -------  ------    :   ------- -------  ------
         ac     16.92       5  0.0183    :     16.16       5  0.0056
     aermod     36.82       5  0.0082    :     36.92       5  0.0106
        air     11.38      10  0.0479    :     11.43      11  0.0494
   capacita     62.21       5  0.0036    :     61.97       5  0.0343
    channel      4.04      12  0.0333    :      4.04       5  0.0160
      doduc     58.07       5  0.0257    :     57.56       5  0.0164
    fatigue     14.94       5  0.0338    :     14.33       5  0.0184
    gas_dyn     11.78      17  0.0448    :     11.89      18  0.0349
     induct     93.27       5  0.0093    :     36.93       5  0.0205
      linpk     28.15       5  0.0099    :     28.21       5  0.0259
       mdbx     16.80       5  0.0112    :     16.83       5  0.0051
         nf     32.45       5  0.0388    :     32.63      10  0.0495
    protein     55.63       5  0.0069    :     54.86       5  0.0305
     rnflow     45.88       5  0.0366    :     46.06       5  0.0230
   test_fpu     14.64       5  0.0115    :     14.46       5  0.0207
       tfft      3.04       5  0.0380    :      3.06      20  0.0284
      ac_v1     16.15       5  0.0197    :     15.16       5  0.0109
     air_v1     10.81       5  0.0411    :     10.88      10  0.0471
 capacita_8     69.45       5  0.0136    :     69.41       5  0.0091
capacita_10    113.20       5  0.0200    :    112.44       5  0.0290
    chan_v1      2.23       5  0.0183    :      2.24       7  0.0471
 channel_10     16.61       5  0.0351    :     16.64      14  0.0492
 fatigue_v1     13.26       5  0.0071    :     12.05       5  0.0148
 fatigue_10     20.54      15  0.0312    :     21.73       5  0.0117
  induct_v2     35.07       5  0.0007    :     60.08       5  0.0236
  induct_v3     34.40       5  0.0189    :     58.64       5  0.0249
  induct_vm    262.95       2  0.0000    :    253.62       2  0.0197
  induct_10    100.12       5  0.0053    :     84.65       5  0.0008
     kepler     22.73       5  0.0123    :     26.11       5  0.0069
  kepler_10     69.59       5  0.0047    :     61.42       5  0.0110
      nf_10     58.00       5  0.0413    :     58.36       5  0.0388
 protein_10     57.04       5  0.0167    :     56.38       5  0.0486
test_fpu_v1     15.15       5  0.0195    :     14.98       5  0.0104
test_fpu_10     34.75       5  0.0408    :     34.68       5  0.0120
     tfft_8      6.81       5  0.0110    :      6.83       5  0.0371
    tfft_10     14.36       5  0.0373    :     14.40       6  0.0496

                 Before patch                After patch 

  Benchmark   Compile  Executable     :  Compile  Executable
       Name    (secs)     (bytes)     :   (secs)     (bytes)
  ---------   -------  ----------     :  -------  ----------
         ac      4.52       50628     :     4.60       50628
     aermod     96.22     1288460     :   106.72     1288460
        air      6.57       80956     :     6.68       80956
   capacita      3.18       60140     :     3.34       64236
    channel      1.55       38532     :     1.65       38532
      doduc     13.52      183264     :    14.02      191456
    fatigue      5.69       84564     :     5.83       80468
    gas_dyn      5.36      695776     :     5.50      695776
     induct     11.65      160132     :    12.02      168324
      linpk      1.67       46512     :     1.71       46512
       mdbx      3.76       72672     :     3.85       72672
         nf      4.45       87644     :     4.46       87644
    protein     11.18      113900     :    11.45      113900
     rnflow     11.58      187316     :    11.74      187316
   test_fpu     11.23      182544     :    11.09      178448
       tfft      1.30       34420     :     1.33       34420
      ac_v1      4.51       50628     :     4.60       50628
     air_v1      6.63       80956     :     6.75       80956
 capacita_8      3.20       60136     :     3.33       64232
capacita_10      3.19       64216     :     3.42       68312
    chan_v1      1.85       38500     :     1.85       38500
 channel_10      1.32       34392     :     1.40       34392
 fatigue_v1      5.75       84524     :     5.77       80428
 fatigue_10      4.91       76352     :     4.91       76352
  induct_v2     11.72      168324     :    12.19      172420
  induct_v3     11.72      164228     :    12.12      172420
  induct_vm     11.44      160132     :    11.73      164228
  induct_10     11.47      159964     :    11.89      159964
     kepler      0.34       17652     :     0.35       17652
  kepler_10      0.33       17632     :     0.34       17632
      nf_10      2.03       46684     :     2.07       46684
 protein_10      7.01       93400     :     7.18       93400
test_fpu_v1     11.23      182592     :    11.17      178496
test_fpu_10      6.54      117056     :     6.40      117056
     tfft_8      1.25       30348     :     1.31       30348
    tfft_10      1.15       30328     :     1.18       30328

The only timings significantly changed by the patch are the induct avatars, with the strange result that the variants which missed the vectorization are now vectorized, while those previously vectorized are not any more (also true for the variants of the first attachment). So there is probably some need of a little bit of tuning. 

I have also to regtest and do some further investigations.

Comment 12 Steven Bosscher 2007-11-28 22:49:25 UTC
The only timings significantly changed are actually the compile times, which go up  significantly.
Comment 13 kargl 2007-11-28 23:06:49 UTC
(In reply to comment #12)
> The only timings significantly changed are actually the compile times, which go
> up  significantly.
> 

Look at the kepler execution time.  22.73 s without the patch and
26.11 s with the patch.  That's a 15% decrease in execution speed
(ie., it runs slower!).
Comment 14 Steven Bosscher 2007-11-28 23:17:55 UTC
Yes, that too.  It was more a sarcastic addendum to your remark that there were so few significantly changed numbers.  It seemed to me you should not look at just the execution times ;-)
Comment 15 Dominique d'Humieres 2007-11-28 23:57:17 UTC
If I am allowed to be sacarstic too, I'll say that the increase in compile time (worst case 11%, arithmetic average 5%) is not against the current trend one can see for instance in

http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-1-0.html

for no gain at all on the execution time (see also the thread

http://gcc.gnu.org/ml/fortran/2007-07/msg00276.html).

Now I do expect that there will be never patch commited worst than the Richard's one!

It came very fast: about one hour after my post.

It did not break anything so far.

It did the optimizations it was supposed to do on the intended code and some variants, even if it broke the vectorization some other variants and increased the execution time of kepler by 15%.

At least it comfirmed that the bottleneck for induct was both the loop unrolling and vectorization. Indeed it remains to understand why vectorization is no longer applied to codes for which it was before the patch.

To be clear, I think it is a mistake to use the f90 array features on small vectors, but I have seen it more often than I'ld like. So this is a kind of optimization that can find its place for real life codes and not only benchmarks.

Comment 16 Dominique d'Humieres 2007-11-29 08:06:24 UTC
A quick report of the comparison between the regression results for revision 130500 + patch in comment #5 + Tobias' patch for pr34262 and revision 130489 + some patches applied to rev. 130500. I have the following new failures:

ERROR: gcc.dg/tree-ssa/loop-1.c: error executing dg-final: no files matched glob pattern "loop-1.c.[0-9][0-9][0-9]t.cunroll"
UNRESOLVED: gcc.dg/tree-ssa/loop-1.c: error executing dg-final: no files matched glob pattern "loop-1.c.[0-9][0-9][0-9]t.cunroll"
FAIL: gcc.dg/tree-ssa/loop-17.c scan-tree-dump sccp "set_nb_iterations_in_loop = 1"
ERROR: gcc.dg/tree-ssa/loop-23.c: error executing dg-final: no files matched glob pattern "loop-23.c.[0-9][0-9][0-9]t.cunroll"
UNRESOLVED: gcc.dg/tree-ssa/loop-23.c: error executing dg-final: no files matched glob pattern "loop-23.c.[0-9][0-9][0-9]t.cunroll"
FAIL: gcc.dg/vect/vect-105.c scan-tree-dump-times vect "vectorized 1 loops" 1
ERROR: gcc.dg/vect/vect-11a.c: error executing dg-final: no files matched glob pattern "vect-11a.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.dg/vect/vect-11a.c: error executing dg-final: no files matched glob pattern "vect-11a.c.[0-9][0-9][0-9]t.vect"
FAIL: gcc.dg/vect/vect-66.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-76.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-92.c scan-tree-dump-times vect "vectorized 1 loops" 3
FAIL: gcc.dg/vect/vect-92.c scan-tree-dump-times vect "Alignment of access forced using peeling" 3
FAIL: gcc.dg/vect/vect-outer-1.c scan-tree-dump-times vect "strided access in outer loop" 1
FAIL: gcc.dg/vect/vect-outer-6.c scan-tree-dump-times vect "OUTER LOOP VECTORIZED" 1
FAIL: gcc.dg/vect/vect-outer-6.c scan-tree-dump-times vect "zero step in outer loop." 1
ERROR: gcc.dg/vect/vect-shift-1.c: error executing dg-final: no files matched glob pattern "vect-shift-1.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.dg/vect/vect-shift-1.c: error executing dg-final: no files matched glob pattern "vect-shift-1.c.[0-9][0-9][0-9]t.vect"
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect "Alignment of access forced using peeling" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect "vectorized 4 loops" 1
FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect "Alignment of access forced using peeling" 2
ERROR: gcc.target/i386/vectorize1.c: error executing dg-final: no files matched glob pattern "vectorize1.c.[0-9][0-9][0-9]t.vect"
UNRESOLVED: gcc.target/i386/vectorize1.c: error executing dg-final: no files matched glob pattern "vectorize1.c.[0-9][0-9][0-9]t.vect"

FAIL: gfortran.dg/array_1.f90  -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/array_1.f90  -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/array_1.f90  -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/array_1.f90  -O3 -g  execution test

I am waiting for directives on how I can investigate further these problems.

Comment 17 Richard Biener 2007-11-29 10:11:50 UTC
Doh, not only I missed to diff the chunk mentioned in comment #6, but I also
added the original unrolling pass, not the one only supposed to unroll inner
loops #)

So, change the passes.c hunk to

Index: gcc/passes.c
===================================================================
--- gcc/passes.c        (revision 130511)
+++ gcc/passes.c        (working copy)
@@ -570,6 +570,9 @@ init_optimization_passes (void)
       NEXT_PASS (pass_merge_phi);
       NEXT_PASS (pass_vrp);
       NEXT_PASS (pass_dce);
+      NEXT_PASS (pass_tree_loop_init);
+      NEXT_PASS (pass_complete_unrolli);
+      NEXT_PASS (pass_tree_loop_done);
       NEXT_PASS (pass_cselim);
       NEXT_PASS (pass_dominator);
       /* The only const/copy propagation opportunities left after


that should fix some of the testsuite failures.  Some thing also to experiment
with (to maybe fix some of the compile-time problems) is in the tree-lssa-loop-ivcanon.c hunk change the condition to

           if (!unroll_outer
               && loop->inner)
             continue;

to only unroll innermost loops, not all-but-outermost loops.

As of pass placement another thing to look at is if it works as part of
early optimizations around

          NEXT_PASS (pass_early_inline);
          NEXT_PASS (pass_cleanup_cfg);
          NEXT_PASS (pass_rename_ssa_copies);
.... here
          NEXT_PASS (pass_ccp);
          NEXT_PASS (pass_forwprop);
          NEXT_PASS (pass_update_address_taken);
.... or here
          NEXT_PASS (pass_simple_dse);
          NEXT_PASS (pass_sra_early);

because this may enable SRA of variables in the loop body.

Most of the compile-time impact is actually from doing loop discovery, but
as we preserve loops now maybe we do not need pass_tree_loop_done after
the early unrolling and as well not pass_tree_loop_init before the rest
of loop optimizations anymore?  Zdenek, can you confirm this?
Comment 18 Dominique d'Humieres 2007-11-29 10:22:26 UTC
I have had a look at what's happening for kepler.f90 (from the 2004 polyhedron test suite?) and it looks like another missed vectorization: if I count the mulpd in the kepler.s files, I find 24 before the patch and none after.

Comment 19 Dominique d'Humieres 2007-11-29 10:40:35 UTC
Richard,

I am not sure to understand your patch in comment #17. I have already in gcc/passes.c (after your patch in comment #5):

      NEXT_PASS (pass_merge_phi);
      NEXT_PASS (pass_vrp);
      NEXT_PASS (pass_dce);
      NEXT_PASS (pass_tree_loop_init);
      NEXT_PASS (pass_complete_unroll);
      NEXT_PASS (pass_tree_loop_done);
      NEXT_PASS (pass_cselim);
      NEXT_PASS (pass_dominator);
      /* The only const/copy propagation opportunities left after

do you mean that I should change

      NEXT_PASS (pass_complete_unroll);

to

      NEXT_PASS (pass_complete_unrolli);

? I am assuming my interpretation is correct and rebuild gcc right now.


Comment 20 Dominique d'Humieres 2007-11-29 11:00:45 UTC
I have applied my interpretation of the first two changes in comment #17. gfortran.dg/array_1.f90 still abort and induct.v3.f90 is still not vectorized. The good news are that induct.f90 is still properly unrolled and vectorized.

Comment 21 rguenther@suse.de 2007-11-29 11:13:06 UTC
Subject: Re:  Missed optimizations

On Thu, 29 Nov 2007, dominiq at lps dot ens dot fr wrote:

> Richard,
> 
> I am not sure to understand your patch in comment #17. I have already in
> gcc/passes.c (after your patch in comment #5):
> 
>       NEXT_PASS (pass_merge_phi);
>       NEXT_PASS (pass_vrp);
>       NEXT_PASS (pass_dce);
>       NEXT_PASS (pass_tree_loop_init);
>       NEXT_PASS (pass_complete_unroll);
>       NEXT_PASS (pass_tree_loop_done);
>       NEXT_PASS (pass_cselim);
>       NEXT_PASS (pass_dominator);
>       /* The only const/copy propagation opportunities left after
> 
> do you mean that I should change
> 
>       NEXT_PASS (pass_complete_unroll);
> 
> to
> 
>       NEXT_PASS (pass_complete_unrolli);
> 
> ? I am assuming my interpretation is correct and rebuild gcc right now.

Yes, that's correct - I did too much copy&paste there :)

Richard.
Comment 22 Dominique d'Humieres 2007-11-29 11:16:21 UTC
In top of the first two patches of comment #17, I have MOVED

+      NEXT_PASS (pass_tree_loop_init);
+      NEXT_PASS (pass_complete_unrolli);
+      NEXT_PASS (pass_tree_loop_done);

to the first suggested place. Now gfortran.dg/array_1.f90 pass the test, induct is no longer unrolled/vectorized, but induct.v3 is: back to before the patch at least on these quick tests.

Comment 23 Dominique d'Humieres 2007-11-29 12:24:53 UTC
In top of the first two patches of comment #17, I have MOVED

+      NEXT_PASS (pass_tree_loop_init);
+      NEXT_PASS (pass_complete_unrolli);
+      NEXT_PASS (pass_tree_loop_done);

to the second suggested place. Same result as in comment #22.

Comment 24 Dominique d'Humieres 2007-11-29 15:49:06 UTC
I think I have now a partial understanding of what is happening for the induct variants that do not vectorize with the patch in comment #5: they do not contain any loop inside the k loop. If I replace

                  rot_q_vector(1) = rot_c_vector(1) - rot_qk_vector(k,1)
                  rot_q_vector(2) = rot_c_vector(2) - rot_qk_vector(k,2)
                  rot_q_vector(3) = rot_c_vector(3) - rot_qk_vector(k,3)

by

                  rot_q_vector(:) = rot_c_vector(:) - rot_qk_vector(k,:)

Then the loop is vectorized again.

Comment 25 Uroš Bizjak 2007-11-30 21:38:00 UTC
(In reply to comment #24)

> Then the loop is vectorized again.

IMO, SLP should vectorize the sequence.
Comment 26 Dominique d'Humieres 2007-12-03 14:08:41 UTC
> IMO, SLP should vectorize the sequence.

Uros,

What is the meaning of the above sentence?

Comment 27 Dominique d'Humieres 2007-12-03 14:32:05 UTC
I have had a look at the failure of gfortran.dg/array_1.f90 with patch #5. The following reduced code gives the same failure:

! { dg-do run }
! PR 15553 : the array used to be filled with garbage
! this problem disappeared between 2004-05-20 and 2004-09-15
program arrpack
  implicit none
  
  double precision x(10,10), tmp(6,5)
  integer i, j

  x = -1
  do i=1,6
     do j=1,5
        x(i,j) = i+j*10
     end do
  end do
  tmp(:,:) = x(1:6, 1:5)
  print '(6F8.2)', tmp
  
end program arrpack

With -O3 and patch #5, the output is

   11.00   12.00   13.00   14.00   15.00   16.00
   -1.00   -1.00   -1.00   -1.00   21.00   22.00
   23.00   24.00   25.00   26.00   -1.00   -1.00
   -1.00   -1.00   31.00   32.00   33.00   34.00
   35.00   36.00   -1.00   -1.00   -1.00   -1.00

instead of

   11.00   12.00   13.00   14.00   15.00   16.00
   21.00   22.00   23.00   24.00   25.00   26.00
   31.00   32.00   33.00   34.00   35.00   36.00
   41.00   42.00   43.00   44.00   45.00   46.00
   51.00   52.00   53.00   54.00   55.00   56.00

I am amaze that it is the only failure of this kind for the several 1000 tests I have passed!

I'll attach the the results of -fdump-tree-optimize for with and without patch #5.

I have also looked at the gcc failures. Most of them are missed vectorizations or new ones. So this is already reported. Is *.[0-9][0-9][0-9]t.vect supposed to exist if the vectorization is missed? If yes, this explaina few failures. Concerning the failures with *.[0-9][0-9][0-9]t.cunroll, I see *cunroll1/2, but no cunroll.

Comment 28 Dominique d'Humieres 2007-12-03 14:33:52 UTC
Created attachment 14691 [details]
result of -fdump-tree-optimized with patch #5
Comment 29 Dominique d'Humieres 2007-12-03 14:34:39 UTC
Created attachment 14692 [details]
result of -fdump-tree-optimized without patch #5
Comment 30 Uroš Bizjak 2007-12-03 16:30:33 UTC
(In reply to comment #26)
> > IMO, SLP should vectorize the sequence.
> What is the meaning of the above sentence?

Uh, sorry for being terse. If there are no loops, then "straight-line parallelization" [SLP] should vectorize your manually unrolled sequence in comment #24. You can look into testsuite/gcc.dg/vect/slp-*.c for some c code examples.
Comment 31 Dominique d'Humieres 2007-12-03 18:58:56 UTC
> If there are no loops, then "straight-line parallelization" [SLP] should vectorize 
> your manually unrolled sequence in comment #24.

Yes it should, but if does not after patch #5.  The unanswered question so far is why it does not, then how to change the patch so that it does it. Anyhow, the "good" vectorization should be along the k loop (length 9 instead of 3). My understanding of my tests is first that 5/9<2/3 and, more important, the packing/unpacking overhead is a smaller penalty if it is shared as in the k vectorization.

Comment 32 Ira Rosen 2007-12-04 06:56:57 UTC
(In reply to comment #30)
> Uh, sorry for being terse. If there are no loops, then "straight-line
> parallelization" [SLP] should vectorize your manually unrolled sequence in
> comment #24. 

Currently only loop-aware SLP is implemented, meaning the code must be enclosed in a loop to get vectorized.

> You can look into testsuite/gcc.dg/vect/slp-*.c for some c code
> examples.
> 

Right. For example,

  for (i = 0; i < N; i++)
    {
      out[i*4] = a0;
      out[i*4 + 1] = a1;
      out[i*4 + 2] = a2;
      out[i*4 + 3] = a3;
    }

will be transformed into:

  for (i = 0; i < N; i++)
    {
      out[i*4:i*4+3] = {a0, a1, a2, a3};
    }

Ira

Comment 33 Dominique d'Humieres 2008-04-23 21:26:21 UTC
Created attachment 15523 [details]
induct.f90 variants and their diff with the original file

The original diff's have space problems.
Comment 34 Dominique d'Humieres 2011-05-22 12:06:20 UTC
Created attachment 24325 [details]
reduced tests

The attached bzipped tar contains the files induct_red.f90 with the all the
infrastructure to provide a realistic framework to run a reduced version of
the subroutine mutual_ind_quad_cir_coil contained in induct_qc_x.F90
(reduced to only one critical nested loops).

When the macro XPA is defined the original rotate code

		  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
		  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
		  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))

is unrolled as (q_vector(2)==0) if the macro FLD is not defined

		  rot_q_vector(1) = rotate_quad(1,1) * q_vector(1) + &
				    rotate_quad(1,2) * q_vector(2)
		  rot_q_vector(2) = rotate_quad(2,1) * q_vector(1) + &
				    rotate_quad(2,2) * q_vector(2)
		  rot_q_vector(3) = rotate_quad(3,1) * q_vector(1) + &
				    rotate_quad(3,2) * q_vector(2)

Otherwise it is folded as

		  rot_q_vector(:) = rotate_quad(:,1) * q_vector(1) + &
				    rotate_quad(:,2) * q_vector(2)

When the macro XPB is defined the original numerator

		  numerator = w1gauss(j) * w2gauss(k) *               &
			      dot_product(coil_current_vec,current_vector)

is unrolled as

		  numerator = w1gauss(j) * w2gauss(k) *               &
			     (coil_current_vec(1)*current_vector(1) + &
			      coil_current_vec(2)*current_vector(2) + &
			      coil_current_vec(3)*current_vector(3))

When the macro XPC is defined the original denominator

		  denominator = sqrt(dot_product(rot_c_vector-rot_q_vector, &
						 rot_c_vector-rot_q_vector))

is unrolled as
		  denominator = sqrt((rot_c_vector(1)-rot_q_vector(1))**2 + &
				     (rot_c_vector(2)-rot_q_vector(2))**2 + &
				     (rot_c_vector(3)-rot_q_vector(3))**2)

				     
It contains also a script to run the twelve cases and one case with
graphite and the raw results for revisions 167530, 167531, and 173917
(original, with r167531 reverted: 173917r1, and with /* NEXT_PASS
(pass_complete_unrolli); */ : 173917n since I think this is related to
revision 134730).

See also pr49006.
Comment 35 Dominique d'Humieres 2011-09-16 15:42:15 UTC
This pr (as well as pr49006) seems to have been fixed between revisions 176696 and 177649. I am closing 
pr49006 as fixed and I'll use this pr to track the remaining issues.
Comment 36 Dominique d'Humieres 2011-09-17 17:09:59 UTC
The pr 34265 and 49006 have been fixed by revision 176984:

Author:	wschmidt
Date:	Sun Jul 31 18:58:06 2011 UTC (6 weeks, 5 days ago)
Changed paths:	2
Log Message:	
2011-07-29  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR tree-optimization/49749
	* tree-ssa-reassoc.c (get_rank): New forward declaration.
	(PHI_LOOP_BIAS): New macro.
	(phi_rank): New function.
	(loop_carried_phi): Likewise.
	(propagate_rank): Likewise.
	(get_rank): Add calls to phi_rank and propagate_rank.

The following table compare the execution time and the line at which the vectorizer reports the vectorization for the twelve variants described in comment #34, showing that they are now all vectorized:

revision           176983          176984          178905
num den rot      time  line      time  line      time  line
 o   o   o      2.154s  243     2.154s  243     2.133s  243 
 u   o   o      1.973s  243     1.970s  243     2.035s  243 
 o   u   o      2.024s  243     2.023s  243     1.977s  243 
 u   u   o      3.053s          1.817s  234     1.831s  234 
 o   o   u      3.015s          1.839s  234     1.841s  234 
 u   o   u      3.030s          1.828s  234     1.816s  234 
 o   u   u      3.049s          1.818s  234     1.834s  234 
 u   u   u      3.059s          1.820s  234     1.818s  234 
 o   o   f      3.010s          1.825s  234     1.822s  234 
 u   o   f      3.033s          1.836s  234     1.826s  234 
 o   u   f      3.061s          1.814s  234     1.828s  234 
 u   u   f      3.058s          1.825s  234     1.812s  234 
graphite        1.937s  243     1.938s  243     1.912s  243 

(num, den and rot stand for numerator, denominator and rotate respectively; o, u, and f stand for original, unrolled, and folded.

Since now gcc has (at least for this class of code;) the three properties I expect from a good optimizer:
(1) it does not destroy the hand optimization I have done;
(2) it optimizes the original code;
(3) it has a consistent behavior across variants;
I think a test should be added to the test suite to check that none of these properties are lost in future revisions.
Comment 37 Richard Biener 2011-09-26 08:36:26 UTC
Testcases for runtime properties are not supported.