Bug 88497 - Improve Accumulation in Auto-Vectorized Code
Summary: Improve Accumulation in Auto-Vectorized Code
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Kewen Lin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-14 13:58 UTC by kelvin
Modified: 2020-03-13 12:18 UTC (History)
4 users (show)

See Also:
Host:
Target: powerpc*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-01-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description kelvin 2018-12-14 13:58:05 UTC

    
Comment 1 kelvin 2018-12-14 14:12:08 UTC
Consider the following loop:

double sacc = 0.00;
extern double x[], y[];
for (unsigned long long int i = 0; i < N; i++)
  sacc += x[i] * y[i];

Auto-vectorization turns the body of the loop into something close to the following function foo:

double foo (double accumulator, vector double arg2[], vector double arg3[])
{
  vector double temp;

  temp = arg2[0] * arg3[0];
  accumulator += temp[0] + temp[1];
  temp = arg2[1] * arg3[1];
  accumulator += temp[0] + temp[1];
  temp = arg2[2] * arg3[2];
  accumulator += temp[0] + temp[1];
  temp = arg2[3] * arg3[3];
  accumulator += temp[0] + temp[1];
  return accumulator;
}

Compiled with -O3 -mcpu=power9 -ffast-math, this translates into 25 instructions:

foo:
.LFB11:
        .cfi_startproc
        lxv 6,0(5)
        lxv 10,0(4)
        lxv 7,16(5)
        lxv 11,16(4)
        lxv 8,32(5)
        lxv 12,32(4)
        lxv 9,48(5)
        lxv 0,48(4)
        xvmuldp 10,10,6
        xvmuldp 11,11,7
        xvmuldp 12,12,8
        xvmuldp 0,0,9
        xxpermdi 7,10,10,3
        xxpermdi 8,11,11,3
        fadd 10,7,10
        xxpermdi 9,12,12,3
        fadd 11,8,11
        xxpermdi 6,0,0,3
        fadd 12,9,12
        fadd 0,6,0
        fadd 10,10,1
        fadd 11,11,10
        fadd 1,12,11
        fadd 1,0,1
        blr
 

If auto-vectorization were to transform this loop into the following equivalent code, the resulting translation is only 18 instructions:

double foo (double accumulator, vector double arg2[], vector double arg3[])
{
  vector double temp;

  temp[0] = accumulator;
  temp[1] = 0.0;
  temp += arg2[0] * arg3[0];
  temp += arg2[1] * arg3[1];
  temp += arg2[2] * arg3[2];
  temp += arg2[3] * arg3[3];
  return temp[0] + temp[1];
}

foo:                                                                            
.LFB11:                                                                         
        .cfi_startproc                                                          
        li 9,0                                                                  
        lxv 10,0(4)                                                             
        lxv 6,0(5)                                                              
        lxv 11,16(4)                                                            
        lxv 7,16(5)                                                             
        mtvsrd 0,9                                                              
        lxv 12,32(4)                                                            
        lxv 8,32(5)                                                             
        lxv 9,48(5)                                                             
        xxpermdi 1,0,1,0                                                        
        lxv 0,48(4)                                                             
        xvmaddadp 1,10,6                                                        
        xvmaddadp 1,11,7                                                        
        xvmaddadp 1,12,8                                                        
        xvmaddadp 1,0,9                                                         
        xxpermdi 0,1,1,3                                                        
        fadd 1,0,1                                                              
        blr                  

I have also experimented with trunk's treatment of x86 targets, and the same optimization is relevant there:

                     
x86 -O3 -ffast-math optimized translation of the "original" source is:

_foo:
LFB1:
        ;; 17 insns in original code                                                
        ;;  movadp:     4                                                       
        ;;  mulpd:      4                                                       
        ;;  addsd:      4                                                       
        ;;  haddpd:     4                                                       
        ;;  ret:        1                                                       
        ;;  total:     17                                                       

        movapd  32(%rdi), %xmm2 ; load arg2[2]                                  
        mulpd   32(%rsi), %xmm2	; multiply arg2[2] * arg3[2]                    
        movapd  (%rdi), %xmm1	; load arg2[0]                                  
        movapd  16(%rdi), %xmm3	; load arg2[1]                                  
        mulpd   (%rsi), %xmm1	; multiply arg2[0] * arg3[0]                    
        mulpd   16(%rsi), %xmm3	; multiply arg2[1] * arg3[1]                    
        haddpd  %xmm2, %xmm2	; sum args[2] products                          
        haddpd  %xmm1, %xmm1	; sum args[0] products                          
        haddpd  %xmm3, %xmm3	; sum args[1] products                          
        addsd   %xmm0, %xmm2	; add args[2] sum into accumulator              
        movapd  48(%rdi), %xmm0	; load arg2[3]                                  
        mulpd   48(%rsi), %xmm0	; multiply arg2[3] * arg3[3]                    
        addsd   %xmm3, %xmm1	; add args[0] and args[1] products              
        addsd   %xmm2, %xmm1	; accumulate args[0..2] products                
	haddpd  %xmm0, %xmm0	; sum args[3] products                          
	addsd   %xmm1, %xmm0	; accumulate args[0..3]                         
	ret

The optimized translation of the "improved" source on x86 is:

_foo:
LFB1:
        ;; 15 insns in translation of improved source code                                                
        ;;  movq:       1                                                       
        ;;  movadp:     4       (load vector                                    
        ;;  mulpd:      4                                                       
        ;;  addpd:      4                                                       
        ;;  haddpd:     1                                                       
        ;;  ret:        1                                                       
        ;;  total:     15                                                       

        movapd  (%rdi), %xmm1   ; load arg2[0]                                  
        movq    %xmm0, %xmm0    ; move quad word: i don't understand this.                          
        mulpd   (%rsi), %xmm1	; multiply arg2[0] * arg3[0]                    
        movapd  16(%rdi), %xmm2	; load arg2[1]                                  
        mulpd   16(%rsi), %xmm2	; multiply arg2[1] * arg3[1]                    
        movapd  48(%rdi), %xmm3	; load arg2[3]                                  
        mulpd   48(%rsi), %xmm3	; multiply arg2[3] * arg3[3]                    
        addpd   %xmm2, %xmm1	; add args[0] and args[1] products              
        movapd  32(%rdi), %xmm2	; load arg2 [2]                                 
        mulpd   32(%rsi), %xmm2	; multiply arg2[2] * arg3[2]                    
        addpd   %xmm3, %xmm2	; add args[2] and args[3] products              
        addpd   %xmm2, %xmm1	; adds all args products                        
        addpd   %xmm1, %xmm0	; Add sums to accumulator                       
        haddpd  %xmm0, %xmm0	; add double elements of vector                 
        ret
Comment 2 Jakub Jelinek 2018-12-14 20:01:41 UTC
Is that though something the vectorizer should do, or better something some post-vectorization pass like reassoc which already has such infrastructure should do?
Comment 3 Jakub Jelinek 2018-12-14 20:02:12 UTC
Obviously, in either case for -ffast-math only.
Comment 4 Bill Schmidt 2018-12-14 20:54:28 UTC
Yes, reassociation sounds like the right place to look at this.
Comment 5 Richard Biener 2018-12-17 08:15:39 UTC
I think it is reassoc doing it "wrong" based on the targets reassoc-width?
Because the vectorizer generates exactly the code you are proposing.

Though you didn't even provide a fully compilable testcase.  I guessed N to
be 16 here and your ideal examples use 'accumulator' which I assume to be 0.0.

Your x86 code-gen examples are also from GCC 8 I assume (plus some -march/tune
flag you didn't expose given it uses haddpd)
Comment 6 Bill Schmidt 2018-12-17 14:51:52 UTC
Reassociation width should be 4 for this case per the target hook.  Kelvin, you can experiment with rs6000_reassociation_width to see if larger values give you what you expect.
Comment 7 kelvin 2019-01-22 19:12:36 UTC
Here is the original program that motivated my simplified reproducer:

extern void first_dummy ();
extern void dummy (double sacc, int n);
extern void other_dummy ();

extern float opt_value;
extern char *opt_desc;

#define M 128
#define N 512

double x [N];
double y [N];

int main (int argc, char *argv []) {
  double sacc;

  first_dummy ();
  for (int j = 0; j < M; j++) {

    sacc = 0.00;
    for (unsigned long long int i = 0; i < N; i++) {
      sacc += x[i] * y[i];
    }
    dummy (sacc, N);
  }
  opt_value = ((float) N) * 2 * ((float) M);
  opt_desc = "flops";
  other_dummy ();
}
Comment 8 kelvin 2019-01-23 17:36:50 UTC
In revisiting this problem report, I have confirmed that if I specify -ffast-math when I compile the original loop that motivated this problem report, I get the desired optimization.  I believe my original discovery of this optimization opportunity had omitted the -ffast-math option.

I have confirmed that reassociation does not produce the desired translation of the loop body in isolation, even if -ffast-math is specified on the command line, and even if I experiment with very large values of the reassociation_width values for the rs6000 target.

Apparently, the reassociation pass is not clever enough to recognize opportunities to transform multiply-add accumulations into expressions that favor use of the xvmaddadp instruction.  Reassociation may change the order in which the sums computed for different vector element products are combined with the accumulator.  But in my experience, reassociation does not discover the opportunity to accumulate the products from different vectors using vector sum instructions or even better, the vector multiply-add instruction.

Since the code produced for -ffast-math auto-vectorization of multiply-add accumulation loops is "optimal", I am recommending future effort on this issue be treated as low priority.
Comment 9 Kewen Lin 2019-03-05 07:09:47 UTC
As Kelvin mentioned in the last comment, there is some thing we teach reassoc to get the below code better, although it's in low priority.

double foo (double accumulator, vector double arg2[], vector double arg3[])
{
  vector double temp;

  temp = arg2[0] * arg3[0];
  accumulator += temp[0] + temp[1];
  temp = arg2[1] * arg3[1];
  accumulator += temp[0] + temp[1];
  temp = arg2[2] * arg3[2];
  accumulator += temp[0] + temp[1];
  temp = arg2[3] * arg3[3];
  accumulator += temp[0] + temp[1];
  return accumulator;
}

Confirmed.
Comment 10 Kewen Lin 2019-07-15 05:12:37 UTC
Author: linkw
Date: Mon Jul 15 05:12:05 2019
New Revision: 273490

URL: https://gcc.gnu.org/viewcvs?rev=273490&root=gcc&view=rev
Log:
gcc/ChangeLog

2019-07-15  Kewen Lin  <linkw@gcc.gnu.org>

    PR tree-optimization/88497
    * tree-ssa-reassoc.c (reassociate_bb): Swap the positions of
    GIMPLE_BINARY_RHS check and gimple_visited_p check, call new
    function undistribute_bitref_for_vector.
    (undistribute_bitref_for_vector): New function.
    (cleanup_vinfo_map): Likewise.
    (sort_by_mach_mode): Likewise.

gcc/testsuite/ChangeLog

2019-07-15  Kewen Lin  <linkw@gcc.gnu.org>

    PR tree-optimization/88497
    * gcc.dg/tree-ssa/pr88497-1.c: New test.
    * gcc.dg/tree-ssa/pr88497-2.c: Likewise.
    * gcc.dg/tree-ssa/pr88497-3.c: Likewise.
    * gcc.dg/tree-ssa/pr88497-4.c: Likewise.
    * gcc.dg/tree-ssa/pr88497-5.c: Likewise.
    * gcc.dg/tree-ssa/pr88497-6.c: Likewise.
    * gcc.dg/tree-ssa/pr88497-7.c: Likewise.


Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-1.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-3.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-4.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-5.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-6.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr88497-7.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-reassoc.c
Comment 11 Kewen Lin 2019-07-15 05:23:39 UTC
For the code in comment 9, now it generates the below code with the committed fix:

0000000000000000 <foo>:
   0:   11 00 04 f4     lxv     vs0,16(r4)
   4:   11 00 c5 f4     lxv     vs6,16(r5)
   8:   31 00 44 f5     lxv     vs10,48(r4)
   c:   31 00 e5 f4     lxv     vs7,48(r5)
  10:   01 00 84 f5     lxv     vs12,0(r4)
  14:   01 00 05 f5     lxv     vs8,0(r5)
  18:   21 00 64 f5     lxv     vs11,32(r4)
  1c:   21 00 25 f5     lxv     vs9,32(r5)
  20:   80 33 00 f0     xvmuldp vs0,vs0,vs6
  24:   80 3b 4a f1     xvmuldp vs10,vs10,vs7
  28:   08 43 0c f0     xvmaddadp vs0,vs12,vs8
  2c:   90 54 8a f1     xxlor   vs12,vs10,vs10
  30:   08 4b 8b f1     xvmaddadp vs12,vs11,vs9
  34:   00 63 00 f0     xvadddp vs0,vs0,vs12
  38:   90 00 80 fd     fmr     f12,f0
  3c:   50 03 00 f0     xxspltd vs0,vs0,1
  40:   2a 00 0c fc     fadd    f0,f12,f0
  44:   2a 08 20 fc     fadd    f1,f0,f1
  48:   20 00 80 4e     blr
Comment 12 Segher Boessenkool 2019-07-15 16:38:57 UTC
It still does some weird register moves (the xxlor and the fmr), but
those are totally different problems ;-)