[Bug c++/69132] New: AVX optimization bug: extra unnecessary code insertion?

Mon Jan 4 05:03:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69132

            Bug ID: 69132
           Summary: AVX optimization bug: extra unnecessary code
                    insertion?
           Product: gcc
           Version: 4.8.4
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: xuancong84 at gmail dot com
  Target Milestone: ---

It is great that gcc compiler 4.8 comes with AVX optimization with -Ofast
option. However, I found an interesting but stupid bug, that it adds additional
computations which are unnecessary. Maybe I am wrong so can someone give me a
reason?

The original C++ source code is as follows:

    #define N 1000007

    float a[N],b[N],c[N],d[N],e[N];

    int main(int argc, char *argv[]){
        cout << a << ' ' << b << ' ' << c << endl;
        for(int x=0; x<N; ++x){
            c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]);
        }
        return  0;
    }

The code is compiled using g++ 4.8.4 in Ubuntu 14.04.3 x86_64:
g++ -mavx avx.cpp -masm=intel -c -g -Wa,-ahl=avx.asm -Ofast

The assembly source code is as follows:

      90                    .LVL10:
      91 006b C5FC2825              vmovaps ymm4, YMMWORD PTR .LC0[rip]
      91      00000000 
      92 0073 31C0                  xor     eax, eax
      93 0075 C5FC281D              vmovaps ymm3, YMMWORD PTR .LC1[rip]
      25:avx.cpp       ****         for(int x=0; x<N; ++x){
      26:avx.cpp       ****                 c[x] =
1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]);
     101                            .loc 1 26 0 discriminator 2
     102 0080 C5FC2890              vmovaps ymm2, YMMWORD PTR b[rax]
     102      00000000 
     103 0088 4883C020              add     rax, 32
     104 008c C5FC2888              vmovaps ymm1, YMMWORD PTR e[rax-32]
     104      00000000 
     105 0094 C5EC5890              vaddps  ymm2, ymm2, YMMWORD PTR a[rax-32]
     105      00000000 
     106 009c C5FC53C1              vrcpps  ymm0, ymm1
     107 00a0 C5FC59C9              vmulps  ymm1, ymm0, ymm1
     108 00a4 C5FC59C9              vmulps  ymm1, ymm0, ymm1
     109 00a8 C5EC5C90              vsubps  ymm2, ymm2, YMMWORD PTR c[rax-32]
     109      00000000 
     110 00b0 C5FC58C0              vaddps  ymm0, ymm0, ymm0
     111 00b4 C5EC5990              vmulps  ymm2, ymm2, YMMWORD PTR d[rax-32]
     111      00000000 
     112 00bc C5FC5CC9              vsubps  ymm1, ymm0, ymm1
     113 00c0 C5EC59C1              vmulps  ymm0, ymm2, ymm1
     118 00c4 C5FC52C8              vrsqrtps        ymm1, ymm0
     119 00c8 C5F459C0              vmulps  ymm0, ymm1, ymm0
     120 00cc C5FC59C1              vmulps  ymm0, ymm0, ymm1
     121 00d0 C5F459CB              vmulps  ymm1, ymm1, ymm3
     122 00d4 C5FC58C4              vaddps  ymm0, ymm0, ymm4
    ^LGAS LISTING /tmp/ccJtIFtg.s                   page 21

     123 00d8 C5FC59C9              vmulps  ymm1, ymm0, ymm1
     124                    .LBE45:
     125                    .LBE44:
     126                            .loc 1 26 0 discriminator 2
     127 00dc C5FC2988              vmovaps YMMWORD PTR c[rax-32], ymm1
     127      00000000 
     128 00e4 483D0009              cmp     rax, 4000000
     128      3D00
     129 00ea 7594                  jne     .L3

Now look at line 106, 107, 108, 110, 112 and 113.

The compiler computes the division by e[x] using the multiplication by its
inverse. So Line 106 computes 1/e[x], which is correct. After that it can
directly multiply this with the final product of (a[x]+b[x]-c[x])*d[x], which
is stored in ymm2, Line 111. However, instead of doing this, the compiler did
something interesting and ridiculous:

 1. it first multiplies the computed reciprocal 1/e[x] with e[x] to
    obtain 1 (Line 107)

 2. then multiply this 1 with 1/e[x] to obtain back 1/e[x] (Line 108)

 3. then it adds 1/e[x] to itself to obtain 2/e[x] (Line 110)

 4. then it subtracts 2/e[x] by 1/e[x] to obtain back 1/e[x] (Line 112)

After that, the compiler is ingenious to use the vrsqrtps instruction to
compute 1/sqrt(). However, after that, what happens? Instead of extracting the
output in ymm1 (Line 118), it did something even more fanciful again:

 1. it first multiplies 1/sqrt(x) by x to obtain sqrt(x), (Line 119)

 2. it then multiplies the sqrt(x) by 1/sqrt(x) to obtain back 1, (Line 120)

 3. it then multiplies 1/sqrt(x) by 1 (pre-stored in ymm3) to obtain the same
1/sqrt(x), (Line 121)

 4. it then adds the obtained 1 by 0 (pre-stored in ymm4) to obtain 1, (Line
122)

 5. it then multiplies 1/sqrt(x) with the obtained 1 to obtain back the same
1/sqrt(x), (Line 123)

The above two redundancies show that whenever 1/x is required, the compiler
tends to multiply the already obtained output with the original number to
obtain back 1, and then multiply this 1 with the already obtained output to
obtain back the same output. Is there any reason for doing this? Or it is just
a bug?