This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug ipa/65701] r221530 makes 187.facerec drop with -Ofast -flto


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65701

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Ganesh.Gopalasubramanian@am
                   |                            |d.com

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
I notice some (obvious) differences (just glancing at -fopt-info)

graphRoutines.f90:393
graphRoutines.f90:359

are not peeled for alignment when vectorized in the good case.

But it seems that's ok (well, we're peeling too much for alignment IMHO...).
In the fast variant we vectorize strided loads while in the slow variant
we can use vector loads for one of the loads (and we made sure to use
aligned loads by peeling).

  1.11 ï3682:   mov    0x60(%rsp),%rdx                                          
  9.32 ï3687:ïïïvmovss (%rax,%r12,2),%xmm5                                      
  1.44 ï     ï  vmovss (%rax),%xmm6                                             
  4.46 ï     ï  inc    %rdi                                                     
  0.01 ï     ï  add    $0x10,%rcx                                               
  1.17 ï     ï  vinser $0x10,(%rax,%r13,1),%xmm5,%xmm0                          
  1.92 ï     ï  vinser $0x10,(%rax,%r12,1),%xmm6,%xmm1                          
  0.28 ï     ï  add    %r14,%rax                                                
  0.07 ï     ï  vmovlh %xmm0,%xmm1,%xmm0                                        
  2.48 ï     ï  vfmadd %xmm3,-0x10(%rcx),%xmm0,%xmm3                            
  5.15 ï     ï  cmp    %rdi,%rdx                                                
  0.01 ï     ïïïja     3687                             

so maybe the vfmadd with a memory operand is just bad for the pipeline
(I suspect bad for the decoder at least).

To me it really looks like trunk generates better code but we run into
a very odd bdver2 architectural issue (if the above loop is really the issue).
You could try disabling peeling for alignment with --param
vect-max-peeling-for-alignment=0 (so you get unaligned load and a vfmadd
without memory operand).

I don't think this is a RA issue.

Ganesh?

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]