This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Serious performance regression -- some tree optimizer questions


Hello,

we've started to do some performance analysis of current GCC mainline
on zSeries, and while in general results are promising, we've found a
couple of serious performance regressions.  The most extreme one was
the SPECfp2000 test case mgrid -- on s390x optimized for z990, this
test case takes about *twice* the run time with current mainline as
it did with GCC 3.4 ...

The reason for this appears to be a combination of the new Fortran
front end generating different code, and the new loop optimizers not
quite being able to handle that code.  The hot spot of the test case
is this single loop:

      DO 600 I3=2,N-1
      DO 600 I2=2,N-1
      DO 600 I1=2,N-1
 600  R(I1,I2,I3)=V(I1,I2,I3)
     >      -A(0)*( U(I1,  I2,  I3  ) )
     >      -A(1)*( U(I1-1,I2,  I3  ) + U(I1+1,I2,  I3  )
     >                 +  U(I1,  I2-1,I3  ) + U(I1,  I2+1,I3  )
     >                 +  U(I1,  I2,  I3-1) + U(I1,  I2,  I3+1) )
     >      -A(2)*( U(I1-1,I2-1,I3  ) + U(I1+1,I2-1,I3  )
     >                 +  U(I1-1,I2+1,I3  ) + U(I1+1,I2+1,I3  )
     >                 +  U(I1,  I2-1,I3-1) + U(I1,  I2+1,I3-1)
     >                 +  U(I1,  I2-1,I3+1) + U(I1,  I2+1,I3+1)
     >                 +  U(I1-1,I2,  I3-1) + U(I1-1,I2,  I3+1)
     >                 +  U(I1+1,I2,  I3-1) + U(I1+1,I2,  I3+1) )
     >      -A(3)*( U(I1-1,I2-1,I3-1) + U(I1+1,I2-1,I3-1)
     >                 +  U(I1-1,I2+1,I3-1) + U(I1+1,I2+1,I3-1)
     >                 +  U(I1-1,I2-1,I3+1) + U(I1+1,I2-1,I3+1)
     >                 +  U(I1-1,I2+1,I3+1) + U(I1+1,I2+1,I3+1) )

The key to optimizing this loop is to remove the redundancies
in address arithmetic, chose proper induction variables, and
at the same time avoid excessive register pressure.  (There is
a 'perfect' solution to this on zSeries that interestingly enough
GCC 2.95.3 was able to find, but no version since ;-/)

However, current mainline does rather badly at all three of
these tasks.  I'm not quite sure which pass is at fault here;
there appear to be some optimizations that would appear quite
straightforward to me that no pass is currently performing.

For example, the gimple code contains sequences of the form:

 s1 = a + b
 s2 = s1 + c
 t1 = a + c
 t2 = t1 + b

but no pass recognizes that s2 == t2 ...

Likewise, no pass recognizes that in a sequence:

 i = a + b
 x = array[i]

where both the base 'array' and the value 'a' are loop-invariant,
but the value 'b' isn't, 'array' and 'a' can be combined into
a new base address 'array + a*stride' which can be moved out of
the loop.  (The old RTL loop optimizer was able to do this.)


Any suggestions how to further investigate this?  Is this just
the way things are with the current tree optimizers, or is this
supposed to work and we just need to find the bug?

Thanks,
Ulrich

-- 
  Dr. Ulrich Weigand
  Linux on zSeries Development
  Ulrich.Weigand@de.ibm.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]