This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

FDO regression in Skidmarks component ParserVid


Using: gcc version 3.5.0 20040430 (experimental) for PowerPC

Here is an example of a performance regression when using FDO.  The
"ParseVid" component of Skidmarks runs about 5% slower using FDO.  I traced
it down to a loop that is unrolled in the non-FDO compilation, but is not
unrolled in the FDO version.  I hand unrolled the subject loop and got back
the full 5% plus an additional 2%, so it's not only introducing a penalty,
but it's also negating other positive transformations.

To reproduce given the attached files:

gcc  -O2 -funroll-loops -m32 -c parse.i

Looking at the code for procedure "ParseVideoSegment", you'll see that the
loop in question has been completely unrolled.  It's also interesting to
note that the loop within which it is nested has become a bct loop:

.L8:
      lhz 11,0(4)
      li 0,12
      lhz 8,2(4)
      slwi 9,30,3
      slwi 11,11,16
      sth 0,8(5)
      or 11,11,8
      addi 9,9,-12
      srwi 10,11,20
      sth 9,10(5)
      rlwinm 0,10,0,30,31
      rlwinm 8,11,10,31,31
      add 0,29,0
      stw 3,0(5)
      slwi 0,0,1
      stw 11,4(5)
      lhzx 9,24,0
      stw 6,120(12)
      stw 7,124(12)
      sth 9,128(12)
      stb 8,130(12)
      stb 25,131(12)
      stw 6,0(12)
      stw 7,4(12)
      stw 6,8(12)
      stw 7,12(12)
      stw 6,16(12)
      stw 7,20(12)
      stw 6,24(12)
      stw 7,28(12)
      stw 6,32(12)
      stw 7,36(12)
      stw 6,40(12)
      stw 7,44(12)
      stw 6,48(12)
      stw 7,52(12)
      stw 6,56(12)
      stw 7,60(12)
      stw 6,64(12)
      stw 7,68(12)
      stw 6,72(12)
      stw 7,76(12)
      stw 6,80(12)
      stw 7,84(12)
      stw 6,88(12)
      stw 7,92(12)
      stw 6,96(12)
      stw 7,100(12)
      stw 6,104(12)
      stw 7,108(12)
      stw 6,112(12)
      stw 7,116(12)
      rlwinm 10,10,20,0,8
      srawi 10,10,18
      add 4,4,30
      sth 10,0(12)
      addi 5,5,16
      addi 12,12,132
      addi 31,31,1
      bdz .L119

If the same code is compiled using FDO, the loop is no longer unrolled,
presumably because the unroller can no longer figure out the iteration
count.  It becomes a bct loop, which in turn prevents the outer loop from
becoming a bct loop.  The use of indexed stores and the associated address
computations also form a very undesirable dependence chain within the loop,
as well as within the two peeled iterations.

gcc -O2 -fprofile-use -funroll-loops -m32 -c parse.i

Generates:

.L111:
      addi 9,27,1
      slwi 0,27,3
      addi 11,9,1
      slwi 9,9,3
      cmplwi 7,11,15
      stwx 4,6,0
      la 6,4(6)
      stwx 5,6,0
      la 6,-4(6)
      stwx 4,9,6
      la 9,4(9)
      stwx 5,9,6
      la 9,-4(9)
      bgt- 7,.L112
      addi 0,11,1
      cmplwi 7,0,16
      subfic 0,11,16
      mtctr 0
      bgt- 7,.L128
.L94:
      slwi 0,11,3
      addi 11,11,1
      stwx 4,6,0
      la 6,4(6)
      stwx 5,6,0
      la 6,-4(6)
      bdnz .L94

I've attached the files needed to duplicate:
(See attached file: parse.i)(See attached file: parse.gcda)(See attached
file: parse.gcno)


Pete

Attachment: parse.i
Description: Binary data

Attachment: parse.gcda
Description: Binary data

Attachment: parse.gcno
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]