[Bug target/56200] queens benchmark is faster with -O0 than with any other optimization level

Tue Feb 5 23:51:00 GMT 2013

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |areg.melikadamyan at gmail
                   |                            |dot com

--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> 2013-02-05 23:50:35 UTC ---
Optimized alignments are enabled for -O2 and above.  For -O2, there are:

        .p2align 4,,10
        .p2align 3
.L19:
        cmpl    file(,%rbx,4), %ebp
        jg      .L18
        cmpl    0(%r13,%rbx,4), %ebp
        jg      .L18
        cmpl    (%r12), %ebp
        jle     .L22
        .p2align 4,,10
        .p2align 3
.L18:

and generate

  400ab6:       66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)
  400ac0:       3b 2c 9d a0 1a 60 00    cmp    0x601aa0(,%rbx,4),%ebp
  400ac7:       7f 17                   jg     400ae0 <find+0x70>
  400ac9:       41 3b 6c 9d 00          cmp    0x0(%r13,%rbx,4),%ebp
  400ace:       7f 10                   jg     400ae0 <find+0x70>
  400ad0:       41 3b 2c 24             cmp    (%r12),%ebp
  400ad4:       7e 32                   jle    400b08 <find+0x98>
  400ad6:       66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)

Branch Predict Unit fetches 32-byte at a time.  There are 3 back-to-back
fused cmp/jcc instructions in 32-byte window, which causes misprediction.
We can add a nop after the first cmp/jcc to avoid back-to-back cmp/jccs.