Gcc 4.0 and 4.1 generate .p2align before a jump instruction. minloc1_8_r8.o in libgfortran has codes like movl $1, 12(%ecx) .p2align 4,,2 jmp .L19
Created attachment 9522 [details] A testcase for gcc 4.0 Here is the testcase for gcc 4.0. x.s is generated with "-O2". x86-64 has the similar problem.
Not a bug, it is aligning the loop: .L5: incl %edx cmpl %edx, %ecx je .L6 incl %edx cmpl %edx, %ecx .p2align 4,,5 jne .L5
And next time don't attach a tar file as it is much harder to get at the testcase.
Were you suggesting .L5: incl %edx cmpl %edx, %ecx je .L6 incl %edx cmpl %edx, %ecx jne .L5 was slower? Where does this information come from?
(note 81 50 85 NOTE_INSN_LOOP_END) (note 85 81 105 [bb 6] NOTE_INSN_BASIC_BLOCK) (insn 105 85 91 (unspec_volatile [ (const_int 4 [0x4]) ] 68) -1 (nil) (nil))
if (TARGET_FOUR_JUMP_LIMIT && optimize && !optimize_size) ix86_avoid_jump_misspredicts (); /* Some CPU cores are not able to predict more than 4 branch instructions in the 16 byte window. */ const int x86_four_jump_limit = m_PPRO | m_ATHLON_K8 | m_PENT4 | m_NOCONA; So this is not a bug.
The alignment is so the stupid processor (yes stupid) will not mis predict the jump.
from the gcc-patches (since the archives look broken): looking on recent copy of Intel optimization manual, it has the same hint as AMD manual about 4 jumps per cache line. I did SPEC run on the P4 and there is no change except for bzip2 that improves by about 3%, that is quite expected as the scenario where 5 jumps happens to be in same window is very rare.