gcc version 8.0.1 20180207 (experimental) [trunk revision 257435] (Debian 8-20180207-2): Test case: void test(void (*fn)(void), unsigned long i) { while (i--) fn(); } Generates code with -O2 -mcpu=power9: 2c: 18 00 41 f8 std r2,24(r1) 30: 00 00 00 60 nop 34: 00 00 00 60 nop 38: 00 00 00 60 nop 3c: 00 00 42 60 ori r2,r2,0 40: 78 f3 cc 7f mr r12,r30 44: a6 03 c9 7f mtctr r30 48: ff ff ff 3b addi r31,r31,-1 4c: 21 04 80 4e bctrl 50: 18 00 41 e8 ld r2,24(r1) 54: ff ff bf 2f cmpdi cr7,r31,-1 58: e8 ff 9e 40 bne cr7,40 <test+0x40> On power9, I wonder if nops and ori should be avoided? Aligning to quad word boundary is reasonable for fetch, but is there any advantage for dispatch to add the extra padding?
Is this little endian or big-endian? 32bit or 64bit?
Ah sorry, target is powerpc64le
The nops are to align the following code (a jump target) to the correct alignment. GAS used an ori 2,2,0 for the last nop, to make sure a new dispatch group starts after it, for p8. Does your GAS support power9? Does binutils trunk behave this way too? Plain nop is ever so slightly more efficient, so it is prefered. The generated code is std 2,24(1) .p2align 5,,31 .L3: mr 12,30 mtctr 30 addi 31,31,-1 bctrl so there is nothing GCC can do about this. Please follow up to binutils bugzilla if necessary. Thanks!
After some more discussion and testing, it was determined that for POWER9, 16 bytes of padding is sufficient for these small loops and was adjusted. https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=258260 Author: segher Date: Mon Mar 5 19:11:54 2018 UTC (3 hours, 27 minutes ago) Log Message: rs6000: Don't align tiny loops to 32 bytes for POWER9 For POWER4..POWER8 we align loops of 5..8 instructions to 32 bytes (instead of to 16 bytes) because that executes faster. This is no longer the case on POWER9, so we can just as well only align to 16 bytes. * config/rs6000/rs6000.c (rs6000_loop_align): Don't align tiny loops to 32 bytes when compiling for POWER9.