GCC Bugzilla – Bug 18560
better optimalization of EOR/MOV block.
Last modified: 2007-11-09 20:40:45 UTC
/* This function reverses the bytes in a word. The method was discovered in 1986 following a competition between ARM programmers; it requires just 4 instructions and 1 work register. A method using only 3 instructions per word reversed was also found, but it has some set-up overhead and uses a 2nd register. */ unsigned long reverse(unsigned long v) { unsigned long t; t = v ^ ((v << 16) | (v >> 16)); // EOR r1,r0,r0,ROR #16 [1] t &= ~0xff0000; // BIC r1,r1,#&ff0000 v = (v << 24) | (v >> 8); // MOV r0,r0,ROR #8 return v ^ (t >> 8); // EOR r0,r0,r1,LSR #8 } The gcc-3.4.3 with -O2 produces: reverse: mov r3, r0 [2] mov r0, r0, ror #16 [2] eor r0, r0, r3 [2] bic r0, r0, #16711680 mov r3, r3, ror #8 eor r0, r3, r0, lsr #8 mov pc, lr Itc doesn't seem to produce [2] optimal code [1].
Confirmed, the mainline produces better code already: mov r3, r0 eor r0, r0, r0, ror #16 bic r0, r0, #16711680 mov r0, r0, lsr #8 eor r0, r0, r3, ror #8 @ lr needed for prologue bx lr
As Andrew pointed out, the merge of the eor and the rotate is now done on mainline in 4.0. The initial redundant MOV is a register allocation artifact. This particular testcase compiles optimally with the new register allocator: reverse: eor r3, r0, r0, ror #16 bic r3, r3, #16711680 mov r3, r3, lsr #8 eor r0, r3, r0, ror #8 bx lr but for other reasons, I wouldn't recommend you use that option (too many other bugs).
(In reply to comment #2) > As Andrew pointed out, the merge of the eor and the rotate is now done on > mainline in 4.0. Hmm, it doesn't work on my gcc. # arm-pld-linux-gcc reversing_the_bytes_in_word.c -s -S -O2 .file "reversing_the_bytes_in_word.c" .text .align 2 .global reverse .type reverse, %function reverse: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mov r3, r0 eor r0, r0, r0, ror #16 bic r0, r0, #16711680 mov r0, r0, lsr #8 eor r0, r0, r3, ror #8 @ lr needed for prologue mov pc, lr .size reverse, .-reverse .ident "GCC: (GNU) 4.0.0 20050130 (experimental)" > The initial redundant MOV is a register allocation artifact. > This particular testcase compiles optimally with the new register allocator: Is there a special option I need to set?
The new register allocator (new-ra) has been removed because it was buggy and there were no plans to fix it. I was using it to show that the initial MOV was an unrelated issue. Your code snippet shows that the rotate and the EOR are being merged, which is what this PR was about.
(In reply to comment #4) > The new register allocator (new-ra) has been removed because it was buggy and > there were no plans to fix it. > I was using it to show that the initial MOV was an unrelated issue. > Your code snippet shows that the rotate and the EOR are being merged, > which is what this PR was about. The PR is about initial eor/mov opt., not eor/rotate ;-) At this moment the comercial IAR compiler generates good code. The GCC should be good too :) reverse: 0x00000D04 E0201860 EOR R1, R0, R0, ROR #16 0x00000D08 E3C118FF BIC R1, R1, #0xFF0000 0x00000D0C E1A01421 MOV R1, R1, LSR #8 0x00000D10 E0210460 EOR R0, R1, R0, ROR #8 0x00000D14 E12FFF1E BX LR
I've linked this to the register-allocator meta-bug
This has been fixed for more than a year: reverse: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. eor r3, r0, r0, ror #16 @ 12 *arith_shiftsi [length = 4] bic r3, r3, #16711680 @ 13 *arm_andsi3_insn/2 [length = 4] mov r0, r0, ror #8 @ 15 *arm_shiftsi3 [length = 4] eor r0, r0, r3, lsr #8 @ 23 *arith_shiftsi [length = 4] @ lr needed for prologue @ 36 prologue_use [length = 4] bx lr @ 39 return [length = 12] .size reverse, .-reverse .ident "GCC: (GNU) 4.2.0 20060729 (experimental)" 4.3.0 (revision 129967) generates the same code.