I'm looking at the code: char * foo (char *s1, char *s2, unsigned int n) { char c1, c2; c2 = *(char *)s2++; do { do { c1 = *s1++; if (c1 == 0) return 0; } while (c1 != c2); } while (__builtin_strncmp (s1, (char *)s2, n) != 0); return --s1; } On aarch64 at -O2 this generates some bad code for the inner loop: ... .L9: cmp w21, w1 beq .L8 .L4: mov x20, x19 // (1) mov x19, x20 // (2) ldrb w1, [x19], 1 (3) cbnz w1, .L9 ... Note the moves between x20 and x19, (1) and (2) Looking at the RTL dumps the two moves were in separate basic blocks but were moved together at some point. Instructions (2) and (3) were transformed by autoincdec from: add x19, x20, #1 ldrb w1, [x19, #-1] but nothing eliminated (2) because x19 is later used as an argument to strncmp and in another basic block. Doesn't the peephole2 pass try to eliminate such back-to-back copies?
I think ivopts also contributes here. Before ivopts the memory access and address are: s1_10 = s1_3 + 1; c1_11 = *s1_3; but ivopts transforms them to: s1_10 = s1_3 + 1; c1_11 = MEM[base: s1_10, offset: -1B]; So the memory address is now dependent on s1_10 rather than s1_3 and introduces an offset where there was none before
(In reply to ktkachov from comment #1) > I think ivopts also contributes here. > Before ivopts the memory access and address are: > s1_10 = s1_3 + 1; > c1_11 = *s1_3; > > but ivopts transforms them to: > s1_10 = s1_3 + 1; > c1_11 = MEM[base: s1_10, offset: -1B]; > > So the memory address is now dependent on s1_10 rather than s1_3 and > introduces an offset where there was none before Thanks for reporting, I will have a look. Thanks.
On the 9.3+ we get: .L4: ldrb w0, [x19], 1 cbz w0, .L6 .L3: sub x20, x19, #1 cmp w21, w0 bne .L4 mov x2, x23 mov x1, x22 mov x0, x19 bl strncmp cbnz w0, .L4 GCC 7 and 8 got us: .L9: beq .L8 .L4: mov x20, x19 mov x19, x20 ldrb w1, [x19], 1 cmp w21, w1 cbnz w1, .L9 ... .L8: .cfi_restore_state mov x2, x22 mov x1, x23 mov x0, x19 bl strncmp cbnz w0, .L4 Which had the moves. GCC 9+ ch does: Analyzing loop 1 Loop 1 is not do-while loop: latch has multiple predecessors. Will duplicate bb 7 Not duplicating bb 1: both sucessors are in loop. Duplicating header of the loop 1 up to edge 7->4, 4 insns. Loop 1 is do-while loop Loop 1 is now do-while loop. While 8- ch has: Analyzing loop 1 Loop 1 is do-while loop Which means this was fixed by r9-23.