[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Sat Jun 5 07:58:59 GMT 2021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922
Bug ID: 100922
Summary: CSE leads to fully redundant (back to back)
zero-extending loads of the same thing in a loop, or a
register copy
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Created attachment 50948
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948&action=edit
redundant_zero_extend.c
It's rarely a good idea to load the same thing twice; generally better to copy
a register. Or to read the same register twice when a copy isn't needed. So
the following asm should never happen, but it does with current trunk, and
similar with GCC as old as 4.5
movzbl (%rax), %edx
movzbl (%rax), %ecx # no branch target between these instructions
or
ldrb w4, [x2]
ldrb w3, [x2], 1 # post-indexed *x2++
(Happens at -O3. With -O2 we have a redundant register copy, so either way
still a wasted instruction. And there are other differences earlier in the
function with -O2 vs. -O3.)
https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk
show basically identical code structure. x86-64 gcc (Compiler-Explorer-Build)
12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524
void remove_chars_inplace(char *str, const unsigned char keep_lut[256])
{
while(keep_lut[(unsigned char)*str]){ // can be an if() and still repro
str++; // keep_lut[0] is false
}
char *out = str;
unsigned char c; /* must be unsigned char for correctness. */
do {
c = *str++;
unsigned char inc = keep_lut[c]; // unsigned long doesn't help
*out = c;
out += inc; // inc=0 or 1 to let next char overwrite or not
} while(c);
}
x86-64 asm:
remove_chars_inplace:
jmp .L8
.L3: # top of search loop for first char to remove
addq $1, %rdi
.L8: # loop entry point
movzbl (%rdi), %eax
cmpb $0, (%rsi,%rax) # un-laminates and doesn't macro-fuse ...
jne .L3
cmpb $0, (%rdi) # 2nd loop body can be skipped if *str == 0
# should be test %al,%al - this char was
already loaded.
leaq 1(%rdi), %rax # even -march=znver2 fails to move this
earlier or later to allow cmp/je fusion. (Intel won't macro-fuse cmp imm,mem /
jcc)
je .L1
.L5: # TOP OF 2ND LOOP
movzbl (%rax), %edx
movzbl (%rax), %ecx # redundant load of *str
addq $1, %rax
movzbl (%rsi,%rdx), %edx # inc = lut[c]
movb %cl, (%rdi)
addq %rdx, %rdi # out += inc
testb %cl, %cl
jne .L5 # }while(c != 0)
.L1:
ret
IDK if it's interesting or not that the cmpb $0, (%rdi) is also a redundant
load. The first loop left *str, i.e. (%rdi), in EAX. Putting the LEA between
cmp and je (even with -march=znver2) is a separate missed optimization.
(unless that's working around Intel's JCC erratum)
With only -O2 instead of -O3, we get better asm in that part: it takes
advantage of having the char in AL, and jumps into the middle of the next loop
after xor-zeroing the `inc` variable.
Replacing c = *str++; with
c = *str;
str++;
results in a wasted register copy with trunk, instead of a 2nd load (on x86-64
and arm64). Still a missed opt, but less bad. GCC7 and earlier still do an
extra load with either way of writing that.
Removing the first loop, or making its loop condition something like *str &&
keep_lut[*str], removes the problem entirely. The CSE possibility is gone.
(Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to
unsigned char instead of casting the signed char value to unsigned char, on x86
where char is signed, but not on arm64 where char is unsigned.)
---
I didn't find any clear duplicates; the following are barely worth mentioning:
* pr94442 looks like extra spilling, not just redundant loading.
* pr97366 is due to vectors of different types, probably.
* pr64319 needs runtime aliasing detection to avoid, unlike this.
The AArch64 version of this does seem to demo pr71942 (a useless and x4, x2,
255 on an LDRB result) when you get it to copy a register instead of doing a
2nd load.
More information about the Gcc-bugs
mailing list