This is GCC Bugzilla
This is GCC Bugzilla Version 2.20+
View Bug Activity | Format For Printing | Clone This Bug
> /usr/local/gcc43/bin/g++ -v Using built-in specs. Target: i386-apple-darwin8.10.1 Configured with: ../gcc/configure --prefix=/usr/local/gcc43 --disable-multilib --with-arch=pentium-m --with-tune=nocona --enable-target-optspace --disable-bootstrap --with-gmp=/sw --with-system-zlib --enable-languages=c,c++,objc,obj-c++ Thread model: posix gcc version 4.3.0 20070702 (experimental) > /usr/local/gcc43/bin/gcc -Os -fno-pic -fomit-frame-pointer -S sub.c "i= 7 - ff_h264_norm_shift[x>>(CABAC_BITS-1)];" generates: movl $7, %ecx subl %eax, %ecx sall %cl, %edx It would be better if it generated: negl %eax addl $7, %eax sall %al, %edx which would leave a register free (which helps if this function is inlined); this is safe since eax isn't used again later. You can do this by transforming 'y = constant - x' into 'y = -x + constant'. It looks like gcc actually does the reverse; if I change the source to match the best output, it generates the same thing.
Created an attachment (id=13827) [edit] testcase
This is a target issue (or a semi generic one for 2-operand targets).
Also, 'x >> 32 - y' can be transformed into 'x >> -y', since x86 only uses the lowest 5 bits. I'm not sure about other targets. Messing with the backend doesn't seem very popular these days. I guess I should figure out how those parts work.
Causes silly code on i386 with this: void pred8x8l_vertical_add_c(unsigned char *pix, const short *block, int stride){ int i; for(i=0; i<8; i++){ int j; for (j=0; j<8; j++){ pix[j] = pix[j-stride] + block[j]; } pix+= stride; block+= 8; } } where it calculates and then spills each of [0-7] - stride to the stack, instead of just being able to keep -stride in a register and incrementing it.
Confirmed, the code for -O2 -funroll-loops includes things such as movzwl 2(%eax), %esi movl $1, -44(%ebp) subl %ecx, -44(%ebp) movl -44(%ebp), %edi movzbl (%edx,%edi), %ebx addl %ebx, %esi movl %esi, %ebx movb %bl, 1(%edx) ... movl -44(%ebp), %ebx movzwl 2(%esi), %edi movzbl (%edx,%ebx), %ebx addl %ebx, %edi movl %edi, %ebx movb %bl, 1(%edx) The unrolling is done on the tree level, so it's CSE who'd need to know this.
On the tree level we see (after FRE): D.1996_50 = pix_1 + 1; D.1997_51 = 1 - stride_11(D); D.1998_52 = (unsigned int) D.1997_51; D.1999_53 = pix_1 + D.1998_52; D.2000_54 = *D.1999_53; D.2003_57 = block_2 + 2; D.2004_58 = *D.2003_57; D.2005_59 = (unsigned char) D.2004_58; D.2006_60 = D.2000_54 + D.2005_59; *D.1996_50 = D.2006_60; etc. which again also shows the weakness of POINTER_PLUS_EXPR and the conversions it causes for the offset operand. IVOPTs cannot cope with it and PRE/LIM make a mess out of the code as well.
Oh, there's no loop. Then it's the not implemented strength-reduction on scalar code that is the issue. In theory strength-reduction can be integrated into our global value-numbering / PRE code, but nobody has done that.