gcc correctly optimize int divu(uint a, uint b) { return a / (1<<b); } to divu: mov r0, r0, lsr r1 mov pc, lr but it fails to optimize int divu3(uint a, uint b) { return a / ((1U<<b) / 4); } gcc generate (arm-linux-gnueabi-gcc -Os p.c -march=armv4 -mno-thumb-interwork -S) divu3: stmfd sp!, {r3, lr} mov r3, #1 mov r1, r3, asl r1 mov r1, r1, lsr #2 bl {{{__}}}aeabi_uidiv ldmfd sp!, {r3, pc} or (gcc p.c -S -O3 -fomit-frame-pointer -mregparm=3) divu3: pushl %ebx movl %edx, %ecx movl $1, %ebx xorl %edx, %edx sall %cl, %ebx shrl $2, %ebx divl %ebx popl %ebx ret but ((1U<<b) / 4) is 0 or a power of 2. Div by 0 is undefined in C ( C99 6.5.5p5) So why can we generate : mov r3, #1 mov r1, r3, asl r1 mov r1, r1, lsr #2 mov r0, r0, lsr r1 ? Note that gcc correctly optimize int divu5(uint a, uint b) { return a / ((1U<<b) * 4); }
We do not exploit the fact that shifts bigger than the width of the type are undefined (in fact we even try to preserve the fact that some CPUs truncate the shift count when constant folding ...). We also have to make sure the shift count does not get negative, which we can't in this case. Thus (1U<<(b-2)) is not equivalent to (1U<<b) / 4.
> We also have to make sure the shift count does not get negative, which we can't in this case. Thus (1U<<(b-2)) is not equivalent to (1U<<b) / 4. yes, but a / c is undefined for c = 0 if c = (1U<<b) / 4 is not 0, then (1U<<b) / 4 is equivalent to (1U<<(b-2)) Then a / ((1U<<b) / 4) is equivalent to a / (1U<<(b-2)) Then a / ((1U<<b) / 4) is equivalent to (a >> (b-2)) But I agree it is not trivial optimisation
Confirmed.