In the following code, where 'short' is 16-bits, on 32-bit processors (ARM/MIPS/SPARC targets), the code that is generated to truncate the value of the variable 'ret' to 16-bits (typically a shift left then right), appears in each iteration of the loop. short func(short *a, int y) { short ret; int i; for(i = 0; i < y; i++) ret += a[i]; return ret; } E.g. on ARM at -O2/3: .L3: ldrh r2, [r0], #2 add r3, r2, r3 mov r3, r3, asl #16 cmp r0, r1 mov r3, r3, lsr #16 bne .L3 Should these not be hoisted out of the loop and only executed once before the return?
The truncate can not be moved out of the loop because that would affect the overflow behaviour of ret. I.e. we need to truncate ret on each iteration of the loop because doing the truncation may affect the value of ret used in the addition during the next iteration of the loop. This is not the same as doing one truncation at the end in cases where ret overflows the maximum value of a short type.
Hi Steve. I'm not sure I'm follow your explanation. As I understand it, signed overflow is undefined behaviour (http://www.airs.com/blog/archives/120), so I'm not sure why we need to worry about changing the overflow behaviour (as the 16 LSBs should be the same). Even if not, -fstrict-overflow should be enabled at -O2, so the compiler should be able to assume that overflow will not occur anyway.
My understanding (I don't have a C/C++ standard handy) is that the addition done by 'ret + a[i]' is done in integer mode (not as short). This results in an integer value that may be outside the range of a short, but in the range of a normal integer. So this is not really an overflow. Then the integer result is assigned to ret, which is short. I believe that the truncation of a integer value (with a value outside the range of a short) to a short is not undefined by the C and C++ standards but has a specific way that it needs to work (truncate off the higher bits). This is the truncation that needs to be done on each loop iteration.
Well if it is just truncating the higher bits, why can't it be done at the end of the loop? What do you think will be different if it is done at the end of the loop? Can you think of an example where the value of ret will differ? The MSBs in an add don't effect the LSBs.
If we did not truncate ret on each loop iteration then ret could get large enough to overflow the maximum integer value before we truncate it at the end, leading to undefined results. But if we truncate ret on each loop iteration then ret will not overflow and the result is defined.
-fstrict-overflow (which is the default at -O2) tells us that we can assume it will not overflow. Even if it did, on most targets it makes no difference to the result.
I am still unconvinced but I will change it back to unconfirmed and leave it there in case someone else wants to look at it and/or propose a patch.
Couldn't it be optimized as: short func(short *a, int y) { short ret = 0; unsigned int tmp = 0; int i; for(i = 0; i < y; i++) tmp += (unsigned int)(int)a[i]; return (short)tmp; } Such that the addition happens in unsigned (so there is only wrapping and is well defined) and only one truncatation happens at the end of the loop.
Confirmed. Some targets have this fixed on the trunk though. For aarch64, GCC 14 produces: .L3: ldrh w1, [x2], 2 add w0, w0, w1 sxth w0, w0 cmp x3, x2 bne .L3 while the trunk now does: .L3: ldrh w0, [x2], 2 add w3, w3, w0 cmp x1, x2 bne .L3 But arm still produces: .L3: ldrsh r2, [r3], #2 add r0, r0, r2 cmp r1, r3 sxth r0, r0 bne .L3 But if change the code to be: ``` void func(short *a, int y, int *c) { short ret; int i; for(i = 0; i < y; i++) ret += a[i]; *c = ret; } ``` Then aarch64 still has the extend inside the loop. That is ext-dce is able to remove the extension from the loop if it is not needed at all but it does not move the extension to outside of the loop. That is ext-dce can remove the extension in this case: ``` void func(short *a, int y, short *c) { short ret; int i; for(i = 0; i < y; i++) ret += a[i]; *c = ret; } ``` But not if the store is int.