66248 – subreg truncation not hoisted from loop

Bug 66248 - subreg truncation not hoisted from loop

Summary: subreg truncation not hoisted from loop

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	7.1.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2015-05-21 16:39 UTC by Jon Beniston
Modified:	2025-01-08 17:56 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	--*
Build:
Known to work:
Known to fail:	15.0, 4.9.1, 5.1.0, 6.1.0, 7.1.0
Last reconfirmed:	2025-01-07 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jon Beniston 2015-05-21 16:39:06 UTC

In the following code, where 'short' is 16-bits, on 32-bit processors (ARM/MIPS/SPARC targets), the code that is generated to truncate the value of the variable 'ret' to 16-bits (typically a shift left then right), appears in each iteration of the loop.

short func(short *a, int y)
{
 short ret;
 int i;
 for(i = 0; i < y; i++) 
   ret += a[i];
   
 return ret;
}

E.g. on ARM at -O2/3:

.L3:
        ldrh    r2, [r0], #2
        add     r3, r2, r3
        mov     r3, r3, asl #16
        cmp     r0, r1
        mov     r3, r3, lsr #16
        bne     .L3


Should these not be hoisted out of the loop and only executed once before the return?

Comment 1 Steve Ellcey 2015-12-14 23:36:10 UTC

The truncate can not be moved out of the loop because that would affect the overflow behaviour of ret.  I.e. we need to truncate ret on each iteration of the loop because doing the truncation may affect the value of ret used in the
addition during the next iteration of the loop.  This is not the same as
doing one truncation at the end in cases where ret overflows the maximum value
of a short type.

Comment 2 Jon Beniston 2015-12-15 09:57:33 UTC

Hi Steve. I'm not sure I'm follow your explanation. 

As I understand it, signed overflow is undefined behaviour (http://www.airs.com/blog/archives/120), so I'm not sure why we need to worry about changing the overflow behaviour (as the 16 LSBs should be the same). Even if not, -fstrict-overflow should be enabled at -O2, so the compiler should be able to assume that overflow will not occur anyway.

Comment 3 Steve Ellcey 2015-12-15 17:21:17 UTC

My understanding (I don't have a C/C++ standard handy) is that the addition
done by 'ret + a[i]' is done in integer mode (not as short).  This results in
an integer value that may be outside the range of a short, but in the
range of a normal integer.  So this is not really an overflow.   Then the
integer result is assigned to ret, which is short.  I believe that the
truncation of a integer value (with a value outside the range of a short)
to a short is not undefined by the C and C++ standards but has a specific
way that it needs to work (truncate off the higher bits).  This is the
truncation that needs to be done on each loop iteration.

Comment 4 Jon Beniston 2015-12-15 17:31:39 UTC

Well if it is just truncating the higher bits, why can't it be done at the end of the loop?

What do you think will be different if it is done at the end of the loop? Can you think of an example where the value of ret will differ?

The MSBs in an add don't effect the LSBs.

Comment 5 Steve Ellcey 2015-12-15 17:46:23 UTC

If we did not truncate ret on each loop iteration then ret could get large enough to overflow the maximum integer value before we truncate it at the end, leading to undefined results.  But if we truncate ret on each loop iteration then ret will not overflow and the result is defined.

Comment 6 Jon Beniston 2015-12-15 17:53:59 UTC

-fstrict-overflow (which is the default at -O2) tells us that we can assume it will not overflow.

Even if it did, on most targets it makes no difference to the result.

Comment 7 Steve Ellcey 2015-12-15 17:56:44 UTC

I am still unconvinced but I will change it back to unconfirmed and leave it there in case someone else wants to look at it and/or propose a patch.

Comment 8 Andrew Pinski 2015-12-15 18:28:53 UTC

Couldn't it be optimized as:
short func(short *a, int y)
{
 short ret = 0;
 unsigned int tmp = 0;
 int i;
 for(i = 0; i < y; i++) 
   tmp += (unsigned int)(int)a[i];
   
 return (short)tmp;
}

Such that the addition happens in unsigned (so there is only wrapping and is well defined) and only one truncatation happens at the end of the loop.

Comment 9 Andrew Pinski 2025-01-07 04:36:35 UTC

Confirmed.

Some targets have this fixed on the trunk though.

For aarch64, GCC 14 produces:
.L3:
        ldrh    w1, [x2], 2
        add     w0, w0, w1
        sxth    w0, w0
        cmp     x3, x2
        bne     .L3

while the trunk now does:
.L3:
        ldrh    w0, [x2], 2
        add     w3, w3, w0
        cmp     x1, x2
        bne     .L3

But arm still produces:
.L3:
        ldrsh   r2, [r3], #2
        add     r0, r0, r2
        cmp     r1, r3
        sxth    r0, r0
        bne     .L3


But if change the code to be:
```
void func(short *a, int y, int *c)
{
 short ret;
 int i;
 for(i = 0; i < y; i++) 
   ret += a[i];
   
 *c = ret;
}
```

Then aarch64 still has the extend inside the loop.

That is ext-dce is able to remove the extension from the loop if it is not needed at all but it does not move the extension to outside of the loop.

That is ext-dce can remove the extension in this case:
```
void func(short *a, int y, short *c)
{
 short ret;
 int i;
 for(i = 0; i < y; i++) 
   ret += a[i];
   
 *c = ret;
}
```
But not if the store is int.