Bug 66511 - [avr] whole-byte shifts not optimized away for uint64_t
Summary: [avr] whole-byte shifts not optimized away for uint64_t
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.9.2
: P5 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-06-11 14:49 UTC by Matthijs Kooijman
Modified: 2023-04-16 17:27 UTC (History)
3 users (show)

See Also:
Host:
Target: avr
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
proposed patch (918 bytes, patch)
2023-04-16 12:51 UTC, Roger Sayle
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthijs Kooijman 2015-06-11 14:49:47 UTC
When doing whole-byte shifts, gcc usually optimizes away the shifts and
ends up moving data between registers instead. However, it seems this
doesn't happen when uint64_t is used.

Here's an example (assembler output slightly trimmed of unrelated
comments and annotations etc.):

matthijs@grubby:~$ cat test.cpp
#include <stdint.h>

uint8_t foo64_8(uint64_t a) {
        return a >> 8;
}

uint16_t foo64_16(uint64_t a) {
        return a >> 8;
}

uint8_t foo32_8(uint32_t a) {
        return a >> 8;
}

uint16_t foo32_16(uint32_t a) {
        return (a >> 8);
}

matthijs@grubby:~$ avr-gcc -Os test.cpp -S -o -
_Z7foo64_8y:
        push r16
        ldi r16,lo8(8)
        rcall __lshrdi3
        mov r24,r18
        pop r16
        ret

_Z8foo64_16y:
        push r16
        ldi r16,lo8(8)
        rcall __lshrdi3
        mov r24,r18
        mov r25,r19
        pop r16
        ret


_Z7foo32_8m:
        mov r24,r23
        ret

_Z8foo32_16m:
        clr r27
        mov r26,r25
        mov r25,r24
        mov r24,r23
        ret

        .ident  "GCC: (GNU) 4.9.2 20141224 (prerelease)"

The output is identical for 4.8.1 on Debian, and the above 4.9.2 on
Arch. I haven't found a readily available 5.x package yet to test.

As you can see, the versions operating on 64 bit values preserve the
8-bit shift (which is very inefficient on AVR), while the versions
running on 32 bit values simply copy the right registers.

The foo32_16 function still has some useless instructions (r27 and r26
are not part of the return value, not sure why these are set) but that
is probably an unrelated problem.

I've marked this with component "target", since I think these
optimizations are avr-specific (or at least not applicable to bigger
architectures).
Comment 1 Georg-Johann Lay 2015-06-27 18:07:20 UTC
(In reply to Matthijs Kooijman from comment #0)
> I haven't found a readily available 5.x package yet to test.

It's the same.

> As you can see, the versions operating on 64 bit values preserve the
> 8-bit shift (which is very inefficient on AVR), while the versions
> running on 32 bit values simply copy the right registers.

Lib functions are used because users complained about bloated 64-bit arithmetic.  

Notice that indide these 64-bit shift functions byte-shifts are used.

> The foo32_16 function still has some useless instructions (r27 and r26
> are not part of the return value, not sure why these are set) but that
> is probably an unrelated problem.

Yes.

> I've marked this with component "target", since I think these
> optimizations are avr-specific (or at least not applicable to bigger
> architectures).
Comment 2 Matthijs Kooijman 2015-08-02 11:42:46 UTC
So, IIUC, this is quite hard to fix? Either you use lib functions, which prevents the optimizer from just relabeling or coyping registers to apply shifting, or you don't and then more complex operations will become very verbose and messy?

Would it make sense (and be possible) to add a special case to not use lib functions for shifts by a constant number of bits that is also a multiple of 8? At first glance, that would make a lot of common cases (where an integer is decomposed into separate bytes or other parts) a lot faster, while still keeping the lib functions for more complex operations?
Comment 3 Georg-Johann Lay 2015-08-13 20:56:33 UTC
(In reply to Matthijs Kooijman from comment #2)
> So, IIUC, this is quite hard to fix? Either you use lib functions, which
> prevents the optimizer from just relabeling or coyping registers to apply
> shifting, or you don't and then more complex operations will become very
> verbose and messy?

avr-gcc is using lib functions, but not as libcall but as transparent call instead.  You'll find the implementation in avr-dimode.md.  avr-gcc has no proper DImode support.  For reasoning cf. the comments in the head of that file.
Comment 4 Roger Sayle 2023-04-16 12:51:58 UTC
Created attachment 54871 [details]
proposed patch

Proposed patch, using a peephole2 in avr-dimode.md to inline calls to __lshrdi3 that require only a single instruction or two (due to truncation).  For truncations to char, this is smaller and faster, and for truncations to unsigned short this is the same size, but faster.  The drawback is that performing this late (in peephole2) doesn't eliminate the stack frame prolog/epilog.  Thoughts?
Comment 5 Wilhelm M 2023-04-16 17:27:33 UTC
(In reply to Roger Sayle from comment #4)
> Created attachment 54871 [details]
> proposed patch
> 
> Proposed patch, using a peephole2 in avr-dimode.md to inline calls to
> __lshrdi3 that require only a single instruction or two (due to truncation).
> For truncations to char, this is smaller and faster, and for truncations to
> unsigned short this is the same size, but faster.  The drawback is that
> performing this late (in peephole2) doesn't eliminate the stack frame
> prolog/epilog.  Thoughts?

Looks good to me.
Many thanks!