Bug 117395 - missed SRA opportunity with extracting subpart of type
Summary: missed SRA opportunity with extracting subpart of type
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2024-11-01 11:54 UTC by Tamar Christina
Modified: 2025-02-22 05:26 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2025-02-21 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tamar Christina 2024-11-01 11:54:57 UTC
The following example:

#include <arm_neon.h>
#include <string.h>

int16x4_t foo(const int16_t *src, int16x8_t c) {
    int16x8_t s[8];
    memcpy (&s[0], src, sizeof(s));

    return vget_low_s16(s[2]) + vget_high_s16(s[2]);
}

compiled with -march=armv9-a -O3

produces:

foo(short const*, __Int16x8_t):
        ldr     q31, [x0, 32]
        sub     sp, sp, #128
        add     sp, sp, 128
        umov    x0, v31.d[0]
        umov    x1, v31.d[1]
        fmov    d30, x0
        fmov    d31, x1
        add     v0.4h, v31.4h, v30.4h
        ret

which is a bit silly, but happens because reload has to spill the subreg that's taking half the register of s[2] and changing modes.

we cleaned up the stack usage, but not the stack allocation itself.

in GIMPLE we have:

  memcpy (&s[0], src_4(D), 128);
  _1 = BIT_FIELD_REF <s[2], 64, 0>;
  _2 = BIT_FIELD_REF <s[2], 64, 64>;
  _6 = _1 + _2;

but SRA has rejected it doing:

Candidate (26131): s
Allowed ADDR_EXPR of s because of memcpy (&s[0], src_4(D), 128);

! Disqualifying s - No scalar replacements to be created.

The manually scalarized version:

int16x4_t bar(const int16_t *src, int16x8_t c) {
    int16x8_t s;
    memcpy (&s, src, sizeof(s));

    return vget_low_s16(s) + vget_high_s16(s);
}

does however generate the right code.

Does SRA support BIT_FIELD_REFs today? comment in analyze_access_subtree seems to indicate it may punt on them.

I also seem to remember vaguely that SRA has a limit on the size of the object it scalarizes?
Comment 1 Andrew Pinski 2024-11-01 21:35:58 UTC
It is a slight regression from GCC 14 though.

Which produced:
```
foo:
        ldr     q31, [x0, 32]
        sub     sp, sp, #128
        add     sp, sp, 128
        dup     d0, v31.d[1]
        add     v0.4h, v0.4h, v31.4h
        ret
```

But that is only because vget_low_s16/vget_high_s16 didn't expand to using BIT_FIELD_REF before.

The bigger issue is GCC does not have a copy propagation for aggregates pass (see PR 14295 and others).
Comment 2 Andrew Pinski 2024-11-01 21:37:18 UTC
(In reply to Andrew Pinski from comment #1)
> 
> The bigger issue is GCC does not have a copy propagation for aggregates pass
> (see PR 14295 and others).

Basically we depend on SRA to do most if not all of the `copy propagation for aggregates` which might not be a good idea.
Comment 3 Richard Biener 2024-11-03 12:11:41 UTC
Having memcpy in the IL is preventing SRA.  There's probably no type
suitable for the single load/store memcpy inlining done by
gimple_fold_builtin_memory_op.

We could try to fold all memcpy to aggregate char[] array assignments,
at least when a decl is involved on either side with the idea to
eventually elide TREE_ADDRESSABLE.  But we need to make sure this
doesn't pessimize RTL expansion or other code dealing with memcpy but
not aggregate array copy.

SRA could handle memcpy and friends transparently iff it were to locally
compute its own idea of TREE_ADDRESSABLE.
Comment 4 Tamar Christina 2024-11-03 21:07:45 UTC
(In reply to Andrew Pinski from comment #1)
> It is a slight regression from GCC 14 though.
> 
> Which produced:
> ```
> foo:
>         ldr     q31, [x0, 32]
>         sub     sp, sp, #128
>         add     sp, sp, 128
>         dup     d0, v31.d[1]
>         add     v0.4h, v0.4h, v31.4h
>         ret
> ```
> 
> But that is only because vget_low_s16/vget_high_s16 didn't expand to using
> BIT_FIELD_REF before.
> 

That's a good point, lowering it in RTL as we did before prevented the subreg inlining so reload didn't have to spill.

I wonder if instead of using BIT_FIELD_REF we should instead use VEC_PERM_EXPR + VIEW_CONVERT.  This would get us the right rotate again and recover the regression.

We'd still need SRA for optimal codegen though without the stack allocations.


(In reply to Richard Biener from comment #3)
> Having memcpy in the IL is preventing SRA.  There's probably no type
> suitable for the single load/store memcpy inlining done by
> gimple_fold_builtin_memory_op.
> 

Yeah the original loop doesn't have memcpy but it's being idiom recognized.

> We could try to fold all memcpy to aggregate char[] array assignments,
> at least when a decl is involved on either side with the idea to
> eventually elide TREE_ADDRESSABLE.  But we need to make sure this
> doesn't pessimize RTL expansion or other code dealing with memcpy but
> not aggregate array copy.
> 
> SRA could handle memcpy and friends transparently iff it were to locally
> compute its own idea of TREE_ADDRESSABLE.

I suppose the second option is better in general? does SRA have the same issue with memset? Would it be possible to get a rough sketch of what this would entail?
Comment 5 Richard Biener 2024-11-04 09:10:32 UTC
(In reply to Tamar Christina from comment #4)
> (In reply to Andrew Pinski from comment #1)
> > It is a slight regression from GCC 14 though.
> > 
> > Which produced:
> > ```
> > foo:
> >         ldr     q31, [x0, 32]
> >         sub     sp, sp, #128
> >         add     sp, sp, 128
> >         dup     d0, v31.d[1]
> >         add     v0.4h, v0.4h, v31.4h
> >         ret
> > ```
> > 
> > But that is only because vget_low_s16/vget_high_s16 didn't expand to using
> > BIT_FIELD_REF before.
> > 
> 
> That's a good point, lowering it in RTL as we did before prevented the
> subreg inlining so reload didn't have to spill.
> 
> I wonder if instead of using BIT_FIELD_REF we should instead use
> VEC_PERM_EXPR + VIEW_CONVERT.  This would get us the right rotate again and
> recover the regression.
> 
> We'd still need SRA for optimal codegen though without the stack allocations.
> 
> 
> (In reply to Richard Biener from comment #3)
> > Having memcpy in the IL is preventing SRA.  There's probably no type
> > suitable for the single load/store memcpy inlining done by
> > gimple_fold_builtin_memory_op.
> > 
> 
> Yeah the original loop doesn't have memcpy but it's being idiom recognized.
> 
> > We could try to fold all memcpy to aggregate char[] array assignments,
> > at least when a decl is involved on either side with the idea to
> > eventually elide TREE_ADDRESSABLE.  But we need to make sure this
> > doesn't pessimize RTL expansion or other code dealing with memcpy but
> > not aggregate array copy.
> > 
> > SRA could handle memcpy and friends transparently iff it were to locally
> > compute its own idea of TREE_ADDRESSABLE.
> 
> I suppose the second option is better in general? does SRA have the same
> issue with memset? Would it be possible to get a rough sketch of what this
> would entail?

Yes, same with memset.  memset can be replaced with aggregate init from {}.

SRA really wants to see whether all accesses to a decl are "direct" and its
address does not escape.  TREE_ADDRESSABLE is a conservative approximation
here and works for early disqualifying of candidates.

SRA already walks all stmts, it could instead update the disqualified
bitmap using gimple_ior_addresses_taken, like execute_update_addresses_taken
does, and explicitly handle memset and memcpy - it needs to explicitly
handle those calls during stmt rewriting and access analysis as well then,
of course.