The following example: #include <arm_neon.h> #include <string.h> int16x4_t foo(const int16_t *src, int16x8_t c) { int16x8_t s[8]; memcpy (&s[0], src, sizeof(s)); return vget_low_s16(s[2]) + vget_high_s16(s[2]); } compiled with -march=armv9-a -O3 produces: foo(short const*, __Int16x8_t): ldr q31, [x0, 32] sub sp, sp, #128 add sp, sp, 128 umov x0, v31.d[0] umov x1, v31.d[1] fmov d30, x0 fmov d31, x1 add v0.4h, v31.4h, v30.4h ret which is a bit silly, but happens because reload has to spill the subreg that's taking half the register of s[2] and changing modes. we cleaned up the stack usage, but not the stack allocation itself. in GIMPLE we have: memcpy (&s[0], src_4(D), 128); _1 = BIT_FIELD_REF <s[2], 64, 0>; _2 = BIT_FIELD_REF <s[2], 64, 64>; _6 = _1 + _2; but SRA has rejected it doing: Candidate (26131): s Allowed ADDR_EXPR of s because of memcpy (&s[0], src_4(D), 128); ! Disqualifying s - No scalar replacements to be created. The manually scalarized version: int16x4_t bar(const int16_t *src, int16x8_t c) { int16x8_t s; memcpy (&s, src, sizeof(s)); return vget_low_s16(s) + vget_high_s16(s); } does however generate the right code. Does SRA support BIT_FIELD_REFs today? comment in analyze_access_subtree seems to indicate it may punt on them. I also seem to remember vaguely that SRA has a limit on the size of the object it scalarizes?
It is a slight regression from GCC 14 though. Which produced: ``` foo: ldr q31, [x0, 32] sub sp, sp, #128 add sp, sp, 128 dup d0, v31.d[1] add v0.4h, v0.4h, v31.4h ret ``` But that is only because vget_low_s16/vget_high_s16 didn't expand to using BIT_FIELD_REF before. The bigger issue is GCC does not have a copy propagation for aggregates pass (see PR 14295 and others).
(In reply to Andrew Pinski from comment #1) > > The bigger issue is GCC does not have a copy propagation for aggregates pass > (see PR 14295 and others). Basically we depend on SRA to do most if not all of the `copy propagation for aggregates` which might not be a good idea.
Having memcpy in the IL is preventing SRA. There's probably no type suitable for the single load/store memcpy inlining done by gimple_fold_builtin_memory_op. We could try to fold all memcpy to aggregate char[] array assignments, at least when a decl is involved on either side with the idea to eventually elide TREE_ADDRESSABLE. But we need to make sure this doesn't pessimize RTL expansion or other code dealing with memcpy but not aggregate array copy. SRA could handle memcpy and friends transparently iff it were to locally compute its own idea of TREE_ADDRESSABLE.
(In reply to Andrew Pinski from comment #1) > It is a slight regression from GCC 14 though. > > Which produced: > ``` > foo: > ldr q31, [x0, 32] > sub sp, sp, #128 > add sp, sp, 128 > dup d0, v31.d[1] > add v0.4h, v0.4h, v31.4h > ret > ``` > > But that is only because vget_low_s16/vget_high_s16 didn't expand to using > BIT_FIELD_REF before. > That's a good point, lowering it in RTL as we did before prevented the subreg inlining so reload didn't have to spill. I wonder if instead of using BIT_FIELD_REF we should instead use VEC_PERM_EXPR + VIEW_CONVERT. This would get us the right rotate again and recover the regression. We'd still need SRA for optimal codegen though without the stack allocations. (In reply to Richard Biener from comment #3) > Having memcpy in the IL is preventing SRA. There's probably no type > suitable for the single load/store memcpy inlining done by > gimple_fold_builtin_memory_op. > Yeah the original loop doesn't have memcpy but it's being idiom recognized. > We could try to fold all memcpy to aggregate char[] array assignments, > at least when a decl is involved on either side with the idea to > eventually elide TREE_ADDRESSABLE. But we need to make sure this > doesn't pessimize RTL expansion or other code dealing with memcpy but > not aggregate array copy. > > SRA could handle memcpy and friends transparently iff it were to locally > compute its own idea of TREE_ADDRESSABLE. I suppose the second option is better in general? does SRA have the same issue with memset? Would it be possible to get a rough sketch of what this would entail?
(In reply to Tamar Christina from comment #4) > (In reply to Andrew Pinski from comment #1) > > It is a slight regression from GCC 14 though. > > > > Which produced: > > ``` > > foo: > > ldr q31, [x0, 32] > > sub sp, sp, #128 > > add sp, sp, 128 > > dup d0, v31.d[1] > > add v0.4h, v0.4h, v31.4h > > ret > > ``` > > > > But that is only because vget_low_s16/vget_high_s16 didn't expand to using > > BIT_FIELD_REF before. > > > > That's a good point, lowering it in RTL as we did before prevented the > subreg inlining so reload didn't have to spill. > > I wonder if instead of using BIT_FIELD_REF we should instead use > VEC_PERM_EXPR + VIEW_CONVERT. This would get us the right rotate again and > recover the regression. > > We'd still need SRA for optimal codegen though without the stack allocations. > > > (In reply to Richard Biener from comment #3) > > Having memcpy in the IL is preventing SRA. There's probably no type > > suitable for the single load/store memcpy inlining done by > > gimple_fold_builtin_memory_op. > > > > Yeah the original loop doesn't have memcpy but it's being idiom recognized. > > > We could try to fold all memcpy to aggregate char[] array assignments, > > at least when a decl is involved on either side with the idea to > > eventually elide TREE_ADDRESSABLE. But we need to make sure this > > doesn't pessimize RTL expansion or other code dealing with memcpy but > > not aggregate array copy. > > > > SRA could handle memcpy and friends transparently iff it were to locally > > compute its own idea of TREE_ADDRESSABLE. > > I suppose the second option is better in general? does SRA have the same > issue with memset? Would it be possible to get a rough sketch of what this > would entail? Yes, same with memset. memset can be replaced with aggregate init from {}. SRA really wants to see whether all accesses to a decl are "direct" and its address does not escape. TREE_ADDRESSABLE is a conservative approximation here and works for early disqualifying of candidates. SRA already walks all stmts, it could instead update the disqualified bitmap using gimple_ior_addresses_taken, like execute_update_addresses_taken does, and explicitly handle memset and memcpy - it needs to explicitly handle those calls during stmt rewriting and access analysis as well then, of course.