[Bug tree-optimization/106106] SRA scalarizes structure copies

Tue Jun 28 07:51:18 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
SRA is eliding 'v' by doing what it does, so it essentially changes

  D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D));
  v = D.22939;
  __b = v;
  D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_3(D), __b, 1); [tail call]

to

  D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D));
  __b = D.22939;
  D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_3(D), __b, 1); [tail call]

but due to how it works overall it cannot do this without exposing the
scalar pieces and "re-materializing" __b.

__extension__ extern __inline float32x2x2_t
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vld2_lane_f32 (const float32_t * __a, float32x2x2_t __b, const int __c)
{
  union { float32x2x2_t __i; __builtin_neon_ti __o; } __bu = { __b };
  union { float32x2x2_t __i; __builtin_neon_ti __o; } __rv;
  __rv.__o = __builtin_neon_vld2_lanev2sf ((const __builtin_neon_sf *) __a,
__bu.__o, __c);
  return __rv.__i;
}

it looks like providing __builtin_neon_vld2_lanev2sf with float32x2x2
argument and return type might avoid one copy.

In any case improving register allocation or massaging the RTL before it
is the way to go here.  How does the RTL IL fed to RA differ with/without SRA?