[Bug middle-end/104151] [10/11/12/13 Regression] x86: excessive code generated for 128-bit byteswap

Wed Sep 7 08:18:55 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Barnabás Pőcze from comment #15)
> Sorry, I haven't found a better issue. But I think the example below
> exhibits the same or a very similar issue.
> 
> I would expect the following code
> 
> void f(unsigned char *p, std::uint32_t x, std::uint32_t y)
> {
>     p[0] = x >> 24;
>     p[1] = x >> 16;
>     p[2] = x >>  8;
>     p[3] = x >>  0;
> 
>     p[4] = y >> 24;
>     p[5] = y >> 16;
>     p[6] = y >>  8;
>     p[7] = y >>  0;
> }
> 
> to be compiled to something along the lines of
> 
> f(unsigned char*, unsigned int, unsigned int):
>         bswap   esi
>         bswap   edx
>         mov     DWORD PTR [rdi], esi
>         mov     DWORD PTR [rdi+4], edx
>         ret
> 
> however, I get scores of bitwise operations instead if `-fno-tree-vectorize`
> is not specified.
> 
> https://gcc.godbolt.org/z/z51K6qorv

Yes, here we vectorize the store:

  <bb 2> [local count: 1073741824]:
  _1 = x_15(D) >> 24;
  _2 = (unsigned char) _1;
  _3 = x_15(D) >> 16;
  _4 = (unsigned char) _3;
  _5 = x_15(D) >> 8;
  _6 = (unsigned char) _5;
  _7 = (unsigned char) x_15(D);
  _8 = y_22(D) >> 24;
  _9 = (unsigned char) _8;
  _10 = y_22(D) >> 16;
  _11 = (unsigned char) _10;
  _12 = y_22(D) >> 8;
  _13 = (unsigned char) _12;
  _14 = (unsigned char) y_22(D);
  _35 = {_2, _4, _6, _7, _9, _11, _13, _14};
  vectp.4_36 = p_17(D);
  MEM <vector(8) unsigned char> [(unsigned char *)vectp.4_36] = _35;

but without vectorizing the store merging pass (which comes after
vectorization) is able to detect two SImode bswaps.

Basically we fail to consider "generic" vectorization as option here
and generic vectorization fails to consider using bswap for permutes
of "existing vectors".  Likewise we fail to consider _1, _3, etc.
as element accesses of the existing "vectors" x and y.  That would
work iff the shift + truncates were canonicalized as BIT_FIELD_REF,
but it's certainly possible to work with the existing IL here.

Note this issue is probably better tracked in a separate bugreport.