Bug 96135 - [9/10/11/12 regression] bswap not detected by bswap pass, unexpected results between optimization levels
Summary: [9/10/11/12 regression] bswap not detected by bswap pass, unexpected results ...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.0
: P2 normal
Target Milestone: 9.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2020-07-09 14:01 UTC by Tobias Schlüter
Modified: 2021-09-08 10:58 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-07-10 00:00:00


Attachments
WIP (682 bytes, patch)
2021-03-11 09:07 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Schlüter 2020-07-09 14:01:07 UTC
This is an odd one, and it seems different from the other bswap bugs that I could find in bugzilla.  This is on x64.

Compiler Explorer link is here: https://godbolt.org/z/arTf5T

Full source code:
==============================================================
constexpr long long bswap64(long long in) // unsigned long long behaves the same
{
    union {
        long long v;
        char c[8];
    } u{in};
    union {
        char c[8];
        long long v;
    } v{ u.c[7], u.c[6], u.c[5], u.c[4], u.c[3], u.c[2], u.c[1], u.c[0]};
    return v.v;
}

long long f(long long i)
{
    return bswap64(i);
}

constexpr long long bswapD(double x)
{
    return bswap64(*(long long*)&x);
}

long long g(double x)
{
    return bswapD(x);
}
===============================================================

There are three observations / bugs:
1) bswapD is never recognized as byte-swapping
2) bswap64 is optimized to bswap at -O2 but not at -O3
3) 131t.bswap never shows bswap, apparently the pass doesn't detect this way of writing bswap, leaving it to the RTL optimizers.  Hence I classified this as tree-optimization bug.

Verified at -O2 with 9.3, 10.1 and trunk on the compiler explorer.

I'm flagging this as a regression because at -O2 gcc 8.3 detects bswap in both cases, but I'm guessing that this is by some accident.  In 7.5 neither function is compiled as bswap.
Comment 1 Richard Biener 2020-07-10 07:19:35 UTC
The bswap pass doesn't handle these patterns at all (it doesn't look at stores).
What does handle this case is store-merging which - on trunk - figures
bswap in f() but not in g() likely because of the BIT_FIELD_REF ->
cast folding which makes the stores appear inhomogenous:

  _3 = VIEW_CONVERT_EXPR<long long int>(x_2(D));
  _4 = BIT_FIELD_REF <x_2(D), 8, 56>;
  v.c[0] = _4;
...
  _11 = (char) _3;
  v.c[7] = _11;

store-merging already handles the cast vs. BIT_FIELD_REF case for f() but
it appearantly doesn't consider to look through a VIEW_CONVERT.

With -O3 we vectorize this in an inconvenient way and fully elide the store
so store-merging isn't the correct pass to handle this:

  _3 = BIT_FIELD_REF <i_2(D), 8, 56>;
  _4 = BIT_FIELD_REF <i_2(D), 8, 48>;
  _5 = BIT_FIELD_REF <i_2(D), 8, 40>;
  _6 = BIT_FIELD_REF <i_2(D), 8, 32>;
  _7 = BIT_FIELD_REF <i_2(D), 8, 24>;
  _8 = BIT_FIELD_REF <i_2(D), 8, 16>;
  _9 = BIT_FIELD_REF <i_2(D), 8, 8>;
  _10 = (char) i_2(D);
  _21 = {_3, _4, _5, _6, _7, _8, _9, _10};
  _18 = VIEW_CONVERT_EXPR<long long int>(_21);
  v ={v} {CLOBBER};
  return _18;

the vectorizer is also confused about BIT_FIELD_REF vs. cast here
(I repeatedly thought of removing that simplification ... but the
user could have written it as well :/).  And it would look for
a vector function argument but that's something that could be fixed.

The above is all for GCC 10.  GCC 8 possibly was lucky and did not have
that BIT_FIELD_REF -> cast simplification.
Comment 2 Richard Biener 2021-03-11 09:07:45 UTC
Created attachment 50361 [details]
WIP

On current trunk at -O3 f() again works via store-merging / vectorizing:

-  _21 = {_3, _4, _5, _6, _7, _8, _9, _10};
+  bswapsrc_22 = (long unsigned int) i_2(D);
+  bswapdst_19 = __builtin_bswap64 (bswapsrc_22);
+  _21 = VIEW_CONVERT_EXPR<vector(8) char>(bswapdst_19);

but g() does not, because init_symbolic_number doesn't like non-integral types.
Fixing that generates

_Z1gd:
.LFB2:
        .cfi_startproc
        movq    %xmm0, %rax
        bswap   %rax
        ret

but with -m32 it has the issue that we bswap only the lower part since
vectorizing produced two vector CTORs.  So we'd need to use a BIT_FIELD_REF
to extract the integer representation.

WIP patch attached.
Comment 3 Richard Biener 2021-03-31 09:13:52 UTC
See also PR96573
Comment 4 Richard Biener 2021-06-01 08:18:01 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 5 Andrew Pinski 2021-09-08 10:48:13 UTC
Note on the trunk for f and g at -O3 -msse4 (and -O3 on aarch64), GCC produces:
  _21 = VIEW_CONVERT_EXPR<vector(8) char>(i_2(D));
  _22 = VEC_PERM_EXPR <_21, _21, { 7, 6, 5, 4, 3, 2, 1, 0 }>;
  _18 = VIEW_CONVERT_EXPR<long long int>(_22);

But that VEC_PERM_EXPR is a bswap :).

So to fix this at -O3 -msse4, maybe we could just do:
(simplify
 (view_convert (vec_perm @0 @0 vector_cst_byteswap_p @1))
 (if (INTERGAL_TYPE_P (type)))
  (convert (bswap (view_convert @1))))

Note I don't think we want to do the byteswap in the integer registers if we are going back to the floating point registers.
Comment 6 Andrew Pinski 2021-09-08 10:58:22 UTC
(In reply to Andrew Pinski from comment #5)
> Note on the trunk for f and g at -O3 -msse4 (and -O3 on aarch64), GCC
> produces:
>   _21 = VIEW_CONVERT_EXPR<vector(8) char>(i_2(D));
>   _22 = VEC_PERM_EXPR <_21, _21, { 7, 6, 5, 4, 3, 2, 1, 0 }>;
>   _18 = VIEW_CONVERT_EXPR<long long int>(_22);
> 
> But that VEC_PERM_EXPR is a bswap :).
> 
> So to fix this at -O3 -msse4, maybe we could just do:
> (simplify
>  (view_convert (vec_perm @0 @0 vector_cst_byteswap_p @1))
>  (if (INTERGAL_TYPE_P (type)))
>   (convert (bswap (view_convert @1))))
> 
> Note I don't think we want to do the byteswap in the integer registers if we
> are going back to the floating point registers.

Or we could just support that in vect-lowering.