Bug 96239 - Failure to recognize __builtin_bswap16 pattern
Summary: Failure to recognize __builtin_bswap16 pattern
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
Reported: 2020-07-18 00:39 UTC by Gabriel Ravier
Modified: 2023-09-03 12:10 UTC (History)
2 users (show)

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2020-07-20 00:00:00

gcc11-pr96239.patch (2.22 KB, patch)
2020-12-14 19:54 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gabriel Ravier 2020-07-18 00:39:19 UTC
union BytesOverlay
    uint8_t u8[2];
    uint16_t u16;

uint16_t f(uint16_t from)
    BytesOverlay overlay;
    overlay.u16 = from;
    uint8_t byte1 = overlay.u8[0];
    uint8_t byte2 = overlay.u8[1];
    overlay.u8[0] = byte2;
    overlay.u8[1] = byte1;
    return overlay.u16;

This can be transformed to `return __builtin_bswap16(from);`. This transformation is done by LLVM, but not by GCC.
Comment 1 Richard Biener 2020-07-20 06:15:39 UTC
I see at -O2

Coalescing successful!
Merged into 1 stores
16 bit bswap implementation found at: _9
New sequence of 1 stores to replace old one of 2 stores

  _9 = from_2(D) r>> 8;
  MEM[(union BytesOverlay *)&overlay] = _9;


        movl    %edi, %eax
        rolw    $8, %ax

with -O3 the vectorizer gets in the way and elides overlay:

  _7 = (unsigned char) from_2(D);
  _8 = BIT_FIELD_REF <from_2(D), 8, 8>;
  _9 = {_8, _7};
  _3 = VIEW_CONVERT_EXPR<short unsigned int>(_9);
  overlay ={v} {CLOBBER};
  return _3;

now we could fold the V_C_E of the vector ctor to a bswap on from_2 though
that would be quite a big special pattern.  Maybe vector-ctor folding can
consider this case to convert it to a V_C_E to a vector type from the
bswap result.  OTOH "fixing" the vectorizer to emit a bswap would be
even nicer.
Comment 2 Jakub Jelinek 2020-12-14 14:41:11 UTC
Doing it in the vectorizer wouldn't optimize it when users write it by hand, like:

typedef char V __attribute__((vector_size (2)));
typedef char W __attribute__((vector_size (8)));

foo (unsigned short x)
  return (V) { x >> 8, x };

bar (unsigned long long x)
  return (W) { x >> 56, x >> 48, x >> 40, x >> 32, x >> 24, x >> 16, x >> 8, x };

and forwprop doesn't have the infrastructure needed for this.
Wouldn't the bswap pass be the right spot to handle this?
Comment 3 Jakub Jelinek 2020-12-14 19:54:55 UTC
Created attachment 49766 [details]

So, I wrote this untested patch which fixes my testcases, but doesn't fix due to pass ordering issue the original one, as the bswap pass is done before vectorization only (both loop and slp).
I guess some of the options are schedule another bswap pass instance later, or include just the vector CONSTRUCTOR handling in e.g. the store-merging pass (yes, it would be related just because the bswap code lives in the store-merging source file and performs also bswap optimizations upon stores).
Comment 4 GCC Commits 2020-12-16 12:11:08 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:


commit r11-6115-gcd676dfa57e643a4f7d8445e6ebad0f21cf3fd84
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Dec 16 13:08:07 2020 +0100

    bswap: Handle vector CONSTRUCTORs [PR96239]
    The following patch teaches the bswap pass to handle for small (2/4/8 byte
    long) vectors a CONSTRUCTOR by determining if the bytes of the constructor
    come from non-vector sources and are either nop or bswap and changing the
    CONSTRUCTOR in that case to VIEW_CONVERT_EXPR from scalar integer to
    the vector type.
    Unfortunately, as I found after the patch was written, due to pass ordering
    this doesn't really fix the original testcase, just the one I wrote,
    because both loop and slp vectorization is done only after the bswap pass.
    A possible way out of that would be to perform just this particular bswap
    optimization (i.e. for CONSTRUCTOR assignments with integral vector types
    call find_bswap_or_nop and bswap_replace if successful) also during the
    store merging pass, it isn't really a store, but the store merging pass
    already performs bswapping when handling store, so it wouldn't be that big
    hack.  What do you think?
    2020-12-16  Jakub Jelinek  <jakub@redhat.com>
            PR tree-optimization/96239
            * gimple-ssa-store-merging.c (find_bswap_or_nop): Handle a vector
            (bswap_replace): Likewise.
            * gcc.dg/pr96239.c: New test.
Comment 5 GCC Commits 2021-01-05 15:17:25 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:


commit r11-6470-ga7553ad60bebc419d510564b8b9f9e5e03725ff5
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Jan 5 16:16:06 2021 +0100

    store-merging: Handle vector CONSTRUCTORs using bswap [PR96239]
    I've tried to add such helper, but handling over just analysis and letting
    each pass handle it differently seems complicated given the limitations of
    the bswap infrastructure.
    So, this patch just hooks the optimization also into store-merging so that
    the original testcase from the PR can be fixed.
    2021-01-05  Jakub Jelinek  <jakub@redhat.com>
            PR tree-optimization/96239
            * gimple-ssa-store-merging.c (maybe_optimize_vector_constructor): New
            (get_status_for_store_merging): Don't return BB_INVALID for blocks
            with potential bswap optimizable CONSTRUCTORs.
            (pass_store_merging::execute): Optimize vector CONSTRUCTORs with bswap
            if possible.
            * gcc.dg/tree-ssa/pr96239.c: New test.
Comment 6 Jakub Jelinek 2021-01-05 15:24:55 UTC
Should be fixed now.