union BytesOverlay { uint8_t u8[2]; uint16_t u16; }; uint16_t f(uint16_t from) { BytesOverlay overlay; overlay.u16 = from; uint8_t byte1 = overlay.u8[0]; uint8_t byte2 = overlay.u8[1]; overlay.u8[0] = byte2; overlay.u8[1] = byte1; return overlay.u16; } This can be transformed to `return __builtin_bswap16(from);`. This transformation is done by LLVM, but not by GCC.
I see at -O2 Coalescing successful! Merged into 1 stores 16 bit bswap implementation found at: _9 New sequence of 1 stores to replace old one of 2 stores _9 = from_2(D) r>> 8; MEM[(union BytesOverlay *)&overlay] = _9; and movl %edi, %eax rolw $8, %ax ret with -O3 the vectorizer gets in the way and elides overlay: _7 = (unsigned char) from_2(D); _8 = BIT_FIELD_REF <from_2(D), 8, 8>; _9 = {_8, _7}; _3 = VIEW_CONVERT_EXPR<short unsigned int>(_9); overlay ={v} {CLOBBER}; return _3; now we could fold the V_C_E of the vector ctor to a bswap on from_2 though that would be quite a big special pattern. Maybe vector-ctor folding can consider this case to convert it to a V_C_E to a vector type from the bswap result. OTOH "fixing" the vectorizer to emit a bswap would be even nicer.
Doing it in the vectorizer wouldn't optimize it when users write it by hand, like: typedef char V __attribute__((vector_size (2))); typedef char W __attribute__((vector_size (8))); V foo (unsigned short x) { return (V) { x >> 8, x }; } W bar (unsigned long long x) { return (W) { x >> 56, x >> 48, x >> 40, x >> 32, x >> 24, x >> 16, x >> 8, x }; } and forwprop doesn't have the infrastructure needed for this. Wouldn't the bswap pass be the right spot to handle this?
Created attachment 49766 [details] gcc11-pr96239.patch So, I wrote this untested patch which fixes my testcases, but doesn't fix due to pass ordering issue the original one, as the bswap pass is done before vectorization only (both loop and slp). I guess some of the options are schedule another bswap pass instance later, or include just the vector CONSTRUCTOR handling in e.g. the store-merging pass (yes, it would be related just because the bswap code lives in the store-merging source file and performs also bswap optimizations upon stores).
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:cd676dfa57e643a4f7d8445e6ebad0f21cf3fd84 commit r11-6115-gcd676dfa57e643a4f7d8445e6ebad0f21cf3fd84 Author: Jakub Jelinek <jakub@redhat.com> Date: Wed Dec 16 13:08:07 2020 +0100 bswap: Handle vector CONSTRUCTORs [PR96239] The following patch teaches the bswap pass to handle for small (2/4/8 byte long) vectors a CONSTRUCTOR by determining if the bytes of the constructor come from non-vector sources and are either nop or bswap and changing the CONSTRUCTOR in that case to VIEW_CONVERT_EXPR from scalar integer to the vector type. Unfortunately, as I found after the patch was written, due to pass ordering this doesn't really fix the original testcase, just the one I wrote, because both loop and slp vectorization is done only after the bswap pass. A possible way out of that would be to perform just this particular bswap optimization (i.e. for CONSTRUCTOR assignments with integral vector types call find_bswap_or_nop and bswap_replace if successful) also during the store merging pass, it isn't really a store, but the store merging pass already performs bswapping when handling store, so it wouldn't be that big hack. What do you think? 2020-12-16 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/96239 * gimple-ssa-store-merging.c (find_bswap_or_nop): Handle a vector CONSTRUCTOR. (bswap_replace): Likewise. * gcc.dg/pr96239.c: New test.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:a7553ad60bebc419d510564b8b9f9e5e03725ff5 commit r11-6470-ga7553ad60bebc419d510564b8b9f9e5e03725ff5 Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Jan 5 16:16:06 2021 +0100 store-merging: Handle vector CONSTRUCTORs using bswap [PR96239] I've tried to add such helper, but handling over just analysis and letting each pass handle it differently seems complicated given the limitations of the bswap infrastructure. So, this patch just hooks the optimization also into store-merging so that the original testcase from the PR can be fixed. 2021-01-05 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/96239 * gimple-ssa-store-merging.c (maybe_optimize_vector_constructor): New function. (get_status_for_store_merging): Don't return BB_INVALID for blocks with potential bswap optimizable CONSTRUCTORs. (pass_store_merging::execute): Optimize vector CONSTRUCTORs with bswap if possible. * gcc.dg/tree-ssa/pr96239.c: New test.
Should be fixed now.