While examining some code using the bloaty tool, I found that a function for deinterleaving Super Magic Drive ROM images was taking up ~5 KB when it should have been less than 1 KB. On examining the disassembly, there appeared to be a lot of unnecessary instructions; compiling with clang and MSVC resulted in significantly fewer instructions. Either removing __restrict from the function parameters (two pointers), or specifying -fno-tree-vectorize to disable auto-vectorization, fixes this issue with gcc-12. The generated code isn't buggy as far as I can tell, and it benchmarks around the same as the non-bloated version. I've narrowed it down to the following minimal test case: #include <stdint.h> #define SMD_BLOCK_SIZE 16384 void decodeBlock_cpp(uint8_t *__restrict pDest, const uint8_t *__restrict pSrc) { // First 8 KB of the source block is ODD bytes. const uint8_t *pSrc_end = pSrc + (SMD_BLOCK_SIZE / 2); for (uint8_t *pDest_odd = pDest + 1; pSrc < pSrc_end; pDest_odd += 2, pSrc += 1) { pDest_odd[0] = pSrc[0]; } } Assembly output with `g++ -O2 -fno-tree-vectorize` (or removing the __restrict qualifiers): decodeBlock_cpp(unsigned char*, unsigned char const*): xor eax, eax .L2: movzx edx, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+1+rax*2], dl add rax, 1 cmp rax, 8192 jne .L2 ret Assembly output with `g++ -O2` (implying -ftree-vectorize with gcc-12) and __restrict qualifiers: decodeBlock_cpp(unsigned char*, unsigned char const*): push r15 lea rax, [rsi+8192] add rdi, 1 push r14 push r13 push r12 push rbp push rbx mov QWORD PTR [rsp-8], rax .L2: movzx ecx, BYTE PTR [rsi+10] movzx eax, BYTE PTR [rsi+14] add rsi, 16 add rdi, 32 movzx edx, BYTE PTR [rsi-3] movzx r15d, BYTE PTR [rsi-1] movzx r11d, BYTE PTR [rsi-10] movzx ebx, BYTE PTR [rsi-11] mov BYTE PTR [rsp-11], cl movzx ecx, BYTE PTR [rsi-16] movzx ebp, BYTE PTR [rsi-12] mov BYTE PTR [rsp-9], al movzx r12d, BYTE PTR [rsi-13] movzx eax, BYTE PTR [rsi-4] mov BYTE PTR [rsp-10], dl movzx r13d, BYTE PTR [rsi-14] movzx edx, BYTE PTR [rsi-5] movzx r14d, BYTE PTR [rsi-15] movzx r8d, BYTE PTR [rsi-7] movzx r9d, BYTE PTR [rsi-8] movzx r10d, BYTE PTR [rsi-9] mov BYTE PTR [rdi-32], cl movzx ecx, BYTE PTR [rsp-11] mov BYTE PTR [rdi-10], dl mov BYTE PTR [rdi-30], r14b mov BYTE PTR [rdi-28], r13b mov BYTE PTR [rdi-26], r12b mov BYTE PTR [rdi-24], bpl mov BYTE PTR [rdi-22], bl mov BYTE PTR [rdi-20], r11b mov BYTE PTR [rdi-18], r10b mov BYTE PTR [rdi-16], r9b mov BYTE PTR [rdi-14], r8b mov BYTE PTR [rdi-12], cl mov BYTE PTR [rdi-8], al movzx eax, BYTE PTR [rsp-9] movzx edx, BYTE PTR [rsp-10] mov BYTE PTR [rdi-2], r15b mov BYTE PTR [rdi-4], al mov rax, QWORD PTR [rsp-8] mov BYTE PTR [rdi-6], dl cmp rsi, rax jne .L2 pop rbx pop rbp pop r12 pop r13 pop r14 pop r15 ret $ gcc --version gcc (Gentoo Hardened 12.2.1_p20221008 p1) 12.2.1 20221008 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Some quick testing with Compiler Explorer at godbolt.org shows that this behavior started with gcc-8.1, and it doesn't happen with gcc-7.x or earlier.
This is a cost model issue with x86_64. on aarch64, this is not vectorized unless you use -fno-vect-cost-model.
With '-fdisable-tree-forwprop4 -msse4.1' you see what the vectorizer perhaps wanted to achieve.
The vectorizer vectorizes this with a strided store, costing *pSrc_16 1 times unaligned_load (misalign -1) costs 12 in body _1 16 times scalar_store costs 192 in body _1 16 times vec_to_scalar costs 64 in body t.c:8:44: note: operating only on full vectors. t.c:8:44: note: Cost model analysis: Vector inside of loop cost: 268 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 24 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 now later forwprop figures it can replace the element extracts from the vector load with scalar loads which then results in effective unrolling of the loop by a factor of 16. The vectorizer misses the fact that w/o SSE 4.1 it cannot do efficient lane extracts. With SSE 4.1 and disabling the forwprop you'd get .L3: movdqu (%rsi), %xmm0 addq $16, %rsi addq $32, %rax pextrb $0, %xmm0, -32(%rax) pextrb $1, %xmm0, -30(%rax) pextrb $2, %xmm0, -28(%rax) pextrb $3, %xmm0, -26(%rax) pextrb $4, %xmm0, -24(%rax) pextrb $5, %xmm0, -22(%rax) pextrb $6, %xmm0, -20(%rax) pextrb $7, %xmm0, -18(%rax) pextrb $8, %xmm0, -16(%rax) pextrb $9, %xmm0, -14(%rax) pextrb $10, %xmm0, -12(%rax) pextrb $11, %xmm0, -10(%rax) pextrb $12, %xmm0, -8(%rax) pextrb $13, %xmm0, -6(%rax) pextrb $14, %xmm0, -4(%rax) pextrb $15, %xmm0, -2(%rax) cmpq %rdx, %rsi jne .L3 which is what the vectorizer thinks is going to be generated. But with just SSE2 we are spilling to memory for the lane extract. For the case at hand loading two vectors from the destination and then punpck{h,l}bw and storing them again might be the most efficient thing to do here. On the cost model side 'vec_to_scalar' is ambiguous, the x86 backend tries to compensate with /* If we do elementwise loads into a vector then we are bound by latency and execution resources for the many scalar loads (AGU and load ports). Try to account for this by scaling the construction cost by the number of elements involved. */ if ((kind == vec_construct || kind == vec_to_scalar) && stmt_info && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type) && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); } but that doesn't trigger here because the step is constant two. RTL expansion will eventually use the vec_extract optab and that succeeds even for SSE2 by spilling, so it isn't useful to query support: void ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt) { ... if (use_vec_extr) { ... } else { rtx mem = assign_stack_temp (mode, GET_MODE_SIZE (mode)); emit_move_insn (mem, vec); tmp = adjust_address (mem, inner_mode, elt*GET_MODE_SIZE (inner_mode)); emit_move_insn (target, tmp); } } the fallback is eventually done by RTL expansion anyway. Note fixing that and querying vec_extract support (the vectorizer doesn't do that - it relies on expands fallback here but could do better costing and also generate a single spill slot rather than one for each extract).
(In reply to Richard Biener from comment #4) > > For the case at hand loading two vectors from the destination and then > punpck{h,l}bw and storing them again might be the most efficient thing > to do here. I think such read-modify-write on the destination introduces a data race for bytes that are not accessed in the original program, so that would be okay only under -fallow-store-data-races?
On Tue, 10 Jan 2023, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322 > > --- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #4) > > > > For the case at hand loading two vectors from the destination and then > > punpck{h,l}bw and storing them again might be the most efficient thing > > to do here. > > I think such read-modify-write on the destination introduces a data race for > bytes that are not accessed in the original program, so that would be okay only > under -fallow-store-data-races? Yes, obviously.