Compile this code with 4.6.2 on a x86-64 machine with -O3: #define SIZE 65536 #define WSIZE 64 unsigned short head[SIZE] __attribute__((aligned(64))); void f(void) { for (unsigned n = 0; n < SIZE; ++n) { unsigned short m = head[n]; head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0); } } The result I see is this: 0000000000000000 <f>: 0: 66 0f ef d2 pxor %xmm2,%xmm2 4: b8 00 00 00 00 mov $0x0,%eax 5: R_X86_64_32 head 9: 66 0f 6f 25 00 00 00 movdqa 0x0(%rip),%xmm4 # 11 <f+0x11> 10: 00 d: R_X86_64_PC32 .LC0-0x4 11: 66 0f 6f 1d 00 00 00 movdqa 0x0(%rip),%xmm3 # 19 <f+0x19> 18: 00 15: R_X86_64_PC32 .LC1-0x4 19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 20: 66 0f 6f 00 movdqa (%rax),%xmm0 24: 66 0f 6f c8 movdqa %xmm0,%xmm1 28: 66 0f d9 c4 psubusw %xmm4,%xmm0 2c: 66 0f 75 c2 pcmpeqw %xmm2,%xmm0 30: 66 0f fd cb paddw %xmm3,%xmm1 34: 66 0f df c1 pandn %xmm1,%xmm0 38: 66 0f 7f 00 movdqa %xmm0,(%rax) 3c: 48 83 c0 10 add $0x10,%rax 40: 48 3d 00 00 00 00 cmp $0x0,%rax 42: R_X86_64_32S head+0x20000 46: 75 d8 jne 20 <f+0x20> 48: f3 c3 repz retq There is a lot of unnecessary code. The psubusw instruction alone is sufficient. The purpose of this instruction is to implement saturated subtraction. Why does gcc create all this extra code? The code should just be movdqa (%rax), %xmm0 psubusw %xmm1, %xmm0 movdqa %mm0, (%rax) where %xmm1 has WSIZE in the 16-bit values.
It's vectorized as vect_var_.11_17 = MEM[base: D.1616_5, offset: 0B]; vect_var_.12_19 = vect_var_.11_17 + { 65472, 65472, 65472, 65472, 65472, 65472, 65472, 65472 }; vect_var_.14_22 = VEC_COND_EXPR <vect_var_.11_17 > { 63, 63, 63, 63, 63, 63, 63, 63 }, vect_var_.12_19, { 0, 0, 0, 0, 0, 0, 0, 0 }>; MEM[base: D.1616_5, offset: 0B] = vect_var_.14_22; GCC doesn't have the idea that this is a "saturated subtraction". If targets have saturated arithmetic support, but only with vectors, then the vectorizer pattern recognition would need to be enhanced and the targets eventually should support expanding saturated arithmetic. OTOH middle-end support for saturated arithmetic needs to be improved, scalar code could also benefit from optimization. On the RTL level we have [us]s_{plus,minus} which the vectorizer could use (if implemented on the target for vector types).
Note, this code appears in gzip and therefore IIRC in specCPU (in deflate.c:fill_window). Although when compiling gzip myself with that code embedded in a larger function I cannot get the optimization to apply at all. If this bug is fixed and the optimization is applied the spec numbers could go up if specCPUis testing unzipping...
Link to vectorizer missed-optimization meta-bug.
AARCH64 could produce SQSUB instead of the following code: add v1.8h, v0.8h, v3.8h cmhi v0.8h, v0.8h, v2.8h and v0.16b, v1.16b, v0.16b
We do slightly better but not close: movdqa (%rax), %xmm0 addq $16, %rax psubusw %xmm1, %xmm0 paddw %xmm1, %xmm0 paddw %xmm2, %xmm0 movaps %xmm0, -16(%rax) Which is expanded from: vect__1.6_15 = MAX_EXPR <vect_m_6.5_3, { 64, 64, 64, 64, 64, 64, 64, 64 }>; vect__2.7_17 = vect__1.6_15 + { 65472, 65472, 65472, 65472, 65472, 65472, 65472, 65472 }; -mavx2 we get: vpmaxuw (%rax), %ymm2, %ymm0 addq $32, %rax vpaddw %ymm1, %ymm0, %ymm0 vmovdqa %ymm0, -32(%rax) Just note 65472 is -64. This shouldn't be too hard to detect and add and even lower back to MAX_EXPR/PLUS_EXPR if us_minus does not exist.
https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html