Bug 51492 - vectorizer does not support saturated arithmetic patterns
Summary: vectorizer does not support saturated arithmetic patterns
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.6.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2011-12-10 00:59 UTC by Ulrich Drepper
Modified: 2021-08-25 03:54 UTC (History)
0 users

See Also:
Host:
Target:
Build: x86_64-linux aarch64-linux-gnu
Known to work:
Known to fail: 6.0
Last reconfirmed: 2021-08-24 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Drepper 2011-12-10 00:59:22 UTC
Compile this code with 4.6.2 on a x86-64 machine with -O3:

#define SIZE 65536
#define WSIZE 64
unsigned short head[SIZE] __attribute__((aligned(64)));

void
f(void)
{
  for (unsigned n = 0; n < SIZE; ++n) {
    unsigned short m = head[n];
    head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0);
  }
}

The result I see is this:

0000000000000000 <f>:
   0:	66 0f ef d2          	pxor   %xmm2,%xmm2
   4:	b8 00 00 00 00       	mov    $0x0,%eax
			5: R_X86_64_32	head
   9:	66 0f 6f 25 00 00 00 	movdqa 0x0(%rip),%xmm4        # 11 <f+0x11>
  10:	00 
			d: R_X86_64_PC32	.LC0-0x4
  11:	66 0f 6f 1d 00 00 00 	movdqa 0x0(%rip),%xmm3        # 19 <f+0x19>
  18:	00 
			15: R_X86_64_PC32	.LC1-0x4
  19:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  20:	66 0f 6f 00          	movdqa (%rax),%xmm0
  24:	66 0f 6f c8          	movdqa %xmm0,%xmm1
  28:	66 0f d9 c4          	psubusw %xmm4,%xmm0
  2c:	66 0f 75 c2          	pcmpeqw %xmm2,%xmm0
  30:	66 0f fd cb          	paddw  %xmm3,%xmm1
  34:	66 0f df c1          	pandn  %xmm1,%xmm0
  38:	66 0f 7f 00          	movdqa %xmm0,(%rax)
  3c:	48 83 c0 10          	add    $0x10,%rax
  40:	48 3d 00 00 00 00    	cmp    $0x0,%rax
			42: R_X86_64_32S	head+0x20000
  46:	75 d8                	jne    20 <f+0x20>
  48:	f3 c3                	repz retq 


There is a lot of unnecessary code.  The psubusw instruction alone is sufficient.  The purpose of this instruction is to implement saturated subtraction.  Why does gcc create all this extra code?  The code should just be

   movdqa (%rax), %xmm0
   psubusw %xmm1, %xmm0
   movdqa %mm0, (%rax)

where %xmm1 has WSIZE in the 16-bit values.
Comment 1 Richard Biener 2011-12-12 10:23:20 UTC
It's vectorized as

  vect_var_.11_17 = MEM[base: D.1616_5, offset: 0B];
  vect_var_.12_19 = vect_var_.11_17 + { 65472, 65472, 65472, 65472, 65472, 65472, 65472, 65472 };
  vect_var_.14_22 = VEC_COND_EXPR <vect_var_.11_17 > { 63, 63, 63, 63, 63, 63, 63, 63 }, vect_var_.12_19, { 0, 0, 0, 0, 0, 0, 0, 0 }>;
  MEM[base: D.1616_5, offset: 0B] = vect_var_.14_22;

GCC doesn't have the idea that this is a "saturated subtraction".  If targets
have saturated arithmetic support, but only with vectors, then the vectorizer
pattern recognition would need to be enhanced and the targets eventually
should support expanding saturated arithmetic.

OTOH middle-end support for saturated arithmetic needs to be improved,
scalar code could also benefit from optimization.  On the RTL level
we have [us]s_{plus,minus} which the vectorizer could use (if implemented
on the target for vector types).
Comment 2 Ulrich Drepper 2012-01-08 18:56:48 UTC
Note, this code appears in gzip and therefore IIRC in specCPU (in deflate.c:fill_window).  Although when compiling gzip myself with that code embedded in a larger function I cannot get the optimization to apply at all.

If this bug is fixed and the optimization is applied the spec numbers could go up if specCPUis testing unzipping...
Comment 3 Richard Biener 2012-07-13 08:39:43 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 4 Andrew Pinski 2016-01-04 23:22:48 UTC
AARCH64 could produce SQSUB instead of the following code:
        add     v1.8h, v0.8h, v3.8h
        cmhi    v0.8h, v0.8h, v2.8h
        and     v0.16b, v1.16b, v0.16b
Comment 5 Andrew Pinski 2021-08-24 23:44:48 UTC
We do slightly better but not close:
        movdqa  (%rax), %xmm0
        addq    $16, %rax
        psubusw %xmm1, %xmm0
        paddw   %xmm1, %xmm0
        paddw   %xmm2, %xmm0
        movaps  %xmm0, -16(%rax)

Which is expanded from:
  vect__1.6_15 = MAX_EXPR <vect_m_6.5_3, { 64, 64, 64, 64, 64, 64, 64, 64 }>;
  vect__2.7_17 = vect__1.6_15 + { 65472, 65472, 65472, 65472, 65472, 65472, 65472, 65472 };

-mavx2 we get:
        vpmaxuw (%rax), %ymm2, %ymm0
        addq    $32, %rax
        vpaddw  %ymm1, %ymm0, %ymm0
        vmovdqa %ymm0, -32(%rax)

Just note 65472 is -64.

This shouldn't be too hard to detect and add and even lower back to MAX_EXPR/PLUS_EXPR if us_minus does not exist.