Bug 101434 - vector-by-vector left shift expansion for char/short is not optimal
Summary: vector-by-vector left shift expansion for char/short is not optimal
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2021-07-13 10:55 UTC by Uroš Bizjak
Modified: 2021-08-25 03:27 UTC (History)
0 users

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-08-24 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2021-07-13 10:55:08 UTC
Following testcase:

--cut here--
short r[8], a[8], b[8];

void f1 (void)
{
  int i;

  for (i = 0; i < 8; i++)
    r[i] = a[i] << b[i];
}
--cut here--

compiles with -O2 -ftree-vectorize -mxop to:

        vmovdqa a(%rip), %xmm0
        vmovdqa b(%rip), %xmm1
        vpmovsxwd       %xmm0, %xmm2
        vpsrldq $8, %xmm0, %xmm0
        vpmovsxwd       %xmm1, %xmm3
        vpsrldq $8, %xmm1, %xmm1
        vpshad  %xmm3, %xmm2, %xmm2
        vpmovsxwd       %xmm0, %xmm0
        vpmovsxwd       %xmm1, %xmm1
        vpshad  %xmm1, %xmm0, %xmm0
        vpperm  .LC0(%rip), %xmm0, %xmm2, %xmm2
        vmovdqa %xmm2, r(%rip)
        ret

SImode vpshad is used together with lots of other instructions, but a HImode vpshaw should be emitted instead.

Similar testcase:

--cut here--
short r[8], a[8], b[8];

void f2 (void)
{
  int i;

  for (i = 0; i < 8; i++)
    r[i] = a[i] >> b[i];
}
--cut here--

results in expected HImode vect-by-vect shift insn:

        vpxor   %xmm0, %xmm0, %xmm0
        vpsubw  b(%rip), %xmm0, %xmm0
        vpshaw  %xmm0, a(%rip), %xmm0
        vmovdqa %xmm0, r(%rip)
        ret

(do not bother with vpxor and vpsubw, these are just one of XOP peculiarities.)
Comment 1 Richard Biener 2021-07-13 12:15:54 UTC
Probably low priority if not doable nicely w/o XOP.

Note this is mainly due to integer promotion rules (we see shifts of int by int)
and fear of introducing undefined behavior (the int by int shift has larger
valid ranges for the RHS than a truncated one).

There must be a duplicate bugreport.

IMHO we might consider to make shifts of smaller than int types with
out of bound shift amounts well-defined.  I think there's no way to
rewrite types to avoid the undefined behavior like we can do with
signed arithmetic -> unsigned arithmetic (besides division by -1 where
the sign matters).
Comment 2 Richard Biener 2021-07-13 12:20:11 UTC
So technically 

  (int)short-var << a

->  short-var << (min (a, 15))

we know a is <= 31 because of the int shift (and >= 0) but we cannot simply
emit short-var << a because how the target behaves is not well-defined
(SHIFT_COUNT_TRUNCATED) but the behavior is well-defined for the int << int
shift.  Pattern recog has code to deal with this in theory but it gives up
here and does not bother to emit a min ().
Comment 3 Uroš Bizjak 2021-07-13 12:23:11 UTC
(In reply to Richard Biener from comment #1)
> Probably low priority if not doable nicely w/o XOP.

-mxop can be substituted with -mavx512bw -mavx512vl for the same effect.
Comment 4 Andrew Pinski 2021-08-25 03:27:33 UTC
(In reply to Uroš Bizjak from comment #3)
> (In reply to Richard Biener from comment #1)
> > Probably low priority if not doable nicely w/o XOP.
> 
> -mxop can be substituted with -mavx512bw -mavx512vl for the same effect.

or -mavx2.