Bug 65084 - Lack of type narrowing/widening inhibits good vectorization
Summary: Lack of type narrowing/widening inhibits good vectorization
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: unknown
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer 65964
  Show dependency treegraph
 
Reported: 2015-02-17 05:21 UTC by Jeffrey A. Law
Modified: 2023-08-22 04:35 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-02-17 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeffrey A. Law 2015-02-17 05:21:53 UTC
These are testcases extracted from 47477.  

short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      short c = (char) a[i] + 5;
      long long d = (long long) b[i] + 12;
      a[i] = c + d;
    }
}

Compiled with -O3 -mavx2 we ought to get something similar to:

short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned short c = ((short)(a[i] << 8) >> 8) + 5U;
      unsigned short d = b[i] + 12U;
      a[i] = c + d;
    }
}


though even in this case I still couldn't achieve the sign extension to be actually performed as 16-bit left + right (signed) shift, while I guess that would lead to even better code.
Or look at how we vectorize:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned char e = a[i];
      short c = e + 5;
      long long d = (long long) b[i] + 12;
      a[i] = c + d;
    }
}
(note, here forwprop pass already performs type promotion, instead of converting a[i] to unsigned char and back to short, it computes a[i] & 255 in short mode) and how we could instead with type demotions:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned short c = (a[i] & 0xff) + 5U;
      unsigned short d = b[i] + 12U;
      a[i] = c + d;
    }
}

These are all admittedly artificial testcases, but I've seen tons of loops where multiple types were vectorized and I think in some portion of those loops we could either use just a single type size, or at least decrease the number of conversions and different type sizes in the vectorized loops.
Comment 1 Jakub Jelinek 2015-02-17 08:50:57 UTC
Note, for masked loads/stores we already have the ability to clone a loop and only perform changes on the loop that would be vectorized and not on the scalar code that would be used when not vectorized.  So, if this is performed during ifcvt or after ifcvt, if such changes wouldn't be generally desirable for scalar code, but only for vectorization, we can still perform them there.
Comment 2 Richard Biener 2015-02-17 10:47:49 UTC
Confirmed.  Btw, this is just the lack of shorten_* on the GIMPLE level.  For
your match.pd pattern it is a matter of simplifying/changing it slightly:

Index: match.pd
===================================================================
--- match.pd    (revision 220751)
+++ match.pd    (working copy)
@@ -1031,31 +1031,22 @@ (define_operator_list CBRT BUILT_IN_CBRT
    operation and convert the result to the desired type.  */
 (for op (plus minus)
   (simplify
-    (convert (op (convert@2 @0) (convert@3 @1)))
+    (convert (op:c@4 (convert@2 @0) (convert?@3 @1)))
     (if (INTEGRAL_TYPE_P (type)
-        /* We check for type compatibility between @0 and @1 below,
-           so there's no need to check that @1/@3 are integral types.  */
         && INTEGRAL_TYPE_P (TREE_TYPE (@0))
-        && INTEGRAL_TYPE_P (TREE_TYPE (@2))
+        && INTEGRAL_TYPE_P (TREE_TYPE (@4))
         /* The precision of the type of each operand must match the
            precision of the mode of each operand, similarly for the
            result.  */
         && (TYPE_PRECISION (TREE_TYPE (@0))
             == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0))))
-        && (TYPE_PRECISION (TREE_TYPE (@1))
-            == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@1))))
-        && TYPE_PRECISION (type) == GET_MODE_PRECISION (TYPE_MODE (type))
         /* The inner conversion must be a widening conversion.  */
         && TYPE_PRECISION (TREE_TYPE (@2)) > TYPE_PRECISION (TREE_TYPE (@0))
-        && ((GENERIC 
-             && (TYPE_MAIN_VARIANT (TREE_TYPE (@0))
-                 == TYPE_MAIN_VARIANT (TREE_TYPE (@1)))
-             && (TYPE_MAIN_VARIANT (TREE_TYPE (@0))
-                 == TYPE_MAIN_VARIANT (type)))
-            || (GIMPLE
-                && types_compatible_p (TREE_TYPE (@0), TREE_TYPE (@1))
-                && types_compatible_p (TREE_TYPE (@0), type))))
+        /* The final precision should match that of operand @0.  */
+        && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0))
+        /* Make sure the wide operation is dead after the transform.  */
+        && (TREE_CODE (@4) != SSA_NAME || has_single_use (@4)))
       (if (TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
-       (convert (op @0 @1)))
+       (convert (op @0 (convert @1))))
       (with { tree utype = unsigned_type_for (TREE_TYPE (@0)); }
        (convert (op (convert:utype @0) (convert:utype @1)))))))

to allow for constant @1 (where the convert would be folded into the constant
itself).  And we then get before vectorization:

  <bb 3>:
  # i_17 = PHI <i_13(4), 0(2)>
  # ivtmp_30 = PHI <ivtmp_29(4), 1024(2)>
  _5 = a[i_17];
  _6 = (char) _5;
  _7 = (unsigned short) _6;
  _9 = b[i_17];
  _3 = (unsigned short) _9;
  _14 = _3 + 17;
  _10 = _7 + _14;
  _11 = (short int) _10;
  a[i_17] = _11;
  i_13 = i_17 + 1;
  ivtmp_29 = ivtmp_30 - 1;
  if (ivtmp_29 != 0)
    goto <bb 4>;

and much better code.

I've added a single-use restriction to the arithmetic result and allow
the commutated form to match so we also get at (short)((long long)x + ((long long)y + 12))).
Comment 3 Jack Howarth 2015-02-17 15:34:09 UTC
I assume that https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088 is a duplicate?
Comment 4 Jeffrey A. Law 2015-02-17 15:41:50 UTC
It's related.  There's enough of these BZs, that it might make sense to create a meta-but to track all of them.  I'm particularly interested in the narrowing case as replacement of shorten_binary_op/shorten_compare from the C/C++ front-ends with match.pd patterns seems like a long term maintenance win.
Comment 5 Jeffrey A. Law 2015-02-20 23:24:09 UTC
Some of the stuff you're doing in that patch matches what I was poking at as well (for a different BZ).  There's clearly much room for improvement here and if we weren't in stage4, I'd be pushing harder on expanding what these patterns do now rather than waiting.

I'm already of the mind that we're going to want to factor some of that type testing so we're not repeating it in a half-dozen patterns.
Comment 6 rguenther@suse.de 2015-02-23 09:16:57 UTC
On Fri, 20 Feb 2015, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65084
> 
> --- Comment #5 from Jeffrey A. Law <law at redhat dot com> --- Some of 
> the stuff you're doing in that patch matches what I was poking at as 
> well (for a different BZ).  There's clearly much room for improvement 
> here and if we weren't in stage4, I'd be pushing harder on expanding 
> what these patterns do now rather than waiting.

Note that for aggressive narrowing (which eventually makes sense for
early GIMPLE) and for widening (which makes sense towards RTL expansion
with the targets word_mode / PROMOTE_MODE in mind) having a separate
pass that deals with the actual transform step (but re-using the
matching from the match.pd code) makes sense - triggering the transform
from forwprop via fold_stmt creates garbage that can confuse single-use
tests (yeah, it's on my list of things to investigate on whether we can
improve here in a somewhat generic way for GCC 6).

As in this bugs summary - what are the cases where vectorization
is bad because of the lack of widening?

> I'm already of the mind that we're going to want to factor some of that 
> type testing so we're not repeating it in a half-dozen patterns.

Yeah, that makes sense.
Comment 7 Andrew Pinski 2021-08-19 01:20:46 UTC
I see the first testcase produces:
.L2:
        vpand   a+32(%rax), %ymm4, %ymm1
        vpand   a(%rax), %ymm4, %ymm0
        addq    $64, %rax
        vpaddw  b-32(%rax), %ymm2, %ymm5
        vpackuswb       %ymm1, %ymm0, %ymm0
        vpermq  $216, %ymm0, %ymm0
        vextracti128    $0x1, %ymm0, %xmm1
        vpmovsxbw       %xmm0, %ymm0
        vpmovsxbw       %xmm1, %ymm1
        vpaddw  %ymm3, %ymm0, %ymm0
        vpaddw  %ymm3, %ymm1, %ymm1
        vpaddw  %ymm5, %ymm1, %ymm1
        vpaddw  b-64(%rax), %ymm2, %ymm5
        vmovdqa %ymm1, a-32(%rax)
        vpaddw  %ymm5, %ymm0, %ymm0
        vmovdqa %ymm0, a-64(%rax)
        cmpq    $2048, %rax
        jne     .L2

While the second one produces:
.L6:
        vmovdqa a(%rax), %ymm3
        vpaddw  b(%rax), %ymm2, %ymm1
        addq    $32, %rax
        vpsllw  $8, %ymm3, %ymm0
        vpsraw  $8, %ymm0, %ymm0
        vpaddw  %ymm1, %ymm0, %ymm0
        vmovdqa %ymm0, a-32(%rax)
        cmpq    $2048, %rax
        jne     .L6