These are testcases extracted from 47477. short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { short c = (char) a[i] + 5; long long d = (long long) b[i] + 12; a[i] = c + d; } } Compiled with -O3 -mavx2 we ought to get something similar to: short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { unsigned short c = ((short)(a[i] << 8) >> 8) + 5U; unsigned short d = b[i] + 12U; a[i] = c + d; } } though even in this case I still couldn't achieve the sign extension to be actually performed as 16-bit left + right (signed) shift, while I guess that would lead to even better code. Or look at how we vectorize: short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { unsigned char e = a[i]; short c = e + 5; long long d = (long long) b[i] + 12; a[i] = c + d; } } (note, here forwprop pass already performs type promotion, instead of converting a[i] to unsigned char and back to short, it computes a[i] & 255 in short mode) and how we could instead with type demotions: short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { unsigned short c = (a[i] & 0xff) + 5U; unsigned short d = b[i] + 12U; a[i] = c + d; } } These are all admittedly artificial testcases, but I've seen tons of loops where multiple types were vectorized and I think in some portion of those loops we could either use just a single type size, or at least decrease the number of conversions and different type sizes in the vectorized loops.
Note, for masked loads/stores we already have the ability to clone a loop and only perform changes on the loop that would be vectorized and not on the scalar code that would be used when not vectorized. So, if this is performed during ifcvt or after ifcvt, if such changes wouldn't be generally desirable for scalar code, but only for vectorization, we can still perform them there.
Confirmed. Btw, this is just the lack of shorten_* on the GIMPLE level. For your match.pd pattern it is a matter of simplifying/changing it slightly: Index: match.pd =================================================================== --- match.pd (revision 220751) +++ match.pd (working copy) @@ -1031,31 +1031,22 @@ (define_operator_list CBRT BUILT_IN_CBRT operation and convert the result to the desired type. */ (for op (plus minus) (simplify - (convert (op (convert@2 @0) (convert@3 @1))) + (convert (op:c@4 (convert@2 @0) (convert?@3 @1))) (if (INTEGRAL_TYPE_P (type) - /* We check for type compatibility between @0 and @1 below, - so there's no need to check that @1/@3 are integral types. */ && INTEGRAL_TYPE_P (TREE_TYPE (@0)) - && INTEGRAL_TYPE_P (TREE_TYPE (@2)) + && INTEGRAL_TYPE_P (TREE_TYPE (@4)) /* The precision of the type of each operand must match the precision of the mode of each operand, similarly for the result. */ && (TYPE_PRECISION (TREE_TYPE (@0)) == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0)))) - && (TYPE_PRECISION (TREE_TYPE (@1)) - == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@1)))) - && TYPE_PRECISION (type) == GET_MODE_PRECISION (TYPE_MODE (type)) /* The inner conversion must be a widening conversion. */ && TYPE_PRECISION (TREE_TYPE (@2)) > TYPE_PRECISION (TREE_TYPE (@0)) - && ((GENERIC - && (TYPE_MAIN_VARIANT (TREE_TYPE (@0)) - == TYPE_MAIN_VARIANT (TREE_TYPE (@1))) - && (TYPE_MAIN_VARIANT (TREE_TYPE (@0)) - == TYPE_MAIN_VARIANT (type))) - || (GIMPLE - && types_compatible_p (TREE_TYPE (@0), TREE_TYPE (@1)) - && types_compatible_p (TREE_TYPE (@0), type)))) + /* The final precision should match that of operand @0. */ + && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)) + /* Make sure the wide operation is dead after the transform. */ + && (TREE_CODE (@4) != SSA_NAME || has_single_use (@4))) (if (TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) - (convert (op @0 @1))) + (convert (op @0 (convert @1)))) (with { tree utype = unsigned_type_for (TREE_TYPE (@0)); } (convert (op (convert:utype @0) (convert:utype @1))))))) to allow for constant @1 (where the convert would be folded into the constant itself). And we then get before vectorization: <bb 3>: # i_17 = PHI <i_13(4), 0(2)> # ivtmp_30 = PHI <ivtmp_29(4), 1024(2)> _5 = a[i_17]; _6 = (char) _5; _7 = (unsigned short) _6; _9 = b[i_17]; _3 = (unsigned short) _9; _14 = _3 + 17; _10 = _7 + _14; _11 = (short int) _10; a[i_17] = _11; i_13 = i_17 + 1; ivtmp_29 = ivtmp_30 - 1; if (ivtmp_29 != 0) goto <bb 4>; and much better code. I've added a single-use restriction to the arithmetic result and allow the commutated form to match so we also get at (short)((long long)x + ((long long)y + 12))).
I assume that https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088 is a duplicate?
It's related. There's enough of these BZs, that it might make sense to create a meta-but to track all of them. I'm particularly interested in the narrowing case as replacement of shorten_binary_op/shorten_compare from the C/C++ front-ends with match.pd patterns seems like a long term maintenance win.
Some of the stuff you're doing in that patch matches what I was poking at as well (for a different BZ). There's clearly much room for improvement here and if we weren't in stage4, I'd be pushing harder on expanding what these patterns do now rather than waiting. I'm already of the mind that we're going to want to factor some of that type testing so we're not repeating it in a half-dozen patterns.
On Fri, 20 Feb 2015, law at redhat dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65084 > > --- Comment #5 from Jeffrey A. Law <law at redhat dot com> --- Some of > the stuff you're doing in that patch matches what I was poking at as > well (for a different BZ). There's clearly much room for improvement > here and if we weren't in stage4, I'd be pushing harder on expanding > what these patterns do now rather than waiting. Note that for aggressive narrowing (which eventually makes sense for early GIMPLE) and for widening (which makes sense towards RTL expansion with the targets word_mode / PROMOTE_MODE in mind) having a separate pass that deals with the actual transform step (but re-using the matching from the match.pd code) makes sense - triggering the transform from forwprop via fold_stmt creates garbage that can confuse single-use tests (yeah, it's on my list of things to investigate on whether we can improve here in a somewhat generic way for GCC 6). As in this bugs summary - what are the cases where vectorization is bad because of the lack of widening? > I'm already of the mind that we're going to want to factor some of that > type testing so we're not repeating it in a half-dozen patterns. Yeah, that makes sense.
I see the first testcase produces: .L2: vpand a+32(%rax), %ymm4, %ymm1 vpand a(%rax), %ymm4, %ymm0 addq $64, %rax vpaddw b-32(%rax), %ymm2, %ymm5 vpackuswb %ymm1, %ymm0, %ymm0 vpermq $216, %ymm0, %ymm0 vextracti128 $0x1, %ymm0, %xmm1 vpmovsxbw %xmm0, %ymm0 vpmovsxbw %xmm1, %ymm1 vpaddw %ymm3, %ymm0, %ymm0 vpaddw %ymm3, %ymm1, %ymm1 vpaddw %ymm5, %ymm1, %ymm1 vpaddw b-64(%rax), %ymm2, %ymm5 vmovdqa %ymm1, a-32(%rax) vpaddw %ymm5, %ymm0, %ymm0 vmovdqa %ymm0, a-64(%rax) cmpq $2048, %rax jne .L2 While the second one produces: .L6: vmovdqa a(%rax), %ymm3 vpaddw b(%rax), %ymm2, %ymm1 addq $32, %rax vpsllw $8, %ymm3, %ymm0 vpsraw $8, %ymm0, %ymm0 vpaddw %ymm1, %ymm0, %ymm0 vmovdqa %ymm0, a-32(%rax) cmpq $2048, %rax jne .L6