If you have something like this: inline unsigned qPremultiply(unsigned x) { const unsigned a = x >> 24; if (a == 255) return x; unsigned t = (x & 0xff00ff) * a; t = (t + ((t >> 8) & 0xff00ff) + 0x800080) >> 8; t &= 0xff00ff; x = ((x >> 8) & 0xff) * a; x = (x + ((x >> 8) & 0xff) + 0x80); x &= 0xff00; return x | t | (a << 24); } Gcc will vectorize it so that the longer calculation is always performed and with an added blend in the end to merge the two different return values. This is however unnecessary as the calculation will give the same result, and thus the blend can be saved. Also in any case it is actually a bit unsafe to vectorize as the performance difference between the two branches is substantial, and it happens that in this case the short-cut is likely to be valid most of the time, so a nonvectorized loop might be faster than a vectorized one by doing a lot less. The latter can be fixed, if the short-cut was also vectorized, for instance making the test for 4 values at a time and skip the long route if none of them need it.
Note it might be hard to figure out for the compiler that the result for a==255 will leave the input unchanged, but you can observe the same if you instead test for a == 0 (and return 0). In that case the compiler should have enough math deduction to be able to tell that the result of a==0 is always 0.
I don't see any vectorization being done on the testcase. Please specify the GCC version you tested as well as the command-line flags and eventually complete the testcase (there's no loop here?).
You need to add the loop around it void test(unsigned *buffer, int count) { for (int i = 0; i < count; ++i) buffer[i] = qPremultiply(buffer[i]); }
Created attachment 43995 [details] gccbug85406.cpp This version compiles with a pcmpeqd and pandn instead of a blend, but the principle is the same. Though the last of a ptest in the beginning is worse, as that risks a performance regression compared to non-vectorized.
I think it's too much asked from GCC to prove that if a == 255 the computation in the else path would result in x. The simplest fix would be to remove the conditional in the source. It's true that GCC doesn't evaluate costs of the vectorization properly as it looks at the if-converted copy when calculating the cost of the scalar loop. OTOH any estimate on how often the shortcut triggers compared to the computation might be off and a conservative estimate may cause us to not vectorize and thus slow down the code.
Yeah, the a==255 was actually not a case I would expect the compiler to solve, which is why I changed the example to the a==0 case, which should be solveable using existing constant propagation. Note you can put both short-cuts in, though as it standards only gcc 7 and 8 can vectorize it with two conditions, so we cant use that in general code as we need it to be fast elsewhere too.
I Noticed clang/LLVM does not do this either nor ICC.