Bug 85406 - Unnecessary blend when vectorizing short-cutted calculations
Summary: Unnecessary blend when vectorizing short-cutted calculations
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: unknown
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-04-15 15:05 UTC by Allan Jensen
Modified: 2021-09-05 02:58 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-04-18 00:00:00


Attachments
gccbug85406.cpp (238 bytes, text/plain)
2018-04-20 08:28 UTC, Allan Jensen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Allan Jensen 2018-04-15 15:05:42 UTC
If you have something like this:

inline unsigned qPremultiply(unsigned x)
{
    const unsigned a = x >> 24;
    if (a == 255)
      return x;

    unsigned t = (x & 0xff00ff) * a;
    t = (t + ((t >> 8) & 0xff00ff) + 0x800080) >> 8;
    t &= 0xff00ff;

    x = ((x >> 8) & 0xff) * a;
    x = (x + ((x >> 8) & 0xff) + 0x80);
    x &= 0xff00;
    return x | t | (a << 24);

}

Gcc will vectorize it so that the longer calculation is always performed and with an added blend in the end to merge the two different return values. This is however unnecessary as the calculation will give the same result, and thus the blend can be saved.

Also in any case it is actually a bit unsafe to vectorize as the performance difference between the two branches is substantial, and it happens that in this case the short-cut is likely to be valid most of the time, so a nonvectorized loop might be faster than a vectorized one by doing a lot less.

The latter can be fixed, if the short-cut was also vectorized, for instance making the test for 4 values at a time and skip the long route if none of them need it.
Comment 1 Allan Jensen 2018-04-15 15:14:04 UTC
Note it might be hard to figure out for the compiler that the result for a==255 will leave the input unchanged, but you can observe the same if you instead test for a == 0 (and return 0). In that case the compiler should have enough math deduction to be able to tell that the result of a==0 is always 0.
Comment 2 Richard Biener 2018-04-18 09:56:11 UTC
I don't see any vectorization being done on the testcase.  Please specify the GCC version you tested as well as the command-line flags and eventually complete the testcase (there's no loop here?).
Comment 3 Allan Jensen 2018-04-20 08:05:38 UTC
You need to add the loop around it

void test(unsigned *buffer, int count)
{
    for (int i = 0; i < count; ++i)
        buffer[i] = qPremultiply(buffer[i]);
}
Comment 4 Allan Jensen 2018-04-20 08:28:41 UTC
Created attachment 43995 [details]
gccbug85406.cpp

This version compiles with a pcmpeqd and pandn instead of a blend, but the principle is the same.

Though the last of a ptest in the beginning is worse, as that risks a performance regression compared to non-vectorized.
Comment 5 Richard Biener 2018-04-20 08:43:58 UTC
I think it's too much asked from GCC to prove that if a == 255 the computation
in the else path would result in x.  The simplest fix would be to remove the
conditional in the source.

It's true that GCC doesn't evaluate costs of the vectorization properly
as it looks at the if-converted copy when calculating the cost of the
scalar loop.  OTOH any estimate on how often the shortcut triggers
compared to the computation might be off and a conservative estimate may
cause us to not vectorize and thus slow down the code.
Comment 6 Allan Jensen 2018-04-20 17:09:22 UTC
Yeah, the a==255 was actually not a case I would expect the compiler to solve, which is why I changed the example to the a==0 case, which should be solveable using existing constant propagation.

Note you can put both short-cuts in, though as it standards only gcc 7 and 8 can vectorize it with two conditions, so we cant use that in general code as we need it to be fast elsewhere too.
Comment 7 Andrew Pinski 2021-09-05 02:58:27 UTC
I Noticed clang/LLVM does not do this either nor ICC.