Bug 117735 - SLP dot_prod opportunity in 525.x264_r
Summary: SLP dot_prod opportunity in 525.x264_r
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2024-11-22 10:23 UTC by Richard Biener
Modified: 2024-11-28 10:56 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2024-11-22 10:23:19 UTC
525.x264_r has

typedef unsigned char uint8_t;
void mc_chroma( uint8_t *dst, int i_dst_stride,
                       uint8_t *src, int i_src_stride,
                       int mvx, int mvy,
                       int i_width, int i_height )
{
    uint8_t *srcp;

    int d8x = mvx&0x07;
    int d8y = mvy&0x07;
    int cA = (8-d8x)*(8-d8y);
    int cB = d8x    *(8-d8y);
    int cC = (8-d8x)*d8y;
    int cD = d8x    *d8y;

    src += (mvy >> 3) * i_src_stride + (mvx >> 3);
    srcp = &src[i_src_stride];

    for( int y = 0; y < i_height; y++ )
      {
        for( int x = 0; x < i_width; x++ )
          dst[x] = ( cA*src[x]  + cB*src[x+1] + cC*srcp[x] + cD*srcp[x+1] + 32 ) >> 6;
        dst  += i_dst_stride;
        src   = srcp;
        srcp += i_src_stride;
      }
}

where the inner loop could use two dot_prodvNhiv2Nqi - iff we had a SLP
pattern recognizing this and iff we'd narrow the invariants to [us]char
(pattern recog demotes the multiply to HImode, range info on c[ABCD]
indicates they fit in QImode).

And iff we'd nail down which lanes get summed for dot_prod (other related
summing optabs have the same unsepcifiedness here, making them only useful
for reductions where we end up summing the result lanes).
Comment 1 Richard Biener 2024-11-22 10:29:58 UTC
Other "failures" for this loop nest is choosing an alias versioning check that is invariant in the outer loop so we'd version that instead of just the inner loop.  We also fail to implement (or schedule) unswitching of the outer loop
on inner loop niter checks the vectorizer inserts for costing and jumps around
vector/epilog.  For perfect nests (and outer loops without actual code) the
vectorizer could consider unswitching the initial epilog entry conditions
itself (at the expense of code size of course).