Bug 91460 - gcc -mpreferred-vector-width=256 is slower than -mpreferred-vector-width=128 for some loops
Summary: gcc -mpreferred-vector-width=256 is slower than -mpreferred-vector-width=128 ...
Status: RESOLVED DUPLICATE of bug 88915
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2019-08-15 16:51 UTC by Sunil Pandey
Modified: 2019-08-16 09:21 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*, aarch64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-08-15 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sunil Pandey 2019-08-15 16:51:14 UTC
1 static inline void pixel_avg( uint8_t *dst,  int     i_dst_stride,
2                              uint8_t *src1, int i_src1_stride,
3                              uint8_t *src2, int i_src2_stride,
4                               int i_width, int i_height )
5 {
6     for( int y = 0; y < i_height; y++ )
7     {
8         for( int x = 0; x < i_width; x++ )
9             dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
10         dst  += i_dst_stride;
11         src1 += i_src1_stride;
12         src2 += i_src2_stride;
13     }
14 }

If above code is in hot loop.

if i_width value is between 16 and 32, -mprefer-vector-width=128 can provide ~6% performance improvement as compared to -mprefer-vector-width=256.

i_width value must be at least 16 to trigger 128 bit vectorization at line 8.

i_width value must be at least 32 to trigger 256 bit vectorization at line 8.
Comment 1 H.J. Lu 2019-08-15 19:17:54 UTC
This testcase

---
int block[9][9][9];
void foo(int row, int k, int h)
{
  /* Variable nrow range from 4 to 9.  */
  int nrow = ((row - 1)/3 + 1)*3 + 1;

   for (int i = nrow; i < 9; i++)
     block[k][h][i] = block[k][h][i] - 10;
}
---

Since nrow range from 4 to 9, 256bit vector operation will never be
executed(vector elements always less than 8), so 256bit vector actually
equals no vectorization plus additional branch cost.  Even with epilogue
vectorization, 256bit vector still has more overhead.  When this is a hot
function, 256bit vector can reduce performance by 6%.
Comment 2 H.J. Lu 2019-08-15 19:35:12 UTC
When loop trip count is known, vectorizer won't select 256-bit vector when
266-bit vector can't be used.  When loop trip count is unknown, 256-bit
vector can be slower than 128-bit vector, depending on workloads.  In
case of SPEC CPU 2017, 128-bit vector is much faster than 256-bit vector
for a couple benchmarks.  For most of benchmarks, there are no performance
differences between  128-bit vector and 256-bit vector.
Comment 3 Richard Biener 2019-08-16 07:44:09 UTC
I believe we have a duplicate PR for exactly this case and Andre is working on
this.
Comment 4 avieira 2019-08-16 09:21:46 UTC
Yes this looks like a duplicate of PR 88915. I'll mark it as such.

*** This bug has been marked as a duplicate of bug 88915 ***