Bug 91374 - [Missed optimization] Versioning opportunities to improve performance
Summary: [Missed optimization] Versioning opportunities to improve performance
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 10.0
: P3 enhancement
Target Milestone: ---
Assignee: fxue
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2019-08-06 06:33 UTC by Hao Liu
Modified: 2019-12-16 01:41 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-08-07 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hao Liu 2019-08-06 06:33:21 UTC
Consider the following code

=== begin code ===

#define LENGTH 512
#define STRIDE 32

char src[LENGTH];
char dst[LENGTH];
static const int height[2] = { 32, 16 };
static const int width[2] = { 16, 8 };

volatile int result;

void foo(int height, int width) {
    char * ptr_src = src;
    char * ptr_dst = dst;

    for( int y = 0; y < height; y++ )
    {
        for( int x = 0; x < width; x++ )
            ptr_dst[x] = ptr_src[x] + ptr_src[x];
        ptr_dst += STRIDE;
        ptr_src += STRIDE;
    }
}

int main(int argc, char *argv[]) {
    for (int i = 0; i < LENGTH; i++) {
        src[i] = i % 256;
    }

    int idx = argc % 2;
    int h = height[idx];
    int w = width[idx];
    foo(h, w);

    result = dst[argc];
}

=== end code ===

The inner loop boundary "width" can be 16 or 8. Compiled with "-O3", gcc generates nested loops for foo.

But if we can make use of 16 and 8 to versioning the code (e.g. 3 versions for 8, 16, general), the inner loop can be removed and fully vectorized (for both AArch64 and X86_64 architecture), the performance will be much better as there is single loop. 

The case is simplified from real benchmark and foo is very hot.
Comment 1 Richard Biener 2019-08-07 07:49:03 UTC
So you ask for main to be converted to

 if (idx == 0)
   foo_32_16 ();
 else /* idx == 1 */
   foo_16_8 ();

correct?  It shoulds like an interesting idea for an IPA-CP cloning
opportunity.  But I wonder how much real-world testcases exists
that are not benchmarks ;)  One could tell by implementing the
analysis into the param analysis stage (look for defs that are
loads from a constant initialized entity, possibly multiple of them
and (some of them) reusing the same memory index).  A related thing
would be to value-profile the indexes since when the array doesn't
only contain two values it isn't feasible to clone for all possible
values.
Comment 2 Hao Liu 2019-08-07 08:53:48 UTC
(In reply to Richard Biener from comment #1)
> So you ask for main to be converted to
> 
>  if (idx == 0)
>    foo_32_16 ();
>  else /* idx == 1 */
>    foo_16_8 ();
> 
> correct?  It shoulds like an interesting idea for an IPA-CP cloning
> opportunity.  But I wonder how much real-world testcases exists
> that are not benchmarks ;)  One could tell by implementing the
> analysis into the param analysis stage (look for defs that are
> loads from a constant initialized entity, possibly multiple of them
> and (some of them) reusing the same memory index).  A related thing
> would be to value-profile the indexes since when the array doesn't
> only contain two values it isn't feasible to clone for all possible
> values.

Proberbly. Actually we can just versioning "width", as "height" may be independent to "width" (my case misleads by using the same array index) and "height" doesn't affect performance too much. So the expected result could be:

 if (width == 16)
   foo_w16 ();
 else /* width == 8 */
   foo_w8 ();