Consider the following code === begin code === #define LENGTH 512 #define STRIDE 32 char src[LENGTH]; char dst[LENGTH]; static const int height[2] = { 32, 16 }; static const int width[2] = { 16, 8 }; volatile int result; void foo(int height, int width) { char * ptr_src = src; char * ptr_dst = dst; for( int y = 0; y < height; y++ ) { for( int x = 0; x < width; x++ ) ptr_dst[x] = ptr_src[x] + ptr_src[x]; ptr_dst += STRIDE; ptr_src += STRIDE; } } int main(int argc, char *argv[]) { for (int i = 0; i < LENGTH; i++) { src[i] = i % 256; } int idx = argc % 2; int h = height[idx]; int w = width[idx]; foo(h, w); result = dst[argc]; } === end code === The inner loop boundary "width" can be 16 or 8. Compiled with "-O3", gcc generates nested loops for foo. But if we can make use of 16 and 8 to versioning the code (e.g. 3 versions for 8, 16, general), the inner loop can be removed and fully vectorized (for both AArch64 and X86_64 architecture), the performance will be much better as there is single loop. The case is simplified from real benchmark and foo is very hot.
So you ask for main to be converted to if (idx == 0) foo_32_16 (); else /* idx == 1 */ foo_16_8 (); correct? It shoulds like an interesting idea for an IPA-CP cloning opportunity. But I wonder how much real-world testcases exists that are not benchmarks ;) One could tell by implementing the analysis into the param analysis stage (look for defs that are loads from a constant initialized entity, possibly multiple of them and (some of them) reusing the same memory index). A related thing would be to value-profile the indexes since when the array doesn't only contain two values it isn't feasible to clone for all possible values.
(In reply to Richard Biener from comment #1) > So you ask for main to be converted to > > if (idx == 0) > foo_32_16 (); > else /* idx == 1 */ > foo_16_8 (); > > correct? It shoulds like an interesting idea for an IPA-CP cloning > opportunity. But I wonder how much real-world testcases exists > that are not benchmarks ;) One could tell by implementing the > analysis into the param analysis stage (look for defs that are > loads from a constant initialized entity, possibly multiple of them > and (some of them) reusing the same memory index). A related thing > would be to value-profile the indexes since when the array doesn't > only contain two values it isn't feasible to clone for all possible > values. Proberbly. Actually we can just versioning "width", as "height" may be independent to "width" (my case misleads by using the same array index) and "height" doesn't affect performance too much. So the expected result could be: if (width == 16) foo_w16 (); else /* width == 8 */ foo_w8 ();