I am observing a 28% runtime regression for 456.hmmer on Zen4 with -Ofast -march=native when compared to GCC 14.2.
Samples: 530K of event 'cycles:Pu', Event count (approx.): 680879110118 Overhead Samples Command Shared Object Symbol 51.45% 273953 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.] P7Viterbi 38.49% 202968 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.] P7Viterbi 71 │4135c0┌─ vmovd (%r11,%rdi,4),%xmm3 ▒ 1361 │4135c6│ vpaddd %xmm3,%xmm0,%xmm0 ▒ 29411 │4135ca│ mov %rdi,%r8 ▒ 15 │4135cd│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ 5826 │4135d3│ vmovd (%rax,%rdi,4),%xmm4 ▒ 981 │4135d8│ vmovd (%r10,%rdi,4),%xmm3 ◆ 725 │4135de│ vpaddd %xmm3,%xmm4,%xmm3 ▒ 3186 │4135e2│ vmovdqa 0x47346(%rip),%xmm4 ▒ 787 │4135ea│ vpmaxsd %xmm4,%xmm3,%xmm3 ▒ 3801 │4135ef│ vpmaxsd %xmm0,%xmm3,%xmm0 ▒ 28932 │4135f4│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ 2073 │4135fa│ inc %rdi ▒ 3464 │4135fd├── cmp %r8,%r9 ▒ 11 │413600└── jne 4135c0 <P7Viterbi+0x1100> vs. │413aa0┌─ vmovd (%r11,%rdi,4),%xmm3 ▒ 208 │413aa6│ mov %rdi,%r8 ▒ 393 │413aa9│ vpaddd %xmm3,%xmm0,%xmm0 ▒ 11199 │413aad│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ 11840 │413ab3│ vmovd (%rax,%rdi,4),%xmm5 ▒ 3889 │413ab8│ vmovd (%r10,%rdi,4),%xmm3 ▒ 340 │413abe│ vpaddd %xmm3,%xmm5,%xmm3 ▒ 2829 │413ac2│ vmovdqa 0x48656(%rip),%xmm5 ▒ 720 │413aca│ vpmaxsd %xmm5,%xmm3,%xmm3 ▒ 1047 │413acf│ vpmaxsd %xmm0,%xmm3,%xmm0 ▒ 10698 │413ad4│ vmovd %xmm0,0x4(%rdx,%rdi,4) ◆ 12478 │413ada│ inc %rdi ▒ 2966 │413add├── cmp %r8,%r9 ▒ 1 │413ae0└── jne 413aa0 <P7Viterbi+0x1760> that's the scalar epilog, -mtune-ctrl=^avx512_two_epilogues does not help. The regression also shows up on Icelake. For some reason we're dealing with branch misses here which we have none for BASE for the above loop but plenty with PEAK. This seems to be related to loop splitting - for PEAK we have two iterating loops while for BASE there's simply fallthru code before. -fno-split-loops fixes this. We do not seem to realize that splitting for (k = 1; k <= M; k++) { if (k < M) { } } has the k == M loop run only once. That causes us to vectorize the epilog loop as well. A simplified testcase looks like int a[1024], b[1024]; void foo (int M) { for (int k = 1; k <= M; ++k) { a[k] = a[k] + 1; if (k < M) b[k] = b[k] + 1; } } likely "caused" by the loop splitting improvements, though for the simplified testcase above the generated code is the same. I'll note that with GCC 14 we do fast_algorithms.c:145:10: optimized: loop split fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0 library calls. fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0 library calls. fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 7100547) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 20163246) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 16089390) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops while trunk does fast_algorithms.c:145:10: optimized: loop split fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0 library calls. fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0 library calls. fast_algorithms.c:165:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:165:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:165:19: optimized: loop vectorized using 16 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 21835320) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 13974604) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops Which is mostly the same (but all do not realize the loop from splitting doesn't iterate). The loop splitting is quite pointless (but it elides the condition).
Manually applying the desired loop splitting in the source also fixes the observed regression. It's not exactly known what regressed though.
I saw this on LNT testers too. Unforutnately hmmer was not working for a while, so there is relatively big interval about when this can happen. Last year I did some work on hmmer and loop splitting. It is was very sensitive about profile update. I added logic to splitting so if the first loop has constant number of iterations, we update it correctly, but did not implement pattern matching needed to check that the second loop has constant number of iteration. The problem is that w/o intermediate FRE iv analysis is not able to analyze it from the code we produce in second loop. It takes output from the first loop and it is not obviously clear what the values are... I guess what is new is that we manage to vectorize the second loop and completely lose track of resulting code before working out it is a non-loop :)
Note the branch and exit condition are nicely present in the IL but indeed there doesn't seem to be any infrastructure for special-casing known constant amount of iteration and there's no cleanup done before more loop opts mess things up. I'm also not sure we'd handle this case with VRPs relation support, niter analysis tries to simplify computed nonzero conditions, but just with its weak simplify_using_initial_conditions which isn't even able to compute M_9(D) < k_24 (aka the nonzero condition) as true. We don't simplify the computed niter expression itself. The niter expression is (unsigned int) M_9(D) - (unsigned int) k_24. I'm not sure what ranger has at its disposal for us here when trying to simplify GENERIC expressions, we could possibly try to compute it's range and if that's singleton use that? <bb 13> [local count: 105119324]: if (M_9(D) > 1) goto <bb 14>; [64.00%] else goto <bb 15>; [36.00%] <bb 14> <bb 3> [local count: 611603345]: // first loop # k_15 = PHI <k_12(8), 1(14)> ... k_12 = k_15 + 1; if (k_12 < M_9(D)) goto <bb 8>; [89.00%] // latch else goto <bb 16>; [11.00%] <bb 16> [local count: 67276368]: # k_29 = PHI <k_12(3)> if (M_9(D) >= k_29) goto <bb 15>; [99.95%] else goto <bb 6>; [0.05%] <bb 15> [local count: 105119324]: # k_24 = PHI <1(13), k_29(16)> // 1 is from loop-around branch <bb 9> [local count: 343854870]: // second loop # k_17 = PHI <k_24(15), k_23(12)> ... k_23 = k_17 + 1; if (M_9(D) >= k_23)
I tried get_range_query (cfun)->range_of_expr (vr, niter->niter, stmt) with (unsigned int) M_9(D) - (unsigned int) k_24 and an enabled ranger but that indeed returns [irange] unsigned int [0, 1024][4294966273, +INF] and not a singleton as expected. It seems to look for ranges of M_9 and k_24 and fold_range with the minus op rather than trying to use relations to simplify the subtraction. The ranges of the first and 2nd op are [1, 1025] and [1, 1024] respectively, basically [1, INF] for our purpose (the constant array bound bounds them). We'd also miss a way to inject niter->assumptions and niter->may_be_zero as conditions known true for the purpose of simplifying (in this case those don't add anything).
(In reply to Richard Biener from comment #5) > I tried get_range_query (cfun)->range_of_expr (vr, niter->niter, stmt) with > (unsigned int) M_9(D) - (unsigned int) k_24 and an enabled ranger > but that indeed returns [irange] unsigned int [0, 1024][4294966273, +INF] > and not a singleton as expected. > It seems to look for ranges of M_9 and k_24 and fold_range with the > minus op rather than trying to use relations to simplify the subtraction. > The ranges of the first and 2nd op are [1, 1025] and [1, 1024] respectively, > basically [1, INF] for our purpose (the constant array bound bounds them). > > We'd also miss a way to inject niter->assumptions and niter->may_be_zero > as conditions known true for the purpose of simplifying (in this case > those don't add anything). It will always use ranges, and utilize any known relation in addition to that. range-op.cc : minus_op1_op2_relation_effect () is used for that purpose, but if can only adjust something to a known subrange, ie if m_9 > k_24, then it will limit whatever the calculated range is to [1, +INF], so we can trim out any negatives. What singleton are you expecting? (and where?) I hacked a VRP pass right after loop interchange and after the loop I see: =========== BB 16 ============ M_9(D) [irange] int [2, 1024] Relational : (k_12 >= M_9(D)) <bb 16> [local count: 67276368]: goto <bb 15>; [100.00%] but IM not sure how we use relations to come up with a singleton constant? The loop backedge has: =========== BB 8 ============ k_12 [irange] int [2, 1024] Relational : (M_9(D) > k_15) Relational : (k_12 < M_9(D)) <bb 8> [local count: 544326978]: goto <bb 3>; [100.00%] so we always know also M_9 > either k, 'mm just not sure how that helps OR are we talking about the second loop? Im having difficulty seeing where I might find a singleton?
I'm talking about the number of iterations of the second loop (after loop splitting), the niter expression is (unsigned int) M_9(D) - (unsigned int) k_24. We know k_29 == M_9(D) when exiting the first loop, so on that entry the expression computes zero. For the case the first loop is short-cut we know M_9(D) == 1 and thus the difference is zero as well. So I expect the range of (unsigned int) M_9(D) - (unsigned int) k_2 to be [0,0]. But maybe I'm missing something?
(In reply to Richard Biener from comment #7) > I'm talking about the number of iterations of the second loop (after loop > splitting), the niter expression is (unsigned int) M_9(D) - (unsigned int) > k_24. > We know k_29 == M_9(D) when exiting the first loop, so on that entry the > expression computes zero. For the case the first loop is short-cut we know > M_9(D) == 1 and thus the difference is zero as well. > > So I expect the range of (unsigned int) M_9(D) - (unsigned int) k_2 to be > [0,0]. > > But maybe I'm missing something? I guess the issue is that with # k_24 = PHI <1(13), k_29(16)> to easily see this we'd have to compute the range of (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - (unsigned) k_29 and then see those are the same singleton. I don't think we can arrive here when using the range of k_24 itself, so maybe I'm asking too much of VRP here.
> > But maybe I'm missing something? > > I guess the issue is that with > > # k_24 = PHI <1(13), k_29(16)> > > to easily see this we'd have to compute the range of > (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - (unsigned) > k_29 and then see those are the same singleton. I don't think we can > arrive here when using the range of k_24 itself, so maybe I'm asking too > much of VRP here. Yes, I don't see how ranger can figure out easily the singleton range, since final value of the iteration is variable. Last year I looked into it and I kind of remember implementing special matching code, but can not find the patch anymore. The code checking loop1->any_estimate can be generalized to check for estimate of one of the loop and adjust the other accordingly. It is scev_cprop pass that does replace the use of IV value use out of loops by its bound. As in: #include <stdio.h> int test(int n) { int i; if (n < 0) return -1; for (i = 0; i < n; i++) printf ("%i\n",i); return i; } Maybe we want to be able to run it on a specific loop and run it on loop1 between split finishes updating IL and before profile update?
On Thu, 5 Dec 2024, hubicka at ucw dot cz wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875 > > --- Comment #9 from Jan Hubicka <hubicka at ucw dot cz> --- > > > But maybe I'm missing something? > > > > I guess the issue is that with > > > > # k_24 = PHI <1(13), k_29(16)> > > > > to easily see this we'd have to compute the range of > > (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - (unsigned) > > k_29 and then see those are the same singleton. I don't think we can > > arrive here when using the range of k_24 itself, so maybe I'm asking too > > much of VRP here. > > Yes, I don't see how ranger can figure out easily the singleton range, > since final value of the iteration is variable. > Last year I looked into it and I kind of remember implementing special > matching code, but can not find the patch anymore. The code checking > loop1->any_estimate can be generalized to check for estimate of one of > the loop and adjust the other accordingly. > > It is scev_cprop pass that does replace the use of IV value use out of > loops by its bound. As in: > #include <stdio.h> > int test(int n) > { > int i; > if (n < 0) > return -1; > for (i = 0; i < n; i++) > printf ("%i\n",i); > return i; > } > > Maybe we want to be able to run it on a specific loop and run it on > loop1 between split finishes updating IL and before profile update? That can be done already, I refactored the code to be run on a loop a while back. Just call final_value_replacement_loop (loop), special casing it to a single PHI might be possible as well, of course.
(In reply to Richard Biener from comment #8) > (In reply to Richard Biener from comment #7) > > I'm talking about the number of iterations of the second loop (after loop > > splitting), the niter expression is (unsigned int) M_9(D) - (unsigned int) > > k_24. > > We know k_29 == M_9(D) when exiting the first loop, so on that entry the > > expression computes zero. For the case the first loop is short-cut we know > > M_9(D) == 1 and thus the difference is zero as well. > > > > So I expect the range of (unsigned int) M_9(D) - (unsigned int) k_2 to be > > [0,0]. > > > > But maybe I'm missing something? > > I guess the issue is that with > > # k_24 = PHI <1(13), k_29(16)> > > to easily see this we'd have to compute the range of > (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - > (unsigned) k_29 and then see those are the same singleton. I don't think we > can > arrive here when using the range of k_24 itself, so maybe I'm asking too > much of VRP here. Is there any place smart enough to be able to recognize that because the first loop is a bump by one and with the core: # k_15 = PHI <k_12(8), 1(14)> <...> k_12 = k_15 + 1; if (k_12 < M_9(D)) the exit condition could be rewritten to if (k_12 == M_9(D)) ? THe loop is only entered by the branch if M_9(D) > 1 feeding the 1(14) initializer, so the relation k_15 < M_9 is always true. (We could know that with a little work) That might enable a couple of things. (Ranger will also have a more precise relation on that exit edge, which may also help inn other unexpected ways.) Im coincidentally looking at PHI argument relations for 116753 (we currently do nothing other than basic equality if all the arguments are in the same equivalence set). k_12 = k_15 + 1; if (k_12 < M_9(D)) goto <bb 8>; [89.00%] // latch else goto <bb 16>; [11.00%] <bb 16> [local count: 67276368]: # k_29 = PHI <k_12(3)> if (M_9(D) >= k_29) goto <bb 15>; [99.95%] else goto <bb 6>; [0.05%] X <bb 15> [local count: 105119324]: # k_24 = PHI <1(13), k_29(16)> the second branch gets eliminated anyway, but feeding into the k_24 PHI, we'll be able to recognize k_29 == M_9(D) with the condition rewritten. It also not particularly difficult to determine that the constant 1 is actually a rewritten M_9(D) if we tried hard enough. ranger knows that M_9(D) is [1, 1] on that edge. with that info, it would not be a stretch to be able to put M_9(D) and k_24 in the same equivalency set... That could give you the 0 result you are looking for. If the 2 operands are in an equivalency set, operator_minus will DTRT: if (rel == VREL_EQ) rel_range = int_range<2> (type, wi::zero (prec), wi::zero (prec));
I tried final_value_replacement_loop on simplified testcase where second loop has known number of iterations: void foo(int *a, int *b, int n) { if (n > 3 && n < 10) for (int i = 0; i <= n;i++) a[i]+=b[i<n-3? i-1 : i]; } (testcase is truly artificial, but lsplit disables itself very happily on more natural candidates) Situation is not so easy, since final_value_replacement_loop needs loop closed ssa form that is rebuilt only after the pass is done. Also even with LCSSA it does not seem to expose the invariant...
I see about a 34% regression on AArch64 as well.
On LoongArch, r15-5894-gaf9a3fe6a52974 compared to r15-5315-g5cf7ffe047681d, the performance of 456.hmmer drops by 27.8%.
For aarch64, it is between r15-5312-g9676da9fc26c9b and r15-5729-g56029c91dcadcf .
(In reply to Jan Hubicka from comment #3) > I saw this on LNT testers too. Unforutnately hmmer was not working for a > while, so there is relatively big interval about when this can happen. So looking into the reasons why it was failing is because the default C level changed to C23. So the spec config changed to include -std=gnu17 now rather than GCC being broken. So in theory someone could do a bisection with a decent spec config.