Bug 117875 - [15 Regression] 28% regression for 456.hmmer on Zen4 with -Ofast -march=native
Summary: [15 Regression] 28% regression for 456.hmmer on Zen4 with -Ofast -march=native
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: 15.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, needs-bisection
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2024-12-02 09:05 UTC by Richard Biener
Modified: 2024-12-28 09:15 UTC (History)
9 users (show)

See Also:
Host:
Target: x86_64-*-*, aarch64*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-12-03 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2024-12-02 09:05:24 UTC
I am observing a 28% runtime regression for 456.hmmer on Zen4 with -Ofast -march=native when compared to GCC 14.2.
Comment 1 Richard Biener 2024-12-02 10:46:04 UTC
Samples: 530K of event 'cycles:Pu', Event count (approx.): 680879110118                                                   
Overhead       Samples  Command          Shared Object                  Symbol                                            
  51.45%        273953  hmmer_peak.amd6  hmmer_peak.amd64-m64-gcc42-nn  [.] P7Viterbi
  38.49%        202968  hmmer_base.amd6  hmmer_base.amd64-m64-gcc42-nn  [.] P7Viterbi

    71 │4135c0┌─  vmovd         (%r11,%rdi,4),%xmm3        ▒
  1361 │4135c6│   vpaddd        %xmm3,%xmm0,%xmm0          ▒
 29411 │4135ca│   mov           %rdi,%r8                   ▒
    15 │4135cd│   vmovd         %xmm0,0x4(%rdx,%rdi,4)     ▒
  5826 │4135d3│   vmovd         (%rax,%rdi,4),%xmm4        ▒
   981 │4135d8│   vmovd         (%r10,%rdi,4),%xmm3        ◆
   725 │4135de│   vpaddd        %xmm3,%xmm4,%xmm3          ▒
  3186 │4135e2│   vmovdqa       0x47346(%rip),%xmm4        ▒
   787 │4135ea│   vpmaxsd       %xmm4,%xmm3,%xmm3          ▒
  3801 │4135ef│   vpmaxsd       %xmm0,%xmm3,%xmm0          ▒
 28932 │4135f4│   vmovd         %xmm0,0x4(%rdx,%rdi,4)     ▒
  2073 │4135fa│   inc           %rdi                       ▒
  3464 │4135fd├── cmp           %r8,%r9                    ▒
    11 │413600└── jne           4135c0 <P7Viterbi+0x1100>

vs.

       │413aa0┌─  vmovd         (%r11,%rdi,4),%xmm3        ▒
   208 │413aa6│   mov           %rdi,%r8                   ▒
   393 │413aa9│   vpaddd        %xmm3,%xmm0,%xmm0          ▒
 11199 │413aad│   vmovd         %xmm0,0x4(%rdx,%rdi,4)     ▒
 11840 │413ab3│   vmovd         (%rax,%rdi,4),%xmm5        ▒
  3889 │413ab8│   vmovd         (%r10,%rdi,4),%xmm3        ▒
   340 │413abe│   vpaddd        %xmm3,%xmm5,%xmm3          ▒
  2829 │413ac2│   vmovdqa       0x48656(%rip),%xmm5        ▒
   720 │413aca│   vpmaxsd       %xmm5,%xmm3,%xmm3          ▒
  1047 │413acf│   vpmaxsd       %xmm0,%xmm3,%xmm0          ▒
 10698 │413ad4│   vmovd         %xmm0,0x4(%rdx,%rdi,4)     ◆
 12478 │413ada│   inc           %rdi                       ▒
  2966 │413add├── cmp           %r8,%r9                    ▒
     1 │413ae0└── jne           413aa0 <P7Viterbi+0x1760>

that's the scalar epilog, -mtune-ctrl=^avx512_two_epilogues does not help.
The regression also shows up on Icelake.

For some reason we're dealing with branch misses here which we have none
for BASE for the above loop but plenty with PEAK.

This seems to be related to loop splitting - for PEAK we have two
iterating loops while for BASE there's simply fallthru code before.
-fno-split-loops fixes this.

We do not seem to realize that splitting

    for (k = 1; k <= M; k++) {
      if (k < M) {
      }
    }

has the k == M loop run only once.  That causes us to vectorize the
epilog loop as well.

A simplified testcase looks like

int a[1024], b[1024];
void foo (int M)
{
  for (int k = 1; k <= M; ++k)
    {
      a[k] = a[k] + 1;
      if (k < M)
        b[k] = b[k] + 1;
    }
}

likely "caused" by the loop splitting improvements, though for the simplified
testcase above the generated code is the same.

I'll note that with GCC 14 we do

fast_algorithms.c:145:10: optimized: loop split
fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0 library calls.
fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0 library calls.
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized:  loop versioned for vectorization because of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized:  loop versioned for vectorization because of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized:  loop versioned for vectorization because of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 7100547)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 20163246)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 16089390)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops

while trunk does

fast_algorithms.c:145:10: optimized: loop split
fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0 library calls.
fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0 library calls.
fast_algorithms.c:165:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:165:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:165:19: optimized: loop vectorized using 16 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized:  loop versioned for vectorization because of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized:  loop versioned for vectorization because of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors
fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 21835320)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 13974604)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops

Which is mostly the same (but all do not realize the loop from splitting
doesn't iterate).

The loop splitting is quite pointless (but it elides the condition).
Comment 2 Richard Biener 2024-12-02 12:16:40 UTC
Manually applying the desired loop splitting in the source also fixes the observed regression.  It's not exactly known what regressed though.
Comment 3 Jan Hubicka 2024-12-03 12:46:42 UTC
I saw this on LNT testers too. Unforutnately hmmer was not working for a while, so there is relatively big interval about when this can happen.

Last year I did some work on hmmer and loop splitting. It is was very sensitive about profile update.  I added logic to splitting so if the first loop has constant number of iterations, we update it correctly, but did not implement pattern matching needed to check that the second loop has constant number of iteration.  The problem is that w/o intermediate FRE iv analysis is not able to analyze it from the code we produce in second loop.  It takes output from the first loop and it is not obviously clear what the values are...

I guess what is new is that we manage to vectorize the second loop and completely lose track of resulting code before working out it is a non-loop :)
Comment 4 Richard Biener 2024-12-04 10:13:36 UTC
Note the branch and exit condition are nicely present in the IL but indeed there doesn't seem to be any infrastructure for special-casing known constant amount of iteration and there's no cleanup done before more loop opts mess things up.
I'm also not sure we'd handle this case with VRPs relation support,
niter analysis tries to simplify computed nonzero conditions, but just
with its weak simplify_using_initial_conditions which isn't even able
to compute M_9(D) < k_24 (aka the nonzero condition) as true.  We don't
simplify the computed niter expression itself.  The niter expression
is (unsigned int) M_9(D) - (unsigned int) k_24.

I'm not sure what ranger has at its disposal for us here when trying to
simplify GENERIC expressions, we could possibly try to compute it's
range and if that's singleton use that?

<bb 13> [local count: 105119324]:
if (M_9(D) > 1)
  goto <bb 14>; [64.00%]
else
  goto <bb 15>; [36.00%]

<bb 14>

<bb 3> [local count: 611603345]:  // first loop
# k_15 = PHI <k_12(8), 1(14)>
...
k_12 = k_15 + 1;
if (k_12 < M_9(D))
  goto <bb 8>; [89.00%] // latch
else
  goto <bb 16>; [11.00%]

<bb 16> [local count: 67276368]:
# k_29 = PHI <k_12(3)>
if (M_9(D) >= k_29)
  goto <bb 15>; [99.95%]
else
  goto <bb 6>; [0.05%]

<bb 15> [local count: 105119324]:
# k_24 = PHI <1(13), k_29(16)>   // 1 is from loop-around branch

<bb 9> [local count: 343854870]:  // second loop
# k_17 = PHI <k_24(15), k_23(12)>
...
k_23 = k_17 + 1;
if (M_9(D) >= k_23)
Comment 5 Richard Biener 2024-12-04 10:39:56 UTC
I tried get_range_query (cfun)->range_of_expr (vr, niter->niter, stmt) with
(unsigned int) M_9(D) - (unsigned int) k_24 and an enabled ranger
but that indeed returns [irange] unsigned int [0, 1024][4294966273, +INF]
and not a singleton as expected.
It seems to look for ranges of M_9 and k_24 and fold_range with the
minus op rather than trying to use relations to simplify the subtraction.
The ranges of the first and 2nd op are [1, 1025] and [1, 1024] respectively,
basically [1, INF] for our purpose (the constant array bound bounds them).

We'd also miss a way to inject niter->assumptions and niter->may_be_zero
as conditions known true for the purpose of simplifying (in this case
those don't add anything).
Comment 6 Andrew Macleod 2024-12-04 16:00:37 UTC
(In reply to Richard Biener from comment #5)
> I tried get_range_query (cfun)->range_of_expr (vr, niter->niter, stmt) with
> (unsigned int) M_9(D) - (unsigned int) k_24 and an enabled ranger
> but that indeed returns [irange] unsigned int [0, 1024][4294966273, +INF]
> and not a singleton as expected.
> It seems to look for ranges of M_9 and k_24 and fold_range with the
> minus op rather than trying to use relations to simplify the subtraction.
> The ranges of the first and 2nd op are [1, 1025] and [1, 1024] respectively,
> basically [1, INF] for our purpose (the constant array bound bounds them).
> 
> We'd also miss a way to inject niter->assumptions and niter->may_be_zero
> as conditions known true for the purpose of simplifying (in this case
> those don't add anything).

It will always use ranges, and utilize any known relation in addition to that.
range-op.cc :  minus_op1_op2_relation_effect () is used for that purpose, but if can only adjust something to a known subrange, ie  if m_9 > k_24, then it will limit whatever the calculated range is to  [1, +INF], so we can trim out any negatives.

What singleton are you expecting?  (and where?)    I hacked a VRP pass right after loop interchange and after the loop I see:

=========== BB 16 ============
M_9(D)  [irange] int [2, 1024]
Relational : (k_12 >= M_9(D))
    <bb 16> [local count: 67276368]:
    goto <bb 15>; [100.00%]

but IM not sure how we use relations to come up with a singleton constant?
The loop backedge has:

=========== BB 8 ============
k_12    [irange] int [2, 1024]
Relational : (M_9(D) > k_15)
Relational : (k_12 < M_9(D))
    <bb 8> [local count: 544326978]:
    goto <bb 3>; [100.00%]

so we always know also M_9 > either k, 'mm just not sure how that helps  OR are we talking about the second loop?  Im having difficulty seeing where I might find a singleton?
Comment 7 Richard Biener 2024-12-05 08:13:54 UTC
I'm talking about the number of iterations of the second loop (after loop splitting), the niter expression is (unsigned int) M_9(D) - (unsigned int) k_24.
We know k_29 == M_9(D) when exiting the first loop, so on that entry the
expression computes zero.  For the case the first loop is short-cut we know
M_9(D) == 1 and thus the difference is zero as well.

So I expect the range of (unsigned int) M_9(D) - (unsigned int) k_2 to be [0,0].

But maybe I'm missing something?
Comment 8 Richard Biener 2024-12-05 08:19:09 UTC
(In reply to Richard Biener from comment #7)
> I'm talking about the number of iterations of the second loop (after loop
> splitting), the niter expression is (unsigned int) M_9(D) - (unsigned int)
> k_24.
> We know k_29 == M_9(D) when exiting the first loop, so on that entry the
> expression computes zero.  For the case the first loop is short-cut we know
> M_9(D) == 1 and thus the difference is zero as well.
> 
> So I expect the range of (unsigned int) M_9(D) - (unsigned int) k_2 to be
> [0,0].
> 
> But maybe I'm missing something?

I guess the issue is that with

# k_24 = PHI <1(13), k_29(16)>

to easily see this we'd have to compute the range of
(unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - (unsigned) k_29 and then see those are the same singleton.  I don't think we can
arrive here when using the range of k_24 itself, so maybe I'm asking too
much of VRP here.
Comment 9 Jan Hubicka 2024-12-05 08:56:54 UTC
> > But maybe I'm missing something?
> 
> I guess the issue is that with
> 
> # k_24 = PHI <1(13), k_29(16)>
> 
> to easily see this we'd have to compute the range of
> (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - (unsigned)
> k_29 and then see those are the same singleton.  I don't think we can
> arrive here when using the range of k_24 itself, so maybe I'm asking too
> much of VRP here.

Yes, I don't see how ranger can figure out easily the singleton range,
since final value of the iteration is variable.
Last year I looked into it and I kind of remember implementing special
matching code, but can not find the patch anymore. The code checking
loop1->any_estimate can be generalized to check for estimate of one of
the loop and adjust the other accordingly.

It is scev_cprop pass that does replace the use of IV value use out of
loops by its bound. As in:
#include <stdio.h>
int test(int n)
{
        int i;
        if (n < 0)
                return -1;
        for (i = 0; i < n; i++)
                printf ("%i\n",i);
        return i;
}

Maybe we want to be able to run it on a specific loop and run it on
loop1 between split finishes updating IL and before profile update?
Comment 10 rguenther@suse.de 2024-12-05 09:36:59 UTC
On Thu, 5 Dec 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875
> 
> --- Comment #9 from Jan Hubicka <hubicka at ucw dot cz> ---
> > > But maybe I'm missing something?
> > 
> > I guess the issue is that with
> > 
> > # k_24 = PHI <1(13), k_29(16)>
> > 
> > to easily see this we'd have to compute the range of
> > (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) - (unsigned)
> > k_29 and then see those are the same singleton.  I don't think we can
> > arrive here when using the range of k_24 itself, so maybe I'm asking too
> > much of VRP here.
> 
> Yes, I don't see how ranger can figure out easily the singleton range,
> since final value of the iteration is variable.
> Last year I looked into it and I kind of remember implementing special
> matching code, but can not find the patch anymore. The code checking
> loop1->any_estimate can be generalized to check for estimate of one of
> the loop and adjust the other accordingly.
> 
> It is scev_cprop pass that does replace the use of IV value use out of
> loops by its bound. As in:
> #include <stdio.h>
> int test(int n)
> {
>         int i;
>         if (n < 0)
>                 return -1;
>         for (i = 0; i < n; i++)
>                 printf ("%i\n",i);
>         return i;
> }
> 
> Maybe we want to be able to run it on a specific loop and run it on
> loop1 between split finishes updating IL and before profile update?

That can be done already, I refactored the code to be run on a loop
a while back.  Just call final_value_replacement_loop (loop),
special casing it to a single PHI might be possible as well, of course.
Comment 11 Andrew Macleod 2024-12-05 18:17:31 UTC
(In reply to Richard Biener from comment #8)
> (In reply to Richard Biener from comment #7)
> > I'm talking about the number of iterations of the second loop (after loop
> > splitting), the niter expression is (unsigned int) M_9(D) - (unsigned int)
> > k_24.
> > We know k_29 == M_9(D) when exiting the first loop, so on that entry the
> > expression computes zero.  For the case the first loop is short-cut we know
> > M_9(D) == 1 and thus the difference is zero as well.
> > 
> > So I expect the range of (unsigned int) M_9(D) - (unsigned int) k_2 to be
> > [0,0].
> > 
> > But maybe I'm missing something?
> 
> I guess the issue is that with
> 
> # k_24 = PHI <1(13), k_29(16)>
> 
> to easily see this we'd have to compute the range of
> (unsigned int) M_9(D) - 1 and the range of (unsigned int) M_9(D) -
> (unsigned) k_29 and then see those are the same singleton.  I don't think we
> can
> arrive here when using the range of k_24 itself, so maybe I'm asking too
> much of VRP here.


Is there any place smart enough to be able to recognize that because the first loop is a bump by one and with the core:

# k_15 = PHI <k_12(8), 1(14)>
 <...>
  k_12 = k_15 + 1;
  if (k_12 < M_9(D))

the exit condition could be rewritten to 
  if (k_12 == M_9(D))  ?

THe loop is only entered by the branch if M_9(D) > 1 feeding the 1(14) initializer, so the relation k_15 < M_9 is always true.  (We could know that with a little work)

That might enable a couple of things.  (Ranger will also have a more precise relation on that exit edge, which may also help inn other unexpected ways.)

Im coincidentally looking at PHI argument relations for 116753 (we currently do nothing other than basic equality if all the arguments are in the same equivalence set).

k_12 = k_15 + 1;
if (k_12 < M_9(D))
  goto <bb 8>; [89.00%] // latch
else
  goto <bb 16>; [11.00%]

<bb 16> [local count: 67276368]:
# k_29 = PHI <k_12(3)>
if (M_9(D) >= k_29)
  goto <bb 15>; [99.95%]
else
  goto <bb 6>; [0.05%]
X

<bb 15> [local count: 105119324]:
# k_24 = PHI <1(13), k_29(16)> 


the second branch gets eliminated anyway, but feeding into the k_24 PHI, we'll be able to recognize k_29 == M_9(D) with the condition rewritten.    It also not particularly difficult to determine that the constant 1 is actually a rewritten  M_9(D) if we tried hard enough.  

ranger knows that M_9(D) is [1, 1] on that edge.  with that info, it would not be a stretch to be able to put M_9(D) and k_24 in the same equivalency set...

That could give you the 0 result you are looking for.  If the 2 operands are in an equivalency set, operator_minus will DTRT:
  if (rel == VREL_EQ)
    rel_range = int_range<2> (type, wi::zero (prec), wi::zero (prec));
Comment 12 Jan Hubicka 2024-12-10 17:16:55 UTC
I tried final_value_replacement_loop on simplified testcase where second loop has known number of iterations:

void foo(int *a, int *b, int n)
{
        if (n > 3 && n < 10)
                for (int i = 0; i <= n;i++)
                        a[i]+=b[i<n-3? i-1 : i];
}

(testcase is truly artificial, but lsplit disables itself very happily on more natural candidates)

Situation is not so easy, since final_value_replacement_loop needs loop closed ssa form that is rebuilt only after the pass is done.  Also even with LCSSA it does not seem to expose the invariant...
Comment 13 Tamar Christina 2024-12-18 10:36:59 UTC
I see about a 34% regression on AArch64 as well.
Comment 14 chenglulu 2024-12-28 08:15:25 UTC
On LoongArch, r15-5894-gaf9a3fe6a52974 compared to r15-5315-g5cf7ffe047681d, the performance of 456.hmmer drops by 27.8%.
Comment 15 Andrew Pinski 2024-12-28 09:06:49 UTC
For aarch64, it is between r15-5312-g9676da9fc26c9b and r15-5729-g56029c91dcadcf .
Comment 16 Andrew Pinski 2024-12-28 09:15:47 UTC
(In reply to Jan Hubicka from comment #3)
> I saw this on LNT testers too. Unforutnately hmmer was not working for a
> while, so there is relatively big interval about when this can happen.

So looking into the reasons why it was failing is because the default C level changed to C23. So the spec config changed to include -std=gnu17 now rather than GCC being broken. So in theory someone could do a bisection with a decent spec config.