102178 – [12/13/14/15 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

Bug 102178 - [12/13/14/15 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

Summary: [12/13/14/15 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs afte...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	12.0

Importance:	P2 normal
Target Milestone:	13.4
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2021-09-02 15:38 UTC by Martin Jambor
Modified:	2024-05-21 09:10 UTC (History)
CC List:	8 users (show)

See Also:	102008
Host:
Target:	x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-04-11 00:00:00

Attachments
perf annotate before and after the revision (25.73 KB, application/x-tar) 2022-01-26 15:57 UTC, Martin Liška	Details
LBM_performStreamCollide testcase (1.18 KB, text/plain) 2022-01-27 10:14 UTC, Richard Biener	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2021-09-02 15:38:07 UTC

LNT has detected an 18% regression of SPECFP 2006 benchmark 470.lbm
when it is compiled with -Ofast -march=native on a Zen2 machine:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=421.240.0&plot.1=301.240.0&

...and similarly a 6% regression when it is run on the same machine
with -Ofast:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0&

I have bisected both on another zen2 machine to commit
r12-897-gde56f95afaaa22 (Run pass_sink_code once more before
store_merging).

Zen1 machine has also seen a similar -march=native regression in the
same time frame:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0&

Zen1 -march=generic seems to be unaffected, which is also the case for
the Intel machines we track.

Although lbm has been known to have weird regressions caused entirely
by code layout where the compiler was not really at fault, the fact
that both generic code-gen and Zen1 are affected seems to indicate this
is not the case.

Comment 1 Richard Biener 2021-09-06 06:40:58 UTC

Maybe related to PR102008, see the comment I made there.  Martin, maybe you can try moving late sink to before the last phiopt pass.

Comment 2 Xionghu Luo (luoxhu@gcc.gnu.org) 2021-09-07 02:46:37 UTC

Verified 470.lbm doesn't show regression on Power8 with Ofast.

runtime is 141 sec for r12-897, without that patch it is 142 sec.

Comment 3 Martin Jambor 2021-09-08 14:06:28 UTC

(In reply to Richard Biener from comment #1)
> Martin, maybe you can try moving late sink to before the last phiopt pass.

If you mean the following then unfortunately that has not helped.

diff --git a/gcc/passes.def b/gcc/passes.def
index d7a1f8c97a6..5eb70cd2cd8 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -347,10 +347,10 @@ along with GCC; see the file COPYING3.  If not see
       /* After late CD DCE we rewrite no longer addressed locals into SSA
         form if possible.  */
       NEXT_PASS (pass_forwprop);
+      NEXT_PASS (pass_sink_code);
       NEXT_PASS (pass_phiopt, false /* early_p */);
       NEXT_PASS (pass_fold_builtins);
       NEXT_PASS (pass_optimize_widening_mul);
-      NEXT_PASS (pass_sink_code);
       NEXT_PASS (pass_store_merging);
       NEXT_PASS (pass_tail_calls);
       /* If DCE is not run before checking for uninitialized uses,


...I'll have a very brief look at what is actually happening just so that I have more reasons to believe this is not a code placement issue again.

Comment 4 Martin Jambor 2021-09-16 16:17:26 UTC

(In reply to Martin Jambor from comment #3)
> ...I'll have a very brief look at what is actually happening just so that I
> have more reasons to believe this is not a code placement issue again.

The hot function is at the same address when compiled by both
revisions and the newer version looks sufficiently different.  I even
tried sprinkling it with nops and it did not help.  I am no saying we
are not bumping against some michro-architectural peculiarity but it
does not seem to be a code placement issue.

Comment 5 Richard Biener 2022-01-20 10:20:38 UTC

Analysis is missing but the regression persists.  On Haswell I do not see any effect.  I do suspect it's about cmov vs. non-cmov but w/o a profile and looking affected assembly that's a wild guess.

Comment 6 Martin Liška 2022-01-26 15:57:10 UTC

Created attachment 52296 [details]
perf annotate before and after the revision

Comment 7 Richard Biener 2022-01-27 07:42:49 UTC

I see a lot more GPR <-> XMM moves in the 'after' case:

    1035 :   401c8b: vaddsd %xmm1,%xmm0,%xmm0
    1953 :   401c8f: vmovq  %rcx,%xmm1
     305 :   401c94: vaddsd %xmm8,%xmm1,%xmm1
    3076 :   401c99: vmovq  %xmm0,%r14
     590 :   401c9e: vmovq  %r11,%xmm0
     267 :   401ca3: vmovq  %xmm1,%r8
     136 :   401ca8: vmovq  %rdx,%xmm1
     448 :   401cad: vaddsd %xmm1,%xmm0,%xmm1
    1703 :   401cb1: vmovq  %xmm1,%r9   (*)
     834 :   401cb6: vmovq  %r8,%xmm1
    1719 :   401cbb: vmovq  %r9,%xmm0   (*)
    2782 :   401cc0: vaddsd %xmm0,%xmm1,%xmm1
   22135 :   401cc4: vmovsd %xmm1,%xmm1,%xmm0
    1261 :   401cc8: vmovq  %r14,%xmm1
     646 :   401ccd: vaddsd %xmm0,%xmm1,%xmm0
   18136 :   401cd1: vaddsd %xmm2,%xmm5,%xmm1
     629 :   401cd5: vmovq  %xmm1,%r8
     142 :   401cda: vaddsd %xmm6,%xmm3,%xmm1
     177 :   401cde: vmovq  %xmm0,%r14
     288 :   401ce3: vmovq  %xmm1,%r9
     177 :   401ce8: vmovq  %r8,%xmm1
     174 :   401ced: vmovq  %r9,%xmm0

those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think?  The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.

The function is LBM_performStreamCollide, the sinking pass does nothing wrong,
it moves unconditionally executed

-  _948 = _861 + _867;
-  _957 = _944 + _948;
-  _912 = _861 + _873;
...
-  _981 = _853 + _865;
-  _989 = _977 + _981;
-  _916 = _853 + _857;
-  _924 = _912 + _916;

into a conditionally executed block.  But that increases register pressure
by 5 FP regs (if I counted correctly) in that area.  So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.

-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).

In postreload I see transforms like

-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
-        (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0  S8 A64])) "lbm.c":241:5 141 {*movdf_internal}
-     (expr_list:REG_EQUAL (const_double:DF 9.939744999999999830464503247640095651149749755859375e-1 [0x0.fe751ce28ed5fp+0])
-        (nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
         (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
      (nil))

which seems like we could have reloaded %xmm5 from .LC10.  But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.

The differences can be seen on trunk with -Ofast -march=znver2 [-fdisable-tree-sink2].

We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).

It might be interesting to benchmark with -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all - Vlad?)

Comment 8 Richard Biener 2022-01-27 07:55:54 UTC

So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
this improves to 114 seconds, with sink2 disabled I get 108 seconds
and with the tune-ctrl ontop I get 113 seconds.

Note that Zen2 is quite special in that it has the ability to handle
load/store from the stack by mapping it to a register, effectively
making them zero latency (zen3 lost this ability).

So while moves between GPRs and XMM might not be bad anymore _spilling_
to a GPR (and I suppose XMM, too) is still a bad idea and the stack
should be preferred.

Not sure if it's possible to do that though.

Doing the same experiment as above on a Zen3 machine would be nice, too.

Comment 9 Hongtao.liu 2022-01-27 08:13:04 UTC

    1703 :   401cb1: vmovq  %xmm1,%r9   (*)
     834 :   401cb6: vmovq  %r8,%xmm1
    1719 :   401cbb: vmovq  %r9,%xmm0   (*)

Look like %r9 is dead after the second (*), and it can be optimized to

    1703 :   401cb1: vmovq  %xmm1,%xmm0
     834 :   401cb6: vmovq  %r8,%xmm1

Comment 10 Hongtao.liu 2022-01-27 08:18:18 UTC

(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.
> 
> Note that Zen2 is quite special in that it has the ability to handle
> load/store from the stack by mapping it to a register, effectively
> making them zero latency (zen3 lost this ability).
> 
> So while moves between GPRs and XMM might not be bad anymore _spilling_
> to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> should be preferred.
> 

According to znver2_cost

Cost of sse_to_integer is a little bit less than fp_store, maybe increase sse_to_integer cost(more than fp_store) can helps RA to choose memory instead of GPR.

Comment 11 Richard Biener 2022-01-27 08:20:51 UTC

(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.

With -Ofast -march=znver2 -fschedule-insns -fsched-pressure I get 113 seconds.

Comment 12 Richard Biener 2022-01-27 09:34:40 UTC

(In reply to Hongtao.liu from comment #10)
> (In reply to Richard Biener from comment #8)
> > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> > this improves to 114 seconds, with sink2 disabled I get 108 seconds
> > and with the tune-ctrl ontop I get 113 seconds.
> > 
> > Note that Zen2 is quite special in that it has the ability to handle
> > load/store from the stack by mapping it to a register, effectively
> > making them zero latency (zen3 lost this ability).
> > 
> > So while moves between GPRs and XMM might not be bad anymore _spilling_
> > to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> > should be preferred.
> > 
> 
> According to znver2_cost
> 
> Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> sse_to_integer cost(more than fp_store) can helps RA to choose memory
> instead of GPR.

That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
but modeling that doesn't seem to be necessary.

We seem to have store costs of 8 and load costs of 6, I'll try bumping the
gpr<->xmm move cost to 8.

Comment 13 hubicka 2022-01-27 09:55:38 UTC

> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.

Comment 14 Richard Biener 2022-01-27 10:13:42 UTC

(In reply to Hongtao.liu from comment #9)
>     1703 :   401cb1: vmovq  %xmm1,%r9   (*)
>      834 :   401cb6: vmovq  %r8,%xmm1
>     1719 :   401cbb: vmovq  %r9,%xmm0   (*)
> 
> Look like %r9 is dead after the second (*), and it can be optimized to
> 
>     1703 :   401cb1: vmovq  %xmm1,%xmm0
>      834 :   401cb6: vmovq  %r8,%xmm1

Yep, we also have code like

-       movabsq $0x3ff03db8fde2ef4e, %r8
...
-       vmovq   %r8, %xmm11

or

        movq    .LC11(%rip), %rax
        vmovq   %rax, %xmm14

which is extremely odd to see ... (I didn't check how we arrive at that)

When I do

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..4c51358d7b6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = {
                                           in 32,64,128,256 and 512-bit.  */
   {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit.  */
-  6, 6,                                        /* SSE->integer and integer->SSE
+  8, 8,                                        /* SSE->integer and integer->SSE
                                           moves.  */
   8, 8,                                /* mask->integer and integer->mask moves */
   {6, 6, 6},                           /* cost of loading mask register

performance improves from 128 seconds to 115 seconds.  The result is
a lot more stack spilling in the code but there are still cases like

        movq    .LC8(%rip), %rax
        vmovsd  .LC13(%rip), %xmm6
        vmovsd  .LC16(%rip), %xmm11
        vmovsd  .LC14(%rip), %xmm3
        vmovsd  .LC12(%rip), %xmm14
        vmovq   %rax, %xmm2
        vmovq   %rax, %xmm0
        movq    .LC9(%rip), %rax

see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of
at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also
loading directly to xmm0.

In the end register pressure is the main issue but how we deal with it
is bad.  It's likely caused by a combination of PRE & hoisting & sinking
which together exploit

  if( ((*((unsigned int*) ((void*) (&((((srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+ (0)*(1*(100))+(0)*(1*(100))*(1*(100))))+(i)]))))))) & (ACCEL))) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

which makes the following computes partly compile-time resolvable.

I still think the above code generation issues need to be analyzed and
we should figure why we emit this weird code under register pressure.
I'll attach a testcase that has the function in question split out for
easier analysis.

Comment 15 Richard Biener 2022-01-27 10:14:41 UTC

Created attachment 52300 [details]
LBM_performStreamCollide testcase

This is the relevant function.

Comment 16 hubicka 2022-01-27 10:23:12 UTC

> 
> Yep, we also have code like
> 
> -       movabsq $0x3ff03db8fde2ef4e, %r8
> ...
> -       vmovq   %r8, %xmm11

It is loading random constant to xmm11.  Since reg<->xmm moves are
relatively cheap it looks OK to me that we generate this.  Is it faster
to load constant from the memory?
>         movq    .LC11(%rip), %rax
>         vmovq   %rax, %xmm14
This is odd indeed and even more odd that we both movabs and memory load... 
i386 FE plays some games with allowing some constants in SSE
instructions (to allow simplification and combining) and split them out
to memory later.  It may be consequence of this.

Comment 17 Richard Biener 2022-01-27 10:32:00 UTC

So in .reload we have (with unpatched trunk)

  401: NOTE_INSN_BASIC_BLOCK 6
  462: ax:DF=[`*.LC0']
      REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
  407: xmm2:DF=ax:DF
  463: ax:DF=[`*.LC0']
      REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
  408: xmm4:DF=ax:DF

why??!  We can load .LC0 into xmm4 directly.  IRA sees

  401: NOTE_INSN_BASIC_BLOCK 6
  407: r118:DF=r482:DF
  408: r119:DF=r482:DF

now I cannot really decipher IRA or LRA dumps but my guess would be that
inheritance (causing us to load from LC0) interferes badly with register
class assignment?

Changing pseudo 482 in operand 1 of insn 407 on equiv 9.850689999999999724167309977929107844829559326171875e-1
...
          alt=21,overall=9,losers=1,rld_nregs=1
         Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
      Creating newreg=525, assigning class GENERAL_REGS to r525
  407: r118:DF=r525:DF
    Inserting insn reload before:
  462: r525:DF=[`*.LC0']
      REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1

we should have preferred alt 14 I think (0) v (1) m, but that has

          alt=14,overall=13,losers=1,rld_nregs=0
            0 Spill pseudo into memory: reject+=3
            Using memory insn operand 0: reject+=3
            0 Non input pseudo reload: reject++
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++
            alt=15,overall=28,losers=3 -- refuse
            0 Costly set: reject++
            alt=16: Bad operand -- refuse
            0 Costly set: reject++
            1 Costly loser: reject++
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++
            alt=17,overall=17,losers=2 -- refuse
            0 Costly set: reject++
            1 Spill Non-pseudo into memory: reject+=3
            Using memory insn operand 1: reject+=3
            1 Non input pseudo reload: reject++
            alt=18,overall=14,losers=1 -- refuse
            0 Spill pseudo into memory: reject+=3
            Using memory insn operand 0: reject+=3
            0 Non input pseudo reload: reject++
            1 Costly loser: reject++
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++
            alt=19,overall=29,losers=3 -- refuse
            0 Non-prefered reload: reject+=600
            0 Non input pseudo reload: reject++
            alt=20,overall=607,losers=1 -- refuse
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++

I'm not sure I can decipher the reasoning but I don't understand how it
doesn't seem to anticipate the cost of reloading the GPR in the alternative
it chooses?

Vlad?

Comment 18 Richard Biener 2022-01-27 11:18:13 UTC

For the case of LBM what also helps is disabling PRE or using PGO (which
sees the useless PRE) given that the path the expressions become partially
compile-time computable is never taken at runtime.  In theory we could
isolate that path completely via -ftracer but the pass pipeline setup doesn't
look optimal.

Comment 19 rguenther@suse.de 2022-01-27 11:30:37 UTC

On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
> 
> --- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > > According to znver2_cost
> > > 
> > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > > instead of GPR.
> > 
> > That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> > but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> > Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> > but modeling that doesn't seem to be necessary.
> > 
> > We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> > gpr<->xmm move cost to 8.
> 
> I was simply following latencies here, so indeed reg<->mem bypass is not
> really modelled.  I recall doing few experiments which was kind of
> inconclusive.

Yes, I think xmm->gpr->xmm vs. xmm->mem->xmm isn't really the issue here
but it's mem->gpr->xmm vs. mem->xmm with all the constant pool remats.
Agner lists latency of 3 for gpr<->xmm and a latency of 4 for mem<->xmm
but then there's forwarding (and the clever renaming trick) which likely
makes xmm->mem->xmm cheaper than 4 + 4 but xmm->gpr->xmm will really
be 3 + 3 latency.  grp<->xmm seem to be also more resource constrained.

In any case for moving xmm to gpr it doesn't make sense to go through
memory but it doesn't seem worth to spill to xmm or gpr when we only
use gpr / xmm later.

Letting the odd and bogus code we generate for the .LC0 re-materialization
aside which we should fix and which fixing likely will fix LBM.

Comment 20 rguenther@suse.de 2022-01-27 11:33:45 UTC

On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
> 
> --- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> > 
> > Yep, we also have code like
> > 
> > -       movabsq $0x3ff03db8fde2ef4e, %r8
> > ...
> > -       vmovq   %r8, %xmm11
> 
> It is loading random constant to xmm11.  Since reg<->xmm moves are
> relatively cheap it looks OK to me that we generate this.  Is it faster
> to load constant from the memory?

I would say so.  It saves code size and also uop space unless the two
can magically fuse to a immediate to %xmm move (I doubt that).

> >         movq    .LC11(%rip), %rax
> >         vmovq   %rax, %xmm14
> This is odd indeed and even more odd that we both movabs and memory load... 
> i386 FE plays some games with allowing some constants in SSE
> instructions (to allow simplification and combining) and split them out
> to memory later.  It may be consequence of this.

I've pasted the LRA dump pieces I think are relevant but I don't
understand them.  The constant load isn't visible originally but
is introduced by LRA so that may be the key to the mystery here.

Comment 21 hubicka 2022-01-27 12:04:38 UTC

> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
        long int i;
        double sum,val1,val2,val3,val4;
         for (i=0;i<1000000000;i++)
         {
#if 1
#if 1
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val1): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val2): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val3): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val4): :"r8","xmm11");
#else
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val1):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val2):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val3):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
                asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a) :"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a) :"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a) :"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a) :"r8","xmm11");
#endif
                sum+=val1+val2+val3+val4;
                 }
         return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.

Comment 22 H.J. Lu 2022-01-27 13:42:28 UTC

Is this related to PR 104059? Can you try:

https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html

Comment 23 Richard Biener 2022-01-27 14:24:10 UTC

(In reply to H.J. Lu from comment #22)
> Is this related to PR 104059? Can you try:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html

I verified that hardreg cprop does nothing on the testcase, so no.

Comment 24 Hongtao.liu 2022-01-27 16:28:25 UTC

for
vmovq   %rdi, %xmm7     # 503   [c=4 l=4]  *movdf_internal/21
..
vmulsd  %xmm7, %xmm4, %xmm5     # 320   [c=12 l=4]  *fop_df_comm/2
..
        movabsq $0x3fef85af6c69b5a6, %rdi       # 409   [c=5 l=10]  *movdf_internal/11



and 
7806(insn 320 319 322 8 (set (reg:DF 441)
7807        (mult:DF (reg:DF 166 [ _323 ])
7808            (reg:DF 249 [ _900 ]))) "../test.c":87:218 1072 {*fop_df_comm}
7809     (expr_list:REG_DEAD (reg:DF 249 [ _900 ])
7810        (nil)))

RA allocate rdi for 249 because cost of general reg is cheaper than mem.

  a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964 BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964 CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NON_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:19546,19546 FP_SECOND_REG:19546,19546 FLOAT_REGS:19546,19546 SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0,0 SSE_REGS:0,0 FLOAT_SSE_REGS:19546,19546 FLOAT_INT_REGS:19546,19546 INT_SSE_REGS:19546,19546 FLOAT_INT_SSE_REGS:19546,19546 MEM:6294,6294

 950    r249: preferred SSE_REGS, alternative GENERAL_REGS, allocno INT_SSE_REGS


Disposition:
 66:r249 l1     5

with -mtune=aldlake, for r249 cost of general regs is expensive than mem, and RA will allocate mem for it, then no more movabsq/vmovq is needed.

 655  a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964 BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964 CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NO\
    N_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:14908,14908 FP_SECOND_REG:14908,14908 FLOAT_REGS:14908,14908 SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0\
    ,0 SSE_REGS:0,0 FLOAT_SSE_REGS:14908,14908 FLOAT_INT_REGS:14908,14908 INT_SSE_REGS:14908,14908 FLOAT_INT_SSE_REGS:14908,14908 MEM:5632,5632


 950    r249: preferred SSE_REGS, alternative NO_REGS, allocno SSE_REGS

 66:r249 l1   mem

vmulsd  -80(%rsp), %xmm2, %xmm3 # 320   [c=29 l=6]  *fop_df_comm/2

Guess we need to let RA know mem cost is cheaper than GPR for r249.

Comment 25 Hongtao.liu 2022-01-27 16:36:03 UTC

> Guess we need to let RA know mem cost is cheaper than GPR for r249.

Reduce sse_store cost?

Comment 26 Vladimir Makarov 2022-01-28 15:48:22 UTC

(In reply to Richard Biener from comment #7)
> make costs in a way that IRA/LRA prefer re-materialization of constants
> from the constant pool over spilling to GPRs (if that's possible at all -
> Vlad?)

LRA rematerialization can not rematerialize constant value from memory pool.  It can rematerialize value of expression only consisting of other pseudos (currently assigned to hard regs) and constants.

I guess rematerialization pass can be extended to work for constants from constant memory pool.  It is pretty doable project opposite to rematerialization of any memory which would require a lot analysis including aliasing and complicated cost calculation benefits.  May be somebody could pick this project up.

Comment 27 Vladimir Makarov 2022-01-28 16:02:13 UTC

(In reply to Richard Biener from comment #17)
> So in .reload we have (with unpatched trunk)
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   462: ax:DF=[`*.LC0']
>       REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
>   407: xmm2:DF=ax:DF
>   463: ax:DF=[`*.LC0']
>       REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
>   408: xmm4:DF=ax:DF
> 
> why??!  We can load .LC0 into xmm4 directly.  IRA sees
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   407: r118:DF=r482:DF
>   408: r119:DF=r482:DF
> 
> now I cannot really decipher IRA or LRA dumps but my guess would be that
> inheritance (causing us to load from LC0) interferes badly with register
> class assignment?
> 
> Changing pseudo 482 in operand 1 of insn 407 on equiv
> 9.850689999999999724167309977929107844829559326171875e-1
> ...
>           alt=21,overall=9,losers=1,rld_nregs=1
>          Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
>       Creating newreg=525, assigning class GENERAL_REGS to r525
>   407: r118:DF=r525:DF
>     Inserting insn reload before:
>   462: r525:DF=[`*.LC0']
>       REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
> 
> we should have preferred alt 14 I think (0) v (1) m, but that has
> 
>           alt=14,overall=13,losers=1,rld_nregs=0
>             0 Spill pseudo into memory: reject+=3
>             Using memory insn operand 0: reject+=3
>             0 Non input pseudo reload: reject++
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
>             alt=15,overall=28,losers=3 -- refuse
>             0 Costly set: reject++
>             alt=16: Bad operand -- refuse
>             0 Costly set: reject++
>             1 Costly loser: reject++
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
>             alt=17,overall=17,losers=2 -- refuse
>             0 Costly set: reject++
>             1 Spill Non-pseudo into memory: reject+=3
>             Using memory insn operand 1: reject+=3
>             1 Non input pseudo reload: reject++
>             alt=18,overall=14,losers=1 -- refuse
>             0 Spill pseudo into memory: reject+=3
>             Using memory insn operand 0: reject+=3
>             0 Non input pseudo reload: reject++
>             1 Costly loser: reject++
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
>             alt=19,overall=29,losers=3 -- refuse
>             0 Non-prefered reload: reject+=600
>             0 Non input pseudo reload: reject++
>             alt=20,overall=607,losers=1 -- refuse
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
> 
> I'm not sure I can decipher the reasoning but I don't understand how it
> doesn't seem to anticipate the cost of reloading the GPR in the alternative
> it chooses?
> 
> Vlad?

All this diagnostics is just description of voodoo from the old reload pass.  LRA choosing alternative the same way as the old reload pass (I doubt that any other approach will not break all existing targets).  Simply the old reload pass does not report its decisions in the dump.

LRA code (lra-constraints.cc::process_alt_operands) choosing the insn alternatives (as the old reload pass) does not use any memory or register move costs.  Instead, the alternative is chosen by heuristics and insn constraints hints (like ? !). The only case where these costs are used, when we have reg:=reg and the register move costs for this is 2.  In this case LRA(reload) does not bother to check the insn constraints.

Comment 28 Vladimir Makarov 2022-02-09 15:51:11 UTC

Could somebody benchmark the following patch on zen2 470.lbm.

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..76619aca8eb 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
                             (x, lra_get_allocno_class (i)) == NO_REGS))
                        || contains_symbol_ref_p (x))))
              ira_reg_equiv[i].defined_p = false;
-           if (contains_reg_p (x, false, true))
+           if (contains_reg_p (x, false, true)
+               || (CONST_DOUBLE_P (x)
+                   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
              ira_reg_equiv[i].profitable_p = false;
            if (get_equiv (reg) != reg)
              bitmap_ior_into (equiv_insn_bitmap, &lra_reg_info[i].insn_bitmap);

If it improves the performance, I'll commit this patch.

The expander unconditionally uses memory pool for double constants.  I think the analogous treatment could be done for equiv double constants in LRA.

I know only x86_64 permits 64-bit constants as immediate for moving them into general regs.  As double fp operations is not done in general regs in the most cases, they should be moved into fp regs and this is costly as Jan wrote.  So it has sense to prohibit using equiv double constant values in LRA unconditionally.  If in the future we have a target which can move double immediate into fp regs we can introduce some target hooks to deal with equiv double constant.  But right now I think there is no need for the new hook.

Comment 29 Richard Biener 2022-02-10 07:45:34 UTC

(In reply to Vladimir Makarov from comment #28)
> Could somebody benchmark the following patch on zen2 470.lbm.
> 
> diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
> index 9cee17479ba..76619aca8eb 100644
> --- a/gcc/lra-constraints.cc
> +++ b/gcc/lra-constraints.cc
> @@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
>                              (x, lra_get_allocno_class (i)) == NO_REGS))
>                         || contains_symbol_ref_p (x))))
>               ira_reg_equiv[i].defined_p = false;
> -           if (contains_reg_p (x, false, true))
> +           if (contains_reg_p (x, false, true)
> +               || (CONST_DOUBLE_P (x)
> +                   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
>               ira_reg_equiv[i].profitable_p = false;
>             if (get_equiv (reg) != reg)
>               bitmap_ior_into (equiv_insn_bitmap,
> &lra_reg_info[i].insn_bitmap);
> 
> If it improves the performance, I'll commit this patch.
> 
> The expander unconditionally uses memory pool for double constants.  I think
> the analogous treatment could be done for equiv double constants in LRA.
> 
> I know only x86_64 permits 64-bit constants as immediate for moving them
> into general regs.  As double fp operations is not done in general regs in
> the most cases, they should be moved into fp regs and this is costly as Jan
> wrote.  So it has sense to prohibit using equiv double constant values in
> LRA unconditionally.  If in the future we have a target which can move
> double immediate into fp regs we can introduce some target hooks to deal
> with equiv double constant.  But right now I think there is no need for the
> new hook.

Code generation changes quite a bit, with the patch the offending function
is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
there is still a lot of spilling of XMMs to GPRs.  Performance is
unchanged by the patch:

470.lbm         13740        128        107 S   13740        128        107 S
470.lbm         13740        128        107 *   13740        128        107 S
470.lbm         13740        128        107 S   13740        128        107 *

I've used

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..a0ec608c056 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5084,9 @@ lra_constraints (bool first_p)
                             (x, lra_get_allocno_class (i)) == NO_REGS))
                        || contains_symbol_ref_p (x))))
              ira_reg_equiv[i].defined_p = false;
-           if (contains_reg_p (x, false, true))
+           if (contains_reg_p (x, false, true)
+               || (CONST_DOUBLE_P (x)
+                   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), UNITS_PER_WORD)))
              ira_reg_equiv[i].profitable_p = false;
            if (get_equiv (reg) != reg)
              bitmap_ior_into (equiv_insn_bitmap, &lra_reg_info[i].insn_bitmap);

note UNITS_PER_WORD vs. literal 8.

Without knowing much of the code I wonder if we can check whether the move
will be to a reg in GENERAL_REGS?  That is, do we know whether there are
(besides some special constants like zero), immediate moves to the
destination register class?

That said, given the result on LBM I'd not change this at this point.

Honza wanted to look at the move pattern to try to mitigate the
GPR spilling of XMMs.

I do think that we need to take costs into account at some point and get
rid of the reload style hand-waving with !?* in the move patterns.

Comment 30 Vladimir Makarov 2022-02-10 15:17:27 UTC

(In reply to Richard Biener from comment #29)
> (In reply to Vladimir Makarov from comment #28)
> > Could somebody benchmark the following patch on zen2 470.lbm.
> 
> Code generation changes quite a bit, with the patch the offending function
> is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
> there is still a lot of spilling of XMMs to GPRs.  Performance is
> unchanged by the patch:
> 
> 470.lbm         13740        128        107 S   13740        128        107 S
> 470.lbm         13740        128        107 *   13740        128        107 S
> 470.lbm         13740        128        107 S   13740        128        107 *
> 
> 

Thank you very much for testing the patch, Richard.  The results mean no go for the patch to me.
  
> Without knowing much of the code I wonder if we can check whether the move
> will be to a reg in GENERAL_REGS?  That is, do we know whether there are
> (besides some special constants like zero), immediate moves to the
> destination register class?
>

There are no such info from the target code.  Ideally we need to have the cost of loading *particular* immediate value into register class on the same cost basis
as load/store.  Still to use this info efficiently choosing alternatives should be based on costs not on the hints and some machine independent general heuristics (as now).


> That said, given the result on LBM I'd not change this at this point.
> 
> Honza wanted to look at the move pattern to try to mitigate the
> GPR spilling of XMMs.
> 
> I do think that we need to take costs into account at some point and get
> rid of the reload style hand-waving with !?* in the move patterns.

In general I am agree with the direction but it will be quite hard to do.  I know it well from my experience to change register class cost calculation algorithm in IRA (the experimental code can be found on the branch ira-select). I expect huge number of test failures and some benchmark performance degradation practically for any targets and a big involvement of target maintainers to fix them.  Although it is possible to try to do this for one target at the time.

Comment 31 Richard Biener 2022-04-11 13:04:00 UTC

For the testcase I still see 58 vmovq GPR <-> XMM at -Ofast -march=znver2,
resulting from spilling of xmms.  And there's still cases like

        movq    .LC0(%rip), %rax
...
        vmovq   %rax, %xmm2
        vmovq   %rax, %xmm4

Honza - you said you wanted to have some look here.  As I understand Vlad
costing isn't taken into account when choosing alternatives for reloading.

Comment 32 Richard Biener 2022-04-25 09:45:08 UTC

So the bad "head" can be fixed via

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index c74edd1aaef..8f9f26e0a82 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3580,9 +3580,9 @@
 ;; Possible store forwarding (partial memory) stall in alternatives 4, 6 and 7.
 (define_insn "*movdf_internal"
   [(set (match_operand:DF 0 "nonimmediate_operand"
-    "=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r ,v,r  ,o ,r  ,m")
+    "=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,!r,!v,r  ,o ,r  ,m")
        (match_operand:DF 1 "general_operand"
-    "Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r ,roF,rF,rmF,rC"))]
+    "Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,!v,!r,roF,rF,rmF,rC"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))
    && (lra_in_progress || reload_completed
        || !CONST_DOUBLE_P (operands[1])

which is adding ! to r<->v alternatives.  That should eventually be done
by duplicating the alternatives and enabling one set via some enable
attribute based on some tunable.  I see those alternatives are already

   (set (attr "preferred_for_speed")
     (cond [(eq_attr "alternative" "3,4")
              (symbol_ref "TARGET_INTEGER_DFMODE_MOVES")
            (eq_attr "alternative" "20")
              (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC")
            (eq_attr "alternative" "21")
              (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC")
           ]     
           (symbol_ref "true")))

not sure why it's preferred_for_speed here though - shouldn't that be
enabled for size if !TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC and otherwise
disabled?  Not sure if combining enabled and preferred_for_speed is
reasonably possible, but we have a preferred_for_size attribute here.

The diff with ! added is quite short, I've yet have to measure any
effect on LBM:

--- streamcollide.s.orig        2022-04-25 11:37:01.638733951 +0200
+++ streamcollide.s2    2022-04-25 11:35:54.885849296 +0200
@@ -33,28 +33,24 @@
        .p2align 4
        .p2align 3
 .L12:
-       movq    .LC0(%rip), %rax
-       vmovsd  .LC4(%rip), %xmm6
+       vmovsd  .LC0(%rip), %xmm2
+       vmovsd  .LC1(%rip), %xmm13
+       movabsq $0x3ff01878b7a1c25d, %rax
        movabsq $0x3fef85af6c69b5a6, %rdi
+       vmovsd  .LC2(%rip), %xmm12
+       vmovsd  .LC3(%rip), %xmm14
        movabsq $0x3ff03db8fde2ef4e, %r8
+       movabsq $0x3fefcea39c51dabe, %r9
+       vmovsd  .LC4(%rip), %xmm6
        vmovsd  .LC5(%rip), %xmm7
        movq    .LC8(%rip), %r11
-       movabsq $0x3fefcea39c51dabe, %r9
        movq    .LC6(%rip), %rdx
        movq    .LC7(%rip), %rcx
-       vmovq   %rax, %xmm2
-       vmovq   %rax, %xmm4
-       movq    .LC1(%rip), %rax
        movq    %r11, %rsi
        movq    %r11, %r12
-       vmovq   %rax, %xmm13
-       vmovq   %rax, %xmm8
-       movq    .LC2(%rip), %rax
-       vmovq   %rax, %xmm12
-       vmovq   %rax, %xmm5
-       movq    .LC3(%rip), %rax
-       vmovq   %rax, %xmm14
-       movabsq $0x3ff01878b7a1c25d, %rax
+       vmovsd  %xmm2, %xmm2, %xmm4
+       vmovsd  %xmm13, %xmm13, %xmm8
+       vmovsd  %xmm12, %xmm12, %xmm5
        vmovsd  %xmm14, -16(%rsp)
 .L5:
        vmulsd  .LC9(%rip), %xmm0, %xmm3

Comment 33 Richard Biener 2022-04-25 12:52:58 UTC

(In reply to Richard Biener from comment #32)
> The diff with ! added is quite short, I've yet have to measure any
> effect on LBM:
> 
> --- streamcollide.s.orig        2022-04-25 11:37:01.638733951 +0200
> +++ streamcollide.s2    2022-04-25 11:35:54.885849296 +0200
> @@ -33,28 +33,24 @@
>         .p2align 4
>         .p2align 3
>  .L12:
> -       movq    .LC0(%rip), %rax
> -       vmovsd  .LC4(%rip), %xmm6
> +       vmovsd  .LC0(%rip), %xmm2
> +       vmovsd  .LC1(%rip), %xmm13
> +       movabsq $0x3ff01878b7a1c25d, %rax
>         movabsq $0x3fef85af6c69b5a6, %rdi
> +       vmovsd  .LC2(%rip), %xmm12
> +       vmovsd  .LC3(%rip), %xmm14
>         movabsq $0x3ff03db8fde2ef4e, %r8
> +       movabsq $0x3fefcea39c51dabe, %r9
> +       vmovsd  .LC4(%rip), %xmm6
>         vmovsd  .LC5(%rip), %xmm7
>         movq    .LC8(%rip), %r11
> -       movabsq $0x3fefcea39c51dabe, %r9
>         movq    .LC6(%rip), %rdx
>         movq    .LC7(%rip), %rcx
> -       vmovq   %rax, %xmm2
> -       vmovq   %rax, %xmm4
> -       movq    .LC1(%rip), %rax
>         movq    %r11, %rsi
>         movq    %r11, %r12
> -       vmovq   %rax, %xmm13
> -       vmovq   %rax, %xmm8
> -       movq    .LC2(%rip), %rax
> -       vmovq   %rax, %xmm12
> -       vmovq   %rax, %xmm5
> -       movq    .LC3(%rip), %rax
> -       vmovq   %rax, %xmm14
> -       movabsq $0x3ff01878b7a1c25d, %rax
> +       vmovsd  %xmm2, %xmm2, %xmm4
> +       vmovsd  %xmm13, %xmm13, %xmm8
> +       vmovsd  %xmm12, %xmm12, %xmm5
>         vmovsd  %xmm14, -16(%rsp)
>  .L5:
>         vmulsd  .LC9(%rip), %xmm0, %xmm3

Huh, and the net effect is that the + code is 9% _slower_!?

Comment 34 Richard Biener 2022-04-25 13:02:39 UTC

As noted the effect of

  if(...) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

is PRE of most(!) dependent instructions, creating

  # prephitmp_1099 = PHI <_1098(6), 6.49971724999999889149648879538290202617645263671875e-1(5)>
  # prephitmp_1111 = PHI <_1110(6), 1.089805708333333178483570691241766326129436492919921875e-1(5)>
...

we successfully coalesce the non-constant incoming register with the result
but have to emit copies for all constants on the other edge where we have
quite a number of duplicate constants to deal with.

I've experimented with ensuring we get _full_ PRE of the dependent expressions
by more aggressively re-associating (give PHIs with a constant incoming operand
on at least one edge a rank similar to constants, 1).

This increases the number of PHIs further but reduces the followup computations
more.  We still fail to simply tail-duplicate the merge block - another
possibility to eventually save some of the overhead, our tail duplication
code (gimple-ssa-split-paths.cc) doesn't handle this case since the
diamond is not the one immediately preceeding the loop exit/latch.

The result of "full PRE" is a little bit worse than the current state (so
it's not a full solution here).

Comment 35 Richard Biener 2022-04-25 13:09:13 UTC

(In reply to Richard Biener from comment #34)
> As noted the effect of
> 
>   if(...) {
>    ux = 0.005;
>    uy = 0.002;
>    uz = 0.000;
>   }
> 
> is PRE of most(!) dependent instructions, creating
> 
>   # prephitmp_1099 = PHI <_1098(6),
> 6.49971724999999889149648879538290202617645263671875e-1(5)>
>   # prephitmp_1111 = PHI <_1110(6),
> 1.089805708333333178483570691241766326129436492919921875e-1(5)>
> ...
> 
> we successfully coalesce the non-constant incoming register with the result
> but have to emit copies for all constants on the other edge where we have
> quite a number of duplicate constants to deal with.
> 
> I've experimented with ensuring we get _full_ PRE of the dependent
> expressions
> by more aggressively re-associating (give PHIs with a constant incoming
> operand
> on at least one edge a rank similar to constants, 1).
> 
> This increases the number of PHIs further but reduces the followup
> computations
> more.  We still fail to simply tail-duplicate the merge block - another
> possibility to eventually save some of the overhead, our tail duplication
> code (gimple-ssa-split-paths.cc) doesn't handle this case since the
> diamond is not the one immediately preceeding the loop exit/latch.
> 
> The result of "full PRE" is a little bit worse than the current state (so
> it's not a full solution here).

Btw, looking at coverage the constant case is only an umimportant fraction
of the runtime, so the register pressure increase by the PRE dominates
(but the branch is predicted to be 50/50):

3562383000:  241:               if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
 55296000:  242:                        ux = 0.005;
 55296000:  243:                        uy = 0.002;
 55296000:  244:                        uz = 0.000;
        -:  245:                }

we can also see that PGO notices this and we do _not_ perform the PRE.

So the root cause is nothing we can fix for GCC 12, tuning to avoid
spilling to GPRs can recover parts of the regression but will definitely
have effects elsewhere.

Re-targeting to GCC 13.

Comment 36 Richard Biener 2023-04-26 06:55:30 UTC

GCC 13.1 is being released, retargeting bugs to GCC 13.2.

Comment 37 Richard Biener 2023-07-27 09:22:24 UTC

GCC 13.2 is being released, retargeting bugs to GCC 13.3.

Comment 38 Jakub Jelinek 2024-05-21 09:10:45 UTC

GCC 13.3 is being released, retargeting bugs to GCC 13.4.