97127 – FMA3 code transformation leads to slowdown on Skylake

Bug 97127 - FMA3 code transformation leads to slowdown on Skylake

Summary: FMA3 code transformation leads to slowdown on Skylake

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	10.2.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-09-20 20:36 UTC by Michael_S
Modified:	2020-09-30 12:09 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:	x86_64-- i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael_S 2020-09-20 20:36:06 UTC

The following clever gcc transformation leads to generation of slower code than non-transformed original:
 a = *mem;
 a = a + b * c;
where both b and c are reused further down is transformed to:
 a = b
 a = *mem + a * c;

Or, expressing the same in asm terms
 vmovuxx      (mem), %ymmA
 vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
 vmovaxx      %ymmB, %ymmA
 vfnmadd213xx (mem), %ymmC, %ymmA

You may ask "Why transformed variant is slower?" and I can try my best to answer (my guess is that performance bottleneck is in rename stage rather than in the execution stage and transformed code occupies 3 rename slots vs 2 rename slots by original) but it would be mostly pointless. What's matters that on Skylake the transformed variant is slower and I can prove it with benchmark. BTW, on Haswell too.

You can see comparison of two variants at https://github.com/already5chosen/others/tree/master/cholesky_solver/gcc-badopt-fma3
The interesting spot is starting at line 367 in file chol.cpp.
Or starting two lines below .L21: in asm generated by gcc 10.2.0 (chol_a.s).
Run 's_chol_a 100' vs 's_chol_b 100' and see the difference in favor of the second (de-transformed) variant.
The difference, in this particular case, is small, order of 2-4 percents, but very consistent.
In more tight loops I would expect a bigger difference.

Comment 1 Richard Biener 2020-09-21 06:54:45 UTC

GCC does not model CPU pipelines in such detail (not to say documentation on the CPU side is insufficient).  In principle the vmovaxx %ymmB, %ymmA should be
handled at the rename stage and be 'free' and both cases should end up with
the same number of uops.

Did you try to throw Intel iaca on it? (not that it is very reliable)

Comment 2 Alexander Monakov 2020-09-21 10:35:20 UTC

Richard, though register moves are resolved by renaming, they still occupy a uop in all stages except execution, and since renaming is one of the narrowest points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of uops generally helps.

In Michael's the actual memory address has two operands:

< 	vmovapd	%ymm1, %ymm10
< 	vmovapd	%ymm1, %ymm11
< 	vfnmadd213pd	(%rdx,%rax), %ymm9, %ymm10
< 	vfnmadd213pd	(%rcx,%rax), %ymm7, %ymm11
---
> 	vmovupd	(%rdx,%rax), %ymm10
> 	vmovupd	(%rcx,%rax), %ymm11
> 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> 	vfnmadd231pd	%ymm1, %ymm7, %ymm11

The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before renaming (because otherwise there would be too many operands to handle). Hence the original code has 4 uops after decoding, 6 uops before renaming, and the transformed code has 4 uops before renaming. Execution handles 4 uops in both cases.

FMA unlamination is mentioned in https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes

Michael, you can probably measure it for yourself with

   perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots

Comment 3 Michael_S 2020-09-21 13:40:47 UTC

(In reply to Alexander Monakov from comment #2)
> Richard, though register moves are resolved by renaming, they still occupy a
> uop in all stages except execution, and since renaming is one of the
> narrowest points in the pipeline (only up to 4 uops/cycle on Intel),
> reducing number of uops generally helps.
> 
> In Michael's the actual memory address has two operands:
> 
> < 	vmovapd	%ymm1, %ymm10
> < 	vmovapd	%ymm1, %ymm11
> < 	vfnmadd213pd	(%rdx,%rax), %ymm9, %ymm10
> < 	vfnmadd213pd	(%rcx,%rax), %ymm7, %ymm11
> ---
> > 	vmovupd	(%rdx,%rax), %ymm10
> > 	vmovupd	(%rcx,%rax), %ymm11
> > 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> > 	vfnmadd231pd	%ymm1, %ymm7, %ymm11
> 
> The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
> renaming (because otherwise there would be too many operands to handle).
> Hence the original code has 4 uops after decoding, 6 uops before renaming,
> and the transformed code has 4 uops before renaming. Execution handles 4
> uops in both cases.
> 
> FMA unlamination is mentioned in
> https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-
> modes
> 

That's pretty much what I assumed.
More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs 4 entries by de-transformed sequence. But capacity of reservation stations is not a bottleneck (and generally very rarely a bottleneck in tight FP loops) while rename stage is one of the bottlenecks, supposedly together with decode and L2 read throughput, but rename limits are more severe than the other two.

Anyway, as I mentioned in original post, the question *why* clever gcc transformation makes execution slower is of lesser interest. What matters is a fact that it makes execution slower.

> Michael, you can probably measure it for yourself with
> 
>    perf stat -e
> cycles,instructions,uops_retired.all,uops_retired.retire_slots

I am playing with this code on msys2. Strongly suspect that 'perf stat' is not going to work here. Also, repeating myself, I am not too interested in *why* a clever code is slower than simple one. For me it's enough to know that it *is* slower.

[generalization mode on]
As a rule of thumb, I'd say that in inner loops [both on Skylake and on Haswell/Broadwell] FMA3 with memory source operand almost never a measurable win relatively to RISCy load+FMA3 sequence, even when number of uOPs going into rename stage is the same. Most of the time it's a wash and sometimes a loss.
So, compiler that works overtime to produce such complex instructions, can spend its effort better elsewhere. IMHO.
Outside of inner loops it's different. Here [in SSE/AVX code] most of the time the main bottleneck is instruction fetch. So, optimizer should try to emit the shortest sequence, as measured in bytes and does not have to pay a lot of attention to anything else.
[generalization mode off]

Comment 4 Alexander Monakov 2020-09-21 15:17:41 UTC

> More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs
> 4 entries by de-transformed sequence.

I don't think this is true for the test at hand? With base+offset memory operand the renaming stage already sees two separate uops for each fma, so reservation etc. should also see two for each fma, 4 uops in total. And they will not be fused.

It would be true if memory operands required just one register (and then pressure on renaming stage would be the same for both variants).


> For me it's enough to know that it *is* slower.

Understood, but I hope GCC developers want to understand the nature of the slowdown before attempting to fix it.

Comment 5 Hongtao.liu 2020-09-22 08:10:36 UTC

(In reply to Michael_S from comment #3)
> (In reply to Alexander Monakov from comment #2)
> > Richard, though register moves are resolved by renaming, they still occupy a
> > uop in all stages except execution, and since renaming is one of the
> > narrowest points in the pipeline (only up to 4 uops/cycle on Intel),
> > reducing number of uops generally helps.
> > 
> > In Michael's the actual memory address has two operands:
> > 
> > < 	vmovapd	%ymm1, %ymm10
> > < 	vmovapd	%ymm1, %ymm11
> > < 	vfnmadd213pd	(%rdx,%rax), %ymm9, %ymm10
> > < 	vfnmadd213pd	(%rcx,%rax), %ymm7, %ymm11
> > ---
> > > 	vmovupd	(%rdx,%rax), %ymm10
> > > 	vmovupd	(%rcx,%rax), %ymm11
> > > 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> > > 	vfnmadd231pd	%ymm1, %ymm7, %ymm11
> > 

We can add peephole2 pattern for this particular situation(Assume the transformation won't hurt the performance when instructions are outside of inner loops), but not sure if GCC could hanlde it in *global view*(handle them differently inside/outside of a loop).

Comment 6 Michael_S 2020-09-22 10:01:38 UTC

Why do you see it as addition of peephole pattern?
I see it as removal. Like, "do what's written in the source and don't try to be tricky".
Probably, I am too removed from how compilers work :(

Or, may be, handle it at the level of cost of instructions?
I don't know how gcc works internally, but it seems that currently the cost of register move [on Haswell and Skylake] is underestimated.
Although it is true that register move has no cost in terms of execution ports and latency, it still has the same cost as, say, integer ALU instructions, in terms of the front end and renamer.

Also, as pointed out above by Alexander, the cost of FMA3 with (base+index) or (index*scale) memory operands could also be underestimated.

Unlike Alexander, I am not sure that the difference between (base+index) and (base) is really what matters. IMO, the cost of FMA3 with *any* memory operands is underestimated, but I am not going to insist on that.

In the ideal world compiler should reason as I do it myself when coding in asm:
estimate which resource is critical in the given loop and then try to reduce pressure on this particular resource. In this particular loop a critical resource appears to be renamer, so the cost of instructions should be seen as cost at renamer. In other situations critical resource could be throughput of particular issue port. In yet another situation it could be latency of instructions that form dependency chain across multiple iterations of the loop.
The later is especially common in "reduction" algorithms of which dot product is the most common example.
A single value of instruction cost simply can't cover all these different cases in satisfactory manner.
May be, gcc it's already here, I don't know.

Comment 7 Hongtao.liu 2020-09-23 01:38:31 UTC

(In reply to Michael_S from comment #6)
> Why do you see it as addition of peephole pattern?
> I see it as removal. Like, "do what's written in the source and don't try to
> be tricky".
> Probably, I am too removed from how compilers work :(
> 
> Or, may be, handle it at the level of cost of instructions?

Since the costs inside/outside of a loop are different, it's too rough to use an unified cost to do estimation.

> I don't know how gcc works internally, but it seems that currently the cost
> of register move [on Haswell and Skylake] is underestimated.
> Although it is true that register move has no cost in terms of execution
> ports and latency, it still has the same cost as, say, integer ALU
> instructions, in terms of the front end and renamer.
> 
> Also, as pointed out above by Alexander, the cost of FMA3 with (base+index)
> or (index*scale) memory operands could also be underestimated.
> 
> Unlike Alexander, I am not sure that the difference between (base+index) and
> (base) is really what matters. IMO, the cost of FMA3 with *any* memory
> operands is underestimated, but I am not going to insist on that.
> 

We can increase FMA3 cost if it has memory operand, but not sure would it benifit all cases, especially for those not in inner loop.

> 
> In the ideal world compiler should reason as I do it myself when coding in
> asm:
> estimate which resource is critical in the given loop and then try to reduce
> pressure on this particular resource. In this particular loop a critical
> resource appears to be renamer, so the cost of instructions should be seen
> as cost at renamer. In other situations critical resource could be
> throughput of particular issue port. In yet another situation it could be
> latency of instructions that form dependency chain across multiple

AFAIK, resource like issue port is only consided in software scheduler, but scheduler won't do instruction tranformation like that, even scheduler itself is subtle when target supports out-of-order execution.

> iterations of the loop.
> The later is especially common in "reduction" algorithms of which dot
> product is the most common example.
> A single value of instruction cost simply can't cover all these different
> cases in satisfactory manner.
> May be, gcc it's already here, I don't know.

Comment 8 Michael_S 2020-09-23 17:49:27 UTC

What are values of gcc "loop" cost of the relevant instructions now?
1. AVX256 Load
2. FMA3 ymm,ymm,ymm
3. AVX256 Regmove
4. FMA3 mem,ymm,ymm

Comment 9 Hongtao.liu 2020-09-24 03:23:41 UTC

(In reply to Michael_S from comment #8)
> What are values of gcc "loop" cost of the relevant instructions now?
> 1. AVX256 Load
> 2. FMA3 ymm,ymm,ymm
> 3. AVX256 Regmove
> 4. FMA3 mem,ymm,ymm

For skylake, outside of register allocation.

they are
1. AVX256 Load  ---- 10
2. FMA3 ymm,ymm,ymm --- 16
3. AVX256 Regmove  --- 2
4. FMA3 mem,ymm,ymm --- 32

In RA, no direct cost for fma instrcutions, but we can disparage memory alternative in FMA instructions， but again, it may hurt performance in some cases.

1. AVX256 Load  ---- 10
3. AVX256 Regmove  --- 2

BTW: we have done a lot of experiments with different cost models and no significant performance impact on SPEC2017.

Comment 10 Michael_S 2020-09-24 08:28:21 UTC

(In reply to Hongtao.liu from comment #9)
> (In reply to Michael_S from comment #8)
> > What are values of gcc "loop" cost of the relevant instructions now?
> > 1. AVX256 Load
> > 2. FMA3 ymm,ymm,ymm
> > 3. AVX256 Regmove
> > 4. FMA3 mem,ymm,ymm
> 
> For skylake, outside of register allocation.
> 
> they are
> 1. AVX256 Load  ---- 10
> 2. FMA3 ymm,ymm,ymm --- 16
> 3. AVX256 Regmove  --- 2
> 4. FMA3 mem,ymm,ymm --- 32
> 
> In RA, no direct cost for fma instrcutions, but we can disparage memory
> alternative in FMA instructions， but again, it may hurt performance in some
> cases.
> 
> 1. AVX256 Load  ---- 10
> 3. AVX256 Regmove  --- 2
> 
> BTW: we have done a lot of experiments with different cost models and no
> significant performance impact on SPEC2017.

Thank you.
With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only in conditions of heavy registers pressure. So, why it generates it in my loop, where registers pressure in the innermost loop is light and even in the next outer level the pressure isn't heavy?
What am I missing?

Comment 11 Hongtao.liu 2020-09-24 10:06:54 UTC

(In reply to Michael_S from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Michael_S from comment #8)
> > > What are values of gcc "loop" cost of the relevant instructions now?
> > > 1. AVX256 Load
> > > 2. FMA3 ymm,ymm,ymm
> > > 3. AVX256 Regmove
> > > 4. FMA3 mem,ymm,ymm
> > 
> > For skylake, outside of register allocation.
> > 
> > they are
> > 1. AVX256 Load  ---- 10
> > 2. FMA3 ymm,ymm,ymm --- 16
> > 3. AVX256 Regmove  --- 2
> > 4. FMA3 mem,ymm,ymm --- 32
> > 
> > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > alternative in FMA instructions， but again, it may hurt performance in some
> > cases.
> > 
> > 1. AVX256 Load  ---- 10
> > 3. AVX256 Regmove  --- 2
> > 
> > BTW: we have done a lot of experiments with different cost models and no
> > significant performance impact on SPEC2017.
> 
> Thank you.
> With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> in conditions of heavy registers pressure. So, why it generates it in my
> loop, where registers pressure in the innermost loop is light and even in
> the next outer level the pressure isn't heavy?
> What am I missing?

the actual transformation gcc did is

vmovuxx (mem1), %ymmA     pass_combine     
vmovuxx (mem), %ymmD         ---->     vmovuxx   (mem1), %ymmA
vfmadd213 %ymmD,%ymmC,%ymmA            vfmadd213 (mem),%ymmC,%ymmA

then RA works like
                            RA
vmovuxx (mem1), %ymmA      ---->  %vmovaps %ymmB, %ymmA
vfmadd213 (mem),%ymmC,%ymmA       vfmadd213 (mem),%ymmC,%ymmA

it "look like" but actually not this one.

 vmovuxx      (mem), %ymmA
 vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
 vmovaxx      %ymmB, %ymmA
 vfnmadd213xx (mem), %ymmC, %ymmA

ymmB is allocate for (mem1) not (mem)

Comment 12 Hongtao.liu 2020-09-24 10:46:04 UTC

Correct AVX256 load cost outside of register allocation and vectorizer

> they are
> 1. AVX256 Load  ---- 16
> 2. FMA3 ymm,ymm,ymm --- 16
> 3. AVX256 Regmove  --- 2
> 4. FMA3 mem,ymm,ymm --- 32

That's why pass_combine would combine *avx256 load* and *FMA3 ymm,ymm,ymm* to *FMA3 mem,ymm,ymm*

Comment 13 Michael_S 2020-09-24 12:38:49 UTC

(In reply to Hongtao.liu from comment #11)
> (In reply to Michael_S from comment #10)
> > (In reply to Hongtao.liu from comment #9)
> > > (In reply to Michael_S from comment #8)
> > > > What are values of gcc "loop" cost of the relevant instructions now?
> > > > 1. AVX256 Load
> > > > 2. FMA3 ymm,ymm,ymm
> > > > 3. AVX256 Regmove
> > > > 4. FMA3 mem,ymm,ymm
> > > 
> > > For skylake, outside of register allocation.
> > > 
> > > they are
> > > 1. AVX256 Load  ---- 10
> > > 2. FMA3 ymm,ymm,ymm --- 16
> > > 3. AVX256 Regmove  --- 2
> > > 4. FMA3 mem,ymm,ymm --- 32
> > > 
> > > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > > alternative in FMA instructions， but again, it may hurt performance in some
> > > cases.
> > > 
> > > 1. AVX256 Load  ---- 10
> > > 3. AVX256 Regmove  --- 2
> > > 
> > > BTW: we have done a lot of experiments with different cost models and no
> > > significant performance impact on SPEC2017.
> > 
> > Thank you.
> > With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> > in conditions of heavy registers pressure. So, why it generates it in my
> > loop, where registers pressure in the innermost loop is light and even in
> > the next outer level the pressure isn't heavy?
> > What am I missing?
> 
> the actual transformation gcc did is
> 
> vmovuxx (mem1), %ymmA     pass_combine     
> vmovuxx (mem), %ymmD         ---->     vmovuxx   (mem1), %ymmA
> vfmadd213 %ymmD,%ymmC,%ymmA            vfmadd213 (mem),%ymmC,%ymmA
> 
> then RA works like
>                             RA
> vmovuxx (mem1), %ymmA      ---->  %vmovaps %ymmB, %ymmA
> vfmadd213 (mem),%ymmC,%ymmA       vfmadd213 (mem),%ymmC,%ymmA
> 
> it "look like" but actually not this one.
> 
>  vmovuxx      (mem), %ymmA
>  vfnmadd231xx %ymmB, %ymmC, %ymmA
> transformed to
>  vmovaxx      %ymmB, %ymmA
>  vfnmadd213xx (mem), %ymmC, %ymmA
> 
> ymmB is allocate for (mem1) not (mem)

Thank you.
Now compiler's reasoning is starting to make more sense.
Still I don't understand why compiler does not compare the cost of full loop body after combining to the cost before combining and does not come to conclusion that combining increased the cost.

Comment 14 Hongtao.liu 2020-09-25 05:24:21 UTC

> Still I don't understand why compiler does not compare the cost of full loop
> body after combining to the cost before combining and does not come to
> conclusion that combining increased the cost.

As Richard says, GCC does not model CPU pipelines in such detail.

Comment 15 Michael_S 2020-09-25 13:21:27 UTC

(In reply to Hongtao.liu from comment #14)
> > Still I don't understand why compiler does not compare the cost of full loop
> > body after combining to the cost before combining and does not come to
> > conclusion that combining increased the cost.
> 
> As Richard says, GCC does not model CPU pipelines in such detail.

I don't understand what "details" you have in mind.
The costs of instructions that you quoted above looks fine. 
But for reason, I don't understand, compiler had chosen more costly "combined" code sequence over less costly, according to its own cost model,  "RISCy" sequence.

Comment 16 Alexander Monakov 2020-09-25 14:02:15 UTC

Mostly because prior to register allocation the compiler does not naturally see that x = *mem + a*b will need an extra mov when both 'a' and 'b' are live (as in that case registers allocated for them cannot be immediately reused for 'x').

(this is why RTL combine merges the two instructions, but on the isolated testcase below we already have the combined form thanks to tree-TER)

Isolated testcase, by the way (gcc -O2 -mfma):

double f(double t, double a, double b, double c[], long i)
{
    t = __builtin_fma(a, b, c[i]);
    asm("" :: "x"(t), "x"(a), "x"(b));
    return t;
}

Comment 17 Alexander Monakov 2020-09-25 15:55:12 UTC

To me this suggests that in fact it's okay to carry the combined form in RTL up to register allocation, but RA should decompose it to load+fma instead of inserting a register copy that preserves the live operand.

(not a fan of "RA should deal with it" theme, but here it does look like the most appropriate resolution)

Comment 18 Richard Sandiford 2020-09-30 12:09:20 UTC

Not sure this is a good idea or not, but: combine could
look at the constraints of the instruction it matches,
compare that with the information it has about whether
the inputs remain live after the instruction, and cost
in an extra move if all alternatives require a reload.