82344 – [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855

Bug 82344 - [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855

Summary: [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trun...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:	spec 84613
	Show dependency tree / graph

Reported:	2017-09-27 14:58 UTC by Alexander Nesterovskiy
Modified:	2018-03-27 10:43 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-10-12 00:00:00

Attachments
r250854 vs r250855 generated code comparison (22.76 KB, image/png) 2017-09-27 14:58 UTC, Alexander Nesterovskiy	Details
RSQRT disable patch (399 bytes, text/plain) 2017-09-28 08:32 UTC, Julia Koval	Details
testcase (2.68 KB, text/plain) 2017-10-12 09:23 UTC, Richard Biener	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Nesterovskiy 2017-09-27 14:58:43 UTC

Created attachment 42246 [details]
r250854 vs r250855 generated code comparison

Compilation options that affects regression: "-Ofast -march=core-avx2 -mfpmath=sse"

Regression happened after r250855 though it looks like this commit is not of guilty by itself but reveals something in other stages.

Changes in 123t.reassoc1 stage leads to a bit different code generation during stages that follow it.

Place of interest is in "inl1130" subroutine (file "innerf.f") - it's a part of a big loop with 9 similar expressions with 4-byte float variables:
---------------
y1 = 1.0/sqrt(x1)
y2 = 1.0/sqrt(x2)
y3 = 1.0/sqrt(x3)
y4 = 1.0/sqrt(x4)
y5 = 1.0/sqrt(x5)
y6 = 1.0/sqrt(x6)
y7 = 1.0/sqrt(x7)
y8 = 1.0/sqrt(x8)
y9 = 1.0/sqrt(x9)
---------------

When compiled with "-ffast-math" 1/sqrt is calculated with "vrsqrtss" instruction followed by Newton-Raphson step with four "vmulss", one "vadss" and two constants used.
Like here (part of r250854 code):
---------------
vrsqrtss xmm12, xmm12, xmm7
vmulss   xmm7,  xmm12, xmm7
vmulss   xmm0,  xmm12, DWORD PTR .LC2[rip]
vmulss   xmm8,  xmm7,  xmm12
vaddss   xmm5,  xmm8,  DWORD PTR .LC1[rip]
vmulss   xmm1,  xmm5,  xmm0
---------------
Input values (x1-x9) are in xmm registers mostly (x2 and x7 in memory), output values (y1-y9) are in xmm registers.

After r250855 .LC2 constant goes into xmm7 and x7 is also goes to xmm register.
This leads to lack of temporary registers and worse instructions interleaving as a result.
See attached picture with part of assembly listings where corresponding y=1/sqrt parts are painted the same color.

Finally these 9 lines of code are executed about twice slower which leads to ~10% performance regression for whole test.

I've made two independent attempts to change code in order to verify the above.

1. To be sure that we loose performance exactly in this part of a loop I just pasted ~60 assembly instructions from previous revision to a new one (after proper renaming of course). This helped to restore performance.

2. To be sure that the problem is due to a lack of temporary registers I moved calculation of 1/sqrt for one last line into function call. Like here:
---------------
... in other module to disable inlining:
function myrsqrt(x)
  implicit none
  real*4 x
  real*4 myrsqrt
  myrsqrt = 1.0/sqrt(x);
  return
end function myrsqrt

...

y1 = 1.0/sqrt(x1)
y2 = 1.0/sqrt(x2)
y3 = 1.0/sqrt(x3)
y4 = 1.0/sqrt(x4)
y5 = 1.0/sqrt(x5)
y6 = 1.0/sqrt(x6)
y7 = 1.0/sqrt(x7)
y8 = 1.0/sqrt(x8)
y9 = myrsqrt(x9)
---------------
Even with call/ret overhead this also helped to restore performance since it freed some registers.

Comment 1 Julia Koval 2017-09-28 08:32:58 UTC

Created attachment 42252 [details]
RSQRT disable patch

Performance can be fixed on new architectures by disabling Newthon-Raphson. Unfortunately it causes degradation on Haswell.

Comment 2 Julia Koval 2017-10-12 07:36:00 UTC

Hi, do you have any hints, how we should approach this issue? Fix it in the middle end or disable the cases affected in backend?

Comment 3 Richard Biener 2017-10-12 09:21:02 UTC

So the newton-raphson step causes register pressure to increase and post haswell this makes code slower than not using rsqrt (thus using sqrtf and a division)?

I wonder whether it would be profitable to SLP vectorize this (of course
we're not considering this because SLP vectorization is looking for stores).
SLP vectorization would need to do 4 (or 8 with avx256) vector inserts
and extracts but then could do the rsqrt and newton raphson together.
The argument computation to the sqrt also loop vectorizable and the ultimate
operands even come from continuous memory.  One of the tricky parts would be
to see that the only first rsqrt arg is re-used and thus taking
rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best.

          rinv11           = 1.0/sqrt(rsq11)
          rinv21           = 1.0/sqrt(rsq21)
          rinv31           = 1.0/sqrt(rsq31)
          rinv12           = 1.0/sqrt(rsq12)
          rinv22           = 1.0/sqrt(rsq22)
          rinv32           = 1.0/sqrt(rsq32)
          rinv13           = 1.0/sqrt(rsq13)
          rinv23           = 1.0/sqrt(rsq23)
          rinv33           = 1.0/sqrt(rsq33)
          r11              = rsq11*rinv11

What does ICC do to this loop?

I can confirm the regression on our tester (a Haswell machine btw).

Comment 4 Richard Biener 2017-10-12 09:23:17 UTC

Created attachment 42344 [details]
testcase

Comment 5 Julia Koval 2017-10-17 08:33:00 UTC

(In reply to Richard Biener from comment #3)
> So the newton-raphson step causes register pressure to increase and post
> haswell this makes code slower than not using rsqrt (thus using sqrtf and a
> division)?
> 
> I wonder whether it would be profitable to SLP vectorize this (of course
> we're not considering this because SLP vectorization is looking for stores).
> SLP vectorization would need to do 4 (or 8 with avx256) vector inserts
> and extracts but then could do the rsqrt and newton raphson together.
> The argument computation to the sqrt also loop vectorizable and the ultimate
> operands even come from continuous memory.  One of the tricky parts would be
> to see that the only first rsqrt arg is re-used and thus taking
> rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best.
> 
>           rinv11           = 1.0/sqrt(rsq11)
>           rinv21           = 1.0/sqrt(rsq21)
>           rinv31           = 1.0/sqrt(rsq31)
>           rinv12           = 1.0/sqrt(rsq12)
>           rinv22           = 1.0/sqrt(rsq22)
>           rinv32           = 1.0/sqrt(rsq32)
>           rinv13           = 1.0/sqrt(rsq13)
>           rinv23           = 1.0/sqrt(rsq23)
>           rinv33           = 1.0/sqrt(rsq33)
>           r11              = rsq11*rinv11
> 
> What does ICC do to this loop?
> 
> I can confirm the regression on our tester (a Haswell machine btw).

ICC generates vrsqrtps in this case.

Comment 6 Richard Biener 2018-03-27 09:11:42 UTC

Our tester has re-bounded early this year, not 100% to previous levels but to an extent that looks enough to close this issue.  Alexander, can you confirm that on your side?

Comment 7 Alexander Nesterovskiy 2018-03-27 10:36:44 UTC

Yes, I've checked it - current performance is about previous level and execution of these piece of code takes the same amount of time.

Comment 8 Richard Biener 2018-03-27 10:43:04 UTC

So fixed.