Created attachment 42246 [details] r250854 vs r250855 generated code comparison Compilation options that affects regression: "-Ofast -march=core-avx2 -mfpmath=sse" Regression happened after r250855 though it looks like this commit is not of guilty by itself but reveals something in other stages. Changes in 123t.reassoc1 stage leads to a bit different code generation during stages that follow it. Place of interest is in "inl1130" subroutine (file "innerf.f") - it's a part of a big loop with 9 similar expressions with 4-byte float variables: --------------- y1 = 1.0/sqrt(x1) y2 = 1.0/sqrt(x2) y3 = 1.0/sqrt(x3) y4 = 1.0/sqrt(x4) y5 = 1.0/sqrt(x5) y6 = 1.0/sqrt(x6) y7 = 1.0/sqrt(x7) y8 = 1.0/sqrt(x8) y9 = 1.0/sqrt(x9) --------------- When compiled with "-ffast-math" 1/sqrt is calculated with "vrsqrtss" instruction followed by Newton-Raphson step with four "vmulss", one "vadss" and two constants used. Like here (part of r250854 code): --------------- vrsqrtss xmm12, xmm12, xmm7 vmulss xmm7, xmm12, xmm7 vmulss xmm0, xmm12, DWORD PTR .LC2[rip] vmulss xmm8, xmm7, xmm12 vaddss xmm5, xmm8, DWORD PTR .LC1[rip] vmulss xmm1, xmm5, xmm0 --------------- Input values (x1-x9) are in xmm registers mostly (x2 and x7 in memory), output values (y1-y9) are in xmm registers. After r250855 .LC2 constant goes into xmm7 and x7 is also goes to xmm register. This leads to lack of temporary registers and worse instructions interleaving as a result. See attached picture with part of assembly listings where corresponding y=1/sqrt parts are painted the same color. Finally these 9 lines of code are executed about twice slower which leads to ~10% performance regression for whole test. I've made two independent attempts to change code in order to verify the above. 1. To be sure that we loose performance exactly in this part of a loop I just pasted ~60 assembly instructions from previous revision to a new one (after proper renaming of course). This helped to restore performance. 2. To be sure that the problem is due to a lack of temporary registers I moved calculation of 1/sqrt for one last line into function call. Like here: --------------- ... in other module to disable inlining: function myrsqrt(x) implicit none real*4 x real*4 myrsqrt myrsqrt = 1.0/sqrt(x); return end function myrsqrt ... y1 = 1.0/sqrt(x1) y2 = 1.0/sqrt(x2) y3 = 1.0/sqrt(x3) y4 = 1.0/sqrt(x4) y5 = 1.0/sqrt(x5) y6 = 1.0/sqrt(x6) y7 = 1.0/sqrt(x7) y8 = 1.0/sqrt(x8) y9 = myrsqrt(x9) --------------- Even with call/ret overhead this also helped to restore performance since it freed some registers.
Created attachment 42252 [details] RSQRT disable patch Performance can be fixed on new architectures by disabling Newthon-Raphson. Unfortunately it causes degradation on Haswell.
Hi, do you have any hints, how we should approach this issue? Fix it in the middle end or disable the cases affected in backend?
So the newton-raphson step causes register pressure to increase and post haswell this makes code slower than not using rsqrt (thus using sqrtf and a division)? I wonder whether it would be profitable to SLP vectorize this (of course we're not considering this because SLP vectorization is looking for stores). SLP vectorization would need to do 4 (or 8 with avx256) vector inserts and extracts but then could do the rsqrt and newton raphson together. The argument computation to the sqrt also loop vectorizable and the ultimate operands even come from continuous memory. One of the tricky parts would be to see that the only first rsqrt arg is re-used and thus taking rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best. rinv11 = 1.0/sqrt(rsq11) rinv21 = 1.0/sqrt(rsq21) rinv31 = 1.0/sqrt(rsq31) rinv12 = 1.0/sqrt(rsq12) rinv22 = 1.0/sqrt(rsq22) rinv32 = 1.0/sqrt(rsq32) rinv13 = 1.0/sqrt(rsq13) rinv23 = 1.0/sqrt(rsq23) rinv33 = 1.0/sqrt(rsq33) r11 = rsq11*rinv11 What does ICC do to this loop? I can confirm the regression on our tester (a Haswell machine btw).
Created attachment 42344 [details] testcase
(In reply to Richard Biener from comment #3) > So the newton-raphson step causes register pressure to increase and post > haswell this makes code slower than not using rsqrt (thus using sqrtf and a > division)? > > I wonder whether it would be profitable to SLP vectorize this (of course > we're not considering this because SLP vectorization is looking for stores). > SLP vectorization would need to do 4 (or 8 with avx256) vector inserts > and extracts but then could do the rsqrt and newton raphson together. > The argument computation to the sqrt also loop vectorizable and the ultimate > operands even come from continuous memory. One of the tricky parts would be > to see that the only first rsqrt arg is re-used and thus taking > rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best. > > rinv11 = 1.0/sqrt(rsq11) > rinv21 = 1.0/sqrt(rsq21) > rinv31 = 1.0/sqrt(rsq31) > rinv12 = 1.0/sqrt(rsq12) > rinv22 = 1.0/sqrt(rsq22) > rinv32 = 1.0/sqrt(rsq32) > rinv13 = 1.0/sqrt(rsq13) > rinv23 = 1.0/sqrt(rsq23) > rinv33 = 1.0/sqrt(rsq33) > r11 = rsq11*rinv11 > > What does ICC do to this loop? > > I can confirm the regression on our tester (a Haswell machine btw). ICC generates vrsqrtps in this case.
Our tester has re-bounded early this year, not 100% to previous levels but to an extent that looks enough to close this issue. Alexander, can you confirm that on your side?
Yes, I've checked it - current performance is about previous level and execution of these piece of code takes the same amount of time.
So fixed.