On x86, r265398 caused: FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2 movdqa (%rdi), %xmm2 pavgb (%rsi), %xmm2 movdqa %xmm0, %xmm3 <<<? movdqa %xmm2, %xmm0 <<<? punpckhbw %xmm1, %xmm2 punpcklbw %xmm1, %xmm0
A slightly older compiler gave test1: movdqa (%rdi), %xmm2 pavgb (%rsi), %xmm2 movdqa %xmm2, %xmm3 punpckhbw %xmm1, %xmm2 punpcklbw %xmm1, %xmm3 pmulhuw %xmm0, %xmm2 pmulhuw %xmm0, %xmm3 packuswb %xmm2, %xmm3 movaps %xmm3, (%rdx) ret What is so super strange about the current generated code?
We currently generate: test1: movdqa (%rdi), %xmm2 pavgb (%rsi), %xmm2 movdqa %xmm0, %xmm3 movdqa %xmm2, %xmm0 punpckhbw %xmm1, %xmm2 punpcklbw %xmm1, %xmm0 pmulhuw %xmm3, %xmm2 pmulhuw %xmm3, %xmm0 packuswb %xmm2, %xmm0 movaps %xmm0, (%rdx) ret One of movdqa %xmm0, %xmm3 movdqa %xmm2, %xmm0 is redundant. We should generate movdqa %xmm2, %xmm3
(and swap xmm0 and xmm3 in all later instructions). Yes. But it seems IRA doesn't figure this out.
I don't think it can be easily fixed. We have the following code in IRA (here - means a removed insn, pref means preferred hard reg for destination pseudo, hard reg in () means assigned hard reg, copy and constrain mean preference of two pseudo to have the same hard reg): -28: r109(di)=di; REG_DEAD di;pref di -29: r110(si)=si; REG_DEAD si;pref si -30: r111(dx)=dx; REG_DEAD dx;pref dx -31: r112(xmm0)=xmm0; REG_DEAD xmm0;pref xmm0 5: r100(xmm3)=r112(xmm0); REG_DEAD r112 ->copy(100,112) -32: r113(xmm1)=xmm1; REG_DEAD xmm1;pref xmm1 -6: r101(xmm1)=r113(xmm1); REG_DEAD r113 ->copy(101,113) 10: r103(xmm2)=[r109(di)]; REG_DEAD r109 11: r102(xmm2)=trunc(zero_extend(r103(xmm2))+zero_extend([r110(si)])+const_vector 0>>0x1);REG_DEAD r110,r103->constrain(102,103) 14: r104(xmm0)=vec_select(vec_concat(r102(xmm2),r101(xmm1)),parallel) 16: r105(xmm2)=vec_select(vec_concat(r102(xmm2),r101(xmm3)),parallel); REG_DEAD r102, r101->constrain(102,105) 19: r106(xmm0)=trunc(zero_extend(r104(xmm0))*zero_extend(r100(xmm3)) 0>>0x10); REG_DEAD r104->constrain(106,104) 21: r107(xmm2)=trunc(zero_extend(r105(xmm2))*zero_extend(r100(xmm3)) 0>>0x10); REG_DEAD r105, r100->constrain(107,105)(107,100) 23: r108(xmm0)=vec_concat(us_truncate(r106(xmm0)),us_truncate(r107(xmm2))); REG_DEAD r107, r106->constrain(108,106) 25: [r111(dx)]=r108(xmm0); REG_DEAD r111, r108 We form threads of pseudos to have the same hard reg: Threads: 1. freq 9000: a2r107(2000) a5r105(2000) a8r102(3000) a10r103(2000) 2. freq 6000: a1r108(2000) a3r106(2000) a6r104(2000) 3. freq 5000: a4r100(3000) a13r112(2000); pref xmm0 4. freq 5000: a7r101(3000) a12r113(2000); pref xmm1 Then coloring algorithm prefers pushing pseudos to coloring stack by threads when the other priorities the same. In this case we assign by threads basically: r102 -- assign reg 22(xmm2) r107 -- assign reg 22(xmm2) r105 -- assign reg 22(xmm2) r103 -- assign reg 22(xmm2) r108 -- assign reg 20(xmm0) r106 -- assign reg 20(xmm0) r104 -- assign reg 20(xmm0) r100 -- assign reg 23(xmm3) r112 -- assign reg 20(xmm0) r101 -- assign reg 21(xmm1) r113 -- assign reg 21(xmm1) r111 -- assign reg 1(dx) r110 -- assign reg 4(si) r109 -- assign reg 5(di) We assign xmm2 (first sse reg after xmm0 and xmm1) to pseudos in the 1st thread becuase threads 3 and 4 prefer xmm0 and xmm1. In LRA: As insn 14 requres p104 and p102 be in the same hard reg we generate an additional insn: r114(xmm0) = r102(xmm2) We could get the desired allocation if we start assignments with pseudos from threads with less priority (in order to assign xmm3 to pseudos from the first thread). But it would worsen performance in common case. RA is all about heuristic solution. In some case they work, in some cases they don't. We should see the whole pictures. Actually in this case RA removes 5 copies out of 6 and satisfies 5 out 6 2-op contraints without additional movement. Probably an additional RA subpass which swaps pseudo-register assignments in order to improve allocation could help. But right now I don't see how effectively to implement this and is it really worth to do.
GCC 9.1 has been released.
Hi, If this test is failing for quite some time and if the fix seems to be complex to write, shall this test be marked as xfailing for now ? Cheers, Romain
GCC 9.2 has been released.
GCC 9.3.0 has been released, adjusting target milestone.
The master branch has been updated by Martin Liska <marxin@gcc.gnu.org>: https://gcc.gnu.org/g:291aa50a63194245ad3ab8bd584f9c0286c5b44c commit r10-7459-g291aa50a63194245ad3ab8bd584f9c0286c5b44c Author: Martin Liska <mliska@suse.cz> Date: Mon Mar 30 17:49:27 2020 +0200 XFAIL pr57193.c test-case. PR rtl-optimization/87716 * gcc.target/i386/pr57193.c: XFAIL a test-case.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.