Bug 87716 - [11/12/13/14 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2
Summary: [11/12/13/14 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 9.0
: P2 normal
Target Milestone: 11.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra, testsuite-fail, xfail
Depends on:
Blocks:
 
Reported: 2018-10-24 00:09 UTC by H.J. Lu
Modified: 2023-07-07 10:34 UTC (History)
9 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-05-30 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2018-10-24 00:09:46 UTC
On x86, r265398 caused:

FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

	movdqa	(%rdi), %xmm2
	pavgb	(%rsi), %xmm2
	movdqa	%xmm0, %xmm3 <<<?
	movdqa	%xmm2, %xmm0 <<<?
	punpckhbw	%xmm1, %xmm2
	punpcklbw	%xmm1, %xmm0
Comment 1 Segher Boessenkool 2018-10-24 01:21:35 UTC
A slightly older compiler gave

test1:
        movdqa  (%rdi), %xmm2
        pavgb   (%rsi), %xmm2
        movdqa  %xmm2, %xmm3
        punpckhbw       %xmm1, %xmm2
        punpcklbw       %xmm1, %xmm3
        pmulhuw %xmm0, %xmm2
        pmulhuw %xmm0, %xmm3
        packuswb        %xmm2, %xmm3
        movaps  %xmm3, (%rdx)
        ret

What is so super strange about the current generated code?
Comment 2 H.J. Lu 2018-10-24 01:37:15 UTC
We currently generate:

test1:
	movdqa	(%rdi), %xmm2
	pavgb	(%rsi), %xmm2
	movdqa	%xmm0, %xmm3
	movdqa	%xmm2, %xmm0
	punpckhbw	%xmm1, %xmm2
	punpcklbw	%xmm1, %xmm0
	pmulhuw	%xmm3, %xmm2
	pmulhuw	%xmm3, %xmm0
	packuswb	%xmm2, %xmm0
	movaps	%xmm0, (%rdx)
	ret

One of

	movdqa	%xmm0, %xmm3
	movdqa	%xmm2, %xmm0

is redundant. We should generate

        movdqa  %xmm2, %xmm3
Comment 3 Segher Boessenkool 2018-10-24 22:17:06 UTC
(and swap xmm0 and xmm3 in all later instructions).

Yes.  But it seems IRA doesn't figure this out.
Comment 4 Vladimir Makarov 2019-03-01 22:38:22 UTC
  I don't think it can be easily fixed.  We have the following code in
IRA (here - means a removed insn, pref means preferred hard reg for
destination pseudo, hard reg in () means assigned hard reg, copy and
constrain mean preference of two pseudo to have the same hard reg):

  -28: r109(di)=di; REG_DEAD di;pref di
  -29: r110(si)=si; REG_DEAD si;pref si
  -30: r111(dx)=dx; REG_DEAD dx;pref dx
  -31: r112(xmm0)=xmm0; REG_DEAD xmm0;pref xmm0
    5: r100(xmm3)=r112(xmm0); REG_DEAD r112 ->copy(100,112)
  -32: r113(xmm1)=xmm1; REG_DEAD xmm1;pref xmm1
   -6: r101(xmm1)=r113(xmm1); REG_DEAD r113 ->copy(101,113)
   10: r103(xmm2)=[r109(di)]; REG_DEAD r109
   11: r102(xmm2)=trunc(zero_extend(r103(xmm2))+zero_extend([r110(si)])+const_vector 0>>0x1);REG_DEAD r110,r103->constrain(102,103)
   14: r104(xmm0)=vec_select(vec_concat(r102(xmm2),r101(xmm1)),parallel)
   16: r105(xmm2)=vec_select(vec_concat(r102(xmm2),r101(xmm3)),parallel); REG_DEAD r102, r101->constrain(102,105)
   19: r106(xmm0)=trunc(zero_extend(r104(xmm0))*zero_extend(r100(xmm3)) 0>>0x10); REG_DEAD r104->constrain(106,104)
   21: r107(xmm2)=trunc(zero_extend(r105(xmm2))*zero_extend(r100(xmm3)) 0>>0x10); REG_DEAD r105, r100->constrain(107,105)(107,100)
   23: r108(xmm0)=vec_concat(us_truncate(r106(xmm0)),us_truncate(r107(xmm2))); REG_DEAD r107, r106->constrain(108,106)
   25: [r111(dx)]=r108(xmm0); REG_DEAD r111, r108

We form threads of pseudos to have the same hard reg:

Threads:
  1. freq 9000: a2r107(2000) a5r105(2000) a8r102(3000) a10r103(2000)
  2. freq 6000: a1r108(2000) a3r106(2000) a6r104(2000)
  3. freq 5000: a4r100(3000) a13r112(2000); pref xmm0
  4. freq 5000: a7r101(3000) a12r113(2000); pref xmm1

Then coloring algorithm prefers pushing pseudos to coloring stack by
threads when the other priorities the same.  In this case we assign by
threads basically:

      r102  -- assign reg 22(xmm2)
      r107  -- assign reg 22(xmm2)
      r105  -- assign reg 22(xmm2)
      r103  -- assign reg 22(xmm2)
      r108  -- assign reg 20(xmm0)
      r106  -- assign reg 20(xmm0)
      r104  -- assign reg 20(xmm0)
      r100  -- assign reg 23(xmm3)
      r112  -- assign reg 20(xmm0)
      r101  -- assign reg 21(xmm1)
      r113  -- assign reg 21(xmm1)
      r111  -- assign reg 1(dx)
      r110  -- assign reg 4(si)
      r109  -- assign reg 5(di)

We assign xmm2 (first sse reg after xmm0 and xmm1) to pseudos in the
1st thread becuase threads 3 and 4 prefer xmm0 and xmm1.

In LRA:

  As insn 14 requres p104 and p102 be in the same hard reg we generate an additional insn:
      r114(xmm0) = r102(xmm2)

We could get the desired allocation if we start assignments with
pseudos from threads with less priority (in order to assign xmm3 to
pseudos from the first thread).  But it would worsen performance in
common case.

RA is all about heuristic solution.  In some case they work, in some
cases they don't.  We should see the whole pictures.  Actually in this
case RA removes 5 copies out of 6 and satisfies 5 out 6 2-op
contraints without additional movement.

Probably an additional RA subpass which swaps pseudo-register
assignments in order to improve allocation could help.  But right now
I don't see how effectively to implement this and is it really worth
to do.
Comment 5 Jakub Jelinek 2019-05-03 09:16:39 UTC
GCC 9.1 has been released.
Comment 6 Romain Geissler 2019-05-10 21:39:11 UTC
Hi,

If this test is failing for quite some time and if the fix seems to be complex to write, shall this test be marked as xfailing for now ?

Cheers,
Romain
Comment 7 Jakub Jelinek 2019-08-12 08:55:40 UTC
GCC 9.2 has been released.
Comment 8 Jakub Jelinek 2020-03-12 11:58:49 UTC
GCC 9.3.0 has been released, adjusting target milestone.
Comment 9 GCC Commits 2020-03-30 15:49:36 UTC
The master branch has been updated by Martin Liska <marxin@gcc.gnu.org>:

https://gcc.gnu.org/g:291aa50a63194245ad3ab8bd584f9c0286c5b44c

commit r10-7459-g291aa50a63194245ad3ab8bd584f9c0286c5b44c
Author: Martin Liska <mliska@suse.cz>
Date:   Mon Mar 30 17:49:27 2020 +0200

    XFAIL pr57193.c test-case.
    
            PR rtl-optimization/87716
            * gcc.target/i386/pr57193.c: XFAIL a test-case.
Comment 10 Richard Biener 2021-06-01 08:12:16 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 11 Richard Biener 2022-05-27 09:39:40 UTC
GCC 9 branch is being closed
Comment 12 Jakub Jelinek 2022-06-28 10:35:58 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 13 Richard Biener 2023-07-07 10:34:22 UTC
GCC 10 branch is being closed.