Bug 57193 - [7 Regression] suboptimal register allocation for SSE registers
Summary: [7 Regression] suboptimal register allocation for SSE registers
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.9.0
: P2 normal
Target Milestone: 8.0
Assignee: Vladimir Makarov
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2013-05-07 09:36 UTC by Wouter Vermaelen
Modified: 2019-11-14 09:09 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.4.6
Known to fail: 4.5.3, 4.6.4, 4.9.0, 5.1.1
Last reconfirmed: 2018-02-03 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Wouter Vermaelen 2013-05-07 09:36:06 UTC
This bug _might_ be related to PR56339, although that report talks about a regression compared to 4.7, while this bug seems to be a regression compared to 4.4.

I was converting some hand-written asm code to SSE-intrinsics, but unfortunately the version using intrinsics generates worse code. It contains two unnecessary 'movdqa' instructions.

I managed to reduce my test to this routine:

//--------------------------------------------------------------
#include <emmintrin.h>

void test1(const __m128i* in1, const __m128i* in2, __m128i* out,
           __m128i f, __m128i zero)
{
	__m128i c = _mm_avg_epu8(*in1, *in2);
	__m128i l = _mm_unpacklo_epi8(c, zero);
	__m128i h = _mm_unpackhi_epi8(c, zero);
	__m128i m = _mm_mulhi_epu16(l, f);
	__m128i n = _mm_mulhi_epu16(h, f);
	*out = _mm_packus_epi16(m, n);
}
//--------------------------------------------------------------

A (few days old) gcc snapshot generates the following code. Versions 4.5, 4.6 and 4.7 generate similar code:

   0:   66 0f 6f 17             movdqa (%rdi),%xmm2
   4:   66 0f e0 16             pavgb  (%rsi),%xmm2
   8:   66 0f 6f da             movdqa %xmm2,%xmm3
   c:   66 0f 68 d1             punpckhbw %xmm1,%xmm2
  10:   66 0f 60 d9             punpcklbw %xmm1,%xmm3
  14:   66 0f e4 d0             pmulhuw %xmm0,%xmm2
  18:   66 0f 6f cb             movdqa %xmm3,%xmm1
  1c:   66 0f e4 c8             pmulhuw %xmm0,%xmm1
  20:   66 0f 6f c1             movdqa %xmm1,%xmm0
  24:   66 0f 67 c2             packuswb %xmm2,%xmm0
  28:   66 0f 7f 02             movdqa %xmm0,(%rdx)
  2c:   c3                      retq

Gcc version 4.3 and 4.4 (and clang) generate the following optimal(?) code:
   0:   66 0f 6f 17             movdqa (%rdi),%xmm2
   4:   66 0f e0 16             pavgb  (%rsi),%xmm2
   8:   66 0f 6f da             movdqa %xmm2,%xmm3
   c:   66 0f 68 d1             punpckhbw %xmm1,%xmm2
  10:   66 0f 60 d9             punpcklbw %xmm1,%xmm3
  14:   66 0f e4 d8             pmulhuw %xmm0,%xmm3
  18:   66 0f e4 c2             pmulhuw %xmm2,%xmm0
  1c:   66 0f 67 d8             packuswb %xmm0,%xmm3
  20:   66 0f 7f 1a             movdqa %xmm3,(%rdx)
  24:   c3                      retq
Comment 1 Richard Biener 2013-05-07 11:47:25 UTC
Confirmed.
Comment 2 H.J. Lu 2013-05-07 18:17:11 UTC
It is caused by revision 156641:

http://gcc.gnu.org/ml/gcc-cvs/2010-02/msg00222.html
Comment 3 Richard Biener 2013-10-30 12:18:30 UTC
Re-confirmed on trunk.
Comment 4 Richard Henderson 2014-02-12 20:03:09 UTC
It seems like incomplete reload inheritance:

(insn 19 16 21 2 (set (reg:V8HI 107)
  (truncate:V8HI
    (lshiftrt:V8SI
      (mult:V8SI (zero_extend:V8SI (subreg:V8HI (reg:V16QI 105) 0))
                 (zero_extend:V8SI (subreg:V8HI (reg/v:V2DI 101 [ f ]) 0)))
      (const_int 16 [0x10]))))
  include/emmintrin.h:1362 2134 {*umulv8hi3_highpart}
  (expr_list:REG_DEAD (reg:V16QI 105) (nil)))

      Creating newreg=111 from oldreg=107, assigning class SSE_REGS to r111
   19: r111:V8HI=trunc(zero_extend(r111:V8HI)*zero_extend(r101:V2DI#0) 0>>0x10)
      REG_DEAD r105:V16QI
    Inserting insn reload before:
   31: r111:V8HI=r105:V16QI#0
    Inserting insn reload after:
   32: r107:V8HI=r111:V8HI

The new register r111 does wind up inheriting from r107, but not
transitively to r105.  Thus we wind up leaving the copy insn 31.
Comment 5 Richard Biener 2014-06-12 13:41:36 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 6 Jakub Jelinek 2014-12-19 13:31:29 UTC
GCC 4.8.4 has been released.
Comment 7 Richard Biener 2015-06-23 08:13:33 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 8 Jakub Jelinek 2015-06-26 19:56:11 UTC
GCC 4.9.3 has been released.
Comment 9 Bernd Schmidt 2016-01-14 16:39:23 UTC
It looks like the situation is as follows (X is the LRA-created reload reg)

X = a
op on X
b = X

where a and b are different registers already allocated by IRA, hence we can avoid one copy at most. I'm not very familiar with LRA yet, but I see no code to rethink such register allocation choices.

-frename-registers gets rid of one unnecessary copy, it was enhanced to detect such situations for gcc-6. Maybe we should finally enable that for -O2 and higher?
Comment 10 Jeffrey A. Law 2016-01-14 17:04:58 UTC
Look in lra-coalesce, if we have code to eliminate those copies, that's where I'd expect to find it.
Comment 11 Bernd Schmidt 2016-04-26 12:44:14 UTC
Author: bernds
Date: Tue Apr 26 12:43:42 2016
New Revision: 235442

URL: https://gcc.gnu.org/viewcvs?rev=235442&root=gcc&view=rev
Log:
Enable -frename-registers at -O2.

	PR rtl-optimization/57193
	* opts.c (default_options_table): Add OPT_frename_registers at -O2
	and above.
	* doc/invoke.texi (-frename-registers, -O2): Update documentation.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/opts.c
Comment 12 Bernd Schmidt 2016-05-03 22:48:44 UTC
Author: bernds
Date: Tue May  3 22:48:03 2016
New Revision: 235848

URL: https://gcc.gnu.org/viewcvs?rev=235848&root=gcc&view=rev
Log:
	PR rtl-optimization/57193
	* opts.c (default_options_table): Revert OPT_frename_registers change.
	* doc/invoke.texi (-frename-registers, -O2): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/opts.c
Comment 13 Richard Biener 2016-08-03 11:00:24 UTC
GCC 4.9 branch is being closed
Comment 14 Jakub Jelinek 2017-10-10 13:24:55 UTC
GCC 5 branch is being closed
Comment 15 Aldy Hernandez 2018-02-03 10:34:24 UTC
Reconfirmed.
Comment 16 Vladimir Makarov 2018-02-09 18:24:30 UTC
Author: vmakarov
Date: Fri Feb  9 18:23:58 2018
New Revision: 257537

URL: https://gcc.gnu.org/viewcvs?rev=257537&root=gcc&view=rev
Log:
2018-02-09  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/57193
	* ira-color.c (struct allocno_color_data): Add member
	conflict_allocno_hard_prefs.
	(update_conflict_allocno_hard_prefs): New.
	(bucket_allocno_compare_func): Add a preference based on
	conflict_allocno_hard_prefs.
	(push_allocno_to_stack): Update conflict_allocno_hard_prefs.
	(color_allocnos): Remove a dead code.  Initiate
	conflict_allocno_hard_prefs.  Call update_costs_from_prefs.

2018-02-09  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/57193
	* gcc.target/i386/57193.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr57193.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-color.c
    trunk/gcc/testsuite/ChangeLog
Comment 17 Jakub Jelinek 2018-10-26 10:06:36 UTC
GCC 6 branch is being closed
Comment 18 Xi Ruoyao 2019-05-07 07:55:52 UTC
It's back in GCC 9.1.0 and trunk:

https://gcc.godbolt.org/z/KrrZW7
Comment 19 Xi Ruoyao 2019-05-07 08:14:32 UTC
(In reply to Xi Ruoyao from comment #18)
> It's back in GCC 9.1.0 and trunk:
> 
> https://gcc.godbolt.org/z/KrrZW7

Wrong link.  Should be:

https://gcc.godbolt.org/z/bxAeCR
Comment 20 Romain Geissler 2019-05-10 21:36:47 UTC
Hi,

It looks like the GCC 9/10 re-occurence is being tracked in this bug 87716.

Cheers,
Romain
Comment 21 Richard Biener 2019-11-14 09:09:20 UTC
Fixed.