Bug 91498 - [12/13/14/15 Regression] STV change in r274481 causes 300.twolf regression on Haswell
Summary: [12/13/14/15 Regression] STV change in r274481 causes 300.twolf regression on...
Status: WAITING
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P2 normal
Target Milestone: 12.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2019-08-20 08:03 UTC by Richard Biener
Modified: 2024-07-19 13:05 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-01-27 00:00:00


Attachments
Patch for 32-bit Solaris/x86 failures (470 bytes, patch)
2019-08-21 09:26 UTC, Rainer Orth
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2019-08-20 08:03:26 UTC
Split out from PR91154 comment#25

Biggest changes when benchmarking -mno-stv (base) against -mstv (peak):

   7.28%         37789  twolf_peak.none  twolf_peak.none   [.] ucxx2 
   4.21%         25709  twolf_base.none  twolf_base.none   [.] ucxx2        
   3.72%         22584  twolf_base.none  twolf_base.none   [.] new_dbox
   2.48%         22364  twolf_peak.none  twolf_peak.none   [.] new_dbox
   1.49%          8270  twolf_base.none  twolf_base.none   [.] sub_penal
   1.12%          7576  twolf_peak.none  twolf_peak.none   [.] sub_penal
   1.36%          9314  twolf_peak.none  twolf_peak.none   [.] old_assgnto_new2
   1.11%          5257  twolf_base.none  twolf_base.none   [.] old_assgnto_new2

and in ucxx2 I see

  0.17 │       mov    %eax,(%rsp)
  3.55 │       vpmins (%rsp),%xmm0,%xmm1   
       │       test   %eax,%eax
  0.22 │       vmovd  %xmm1,%r8d              
  0.80 │       cmovs  %esi,%r8d

...

Testcase:

extern int numBins;
extern int binOffst;
extern int binWidth;
extern int Trybin;
void foo (int);

void bar (int aleft, int axcenter)
{
  int a1LoBin = (((Trybin=((axcenter + aleft)-binOffst)/binWidth)<0)
                 ? 0 : ((Trybin>numBins) ? numBins : Trybin));
  foo (a1LoBin);
}

where combine eliminates the reg-reg copies STV adds to split live-ranges
between GPR and SSE uses (currently one plain move and one set via
vec_merge/duplicate).

Making STV of SI/DImode chains always happen after combine (in STV2) fixes
the testcase above but regresses gcc.target/i386/minmax-6.c which ran into
a very similar issue and was reduced from a SPEC CPU 2006 regression observed.

In the end the issue is that as soon as the RA decides it needs to spill
for a dual-use pseudo it ends up going through the stack because of, as
HJ notices, the minimum cost of moves between SSE and integer units is 8:

  /* Moves between SSE and integer units are expensive.  */
  if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))

    /* ??? By keeping returned value relatively high, we limit the number
       of moves between integer and SSE registers for all targets.
       Additionally, high value prevents problem with x86_modes_tieable_p(),
       where integer modes in SSE registers are not tieable
       because of missing QImode and HImode moves to, from or between
       MMX/SSE registers.  */
    return MAX (8, SSE_CLASS_P (class1)
                ? ix86_cost->hard_register.sse_to_integer
                : ix86_cost->hard_register.integer_to_sse);
Comment 1 Richard Biener 2019-08-20 08:07:43 UTC
Patch that causes gcc.target/i386/minmax-6.c to spill to the stack:
https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01283.html

This makes SI/DImode chains handled by STV2 only (after combine).  We can
take it from there if we think that's what we need to do anyways.  In the
end it's STV/RA interaction, I guess if we can take combine out of the
equation it might simplify things.
Comment 2 Uroš Bizjak 2019-08-20 08:17:40 UTC
(In reply to Richard Biener from comment #1)
> Patch that causes gcc.target/i386/minmax-6.c to spill to the stack:
> https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01283.html
> 
> This makes SI/DImode chains handled by STV2 only (after combine).  We can
> take it from there if we think that's what we need to do anyways.  In the
> end it's STV/RA interaction, I guess if we can take combine out of the
> equation it might simplify things.

Yes, this is the intended way SI/DImode STV should be used.
Comment 3 Richard Biener 2019-08-20 08:46:27 UTC
Author: rguenth
Date: Tue Aug 20 08:45:56 2019
New Revision: 274694

URL: https://gcc.gnu.org/viewcvs?rev=274694&root=gcc&view=rev
Log:
2019-08-20  Richard Biener  <rguenther@suse.de>

	PR target/91498
	* config/i386/i386-features.c (general_scalar_chain::convert_op):
	Use (vec_merge (vec_duplicate..)) style vector from scalar move.
	(convert_scalars_to_vector): Add timode_p parameter and use it
	to guard TImode-only operation.
	(pass_stv::gate): Adjust so STV runs twice for TARGET_64BIT.
	(pass_stv::execute): Pass down timode_p.

	* gcc.target/i386/minmax-7.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.target/i386/minmax-7.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-features.c
    trunk/gcc/testsuite/ChangeLog
Comment 4 Uroš Bizjak 2019-08-20 11:21:05 UTC
Following patch

--cut here--
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 49ab50ea41bf..11c75be113e0 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18601,9 +18601,9 @@ ix86_register_move_cost (machine_mode mode, reg_class_t class1_i,
        where integer modes in SSE registers are not tieable
        because of missing QImode and HImode moves to, from or between
        MMX/SSE registers.  */
-    return MAX (8, SSE_CLASS_P (class1)
-               ? ix86_cost->hard_register.sse_to_integer
-               : ix86_cost->hard_register.integer_to_sse);
+    return (SSE_CLASS_P (class1)
+           ? ix86_cost->hard_register.sse_to_integer
+           : ix86_cost->hard_register.integer_to_sse);
 
   if (MAYBE_FLOAT_CLASS_P (class1))
     return ix86_cost->hard_register.fp_move;
--cut here--

fixes the regression.

(BTW: QI/HImodes are not "missing" at all. We can use %k modifier to move QImode and HImode values using MOVD instruction between register sets.)
Comment 5 Uroš Bizjak 2019-08-20 12:23:38 UTC
(In reply to Uroš Bizjak from comment #4)
> Following patch

HJ, can you please put the patch through some benchmarks? (I have no access to SPEC).
Comment 6 H.J. Lu 2019-08-20 15:14:23 UTC
(In reply to Uroš Bizjak from comment #5)
> (In reply to Uroš Bizjak from comment #4)
> > Following patch
> 
> HJ, can you please put the patch through some benchmarks? (I have no access
> to SPEC).

Sure. We will measure it.  Thanks.
Comment 7 H.J. Lu 2019-08-20 15:44:50 UTC
Our SPEC CPU 2017 failed with

 39 util.c:205:1: error: invalid rtl sharing found in the insn
 40   205 | }
 41       | ^
 42 (insn 29 28 3 2 (set (subreg:V2DI (reg:DI 91) 0)
 43         (vec_concat:V2DI (mem/c:DI (plus:DI (reg/f:DI 19 frame)
 44                     (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
 45             (const_int 0 [0]))) "util.c":134:1 -1
 46      (nil))
 47 util.c:205:1: error: shared rtx
 48 (mem/c:DI (plus:DI (reg/f:DI 19 frame)
 49         (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
 50 during RTL pass: stv

We have an invalid shared rtx.
Comment 8 Rainer Orth 2019-08-20 17:52:39 UTC
The new testcase FAILs on both i386-pc-solaris2.11 and x86_64-pc-linux-gnu with
-m32:

+FAIL: gcc.target/i386/minmax-7.c scan-assembler pminsd
Comment 9 Arseny Solokha 2019-08-20 18:14:12 UTC
(In reply to H.J. Lu from comment #7)
> Our SPEC CPU 2017 failed with
> 
>  39 util.c:205:1: error: invalid rtl sharing found in the insn
>  40   205 | }
>  41       | ^
>  42 (insn 29 28 3 2 (set (subreg:V2DI (reg:DI 91) 0)
>  43         (vec_concat:V2DI (mem/c:DI (plus:DI (reg/f:DI 19 frame)
>  44                     (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
>  45             (const_int 0 [0]))) "util.c":134:1 -1
>  46      (nil))
>  47 util.c:205:1: error: shared rtx
>  48 (mem/c:DI (plus:DI (reg/f:DI 19 frame)
>  49         (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
>  50 during RTL pass: stv
> 
> We have an invalid shared rtx.

And the testcase is as simple as:

int
ff (int jn)
{
  return jn > 0 ? jn : 0;
}

% x86_64-unknown-linux-gnu-gcc-10.0.0-alpha20190818 -march=nano-3000 -Os -c f9272pqv.c
f9272pqv.c: In function 'ff':
f9272pqv.c:5:1: error: invalid rtl sharing found in the insn
    5 | }
      | ^
(insn 18 17 3 2 (set (subreg:V4SI (reg:SI 87) 0)
        (vec_merge:V4SI (vec_duplicate:V4SI (mem/c:SI (plus:DI (reg/f:DI 19 frame)
                        (const_int -4 [0xfffffffffffffffc])) [0  S4 A32]))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) "f9272pqv.c":3:1 -1
     (nil))
f9272pqv.c:5:1: error: shared rtx
(mem/c:SI (plus:DI (reg/f:DI 19 frame)
        (const_int -4 [0xfffffffffffffffc])) [0  S4 A32])
during RTL pass: stv
f9272pqv.c:5:1: internal compiler error: internal consistency failure
0x9f3eca verify_rtx_sharing
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/emit-rtl.c:2927
0x9f3dba verify_rtx_sharing
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/emit-rtl.c:2942
0x9f3dba verify_rtx_sharing
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/emit-rtl.c:2942
0x9f3dba verify_rtx_sharing
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/emit-rtl.c:2942
0x9f4178 verify_insn_sharing
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/emit-rtl.c:3013
0x9f79ed verify_rtl_sharing()
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/emit-rtl.c:3036
0xc7a8f3 execute_function_todo
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/passes.c:2004
0xc7b536 execute_todo
	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/gcc/passes.c:2037
Comment 10 rguenther@suse.de 2019-08-21 07:10:39 UTC
On Tue, 20 Aug 2019, ro at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91498
> 
> Rainer Orth <ro at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |ro at gcc dot gnu.org
> 
> --- Comment #8 from Rainer Orth <ro at gcc dot gnu.org> ---
> The new testcase FAILs on both i386-pc-solaris2.11 and x86_64-pc-linux-gnu with
> -m32:
> 
> +FAIL: gcc.target/i386/minmax-7.c scan-assembler pminsd

Not for me.  I've used -march=haswell in all of these testcases to
rule out ISA and cost issues.  So I really don't know what's the issue
with these FAILs.
Comment 11 rguenther@suse.de 2019-08-21 07:21:11 UTC
On Tue, 20 Aug 2019, asolokha at gmx dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91498
> 
> Arseny Solokha <asolokha at gmx dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |asolokha at gmx dot com
> 
> --- Comment #9 from Arseny Solokha <asolokha at gmx dot com> ---
> (In reply to H.J. Lu from comment #7)
> > Our SPEC CPU 2017 failed with
> > 
> >  39 util.c:205:1: error: invalid rtl sharing found in the insn
> >  40   205 | }
> >  41       | ^
> >  42 (insn 29 28 3 2 (set (subreg:V2DI (reg:DI 91) 0)
> >  43         (vec_concat:V2DI (mem/c:DI (plus:DI (reg/f:DI 19 frame)
> >  44                     (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
> >  45             (const_int 0 [0]))) "util.c":134:1 -1
> >  46      (nil))
> >  47 util.c:205:1: error: shared rtx
> >  48 (mem/c:DI (plus:DI (reg/f:DI 19 frame)
> >  49         (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
> >  50 during RTL pass: stv
> > 
> > We have an invalid shared rtx.
> 
> And the testcase is as simple as:
> 
> int
> ff (int jn)
> {
>   return jn > 0 ? jn : 0;
> }
> 
> % x86_64-unknown-linux-gnu-gcc-10.0.0-alpha20190818 -march=nano-3000 -Os -c
> f9272pqv.c
> f9272pqv.c: In function 'ff':
> f9272pqv.c:5:1: error: invalid rtl sharing found in the insn
>     5 | }
>       | ^
> (insn 18 17 3 2 (set (subreg:V4SI (reg:SI 87) 0)
>         (vec_merge:V4SI (vec_duplicate:V4SI (mem/c:SI (plus:DI (reg/f:DI 19
> frame)
>                         (const_int -4 [0xfffffffffffffffc])) [0  S4 A32]))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) "f9272pqv.c":3:1 -1
>      (nil))

I have a patch.
Comment 12 Dmitry G. Dyachenko 2019-08-21 07:23:56 UTC
(In reply to Arseny Solokha from comment #9)
> (In reply to H.J. Lu from comment #7)
> > Our SPEC CPU 2017 failed with
> > 
> >  39 util.c:205:1: error: invalid rtl sharing found in the insn
> >  40   205 | }
> >  41       | ^
> >  42 (insn 29 28 3 2 (set (subreg:V2DI (reg:DI 91) 0)
> >  43         (vec_concat:V2DI (mem/c:DI (plus:DI (reg/f:DI 19 frame)
> >  44                     (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
> >  45             (const_int 0 [0]))) "util.c":134:1 -1
> >  46      (nil))
> >  47 util.c:205:1: error: shared rtx
> >  48 (mem/c:DI (plus:DI (reg/f:DI 19 frame)
> >  49         (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
> >  50 during RTL pass: stv
> > 
> > We have an invalid shared rtx.
> 
> And the testcase is as simple as:
> 
> int
> ff (int jn)
> {
>   return jn > 0 ? jn : 0;
> }
> 
> % x86_64-unknown-linux-gnu-gcc-10.0.0-alpha20190818 -march=nano-3000 -Os -c
> f9272pqv.c
> f9272pqv.c: In function 'ff':
> f9272pqv.c:5:1: error: invalid rtl sharing found in the insn
>     5 | }
>       | ^
> (insn 18 17 3 2 (set (subreg:V4SI (reg:SI 87) 0)
>         (vec_merge:V4SI (vec_duplicate:V4SI (mem/c:SI (plus:DI (reg/f:DI 19
> frame)
>                         (const_int -4 [0xfffffffffffffffc])) [0  S4 A32]))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) "f9272pqv.c":3:1 -1
>      (nil))
> f9272pqv.c:5:1: error: shared rtx
> (mem/c:SI (plus:DI (reg/f:DI 19 frame)
>         (const_int -4 [0xfffffffffffffffc])) [0  S4 A32])
> during RTL pass: stv
> f9272pqv.c:5:1: internal compiler error: internal consistency failure
> 0x9f3eca verify_rtx_sharing
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/emit-rtl.c:2927
> 0x9f3dba verify_rtx_sharing
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/emit-rtl.c:2942
> 0x9f3dba verify_rtx_sharing
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/emit-rtl.c:2942
> 0x9f3dba verify_rtx_sharing
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/emit-rtl.c:2942
> 0x9f4178 verify_insn_sharing
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/emit-rtl.c:3013
> 0x9f79ed verify_rtl_sharing()
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/emit-rtl.c:3036
> 0xc7a8f3 execute_function_todo
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/passes.c:2004
> 0xc7b536 execute_todo
> 	/var/tmp/portage/sys-devel/gcc-10.0.0_alpha20190818/work/gcc-10-20190818/
> gcc/passes.c:2037

sounds like PR91503
Comment 13 Uroš Bizjak 2019-08-21 07:29:18 UTC
(In reply to rguenther@suse.de from comment #10) 
> > +FAIL: gcc.target/i386/minmax-7.c scan-assembler pminsd
> 
> Not for me.  I've used -march=haswell in all of these testcases to
> rule out ISA and cost issues.  So I really don't know what's the issue
> with these FAILs.

IIRC 32bit Solaris doesn't guarantee 16byte stack alignment, so STV is disabled.
Comment 14 Richard Biener 2019-08-21 08:45:31 UTC
Author: rguenth
Date: Wed Aug 21 08:44:59 2019
New Revision: 274792

URL: https://gcc.gnu.org/viewcvs?rev=274792&root=gcc&view=rev
Log:
2019-08-21  Richard Biener  <rguenther@suse.de>

	PR target/91498
	PR target/91503
	* config/i386/i386-features.c
	(general_scalar_chain::make_vector_copies): Copy stack temporary
	rtx when using it multiple times.
	(general_scalar_chain::convert_reg): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-features.c
Comment 15 Rainer Orth 2019-08-21 09:26:52 UTC
Created attachment 46737 [details]
Patch for 32-bit Solaris/x86 failures
Comment 16 ro@CeBiTec.Uni-Bielefeld.DE 2019-08-21 09:29:21 UTC
> --- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to rguenther@suse.de from comment #10) 
>> > +FAIL: gcc.target/i386/minmax-7.c scan-assembler pminsd
>> 
>> Not for me.  I've used -march=haswell in all of these testcases to
>> rule out ISA and cost issues.  So I really don't know what's the issue
>> with these FAILs.
>
> IIRC 32bit Solaris doesn't guarantee 16byte stack alignment, so STV is
> disabled.

Indeed: adding -mno-stackrealign (as is done in a couple of other STV
tests already) fixes this failure and the ones reported in PR
rtl-optimization/91154.

Tested with the appropriate runtest invocation on i386-pc-solaris2.11
and x86_64-pc-linux-gnu (both multilibs each).
Comment 17 Martin Liška 2020-01-27 14:02:25 UTC
Can the issue be closed as resolved?
Comment 18 Richard Biener 2020-01-27 14:25:19 UTC
(In reply to Martin Liška from comment #17)
> Can the issue be closed as resolved?

I've fixed a -fno-common issue on the tester and will report back after the next run.
Comment 19 Richard Biener 2020-01-28 09:05:23 UTC
Doesn't seem fixed.
Comment 20 Richard Biener 2020-03-20 08:46:54 UTC
The patch from comment#4 got installed so re-analysis is necessary I think.
Comment 21 Jakub Jelinek 2020-05-07 11:56:30 UTC
GCC 10.1 has been released.
Comment 22 Richard Biener 2020-07-23 06:51:54 UTC
GCC 10.2 is released, adjusting target milestone.
Comment 23 Richard Biener 2021-04-08 12:01:59 UTC
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Comment 24 Jakub Jelinek 2022-06-28 10:38:11 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 25 Richard Biener 2023-07-07 10:35:51 UTC
GCC 10 branch is being closed.
Comment 26 Richard Biener 2024-07-19 13:05:36 UTC
GCC 11 branch is being closed.