[PATCH, i386]: Do not limit the cost of moves to/from XMM register to minimum 8.

Thu Sep 5 08:54:00 GMT 2019

On Thu, Sep 5, 2019 at 7:47 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Sep 4, 2019 at 9:44 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Tue, Sep 3, 2019 at 1:33 PM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >
> > > > > > Note:
> > > > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> > > > > > --------------------------------
> > > > > > 531.deepsjeng_r  -7.18%
> > > > > > 548.exchange_r  -6.70%
> > > > > > 557.xz_r -6.74%
> > > > > > 508.namd_r -2.81%
> > > > > > 527.cam4_r -6.48%
> > > > > > 544.nab_r -3.99%
> > > > > >
> > > > > > Tested on skylake server.
> > > > > > -------------------------------------
> > > > > > How about  changing cost from 2 to 8 until we figure out a better number.
> > > > >
> > > > > Certainly works for me.  Note the STV code uses the "other" sse_to_integer
> > > > > number and the numbers in question here are those for the RA.  There's
> > > > > a multitude of values used in the tables here, including some a lot larger.
> > > > > So the overall bumping to 8 certainly was the wrong thing to do and instead
> > > > > individual numbers should have been adjusted (didn't look at the history
> > > > > of that bumping).
> > > >
> > > > For reference:
> > > >
> > > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines
> > > >
> > > >     PR target/32413
> > > >     * config/i386/i386.c (ix86_register_move_cost): Rise the cost of
> > > >     moves between MMX/SSE registers to at least 8 units to prevent
> > > >     ICE caused by non-tieable SI/HI/QImodes in SSE registers.
> > > >
> > > > should probably have been "twice the cost of X" or something like that
> > > > instead where X be some reg-reg move cost.
> > >
> > > Thanks for the reference. It looks that the patch fixes the issue in
> > > the wrong place, this should be solved in
> > > inline_secondary_memory_needed:
> > >
> > >       /* Between SSE and general, we have moves no larger than word size.  */
> > >       if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
> > >            || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode)
> > >            || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
> > >         return true;
> > >
> > > as an alternative to implement QI and HImode moves as a SImode move
> > > between SSE and int<->SSE registers. We have
> > > ix86_secondary_memory_needed_mode that extends QI and HImode secondary
> > > memory to SImode, so this should solve PR32413.
> > >
> > > Other than that, what to do with the bizzare property of direct moves
> > > that benchmark far worse than indirect moves? I was expecting that
> > > keeping the cost of direct inter-regset moves just a bit below the
> > > cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves
> > > would prevent unwanted wandering of values between register sets,
> > > while still generating the direct move when needed. While this almost
> >
> > I've not tested it yet.
> > So i'll start a test about this patch(change cost from 2-->6) with
> > Richard's change.
> > I'll keep you informed when finishing test.
> >
> > > fixes the runtime regression, it is not clear to me from Hongtao Liu's
> > > message if  Richard's 2019-08-27 fixes the remaining regression or
> > > not). Liu, can you please clarify?
> > >
> > --------------------------------
> > 531.deepsjeng_r  -7.18%
> > 548.exchange_r  -6.70%
> > 557.xz_r -6.74%
> > 508.namd_r -2.81%
> > 527.cam4_r -6.48%
> > 544.nab_r -3.99%
> >
> > Tested on skylake server.
> > -------------------------------------
> > Those regressions are comparing gcc10_20190830 to gcc10_20190824 which
> > are mainly caused by removing limit of 8.
> >
> > > > >  For example Pentium4 has quite high bases for move
> > > > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
> > > > > while the opposite 12.
> > > > >
> > > > > So yes, we want to revert the patch by applying its effect to the
> > > > > individual cost tables so we can revisit this for the still interesting
> > > > > micro-architectures.
> > >
> > > Uros.
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
> Change cost from 2->6 got
> -------------
> 531.deepsjeng_r  9.64%
> 548.exchange_r  10.24%
> 557.xc_r              7.99%
> 508.namd_r         1.08%
> 527.cam4_r          6.91%
> 553.nab_r            3.06%
> ------------
>
> for 531,548,557,527, even better comparing to version before regression.
> for 508,533, still little regressions comparing to version before regression.

Good, that brings us into "noise" region.

Based on these results and other findings, I propose the following solution:

- The inter-regset move costs of architectures, that have been defined
before r125951 remain the same. These are: size, i386, i486, pentium,
pentiumpro, geode, k6, athlon, k8, amdfam10, pentium4 and nocona.
- bdver, btver1 and btver2 have costs higher than 8, so they are not affected.
- lakemont, znver1, znver2, atom, slm, intel and generic costs have
inter-regset costs above intra-regset and below or equal memory
load/store cost, should remain as they are. Additionally, intel and
generic costs are regularly re-tuned.
-  only skylake and core costs remain problematic

So, I propose to raise XMM<->intreg costs of skylake and core
architectures to 6 to solve the regression. These can be fine-tuned
later, we are now able to change the cost for RA independently of RTX
costs. Also, the RA cost can be asymmetrical.

Attached patch implements the proposal. If there are no other
proposals or discussions, I plan to commit it on Friday.

Uros.
-------------- next part --------------

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 3381b8bf143c..00edece3eb68 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1610,7 +1610,7 @@ struct processor_costs skylake_cost = {
 					   in 32,64,128,256 and 512-bit */
   {8, 8, 8, 12, 24},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  2, 2,					/* SSE->integer and integer->SSE moves */
+  6, 6,					/* SSE->integer and integer->SSE moves */
   /* End of register allocator costs.  */
   },
 
@@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
 					   in 32,64,128,256 and 512-bit */
   {6, 6, 6, 6, 12},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  2, 2,					/* SSE->integer and integer->SSE moves */
+  6, 6,					/* SSE->integer and integer->SSE moves */
   /* End of register allocator costs.  */
   },