[PATCH] Fix PR69274, 435.gromacs performance regression due to RA

Mon Feb 8 09:09:00 GMT 2016

On Fri, 5 Feb 2016, Vladimir Makarov wrote:

> On 02/05/2016 04:25 AM, Richard Biener wrote:
> > The following patch fixes the performance regression for 435.gromacs
> > on x86_64 with AVX2 (Haswell or bdver2) caused by
> > 
> > 2015-12-18  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>
> > 
> > 	* ira.c (ira_setup_alts): Move the scan for commutative modifier
> > 	to the first loop to make it work even with disabled alternatives.
> > 
> > which in itself is a desirable change giving the RA more freedom.
> > 
> > It turns out the fix makes an existing issue more severe in detecting
> > more swappable alternatives and thus exiting ira_setup_alts with
> > operands swapped in recog_data.  This seems to give a slight preference
> > to choose alternatives with the operands swapped (I didn't try to
> > investigate how IRA handles the "merged" alternative mask and
> > operand swapping in its further processing).
> Alternative mask excludes alternatives which will be definitely rejected in
> LRA.  This approach is to speed up LRA (a lot was done to speed up RA but
> still it consumes a big chunk of compiler time which is unusual for all
> compilers).
> 
> LRA and reload prefer insns without commutative operands swap when all other
> costs are the same.

Ok, so leaving operands swapped in ira_setup_alts will prefer the
swapped variants in case other costs are the same.

> > Of course previous RTL optimizers and canonicalization rules as well
> > as backend patterns are tuned towards the not swapped variant and thus
> > it happens doing more swaps ends up in slower code (I didn't closely
> > investigate).
> > 
> > So I tested the following patch which simply makes sure that
> > ira_setup_alts does not alter recog_data.
> > 
> > On a Intel Haswell machine I get (base is with the patch, peak is with
> > the above change reverted):
> > 
> >                                    Estimated
> > Estimated
> >                  Base     Base       Base        Peak     Peak       Peak
> > Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
> > -------------- ------  ---------  ---------    ------  ---------
> > ---------
> > 435.gromacs      7140        264       27.1 S    7140        270
> > 26.5 S
> > 435.gromacs      7140        264       27.1 *    7140        269
> > 26.6 S
> > 435.gromacs      7140        263       27.1 S    7140        269
> > 26.5 *
> > ==============================================================================
> > 435.gromacs      7140        264       27.1 *    7140        269
> > 26.5 *
> > 
> > which means the patched result is even better than before Andreas
> > change.  Current trunk homes in at a Run Time of 321s (which is
> > the regression to fix).
>   Thanks for working on this, Richard.  It is not easy to find reasons for
> worse code on modern processors after such small change.  As RA is based on
> heuristics it hard to predict the change for a specific benchmark.  I remember
> I checked  Andreas patch on SPEC2000 in a hope that it also improves x86-64
> code but I did not see a difference.
> 
> It is even hard to say sometimes how a specific (non-heuristic) optimization
> will affect a specific benchmark performance when a lot of unknown (from
> alignments to CPU internals are involved).  An year ago I tried to use ML to
> choose best options.  I used a set of about 100 C benchmarks (and even more
> functions).  For practically every benchmark, I had an option modification to
> -Ofast resulting in faster code but ML prediction did not work at all.
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?
> > 
> OK.  Thanks again.

Thanks.  Over the weekend I did a full 3-run SPEC 2k6 with the following
result.  Base is r231814 while peak is r231814 patched, flags are
-Ofast -march=native (on a Haswell machine).

                                  Estimated                       
Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
400.perlbench    9770        255       38.4 *    9770        251       
39.0 S
400.perlbench    9770        258       37.8 S    9770        250       
39.0 *
400.perlbench    9770        253       38.6 S    9770        250       
39.0 S
401.bzip2        9650        407       23.7 *    9650        400       
24.1 *
401.bzip2        9650        412       23.4 S    9650        417       
23.1 S
401.bzip2        9650        396       24.4 S    9650        398       
24.3 S
403.gcc          8050        252       31.9 S    8050        245       
32.9 S
403.gcc          8050        240       33.6 S    8050        244       
32.9 *
403.gcc          8050        242       33.2 *    8050        241       
33.4 S
429.mcf          9120        243       37.6 S    9120        245       
37.3 S
429.mcf          9120        224       40.7 S    9120        241       
37.8 *
429.mcf          9120        225       40.5 *    9120        229       
39.9 S
445.gobmk       10490        394       26.6 S   10490        393       
26.7 S
445.gobmk       10490        394       26.6 *   10490        392       
26.8 *
445.gobmk       10490        393       26.7 S   10490        392       
26.8 S
456.hmmer        9330        340       27.4 S    9330        340       
27.5 *
456.hmmer        9330        340       27.5 S    9330        340       
27.5 S
456.hmmer        9330        340       27.5 *    9330        340       
27.5 S
458.sjeng       12100        407       29.7 *   12100        407       
29.8 *
458.sjeng       12100        406       29.8 S   12100        406       
29.8 S
458.sjeng       12100        408       29.6 S   12100        407       
29.8 S
462.libquantum  20720        300       69.0 S   20720        300       
69.0 *
462.libquantum  20720        300       69.1 *   20720        300       
69.2 S
462.libquantum  20720        300       69.1 S   20720        301       
68.9 S
464.h264ref     22130        444       49.8 S   22130        444       
49.9 S
464.h264ref     22130        443       50.0 S   22130        442       
50.1 S
464.h264ref     22130        444       49.9 *   22130        443       
50.0 *
471.omnetpp      6250        326       19.1 S    6250        328       
19.1 S
471.omnetpp      6250        305       20.5 *    6250        324       
19.3 *
471.omnetpp      6250        296       21.1 S    6250        305       
20.5 S
473.astar        7020        313       22.4 *    7020        316       
22.2 S
473.astar        7020        314       22.3 S    7020        309       
22.7 S
473.astar        7020        308       22.8 S    7020        310       
22.7 *
483.xalancbmk    6900        193       35.7 S    6900        193       
35.7 S
483.xalancbmk    6900        189       36.5 *    6900        189       
36.5 *
483.xalancbmk    6900        185       37.4 S    6900        189       
36.6 S
==============================================================================
400.perlbench    9770        255       38.4 *    9770        250       
39.0 *
401.bzip2        9650        407       23.7 *    9650        400       
24.1 *
403.gcc          8050        242       33.2 *    8050        244       
32.9 *
429.mcf          9120        225       40.5 *    9120        241       
37.8 *
445.gobmk       10490        394       26.6 *   10490        392       
26.8 *
456.hmmer        9330        340       27.5 *    9330        340       
27.5 *
458.sjeng       12100        407       29.7 *   12100        407       
29.8 *
462.libquantum  20720        300       69.1 *   20720        300       
69.0 *
464.h264ref     22130        444       49.9 *   22130        443       
50.0 *
471.omnetpp      6250        305       20.5 *    6250        324       
19.3 *
473.astar        7020        313       22.4 *    7020        310       
22.7 *
483.xalancbmk    6900        189       36.5 *    6900        189       
36.5 *
 Est. SPECint(R)_base2006              32.8
 Est. SPECint2006                                                      
32.5

                                  Estimated                       
Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
410.bwaves      13590        196       69.4 S   13590        202       
67.1 S
410.bwaves      13590        197       69.0 *   13590        197       
69.1 S
410.bwaves      13590        199       68.3 S   13590        197       
69.0 *
416.gamess      19580        592       33.1 *   19580        589       
33.2 S
416.gamess      19580        592       33.1 S   19580        588       
33.3 S
416.gamess      19580        593       33.0 S   19580        588       
33.3 *
433.milc         9180        350       26.2 S    9180        347       
26.5 S
433.milc         9180        335       27.4 *    9180        334       
27.5 S
433.milc         9180        334       27.5 S    9180        337       
27.2 *
434.zeusmp       9100        231       39.3 S    9100        232       
39.2 S
434.zeusmp       9100        233       39.1 S    9100        233       
39.1 *
434.zeusmp       9100        232       39.2 *    9100        235       
38.8 S
435.gromacs      7140        316       22.6 S    7140        264       
27.1 S
435.gromacs      7140        314       22.7 S    7140        263       
27.2 S
435.gromacs      7140        316       22.6 *    7140        263       
27.1 *
436.cactusADM   11950        201       59.3 *   11950        211       
56.6 S
436.cactusADM   11950        214       55.7 S   11950        205       
58.3 *
436.cactusADM   11950        198       60.4 S   11950        201       
59.5 S
437.leslie3d     9400        218       43.1 S    9400        219       
42.9 S
437.leslie3d     9400        219       42.9 *    9400        219       
42.9 *
437.leslie3d     9400        220       42.8 S    9400        220       
42.7 S
444.namd         8020        301       26.7 S    8020        302       
26.5 S
444.namd         8020        300       26.7 *    8020        302       
26.5 *
444.namd         8020        300       26.7 S    8020        302       
26.6 S
447.dealII      11440        248       46.2 S   11440        245       
46.7 S
447.dealII      11440        248       46.0 S   11440        246       
46.4 S
447.dealII      11440        248       46.2 *   11440        246       
46.5 *
450.soplex       8340        216       38.6 S    8340        223       
37.5 S
450.soplex       8340        215       38.8 *    8340        215       
38.7 S
450.soplex       8340        214       38.9 S    8340        216       
38.6 *
453.povray       5320        121       44.1 S    5320        119       
44.8 *
453.povray       5320        120       44.2 *    5320        117       
45.3 S
453.povray       5320        120       44.3 S    5320        121       
44.0 S
454.calculix     8250        277       29.8 S    8250        277       
29.8 S
454.calculix     8250        277       29.8 *    8250        276       
29.9 *
454.calculix     8250        277       29.7 S    8250        276       
29.9 S
459.GemsFDTD    10610        326       32.5 S   10610        333       
31.8 S
459.GemsFDTD    10610        316       33.6 S   10610        318       
33.4 S
459.GemsFDTD    10610        317       33.5 *   10610        320       
33.2 *
465.tonto        9840        421       23.4 S    9840        419       
23.5 S
465.tonto        9840        419       23.5 S    9840        419       
23.5 *
465.tonto        9840        420       23.5 *    9840        418       
23.5 S
470.lbm         13740        253       54.2 *   13740        254       
54.1 S
470.lbm         13740        251       54.6 S   13740        251       
54.8 S
470.lbm         13740        254       54.0 S   13740        251       
54.7 *
481.wrf         11170        291       38.4 S   11170        293       
38.2 S
481.wrf         11170        288       38.8 S   11170        288       
38.8 S
481.wrf         11170        289       38.6 *   11170        289       
38.6 *
482.sphinx3     19490        398       49.0 S   19490        406       
48.0 S
482.sphinx3     19490        406       48.1 S   19490        401       
48.6 S
482.sphinx3     19490        399       48.9 *   19490        401       
48.6 *
==============================================================================
410.bwaves      13590        197       69.0 *   13590        197       
69.0 *
416.gamess      19580        592       33.1 *   19580        588       
33.3 *
433.milc         9180        335       27.4 *    9180        337       
27.2 *
434.zeusmp       9100        232       39.2 *    9100        233       
39.1 *
435.gromacs      7140        316       22.6 *    7140        263       
27.1 *
436.cactusADM   11950        201       59.3 *   11950        205       
58.3 *
437.leslie3d     9400        219       42.9 *    9400        219       
42.9 *
444.namd         8020        300       26.7 *    8020        302       
26.5 *
447.dealII      11440        248       46.2 *   11440        246       
46.5 *
450.soplex       8340        215       38.8 *    8340        216       
38.6 *
453.povray       5320        120       44.2 *    5320        119       
44.8 *
454.calculix     8250        277       29.8 *    8250        276       
29.9 *
459.GemsFDTD    10610        317       33.5 *   10610        320       
33.2 *
465.tonto        9840        420       23.5 *    9840        419       
23.5 *
470.lbm         13740        253       54.2 *   13740        251       
54.7 *
481.wrf         11170        289       38.6 *   11170        289       
38.6 *
482.sphinx3     19490        399       48.9 *   19490        401       
48.6 *
 Est. SPECfp(R)_base2006               38.0
 Est. SPECfp2006                                                       
38.4

So overall the patch is a loss for SPEC CPU 2006 INT due to
the 429.mcf and 471.omnetpp regressions and a win on SPEC FP.
(I didn't test SPEC INT previously, only FP)

But as you noted the patch only changes allocation preference in case
of equal costs.

I've also looked at stage2 binary sizes and patched we see a growth
of cc1 from 35450264 bytes to 35459552 (executable size).  There are
both ups and downs for individual .o files though.

The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started
working at some point past in time and thus it was added and the
bug closed.  You could say RA does a better job after the patch
as it uses 1 less register but that restricts the followup
postreload combine attempts.  Though I wonder about what's "better"
RA here - isn't the best allocation one that avoids spills but
uses as many registers as possible (at least when targeting a CPU
that cannot to register renaming)?  regrename doesn't help this
testcase either (it runs too late and does a renaming that doesn't help).

With all of the above I'm not sure what to do for GCC 6 (even though
you just approved the patch).  Going with the patch alternative
(just revert swapping parts of the commutative operands) looks
like completely bogus though it works for fixing the regression as well.

So I have applied the patch now, giving us a few days to get other
peoples (and my own) auto-testers the chance to pick up results with
it and after that we can consider reverting or going with the (IMHO
bogus) half-way variant.

I've XFAILed half of gcc.target/i386/addr-sel-1.c.

Thanks,
Richard.

> > 2016-02-05   Richard Biener  <rguenther@suse.de>
> > 
> > 	PR rtl-optimization/69274
> > 	* ira.c (ira_setup_alts): Do not change recog_data.operand
> > 	order.
> > 
> > Index: gcc/ira.c
> > ===================================================================
> > --- gcc/ira.c	(revision 231814)
> > +++ gcc/ira.c	(working copy)
> > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
> >   	}
> >         if (commutative < 0)
> >   	break;
> > -      if (curr_swapped)
> > -	break;
> > +      /* Swap forth and back to avoid changing recog_data.  */
> >         std::swap (recog_data.operand[commutative],
> >   		 recog_data.operand[commutative + 1]);
> > +      if (curr_swapped)
> > +	break;
> >       }
> >   }
> >   
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)