[PATCH] Fix PR69274, 435.gromacs performance regression due to RA
Richard Biener
rguenther@suse.de
Mon Feb 8 09:09:00 GMT 2016
On Fri, 5 Feb 2016, Vladimir Makarov wrote:
> On 02/05/2016 04:25 AM, Richard Biener wrote:
> > The following patch fixes the performance regression for 435.gromacs
> > on x86_64 with AVX2 (Haswell or bdver2) caused by
> >
> > 2015-12-18 Andreas Krebbel <krebbel@linux.vnet.ibm.com>
> >
> > * ira.c (ira_setup_alts): Move the scan for commutative modifier
> > to the first loop to make it work even with disabled alternatives.
> >
> > which in itself is a desirable change giving the RA more freedom.
> >
> > It turns out the fix makes an existing issue more severe in detecting
> > more swappable alternatives and thus exiting ira_setup_alts with
> > operands swapped in recog_data. This seems to give a slight preference
> > to choose alternatives with the operands swapped (I didn't try to
> > investigate how IRA handles the "merged" alternative mask and
> > operand swapping in its further processing).
> Alternative mask excludes alternatives which will be definitely rejected in
> LRA. This approach is to speed up LRA (a lot was done to speed up RA but
> still it consumes a big chunk of compiler time which is unusual for all
> compilers).
>
> LRA and reload prefer insns without commutative operands swap when all other
> costs are the same.
Ok, so leaving operands swapped in ira_setup_alts will prefer the
swapped variants in case other costs are the same.
> > Of course previous RTL optimizers and canonicalization rules as well
> > as backend patterns are tuned towards the not swapped variant and thus
> > it happens doing more swaps ends up in slower code (I didn't closely
> > investigate).
> >
> > So I tested the following patch which simply makes sure that
> > ira_setup_alts does not alter recog_data.
> >
> > On a Intel Haswell machine I get (base is with the patch, peak is with
> > the above change reverted):
> >
> > Estimated
> > Estimated
> > Base Base Base Peak Peak Peak
> > Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio
> > -------------- ------ --------- --------- ------ ---------
> > ---------
> > 435.gromacs 7140 264 27.1 S 7140 270
> > 26.5 S
> > 435.gromacs 7140 264 27.1 * 7140 269
> > 26.6 S
> > 435.gromacs 7140 263 27.1 S 7140 269
> > 26.5 *
> > ==============================================================================
> > 435.gromacs 7140 264 27.1 * 7140 269
> > 26.5 *
> >
> > which means the patched result is even better than before Andreas
> > change. Current trunk homes in at a Run Time of 321s (which is
> > the regression to fix).
> Thanks for working on this, Richard. It is not easy to find reasons for
> worse code on modern processors after such small change. As RA is based on
> heuristics it hard to predict the change for a specific benchmark. I remember
> I checked Andreas patch on SPEC2000 in a hope that it also improves x86-64
> code but I did not see a difference.
>
> It is even hard to say sometimes how a specific (non-heuristic) optimization
> will affect a specific benchmark performance when a lot of unknown (from
> alignments to CPU internals are involved). An year ago I tried to use ML to
> choose best options. I used a set of about 100 C benchmarks (and even more
> functions). For practically every benchmark, I had an option modification to
> -Ofast resulting in faster code but ML prediction did not work at all.
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?
> >
> OK. Thanks again.
Thanks. Over the weekend I did a full 3-run SPEC 2k6 with the following
result. Base is r231814 while peak is r231814 patched, flags are
-Ofast -march=native (on a Haswell machine).
Estimated
Estimated
Base Base Base Peak Peak Peak
Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio
-------------- ------ --------- --------- ------ ---------
---------
400.perlbench 9770 255 38.4 * 9770 251
39.0 S
400.perlbench 9770 258 37.8 S 9770 250
39.0 *
400.perlbench 9770 253 38.6 S 9770 250
39.0 S
401.bzip2 9650 407 23.7 * 9650 400
24.1 *
401.bzip2 9650 412 23.4 S 9650 417
23.1 S
401.bzip2 9650 396 24.4 S 9650 398
24.3 S
403.gcc 8050 252 31.9 S 8050 245
32.9 S
403.gcc 8050 240 33.6 S 8050 244
32.9 *
403.gcc 8050 242 33.2 * 8050 241
33.4 S
429.mcf 9120 243 37.6 S 9120 245
37.3 S
429.mcf 9120 224 40.7 S 9120 241
37.8 *
429.mcf 9120 225 40.5 * 9120 229
39.9 S
445.gobmk 10490 394 26.6 S 10490 393
26.7 S
445.gobmk 10490 394 26.6 * 10490 392
26.8 *
445.gobmk 10490 393 26.7 S 10490 392
26.8 S
456.hmmer 9330 340 27.4 S 9330 340
27.5 *
456.hmmer 9330 340 27.5 S 9330 340
27.5 S
456.hmmer 9330 340 27.5 * 9330 340
27.5 S
458.sjeng 12100 407 29.7 * 12100 407
29.8 *
458.sjeng 12100 406 29.8 S 12100 406
29.8 S
458.sjeng 12100 408 29.6 S 12100 407
29.8 S
462.libquantum 20720 300 69.0 S 20720 300
69.0 *
462.libquantum 20720 300 69.1 * 20720 300
69.2 S
462.libquantum 20720 300 69.1 S 20720 301
68.9 S
464.h264ref 22130 444 49.8 S 22130 444
49.9 S
464.h264ref 22130 443 50.0 S 22130 442
50.1 S
464.h264ref 22130 444 49.9 * 22130 443
50.0 *
471.omnetpp 6250 326 19.1 S 6250 328
19.1 S
471.omnetpp 6250 305 20.5 * 6250 324
19.3 *
471.omnetpp 6250 296 21.1 S 6250 305
20.5 S
473.astar 7020 313 22.4 * 7020 316
22.2 S
473.astar 7020 314 22.3 S 7020 309
22.7 S
473.astar 7020 308 22.8 S 7020 310
22.7 *
483.xalancbmk 6900 193 35.7 S 6900 193
35.7 S
483.xalancbmk 6900 189 36.5 * 6900 189
36.5 *
483.xalancbmk 6900 185 37.4 S 6900 189
36.6 S
==============================================================================
400.perlbench 9770 255 38.4 * 9770 250
39.0 *
401.bzip2 9650 407 23.7 * 9650 400
24.1 *
403.gcc 8050 242 33.2 * 8050 244
32.9 *
429.mcf 9120 225 40.5 * 9120 241
37.8 *
445.gobmk 10490 394 26.6 * 10490 392
26.8 *
456.hmmer 9330 340 27.5 * 9330 340
27.5 *
458.sjeng 12100 407 29.7 * 12100 407
29.8 *
462.libquantum 20720 300 69.1 * 20720 300
69.0 *
464.h264ref 22130 444 49.9 * 22130 443
50.0 *
471.omnetpp 6250 305 20.5 * 6250 324
19.3 *
473.astar 7020 313 22.4 * 7020 310
22.7 *
483.xalancbmk 6900 189 36.5 * 6900 189
36.5 *
Est. SPECint(R)_base2006 32.8
Est. SPECint2006
32.5
Estimated
Estimated
Base Base Base Peak Peak Peak
Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio
-------------- ------ --------- --------- ------ ---------
---------
410.bwaves 13590 196 69.4 S 13590 202
67.1 S
410.bwaves 13590 197 69.0 * 13590 197
69.1 S
410.bwaves 13590 199 68.3 S 13590 197
69.0 *
416.gamess 19580 592 33.1 * 19580 589
33.2 S
416.gamess 19580 592 33.1 S 19580 588
33.3 S
416.gamess 19580 593 33.0 S 19580 588
33.3 *
433.milc 9180 350 26.2 S 9180 347
26.5 S
433.milc 9180 335 27.4 * 9180 334
27.5 S
433.milc 9180 334 27.5 S 9180 337
27.2 *
434.zeusmp 9100 231 39.3 S 9100 232
39.2 S
434.zeusmp 9100 233 39.1 S 9100 233
39.1 *
434.zeusmp 9100 232 39.2 * 9100 235
38.8 S
435.gromacs 7140 316 22.6 S 7140 264
27.1 S
435.gromacs 7140 314 22.7 S 7140 263
27.2 S
435.gromacs 7140 316 22.6 * 7140 263
27.1 *
436.cactusADM 11950 201 59.3 * 11950 211
56.6 S
436.cactusADM 11950 214 55.7 S 11950 205
58.3 *
436.cactusADM 11950 198 60.4 S 11950 201
59.5 S
437.leslie3d 9400 218 43.1 S 9400 219
42.9 S
437.leslie3d 9400 219 42.9 * 9400 219
42.9 *
437.leslie3d 9400 220 42.8 S 9400 220
42.7 S
444.namd 8020 301 26.7 S 8020 302
26.5 S
444.namd 8020 300 26.7 * 8020 302
26.5 *
444.namd 8020 300 26.7 S 8020 302
26.6 S
447.dealII 11440 248 46.2 S 11440 245
46.7 S
447.dealII 11440 248 46.0 S 11440 246
46.4 S
447.dealII 11440 248 46.2 * 11440 246
46.5 *
450.soplex 8340 216 38.6 S 8340 223
37.5 S
450.soplex 8340 215 38.8 * 8340 215
38.7 S
450.soplex 8340 214 38.9 S 8340 216
38.6 *
453.povray 5320 121 44.1 S 5320 119
44.8 *
453.povray 5320 120 44.2 * 5320 117
45.3 S
453.povray 5320 120 44.3 S 5320 121
44.0 S
454.calculix 8250 277 29.8 S 8250 277
29.8 S
454.calculix 8250 277 29.8 * 8250 276
29.9 *
454.calculix 8250 277 29.7 S 8250 276
29.9 S
459.GemsFDTD 10610 326 32.5 S 10610 333
31.8 S
459.GemsFDTD 10610 316 33.6 S 10610 318
33.4 S
459.GemsFDTD 10610 317 33.5 * 10610 320
33.2 *
465.tonto 9840 421 23.4 S 9840 419
23.5 S
465.tonto 9840 419 23.5 S 9840 419
23.5 *
465.tonto 9840 420 23.5 * 9840 418
23.5 S
470.lbm 13740 253 54.2 * 13740 254
54.1 S
470.lbm 13740 251 54.6 S 13740 251
54.8 S
470.lbm 13740 254 54.0 S 13740 251
54.7 *
481.wrf 11170 291 38.4 S 11170 293
38.2 S
481.wrf 11170 288 38.8 S 11170 288
38.8 S
481.wrf 11170 289 38.6 * 11170 289
38.6 *
482.sphinx3 19490 398 49.0 S 19490 406
48.0 S
482.sphinx3 19490 406 48.1 S 19490 401
48.6 S
482.sphinx3 19490 399 48.9 * 19490 401
48.6 *
==============================================================================
410.bwaves 13590 197 69.0 * 13590 197
69.0 *
416.gamess 19580 592 33.1 * 19580 588
33.3 *
433.milc 9180 335 27.4 * 9180 337
27.2 *
434.zeusmp 9100 232 39.2 * 9100 233
39.1 *
435.gromacs 7140 316 22.6 * 7140 263
27.1 *
436.cactusADM 11950 201 59.3 * 11950 205
58.3 *
437.leslie3d 9400 219 42.9 * 9400 219
42.9 *
444.namd 8020 300 26.7 * 8020 302
26.5 *
447.dealII 11440 248 46.2 * 11440 246
46.5 *
450.soplex 8340 215 38.8 * 8340 216
38.6 *
453.povray 5320 120 44.2 * 5320 119
44.8 *
454.calculix 8250 277 29.8 * 8250 276
29.9 *
459.GemsFDTD 10610 317 33.5 * 10610 320
33.2 *
465.tonto 9840 420 23.5 * 9840 419
23.5 *
470.lbm 13740 253 54.2 * 13740 251
54.7 *
481.wrf 11170 289 38.6 * 11170 289
38.6 *
482.sphinx3 19490 399 48.9 * 19490 401
48.6 *
Est. SPECfp(R)_base2006 38.0
Est. SPECfp2006
38.4
So overall the patch is a loss for SPEC CPU 2006 INT due to
the 429.mcf and 471.omnetpp regressions and a win on SPEC FP.
(I didn't test SPEC INT previously, only FP)
But as you noted the patch only changes allocation preference in case
of equal costs.
I've also looked at stage2 binary sizes and patched we see a growth
of cc1 from 35450264 bytes to 35459552 (executable size). There are
both ups and downs for individual .o files though.
The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started
working at some point past in time and thus it was added and the
bug closed. You could say RA does a better job after the patch
as it uses 1 less register but that restricts the followup
postreload combine attempts. Though I wonder about what's "better"
RA here - isn't the best allocation one that avoids spills but
uses as many registers as possible (at least when targeting a CPU
that cannot to register renaming)? regrename doesn't help this
testcase either (it runs too late and does a renaming that doesn't help).
With all of the above I'm not sure what to do for GCC 6 (even though
you just approved the patch). Going with the patch alternative
(just revert swapping parts of the commutative operands) looks
like completely bogus though it works for fixing the regression as well.
So I have applied the patch now, giving us a few days to get other
peoples (and my own) auto-testers the chance to pick up results with
it and after that we can consider reverting or going with the (IMHO
bogus) half-way variant.
I've XFAILed half of gcc.target/i386/addr-sel-1.c.
Thanks,
Richard.
> > 2016-02-05 Richard Biener <rguenther@suse.de>
> >
> > PR rtl-optimization/69274
> > * ira.c (ira_setup_alts): Do not change recog_data.operand
> > order.
> >
> > Index: gcc/ira.c
> > ===================================================================
> > --- gcc/ira.c (revision 231814)
> > +++ gcc/ira.c (working copy)
> > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
> > }
> > if (commutative < 0)
> > break;
> > - if (curr_swapped)
> > - break;
> > + /* Swap forth and back to avoid changing recog_data. */
> > std::swap (recog_data.operand[commutative],
> > recog_data.operand[commutative + 1]);
> > + if (curr_swapped)
> > + break;
> > }
> > }
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
More information about the Gcc-patches
mailing list