[PATCH v3 2/3] Add predict_doloop_p target hook

Kewen.Lin linkw@linux.ibm.com
Tue Jun 11 02:39:00 GMT 2019


>> If my understanding on this question is correct, IMHO we should try to make
>> IVOPTs conservative than optimistic, since once the predict is wrong from
>> too optimistic decision, the costing on the doloop use is wrong, it's very
>> possible to affect the global optimal set.  It looks we don't have any ways
>> to recover it in RTL then?  (otherwise, there should be better place to fix
>> the PR).  Although it's also possible to miss some good cases, it's at least
>> as good as before, I'm inclined to make it conservative.
> 
> I wonder if you could simply benchmark what happens if you make
> IVOPTs _always_ create a doloop IV (if possible)?  I doubt the
> cases where a doloop IV is bad (calls, etc.) are too common and
> that in those cases the extra simple IV hurts.
> 

Hi Richard and all,

With these different settings:
	Base) without any changes
	A) having predict_doloop and enable all checks
	B) A + disable check on "too few iterations" (0 <= est_niter < 3)
	C) A + disable costly niter check
	D) A + disable invalid stmt check (call/computed_goto/switch)

I collected some runtime performance data with SPEC2017 as following:

			Avs.Base   Bvs.A  Cvs.A  Dvs.A
500.perlbench_r		0.00%	0.38%	0.00%	-0.19%
502.gcc_r		0.00%	0.38%	0.00%	0.00%
505.mcf_r		0.89%	0.00%	0.00%	0.00%
520.omnetpp_r		-0.41%	-1.25%	0.00%	0.00%
523.xalancbmk_r		-0.36%	-0.36%	-0.36%	-0.73%
525.x264_r		1.14%	0.00%	0.00%	0.00%
531.deepsjeng_r		-0.26%	0.26%	0.00%	0.00%
541.leela_r		0.00%	0.37%	0.00%	0.00%
548.exchange2_r		0.85%	-0.21%	0.00%	0.00%
557.xz_r		-0.77%	0.52%	-0.26%	0.00%
503.bwaves_r		0.00%	0.00%	0.36%	0.00%
507.cactuBSSN_r		-0.57%	0.00%	0.00%	0.00%
508.namd_r		-0.69%	0.35%	0.00%	0.35%
510.parest_r		0.17%	-0.17%	0.00%	-0.17%
511.povray_r		-1.31%	-0.44%	0.15%	0.15%
519.lbm_r		0.00%	0.00%	0.00%	0.00%
521.wrf_r		0.33%	-0.44%	-0.33%	-0.33%
526.blender_r		0.26%	0.26%	0.00%	0.00%
527.cam4_r		0.59%	-0.59%	0.00%	-0.39%
538.imagick_r		0.45%	0.00%	0.00%	0.00%
544.nab_r		0.23%	0.00%	0.00%	0.00%
549.fotonik3d_r		1.80%	-0.29%	0.00%	0.00%
554.roms_r		0.00%	0.00%	0.00%	0.00%
geomean			0.10%	-0.05%	-0.02%	-0.06%

As above, the difference is very small, looks like caused by noise and can 
be ignored. 

I also ran partial of test suite with some explicit statistics dumping 
(on gcc/g++/gfortran etc.). No regressions found. The unique files number 
with predicted doloop found are:

  A) 3297
  B) 3416
  C) 3297
  D) 3858

Some observations:
  * Based on A) and C), we can see the checking on costly niter is useless, 
    I plan to give the check up or replace it with one existing interface 
    expression_expensive_p as Richard mentioned.  (Correct me if you have 
    any concerns.)
  * B) does filter some cases, I checked a few different cases, they are 
    written with small iteration count indeed.
  * The delta number isn't small between A) and D).

I ran some filtering by compiling C/C++ files at -O2 with A) and D) (-O2 is 
probably not the actual option used in each testing, may not cause the 
difference, but just for simplicity), obtained dumping assembly file and did
further comparison, then I got 60 files finally.

I looked into all 60 cases (I assumed these are typical enough, don't need
to go through the Fortran etc.).  Most of wi/wo differences are expected, 
55 of them use more insns to update biv, but 5 of them are trivial since they
have to use original biv later so it's kept wi/wo the scanning.  Some of 55 
takes one more register in prologue/epilogue, some of them have fewer setup 
insns (which are to calculate the original value and bound for selected iv).

Some typical case looks like:

 Replacing exit test: if (ivtmp_2 != 0)                                        \  Replacing exit test: if (ivtmp_2 != 0)
  xxx ()                                                                       \  xxx ()
  {                                                                            \  {
    unsigned long ivtmp.9;                                                     \    unsigned long ivtmp.9;
    int iter;                                                                  \    int iter;
    int _1;                                                                    \    int _1;
  -----------------------------------------------------------------------------\    unsigned int ivtmp_2;
  -----------------------------------------------------------------------------\    unsigned int ivtmp_3;
    struct bla[100] * _13;                                                     \    struct bla[100] * _13;
    void * _14;                                                                \    void * _14;
    unsigned long _15;                                                         \  -----------------------------------------------------------------------------
    unsigned long _16;                                                         \  -----------------------------------------------------------------------------
                                                                               \
    <bb 2> [local count: 10737418]:                                            \    <bb 2> [local count: 10737418]:
    _13 = &arr_base + 188;                                                     \    _13 = &arr_base + 188;
    ivtmp.9_8 = (unsigned long) _13;                                           \    ivtmp.9_8 = (unsigned long) _13;
    _15 = (unsigned long) &arr_base;                                           \  -----------------------------------------------------------------------------
    _16 = _15 + 44988;                                                         \  -----------------------------------------------------------------------------
                                                                               \
    <bb 3> [local count: 1063004407]:                                          \    <bb 3> [local count: 1063004407]:
  -----------------------------------------------------------------------------\    # ivtmp_3 = PHI <100(2), ivtmp_2(5)>
    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>                            \    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>
    _1 = foo ();                                                               \    _1 = foo ();
    _14 = (void *) ivtmp.9_10;                                                 \    _14 = (void *) ivtmp.9_10;
    MEM[base: _14, offset: 0B] = _1;                                           \    MEM[base: _14, offset: 0B] = _1;
  -----------------------------------------------------------------------------\    ivtmp_2 = ivtmp_3 - 1;
    ivtmp.9_9 = ivtmp.9_10 + 448;                                              \    ivtmp.9_9 = ivtmp.9_10 + 448;
    if (ivtmp.9_9 != _16)                                                      \    if (ivtmp_2 != 0)
      goto <bb 5>; [98.99%]                                                    \      goto <bb 5>; [98.99%]
    else                                                                       \    else
      goto <bb 4>; [1.01%]                                                     \      goto <bb 4>; [1.01%]


  >-------addis 31,2,.LC0@toc@ha                                               \  >-------li 31,100
  >-------ld 31,.LC0@toc@l(31)                                                 \  >-------addis 30,2,.LC0@toc@ha
  >-------addis 30,31,0x1                                                      \  >-------ld 30,.LC0@toc@l(30)
  >-------std 0,16(1)                                                          \  >-------std 0,16(1)
  >-------stdu 1,-48(1)                                                        \  >-------stdu 1,-48(1)
  >-------.cfi_def_cfa_offset 48                                               \  >-------.cfi_def_cfa_offset 48
  >-------.cfi_offset 65, 16                                                   \  -----------------------------------------------------------------------------
  >-------addi 30,30,-20736                                                    \  >-------.cfi_offset 65, 16
  >-------.p2align 5                                                           \  >-------.p2align 5
  .L2:                                                                         \  .L2:
  >-------bl foo                                                               \  >-------bl foo
  >-------nop                                                                  \  >-------nop
  >-------addi 31,31,448                                                       \  >-------addi 9,31,-1
  >-------stw 3,-448(31)                                                       \  >-------addi 30,30,448
  >-------cmpld 0,31,30                                                        \  >-------rldicl. 31,9,0,32
  -----------------------------------------------------------------------------\  >-------stw 3,-448(30)
  >-------bne 0,.L2                                                            \  >-------bne 0,.L2
  >-------addi 1,1,48                                                          \  >-------addi 1,1,48
  >-------.cfi_def_cfa_offset 0                                                \  >-------.cfi_def_cfa_offset 0
  >-------ld 0,16(1)                                                           \  >-------ld 0,16(1)
  >-------ld 30,-16(1)                                                         \  >-------ld 30,-16(1)
  >-------ld 31,-8(1)                                                          \  >-------ld 31,-8(1)


As shown above, P8 SPEC2017 performance evaluation shows to disable 
invalid_stmt scanning doesn't have any impacts.  It looks it's fine
not to force it.  The test suite small cases show it's possible to
have sub optimal instruction sequence with eliminated BIV and its update. 

I guess the concern on the scanning is compilation time cost, is it
worth to doing according to current data?

What do you think of this? 

Thanks a lot in advance!


Kewen



More information about the Gcc-patches mailing list