PR80155: Code hoisting and register pressure

Fri May 25 09:58:00 GMT 2018

On Fri, 25 May 2018, Bin.Cheng wrote:

> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> > On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
> >> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
> >>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
> >>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
> >>>>
> >>>>> Hi,
> >>>>> I am trying to work on PR80155, which exposes a problem with code
> >>>>> hoisting and register pressure on a leading embedded benchmark for ARM
> >>>>> cortex-m7, where code-hoisting causes an extra register spill.
> >>>>>
> >>>>> I have attached two test-cases which (hopefully) are representative of
> >>>>> the original test-case.
> >>>>> The first one (trans_dfa.c) is bigger and somewhat similar to the
> >>>>> original test-case and trans_dfa_2.c is hand-reduced version of
> >>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
> >>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
> >>>>> The test-cases in the PR are probably not relevant.
> >>>>>
> >>>>> Initially I thought the spill was happening because of "too many
> >>>>> hoistings" taking place in original test-case thus increasing the
> >>>>> register pressure, but it seems the spill is possibly caused because
> >>>>> expression gets hoisted out of a block that is on loop exit.
> >>>>>
> >>>>> For example, the following hoistings take place with trans_dfa_2.c:
> >>>>>
> >>>>> (1) Inserting expression in block 4 for code hoisting:
> >>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
> >>>>>
> >>>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
> >>>>>
> >>>>> (3) Inserting expression in block 4 for code hoisting:
> >>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>
> >>>>> (4) Inserting expression in block 3 for code hoisting:
> >>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>
> >>>>> The issue seems to be hoisting of (*tab + 1) which consists of first
> >>>>> two hoistings in block 4
> >>>>> from blocks 5 and 9, which causes the extra spill. I verified that by
> >>>>> disabling hoisting into block 4,
> >>>>> which resulted in no extra spills.
> >>>>>
> >>>>> I wonder if that's because the expression (*tab + 1) is getting
> >>>>> hoisted from blocks 5 and 9,
> >>>>> which are on loop exit ? So the expression that was previously
> >>>>> computed in a block on loop exit, gets hoisted outside that block
> >>>>> which possibly makes the allocator more defensive ? Similarly
> >>>>> disabling hoisting of expressions which appeared in blocks on loop
> >>>>> exit in original test-case prevented the extra spill. The other
> >>>>> hoistings didn't seem to matter.
> >>>>
> >>>> I think that's simply co-incidence.  The only thing that makes
> >>>> a block that also exits from the loop special is that an
> >>>> expression could be sunk out of the loop and hoisting (commoning
> >>>> with another path) could prevent that.  But that isn't what is
> >>>> happening here and it would be a pass ordering issue as
> >>>> the sinking pass runs only after hoisting (no idea why exactly
> >>>> but I guess there are cases where we want to prefer CSE over
> >>>> sinking).  So you could try if re-ordering PRE and sinking helps
> >>>> your testcase.
> >>> Thanks for the suggestions. Placing sink pass before PRE works
> >>> for both these test-cases! Sadly it still causes the spill for the benchmark -:(
> >>> I will try to create a better approximation of the original test-case.
> >>>>
> >>>> What I do see is a missed opportunity to merge the successors
> >>>> of BB 4.  After PRE we have
> >>>>
> >>>> <bb 4> [local count: 159303558]:
> >>>> <L1>:
> >>>> pretmp_123 = *tab_37(D);
> >>>> _87 = pretmp_123 + 1;
> >>>> if (c_36 == 65)
> >>>>   goto <bb 5>; [34.00%]
> >>>> else
> >>>>   goto <bb 8>; [66.00%]
> >>>>
> >>>> <bb 5> [local count: 54163210]:
> >>>> *tab_37(D) = _87;
> >>>> _96 = MEM[(char *)s_57 + 1B];
> >>>> if (_96 != 0)
> >>>>   goto <bb 7>; [89.00%]
> >>>> else
> >>>>   goto <bb 6>; [11.00%]
> >>>>
> >>>> <bb 8> [local count: 105140348]:
> >>>> *tab_37(D) = _87;
> >>>> _56 = MEM[(char *)s_57 + 1B];
> >>>> if (_56 != 0)
> >>>>   goto <bb 10>; [89.00%]
> >>>> else
> >>>>   goto <bb 9>; [11.00%]
> >>>>
> >>>> here at least the stores and loads can be hoisted.  Note this
> >>>> may also point at the real issue of the code hoisting which is
> >>>> tearing apart the RMW operation?
> >>> Indeed, this possibility seems much more likely than block being on loop exit.
> >>> I will try to "hardcode" the load/store hoists into block 4 for this
> >>> specific test-case to check
> >>> if that prevents the spill.
> >> Even if it prevents the spill in this case, it's likely a good thing to
> >> do.  The statements prior to the conditional in bb5 and bb8 should be
> >> hoisted, leaving bb5 and bb8 with just their conditionals.
> > Hi,
> > It seems disabling forwprop somehow works for causing no extra spills
> > on the original test-case.
> >
> > For instance,
> > Hoisting without forwprop:
> >
> > bb 3:
> > _1 = tab_1(D) + 8
> > pretmp_268 = MEM[tab_1(D) + 8B];
> > _2 = pretmp_268 + 1;
> > goto <bb 4> or <bb 5>
> >
> > bb 4:
> >  *_1 = _ 2
> >
> > bb 5:
> > *_1 = _2
> >
> > Hoisting with forwprop:
> >
> > bb 3:
> > pretmp_164 = MEM[tab_1(D) + 8B];
> > _2 = pretmp_164 + 1
> > goto <bb 4> or <bb 5>
> >
> > bb 4:
> > MEM[tab_1(D) + 8] = _2;
> >
> > bb 5:
> > MEM[tab_1(D) + 8] = _2;
> >
> > Although in both cases, we aren't hoisting stores, the issues with forwprop
> > for this case seems to be the folding of
> > *_1 = _2
> > into
> > MEM[tab_1(D) + 8] = _2  ?
> 
> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
> thus propagating _1 using (tab_1(D) + 8) actually removes one live
> range.
> 
> >
> > Disabling folding to mem_ref[base + offset] in forwprop "works" in the
> > sense it created same set of hoistings as without forwprop, however it
> > still results in additional spills (albeit different registers).
> >
> > That's because forwprop seems to be increasing live range of
> > prephitmp_217 by substituting
> > _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + 1).
> Hmm, it's hard to discuss private benchmarks, not sure which dump
> shall I find prephitmp_221/prephitmp_217 stuff.
> 
> > On the other hand, Bin pointed out to me in private that forwprop also
> > helps to restrict register pressure by propagating "tab + const_int"
> > for same test-case.
> >
> > So I am not really sure if there's an easier fix than having
> > heuristics for estimating register pressure at TREE level ? I would be
> Easy fix, maybe not.  OTOH, I am more convinced passes like
> forwprop/sink/hoisting can be improved by taking live range into
> consideration.  Specifically, to direct such passes when moving code
> around different basic blocks, because inter-block register pressure
> is hard to resolve afterwards.
> 
> As suggested by Jeff and Richi, I guess the first step would be doing
> experiments, collecting more benchmark data for reordering sink before
> pre?  It enables code sink as well as decreases register pressure in
> the original reduced cases IIRC.

Note sinking also doesn't have a cost model that takes into account
register pressure.  But yes, I can't think of a reason to do sinking
after PRE (well, apart from PRE performing value-numbering and that
of course might expose sinking opportunities).

So yes, perform some experiments - you should be able to use
-fdump-statistics[-stats] and look at the number of sunk stmts
reported (and also the number of PRE eliminations ultimatively
performed).

Richard.

> Thanks,
> bin
> > grateful for suggestions on how to proceed from here.
> > Thanks!
> >
> > Regards,
> > Prathamesh
> >>
> >> Jeff
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)