This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: PR80155: Code hoisting and register pressure

From: Richard Biener <rguenther at suse dot de>
To: "Bin.Cheng" <amker dot cheng at gmail dot com>
Cc: Jeff Law <law at redhat dot com>, Prathamesh Kulkarni <prathamesh dot kulkarni at linaro dot org>, GCC Development <gcc at gcc dot gnu dot org>, Thomas Preudhomme <thomas dot preudhomme at linaro dot org>
Date: Mon, 28 May 2018 09:22:43 +0200 (CEST)
Subject: Re: PR80155: Code hoisting and register pressure
References: <CAAgBjMndG3T40NnF6-SZkzyF_UzQwzXGT6HcnMJr0HfBD5mTKg@mail.gmail.com> <alpine.LSU.2.20.1805231011350.24704@zhemvz.fhfr.qr> <CAAgBjMnx87jrW7ZusO5OHdVbvNiw1VLu9C-2Zw2x5gH5MpfLCg@mail.gmail.com> <014f7b2a-3c64-4144-37a4-4cc7bdff3d47@redhat.com> <CAAgBjMnEzskovbNKffYKfkTsN2+WRbjwaARCuhCss-S4-FaPGQ@mail.gmail.com> <CAHFci2_5gZ4vVOkxFoRDhPJH-VaGYGTjt4DZgwLCK_weLQefKQ@mail.gmail.com> <66d75fc6-5553-212d-1aca-d92b3c6193b2@redhat.com> <3D61699E-73EC-4936-9FF3-494DF6816616@suse.de> <CAHFci2-bb_--Ls+VaEiDZJEhrDYvu44a98Omd0T99TbN+g2vVw@mail.gmail.com>

On Sat, 26 May 2018, Bin.Cheng wrote:

> On Fri, May 25, 2018 at 5:54 PM, Richard Biener <rguenther@suse.de> wrote:
> > On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
> >>On 05/25/2018 03:49 AM, Bin.Cheng wrote:
> >>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
> >>> <prathamesh.kulkarni@linaro.org> wrote:
> >>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
> >>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
> >>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
> >>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> I am trying to work on PR80155, which exposes a problem with
> >>code
> >>>>>>>> hoisting and register pressure on a leading embedded benchmark
> >>for ARM
> >>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
> >>>>>>>>
> >>>>>>>> I have attached two test-cases which (hopefully) are
> >>representative of
> >>>>>>>> the original test-case.
> >>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
> >>the
> >>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
> >>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
> >>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
> >>>>>>>> The test-cases in the PR are probably not relevant.
> >>>>>>>>
> >>>>>>>> Initially I thought the spill was happening because of "too many
> >>>>>>>> hoistings" taking place in original test-case thus increasing
> >>the
> >>>>>>>> register pressure, but it seems the spill is possibly caused
> >>because
> >>>>>>>> expression gets hoisted out of a block that is on loop exit.
> >>>>>>>>
> >>>>>>>> For example, the following hoistings take place with
> >>trans_dfa_2.c:
> >>>>>>>>
> >>>>>>>> (1) Inserting expression in block 4 for code hoisting:
> >>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
> >>>>>>>>
> >>>>>>>> (2) Inserting expression in block 4 for code hoisting:
> >>{plus_expr,_4,1} (0006)
> >>>>>>>>
> >>>>>>>> (3) Inserting expression in block 4 for code hoisting:
> >>>>>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>>>>
> >>>>>>>> (4) Inserting expression in block 3 for code hoisting:
> >>>>>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>>>>
> >>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
> >>first
> >>>>>>>> two hoistings in block 4
> >>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
> >>that by
> >>>>>>>> disabling hoisting into block 4,
> >>>>>>>> which resulted in no extra spills.
> >>>>>>>>
> >>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
> >>>>>>>> hoisted from blocks 5 and 9,
> >>>>>>>> which are on loop exit ? So the expression that was previously
> >>>>>>>> computed in a block on loop exit, gets hoisted outside that
> >>block
> >>>>>>>> which possibly makes the allocator more defensive ? Similarly
> >>>>>>>> disabling hoisting of expressions which appeared in blocks on
> >>loop
> >>>>>>>> exit in original test-case prevented the extra spill. The other
> >>>>>>>> hoistings didn't seem to matter.
> >>>>>>>
> >>>>>>> I think that's simply co-incidence.  The only thing that makes
> >>>>>>> a block that also exits from the loop special is that an
> >>>>>>> expression could be sunk out of the loop and hoisting (commoning
> >>>>>>> with another path) could prevent that.  But that isn't what is
> >>>>>>> happening here and it would be a pass ordering issue as
> >>>>>>> the sinking pass runs only after hoisting (no idea why exactly
> >>>>>>> but I guess there are cases where we want to prefer CSE over
> >>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
> >>>>>>> your testcase.
> >>>>>> Thanks for the suggestions. Placing sink pass before PRE works
> >>>>>> for both these test-cases! Sadly it still causes the spill for the
> >>benchmark -:(
> >>>>>> I will try to create a better approximation of the original
> >>test-case.
> >>>>>>>
> >>>>>>> What I do see is a missed opportunity to merge the successors
> >>>>>>> of BB 4.  After PRE we have
> >>>>>>>
> >>>>>>> <bb 4> [local count: 159303558]:
> >>>>>>> <L1>:
> >>>>>>> pretmp_123 = *tab_37(D);
> >>>>>>> _87 = pretmp_123 + 1;
> >>>>>>> if (c_36 == 65)
> >>>>>>>   goto <bb 5>; [34.00%]
> >>>>>>> else
> >>>>>>>   goto <bb 8>; [66.00%]
> >>>>>>>
> >>>>>>> <bb 5> [local count: 54163210]:
> >>>>>>> *tab_37(D) = _87;
> >>>>>>> _96 = MEM[(char *)s_57 + 1B];
> >>>>>>> if (_96 != 0)
> >>>>>>>   goto <bb 7>; [89.00%]
> >>>>>>> else
> >>>>>>>   goto <bb 6>; [11.00%]
> >>>>>>>
> >>>>>>> <bb 8> [local count: 105140348]:
> >>>>>>> *tab_37(D) = _87;
> >>>>>>> _56 = MEM[(char *)s_57 + 1B];
> >>>>>>> if (_56 != 0)
> >>>>>>>   goto <bb 10>; [89.00%]
> >>>>>>> else
> >>>>>>>   goto <bb 9>; [11.00%]
> >>>>>>>
> >>>>>>> here at least the stores and loads can be hoisted.  Note this
> >>>>>>> may also point at the real issue of the code hoisting which is
> >>>>>>> tearing apart the RMW operation?
> >>>>>> Indeed, this possibility seems much more likely than block being
> >>on loop exit.
> >>>>>> I will try to "hardcode" the load/store hoists into block 4 for
> >>this
> >>>>>> specific test-case to check
> >>>>>> if that prevents the spill.
> >>>>> Even if it prevents the spill in this case, it's likely a good
> >>thing to
> >>>>> do.  The statements prior to the conditional in bb5 and bb8 should
> >>be
> >>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
> >>>> Hi,
> >>>> It seems disabling forwprop somehow works for causing no extra
> >>spills
> >>>> on the original test-case.
> >>>>
> >>>> For instance,
> >>>> Hoisting without forwprop:
> >>>>
> >>>> bb 3:
> >>>> _1 = tab_1(D) + 8
> >>>> pretmp_268 = MEM[tab_1(D) + 8B];
> >>>> _2 = pretmp_268 + 1;
> >>>> goto <bb 4> or <bb 5>
> >>>>
> >>>> bb 4:
> >>>>  *_1 = _ 2
> >>>>
> >>>> bb 5:
> >>>> *_1 = _2
> >>>>
> >>>> Hoisting with forwprop:
> >>>>
> >>>> bb 3:
> >>>> pretmp_164 = MEM[tab_1(D) + 8B];
> >>>> _2 = pretmp_164 + 1
> >>>> goto <bb 4> or <bb 5>
> >>>>
> >>>> bb 4:
> >>>> MEM[tab_1(D) + 8] = _2;
> >>>>
> >>>> bb 5:
> >>>> MEM[tab_1(D) + 8] = _2;
> >>>>
> >>>> Although in both cases, we aren't hoisting stores, the issues with
> >>forwprop
> >>>> for this case seems to be the folding of
> >>>> *_1 = _2
> >>>> into
> >>>> MEM[tab_1(D) + 8] = _2  ?
> >>>
> >>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
> >>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
> >>> range.
> >>>
> >>>>
> >>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
> >>the
> >>>> sense it created same set of hoistings as without forwprop, however
> >>it
> >>>> still results in additional spills (albeit different registers).
> >>>>
> >>>> That's because forwprop seems to be increasing live range of
> >>>> prephitmp_217 by substituting
> >>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
> >>1).
> >>> Hmm, it's hard to discuss private benchmarks, not sure which dump
> >>> shall I find prephitmp_221/prephitmp_217 stuff.
> >>>
> >>>> On the other hand, Bin pointed out to me in private that forwprop
> >>also
> >>>> helps to restrict register pressure by propagating "tab + const_int"
> >>>> for same test-case.
> >>>>
> >>>> So I am not really sure if there's an easier fix than having
> >>>> heuristics for estimating register pressure at TREE level ? I would
> >>be
> >>> Easy fix, maybe not.  OTOH, I am more convinced passes like
> >>> forwprop/sink/hoisting can be improved by taking live range into
> >>> consideration.  Specifically, to direct such passes when moving code
> >>> around different basic blocks, because inter-block register pressure
> >>> is hard to resolve afterwards.
> >>>
> >>> As suggested by Jeff and Richi, I guess the first step would be doing
> >>> experiments, collecting more benchmark data for reordering sink
> >>before
> >>> pre?  It enables code sink as well as decreases register pressure in
> >>> the original reduced cases IIRC.
> >>We might even consider re-evaluating Bernd's work on what is
> >>effectively
> >>a gimple scheduler to minimize register pressure.
> Could you please point me to Bernd's work?  Does it schedule around
> different basic blocks to minimize register pressue?  Like in this
> case, various optimizers extends live range to different basic blocks.
> I once prototyped a single-block gimple scheduler, it isn't very
> useful IMHO.

Possibly https://gcc.gnu.org/ml/gcc-patches/2017-05/msg01058.html.

Quickly re-skimming the patch it seems to work on the set of TER-able
stmts, scheduling them, and if scheduled, remove them from the TER-able
set (because TER will undo scheduling later).  So it's partly a
register-pressure driven cost-model for TER and partly doing scheduling
on GIMPLE.

Of course it suffers from the issue I mentioned - we rely on TER
(too much) to do combine-like work during RTL expansion.

> It will be huge amount work to take register pressure into
> consideration in various optimizers.  OTOH, having a single
> inter-block live range shrink pass before expanding is great, but I
> think it would be at least equally difficult.

Yes, I've always wanted to do some GIMPLE level scheduling before
expanding.  And yes, it's equally difficult because it means we have
to do all instruction selection TER enables on GIMPLE - which is
of course also a good thing.  And yes, the difficulty is to actually
see all the expand-time benefits we get from TER - most of them
are _not_ places that look at def stmts during expansion but are
those that benefit from being fed "complex" RTL when expanding
a stmts operands.

> > Sure. The main issue here I see is with the interaction with TER which we unfortunately still rely on. Enough GIMPLE instruction selection might help to get rid of the remaining pieces...
> 
> If the scheduler is designed to focus on inter-block live range
> shrink, it won't interfer with TER which only replaces single use in
> each basic block.

True.

Richard.

References:
- PR80155: Code hoisting and register pressure
  - From: Prathamesh Kulkarni
- Re: PR80155: Code hoisting and register pressure
  - From: Richard Biener
- Re: PR80155: Code hoisting and register pressure
  - From: Prathamesh Kulkarni
- Re: PR80155: Code hoisting and register pressure
  - From: Jeff Law
- Re: PR80155: Code hoisting and register pressure
  - From: Prathamesh Kulkarni
- Re: PR80155: Code hoisting and register pressure
  - From: Bin.Cheng
- Re: PR80155: Code hoisting and register pressure
  - From: Jeff Law
- Re: PR80155: Code hoisting and register pressure
  - From: Richard Biener
- Re: PR80155: Code hoisting and register pressure
  - From: Bin.Cheng

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]