[PATCH 1/4] unroll: Add middle-end unroll factor estimation

Kewen.Lin linkw@linux.ibm.com
Mon Mar 1 02:45:28 GMT 2021


on 2021/1/27 下午5:43, Kewen.Lin via Gcc-patches wrote:
> on 2021/1/26 下午6:53, Richard Biener wrote:
>> On Tue, 26 Jan 2021, Kewen.Lin wrote:
>>
>>> Hi Segher/Richard B./Richard S.,
>>>
>>> Many thanks for your all helps and comments on this!
>>>
>>> on 2021/1/25 下午3:56, Richard Biener wrote:
>>>> On Fri, 22 Jan 2021, Segher Boessenkool wrote:
>>>>
>>>>> On Fri, Jan 22, 2021 at 02:47:06PM +0100, Richard Biener wrote:
>>>>>> On Thu, 21 Jan 2021, Segher Boessenkool wrote:
>>>>>>> What is holding up this patch still?  Ke Wen has pinged it every month
>>>>>>> since May, and there has still not been a review.
>>>>>
>>>>> Richard Sandiford wrote:
>>>>>> FAOD (since I'm on cc:), I don't feel qualified to review this.
>>>>>> Tree-level loop stuff isn't really my area.
>>>>>
>>>>> And Richard Biener wrote:
>>>>>> I don't like it, it feels wrong but I don't have a good suggestion
>>>>>> that had positive feedback.  Since a reviewer / approver is indirectly
>>>>>> responsible for at least the design I do not want to ack this patch.
>>>>>> Bin made forward progress on the other parts of the series but clearly
>>>>>> there's somebody missing with the appropriate privileges who feels
>>>>>> positive about the patch and its general direction.
>>>>>>
>>>>>> Sorry to be of no help here.
>>>>>
>>>>> How unfortunate :-(
>>>>>
>>>>> So, first off, this will then have to work for next stage 1 to make any
>>>>> progress.  Rats.
>>>>>
>>>>> But what could have been done differently that would have helped?  Of
>>>>> course Ke Wen could have written a better patch (aka one that is more
>>>>> acceptable); either of you could have made your current replies earlier,
>>>>> so that it is clear help needs to be sought elsewhere; and I could have
>>>>> pushed people earlier, too.  No one really did anything wrong, I'm not
>>>>> seeking who to blame, I'm just trying to find out how to prevent
>>>>> deadlocks like this in the future (where one party waits for replies
>>>>> that will never come).
>>>>>
>>>>> Is it just that we have a big gaping hole in reviewers with experience
>>>>> in such loop optimisations?
>>>>
>>>> May be.  But what I think is the biggest problem is that we do not
>>>> have a good way to achieve what the patch tries (if you review the
>>>> communications you'll see many ideas tossed around) first and foremost
>>>> because IV selection is happening early on GIMPLE and unrolling
>>>> happens late on RTL.  Both need a quite accurate estimate of costs
>>>> but unrolling has an ever harder time than IV selection where we've
>>>> got along with throwing dummy RTL at costing functions.
>>>>
>>>
>>> Yeah, exactly.
>>>
>>>> IMHO the patch is the wrong "start" to try fixing the issue and my
>>>> fear is that wiring this kind of "features" into the current
>>>> (fundamentally broken) state will make it much harder to rework
>>>> that state without introducing regressions on said features (I'm
>>>> there with trying to turn the vectorizer upside down - for three
>>>> years now, struggling to not regress any of the "features" we've
>>>> accumulated for various targets where most of them feel a
>>>> "bolted-on" rather than well-designed ;/).
>>>>
>>>
>>> OK, understandable.
>>>
>>>> I think IV selection and unrolling (and scheduling FWIW) need to move
>>>> closer together.  I do not have a good idea how that can work out
>>>> though but I very much believe that this "most wanted" GIMPLE unroller
>>>> will not be a good way of progressing here.  Maybe taking the bullet
>>>> and moving IV selection back to RTL is the answer.
>>>>
>>>
>>> I haven't looked into loop-iv.c, but IVOPTS in gimple can leverage
>>> SCEV analysis for iv detection, if moving it to RTL, it could be
>>> very heavier to detect the full set there?
>>>
>>>> For a "short term" solution I still think that trying to perform
>>>> unrolling and IV selection (for the D-form case you're targeting)
>>>> at the same time is a better design, even if it means complicating
>>>> the IV selection pass (and yeah, it'll still be at GIMPLE and w/o
>>>> any good idea about scheduling).  There are currently 20+ GIMPLE
>>>> optimization passes and 10+ RTL optimization passes between
>>>> IV selection and unrolling, the idea that you can have transform
>>>> decision and transform apply this far apart looks scary.
>>>>
>>>
>>> I have some questions in mind for this part, for "perform unrolling
>>> and IV selection at the same time", it can be interpreted to two
>>> different implementation ways to me:
>>>
>>> 1) Run one gimple unrolling pass just before IVOPTS, probably using
>>>    the same gate for IVOPTS.  The unrolling factor is computed by
>>>    the same method as that of RTL unrolling.  But this sounds very
>>>    like "most wanted gimple unrolling" which is what we want to avoid.
>>>
>>>    The positive aspect here is what IVOPTS faces is already one unrolled
>>>    loop, it can see the whole picture and get the optimal IV set.  The
>>>    downside/question is how we position these gimple unrolling and RTL
>>>    unrolling passes, whether we still need RTL unrolling.  If no, it's
>>>    doubtable that one gimple unrolling can fully replace the RTL
>>>    unrolling probably lacking some actual target information/instructions.
>>>    If yes, it's still possible to have inconsistent unrolling factors
>>>    between what IVOPTS optimizes on and what late RTL unrolling pass
>>>    ends with.
>>>
>>> 2) Make IVOPTS determine the unrolling factor by considering the
>>>    reg-offset addressing (D-form), unroll the loop and do the remainings.
>>
>> Yes, that's what I meant.
>>
>>>    I don't think you referred to this though.  Since comparing to
>>>    reg-reg addressing, reg-offset addressing can only save some bumps,
>>>    it's too weak to be one deciding factor of unrolling factor.  Unlike
>>>    vectorizer or modulo scheduling, it's more likely there are more
>>>    important factors for unrolling factor computation than this one.
>>
>> But you _are_ using that bump cost to compute the unroll estimate
>> (and IIRC you are doing the D-form IV selection then).
> 
> The patch series uses the way like the opposite direction that uses
> unrolling factor (will use UF later for short) estimated to adjust
> the bump costs.  The process looks like:
>   - estimate UF by following the similar UF determination in RTL unroller.
>   - identify the reg-offset (D-form) available iv use groups, mark some
>     reg-offset iv cands.
>   - scale up all pair costs if need, for reg-offset iv cands, the bump cost
>     (step cost) is costed by just once, for the others, it's UF-1 time.
>   - run the IV selection algorithm as usual.
> 
> From the perspective of bump cost, we can compute one UF (let's call it 
> ivopts_UF here), it means:
>   - when actual UF > ivopts_UF, the reg-offset iv selection wins with
>     fewer bump costs.
>   - while actual UF <= ivopts_UF, the reg-reg (indexed addressing) wins.
> 
> It looks we can get with the below:
> 
>   G for group cost, B for basic iv cost, S for step iv cost, N for iv count.
> 
>     G1 = (gcost<oiv,grp1> + gcost<oiv,grp2> ... ) * ivopts_UF;
>     B1 = bcost<oiv>;
>     S1 = scost<oiv> * (ivopts_UF-1);
>     N1 = 1;
>     
>     G2 = (gcost<iv1, grp1> + gcost<iv2, grp2> ...) * ivopts_UF;
>     B2 = bcost(iv1) + bcost(iv2) + ...;
>     S2 = scost(iv1) + scost(iv2) + ...;
>     N2 = count of (iv1, iv2, ...)
> 
>   Let G1 + B1 + S1 + N1 = G2 + B2 + S2 + N2, evaluate the value of ivopts_UF,
>   then use the ceiling (match 2^n) of the result.
> 
> Here it should have one upper bound by considering the range of target
> supported offset (calling it as offset_UF_bound).
> 
> I have some concerns on the ivopts_UF computed like this way, since here
> we only compute it by considering all address type iv uses, but don't
> consider the other iv uses.  It's possible that there are some iv uses
> who prefer iv cands which lead to indexed addressing, we can possibly
> miss that.  If we want to consider all iv uses globally, it's like to
> run the IV selection and conclude the ivopts_UF.  For me, it looks very
> hard in the current framework, but we probably can have one range from
> low bound to upper bound and iterate it for iv selection till we can
> find one or all finish (can be binary searching even), it's seems still
> not good due to the possible time consuming.  // (A)
> 
> Apart from this, let's assume we have determined one ivopts_UF (have
> considered the offset_UF_bound), do we need to take care of some other
> possible upper bounds? especially those ones we process in RTL unroller.
> 
> This question originates from the concern that when we determine the
> ivopts_UF, we only focus on those iv cand and iv uses (whatever all or 
> just address type), it's likely that some/most statements of the loop
> aren't qualified to show up as iv uses.  Then we can have calculated
> ivopts_UF 4, but actually the loop size is big and UF is 2 according
> to RTL unroller handlings, since IVOPTs perform early so we will make
> it unroll 4 times first then degrade the performance (causing more
> spillings etc.).  Here ivopts_UF computation focuses on bump costs
> saving, it may not be the critical thing for the loop, comparing to
> those similar passes who want to drive unrolling and truely make more
> gains (typically more parallelism like vectorization and SMS).  // (B)
> 
> So I think we may need to consider UF determination from RTL unroller
> somehow, like to respect parameters for unrolling and loop_unroll_adjust
> hook if it works in gimple phase.  I assume that there are some tuning
> work behind those parameters and hooks.
> 
> As above, then the approach seems to be like:
>   1. compute ivopts_UF
>   2. adjust it according to reg-offset allowable range.
>   3. adjust it to respect RTL UF determination. // for (B)
>   4. adjust the costs according to this ivopts_UF.
>   5. iv selection as usual.
>   6. check the iv set as expected. // 4,5,6 for (A)
>   7. post-pass unrolling as ivopts_UF.
> 
> For 5-7, it can be replaced as to halt the ivopts processing for the
> loop, unroll it as ivopts_UF, rerun the ivopts again, but I think we
> need to rerun the analysis on the unrolled loop then.  Since we already
> have decided the UF, it looks not different to unroll it in post-pass,
> or at the end of pass, just making sure the unrolling to happen.
> 

Hi Richard(s)/Segher,

I'd like to ping this to avoid the "boat" to sink again. :-)

What do you think of the above proposal and the related concerns?


BR,
Kewen

>>
>>>    For IVOPTS, it's more like that it doesn't care what the unrolling
>>>    factor should be, but it needs to know what the unrolling factor
>>>    would be, then do optimal IV selection based on that.  So it's not
>>>    good to get it to decide the unrolling factor.
>>
>> Well, currently with the stupid RTL unroller you will know it will
>> be unrolled 8 times unless it is cold or doesn't iterate enough.
>> There's not much costing going on in the RTL unroller (unless you
>> go wild with target hooks).
> 
> But I guess there were some tuning work behind those things?
> Like some condition checks, parameters and hooks to adjust the value.
> 
> 
> BR,
> Kewen
>


More information about the Gcc-patches mailing list