This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: On-Demand range technology [2/5] - Major Components : How it works


On 6/4/19 1:07 PM, Richard Biener wrote:
On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod <amacleod@redhat.com> wrote:
On 6/4/19 11:37 AM, Richard Biener wrote:

where the single(!) query of the range of i on the alloca call
will populate N BBs caches with the Nth having ranges for
all SSA defs of i running into quadraticness in N.
well, not really.   its still linear since the 'i' used in each foo()
will be dead, so the next block will contain no range for it


   try{
       long i_2 = _1;
       i_3 += foo(i_2);
       i_4 += foo(i_3);
       i_5 += foo(i_4);
       i_6 += foo(i_5);
... repeat N times ...
      alloca(i_6); // query range of i

once i_2 is 'dead' after the first call to foo(), it will no longer be
in the on-entry cache to any other block.
The walk back never see's i_2 until the first call to foo(), so none of
It's actually

    Utemp_2 = foo(i_1);

Bb:
    i_3 = i_1 + (long) utemp_2;
    Utemp_4 = foo(i_3);

Eliding the separate stmt for the conversion. From you description of the cache it sounded like getting incoming values is a take-all operation. Otherwise you'd need negative entries as well (I thought the cache might not contain varying ranges for example).

ok,  Utemp2 and i_1 will be live, so there are 2 ranges on entry . the next block will have only utemp_4 and i_3 live on entry.  So it'll be 2 ranges per block.


It currently does have varying in the table, but the varying value is cached for the entire ssa_name in the cache. so 500 BB's with varying range will only consume a single range.

By negative entries I assume you mean something like the bottom of a diamond?

if (x_3 > 20)
  blah1 (x_3)
else
  blah 2(x)_3;
blah3 (x_3);

It does nothing special to come up with the range of x_3 at blah3 ().  It simply unions all the incoming ranges which is [21,MAX] union [MIN,20]  to produce the value [MIN, MAX] which is then put into the cache.  We don't use anything special to represent varying.   varying_p()  merely checks if the range is [MIN, MAX]. And when putting a value into the cache, if it is varying_p() the cache simply uses it's unique copy of [MIN, MAX] rather than create a new one every time it is needed.


I had originally considered not caching ranges spanning blocks in which the range never changes...  the trade off is you have an increase in calculation time as you walk the CFG to find the last block in which it did change.  First I figured we'd first see if the less error prone full cache causes any real problems or not.   Its always an option to implement later since the cache is self contained and can resolve its queries internally however it wants.


blocks after that have a need for its range, so it will not in their
caches.

The on-entry cache will only get a single range entry for each of those

blocks.. the incoming value of i_x  and thats it. The implicitly known
live-on-exit attribute is handy here.

Now, that's not to say we cant construct cases where there is a lot of
ranges being populated.. If we find the cache is really a problem, we
can start caching ranges.. so that if all these i_x's were live
somehow,
there would only be one range for each in this case, and the cache's
would simply all point to the same range.
I suppose using i in the catch block would create sth like this, a phi with a large in degree and thus the switch case you already know about?

It would be a PHI. None of the ranges coming into the block are considered live on entry, so they dont get into the cache.  The PHI calculation asks for the incoming range on the edge for each parameter, and we get a global range for the PHI definition calculated, and that is it.

However, Thats not actually the switch problem :-)  the switch problem isn't due to a large degree of incoming edges, but rather the difficulty in calculating case ranges on the outgoing edges from the switch itself.

stepping back slightly,  Branches are handled generically the same way any expression is, the outgoing edge you want information about provides a constant range for the implicit LHS of the branch.

so for an if
  c_2 = x_3 > 20
  if (c_2 != 0)

the TRUE edge  has a range of bool [1,1] , fed back as the implicit LHS of the branch expression [1,1] = (c_2 != 0)   which range-ops for '!=' says c_2 must be [1,1] in order to satisfy the equation. [1,1] = x_3 > 20  which range-ops for '>' says x_3 must have a range of [21, MAX]

if you want the range on the FALSE edge, the edge starts with the range [0,0] into the equation solver, and out pops [MIN, 20].

switch (i_3)

for a switch, it works the same way, except the edge has an integral value, not a boolean one.  This range is fed back into the implicit branch expression: [edge range] = i_3    ,    so i_3  has whatever value the edge has since it amounts to a copy.  Calculating this edge constant value is the problem.

BB2:
switch (i_3) {
BB3:
  case 4:
  case 5:
  case 9:
  case 10:
     foo (i_3)

In order to evaluate the case values on  the edge from BB2 to BB3 as  [4,5][9,10],  we have to loop over all  the cases that have a label at the beginning of the block, and union them together. Calculating the default value is even worse, start with varying and intersect out each and every case.

The gimple representation does not make evaluating this easy nor cheap since it merely has a list of case LOW/HIGHS with labels... so you have to loop thru that entire list for each edge being evaluated to see if the label is associated with this edge and then union together all those that match.   This is linear and cant be shortcut. multiple it by the number of edges since you probably want a range on each edge and its suddenly exponential.    The result is it can be *very* time consuming if the switch is very, very large.

And it's a constant value that never changes, so it really could be cheap. We've just never needed to ask this exact question before. We've considered multiple ways to address this,  but we wont worry about any of them just yet.  All in time :-)





so far we havent really run across anything that appears to trigger
significant concerns.
Sure. It's still worth thinking about worst case behavior and how you'd deal with that given once ranger is actually used we will eventually run into these cases. When
A forward evrp walk is then faster than a single on demand query that would be concerning.

Richard.
Absolutely.  I've been concerned about worst case since the get go. Thats how its evolved to this design.  We're trying to make sure the worst case scenario is no worse than doing a normal walk.

We can currently run the ranger version of RVRP across blocks in dominator order, reverse dominator order,  linearly BB1 -> BBN, and finally  BBN-> BB1   There are fluctuations in timings, but nothing of too much significance.  They all run in similar times and pass all the tests.

I have thoughts on how the various pieces can fit together with EVRP,  I'll stick that in another email.

Andrew


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]