Ranger Integration With Trunk
The on-demand GORI and path range engines are completely self contained components. They depend solely on the irange implementation, and consist of a few files which can be added to GCC. If they are not called, will have no impact on compile time or code generation. This makes them a low risk package to integrate with trunk.
Within the current context of this project, I think it is best to keep class irange a component of the Ranger, allowing inter-operability with value_range for a few reasons.
- We desire a single code base to perform operations on ranges. Other than symbolics, irange provides a super-set operations available on value_range. We plan to modify the non-symbolic parts of VRP and EVRP to call routines which will, under the covers utilize the range-ops routines to perform their operations, and then return a value_range structure. This should be transparent to them, and allow consolidation of the range operation code in one place. While doing this work we've noticed that EVRP appears to have diverged from VRP in places already. It would be good to have a single place for all operations.
- Last year Aldy proposed integrating class irange into the compiler, replacing the global range table and all uses of value_range throughout the compiler. Many places (other than VRP and EVRP) where value_range was used would be replaced by calls to the ranger and make use of class irange directly. Altering all these locations before converting to the Ranger would generate unnecessary effort and noise. Especially considering the next point.
- Given the new on-demand model of the Ranger, it is unclear whether we wish to maintain a global range table as we have today anyway. Anyone who requires a range might be better off simply asking a Ranger for the range where they care, and get a contextually sensitive range instead. The ranger itself will likely maintain some sort of global table to avoid recreating reduced ranges, but this would be of limited value to any one else.
Ideally, once the range-ops are fully implemented providing coverage over all the operations VRP and EVRP provide, this would reduce the code within EVRP to be just the algorithm itself with calls to the generic handling of the various operations. In theory, EVRP could then be easily converted to utilize class irange natively instead of using the wrapper code for value_range. That would leave the current VRP as the only remaining consumer of value_range which utilized symbolic endpoints, and even there all the non-symbolic code would be standardized in the range-op code.
As mentioned in the earlier Ranger document, We have every reason to believe that in time we can implement an iterative pass using the ranger which will perform all the same operations as the ASSERT_EXPR and symbolic lattice approach that VRP uses. At that point we believe value_range can be eliminated. But there is no rush on that front, this is not intended to be a VRP replacement, but rather a tool which can also be used to perform most if not all VRP operations.
Pass Conversions
Aldy has already converted 4 passes in the compiler to use the new Ranger model. These are his notes about them:
pass_sprintf_length
This was very straightforward and mechanical.It was basically changing all calls into the evrp engine into calls to on_demand_get_range_on_stmt() and getting rid of a dominator walk in favor of the cheaper FOR_EACH_BB_FN.
We have no regressions wrt the old implementation, and we now get conditional things we couldn't before:
if (n == 3) retval = __builtin_snprintf ((buffer + sizeof buffer - n), n, "%i%i", value_range (1, 99), value_range (1, 99)); /* { dg-warning "output may be truncated" } */
Another interesting aspect of this pass is that it doesn't actually "own" a ranger object. It simply makes calls to an "on-demand" range function:
static inline bool on_demand_get_range_on_stmt (irange &r, tree ssa, gimple *stmt) { path_ranger ranger; return ranger.path_range_on_stmt (r, ssa, stmt); }
Note this routine creates a new ranger from scratch, makes the call, and then destroys that ranger before returning the result. These lookups will not benefit from any caching, and demonstrate how lightweight the Ranger is. Yet the run time performance is still quite remarkable as you will see in the performance section below. This is of particular use to places which share common code with other parts of the compiler that we don't want to complicate by having to pass a ranger object around. Due to the lack of caching, it works best when the ranges being examined are disjoint. Performance would slow down a bit if the same ranges were being looked up over and over.
pass_data_wrestrict
This involved mostly mechanical changes, but I didn't want to change the notion of upper and lower bounds, so there are some hacks to make sure the pass still gets what it wants in terms of anti ranges:
+ if (maybe_get_range_on_stmt (r, offset, call)) + { + irange anti; + if (anti_range_p (r, anti)) + { + rng = VR_ANTI_RANGE; + r = anti; + } + else + rng = VR_RANGE; + min = r.lower_bound (); + max = r.upper_bound (); + }
We have no regressions wrt the old implementation, and we now get conditional things we couldn't before:
unsigned n; void test_memcpy_warn (char *d) { if (n > 10 && n < 20) __builtin_memcpy (d, d + 2, n); // { dg-warning "overlaps between" } }
pass_data_walloca
The conversion of this pass involved deleting a lot of code that was making up for the fact that we had poor global range information. The previous patch chased DEF chains and made assumptions about cast results.
The entire patch is here. Mostly it removes code, and the core is now simply
irange r; if (TREE_CODE (len) == SSA_NAME && ranger.path_range_on_stmt (r, len, stmt) && !r.range_for_type_p ()) { irange invalid_range (size_type_node, 0, max_size, irange::INVERSE); if (r.intersect (invalid_range).empty_p ()) return alloca_type_and_arg (ALLOCA_OK); return alloca_type_and_arg (ALLOCA_BOUND_MAYBE_LARGE); } return alloca_type_and_arg (ALLOCA_UNBOUNDED);
Rewriting this pass was less cutting and pasting, and more thinking about how things work in the path ranger world, but it was still relatively painless. (This could also have to do with the fact that I wrote the original annoying hoops that probably inspired some of the ranger work :)).
There are no regressions from the previous code.
pass_data_thread_jumps
The work on the backwards threader was not a conversion but adding new functionality to the threader.
The current threader chases SSA copies to definitions, which when ultimately constant, can be threaded and expose further optimization opportunities. The problem is that right now we only use actual definitions of NAME to determine if there's a path where NAME has a constant value. We don't make use of "implicit assignments".
What we want to do is use other statements in the IL to find constant values for NAME. So if we take some simple pseudocode:
x = <whatever> if (x == 24) { code1; if (x == 42) code2; }
we would know that x cannot be 42 and we can thread past code2. Thanks to the range information, the new code can do even better than this, doing the same thing with ranges rather than just constants:
if (x < 24) { bob (); if (x == 42) not_uncle (); }
Results in code like:
<bb 2> : if (x_2(D) <= 23) goto <bb 3>; [INV] else goto <bb 4>; [INV] <bb 3> : bob (); <bb 4> : return;
This pass work took a long time, but that had nothing to do with the ranger. It was mostly due to the fact that we were designing this new threading world on the fly. We also added an additional API to the ranger to support calculating ranges on specified paths. The threader now looks for new sets of paths which may be profitable and checks the ranges on those.
With the range based threader work, we now handle things like:
/* Test that we thread 2 -> 3 -> 5 in the early threader. */ int no; int yes; void __GIMPLE (startwith ("ethread")) foo (int x) { bb_2: if (x == 222) goto bb_3; else goto bb_5; bb_3: if (x == 444) goto bb_4; else goto bb_5; bb_4: no = 123; bb_5: yes = 456; }
Results in code from the threader:
<bb 2> : bb_2: yes = 456; return;
Performance analysis
Aldy hacked the compiler up to get some times in microseconds on his machine to give us ballparks ideas of how long the ranger is taking relative to other passes. He ran the compiler through 45 various .ii files he has collected and this is the total number of microseconds taken for each of the above pass conversions:
Pass name |
original time |
ranger version time |
change |
comments |
pass_sprintf_length |
887807 |
12497 |
-98.6% |
No dominator walk, no longer wired into EVRP. Only need a few ranges |
pass_data_walloca |
8587 |
9208 |
+7.2% |
Replaces accessing global range variable with calls get accurate ranges. |
pass_data_wrestrict |
44132 |
23924 |
-45.8% |
No dominator walk * |
pass_data_thread_jumps |
1282495 |
3701278 |
+188.6% |
All new path walking functionality. |
* Aldy removed the dominator walk from wrestrict. Not sure why it was there unless it was in preparation for wiring into the EVRP pass.
As yet, there has been no attempt to tune the Ranger. It appears to perform relatively well, even though we suspect there are areas which should be investigated.
- irange and its operations.
- Path ranger SSA cache - maybe switch to a hash table? It may end up being too dense to matter, Im not sure since we haven't looked at it yet.
- A full on performance analysis should be undertaken before any integration with mainline to look for obvious hot-spots. I use to use gprof back in the old days to get a count of how many times during a big execution each BB in a file a was executed. look for the areas where the most time was spent in the pass and determine whether that makes sense, or whether we're doing something stupid and fix it. It catches a lot of dumb things, especially in c++ if you are inadvertently passing something large by value instead of by reference like you meant :-).
In any case, im sure there are more modern tools than gprof to look for where a pass is spending most fo its time, but we should use something to do this analysis, if for nothing else than to assure ourselves the Ranger is performing in line with expectations. Our expectation is that any additional costs accumulated for the dynamic lookup should be offset by lack of calculating dominators, ssa renaming, and any other expensive prerequisites.
Code duplication
Probably the biggest concern is duplication of code. VRP currently has code to extract ranges from various expressions. In some respects range-op.c duplicates some of this code. The VRP code is cluttered by the need to deal with symbolic ranges, and it also deals with trees in the primary interface rather than wide-ints. .
There was some initial work done to extract the non-symbolic aspects of the range extractions, and port it to the range-op framework. This provided the equivalent operations in a generic wide_int environment. This work involves splitting out some of the existing routines into wide_int and tree wrappers allowing the range-ops code to utilize the wide_int variations. The initial cut of this can be found in the file wide-int-aux.[ch]. It contains some code from fold-const.c, some from tree-vrp.c and some from the ranger.
I'm not particularly happy with the way that went, and instead what I'd rather do is implement the range-ops code, and then have both EVRP and VRP call wrapper functions which take value_ranges, convert them to an irange. Invoke the range-ops routine to do the processing, then convert the irange back to a value_range. We'll prototype doing that for some basic operators and see how that works.
Early VRP utilizes the same VRP code, but has no need for symbolics, it may be possible to move the EVRP extraction code to use the range-op interface directly and avoid the wrappers.
RVRP
There is also the new RVRP pass. It is mostly a path ranger equivalent of an Early VRP pass as proof of concept that it is both powerful and fast enough to become the general purpose range engine in the compiler. This pass can be found in ssa-range-vrp.c. Its not a complete version of evrp since it does not walk the IL folding statements within the block. I have considered writing an alternative rvrp which would do that so give a better apple-to-apple comparison, but I simply haven't gotten to it yet.
As it stands, the ranger VRP pass is running immediately before the Early VRP pass. Proof of concept of its full functionality would be Early VRP finding something which Ranger VRP does not. Running in the other order, RVRP running after EVRP would show any cases which Ranger VRP gets which EVRP does not. This also provides a test bed for a compile time performance between the two approaches.
That said, the ranger itself is not intended to be a replacement for VRP, but rather a range generating tool that can be easily utilized anywhere with little infrastructure requirements. It just happens that VRP makes heavy use of ranges to do its job, and thus makes for an obvious candidate. The original driving force was for other optimizations, such as sprintf and alloca warnings, the backwards threader, and any place which can utilize contextual accurate range information if it were easily available. The theory being that any place in the compiler that asks for the global range of an ssa_name, can now get a context-sensitive accurate range for that same ssa_name with a simple call.
We compared the run time for EVRP and the new RVRP just to see how much room we had to cover the difference between the two passes. We get a lot of things EVRP does, but we don't get them all yet. So take it with a grain of salt. Run over the same 45 .ii files:
PASS |
Microseconds |
RVRP |
1007206 |
EVRP |
2020806 |
Which demonstrates RVRP running about 50% faster.
Check the bottom of the project page to see how straightforward the Rangers VRP pass is.
Ideally we can meet some entry criteria and check the code into trunk and allow other optimizations to start using it. Ranger VRP (RVRP) could be disabled initially while work continues on flushing out the range-op functionality to at least equal what VRP currently does, but I would prefer if we could leave it running so we can get good test coverage.
We have started a set of regression tests. They are found in gcc.dg/tree-ssa/rvrp*.c. We have had to modify a number of vrp1 tests to disable rvrp and sometimes threading in order to continue testing vrp with them.
The code is currently on the SVN branch ‘ssa-range’. When run with -fdump-tree-all-details , the pass *.rvrp will contain a listing like the one below which show the ranges found on the various edges and blocks.
A final simple example showing a case EVRP does not get which ranger VRP does. (caveat, with threading turned off; early threading now uses the ranger and eliminates some of this even earlier), The listing shows any ranges observed on entry to each block, defined within that block, as well on each outgoing edge. Also note that the range of x_5 on the edge from 3->5 is []. A NIL range means that thre is no possible range, and that means the edge cannot be taken!
int f2(int x) { return x == 1 || x == 3 || x == 1; } ========= BB2 ========= <bb 2> : _1 = x_5(D) == 1; _2 = x_5(D) == 3; _3 = _1 | _2; if (_3 != 0) goto <bb 5>; [INV] else goto <bb 3>; [INV] 2->5 (T) _3 [1, 1] _Bool 2->5 (T) x_5(D) [1, 1][3, 3] int 2->3 (F) _1 [0, 0] _Bool 2->3 (F) _2 [0, 0] _Bool 2->3 (F) _3 [0, 0] _Bool 2->3 (F) x_5(D) [0x80000000, 0][2, 2][4, 0x7fffffff] int ========= BB3 ========= Ranges on Entry : _1 : [0, 0] _Bool _2 : [0, 0] _Bool _3 : [0, 0] _Bool x_5(D) : [0x80000000, 0][2, 2][4, 0x7fffffff] int <bb 3> : if (x_5(D) == 1) goto <bb 5>; [INV] else goto <bb 4>; [INV] 3->5 (T) x_5(D) [] int 3->4 (F) x_5(D) [0x80000000, 0][2, 2][4, 0x7fffffff] int ========= BB4 ========= Ranges on Entry : _1 : [0, 0] _Bool _2 : [0, 0] _Bool _3 : [0, 0] _Bool x_5(D) : [0x80000000, 0][2, 2][4, 0x7fffffff] int <bb 4> : ========= BB5 ========= <bb 5> : # iftmp.0_4 = PHI <1(3), 0(4), 1(2)> return iftmp.0_4; -- Ranges Defined -- : iftmp.0_4 : [0, 1] int RVRP: Considering BB 2: if (_3 != 0) RVRP: Considering BB 3: if (x_5(D) == 1) Expression evaluates to range: [0, 0] _Bool RVRP: Branch rewritten to: if (0 != 0) Removing basic block 3 ;; basic block 3, loop depth 0 ;; pred: ;; succ: 4 fix_loop_structure: fixing up loops for function f2 (int x) { _Bool _1; _Bool _2; _Bool _3; int iftmp.0_4; <bb 2> : _1 = x_5(D) == 1; _2 = x_5(D) == 3; _3 = _1 | _2; if (_3 != 0) goto <bb 4>; [INV] else goto <bb 3>; [INV] <bb 3> : <bb 4> : # iftmp.0_4 = PHI <1(2), 0(3)> return iftmp.0_4;