Performance

We have done extensive performance analysis to help address concerns about the nature of an on-demand model. LLVM made an attempt at something similar, but suffered from significant performance issues they could not solve with their approach. This approach is not the same, and we have seen no sign of pathological cases which are problematic.

I have trolled bugzilla looking for large problematic test cases, and have tried a couple of test cases from LLVM which dberlin pointed out they found to be problematic. To date, I have found nothing that shows any kind of performance problem. I am more than happy to entertain whatever cases y’all might want to throw my way.

For a test of robustness, we have built a complete Fedora distribution consisting of 9174 packages with the ranger branch. All packages except 3 build successfully and pass the relevant regression tests. It appears 2 of them are related to RVRP and are still under analysis.

Our primary performance testbase to date has been the compiler itself. We compile 242 .ii files from a stage1 compiler build, and compare times to EVRP. VRP is quite a bit slower, and although we do have an iterative update infrastructure, comparisons aren’t quite fair since we don’t do all equivalence and bitmask operations it does yet.

Before going into the numbers, I would like to visit a minor issue we have with switches. RVRP works from the bottom up, so in order to evaluate a range, it begins by getting the constant range for the LHS of the branch from the edge. For a conditional it is trivially [0,0] or [1,1] depending on TRUE or FALSE edge.

For a switch, it turns out GCC gimple representation has no simple way to figure this out. As a result, we need to loop over every edge in the switch and then union together all the cases which share that edge, or in the default case, intersect out all the other cases. This turns out to be *very* time consuming in tests cases with very large switches, somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fair amount of overhead trying to evaluate, and re-evaluate these constant edges.

There are ways to address this… but for now we will present performance numbers with different switch configurations, each of the 5 configurations listed here:

All times are with a release configured compiler. Out of the 242 files, pretty much across the board in all 4 sets of figures, RVRP was faster in about 90% of the cases, and slower in the other 10%, resulting in the following cumulative totals.

Overall (242 files)

Raw RVRP	22% slower	Calculating ranges outright from the stock branch.
No edge calculation(1)	4.5% slower	Timers adjusted to exclude switch edge calculation code (i.e. pretend the range is available on the edge like it is with TRUE and FALSE.
No switch processing(2)	9.5% faster	Do not process switches. We spend extra time on switches because we always attempt to calculate ranges very precisely as if we had infinite precision. There is room for trimming outcomes here, but we have made no attempt yet.
No dominators(3)	16% faster	Just like EVRP, RVRP currently includes building the dominators and integrating calls into SCEV at the beginning of each block to see if there are any loop range refinements. The ranger has no need for this to operate, and many passes will not care. So we produce a 4th number for RVRP where we don’t build any of the infrastructure it doesn’t need.
Conditionals (including switches) only	4% faster	RVRP can also run in conditional-only mode. Rather than walking the entire IL trying to resolve every range, it can simply look at the last statement of every block asking if the branch can be folded. This gets a lot of what vrp gets that affects the CFG and could be utilized in either lower optimization levels, or as VRP if we can push all the other activities it does into appropriate optimizations (such as making CCP range aware). NOTE: This still calculates switch ranges, so includes that slow down.

These numbers indicate that large switches are the primary cause when RVRP is slower. We have various approaches we could use to address this. Removing the time spent building unnecessary dominators shows a significant improvement as well.

We also have the results for the time spent in the passes we converted:

Overall (242 files)

wrestrict	19% faster	Has the dominator walk removed since it is no longer needed, and simply calls into a ranger to get the range.
wprintf	95% faster	Has had the dominator build removed, as well as the EVRP range walk. It really benefits from very few ranges being requested… so the on demand approach is a big win here since we only calculate what we actually need to answer a small number of questions.
walloca	1% slower	A touch slower because we are doing more work. The original pass simply queried for global range information. We replaced calls to SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges. So this overhead is the amount required to calculate those ranges. The Walloca pass now handles more things (for instance we fix gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap. For example, we find a possible out of bounds VLA in libgomp/task.c.

We are also working on integrating the on-demand ranges with the backwards threader. It queries ranges over each potential path to see if a collapsable branch can be found. It then uses that information to find better threading opportunities than is currently possible. Results thus far have been very promising.

These numbers are all indicative that this approach is viable versus the existing approach, and is quite frequently faster. It already has iterative back-edge resolution integrated, and we can get cases that the existing VRP and EVRP approach have difficulty with.

None: AndrewMacLeod/ProjectRanger/Performance (last edited 2019-09-27 16:16:53 by AndrewMacLeod)