This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING] Patch to implement "Switch statement fdo reordering"


Hi Mark.

I completed all the changes discussed bellow. Attached is a new
version of the patch.
Regression tests are under way now.

The answer of the last question is this:
In the second stage compilation, the size of the histogram is read
from the gcda file
and compared to what the current value in the second compilation is.
If they differ an error
is issued. This is part of the profile feedback infrastructure, that
is the way all value
histograms are processed.

Regards
Edmar


On Thu, Jul 10, 2008 at 3:04 PM, Edmar Wienskoski <edmarwjr@gmail.com> wrote:
>
> Hi Mark.
>
> Sorry for the delay, we had an email server glitch right on the day you posted this, and I missed the
> original list posting, as well the direct copy you sent to me.
>
> Thanks for the comments, I will address each questions bellow.
>
> Regards
> Edmar
>
>
> Edmar Wienskoski-RA8797 wrote:
>
> http://gcc.gnu.org/ml/gcc-patches/2008-04/msg02197.html
>
> This implements the optimization presented on Gcc Summit 2006:
> http://www.gccsummit.org/2006/view_abstract.php?content_key=18
>
>
>
>
> For those following along at home, the optimization here is to use profiling feedback to balance binary trees of if-statements generated for switch statements.
>
> If I understand correctly, the trees are balanced according to frequency of the values in the switch condition; the goal is to get the common values processed quickly. So, for example, if the value of the switch condition is "7" 75% of the time, then the first if will be:
>
>   if (val == 7) {
>   } else {
>      ..
>   }
>
> so that we quickly handle the common cases.
>
> I have some questions:
>
> * What is the performance improvement from this patch?
>
> All optimization improvements should come with performance numbers. These don't have to be comprehensive, or measured on a standard benchmark, but there should be some evidence from something like real-world code showing what the impact is. I'm not doubting the benefit, but optimization is a quantitative game, so we should see some quantitative information.
>
> Of course.
> In the paper we showed several results. Here is one of them:
> On a G5, the perl benchmark from spec2k, improved from 233.61 sec to 210.12 sec, a speedup
> of 11.2% on the whole test. That was using the provided training set.
>
> Because the training set, in this benchmark, is not quite representative of the reference set, I also run the
> benchmark using the reference set as training set, just to show the potential of this optimization on real
> applications using good training set. The time for that was 197.04 a 18.5% speedup
>
> On the paper I also showed results for spec95 m88ksim, 7450 and e500v2.
> All other spec2k benchmarks showed no difference.
>
>
> * Why is this target-specific?
>
> There's a new hook here that allows a backend to say that its preferable to generate the balanced conditional form rather than a jump table. (Small point: the patch contains code that tests if the hook is NULL. The preferred style is to have the default value of the hook be a function which always returns true/false so that users don't have to remember to test for NULL. In Ian's brave new world, the default values are in the base class; the target machine is the derived class overriding the hook. And, we don't like pure virtual functions.)
>
> As you pointed out bellow, under the right condition, a jump table might win, so the hook
> allows the target to calculate the cost and decide to use the balanced tree or not.
>
> I provided a very simple "profitable_case_rebalance_p" implementation for the G5,
> in this case it seems safe enough.
> But what I really wanted was to not provide any at all, and let the target maintainers to choose
> a function cost that they think fits best. Below I have more on this.
>
> (PS: I put your "small point", in my TODO list.)
>
> It looks like you're inserting the new logic before the:
>
> /* If range of values is much bigger than number of values,
>
> make a sequence of conditional branches instead of a dispatch.
>
> If the switch-index is a constant, do it this way
>
> because we can optimize it. */
>
> and before the version that uses a jump table. If we have profiling data, that seems clearly better than using the assumption that all case nodes are equally likely, which is what we do by default. But, how do we know when it's better than the jump table?
>
> That is related to the answer above.
> Final results depends on case frequency distribution, good training set, architecture difference on table lookup versus conditional branch, etc.
> The target hook has access to the actual measured frequencies and is aware of various architecture info to make a final decision.
>
> It seems to me that this is something we should be able to guess based on some information from the back end about branch costs, indirect jump costs, etc. It doesn't seem like something that's absolutely true for some CPUs, and not for others. Even for a given CPU, if the histogram says all values are equally likely, and there are 256 possible values, a jump table might win -- even though on the same CPU, if there is 1 value that is used 99.99% of the time, then the balanced tree would win.
>
> 100% right..
>
> So, I think the logic should be tuned to measure the expected-case outcome in both cases, and pick the one that's likely to win. If we need a new hook, I think the hook would be something like "is it a win to take N branches instead of indirecting through a jump table?".
>
> The provided target hook can do this.
> (I wish the fdo optimization could be considered separated of the various target specific cost functions...)
> Well, this is what I have scratched:
>
> There are 2 options here:
> 1)
> /*
>   We have to decide between a tree of compare/branch pairs and a table
>   lookup.
>
>   We only care for the relative costs. So we set the cost of a
>   compare/branch to 1.
>
>   Note that the table lookup scheme consists of specific code (5
>   instructions in the ppc case) plus 1 or 2 compare/branches (depends
>   on how the lower bound of table was handled)
>
>   The input to the cost function is a histogram that contains the
>   frequencies for all case statements. At this point the final
>   re-balanced tree branch hasn't being computed yet.
>
>   So, the following is an aproximation of the actual costs:
>
>   Sort the histogram frequencies in descending order, and use the
>   depth of an ideal tree as weight, the total sum must be less or
>   equal the full frequency of a table jump cost. Example:
>
>   50% 10% 10% 10% 5% 5% 5% 2% 2% 1%
>   50 * 1 + (10 + 20) * 2 + (10 + 5 + 5 + 5) * 3 + (2 + 2 + 1) * 4
>   => 50 + 60 + 75 + 20 => 205
>
>   Let's assume the table lookup cost is 3, and the lower bound
>   test was optimized.
>   (3 + 1) * 100 => 400
>
>   In this example, the compare/branch tree is a clear win.
> */
>
> 2)
> After the tree is computed, A BFS of the tree is used as the depth weight cost, the rest is like above.
> This is more precise but I don't think it is worth the trouble because of the mess it will cause
> in the implementation:
> expand_case in stmt.c is structured like this:
>
> if (situation 1)
>  .
> .
> else if (our situation && tgt_hook())
> .  (A) /* among other things build the tree */
> .  evaluate cost using BFS. If not good, must resume control to L1
> .  (B)
> L1: else if (situation 3)
> .
> .
>
> So, (A) has to rewritten as a function, and must have to undo side effects, which probably will be re-done at (B),
> and will return an integer with the BFS cost evaluation computed as above.
> After that, the condition test will be re-written as
> else if (our situation && tgt_hook(f_A())
>
> Can you think a more structured way to implement this ?
> If you don't have strong feelings about it, I would go with (1) above.
>
>
> Please, let me know your opinion on all this...
>
>
> * Code is cut-and-pasted from the standard decision-tree implementation.
>
> The version in the current compiler assumes equal probability for all nodes. The new optimization uses profiling feedback to balance things given the feedback data. But, the other logic is identical, so it should be factored out. The bits like:
>
> +	  /* If the index is a short or char that we do not have
> +	     an insn to handle comparisons directly, convert it to
> +	     a full integer now, rather than letting each comparison
>
>
> +	     generate the conversion.  */
>
> should not be cut-and-pasted.
>
> It seems like an obvious abstraction, and an oversight of my part.
> I will fix that.
>
> * I suspect you can simplify the handling of case ranges and single-case nodes. After all, a single-case node is just a case range with exactly one element.
>
> I figure it can be done, but I think it will be harder to read the source code.
> I will try it, and get back to you on this.
>
> * I suspect that balance_case_nodes should just become a call to fdo_balance_case_nodes, with a parameter that says all the nodes have equal probability. In other words, the current if-generation code is should just be a special case of the profile-directed feedback code.
>
> Ok. I will fix that.
>
> * If I understand correctly, we're depending on the value of PARAM_SWITCH_HISTOGRAM_SIZE being the same at both the initial compilation and the profiled compilation. Is there any way to detect that the values are different across runs and issue an error? Otherwise, I think we're at risk of crashing if the user lies to us about the histogram size.
>
> The parameter is used only in the instrumentation. The second compilation relies on whatever is loaded from the gcda file.
> Actually, now that will mention it I have some doubts, I will double check..
>
> * The documentation needs improvement. I'll be happy to help with this, once the rest of the patch is all set.
>
> Ok. Thanks !!
>
> Thanks,
>
> --
> Mark Mitchell
> CodeSourcery
> mark@codesourcery.com
> (650) 331-3385 x713

Attachment: patch_fdo-case-reorder_20080715
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]