This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

branch optimizations

From: Mike Stump <mikestump at comcast dot net>
To: GCC Development <gcc at gcc dot gnu dot org>
Date: Wed, 27 Apr 2011 10:14:25 -0700
Subject: branch optimizations

So, I have a machine that has many styles of branches, among them, a normal one, and a short version. The short version is cheaper (sometimes). The regular one is 1 (predicted), 7 mis-predicted. The cost of mis-prediction can be substantially higher depending upon what is in the cache. The short branch version only applies to forward 1-n instructions, and it's cost can be more expensive than a regular branch but is usually faster. Another benefit of the short version is, it doesn't pollute the branch cache, thus, improving the prediction rates of other branches. To obtain the costs of each style, I need to run some math taking into consideration the expected mis-prediction rates, and the counts (predicted or gcov) and the distance skipped over. Also, I determine if the short branch style applies, if it doesn't, I just say the regular form is cheaper.

We can see evidence of the lack of completeness in the current code:

if ((flag_reorder_blocks || flag_reorder_blocks_and_partition)
/* Don't reorder blocks when optimizing for size because extra jump insns may
be created; also barrier may create extra padding.

More correctly we should have a block reordering mode that tried to
minimize the combined size of all the jumps. This would more or less
automatically remove extra jumps, but would also try to use more short
jumps instead of long jumps. */
&& optimize_function_for_speed_p (cfun))
{
reorder_basic_blocks ();

Essentially, I think the optimization could help other ports, if they have a well balanced implementation or if they are limited by the branch prediction or if the branch cost is non-zero. Chips that have lots of resources for prediction and/or can follow multiple paths aren't likely to see any benefit.

So, here is the patch in progress, should be close, but outstanding questions would be, are there any other ports that benefit from it? If not, I'd probably say it's not suitable for the tree in this form. Are there other ways, better ways of doing this?

If you like the concrete, consider this:

jump equal, L5 predicted equal
nop
L5:
nop2
ret

versus:

jump not equal, L6 predicted equal
L7:
nop2
ret

L6: nop
b L7

So, if branches were 10 cycles, conditional branch 12 cycles, T is the number of times executed when the condition is true and F, the number of times false, the branch costs would be (T+F)*12 in the first case and (T+F)*12 + F*10 in the second case. The later _always_ more expensive, if F is non-zero. Further, this is true for all such non-zero costs, as long as there is a single F, ignoring cache issues. Arguably, we could examine the cost of a branch and always perform this optimization if the cost is high enough, even without a short branch form. That isn't included in the patch.

A better patch, would be to have the backend figure out mis-prediction rates, do the math and decide based upon all costs alone given all the costs of all the branches.

Thoughts?

Attachment: jump.diffs.txt
Description: Text document

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]