This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Optimization is removing wanted branches


> Date: Fri, 31 Aug 2001 17:42:29 +0200 (CEST)
> From: Gabriel Paubert <paubert@iram.es>
> cc: "Smith, Stephen P  (AZ75)" <smith_stephen@htc.honeywell.com>,
>         gcc@gcc.gnu.org
> X-OriginalArrivalTime: 03 Sep 2001 11:04:18.0378 (UTC) FILETIME=[2D4452A0:01C13468]
> 
> On 30 Aug 2001, Geoff Keating wrote:
> 
> > "Smith, Stephen P  (AZ75)" <smith_stephen@htc.honeywell.com> writes:
> > 
> > > Several of us are using gcc 2.95.2-6 (from the cygnus distribution) built as
> > > a cross compiler for the powerpc.
> > > 
> > > The other day one of us noticed that the following code is emitted from cc1
> > > without a branch when optimization is turned on.  Is there a way to use
> > > optimization (-O1, -O2, -O3) and have the assembly branch instuction in the
> > > ouput?
> > 
> > Why?
> > 
> > It would be slower, at least in general.
> 
> At least in the -Os case, this is wrong. But properly implementing -Os on
> PPC is a big project. 

Yup.

> And it would not be necessarily slower (or at least measureably slower for
> infrequently executed code). As soon as you have to fetch the code from
> memory or even L2 cache, bigger code is often a loss, besides the fact
> that using the carry stalls the pipelines while branch prediction will
> help executing the correct code path... right at least 50% of the time.

This very much depends on precisely how the code works.  For instance,
if the branch can be predicted successfully with only a little over
50% probability, then usually a branch-free sequence will be much
faster---a mispredicted branch will cost several times more than a
correctly predicted one.  By comparison, if branch prediction is 99%
accurate, then a not-taken branch will always be cheaper than a
branch-free sequence (and hopefully GCC would reorder code so that the
popular path is the not-taken-branch one).

The cost of larger code is somewhat variable.  The actual metric is
'cache lines of code fetched', which is not necessarily related to
'number of instructions executed' or 'number of instructions in
program'.  A code sequence like

	   beq+ 0f
	   addi r0,r0,5
0:	   ...

may, depending on the processor, cause two cache lines to be fetched
if the prediction is correct, or three if not correct---one for the
beq, one for the addi which is then discarded, and one for the
instruction at label 0.  Of course, assuming all this code actually
lives in the same physical cache line, one would hope that the last
two fetches come from L1 cache!

There was another comment about using carry instructions for these
eliminations; yes, that's probably a bad idea, it's likely that a
sequence using cntlzw or shifts would be faster.  That code dates from
the time when carry instructions didn't cause a pipeline flush.

> In addition, in a bigger function, the test might be moved far enough
> ahead of the branch to greatly reduce the loss due to mispredictions.

This can help a lot, yes.

-- 
- Geoffrey Keating <geoffk@geoffk.org>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]