Bug 66520 - bad code generated code for branches with single &
Summary: bad code generated code for branches with single &
Status: RESOLVED WORKSFORME
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-12 11:52 UTC by Fisnik
Modified: 2015-06-14 12:26 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fisnik 2015-06-12 11:52:20 UTC
consider the following code snippet (c++):

void ampamp(int x, int y) {
  if (x < 3 && y > 10 ) 
    printf("%d%d", x, y);
}

void amp(int x, int y) {
  if ((x < 3) & (y > 10) ) 
    printf("%d%d", x, y);
}


the assembly code generated by g++ (all versions I tested with  optimization flag `-O3'), is not optimal (see the link on the bottom of this message).
Basically, for both methods, the generated assembly code is identical. For the method "amp" (with single `&') there should be generated only one jump, as branch misprediction penalty is very high otherwise!

As a side note: the code by intel's  compiler (ICC) is however generating optimal code for such scenarios, at least for versions icc13, and icc15 that I've tested.

See: http://goo.gl/oiTPX5
Comment 1 Eric Botcazou 2015-06-13 10:13:17 UTC
> consider the following code snippet (c++):
> 
> void ampamp(int x, int y) {
>   if (x < 3 && y > 10 ) 
>     printf("%d%d", x, y);
> }
> 
> void amp(int x, int y) {
>   if ((x < 3) & (y > 10) ) 
>     printf("%d%d", x, y);
> }
> 
> 
> the assembly code generated by g++ (all versions I tested with  optimization
> flag `-O3'), is not optimal (see the link on the bottom of this message).
> Basically, for both methods, the generated assembly code is identical.

An optimizing compiler should generate the same code in both cases, either with or without branches, whatever form is deemed the fastest for the target, since
there are no side-effects involved in the evaluation of the comparisons.

> As a side note: the code by intel's  compiler (ICC) is however generating
> optimal code for such scenarios, at least for versions icc13, and icc15 that
> I've tested.

One of the forms is necessarily not optimal (unless they are equivalent).
Comment 2 Fisnik 2015-06-13 23:21:01 UTC
(In reply to Eric Botcazou from comment #1)
> > consider the following code snippet (c++):
> > 
> > void ampamp(int x, int y) {
> >   if (x < 3 && y > 10 ) 
> >     printf("%d%d", x, y);
> > }
> > 
> > void amp(int x, int y) {
> >   if ((x < 3) & (y > 10) ) 
> >     printf("%d%d", x, y);
> > }
> > 
> > 
> > the assembly code generated by g++ (all versions I tested with  optimization
> > flag `-O3'), is not optimal (see the link on the bottom of this message).
> > Basically, for both methods, the generated assembly code is identical.
> 
> An optimizing compiler should generate the same code in both cases, either
> with or without branches, whatever form is deemed the fastest for the
> target, since
> there are no side-effects involved in the evaluation of the comparisons.
> 
> > As a side note: the code by intel's  compiler (ICC) is however generating
> > optimal code for such scenarios, at least for versions icc13, and icc15 that
> > I've tested.
> 
> One of the forms is necessarily not optimal (unless they are equivalent).

I do not quite agree. If the first operand is "always" true, than the version with double ampersand `&&' is faster. But if probability of the first operand being true is around 0.5, then the version with single `&' is faster, given that the compiler generates the correct assembly, i.e., a single jump. The reason for this is the branch misprediction penalty, which is severe at probabilities around 0.5.

I already did experiments with both methods `&', and `&&', and varying probabilities for predicates being true.
The difference (in speed) is around factor 2.3, as the selectivity of the first predicate approaches 0.5, (i.e., the probability of the first operand being true -> 0.5).
That is, the method with single ampersand with beat clearly the method with double ampersand.

The predicates I used were simple, but for more expensive predicates the difference is even worse.
If still not convinced, read the paper: "Conjunctive selection conditions in main memory" by K.Ross.

There should be at least some flag available, such that we can set such a flag and have the compiler generate only a single jump for the method with single ampersand.



[reply] [-] Comment 4
Comment 3 Eric Botcazou 2015-06-14 07:22:55 UTC
> There should be at least some flag available, such that we can set such a
> flag and have the compiler generate only a single jump for the method with
> single ampersand.

You missed the point though, the compiler should generate the same code in both cases.  Whether or not it's the branchy form is another debate, which cannot be settled with toy examples like this one but by using real-life benchmarks.
Comment 4 Fisnik 2015-06-14 10:44:16 UTC
(In reply to Eric Botcazou from comment #3)
> > There should be at least some flag available, such that we can set such a
> > flag and have the compiler generate only a single jump for the method with
> > single ampersand.
> 
> You missed the point though, the compiler should generate the same code in
> both cases.  Whether or not it's the branchy form is another debate, which
> cannot be settled with toy examples like this one but by using real-life
> benchmarks.

Compiler should not generate the same code, and should listen to the developer, when she/he connects predicates with single `&'. I already wrote that I did benchmarks, and also mentioned a paper from a top computer science conference (which I see you didn't even bother to read it) which has all the evidence required.
The branch misprediction penalty for the assembly code with two jumps is simply too severe when the selectivity of a predicate is around 0.5. 
And I also mentioned it that Intel's compiler is catching this very well. The example was given just to point the problem, I cannot possibly write you here 10k lines of code of my project.
Comment 5 Eric Botcazou 2015-06-14 10:57:04 UTC
> Compiler should not generate the same code, and should listen to the
> developer, when she/he connects predicates with single `&'. I already wrote
> that I did benchmarks, and also mentioned a paper from a top computer
> science conference (which I see you didn't even bother to read it) which has
> all the evidence required.

You're still missing the point!  If the non-branchy form is the fastest, then the compiler should generate it also with the short-circuit form &&.

> And I also mentioned it that Intel's compiler is catching this very well.

Nope, the Intel compiler is inconsistent here (unlike GCC and Clang).
Comment 6 Andreas Schwab 2015-06-14 10:58:15 UTC
If a and b are side-effect-free, pure-boolean expressions then `a && b' and `a & b' are completely equivalent and there is no reason to generate different code for them.
Comment 7 Fisnik 2015-06-14 11:26:06 UTC
(In reply to Eric Botcazou from comment #5)
> > Compiler should not generate the same code, and should listen to the
> > developer, when she/he connects predicates with single `&'. I already wrote
> > that I did benchmarks, and also mentioned a paper from a top computer
> > science conference (which I see you didn't even bother to read it) which has
> > all the evidence required.
> 
> You're still missing the point!  If the non-branchy form is the fastest,
> then the compiler should generate it also with the short-circuit form &&.
> 
> > And I also mentioned it that Intel's compiler is catching this very well.
> 
> Nope, the Intel compiler is inconsistent here (unlike GCC and Clang).

As I said, I would leave that to the developer. If I connect predicates with &, the compiler should generate the non-branchy code. So this issue is definatively an optimization bug.
Comment 8 Fisnik 2015-06-14 11:34:41 UTC
(In reply to Andreas Schwab from comment #6)
> If a and b are side-effect-free, pure-boolean expressions then `a && b' and
> `a & b' are completely equivalent and there is no reason to generate
> different code for them.

I said it is an optimization bug. 
The branchy code is very expensive when the first operand has probability of around 0.5 of being true, therefore it ruins the branch prediction. Now you can imagine a much more complex scenario, such as  e.g. (A & B & ... & Z), and if all predicates have probabilities of 0.5, then I'll end here with very inefficient code (the way the g++ is generating assembly). This does not happen for e.g. with icc.

To this end, the compiler should respect the code written by the developer. I'm already having trouble with g++, as for the code with single &, it is generating inefficient code. Or I have to remove completely the optimization flag '-O3', but then I have a very slow code, and your comments are not helping me at all solve the problem!
Comment 9 Jonathan Wakely 2015-06-14 12:26:27 UTC
(In reply to Fisnik from comment #8)
> To this end, the compiler should respect the code written by the developer.

As far as the C++ standard is concerned the built-in && and & operators for type bool are equivalent if evaluating the operands has no side-effects. C++ is not a high-level assembly language.

> I'm already having trouble with g++, as for the code with single &, it is
> generating inefficient code. Or I have to remove completely the optimization
> flag '-O3', but then I have a very slow code

Have you tried just using -O3 -fno-xxx for the relevant pass?