15187 – Inefficient if optimization with -O2 -ffast-math

Bug 15187 - Inefficient if optimization with -O2 -ffast-math

Summary: Inefficient if optimization with -O2 -ffast-math

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 enhancement
Target Milestone:	4.0.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2004-04-28 06:47 UTC by Uroš Bizjak
Modified:	2006-03-29 14:08 UTC (History)
CC List:	2 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:	i686-pc-linux-gnu
Build:	i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:	2004-04-28 11:41:56

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Uroš Bizjak 2004-04-28 06:47:20 UTC

This testcase compiles with '-O2 -ffast-math' into extremely inefficient asm code:
double test(double x) {
        if (x > 0.0)
                return cos(x);
        else
                return sin(x);
}

--cut here--
test:
        pushl   %ebp
        movl    %esp, %ebp
        fldl    8(%ebp)
        fcoml   .LC1
        fnstsw  %ax
        fld     %st(0)
        fcos
        sahf
        ja      .L6
        fstp    %st(0)
        fsin
        jmp     .L1
        .p2align 4,,7
.L6:
        fstp    %st(1)
.L1:
        popl    %ebp
        ret
--cut here--

It will _always_ call fcos instruction, and - depending on input - overwrite
output of fcos with output of fsin instruction. This problem is not fsin/fcos
specific.

The problem is in ifcvt.c, find_if_case_1() function. Around line 2889, there is
a condition:
  /* THEN is small.  */
  if (count_bb_insns (then_bb) > BRANCH_COST)
    return FALSE;

This condition would prevent moving 'else' BB before 'if', if then_bb is not
small. 'Small' means a couple of instructions with default BRANCH_COST (= 1).

However, if an instruction is UNSPEC_*, this instruction can last hundred of
cycles (as it is case with fsin or fcos), but it is still _one_ RTL instruction.
So the case above is not triggered, and fcos is moved before 'if'. 

The situation is even worser with '-O2 -ffast-math -march=i686'. fsin and fcos
are called every time...

test:
        pushl   %ebp
        movl    %esp, %ebp
        fldl    8(%ebp)
        fldz
        fld     %st(1)
        fld     %st(2)
        fxch    %st(1)
        fcos
        fxch    %st(3)
        popl    %ebp
        fcomip  %st(2), %st
        fstp    %st(1)
        fsin
        fcmovnbe        %st(1), %st
        fstp    %st(1)
        ret

Comment 1 Zack Weinberg 2004-04-28 07:44:07 UTC

Subject: Re:  New: Inefficient if optimization with
 -O2 -ffast-math

 
Ideal code here would be to call fsincos and then pick one of the
outputs, yes?

zw

Comment 2 Uroš Bizjak 2004-04-28 09:52:21 UTC

Yes, fsincos would be the best solution in particular case, when both sin() and
cos() are involved. However with more general testcase, for example:

double test(double x) {
      if (x > 0.0)
        return x + 1.0;
      else
        return sqrt(x); // or sin(x), cos(x), etc...
}

square root is _always_ called, even if x > 0.0. If we return sin(x) [which
could not be combine with sqrt in any way] instead of (x + 1.0), both sin() and
sqrt() will be called when x > 0.0. In this case, -march=i686 (which has cmov
instruction) will produce code, which will _always_ call sin() and sqrt() and
then cmov will pick one of the results. When fsqrt x87 instruction takes 70
cycles and fsin x87 insn takes 65-100 cycles to execute, this is not a win in
any case.

Comment 3 Andrew Pinski 2004-04-28 11:41:54 UTC

Confirmed.  I think this has to do with the branch probability of taking it being 50% so it is merging 
both of them together so if you use profiled feedback that GCC has it will work correctly if it is not 
taken 50% of the time.

Comment 4 Uroš Bizjak 2004-04-28 12:04:19 UTC

Andrew: I don't think that profiled feedback will help. fcos will be computed
every time, but in 50% that result will be thrown away, fsin will be calculated
and its result will be used. With -mbranch-cost=0, generated code looks a lot
better, exactly what would one expect (and with only one conditional jump):

--cut here--
test:
      pushl     %ebp
      movl      %esp, %ebp
      fldl      8(%ebp)
      fcoml     .LC1
      fnstsw    %ax
      sahf
      jbe       .L2
      fcos
      popl      %ebp
      ret
      .p2align 4,,7
.L2:
      fsin
      popl      %ebp
      ret
--cut here--

-mbranch-cost=0 does not help with -march=i686, resulting code is the same as
without -mbranch-cost.

Comment 5 Andrew Pinski 2004-04-28 12:52:13 UTC

Note branch cost and branch probabilities are two seperate things.  I tried to find what flag is causing 
this but I could not I would need to look into the RTL dups to see what is going on.

Comment 6 Falk Hueffner 2004-04-28 13:35:20 UTC

It seems to me that cases where this is a bad decision are really rare. Basically
it is only bad if both branches use the same functional unit, which additionally
isn't fully pipelined. This seems like a rare situation, and in the future CPUs
will have even more pipelined units, which will usually not have to be
represented with UNSPEC, and even more expensive branches. IMHO we should leave
it just as it is and maybe try to mitigate it with target specific hacks if
it is really an issue in real programs.

Comment 7 roger 2004-07-07 13:34:55 UTC

There's a relevant discussion of these issues posted to gcc-patches at:
http://gcc.gnu.org/ml/gcc-patches/2004-07/msg00597.html

Comment 8 GCC Commits 2004-09-17 05:32:39 UTC

Subject: Bug 15187

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	uros@gcc.gnu.org	2004-09-17 05:32:37

Modified files:
	gcc            : ChangeLog ifcvt.c 

Log message:
	PR rtl-optimization/15187
	* ifcvt.c (noce_try_cmove_arith): Exit early if total
	insn_rtx_cost of both branches > BRANCH_COST

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.5487&r2=2.5488
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ifcvt.c.diff?cvsroot=gcc&r1=1.164&r2=1.165

Comment 9 Uroš Bizjak 2004-09-17 05:41:25 UTC

Fixed by:
http://gcc.gnu.org/ml/gcc-patches/2004-07/msg00654.html
http://gcc.gnu.org/ml/gcc-patches/2004-07/msg01107.html

(TERGET_CMOV case by):
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01667.html

Comment 10 uros 2004-09-17 05:47:05 UTC

I forgot to mark this bug as RESOLVED FIXED in previous comment.

Comment 11 Pawel Sikora 2006-03-28 10:34:06 UTC

it looks like 4.1.1 and 4.2.0 still produce unoptimal code.

#include <math.h>
double test(double x)
{
        if (x > 0.0)
                return cos(x);
        else
                return sin(x);
}

[ 4.1.1-0.20060322 / rev.112277 /x86-64 ]

$ gcc bug.c -O2 -ffast-math -m32 -march=i686 -S

test:   pushl   %ebp
        movl    %esp, %ebp
        fldl    8(%ebp)
        fldz
        fcomip  %st(1), %st
        jae     .L2
        popl    %ebp
        fcos
        ret

.L2:    popl    %ebp
        fsin
        ret

[ 4.2.0-20060323 / rev.112317 / x86-64 ]

$ ./xgcc -B. bug.c -O2 -ffast-math -m32 -march=i686 -S
bug.c: In function &#8216;test&#8217;:
bug.c:7: internal compiler error: in bsi_last, at tree-flow-inline.h:760

Comment 12 Uroš Bizjak 2006-03-29 14:08:28 UTC

(In reply to comment #11)
> it looks like 4.1.1 and 4.2.0 still produce unoptimal code.

> test:   pushl   %ebp
>         movl    %esp, %ebp
>         fldl    8(%ebp)
>         fldz
>         fcomip  %st(1), %st
>         jae     .L2
>         popl    %ebp
>         fcos
>         ret
> 
> .L2:    popl    %ebp
>         fsin
>         ret

No, this code is optimal. Please compare the code above to the code in description, where fcos is calculated even if x <= 0.0