This testcase compiles with '-O2 -ffast-math' into extremely inefficient asm code: double test(double x) { if (x > 0.0) return cos(x); else return sin(x); } --cut here-- test: pushl %ebp movl %esp, %ebp fldl 8(%ebp) fcoml .LC1 fnstsw %ax fld %st(0) fcos sahf ja .L6 fstp %st(0) fsin jmp .L1 .p2align 4,,7 .L6: fstp %st(1) .L1: popl %ebp ret --cut here-- It will _always_ call fcos instruction, and - depending on input - overwrite output of fcos with output of fsin instruction. This problem is not fsin/fcos specific. The problem is in ifcvt.c, find_if_case_1() function. Around line 2889, there is a condition: /* THEN is small. */ if (count_bb_insns (then_bb) > BRANCH_COST) return FALSE; This condition would prevent moving 'else' BB before 'if', if then_bb is not small. 'Small' means a couple of instructions with default BRANCH_COST (= 1). However, if an instruction is UNSPEC_*, this instruction can last hundred of cycles (as it is case with fsin or fcos), but it is still _one_ RTL instruction. So the case above is not triggered, and fcos is moved before 'if'. The situation is even worser with '-O2 -ffast-math -march=i686'. fsin and fcos are called every time... test: pushl %ebp movl %esp, %ebp fldl 8(%ebp) fldz fld %st(1) fld %st(2) fxch %st(1) fcos fxch %st(3) popl %ebp fcomip %st(2), %st fstp %st(1) fsin fcmovnbe %st(1), %st fstp %st(1) ret
Subject: Re: New: Inefficient if optimization with -O2 -ffast-math Ideal code here would be to call fsincos and then pick one of the outputs, yes? zw
Yes, fsincos would be the best solution in particular case, when both sin() and cos() are involved. However with more general testcase, for example: double test(double x) { if (x > 0.0) return x + 1.0; else return sqrt(x); // or sin(x), cos(x), etc... } square root is _always_ called, even if x > 0.0. If we return sin(x) [which could not be combine with sqrt in any way] instead of (x + 1.0), both sin() and sqrt() will be called when x > 0.0. In this case, -march=i686 (which has cmov instruction) will produce code, which will _always_ call sin() and sqrt() and then cmov will pick one of the results. When fsqrt x87 instruction takes 70 cycles and fsin x87 insn takes 65-100 cycles to execute, this is not a win in any case.
Confirmed. I think this has to do with the branch probability of taking it being 50% so it is merging both of them together so if you use profiled feedback that GCC has it will work correctly if it is not taken 50% of the time.
Andrew: I don't think that profiled feedback will help. fcos will be computed every time, but in 50% that result will be thrown away, fsin will be calculated and its result will be used. With -mbranch-cost=0, generated code looks a lot better, exactly what would one expect (and with only one conditional jump): --cut here-- test: pushl %ebp movl %esp, %ebp fldl 8(%ebp) fcoml .LC1 fnstsw %ax sahf jbe .L2 fcos popl %ebp ret .p2align 4,,7 .L2: fsin popl %ebp ret --cut here-- -mbranch-cost=0 does not help with -march=i686, resulting code is the same as without -mbranch-cost.
Note branch cost and branch probabilities are two seperate things. I tried to find what flag is causing this but I could not I would need to look into the RTL dups to see what is going on.
It seems to me that cases where this is a bad decision are really rare. Basically it is only bad if both branches use the same functional unit, which additionally isn't fully pipelined. This seems like a rare situation, and in the future CPUs will have even more pipelined units, which will usually not have to be represented with UNSPEC, and even more expensive branches. IMHO we should leave it just as it is and maybe try to mitigate it with target specific hacks if it is really an issue in real programs.
There's a relevant discussion of these issues posted to gcc-patches at: http://gcc.gnu.org/ml/gcc-patches/2004-07/msg00597.html
Subject: Bug 15187 CVSROOT: /cvs/gcc Module name: gcc Changes by: uros@gcc.gnu.org 2004-09-17 05:32:37 Modified files: gcc : ChangeLog ifcvt.c Log message: PR rtl-optimization/15187 * ifcvt.c (noce_try_cmove_arith): Exit early if total insn_rtx_cost of both branches > BRANCH_COST Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.5487&r2=2.5488 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ifcvt.c.diff?cvsroot=gcc&r1=1.164&r2=1.165
Fixed by: http://gcc.gnu.org/ml/gcc-patches/2004-07/msg00654.html http://gcc.gnu.org/ml/gcc-patches/2004-07/msg01107.html (TERGET_CMOV case by): http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01667.html
I forgot to mark this bug as RESOLVED FIXED in previous comment.
it looks like 4.1.1 and 4.2.0 still produce unoptimal code. #include <math.h> double test(double x) { if (x > 0.0) return cos(x); else return sin(x); } [ 4.1.1-0.20060322 / rev.112277 /x86-64 ] $ gcc bug.c -O2 -ffast-math -m32 -march=i686 -S test: pushl %ebp movl %esp, %ebp fldl 8(%ebp) fldz fcomip %st(1), %st jae .L2 popl %ebp fcos ret .L2: popl %ebp fsin ret [ 4.2.0-20060323 / rev.112317 / x86-64 ] $ ./xgcc -B. bug.c -O2 -ffast-math -m32 -march=i686 -S bug.c: In function ‘test’: bug.c:7: internal compiler error: in bsi_last, at tree-flow-inline.h:760
(In reply to comment #11) > it looks like 4.1.1 and 4.2.0 still produce unoptimal code. > test: pushl %ebp > movl %esp, %ebp > fldl 8(%ebp) > fldz > fcomip %st(1), %st > jae .L2 > popl %ebp > fcos > ret > > .L2: popl %ebp > fsin > ret No, this code is optimal. Please compare the code above to the code in description, where fcos is calculated even if x <= 0.0