GCC 7.3 and 8.2 (from Ubuntu 18.04 and MinGW-w64) seems to generate wrong code for sqrtf() -- and sqrt() -- for multiple platforms (tested under x86-64 and ARM Arch64, gcc 7.3, and Arch32 with gcc 6.3). Try to compile this simple function: /* test.c */ #include <math.h> float f( float x ) { return sqrtf( x ); } And I've got, for x86-64 using SSE: # Compiled with gcc -O2 -S test.c f: pxor %xmm2,%xmm2 sqrtss %xmm0,%xmm1 ucomiss %xmm0,%xmm2 ja .L8 movaps %xmm1,%xmm0 ret .L8: subq $24,%rsp movss %xmm1, 12(%rsp) # save xmm1 from sqrtss call sqrtf@PLT movss 12(%rsp),%xmm1 # restore xmm1. addq $24,%rsp movaps %xmm1,%xmm0 # use xmm1 anyway?! ret Notice, when 0 > x sqrt@PLT is called, but the result is ignored. A similar code is created by GCC for ARM. As expected, -ffast-math, creates: f: sqrtss %xmm0,%xmm0 ret Which is correct. My environment: $ gcc --version gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 $ gcc-8 --version gcc-8 (Ubuntu 8.2.0-1ubuntu2~18.04) 8.2.0 $ arm-none-eabi-gcc --version arm-none-eabi-gcc (15:6.3.1+svn253039-1build1) 6.3.1 20170620 $ aarch64-linux-gnu-gcc --version aarch64-linux-gnu-gcc (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0 $ uname -srvp Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.2 LTS Release: 18.04 Codename: bionic
It is <0 when sqrt is called and it is due setting errno. Not a bug.
(In reply to Andrew Pinski from comment #1) > It is <0 when sqrt is called and it is due setting errno. Not a bug. Hummmm... interesting. But why the inefficient code?
(In reply to Frederico Lamberti Pissarra from comment #2) > (In reply to Andrew Pinski from comment #1) > > It is <0 when sqrt is called and it is due setting errno. Not a bug. > > Hummmm... interesting. But why the inefficient code? What do you mean by inefficient? What would you suggest instead? We generate a fast sqrtss and only in the rare case where there is an error we call the libc sqrt function to let it report errors the way it wants. -fno-math-errno is the way to say we don't care about this kind of error reporting.
My suggestion is to do a simple jmp after .L8 label and test the condition before sqrtss (or fsqrt, or sqrtsd...): f: pxor %xmm2,%xmm2 ucomiss %xmm0,%xmm2 ja .L8 sqrtss %xmm0,%xmm0 ret .L8: jmp sqrtf@PLT
CLANG 6 creates a similar code: f: xorps %xmm1,%xmm1 ucomiss %xmm1,%xmm0 jb .L8 # more intutive test... sqrtss ret .L8: jmp sqrtf@PLT
Reopening and confirming, GCC's code looks less efficient than possible for no good reason. CDCE does y = sqrt (x); ==> y = IFN_SQRT (x); if (__builtin_isless (x, 0)) sqrt (x); but it could do y = IFN_SQRT (x); if (__builtin_isless (x, 0)) y = sqrt (x); (note two assignments to y) or to mimic LLVM's approach: if (__builtin_isless (x, 0)) y = sqrt (x); else y = IFN_SQRT (x);
yes, the transformation in CDEC prevent the tail call optimization. let's check the return stmt in CDEC pass.
(In reply to Alexander Monakov from comment #6) > Reopening and confirming, GCC's code looks less efficient than possible for > no good reason. > > CDCE does > > y = sqrt (x); > ==> > y = IFN_SQRT (x); > if (__builtin_isless (x, 0)) > sqrt (x); > > but it could do > > y = IFN_SQRT (x); > if (__builtin_isless (x, 0)) > y = sqrt (x); > > (note two assignments to y) > what is the difference between this and LLVM's approach ? > or to mimic LLVM's approach: > > if (__builtin_isless (x, 0)) > y = sqrt (x); > else > y = IFN_SQRT (x); I have finished a patch which do as same as LLVM in cdce pass, and test with case below: #include <math.h> int main () { float x = 1.0; float y; for (int i=0; i<100000000; i++) { y += sqrtf (x+i); } return y; } And I've got, for x86-64 with O2: # original asm of IFN_SQRT part .L4: pxor %xmm0, %xmm0 cvtsi2ssl %ebx, %xmm0 addss %xmm3, %xmm0 ucomiss %xmm0, %xmm4 movaps %xmm0, %xmm2 sqrtss %xmm2, %xmm2 ja .L7 and perf stat : 1,423,652,277 cycles # 2.180 GHz (83.31%) 1,121,862,980 stalled-cycles-frontend # 78.80% frontend cycles idle (83.31%) 634,957,413 stalled-cycles-backend # 44.60% backend cycles idle (66.62%) 1,102,109,423 instructions # 0.77 insn per cycle # 1.02 stalled cycles per insn (83.31%) 200,400,940 branches # 306.873 M/sec (83.44%) 7,734 branch-misses # 0.00% of all branches (83.44%) #transformed asm : .L4: pxor %xmm0, %xmm0 cvtsi2ssl %ebx, %xmm0 addss %xmm3, %xmm0 ucomiss %xmm0, %xmm2 ja .L8 sqrtss %xmm0, %xmm0 and perf stat: 1,418,560,722 cycles # 2.180 GHz (83.25%) 1,116,732,674 stalled-cycles-frontend # 78.72% frontend cycles idle (83.25%) 674,837,417 stalled-cycles-backend # 47.57% backend cycles idle (66.81%) 1,003,067,037 instructions # 0.71 insn per cycle # 1.11 stalled cycles per insn (83.41%) 200,619,151 branches # 308.272 M/sec (83.40%) 5,637 branch-misses # 0.00% of all branches (83.28%) The transformed case has less instructions and gets better performance which looks good to me. However, one thing that I noticed is the original case gets less 'stalled-cycles-backend', since its code has better ILP. I'm not sure which approach is better. Environment: gcc version: gcc trunk@270488 OS: centos7.2 HW: Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz
(In reply to JunMa from comment #7) > yes, the transformation in CDEC prevent the tail call optimization. let's > check the return stmt in CDEC pass. Sorry for the confused comment. As the discussion above, The cdce pass looks for calls to built-in functions that set errno and whose result is used. It tries to transform these calls into conditionally executes calls with a simple range check on the arguments which can detect most cases and the errno does not need to be set. The transform looks like: y = sqrt (x); ==> y = IFN_SQRT (x); if (__builtin_isless (x, 0)) sqrt (x); However when the call is in tail position, this transformation breaks tailcall optimizations, since the conditionally call does not have return value. This is what this PR tries to explain and fix. Alexander gives two suggestions: first: y = IFN_SQRT (x); if (__builtin_isless (x, 0)) y = sqrt (x); second(LLVM's approach): if (__builtin_isless (x, 0)) y = sqrt (x); else y = IFN_SQRT (x); So what I want to do here is looking for tailcall and transforming as first one. I did some hacks locally, but then I found gcc generated even worse code in 'y = IFN_SQRT' part: f: pxor %xmm1, %xmm1 movaps %xmm0, %xmm2 ucomiss %xmm0, %xmm1 sqrtss %xmm2, %xmm2 ja .L4 movaps %xmm2, %xmm0 ret .L4: jmp sqrtf Then I used LLVM's approach no matter call is in tail position or not, and it gives: f: pxor %xmm1, %xmm1 ucomiss %xmm0, %xmm1 ja .L4 sqrtss %xmm0, %xmm0 ret .L4: jmp sqrtf Also in comment 6, I did some test for LLVM's approach. Sorry for the confused comment again.
Author: junma Date: Thu May 16 08:21:17 2019 New Revision: 271281 URL: https://gcc.gnu.org/viewcvs?rev=271281&root=gcc&view=rev Log: PR tree-optimization/90106 * tree-call-cdce.c (shrink_wrap_one_built_in_call_with_conds): Add new parameter as new internal function call, also move it to new basic block. (use_internal_fn): Pass internal function call to shrink_wrap_one_built_in_call_with_conds. gcc/testsuite * gcc.dg/cdce1.c: Check tailcall code generation after cdce pass. * gcc.dg/cdce2.c: Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/cdce1.c trunk/gcc/testsuite/gcc.dg/cdce2.c trunk/gcc/tree-call-cdce.c
Author: junma Date: Fri May 17 10:13:29 2019 New Revision: 271319 URL: https://gcc.gnu.org/viewcvs?rev=271319&root=gcc&view=rev Log: PR tree-optimization/90106 * gcc.dg/cdce3.c: New test. Added: trunk/gcc/testsuite/gcc.dg/cdce3.c Modified: trunk/gcc/testsuite/ChangeLog
(In reply to junma from comment #11) > Author: junma > Date: Fri May 17 10:13:29 2019 > New Revision: 271319 > > URL: https://gcc.gnu.org/viewcvs?rev=271319&root=gcc&view=rev > Log: > PR tree-optimization/90106 > * gcc.dg/cdce3.c: New test. > > > Added: > trunk/gcc/testsuite/gcc.dg/cdce3.c > Modified: > trunk/gcc/testsuite/ChangeLog This new test fails on arm: FAIL: gcc.dg/cdce3.c scan-tree-dump cdce "cdce3.c:9: [^\n\r]* function call is shrink-wrapped into error conditions."
(In reply to Christophe Lyon from comment #12) > This new test fails on arm: > FAIL: gcc.dg/cdce3.c scan-tree-dump cdce "cdce3.c:9: [^\n\r]* function call > is shrink-wrapped into error conditions." I don't have arm environment. Would you please attach the dump file of cdce pass?
Sure, here is the contents of cdce3.c.105t.cdce: ;; Function foo (foo, funcdef_no=0, decl_uid=4197, cgraph_uid=1, symbol_order=0) foo (float x) { float _4; <bb 2> [local count: 1073741824]: _4 = sqrtf (x_2(D)); return _4; }
(In reply to Christophe Lyon from comment #14) > Sure, here is the contents of cdce3.c.105t.cdce: > > ;; Function foo (foo, funcdef_no=0, decl_uid=4197, cgraph_uid=1, > symbol_order=0) > > foo (float x) > { > float _4; > > <bb 2> [local count: 1073741824]: > _4 = sqrtf (x_2(D)); > return _4; > > } The contents are as same as cdce3.c.104t.stdarg. Would you please dump it whit -fdump-tree-cdce-details? Then we can see(in x86_64) Found conditional dead call: _4 = sqrtf (x_2(D)); cdce3.c:9: note: function call is shrink-wrapped into error conditions. It seems that gcc thinks arm doesn't support direct internal function (maybe vsqrt instruction ?) of sqrtf .
That's what I did... (use -fdump-tree-cdce-details). The assembler code is: .arm .fpu softvfp .type foo, %function foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. b sqrtf which is a tail call ('b' is a jump instruction on arm, not a call).
(In reply to Christophe Lyon from comment #16) > That's what I did... (use -fdump-tree-cdce-details). > > The assembler code is: > .arm > .fpu softvfp > .type foo, %function > foo: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > b sqrtf > > which is a tail call ('b' is a jump instruction on arm, not a call). Hmm... I think the testcase works with -mfpu=vfp -mfloat-abi=hard.
(In reply to JunMa from comment #17) > (In reply to Christophe Lyon from comment #16) > > That's what I did... (use -fdump-tree-cdce-details). > > > > The assembler code is: > > .arm > > .fpu softvfp > > .type foo, %function > > foo: > > @ args = 0, pretend = 0, frame = 0 > > @ frame_needed = 0, uses_anonymous_args = 0 > > @ link register save eliminated. > > b sqrtf > > > > which is a tail call ('b' is a jump instruction on arm, not a call). > > Hmm... I think the testcase works with -mfpu=vfp -mfloat-abi=hard. Indeed, sorry I wasn't specific enough when I said "arm". The test fails on arm-none-linux-gnueabi, and passes on arm-none-linux-gnueabihf.
we can skip the target by adding /* { dg-skip-if "need hardfp abi" { *-*-* } { "-mfloat-abi=soft" } { "" } } */ to testcase.
Author: clyon Date: Mon May 20 15:01:46 2019 New Revision: 271424 URL: https://gcc.gnu.org/viewcvs?rev=271424&root=gcc&view=rev Log: [testsuite] PR90106 Fix cdce3.c testcase 2019-05-20 Christophe Lyon <christophe.lyon@linaro.org> PR tree-optimization/90106 * gcc.dg/cdce3.c: Add hard_float effective target. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/cdce3.c
Author: jakub Date: Fri May 24 10:15:16 2019 New Revision: 271598 URL: https://gcc.gnu.org/viewcvs?rev=271598&root=gcc&view=rev Log: PR tree-optimization/90106 PR testsuite/90517 * gcc.dg/cdce1.c: Don't scan-assembler, instead -fdump-tree-optimized and scan-tree-dump for tail call. * gcc.dg/cdce2.c: Likewise. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/cdce1.c trunk/gcc/testsuite/gcc.dg/cdce2.c
Fixed so closing as such.