90106 – builtin sqrt() ignoring libm's sqrt call result

Bug 90106 - builtin sqrt() ignoring libm's sqrt call result

Summary: builtin sqrt() ignoring libm's sqrt call result

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	7.3.0

Importance:	P3 normal
Target Milestone:	10.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2019-04-16 00:21 UTC by Frederico Lamberti Pissarra
Modified:	2021-08-10 18:31 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-04-16 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Frederico Lamberti Pissarra 2019-04-16 00:21:59 UTC

GCC 7.3 and 8.2 (from Ubuntu 18.04 and MinGW-w64) seems to generate wrong code for sqrtf() -- and sqrt() -- for multiple platforms (tested under x86-64 and ARM Arch64, gcc 7.3, and Arch32 with gcc 6.3). Try to compile this simple function:

  /* test.c */
  #include <math.h>
  float f( float x ) { return sqrtf( x ); }

And I've got, for x86-64 using SSE:

  # Compiled with gcc -O2 -S test.c
  f:
    pxor %xmm2,%xmm2
    sqrtss %xmm0,%xmm1
    ucomiss %xmm0,%xmm2
    ja .L8
    movaps %xmm1,%xmm0
    ret
  .L8:
    subq $24,%rsp
    movss %xmm1, 12(%rsp)    # save xmm1 from sqrtss
    call  sqrtf@PLT
    movss 12(%rsp),%xmm1     # restore xmm1.
    addq $24,%rsp
    movaps %xmm1,%xmm0       # use xmm1 anyway?!
    ret

Notice, when 0 > x sqrt@PLT is called, but the result is ignored.
A similar code is created by GCC for ARM.

As expected, -ffast-math, creates:

f:
  sqrtss %xmm0,%xmm0
  ret

Which is correct.

My environment:

  $ gcc --version
  gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0

  $ gcc-8 --version
  gcc-8 (Ubuntu 8.2.0-1ubuntu2~18.04) 8.2.0

  $  arm-none-eabi-gcc --version
  arm-none-eabi-gcc (15:6.3.1+svn253039-1build1) 6.3.1 20170620

  $  aarch64-linux-gnu-gcc --version
  aarch64-linux-gnu-gcc (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0

  $  uname -srvp
  Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64

  $  lsb_release -a
  No LSB modules are available.
  Distributor ID:	Ubuntu
  Description:	Ubuntu 18.04.2 LTS
  Release:	18.04
  Codename:	bionic

Comment 1 Andrew Pinski 2019-04-16 00:27:10 UTC

It is <0 when sqrt is called and it is due setting errno.  Not a bug.

Comment 2 Frederico Lamberti Pissarra 2019-04-16 00:30:29 UTC

(In reply to Andrew Pinski from comment #1)
> It is <0 when sqrt is called and it is due setting errno.  Not a bug.

Hummmm... interesting. But why the inefficient code?

Comment 3 Marc Glisse 2019-04-16 06:44:47 UTC

(In reply to Frederico Lamberti Pissarra from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > It is <0 when sqrt is called and it is due setting errno.  Not a bug.
> 
> Hummmm... interesting. But why the inefficient code?

What do you mean by inefficient? What would you suggest instead? We generate a fast sqrtss and only in the rare case where there is an error we call the libc sqrt function to let it report errors the way it wants. -fno-math-errno is the way to say we don't care about this kind of error reporting.

Comment 4 Frederico Lamberti Pissarra 2019-04-16 12:30:23 UTC

My suggestion is to do a simple jmp after .L8 label and test the condition before sqrtss (or fsqrt, or sqrtsd...):

  f:
    pxor %xmm2,%xmm2
    ucomiss %xmm0,%xmm2
    ja .L8
    sqrtss %xmm0,%xmm0
    ret
  .L8:
    jmp sqrtf@PLT

Comment 5 Frederico Lamberti Pissarra 2019-04-16 12:37:54 UTC

CLANG 6 creates a similar code:

  f:
    xorps %xmm1,%xmm1
    ucomiss %xmm1,%xmm0
    jb .L8   # more intutive test...
    sqrtss
    ret
  .L8:
    jmp sqrtf@PLT

Comment 6 Alexander Monakov 2019-04-16 13:04:29 UTC

Reopening and confirming, GCC's code looks less efficient than possible for no good reason.

CDCE does

        y = sqrt (x);
     ==>
        y = IFN_SQRT (x);
        if (__builtin_isless (x, 0))
            sqrt (x);

but it could do

        y = IFN_SQRT (x);
        if (__builtin_isless (x, 0))
            y = sqrt (x);

(note two assignments to y)

or to mimic LLVM's approach:

        if (__builtin_isless (x, 0))
            y = sqrt (x);
        else
            y = IFN_SQRT (x);

Comment 7 JunMa 2019-04-17 08:57:54 UTC

yes, the transformation in CDEC prevent the tail call optimization. let's check the return stmt in CDEC pass.

Comment 8 JunMa 2019-04-25 02:41:44 UTC

(In reply to Alexander Monakov from comment #6)
> Reopening and confirming, GCC's code looks less efficient than possible for
> no good reason.
> 
> CDCE does
> 
>         y = sqrt (x);
>      ==>
>         y = IFN_SQRT (x);
>         if (__builtin_isless (x, 0))
>             sqrt (x);
> 
> but it could do
> 
>         y = IFN_SQRT (x);
>         if (__builtin_isless (x, 0))
>             y = sqrt (x);
> 
> (note two assignments to y)
> 

what is the difference between this and LLVM's approach ? 

> or to mimic LLVM's approach:
> 
>         if (__builtin_isless (x, 0))
>             y = sqrt (x);
>         else
>             y = IFN_SQRT (x);

I have finished a patch which do as same as LLVM in cdce pass, and test with case below:

 #include <math.h>
  int main () {
    float x = 1.0;
    float y;
    for (int i=0; i<100000000; i++) {
      y += sqrtf (x+i);
    }
    return y;
  }

And I've got, for x86-64 with O2:

  # original asm of IFN_SQRT part
.L4:
  pxor  %xmm0, %xmm0
  cvtsi2ssl  %ebx, %xmm0
  addss  %xmm3, %xmm0
  ucomiss %xmm0, %xmm4
  movaps %xmm0, %xmm2
  sqrtss %xmm2, %xmm2
  ja  .L7

and perf stat : 
     1,423,652,277      cycles                    #    2.180 GHz                      (83.31%)
     1,121,862,980      stalled-cycles-frontend   #   78.80% frontend cycles idle     (83.31%)
       634,957,413      stalled-cycles-backend    #   44.60% backend cycles idle      (66.62%)
     1,102,109,423      instructions              #    0.77  insn per cycle         
                                                  #    1.02  stalled cycles per insn  (83.31%)
       200,400,940      branches                  #  306.873 M/sec                    (83.44%)
             7,734      branch-misses             #    0.00% of all branches          (83.44%)



#transformed asm : 
.L4:
  pxor  %xmm0, %xmm0
  cvtsi2ssl  %ebx, %xmm0
  addss  %xmm3, %xmm0
  ucomiss %xmm0, %xmm2
  ja   .L8
  sqrtss %xmm0, %xmm0

and perf stat:
     1,418,560,722      cycles                    #    2.180 GHz                      (83.25%)
     1,116,732,674      stalled-cycles-frontend   #   78.72% frontend cycles idle     (83.25%)
       674,837,417      stalled-cycles-backend    #   47.57% backend cycles idle      (66.81%)
     1,003,067,037      instructions              #    0.71  insn per cycle         
                                                  #    1.11  stalled cycles per insn  (83.41%)
       200,619,151      branches                  #  308.272 M/sec                    (83.40%)
             5,637      branch-misses             #    0.00% of all branches          (83.28%)


The transformed case has less instructions and gets better performance which looks good to me. However, one thing that I noticed is the original case gets less 'stalled-cycles-backend', since its code has better ILP.

I'm not sure which approach is better.

Environment:
gcc version:  gcc trunk@270488 
OS: centos7.2
HW: Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz

Comment 9 JunMa 2019-04-25 05:05:45 UTC

(In reply to JunMa from comment #7)
> yes, the transformation in CDEC prevent the tail call optimization. let's
> check the return stmt in CDEC pass.

Sorry for the confused comment. 

As the discussion above, The cdce pass looks for calls to built-in functions that set errno and whose result is used. It tries to transform these calls into conditionally executes calls with a simple range check on the arguments which can detect most cases and the errno does not need to be set. The transform looks like:

        y = sqrt (x);
     ==>
        y = IFN_SQRT (x);
        if (__builtin_isless (x, 0))
            sqrt (x);

However when the call is in tail position, this transformation breaks  tailcall optimizations, since the conditionally call does not have return value. This is what this PR tries to explain and fix.

Alexander gives two suggestions:
first:
        y = IFN_SQRT (x);
        if (__builtin_isless (x, 0))
            y = sqrt (x);

second(LLVM's approach):

        if (__builtin_isless (x, 0))
            y = sqrt (x);
        else
            y = IFN_SQRT (x);


So what I want to do here is looking for tailcall and transforming as first one.

I did some hacks locally, but then I found gcc generated even worse code in 'y = IFN_SQRT' part:

f:
pxor  %xmm1, %xmm1
movaps %xmm0, %xmm2
ucomiss %xmm0, %xmm1
sqrtss %xmm2, %xmm2
ja   .L4
movaps %xmm2, %xmm0
ret
.L4:
jmp  sqrtf

Then I used LLVM's approach no matter call is in tail position or not, and it gives:

f:
  pxor  %xmm1, %xmm1
  ucomiss %xmm0, %xmm1
  ja   .L4
  sqrtss %xmm0, %xmm0
  ret
.L4:
  jmp  sqrtf 

Also in comment 6, I did some test for LLVM's approach.

Sorry for the confused comment again.

Comment 10 junma 2019-05-16 08:21:49 UTC

Author: junma
Date: Thu May 16 08:21:17 2019
New Revision: 271281

URL: https://gcc.gnu.org/viewcvs?rev=271281&root=gcc&view=rev
Log:
	PR tree-optimization/90106
	* tree-call-cdce.c (shrink_wrap_one_built_in_call_with_conds): Add
	new parameter as new internal function call, also move it to new
	basic block.
	(use_internal_fn): Pass internal function call to
	shrink_wrap_one_built_in_call_with_conds.

	gcc/testsuite
	* gcc.dg/cdce1.c: Check tailcall code generation after cdce pass.
	* gcc.dg/cdce2.c: Likewise.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/cdce1.c
    trunk/gcc/testsuite/gcc.dg/cdce2.c
    trunk/gcc/tree-call-cdce.c

Comment 11 junma 2019-05-17 10:14:01 UTC

Author: junma
Date: Fri May 17 10:13:29 2019
New Revision: 271319

URL: https://gcc.gnu.org/viewcvs?rev=271319&root=gcc&view=rev
Log:
	PR tree-optimization/90106
	* gcc.dg/cdce3.c: New test.


Added:
    trunk/gcc/testsuite/gcc.dg/cdce3.c
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 12 Christophe Lyon 2019-05-20 06:26:47 UTC

(In reply to junma from comment #11)
> Author: junma
> Date: Fri May 17 10:13:29 2019
> New Revision: 271319
> 
> URL: https://gcc.gnu.org/viewcvs?rev=271319&root=gcc&view=rev
> Log:
> 	PR tree-optimization/90106
> 	* gcc.dg/cdce3.c: New test.
> 
> 
> Added:
>     trunk/gcc/testsuite/gcc.dg/cdce3.c
> Modified:
>     trunk/gcc/testsuite/ChangeLog

This new test fails on arm:
FAIL: gcc.dg/cdce3.c scan-tree-dump cdce "cdce3.c:9: [^\n\r]* function call is shrink-wrapped into error conditions."

Comment 13 JunMa 2019-05-20 07:32:19 UTC

(In reply to Christophe Lyon from comment #12)
> This new test fails on arm:
> FAIL: gcc.dg/cdce3.c scan-tree-dump cdce "cdce3.c:9: [^\n\r]* function call
> is shrink-wrapped into error conditions."

I don't have arm environment. Would you please attach the dump file of cdce pass?

Comment 14 Christophe Lyon 2019-05-20 09:16:55 UTC

Sure, here is the contents of cdce3.c.105t.cdce:

;; Function foo (foo, funcdef_no=0, decl_uid=4197, cgraph_uid=1, symbol_order=0)

foo (float x)
{
  float _4;

  <bb 2> [local count: 1073741824]:
  _4 = sqrtf (x_2(D));
  return _4;

}

Comment 15 JunMa 2019-05-20 09:56:25 UTC

(In reply to Christophe Lyon from comment #14)
> Sure, here is the contents of cdce3.c.105t.cdce:
> 
> ;; Function foo (foo, funcdef_no=0, decl_uid=4197, cgraph_uid=1,
> symbol_order=0)
> 
> foo (float x)
> {
>   float _4;
> 
>   <bb 2> [local count: 1073741824]:
>   _4 = sqrtf (x_2(D));
>   return _4;
> 
> }

The contents are as same as cdce3.c.104t.stdarg. Would you please dump it whit -fdump-tree-cdce-details?

Then we can see(in x86_64)
   Found conditional dead call: _4 = sqrtf (x_2(D));

   cdce3.c:9: note: function call is shrink-wrapped into error conditions.

It seems that gcc thinks arm doesn't support direct internal function (maybe vsqrt instruction ?) of sqrtf .

Comment 16 Christophe Lyon 2019-05-20 10:02:17 UTC

That's what I did... (use -fdump-tree-cdce-details).

The assembler code is:
        .arm
        .fpu softvfp
        .type   foo, %function
foo:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        b       sqrtf

which is a tail call ('b' is a jump instruction on arm, not a call).

Comment 17 JunMa 2019-05-20 10:18:10 UTC

(In reply to Christophe Lyon from comment #16)
> That's what I did... (use -fdump-tree-cdce-details).
> 
> The assembler code is:
>         .arm
>         .fpu softvfp
>         .type   foo, %function
> foo:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         b       sqrtf
> 
> which is a tail call ('b' is a jump instruction on arm, not a call).

Hmm... I think the testcase works with -mfpu=vfp -mfloat-abi=hard.

Comment 18 Christophe Lyon 2019-05-20 10:22:15 UTC

(In reply to JunMa from comment #17)
> (In reply to Christophe Lyon from comment #16)
> > That's what I did... (use -fdump-tree-cdce-details).
> > 
> > The assembler code is:
> >         .arm
> >         .fpu softvfp
> >         .type   foo, %function
> > foo:
> >         @ args = 0, pretend = 0, frame = 0
> >         @ frame_needed = 0, uses_anonymous_args = 0
> >         @ link register save eliminated.
> >         b       sqrtf
> > 
> > which is a tail call ('b' is a jump instruction on arm, not a call).
> 
> Hmm... I think the testcase works with -mfpu=vfp -mfloat-abi=hard.

Indeed, sorry I wasn't specific enough when I said "arm".

The test fails on arm-none-linux-gnueabi, and passes on arm-none-linux-gnueabihf.

Comment 19 JunMa 2019-05-20 10:33:50 UTC

we can skip the target by adding 
/* { dg-skip-if "need hardfp abi" { *-*-* } { "-mfloat-abi=soft" } { "" } } */
to testcase.

Comment 20 Christophe Lyon 2019-05-20 15:02:17 UTC

Author: clyon
Date: Mon May 20 15:01:46 2019
New Revision: 271424

URL: https://gcc.gnu.org/viewcvs?rev=271424&root=gcc&view=rev
Log:
[testsuite] PR90106 Fix cdce3.c testcase

2019-05-20  Christophe Lyon  <christophe.lyon@linaro.org>

	PR tree-optimization/90106
	* gcc.dg/cdce3.c: Add hard_float effective target.


Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/cdce3.c

Comment 21 Jakub Jelinek 2019-05-24 10:16:08 UTC

Author: jakub
Date: Fri May 24 10:15:16 2019
New Revision: 271598

URL: https://gcc.gnu.org/viewcvs?rev=271598&root=gcc&view=rev
Log:
	PR tree-optimization/90106
	PR testsuite/90517
	* gcc.dg/cdce1.c: Don't scan-assembler, instead -fdump-tree-optimized
	and scan-tree-dump for tail call.
	* gcc.dg/cdce2.c: Likewise.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/cdce1.c
    trunk/gcc/testsuite/gcc.dg/cdce2.c

Comment 22 Andrew Pinski 2021-08-10 18:31:40 UTC

Fixed so closing as such.