[Bug target/103008] poor inlined builtin_fmod on x86_64

Mon Feb 14 07:12:58 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 11 Feb 2022, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
> 
> --- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #12)
> > Just as data-point on znver2 Uros testcase shows
> > 
> > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2
> > rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out 
> > 19.18user 0.00system 0:19.18elapsed 99%CPU (0avgtext+0avgdata
> > 1528maxresident)k
> > 0inputs+0outputs (0major+76minor)pagefaults 0swaps
> > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -fno-builtin-fmod
> 
> You should use -fno-builtin-fmodf in the above compile flags.

Oops, yes.  Then the glibc version is

22.53user 0.00system 0:22.53elapsed 99%CPU (0avgtext+0avgdata 
1600maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps

so indeed for float the x87 inline version is faster when benchmarked
this way.  For double it's

19.31user 0.00system 0:19.31elapsed 99%CPU (0avgtext+0avgdata 
1536maxresident)k
0inputs+0outputs (0major+76minor)pagefaults 0swaps

vs.

18.47user 0.00system 0:18.47elapsed 99%CPU (0avgtext+0avgdata 
1600maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps

so glibc is a bit faster here while the x87 version is of course
similar.  Avoiding the libcall can of course avoid spilling SSE
regs around the call.

So what remains is really the special case in blender doing
fmod (x, 1.) which can eventually be optimized with SSE4.