[Bug target/103008] poor inlined builtin_fmod on x86_64
rguenther at suse dot de
gcc-bugzilla@gcc.gnu.org
Mon Feb 14 07:12:58 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 11 Feb 2022, ubizjak at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
>
> --- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #12)
> > Just as data-point on znver2 Uros testcase shows
> >
> > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2
> > rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out
> > 19.18user 0.00system 0:19.18elapsed 99%CPU (0avgtext+0avgdata
> > 1528maxresident)k
> > 0inputs+0outputs (0major+76minor)pagefaults 0swaps
> > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -fno-builtin-fmod
>
> You should use -fno-builtin-fmodf in the above compile flags.
Oops, yes. Then the glibc version is
22.53user 0.00system 0:22.53elapsed 99%CPU (0avgtext+0avgdata
1600maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps
so indeed for float the x87 inline version is faster when benchmarked
this way. For double it's
19.31user 0.00system 0:19.31elapsed 99%CPU (0avgtext+0avgdata
1536maxresident)k
0inputs+0outputs (0major+76minor)pagefaults 0swaps
vs.
18.47user 0.00system 0:18.47elapsed 99%CPU (0avgtext+0avgdata
1600maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps
so glibc is a bit faster here while the x87 version is of course
similar. Avoiding the libcall can of course avoid spilling SSE
regs around the call.
So what remains is really the special case in blender doing
fmod (x, 1.) which can eventually be optimized with SSE4.
More information about the Gcc-bugs
mailing list