[Bug target/86819] Set min_divisions_for_recip_mul to 2
amonakov at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Wed Aug 1 20:29:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #4)
> But unless your FPU can do 2 divisions in parallel, you have to take into
> account the delay before a second division can start (related to
> throughput), which is often larger than the latency of a multiplication.
Yep - Agner's tables indicate that starting with Ivybridge, divss is partially
pipelined, and on SkylakeX it has reciprocal throughput of just 3 cycles, which
is smaller than mulss latency (4). On Ryzen it's similar.
> To try your example:
[snip]
> On skylake, I am getting 1s for the 2 divisions and .75s for the
> inverse+multiplication. With float, both are .75s.
Note that your code compares throughput. A microbenchmark for comparing latency
would chain dependent computations, e.g. like this:
int main(){
float a=3, b=7;
for(int i=0;i<100000000;++i) {
float c = a+b;
float d = 1/c;
#if 0
a /= c;
b /= c;
#else
a *= d;
b *= d;
#endif
}
__builtin_printf("%g %g\n", a, b);
}
> Maybe the right choice is clearer for double than for float? I would still
> go with an unconditional 2, for simplicity.
Ack. I just want to point out that it's not so clear-cut given the trend for
improved pipelining of division in the latest cpu generations.
More information about the Gcc-bugs
mailing list