[Bug target/86819] Set min_divisions_for_recip_mul to 2

Wed Aug 1 20:29:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #4)
> But unless your FPU can do 2 divisions in parallel, you have to take into
> account the delay before a second division can start (related to
> throughput), which is often larger than the latency of a multiplication.

Yep - Agner's tables indicate that starting with Ivybridge, divss is partially
pipelined, and on SkylakeX it has reciprocal throughput of just 3 cycles, which
is smaller than mulss latency (4). On Ryzen it's similar.

> To try your example:
[snip]
> On skylake, I am getting 1s for the 2 divisions and .75s for the
> inverse+multiplication. With float, both are .75s.

Note that your code compares throughput. A microbenchmark for comparing latency
would chain dependent computations, e.g. like this:

int main(){
  float a=3, b=7;
  for(int i=0;i<100000000;++i) {
    float c = a+b;
    float d = 1/c;
#if 0
    a /= c;
    b /= c;
#else
    a *= d;
    b *= d;
#endif
  }
  __builtin_printf("%g %g\n", a, b);
}

> Maybe the right choice is clearer for double than for float? I would still
> go with an unconditional 2, for simplicity.

Ack. I just want to point out that it's not so clear-cut given the trend for
improved pipelining of division in the latest cpu generations.