[Bug middle-end/37443] fast 64-bit divide by constant 32-bit

Tue Sep 16 00:18:00 GMT 2008

------- Comment #1 from ajrobb at bigfoot dot com  2008-09-16 00:17 -------
I have tested my initial routine with more values and it turns out that the
short-cut of omitting the least significant multiplication is wrong. I added
the 4th multiply to generate the full 128-bit precision. I also took the
multiplier used by gcc 4.2.3 on x86_64-linux-gnu (ubuntu) (4056481920730334085)
and (28+64) bit shift to achieve the division by 1220703125u (largest 32-bit
power of 5):

uint64_t fast(uint64_t x) {
  static const uint64_t fact = 4056481920730334085ull;
  uint32_t c = ((x & 0xffffffffu) * (fact & 0xffffffffu)) >> 32;
  uint64_t a = ((x & 0xffffffffu) * (fact >> 32) + c);
  /*
   * Split the addition of the two middle 64-bit intermediate results
   * into 2 stages to avoid possible overflow.
   * (not actually an issue with this value of 'fact')
   */
  uint32_t b = ((x >> 32) * (fact & 0xffffffffu) + (a & 0xffffffffu)) >> 32;
  return ((x >> 32) * (fact >> 32) + (a >> 32) + b) >> 28;
}

After hand coding the assembler on 32-bit x86, the extra multiply increased the
run time from 15s to 17s - still much faster than 'slow' (64-bit divide API
call).

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37443