Bug 37443

Summary: fast 64-bit divide by constant on 32-bit platform
Product: gcc Reporter: Andrew Robb <ajrobb57>
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: gcc-bugs
Priority: P3 Keywords: missed-optimization
Version: 4.2.3   
Target Milestone: ---   
Host: Target: i686-pc-cygwin
Build: Known to work:
Known to fail: Last reconfirmed: 2021-07-25 00:00:00
Attachments: proposed 32-bit API calls for 64-bit constant divison
proposed 32-bit API calls for 64-bit constant divison

Description Andrew Robb 2008-09-09 13:57:38 UTC
consider the code fragment:

uint64_t slow(uint64_t x) {
  return x / 1220703125u;
}

This can be replaced by:

uint64_t fast(uint64_t x) {
  uint32_t a = ((x >> 32) * 1270091284u) >> 32;
  uint32_t b = ((x & 0xffffffffu) * 3777893186u) >> 32;
  return ((x >> 32) * 3777893186ull + a + b) >> 30;
}

The 'fast' code runs 50% faster than 'slow'. However, removing the redundant multiplies (see my earlier bug - fixed in 4.4 trunk) and tidying up storage, I can use the following assembler to run nearly 100% faster than 'slow':

        .p2align 4,,15
.globl _fast
        .def    _fast;  .scl    2;      .type   32;     .endef
_fast:
        pushl   %ebx
        movl    $1270091284, %eax
        mull    12(%esp)
        movl    $-517074110, %eax
        movl    %edx, %ebx
        mull    8(%esp)
        movl    $-517074110, %eax
        movl    %edx, %ecx
        mull    12(%esp)
        addl    %ebx, %ecx
        popl    %ebx
        adcl    $0, %edx
        addl    %ecx, %eax
        adcl    $0, %edx
        shrdl   $30, %edx, %eax
        shrl    $30, %edx
        ret

NOTE: the 2 multipliers are derived using 96-bit arithmetic:

d = 1220703125u

-517074110 = 0xffe12e1342u = (((1u << 94) + d - 1) / d) >> 32

1270091284 = (((1u << 94) + d - 1) / d) & 0xffffffffu
Comment 1 Andrew Robb 2008-09-16 00:17:33 UTC
I have tested my initial routine with more values and it turns out that the short-cut of omitting the least significant multiplication is wrong. I added the 4th multiply to generate the full 128-bit precision. I also took the multiplier used by gcc 4.2.3 on x86_64-linux-gnu (ubuntu) (4056481920730334085) and (28+64) bit shift to achieve the division by 1220703125u (largest 32-bit power of 5):

uint64_t fast(uint64_t x) {
  static const uint64_t fact = 4056481920730334085ull;
  uint32_t c = ((x & 0xffffffffu) * (fact & 0xffffffffu)) >> 32;
  uint64_t a = ((x & 0xffffffffu) * (fact >> 32) + c);
  /*
   * Split the addition of the two middle 64-bit intermediate results
   * into 2 stages to avoid possible overflow.
   * (not actually an issue with this value of 'fact')
   */
  uint32_t b = ((x >> 32) * (fact & 0xffffffffu) + (a & 0xffffffffu)) >> 32;
  return ((x >> 32) * (fact >> 32) + (a >> 32) + b) >> 28;
}

After hand coding the assembler on 32-bit x86, the extra multiply increased the run time from 15s to 17s - still much faster than 'slow' (64-bit divide API call).
Comment 2 Andrew Robb 2008-09-16 07:15:11 UTC
I changed the summary as it is equally applicable to all 64-bit constant divides on 32-bit x86 that are already compiled as multiplies on 64-bit x64. It might be worth implementing as another API call, replacing the 64-bit divisor with a 64-bit multiplier and a shift count.
Comment 3 Andrew Robb 2008-09-20 17:32:49 UTC
Created attachment 16369 [details]
proposed 32-bit API calls for 64-bit constant divison

I have attached a sample assembler code for a API calls for the 2 types of 64-bit division through multiplication:

mul0shift - for multiplier with 64 or fewer significant bits

mul1shift - for multiplier with 65 significant bits (implicit high bit)
Comment 4 Andrew Robb 2008-09-21 02:39:53 UTC
Created attachment 16370 [details]
proposed 32-bit API calls for 64-bit constant divison

The original attachment did not handle shift greater than 31.
Comment 5 Andrew Pinski 2021-08-15 11:28:58 UTC
Must be a cost model issue because I can get /10u working but not /1220703125u (in GCC 11+).