37443 – fast 64-bit divide by constant on 32-bit platform

Bug 37443 - fast 64-bit divide by constant on 32-bit platform

Summary: fast 64-bit divide by constant on 32-bit platform

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.2.3

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2008-09-09 13:57 UTC by Andrew Robb
Modified:	2021-08-15 11:28 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i686-pc-cygwin
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-07-25 00:00:00

Attachments
proposed 32-bit API calls for 64-bit constant divison (328 bytes, text/plain) 2008-09-20 17:32 UTC, Andrew Robb	Details
proposed 32-bit API calls for 64-bit constant divison (371 bytes, text/plain) 2008-09-21 02:39 UTC, Andrew Robb	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Robb 2008-09-09 13:57:38 UTC

consider the code fragment:

uint64_t slow(uint64_t x) {
  return x / 1220703125u;
}

This can be replaced by:

uint64_t fast(uint64_t x) {
  uint32_t a = ((x >> 32) * 1270091284u) >> 32;
  uint32_t b = ((x & 0xffffffffu) * 3777893186u) >> 32;
  return ((x >> 32) * 3777893186ull + a + b) >> 30;
}

The 'fast' code runs 50% faster than 'slow'. However, removing the redundant multiplies (see my earlier bug - fixed in 4.4 trunk) and tidying up storage, I can use the following assembler to run nearly 100% faster than 'slow':

        .p2align 4,,15
.globl _fast
        .def    _fast;  .scl    2;      .type   32;     .endef
_fast:
        pushl   %ebx
        movl    $1270091284, %eax
        mull    12(%esp)
        movl    $-517074110, %eax
        movl    %edx, %ebx
        mull    8(%esp)
        movl    $-517074110, %eax
        movl    %edx, %ecx
        mull    12(%esp)
        addl    %ebx, %ecx
        popl    %ebx
        adcl    $0, %edx
        addl    %ecx, %eax
        adcl    $0, %edx
        shrdl   $30, %edx, %eax
        shrl    $30, %edx
        ret

NOTE: the 2 multipliers are derived using 96-bit arithmetic:

d = 1220703125u

-517074110 = 0xffe12e1342u = (((1u << 94) + d - 1) / d) >> 32

1270091284 = (((1u << 94) + d - 1) / d) & 0xffffffffu

Comment 1 Andrew Robb 2008-09-16 00:17:33 UTC

I have tested my initial routine with more values and it turns out that the short-cut of omitting the least significant multiplication is wrong. I added the 4th multiply to generate the full 128-bit precision. I also took the multiplier used by gcc 4.2.3 on x86_64-linux-gnu (ubuntu) (4056481920730334085) and (28+64) bit shift to achieve the division by 1220703125u (largest 32-bit power of 5):

uint64_t fast(uint64_t x) {
  static const uint64_t fact = 4056481920730334085ull;
  uint32_t c = ((x & 0xffffffffu) * (fact & 0xffffffffu)) >> 32;
  uint64_t a = ((x & 0xffffffffu) * (fact >> 32) + c);
  /*
   * Split the addition of the two middle 64-bit intermediate results
   * into 2 stages to avoid possible overflow.
   * (not actually an issue with this value of 'fact')
   */
  uint32_t b = ((x >> 32) * (fact & 0xffffffffu) + (a & 0xffffffffu)) >> 32;
  return ((x >> 32) * (fact >> 32) + (a >> 32) + b) >> 28;
}

After hand coding the assembler on 32-bit x86, the extra multiply increased the run time from 15s to 17s - still much faster than 'slow' (64-bit divide API call).

Comment 2 Andrew Robb 2008-09-16 07:15:11 UTC

I changed the summary as it is equally applicable to all 64-bit constant divides on 32-bit x86 that are already compiled as multiplies on 64-bit x64. It might be worth implementing as another API call, replacing the 64-bit divisor with a 64-bit multiplier and a shift count.

Comment 3 Andrew Robb 2008-09-20 17:32:49 UTC

Created attachment 16369 [details]
proposed 32-bit API calls for 64-bit constant divison

I have attached a sample assembler code for a API calls for the 2 types of 64-bit division through multiplication:

mul0shift - for multiplier with 64 or fewer significant bits

mul1shift - for multiplier with 65 significant bits (implicit high bit)

Comment 4 Andrew Robb 2008-09-21 02:39:53 UTC

Created attachment 16370 [details]
proposed 32-bit API calls for 64-bit constant divison

The original attachment did not handle shift greater than 31.

Comment 5 Andrew Pinski 2021-08-15 11:28:58 UTC

Must be a cost model issue because I can get /10u working but not /1220703125u (in GCC 11+).