Summary: | fast 64-bit divide by constant on 32-bit platform | ||
---|---|---|---|
Product: | gcc | Reporter: | Andrew Robb <ajrobb57> |
Component: | middle-end | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | gcc-bugs |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 4.2.3 | ||
Target Milestone: | --- | ||
Host: | Target: | i686-pc-cygwin | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2021-07-25 00:00:00 | |
Attachments: |
proposed 32-bit API calls for 64-bit constant divison
proposed 32-bit API calls for 64-bit constant divison |
Description
Andrew Robb
2008-09-09 13:57:38 UTC
I have tested my initial routine with more values and it turns out that the short-cut of omitting the least significant multiplication is wrong. I added the 4th multiply to generate the full 128-bit precision. I also took the multiplier used by gcc 4.2.3 on x86_64-linux-gnu (ubuntu) (4056481920730334085) and (28+64) bit shift to achieve the division by 1220703125u (largest 32-bit power of 5): uint64_t fast(uint64_t x) { static const uint64_t fact = 4056481920730334085ull; uint32_t c = ((x & 0xffffffffu) * (fact & 0xffffffffu)) >> 32; uint64_t a = ((x & 0xffffffffu) * (fact >> 32) + c); /* * Split the addition of the two middle 64-bit intermediate results * into 2 stages to avoid possible overflow. * (not actually an issue with this value of 'fact') */ uint32_t b = ((x >> 32) * (fact & 0xffffffffu) + (a & 0xffffffffu)) >> 32; return ((x >> 32) * (fact >> 32) + (a >> 32) + b) >> 28; } After hand coding the assembler on 32-bit x86, the extra multiply increased the run time from 15s to 17s - still much faster than 'slow' (64-bit divide API call). I changed the summary as it is equally applicable to all 64-bit constant divides on 32-bit x86 that are already compiled as multiplies on 64-bit x64. It might be worth implementing as another API call, replacing the 64-bit divisor with a 64-bit multiplier and a shift count. Created attachment 16369 [details]
proposed 32-bit API calls for 64-bit constant divison
I have attached a sample assembler code for a API calls for the 2 types of 64-bit division through multiplication:
mul0shift - for multiplier with 64 or fewer significant bits
mul1shift - for multiplier with 65 significant bits (implicit high bit)
Created attachment 16370 [details]
proposed 32-bit API calls for 64-bit constant divison
The original attachment did not handle shift greater than 31.
Must be a cost model issue because I can get /10u working but not /1220703125u (in GCC 11+). |