Bug 64308 - Missed optimization: 64-bit divide used when 32-bit divide would work
Summary: Missed optimization: 64-bit divide used when 32-bit divide would work
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2014-12-14 18:39 UTC by Tavian Barnes
Modified: 2022-01-28 00:52 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Test case (183 bytes, text/plain)
2014-12-14 18:39 UTC, Tavian Barnes
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tavian Barnes 2014-12-14 18:39:05 UTC
Created attachment 34280 [details]
Test case

The following is a fairly typical implementation of exponentiation modulo m:

$ cat ipowm.c
// Computes (b**e) % m
unsigned int
ipowm(unsigned int b, unsigned int e, unsigned int m)
{
  unsigned int ret;
  b %= m;
  for (ret = m > 1; e; e >>= 1) {
    if ((e & 1) == 1) {
      ret = (unsigned long long)ret * b % m;
    }
    b = (unsigned long long)b * b % m;
  }
  return ret;
}

Unfortunately, GCC emits a 64-bit multiply and divide for both "... * b % m" expressions on x86 and x86-64, where a 32-bit multiply and divide would be equivalent and faster.

$ gcc -std=c11 -O3 -Wall -S -save-temps ipowm.c
$ cat ipowm.s
...
	imulq	%rdi, %rax
	divq	%rcx
...
	imulq	%rdi, %rax
	divq	%rcx
...

The pattern

	mull	%edi
	divl	%ecx

would be much faster.  They're equivalent because b is always reduced mod m, so b < m and therefore (for any unsigned int x), x * b / m <= x * m / m == x, thus the quotient will always fit in 32 bits.
Comment 1 Richard Biener 2014-12-15 08:55:45 UTC
Not sure if your analysis is correct, but value-range propagation computes

ret_4: VARYING
...
_12: [0, 4294967295]
_13: [0, 4294967295]
_14: [0, 18446744065119617025]
_15: [0, 4294967295]
_16: [0, +INF]

for

Folding statement: _12 = (long long unsigned int) ret_4;
Not folded
Folding statement: _13 = (long long unsigned int) b_1;
Not folded
Folding statement: _14 = _12 * _13;
Not folded
Folding statement: _15 = (long long unsigned int) m_6(D);
Not folded
Folding statement: _16 = _14 % _15;

so its analysis needs to be improved (didn't investigate exactly where it
fails, but IIRC modulo is not handled here because m may be zero).
Comment 2 Marc Glisse 2014-12-15 10:34:16 UTC
You would need symbolic ranges, b and ret are in [0,m-1]. And then you are using that very specific x86 instruction that divides 64 bits by 32 bits but only works if the result fits in 32 bits. It works here because (m-1)*(m-1)/m<m is small enough, but that's very hard for the compiler to prove, and I don't know of another architecture with a similar instruction.

(Related: PR58897, PR53100)
Comment 3 Tavian Barnes 2014-12-16 23:04:51 UTC
@Richard Biener: Yes the range for _16 could be [0, 4294967294].  Why can't VRP can't assume division by zero doesn't occur?  If it can then it could say anything mod [a, b] fits in [0, b - 1].

That's a reasonable improvement by itself but it's not enough to optimize this PR, because to use divl for (ret * b % m), you need (ret * b / m) to fit in [0, 4294967295] as well.  And to know that that, as Marc Glisse suggests, you'd need symbolic ranges.

@Marc Glisse: Is there currently no support at all for symbolic ranges?  If you can infer that b < m is an invariant then that's all you need.  Formally it's something like this:

If x, y, and z are 32-bit unsigned integers, and x <= z || y <= z, then

    (uint64_t)x * (uint64_t)y % z

can be computed with mull and divl because x * y / z is always <= max(x, y) which fits in 32 bits.
Comment 4 Oleg Endo 2014-12-20 01:29:28 UTC
(In reply to Marc Glisse from comment #2)
> You would need symbolic ranges, b and ret are in [0,m-1]. And then you are
> using that very specific x86 instruction that divides 64 bits by 32 bits but
> only works if the result fits in 32 bits. It works here because
> (m-1)*(m-1)/m<m is small enough, but that's very hard for the compiler to
> prove, and I don't know of another architecture with a similar instruction.

On SH a 64/32 -> 32 bit unsigned division can be done by repeating rotcl,div1 32 times (once for each result bit).  It's one of the examples for the 1-step division insn 'div1' in the manuals.  Effectively it's the same as the x86 divl insn.