Bug 28417 - suboptimal 'division by constant' optimization
Summary: suboptimal 'division by constant' optimization
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.1.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, patch
Depends on:
Blocks: 16996
  Show dependency treegraph
 
Reported: 2006-07-18 08:34 UTC by Denis Vlasenko
Modified: 2010-07-09 10:35 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-06-06 15:08:07


Attachments
Use alternative algorithm if it gives better results (2.36 KB, patch)
2006-07-30 13:35 UTC, Denis Vlasenko
Details | Diff
Test program (876 bytes, text/plain)
2006-07-30 13:37 UTC, Denis Vlasenko
Details
Random tester (5.07 KB, text/plain)
2006-07-30 13:43 UTC, Denis Vlasenko
Details
Alternative algorithm v. 2 (2.47 KB, patch)
2006-08-02 19:05 UTC, Denis Vlasenko
Details | Diff
Alternative algorithm v. 3 (2.35 KB, patch)
2006-08-03 17:06 UTC, Denis Vlasenko
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Vlasenko 2006-07-18 08:34:01 UTC
It looks like expmed.c::choose_multiplier(), which is responsible for finding parameters needed for replacing division by constant with mul+shift, sometimes fails to find optimal parameters.

One example: A/1577682821 can be calculated by executing A*365384439 >> (27+32), for any 32-bit unsigned A. However, gcc-4.1.1 and all previous versions generate much more elaborate code.

The below program demonstrates tis and also two more similar examples. It also checks that mul+shift works for any unsigned A, by testing all possibe values of A.

#include <stdint.h>
#include <stdio.h>

unsigned v;
void f9188_mul365384439_shift27(unsigned A) { v = A/(unsigned)1577682821; }
void f9188_mul365384439_shift27_prime(unsigned A) { v = (((uint64_t)A)*(unsigned)365384439) >> (27+32); }
void f8399_mul2283243215_shift29(unsigned A) { v = A/(unsigned)1009898111; }
void f8399_mul2283243215_shift29_prime(unsigned A) { v = (((uint64_t)A)*(unsigned)2283243215) >> (29+32); }
void f8267_mul2482476753_shift30(unsigned A) { v = A/(unsigned)1857695551; }
void f8267_mul2482476753_shift30_prime(unsigned A) { v = (((uint64_t)A)*(unsigned)2482476753) >> (30+32); }

/* Generated code is suboptimal (gcc 4.1.1):
void f9188_mul365384439_shift27(unsigned A) { v = A/1577682821; }
f9188_mul365384439_shift27:
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	16(%esp), %ebx
	movl	$1551183727, %ecx
	movl	%ebx, %eax
	mull	%ecx
	subl	%edx, %ebx
	shrl	%ebx
	leal	(%ebx,%edx), %eax
	shrl	$30, %eax
	movl	%eax, v
	popl	%ebx
	popl	%esi
	popl	%edi
	ret
void f8399_mul2283243215_shift29(unsigned A) { v = A/1009898111; }
f8399_mul2283243215_shift29:
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	16(%esp), %ebx
	movl	$271519133, %ecx
	movl	%ebx, %eax
	mull	%ecx
	subl	%edx, %ebx
	shrl	%ebx
	leal	(%ebx,%edx), %eax
	shrl	$29, %eax
	movl	%eax, v
	popl	%ebx
	popl	%esi
	popl	%edi
	ret
void f8267_mul2482476753_shift30(unsigned A) { v = A/1857695551; }
f8267_mul2482476753_shift30:
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	16(%esp), %ebx
	movl	$669986209, %ecx
	movl	%ebx, %eax
	mull	%ecx
	subl	%edx, %ebx
	shrl	%ebx
	leal	(%ebx,%edx), %eax
	shrl	$30, %eax
	movl	%eax, v
	popl	%ebx
	popl	%esi
	popl	%edi
	ret

These operations can be done like this:
f9188_mul365384439_shift27_prime:
	movl	$365384439, %eax
	mull	4(%esp)
	movl	%edx, %eax
	shrl	$27, %eax
	movl	%eax, v
	ret
f8399_mul2283243215_shift29_prime:
	movl	$-2011724081, %eax
	mull	4(%esp)
	movl	%edx, %eax
	shrl	$29, %eax
	movl	%eax, v
	ret
f8267_mul2482476753_shift30_prime:
	movl	$-1812490543, %eax
	mull	4(%esp)
	movl	%edx, %eax
	shrl	$30, %eax
	movl	%eax, v
	ret

(Why there is that silly movl %edx, %eax - that's another good question...)

The verification program is below. Was compiled with -Os
(in order to force gcc to use div insn for a/b - we want to do true
division for the purpose of correctness check!)
and didn't report a failure.
*/

int main()
{
	unsigned A=0,u,v;
	while(++A) {
		(A & 0xffff) || printf("A=%u\r",A);

		u = (((uint64_t)A)*(unsigned)365384439) >> (27+32);
		v = A / (unsigned)1577682821;
		if(u!=v) { printf("1: A=%u u=%u v=%u\n", A,u,v); return 0; }

		u = (((uint64_t)A)*(unsigned)2283243215) >> (29+32);
		v = A / (unsigned)1009898111;
		if(u!=v) { printf("2: A=%u u=%u v=%u\n", A,u,v); return 0; }

		u = (((uint64_t)A)*(unsigned)2482476753) >> (30+32);
		v  =A / (unsigned)1857695551;
		if(u!=v) { printf("3: A=%u u=%u v=%u\n", A,u,v); return 0; }
	}
	puts("");
	return 0;
}
Comment 1 Denis Vlasenko 2006-07-18 08:40:23 UTC
I didn't look too close at choose_multiplier(), yet.
If anybody will work upon it in order to make it better,
this is the routine which prints values 'K' and 'bits'
so that
    (A*K)>>bits == A/B
is true for given fixed B and all A's in range [0...max_A].
You may want to add similar code to choose_multiplier().

Mathematical explanation is in the comment.

/*
[below: 'div' is unsigned integer division]
['/' is real division with infinite precision]
[A,B,C... - integers, a,b,c... - reals]

Theory: we want to calculate A div B, fast.
B is const > 2 and is not a power of 2.

In fp: A/B = A*(L/B)/L. (If L is a large power of 2,
say 2^32, then it could be done really fast).
Let k := L/B, K := L div B + 1.
Then A/B = A*k/L.

Then this is true:

A div B <= A * K div L.

For small enough A: A div B = A * K div L.
Lets find first A for which it is not true.

Lets compare k/L and K/L. K/L is larger by a small value d:

d := K/L - k/L = (L div B + 1) / L - L/B/L =
= (L div B * B + B) / (L*B) - L/(L*B) =
= (L div B * B + B - L) / (L*B)

A*K/L is larger than A*k/L by A*d.

A*k/L is closest to 'overflowing into next integer'
when A = N*B-1. The 'deficit' with such A is exactly 1/B.
If A*d >= 1/B, then A*K/L will 'overflow'.

Thus bad_A >= 1/B / d = (1/B) * (L*B)/(L div B * B + B - L) =
= L/(L div B * B + B - L). Then you need to 'walk up' to next
A representable as N*B-1: bad_A2 = (bad_A div B) * B + B-1
This is the first A which will have wrong result
(i.e. for which (A*K div L) = (A div B)+1, not (A div B).

Practical use.

Suppose that A and B are 32-bit unsigned integers.

Unfortunately, there exist only two B values in 3..2^32-1 range
for which _any_ 32-bit unsigned A can be fast divided using L=2^32
(because bad_A=2^32 and any A is less than that):

B=641 K=6700417 d=1/2753074036736 bad_A=4294967296 A=unlimited
B=6700417 K=641 d=1/28778071884562432 bad_A=4294967296 A=unlimited

We need to use larger powers of 2 for L if we need to handle
many more B's.
*/

void fastdiv_params(unsigned B, unsigned max_A)
{
	unsigned K, d_LB, bits, max_bits;
	uint64_t L, KL, mask, bad_A, max_bad_A;

	if (!B || ((B-1)&B) == 0) { // B is a power of 2
//if(0)
		printf("B=%u: power of 2, division by shift\n", B);
		return;
	}
	L = (max_ull/max_unsigned - 1) * B;
	bits = 63;
	mask = 1ULL << 63;
	while( !(L & mask))
		bits--, mask >>= 1;
	L = mask;

	while ( (KL = L/B + 1) > max_unsigned)
		bits--, L >>= 1;
	K = KL;
	// There is not much point in multiplying by even number
	// and then shifting right. Reduce K & L to avoid it:
	while (!(K & 1) && bits > 32)
		bits--, L >>= 1, K = L/B + 1;

	d_LB = ((L/B) * B + B - L);
	bad_A = L / d_LB;
	bad_A = (bad_A / B) * B + B-1;
	if (bad_A <= max_A) {
		printf("B=%u,A<=%u: no suitable fastdiv params found\n", B, max_A);
		return;
	}
	max_bits = bits;
	max_bad_A = bad_A;
	while(1) {
		uint64_t last_L = L;
		uint64_t last_bad_A = bad_A;
		unsigned last_K = K;
		bits--;
		L >>= 1;
		K = L/B + 1;
		d_LB = ((L/B) * B + B - L);
		bad_A = L / d_LB;
		bad_A = (bad_A / B) * B + B-1;
		if (bad_A <= max_A || bits < 32) {
			// new params are bad, report last ones and bail out
//if(0)
			printf("B=%u,A<=%u: L=%llx(bits=%u) K=%u  (bad_A=%llu, bad_A=%llu at bits=%u)\n",
				B, max_A, last_L, bits+1, last_K, last_bad_A,  max_bad_A, max_bits);
			return;
		}
	}
}
Comment 2 Denis Vlasenko 2006-07-18 12:06:56 UTC
Pseudo-code to help in reading above code:

void fastdiv_params(unsigned B, unsigned max_A)
{
        L = (max_unsigned_long_long/max_unsigned - 1) * B;
        L = largest_pow2_smaller_or_eq_to(L);
        bits = log2(L);
        while ((K = L/B + 1) > max_unsigned)
                bits--, L >>= 1;
        // There is not much point in multiplying by even 'K'
        // and then shifting right (dividing by 2*n). Reduce K & L to avoid it:
        while (!(K & 1) && bits > 32)
                bits--, L >>= 1, K = L/B + 1;

        d_LB = ((L/B) * B + B - L);
        bad_A = L / d_LB;
        bad_A = (bad_A / B) * B + B-1;
        if (bad_A <= max_A)
                bail_out_no_params_exist;

        // 'K' and 'bits' params are already determined, can return them...
        /* return_params(K,bits); */

        // ...but if max_A is smaller then max_unsigned, we can
        // optionally use smaller 'K' and 'bits'. Find smallest which
        // still work:
        max_bits = bits;
        max_bad_A = bad_A;
        while(1) {
                uint64_t last_L = L;
                unsigned last_K = K;
                bits--;
                L >>= 1;
                K = L/B + 1;
                d_LB = ((L/B) * B + B - L);
                bad_A = L / d_LB;
                bad_A = (bad_A / B) * B + B-1;
                if (bad_A <= max_A || bits < 32)
                        // new params are bad, report last ones and bail out
                        return_params(last_K,bits+1);
        }
}
Comment 3 Denis Vlasenko 2006-07-19 16:24:26 UTC
I instrumented choose_multiplier(). When called like this:
choose_multiplier(d=1577682821,n=32,precision=32),
it returns *post_shift_ptr=31,multiplier=5846151023

whereas optimal one is *post_shift_ptr=27,multiplier=365384439
In the code: "////" comments show variable values 
if d=1577682821.


The instrumented version is below.
Too bad there are no comments how the code works, what is mlow and mhigh.

Note that the good value, if shifted left 4 times,
365384439<<4, is equal to 5846151024, which is mhigh+1!

With mhigh=5846151024 "Reduce to lowest terms" part of code would reduce
mhigh/mlow pair exactly to correct value of 365384439, I think...

Isn't this a reason why choose_multiplier() doesn't consider 365384439
to be a valid multiplier? Maybe mhigh calcucation is off-by-one?

unsigned HOST_WIDE_INT

choose_multiplier (unsigned HOST_WIDE_INT d, int n, int precision,

		   rtx *multiplier_ptr, int *post_shift_ptr, int *lgup_ptr)

{

  HOST_WIDE_INT mhigh_hi, mlow_hi;

  unsigned HOST_WIDE_INT mhigh_lo, mlow_lo;

  int lgup, post_shift;

  int pow, pow2;

  unsigned HOST_WIDE_INT nl, dummy1;

  HOST_WIDE_INT nh, dummy2;



static int divdebug = -1;



if (divdebug == -1)

    divdebug = NULL!=getenv("DIV_DEBUG");



divdebug && printf("choose_multiplier(d=%llu,n=%i,precision=%i)\n",(unsigned long long)d,n,precision);

//// d=1577682821,n=32,precision=32

  /* lgup = ceil(log2(divisor)); */

  lgup = ceil_log2 (d);	//// 31

  gcc_assert (lgup <= n);



  pow = n + lgup; //// 63

  pow2 = n + lgup - precision; //// 31



  /* We could handle this with some effort, but this case is much

     better handled directly with a scc insn, so rely on caller using

     that.  */

  gcc_assert (pow != 2 * HOST_BITS_PER_WIDE_INT);



  /* mlow = 2^(N + lgup)/d */

 if (pow >= HOST_BITS_PER_WIDE_INT) //// yes

    {

      nh = (HOST_WIDE_INT) 1 << (pow - HOST_BITS_PER_WIDE_INT); //// 1<<31

      nl = 0;

    }

  else

    {

      nh = 0;

      nl = (unsigned HOST_WIDE_INT) 1 << pow;

    }

//// 1<<63 / d: mlow=5846151022

  div_and_round_double (TRUNC_DIV_EXPR, 1, nl, nh, d, (HOST_WIDE_INT) 0,

			&mlow_lo, &mlow_hi, &dummy1, &dummy2);

divdebug && printf("choose_multiplier: mlow=%llu\n",((unsigned long long)mlow_hi<<32)+mlow_lo);



  /* mhigh = (2^(N + lgup) + 2^N + lgup - precision)/d */

  if (pow2 >= HOST_BITS_PER_WIDE_INT) //// no

    nh |= (HOST_WIDE_INT) 1 << (pow2 - HOST_BITS_PER_WIDE_INT);

  else

    nl |= (unsigned HOST_WIDE_INT) 1 << pow2; //// 1<<31

//// (1<<63+1<<31) / d: mhigh=5846151023

  div_and_round_double (TRUNC_DIV_EXPR, 1, nl, nh, d, (HOST_WIDE_INT) 0,

			&mhigh_lo, &mhigh_hi, &dummy1, &dummy2);

divdebug && printf("choose_multiplier: mhigh=%llu\n",((unsigned long long)mhigh_hi<<32)+mhigh_lo);



  gcc_assert (!mhigh_hi || nh - d < d);

  gcc_assert (mhigh_hi <= 1 && mlow_hi <= 1);

  /* Assert that mlow < mhigh.  */

  gcc_assert (mlow_hi < mhigh_hi

	      || (mlow_hi == mhigh_hi && mlow_lo < mhigh_lo));



  /* If precision == N, then mlow, mhigh exceed 2^N

     (but they do not exceed 2^(N+1)).  */



  /* Reduce to lowest terms.  */

  for (post_shift = lgup; post_shift > 0; post_shift--)

    {

      unsigned HOST_WIDE_INT ml_lo = (mlow_hi << (HOST_BITS_PER_WIDE_INT - 1)) | (mlow_lo >> 1);

      unsigned HOST_WIDE_INT mh_lo = (mhigh_hi << (HOST_BITS_PER_WIDE_INT - 1)) | (mhigh_lo >> 1);

      if (ml_lo >= mh_lo) //// if mlow/2 == mhigh/2 - bail out

	break;



      mlow_hi = 0;

      mlow_lo = ml_lo;

      mhigh_hi = 0;

      mhigh_lo = mh_lo;

    }



  *post_shift_ptr = post_shift;

  *lgup_ptr = lgup;

  if (n < HOST_BITS_PER_WIDE_INT)

    {

      unsigned HOST_WIDE_INT mask = ((unsigned HOST_WIDE_INT) 1 << n) - 1;

      *multiplier_ptr = GEN_INT (mhigh_lo & mask);

divdebug && printf("choose_multiplier: *post_shift_ptr=%i,*lgup_ptr=%i,multiplier=%llu\n",*post_shift_ptr,*lgup_ptr,(unsigned long long)(mhigh_lo & mask) + ((unsigned long long)(mhigh_lo >= mask) << 32));

      return mhigh_lo >= mask;

    }

  else

    {

      *multiplier_ptr = GEN_INT (mhigh_lo);

divdebug && printf("choose_multiplier: *post_shift_ptr=%i,*lgup_ptr=%i,multiplier=%llu\n",*post_shift_ptr,*lgup_ptr,(unsigned long long)(mhigh_lo) + ((unsigned long long)(mhigh_hi) << 32));

      return mhigh_hi;

    }

}
Comment 4 Denis Vlasenko 2006-07-30 13:35:18 UTC
Created attachment 11971 [details]
Use alternative algorithm if it gives better results

New algorithm lives in separate function gcc_fastdiv_params() which is called from choose_multiplier(). Then we run old algorithm. Then we compare results and if new algorithm gives different one (new result is always better or same than old), we use it.
Comment 5 Denis Vlasenko 2006-07-30 13:37:33 UTC
Created attachment 11972 [details]
Test program

Compile with -O2 -S to see the assembly.
Compile with -Os and run to check the correctness.
Comment 6 Denis Vlasenko 2006-07-30 13:38:19 UTC
Patched versus stock gcc-4.1.1 on this code

unsigned v;
void f9188_mul365384439_shift27(unsigned A) { v = A/(unsigned)1577682821; }

is:

# diff -u suboptimal-411-O2.s suboptimal-411-O2-fixed.s
--- suboptimal-411-O2.s 2006-07-29 19:34:09.000000000 +0200
+++ suboptimal-411-O2-fixed.s   2006-07-29 19:37:14.000000000 +0200
@@ -19,21 +19,10 @@
 f9188_mul365384439_shift27:
        pushl   %ebp
        movl    %esp, %ebp
-       pushl   %edi
-       pushl   %esi
-       pushl   %ebx
-       movl    8(%ebp), %ebx
-       movl    $1551183727, %ecx
-       movl    %ebx, %eax
-       mull    %ecx
-       subl    %edx, %ebx
-       shrl    %ebx
-       leal    (%ebx,%edx), %eax
-       shrl    $30, %eax
-       movl    %eax, v
-       popl    %ebx
-       popl    %esi
-       popl    %edi
+       movl    $365384439, %eax
+       mull    8(%ebp)
+       shrl    $27, %edx
+       movl    %edx, v
        leave
        ret
...
Comment 7 Denis Vlasenko 2006-07-30 13:43:10 UTC
Created attachment 11973 [details]
Random tester

Randomly chooses a/b and compares "gcc patch code" results with code which was previously verified to be ok. Just my internal tool to check that I did not screw up while making a patch.
Comment 8 Jorn Wolfgang Rennecke 2006-07-31 14:39:08 UTC
(In reply to comment #3)
> Too bad there are no comments how the code works, what is mlow and mhigh.

Tege has written a paper: http://swox.se/~tg/divcnst-pldi94.pdf

Also, you might consider adding him to the CC list of this bug.
Comment 9 Denis Vlasenko 2006-08-02 19:05:42 UTC
Thanks for the link to .pdf! Who's Tege? And his email address is ...?

Ok. This new version of the patch is working for signed divisions too, and is giving the results which are always same or better than "old" code.

If it would be deemed acceptable for mainline, it can be simply be integrated into choose_multiplier().

Since new code needs to know maximum possible dividend instead of (n,prec) tuple, and it can generate better code if dividend is known to be small enough
(example: (n & 0xffff) / 10 will need only a mul, not mul+shift), it will make sense to propagate possible dividend range into choose_multiplier() intead of (n,prec). TODO later...
Comment 10 Denis Vlasenko 2006-08-02 19:05:47 UTC
Created attachment 12000 [details]
Alternative algorithm v. 2
Comment 11 Jorn Wolfgang Rennecke 2006-08-02 19:34:35 UTC
(In reply to comment #9)
> Thanks for the link to .pdf! Who's Tege? And his email address is ...?

Torbjoern Granlund; homepage:  http://swox.se/~tg/ .  From MAINTAINERS:
*synthetic multiply     Torbjorn Granlund       tege@swox.com
Comment 12 Denis Vlasenko 2006-08-03 17:06:19 UTC
Created attachment 12005 [details]
Alternative algorithm v. 3

Formatting fixes mostly
Comment 13 tege 2006-08-04 11:57:33 UTC
Subject: Re:  suboptimal 'division by constant' optimization

I looked briefly into the suggested change, without trying to
assess its correctness.  But since these sort of transformations
are of a very mathematical nature, I trust the author of the
suggested changes has proven the algorithms?

I am aware of that Peter Montgomery's and my algorithms do not
generate optimal sequences for v/C where C is close to the
largest possible value of v.  We deemed it to have very little
practical use to optimize these cases.

If it is decided that these cases are indeed important enough to
require optimization, the choose_multiplier function should
course be amended, of course.

Comment 14 Denis Vlasenko 2006-08-04 18:14:04 UTC
>But since these sort of transformations are of
>a very mathematical nature, I trust the author of the
>suggested changes has proven the algorithms?

I did not write a paper about it, but I think I have
the proof and it is documented in the patch.

I will happily ament that comment if it has any unclear spots.
I don't want anydoby to have headaches over it :)

> If it is decided that these cases are indeed important enough to
require optimization, the choose_multiplier function should
course be amended, of course.

I think that we can get more gains if we will use new algorithm
_and_ will propagate dividend value range info into
choose_multiplier(). New algorithm (of should I say "modified",
it is not that new) takes max_A parameter already, so it's trivial.

For example, with value range propagation (n & 0xffff) / 10
will generate code with just a mul without shift, because there exist
suitable K for L=2^32 -> we can just take higher half of multiply.