It looks like expmed.c::choose_multiplier(), which is responsible for finding parameters needed for replacing division by constant with mul+shift, sometimes fails to find optimal parameters. One example: A/1577682821 can be calculated by executing A*365384439 >> (27+32), for any 32-bit unsigned A. However, gcc-4.1.1 and all previous versions generate much more elaborate code. The below program demonstrates tis and also two more similar examples. It also checks that mul+shift works for any unsigned A, by testing all possibe values of A. #include <stdint.h> #include <stdio.h> unsigned v; void f9188_mul365384439_shift27(unsigned A) { v = A/(unsigned)1577682821; } void f9188_mul365384439_shift27_prime(unsigned A) { v = (((uint64_t)A)*(unsigned)365384439) >> (27+32); } void f8399_mul2283243215_shift29(unsigned A) { v = A/(unsigned)1009898111; } void f8399_mul2283243215_shift29_prime(unsigned A) { v = (((uint64_t)A)*(unsigned)2283243215) >> (29+32); } void f8267_mul2482476753_shift30(unsigned A) { v = A/(unsigned)1857695551; } void f8267_mul2482476753_shift30_prime(unsigned A) { v = (((uint64_t)A)*(unsigned)2482476753) >> (30+32); } /* Generated code is suboptimal (gcc 4.1.1): void f9188_mul365384439_shift27(unsigned A) { v = A/1577682821; } f9188_mul365384439_shift27: pushl %edi pushl %esi pushl %ebx movl 16(%esp), %ebx movl $1551183727, %ecx movl %ebx, %eax mull %ecx subl %edx, %ebx shrl %ebx leal (%ebx,%edx), %eax shrl $30, %eax movl %eax, v popl %ebx popl %esi popl %edi ret void f8399_mul2283243215_shift29(unsigned A) { v = A/1009898111; } f8399_mul2283243215_shift29: pushl %edi pushl %esi pushl %ebx movl 16(%esp), %ebx movl $271519133, %ecx movl %ebx, %eax mull %ecx subl %edx, %ebx shrl %ebx leal (%ebx,%edx), %eax shrl $29, %eax movl %eax, v popl %ebx popl %esi popl %edi ret void f8267_mul2482476753_shift30(unsigned A) { v = A/1857695551; } f8267_mul2482476753_shift30: pushl %edi pushl %esi pushl %ebx movl 16(%esp), %ebx movl $669986209, %ecx movl %ebx, %eax mull %ecx subl %edx, %ebx shrl %ebx leal (%ebx,%edx), %eax shrl $30, %eax movl %eax, v popl %ebx popl %esi popl %edi ret These operations can be done like this: f9188_mul365384439_shift27_prime: movl $365384439, %eax mull 4(%esp) movl %edx, %eax shrl $27, %eax movl %eax, v ret f8399_mul2283243215_shift29_prime: movl $-2011724081, %eax mull 4(%esp) movl %edx, %eax shrl $29, %eax movl %eax, v ret f8267_mul2482476753_shift30_prime: movl $-1812490543, %eax mull 4(%esp) movl %edx, %eax shrl $30, %eax movl %eax, v ret (Why there is that silly movl %edx, %eax - that's another good question...) The verification program is below. Was compiled with -Os (in order to force gcc to use div insn for a/b - we want to do true division for the purpose of correctness check!) and didn't report a failure. */ int main() { unsigned A=0,u,v; while(++A) { (A & 0xffff) || printf("A=%u\r",A); u = (((uint64_t)A)*(unsigned)365384439) >> (27+32); v = A / (unsigned)1577682821; if(u!=v) { printf("1: A=%u u=%u v=%u\n", A,u,v); return 0; } u = (((uint64_t)A)*(unsigned)2283243215) >> (29+32); v = A / (unsigned)1009898111; if(u!=v) { printf("2: A=%u u=%u v=%u\n", A,u,v); return 0; } u = (((uint64_t)A)*(unsigned)2482476753) >> (30+32); v =A / (unsigned)1857695551; if(u!=v) { printf("3: A=%u u=%u v=%u\n", A,u,v); return 0; } } puts(""); return 0; }

I didn't look too close at choose_multiplier(), yet. If anybody will work upon it in order to make it better, this is the routine which prints values 'K' and 'bits' so that (A*K)>>bits == A/B is true for given fixed B and all A's in range [0...max_A]. You may want to add similar code to choose_multiplier(). Mathematical explanation is in the comment. /* [below: 'div' is unsigned integer division] ['/' is real division with infinite precision] [A,B,C... - integers, a,b,c... - reals] Theory: we want to calculate A div B, fast. B is const > 2 and is not a power of 2. In fp: A/B = A*(L/B)/L. (If L is a large power of 2, say 2^32, then it could be done really fast). Let k := L/B, K := L div B + 1. Then A/B = A*k/L. Then this is true: A div B <= A * K div L. For small enough A: A div B = A * K div L. Lets find first A for which it is not true. Lets compare k/L and K/L. K/L is larger by a small value d: d := K/L - k/L = (L div B + 1) / L - L/B/L = = (L div B * B + B) / (L*B) - L/(L*B) = = (L div B * B + B - L) / (L*B) A*K/L is larger than A*k/L by A*d. A*k/L is closest to 'overflowing into next integer' when A = N*B-1. The 'deficit' with such A is exactly 1/B. If A*d >= 1/B, then A*K/L will 'overflow'. Thus bad_A >= 1/B / d = (1/B) * (L*B)/(L div B * B + B - L) = = L/(L div B * B + B - L). Then you need to 'walk up' to next A representable as N*B-1: bad_A2 = (bad_A div B) * B + B-1 This is the first A which will have wrong result (i.e. for which (A*K div L) = (A div B)+1, not (A div B). Practical use. Suppose that A and B are 32-bit unsigned integers. Unfortunately, there exist only two B values in 3..2^32-1 range for which _any_ 32-bit unsigned A can be fast divided using L=2^32 (because bad_A=2^32 and any A is less than that): B=641 K=6700417 d=1/2753074036736 bad_A=4294967296 A=unlimited B=6700417 K=641 d=1/28778071884562432 bad_A=4294967296 A=unlimited We need to use larger powers of 2 for L if we need to handle many more B's. */ void fastdiv_params(unsigned B, unsigned max_A) { unsigned K, d_LB, bits, max_bits; uint64_t L, KL, mask, bad_A, max_bad_A; if (!B || ((B-1)&B) == 0) { // B is a power of 2 //if(0) printf("B=%u: power of 2, division by shift\n", B); return; } L = (max_ull/max_unsigned - 1) * B; bits = 63; mask = 1ULL << 63; while( !(L & mask)) bits--, mask >>= 1; L = mask; while ( (KL = L/B + 1) > max_unsigned) bits--, L >>= 1; K = KL; // There is not much point in multiplying by even number // and then shifting right. Reduce K & L to avoid it: while (!(K & 1) && bits > 32) bits--, L >>= 1, K = L/B + 1; d_LB = ((L/B) * B + B - L); bad_A = L / d_LB; bad_A = (bad_A / B) * B + B-1; if (bad_A <= max_A) { printf("B=%u,A<=%u: no suitable fastdiv params found\n", B, max_A); return; } max_bits = bits; max_bad_A = bad_A; while(1) { uint64_t last_L = L; uint64_t last_bad_A = bad_A; unsigned last_K = K; bits--; L >>= 1; K = L/B + 1; d_LB = ((L/B) * B + B - L); bad_A = L / d_LB; bad_A = (bad_A / B) * B + B-1; if (bad_A <= max_A || bits < 32) { // new params are bad, report last ones and bail out //if(0) printf("B=%u,A<=%u: L=%llx(bits=%u) K=%u (bad_A=%llu, bad_A=%llu at bits=%u)\n", B, max_A, last_L, bits+1, last_K, last_bad_A, max_bad_A, max_bits); return; } } }

Pseudo-code to help in reading above code: void fastdiv_params(unsigned B, unsigned max_A) { L = (max_unsigned_long_long/max_unsigned - 1) * B; L = largest_pow2_smaller_or_eq_to(L); bits = log2(L); while ((K = L/B + 1) > max_unsigned) bits--, L >>= 1; // There is not much point in multiplying by even 'K' // and then shifting right (dividing by 2*n). Reduce K & L to avoid it: while (!(K & 1) && bits > 32) bits--, L >>= 1, K = L/B + 1; d_LB = ((L/B) * B + B - L); bad_A = L / d_LB; bad_A = (bad_A / B) * B + B-1; if (bad_A <= max_A) bail_out_no_params_exist; // 'K' and 'bits' params are already determined, can return them... /* return_params(K,bits); */ // ...but if max_A is smaller then max_unsigned, we can // optionally use smaller 'K' and 'bits'. Find smallest which // still work: max_bits = bits; max_bad_A = bad_A; while(1) { uint64_t last_L = L; unsigned last_K = K; bits--; L >>= 1; K = L/B + 1; d_LB = ((L/B) * B + B - L); bad_A = L / d_LB; bad_A = (bad_A / B) * B + B-1; if (bad_A <= max_A || bits < 32) // new params are bad, report last ones and bail out return_params(last_K,bits+1); } }

I instrumented choose_multiplier(). When called like this: choose_multiplier(d=1577682821,n=32,precision=32), it returns *post_shift_ptr=31,multiplier=5846151023 whereas optimal one is *post_shift_ptr=27,multiplier=365384439 In the code: "////" comments show variable values if d=1577682821. The instrumented version is below. Too bad there are no comments how the code works, what is mlow and mhigh. Note that the good value, if shifted left 4 times, 365384439<<4, is equal to 5846151024, which is mhigh+1! With mhigh=5846151024 "Reduce to lowest terms" part of code would reduce mhigh/mlow pair exactly to correct value of 365384439, I think... Isn't this a reason why choose_multiplier() doesn't consider 365384439 to be a valid multiplier? Maybe mhigh calcucation is off-by-one? unsigned HOST_WIDE_INT choose_multiplier (unsigned HOST_WIDE_INT d, int n, int precision, rtx *multiplier_ptr, int *post_shift_ptr, int *lgup_ptr) { HOST_WIDE_INT mhigh_hi, mlow_hi; unsigned HOST_WIDE_INT mhigh_lo, mlow_lo; int lgup, post_shift; int pow, pow2; unsigned HOST_WIDE_INT nl, dummy1; HOST_WIDE_INT nh, dummy2; static int divdebug = -1; if (divdebug == -1) divdebug = NULL!=getenv("DIV_DEBUG"); divdebug && printf("choose_multiplier(d=%llu,n=%i,precision=%i)\n",(unsigned long long)d,n,precision); //// d=1577682821,n=32,precision=32 /* lgup = ceil(log2(divisor)); */ lgup = ceil_log2 (d); //// 31 gcc_assert (lgup <= n); pow = n + lgup; //// 63 pow2 = n + lgup - precision; //// 31 /* We could handle this with some effort, but this case is much better handled directly with a scc insn, so rely on caller using that. */ gcc_assert (pow != 2 * HOST_BITS_PER_WIDE_INT); /* mlow = 2^(N + lgup)/d */ if (pow >= HOST_BITS_PER_WIDE_INT) //// yes { nh = (HOST_WIDE_INT) 1 << (pow - HOST_BITS_PER_WIDE_INT); //// 1<<31 nl = 0; } else { nh = 0; nl = (unsigned HOST_WIDE_INT) 1 << pow; } //// 1<<63 / d: mlow=5846151022 div_and_round_double (TRUNC_DIV_EXPR, 1, nl, nh, d, (HOST_WIDE_INT) 0, &mlow_lo, &mlow_hi, &dummy1, &dummy2); divdebug && printf("choose_multiplier: mlow=%llu\n",((unsigned long long)mlow_hi<<32)+mlow_lo); /* mhigh = (2^(N + lgup) + 2^N + lgup - precision)/d */ if (pow2 >= HOST_BITS_PER_WIDE_INT) //// no nh |= (HOST_WIDE_INT) 1 << (pow2 - HOST_BITS_PER_WIDE_INT); else nl |= (unsigned HOST_WIDE_INT) 1 << pow2; //// 1<<31 //// (1<<63+1<<31) / d: mhigh=5846151023 div_and_round_double (TRUNC_DIV_EXPR, 1, nl, nh, d, (HOST_WIDE_INT) 0, &mhigh_lo, &mhigh_hi, &dummy1, &dummy2); divdebug && printf("choose_multiplier: mhigh=%llu\n",((unsigned long long)mhigh_hi<<32)+mhigh_lo); gcc_assert (!mhigh_hi || nh - d < d); gcc_assert (mhigh_hi <= 1 && mlow_hi <= 1); /* Assert that mlow < mhigh. */ gcc_assert (mlow_hi < mhigh_hi || (mlow_hi == mhigh_hi && mlow_lo < mhigh_lo)); /* If precision == N, then mlow, mhigh exceed 2^N (but they do not exceed 2^(N+1)). */ /* Reduce to lowest terms. */ for (post_shift = lgup; post_shift > 0; post_shift--) { unsigned HOST_WIDE_INT ml_lo = (mlow_hi << (HOST_BITS_PER_WIDE_INT - 1)) | (mlow_lo >> 1); unsigned HOST_WIDE_INT mh_lo = (mhigh_hi << (HOST_BITS_PER_WIDE_INT - 1)) | (mhigh_lo >> 1); if (ml_lo >= mh_lo) //// if mlow/2 == mhigh/2 - bail out break; mlow_hi = 0; mlow_lo = ml_lo; mhigh_hi = 0; mhigh_lo = mh_lo; } *post_shift_ptr = post_shift; *lgup_ptr = lgup; if (n < HOST_BITS_PER_WIDE_INT) { unsigned HOST_WIDE_INT mask = ((unsigned HOST_WIDE_INT) 1 << n) - 1; *multiplier_ptr = GEN_INT (mhigh_lo & mask); divdebug && printf("choose_multiplier: *post_shift_ptr=%i,*lgup_ptr=%i,multiplier=%llu\n",*post_shift_ptr,*lgup_ptr,(unsigned long long)(mhigh_lo & mask) + ((unsigned long long)(mhigh_lo >= mask) << 32)); return mhigh_lo >= mask; } else { *multiplier_ptr = GEN_INT (mhigh_lo); divdebug && printf("choose_multiplier: *post_shift_ptr=%i,*lgup_ptr=%i,multiplier=%llu\n",*post_shift_ptr,*lgup_ptr,(unsigned long long)(mhigh_lo) + ((unsigned long long)(mhigh_hi) << 32)); return mhigh_hi; } }

Created attachment 11971 [details] Use alternative algorithm if it gives better results New algorithm lives in separate function gcc_fastdiv_params() which is called from choose_multiplier(). Then we run old algorithm. Then we compare results and if new algorithm gives different one (new result is always better or same than old), we use it.

Created attachment 11972 [details] Test program Compile with -O2 -S to see the assembly. Compile with -Os and run to check the correctness.

Patched versus stock gcc-4.1.1 on this code unsigned v; void f9188_mul365384439_shift27(unsigned A) { v = A/(unsigned)1577682821; } is: # diff -u suboptimal-411-O2.s suboptimal-411-O2-fixed.s --- suboptimal-411-O2.s 2006-07-29 19:34:09.000000000 +0200 +++ suboptimal-411-O2-fixed.s 2006-07-29 19:37:14.000000000 +0200 @@ -19,21 +19,10 @@ f9188_mul365384439_shift27: pushl %ebp movl %esp, %ebp - pushl %edi - pushl %esi - pushl %ebx - movl 8(%ebp), %ebx - movl $1551183727, %ecx - movl %ebx, %eax - mull %ecx - subl %edx, %ebx - shrl %ebx - leal (%ebx,%edx), %eax - shrl $30, %eax - movl %eax, v - popl %ebx - popl %esi - popl %edi + movl $365384439, %eax + mull 8(%ebp) + shrl $27, %edx + movl %edx, v leave ret ...

Created attachment 11973 [details] Random tester Randomly chooses a/b and compares "gcc patch code" results with code which was previously verified to be ok. Just my internal tool to check that I did not screw up while making a patch.

(In reply to comment #3) > Too bad there are no comments how the code works, what is mlow and mhigh. Tege has written a paper: http://swox.se/~tg/divcnst-pldi94.pdf Also, you might consider adding him to the CC list of this bug.

Thanks for the link to .pdf! Who's Tege? And his email address is ...? Ok. This new version of the patch is working for signed divisions too, and is giving the results which are always same or better than "old" code. If it would be deemed acceptable for mainline, it can be simply be integrated into choose_multiplier(). Since new code needs to know maximum possible dividend instead of (n,prec) tuple, and it can generate better code if dividend is known to be small enough (example: (n & 0xffff) / 10 will need only a mul, not mul+shift), it will make sense to propagate possible dividend range into choose_multiplier() intead of (n,prec). TODO later...

Created attachment 12000 [details] Alternative algorithm v. 2

(In reply to comment #9) > Thanks for the link to .pdf! Who's Tege? And his email address is ...? Torbjoern Granlund; homepage: http://swox.se/~tg/ . From MAINTAINERS: *synthetic multiply Torbjorn Granlund tege@swox.com

Created attachment 12005 [details] Alternative algorithm v. 3 Formatting fixes mostly

Subject: Re: suboptimal 'division by constant' optimization I looked briefly into the suggested change, without trying to assess its correctness. But since these sort of transformations are of a very mathematical nature, I trust the author of the suggested changes has proven the algorithms? I am aware of that Peter Montgomery's and my algorithms do not generate optimal sequences for v/C where C is close to the largest possible value of v. We deemed it to have very little practical use to optimize these cases. If it is decided that these cases are indeed important enough to require optimization, the choose_multiplier function should course be amended, of course.

>But since these sort of transformations are of >a very mathematical nature, I trust the author of the >suggested changes has proven the algorithms? I did not write a paper about it, but I think I have the proof and it is documented in the patch. I will happily ament that comment if it has any unclear spots. I don't want anydoby to have headaches over it :) > If it is decided that these cases are indeed important enough to require optimization, the choose_multiplier function should course be amended, of course. I think that we can get more gains if we will use new algorithm _and_ will propagate dividend value range info into choose_multiplier(). New algorithm (of should I say "modified", it is not that new) takes max_A parameter already, so it's trivial. For example, with value range propagation (n & 0xffff) / 10 will generate code with just a mul without shift, because there exist suitable K for L=2^32 -> we can just take higher half of multiply.