First Last Prev Next    No search results available      Search page      Enter new bug
Bug#: 28417
Product:  
Component:  
Status: NEW
Resolution:
Assigned To: Not yet assigned to anyone <unassigned@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: Denis Vlasenko <vda.linux@googlemail.com>
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
gcc-4.1.1.new.diff Use alternative algorithm if it gives better results patch 2006-07-30 13:35 2.36 KB Edit | Diff
suboptimal.c Test program text/plain 2006-07-30 13:37 876 bytes Edit
find_fast_div_random_compare_with_gcc_patch.c Random tester text/plain 2006-07-30 13:43 5.07 KB Edit
gcc-4.1.1.new.diff Alternative algorithm v. 2 patch 2006-08-02 19:05 2.47 KB Edit | Diff
gcc-4.1.1.new2.diff Alternative algorithm v. 3 patch 2006-08-03 17:06 2.35 KB Edit | Diff
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 28417 depends on: Show dependency tree
Show dependency graph
Bug 28417 blocks: 16996

Additional Comments:





Mark bug as waiting for feedback
Mark bug as suspended




View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: 2009-06-06 15:08 Opened: 2006-07-18 08:34
It looks like expmed.c::choose_multiplier(), which is responsible for finding
parameters needed for replacing division by constant with mul+shift, sometimes
fails to find optimal parameters.

One example: A/1577682821 can be calculated by executing A*365384439 >>
(27+32), for any 32-bit unsigned A. However, gcc-4.1.1 and all previous
versions generate much more elaborate code.

The below program demonstrates tis and also two more similar examples. It also
checks that mul+shift works for any unsigned A, by testing all possibe values
of A.

#include <stdint.h>
#include <stdio.h>

unsigned v;
void f9188_mul365384439_shift27(unsigned A) { v = A/(unsigned)1577682821; }
void f9188_mul365384439_shift27_prime(unsigned A) { v =
(((uint64_t)A)*(unsigned)365384439) >> (27+32); }
void f8399_mul2283243215_shift29(unsigned A) { v = A/(unsigned)1009898111; }
void f8399_mul2283243215_shift29_prime(unsigned A) { v =
(((uint64_t)A)*(unsigned)2283243215) >> (29+32); }
void f8267_mul2482476753_shift30(unsigned A) { v = A/(unsigned)1857695551; }
void f8267_mul2482476753_shift30_prime(unsigned A) { v =
(((uint64_t)A)*(unsigned)2482476753) >> (30+32); }

/* Generated code is suboptimal (gcc 4.1.1):
void f9188_mul365384439_shift27(unsigned A) { v = A/1577682821; }
f9188_mul365384439_shift27:
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        movl    16(%esp), %ebx
        movl    $1551183727, %ecx
        movl    %ebx, %eax
        mull    %ecx
        subl    %edx, %ebx
        shrl    %ebx
        leal    (%ebx,%edx), %eax
        shrl    $30, %eax
        movl    %eax, v
        popl    %ebx
        popl    %esi
        popl    %edi
        ret
void f8399_mul2283243215_shift29(unsigned A) { v = A/1009898111; }
f8399_mul2283243215_shift29:
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        movl    16(%esp), %ebx
        movl    $271519133, %ecx
        movl    %ebx, %eax
        mull    %ecx
        subl    %edx, %ebx
        shrl    %ebx
        leal    (%ebx,%edx), %eax
        shrl    $29, %eax
        movl    %eax, v
        popl    %ebx
        popl    %esi
        popl    %edi
        ret
void f8267_mul2482476753_shift30(unsigned A) { v = A/1857695551; }
f8267_mul2482476753_shift30:
        pushl   %edi
        pushl   %esi
        pushl   %ebx
        movl    16(%esp), %ebx
        movl    $669986209, %ecx
        movl    %ebx, %eax
        mull    %ecx
        subl    %edx, %ebx
        shrl    %ebx
        leal    (%ebx,%edx), %eax
        shrl    $30, %eax
        movl    %eax, v
        popl    %ebx
        popl    %esi
        popl    %edi
        ret

These operations can be done like this:
f9188_mul365384439_shift27_prime:
        movl    $365384439, %eax
        mull    4(%esp)
        movl    %edx, %eax
        shrl    $27, %eax
        movl    %eax, v
        ret
f8399_mul2283243215_shift29_prime:
        movl    $-2011724081, %eax
        mull    4(%esp)
        movl    %edx, %eax
        shrl    $29, %eax
        movl    %eax, v
        ret
f8267_mul2482476753_shift30_prime:
        movl    $-1812490543, %eax
        mull    4(%esp)
        movl    %edx, %eax
        shrl    $30, %eax
        movl    %eax, v
        ret

(Why there is that silly movl %edx, %eax - that's another good question...)

The verification program is below. Was compiled with -Os
(in order to force gcc to use div insn for a/b - we want to do true
division for the purpose of correctness check!)
and didn't report a failure.
*/

int main()
{
        unsigned A=0,u,v;
        while(++A) {
                (A & 0xffff) || printf("A=%u\r",A);

                u = (((uint64_t)A)*(unsigned)365384439) >> (27+32);
                v = A / (unsigned)1577682821;
                if(u!=v) { printf("1: A=%u u=%u v=%u\n", A,u,v); return 0; }

                u = (((uint64_t)A)*(unsigned)2283243215) >> (29+32);
                v = A / (unsigned)1009898111;
                if(u!=v) { printf("2: A=%u u=%u v=%u\n", A,u,v); return 0; }

                u = (((uint64_t)A)*(unsigned)2482476753) >> (30+32);
                v  =A / (unsigned)1857695551;
                if(u!=v) { printf("3: A=%u u=%u v=%u\n", A,u,v); return 0; }
        }
        puts("");
        return 0;
}

------- Comment #1 From Denis Vlasenko 2006-07-18 08:40 -------
I didn't look too close at choose_multiplier(), yet.
If anybody will work upon it in order to make it better,
this is the routine which prints values 'K' and 'bits'
so that
    (A*K)>>bits == A/B
is true for given fixed B and all A's in range [0...max_A].
You may want to add similar code to choose_multiplier().

Mathematical explanation is in the comment.

/*
[below: 'div' is unsigned integer division]
['/' is real division with infinite precision]
[A,B,C... - integers, a,b,c... - reals]

Theory: we want to calculate A div B, fast.
B is const > 2 and is not a power of 2.

In fp: A/B = A*(L/B)/L. (If L is a large power of 2,
say 2^32, then it could be done really fast).
Let k := L/B, K := L div B + 1.
Then A/B = A*k/L.

Then this is true:

A div B <= A * K div L.

For small enough A: A div B = A * K div L.
Lets find first A for which it is not true.

Lets compare k/L and K/L. K/L is larger by a small value d:

d := K/L - k/L = (L div B + 1) / L - L/B/L =
= (L div B * B + B) / (L*B) - L/(L*B) =
= (L div B * B + B - L) / (L*B)

A*K/L is larger than A*k/L by A*d.

A*k/L is closest to 'overflowing into next integer'
when A = N*B-1. The 'deficit' with such A is exactly 1/B.
If A*d >= 1/B, then A*K/L will 'overflow'.

Thus bad_A >= 1/B / d = (1/B) * (L*B)/(L div B * B + B - L) =
= L/(L div B * B + B - L). Then you need to 'walk up' to next
A representable as N*B-1: bad_A2 = (bad_A div B) * B + B-1
This is the first A which will have wrong result
(i.e. for which (A*K div L) = (A div B)+1, not (A div B).

Practical use.

Suppose that A and B are 32-bit unsigned integers.

Unfortunately, there exist only two B values in 3..2^32-1 range
for which _any_ 32-bit unsigned A can be fast divided using L=2^32
(because bad_A=2^32 and any A is less than that):

B=641 K=6700417 d=1/2753074036736 bad_A=4294967296 A=unlimited
B=6700417 K=641 d=1/28778071884562432 bad_A=4294967296 A=unlimited

We need to use larger powers of 2 for L if we need to handle
many more B's.
*/

void fastdiv_params(unsigned B, unsigned max_A)
{
        unsigned K, d_LB, bits, max_bits;
        uint64_t L, KL, mask, bad_A, max_bad_A;

        if (!B || ((B-1)&B) == 0) { // B is a power of 2
//if(0)
                printf("B=%u: power of 2, division by shift\n", B);
                return;
        }
        L = (max_ull/max_unsigned - 1) * B;
        bits = 63;
        mask = 1ULL << 63;
        while( !(L & mask))
                bits--, mask >>= 1;
        L = mask;

        while ( (KL = L/B + 1) > max_unsigned)
                bits--, L >>= 1;
        K = KL;
        // There is not much point in multiplying by even number
        // and then shifting right. Reduce K & L to avoid it:
        while (!(K & 1) && bits > 32)
                bits--, L >>= 1, K = L/B + 1;

        d_LB = ((L/B) * B + B - L);
        bad_A = L / d_LB;
        bad_A = (bad_A / B) * B + B-1;
        if (bad_A <= max_A) {
                printf("B=%u,A<=%u: no suitable fastdiv params found\n", B,
max_A);
                return;
        }
        max_bits = bits;
        max_bad_A = bad_A;
        while(1) {
                uint64_t last_L = L;
                uint64_t last_bad_A = bad_A;
                unsigned last_K = K;
                bits--;
                L >>= 1;
                K = L/B + 1;
                d_LB = ((L/B) * B + B - L);
                bad_A = L / d_LB;
                bad_A = (bad_A / B) * B + B-1;
                if (bad_A <= max_A || bits < 32) {
                        // new params are bad, report last ones and bail out
//if(0)
                        printf("B=%u,A<=%u: L=%llx(bits=%u) K=%u  (bad_A=%llu,
bad_A=%llu at bits=%u)\n",
                                B, max_A, last_L, bits+1, last_K, last_bad_A, 
max_bad_A, max_bits);
                        return;
                }
        }
}

------- Comment #2 From Denis Vlasenko 2006-07-18 12:06 -------
Pseudo-code to help in reading above code:

void fastdiv_params(unsigned B, unsigned max_A)
{
        L = (max_unsigned_long_long/max_unsigned - 1) * B;
        L = largest_pow2_smaller_or_eq_to(L);
        bits = log2(L);
        while ((K = L/B + 1) > max_unsigned)
                bits--, L >>= 1;
        // There is not much point in multiplying by even 'K'
        // and then shifting right (dividing by 2*n). Reduce K & L to avoid it:
        while (!(K & 1) && bits > 32)
                bits--, L >>= 1, K = L/B + 1;

        d_LB = ((L/B) * B + B - L);
        bad_A = L / d_LB;
        bad_A = (bad_A / B) * B + B-1;
        if (bad_A <= max_A)
                bail_out_no_params_exist;

        // 'K' and 'bits' params are already determined, can return them...
        /* return_params(K,bits); */

        // ...but if max_A is smaller then max_unsigned, we can
        // optionally use smaller 'K' and 'bits'. Find smallest which
        // still work:
        max_bits = bits;
        max_bad_A = bad_A;
        while(1) {
                uint64_t last_L = L;
                unsigned last_K = K;
                bits--;
                L >>= 1;
                K = L/B + 1;
                d_LB = ((L/B) * B + B - L);
                bad_A = L / d_LB;
                bad_A = (bad_A / B) * B + B-1;
                if (bad_A <= max_A || bits < 32)
                        // new params are bad, report last ones and bail out
                        return_params(last_K,bits+1);
        }
}

------- Comment #3 From Denis Vlasenko 2006-07-19 16:24 -------
I instrumented choose_multiplier(). When called like this:
choose_multiplier(d=1577682821,n=32,precision=32),
it returns *post_shift_ptr=31,multiplier=5846151023

whereas optimal one is *post_shift_ptr=27,multiplier=365384439
In the code: "////" comments show variable values 
if d=1577682821.


The instrumented version is below.
Too bad there are no comments how the code works, what is mlow and mhigh.

Note that the good value, if shifted left 4 times,
365384439<<4, is equal to 5846151024, which is mhigh+1!

With mhigh=5846151024 "Reduce to lowest terms" part of code would reduce
mhigh/mlow pair exactly to correct value of 365384439, I think...

Isn't this a reason why choose_multiplier() doesn't consider 365384439
to be a valid multiplier? Maybe mhigh calcucation is off-by-one?

unsigned HOST_WIDE_INT

choose_multiplier (unsigned HOST_WIDE_INT d, int n, int precision,

                   rtx *multiplier_ptr, int *post_shift_ptr, int *lgup_ptr)

{

  HOST_WIDE_INT mhigh_hi, mlow_hi;

  unsigned HOST_WIDE_INT mhigh_lo, mlow_lo;

  int lgup, post_shift;

  int pow, pow2;

  unsigned HOST_WIDE_INT nl, dummy1;

  HOST_WIDE_INT nh, dummy2;



static int divdebug = -1;



if (divdebug == -1)

    divdebug = NULL!=getenv("DIV_DEBUG");



divdebug && printf("choose_multiplier(d=%llu,n=%i,precision=%i)\n",(unsigned
long long)d,n,precision);

//// d=1577682821,n=32,precision=32

  /* lgup = ceil(log2(divisor)); */

  lgup = ceil_log2 (d); //// 31

  gcc_assert (lgup <= n);



  pow = n + lgup; //// 63

  pow2 = n + lgup - precision; //// 31



  /* We could handle this with some effort, but this case is much

     better handled directly with a scc insn, so rely on caller using

     that.  */

  gcc_assert (pow != 2 * HOST_BITS_PER_WIDE_INT);



  /* mlow = 2^(N + lgup)/d */

 if (pow >= HOST_BITS_PER_WIDE_INT) //// yes

    {

      nh = (HOST_WIDE_INT) 1 << (pow - HOST_BITS_PER_WIDE_INT); //// 1<<31

      nl = 0;

    }

  else

    {

      nh = 0;

      nl = (unsigned HOST_WIDE_INT) 1 << pow;

    }

//// 1<<63 / d: mlow=5846151022

  div_and_round_double (TRUNC_DIV_EXPR, 1, nl, nh, d, (HOST_WIDE_INT) 0,

                        &mlow_lo, &mlow_hi, &dummy1, &dummy2);

divdebug && printf("choose_multiplier: mlow=%llu\n",((unsigned long
long)mlow_hi<<32)+mlow_lo);



  /* mhigh = (2^(N + lgup) + 2^N + lgup - precision)/d */

  if (pow2 >= HOST_BITS_PER_WIDE_INT) //// no

    nh |= (HOST_WIDE_INT) 1 << (pow2 - HOST_BITS_PER_WIDE_INT);

  else

    nl |= (unsigned HOST_WIDE_INT) 1 << pow2; //// 1<<31

//// (1<<63+1<<31) / d: mhigh=5846151023

  div_and_round_double (TRUNC_DIV_EXPR, 1, nl, nh, d, (HOST_WIDE_INT) 0,

                        &mhigh_lo, &mhigh_hi, &dummy1, &dummy2);

divdebug && printf("choose_multiplier: mhigh=%llu\n",((unsigned long
long)mhigh_hi<<32)+mhigh_lo);



  gcc_assert (!mhigh_hi || nh - d < d);

  gcc_assert (mhigh_hi <= 1 && mlow_hi <= 1);

  /* Assert that mlow < mhigh.  */

  gcc_assert (mlow_hi < mhigh_hi

              || (mlow_hi == mhigh_hi && mlow_lo < mhigh_lo));



  /* If precision == N, then mlow, mhigh exceed 2^N

     (but they do not exceed 2^(N+1)).  */



  /* Reduce to lowest terms.  */

  for (post_shift = lgup; post_shift > 0; post_shift--)

    {

      unsigned HOST_WIDE_INT ml_lo = (mlow_hi << (HOST_BITS_PER_WIDE_INT - 1))
| (mlow_lo >> 1);

      unsigned HOST_WIDE_INT mh_lo = (mhigh_hi << (HOST_BITS_PER_WIDE_INT - 1))
| (mhigh_lo >> 1);

      if (ml_lo >= mh_lo) //// if mlow/2 == mhigh/2 - bail out

        break;



      mlow_hi = 0;

      mlow_lo = ml_lo;

      mhigh_hi = 0;

      mhigh_lo = mh_lo;

    }



  *post_shift_ptr = post_shift;

  *lgup_ptr = lgup;

  if (n < HOST_BITS_PER_WIDE_INT)

    {

      unsigned HOST_WIDE_INT mask = ((unsigned HOST_WIDE_INT) 1 << n) - 1;

      *multiplier_ptr = GEN_INT (mhigh_lo & mask);

divdebug && printf("choose_multiplier:
*post_shift_ptr=%i,*lgup_ptr=%i,multiplier=%llu\n",*post_shift_ptr,*lgup_ptr,(unsigned
long long)(mhigh_lo & mask) + ((unsigned long long)(mhigh_lo >= mask) << 32));

      return mhigh_lo >= mask;

    }

  else

    {

      *multiplier_ptr = GEN_INT (mhigh_lo);

divdebug && printf("choose_multiplier:
*post_shift_ptr=%i,*lgup_ptr=%i,multiplier=%llu\n",*post_shift_ptr,*lgup_ptr,(unsigned
long long)(mhigh_lo) + ((unsigned long long)(mhigh_hi) << 32));

      return mhigh_hi;

    }

}

------- Comment #4 From Denis Vlasenko 2006-07-30 13:35 -------
Created an attachment (id=11971) [edit]
Use alternative algorithm if it gives better results

New algorithm lives in separate function gcc_fastdiv_params() which is called
from choose_multiplier(). Then we run old algorithm. Then we compare results
and if new algorithm gives different one (new result is always better or same
than old), we use it.

------- Comment #5 From Denis Vlasenko 2006-07-30 13:37 -------
Created an attachment (id=11972) [edit]
Test program

Compile with -O2 -S to see the assembly.
Compile with -Os and run to check the correctness.

------- Comment #6 From Denis Vlasenko 2006-07-30 13:38 -------
Patched versus stock gcc-4.1.1 on this code

unsigned v;
void f9188_mul365384439_shift27(unsigned A) { v = A/(unsigned)1577682821; }

is:

# diff -u suboptimal-411-O2.s suboptimal-411-O2-fixed.s
--- suboptimal-411-O2.s 2006-07-29 19:34:09.000000000 +0200
+++ suboptimal-411-O2-fixed.s   2006-07-29 19:37:14.000000000 +0200
@@ -19,21 +19,10 @@
 f9188_mul365384439_shift27:
        pushl   %ebp
        movl    %esp, %ebp
-       pushl   %edi
-       pushl   %esi
-       pushl   %ebx
-       movl    8(%ebp), %ebx
-       movl    $1551183727, %ecx
-       movl    %ebx, %eax
-       mull    %ecx
-       subl    %edx, %ebx
-       shrl    %ebx
-       leal    (%ebx,%edx), %eax
-       shrl    $30, %eax
-       movl    %eax, v
-       popl    %ebx
-       popl    %esi
-       popl    %edi
+       movl    $365384439, %eax
+       mull    8(%ebp)
+       shrl    $27, %edx
+       movl    %edx, v
        leave
        ret
...

------- Comment #7 From Denis Vlasenko 2006-07-30 13:43 -------
Created an attachment (id=11973) [edit]
Random tester

Randomly chooses a/b and compares "gcc patch code" results with code which was
previously verified to be ok. Just my internal tool to check that I did not
screw up while making a patch.

------- Comment #8 From Jorn Wolfgang Rennecke 2006-07-31 14:39 -------
(In reply to comment #3)
> Too bad there are no comments how the code works, what is mlow and mhigh.

Tege has written a paper: http://swox.se/~tg/divcnst-pldi94.pdf

Also, you might consider adding him to the CC list of this bug.

------- Comment #9 From Denis Vlasenko 2006-08-02 19:05 -------
Thanks for the link to .pdf! Who's Tege? And his email address is ...?

Ok. This new version of the patch is working for signed divisions too, and is
giving the results which are always same or better than "old" code.

If it would be deemed acceptable for mainline, it can be simply be integrated
into choose_multiplier().

Since new code needs to know maximum possible dividend instead of (n,prec)
tuple, and it can generate better code if dividend is known to be small enough
(example: (n & 0xffff) / 10 will need only a mul, not mul+shift), it will make
sense to propagate possible dividend range into choose_multiplier() intead of
(n,prec). TODO later...

------- Comment #10 From Denis Vlasenko 2006-08-02 19:05 -------
Created an attachment (id=12000) [edit]
Alternative algorithm v. 2

------- Comment #11 From Jorn Wolfgang Rennecke 2006-08-02 19:34 -------
(In reply to comment #9)
> Thanks for the link to .pdf! Who's Tege? And his email address is ...?

Torbjoern Granlund; homepage:  http://swox.se/~tg/ .  From MAINTAINERS:
*synthetic multiply     Torbjorn Granlund       tege@swox.com

------- Comment #12 From Denis Vlasenko 2006-08-03 17:06 -------
Created an attachment (id=12005) [edit]
Alternative algorithm v. 3

Formatting fixes mostly

------- Comment #13 From tege@swox.com 2006-08-04 11:57 -------
Subject: Re:  suboptimal 'division by constant' optimization

I looked briefly into the suggested change, without trying to
assess its correctness.  But since these sort of transformations
are of a very mathematical nature, I trust the author of the
suggested changes has proven the algorithms?

I am aware of that Peter Montgomery's and my algorithms do not
generate optimal sequences for v/C where C is close to the
largest possible value of v.  We deemed it to have very little
practical use to optimize these cases.

If it is decided that these cases are indeed important enough to
require optimization, the choose_multiplier function should
course be amended, of course.

------- Comment #14 From Denis Vlasenko 2006-08-04 18:14 -------
>But since these sort of transformations are of
>a very mathematical nature, I trust the author of the
>suggested changes has proven the algorithms?

I did not write a paper about it, but I think I have
the proof and it is documented in the patch.

I will happily ament that comment if it has any unclear spots.
I don't want anydoby to have headaches over it :)

> If it is decided that these cases are indeed important enough to
require optimization, the choose_multiplier function should
course be amended, of course.

I think that we can get more gains if we will use new algorithm
_and_ will propagate dividend value range info into
choose_multiplier(). New algorithm (of should I say "modified",
it is not that new) takes max_A parameter already, so it's trivial.

For example, with value range propagation (n & 0xffff) / 10
will generate code with just a mul without shift, because there exist
suitable K for L=2^32 -> we can just take higher half of multiply.

First Last Prev Next    No search results available      Search page      Enter new bug