Bug 82853 - Optimize x % 3 == 0 without modulo
Summary: Optimize x % 3 == 0 without modulo
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
: 49552 87203 87232 (view as bug list)
Depends on:
Blocks: 12849
  Show dependency treegraph
 
Reported: 2017-11-05 19:57 UTC by Andi Kleen
Modified: 2019-06-03 11:11 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-11-05 00:00:00


Attachments
gcc9-pr82853-wip.patch (3.02 KB, patch)
2018-09-03 18:04 UTC, Jakub Jelinek
Details | Diff
gcc9-pr82853.patch (4.16 KB, patch)
2018-09-04 12:24 UTC, Jakub Jelinek
Details | Diff
pr82853-tests.tar.xz (71.73 KB, application/x-xz)
2018-09-04 12:30 UTC, Jakub Jelinek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andi Kleen 2017-11-05 19:57:22 UTC
Ralph Levien pointed out as part of FizzBuzz optimization:

Turns out you can compute x%3 == 0 with even fewer steps, it's (x*0xaaaaaaaab) < 0x55555556 (assuming wrapping unsigned 32 bit arithmetic).

gcc currently generates the full modulo and then checks.

Could be done in match.pd I suppose.

Test case

unsigned mod3(unsigned a) { return 0==(a%3); }
Comment 1 Andrew Pinski 2017-11-05 20:11:09 UTC
Works for me on aarch64:
f:
        mov     w1, 21846
        movk    w1, 0x5555, lsl 16
        smull   x1, w0, w1
        lsr     x1, x1, 32
        sub     w1, w1, w0, asr 31
        add     w1, w1, w1, lsl 1
        sub     w0, w0, w1
        ret

Most likely a target cost issue.
Comment 2 Andrew Pinski 2017-11-05 20:13:13 UTC
>Could be done in match.pd I suppose.


NO!  There are a lot of code in the expander to handle division and modulus by a constant already.  If the division cost is high enough, then it is not used directly and the multiply method is used as shown by what happens on aarch64:
f:
        mov     w1, 21846
        movk    w1, 0x5555, lsl 16
        smull   x1, w0, w1
        lsr     x1, x1, 32
        sub     w1, w1, w0, asr 31
        add     w1, w1, w1, lsl 1
        sub     w0, w0, w1
        ret

int f(int a)
{
  return a%3;
}
Comment 3 Andrew Pinski 2017-11-05 20:15:27 UTC
For your code aarch64 produces:
mod3:
        mov     w1, 43691
        movk    w1, 0xaaaa, lsl 16
        umull   x1, w0, w1
        lsr     x1, x1, 32
        lsr     w1, w1, 1
        add     w1, w1, w1, lsl 1
        cmp     w0, w1
        cset    w0, eq
        ret


Oh you mean not to produce a%3 followed by == part :).  Anyways this code should go into the middle-end expanders and not be part of match.pd.
Comment 4 Andi Kleen 2017-11-05 20:26:48 UTC
Right it's about special casing the complete expression
Comment 5 Andi Kleen 2017-11-05 20:32:15 UTC
Also I'm not sure why you would want it in the middle end. It should all work at the tree level
Comment 6 Andrew Pinski 2017-11-05 20:34:42 UTC
(In reply to Andi Kleen from comment #5)
> Also I'm not sure why you would want it in the middle end. It should all
> work at the tree level

Because we lose meaning and there is already code to handle the %N (where N is a constant) to a multiplication in the middle-end so reusing that code would be easy.  Plus 0 is simpler than 0x55555556.
Comment 7 Marc Glisse 2017-11-05 21:01:34 UTC
Is that a special case of a more generic transformation, which might apply for other values of 3, 0, == etc, or is this meant only literally for x%3==0?
Comment 8 Andi Kleen 2017-11-05 21:05:15 UTC
I'm not sure if it works with other numbers too.

(need to dig through Hacker's delight & Matters Computational to see if they have anything on it)

But it could be extended for other word lengths at least

BTW there are some other cases, will file a bug shortly on those too.
Comment 9 Wilco 2017-11-06 12:28:51 UTC
(In reply to Andi Kleen from comment #8)
> I'm not sure if it works with other numbers too.
> 
> (need to dig through Hacker's delight & Matters Computational to see if they
> have anything on it)
> 
> But it could be extended for other word lengths at least
> 
> BTW there are some other cases, will file a bug shortly on those too.

It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest). I forgot whether I made it work for signed too, but it's certainly possible to skip the sign handling in x % 4 == 0 even if x is signed.

While this is faster in general than computing the full modulo result, it wouldn't be if you need the division result as well. So this works best on single-use modulos used in comparisons, at the same time as general divisions are expanded.
Comment 10 Richard Biener 2017-11-06 12:33:42 UTC
Might be even more special-cased with x % 3 == 0 ? C1 : C2 aka in if-conversion context.
Comment 11 Marc Glisse 2017-11-06 13:08:33 UTC
(In reply to Wilco from comment #9)
> It works for any C where (divisor*C) MOD 2^32 == 1 (or -1).

For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<55555556 and 0<=z<3. 
Indeed, x*0xaaaaaaab==y+z*0xaaaaaaab is in the right range precisely for z==0 and the same can be done for any odd number.

> You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest).

Any ==cst will yield a range test. It is less obvious that inequalities are transformed to a contiguous range... (try x%7<3 maybe)

> I forgot whether I made it work for signed too, but it's certainly
> possible to skip the sign handling in x % 4 == 0 even if x is signed.

4 is a completely different story, as a power of 2.
Comment 12 Wilco 2017-11-06 15:19:48 UTC
(In reply to Marc Glisse from comment #11)
> (In reply to Wilco from comment #9)
> > It works for any C where (divisor*C) MOD 2^32 == 1 (or -1).
> 
> For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<55555556 and 0<=z<3. 
> Indeed, x*0xaaaaaaab==y+z*0xaaaaaaab is in the right range precisely for
> z==0 and the same can be done for any odd number.

You can make it work for even numbers as well, eg. (x % 10) == 0 -> (x % 5) == 0 && (x & 1) == 0. If you can avoid branches then it may still be faster.

> > You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest).
> 
> Any ==cst will yield a range test. It is less obvious that inequalities are
> transformed to a contiguous range... (try x%7<3 maybe)

Indeed, expanding into x % 7 == 0 || x % 7 == 1 || x % 7 == 2 is unlikely to be a good idea. However you can change x % 7 >= 1 into x % 7 != 0 or x % 7 < 6 into x % 7 != 6 which are single ranges.
Comment 13 Jakub Jelinek 2018-09-03 11:38:23 UTC
*** Bug 87203 has been marked as a duplicate of this bug. ***
Comment 14 Jakub Jelinek 2018-09-03 11:41:41 UTC
http://duriansoftware.com/joe/Optimizing-is-multiple-checks-with-modular-arithmetic.html

Do we want to do this at GIMPLE time ignoring costs, or during expansion time?
Doing it later has the benefit that we can compare costs and could avoid breaking say divmod recognition, or finding out multiple uses of the modulo, etc.
Doing it earlier has the benefit for vectorization I guess, otherwise we need a pattern recognizer that will work like the expansion.
Comment 15 ktkachov 2018-09-03 12:02:45 UTC
I tried to do something similar at https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02005.html and the feedback was that we should do this at expand time
Comment 16 Jakub Jelinek 2018-09-03 13:05:37 UTC
For unsigned x % y == z if y is odd constant we can handle it for any constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and using x * m - (z * m) < d .
For even y, not sure if it can work for anything but z == 0; for z == 0 we can do
s = ctz (y); y_ = y >> s; m = mul_inv (y_, 2^prec); d = (2^prec / y_);
and use ((x * m) r>> s) < d .
Comment 17 Alexander Monakov 2018-09-03 13:35:22 UTC
(In reply to Jakub Jelinek from comment #16)
> For unsigned x % y == z if y is odd constant we can handle it for any
> constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and
> using x * m - (z * m) < d .

Is that preferable to testing (x - z) % y == 0?  Why?
Comment 18 Wilco 2018-09-03 14:59:44 UTC
(In reply to Alexander Monakov from comment #17)
> (In reply to Jakub Jelinek from comment #16)
> > For unsigned x % y == z if y is odd constant we can handle it for any
> > constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and
> > using x * m - (z * m) < d .
> 
> Is that preferable to testing (x - z) % y == 0?  Why?

That would require checking that it when x - z underflows if returns the correct answer (or adding a && x >= z).
Comment 19 Jakub Jelinek 2018-09-03 18:04:04 UTC
Created attachment 44656 [details]
gcc9-pr82853-wip.patch

Untested WIP patch which does this during expansion if it is cheaper according to target costs.
As the FIXMEs say, I'm not yet trying Alex' proposal with doing the subtraction on X (can't be used unconditionally, need to prove that values from -C2 to -1 u% C1 aren't 0), and am not trying even C1 when rotates aren't available (either could just expand the rotate anyway using 2 shifts + or, or comparison + mask + comparison + bitwise and (though that would complicate the callers equite a bit).
Comment 20 Jakub Jelinek 2018-09-03 18:04:55 UTC
Testcase I've been eyeballing so far:
unsigned f1 (unsigned x) { return (x % 679U) == 0; }
unsigned f2 (unsigned x, unsigned *y) { *y = x / 679U; return (x % 679U) == 0; }
unsigned f3 (unsigned x) { return (x % 1738U) == 0; }
void bar (void);
void f4 (unsigned x) { if (x % 3 == 0) bar (); }
void f5 (unsigned x) { if (x % 3 == 1) bar (); }
void f6 (unsigned x) { if (x % 3 == 2) bar (); }
Comment 21 Jakub Jelinek 2018-09-03 19:19:27 UTC
Probably should punt early if integer_onep (treeop1), that case should have been optimized earlier, but if it isn't, we shouldn't miscompile.

Another thing is if *arg1 is < 0 or >= treeop1, again, I'd hope we have optimized that already to false (or true for NE_EXPR), but if not, we shouldn't miscompile.

Another thing is that C4 needs to be adjusted, if *arg1 is bigger than all_ones%treeop1, then we need to subtract 1 from C4 - e.g. for x % 3 == 0
we want to expand it as x * 0xaaaaaaab <= 0x55555555, while for x % 3 == 1
we should expand as x * 0xaaaaaaab - 1 * 0xaaaaaaab <= 0x55555555 - 1, because
0xffffffff % 3 == 0 and thus there is one more case where x % 3 == 0 compared to x % 3 == 1 and x % 3 == 2 cases.

Hacker's Delight 10-17 mentions <= (2^prec - 1) / c1 rather than <= (2^prec) / c1, probably should do that too.
Comment 22 Jakub Jelinek 2018-09-03 20:37:09 UTC
So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, because
for x == 0 the test will yield a different value, as 0xffffffffU % 3U is 0 and
0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaaaaaabU <= 0x55555554U, but x % 3U == 0 is equivalent to x * 0xaaaaaaabU <= 0x55555555U.
Now to see if something useful can be used also for the even divisors.
Comment 23 Wilco 2018-09-03 20:52:13 UTC
(In reply to Jakub Jelinek from comment #22)
> So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0,
> because
> for x == 0 the test will yield a different value, as 0xffffffffU % 3U is 0
> and
> 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaaaaaabU <=
> 0x55555554U, but x % 3U == 0 is equivalent to x * 0xaaaaaaabU <= 0x55555555U.
> Now to see if something useful can be used also for the even divisors.

Yes for this case it is safe to do (x - 1) * 0xaaaaaaab < 0x55555555, but you can also do x * 0xaaaaaaab >= 0xaaaaaaab which is even simpler. Basically (x % C) == N is simpler for 2 values of N.
Comment 24 Jakub Jelinek 2018-09-03 20:58:45 UTC
(In reply to Wilco from comment #23)
> (In reply to Jakub Jelinek from comment #22)
> > So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0,
> > because
> > for x == 0 the test will yield a different value, as 0xffffffffU % 3U is 0
> > and
> > 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaaaaaabU <=
> > 0x55555554U, but x % 3U == 0 is equivalent to x * 0xaaaaaaabU <= 0x55555555U.
> > Now to see if something useful can be used also for the even divisors.
> 
> Yes for this case it is safe to do (x - 1) * 0xaaaaaaab < 0x55555555, but
> you can also do x * 0xaaaaaaab >= 0xaaaaaaab which is even simpler.

Yeah, a special case, todo++.

> Basically (x % C) == N is simpler for 2 values of N.

I meant that for C odd it works for any value, x % C == D for C odd and D <= -1 % C then (x - D) * mod_inv (C) <= -1 / C, otherwise if C is odd and D > -1 % C then
(x - D) * mod_inv (C) < -1 / C.
Just if C is even it is more complicated.  
For C positive odd and signed modulo with D == 0 it can be done easily too (will implement tomorrow).
Comment 25 Jakub Jelinek 2018-09-04 12:24:59 UTC
Created attachment 44657 [details]
gcc9-pr82853.patch

Full untested patch.  For x % C1 == C2 it handles all unsigned cases where C1 is odd, if C1 is even, just cases where C2 <= -1U % C1, if signed modulo, just x % C1 == 0 cases (where C1 is not INT_MIN).
Comment 26 Jakub Jelinek 2018-09-04 12:30:34 UTC
Created attachment 44658 [details]
pr82853-tests.tar.xz

A test generator for x % c1 == c2 expansion for unsigned, int, unsigned long
long, long long, unsigned int128 and int128 types (assuming ilp32 or lp64)
plus tests for those types.  Takes about 2 minutes to compile + run on a fast
box and uses random (), so not really sure the tests should go into the
testsuite.  Thoughts on that?  After all, the generator isn't extra smart and
doesn't try to find problematic corner cases.
Comment 27 Richard Earnshaw 2018-09-04 12:36:29 UTC
(In reply to Jakub Jelinek from comment #26)

> A test generator for x % c1 == c2 expansion for unsigned, int, unsigned long
> long, long long, unsigned int128 and int128 types (assuming ilp32 or lp64)
> plus tests for those types.  Takes about 2 minutes to compile + run on a fast
> box and uses random (), so not really sure the tests should go into the
> testsuite.  Thoughts on that?  After all, the generator isn't extra smart and
> doesn't try to find problematic corner cases.

If you work on the 'guess' of a 10x slowdown when using simulators, that's probably too long particularly if it's not delivering high value. 

Ultimately, it will depend on how much of that 2 minutes is during generate/compile and how much during run time.
Comment 28 Jakub Jelinek 2018-09-04 12:45:26 UTC
On i9-7960X I get (cc1 is -O0 checking build, so bootstrapped compiler might be much faster), will repeat that with bootstrapped compiler if it succeeds.  The __int128 and unsigned __int128 tests are clearly too expensive.

for i in 1 2 3 4 5 6; do time ./cc1 -quiet -O2 -o pr82853-$i.{s,c}; gcc -o pr82853-$i{,.s}; time ./pr82853-$i; echo $?; done
real	0m11.273s
user	0m11.182s
sys	0m0.039s

real	0m2.997s
user	0m2.993s
sys	0m0.001s
0

real	0m8.145s
user	0m8.082s
sys	0m0.026s

real	0m2.166s
user	0m2.165s
sys	0m0.000s
0

real	0m11.683s
user	0m11.597s
sys	0m0.033s

real	0m5.315s
user	0m5.312s
sys	0m0.000s
0

real	0m7.972s
user	0m7.903s
sys	0m0.032s

real	0m3.801s
user	0m3.798s
sys	0m0.001s
0

real	0m12.846s
user	0m12.762s
sys	0m0.028s

real	0m17.471s
user	0m17.458s
sys	0m0.001s
0

real	0m8.546s
user	0m8.486s
sys	0m0.022s

real	0m13.738s
user	0m13.728s
sys	0m0.000s
Comment 29 Jakub Jelinek 2018-09-05 17:12:38 UTC
*** Bug 87232 has been marked as a duplicate of this bug. ***
Comment 30 Jakub Jelinek 2018-09-12 18:28:52 UTC
Author: jakub
Date: Wed Sep 12 18:28:20 2018
New Revision: 264248

URL: https://gcc.gnu.org/viewcvs?rev=264248&root=gcc&view=rev
Log:
	PR middle-end/82853
	* expr.h (maybe_optimize_mod_cmp): Declare.
	* expr.c (mod_inv): New function.
	(maybe_optimize_mod_cmp): New function.
	(do_store_flag): Use it.
	* cfgexpand.c (expand_gimple_cond): Likewise.

	* gcc.target/i386/pr82853-1.c: New test.
	* gcc.target/i386/pr82853-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr82853-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr82853-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
    trunk/gcc/expr.c
    trunk/gcc/expr.h
    trunk/gcc/testsuite/ChangeLog
Comment 31 Jakub Jelinek 2018-09-13 08:18:44 UTC
Fixed on the trunk.  We might still try to improve some == non-zero cases with conditional code, and certainly should repeat this in the vectorizer's pattern recognizer.
Comment 32 Alexander Monakov 2019-02-09 09:42:31 UTC
*** Bug 49552 has been marked as a duplicate of this bug. ***
Comment 33 Orr Shalom Dvory 2019-02-26 22:36:26 UTC
(In reply to Jakub Jelinek from comment #25)
> Created attachment 44657 [details]
> gcc9-pr82853.patch
> 
> Full untested patch.  For x % C1 == C2 it handles all unsigned cases where
> C1 is odd, if C1 is even, just cases where C2 <= -1U % C1, if signed modulo,
> just x % C1 == 0 cases (where C1 is not INT_MIN).

Hi, I will try to help. I've got a more accurate formula. look at my comment over here https://reviews.llvm.org/D50222
I'm copying it here for the sake of all. 
Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder:
from divisor d and reminder r, I calculate 4 constants.
any d!=0 should fit.

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */
}
A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo.
The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd.

then the compiler should generate the following code with the given constants:

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}
this replace the following:

/* d is the divisor, r is the remainder */
int checkrem64(uint64_t x)
{
  return x%d==r;
}
this is the code to find modular inverse..

uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits)
{
      if (bits > 64 || ((x&1)==0))
              return 0;// invalid parameters
      uint64_t mask;
      if (bits == 64)
              mask = -1;
      else
      {                
              mask = 1;
              mask<<=bits;
              mask--;
      }
      x&=mask;
      uint64_t result=1, state=x, ctz=0;
      while(state!=1ULL)
      {
              ctz=__builtin_ctzll(state^1);
              result|=1ULL<<ctz;
              state+=x<<ctz;
              state&=mask;
      }
      return result;
}
good luck!
I tested this on all the cases of 10bit word size, and it passed.

*Edit:* I looked for something that will work for signed integers. I came up with something that would work with negative numbers if the following assumption was correct:

(-21)%10==9
but this assumption is not correct because (-21)%10 equals to -1.
anyway, the idea is like that, you shift the range and change s and accordingly:

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
     //this is the added line to make it work with signed integers
      r+=0x8000 0000 0000 0000% d;
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;
}

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
     //this is the added line to make it work with signed integers
//x came as signed number but was casted to unsigned
    x^=0x8000 0000 0000 0000;// this is addition simplified to xor, spaces for clarity.
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}
There is a way to tweak u and s to make it work on negative only numbers or positive only numbers, when r is negative or positive... but for r=0, this should work. please tell me the exact requirements, and I will do the math.
Comment 34 Trass3r 2019-05-30 08:30:57 UTC
Also fixes the duplicate https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849.
Can't close it though.
Comment 35 Orr Shalom Dvory 2019-06-02 21:15:53 UTC
Hi, thanks for your respond. can someone mark this bug as need to be improved?
Does anyone agree/disagree with my new proposed method?
Comment 36 Wilco 2019-06-03 11:11:04 UTC
(In reply to Orr Shalom Dvory from comment #35)
> Hi, thanks for your respond. can someone mark this bug as need to be
> improved?
> Does anyone agree/disagree with my new proposed method?

It's best to create a new bug if there are still easy cases that could be optimized. Also it seems the constants it uses are quite complex - it may be feasible to simplify them. Eg. long f(long x) { return x % 35 == 0; }