Ralph Levien pointed out as part of FizzBuzz optimization: Turns out you can compute x%3 == 0 with even fewer steps, it's (x*0xaaaaaaaab) < 0x55555556 (assuming wrapping unsigned 32 bit arithmetic). gcc currently generates the full modulo and then checks. Could be done in match.pd I suppose. Test case unsigned mod3(unsigned a) { return 0==(a%3); }

Works for me on aarch64: f: mov w1, 21846 movk w1, 0x5555, lsl 16 smull x1, w0, w1 lsr x1, x1, 32 sub w1, w1, w0, asr 31 add w1, w1, w1, lsl 1 sub w0, w0, w1 ret Most likely a target cost issue.

>Could be done in match.pd I suppose. NO! There are a lot of code in the expander to handle division and modulus by a constant already. If the division cost is high enough, then it is not used directly and the multiply method is used as shown by what happens on aarch64: f: mov w1, 21846 movk w1, 0x5555, lsl 16 smull x1, w0, w1 lsr x1, x1, 32 sub w1, w1, w0, asr 31 add w1, w1, w1, lsl 1 sub w0, w0, w1 ret int f(int a) { return a%3; }

For your code aarch64 produces: mod3: mov w1, 43691 movk w1, 0xaaaa, lsl 16 umull x1, w0, w1 lsr x1, x1, 32 lsr w1, w1, 1 add w1, w1, w1, lsl 1 cmp w0, w1 cset w0, eq ret Oh you mean not to produce a%3 followed by == part :). Anyways this code should go into the middle-end expanders and not be part of match.pd.

Right it's about special casing the complete expression

Also I'm not sure why you would want it in the middle end. It should all work at the tree level

(In reply to Andi Kleen from comment #5) > Also I'm not sure why you would want it in the middle end. It should all > work at the tree level Because we lose meaning and there is already code to handle the %N (where N is a constant) to a multiplication in the middle-end so reusing that code would be easy. Plus 0 is simpler than 0x55555556.

Is that a special case of a more generic transformation, which might apply for other values of 3, 0, == etc, or is this meant only literally for x%3==0?

I'm not sure if it works with other numbers too. (need to dig through Hacker's delight & Matters Computational to see if they have anything on it) But it could be extended for other word lengths at least BTW there are some other cases, will file a bug shortly on those too.

(In reply to Andi Kleen from comment #8) > I'm not sure if it works with other numbers too. > > (need to dig through Hacker's delight & Matters Computational to see if they > have anything on it) > > But it could be extended for other word lengths at least > > BTW there are some other cases, will file a bug shortly on those too. It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest). I forgot whether I made it work for signed too, but it's certainly possible to skip the sign handling in x % 4 == 0 even if x is signed. While this is faster in general than computing the full modulo result, it wouldn't be if you need the division result as well. So this works best on single-use modulos used in comparisons, at the same time as general divisions are expanded.

Might be even more special-cased with x % 3 == 0 ? C1 : C2 aka in if-conversion context.

(In reply to Wilco from comment #9) > It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<55555556 and 0<=z<3. Indeed, x*0xaaaaaaab==y+z*0xaaaaaaab is in the right range precisely for z==0 and the same can be done for any odd number. > You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest). Any ==cst will yield a range test. It is less obvious that inequalities are transformed to a contiguous range... (try x%7<3 maybe) > I forgot whether I made it work for signed too, but it's certainly > possible to skip the sign handling in x % 4 == 0 even if x is signed. 4 is a completely different story, as a power of 2.

(In reply to Marc Glisse from comment #11) > (In reply to Wilco from comment #9) > > It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). > > For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<55555556 and 0<=z<3. > Indeed, x*0xaaaaaaab==y+z*0xaaaaaaab is in the right range precisely for > z==0 and the same can be done for any odd number. You can make it work for even numbers as well, eg. (x % 10) == 0 -> (x % 5) == 0 && (x & 1) == 0. If you can avoid branches then it may still be faster. > > You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest). > > Any ==cst will yield a range test. It is less obvious that inequalities are > transformed to a contiguous range... (try x%7<3 maybe) Indeed, expanding into x % 7 == 0 || x % 7 == 1 || x % 7 == 2 is unlikely to be a good idea. However you can change x % 7 >= 1 into x % 7 != 0 or x % 7 < 6 into x % 7 != 6 which are single ranges.

*** Bug 87203 has been marked as a duplicate of this bug. ***

http://duriansoftware.com/joe/Optimizing-is-multiple-checks-with-modular-arithmetic.html Do we want to do this at GIMPLE time ignoring costs, or during expansion time? Doing it later has the benefit that we can compare costs and could avoid breaking say divmod recognition, or finding out multiple uses of the modulo, etc. Doing it earlier has the benefit for vectorization I guess, otherwise we need a pattern recognizer that will work like the expansion.

I tried to do something similar at https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02005.html and the feedback was that we should do this at expand time

For unsigned x % y == z if y is odd constant we can handle it for any constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and using x * m - (z * m) < d . For even y, not sure if it can work for anything but z == 0; for z == 0 we can do s = ctz (y); y_ = y >> s; m = mul_inv (y_, 2^prec); d = (2^prec / y_); and use ((x * m) r>> s) < d .

(In reply to Jakub Jelinek from comment #16) > For unsigned x % y == z if y is odd constant we can handle it for any > constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and > using x * m - (z * m) < d . Is that preferable to testing (x - z) % y == 0? Why?

(In reply to Alexander Monakov from comment #17) > (In reply to Jakub Jelinek from comment #16) > > For unsigned x % y == z if y is odd constant we can handle it for any > > constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and > > using x * m - (z * m) < d . > > Is that preferable to testing (x - z) % y == 0? Why? That would require checking that it when x - z underflows if returns the correct answer (or adding a && x >= z).

Created attachment 44656 [details] gcc9-pr82853-wip.patch Untested WIP patch which does this during expansion if it is cheaper according to target costs. As the FIXMEs say, I'm not yet trying Alex' proposal with doing the subtraction on X (can't be used unconditionally, need to prove that values from -C2 to -1 u% C1 aren't 0), and am not trying even C1 when rotates aren't available (either could just expand the rotate anyway using 2 shifts + or, or comparison + mask + comparison + bitwise and (though that would complicate the callers equite a bit).

Testcase I've been eyeballing so far: unsigned f1 (unsigned x) { return (x % 679U) == 0; } unsigned f2 (unsigned x, unsigned *y) { *y = x / 679U; return (x % 679U) == 0; } unsigned f3 (unsigned x) { return (x % 1738U) == 0; } void bar (void); void f4 (unsigned x) { if (x % 3 == 0) bar (); } void f5 (unsigned x) { if (x % 3 == 1) bar (); } void f6 (unsigned x) { if (x % 3 == 2) bar (); }

Probably should punt early if integer_onep (treeop1), that case should have been optimized earlier, but if it isn't, we shouldn't miscompile. Another thing is if *arg1 is < 0 or >= treeop1, again, I'd hope we have optimized that already to false (or true for NE_EXPR), but if not, we shouldn't miscompile. Another thing is that C4 needs to be adjusted, if *arg1 is bigger than all_ones%treeop1, then we need to subtract 1 from C4 - e.g. for x % 3 == 0 we want to expand it as x * 0xaaaaaaab <= 0x55555555, while for x % 3 == 1 we should expand as x * 0xaaaaaaab - 1 * 0xaaaaaaab <= 0x55555555 - 1, because 0xffffffff % 3 == 0 and thus there is one more case where x % 3 == 0 compared to x % 3 == 1 and x % 3 == 2 cases. Hacker's Delight 10-17 mentions <= (2^prec - 1) / c1 rather than <= (2^prec) / c1, probably should do that too.

So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, because for x == 0 the test will yield a different value, as 0xffffffffU % 3U is 0 and 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaaaaaabU <= 0x55555554U, but x % 3U == 0 is equivalent to x * 0xaaaaaaabU <= 0x55555555U. Now to see if something useful can be used also for the even divisors.

(In reply to Jakub Jelinek from comment #22) > So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, > because > for x == 0 the test will yield a different value, as 0xffffffffU % 3U is 0 > and > 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaaaaaabU <= > 0x55555554U, but x % 3U == 0 is equivalent to x * 0xaaaaaaabU <= 0x55555555U. > Now to see if something useful can be used also for the even divisors. Yes for this case it is safe to do (x - 1) * 0xaaaaaaab < 0x55555555, but you can also do x * 0xaaaaaaab >= 0xaaaaaaab which is even simpler. Basically (x % C) == N is simpler for 2 values of N.

(In reply to Wilco from comment #23) > (In reply to Jakub Jelinek from comment #22) > > So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, > > because > > for x == 0 the test will yield a different value, as 0xffffffffU % 3U is 0 > > and > > 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaaaaaabU <= > > 0x55555554U, but x % 3U == 0 is equivalent to x * 0xaaaaaaabU <= 0x55555555U. > > Now to see if something useful can be used also for the even divisors. > > Yes for this case it is safe to do (x - 1) * 0xaaaaaaab < 0x55555555, but > you can also do x * 0xaaaaaaab >= 0xaaaaaaab which is even simpler. Yeah, a special case, todo++. > Basically (x % C) == N is simpler for 2 values of N. I meant that for C odd it works for any value, x % C == D for C odd and D <= -1 % C then (x - D) * mod_inv (C) <= -1 / C, otherwise if C is odd and D > -1 % C then (x - D) * mod_inv (C) < -1 / C. Just if C is even it is more complicated. For C positive odd and signed modulo with D == 0 it can be done easily too (will implement tomorrow).

Created attachment 44657 [details] gcc9-pr82853.patch Full untested patch. For x % C1 == C2 it handles all unsigned cases where C1 is odd, if C1 is even, just cases where C2 <= -1U % C1, if signed modulo, just x % C1 == 0 cases (where C1 is not INT_MIN).

Created attachment 44658 [details] pr82853-tests.tar.xz A test generator for x % c1 == c2 expansion for unsigned, int, unsigned long long, long long, unsigned int128 and int128 types (assuming ilp32 or lp64) plus tests for those types. Takes about 2 minutes to compile + run on a fast box and uses random (), so not really sure the tests should go into the testsuite. Thoughts on that? After all, the generator isn't extra smart and doesn't try to find problematic corner cases.

(In reply to Jakub Jelinek from comment #26) > A test generator for x % c1 == c2 expansion for unsigned, int, unsigned long > long, long long, unsigned int128 and int128 types (assuming ilp32 or lp64) > plus tests for those types. Takes about 2 minutes to compile + run on a fast > box and uses random (), so not really sure the tests should go into the > testsuite. Thoughts on that? After all, the generator isn't extra smart and > doesn't try to find problematic corner cases. If you work on the 'guess' of a 10x slowdown when using simulators, that's probably too long particularly if it's not delivering high value. Ultimately, it will depend on how much of that 2 minutes is during generate/compile and how much during run time.

On i9-7960X I get (cc1 is -O0 checking build, so bootstrapped compiler might be much faster), will repeat that with bootstrapped compiler if it succeeds. The __int128 and unsigned __int128 tests are clearly too expensive. for i in 1 2 3 4 5 6; do time ./cc1 -quiet -O2 -o pr82853-$i.{s,c}; gcc -o pr82853-$i{,.s}; time ./pr82853-$i; echo $?; done real 0m11.273s user 0m11.182s sys 0m0.039s real 0m2.997s user 0m2.993s sys 0m0.001s 0 real 0m8.145s user 0m8.082s sys 0m0.026s real 0m2.166s user 0m2.165s sys 0m0.000s 0 real 0m11.683s user 0m11.597s sys 0m0.033s real 0m5.315s user 0m5.312s sys 0m0.000s 0 real 0m7.972s user 0m7.903s sys 0m0.032s real 0m3.801s user 0m3.798s sys 0m0.001s 0 real 0m12.846s user 0m12.762s sys 0m0.028s real 0m17.471s user 0m17.458s sys 0m0.001s 0 real 0m8.546s user 0m8.486s sys 0m0.022s real 0m13.738s user 0m13.728s sys 0m0.000s

*** Bug 87232 has been marked as a duplicate of this bug. ***

Author: jakub Date: Wed Sep 12 18:28:20 2018 New Revision: 264248 URL: https://gcc.gnu.org/viewcvs?rev=264248&root=gcc&view=rev Log: PR middle-end/82853 * expr.h (maybe_optimize_mod_cmp): Declare. * expr.c (mod_inv): New function. (maybe_optimize_mod_cmp): New function. (do_store_flag): Use it. * cfgexpand.c (expand_gimple_cond): Likewise. * gcc.target/i386/pr82853-1.c: New test. * gcc.target/i386/pr82853-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr82853-1.c trunk/gcc/testsuite/gcc.target/i386/pr82853-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/expr.c trunk/gcc/expr.h trunk/gcc/testsuite/ChangeLog

Fixed on the trunk. We might still try to improve some == non-zero cases with conditional code, and certainly should repeat this in the vectorizer's pattern recognizer.

*** Bug 49552 has been marked as a duplicate of this bug. ***

(In reply to Jakub Jelinek from comment #25) > Created attachment 44657 [details] > gcc9-pr82853.patch > > Full untested patch. For x % C1 == C2 it handles all unsigned cases where > C1 is odd, if C1 is even, just cases where C2 <= -1U % C1, if signed modulo, > just x % C1 == 0 cases (where C1 is not INT_MIN). Hi, I will try to help. I've got a more accurate formula. look at my comment over here https://reviews.llvm.org/D50222 I'm copying it here for the sake of all. Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder: from divisor d and reminder r, I calculate 4 constants. any d!=0 should fit. void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u) { k=__builtin_ctzll(d);/* the power of 2 */ uint64_t d_odd=d>>k; mmi=find_mod_mul_inverse(d_odd,64); /* 64 is word size*/ s=(r*mmi); u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */ } A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo. The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd. then the compiler should generate the following code with the given constants: int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x) { uint64_t o=((x*mmi)-s); o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/ return o<=u; } this replace the following: /* d is the divisor, r is the remainder */ int checkrem64(uint64_t x) { return x%d==r; } this is the code to find modular inverse.. uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits) { if (bits > 64 || ((x&1)==0)) return 0;// invalid parameters uint64_t mask; if (bits == 64) mask = -1; else { mask = 1; mask<<=bits; mask--; } x&=mask; uint64_t result=1, state=x, ctz=0; while(state!=1ULL) { ctz=__builtin_ctzll(state^1); result|=1ULL<<ctz; state+=x<<ctz; state&=mask; } return result; } good luck! I tested this on all the cases of 10bit word size, and it passed. *Edit:* I looked for something that will work for signed integers. I came up with something that would work with negative numbers if the following assumption was correct: (-21)%10==9 but this assumption is not correct because (-21)%10 equals to -1. anyway, the idea is like that, you shift the range and change s and accordingly: void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u) { k=__builtin_ctzll(d);/* the power of 2 */ uint64_t d_odd=d>>k; mmi=find_mod_mul_inverse(d_odd,64); /* 64 is word size*/ //this is the added line to make it work with signed integers r+=0x8000 0000 0000 0000% d; s=(r*mmi); u=(ULLONG_MAX-r)/d; } int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x) { //this is the added line to make it work with signed integers //x came as signed number but was casted to unsigned x^=0x8000 0000 0000 0000;// this is addition simplified to xor, spaces for clarity. uint64_t o=((x*mmi)-s); o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/ return o<=u; } There is a way to tweak u and s to make it work on negative only numbers or positive only numbers, when r is negative or positive... but for r=0, this should work. please tell me the exact requirements, and I will do the math.

Also fixes the duplicate https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849. Can't close it though.

Hi, thanks for your respond. can someone mark this bug as need to be improved? Does anyone agree/disagree with my new proposed method?

(In reply to Orr Shalom Dvory from comment #35) > Hi, thanks for your respond. can someone mark this bug as need to be > improved? > Does anyone agree/disagree with my new proposed method? It's best to create a new bug if there are still easy cases that could be optimized. Also it seems the constants it uses are quite complex - it may be feasible to simplify them. Eg. long f(long x) { return x % 35 == 0; }