Bug 12849

Summary: testing divisibility by constant
Product: gcc Reporter: user42
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: enhancement CC: gcc-bugs, jakub
Priority: P3 Keywords: missed-optimization
Version: 3.3.2   
Target Milestone: ---   
See Also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853
Host: i486-pc-linux-gnu Target: i486-pc-linux-gnu
Build: i486-pc-linux-gnu Known to work:
Known to fail: Last reconfirmed: 2005-12-24 19:46:25
Bug Depends on: 82853    
Bug Blocks:    

Description user42 2003-10-30 22:06:22 UTC
I believe tests of divisibility by a constant (ie. n % d == 0, where d
is a compile-time constant) can be improved by the technique described
in section 9 of "Division by Invariant Integers using Multiplication"
by Granlund and Montgomery.

	ftp://ftp.cwi.nl/pub/pmontgom/divcnst.psa4.gz

(This section of this paper is what's currently used in gcc for exact
division by a constant arising from pointer subtraction, but the
divisibility tests managed to escape implementation. :-)

Environment:
System: Linux blah 2.2.15 #1 Tue Apr 25 17:13:48 EST 2000 i586 unknown unknown GNU/Linux
Architecture: i586
	<machine, os, target, libraries (multiple lines)>
host: i486-pc-linux-gnu
build: i486-pc-linux-gnu
target: i486-pc-linux-gnu
configured with: ../src/configure -v --enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared --with-system-zlib --enable-nls --without-included-gettext --enable-__cxa_atexit --enable-clocale=gnu --enable-debug --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc i486-linux

How-To-Repeat:
A little function like,

	void
	bar (unsigned n)
	{
	  if (n % 3 == 0)
	    true ();
	  else
	    false ();
	}

compiled with

	gcc -march=athlon -mcpu=athlon -O3 -fomit-frame-pointer -S dive.c

currently comes out with

        movl    4(%esp), %ecx
        movl    $1639179445, %edx
        movl    %ecx, %eax
        mull    %edx
        shrl    $16, %edx
        imull   $171717, %edx, %edx
        cmpl    %edx, %ecx
	...

I believe instead this could be something like

        movl    4(%esp), %ecx
	imull	$0x4B45300D, %ecx
	cmpl	$0x61B3, %ecx
	...

Notice there's only one low-half multiply, as opposed to a high-half
plus a low-half above.

An even divisor will require an extra rotl (or a separate test of some
low bits), but still ought to be smaller and faster in all cases.


If the actual value of the remainder is wanted then this technique
wouldn't help, it's only for a test for == 0 (or particular values).

When the input does prove to be divisible, then the mul shown gives
the quotient.  Maybe that could be noted on the corresponding side of
the branch, in case it allowed for some CSE.
Comment 1 Falk Hueffner 2003-10-30 22:42:58 UTC
If anybody wants to look into this, the material at http://www.hackersdelight.org/
might be useful.
Comment 2 Andrew Pinski 2003-10-31 01:03:18 UTC
Nice optimizations.
Comment 3 user42 2003-10-31 22:00:44 UTC
Subject: Re:  testing divisibility by constant

I wrote:
>
> 	  if (n % 3 == 0)

Actually, the asm code I included was for a divisor of 171717.  Small
values get strength reduction on the multiplies, obscuring what gets
done.
Comment 4 Cassio Neri 2018-03-14 10:11:09 UTC
A simple mathematical proof that the algorithm works is found here:

http://clomont.com/efficient-divisibility-testing/

See also https://stackoverflow.com/a/49264279/1137388.
Comment 5 Trass3r 2018-11-09 04:21:35 UTC
Should be fixed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853
Comment 6 Wilco 2019-05-30 10:44:40 UTC
Fixed in GCC9.
Comment 7 Cassio Neri 2019-05-31 21:10:25 UTC
Thanks for implementing the modular inverse algorithm in gcc. However, the implementation has an issue. In some cases, for no obvious reason, the compiler falls back to the old algorithm. For instance,

    bool f1(unsigned n) { return n % 10 == 5; }

as expected, uses the modular inverse algorithm and translates to

    f1(unsigned int):
      imull $-858993459, %edi, %edi
      subl $1, %edi
      rorl %edi
      cmpl $429496729, %edi
      setbe %al
      ret

whereas

    bool f2(unsigned n) { return n % 10 == 6; }

doesn't use the modular inverse algorithm and is the same as in older versions of gcc:

    f2(unsigned int):
      movl %edi, %eax
      movl $3435973837, %edx
      imulq %rdx, %rax
      shrq $35, %rax
      leal (%rax,%rax,4), %eax
      addl %eax, %eax
      subl %eax, %edi
      cmpl $6, %edi
      sete %al
      ret

See on godbolt: https://godbolt.org/z/u-C54I

I would like make another observation. For some divisors (e.g. 7, 19, 21) the modular inverse algorithm seems to be faster than the traditional one even when the remainder r (in n % d == r) is not a compile time constant. In general this happens in cases where the "magic number" M used by the traditional algorithm to replace the division "n / d" with "n * M >> k" is such that M doesn't fit in a register and extra operations are required to overcome this problem. In other words, these are the divisors for which '"Add" indicator' in https://www.hackersdelight.org/magic.htm shows 1.

I made some measurements and I hope to make my results available for your consideration soon.