User account creation filtered due to spam.

Bug 46186

Summary: Clang creates code running 1600 times faster than gcc's
Product: gcc Reporter: Julian Andres Klode <jak>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: minor CC: jakub, lucas, mehturt, spop, tg
Priority: P3 Keywords: missed-optimization
Version: 4.5.1   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2010-10-26 16:59:00
Attachments: C file
Clang's assember
0001-Fix-PR46186-improve-chrec_apply.patch

Description Julian Andres Klode 2010-10-26 14:27:00 UTC
Created attachment 22161 [details]
C file

The attached code compiles into an executable that takes 1.6 seconds to run, when compiled with clang, it takes 0.001 seconds (both with -O2 as the only compiler option).
Comment 1 Julian Andres Klode 2010-10-26 14:30:24 UTC
Created attachment 22162 [details]
Clang's assember

Attaching the assembler output from clang, it should help understand which optimizations clang does (and improve gcc to do them as well).
Comment 2 Paolo Carlini 2010-10-26 14:31:47 UTC
;)
Comment 3 Julian Andres Klode 2010-10-26 14:32:27 UTC
System information:

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-5)
Comment 4 Jonathan Wakely 2010-10-26 14:47:12 UTC
GCC's output is significantly faster at -O3 or without the noinline attribute
Comment 5 Julian Andres Klode 2010-10-26 14:53:24 UTC
(In reply to comment #4)
> GCC's output is significantly faster at -O3 or without the noinline attribute

I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's code at -O3.
Comment 6 Dominique d'Humieres 2010-10-26 14:59:18 UTC
You get this kind of speedup if the compiler knows that the result of the loop is

sum=(b*(b-1)-a*(a-1))/2

In which case the timing is meaningless (it is 0.000s on my laptop), so is the ratio with the execution of the loop.

The basic question is: how much the user's ignorance should be repaired by the optimizer? (A colleague of mine told me that he once audited a CFD code and found that \int_a^b dx/x was evaluated numerically instead of using log(b)-log(a)!-)
Comment 7 Julian Andres Klode 2010-10-26 15:00:37 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > GCC's output is significantly faster at -O3 or without the noinline attribute
> 
> I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At
> -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's
> code at -O3.

Using gcc 4.5 with -O3 may work for the C code, but it does not optimize the C++ code posted at http://lwn.net/Articles/411776/:

    g++-4.5 -O3: real 0m1.608s
    clang++ -O2: real 0m0.003s
    clang++ -Os: real 0m0.003s

But I guess the C++ problem might be a different one.
Comment 8 Julian Andres Klode 2010-10-26 15:25:56 UTC
(In reply to comment #6)
> You get this kind of speedup if the compiler knows that the result of the loop
> is
> 
> sum=(b*(b-1)-a*(a-1))/2
> 
> In which case the timing is meaningless (it is 0.000s on my laptop), so is the
> ratio with the execution of the loop.
> 
> The basic question is: how much the user's ignorance should be repaired by the
> optimizer? (A colleague of mine told me that he once audited a CFD code and
> found that \int_a^b dx/x was evaluated numerically instead of using
> log(b)-log(a)!-)

Since the optimization seems to be mostly there in -O3, it's just a matter of enabling it in -O2. 

I just found out that it does not optimize if you call f() via a global function pointer, it still takes 1.6 seconds despite being compiled at -O3, whereas clang can optimize it to 0.001s.
Comment 9 Jonathan Wakely 2010-10-26 15:28:51 UTC
(In reply to comment #8)
> 
> Since the optimization seems to be mostly there in -O3, it's just a matter of
> enabling it in -O2. 

Or if you want all optimisations, it's just a matter of using -O3

Expecting all optimisations to be done below the maximum optimisation level is ... unexpected.
Comment 10 Jakub Jelinek 2010-10-26 15:37:04 UTC
-O2 -fipa-cp-clone should be in theory enough, then f would be normally cloned, assuming gcc thinks it is from a hot spot.  But this is not a real-world testcase, the call from main where it happens just once for the process is not considered a hot spot.
Comment 11 Paolo Carlini 2010-10-26 15:42:58 UTC
Can we please stop talking about nano and giga numbers like kids? If an optimization like complete loop unrolling is involved of course very small or large numbers can be involved, doesn't really contribute anything to the problem talking about the exact figure.
Comment 12 pinskia@gmail.com 2010-10-26 15:56:20 UTC
On Oct 26, 2010, at 7:30 AM, "jak@jak-linux.org" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186
>
> --- Comment #1 from Julian Andres Klode <jak@jak-linux.org>  
> 2010-10-26 14:30:24 UTC ---
> Created attachment 22162 [details]
>  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
> Clang's assember

This multiplication transformation is incorrect if the loop wraps  
(unsigned always wraps; never overflows). Gcc is correct in its speed.  
Though -O3 is faster but not because of multiplication but rather  
constant propatagtion. So this bug is invalid and some one should  
report a bug to llvm folks about this.

>
> Attaching the assembler output from clang, it should help understand  
> which
> optimizations clang does (and improve gcc to do them as well).
Comment 13 Dominique d'Humieres 2010-10-26 16:36:05 UTC
> This multiplication transformation is incorrect if the loop wraps  
> (unsigned always wraps; never overflows).

I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64 here) which "works" for additions and multiplications, so if there is wrapping, the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9 loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If you get anything below this number, it means that the loop

    for (; a < b; a++)
        sum += a;

is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that the timing behaves as O(len) or not). Apparently clang does this (valid) transformation while gcc don't for any options I have tried.

Note that If I write such a loop, it is because I am interested by the timing of the loop, not by the result I know for more than 40 years!
Comment 14 Jakub Jelinek 2010-10-26 16:59:00 UTC
For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't handle even the sum += a case.  Sebastian?
Comment 15 Dominique d'Humieres 2010-10-26 17:15:31 UTC
> For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't
> handle even the sum += a case.

2 and b are constants while a is not. For constants you have to know that the sum is cst*nloop, while if a is incremented you have to know that it is related to nloop*(nloop+1)/2 (and if you use sum += a*a, nloop*(nloop+1)*(2*nloop+1)/6, if sum += a*a*a, nloop^2*(nloop+1)^2/4 and so on).

However is it worth the work?
Comment 16 Jakub Jelinek 2010-10-26 18:43:40 UTC
chrec_apply is called with
{a_4(D), +, {a_4(D) + 1, +, 1}_1}_1
chrec and ~a_4(D) + b_5(D) in x.
I wonder if this can be fixed just by recognizing such special cases in chrec_apply (after checking e.g. that it is unsigned computation instead of signed etc.).
Comment 17 Dominique d'Humieres 2010-10-26 18:53:49 UTC
Note that clang seems to know the general result: \sum_{i=a}^b p(i)=P(b), where p(i) is a given polynomial of degree n and P(x) a polynomial of degree n+1 such that P(x)=P(x-1)+p(x) and P(a)=p(a). At least clang is able to remove the loop for

sum5 += a*a*a*a*a + 2*a*a*a +5*a;
Comment 18 Jakub Jelinek 2010-10-26 19:11:59 UTC
I guess you mean LLVM instead of clang, I'm pretty sure the FE doesn't perform this optimization.

Anyway, given:

#define F(n, exp) \
unsigned long                                   \
f##n (unsigned long a, unsigned long b)         \
{                                               \
  unsigned long sum = 0;                        \
  for (; a < b; a++)                            \
    exp;                                        \
  return sum;                                   \
}

F (1, sum += a)
F (2, sum += 2)
F (3, sum += b)
F (4, sum += a * a)
F (5, sum += a * a * a)
F (6, a * a * a * a * a + 2 * a * a * a + 5 * a)

only the f1/f2/f3 cases make it into chrec_apply (and only f2/f3 are currently handled there).
Comment 19 joseph@codesourcery.com 2010-10-26 20:29:56 UTC
On Tue, 26 Oct 2010, dominiq at lps dot ens.fr wrote:

> --- Comment #13 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 16:36:05 UTC ---
> > This multiplication transformation is incorrect if the loop wraps  
> > (unsigned always wraps; never overflows).
> 
> I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64
> here) which "works" for additions and multiplications, so if there is wrapping,
> the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

It's a bit more complicated than that, in that you can't just compute 
(b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
bit.  (I haven't checked exactly what the generated code is doing here.)
Comment 20 Jakub Jelinek 2010-10-26 21:00:11 UTC
If I translate the assembly back to C, it seems it is performing part of the arithmetics in TImode:

unsigned long f (unsigned long a, unsigned long b)
{
  if (a >= b)
    return 0;
  else
    return (a + 1) * (b - 1 - a) + a + (unsigned long)(((unsigned __int128) (b - 1 - a) * (b - 2 - a)) >> 1);
}
Comment 21 Dominique d'Humieres 2010-10-26 21:06:48 UTC
> I guess you mean LLVM instead of clang,

Yes, if you prefer. I was referring to the command I used.

> F (6, a * a * a * a * a + 2 * a * a * a + 5 * a)

you probably mean

F (6, sum +=a * a * a * a * a + 2 * a * a * a + 5 * a)

> It's a bit more complicated than that, in that you can't just compute 
> (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
> bit.  (I haven't checked exactly what the generated code is doing here.)

This is right if you do the multiply before the division. However b or b-1 can be divided exactly by 2, so you have to do (b/2)*(b-1) if b is even and b*((b-1)/2) if b is odd. The same result applies for the sum of square and cubes, although you may be one more trick if 2*b+1 wrap and can be divided exactly by 3.

I have tested the following variant


[macbook] f90/bug% cat pr46186_db.c
#include <stdio.h>

#define len 1000000000L

unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline));

int main()
{    
    unsigned long res;
    res = f(2, 2*len);
/*    printf("%lu\n", res); */
    return 0;
}

unsigned long f(unsigned long a, unsigned long b)
{
    unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0;
    for (; a < b; a++)
      {
	sum0 += 2;
        sum += a;
	sum2 += a*a;
	sum5 += a*a*a*a*a + 2*a*a*a +5*a;
      }
    printf("%lu\n", sum0);
    printf("%lu\n", sum);
    printf("%lu\n", sum2);
    printf("%lu\n", sum5);
    return sum;
}
 which gives

[macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db.c

pr46186_db.c:18: note: LOOP VECTORIZED.
pr46186_db.c:15: note: vectorized 1 loops in function.
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
18.114u 0.012s 0:18.15 99.8%	0+0k 0+0io 0pf+0w
[macbook] f90/bug% clang -O pr46186_db.c
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
0.000u 0.000s 0:00.00 0.0%	0+0k 0+0io 0pf+0w

So the wrapping seems properly handled for the loop replacement.

Now I have also tested the following variant


[macbook] f90/bug% cat pr46186_db_1.c
#include <stdio.h>

#define len 1000000000L

unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline));

int main()
{    
    unsigned long res;
    res = f(2, 2*len);
/*    printf("%lu\n", res); */
    return 0;
}

unsigned long f(unsigned long a, unsigned long b)
{
    unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0;
    unsigned long a2;
    for (; a < b; a++)
      {
	sum0 += 2;
        sum += a;
	a2 = a*a;
	sum2 += a2;
	sum5 += a*(a2*(a2 + 2) +5);
      }
    printf("%lu\n", sum0);
    printf("%lu\n", sum);
    printf("%lu\n", sum2);
    printf("%lu\n", sum5);
    return sum;
}

and the timings are

[macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db_1.c

pr46186_db_1.c:19: note: LOOP VECTORIZED.
pr46186_db_1.c:15: note: vectorized 1 loops in function.
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
12.227u 0.016s 0:12.36 98.9%	0+0k 0+0io 0pf+0w  <== was 18.114u 0.012s 0:18.15 without
hand optimization
[macbook] f90/bug% clang -O pr46186_db_1.c
[macbook] f90/bug% time a.out
3999999996
1999999998999999999
10262176583330622975
11833798674328980984
0.000u 0.000s 0:00.00 0.0%	0+0k 0+0io 0pf+0w

So clearly gcc is missing some generic optimizations in products.
Comment 22 Thorsten Glaser 2010-10-29 21:06:44 UTC
Created attachment 22201 [details]
0001-Fix-PR46186-improve-chrec_apply.patch

(In reply to comment #19)
> It's a bit more complicated than that, in that you can't just compute 
> (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
> bit.  (I haven't checked exactly what the generated code is doing here.)

Haven’t checked either, but in i386 asm, using RCR instead of SHR would work
if the carry flag is correct after the sequence calculating that (or SAR).
Comment 23 sebpop@gmail.com 2010-10-29 21:44:09 UTC
Hi,

here is a preliminary patch (not tested yet other that the PR testcase).
This patch improves chrec_apply to also handle these very uncommon
cases that some like to make big titles about (I wonder if the guy who
submitted this bug report is part of some marketing division... anyways)

Note that for these cases
F (4, sum += a * a)
F (5, sum += a * a * a)
F (6, sum += a * a * a * a * a + 2 * a * a * a + 5 * a)

although GCC with this patch knows how to transform these into end
of loop values, GCC won't change them, because of this heuristic:

	      /* Do not emit expensive expressions.  The rationale is that
		 when someone writes a code like

		 while (n > 45) n -= 45;

		 he probably knows that n is not large, and does not want it
		 to be turned into n %= 45.  */
	      || expression_expensive_p (def))

one needs to also set the --param scev-max-expr-size to a pretty big
value for f6 to pass the fold steps...

Sebastian
Comment 24 joseph@codesourcery.com 2010-10-29 21:58:46 UTC
On Fri, 29 Oct 2010, sebpop at gmail dot com wrote:

> here is a preliminary patch (not tested yet other that the PR testcase).

How does this patch deal with needing to compute in a wider type to avoid 
losing high bits before the division - is that covered by some existing 
code?  It could certainly do with execution testcases that verify cases 
where computing everything naively in the original type would yield an 
incorrect result (and those were computing in a type of double the width 
would still have overflow, etc.).
Comment 25 Jackie Rosen 2014-02-16 13:13:30 UTC Comment hidden (spam)