Bug 46186 - Clang creates code running 1600 times faster than gcc's
Clang creates code running 1600 times faster than gcc's
 Status: NEW None gcc Unclassified tree-optimization (show other bugs) 4.5.1 P3 minor --- Not yet assigned to anyone missed-optimization 80852 Show dependency tree / graph

 Reported: 2010-10-26 14:27 UTC by Julian Andres Klode 2017-11-27 17:50 UTC (History) 7 users (show) drraph hiraditya jakub lucas mehturt spop tg 2010-10-26 16:59:00

Attachments
C file (185 bytes, text/plain)
2010-10-26 14:27 UTC, Julian Andres Klode
Details
Clang's assember (712 bytes, text/plain)
2010-10-26 14:30 UTC, Julian Andres Klode
Details
0001-Fix-PR46186-improve-chrec_apply.patch (1.89 KB, text/x-patch)
2010-10-29 21:44 UTC, sebpop@gmail.com
Details

 Note You need to log in before you can comment on or make changes to this bug.
 Julian Andres Klode 2010-10-26 14:27:00 UTC Created attachment 22161 [details] C file The attached code compiles into an executable that takes 1.6 seconds to run, when compiled with clang, it takes 0.001 seconds (both with -O2 as the only compiler option). Julian Andres Klode 2010-10-26 14:30:24 UTC Created attachment 22162 [details] Clang's assember Attaching the assembler output from clang, it should help understand which optimizations clang does (and improve gcc to do them as well). Paolo Carlini 2010-10-26 14:31:47 UTC ;) Julian Andres Klode 2010-10-26 14:32:27 UTC System information: Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.5 (Debian 4.4.5-5) Jonathan Wakely 2010-10-26 14:47:12 UTC GCC's output is significantly faster at -O3 or without the noinline attribute Julian Andres Klode 2010-10-26 14:53:24 UTC (In reply to comment #4) > GCC's output is significantly faster at -O3 or without the noinline attribute I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's code at -O3. Dominique d'Humieres 2010-10-26 14:59:18 UTC You get this kind of speedup if the compiler knows that the result of the loop is sum=(b*(b-1)-a*(a-1))/2 In which case the timing is meaningless (it is 0.000s on my laptop), so is the ratio with the execution of the loop. The basic question is: how much the user's ignorance should be repaired by the optimizer? (A colleague of mine told me that he once audited a CFD code and found that \int_a^b dx/x was evaluated numerically instead of using log(b)-log(a)!-) Julian Andres Klode 2010-10-26 15:00:37 UTC (In reply to comment #5) > (In reply to comment #4) > > GCC's output is significantly faster at -O3 or without the noinline attribute > > I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At > -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's > code at -O3. Using gcc 4.5 with -O3 may work for the C code, but it does not optimize the C++ code posted at http://lwn.net/Articles/411776/: g++-4.5 -O3: real 0m1.608s clang++ -O2: real 0m0.003s clang++ -Os: real 0m0.003s But I guess the C++ problem might be a different one. Julian Andres Klode 2010-10-26 15:25:56 UTC (In reply to comment #6) > You get this kind of speedup if the compiler knows that the result of the loop > is > > sum=(b*(b-1)-a*(a-1))/2 > > In which case the timing is meaningless (it is 0.000s on my laptop), so is the > ratio with the execution of the loop. > > The basic question is: how much the user's ignorance should be repaired by the > optimizer? (A colleague of mine told me that he once audited a CFD code and > found that \int_a^b dx/x was evaluated numerically instead of using > log(b)-log(a)!-) Since the optimization seems to be mostly there in -O3, it's just a matter of enabling it in -O2. I just found out that it does not optimize if you call f() via a global function pointer, it still takes 1.6 seconds despite being compiled at -O3, whereas clang can optimize it to 0.001s. Jonathan Wakely 2010-10-26 15:28:51 UTC (In reply to comment #8) > > Since the optimization seems to be mostly there in -O3, it's just a matter of > enabling it in -O2. Or if you want all optimisations, it's just a matter of using -O3 Expecting all optimisations to be done below the maximum optimisation level is ... unexpected. Jakub Jelinek 2010-10-26 15:37:04 UTC -O2 -fipa-cp-clone should be in theory enough, then f would be normally cloned, assuming gcc thinks it is from a hot spot. But this is not a real-world testcase, the call from main where it happens just once for the process is not considered a hot spot. Paolo Carlini 2010-10-26 15:42:58 UTC Can we please stop talking about nano and giga numbers like kids? If an optimization like complete loop unrolling is involved of course very small or large numbers can be involved, doesn't really contribute anything to the problem talking about the exact figure. pinskia@gmail.com 2010-10-26 15:56:20 UTC On Oct 26, 2010, at 7:30 AM, "jak@jak-linux.org" wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 > > --- Comment #1 from Julian Andres Klode > 2010-10-26 14:30:24 UTC --- > Created attachment 22162 [details] > --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162 > Clang's assember This multiplication transformation is incorrect if the loop wraps (unsigned always wraps; never overflows). Gcc is correct in its speed. Though -O3 is faster but not because of multiplication but rather constant propatagtion. So this bug is invalid and some one should report a bug to llvm folks about this. > > Attaching the assembler output from clang, it should help understand > which > optimizations clang does (and improve gcc to do them as well). Dominique d'Humieres 2010-10-26 16:36:05 UTC > This multiplication transformation is incorrect if the loop wraps > (unsigned always wraps; never overflows). I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64 here) which "works" for additions and multiplications, so if there is wrapping, the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped. On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9 loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If you get anything below this number, it means that the loop for (; a < b; a++) sum += a; is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that the timing behaves as O(len) or not). Apparently clang does this (valid) transformation while gcc don't for any options I have tried. Note that If I write such a loop, it is because I am interested by the timing of the loop, not by the result I know for more than 40 years! Jakub Jelinek 2010-10-26 16:59:00 UTC For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't handle even the sum += a case. Sebastian? Dominique d'Humieres 2010-10-26 17:15:31 UTC > For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't > handle even the sum += a case. 2 and b are constants while a is not. For constants you have to know that the sum is cst*nloop, while if a is incremented you have to know that it is related to nloop*(nloop+1)/2 (and if you use sum += a*a, nloop*(nloop+1)*(2*nloop+1)/6, if sum += a*a*a, nloop^2*(nloop+1)^2/4 and so on). However is it worth the work? Jakub Jelinek 2010-10-26 18:43:40 UTC chrec_apply is called with {a_4(D), +, {a_4(D) + 1, +, 1}_1}_1 chrec and ~a_4(D) + b_5(D) in x. I wonder if this can be fixed just by recognizing such special cases in chrec_apply (after checking e.g. that it is unsigned computation instead of signed etc.). Dominique d'Humieres 2010-10-26 18:53:49 UTC Note that clang seems to know the general result: \sum_{i=a}^b p(i)=P(b), where p(i) is a given polynomial of degree n and P(x) a polynomial of degree n+1 such that P(x)=P(x-1)+p(x) and P(a)=p(a). At least clang is able to remove the loop for sum5 += a*a*a*a*a + 2*a*a*a +5*a; Jakub Jelinek 2010-10-26 19:11:59 UTC I guess you mean LLVM instead of clang, I'm pretty sure the FE doesn't perform this optimization. Anyway, given: #define F(n, exp) \ unsigned long \ f##n (unsigned long a, unsigned long b) \ { \ unsigned long sum = 0; \ for (; a < b; a++) \ exp; \ return sum; \ } F (1, sum += a) F (2, sum += 2) F (3, sum += b) F (4, sum += a * a) F (5, sum += a * a * a) F (6, a * a * a * a * a + 2 * a * a * a + 5 * a) only the f1/f2/f3 cases make it into chrec_apply (and only f2/f3 are currently handled there). joseph@codesourcery.com 2010-10-26 20:29:56 UTC On Tue, 26 Oct 2010, dominiq at lps dot ens.fr wrote: > --- Comment #13 from Dominique d'Humieres 2010-10-26 16:36:05 UTC --- > > This multiplication transformation is incorrect if the loop wraps > > (unsigned always wraps; never overflows). > > I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64 > here) which "works" for additions and multiplications, so if there is wrapping, > the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped. It's a bit more complicated than that, in that you can't just compute (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top bit. (I haven't checked exactly what the generated code is doing here.) Jakub Jelinek 2010-10-26 21:00:11 UTC If I translate the assembly back to C, it seems it is performing part of the arithmetics in TImode: unsigned long f (unsigned long a, unsigned long b) { if (a >= b) return 0; else return (a + 1) * (b - 1 - a) + a + (unsigned long)(((unsigned __int128) (b - 1 - a) * (b - 2 - a)) >> 1); } Dominique d'Humieres 2010-10-26 21:06:48 UTC > I guess you mean LLVM instead of clang, Yes, if you prefer. I was referring to the command I used. > F (6, a * a * a * a * a + 2 * a * a * a + 5 * a) you probably mean F (6, sum +=a * a * a * a * a + 2 * a * a * a + 5 * a) > It's a bit more complicated than that, in that you can't just compute > (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top > bit. (I haven't checked exactly what the generated code is doing here.) This is right if you do the multiply before the division. However b or b-1 can be divided exactly by 2, so you have to do (b/2)*(b-1) if b is even and b*((b-1)/2) if b is odd. The same result applies for the sum of square and cubes, although you may be one more trick if 2*b+1 wrap and can be divided exactly by 3. I have tested the following variant [macbook] f90/bug% cat pr46186_db.c #include #define len 1000000000L unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline)); int main() { unsigned long res; res = f(2, 2*len); /* printf("%lu\n", res); */ return 0; } unsigned long f(unsigned long a, unsigned long b) { unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0; for (; a < b; a++) { sum0 += 2; sum += a; sum2 += a*a; sum5 += a*a*a*a*a + 2*a*a*a +5*a; } printf("%lu\n", sum0); printf("%lu\n", sum); printf("%lu\n", sum2); printf("%lu\n", sum5); return sum; } which gives [macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db.c pr46186_db.c:18: note: LOOP VECTORIZED. pr46186_db.c:15: note: vectorized 1 loops in function. [macbook] f90/bug% time a.out 3999999996 1999999998999999999 10262176583330622975 11833798674328980984 18.114u 0.012s 0:18.15 99.8% 0+0k 0+0io 0pf+0w [macbook] f90/bug% clang -O pr46186_db.c [macbook] f90/bug% time a.out 3999999996 1999999998999999999 10262176583330622975 11833798674328980984 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w So the wrapping seems properly handled for the loop replacement. Now I have also tested the following variant [macbook] f90/bug% cat pr46186_db_1.c #include #define len 1000000000L unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline)); int main() { unsigned long res; res = f(2, 2*len); /* printf("%lu\n", res); */ return 0; } unsigned long f(unsigned long a, unsigned long b) { unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0; unsigned long a2; for (; a < b; a++) { sum0 += 2; sum += a; a2 = a*a; sum2 += a2; sum5 += a*(a2*(a2 + 2) +5); } printf("%lu\n", sum0); printf("%lu\n", sum); printf("%lu\n", sum2); printf("%lu\n", sum5); return sum; } and the timings are [macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db_1.c pr46186_db_1.c:19: note: LOOP VECTORIZED. pr46186_db_1.c:15: note: vectorized 1 loops in function. [macbook] f90/bug% time a.out 3999999996 1999999998999999999 10262176583330622975 11833798674328980984 12.227u 0.016s 0:12.36 98.9% 0+0k 0+0io 0pf+0w <== was 18.114u 0.012s 0:18.15 without hand optimization [macbook] f90/bug% clang -O pr46186_db_1.c [macbook] f90/bug% time a.out 3999999996 1999999998999999999 10262176583330622975 11833798674328980984 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w So clearly gcc is missing some generic optimizations in products. Thorsten Glaser 2010-10-29 21:06:44 UTC Created attachment 22201 [details] 0001-Fix-PR46186-improve-chrec_apply.patch (In reply to comment #19) > It's a bit more complicated than that, in that you can't just compute > (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top > bit. (I haven't checked exactly what the generated code is doing here.) Haven’t checked either, but in i386 asm, using RCR instead of SHR would work if the carry flag is correct after the sequence calculating that (or SAR). sebpop@gmail.com 2010-10-29 21:44:09 UTC Hi, here is a preliminary patch (not tested yet other that the PR testcase). This patch improves chrec_apply to also handle these very uncommon cases that some like to make big titles about (I wonder if the guy who submitted this bug report is part of some marketing division... anyways) Note that for these cases F (4, sum += a * a) F (5, sum += a * a * a) F (6, sum += a * a * a * a * a + 2 * a * a * a + 5 * a) although GCC with this patch knows how to transform these into end of loop values, GCC won't change them, because of this heuristic: /* Do not emit expensive expressions. The rationale is that when someone writes a code like while (n > 45) n -= 45; he probably knows that n is not large, and does not want it to be turned into n %= 45. */ || expression_expensive_p (def)) one needs to also set the --param scev-max-expr-size to a pretty big value for f6 to pass the fold steps... Sebastian joseph@codesourcery.com 2010-10-29 21:58:46 UTC On Fri, 29 Oct 2010, sebpop at gmail dot com wrote: > here is a preliminary patch (not tested yet other that the PR testcase). How does this patch deal with needing to compute in a wider type to avoid losing high bits before the division - is that covered by some existing code? It could certainly do with execution testcases that verify cases where computing everything naively in the original type would yield an incorrect result (and those were computing in a type of double the width would still have overflow, etc.). Jackie Rosen 2014-02-16 13:13:30 UTC Comment hidden (spam) *** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla. Raphael C 2017-05-24 13:36:31 UTC If I understood this PR correctly, this simpler code shows the same issue: unsigned long f(unsigned long a) { unsigned long sum = 0; for (; a <1000000000; a++) sum += a; return sum; } In gcc 7.1 with -O3 -march=native I get: f: cmp rdi, 999999999 ja .L7 mov eax, 999999999 mov ecx, 1000000000 sub rax, rdi sub rcx, rdi cmp rax, 7 jbe .L8 vmovq xmm3, rdi mov rdx, rcx vpxor xmm0, xmm0, xmm0 xor eax, eax vpbroadcastq ymm1, xmm3 vmovdqa ymm2, YMMWORD PTR .LC1[rip] vpaddq ymm1, ymm1, YMMWORD PTR .LC0[rip] shr rdx, 2 .L4: add rax, 1 vpaddq ymm0, ymm0, ymm1 vpaddq ymm1, ymm1, ymm2 cmp rax, rdx jb .L4 vpxor xmm1, xmm1, xmm1 mov rdx, rcx vperm2i128 ymm2, ymm0, ymm1, 33 and rdx, -4 vpaddq ymm0, ymm0, ymm2 add rdi, rdx vperm2i128 ymm1, ymm0, ymm1, 33 vpalignr ymm1, ymm1, ymm0, 8 vpaddq ymm0, ymm0, ymm1 vmovq rax, xmm0 cmp rcx, rdx je .L33 vzeroupper .L3: lea rdx, [rdi+1] add rax, rdi cmp rdx, 1000000000 je .L31 add rax, rdx lea rdx, [rdi+2] cmp rdx, 1000000000 je .L31 add rax, rdx lea rdx, [rdi+3] cmp rdx, 1000000000 je .L31 add rax, rdx lea rdx, [rdi+4] cmp rdx, 1000000000 je .L31 add rax, rdx lea rdx, [rdi+5] cmp rdx, 1000000000 je .L31 add rax, rdx lea rdx, [rdi+6] cmp rdx, 1000000000 je .L31 add rax, rdx add rdi, 7 lea rdx, [rax+rdi] cmp rdi, 1000000000 cmovne rax, rdx ret .L7: xor eax, eax .L31: ret .L33: vzeroupper ret .L8: xor eax, eax jmp .L3 However in clang I get: f: # @f cmp rdi, 999999999 ja .LBB0_1 mov eax, 999999999 sub rax, rdi lea rcx, [rdi + 1] imul rcx, rax mov edx, 999999998 sub rdx, rdi mul rdx shl rdx, 63 shr rax or rax, rdx add rcx, rdi add rcx, rax mov rax, rcx ret .LBB0_1: xor ecx, ecx mov rax, rcx ret which is greatly simpler and avoids looping altogether. What is the current status of this (very old) PR? Do people think it is worth addressing? AK 2017-11-27 17:50:20 UTC Seems PR65855 is related to this. btw, it may be worthwhile to try the patch posted by Sebastian https://gcc.gnu.org/bugzilla/attachment.cgi?id=22201