Summary: | gcc doesn't unroll nested loops | ||
---|---|---|---|
Product: | gcc | Reporter: | Benoit Jacob <jacob> |
Component: | middle-end | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | bangerth, fang, gcc-bugs, rakdver, rguenth |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 4.1.1 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2006-12-13 16:33:34 |
Description
Benoit Jacob
2006-12-13 16:22:32 UTC
Actually there are two issues with this bug, one issue is that we change (float)(CMP) into CMP?1.0:0.0 early on which in the end causes us to produce three different versions of the inner loop. If we write the inner loop as: for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) { bool a = (i == j); (*this)(i, j) = a * factor; } And use -O3 -funroll-loops, we get the correct unrolled looped (though the multiply is still there because of signed zeros). And the second issue is unrolling the outer loop does not happen on the tree level. Not understanding much in compiler stuff, I tried what you said, namely replace the loop with for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) { bool a = (i == j); (*this)(i, j) = a * factor; } and compile with -O3 -funroll-loops, and toto ran in 1.1 seconds. That's better than the original 1.9 seconds, but still far from the 0.3 seconds I get manually unrolling the loop. Do you think there is a chance to have this bug fixed in 4.2.x or 4.3.x ? (In reply to comment #2) > than the original 1.9 seconds, but still far from the 0.3 seconds I get > manually unrolling the loop. Try with -ffast-math, you should get the same speed. Again the problem is with signed zeros to be able to get down to what you have with the manually unrolled loop. Subject: Re: gcc doesn't unroll nested loops
> Not understanding much in compiler stuff, I tried what you said, namely replace
> the loop with
>
> for( int i = 0; i < 3; i++ )
> for( int j = 0; j < 3; j++ )
> {
> bool a = (i == j);
> (*this)(i, j) = a * factor;
> }
>
> and compile with -O3 -funroll-loops, and toto ran in 1.1 seconds. That's better
> than the original 1.9 seconds, but still far from the 0.3 seconds I get
> manually unrolling the loop.
>
> Do you think there is a chance to have this bug fixed in 4.2.x or 4.3.x ?
most likely not in 4.2. I think something could be done in 4.3 timeframe.
Nope... with -O3 -ffast-math I get 1.9 seconds in average (this is a laptop with CPU frequency scaling, so it's difficult to get precise numbers). Adding -funroll-loops in addition to -ffast-math doesn't seem to make a difference. We're very far from the 0.3 seconds I get with -DUNROLL. Also, trying again -O3 -funroll-loops, I get again 1.9 seconds, so I think -funroll-loops didn't make any difference and I had been fooled by CPU frequency scaling. The problem with the multiplication is not important to me, it's just something I used in this example. I could as well have written for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) (*this)(i, j) = (i == j) ? factor : 0; But this turns out to be even slower. I presume that's because, as the loops don't get both unrolled, the test i==j ?: makes branches at run-time. Anyway thanks for being supportive and having looked into my problem. May I ask again, can I hope for a fully-unrolling-nested-loops g++ in the near future? (In reply to comment #4) > most likely not in 4.2. I think something could be done in 4.3 timeframe. ah, ok. Thanks for the information. I've reduced the code a bit by stripping all C++isms and collapsing everything into a single function: ----------------------------------- double data[9]; int main() { for( int factor = 0; factor < 1000000000; factor++ ) { #ifdef UNROLL data[ 0 + 3 * 0 ] = factor; data[ 1 + 3 * 0 ] = 0; data[ 2 + 3 * 0 ] = 0; data[ 0 + 3 * 1 ] = 0; data[ 1 + 3 * 1 ] = factor; data[ 2 + 3 * 1 ] = 0; data[ 0 + 3 * 2 ] = 0; data[ 1 + 3 * 2 ] = 0; data[ 2 + 3 * 2 ] = factor; #else for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) data[i + 3 * j] = (i == j) * factor; #endif } } ----------------------------- For this, we get these results: g/x> gcc x.c -O3 -funroll-loops -std=c99 g/x> time ./a.out real 0m9.380s user 0m9.381s sys 0m0.000s g/x> gcc x.c -O3 -funroll-loops -std=c99 -DUNROLL g/x> time ./a.out real 0m0.420s user 0m0.420s sys 0m0.000s W. Here is an analysis of the assembler code we get when using my first command line in my previous comment, i.e. no hand unrolling. I'm using 4.1.0, btw. The main loop looks like this: -------------------------- .L2: pushl %edx // push 'factor' xorl %eax, %eax // eax=0 fildl (%esp) // st(0)=(double)factor addl $1, %edx // ++factor fstl data // data[0]=factor movl %eax, (%esp) // push 0 fildl (%esp) // st(0)=0 addl $4, %esp cmpl $1000000000, %edx fstl data+24 // data[3]=0 fstl data+48 // data[6]=0 fstl data+8 // data[1]=0 fxch %st(1) // st(0)=factor fstl data+32 // data[4]=factor fxch %st(1) // st(0)=0 fstl data+56 // data[7]=0 fstl data+16 // data[2]=0 fstpl data+40 // data[5]=0; st(0)=factor fstpl data+64 // data[8]=factor jne .L2 --------------------- I can find several things wrong with this: a/ the sequence xorl %eax, %eax movl %eax, (%esp) fildl (%esp) could certainly be made more efficient by using fldz. b/ I find the use of fstpl at the end of the loop quite ingenious, since it avoids another fxch. However, the two uses of fxch in the middle may nevertheless be avoided if we manage to realize that we can reorder all those stores. So, in summary, it is not that gcc doesn't realize that it can unroll these loops -- it actually does that, the slowdown comes from other places. W. Hm, now, with -DUNROLL, gcc realizes that it writes to data[] all the times with no reads and moves all the writes out of the loop. I suppose it's no surprise that the code is fast (the loop looks pretty atrocious, though, and is not collapsed to anything reasonable, though). Of course, this is what should have happened with the original code as well. W. I get on the trunk with -O2 -funroll-loops main: .LFB2: xorl %eax, %eax .p2align 4,,7 .L2: cvtsi2sd %eax, %xmm0 addl $1, %eax cmpl $1000000000, %eax movq $0, data+24(%rip) movq $0, data+48(%rip) movq $0, data+8(%rip) movq $0, data+56(%rip) movq $0, data+16(%rip) movq $0, data+40(%rip) movsd %xmm0, data(%rip) movsd %xmm0, data+32(%rip) movsd %xmm0, data+64(%rip) jne .L2 xorl %eax, %eax ret it doesn't look like we can do better w/o cheating and moving the benchmark-loop-invariant movq $0 out of the loop ;) The UNROLL variant looks like .L2: leal 8(%rdx), %ecx addl $9, %edx movq $0, data+8(%rip) cmpl $1000000000, %edx movq $0, data+16(%rip) movq $0, data+24(%rip) cvtsi2sd %ecx, %xmm0 movq $0, data+40(%rip) movq $0, data+48(%rip) movq $0, data+56(%rip) movsd %xmm0, data(%rip) movsd %xmm0, data+32(%rip) movsd %xmm0, data+64(%rip) jne .L2 Very interesting, thanks... so does it mean that gcc did loop unrolling after all? (sorry, I'm a newbie when it comes to compiler/assembler stuff). And the speed difference was only caused by the compiler understanding that it didn't need to loop 100,000,000 times? What happened to the idea that gcc didn't unroll nested loops? So does it unroll them after all? - Mr. Newbie Subject: Re: gcc doesn't unroll nested loops > Very interesting, thanks... so does it mean that gcc did loop unrolling after > all? (sorry, I'm a newbie when it comes to compiler/assembler stuff). Yes. > And the speed difference was only caused by the compiler understanding that it > didn't need to loop 100,000,000 times? Yes. However, all this is only for my reduced testcase without the use of the C++ class. For the original testcase, the issues Andrew P. identified are still true. W. ------------------------------------------------------------------------- Wolfgang Bangerth email: bangerth@math.tamu.edu www: http://www.math.tamu.edu/~bangerth/ (In reply to comment #12) > Yes. However, all this is only for my reduced testcase without the use of > the C++ class. For the original testcase, the issues Andrew P. identified > are still true. OK, so if I understand well the problem that "gcc doesn't unroll nested loops" exists only for C++, not for C ? That seems so strange to me, but then again I don't know much about compilers. So now I need to make a decision for my C++ project. I am planning to unfold nested loops into normal loops, for instance replace for( col = 0; col < size(); col++) for( row = 0; row < size(); row++) { ... } with for( int foo = 0; foo < size() * size(); foo++) { col = foo / size(); row = foo % size(); ... } Do you think this will help solve my problem? Do you see a better solution? I forgot to say that size() is an inline function returning a constant that is known at compile-time (a template parameter). Otherwise, of course, I wouldn't expect these loops to get unrolled... If the loop bounds are compile-time constants you can use template metaprogramming to unroll them. (In reply to comment #15) > If the loop bounds are compile-time constants you can use template > metaprogramming to unroll them. > That is true, I will think about that. I was also mentionning my loop-unnesting idea to see if you saw immediately something wrong with it, but since you seem to agree that loop-unnesting is the way to go, I'll go ahead with it. (In reply to comment #15) > If the loop bounds are compile-time constants you can use template > metaprogramming to unroll them. > ah right, I initially misread "unroll" as "unnest", but it is true that I can unroll loops with template metaprogramming. Then, no need to unnest. (In reply to comment #15) > If the loop bounds are compile-time constants you can use template > metaprogramming to unroll them. > I shouldn't elaborate on this as this is not the subject of this bug report, but anyway... I can't do template unrolling, because actually my template classes are part of a complicated CRTP ("curiously recurring template pattern") where size() is a wrapper around a method in a template parameter, and depending on this parameter, size() may or may not be known at compile time. The idea is that when you write a library for matrices/vectors, it's always a dilemma whether the size (dimension) of the matrices/vectors should be a template parameter(hence fixed at compile-time) or a variable. In the first approach the coords of the matrix/vector are stored in a T array[Size], in the second approach they are dynamically allocated with array=new T[Size]. Both approaches have their pro's and con's, and the "killer feature" of my library is that it allows both with the same underlying C++ code. This is where the CRTP is useful. So size() returns a compile-time constant in the first approach, and a variable in the second one. And I would like the loops to get unrolled only in the first approach (and, if possible, only if the size is not too large). So template metaprogramming for that, while probably possible, would be much hassle. |