Bug 30201

Summary: gcc doesn't unroll nested loops
Product: gcc Reporter: Benoit Jacob <jacob>
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: bangerth, fang, gcc-bugs, rakdver, rguenth
Priority: P3 Keywords: missed-optimization
Version: 4.1.1   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2006-12-13 16:33:34

Description Benoit Jacob 2006-12-13 16:22:32 UTC
While developing a Free C++ library, I am facing what I think is a bug in gcc: it doesn't unroll nested loops. Namely, in the example program that I paste below, there is a nested loop like

for( int i = 0; i < 3; i++ )
  for( int j = 0; j < 3; j++ )
    do_something( i, j );

and only the inner loop gets completely unrolled (the loop on j), the outer loop (on i) is only partially unrolled (this is according to I. L. Taylor on gcc@gcc.gnu.org, I don't have the skill to read the binary code).

This is a huge problem for me, not only a detail, as the performance of my code is about 15% of what it would be if the loops got unrolled.

I cannot unroll loops by hand because this is a template library and the loops depend on template parameters.

I have made a minimal standalone example program. I paste it below (toto.cpp).

This program does a nested loop if UNROLL is not defined, and does the same thing but with the loops unrolled by hand if UNROLL is defined. On my machine, the speed difference is huge:

g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
g++ -O3 toto.cpp -o toto            ---> toto runs in 1.9 seconds

Again, this is not an academic example but something found in a real library, Eigen (http://eigen.tuxfamily.org). Granted, it's a math library, but still one that is used in real apps, so fixing this gcc bug would benefit real apps.

So here is the example program toto.cpp:

-----------------------------------------------------------------------

#include<iostream>

class Matrix
{
public:
    double data[9];
    double & operator()( int i, int j )
    {
        return data[i + 3 * j];
    }
    void loadScaling( double factor );
};

void Matrix::loadScaling( double factor)
{
#ifdef UNROLL
    (*this)( 0, 0 ) = factor;
    (*this)( 1, 0 ) = 0;
    (*this)( 2, 0 ) = 0;
    (*this)( 0, 1 ) = 0;
    (*this)( 1, 1 ) = factor;
    (*this)( 2, 1 ) = 0;
    (*this)( 0, 2 ) = 0;
    (*this)( 1, 2 ) = 0;
    (*this)( 2, 2 ) = factor;
#else
    for( int i = 0; i < 3; i++ )
        for( int j = 0; j < 3; j++ )
            (*this)(i, j) = (i == j) * factor;
#endif
}

int main( int argc, char *argv[] )
{
    Matrix m;
    for( int i = 0; i < 100000000; i++ )
        m.loadScaling( i );
    std::cout << "m(0,0) = " << m(0,0) << std::endl;
    return 0;
}
Comment 1 Andrew Pinski 2006-12-13 16:33:33 UTC
Actually there are two issues with this bug, one issue is that we change (float)(CMP) into CMP?1.0:0.0 early on which in the end causes us to produce three different versions of the inner loop.  If we write the inner loop as:
    for( int i = 0; i < 3; i++ )
        for( int j = 0; j < 3; j++ )
        {
            bool a = (i == j);
            (*this)(i, j) = a * factor;
        }

And use -O3 -funroll-loops, we get the correct unrolled looped (though the multiply is still there because of signed zeros).

And the second issue is unrolling the outer loop does not happen on the tree level.
Comment 2 Benoit Jacob 2006-12-13 18:54:06 UTC
Not understanding much in compiler stuff, I tried what you said, namely replace the loop with

    for( int i = 0; i < 3; i++ )
        for( int j = 0; j < 3; j++ )
        {
            bool a = (i == j);
            (*this)(i, j) = a * factor;
        }

and compile with -O3 -funroll-loops, and toto ran in 1.1 seconds. That's better than the original 1.9 seconds, but still far from the 0.3 seconds I get manually unrolling the loop.

Do you think there is a chance to have this bug fixed in 4.2.x or 4.3.x ?
Comment 3 Andrew Pinski 2006-12-13 18:56:12 UTC
(In reply to comment #2)
> than the original 1.9 seconds, but still far from the 0.3 seconds I get
> manually unrolling the loop.

Try with -ffast-math, you should get the same speed.  Again the problem is with signed zeros to be able to get down to what you have with the manually unrolled loop.

Comment 4 Zdenek Dvorak 2006-12-13 20:22:06 UTC
Subject: Re:  gcc doesn't unroll nested loops

> Not understanding much in compiler stuff, I tried what you said, namely replace
> the loop with
> 
>     for( int i = 0; i < 3; i++ )
>         for( int j = 0; j < 3; j++ )
>         {
>             bool a = (i == j);
>             (*this)(i, j) = a * factor;
>         }
> 
> and compile with -O3 -funroll-loops, and toto ran in 1.1 seconds. That's better
> than the original 1.9 seconds, but still far from the 0.3 seconds I get
> manually unrolling the loop.
> 
> Do you think there is a chance to have this bug fixed in 4.2.x or 4.3.x ?

most likely not in 4.2.  I think something could be done in 4.3 timeframe.
Comment 5 Benoit Jacob 2006-12-13 20:22:41 UTC
Nope... with -O3 -ffast-math I get 1.9 seconds in average (this is a laptop with CPU frequency scaling, so it's difficult to get precise numbers). Adding -funroll-loops in addition to -ffast-math doesn't seem to make a difference. We're very far from the 0.3 seconds I get with -DUNROLL.

Also, trying again -O3 -funroll-loops, I get again 1.9 seconds, so I think -funroll-loops didn't make any difference and I had been fooled by CPU frequency scaling.

The problem with the multiplication is not important to me, it's just something I used in this example. I could as well have written

    for( int i = 0; i < 3; i++ )
        for( int j = 0; j < 3; j++ )
            (*this)(i, j) = (i == j) ? factor : 0;

But this turns out to be even slower. I presume that's because, as the loops don't get both unrolled, the test i==j ?: makes branches at run-time.

Anyway thanks for being supportive and having looked into my problem. May I ask again, can I hope for a fully-unrolling-nested-loops g++ in the near future?
Comment 6 Benoit Jacob 2006-12-13 20:24:13 UTC
(In reply to comment #4)
> most likely not in 4.2.  I think something could be done in 4.3 timeframe.

ah, ok. Thanks for the information. 

Comment 7 Wolfgang Bangerth 2006-12-14 15:16:00 UTC
I've reduced the code a bit by stripping all C++isms and collapsing
everything into a single function:
-----------------------------------
double data[9];

int main()
{
  for( int factor = 0; factor < 1000000000; factor++ )
    {
#ifdef UNROLL
      data[ 0 + 3 * 0 ] = factor;
      data[ 1 + 3 * 0 ] = 0;
      data[ 2 + 3 * 0 ] = 0;
      data[ 0 + 3 * 1 ] = 0;
      data[ 1 + 3 * 1 ] = factor;
      data[ 2 + 3 * 1 ] = 0;
      data[ 0 + 3 * 2 ] = 0;
      data[ 1 + 3 * 2 ] = 0;
      data[ 2 + 3 * 2 ] = factor;
#else
      for( int i = 0; i < 3; i++ )
        for( int j = 0; j < 3; j++ )
	  data[i + 3 * j] = (i == j) * factor;
#endif
    }
}
-----------------------------

For this, we get these results:
g/x> gcc x.c -O3 -funroll-loops -std=c99
g/x> time ./a.out 

real    0m9.380s
user    0m9.381s
sys     0m0.000s

g/x> gcc x.c -O3 -funroll-loops -std=c99 -DUNROLL
g/x> time ./a.out 

real    0m0.420s
user    0m0.420s
sys     0m0.000s

W.
Comment 8 Wolfgang Bangerth 2006-12-14 15:35:50 UTC
Here is an analysis of the assembler code we get when using
my first command line in my previous comment, i.e. no hand unrolling.
I'm using 4.1.0, btw.

The main loop looks like this:
--------------------------
.L2:
	pushl	%edx            // push 'factor'
	xorl	%eax, %eax      // eax=0
	fildl	(%esp)          // st(0)=(double)factor
	addl	$1, %edx        // ++factor
	fstl	data            // data[0]=factor
	movl	%eax, (%esp)    // push 0
	fildl	(%esp)          // st(0)=0
	addl	$4, %esp
	cmpl	$1000000000, %edx
	fstl	data+24         // data[3]=0
	fstl	data+48         // data[6]=0
	fstl	data+8          // data[1]=0
	fxch	%st(1)          // st(0)=factor
	fstl	data+32         // data[4]=factor
	fxch	%st(1)          // st(0)=0
	fstl	data+56         // data[7]=0
	fstl	data+16         // data[2]=0
	fstpl	data+40         // data[5]=0; st(0)=factor
	fstpl	data+64         // data[8]=factor
	jne	.L2
---------------------

I can find several things wrong with this:
a/ the sequence
    xorl        %eax, %eax
    movl	%eax, (%esp)
    fildl	(%esp)
   could certainly be made more efficient by using fldz.
b/ I find the use of fstpl at the end of the loop quite ingenious, since
   it avoids another fxch. However, the two uses of fxch in the middle
   may nevertheless be avoided if we manage to realize that we can
   reorder all those stores. 

So, in summary, it is not that gcc doesn't realize that it can unroll
these loops -- it actually does that, the slowdown comes from other places.

W.
Comment 9 Wolfgang Bangerth 2006-12-14 15:41:32 UTC
Hm, now, with -DUNROLL, gcc realizes that it writes to data[] all the
times with no reads and moves all the writes out of the loop. I suppose
it's no surprise that the code is fast (the loop looks pretty atrocious,
though, and is not collapsed to anything reasonable, though).

Of course, this is what should have happened with the original code as
well.

W.
Comment 10 Richard Biener 2006-12-14 15:46:41 UTC
I get on the trunk with -O2 -funroll-loops

main:
.LFB2:
        xorl    %eax, %eax
        .p2align 4,,7
.L2:
        cvtsi2sd        %eax, %xmm0
        addl    $1, %eax
        cmpl    $1000000000, %eax
        movq    $0, data+24(%rip)
        movq    $0, data+48(%rip)
        movq    $0, data+8(%rip)
        movq    $0, data+56(%rip)
        movq    $0, data+16(%rip)
        movq    $0, data+40(%rip)
        movsd   %xmm0, data(%rip)
        movsd   %xmm0, data+32(%rip)
        movsd   %xmm0, data+64(%rip)
        jne     .L2
        xorl    %eax, %eax
        ret

it doesn't look like we can do better w/o cheating and moving the
benchmark-loop-invariant movq $0 out of the loop ;)

The UNROLL variant looks like

.L2:
        leal    8(%rdx), %ecx
        addl    $9, %edx
        movq    $0, data+8(%rip)
        cmpl    $1000000000, %edx
        movq    $0, data+16(%rip)
        movq    $0, data+24(%rip)
        cvtsi2sd        %ecx, %xmm0
        movq    $0, data+40(%rip)
        movq    $0, data+48(%rip)
        movq    $0, data+56(%rip)
        movsd   %xmm0, data(%rip)
        movsd   %xmm0, data+32(%rip)
        movsd   %xmm0, data+64(%rip)
        jne     .L2
Comment 11 Benoit Jacob 2006-12-14 15:56:17 UTC
Very interesting, thanks... so does it mean that gcc did loop unrolling after all? (sorry, I'm a newbie when it comes to compiler/assembler stuff).

And the speed difference was only caused by the compiler understanding that it didn't need to loop 100,000,000 times?

What happened to the idea that gcc didn't unroll nested loops? So does it unroll them after all?

- Mr. Newbie
Comment 12 bangerth@math.tamu.edu 2006-12-14 16:08:37 UTC
Subject: Re:  gcc doesn't unroll nested loops


> Very interesting, thanks... so does it mean that gcc did loop unrolling after
> all? (sorry, I'm a newbie when it comes to compiler/assembler stuff).

Yes.


> And the speed difference was only caused by the compiler understanding that it
> didn't need to loop 100,000,000 times?

Yes. However, all this is only for my reduced testcase without the use of
the C++ class. For the original testcase, the issues Andrew P. identified
are still true.

W.

-------------------------------------------------------------------------
Wolfgang Bangerth                email:            bangerth@math.tamu.edu
                                 www: http://www.math.tamu.edu/~bangerth/

Comment 13 Benoit Jacob 2006-12-14 16:22:16 UTC
(In reply to comment #12)
> Yes. However, all this is only for my reduced testcase without the use of
> the C++ class. For the original testcase, the issues Andrew P. identified
> are still true.

OK, so if I understand well the problem that "gcc doesn't unroll nested loops" exists only for C++, not for C ? That seems so strange to me, but then again I don't know much about compilers.

So now I need to make a decision for my C++ project. I am planning to unfold nested loops into normal loops, for instance replace

for( col = 0; col < size(); col++)
   for( row = 0; row < size(); row++)
   {
     ...
   }

with

for( int foo = 0; foo < size() * size(); foo++)
{
   col = foo / size();
   row = foo % size();
   ...
}

Do you think this will help solve my problem? Do you see a better solution?
Comment 14 Benoit Jacob 2006-12-14 16:24:02 UTC
I forgot to say that size() is an inline function returning a constant that is known at compile-time (a template parameter). Otherwise, of course, I wouldn't expect these loops to get unrolled...

Comment 15 Richard Biener 2006-12-14 16:24:37 UTC
If the loop bounds are compile-time constants you can use template metaprogramming to unroll them.
Comment 16 Benoit Jacob 2006-12-14 16:28:46 UTC
(In reply to comment #15)
> If the loop bounds are compile-time constants you can use template
> metaprogramming to unroll them.
> 

That is true, I will think about that.

I was also mentionning my loop-unnesting idea to see if you saw immediately something wrong with it, but since you seem to agree that loop-unnesting is the way to go, I'll go ahead with it.
Comment 17 Benoit Jacob 2006-12-14 16:34:19 UTC
(In reply to comment #15)
> If the loop bounds are compile-time constants you can use template
> metaprogramming to unroll them.
> 

ah right, I initially misread "unroll" as "unnest", but it is true that I can unroll loops with template metaprogramming. Then, no need to unnest.
Comment 18 Benoit Jacob 2006-12-14 16:47:03 UTC
(In reply to comment #15)
> If the loop bounds are compile-time constants you can use template
> metaprogramming to unroll them.
> 

I shouldn't elaborate on this as this is not the subject of this bug report, but anyway...

I can't do template unrolling, because actually my template classes are part of a complicated CRTP ("curiously recurring template pattern") where size() is a wrapper around a method in a template parameter, and depending on this parameter, size() may or may not be known at compile time.

The idea is that when you write a library for matrices/vectors, it's always a dilemma whether the size (dimension) of the matrices/vectors should be a template parameter(hence fixed at compile-time) or a variable. In the first approach the coords of the matrix/vector are stored in a T array[Size], in the second approach they are dynamically allocated with array=new T[Size]. Both approaches have their pro's and con's, and the "killer feature" of my library is that it allows both with the same underlying C++ code. This is where the CRTP is useful. So size() returns a compile-time constant in the first approach, and a variable in the second one. And I would like the loops to get unrolled only in the first approach (and, if possible, only if the size is not too large). So template metaprogramming for that, while probably possible, would be much hassle.