This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

gcc 4.4.0 loop-unrolling optimizations peculiarity observed

From: martin krastev <blu dot dark at gmail dot com>
To: gcc at gcc dot gnu dot org
Date: Wed, 28 Jan 2009 23:40:54 -0500
Subject: gcc 4.4.0 loop-unrolling optimizations peculiarity observed

gcc version: powerpc-apple-darwin8.11.0-gcc-4.4.0 (GCC) 4.4.0 20090116
(experimental)
version is a macports (formerly darwin ports) build of gcc4.4.0 on an
OSX 10.4.11 ppc7450 host

Following C function produces different code depending on the use of
'loop_Ai' vs 'direct_assignment_Ai' snippets:

float a[4][4] __attribute__ ((aligned (16)));
float b[4][4] __attribute__ ((aligned (16)));

float c[4][4] __attribute__ ((aligned (16)));

inline static void
mmul(
    float (&c)[4][4],
    const float (&a)[4][4],
    const float (&b)[4][4])
{
    // iterate by product's rows
    for (unsigned i = 0; i < 4; i++)
    {
        register float ai[4][4];

        // swizzle each element of the i-th row of A into a
full-dimensional vector
        for (unsigned j = 0; j < 4; j++)

// direct_assignment_Ai:
/*          ai[j][0] = ai[j][1] = ai[j][2] = ai[j][3] = a[i][j];
*/
// loop_Ai:
            for (unsigned k = 0; k < 4; k++)
                ai[j][k] = a[i][j];

        // multiply the first element of the i-th row of A by the first row of B
        for (unsigned k = 0; k < 4; k++)
        {
            c[i][k] = ai[0][k] * b[0][k];
        }

        // multiply-add all subsequent elements of the i-th row of A
by their respective rows of B
        for (unsigned j = 1; j < 4; j++)
        {
            for (unsigned k = 0; k < 4; k++)
            {
                c[i][k] += ai[j][k] * b[j][k];
            }
        }
    }
}

/code

Observed ~10% performance degradation when using 'loop_Ai' instead of
'direct_assignment_Ai'. From what I can tell, the differences in the
generated ppc code constitute mainly instruction scheduling.

Following optimization-related compiler options were used for the test:

-fno-exceptions -fno-rtti -faltivec -maltivec -mtune=7450 -O3
-fmessage-length=0 -funroll-loops -ffast-math -fstrict-aliasing
-ftree-vectorize -ftree-vectorizer-verbose=3 -fvisibility=hidden
-fvisibility-inlines-hidden -fno-threadsafe-statics

Full test app code and resulting .s files available upon request. For
the record, the intended vectorization fails, so resulting code is
entirely scalar, but it is rich on fused multiply-add's.

-martin

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]