This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

gcc 4.4.0 loop-unrolling optimizations peculiarity observed


gcc version: powerpc-apple-darwin8.11.0-gcc-4.4.0 (GCC) 4.4.0 20090116
(experimental)
version is a macports (formerly darwin ports) build of gcc4.4.0 on an
OSX 10.4.11 ppc7450 host

Following C function produces different code depending on the use of
'loop_Ai' vs 'direct_assignment_Ai' snippets:

float a[4][4] __attribute__ ((aligned (16)));
float b[4][4] __attribute__ ((aligned (16)));

float c[4][4] __attribute__ ((aligned (16)));

inline static void
mmul(
    float (&c)[4][4],
    const float (&a)[4][4],
    const float (&b)[4][4])
{
    // iterate by product's rows
    for (unsigned i = 0; i < 4; i++)
    {
        register float ai[4][4];

        // swizzle each element of the i-th row of A into a
full-dimensional vector
        for (unsigned j = 0; j < 4; j++)

// direct_assignment_Ai:
/*          ai[j][0] = ai[j][1] = ai[j][2] = ai[j][3] = a[i][j];
*/
// loop_Ai:
            for (unsigned k = 0; k < 4; k++)
                ai[j][k] = a[i][j];

        // multiply the first element of the i-th row of A by the first row of B
        for (unsigned k = 0; k < 4; k++)
        {
            c[i][k] = ai[0][k] * b[0][k];
        }

        // multiply-add all subsequent elements of the i-th row of A
by their respective rows of B
        for (unsigned j = 1; j < 4; j++)
        {
            for (unsigned k = 0; k < 4; k++)
            {
                c[i][k] += ai[j][k] * b[j][k];
            }
        }
    }
}

/code

Observed ~10% performance degradation when using 'loop_Ai' instead of
'direct_assignment_Ai'. From what I can tell, the differences in the
generated ppc code constitute mainly instruction scheduling.

Following optimization-related compiler options were used for the test:

-fno-exceptions -fno-rtti -faltivec -maltivec -mtune=7450 -O3
-fmessage-length=0 -funroll-loops -ffast-math -fstrict-aliasing
-ftree-vectorize -ftree-vectorizer-verbose=3 -fvisibility=hidden
-fvisibility-inlines-hidden -fno-threadsafe-statics

Full test app code and resulting .s files available upon request. For
the record, the intended vectorization fails, so resulting code is
entirely scalar, but it is rich on fused multiply-add's.

-martin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]