Bug 39046

Summary: gcc 4.4.0 20090116 loop unrolling messes optimization
Product: gcc Reporter: martin krastev <blu.dark>
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: UNCONFIRMED ---    
Severity: enhancement CC: gcc-bugs, pinskia
Priority: P3 Keywords: missed-optimization
Version: 4.4.0   
Target Milestone: ---   
Host: powerpc-apple-darwin8.11.0 Target: powerpc-apple-darwin8.11.0
Build: powerpc-apple-darwin8.11.0 Known to work:
Known to fail: Last reconfirmed:

Description martin krastev 2009-01-30 21:58:37 UTC
Version info:

$ powerpc-apple-darwin8.11.0-gcc-4.4.0 -v
Using built-in specs.
Target: powerpc-apple-darwin8.11.0
Configured with: ../gcc-4.4-20090116/configure --prefix=/opt/local --enable-languages=c,c++,objc,obj-c++ --libdir=/opt/local/lib/gcc44 --includedir=/opt/local/include/gcc44 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.4 --with-gxx-include-dir=/opt/local/include/gcc44/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local --disable-multilib
Thread model: posix
gcc version 4.4.0 20090116 (experimental) (GCC)



Above is a macports (formerly darwin ports) build of gcc4.4.0 on an
OSX 10.4.11 ppc7450 host.

Following C++ function produces different code depending on the use of
'loop_assignment_ai' vs 'flat_assignment_ai' snippets:

#include <stdio.h>

inline static void
mmul(
    float (&c)[4][4],
    const float (&a)[4][4],
    const float (&b)[4][4])
{
    // iterate by product's rows
    for (unsigned i = 0; i < 4; i++)
    {
        register float ai[4][4];

        // swizzle each element of the i-th row of A into a full vector
        for (unsigned j = 0; j < 4; j++)

// flat_assignment_ai:
/*          ai[j][0] = ai[j][1] = ai[j][2] = ai[j][3] = a[i][j];
*/

// loop_assignment_ai:
            for (unsigned k = 0; k < 4; k++)
                ai[j][k] = a[i][j];

        // multiply the first element of the i-th row of A by the first row of B
        for (unsigned k = 0; k < 4; k++)
        {
            c[i][k] = ai[0][k] * b[0][k];
        }

        // multiply-add all subsequent elements of the i-th row of A by the respective rows of B
        for (unsigned j = 1; j < 4; j++)
        {
            for (unsigned k = 0; k < 4; k++)
            {
                c[i][k] += ai[j][k] * b[j][k];
            }
        }
    }
}


// function invoked with following parameters (statics)
float a[4][4] __attribute__ ((aligned (16)));
float b[4][4] __attribute__ ((aligned (16)));
float c[4][4] __attribute__ ((aligned (16)));


int main(int argc, char * const argv[])
{
    // omitted here is assignment of sample test values to arguments a & b

    unsigned ndz; // non-deterministic zero
    printf("enter a zero: ");
    if (1 != scanf("%u", &ndz)) // user expected to punch in a zero here
        return -1;

    const unsigned ndf = ndz ? 1 : 0; // non-deterministic const factor: it is meant to be zero, but the cc does not know that thus it can't declare our loop 'redundant'

    unsigned r = 10000000;

    do
    {
        mmul(*(&c + ndf * r), *(&a + ndf * r), *(&b + ndf * r));
    }
    while (--r);

    return r;
}

/code

Observed ~10% performance degradation when using 'loop_assignment_ai' instead of
'direct_assignment_ai'. It appears that the differences in the generated ppc code are mainly in instruction scheduling.

Following optimization-related compiler options were used for the test:

-fno-exceptions -fno-rtti -faltivec -maltivec -mtune=7450 -O3
-funroll-loops -ffast-math -fstrict-aliasing
-ftree-vectorize -ftree-vectorizer-verbose=3
-fvisibility-inlines-hidden -fno-threadsafe-statics

For the record, the intended vectorization fails, so the resulting code is
entirely scalar.

-martin
Comment 1 martin krastev 2009-01-31 03:23:50 UTC
Result unreproducible under the same compiler version and same compile options on a OSX 10.5.6 core2duo host. Both 'flat_assignment_ai' and 'loop_assignment_ai' versions generate identical code.

$ i386-apple-darwin9.6.0-gcc-4.4.0 -v       
Using built-in specs.
Target: i386-apple-darwin9.6.0
Configured with: ../gcc-4.4-20090116/configure --prefix=/opt/local --enable-languages=c,c++,objc,obj-c++ --libdir=/opt/local/lib/gcc44 --includedir=/opt/local/include/gcc44 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.4 --with-gxx-include-dir=/opt/local/include/gcc44/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local
Thread model: posix
gcc version 4.4.0 20090116 (experimental) (GCC)