[PATCH] Tiny predcom improvement (PR tree-optimization/59643)

Wed Jan 1 17:50:00 GMT 2014

On Tue, Dec 31, 2013 at 03:22:07PM -0500, Vladimir Makarov wrote:
> Scimark2 is always used by Phoronix to show how bad GCC in
> comparison with LLVM.  It is understandable. For some reasons
> phoronix is very biased to LLVM and, I'd say, a marketing machine
> for LLVM.
> 
> They use very small selected benchmarks.  Benchmarking is evil but I
> believe more SPEC2000/SPEC2006 which use bigger real programs.  With
> their approach, they could use whetstone instead of Scimark but the
> results would be very bad for LLVM (e.g. 64-bit code for -O3 on
> Haswell):

I'm aware of all that, that said putting some minimal effort to increase
obstackles for their marketing is worth it.

> It would be nice to fix also Scimark2 LU GCC-4.8 degradation to
> finally stop this nonsense from Phoronix.

But strangely I don't see any LU GCC-4.8 degradation.  This is
-O3 -march=native on i7-2600 (the Phoronix numbers were from i7-4960X,
but that is just SandyBridge vs. IvyBridge-E difference, so I don't
think at least for GCC (verified for the patched trunk, -O3 -march=native
and -O3 -march=ivybridge generated bitwise identical code) it is a significant
difference in what GCC generates, just the CPU is faster and has bigger caches.
Don't have i7-4960X to test it there unfortunately.  The Phoronix posted numbers
show 80% improvement on LU from clang 3.3 to clang 3.4.

LU_factor has several inner loops, only the last one
                for (jj=j+1; jj<N; jj++)
                  Aii[jj] -= AiiJ * Aj[jj];
(but the only one with loop depth 3) is vectorized (with versioning for alias
due to the badly chosen data structure) by both gcc and clang,
            for (k=j+1; k<M; k++)
                A[k][j] *= recp;
for vectorization would require gather+scatter (i.e. Skylake+) plus some
guarantee that it doesn't alias (#pragma omp simd?), but still it is
questionable if vectorization would be beneficial, and
        double t = fabs(A[j][j]);
        for (i=j+1; i<M; i++)
        {
            double ab = fabs(A[i][j]);
            if ( ab > t)
            {
                jp = i;
                t = ab;
            }
        }
is perhaps vectorizable with gather load (i.e. AVX2+) and -Ofast, haven't
tried that though.

GCC 4.7.4 20130609:

Composite Score:         1323.20
FFT             Mflops:   183.61    (N=1048576)
SOR             Mflops:   618.12    (1000 x 1000)
MonteCarlo:     Mflops:   412.98
Sparse matmult  Mflops:  1773.16    (N=100000, nz=1000000)
LU              Mflops:  3628.12    (M=1000, N=1000)

GCC 4.8.2 20131212 (Red Hat 4.8.2-7):

Composite Score:         1437.05
FFT             Mflops:   213.91    (N=1048576)
SOR             Mflops:  1139.72    (1000 x 1000)
MonteCarlo:     Mflops:   526.34
Sparse matmult  Mflops:  1713.81    (N=100000, nz=1000000)
LU              Mflops:  3591.47    (M=1000, N=1000)

GCC 4.9.0 20131230:

Composite Score:         1569.78
FFT             Mflops:   254.65    (N=1048576)
SOR             Mflops:  1131.31    (1000 x 1000)
MonteCarlo:     Mflops:   563.64
Sparse matmult  Mflops:  1780.87    (N=100000, nz=1000000)
LU              Mflops:  4118.40    (M=1000, N=1000)

GCC 4.9.0 20140101 plus the predcom patch:

Composite Score:         1692.05
FFT             Mflops:   253.90    (N=1048576)
SOR             Mflops:  1605.16    (1000 x 1000)
MonteCarlo:     Mflops:   556.34
Sparse matmult  Mflops:  1861.82    (N=100000, nz=1000000)
LU              Mflops:  4183.01    (M=1000, N=1000)

clang 3.3:

Composite Score:         1576.91
FFT             Mflops:   255.41    (N=1048576)
SOR             Mflops:  1613.61    (1000 x 1000)
MonteCarlo:     Mflops:   557.79
Sparse matmult  Mflops:  1914.02    (N=100000, nz=1000000)
LU              Mflops:  3543.74    (M=1000, N=1000)

clang 3.4 20131230:

Composite Score:         1566.95
FFT             Mflops:   255.41    (N=1048576)
SOR             Mflops:  1617.87    (1000 x 1000)
MonteCarlo:     Mflops:   528.94
Sparse matmult  Mflops:  1804.41    (N=100000, nz=1000000)
LU              Mflops:  3628.12    (M=1000, N=1000)

	Jakub