[PATCH] Tiny predcom improvement (PR tree-optimization/59643)
Jakub Jelinek
jakub@redhat.com
Wed Jan 1 17:50:00 GMT 2014
On Tue, Dec 31, 2013 at 03:22:07PM -0500, Vladimir Makarov wrote:
> Scimark2 is always used by Phoronix to show how bad GCC in
> comparison with LLVM. It is understandable. For some reasons
> phoronix is very biased to LLVM and, I'd say, a marketing machine
> for LLVM.
>
> They use very small selected benchmarks. Benchmarking is evil but I
> believe more SPEC2000/SPEC2006 which use bigger real programs. With
> their approach, they could use whetstone instead of Scimark but the
> results would be very bad for LLVM (e.g. 64-bit code for -O3 on
> Haswell):
I'm aware of all that, that said putting some minimal effort to increase
obstackles for their marketing is worth it.
> It would be nice to fix also Scimark2 LU GCC-4.8 degradation to
> finally stop this nonsense from Phoronix.
But strangely I don't see any LU GCC-4.8 degradation. This is
-O3 -march=native on i7-2600 (the Phoronix numbers were from i7-4960X,
but that is just SandyBridge vs. IvyBridge-E difference, so I don't
think at least for GCC (verified for the patched trunk, -O3 -march=native
and -O3 -march=ivybridge generated bitwise identical code) it is a significant
difference in what GCC generates, just the CPU is faster and has bigger caches.
Don't have i7-4960X to test it there unfortunately. The Phoronix posted numbers
show 80% improvement on LU from clang 3.3 to clang 3.4.
LU_factor has several inner loops, only the last one
for (jj=j+1; jj<N; jj++)
Aii[jj] -= AiiJ * Aj[jj];
(but the only one with loop depth 3) is vectorized (with versioning for alias
due to the badly chosen data structure) by both gcc and clang,
for (k=j+1; k<M; k++)
A[k][j] *= recp;
for vectorization would require gather+scatter (i.e. Skylake+) plus some
guarantee that it doesn't alias (#pragma omp simd?), but still it is
questionable if vectorization would be beneficial, and
double t = fabs(A[j][j]);
for (i=j+1; i<M; i++)
{
double ab = fabs(A[i][j]);
if ( ab > t)
{
jp = i;
t = ab;
}
}
is perhaps vectorizable with gather load (i.e. AVX2+) and -Ofast, haven't
tried that though.
GCC 4.7.4 20130609:
Composite Score: 1323.20
FFT Mflops: 183.61 (N=1048576)
SOR Mflops: 618.12 (1000 x 1000)
MonteCarlo: Mflops: 412.98
Sparse matmult Mflops: 1773.16 (N=100000, nz=1000000)
LU Mflops: 3628.12 (M=1000, N=1000)
GCC 4.8.2 20131212 (Red Hat 4.8.2-7):
Composite Score: 1437.05
FFT Mflops: 213.91 (N=1048576)
SOR Mflops: 1139.72 (1000 x 1000)
MonteCarlo: Mflops: 526.34
Sparse matmult Mflops: 1713.81 (N=100000, nz=1000000)
LU Mflops: 3591.47 (M=1000, N=1000)
GCC 4.9.0 20131230:
Composite Score: 1569.78
FFT Mflops: 254.65 (N=1048576)
SOR Mflops: 1131.31 (1000 x 1000)
MonteCarlo: Mflops: 563.64
Sparse matmult Mflops: 1780.87 (N=100000, nz=1000000)
LU Mflops: 4118.40 (M=1000, N=1000)
GCC 4.9.0 20140101 plus the predcom patch:
Composite Score: 1692.05
FFT Mflops: 253.90 (N=1048576)
SOR Mflops: 1605.16 (1000 x 1000)
MonteCarlo: Mflops: 556.34
Sparse matmult Mflops: 1861.82 (N=100000, nz=1000000)
LU Mflops: 4183.01 (M=1000, N=1000)
clang 3.3:
Composite Score: 1576.91
FFT Mflops: 255.41 (N=1048576)
SOR Mflops: 1613.61 (1000 x 1000)
MonteCarlo: Mflops: 557.79
Sparse matmult Mflops: 1914.02 (N=100000, nz=1000000)
LU Mflops: 3543.74 (M=1000, N=1000)
clang 3.4 20131230:
Composite Score: 1566.95
FFT Mflops: 255.41 (N=1048576)
SOR Mflops: 1617.87 (1000 x 1000)
MonteCarlo: Mflops: 528.94
Sparse matmult Mflops: 1804.41 (N=100000, nz=1000000)
LU Mflops: 3628.12 (M=1000, N=1000)
Jakub
More information about the Gcc-patches
mailing list