mkdir scimark2; cd scimark2 wget http://math.nist.gov/scimark2/scimark2_1c.zip unzip scimark2_1c.zip c++ -Ofast *.c; ./a.out c++ -Ofast *.c -flto; ./a.out with gcc 4.9.3 gcc version 4.9.3 (GCC) c++ -Ofast *.c; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2462.90 FFT Mflops: 2070.32 (N=1024) SOR Mflops: 1661.17 (100 x 100) MonteCarlo: Mflops: 813.44 Sparse matmult Mflops: 2978.91 (N=1000, nz=5000) LU Mflops: 4790.64 (M=100, N=100) [innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2582.94 FFT Mflops: 2064.19 (N=1024) SOR Mflops: 1654.04 (100 x 100) MonteCarlo: Mflops: 1426.90 Sparse matmult Mflops: 2978.91 (N=1000, nz=5000) LU Mflops: 4790.64 (M=100, N=100) with latest build gcc version 6.0.0 20160129 (experimental) (GCC) [innocent@vinavx3 scimark2]$ c++ -Ofast *.c; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2377.18 FFT Mflops: 1970.89 (N=1024) SOR Mflops: 1654.04 (100 x 100) MonteCarlo: Mflops: 810.37 Sparse matmult Mflops: 3328.81 (N=1000, nz=5000) LU Mflops: 4121.76 (M=100, N=100) [innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2136.23 FFT Mflops: 2076.48 (N=1024) SOR Mflops: 1654.04 (100 x 100) MonteCarlo: Mflops: 1533.92 Sparse matmult Mflops: 3266.59 (N=1000, nz=5000) LU Mflops: 2150.13 (M=100, N=100)
Any reason you are using the c++ driver here? I get > gcc-6 -Ofast -flto *.c -lm -B /abuild/rguenther/trunk3-g/gcc > ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1729.02 FFT Mflops: 1247.04 (N=1024) SOR Mflops: 1537.70 (100 x 100) MonteCarlo: Mflops: 842.21 Sparse matmult Mflops: 1657.86 (N=1000, nz=5000) LU Mflops: 3360.29 (M=100, N=100) > gcc-6 -Ofast *.c -lm -B /abuild/rguenther/trunk3-g/gcc > ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1645.94 FFT Mflops: 1288.61 (N=1024) SOR Mflops: 1471.29 (100 x 100) MonteCarlo: Mflops: 459.90 Sparse matmult Mflops: 1665.91 (N=1000, nz=5000) LU Mflops: 3343.98 (M=100, N=100) Ok, when using g++ to compile things I _do_ get > g++-6 -Ofast -flto *.c -lm -B /abuild/rguenther/trunk3-g/gcc > ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1321.43 FFT Mflops: 1261.86 (N=1024) SOR Mflops: 1533.77 (100 x 100) MonteCarlo: Mflops: 850.69 Sparse matmult Mflops: 1669.90 (N=1000, nz=5000) LU Mflops: 1290.93 (M=100, N=100) > g++-6 -Ofast *.c -lm -B /abuild/rguenther/trunk3-g/gcc > ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1492.12 FFT Mflops: 1279.12 (N=1024) SOR Mflops: 1479.86 (100 x 100) MonteCarlo: Mflops: 433.83 Sparse matmult Mflops: 1637.11 (N=1000, nz=5000) LU Mflops: 2630.71 (M=100, N=100) So even without LTO I get a hit in using C++ to compile LU. Interesting.
It looks like we get different BB order out of C++ than C but otherwise no real code-differences as far as I can see.
> Any reason you are using the c++ driver here? Because I am interested in C++ performance never imagined that the c++ front-end could make a difference on such a code... From my point of view it is even a more severe regression than just "lto"
(In reply to vincenzo Innocente from comment #3) > > Any reason you are using the c++ driver here? > Because I am interested in C++ performance > never imagined that the c++ front-end could make a difference on such a > code... > From my point of view it is even a more severe regression than just "lto" Yeah, didn't try to figure out whether the C vs. C++ thing is a regression. But I suspect the change to the C++ loop lowering. Certainly needs closer investigation.
it is a regression gcc version 4.9.3 (GCC) c++ -Ofast *.c; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. gcc -Ofast *.c; ./a.out c++ -v Composite Score: 2449.06 FFT Mflops: 2046.03 (N=1024) SOR Mflops: 1654.04 (100 x 100) MonteCarlo: Mflops: 813.44 Sparse matmult Mflops: 2962.08 (N=1000, nz=5000) LU Mflops: 4769.72 (M=100, N=100) --- gcc -Ofast *.c -lm; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2475.22 FFT Mflops: 2064.19 (N=1024) SOR Mflops: 1633.01 (100 x 100) MonteCarlo: Mflops: 810.37 Sparse matmult Mflops: 2970.47 (N=1000, nz=5000) LU Mflops: 4898.06 (M=100, N=100)
Looks like since GCC 5 we wrap all loops in a while (1). Eventually this just confuses GCCs static prediction machinery but who knows. Old (not the problematic function!): { int i; int j; <<cleanup_point <<< Unknown tree: expr_stmt (void) (i = 0) >>>>>; goto <D.3610>; <D.3609>:; <<cleanup_point <<< Unknown tree: expr_stmt (void) (j = 0) >>>>>; goto <D.3608>; <D.3607>:; <<cleanup_point <<< Unknown tree: expr_stmt (void) (*(*(lu + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8)) = *(*(A + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8))) >>>>>; <<cleanup_point (void) j++ >>; <D.3608>:; if (j < N) goto <D.3607>; else goto <D.3605>; <D.3605>:; <<cleanup_point (void) i++ >>; <D.3610>:; if (i < M) goto <D.3609>; else goto <D.3603>; <D.3603>:; } New: { int i; int j; <<cleanup_point <<< Unknown tree: expr_stmt (void) (i = 0) >>>>>; while (1) { if (i >= M) goto <D.3704>; <<cleanup_point <<< Unknown tree: expr_stmt (void) (j = 0) >>>>>; while (1) { if (j >= N) goto <D.3706>; <<cleanup_point <<< Unknown tree: expr_stmt (void) (*(*(lu + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8)) = *(*(A + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8))) >>>>>; <<cleanup_point (void) j++ >>; } <D.3706>:; <<cleanup_point (void) i++ >>; } <D.3704>:; } Or more likely it's the increment after exit check and thus no longer do-while style loops.
(In reply to Richard Biener from comment #4) > Yeah, didn't try to figure out whether the C vs. C++ thing is a > regression. But I suspect the change to the C++ loop lowering. Yes, the relatively small difference between C and C++ without LTO seems to be due to the loop inversion that the C front end still does, and the C++ front end doesn't. But if I add loop inversion back into the C++ front end so that the .optimized output is indistinguishable, that resolves the difference without LTO, but LTO still makes the C++ output much slower.
Created attachment 37774 [details] loop inversion sketch This patch does loop inversion sufficient for scimark. It will break constexpr, but might be useful for looking at the LTO issue.
(In reply to Jason Merrill from comment #7) > But if I add loop inversion back into the C++ front end > so that the .optimized output is indistinguishable, that resolves the > difference without LTO Actually, no, the loop inversion patch in comment 8 doesn't improve performance; even though the .optimized dump is the same between C and C++, the C++ build is still slower by roughly the same amount. I guess the RTL optimizers must be doing something different? Mysterious. Unassigning myself.
One minor difference between the FEs which can be seen in the gimple dump of LU.c is that the C++ frontend doesn't transform (a < b ? a : b) to MIN_EXPR (a, b) whereas the C frontend does perform this transformation. This is because in fold-const.c the transformation is guarded by: /* Avoid these transformations if the COND_EXPR may be used as an lvalue in the C++ front-end. PR c++/19199. */ && (in_gimple_form || VECTOR_TYPE_P (type) || (! lang_GNU_CXX () && strcmp (lang_hooks.name, "GNU Objective-C++") != 0) || ! maybe_lvalue_p (arg1) || ! maybe_lvalue_p (arg2))) Which deliberately inhibits the transformation for C++. Is this mentioned lvalued-ness retention still a concern now that the FE performs delayed folding? By the time the transformation gets performed, the FE will have lowered all uses of lvalue COND_EXPR to an appropriate non-lvalue COND_EXPR. Removing this hunk of code passes regtesting, at least.
Maybe. But the middle-end is able to perform that optimization later too (e.g. during phiopt pass, or ifcvt).
I don't see any difference though, neither with the fold-const.c change, nor with the loop-invert.patch (at least on my Haswell-E, -g -Ofast x86_64, single runs only; though, it shows the LU slowdown with C++ clearly, and also that clang wins on SOR and Sparse matmult, we win significantly on MonteCarlo and less significantly on FFT, C LU is comparable): gcc trunk 20160224 Composite Score: 2482.97 FFT Mflops: 1982.24 (N=1024) SOR Mflops: 1904.08 (100 x 100) MonteCarlo: Mflops: 677.65 Sparse matmult Mflops: 2775.38 (N=1000, nz=5000) LU Mflops: 5075.48 (M=100, N=100) g++ trunk 20160224 Composite Score: 2314.35 FFT Mflops: 1986.77 (N=1024) SOR Mflops: 1903.29 (100 x 100) MonteCarlo: Mflops: 678.80 Sparse matmult Mflops: 2775.33 (N=1000, nz=5000) LU Mflops: 4227.54 (M=100, N=100) g++ trunk 20160224 + fold-const.c MIN/MAX change Composite Score: 2331.88 FFT Mflops: 1983.28 (N=1024) SOR Mflops: 1906.04 (100 x 100) MonteCarlo: Mflops: 676.53 Sparse matmult Mflops: 2823.60 (N=1000, nz=5000) LU Mflops: 4269.96 (M=100, N=100) g++ trunk 20150224 + fold-const.c MIN/MAX change + loop-invert.patch Composite Score: 2332.00 FFT Mflops: 1983.18 (N=1024) SOR Mflops: 1905.64 (100 x 100) MonteCarlo: Mflops: 674.50 Sparse matmult Mflops: 2823.55 (N=1000, nz=5000) LU Mflops: 4273.14 (M=100, N=100) clang 3.8 Composite Score: 2418.13 FFT Mflops: 1583.23 (N=1024) SOR Mflops: 2130.27 (100 x 100) MonteCarlo: Mflops: 281.80 Sparse matmult Mflops: 3026.40 (N=1000, nz=5000) LU Mflops: 5068.95 (M=100, N=100) clang++ 3.8 Composite Score: 2434.04 FFT Mflops: 1595.89 (N=1024) SOR Mflops: 2131.09 (100 x 100) MonteCarlo: Mflops: 281.63 Sparse matmult Mflops: 3001.59 (N=1000, nz=5000) LU Mflops: 5159.98 (M=100, N=100)
Would be still good to get rid of. Not sure why we key off in_gimple_form either. What we get is branches done in different directions and thus BB reorder entered with a different BB order. Edge frequencies seem to match but BB frequencies are off, somehow differently scaled. -;; basic block 21, loop depth 2, count 0, freq 655, maybe hot -;; Invalid sum of incoming frequencies 985, should be 655 +;; basic block 21, loop depth 2, count 0, freq 3, maybe hot ... ;; basic block 23, loop depth 2, count 0, freq 720, maybe hot +;; basic block 23, loop depth 2, count 0, freq 2, maybe hot etc. (- is C + is C++) very few edge frequency differences, but those on backedges for example: ;; pred: 25 [100.0%] (FALLTHRU,EXECUTABLE) -;; 26 [91.0%] (TRUE_VALUE,EXECUTABLE) - # ivtmp.72_13 = PHI <0(25), ivtmp.72_14(26)> - # ivtmp.75_226 = PHI <0(25), ivtmp.75_192(26)> ... +;; 26 [80.0%] (FALSE_VALUE,EXECUTABLE) + # ivtmp.73_13 = PHI <0(25), ivtmp.73_14(26)> + # ivtmp.76_226 = PHI <0(25), ivtmp.76_192(26)> it's already that at profile-estimate time (for LU_factor) which also has 2 more basic blocks and 3 more edges for C++ (with the folding thing "fixed"). Didn't check performance with the folding fixed.
With -Ofast -g -march=native -funroll-loops we actually (for C) beat clang on LU. The SOR difference is most likely IVOPTs/scheduling/RA thing, there is nothing to actually vectorize because of the dependencies. clang 3.8 (no matter if -funroll-loops is used or not) unrolls the main SOR loop 4 times and we get roughly: .align 16, 0x90 .LBB1_12: vmovsd -24(%r13,%rax,8), %xmm4 vaddsd -24(%r12,%rax,8), %xmm4, %xmm4 vmovsd -24(%rbx,%rax,8), %xmm5 vaddsd %xmm5, %xmm3, %xmm3 vaddsd %xmm3, %xmm4, %xmm3 vmulsd %xmm0, %xmm2, %xmm2 vfmadd231sd %xmm3, %xmm1, %xmm2 vmovsd %xmm2, -32(%rbx,%rax,8) vmovsd -16(%r13,%rax,8), %xmm3 vaddsd -16(%r12,%rax,8), %xmm3, %xmm3 vmovsd -16(%rbx,%rax,8), %xmm4 vaddsd %xmm4, %xmm3, %xmm3 vaddsd %xmm3, %xmm2, %xmm2 vmulsd %xmm0, %xmm5, %xmm3 vfmadd231sd %xmm2, %xmm1, %xmm3 vmovsd %xmm3, -24(%rbx,%rax,8) vmovsd -8(%r13,%rax,8), %xmm2 vaddsd -8(%r12,%rax,8), %xmm2, %xmm2 vmovsd -8(%rbx,%rax,8), %xmm5 vaddsd %xmm5, %xmm2, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vmulsd %xmm0, %xmm4, %xmm3 vfmadd231sd %xmm2, %xmm1, %xmm3 vmovsd %xmm3, -16(%rbx,%rax,8) vmovsd (%r13,%rax,8), %xmm2 vaddsd (%r12,%rax,8), %xmm2, %xmm4 vmovsd (%rbx,%rax,8), %xmm2 vaddsd %xmm2, %xmm4, %xmm4 vaddsd %xmm4, %xmm3, %xmm4 vmulsd %xmm0, %xmm5, %xmm3 vfmadd231sd %xmm4, %xmm1, %xmm3 vmovsd %xmm3, -8(%rbx,%rax,8) addq $4, %rax cmpl %eax, %r15d jne .LBB1_12 gcc -Ofast -g -march=native unrolls just twice: .p2align 4,,10 .p2align 3 .L9: vmovsd -8(%r8,%rdi,8), %xmm5 vmovsd -16(%r9,%rdi,8), %xmm1 vmulsd %xmm4, %xmm2, %xmm4 movslq %edi, %rax vaddsd -16(%r10,%rdi,8), %xmm1, %xmm1 vaddsd %xmm0, %xmm5, %xmm0 vmulsd %xmm5, %xmm2, %xmm5 vaddsd %xmm0, %xmm1, %xmm0 vmovapd %xmm0, %xmm1 vfmadd132sd %xmm3, %xmm4, %xmm1 vmovsd (%r8,%rdi,8), %xmm4 vmovsd %xmm1, -16(%r8,%rdi,8) vaddsd %xmm4, %xmm1, %xmm1 vmovsd -8(%r10,%rdi,8), %xmm0 vaddsd -8(%r9,%rdi,8), %xmm0, %xmm0 vaddsd %xmm1, %xmm0, %xmm0 vfmadd132sd %xmm3, %xmm5, %xmm0 vmovsd %xmm0, -8(%r8,%rdi,8) addq $2, %rdi cmpq %rcx, %rdi jne .L9 and with additional -funroll-loops we unroll 8 times instead: .L9: vmovsd -8(%rax,%rdi,8), %xmm6 vmovsd -16(%r8,%rdi,8), %xmm5 vmulsd %xmm7, %xmm1, %xmm8 leaq 2(%rdi), %r14 vaddsd -16(%r9,%rdi,8), %xmm5, %xmm9 vmovsd (%rax,%rdi,8), %xmm12 leaq 4(%rdi), %r10 vaddsd %xmm3, %xmm6, %xmm10 vmulsd %xmm6, %xmm1, %xmm7 vmulsd %xmm12, %xmm1, %xmm0 vaddsd %xmm10, %xmm9, %xmm11 vfmadd132sd %xmm2, %xmm8, %xmm11 vmovsd %xmm11, -16(%rax,%rdi,8) vaddsd %xmm12, %xmm11, %xmm15 vmovsd -8(%r9,%rdi,8), %xmm13 vaddsd -8(%r8,%rdi,8), %xmm13, %xmm14 vaddsd %xmm15, %xmm14, %xmm4 vfmadd132sd %xmm2, %xmm7, %xmm4 vmovsd %xmm4, -8(%rax,%rdi,8) vmovsd -8(%rax,%r14,8), %xmm6 vmovsd -16(%r8,%r14,8), %xmm3 vaddsd -16(%r9,%r14,8), %xmm3, %xmm8 vmovsd (%rax,%r14,8), %xmm10 vaddsd %xmm4, %xmm6, %xmm5 vmulsd %xmm6, %xmm1, %xmm11 vmulsd %xmm10, %xmm1, %xmm4 vaddsd %xmm5, %xmm8, %xmm9 vfmadd132sd %xmm2, %xmm0, %xmm9 vmovsd %xmm9, -16(%rax,%r14,8) vaddsd %xmm10, %xmm9, %xmm13 vmovsd -8(%r9,%r14,8), %xmm12 vaddsd -8(%r8,%r14,8), %xmm12, %xmm7 vaddsd %xmm13, %xmm7, %xmm14 vfmadd132sd %xmm2, %xmm11, %xmm14 vmovsd %xmm14, -8(%rax,%r14,8) vmovsd -8(%rax,%r10,8), %xmm15 vmovsd -16(%r8,%r10,8), %xmm6 vaddsd -16(%r9,%r10,8), %xmm6, %xmm0 vmovsd (%rax,%r10,8), %xmm9 vaddsd %xmm14, %xmm15, %xmm3 vmulsd %xmm15, %xmm1, %xmm5 vmulsd %xmm9, %xmm1, %xmm14 vaddsd %xmm3, %xmm0, %xmm8 vfmadd132sd %xmm2, %xmm4, %xmm8 vmovsd %xmm8, -16(%rax,%r10,8) vaddsd %xmm9, %xmm8, %xmm12 vmovsd -8(%r9,%r10,8), %xmm10 vaddsd -8(%r8,%r10,8), %xmm10, %xmm11 vaddsd %xmm12, %xmm11, %xmm7 vfmadd132sd %xmm2, %xmm5, %xmm7 vmovsd %xmm7, -8(%rax,%r10,8) leaq 6(%rdi), %r10 addq $8, %rdi vmovsd -8(%rax,%r10,8), %xmm13 vmovsd -16(%r8,%r10,8), %xmm15 vaddsd -16(%r9,%r10,8), %xmm15, %xmm6 vaddsd %xmm7, %xmm13, %xmm4 vmovsd (%rax,%r10,8), %xmm7 vmulsd %xmm13, %xmm1, %xmm8 vaddsd %xmm4, %xmm6, %xmm3 vfmadd132sd %xmm2, %xmm14, %xmm3 vmovsd %xmm3, -16(%rax,%r10,8) vaddsd %xmm7, %xmm3, %xmm5 vmovsd -8(%r9,%r10,8), %xmm0 vaddsd -8(%r8,%r10,8), %xmm0, %xmm9 vaddsd %xmm5, %xmm9, %xmm0 vfmadd132sd %xmm2, %xmm8, %xmm0 vmovapd %xmm0, %xmm3 vmovsd %xmm0, -8(%rax,%r10,8) cmpq %rcx, %rdi jne .L9 The number of the v*sd instructions matches in between all 3 versions proportionally to how many unrolls there are, but the non-unrolled version has a weird extra move in there (the movslq %edi, %rax) even when it isn't used anywhere in the loop, and the -funroll-loops version is just too weird, many leaq insns when the constant could be added into the immediates in the corresponding addresses.
Is this the same bug? https://www.phoronix.com/scan.php?page=article&item=ubuntu-1604-compilers&num=2 In Dense LU Matrix Factorization GCC 5.3.1/6.0 is more than 2 times slower than Clang. G++ options: -O3 -march=native CPU: Xeon E3-1280 v5
(In reply to Artem S. Tashkinov from comment #15) > Is this the same bug? > > https://www.phoronix.com/scan.php?page=article&item=ubuntu-1604- > compilers&num=2 > > In Dense LU Matrix Factorization GCC 5.3.1/6.0 is more than 2 times slower > than Clang. > > G++ options: -O3 -march=native > CPU: Xeon E3-1280 v5 Maybe. Though with -Ofast -march=native -funroll-loops is at least on my box GCC emitted code faster than clang, without -funroll-loops it is the opposite.
The following patch by itself closes the gap between the C++ and C FEs, to make compilation with the C++ FE at least as good as with the C FE: diff --git a/gcc/cp/cp-gimplify.c b/gcc/cp/cp-gimplify.c index b6e1d27..8b1d020 100644 --- a/gcc/cp/cp-gimplify.c +++ b/gcc/cp/cp-gimplify.c @@ -237,8 +237,8 @@ genericize_cp_loop (tree *stmt_p, location_t start_locus, tree cond, tree body, location_t cloc = EXPR_LOC_OR_LOC (cond, start_locus); exit = build1_loc (cloc, GOTO_EXPR, void_type_node, get_bc_label (bc_break)); - exit = fold_build3_loc (cloc, COND_EXPR, void_type_node, cond, - build_empty_stmt (cloc), exit); + exit = build3_loc (cloc, COND_EXPR, void_type_node, cond, + build_empty_stmt (cloc), exit); } if (exit && cond_is_first) Here, when building the exit condition of a for-loop, fold_build3 decides to swap the operands of the COND_EXPR (and invert its condition), which I suppose causes subsequent optimization passes to generate worse code for these loops. Avoid having this done by using build3 instead. This means that we won't fold degenerate COND_EXPRs but I doubt that's a big deal in practice. (An alternative fix would be to adjust tree_swap_operands_p to avoid returning true in this case -- maybe return false if the TREE_TYPEs are VOID_TYPEs?) I haven't checked -flto or an optimization setting besides -Ofast.
So with the patch, g++ -flto -Ofast is on par with gcc -flto -Ofast and better than g++ -Ofast. Could anyone confirm?
patch applied to gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC) I confirm the improvement in timing for c++ and lto timing difference between gcc and c++ seems to be inside "errors" I am satisfied. Thanks Patrick! (btw I suppose no hope for a back port to 5.4?)
On March 25, 2016 8:42:50 AM GMT+01:00, "vincenzo.innocente at cern dot ch" <gcc-bugzilla@gcc.gnu.org> wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564 > >--- Comment #19 from vincenzo Innocente <vincenzo.innocente at cern dot >ch> --- >patch applied to >gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC) > >I confirm the improvement in timing for c++ and lto >timing difference between gcc and c++ seems to be inside "errors" >I am satisfied. > >Thanks Patrick! > >(btw I suppose no hope for a back port to 5.4?) It needs some more understanding and a fix in the appropriate place. Thanks Patrick for narrowing down a workaround.
To be clear, given the loop for (int i = 0; i < M; i++) { ... } The fold_build3 in question was transforming if (i < M) fallthrough; else goto exit; to if (i >= M) goto exit; else fallthrough; The C FE emits if (i < M) goto body; else goto L; L: I would guess r217669 introduced the regression. Before this commit the two arms of the COND_EXPR would be GOTO_EXPRs and so during folding tree_swap_operands_p would return false. After this commit, the true arm is an empty statement which is TREE_CONSTANT and so during folding tree_swap_operands_p returns true. The tree dumps do not seem to diverge significantly with and without the above patch. The only difference throughout is the inversion of the branches.
(In reply to Patrick Palka from comment #21) > The tree dumps do not seem to diverge significantly with and without the > above patch. The only difference throughout is the inversion of the > branches. Was only looking at LU.c though.
So I tried to investigate why prediction doesn't fix this. In fact with the following patch I get to a similar level of performance as with not folding the conditional (notice append_to_statement_list_force which isn't used in continue genericization which means its predict-expr is thrown away...): Index: gcc/cp/cp-gimplify.c =================================================================== --- gcc/cp/cp-gimplify.c (revision 234453) +++ gcc/cp/cp-gimplify.c (working copy) @@ -237,8 +237,12 @@ genericize_cp_loop (tree *stmt_p, locati location_t cloc = EXPR_LOC_OR_LOC (cond, start_locus); exit = build1_loc (cloc, GOTO_EXPR, void_type_node, get_bc_label (bc_break)); + tree pred = build_predict_expr (PRED_LOOP_EXIT, NOT_TAKEN); + tree stmt_list = NULL; + append_to_statement_list_force (pred, &stmt_list); + append_to_statement_list (exit, &stmt_list); exit = fold_build3_loc (cloc, COND_EXPR, void_type_node, cond, - build_empty_stmt (cloc), exit); + build_empty_stmt (cloc), stmt_list); } if (exit && cond_is_first) Now the question to me is still why we don't predict the loop exit correctly ourselves (or not as strong as desired for this testcase). Honza? The loop has multiple exits and a non-constant iteration bound which is where predict_loops () gives up. So the question is whether the above is a good idea in general (likewise "fixing" the continue predict hint below). The loop in question is for (j=0; j<minMN; j++) { ... loop ... if ( A[jp][j] == 0 ) return 1; /* factorization failed because of zero pivot */ ... conditional loop ... ... conditional loop nest ... } and we only predict with int probability = ((REG_BR_PROB_BASE - predictor_info [(int) PRED_LOOP_EXIT].hitrate) / n_exits); where n_exits == 2. But the question is really why basic-block order or branch inversion makes this kind of difference. We _do_ predict the same with the s/fold_//g patch from Patrick. The following patch "fixes" continue genericization. Index: gcc/cp/cp-gimplify.c =================================================================== --- gcc/cp/cp-gimplify.c (revision 234453) +++ gcc/cp/cp-gimplify.c (working copy) @@ -361,7 +365,7 @@ genericize_continue_stmt (tree *stmt_p) tree label = get_bc_label (bc_continue); location_t location = EXPR_LOCATION (*stmt_p); tree jump = build1_loc (location, GOTO_EXPR, void_type_node, label); - append_to_statement_list (pred, &stmt_list); + append_to_statement_list_force (pred, &stmt_list); append_to_statement_list (jump, &stmt_list); *stmt_p = stmt_list; }
Oh, and this is also another case where we end up with different out-of-SSA coalescing desires + outcomes. Sorted Coalesce list: -(16562, 0) t_6 <-> t_105 -(16562, 1) ivtmp.101_80 <-> ivtmp.101_81 -(16562, 2) jp_3 <-> jp_107 -(7908, 0) ivtmp.94_76 <-> ivtmp.94_174 -(1640, 0) ivtmp.107_29 <-> ivtmp.107_132 -(1640, 0) ivtmp.113_228 <-> ivtmp.113_229 ... +(3438, 0) ivtmp.74_16 <-> ivtmp.74_17 +(3438, 0) ivtmp.77_192 <-> ivtmp.77_226 +(3312, 0) ivtmp.67_95 <-> ivtmp.67_164 +(1716, 0) t_6 <-> t_105 +(1716, 1) ivtmp.101_80 <-> ivtmp.101_81 +(1716, 2) jp_3 <-> jp_107 +(1638, 0) ivtmp.87_104 <-> ivtmp.87_188 +(961, 0) jj_96 <-> _242 +(820, 0) ivtmp.94_76 <-> ivtmp.94_174 interestingly we have less coalesce fails with fold_build3_loc. Also edge frequencies seem to be very different in the end even though we predicted things the same... So I still think there's some underlying issue "elsewhere" (apart from nothing adjusting BB "order" while we still have a CFG around, making predicted edges fallthrus dependend on whether they are backwards or forward jumps).
I benchmarked the patch in comment#17 with a full three-run on all_cpp on a Haswell machine with -Ofast -march=native (peak is patched). Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 400.perlbench NR NR 401.bzip2 NR NR 403.gcc NR NR 429.mcf NR NR 445.gobmk NR NR 456.hmmer NR NR 458.sjeng NR NR 462.libquantum NR NR 464.h264ref NR NR 471.omnetpp 6250 315 19.9 * 6250 312 20.0 * 473.astar 7020 323 21.7 * 7020 323 21.7 * 483.xalancbmk 6900 193 35.8 * 6900 192 35.9 * Est. SPECint_base2006 -- Est. SPECint2006 -- Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 410.bwaves NR NR 416.gamess NR NR 433.milc NR NR 434.zeusmp NR NR 435.gromacs NR NR 436.cactusADM NR NR 437.leslie3d NR NR 444.namd 8020 300 26.7 * 8020 300 26.8 * 447.dealII 11440 253 45.3 * 11440 257 44.5 * 450.soplex 8340 218 38.3 * 8340 218 38.2 * 453.povray 5320 120 44.2 * 5320 120 44.2 * 454.calculix NR NR 459.GemsFDTD NR NR 465.tonto NR NR 470.lbm NR NR 481.wrf NR NR 482.sphinx3 NR NR Est. SPECfp_base2006 -- Est. SPECfp2006 -- dealII might be not noise, three runs are 447.dealII 11440 319 35.8 S 11440 257 44.5 S 447.dealII 11440 252 45.4 S 11440 257 44.5 * 447.dealII 11440 253 45.3 * 11440 256 44.7 S and I think we can take peak values (stupid CPU freq boosting). I'm re-running just 447.dealII now, results tomorrow. Binaries do differ.
In response to c#24. Coalescing's primary sort key is highly dependent on the edge frequencies -- which is generally what we want. The highest edge frequency represents the highest cost for any particular copy implied by a PHI node. This obviously falls down if there are > 1 lower frequency copies that can't be coalesced due to a higher frequency copy that does coalesce and the sum of the lower frequency copy costs are greater than the sum of the single higher frequency copy. One might be able to model that by looking at the conflict lists, but before going down that path, it'd obviously be best to know if that scenario is the root cause of the slowdown.
So, dealII slowdown is confirmed. 447.dealII 11440 252 45.4 S 11440 255 44.8 S 447.dealII 11440 254 45.1 S 11440 256 44.6 S 447.dealII 11440 253 45.2 * 11440 256 44.7 * it's probably similarly artificial than the slowdown in this PR and thus it might be acceptable given artificially swapping all exit comparisons isn't something that should be done. When profiling the difference it looks like the culprit is (are) 6.55% dealII_base.amd libstdc++.so.6.0.22 [.] std::_Rb_tree_increment(std::_Rb_tree_node_base const*) 6.53% dealII_peak.amd dealII_peak.amd64-m64-gcc42-nn [.] MappingQ1<3>::compute_fill(TriaIterator<3, DoFCellAccessor<3> > const&, unsigned int, QProjector<3>::D 6.34% dealII_peak.amd libstdc++.so.6.0.22 [.] std::_Rb_tree_increment(std::_Rb_tree_node_base const*) 5.95% dealII_base.amd dealII_base.amd64-m64-gcc42-nn [.] MappingQ1<3>::compute_fill(TriaIterator<3, DoFCellAccessor<3> > const&, unsigned int, QProjector<3>::D ... 1.61% dealII_peak.amd dealII_peak.amd64-m64-gcc42-nn [.] SparsityPattern::operator()(unsigned int, unsigned int) const 1.60% dealII_peak.amd dealII_peak.amd64-m64-gcc42-nn [.] MappingQ<3>::transform_covariant(Tensor<1, 3>*, Tensor<1, 3>*, Tensor<1, 3> const*, Mapping<3>::Intern 1.60% dealII_base.amd dealII_base.amd64-m64-gcc42-nn [.] SparsityPattern::operator()(unsigned int, unsigned int) const 1.56% dealII_base.amd dealII_base.amd64-m64-gcc42-nn [.] MappingQ<3>::transform_covariant(Tensor<1, 3>*, Tensor<1, 3>*, Tensor<1, 3> const*, Mapping<3>::Intern but there isn't any assembly difference for MappingQ1<3>::compute_fill ...
GCC 5.4 is being released, adjusting target milestone.
So to bring this BZ back to the core questions (the scope seems to have widened through the year since this originally reported). Namely are the use of LTO or C++ making things slower, particularly for scimark's LU factorization test. From my experiments, the answer is a very clear yes. I hacked up the test a bit to only run LU and run a fixed number of iterations. That makes comparisons with something like callgrind much easier. Use of C++ adds 2-3% in terms of instruction counts. LTO adds an additional 2-3% to the instruction counts. These are additive, C++ with LTO is about 5% higher than C without LTO. The time (not surprisingly) is lost in LU_factor, the main culprit seems to be this pair of nested loops: int ii; for (ii=j+1; ii<M; ii++) { double *Aii = A[ii]; double *Aj = A[j]; double AiiJ = Aii[j]; /* Here */ int jj; for (jj=j+1; jj<N; jj++) Aii[jj] -= AiiJ * Aj[jj]; } Callgrind calls out the marked line, which probably in reality means the preheader for the inner loop. For C w/o LTO it's ~12million instructions. For C++ with LTO it's ~21million instructions (remember, I'm just running LU and for a relatively small number of iterations). It's a bit of a surprise as these loops are dead simple, but it appears we've got to be doing something dumb somewhere. Hopefully that narrows things down a bit.
Looking at the dumps, the tests for whether or not to use the vectorized loop are considerably more complex with LTO with memory references as well when compared to the non-LTO version. It's almost as-if the program-wide visibility provided by LTO has *hidden* redundancies and mucked up the IV analysis. The first block of the LTO check looks like: # ivtmp.66_807 = PHI <ivtmp.87_773(55), ivtmp.66_812(67)> Aii_206 = MEM[base: A_139, index: ivtmp.66_807, step: 8, offset: 0B]; _208 = Aii_206 + _765; AiiJ_209 = *_208; _449 = SOR_size_15 > i_167; _763 = (unsigned int) i_167; _762 = _686 - _763; _444 = _762 > 8; _443 = _444 & _449; _438 = Aii_206 + _760; _433 = _434 >= _438; _425 = Aii_206 + _435; _424 = _425 >= _429; _423 = _424 | _433; _422 = _423 & _443; if (_422 != 0) goto <bb 57>; [80.00%] else goto <bb 65>; [20.00%] Compared to the non-LTO block: _156 = ivtmp.112_242 + 16; _155 = prephitmp_168 + _156; _151 = Aii_72 + ivtmp.112_242; _150 = _151 >= _155; _146 = Aii_72 + _156; _142 = prephitmp_168 + ivtmp.112_242; _141 = _142 >= _146; _140 = _141 | _150; _139 = _140 & _160; if (_139 != 0) goto <bb 23>; else goto <bb 32>; Then the next block in the check (LTO): ;; basic block 57, loop depth 3, count 0, freq 66, maybe hot ;; prev block 56, next block 58, flags: (NEW, REACHABLE, VISITED) ;; pred: 56 [80.0%] (TRUE_VALUE,EXECUTABLE) niters.2_408 = SOR_size_15 > i_167 ? _762 : 1; _366 = (unsigned long) _425; _365 = _366 >> 3; _364 = -_365; _363 = (unsigned int) _364; prolog_loop_niters.4_367 = _363 & 1; _740 = (unsigned int) ivtmp.86_769; _509 = _686 + 4294967294; _738 = _509 - _740; _295 = SOR_size_15 > i_167 ? _738 : 0; _283 = prolog_loop_niters.4_367 == 0 ? 1 : 2; if (_283 > _295) goto <bb 63>; [10.00%] else goto <bb 58>; [90.00%] non-LTO: ;; basic block 23, loop depth 2, count 0, freq 2, maybe hot ;; prev block 22, next block 24, flags: (NEW, REACHABLE) ;; pred: 22 [80.0%] (TRUE_VALUE,EXECUTABLE) _119 = (unsigned long) _151; _118 = _119 & 15; _117 = _118 >> 3; _116 = -_117; _115 = (unsigned int) _116; _114 = _115 & 1; prolog_loop_niters.46_120 = MIN_EXPR <_114, ivtmp.115_244>; if (prolog_loop_niters.46_120 == 0) goto <bb 25>; else goto <bb 24>; Egad. No wonder LTO loses. I don't think the loop iterates enough to make up for this mess.
I can't see any difference in the vectorizer dumps, one notable thing is that we (again) do peeling for alignment which is hardly useful on x86_64 these days, at least if you only have a single store stream. The code after the vectorizer looks 1:1 the same, so do the alias checks. IVOPTs does different decisions for the outer loop (LU.c:87) that is because with LTO we have more context (inlined the function) and thus "expansion" brings up bigger expressions (well, that's my guess).
int ii; for (ii=j+1; ii<M; ii++) { double *Aii = A[ii]; double *Aj = A[j]; double AiiJ = Aii[j]; /* Here */ int jj; for (jj=j+1; jj<N; jj++) Aii[jj] -= AiiJ * Aj[jj]; } the alias check for Aii[jj] vs. Aj[jj] should be as "simple" as abs(Aii - Aj) >= VF * sizeof (double). It certainly looks more complicated than that: _1100 = (unsigned int) SOR_size_19; _1096 = (unsigned int) j_910; _1087 = _1100 - _1096; _1077 = _1087 + 4294967295; _1076 = _1077 > 8; _1075 = (sizetype) j_910; _1074 = _1075 + 3; _1073 = _1074 * 8; _1072 = _539 + _1073; _1071 = (sizetype) j_910; _1056 = _1071 + 1; _1055 = _1056 * 8; _1054 = Aii_557 + _1055; _1053 = _1054 >= _1072; _1052 = (sizetype) j_910; _1034 = _1052 + 3; _1033 = _1034 * 8; _1032 = Aii_557 + _1033; _1031 = (sizetype) j_910; _1030 = _1031 + 1; _1029 = _1030 * 8; _1028 = _539 + _1029; _1027 = _1028 >= _1032; _1026 = _1027 | _1053; _1025 = _1026 & _1076; if (_1025 != 0) appearantly we expanded things too much. Creating dr for *_566 analyze_innermost: Applying pattern match.pd:84, generic-match.c:11338 success. Applying pattern match.pd:84, generic-match.c:11338 base_address: _539 + (sizetype) ((long unsigned int) j_910 * 8) offset from base address: 0 constant offset from base address: 8 step: 8 aligned to: 128 base_object: *_539 + (sizetype) ((long unsigned int) j_910 * 8) Access function 0: {8B, +, 8}_25 Creating dr for *_564 analyze_innermost: Applying pattern match.pd:84, generic-match.c:11338 success. Applying pattern match.pd:84, generic-match.c:11338 base_address: Aii_557 + (sizetype) ((long unsigned int) j_910 * 8) offset from base address: 0 constant offset from base address: 8 step: 8 aligned to: 128 base_object: *Aii_557 + (sizetype) ((long unsigned int) j_910 * 8) Access function 0: {8B, +, 8}_25 ... LU.c:93:32: note: mark for run-time aliasing test between *_566 and *_564 looks like we fail to split the common exprs away. But that may be hard(er) because data reference analysis produces DR_BASE_ADDRESS == Aii_81 + ((sizetype) j_94 + 1) * 8 DR_OFFSET == 0 rather than the expected DR_OFFSET == ((sizetype) j_94 + 1) * 8 because of the way pointer accesses are handled (only the get_inner_reference part provides some input to the "offset-IV" analysis). Later doing ptr1 + X <= ptr2 + X is not folded to ptr1 <= ptr2 because of overflow concerns.
For example with Index: tree-vect-loop-manip.c =================================================================== --- tree-vect-loop-manip.c (revision 245501) +++ tree-vect-loop-manip.c (working copy) @@ -2187,6 +2187,14 @@ create_intersect_range_checks (loop_vec_ tree addr_base_b = DR_BASE_ADDRESS (dr_b.dr); tree offset_a = DR_OFFSET (dr_a.dr), offset_b = DR_OFFSET (dr_b.dr); + if (TREE_CODE (addr_base_a) == POINTER_PLUS_EXPR + && TREE_CODE (addr_base_b) == POINTER_PLUS_EXPR + && operand_equal_p (TREE_OPERAND (addr_base_a, 1), + TREE_OPERAND (addr_base_b, 1), 0)) + { + addr_base_a = TREE_OPERAND (addr_base_a, 0); + addr_base_b = TREE_OPERAND (addr_base_b, 0); + } offset_a = fold_build2 (PLUS_EXPR, TREE_TYPE (offset_a), offset_a, DR_INIT (dr_a.dr)); offset_b = fold_build2 (PLUS_EXPR, TREE_TYPE (offset_b), the versioning condition is simplified to _173 = (unsigned int) N_67(D); _162 = (unsigned int) j_94; _161 = _173 - _162; _160 = _161 + 4294967295; _159 = _160 > 8; _158 = prephitmp_172 + 24; _157 = Aii_81 + 8; _156 = _157 >= _158; _155 = Aii_81 + 24; _154 = prephitmp_172 + 8; _153 = _154 >= _155; _152 = _153 | _156; _151 = _152 & _159; if (_151 != 0) note that while the cost model check comes first (the > 8 compare) we do not separate that out into a GIMPLE_COND and RTL expansion might f*ck up condition ordering.
But as A + 8 >= B || A >= B + 8 is the same as ABS (A - B) >= 8 we might do better re-writing the overlap test in terms of this (of course it all really depends on whether that and the offset stripping handles arrays crossing the "signed overlap" boundary in the address space correctly...)
(In reply to Richard Biener from comment #34) > But as A + 8 >= B || A >= B + 8 is the same as ABS (A - B) >= 8 we might do > better re-writing the overlap test in terms of this (of course it all really > depends on whether that and the offset stripping handles arrays crossing the > "signed overlap" boundary in the address space correctly...) I had a patch doing this, will revise and send for review. Thanks.
GCC 5 branch is being closed
GCC 6 branch is being closed
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Created attachment 47649 [details] patch to revert to lowering loops like the C front end Could someone with a better sense of loop code generation tell me if this improves things?
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.
GCC 12 branch is being closed.