Bug 69564 - [13/14/15/16 Regression] lto and/or C++ make scimark2 LU slower
Summary: [13/14/15/16 Regression] lto and/or C++ make scimark2 LU slower
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 6.0
: P2 normal
Target Milestone: 13.5
Assignee: Not yet assigned to anyone
URL:
Keywords: lto, missed-optimization
Depends on:
Blocks: 79703 78402
  Show dependency treegraph
 
Reported: 2016-01-30 10:50 UTC by vincenzo Innocente
Modified: 2025-07-11 08:54 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.9.3
Known to fail: 5.3.0, 6.0
Last reconfirmed: 2016-02-01 00:00:00


Attachments
loop inversion sketch (517 bytes, patch)
2016-02-23 20:58 UTC, Jason Merrill
Details | Diff
patch to revert to lowering loops like the C front end (2.20 KB, patch)
2020-01-14 03:09 UTC, Jason Merrill
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2016-01-30 10:50:39 UTC
mkdir scimark2; cd scimark2
wget http://math.nist.gov/scimark2/scimark2_1c.zip
unzip scimark2_1c.zip
c++ -Ofast *.c; ./a.out
c++ -Ofast *.c -flto; ./a.out


with gcc 4.9.3
gcc version 4.9.3 (GCC) 

c++ -Ofast *.c; ./a.out
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         2462.90
FFT             Mflops:  2070.32    (N=1024)
SOR             Mflops:  1661.17    (100 x 100)
MonteCarlo:     Mflops:   813.44
Sparse matmult  Mflops:  2978.91    (N=1000, nz=5000)
LU              Mflops:  4790.64    (M=100, N=100)
[innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         2582.94
FFT             Mflops:  2064.19    (N=1024)
SOR             Mflops:  1654.04    (100 x 100)
MonteCarlo:     Mflops:  1426.90
Sparse matmult  Mflops:  2978.91    (N=1000, nz=5000)
LU              Mflops:  4790.64    (M=100, N=100)


with latest build
gcc version 6.0.0 20160129 (experimental) (GCC) 

[innocent@vinavx3 scimark2]$ c++ -Ofast *.c; ./a.out
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         2377.18
FFT             Mflops:  1970.89    (N=1024)
SOR             Mflops:  1654.04    (100 x 100)
MonteCarlo:     Mflops:   810.37
Sparse matmult  Mflops:  3328.81    (N=1000, nz=5000)
LU              Mflops:  4121.76    (M=100, N=100)
[innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         2136.23
FFT             Mflops:  2076.48    (N=1024)
SOR             Mflops:  1654.04    (100 x 100)
MonteCarlo:     Mflops:  1533.92
Sparse matmult  Mflops:  3266.59    (N=1000, nz=5000)
LU              Mflops:  2150.13    (M=100, N=100)
Comment 1 Richard Biener 2016-02-01 10:22:48 UTC
Any reason you are using the c++ driver here?

I get

> gcc-6 -Ofast -flto *.c -lm -B /abuild/rguenther/trunk3-g/gcc
> ./a.out 
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         1729.02
FFT             Mflops:  1247.04    (N=1024)
SOR             Mflops:  1537.70    (100 x 100)
MonteCarlo:     Mflops:   842.21
Sparse matmult  Mflops:  1657.86    (N=1000, nz=5000)
LU              Mflops:  3360.29    (M=100, N=100)

> gcc-6 -Ofast  *.c -lm -B /abuild/rguenther/trunk3-g/gcc
> ./a.out 
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         1645.94
FFT             Mflops:  1288.61    (N=1024)
SOR             Mflops:  1471.29    (100 x 100)
MonteCarlo:     Mflops:   459.90
Sparse matmult  Mflops:  1665.91    (N=1000, nz=5000)
LU              Mflops:  3343.98    (M=100, N=100)


Ok, when using g++ to compile things I _do_ get

> g++-6 -Ofast -flto *.c -lm -B /abuild/rguenther/trunk3-g/gcc
> ./a.out 
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         1321.43
FFT             Mflops:  1261.86    (N=1024)
SOR             Mflops:  1533.77    (100 x 100)
MonteCarlo:     Mflops:   850.69
Sparse matmult  Mflops:  1669.90    (N=1000, nz=5000)
LU              Mflops:  1290.93    (M=100, N=100)

> g++-6 -Ofast  *.c -lm -B /abuild/rguenther/trunk3-g/gcc
> ./a.out 
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         1492.12
FFT             Mflops:  1279.12    (N=1024)
SOR             Mflops:  1479.86    (100 x 100)
MonteCarlo:     Mflops:   433.83
Sparse matmult  Mflops:  1637.11    (N=1000, nz=5000)
LU              Mflops:  2630.71    (M=100, N=100)

So even without LTO I get a hit in using C++ to compile LU.  Interesting.
Comment 2 Richard Biener 2016-02-01 10:33:02 UTC
It looks like we get different BB order out of C++ than C but otherwise no real code-differences as far as I can see.
Comment 3 vincenzo Innocente 2016-02-01 10:34:29 UTC
> Any reason you are using the c++ driver here?
Because I am interested in C++ performance
never imagined that the c++ front-end could make a difference on such a code...
From my point of view it is even a more severe regression than just "lto"
Comment 4 Richard Biener 2016-02-01 10:53:59 UTC
(In reply to vincenzo Innocente from comment #3)
> > Any reason you are using the c++ driver here?
> Because I am interested in C++ performance
> never imagined that the c++ front-end could make a difference on such a
> code...
> From my point of view it is even a more severe regression than just "lto"

Yeah, didn't try to figure out whether the C vs. C++ thing is a 
regression.  But I suspect the change to the C++ loop lowering.

Certainly needs closer investigation.
Comment 5 vincenzo Innocente 2016-02-01 10:59:47 UTC
it is a regression 
gcc version 4.9.3 (GCC) 
c++ -Ofast *.c; ./a.out
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
gcc -Ofast *.c; ./a.out
c++ -v
Composite Score:         2449.06
FFT             Mflops:  2046.03    (N=1024)
SOR             Mflops:  1654.04    (100 x 100)
MonteCarlo:     Mflops:   813.44
Sparse matmult  Mflops:  2962.08    (N=1000, nz=5000)
LU              Mflops:  4769.72    (M=100, N=100)
---
gcc -Ofast *.c -lm; ./a.out
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         2475.22
FFT             Mflops:  2064.19    (N=1024)
SOR             Mflops:  1633.01    (100 x 100)
MonteCarlo:     Mflops:   810.37
Sparse matmult  Mflops:  2970.47    (N=1000, nz=5000)
LU              Mflops:  4898.06    (M=100, N=100)
Comment 6 Richard Biener 2016-02-01 11:21:14 UTC
Looks like since GCC 5 we wrap all loops in a while (1).  Eventually this just confuses GCCs static prediction machinery but who knows.

Old (not the problematic function!):

{
    int i;
    int j;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (i = 0) >>>>>;
  goto <D.3610>;
  <D.3609>:;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (j = 0) >>>>>;
  goto <D.3608>;
  <D.3607>:;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (*(*(lu + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8)) = *(*(A + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8))) >>>>>;
  <<cleanup_point (void) j++ >>;
  <D.3608>:;
  if (j < N) goto <D.3607>; else goto <D.3605>;
  <D.3605>:;
  <<cleanup_point (void) i++ >>;
  <D.3610>:;
  if (i < M) goto <D.3609>; else goto <D.3603>;
  <D.3603>:;
}

New:

{
    int i;
    int j;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (i = 0) >>>>>;
  while (1)
    {
      if (i >= M) goto <D.3704>;
      <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (j = 0) >>>>>;
      while (1)
        {
          if (j >= N) goto <D.3706>;
          <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (*(*(lu + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8)) = *(*(A + (sizetype) ((long unsigned int) i * 8)) + (sizetype) ((long unsigned int) j * 8))) >>>>>;
          <<cleanup_point (void) j++ >>;
        }
      <D.3706>:;
      <<cleanup_point (void) i++ >>;
    }
  <D.3704>:;
}

Or more likely it's the increment after exit check and thus no longer
do-while style loops.
Comment 7 Jason Merrill 2016-02-23 20:57:29 UTC
(In reply to Richard Biener from comment #4)
> Yeah, didn't try to figure out whether the C vs. C++ thing is a 
> regression.  But I suspect the change to the C++ loop lowering.

Yes, the relatively small difference between C and C++ without LTO seems to be due to the loop inversion that the C front end still does, and the C++ front end doesn't.  But if I add loop inversion back into the C++ front end so that the .optimized output is indistinguishable, that resolves the difference without LTO, but LTO still makes the C++ output much slower.
Comment 8 Jason Merrill 2016-02-23 20:58:55 UTC
Created attachment 37774 [details]
loop inversion sketch

This patch does loop inversion sufficient for scimark.  It will break constexpr, but might be useful for looking at the LTO issue.
Comment 9 Jason Merrill 2016-02-24 05:33:21 UTC
(In reply to Jason Merrill from comment #7)
> But if I add loop inversion back into the C++ front end
> so that the .optimized output is indistinguishable, that resolves the
> difference without LTO

Actually, no, the loop inversion patch in comment 8 doesn't improve performance; even though the .optimized dump is the same between C and C++, the C++ build is still slower by roughly the same amount.  I guess the RTL optimizers must be doing something different?  Mysterious.  Unassigning myself.
Comment 10 Patrick Palka 2016-02-25 05:30:49 UTC
One minor difference between the FEs which can be seen in the gimple dump of LU.c is that the C++ frontend doesn't transform (a < b ? a : b) to MIN_EXPR (a, b) whereas the C frontend does perform this transformation.  This is because in fold-const.c the transformation is guarded by:

      /* Avoid these transformations if the COND_EXPR may be used
         as an lvalue in the C++ front-end.  PR c++/19199.  */
      && (in_gimple_form
          || VECTOR_TYPE_P (type)
          || (! lang_GNU_CXX ()
              && strcmp (lang_hooks.name, "GNU Objective-C++") != 0)
          || ! maybe_lvalue_p (arg1)
          || ! maybe_lvalue_p (arg2)))

Which deliberately inhibits the transformation for C++.  Is this mentioned lvalued-ness retention still a concern now that the FE performs delayed folding?  By the time the transformation gets performed, the FE will have lowered all uses of lvalue COND_EXPR to an appropriate non-lvalue COND_EXPR.  Removing this hunk of code passes regtesting, at least.
Comment 11 Jakub Jelinek 2016-02-25 08:38:32 UTC
Maybe.  But the middle-end is able to perform that optimization later too (e.g. during phiopt pass, or ifcvt).
Comment 12 Jakub Jelinek 2016-02-25 12:41:53 UTC
I don't see any difference though, neither with the fold-const.c change, nor with the loop-invert.patch (at least on my Haswell-E, -g -Ofast x86_64, single runs only; though, it shows the LU slowdown with C++ clearly, and also that clang wins on SOR and Sparse matmult, we win significantly on MonteCarlo and less significantly on FFT, C LU is comparable):
gcc trunk 20160224
Composite Score:         2482.97
FFT             Mflops:  1982.24    (N=1024)
SOR             Mflops:  1904.08    (100 x 100)
MonteCarlo:     Mflops:   677.65
Sparse matmult  Mflops:  2775.38    (N=1000, nz=5000)
LU              Mflops:  5075.48    (M=100, N=100)
g++ trunk 20160224
Composite Score:         2314.35
FFT             Mflops:  1986.77    (N=1024)
SOR             Mflops:  1903.29    (100 x 100)
MonteCarlo:     Mflops:   678.80
Sparse matmult  Mflops:  2775.33    (N=1000, nz=5000)
LU              Mflops:  4227.54    (M=100, N=100)
g++ trunk 20160224 + fold-const.c MIN/MAX change
Composite Score:         2331.88
FFT             Mflops:  1983.28    (N=1024)
SOR             Mflops:  1906.04    (100 x 100)
MonteCarlo:     Mflops:   676.53
Sparse matmult  Mflops:  2823.60    (N=1000, nz=5000)
LU              Mflops:  4269.96    (M=100, N=100)
g++ trunk 20150224 + fold-const.c MIN/MAX change + loop-invert.patch
Composite Score:         2332.00
FFT             Mflops:  1983.18    (N=1024)
SOR             Mflops:  1905.64    (100 x 100)
MonteCarlo:     Mflops:   674.50
Sparse matmult  Mflops:  2823.55    (N=1000, nz=5000)
LU              Mflops:  4273.14    (M=100, N=100)
clang 3.8
Composite Score:         2418.13
FFT             Mflops:  1583.23    (N=1024)
SOR             Mflops:  2130.27    (100 x 100)
MonteCarlo:     Mflops:   281.80
Sparse matmult  Mflops:  3026.40    (N=1000, nz=5000)
LU              Mflops:  5068.95    (M=100, N=100)
clang++ 3.8
Composite Score:         2434.04
FFT             Mflops:  1595.89    (N=1024)
SOR             Mflops:  2131.09    (100 x 100)
MonteCarlo:     Mflops:   281.63
Sparse matmult  Mflops:  3001.59    (N=1000, nz=5000)
LU              Mflops:  5159.98    (M=100, N=100)
Comment 13 Richard Biener 2016-02-25 12:52:29 UTC
Would be still good to get rid of.  Not sure why we key off in_gimple_form either.

What we get is branches done in different directions and thus BB reorder entered
with a different BB order.

Edge frequencies seem to match but BB frequencies are off, somehow differently scaled.

-;;   basic block 21, loop depth 2, count 0, freq 655, maybe hot
-;;   Invalid sum of incoming frequencies 985, should be 655
+;;   basic block 21, loop depth 2, count 0, freq 3, maybe hot
...
;;   basic block 23, loop depth 2, count 0, freq 720, maybe hot
+;;   basic block 23, loop depth 2, count 0, freq 2, maybe hot

etc. (- is C + is C++) very few edge frequency differences, but those on
backedges for example:

 ;;    pred:       25 [100.0%]  (FALLTHRU,EXECUTABLE)
-;;                26 [91.0%]  (TRUE_VALUE,EXECUTABLE)
-  # ivtmp.72_13 = PHI <0(25), ivtmp.72_14(26)>
-  # ivtmp.75_226 = PHI <0(25), ivtmp.75_192(26)>
...
+;;                26 [80.0%]  (FALSE_VALUE,EXECUTABLE)
+  # ivtmp.73_13 = PHI <0(25), ivtmp.73_14(26)>
+  # ivtmp.76_226 = PHI <0(25), ivtmp.76_192(26)>

it's already that at profile-estimate time (for LU_factor) which also has
2 more basic blocks and 3 more edges for C++ (with the folding thing
"fixed").  Didn't check performance with the folding fixed.
Comment 14 Jakub Jelinek 2016-02-25 14:50:03 UTC
With -Ofast -g -march=native -funroll-loops we actually (for C) beat clang on LU.
The SOR difference is most likely IVOPTs/scheduling/RA thing, there is nothing to actually vectorize because of the dependencies.
clang 3.8 (no matter if -funroll-loops is used or not) unrolls the main SOR loop 4 times and we get roughly:
	.align	16, 0x90
.LBB1_12:
	vmovsd	-24(%r13,%rax,8), %xmm4
	vaddsd	-24(%r12,%rax,8), %xmm4, %xmm4
	vmovsd	-24(%rbx,%rax,8), %xmm5
	vaddsd	%xmm5, %xmm3, %xmm3
	vaddsd	%xmm3, %xmm4, %xmm3
	vmulsd	%xmm0, %xmm2, %xmm2
	vfmadd231sd	%xmm3, %xmm1, %xmm2
	vmovsd	%xmm2, -32(%rbx,%rax,8)
	vmovsd	-16(%r13,%rax,8), %xmm3
	vaddsd	-16(%r12,%rax,8), %xmm3, %xmm3
	vmovsd	-16(%rbx,%rax,8), %xmm4
	vaddsd	%xmm4, %xmm3, %xmm3
	vaddsd	%xmm3, %xmm2, %xmm2
	vmulsd	%xmm0, %xmm5, %xmm3
	vfmadd231sd	%xmm2, %xmm1, %xmm3
	vmovsd	%xmm3, -24(%rbx,%rax,8)
	vmovsd	-8(%r13,%rax,8), %xmm2 
	vaddsd	-8(%r12,%rax,8), %xmm2, %xmm2
	vmovsd	-8(%rbx,%rax,8), %xmm5 
	vaddsd	%xmm5, %xmm2, %xmm2
	vaddsd	%xmm2, %xmm3, %xmm2
	vmulsd	%xmm0, %xmm4, %xmm3
	vfmadd231sd	%xmm2, %xmm1, %xmm3
	vmovsd	%xmm3, -16(%rbx,%rax,8)
	vmovsd	(%r13,%rax,8), %xmm2   
	vaddsd	(%r12,%rax,8), %xmm2, %xmm4
	vmovsd	(%rbx,%rax,8), %xmm2   
	vaddsd	%xmm2, %xmm4, %xmm4
	vaddsd	%xmm4, %xmm3, %xmm4
	vmulsd	%xmm0, %xmm5, %xmm3
	vfmadd231sd	%xmm4, %xmm1, %xmm3
	vmovsd	%xmm3, -8(%rbx,%rax,8)
	addq	$4, %rax
	cmpl	%eax, %r15d
	jne	.LBB1_12
gcc -Ofast -g -march=native unrolls just twice:
	.p2align 4,,10
	.p2align 3
.L9:
	vmovsd	-8(%r8,%rdi,8), %xmm5
	vmovsd	-16(%r9,%rdi,8), %xmm1
	vmulsd	%xmm4, %xmm2, %xmm4
	movslq	%edi, %rax
	vaddsd	-16(%r10,%rdi,8), %xmm1, %xmm1
	vaddsd	%xmm0, %xmm5, %xmm0
	vmulsd	%xmm5, %xmm2, %xmm5
	vaddsd	%xmm0, %xmm1, %xmm0
	vmovapd	%xmm0, %xmm1
	vfmadd132sd	%xmm3, %xmm4, %xmm1
	vmovsd	(%r8,%rdi,8), %xmm4
	vmovsd	%xmm1, -16(%r8,%rdi,8)
	vaddsd	%xmm4, %xmm1, %xmm1
	vmovsd	-8(%r10,%rdi,8), %xmm0
	vaddsd	-8(%r9,%rdi,8), %xmm0, %xmm0
	vaddsd	%xmm1, %xmm0, %xmm0
	vfmadd132sd	%xmm3, %xmm5, %xmm0
	vmovsd	%xmm0, -8(%r8,%rdi,8)
	addq	$2, %rdi
	cmpq	%rcx, %rdi
	jne	.L9
and with additional -funroll-loops we unroll 8 times instead:
.L9:
	vmovsd	-8(%rax,%rdi,8), %xmm6
	vmovsd	-16(%r8,%rdi,8), %xmm5
	vmulsd	%xmm7, %xmm1, %xmm8
	leaq	2(%rdi), %r14
	vaddsd	-16(%r9,%rdi,8), %xmm5, %xmm9
	vmovsd	(%rax,%rdi,8), %xmm12
	leaq	4(%rdi), %r10
	vaddsd	%xmm3, %xmm6, %xmm10
	vmulsd	%xmm6, %xmm1, %xmm7
	vmulsd	%xmm12, %xmm1, %xmm0
	vaddsd	%xmm10, %xmm9, %xmm11
	vfmadd132sd	%xmm2, %xmm8, %xmm11
	vmovsd	%xmm11, -16(%rax,%rdi,8)
	vaddsd	%xmm12, %xmm11, %xmm15
	vmovsd	-8(%r9,%rdi,8), %xmm13
	vaddsd	-8(%r8,%rdi,8), %xmm13, %xmm14
	vaddsd	%xmm15, %xmm14, %xmm4
	vfmadd132sd	%xmm2, %xmm7, %xmm4
	vmovsd	%xmm4, -8(%rax,%rdi,8)
	vmovsd	-8(%rax,%r14,8), %xmm6
	vmovsd	-16(%r8,%r14,8), %xmm3
	vaddsd	-16(%r9,%r14,8), %xmm3, %xmm8
	vmovsd	(%rax,%r14,8), %xmm10
	vaddsd	%xmm4, %xmm6, %xmm5
	vmulsd	%xmm6, %xmm1, %xmm11
	vmulsd	%xmm10, %xmm1, %xmm4
	vaddsd	%xmm5, %xmm8, %xmm9
	vfmadd132sd	%xmm2, %xmm0, %xmm9
	vmovsd	%xmm9, -16(%rax,%r14,8)
	vaddsd	%xmm10, %xmm9, %xmm13
	vmovsd	-8(%r9,%r14,8), %xmm12
	vaddsd	-8(%r8,%r14,8), %xmm12, %xmm7
	vaddsd	%xmm13, %xmm7, %xmm14
	vfmadd132sd	%xmm2, %xmm11, %xmm14
	vmovsd	%xmm14, -8(%rax,%r14,8)
	vmovsd	-8(%rax,%r10,8), %xmm15
	vmovsd	-16(%r8,%r10,8), %xmm6
	vaddsd	-16(%r9,%r10,8), %xmm6, %xmm0
	vmovsd	(%rax,%r10,8), %xmm9
	vaddsd	%xmm14, %xmm15, %xmm3
	vmulsd	%xmm15, %xmm1, %xmm5
	vmulsd	%xmm9, %xmm1, %xmm14
	vaddsd	%xmm3, %xmm0, %xmm8
	vfmadd132sd	%xmm2, %xmm4, %xmm8
	vmovsd	%xmm8, -16(%rax,%r10,8)
	vaddsd	%xmm9, %xmm8, %xmm12
	vmovsd	-8(%r9,%r10,8), %xmm10
	vaddsd	-8(%r8,%r10,8), %xmm10, %xmm11
	vaddsd	%xmm12, %xmm11, %xmm7
	vfmadd132sd	%xmm2, %xmm5, %xmm7
	vmovsd	%xmm7, -8(%rax,%r10,8)
	leaq	6(%rdi), %r10
	addq	$8, %rdi
	vmovsd	-8(%rax,%r10,8), %xmm13
	vmovsd	-16(%r8,%r10,8), %xmm15
	vaddsd	-16(%r9,%r10,8), %xmm15, %xmm6
	vaddsd	%xmm7, %xmm13, %xmm4
	vmovsd	(%rax,%r10,8), %xmm7
	vmulsd	%xmm13, %xmm1, %xmm8
	vaddsd	%xmm4, %xmm6, %xmm3
	vfmadd132sd	%xmm2, %xmm14, %xmm3
	vmovsd	%xmm3, -16(%rax,%r10,8)
	vaddsd	%xmm7, %xmm3, %xmm5
	vmovsd	-8(%r9,%r10,8), %xmm0
	vaddsd	-8(%r8,%r10,8), %xmm0, %xmm9
	vaddsd	%xmm5, %xmm9, %xmm0
	vfmadd132sd	%xmm2, %xmm8, %xmm0
	vmovapd	%xmm0, %xmm3
	vmovsd	%xmm0, -8(%rax,%r10,8)
	cmpq	%rcx, %rdi
	jne	.L9

The number of the v*sd instructions matches in between all 3 versions proportionally to how many unrolls there are, but the non-unrolled version has a weird extra move in there (the movslq	%edi, %rax) even when it isn't used anywhere in the loop, and the -funroll-loops version is just too weird, many leaq insns when the constant could be added into the immediates in the corresponding addresses.
Comment 15 Artem S. Tashkinov 2016-02-27 11:00:15 UTC
Is this the same bug?

https://www.phoronix.com/scan.php?page=article&item=ubuntu-1604-compilers&num=2

In Dense LU Matrix Factorization GCC 5.3.1/6.0 is more than 2 times slower than Clang.

G++ options: -O3 -march=native
CPU: Xeon E3-1280 v5
Comment 16 Jakub Jelinek 2016-02-27 11:33:18 UTC
(In reply to Artem S. Tashkinov from comment #15)
> Is this the same bug?
> 
> https://www.phoronix.com/scan.php?page=article&item=ubuntu-1604-
> compilers&num=2
> 
> In Dense LU Matrix Factorization GCC 5.3.1/6.0 is more than 2 times slower
> than Clang.
> 
> G++ options: -O3 -march=native
> CPU: Xeon E3-1280 v5

Maybe.  Though with -Ofast -march=native -funroll-loops is at least on my box GCC emitted code faster than clang, without -funroll-loops it is the opposite.
Comment 17 Patrick Palka 2016-03-25 03:41:36 UTC
The following patch by itself closes the gap between the C++ and C FEs, to make compilation with the C++ FE at least as good as with the C FE:

diff --git a/gcc/cp/cp-gimplify.c b/gcc/cp/cp-gimplify.c
index b6e1d27..8b1d020 100644
--- a/gcc/cp/cp-gimplify.c
+++ b/gcc/cp/cp-gimplify.c
@@ -237,8 +237,8 @@ genericize_cp_loop (tree *stmt_p, location_t start_locus, tree cond, tree body,
       location_t cloc = EXPR_LOC_OR_LOC (cond, start_locus);
       exit = build1_loc (cloc, GOTO_EXPR, void_type_node,
                         get_bc_label (bc_break));
-      exit = fold_build3_loc (cloc, COND_EXPR, void_type_node, cond,
-                             build_empty_stmt (cloc), exit);
+      exit = build3_loc (cloc, COND_EXPR, void_type_node, cond,
+                         build_empty_stmt (cloc), exit);
     }
 
   if (exit && cond_is_first)

Here, when building the exit condition of a for-loop, fold_build3 decides to swap the operands of the COND_EXPR (and invert its condition), which I suppose causes subsequent optimization passes to generate worse code for these loops.  Avoid having this done by using build3 instead.  This means that we won't fold degenerate COND_EXPRs but I doubt that's a big deal in practice.  (An alternative fix would be to adjust tree_swap_operands_p to avoid returning true in this case -- maybe return false if the TREE_TYPEs are VOID_TYPEs?)

I haven't checked -flto or an optimization setting besides -Ofast.
Comment 18 Patrick Palka 2016-03-25 03:50:51 UTC
So with the patch, g++ -flto -Ofast is on par with gcc -flto -Ofast and better than g++ -Ofast.  Could anyone confirm?
Comment 19 vincenzo Innocente 2016-03-25 07:42:50 UTC
patch applied to 
gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC) 
I confirm the improvement in timing for c++ and lto
timing difference between gcc and c++ seems to be inside "errors"
I am satisfied.

Thanks Patrick!

(btw I suppose no hope for a back port to 5.4?)
Comment 20 rguenther@suse.de 2016-03-25 08:39:11 UTC
On March 25, 2016 8:42:50 AM GMT+01:00, "vincenzo.innocente at cern dot ch" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564
>
>--- Comment #19 from vincenzo Innocente <vincenzo.innocente at cern dot
>ch> ---
>patch applied to 
>gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC)
>
>I confirm the improvement in timing for c++ and lto
>timing difference between gcc and c++ seems to be inside "errors"
>I am satisfied.
>
>Thanks Patrick!
>
>(btw I suppose no hope for a back port to 5.4?)

It needs some more understanding and a fix in the appropriate place.  Thanks Patrick for narrowing down a workaround.
Comment 21 Patrick Palka 2016-03-25 13:11:29 UTC
To be clear, given the loop 

  for (int i = 0; i < M; i++) { ... }

The fold_build3 in question was transforming

  if (i < M)
    fallthrough;
  else
    goto exit;

to

  if (i >= M)
    goto exit;
  else
    fallthrough;

The C FE emits

   if (i < M)
     goto body;
   else
     goto L;
L:


I would guess r217669 introduced the regression.  Before this commit the two arms of the COND_EXPR would be GOTO_EXPRs and so during folding tree_swap_operands_p would return false.  After this commit, the true arm is an empty statement which is TREE_CONSTANT and so during folding tree_swap_operands_p returns true.

The tree dumps do not seem to diverge significantly with and without the above patch.  The only difference throughout is the inversion of the branches.
Comment 22 Patrick Palka 2016-03-25 13:13:09 UTC
(In reply to Patrick Palka from comment #21)
> The tree dumps do not seem to diverge significantly with and without the
> above patch.  The only difference throughout is the inversion of the
> branches.

Was only looking at LU.c though.
Comment 23 Richard Biener 2016-03-29 11:19:45 UTC
So I tried to investigate why prediction doesn't fix this.  In fact with the
following patch I get to a similar level of performance as with not folding
the conditional (notice append_to_statement_list_force which isn't used in
continue genericization which means its predict-expr is thrown away...):

Index: gcc/cp/cp-gimplify.c
===================================================================
--- gcc/cp/cp-gimplify.c        (revision 234453)
+++ gcc/cp/cp-gimplify.c        (working copy)
@@ -237,8 +237,12 @@ genericize_cp_loop (tree *stmt_p, locati
       location_t cloc = EXPR_LOC_OR_LOC (cond, start_locus);
       exit = build1_loc (cloc, GOTO_EXPR, void_type_node,
                         get_bc_label (bc_break));
+      tree pred = build_predict_expr (PRED_LOOP_EXIT, NOT_TAKEN);
+      tree stmt_list = NULL;
+      append_to_statement_list_force (pred, &stmt_list);
+      append_to_statement_list (exit, &stmt_list);
       exit = fold_build3_loc (cloc, COND_EXPR, void_type_node, cond,
-                             build_empty_stmt (cloc), exit);
+                             build_empty_stmt (cloc), stmt_list);
     }
 
   if (exit && cond_is_first)


Now the question to me is still why we don't predict the loop exit correctly
ourselves (or not as strong as desired for this testcase).  Honza?  The loop has multiple exits and a non-constant iteration bound which is where predict_loops () gives up.  So the question is whether the above is a good
idea in general (likewise "fixing" the continue predict hint below).  The
loop in question is

    for (j=0; j<minMN; j++)
    {
... loop ...
        if ( A[jp][j] == 0 )
            return 1;       /* factorization failed because of zero pivot */
... conditional loop ...
... conditional loop nest ...
    }

and we only predict with

              int probability = ((REG_BR_PROB_BASE
                                  - predictor_info [(int) PRED_LOOP_EXIT].hitrate)
                                 / n_exits);

where n_exits == 2.  But the question is really why basic-block order
or branch inversion makes this kind of difference.  We _do_ predict
the same with the s/fold_//g patch from Patrick.


The following patch "fixes" continue genericization.

Index: gcc/cp/cp-gimplify.c
===================================================================
--- gcc/cp/cp-gimplify.c        (revision 234453)
+++ gcc/cp/cp-gimplify.c        (working copy)
@@ -361,7 +365,7 @@ genericize_continue_stmt (tree *stmt_p)
   tree label = get_bc_label (bc_continue);
   location_t location = EXPR_LOCATION (*stmt_p);
   tree jump = build1_loc (location, GOTO_EXPR, void_type_node, label);
-  append_to_statement_list (pred, &stmt_list);
+  append_to_statement_list_force (pred, &stmt_list);
   append_to_statement_list (jump, &stmt_list);
   *stmt_p = stmt_list;
 }
Comment 24 Richard Biener 2016-03-29 11:40:02 UTC
Oh, and this is also another case where we end up with different out-of-SSA
coalescing desires + outcomes.

 Sorted Coalesce list:
-(16562, 0) t_6 <-> t_105
-(16562, 1) ivtmp.101_80 <-> ivtmp.101_81
-(16562, 2) jp_3 <-> jp_107
-(7908, 0) ivtmp.94_76 <-> ivtmp.94_174
-(1640, 0) ivtmp.107_29 <-> ivtmp.107_132
-(1640, 0) ivtmp.113_228 <-> ivtmp.113_229
...
+(3438, 0) ivtmp.74_16 <-> ivtmp.74_17
+(3438, 0) ivtmp.77_192 <-> ivtmp.77_226
+(3312, 0) ivtmp.67_95 <-> ivtmp.67_164
+(1716, 0) t_6 <-> t_105
+(1716, 1) ivtmp.101_80 <-> ivtmp.101_81
+(1716, 2) jp_3 <-> jp_107
+(1638, 0) ivtmp.87_104 <-> ivtmp.87_188
+(961, 0) jj_96 <-> _242
+(820, 0) ivtmp.94_76 <-> ivtmp.94_174

interestingly we have less coalesce fails with fold_build3_loc.  Also edge
frequencies seem to be very different in the end even though we predicted
things the same...

So I still think there's some underlying issue "elsewhere" (apart from
nothing adjusting BB "order" while we still have a CFG around, making
predicted edges fallthrus dependend on whether they are backwards or
forward jumps).
Comment 25 Richard Biener 2016-03-31 13:50:07 UTC
I benchmarked the patch in comment#17 with a full three-run on all_cpp
on a Haswell machine with -Ofast -march=native (peak is patched).

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  ---------
400.perlbench                               NR                              NR 
401.bzip2                                   NR                              NR 
403.gcc                                     NR                              NR 
429.mcf                                     NR                              NR 
445.gobmk                                   NR                              NR 
456.hmmer                                   NR                              NR 
458.sjeng                                   NR                              NR 
462.libquantum                              NR                              NR 
464.h264ref                                 NR                              NR 
471.omnetpp      6250        315       19.9 *    6250        312       20.0 *  
473.astar        7020        323       21.7 *    7020        323       21.7 *  
483.xalancbmk    6900        193       35.8 *    6900        192       35.9 *  
 Est. SPECint_base2006                   --
 Est. SPECint2006                                                        --

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  ---------
410.bwaves                                  NR                              NR 
416.gamess                                  NR                              NR 
433.milc                                    NR                              NR 
434.zeusmp                                  NR                              NR 
435.gromacs                                 NR                              NR 
436.cactusADM                               NR                              NR 
437.leslie3d                                NR                              NR 
444.namd         8020        300       26.7 *    8020        300       26.8 *  
447.dealII      11440        253       45.3 *   11440        257       44.5 *  
450.soplex       8340        218       38.3 *    8340        218       38.2 *  
453.povray       5320        120       44.2 *    5320        120       44.2 *  
454.calculix                                NR                              NR 
459.GemsFDTD                                NR                              NR 
465.tonto                                   NR                              NR 
470.lbm                                     NR                              NR 
481.wrf                                     NR                              NR 
482.sphinx3                                 NR                              NR 
 Est. SPECfp_base2006                    --
 Est. SPECfp2006                                                         --


dealII might be not noise, three runs are

447.dealII      11440        319       35.8 S   11440        257       44.5 S
447.dealII      11440        252       45.4 S   11440        257       44.5 *
447.dealII      11440        253       45.3 *   11440        256       44.7 S

and I think we can take peak values (stupid CPU freq boosting).

I'm re-running just 447.dealII now, results tomorrow.  Binaries do differ.
Comment 26 Jeffrey A. Law 2016-04-01 04:32:24 UTC
In response to c#24.  Coalescing's primary sort key is highly dependent on the edge frequencies -- which is generally what we want.

The highest edge frequency represents the highest cost for any particular copy implied by a PHI node.

This obviously falls down if there are > 1 lower frequency copies that can't be coalesced due to a higher frequency copy that does coalesce and the sum of the lower frequency copy costs are greater than the sum of the single higher frequency copy.

One might be able to model that by looking at the conflict lists, but before going down that path, it'd obviously be best to know if that scenario is the root cause of the slowdown.
Comment 27 Richard Biener 2016-04-01 07:46:57 UTC
So, dealII slowdown is confirmed.

447.dealII      11440        252       45.4 S   11440        255       44.8 S
447.dealII      11440        254       45.1 S   11440        256       44.6 S
447.dealII      11440        253       45.2 *   11440        256       44.7 *

it's probably similarly artificial than the slowdown in this PR and thus it
might be acceptable given artificially swapping all exit comparisons isn't
something that should be done.

When profiling the difference it looks like the culprit is (are)

  6.55%  dealII_base.amd  libstdc++.so.6.0.22             [.] std::_Rb_tree_increment(std::_Rb_tree_node_base const*)
  6.53%  dealII_peak.amd  dealII_peak.amd64-m64-gcc42-nn  [.] MappingQ1<3>::compute_fill(TriaIterator<3, DoFCellAccessor<3> > const&, unsigned int, QProjector<3>::D
  6.34%  dealII_peak.amd  libstdc++.so.6.0.22             [.] std::_Rb_tree_increment(std::_Rb_tree_node_base const*)
  5.95%  dealII_base.amd  dealII_base.amd64-m64-gcc42-nn  [.] MappingQ1<3>::compute_fill(TriaIterator<3, DoFCellAccessor<3> > const&, unsigned int, QProjector<3>::D
...
  1.61%  dealII_peak.amd  dealII_peak.amd64-m64-gcc42-nn  [.] SparsityPattern::operator()(unsigned int, unsigned int) const
  1.60%  dealII_peak.amd  dealII_peak.amd64-m64-gcc42-nn  [.] MappingQ<3>::transform_covariant(Tensor<1, 3>*, Tensor<1, 3>*, Tensor<1, 3> const*, Mapping<3>::Intern
  1.60%  dealII_base.amd  dealII_base.amd64-m64-gcc42-nn  [.] SparsityPattern::operator()(unsigned int, unsigned int) const
  1.56%  dealII_base.amd  dealII_base.amd64-m64-gcc42-nn  [.] MappingQ<3>::transform_covariant(Tensor<1, 3>*, Tensor<1, 3>*, Tensor<1, 3> const*, Mapping<3>::Intern

but there isn't any assembly difference for MappingQ1<3>::compute_fill ...
Comment 28 Richard Biener 2016-06-03 10:04:37 UTC
GCC 5.4 is being released, adjusting target milestone.
Comment 29 Jeffrey A. Law 2017-02-15 21:24:33 UTC
So to bring this BZ back to the core questions (the scope seems to have widened through the year since this originally reported).  Namely are the use of LTO or C++ making things slower, particularly for scimark's LU factorization test.

From my experiments, the answer is a very clear yes. I hacked up the test a bit to only run LU and run a fixed number of iterations.  That makes comparisons with something like callgrind much easier. 

Use of C++ adds 2-3% in terms of instruction counts.  LTO adds an additional 2-3% to the instruction counts.  These are additive, C++ with LTO is about 5% higher than C without LTO.

The time (not surprisingly) is lost in LU_factor, the main culprit seems to be this pair of nested loops:

           int ii;
            for (ii=j+1; ii<M; ii++)
            {
                double *Aii = A[ii];
                double *Aj = A[j];
                double AiiJ = Aii[j];       /* Here */
                int jj;
                for (jj=j+1; jj<N; jj++)
                  Aii[jj] -= AiiJ * Aj[jj];

            }

Callgrind calls out the marked line, which probably in reality means the preheader for the inner loop.  For C w/o LTO it's ~12million instructions.  For C++ with LTO it's ~21million instructions (remember, I'm just running LU and for a relatively small number of iterations).

It's a bit of a surprise as these loops are dead simple, but it appears we've got to be doing something dumb somewhere.  Hopefully that narrows things down a bit.
Comment 30 Jeffrey A. Law 2017-02-15 23:21:07 UTC
Looking at the dumps, the tests for whether or not to use the vectorized loop are considerably more complex with LTO with memory references as well when compared to the non-LTO version.  It's almost as-if the program-wide visibility provided by LTO has *hidden* redundancies and mucked up the IV analysis.

The first block of the LTO check looks like:

  # ivtmp.66_807 = PHI <ivtmp.87_773(55), ivtmp.66_812(67)>
  Aii_206 = MEM[base: A_139, index: ivtmp.66_807, step: 8, offset: 0B];
  _208 = Aii_206 + _765;
  AiiJ_209 = *_208;
  _449 = SOR_size_15 > i_167;
  _763 = (unsigned int) i_167;
  _762 = _686 - _763;
  _444 = _762 > 8;
  _443 = _444 & _449;
  _438 = Aii_206 + _760;
  _433 = _434 >= _438;
  _425 = Aii_206 + _435;
  _424 = _425 >= _429;
  _423 = _424 | _433;
  _422 = _423 & _443;
  if (_422 != 0)
    goto <bb 57>; [80.00%]
  else
    goto <bb 65>; [20.00%]

Compared to the non-LTO block:

  _156 = ivtmp.112_242 + 16;
  _155 = prephitmp_168 + _156;
  _151 = Aii_72 + ivtmp.112_242;
  _150 = _151 >= _155;
  _146 = Aii_72 + _156;
  _142 = prephitmp_168 + ivtmp.112_242;
  _141 = _142 >= _146;
  _140 = _141 | _150;
  _139 = _140 & _160;
  if (_139 != 0)
    goto <bb 23>;
  else
    goto <bb 32>;


Then the next block in the check (LTO):

;;   basic block 57, loop depth 3, count 0, freq 66, maybe hot
;;    prev block 56, next block 58, flags: (NEW, REACHABLE, VISITED)
;;    pred:       56 [80.0%]  (TRUE_VALUE,EXECUTABLE)
  niters.2_408 = SOR_size_15 > i_167 ? _762 : 1;
  _366 = (unsigned long) _425;
  _365 = _366 >> 3;
  _364 = -_365;
  _363 = (unsigned int) _364;
  prolog_loop_niters.4_367 = _363 & 1;
  _740 = (unsigned int) ivtmp.86_769;
  _509 = _686 + 4294967294;
  _738 = _509 - _740;
  _295 = SOR_size_15 > i_167 ? _738 : 0;
  _283 = prolog_loop_niters.4_367 == 0 ? 1 : 2;
  if (_283 > _295)
    goto <bb 63>; [10.00%]
  else
    goto <bb 58>; [90.00%]


non-LTO:

;;   basic block 23, loop depth 2, count 0, freq 2, maybe hot
;;    prev block 22, next block 24, flags: (NEW, REACHABLE)
;;    pred:       22 [80.0%]  (TRUE_VALUE,EXECUTABLE)
  _119 = (unsigned long) _151;
  _118 = _119 & 15;
  _117 = _118 >> 3;
  _116 = -_117;
  _115 = (unsigned int) _116;
  _114 = _115 & 1;
  prolog_loop_niters.46_120 = MIN_EXPR <_114, ivtmp.115_244>;
  if (prolog_loop_niters.46_120 == 0)
    goto <bb 25>;
  else
    goto <bb 24>;


Egad.  No wonder LTO loses.  I don't think the loop iterates enough to make up for this mess.
Comment 31 Richard Biener 2017-02-16 09:05:02 UTC
I can't see any difference in the vectorizer dumps, one notable thing is that we
(again) do peeling for alignment which is hardly useful on x86_64 these days,
at least if you only have a single store stream.

The code after the vectorizer looks 1:1 the same, so do the alias checks.

IVOPTs does different decisions for the outer loop (LU.c:87) that is because
with LTO we have more context (inlined the function) and thus "expansion"
brings up bigger expressions (well, that's my guess).
Comment 32 Richard Biener 2017-02-16 09:34:28 UTC
          int ii;
            for (ii=j+1; ii<M; ii++)
            {
                double *Aii = A[ii];
                double *Aj = A[j];
                double AiiJ = Aii[j];       /* Here */
                int jj;
                for (jj=j+1; jj<N; jj++)
                  Aii[jj] -= AiiJ * Aj[jj];

            }

the alias check for Aii[jj] vs. Aj[jj] should be as "simple" as
abs(Aii - Aj) >= VF * sizeof (double).  It certainly looks more complicated
than that:

  _1100 = (unsigned int) SOR_size_19;
  _1096 = (unsigned int) j_910;
  _1087 = _1100 - _1096;
  _1077 = _1087 + 4294967295;
  _1076 = _1077 > 8;
  _1075 = (sizetype) j_910;
  _1074 = _1075 + 3;
  _1073 = _1074 * 8;
  _1072 = _539 + _1073;
  _1071 = (sizetype) j_910;
  _1056 = _1071 + 1;
  _1055 = _1056 * 8;
  _1054 = Aii_557 + _1055;
  _1053 = _1054 >= _1072;
  _1052 = (sizetype) j_910;
  _1034 = _1052 + 3;
  _1033 = _1034 * 8;
  _1032 = Aii_557 + _1033;
  _1031 = (sizetype) j_910;
  _1030 = _1031 + 1;
  _1029 = _1030 * 8;
  _1028 = _539 + _1029;
  _1027 = _1028 >= _1032;
  _1026 = _1027 | _1053;
  _1025 = _1026 & _1076;
  if (_1025 != 0)

appearantly we expanded things too much.

Creating dr for *_566
analyze_innermost: Applying pattern match.pd:84, generic-match.c:11338
success.
Applying pattern match.pd:84, generic-match.c:11338
        base_address: _539 + (sizetype) ((long unsigned int) j_910 * 8)
        offset from base address: 0
        constant offset from base address: 8
        step: 8
        aligned to: 128
        base_object: *_539 + (sizetype) ((long unsigned int) j_910 * 8)
        Access function 0: {8B, +, 8}_25
Creating dr for *_564
analyze_innermost: Applying pattern match.pd:84, generic-match.c:11338
success.
Applying pattern match.pd:84, generic-match.c:11338
        base_address: Aii_557 + (sizetype) ((long unsigned int) j_910 * 8)
        offset from base address: 0
        constant offset from base address: 8
        step: 8
        aligned to: 128
        base_object: *Aii_557 + (sizetype) ((long unsigned int) j_910 * 8)
        Access function 0: {8B, +, 8}_25
...
LU.c:93:32: note: mark for run-time aliasing test between *_566 and *_564

looks like we fail to split the common exprs away.  But that may be hard(er) because data reference analysis produces

DR_BASE_ADDRESS == Aii_81 + ((sizetype) j_94 + 1) * 8
DR_OFFSET == 0

rather than the expected DR_OFFSET == ((sizetype) j_94 + 1) * 8 because of
the way pointer accesses are handled (only the get_inner_reference part
provides some input to the "offset-IV" analysis).

Later doing ptr1 + X <= ptr2 + X is not folded to ptr1 <= ptr2 because of
overflow concerns.
Comment 33 Richard Biener 2017-02-16 09:53:07 UTC
For example with

Index: tree-vect-loop-manip.c
===================================================================
--- tree-vect-loop-manip.c      (revision 245501)
+++ tree-vect-loop-manip.c      (working copy)
@@ -2187,6 +2187,14 @@ create_intersect_range_checks (loop_vec_
   tree addr_base_b = DR_BASE_ADDRESS (dr_b.dr);
   tree offset_a = DR_OFFSET (dr_a.dr), offset_b = DR_OFFSET (dr_b.dr);
 
+  if (TREE_CODE (addr_base_a) == POINTER_PLUS_EXPR
+      && TREE_CODE (addr_base_b) == POINTER_PLUS_EXPR
+      && operand_equal_p (TREE_OPERAND (addr_base_a, 1),
+                         TREE_OPERAND (addr_base_b, 1), 0))
+    {
+      addr_base_a = TREE_OPERAND (addr_base_a, 0);
+      addr_base_b = TREE_OPERAND (addr_base_b, 0);
+    }
   offset_a = fold_build2 (PLUS_EXPR, TREE_TYPE (offset_a),
                          offset_a, DR_INIT (dr_a.dr));
   offset_b = fold_build2 (PLUS_EXPR, TREE_TYPE (offset_b),

the versioning condition is simplified to

  _173 = (unsigned int) N_67(D);
  _162 = (unsigned int) j_94;
  _161 = _173 - _162;
  _160 = _161 + 4294967295;
  _159 = _160 > 8;
  _158 = prephitmp_172 + 24;
  _157 = Aii_81 + 8;
  _156 = _157 >= _158;
  _155 = Aii_81 + 24;
  _154 = prephitmp_172 + 8;
  _153 = _154 >= _155;
  _152 = _153 | _156;
  _151 = _152 & _159;
  if (_151 != 0)

note that while the cost model check comes first (the > 8 compare) we do
not separate that out into a GIMPLE_COND and RTL expansion might f*ck up
condition ordering.
Comment 34 Richard Biener 2017-02-16 09:59:18 UTC
But as A + 8 >= B || A >= B + 8 is the same as ABS (A - B) >= 8 we might do better re-writing the overlap test in terms of this (of course it all really depends on whether that and the offset stripping handles arrays crossing the "signed overlap" boundary in the address space correctly...)
Comment 35 bin cheng 2017-02-17 16:16:10 UTC
(In reply to Richard Biener from comment #34)
> But as A + 8 >= B || A >= B + 8 is the same as ABS (A - B) >= 8 we might do
> better re-writing the overlap test in terms of this (of course it all really
> depends on whether that and the offset stripping handles arrays crossing the
> "signed overlap" boundary in the address space correctly...)

I had a patch doing this, will revise and send for review.  Thanks.
Comment 36 Jakub Jelinek 2017-10-10 13:25:47 UTC
GCC 5 branch is being closed
Comment 37 Jakub Jelinek 2018-10-26 10:08:49 UTC
GCC 6 branch is being closed
Comment 38 Richard Biener 2019-11-14 07:55:24 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 39 Jason Merrill 2020-01-14 03:09:36 UTC
Created attachment 47649 [details]
patch to revert to lowering loops like the C front end

Could someone with a better sense of loop code generation tell me if this improves things?
Comment 40 Jakub Jelinek 2020-03-04 09:36:58 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 41 Jakub Jelinek 2021-05-14 09:47:58 UTC
GCC 8 branch is being closed.
Comment 42 Richard Biener 2021-06-01 08:07:26 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 43 Richard Biener 2022-05-27 09:36:03 UTC
GCC 9 branch is being closed
Comment 44 Jakub Jelinek 2022-06-28 10:31:57 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 45 Richard Biener 2023-07-07 10:31:03 UTC
GCC 10 branch is being closed.
Comment 46 Richard Biener 2024-07-19 12:58:35 UTC
GCC 11 branch is being closed.
Comment 47 Richard Biener 2025-07-11 08:54:14 UTC
GCC 12 branch is being closed.