21485 – [11/12/13/14 Regression] missed load PRE, PRE makes i?88/9/10/11/12 suck

Bug 21485 - [11/12/13/14 Regression] missed load PRE, PRE makes i?88/9/10/11/12 suck

Summary: [11/12/13/14 Regression] missed load PRE, PRE makes i?88/9/10/11/12 suck

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	37732 (view as bug list)
Depends on:	30521 23286 32120
Blocks:	36861 37732
	Show dependency tree / graph

Reported:	2005-05-10 09:03 UTC by Jason Bucata
Modified:	2023-12-28 07:18 UTC (History)
CC List:	19 users (show)

See Also:	37242
Host:
Target:	i686--
Build:
Known to work:	3.4.3
Known to fail:	4.0.0, 4.0.4, 4.1.0, 4.2.4, 4.3.2, 4.4.0, 4.6.0
Last reconfirmed:	2018-02-01 00:00:00

Attachments
Test case (preprocessed) (13.49 KB, application/octet-stream) 2005-05-10 09:05 UTC, Jason Bucata	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jason Bucata 2005-05-10 09:03:38 UTC

I've found a major performance regression in gcc 4.0.0's optimization of the
BYTEmark numsort benchmark.  I've boiled it down to a testcase that I think will
suit you... it outputs a single number representing the number of iterations run
(higher is better).  On my machine I get 900ish under 4.0.0 and around 1530 on
3.4.3.

Both were compiled and run in a Gentoo test partition, if that makes a difference:
3.4.3: gcc version 3.4.3-20050110 (Gentoo Linux 3.4.3.20050110-r2,
ssp-3.4.3.20050110-0, pie-8.7.7)
4.0.0: gcc version 4.0.0 (Gentoo Linux 4.0.0)

Comment 1 Jason Bucata 2005-05-10 09:05:21 UTC

Created attachment 8851 [details]
Test case (preprocessed)

Comment 2 Jason Bucata 2005-05-10 09:10:04 UTC

Oops, I should add that my pertinent options were: -O3 -fomit-frame-pointer
-march=athlon-xp -static

Comment 3 Steven Bosscher 2005-05-10 09:14:51 UTC

Confirmed on x86 (with and without frame pointer) and on amd64.

Comment 4 Richard Biener 2005-05-10 09:27:10 UTC

mainline drops even lower - looks like poor choice of addressing modes and thus
more register pressure for 4.0 and 4.1.  Note that using profile-feedback
improves numbers a lot (but still we regress).

Comment 5 Giovanni Bajo 2005-05-10 09:45:24 UTC

Jason: thanks for this! Even better would be to let the testcase do a fixed 
number of iterations (like 1000 or so), and then we'll be using "time" 
externally to measure performance. Maybe you can do this for other testcases 
you are going to submit, thanks!

Comment 6 Steven Bosscher 2005-05-10 09:46:29 UTC

This is the function (reindented) where we spend almost all of our time: 
 
void 
NumSift (long *array, unsigned long i, unsigned long j) 
{ 
  unsigned long k; 
  long temp; 
  while ((i + i) <= j) 
    { 
      k = i + i; 
      if (k < j) 
        if (array[k] < array[k + 1L]) 
          ++k; 
      if (array[i] < array[k]) 
        { 
          temp = array[k]; 
          array[k] = array[i]; 
          array[i] = temp; 
          i = k; 
        } 
      else 
        i = j + 1; 
    } 
  return; 
}

Comment 7 Steven Bosscher 2005-05-10 09:50:33 UTC

If Richard is right in comment #4, it would be interesting to see what 
happens if one tries this with Zdenek's TARGET_MEM_REF patch.

Comment 8 Steven Bosscher 2005-05-10 10:22:33 UTC

On AMD64 with GCC 4.0.1 (CVS 4.0 branch) I go from ~580 at -O3 
to ~930 at -O3 -fno-tree-pre.

Comment 9 Paolo Bonzini 2005-05-10 12:58:04 UTC

Looks like a register pressure problem... but yes, TARGET_MEM_REF may help as well.

Paolo

Comment 10 Andrew Pinski 2005-05-10 20:03:48 UTC

IV-OPTS does nothing to this testcase, it does not even change the trees.  This is just a ra issue.

Comment 11 Mark Mitchell 2005-10-31 03:21:41 UTC

We need more analysis on these kinds of issues.

So, we're doing a worse job on register allocation.  Is that because the register allocator got worse, or because we're giving it a harder problem to solve?  If the latter, what's responsible, and is there anything we can do about it?  We need that kind of information to make an intelligent decision about whether or not we should try to fix this, or just chalk it up to the fact that there's always variability between releases.

Comment 12 Andrew Pinski 2005-10-31 03:37:32 UTC

(In reply to comment #11)
> So, we're doing a worse job on register allocation.  Is that because the
> register allocator got worse, or because we're giving it a harder problem to
> solve?  If the latter, what's responsible, and is there anything we can do
> about it?
It is the latter and the pass which is responsible is tree PRE but if someone writes the code like what tree PRE gives, we will still have the same issue so the only correct place to fix this would be in RA.

Comment 13 Steven Bosscher 2006-01-09 22:23:17 UTC

AMD64 timings for today's GCC 4.1 (20060109) branch:

flags used                               score
for compilation                          (avg. of 3, higher is better)
-O3                                      685.2
-O3 -fprofile-generate                   636
-O3 -fprofile-use                        785.6
-O3 -fno-tree-pre                        958.4
-O3 -fno-tree-pre -fprofile-generate     673.8
-O3 -fno-tree-pre -fprofile-use          964.2
-O3 -fno-tree-pre -fmove-loop-invariants 967.2


The problem really is that we present the register allocator with a harder problem.  Tough, there is really not much that can be done about this other than crippling PRE for SMALL_REGISTER_CLASSES machines, but I don't know if we should want to do that...

Comment 14 Daniel Berlin 2006-02-16 19:46:03 UTC

So uh, has anyone figured out why we don't get 1500 with -fno-tree-pre?

Does -fno-gcse help too?

Comment 15 Mark Mitchell 2006-02-24 00:25:53 UTC

This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.

Comment 16 Andrew Pinski 2006-03-10 04:41:30 UTC

Hmm,  I see a missed optimization here (I want to say a PRE one too).
  D.1521_32 = k_4 * 4;
  D.1525_37 = (long int *) D.1521_32;
  D.1526_38 = D.1525_37 + array_9;
  D.1527_39 = D.1526_38 + 4B;
  #   VUSE <SMT.4_42>;
  D.1528_40 = *D.1527_39;
  if (prephitmp.27_35 < D.1528_40) goto <L2>; else goto <L3>;

<L2>:;
  k_41 = k_4 + 1;
  pretmp.22_23 = k_41 * 4;
  pretmp.24_6 = (long int *) pretmp.22_23;
  prephitmp.25_48 = pretmp.24_6 + array_9;
  #   VUSE <SMT.4_42>;
  prephitmp.27_51 = *prephitmp.25_48;

prephitmp.27_51 here is also the same as D.1528_40.  Though I don't know how to figure this out in PRE or any other optimization.  I almost want to say this is the real cause.  Looking at 3.4's asm, you don't have the extra load but for 4.0 and above you do.

Comment 17 Andrew Pinski 2006-03-10 13:56:03 UTC

What happens to the time if you replace that function with:
void
NumSift (long *array, unsigned long i, unsigned long j)
{
  unsigned long k;
  while ((i + i) <= j)
    {
      k = i + i;
      long t, t1;
      t = array[k];
      if (k < j)
        {
          t1 = array[k+1];
          if (t < t1)
            ++k, t = t1;
        }
      t1 = array[i];
      if (t1 < t)
        {
          array[k] = t1;
          array[i] = t;
          i = k;
        }
      else
        i = j + 1;
    }
  return;
}
-----------
The semantics should be the same, I just pulled out the PREs as much as I could.

Comment 18 Andrew Pinski 2006-03-10 14:10:11 UTC

(In reply to comment #17)
> What happens to the time if you replace that function with:
This helps about 5% but it does not get the score back up.

3.4.0's score for this machine is about 900.
4.1.0's score is 720.
4.1.0+scource modification is about 780.

so it helps but there is something else which needs to be improved still.

I don't think this is a RA issue now after looking into the source.

Comment 19 Andrew Pinski 2006-03-10 14:13:29 UTC

(In reply to comment #18)
> (In reply to comment #17)
> > What happens to the time if you replace that function with:
> This helps about 5% but it does not get the score back up.

Also the asm is about the same for this function, at least on i686 between 3.4.0 and the source modified version with 4.1.0.

Comment 20 Mark Mitchell 2006-05-25 02:32:56 UTC

Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.

Comment 21 Andrew Pinski 2007-05-28 07:06:54 UTC

The missed load PRE for the testcase below is fixed on the pointer plus banch as the addition is done in "unsigned int" for both the "++k" and the array[k+1].  If we change "++k" into "k+=2" and array[k+1] into array[k+2], we run into another missed PRE issue which can be shown by the following testcase which we don't optimize currently:
int f(int a, int b)
{
  int c = a+2;
  int d = c*2;
  int e = a*2;
  int f = e+2;
  return d == f;
}

Which I will file seperately.

Comment 22 Andrew Pinski 2007-06-18 06:12:54 UTC

This is basically fixed by the pointer_plus except we still have some combinable code (though this is not PRE's fault); see http://gcc.gnu.org/ml/gcc-patches/2007-05/msg01996.html for how to fix that issue.

Comment 23 Andrew Pinski 2007-07-04 21:01:06 UTC

I am going to declare this as fixed for 4.3, pointer plus gets the original case correctly and the secondary issues are all file seperately.

Comment 24 wbrana 2008-06-15 13:02:50 UTC

It seems to not be fixed in 4.3.1:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1545.8  :      39.64  :      13.02
STRING SORT         :          380.16  :     169.87  :      26.29
BITFIELD            :       6.014e+08  :     103.16  :      21.55
FP EMULATION        :          332.88  :     159.73  :      36.86
FOURIER             :           32962  :      37.49  :      21.06
ASSIGNMENT          :          57.851  :     220.13  :      57.10
IDEA                :            9372  :     143.34  :      42.56
HUFFMAN             :          3234.8  :      89.70  :      28.64
NEURAL NET          :          72.571  :     116.58  :      49.04
LU DECOMPOSITION    :          2656.5  :     137.62  :      99.37
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 117.762
FLOATING-POINT INDEX: 84.407
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.25.6
C compiler          : 4.3.1
libc                :
MEMORY INDEX        : 31.863
INTEGER INDEX       : 27.656
FLOATING-POINT INDEX: 46.816
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          2463.4  :      63.18  :      20.75
STRING SORT         :           374.8  :     167.47  :      25.92
BITFIELD            :      5.9818e+08  :     102.61  :      21.43
FP EMULATION        :          268.64  :     128.91  :      29.75
FOURIER             :           43176  :      49.10  :      27.58
ASSIGNMENT          :           58.68  :     223.29  :      57.92
IDEA                :          5413.5  :      82.80  :      24.58
HUFFMAN             :          3198.7  :      88.70  :      28.32
NEURAL NET          :           58.29  :      93.64  :      39.39
LU DECOMPOSITION    :          2380.6  :     123.33  :      89.05
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 112.599
FLOATING-POINT INDEX: 82.767
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.25.6
C compiler          : gcc version 3.4.6 (Gentoo 3.4.6-r2 p1.5, ssp-3.4.6-1.0, pie-8.7.10)
libc                :
MEMORY INDEX        : 31.806
INTEGER INDEX       : 25.604
FLOATING-POINT INDEX: 45.906
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

Comment 25 wbrana 2008-06-29 10:55:24 UTC

I think "4.3" is missing in "Summary" and "4.3.1" in "Known to fail".

Comment 26 Joseph S. Myers 2008-07-04 16:52:42 UTC

Closing 4.1 branch.

Comment 27 wbrana 2008-09-28 18:00:07 UTC

gcc 4.3.2, -march=core2 instead of -march=nocona

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1516.3  :      38.89  :      12.77
STRING SORT         :          381.24  :     170.35  :      26.37
BITFIELD            :      6.0164e+08  :     103.20  :      21.56
FP EMULATION        :             332  :     159.31  :      36.76
FOURIER             :           32687  :      37.17  :      20.88
ASSIGNMENT          :           57.93  :     220.44  :      57.18
IDEA                :            9380  :     143.46  :      42.60
HUFFMAN             :          3276.1  :      90.85  :      29.01
NEURAL NET          :          72.251  :     116.07  :      48.82
LU DECOMPOSITION    :          2641.5  :     136.84  :      98.81
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 117.698
FLOATING-POINT INDEX: 83.889
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc-4.3.2
libc                :
MEMORY INDEX        : 31.912
INTEGER INDEX       : 27.598
FLOATING-POINT INDEX: 46.528
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

Comment 28 Richard Biener 2008-10-03 23:18:32 UTC

nbench 2.2.3 numeric sort test executes 40% less iterations per second
when compiled with 4.4 snapshot than with 3.4.6

iterations/s - version
2439 - 3.4.6
1530 - 4.4.0 20080926 (experimental)
1526 - 4.3.2 
1580 - 4.2.4

CFLAGS = -s -static -Wall -O3 -g0 -march=nocona -fomit-frame-pointer
-funroll-loops -ffast-math

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          2439.1  :      62.55  :      20.54
STRING SORT         :          373.28  :     166.79  :      25.82
BITFIELD            :      5.9879e+08  :     102.71  :      21.45
FP EMULATION        :          267.84  :     128.52  :      29.66
FOURIER             :           43702  :      49.70  :      27.92
ASSIGNMENT          :          56.657  :     215.59  :      55.92
IDEA                :          5407.7  :      82.71  :      24.56
HUFFMAN             :          3204.3  :      88.86  :      28.37
NEURAL NET          :          57.485  :      92.35  :      38.84
LU DECOMPOSITION    :          2363.5  :     122.44  :      88.41
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 111.792
FLOATING-POINT INDEX: 82.519
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc version 3.4.6 (Gentoo 3.4.6-r2 p1.5, ssp-3.4.6-1.0,
pie-8.7.10)
libc                : libc-2.8.so
MEMORY INDEX        : 31.404
INTEGER INDEX       : 25.525
FLOATING-POINT INDEX: 45.768
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1530.9  :      39.26  :      12.89
STRING SORT         :          372.64  :     166.51  :      25.77
BITFIELD            :      6.0348e+08  :     103.52  :      21.62
FP EMULATION        :           310.4  :     148.94  :      34.37
FOURIER             :           31760  :      36.12  :      20.29
ASSIGNMENT          :          48.361  :     184.02  :      47.73
IDEA                :            9204  :     140.77  :      41.80
HUFFMAN             :          3554.3  :      98.56  :      31.47
NEURAL NET          :          73.882  :     118.69  :      49.92
LU DECOMPOSITION    :          2322.2  :     120.30  :      86.87
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 114.457
FLOATING-POINT INDEX: 80.190
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc version 4.4.0 20080926 (experimental) (GCC)
libc                : libc-2.8.so
MEMORY INDEX        : 29.851
INTEGER INDEX       : 27.632
FLOATING-POINT INDEX: 44.477
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1526.2  :      39.14  :      12.85
STRING SORT         :          374.28  :     167.24  :      25.89
BITFIELD            :      6.0262e+08  :     103.37  :      21.59
FP EMULATION        :          320.64  :     153.86  :      35.50
FOURIER             :           33352  :      37.93  :      21.30
ASSIGNMENT          :          57.834  :     220.07  :      57.08
IDEA                :            9288  :     142.06  :      42.18
HUFFMAN             :            3211  :      89.04  :      28.43
NEURAL NET          :           74.69  :     119.98  :      50.47
LU DECOMPOSITION    :          2655.7  :     137.58  :      99.35
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 116.415
FLOATING-POINT INDEX: 85.548
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc-4.3.2
libc                : libc-2.8.so
MEMORY INDEX        : 31.716
INTEGER INDEX       : 27.199
FLOATING-POINT INDEX: 47.448
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.



BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1580.2  :      40.52  :      13.31
STRING SORT         :          359.88  :     160.80  :      24.89
BITFIELD            :      5.9462e+08  :     102.00  :      21.30
FP EMULATION        :          291.28  :     139.77  :      32.25
FOURIER             :           32567  :      37.04  :      20.80
ASSIGNMENT          :          57.097  :     217.26  :      56.35
IDEA                :          7876.8  :     120.47  :      35.77
HUFFMAN             :          3433.1  :      95.20  :      30.40
NEURAL NET          :          72.091  :     115.81  :      48.71
LU DECOMPOSITION    :          2664.4  :     138.03  :      99.67
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 112.739
FLOATING-POINT INDEX: 83.966
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc-4.2.4
libc                : libc-2.8.so
MEMORY INDEX        : 31.032
INTEGER INDEX       : 26.138
FLOATING-POINT INDEX: 46.571
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

Comment 29 Richard Biener 2008-10-03 23:18:41 UTC

*** Bug 37732 has been marked as a duplicate of this bug. ***

Comment 30 Richard Biener 2008-10-03 23:54:39 UTC

Comment #6 still applies.  On the trunk we do not fully exploit the partial
redundant load of array[k] in

      if (k < j) 
        if (array[k] < array[k + 1L]) 
          ++k; 
      if (array[i] < array[k]) 

but we transform the above to

      if (k < j)
        {
          tmp1 = array[k];
          tmp2 = array[k+1];
          if (tmp1 < tmp2)
            ++k;
          else
            tmp1 = array[k+1];
        }
      if (array[i] < tmp1)

missing the full redundancy of array[k+1].

Reduced testcase:

long
NumSift (long *array, unsigned long k)
{
  if (array[k] < array[k + 1L])
    ++k;
  return array[k];
}

with integer k it gets somewhat more complicated even.

The question is whether this explains the slowdown compared to GCC 3.4.

Comment 31 Andrew Pinski 2008-10-04 00:13:26 UTC

the reduced testcase in comment #30 is optimized by DOM3 though not by PRE.
Running PRE again right after the first PRE still founds more PREable expressions for this testcase ...

Comment 32 Richard Biener 2008-10-04 00:57:46 UTC

As we PHI-translate k_1 * 4 we are not able to find D.1237_7 * 4 in the
SCCVN tables.  So we allocate a new value-id for it.  Oops.  This is because
once we say its type is unsigned int and once it's unsigned long.  And
we require types to be pointer-equal in the hash comparison fn.

I have a patch to fix this which makes this run even more slow.  -fno-tree-pre
fixes the speed regression.

Comment 33 wbrana 2008-10-04 09:22:25 UTC

results with -fno-tree-pre

1749 - 4.4.0 20080926 (experimental)
1701 - 4.3.2 
2476 - 4.2.4

Comment 34 Richard Biener 2008-10-04 15:11:56 UTC

Fastest result on a Intel Core Duo with

gcc-4.1 -O3 -fomit-frame-pointer -fno-tree-pre -fno-inline -fschedule-insns: 1273

the interesting thing is that with the above we if-convert

        if (array[k] < array[k + 1L]) 
          ++k; 

using setl which reduces the burden of the branch predictor which in the worst
case (trunk) has quite a number of mispredicts.  The following is branches
retired vs. mispredicted branches retired for trunk (with PRE enabled)

 * CPU: Core Solo / Duo, speed 1833 MHz (estimated)
 * Counted BR_INST_RETIRED events (number of branch instructions retired) with a
 unit mask of 0x00 (No unit mask) count 10000
 * Counted BR_MISS_PRED_RETIRED events (number of mispredicted branches retired)
 with a unit mask of 0x00 (No unit mask) count 10000

080486d0 <NumSift>: /* NumSift total: 188708 95.2681 21424 99.9953 */
   752  0.3796     0       0   : 80486d0:       push   %ebp
                               : 80486d1:       push   %edi
                               : 80486d2:       push   %esi
   824  0.4160     0       0   : 80486d3:       push   %ebx
     5  0.0025     0       0   : 80486d4:       sub    $0xc,%esp
                               : 80486d7:       mov    %ecx,(%esp)
  1541  0.7780     0       0   : 80486da:       add    $0x1,%ecx
                               : 80486dd:       mov    %ecx,0x8(%esp)
                               : 80486e1:       lea    0x0(%esi),%esi
   709  0.3579     2  0.0093   : 80486e8:       lea    (%edx,%edx,1),%ecx
  1706  0.8613     1  0.0047   : 80486eb:       cmp    (%esp),%ecx
  3083  1.5564   924  4.3127   : 80486ee:       mov    %ecx,%edi
    92  0.0464     0       0   : 80486f0:       lea    (%eax,%edx,8),%ebp
                               : 80486f3:       mov    %ebp,%ebx
   868  0.4382    13  0.0607   : 80486f5:       ja     804871d <NumSift+0x4d>
  5732  2.8938     0       0   : 80486f7:       jb     8048728 <NumSift+0x58>
     2  0.0010     0       0   : 80486f9:       mov    (%ebx),%esi
  7789  3.9322   162  0.7561   : 80486fb:       lea    (%eax,%edx,4),%ecx
 34575 17.4550  6534 30.4971   : 80486fe:       mov    0x8(%esp),%edx
  3070  1.5499  2103  9.8156   : 8048702:       mov    (%ecx),%ebp
  8244  4.1619   134  0.6254   : 8048704:       cmp    %esi,%ebp
  2322  1.1722   155  0.7235   : 8048706:       jge    80486e8 <NumSift+0x18>
  1363  0.6881   236  1.1015   : 8048708:       mov    %edi,%edx
  3578  1.8063     0       0   : 804870a:       mov    %ebp,(%ebx)
   450  0.2272   367  1.7130   : 804870c:       lea    (%eax,%edx,8),%ebp
  3797  1.9169     0       0   : 804870f:       mov    %esi,(%ecx)
  5035  2.5419    22  0.1027   : 8048711:       lea    (%edx,%edx,1),%ecx
                               : 8048714:       mov    %ebp,%ebx
   389  0.1964     0       0   : 8048716:       cmp    (%esp),%ecx
  5885  2.9710    15  0.0700   : 8048719:       mov    %ecx,%edi
     7  0.0035     0       0   : 804871b:       jbe    80486f7 <NumSift+0x27>
   416  0.2100    24  0.1120   : 804871d:       add    $0xc,%esp
  5419  2.7357  1431  6.6791   : 8048720:       pop    %ebx
   568  0.2868   275  1.2835   : 8048721:       pop    %esi
   710  0.3584    24  0.1120   : 8048722:       pop    %edi
   334  0.1686    12  0.0560   : 8048723:       pop    %ebp
   146  0.0737    91  0.4247   : 8048724:       ret
                               : 8048725:       lea    0x0(%esi),%esi
  8706  4.3952     0       0   : 8048728:       mov    0x0(%ebp),%ebx
  1536  0.7754   379  1.7690   : 804872b:       lea    0x1(%ecx),%edi
                               : 804872e:       mov    %ebx,0x4(%esp)
 14484  7.3122     9  0.0420   : 8048732:       lea    (%eax,%edi,4),%ebx
                               : 8048735:       mov    (%ebx),%esi
  2165  1.0930     6  0.0280   : 8048737:       cmp    %esi,0x4(%esp)
 19814 10.0030     1  0.0047   : 804873b:       jl     80486fb <NumSift+0x2b>
  2585  1.3050     0       0   : 804873d:       mov    0x4(%esp),%esi
 37728 19.0468  8504 39.6919   : 8048741:       mov    %ebp,%ebx
  1511  0.7628     0       0   : 8048743:       mov    %ecx,%edi
   768  0.3877     0       0   : 8048745:       jmp    80486fb <NumSift+0x2b>
                               : 8048747:       mov    %esi,%esi
                               : 8048749:       lea    0x0(%edi),%edi

while the following is what we get for the gcc 4.1 code w/o PRE:

08048670 <NumSift>: /* NumSift total: 200781 92.9938  4738 99.9156 */
  1196  0.5539     0       0   : 8048670:       push   %ebp
     1 4.6e-04     0       0   : 8048671:       push   %edi
     2 9.3e-04     0       0   : 8048672:       mov    %eax,%edi
  2084  0.9652     0       0   : 8048674:       push   %esi
     9  0.0042     0       0   : 8048675:       push   %ebx
     1 4.6e-04     0       0   : 8048676:       mov    %edx,%ebx
  1162  0.5382     0       0   : 8048678:       sub    $0x4,%esp
     6  0.0028     0       0   : 804867b:       mov    %ecx,(%esp)
  3128  1.4488     0       0   : 804867e:       xchg   %ax,%ax
  1078  0.4993     2  0.0422   : 8048680:       lea    (%ebx,%ebx,1),%edx
   577  0.2672     0       0   : 8048683:       cmp    (%esp),%edx
   202  0.0936     6  0.1265   : 8048686:       lea    (%edi,%ebx,4),%ebp
   152  0.0704     1  0.0211   : 8048689:       ja     80486ba <NumSift+0x4a>
 44618 20.6653     0       0   : 804868b:       jae    804869c <NumSift+0x2c>
  2125  0.9842    62  1.3075   : 804868d:       mov    (%edi,%ebx,8),%eax
  2932  1.3580   322  6.7904   : 8048690:       cmp    0x4(%edi,%ebx,8),%eax
 23392 10.8342   151  3.1843   : 8048694:       setl   %al
  8331  3.8586     5  0.1054   : 8048697:       movzbl %al,%eax
 11420  5.2893     0       0   : 804869a:       add    %eax,%edx
 15985  7.4036     1  0.0211   : 804869c:       lea    (%edi,%edx,4),%esi
  5171  2.3950     6  0.1265   : 804869f:       mov    0x0(%ebp),%ecx
   109  0.0505     0       0   : 80486a2:       mov    %edx,%ebx
  1129  0.5229     0       0   : 80486a4:       mov    (%esi),%eax
 16905  7.8297     0       0   : 80486a6:       cmp    %eax,%ecx
  2442  1.1310     0       0   : 80486a8:       jge    80486c2 <NumSift+0x52>
 18994  8.7973     1  0.0211   : 80486aa:       lea    (%ebx,%ebx,1),%edx
  1134  0.5252   507 10.6917   : 80486ad:       cmp    (%esp),%edx
   136  0.0630     4  0.0844   : 80486b0:       mov    %ecx,(%esi)
 19141  8.8654     0       0   : 80486b2:       mov    %eax,0x0(%ebp)
     1 4.6e-04     0       0   : 80486b5:       lea    (%edi,%ebx,4),%ebp
    36  0.0167     0       0   : 80486b8:       jbe    804868b <NumSift+0x1b>
  3202  1.4830     0       0   : 80486ba:       add    $0x4,%esp
  4369  2.0235   618 13.0325   : 80486bd:       pop    %ebx
  1842  0.8531   680 14.3399   : 80486be:       pop    %esi
  2309  1.0694   878 18.5154   : 80486bf:       pop    %edi
    54  0.0250     1  0.0211   : 80486c0:       pop    %ebp
   500  0.2316     5  0.1054   : 80486c1:       ret    
   498  0.2307     0       0   : 80486c2:       mov    (%esp),%ebx
  4407  2.0411  1487 31.3581   : 80486c5:       add    $0x1,%ebx
     1 4.6e-04     1  0.0211   : 80486c8:       jmp    8048680 <NumSift+0x10>
                               : 80486ca:       lea    0x0(%esi),%esi

Comment 35 Richard Biener 2008-10-04 15:58:50 UTC

Another missed optimization on the tree level is hoisting of the load of
array[k*4] before the k < j condition which is possible after the PRE
insertion:

<bb 3>:
  if (k_4 < j_5(D))
    goto <bb 4>;
  else
    goto <bb 12>;

<bb 12>:
  pretmp.12_31 = k_4 * 4;
  pretmp.13_30 = array_8(D) + pretmp.12_31;
  pretmp.14_25 = *pretmp.13_30;
  goto <bb 6>;

<bb 4>:
  D.1242_7 = k_4 * 4;
  D.1243_9 = array_8(D) + D.1242_7;
  D.1244_10 = *D.1243_9;

maybe PRE could somehow even insert on BB3 entry instead of BB3 exit edges in this case.  This would re-enable if-conversion of the second branch.

Comment 36 Richard Biener 2008-10-04 16:16:07 UTC

Testcase for that:

/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-pre" } */

long
NumSift (long *array, int b, unsigned long k)
{
  if (b)
    if (array[k] < array[k + 1L])
      ++k;
  return array[k];
}

/* There should be two loads left.  */

/* { dg-final { scan-tree-dump-times "= \\\*" 2 "pre" } } */
/* { dg-final { cleanup-tree-dump "pre" } } */

Comment 37 Steven Bosscher 2008-11-22 09:13:07 UTC

Re: comment #35 and comment #36
That is code hoisting, again. See Bug 23286 and some the bugs closed as a duplicate of Bug 23286. Looks like it's time to implement tree-level hoisting :-)

Comment 38 Paolo Bonzini 2009-02-16 09:01:25 UTC

Outlandish statement: maybe no-condexec if conversion should be moved to the tree-level?!?  Doing this kind of hoisting at the same time as if conversion is simpler on GIMPLE...

Comment 39 Joseph S. Myers 2009-03-31 18:46:22 UTC

Closing 4.2 branch.

Comment 40 Rob 2009-04-21 12:30:59 UTC

(In reply to comment #0)
> I've found a major performance regression in gcc 4.0.0's optimization ...

(In reply to comment #11)
> We need more analysis on these kinds of issues.
> So, we're doing a worse job on register allocation.  Is that because 
> the register allocator got worse, or because we're giving it a harder 
> problem to solve?

(In reply to comment #12)
> It is the latter and the pass which is responsible ...

(In reply to comment #13)
> Tough, there is really not much that can be done about this ...


Is this of any use ?

Register Allocation by Puzzle Solving
http://compilers.cs.ucla.edu/fernando/projects/puzzles/


"This project consists in developing a new model for register allocation that has fast compilation time, produces code of good quality, is able to address the many irregularities found in common computer architectures and is intuitive enough to be easily understood."

"Our implementation produces Pentium code that is of similar quality to the code produced by the slower, state-of-the-art iterated register coalescing algorithm of George and Appel augmented with extensions by Smith, Ramsey, and Holloway."


YT,
Rob

Comment 41 Daniel Berlin 2009-04-21 13:24:17 UTC

Subject: Re:  [4.3/4.4/4.5 Regression] missed 
	load PRE, PRE makes i?86 suck

Fernando was an intern of mine, and while his algorithm is a great
algorithm, AFAIK he hasn't gotten better code out of it than we get
with IRA right now.
I'll check with him again and see if this has changed.

On Tue, Apr 21, 2009 at 8:31 AM, rob1weld at aol dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> Is this of any use ?
>
> Register Allocation by Puzzle Solving
> http://compilers.cs.ucla.edu/fernando/projects/puzzles/

Comment 42 Richard Biener 2009-08-04 12:26:23 UTC

GCC 4.3.4 is being released, adjusting target milestone.

Comment 43 Richard Biener 2010-05-22 18:10:27 UTC

GCC 4.3.5 is being released, adjusting target milestone.

Comment 44 Jan Hubicka 2011-01-08 16:21:07 UTC

Seems that even in IRA world we made no progress
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> ./xgcc -B ./ -O3 reg.i
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> tim e./a.out
If 'tim' is not a typo you can run the following command to lookup the package that contains the binary:
    command-not-found tim
bash: tim: command not found
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> time ./a.out
         1000.8

real    0m5.179s
user    0m5.176s
sys     0m0.000s
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> gcc-4.3  -O3 reg.i
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> time ./a.out
         1180.8

real    0m5.195s
user    0m5.192s
sys     0m0.000s
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc>

-fno-tree-pre -fno-inline -fschedule-insns makes no difference

Adding vladimir to list as it is partly RA issue

Comment 45 Richard Biener 2011-06-27 12:14:28 UTC

4.3 branch is being closed, moving to 4.4.7 target.

Comment 46 wbrana 2011-07-23 13:25:44 UTC

-O3 -pipe -fomit-frame-pointer -march=core2 -funroll-loops -fno-tree-pre
4.4.6 - 1730.9
4.5.2 - 2368
4.6.1 - 2205.6

Comment 47 wbrana 2011-07-23 13:53:46 UTC

-O3 -pipe -fomit-frame-pointer -march=core2 -funroll-loops
4.4.6 - 1522.8
4.5.2 - 1502.6
4.6.1 - 1418.2

Comment 48 Andrew Pinski 2012-03-03 23:51:36 UTC

I noticed PR 32120 is also involved.

  D.1716_7 = k_4 * 8;
  D.1717_9 = array_8(D) + D.1716_7;
  D.1718_10 = *D.1717_9;
  k_11 = k_4 + 1;
  D.1720_12 = k_11 * 8;


The D.1720_12 should be converted to D.1716_7 + 8.

Comment 49 Andrew Pinski 2012-03-05 05:29:54 UTC

PR 37242 is also needed from what I remember reading the IR.

Comment 50 Jakub Jelinek 2012-03-13 12:47:54 UTC

4.4 branch is being closed, moving to 4.5.4 target.

Comment 51 Richard Biener 2012-07-02 11:45:51 UTC

The 4.5 branch is being closed, adjusting target milestone.

Comment 52 wbrana 2012-08-12 12:30:21 UTC

This bug celebrated 7th anniversary this year. Congratulations!

Comment 53 wbrana 2012-08-13 08:26:13 UTC

It seems it was improved.

4.8 20120806
NUMERIC SORT        :          1543.7  :      39.59  :      13.00
4.8 20120813
NUMERIC SORT        :          2007.8  :      51.49  :      16.91

Comment 54 Richard Biener 2013-01-15 14:40:39 UTC

(In reply to comment #48)
> I noticed PR 32120 is also involved.
> 
>   D.1716_7 = k_4 * 8;
>   D.1717_9 = array_8(D) + D.1716_7;
>   D.1718_10 = *D.1717_9;
>   k_11 = k_4 + 1;
>   D.1720_12 = k_11 * 8;
> 
> 
> The D.1720_12 should be converted to D.1716_7 + 8.

For this the straight-line strenght-reduction pass now does

  <bb 5>:
  _8 = i_6 * 8;
  _10 = array_9(D) + _8;
  _11 = *_10;
  i_12 = i_6 + 1;
  _13 = _8 + 8;
  _14 = array_9(D) + _13;
  _15 = *_14;

replacing i_12 * 8 with _8 + 8.  The association / CSE opportunity is
undetected (maybe it makes sense to schedule pass_strength_reduction
before the 2nd pass_reassoc?).

Comment 55 Jakub Jelinek 2013-04-12 15:17:01 UTC

GCC 4.6.4 has been released and the branch has been closed.

Comment 56 Richard Biener 2014-06-12 13:46:48 UTC

The 4.7 branch is being closed, moving target milestone to 4.8.4.

Comment 57 Jakub Jelinek 2014-12-19 13:36:16 UTC

GCC 4.8.4 has been released.

Comment 58 Richard Biener 2015-06-23 08:18:59 UTC

The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.

Comment 59 Jakub Jelinek 2015-06-26 19:59:29 UTC

GCC 4.9.3 has been released.

Comment 60 Richard Biener 2016-07-12 14:22:25 UTC

With code hoisting we now end up with the following for the testcase in comment#6.  Over-the-weekend testing on bytemark (nbench?) didn't show
any effect though.  Thus requires re-evaluation.

NumSift (long int * array, long unsigned int i, long unsigned int j)
{
  long unsigned int k;
  sizetype _4;
  sizetype _5;
  long int * _6;
  long int _7;
  long unsigned int _8;
  long int * _9;
  long int _10;
  long unsigned int _14;
  long int * _28;
  long int pretmp_29;
  long unsigned int _30;
  long int prephitmp_31;
  long int * prephitmp_34;

  <bb 2>:
  goto <bb 10>;

  <bb 3>:
  _30 = _14 * 8;
  _28 = array_22(D) + _30;
  pretmp_29 = *_28;
  if (_14 < j_20(D))
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 4>:
  _4 = _14 + 1;
  _5 = _4 * 8;
  _6 = array_22(D) + _5;
  _7 = *_6;
  if (_7 > pretmp_29)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:

  <bb 6>:
  # k_16 = PHI <_14(3), _14(4), _4(5)>
  # prephitmp_34 = PHI <_28(3), _28(4), _6(5)>
  # prephitmp_31 = PHI <pretmp_29(3), pretmp_29(4), _7(5)>
  _8 = i_15 * 8;
  _9 = array_22(D) + _8;
  _10 = *_9;
  if (_10 < prephitmp_31)
    goto <bb 7>;
  else
    goto <bb 8>;

  <bb 7>:
  *prephitmp_34 = _10;
  *_9 = prephitmp_31;
  goto <bb 9>;

  <bb 8>:
  i_26 = j_20(D) + 1;

  <bb 9>:
  # i_23 = PHI <k_16(7), i_26(8)>

  <bb 10>:
  # i_15 = PHI <i_23(9), i_18(D)(2)>
  _14 = i_15 * 2;
  if (_14 <= j_20(D))
    goto <bb 3>;
  else
    goto <bb 11>;

  <bb 11>:
  return;

Comment 61 Richard Biener 2016-08-03 10:38:01 UTC

GCC 4.9 branch is being closed

Comment 62 Jeffrey A. Law 2017-02-07 23:03:39 UTC

So one thing that seems to ever so slightly improve this code is to realize that as we come around to the top of the loop the test _14 < j_20 can be rewritten as _14 != j_20 (we already know that _14 <= j_20).

That's quite a surprise as all that seems to do is allow propagation of j_20 for _14 in a later PHI node and we do less use work in PRE -- so it seems like a step backwards at this point.

But then sink comes along and moves two address computations and a load into a more control dependent block.  LIM4 later pulls the two address computations totally out of the loop.

But that all seems to be relatively accidental improvements based on PRE not seeing a transformation.

Comment 63 Jeffrey A. Law 2017-02-14 16:52:02 UTC

So going back to the gcc-3.4 code the major change is we make much more use of the complex addressing modes on x86:

NumSift:
.LFB2:
        pushq   %rbx
.LCFI0:
        movq    %rsi, %r8
        .p2align 4,,7
.L11:
        leaq    (%r8,%r8), %rcx
        cmpq    %rdx, %rcx
        ja      .L10
.L14:
        cmpq    %rdx, %rcx
        movq    %rcx, %rsi
        jae     .L4
        movq    8(%rdi,%rcx,8), %rbx
        cmpq    %rbx, (%rdi,%rcx,8)
        leaq    1(%rcx), %rax
        cmovl   %rax, %rsi
.L4:
        movq    (%rdi,%r8,8), %rax
        movq    (%rdi,%rsi,8), %rcx
        cmpq    %rcx, %rax
        jge     .L6
        movq    %rax, (%rdi,%rsi,8)
        movq    %rcx, (%rdi,%r8,8)
        movq    %rsi, %r8
        leaq    (%r8,%r8), %rcx
        cmpq    %rdx, %rcx
        jbe     .L14
.L10:
        popq    %rbx
        ret
        .p2align 4,,7
.L6:
        leaq    1(%rdx), %r8
        jmp     .L11


Contrast the loop body to what we generate now:
NumSift:
.LFB0:
        .cfi_startproc
        pushq   %rbx
        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16
        leaq    1(%rdx), %rbx
.L2:
        leaq    (%rsi,%rsi), %rcx
        cmpq    %rdx, %rcx
        ja      .L9
.L6:
        movq    %rsi, %rax
        salq    $4, %rax
        addq    %rdi, %rax
        cmpq    %rdx, %rcx
        movq    (%rax), %r8
        jnb     .L3
        leaq    1(%rcx), %r9
        leaq    (%rdi,%r9,8), %r10
        movq    (%r10), %r11
        cmpq    %r8, %r11
        jle     .L3
        movq    %r11, %r8
        movq    %r10, %rax
        movq    %r9, %rcx
.L3:
        leaq    (%rdi,%rsi,8), %r9
        movq    %rbx, %rsi
        movq    (%r9), %r10
        cmpq    %r8, %r10
        jge     .L2
        movq    %rcx, %rsi
        movq    %r10, (%rax)
        movq    %r8, (%r9)
        leaq    (%rsi,%rsi), %rcx
        cmpq    %rdx, %rcx
        jbe     .L6
.L9:
        popq    %rbx

I haven't looked deeply at the dumps, but I suspect that CSE/PRE on the address arithmetic spoils our ability to utilize the complex addressing modes.  Which would tend match the findings from c#62 where a change in a conditional inhibits PRE and we end up with better code.

Comment 64 Jakub Jelinek 2017-10-10 13:28:05 UTC

GCC 5 branch is being closed

Comment 65 Aldy Hernandez 2018-02-01 18:43:53 UTC

Still a problem.  I get a score of 1750.7 on a:

model name      : Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz

Comment 66 Jakub Jelinek 2018-10-26 10:13:24 UTC

GCC 6 branch is being closed

Comment 67 Eric Gallager 2019-08-20 15:02:11 UTC

(In reply to Andrew Pinski from comment #49)
> PR 37242 is also needed from what I remember reading the IR.

That's fixed now.

Comment 68 Richard Biener 2019-11-14 07:58:24 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 69 Jeffrey A. Law 2020-02-25 14:27:54 UTC

So while gcc-10 improves things relative to gcc-9.  THe gcc-3.4 code is still far and away better.

On my box:
gcc-10  1725-1775
gcc-3.4 3000-3050

Clearly it's not even close.

Comment 70 Jakub Jelinek 2020-03-04 09:39:28 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 71 Jakub Jelinek 2021-05-14 09:45:29 UTC

GCC 8 branch is being closed.

Comment 72 Richard Biener 2021-06-01 08:03:48 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 73 Richard Biener 2022-05-27 09:33:22 UTC

GCC 9 branch is being closed

Comment 74 Jakub Jelinek 2022-06-28 10:29:02 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 75 Richard Biener 2023-07-07 10:28:27 UTC

GCC 10 branch is being closed.