GCC Bugzilla has been upgraded from version 4.4.9 to 5.0rc3. If you see any problem, please report it to bug 64968.
Bug 21485 - [4.8/4.9/5/6 Regression] missed load PRE, PRE makes i?86 suck
Summary: [4.8/4.9/5/6 Regression] missed load PRE, PRE makes i?86 suck
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.0.0
: P2 normal
Target Milestone: 4.8.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 37732 (view as bug list)
Depends on: 23286 30521 32120
Blocks: 36861 37732
  Show dependency treegraph
 
Reported: 2005-05-10 09:03 UTC by Jason Bucata
Modified: 2014-12-19 13:36 UTC (History)
14 users (show)

See Also:
Host:
Target: i686-*-*
Build:
Known to work: 3.4.3
Known to fail: 4.0.0, 4.0.4, 4.1.0, 4.2.4, 4.3.2, 4.4.0, 4.6.0
Last reconfirmed: 2011-02-06 23:54:40


Attachments
Test case (preprocessed) (13.49 KB, application/octet-stream)
2005-05-10 09:05 UTC, Jason Bucata
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Bucata 2005-05-10 09:03:38 UTC
I've found a major performance regression in gcc 4.0.0's optimization of the
BYTEmark numsort benchmark.  I've boiled it down to a testcase that I think will
suit you... it outputs a single number representing the number of iterations run
(higher is better).  On my machine I get 900ish under 4.0.0 and around 1530 on
3.4.3.

Both were compiled and run in a Gentoo test partition, if that makes a difference:
3.4.3: gcc version 3.4.3-20050110 (Gentoo Linux 3.4.3.20050110-r2,
ssp-3.4.3.20050110-0, pie-8.7.7)
4.0.0: gcc version 4.0.0 (Gentoo Linux 4.0.0)
Comment 1 Jason Bucata 2005-05-10 09:05:21 UTC
Created attachment 8851 [details]
Test case (preprocessed)
Comment 2 Jason Bucata 2005-05-10 09:10:04 UTC
Oops, I should add that my pertinent options were: -O3 -fomit-frame-pointer
-march=athlon-xp -static
Comment 3 Steven Bosscher 2005-05-10 09:14:51 UTC
Confirmed on x86 (with and without frame pointer) and on amd64. 
Comment 4 Richard Biener 2005-05-10 09:27:10 UTC
mainline drops even lower - looks like poor choice of addressing modes and thus
more register pressure for 4.0 and 4.1.  Note that using profile-feedback
improves numbers a lot (but still we regress).
Comment 5 Giovanni Bajo 2005-05-10 09:45:24 UTC
Jason: thanks for this! Even better would be to let the testcase do a fixed 
number of iterations (like 1000 or so), and then we'll be using "time" 
externally to measure performance. Maybe you can do this for other testcases 
you are going to submit, thanks!
Comment 6 Steven Bosscher 2005-05-10 09:46:29 UTC
This is the function (reindented) where we spend almost all of our time: 
 
void 
NumSift (long *array, unsigned long i, unsigned long j) 
{ 
  unsigned long k; 
  long temp; 
  while ((i + i) <= j) 
    { 
      k = i + i; 
      if (k < j) 
        if (array[k] < array[k + 1L]) 
          ++k; 
      if (array[i] < array[k]) 
        { 
          temp = array[k]; 
          array[k] = array[i]; 
          array[i] = temp; 
          i = k; 
        } 
      else 
        i = j + 1; 
    } 
  return; 
} 
 
Comment 7 Steven Bosscher 2005-05-10 09:50:33 UTC
If Richard is right in comment #4, it would be interesting to see what 
happens if one tries this with Zdenek's TARGET_MEM_REF patch. 
Comment 8 Steven Bosscher 2005-05-10 10:22:33 UTC
On AMD64 with GCC 4.0.1 (CVS 4.0 branch) I go from ~580 at -O3 
to ~930 at -O3 -fno-tree-pre. 
Comment 9 Paolo Bonzini 2005-05-10 12:58:04 UTC
Looks like a register pressure problem... but yes, TARGET_MEM_REF may help as well.

Paolo
Comment 10 Andrew Pinski 2005-05-10 20:03:48 UTC
IV-OPTS does nothing to this testcase, it does not even change the trees.  This is just a ra issue.
Comment 11 Mark Mitchell 2005-10-31 03:21:41 UTC
We need more analysis on these kinds of issues.

So, we're doing a worse job on register allocation.  Is that because the register allocator got worse, or because we're giving it a harder problem to solve?  If the latter, what's responsible, and is there anything we can do about it?  We need that kind of information to make an intelligent decision about whether or not we should try to fix this, or just chalk it up to the fact that there's always variability between releases.
Comment 12 Andrew Pinski 2005-10-31 03:37:32 UTC
(In reply to comment #11)
> So, we're doing a worse job on register allocation.  Is that because the
> register allocator got worse, or because we're giving it a harder problem to
> solve?  If the latter, what's responsible, and is there anything we can do
> about it?
It is the latter and the pass which is responsible is tree PRE but if someone writes the code like what tree PRE gives, we will still have the same issue so the only correct place to fix this would be in RA.
Comment 13 Steven Bosscher 2006-01-09 22:23:17 UTC
AMD64 timings for today's GCC 4.1 (20060109) branch:

flags used                               score
for compilation                          (avg. of 3, higher is better)
-O3                                      685.2
-O3 -fprofile-generate                   636
-O3 -fprofile-use                        785.6
-O3 -fno-tree-pre                        958.4
-O3 -fno-tree-pre -fprofile-generate     673.8
-O3 -fno-tree-pre -fprofile-use          964.2
-O3 -fno-tree-pre -fmove-loop-invariants 967.2


The problem really is that we present the register allocator with a harder problem.  Tough, there is really not much that can be done about this other than crippling PRE for SMALL_REGISTER_CLASSES machines, but I don't know if we should want to do that...
Comment 14 Daniel Berlin 2006-02-16 19:46:03 UTC
So uh, has anyone figured out why we don't get 1500 with -fno-tree-pre?

Does -fno-gcse help too?
Comment 15 Mark Mitchell 2006-02-24 00:25:53 UTC
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Comment 16 Andrew Pinski 2006-03-10 04:41:30 UTC
Hmm,  I see a missed optimization here (I want to say a PRE one too).
  D.1521_32 = k_4 * 4;
  D.1525_37 = (long int *) D.1521_32;
  D.1526_38 = D.1525_37 + array_9;
  D.1527_39 = D.1526_38 + 4B;
  #   VUSE <SMT.4_42>;
  D.1528_40 = *D.1527_39;
  if (prephitmp.27_35 < D.1528_40) goto <L2>; else goto <L3>;

<L2>:;
  k_41 = k_4 + 1;
  pretmp.22_23 = k_41 * 4;
  pretmp.24_6 = (long int *) pretmp.22_23;
  prephitmp.25_48 = pretmp.24_6 + array_9;
  #   VUSE <SMT.4_42>;
  prephitmp.27_51 = *prephitmp.25_48;

prephitmp.27_51 here is also the same as D.1528_40.  Though I don't know how to figure this out in PRE or any other optimization.  I almost want to say this is the real cause.  Looking at 3.4's asm, you don't have the extra load but for 4.0 and above you do.
Comment 17 Andrew Pinski 2006-03-10 13:56:03 UTC
What happens to the time if you replace that function with:
void
NumSift (long *array, unsigned long i, unsigned long j)
{
  unsigned long k;
  while ((i + i) <= j)
    {
      k = i + i;
      long t, t1;
      t = array[k];
      if (k < j)
        {
          t1 = array[k+1];
          if (t < t1)
            ++k, t = t1;
        }
      t1 = array[i];
      if (t1 < t)
        {
          array[k] = t1;
          array[i] = t;
          i = k;
        }
      else
        i = j + 1;
    }
  return;
}
-----------
The semantics should be the same, I just pulled out the PREs as much as I could.
Comment 18 Andrew Pinski 2006-03-10 14:10:11 UTC
(In reply to comment #17)
> What happens to the time if you replace that function with:
This helps about 5% but it does not get the score back up.

3.4.0's score for this machine is about 900.
4.1.0's score is 720.
4.1.0+scource modification is about 780.

so it helps but there is something else which needs to be improved still.

I don't think this is a RA issue now after looking into the source.
Comment 19 Andrew Pinski 2006-03-10 14:13:29 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > What happens to the time if you replace that function with:
> This helps about 5% but it does not get the score back up.

Also the asm is about the same for this function, at least on i686 between 3.4.0 and the source modified version with 4.1.0.
Comment 20 Mark Mitchell 2006-05-25 02:32:56 UTC
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Comment 21 Andrew Pinski 2007-05-28 07:06:54 UTC
The missed load PRE for the testcase below is fixed on the pointer plus banch as the addition is done in "unsigned int" for both the "++k" and the array[k+1].  If we change "++k" into "k+=2" and array[k+1] into array[k+2], we run into another missed PRE issue which can be shown by the following testcase which we don't optimize currently:
int f(int a, int b)
{
  int c = a+2;
  int d = c*2;
  int e = a*2;
  int f = e+2;
  return d == f;
}

Which I will file seperately.
Comment 22 Andrew Pinski 2007-06-18 06:12:54 UTC
This is basically fixed by the pointer_plus except we still have some combinable code (though this is not PRE's fault); see http://gcc.gnu.org/ml/gcc-patches/2007-05/msg01996.html for how to fix that issue.
Comment 23 Andrew Pinski 2007-07-04 21:01:06 UTC
I am going to declare this as fixed for 4.3, pointer plus gets the original case correctly and the secondary issues are all file seperately.
Comment 24 wbrana 2008-06-15 13:02:50 UTC
It seems to not be fixed in 4.3.1:

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1545.8  :      39.64  :      13.02
STRING SORT         :          380.16  :     169.87  :      26.29
BITFIELD            :       6.014e+08  :     103.16  :      21.55
FP EMULATION        :          332.88  :     159.73  :      36.86
FOURIER             :           32962  :      37.49  :      21.06
ASSIGNMENT          :          57.851  :     220.13  :      57.10
IDEA                :            9372  :     143.34  :      42.56
HUFFMAN             :          3234.8  :      89.70  :      28.64
NEURAL NET          :          72.571  :     116.58  :      49.04
LU DECOMPOSITION    :          2656.5  :     137.62  :      99.37
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 117.762
FLOATING-POINT INDEX: 84.407
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.25.6
C compiler          : 4.3.1
libc                :
MEMORY INDEX        : 31.863
INTEGER INDEX       : 27.656
FLOATING-POINT INDEX: 46.816
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          2463.4  :      63.18  :      20.75
STRING SORT         :           374.8  :     167.47  :      25.92
BITFIELD            :      5.9818e+08  :     102.61  :      21.43
FP EMULATION        :          268.64  :     128.91  :      29.75
FOURIER             :           43176  :      49.10  :      27.58
ASSIGNMENT          :           58.68  :     223.29  :      57.92
IDEA                :          5413.5  :      82.80  :      24.58
HUFFMAN             :          3198.7  :      88.70  :      28.32
NEURAL NET          :           58.29  :      93.64  :      39.39
LU DECOMPOSITION    :          2380.6  :     123.33  :      89.05
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 112.599
FLOATING-POINT INDEX: 82.767
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.25.6
C compiler          : gcc version 3.4.6 (Gentoo 3.4.6-r2 p1.5, ssp-3.4.6-1.0, pie-8.7.10)
libc                :
MEMORY INDEX        : 31.806
INTEGER INDEX       : 25.604
FLOATING-POINT INDEX: 45.906
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
Comment 25 wbrana 2008-06-29 10:55:24 UTC
I think "4.3" is missing in "Summary" and "4.3.1" in "Known to fail".
Comment 26 Joseph S. Myers 2008-07-04 16:52:42 UTC
Closing 4.1 branch.
Comment 27 wbrana 2008-09-28 18:00:07 UTC
gcc 4.3.2, -march=core2 instead of -march=nocona

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1516.3  :      38.89  :      12.77
STRING SORT         :          381.24  :     170.35  :      26.37
BITFIELD            :      6.0164e+08  :     103.20  :      21.56
FP EMULATION        :             332  :     159.31  :      36.76
FOURIER             :           32687  :      37.17  :      20.88
ASSIGNMENT          :           57.93  :     220.44  :      57.18
IDEA                :            9380  :     143.46  :      42.60
HUFFMAN             :          3276.1  :      90.85  :      29.01
NEURAL NET          :          72.251  :     116.07  :      48.82
LU DECOMPOSITION    :          2641.5  :     136.84  :      98.81
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 117.698
FLOATING-POINT INDEX: 83.889
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @ 2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc-4.3.2
libc                :
MEMORY INDEX        : 31.912
INTEGER INDEX       : 27.598
FLOATING-POINT INDEX: 46.528
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
Comment 28 Richard Biener 2008-10-03 23:18:32 UTC
nbench 2.2.3 numeric sort test executes 40% less iterations per second
when compiled with 4.4 snapshot than with 3.4.6

iterations/s - version
2439 - 3.4.6
1530 - 4.4.0 20080926 (experimental)
1526 - 4.3.2 
1580 - 4.2.4

CFLAGS = -s -static -Wall -O3 -g0 -march=nocona -fomit-frame-pointer
-funroll-loops -ffast-math

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          2439.1  :      62.55  :      20.54
STRING SORT         :          373.28  :     166.79  :      25.82
BITFIELD            :      5.9879e+08  :     102.71  :      21.45
FP EMULATION        :          267.84  :     128.52  :      29.66
FOURIER             :           43702  :      49.70  :      27.92
ASSIGNMENT          :          56.657  :     215.59  :      55.92
IDEA                :          5407.7  :      82.71  :      24.56
HUFFMAN             :          3204.3  :      88.86  :      28.37
NEURAL NET          :          57.485  :      92.35  :      38.84
LU DECOMPOSITION    :          2363.5  :     122.44  :      88.41
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 111.792
FLOATING-POINT INDEX: 82.519
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc version 3.4.6 (Gentoo 3.4.6-r2 p1.5, ssp-3.4.6-1.0,
pie-8.7.10)
libc                : libc-2.8.so
MEMORY INDEX        : 31.404
INTEGER INDEX       : 25.525
FLOATING-POINT INDEX: 45.768
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1530.9  :      39.26  :      12.89
STRING SORT         :          372.64  :     166.51  :      25.77
BITFIELD            :      6.0348e+08  :     103.52  :      21.62
FP EMULATION        :           310.4  :     148.94  :      34.37
FOURIER             :           31760  :      36.12  :      20.29
ASSIGNMENT          :          48.361  :     184.02  :      47.73
IDEA                :            9204  :     140.77  :      41.80
HUFFMAN             :          3554.3  :      98.56  :      31.47
NEURAL NET          :          73.882  :     118.69  :      49.92
LU DECOMPOSITION    :          2322.2  :     120.30  :      86.87
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 114.457
FLOATING-POINT INDEX: 80.190
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc version 4.4.0 20080926 (experimental) (GCC)
libc                : libc-2.8.so
MEMORY INDEX        : 29.851
INTEGER INDEX       : 27.632
FLOATING-POINT INDEX: 44.477
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1526.2  :      39.14  :      12.85
STRING SORT         :          374.28  :     167.24  :      25.89
BITFIELD            :      6.0262e+08  :     103.37  :      21.59
FP EMULATION        :          320.64  :     153.86  :      35.50
FOURIER             :           33352  :      37.93  :      21.30
ASSIGNMENT          :          57.834  :     220.07  :      57.08
IDEA                :            9288  :     142.06  :      42.18
HUFFMAN             :            3211  :      89.04  :      28.43
NEURAL NET          :           74.69  :     119.98  :      50.47
LU DECOMPOSITION    :          2655.7  :     137.58  :      99.35
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 116.415
FLOATING-POINT INDEX: 85.548
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc-4.3.2
libc                : libc-2.8.so
MEMORY INDEX        : 31.716
INTEGER INDEX       : 27.199
FLOATING-POINT INDEX: 47.448
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.



BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          1580.2  :      40.52  :      13.31
STRING SORT         :          359.88  :     160.80  :      24.89
BITFIELD            :      5.9462e+08  :     102.00  :      21.30
FP EMULATION        :          291.28  :     139.77  :      32.25
FOURIER             :           32567  :      37.04  :      20.80
ASSIGNMENT          :          57.097  :     217.26  :      56.35
IDEA                :          7876.8  :     120.47  :      35.77
HUFFMAN             :          3433.1  :      95.20  :      30.40
NEURAL NET          :          72.091  :     115.81  :      48.71
LU DECOMPOSITION    :          2664.4  :     138.03  :      99.67
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 112.739
FLOATING-POINT INDEX: 83.966
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : Dual GenuineIntel Intel(R) Core(TM)2 Duo CPU     E6750  @
2.66GHz 3200MHz
L2 Cache            : 4096 KB
OS                  : Linux 2.6.26.5
C compiler          : gcc-4.2.4
libc                : libc-2.8.so
MEMORY INDEX        : 31.032
INTEGER INDEX       : 26.138
FLOATING-POINT INDEX: 46.571
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.
Comment 29 Richard Biener 2008-10-03 23:18:41 UTC
*** Bug 37732 has been marked as a duplicate of this bug. ***
Comment 30 Richard Biener 2008-10-03 23:54:39 UTC
Comment #6 still applies.  On the trunk we do not fully exploit the partial
redundant load of array[k] in

      if (k < j) 
        if (array[k] < array[k + 1L]) 
          ++k; 
      if (array[i] < array[k]) 

but we transform the above to

      if (k < j)
        {
          tmp1 = array[k];
          tmp2 = array[k+1];
          if (tmp1 < tmp2)
            ++k;
          else
            tmp1 = array[k+1];
        }
      if (array[i] < tmp1)

missing the full redundancy of array[k+1].

Reduced testcase:

long
NumSift (long *array, unsigned long k)
{
  if (array[k] < array[k + 1L])
    ++k;
  return array[k];
}

with integer k it gets somewhat more complicated even.

The question is whether this explains the slowdown compared to GCC 3.4.
Comment 31 Andrew Pinski 2008-10-04 00:13:26 UTC
the reduced testcase in comment #30 is optimized by DOM3 though not by PRE.
Running PRE again right after the first PRE still founds more PREable expressions for this testcase ...
Comment 32 Richard Biener 2008-10-04 00:57:46 UTC
As we PHI-translate k_1 * 4 we are not able to find D.1237_7 * 4 in the
SCCVN tables.  So we allocate a new value-id for it.  Oops.  This is because
once we say its type is unsigned int and once it's unsigned long.  And
we require types to be pointer-equal in the hash comparison fn.

I have a patch to fix this which makes this run even more slow.  -fno-tree-pre
fixes the speed regression.
Comment 33 wbrana 2008-10-04 09:22:25 UTC
results with -fno-tree-pre

1749 - 4.4.0 20080926 (experimental)
1701 - 4.3.2 
2476 - 4.2.4
Comment 34 Richard Biener 2008-10-04 15:11:56 UTC
Fastest result on a Intel Core Duo with

gcc-4.1 -O3 -fomit-frame-pointer -fno-tree-pre -fno-inline -fschedule-insns: 1273

the interesting thing is that with the above we if-convert

        if (array[k] < array[k + 1L]) 
          ++k; 

using setl which reduces the burden of the branch predictor which in the worst
case (trunk) has quite a number of mispredicts.  The following is branches
retired vs. mispredicted branches retired for trunk (with PRE enabled)

 * CPU: Core Solo / Duo, speed 1833 MHz (estimated)
 * Counted BR_INST_RETIRED events (number of branch instructions retired) with a
 unit mask of 0x00 (No unit mask) count 10000
 * Counted BR_MISS_PRED_RETIRED events (number of mispredicted branches retired)
 with a unit mask of 0x00 (No unit mask) count 10000

080486d0 <NumSift>: /* NumSift total: 188708 95.2681 21424 99.9953 */
   752  0.3796     0       0   : 80486d0:       push   %ebp
                               : 80486d1:       push   %edi
                               : 80486d2:       push   %esi
   824  0.4160     0       0   : 80486d3:       push   %ebx
     5  0.0025     0       0   : 80486d4:       sub    $0xc,%esp
                               : 80486d7:       mov    %ecx,(%esp)
  1541  0.7780     0       0   : 80486da:       add    $0x1,%ecx
                               : 80486dd:       mov    %ecx,0x8(%esp)
                               : 80486e1:       lea    0x0(%esi),%esi
   709  0.3579     2  0.0093   : 80486e8:       lea    (%edx,%edx,1),%ecx
  1706  0.8613     1  0.0047   : 80486eb:       cmp    (%esp),%ecx
  3083  1.5564   924  4.3127   : 80486ee:       mov    %ecx,%edi
    92  0.0464     0       0   : 80486f0:       lea    (%eax,%edx,8),%ebp
                               : 80486f3:       mov    %ebp,%ebx
   868  0.4382    13  0.0607   : 80486f5:       ja     804871d <NumSift+0x4d>
  5732  2.8938     0       0   : 80486f7:       jb     8048728 <NumSift+0x58>
     2  0.0010     0       0   : 80486f9:       mov    (%ebx),%esi
  7789  3.9322   162  0.7561   : 80486fb:       lea    (%eax,%edx,4),%ecx
 34575 17.4550  6534 30.4971   : 80486fe:       mov    0x8(%esp),%edx
  3070  1.5499  2103  9.8156   : 8048702:       mov    (%ecx),%ebp
  8244  4.1619   134  0.6254   : 8048704:       cmp    %esi,%ebp
  2322  1.1722   155  0.7235   : 8048706:       jge    80486e8 <NumSift+0x18>
  1363  0.6881   236  1.1015   : 8048708:       mov    %edi,%edx
  3578  1.8063     0       0   : 804870a:       mov    %ebp,(%ebx)
   450  0.2272   367  1.7130   : 804870c:       lea    (%eax,%edx,8),%ebp
  3797  1.9169     0       0   : 804870f:       mov    %esi,(%ecx)
  5035  2.5419    22  0.1027   : 8048711:       lea    (%edx,%edx,1),%ecx
                               : 8048714:       mov    %ebp,%ebx
   389  0.1964     0       0   : 8048716:       cmp    (%esp),%ecx
  5885  2.9710    15  0.0700   : 8048719:       mov    %ecx,%edi
     7  0.0035     0       0   : 804871b:       jbe    80486f7 <NumSift+0x27>
   416  0.2100    24  0.1120   : 804871d:       add    $0xc,%esp
  5419  2.7357  1431  6.6791   : 8048720:       pop    %ebx
   568  0.2868   275  1.2835   : 8048721:       pop    %esi
   710  0.3584    24  0.1120   : 8048722:       pop    %edi
   334  0.1686    12  0.0560   : 8048723:       pop    %ebp
   146  0.0737    91  0.4247   : 8048724:       ret
                               : 8048725:       lea    0x0(%esi),%esi
  8706  4.3952     0       0   : 8048728:       mov    0x0(%ebp),%ebx
  1536  0.7754   379  1.7690   : 804872b:       lea    0x1(%ecx),%edi
                               : 804872e:       mov    %ebx,0x4(%esp)
 14484  7.3122     9  0.0420   : 8048732:       lea    (%eax,%edi,4),%ebx
                               : 8048735:       mov    (%ebx),%esi
  2165  1.0930     6  0.0280   : 8048737:       cmp    %esi,0x4(%esp)
 19814 10.0030     1  0.0047   : 804873b:       jl     80486fb <NumSift+0x2b>
  2585  1.3050     0       0   : 804873d:       mov    0x4(%esp),%esi
 37728 19.0468  8504 39.6919   : 8048741:       mov    %ebp,%ebx
  1511  0.7628     0       0   : 8048743:       mov    %ecx,%edi
   768  0.3877     0       0   : 8048745:       jmp    80486fb <NumSift+0x2b>
                               : 8048747:       mov    %esi,%esi
                               : 8048749:       lea    0x0(%edi),%edi

while the following is what we get for the gcc 4.1 code w/o PRE:

08048670 <NumSift>: /* NumSift total: 200781 92.9938  4738 99.9156 */
  1196  0.5539     0       0   : 8048670:       push   %ebp
     1 4.6e-04     0       0   : 8048671:       push   %edi
     2 9.3e-04     0       0   : 8048672:       mov    %eax,%edi
  2084  0.9652     0       0   : 8048674:       push   %esi
     9  0.0042     0       0   : 8048675:       push   %ebx
     1 4.6e-04     0       0   : 8048676:       mov    %edx,%ebx
  1162  0.5382     0       0   : 8048678:       sub    $0x4,%esp
     6  0.0028     0       0   : 804867b:       mov    %ecx,(%esp)
  3128  1.4488     0       0   : 804867e:       xchg   %ax,%ax
  1078  0.4993     2  0.0422   : 8048680:       lea    (%ebx,%ebx,1),%edx
   577  0.2672     0       0   : 8048683:       cmp    (%esp),%edx
   202  0.0936     6  0.1265   : 8048686:       lea    (%edi,%ebx,4),%ebp
   152  0.0704     1  0.0211   : 8048689:       ja     80486ba <NumSift+0x4a>
 44618 20.6653     0       0   : 804868b:       jae    804869c <NumSift+0x2c>
  2125  0.9842    62  1.3075   : 804868d:       mov    (%edi,%ebx,8),%eax
  2932  1.3580   322  6.7904   : 8048690:       cmp    0x4(%edi,%ebx,8),%eax
 23392 10.8342   151  3.1843   : 8048694:       setl   %al
  8331  3.8586     5  0.1054   : 8048697:       movzbl %al,%eax
 11420  5.2893     0       0   : 804869a:       add    %eax,%edx
 15985  7.4036     1  0.0211   : 804869c:       lea    (%edi,%edx,4),%esi
  5171  2.3950     6  0.1265   : 804869f:       mov    0x0(%ebp),%ecx
   109  0.0505     0       0   : 80486a2:       mov    %edx,%ebx
  1129  0.5229     0       0   : 80486a4:       mov    (%esi),%eax
 16905  7.8297     0       0   : 80486a6:       cmp    %eax,%ecx
  2442  1.1310     0       0   : 80486a8:       jge    80486c2 <NumSift+0x52>
 18994  8.7973     1  0.0211   : 80486aa:       lea    (%ebx,%ebx,1),%edx
  1134  0.5252   507 10.6917   : 80486ad:       cmp    (%esp),%edx
   136  0.0630     4  0.0844   : 80486b0:       mov    %ecx,(%esi)
 19141  8.8654     0       0   : 80486b2:       mov    %eax,0x0(%ebp)
     1 4.6e-04     0       0   : 80486b5:       lea    (%edi,%ebx,4),%ebp
    36  0.0167     0       0   : 80486b8:       jbe    804868b <NumSift+0x1b>
  3202  1.4830     0       0   : 80486ba:       add    $0x4,%esp
  4369  2.0235   618 13.0325   : 80486bd:       pop    %ebx
  1842  0.8531   680 14.3399   : 80486be:       pop    %esi
  2309  1.0694   878 18.5154   : 80486bf:       pop    %edi
    54  0.0250     1  0.0211   : 80486c0:       pop    %ebp
   500  0.2316     5  0.1054   : 80486c1:       ret    
   498  0.2307     0       0   : 80486c2:       mov    (%esp),%ebx
  4407  2.0411  1487 31.3581   : 80486c5:       add    $0x1,%ebx
     1 4.6e-04     1  0.0211   : 80486c8:       jmp    8048680 <NumSift+0x10>
                               : 80486ca:       lea    0x0(%esi),%esi
Comment 35 Richard Biener 2008-10-04 15:58:50 UTC
Another missed optimization on the tree level is hoisting of the load of
array[k*4] before the k < j condition which is possible after the PRE
insertion:

<bb 3>:
  if (k_4 < j_5(D))
    goto <bb 4>;
  else
    goto <bb 12>;

<bb 12>:
  pretmp.12_31 = k_4 * 4;
  pretmp.13_30 = array_8(D) + pretmp.12_31;
  pretmp.14_25 = *pretmp.13_30;
  goto <bb 6>;

<bb 4>:
  D.1242_7 = k_4 * 4;
  D.1243_9 = array_8(D) + D.1242_7;
  D.1244_10 = *D.1243_9;

maybe PRE could somehow even insert on BB3 entry instead of BB3 exit edges in this case.  This would re-enable if-conversion of the second branch.
Comment 36 Richard Biener 2008-10-04 16:16:07 UTC
Testcase for that:

/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-pre" } */

long
NumSift (long *array, int b, unsigned long k)
{
  if (b)
    if (array[k] < array[k + 1L])
      ++k;
  return array[k];
}

/* There should be two loads left.  */

/* { dg-final { scan-tree-dump-times "= \\\*" 2 "pre" } } */
/* { dg-final { cleanup-tree-dump "pre" } } */
Comment 37 Steven Bosscher 2008-11-22 09:13:07 UTC
Re: comment #35 and comment #36
That is code hoisting, again. See Bug 23286 and some the bugs closed as a duplicate of Bug 23286. Looks like it's time to implement tree-level hoisting :-)
Comment 38 Paolo Bonzini 2009-02-16 09:01:25 UTC
Outlandish statement: maybe no-condexec if conversion should be moved to the tree-level?!?  Doing this kind of hoisting at the same time as if conversion is simpler on GIMPLE...
Comment 39 Joseph S. Myers 2009-03-31 18:46:22 UTC
Closing 4.2 branch.
Comment 40 Rob 2009-04-21 12:30:59 UTC
(In reply to comment #0)
> I've found a major performance regression in gcc 4.0.0's optimization ...

(In reply to comment #11)
> We need more analysis on these kinds of issues.
> So, we're doing a worse job on register allocation.  Is that because 
> the register allocator got worse, or because we're giving it a harder 
> problem to solve?

(In reply to comment #12)
> It is the latter and the pass which is responsible ...

(In reply to comment #13)
> Tough, there is really not much that can be done about this ...


Is this of any use ?

Register Allocation by Puzzle Solving
http://compilers.cs.ucla.edu/fernando/projects/puzzles/


"This project consists in developing a new model for register allocation that has fast compilation time, produces code of good quality, is able to address the many irregularities found in common computer architectures and is intuitive enough to be easily understood."

"Our implementation produces Pentium code that is of similar quality to the code produced by the slower, state-of-the-art iterated register coalescing algorithm of George and Appel augmented with extensions by Smith, Ramsey, and Holloway."


YT,
Rob
Comment 41 Daniel Berlin 2009-04-21 13:24:17 UTC
Subject: Re:  [4.3/4.4/4.5 Regression] missed 
	load PRE, PRE makes i?86 suck

Fernando was an intern of mine, and while his algorithm is a great
algorithm, AFAIK he hasn't gotten better code out of it than we get
with IRA right now.
I'll check with him again and see if this has changed.

On Tue, Apr 21, 2009 at 8:31 AM, rob1weld at aol dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> Is this of any use ?
>
> Register Allocation by Puzzle Solving
> http://compilers.cs.ucla.edu/fernando/projects/puzzles/
Comment 42 Richard Biener 2009-08-04 12:26:23 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 43 Richard Biener 2010-05-22 18:10:27 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 44 Jan Hubicka 2011-01-08 16:21:07 UTC
Seems that even in IRA world we made no progress
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> ./xgcc -B ./ -O3 reg.i
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> tim e./a.out
If 'tim' is not a typo you can run the following command to lookup the package that contains the binary:
    command-not-found tim
bash: tim: command not found
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> time ./a.out
         1000.8

real    0m5.179s
user    0m5.176s
sys     0m0.000s
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> gcc-4.3  -O3 reg.i
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc> time ./a.out
         1180.8

real    0m5.195s
user    0m5.192s
sys     0m0.000s
jh@evans:/abuild/jh/trunk-3/build-inst2/gcc>

-fno-tree-pre -fno-inline -fschedule-insns makes no difference

Adding vladimir to list as it is partly RA issue
Comment 45 Richard Biener 2011-06-27 12:14:28 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 46 wbrana 2011-07-23 13:25:44 UTC
-O3 -pipe -fomit-frame-pointer -march=core2 -funroll-loops -fno-tree-pre
4.4.6 - 1730.9
4.5.2 - 2368
4.6.1 - 2205.6
Comment 47 wbrana 2011-07-23 13:53:46 UTC
-O3 -pipe -fomit-frame-pointer -march=core2 -funroll-loops
4.4.6 - 1522.8
4.5.2 - 1502.6
4.6.1 - 1418.2
Comment 48 Andrew Pinski 2012-03-03 23:51:36 UTC
I noticed PR 32120 is also involved.

  D.1716_7 = k_4 * 8;
  D.1717_9 = array_8(D) + D.1716_7;
  D.1718_10 = *D.1717_9;
  k_11 = k_4 + 1;
  D.1720_12 = k_11 * 8;


The D.1720_12 should be converted to D.1716_7 + 8.
Comment 49 Andrew Pinski 2012-03-05 05:29:54 UTC
PR 37242 is also needed from what I remember reading the IR.
Comment 50 Jakub Jelinek 2012-03-13 12:47:54 UTC
4.4 branch is being closed, moving to 4.5.4 target.
Comment 51 Richard Biener 2012-07-02 11:45:51 UTC
The 4.5 branch is being closed, adjusting target milestone.
Comment 52 wbrana 2012-08-12 12:30:21 UTC
This bug celebrated 7th anniversary this year. Congratulations!
Comment 53 wbrana 2012-08-13 08:26:13 UTC
It seems it was improved.

4.8 20120806
NUMERIC SORT        :          1543.7  :      39.59  :      13.00
4.8 20120813
NUMERIC SORT        :          2007.8  :      51.49  :      16.91
Comment 54 Richard Biener 2013-01-15 14:40:39 UTC
(In reply to comment #48)
> I noticed PR 32120 is also involved.
> 
>   D.1716_7 = k_4 * 8;
>   D.1717_9 = array_8(D) + D.1716_7;
>   D.1718_10 = *D.1717_9;
>   k_11 = k_4 + 1;
>   D.1720_12 = k_11 * 8;
> 
> 
> The D.1720_12 should be converted to D.1716_7 + 8.

For this the straight-line strenght-reduction pass now does

  <bb 5>:
  _8 = i_6 * 8;
  _10 = array_9(D) + _8;
  _11 = *_10;
  i_12 = i_6 + 1;
  _13 = _8 + 8;
  _14 = array_9(D) + _13;
  _15 = *_14;

replacing i_12 * 8 with _8 + 8.  The association / CSE opportunity is
undetected (maybe it makes sense to schedule pass_strength_reduction
before the 2nd pass_reassoc?).
Comment 55 Jakub Jelinek 2013-04-12 15:17:01 UTC
GCC 4.6.4 has been released and the branch has been closed.
Comment 56 Richard Biener 2014-06-12 13:46:48 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 57 Jakub Jelinek 2014-12-19 13:36:16 UTC
GCC 4.8.4 has been released.