19923 – [4.0/4.1/4.2 Regression] openssl is slower when compiled with gcc 4.0 than 3.3

Bug 19923 - [4.0/4.1/4.2 Regression] openssl is slower when compiled with gcc 4.0 than 3.3

Summary: [4.0/4.1/4.2 Regression] openssl is slower when compiled with gcc 4.0 than 3.3

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	4.3.0
Assignee:	Zdenek Dvorak

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2005-02-12 15:30 UTC by Grzegorz Jaskiewcz
Modified:	2008-01-26 13:46 UTC (History)
CC List:	8 users (show)

See Also:
Host:
Target:	i86--
Build:
Known to work:	3.4.0
Known to fail:	4.0.0 4.1.0
Last reconfirmed:	2008-01-25 20:57:23

Attachments
small preprocessed standalone test based on openssh md2 (1.73 KB, text/plain) 2005-06-01 20:49 UTC, Yongxiang Liu	Details
10x smaller testcase (257 bytes, text/plain) 2005-06-17 01:00 UTC, dank	Details
script used to benchmark the problem code (977 bytes, text/plain) 2005-06-18 06:27 UTC, dank	Details
a possible patch (3.16 KB, patch) 2005-07-02 06:31 UTC, Zdenek Dvorak	Details \| Diff
Patch (6.33 KB, patch) 2005-11-17 15:09 UTC, Zdenek Dvorak	Details \| Diff
Benchmarks (597 bytes, text/plain) 2007-11-28 17:19 UTC, Dan Hipschman	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Grzegorz Jaskiewcz 2005-02-12 15:30:02 UTC

here's openssl speed resoult when it's compiled with 3.3 (orginal debian 
unstable package): 
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) 
blowfish(idx) 
compiler: gcc -fPIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H 
-DOPENSSL_NO_KRB5 -DOPENSSL_NO_IDEA -DOPENSSL_NO_MDC2 -DOPENSSL_NO_RC5 
-DL_ENDIAN -DTERMIO -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer -Wall 
-DSHA1_ASM -DMD5_ASM -DRMD160_ASM 
available timing options: TIMES TIMEB HZ=100 [sysconf value] 
timing function used: times 
The 'numbers' are in 1000s of bytes per second processed. 
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes 
md2                510.80k     1064.79k     1486.96k     1641.83k     1702.87k 
mdc2                 0.00         0.00         0.00         0.00         0.00 
md4               4999.47k    17746.97k    51392.88k    97451.59k   131711.89k 
md5               4405.95k    15208.16k    43027.34k    77946.11k   101040.96k 
hmac(md5)         4951.58k    16851.67k    46126.90k    81002.65k   101700.77k 
sha1              3892.54k    12223.89k    29586.19k    45767.99k    54082.03k 
rmd160            3715.14k    10397.52k    23079.49k    33148.87k    37651.83k 
rc4              58941.98k    66899.63k    71733.39k    72572.54k    72476.92k 
des cbc          13353.92k    13897.80k    14067.26k    14088.53k    14107.61k 
des ede3          4887.63k     5039.28k     5083.63k     5116.70k     5086.58k 
idea cbc             0.00         0.00         0.00         0.00         0.00 
rc2 cbc           5257.37k     5534.13k     5560.97k     5610.12k     5582.42k 
rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00 
blowfish cbc     21054.83k    22340.34k    22704.49k    22895.90k    22860.91k 
cast cbc         14478.39k    15882.31k    16400.99k    16570.03k    16585.01k 
aes-128 cbc      13612.33k    14364.39k    14382.68k    14404.12k    14440.26k 
aes-192 cbc      12075.70k    12370.43k    12530.49k    12518.63k    12559.92k 
aes-256 cbc      10806.91k    11093.65k    11179.27k    11185.67k    11205.97k 
                  sign    verify    sign/s verify/s 
rsa  512 bits   0.0023s   0.0002s    438.5   4928.2 
rsa 1024 bits   0.0109s   0.0006s     91.6   1746.1 
rsa 2048 bits   0.0646s   0.0019s     15.5    527.6 
rsa 4096 bits   0.4317s   0.0066s      2.3    152.0 
                  sign    verify    sign/s verify/s 
dsa  512 bits   0.0018s   0.0022s    546.0    460.7 
dsa 1024 bits   0.0054s   0.0065s    186.6    154.8 
dsa 2048 bits   0.0179s   0.0220s     55.7     45.5 
 
and here's the same package compiled with gcc 4.0,  
gcc-4.0 (GCC) 4.0.0 20050212 (experimental) 
compiler: gcc -fPIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H 
-DOPENSSL_NO_KRB5 -DOPENSSL_NO_IDEA -DO 
PENSSL_NO_MDC2 -DOPENSSL_NO_RC5 -DL_ENDIAN -DTERMIO -O3 -march=i686 -mcpu=i686 
-fomit-frame-pointer -Wall -DSHA1_ASM 
-DMD5_ASM -DRMD160_ASM 
available timing options: TIMES TIMEB HZ=100 [sysconf value] 
timing function used: times 
The 'numbers' are in 1000s of bytes per second processed. 
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes 
md2                361.81k      781.01k     1103.19k     1231.36k     1278.84k 
mdc2                 0.00         0.00         0.00         0.00         0.00 
md4               3103.64k    11338.88k    36135.04k    79292.67k   123123.36k 
md5               2758.32k    10084.74k    31863.54k    66522.25k    98860.02k 
hmac(md5)         4581.08k    15784.49k    43771.66k    78227.60k   101959.42k 
sha1              2638.72k     8889.12k    24063.88k    41890.99k    53462.15k 
rmd160            2477.15k     7918.19k    19696.52k    31106.04k    37317.88k 
rc4              60284.27k    67543.46k    71379.34k    72455.38k    72581.12k 
des cbc          13547.77k    13876.64k    14049.67k    14102.25k    14020.78k 
des ede3          4950.20k     5050.99k     5068.80k     5111.00k     5088.06k 
idea cbc             0.00         0.00         0.00         0.00         0.00 
rc2 cbc           5814.75k     6060.45k     6150.37k     6169.60k     6196.13k 
rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00 
blowfish cbc     20941.23k    22373.68k    22868.43k    22822.28k    23014.29k 
cast cbc         12790.60k    14102.95k    14514.24k    14494.77k    14622.21k 
aes-128 cbc      13030.43k    13549.49k    13653.51k    13694.85k    13696.33k 
aes-192 cbc      11257.66k    11517.92k    11545.25k    11604.32k    11568.43k 
aes-256 cbc      10065.01k    10296.48k    10403.82k    10332.02k    10382.25k 
                  sign    verify    sign/s verify/s 
rsa  512 bits   0.0024s   0.0002s    418.5   4201.7 
rsa 1024 bits   0.0112s   0.0006s     89.5   1550.7 
rsa 2048 bits   0.0650s   0.0020s     15.4    504.9 
rsa 4096 bits   0.4311s   0.0068s      2.3    147.9 
                  sign    verify    sign/s verify/s 
dsa  512 bits   0.0019s   0.0023s    521.4    441.9 
dsa 1024 bits   0.0055s   0.0067s    182.9    148.3 
dsa 2048 bits   0.0181s   0.0222s     55.2     45.1 
 
as you can see almost each test is worst with 4.0.  
Not sure why. The same test on ultrasparc and amd64 shows 4.0 as clear winner. 
( Althou it still crashes on amd64... ;) )

Comment 1 Andrew Pinski 2005-02-12 22:24:19 UTC

We need a self contained example.

Comment 2 Andrew Pinski 2005-05-14 20:36:12 UTC

No feedback in 3 months.

Comment 3 Yongxiang Liu 2005-06-01 20:47:09 UTC

When we ran 'openssh speed md2', we did see that gcc-4.0 was slower
than earlier versions, so we created a minimal test case, which
we will attach.  Here is how long it took to run a 34 megabyte
file through the test program when compiled with various compilers and
options:

gcc-2.95.3 -fPIC -O1 4.940s
gcc-4.0.0  -fPIC -O1 3.510s
gcc-3.4.3  -fPIC -O1 5.190s

gcc-2.95.3 -fPIC -O2 3.470s
gcc-3.4.3  -fPIC -O2 3.460s
gcc-4.0.0  -fPIC -O2 4.050s

gcc-2.95.3 -fPIC -O3 3.400s
gcc-3.4.3  -fPIC -O3 3.740s
gcc-4.0.0  -fPIC -O3 4.010s

This test was done on a pentium 4 workstation, and no smoothing was
done on the resulting times, but they seemed to be repeatable.
We also tried without -fPIC, but did not see as large a regression there.

Comment 4 Yongxiang Liu 2005-06-01 20:49:46 UTC

Created attachment 9010 [details]
small preprocessed standalone test based on openssh md2

Comment 5 Andrew Pinski 2005-06-01 20:55:23 UTC

I would not doubt this is just not using the i386 address mode

Comment 6 Giovanni Bajo 2005-06-01 22:55:35 UTC

Confirmed. The regression appears only with -fPIC, and it's pretty evident. The 
core is md2_block, the inner loop:

GCC 3.4
=============================================================
.L29:
        xorl    %edx, %edx
        .p2align 2,,3
.L28:
        movl    S@GOTOFF(%ebx,%eax,4), %esi
        xorl    -216(%ebp,%edx,4), %esi
        movl    S@GOTOFF(%ebx,%esi,4), %eax
        xorl    -212(%ebp,%edx,4), %eax
        movl    S@GOTOFF(%ebx,%eax,4), %edi
        xorl    -208(%ebp,%edx,4), %edi
        movl    %esi, -216(%ebp,%edx,4)
        movl    S@GOTOFF(%ebx,%edi,4), %esi
        xorl    -204(%ebp,%edx,4), %esi
        movl    %eax, -212(%ebp,%edx,4)
        movl    S@GOTOFF(%ebx,%esi,4), %eax
        xorl    -200(%ebp,%edx,4), %eax
        movl    %edi, -208(%ebp,%edx,4)
        movl    S@GOTOFF(%ebx,%eax,4), %edi
        xorl    -196(%ebp,%edx,4), %edi
        movl    %esi, -204(%ebp,%edx,4)
        movl    S@GOTOFF(%ebx,%edi,4), %esi
        xorl    -192(%ebp,%edx,4), %esi
        movl    %eax, -200(%ebp,%edx,4)
        movl    S@GOTOFF(%ebx,%esi,4), %eax
        xorl    -188(%ebp,%edx,4), %eax
        movl    %edi, -196(%ebp,%edx,4)
        movl    %esi, -192(%ebp,%edx,4)
        movl    %eax, -188(%ebp,%edx,4)
        addl    $8, %edx
        cmpl    $47, %edx
        jle     .L28
        addl    %ecx, %eax
        incl    %ecx
        andl    $255, %eax
        cmpl    $17, %ecx
        jle     .L29
=============================================================



GCC 4.0
=============================================================
.L16:
        movl    -384(%ebp), %eax
        movl    -208(%ebp), %esi
        incl    -384(%ebp)
        addl    %esi, %eax
        movl    -456(%ebp), %esi
        andl    $255, %eax
        movl    (%edi,%eax,4), %ecx
        movl    -464(%ebp), %eax
        xorl    %ecx, %esi
        movl    (%edi,%esi,4), %edx
        movl    %esi, -368(%ebp)
        movl    %esi, -456(%ebp)
        movl    -488(%ebp), %esi
        xorl    %edx, %eax
        movl    -472(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -364(%ebp)
        movl    %eax, -464(%ebp)
        xorl    %ecx, %edx
        movl    -480(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -360(%ebp)
        movl    %edx, -472(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -356(%ebp)
        movl    %ecx, -480(%ebp)
        xorl    %eax, %esi
        movl    -496(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -352(%ebp)
        movl    %esi, -488(%ebp)
        xorl    %edx, %eax
        movl    -504(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -348(%ebp)
        movl    %eax, -496(%ebp)
        xorl    %ecx, %edx
        movl    -512(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -344(%ebp)
        movl    %edx, -504(%ebp)
        xorl    %eax, %ecx
        movl    %ecx, -340(%ebp)
        movl    (%edi,%ecx,4), %eax
        movl    -520(%ebp), %esi
        movl    %ecx, -512(%ebp)
        xorl    %eax, %esi
        movl    -528(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -336(%ebp)
        movl    %esi, -520(%ebp)
        movl    -552(%ebp), %esi
        xorl    %edx, %eax
        movl    -536(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -332(%ebp)
        movl    %eax, -528(%ebp)
        xorl    %ecx, %edx
        movl    -544(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -328(%ebp)
        movl    %edx, -536(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -324(%ebp)
        movl    %ecx, -544(%ebp)
        xorl    %eax, %esi
        movl    -556(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -320(%ebp)
        movl    %esi, -552(%ebp)
        movl    -568(%ebp), %esi
        xorl    %edx, %eax
        movl    -560(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -316(%ebp)
        movl    %eax, -556(%ebp)
        xorl    %ecx, %edx
        movl    -564(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -312(%ebp)
        movl    %edx, -560(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -308(%ebp)
        movl    %ecx, -564(%ebp)
        xorl    %eax, %esi
        movl    %esi, -304(%ebp)
        movl    (%edi,%esi,4), %edx
        movl    -572(%ebp), %eax
        movl    %esi, -568(%ebp)
        movl    -396(%ebp), %esi
        xorl    %edx, %eax
        movl    -576(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -300(%ebp)
        movl    %eax, -572(%ebp)
        xorl    %ecx, %edx
        movl    -580(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -296(%ebp)
        movl    %edx, -576(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -292(%ebp)
        movl    %ecx, -580(%ebp)
        xorl    %eax, %esi
        movl    -400(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -288(%ebp)
        movl    %esi, -396(%ebp)
        movl    -412(%ebp), %esi
        xorl    %edx, %eax
        movl    -404(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -284(%ebp)
        movl    %eax, -400(%ebp)
        xorl    %ecx, %edx
        movl    -408(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -280(%ebp)
        movl    %edx, -404(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -276(%ebp)
        movl    %ecx, -408(%ebp)
        xorl    %eax, %esi
        movl    -416(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -272(%ebp)
        movl    %esi, -412(%ebp)
        xorl    %edx, %eax
        movl    %eax, -268(%ebp)
        movl    (%edi,%eax,4), %ecx
        movl    -420(%ebp), %edx
        movl    %eax, -416(%ebp)
        movl    -428(%ebp), %esi
        xorl    %ecx, %edx
        movl    -424(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -264(%ebp)
        movl    %edx, -420(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -260(%ebp)
        movl    %ecx, -424(%ebp)
        xorl    %eax, %esi
        movl    -432(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -256(%ebp)
        movl    %esi, -428(%ebp)
        movl    -444(%ebp), %esi
        xorl    %edx, %eax
        movl    -436(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -252(%ebp)
        movl    %eax, -432(%ebp)
        xorl    %ecx, %edx
        movl    -440(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -248(%ebp)
        movl    %edx, -436(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -244(%ebp)
        movl    %ecx, -440(%ebp)
        xorl    %eax, %esi
        movl    -448(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -240(%ebp)
        movl    %esi, -444(%ebp)
        xorl    %edx, %eax
        movl    -452(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -236(%ebp)
        movl    %eax, -448(%ebp)
        xorl    %ecx, %edx
        movl    %edx, -232(%ebp)
        movl    (%edi,%edx,4), %eax
        movl    -460(%ebp), %ecx
        movl    -468(%ebp), %esi
        movl    %edx, -452(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -228(%ebp)
        movl    %ecx, -460(%ebp)
        xorl    %eax, %esi
        movl    -476(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -224(%ebp)
        movl    %esi, -468(%ebp)
        movl    -500(%ebp), %esi
        xorl    %edx, %eax
        movl    -484(%ebp), %edx
        movl    (%edi,%eax,4), %ecx
        movl    %eax, -220(%ebp)
        movl    %eax, -476(%ebp)
        xorl    %ecx, %edx
        movl    -492(%ebp), %ecx
        movl    (%edi,%edx,4), %eax
        movl    %edx, -216(%ebp)
        movl    %edx, -484(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %edx, -216(%ebp)
        movl    %edx, -484(%ebp)
        xorl    %eax, %ecx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -212(%ebp)
        movl    %ecx, -492(%ebp)
        xorl    %eax, %esi
        movl    -508(%ebp), %eax
        movl    (%edi,%esi,4), %edx
        movl    %esi, -380(%ebp)
        movl    %esi, -500(%ebp)
        xorl    %edx, %eax
        movl    -516(%ebp), %edx
        movl    (%edi,%eax,4), %esi
        movl    %eax, -376(%ebp)
        movl    %eax, -508(%ebp)
        xorl    %esi, %edx
        movl    -524(%ebp), %esi
        movl    (%edi,%edx,4), %ecx
        movl    %edx, -372(%ebp)
        movl    %edx, -516(%ebp)
        xorl    %ecx, %esi
        movl    %esi, -524(%ebp)
        movl    -532(%ebp), %ecx
        movl    (%edi,%esi,4), %edx
        xorl    %edx, %ecx
        movl    -540(%ebp), %edx
        movl    (%edi,%ecx,4), %eax
        movl    %ecx, -532(%ebp)
        xorl    %eax, %edx
        movl    -548(%ebp), %eax
        xorl    (%edi,%edx,4), %eax
        movl    %edx, -540(%ebp)
        movl    %eax, -584(%ebp)
        movl    %eax, -548(%ebp)
        movl    (%edi,%eax,4), %eax
        xorl    %eax, -208(%ebp)
        cmpl    $17, -384(%ebp)
        jne     .L16
=============================================================


The loop was unrolled, but it's clear that the address mode selection is worse.

Comment 7 Giovanni Bajo 2005-06-01 22:59:21 UTC

I wonder if this is fixed by TARGET_MEM_REF.

Comment 8 Zdenek Dvorak 2005-06-02 08:01:12 UTC

Subject: Re:  openssl is slower when compiled with gcc 4.0 than 3.3

The assembler attributed to 4.0 was produced by mainline (or some
patched version of 4.0), wasn't it?
Otherwise I cannot imagine why the inner loop would be unrolled.

For plain 4.0, we get the following code, which seems just fine
and equivalent to the one obtained with 3.4 (one of the memory
references is strength reduced, but since we still fit into registers,
this is OK).

I don't just now see what/whether there is some problem with the code
produced by 4.1, but I also don't see anything related to addressing
mode selection there.

.L21:
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    (%edx), %eax
        movl    %eax, (%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    4(%edx), %eax
        movl    %eax, 4(%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    8(%edx), %eax
        movl    %eax, 8(%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    12(%edx), %eax
        movl    %eax, 12(%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    16(%edx), %eax
        movl    %eax, 16(%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    20(%edx), %eax
        movl    %eax, 20(%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    24(%edx), %eax
        movl    %eax, 24(%edx)
        movl    S@GOTOFF(%ebx,%eax,4), %eax
        xorl    28(%edx), %eax
        movl    %eax, 28(%edx)
        addl    $32, %edx
        leal    -12(%ebp), %esi
        cmpl    %esi, %edx
        jne     .L21

Comment 9 Steven Bosscher 2005-06-06 07:16:24 UTC

Could L1 icache blow-out be the reason?

Comment 10 Zdenek Dvorak 2005-06-06 07:30:22 UTC

Subject: Re:  openssl is slower when compiled with gcc 4.0 than 3.3

> Could L1 icache blow-out be the reason? 

This is not likely with the minimized example.

Comment 11 Giovanni Bajo 2005-06-06 13:33:24 UTC

Uhm, at this point, I don't believe anymore that the loop I posted is the cause 
of the regression. Maybe the regression is somewhere else. I'll investigate.

Comment 12 Giovanni Bajo 2005-06-06 14:40:15 UTC

Looks like the culrpit is this:

=========================================================================
static unsigned int S[256];
unsigned
md2_block (unsigned int *sp1, unsigned int *sp2, const unsigned char *d)
{
	register unsigned int t;
	register int i, j;
	static unsigned int state[48];

	j = sp2[16 - 1];
	for (i = 0; i < 16; i++)
	{
		state[i] = sp1[i];
		state[i + 16] = t = d[i];
		state[i + 32] = (t ^ sp1[i]);
		j = sp2[i] ^= S[t ^ j];
	}
}
=========================================================================


gcc 3.4.3 -fPIC -O2:
===================================================
.L5:
	movl	8(%ebp), %esi
	movl	(%esi,%ecx,4), %eax
	movl	%eax, state.0@GOTOFF(%ebx,%ecx,4)
	movl	16(%ebp), %edx
	movzbl	(%edx,%ecx), %eax
	movl	%eax, 64+state.0@GOTOFF(%ebx,%ecx,4)
	movl	(%esi,%ecx,4), %edx
	xorl	%eax, %edx
	movl	-16(%ebp), %esi
	xorl	-20(%ebp), %eax
	movl	%edx, 128+state.0@GOTOFF(%ebx,%ecx,4)
	movl	(%esi,%eax,4), %eax
	xorl	(%edi,%ecx,4), %eax
	movl	%eax, (%edi,%ecx,4)
	incl	%ecx
	cmpl	$15, %ecx
	movl	%eax, -20(%ebp)
	jle	.L5
===================================================



gcc 4.1.0 20050529 -fPIC -O2:
===================================================
.L2:
	movl	8(%ebp), %eax
	leal	0(,%edi,4), %ecx
	movl	%ecx, -28(%ebp)
	addl	%ecx, %eax
	movl	16(%ebp), %ecx
	movl	%eax, %edx
	movl	%eax, -24(%ebp)
	movl	-4(%eax), %eax
	movl	%eax, (%esi)
	movzbl	-1(%ecx,%edi), %eax
	incl	%edi
	movl	%eax, 64(%esi)
	movl	-4(%edx), %ecx
	movl	12(%ebp), %edx
	xorl	%eax, %ecx
	movl	%ecx, 128(%esi)
	movl	-28(%ebp), %ecx
	addl	$4, %esi
	addl	%edx, %ecx
	movl	-16(%ebp), %edx
	xorl	%edx, %eax
	movl	-20(%ebp), %edx
	movl	(%edx,%eax,4), %eax
	movl	-4(%ecx), %edx
	xorl	%edx, %eax
	cmpl	$17, %edi
	movl	%eax, -4(%ecx)
	movl	%eax, -16(%ebp)
	jne	.L2
===================================================

Comment 13 Zdenek Dvorak 2005-06-06 15:00:08 UTC

Subject: Re:  openssl is slower when compiled with gcc 4.0 than 3.3

> Looks like the culrpit is this:
> 
> =========================================================================
> static unsigned int S[256];
> unsigned
> md2_block (unsigned int *sp1, unsigned int *sp2, const unsigned char *d)
> {
> 	register unsigned int t;
> 	register int i, j;
> 	static unsigned int state[48];
> 
> 	j = sp2[16 - 1];
> 	for (i = 0; i < 16; i++)
> 	{
> 		state[i] = sp1[i];
> 		state[i + 16] = t = d[i];
> 		state[i + 32] = (t ^ sp1[i]);
> 		j = sp2[i] ^= S[t ^ j];
> 	}
> }
> =========================================================================

with the TARGET_MEM_REFs patch the result is much better.  At
least we avoid the multiplication by 4

> 	leal	0(,%edi,4), %ecx

and other results of the DOM missoptimization of addressing modes, that was
one of the main motivations for TARGET_MEM_REFs.

We still use one more iv than in the 3.4 case, and in result we need one
more register.

.L2:
        movl    8(%ebp), %edi
        movl    -4(%edi,%ecx,4), %eax
        movl    %eax, (%esi)
        movl    16(%ebp), %edx
        movzbl  -1(%ecx,%edx), %eax
        movl    %eax, 64(%esi)
        movl    -4(%edi,%ecx,4), %edx
        xorl    %eax, %edx
        movl    %edx, 128(%esi)
        xorl    -20(%ebp), %eax
        movl    -16(%ebp), %edi
        movl    (%edi,%eax,4), %eax
        movl    12(%ebp), %edx
        xorl    -4(%edx,%ecx,4), %eax
        movl    %eax, -4(%edx,%ecx,4)
        movl    %eax, -20(%ebp)
        incl    %ecx
        addl    $4, %esi
        cmpl    $17, %ecx
        jne     .L2

Comment 14 dank 2005-06-17 00:59:23 UTC

We're learning more about this bug.
Anthony Danalis has boiled down the testcase much further;
I'll attach the reduced testcase as foo4.i.

It looks like it shows up if your /proc/cpuinfo says

vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping        : 9
cpu MHz         : 2793.051
cache size      : 512 KB

but not if your /proc/cpuinfo says
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 1
cpu MHz         : 3200.255
cache size      : 1024 KB

But here's the fun part: on the newer CPU with the bigger
cache, gcc-2.95.3 was just as slow as gcc-3.4.3/gcc-4.0.0.
Go figure.

We'll add more details once we've got more info.

Comment 15 dank 2005-06-17 01:00:35 UTC

Created attachment 9102 [details]
10x smaller testcase

Comment 16 Andrew Pinski 2005-06-17 01:10:29 UTC

(In reply to comment #14)
> We're learning more about this bug.
> Anthony Danalis has boiled down the testcase much further;
> I'll attach the reduced testcase as foo4.i.

Yes you know what the difference is between those two, the second one is not really a P4 but really a 
new core, Intel marketing at its best.

Comment 17 dank 2005-06-18 06:24:50 UTC

Looks to me like gcc-3.4.3 is known to fail, too, depending on the CPU.
Anthony Danalis and I came up with a little script to run foo4.i
on various processors with various values for -mtune, which I'll
attach; here are the results for four different x86 variants.

The last two columns are the time on gcc-3.4.3 and gcc-4.0.0
divided by the time on gcc-2.95.3, so any value above 1.0 in
the last column is a performance regression.  
Rows are sorted by the last column.  The first five
rows represent performance regressions for gcc-3.4.3;
the first three also represent performance regressions
for gcc-4.0.0.

family,model,name               pic?  tune      [t_295, t_343, t_400]
[t_295/t_295, t_343/t_295, t_400/t_295]

6,8, Pentium III (Coppermine),  -fPIC athlon-xp [9.25, 16.22, 18.79]  [1.00,
1.75, 2.03]
15,2, Xeon(TM) CPU 2.60GHz,     -fPIC pentium4  [1.91, 3.89, 3.27]    [1.00,
2.04, 1.71]
6,8, Pentium III (Coppermine),  -fPIC pentium3  [9.15, 10.10, 13.20]  [1.00,
1.10, 1.44]
15,2, Xeon(TM) CPU 2.60GHz,     -fPIC athlon-xp [1.91, 2.00, 1.95]    [1.00,
1.05, 1.02]
6,8, Pentium III (Coppermine),  -fPIC pentium4  [9.27, 10.49,  8.87]  [1.00,
1.13, 0.96]

--- ok below this line ---

6,8, Pentium III (Coppermine),        pentium4  [14.74, 13.71, 14.12] [1.00,
0.93, 0.96]
15,4, Athlon(tm) 64 3000+,      -fPIC pentium4  [4.12, 3.68, 3.74]    [1.00,
0.89, 0.91]
15,4, Pentium(R) 4 CPU 3.20GHz, -fPIC pentium4  [2.48, 2.18, 2.09]    [1.00,
0.88, 0.84]
15,4, Athlon(tm) 64 3000+,      -fPIC athlon-xp [4.12, 3.50, 3.20]    [1.00,
0.85, 0.78]
15,4, Pentium(R) 4 CPU 3.20GHz,       pentium4  [2.17, 1.07, 1.07]    [1.00,
0.49, 0.49]
6,8, Pentium III (Coppermine),        pentium3  [14.22,  6.26,  6.46] [1.00,
0.44, 0.45]
6,8, Pentium III (Coppermine),        athlon-xp [14.93,  6.26,  6.27] [1.00,
0.42, 0.42]
15,4, Athlon(tm) 64 3000+,            pentium4  [3.65, 1.39, 1.39]    [1.00,
0.38, 0.38]
15,4, Athlon(tm) 64 3000+,            athlon-xp [3.65, 1.39, 1.40]    [1.00,
0.38, 0.38]
15,2, Xeon(TM) CPU 2.60GHz,           pentium4  [6.42, 0.97, 0.98]    [1.00,
0.15, 0.15]

Comment 18 dank 2005-06-18 06:27:20 UTC

Created attachment 9106 [details]
script used to benchmark the problem code

We also tried -mtune=nocona and -mtune=prescott,
but they gave identical results to -mtune=pentium4
on the processors we ran on (piii, p4, k8).

Comment 19 dank 2005-06-18 06:38:58 UTC

To be clear, here are the two most worrying rows from the above table,
expanded a bit.  These are the runtimes of foo4.i in seconds.
The cpu family, model, and name are as shown by /proc/cpuinfo.

cpu family 15, model 2, Intel(R) Xeon(TM) CPU 2.60GHz:
-fPIC -mtune=pentium4 -O3 
gcc-2.95.3: 1.91 seconds
gcc-3.4.3:  3.89
gcc-4.0.0:  3.27

cpu family 6, model 8, Pentium III (Coppermine)
-fPIC -mtune=pentium3 -O3
gcc-2.95.3: 9.15
gcc-3.4.3: 10.10
gcc-4.0.0: 13.20

gcc-4.0.0 produces code that runs 1.7 and 1.4 times slower than gcc-2.95.3
on these (fairly common!) cpus, even when the proper -mtune is used.

Comment 20 dank 2005-06-18 17:46:59 UTC

The above tests did not use -mcpu on gcc-2.95.3,
so they were comparing apples to oranges, kind of.

I reran them on a PIII with gcc-2.95.3 -mcpu=$tune -O3 
and gcc-[34] -mtune=$tune -O3.  The problem persists
even when using the most appropriate tuning option
for the CPU in question.

cpu family 6,model 8, Pentium III (Coppermine):
-fPIC -mcpu=pentium -O3 
gcc-2.95.3: 7.61
gcc-3.4.3: 27.43
gcc-4.0.0: 17.57

cpu family 6,model 8, Pentium III (Coppermine):
-fPIC -mcpu=pentiumpro -O3
gcc-2.95.3: 9.27
gcc-3.4.3: 10.09
gcc-4.0.0: 13.96

cpu family 15, model 2, Intel(R) Xeon(TM) CPU 2.60GHz:
-fPIC -mtune=pentium4 -O3 
gcc-2.95.3: 1.91 seconds
gcc-3.4.3:  3.89
gcc-4.0.0:  3.27

Comment 21 dank 2005-06-18 22:45:55 UTC

I asked the fellow who posted the original problem report to give
me the results of 'cat /proc/cpuinfo' on the affected machine.
Here it is:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 10
cpu MHz         : 896.153

This is the same as one of the two affected CPU types here.

The slow routine appears to be the buffer cleaning routine,
though I haven't verified this with oprofile yet.
Here's its loop:
static char cleanse_ctr;
...
    while (len--) {
        *(ptr++) = cleanse_ctr;
        cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF));
    }
and the output of -O3 -fPIC for both gcc-2.95.3 and gcc-4.0.0:

--- gcc-2.95.3 ---
.L5:    
        movl cleanse_ctr@GOT(%ebx),%edi
        movb (%edi),%al
        movb %al,(%edx)
        incl %edx
        movb (%edi),%cl
        addb $17,%cl
        movb %dl,%al
        andb $15,%al
        addb %al,%cl
        movb %cl,(%edi)
        subl $1,%esi
        jnc .L5
.L4:

--- gcc-4 ---    
.L4:    
        movb    (%esi), %al
        movb    %al, (%edx)
        leal    (%ecx,%edi), %eax
        andl    $15, %eax
        incl    %ecx
        addb    (%esi), %al
        incl    %edx
        addl    $17, %eax
        cmpl    %ecx, 12(%ebp)
        movb    %al, (%esi)
        jne     .L4

It's not obvious to me why the gcc-4.0.0 generated code
should be slower when run on some CPUs, if in fact it is.
Is it the fact that the loop condition is checked with
a cmp against memory instead of a flag being set by subtracting
1 from a register?

(And where's the best place to learn about how to predict
how long assembly snippets like this will take to run
on various modern CPUs, anyway?)

Comment 22 dank 2005-06-24 15:00:08 UTC

Michael Meissner looked at the code, and saw that
gcc-2.95.3 converts the loop to a countdown loop,
but gcc-3.x doesn't, which wastes a precious register.

Comment 23 dank 2005-06-24 15:01:23 UTC

And, for what it's worth, the latest 4.1 snapshot also suffers from this.

Comment 24 Steven Bosscher 2005-06-24 15:53:34 UTC

I don't see how the precious register would matter much.  But this compare 
with memory is strange: 
 
        cmpl    %ecx, 12(%ebp) 
 
Why isn't len loaded into a register??

Comment 25 Zdenek Dvorak 2005-06-24 16:24:52 UTC

Subject: Re:  [4.0/4.1 Regression] openssl is slower when compiled with gcc 4.0 than 3.3

> I don't see how the precious register would matter much.  But this compare 
> with memory is strange: 
>  
>         cmpl    %ecx, 12(%ebp) 
>  
> Why isn't len loaded into a register?? 

You answer your own question -- because there is no register free;
that's why the precisious register maters that much.

(I guess; I may be wrong).

Zdenek

Comment 26 Dan Nicolaescu 2005-06-24 17:41:44 UTC

(In reply to comment #21)

> The slow routine appears to be the buffer cleaning routine,
> though I haven't verified this with oprofile yet.
> Here's its loop:
> static char cleanse_ctr;
> ...
>     while (len--) {
>         *(ptr++) = cleanse_ctr;
>         cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF));
>     }

[Not entirely related, but..] There's one obvious way to improve this loop. 
The compiler cannot prove that the write *(ptr++) does not alias the global
variable cleanse_ptr, so it will read it from memory in each iteration.
To avoid the extra memory read just do something like:

void OPENSSL_cleanse(unsigned char *ptr, unsigned int len)
{
  unsigned char local_cleanse_ctr = cleanse_ctr;
  while (len--) {
        *(ptr++) = local_cleanse_ctr;
        local_cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF));
    }
  local_cleanse_ctr += 63;
  cleanse_ctr = local_cleanse_ctr;  
}

Comment 27 Zdenek Dvorak 2005-06-25 02:49:38 UTC

Ivopts seem to do several quite doubtful decisions in this testcase.

Comment 28 Steven Bosscher 2005-06-25 10:15:05 UTC

Re. comment #25, as far as I can tell there are registers available in 
that loop.  To quote the loop from comment #12: 
 
.L4:     
        movb    (%esi), %al 
        movb    %al, (%edx) 
        leal    (%ecx,%edi), %eax 
        andl    $15, %eax 
        incl    %ecx 
        addb    (%esi), %al 
        incl    %edx 
        addl    $17, %eax 
        cmpl    %ecx, 12(%ebp) 
        movb    %al, (%esi) 
        jne     .L4 
 
Checking off used registers in this loop: 
%esi x 
%edi x 
%eax x 
%ebx 
%ecx x 
%edx x 
 
So %ebx at least is free (and iiuc, with -fomit-frame-pointer %ebp is 
also free, right?).  Maybe the allocator thinks %ebx can't be used 
because it is the PIC register. 
 
Here is what mainline today ("GCC: (GNU) 4.1.0 20050625 (experimental)") 
gives me (x86-64 compiler with "-m32 -march=i686 -O3 -fPIC"): 
 
.L4: 
        movzbl  (%esi), %eax 
        movb    %al, (%ecx) 
        incl    %ecx 
        movzbl  -13(%ebp), %eax 
        movzbl  (%esi), %edx 
        incb    -13(%ebp) 
        andb    $15, %al 
        addb    $17, %dl 
        addb    %dl, %al 
        cmpl    %edi, %ecx 
        movb    %al, (%esi) 
        jne     .L4 
 
The .optimized tree dump looks like this: 
 
<bb 0>: 
  len.23 = len - 1; 
  if (len.23 != 4294967295) goto <L6>; else goto <L2>; 
 
<L6>:; 
  ivtmp.19 = (unsigned char) (signed char) (int) (ptr + 1B); 
  ptr.27 = ptr; 
 
<L0>:; 
  MEM[base: ptr.27] = cleanse_ctr; 
  ptr.27 = ptr.27 + 1B; 
  cleanse_ctr = (unsigned char) (((signed char) ivtmp.19 & 15) 
                                 + (signed char) cleanse_ctr + 17); 
  ivtmp.19 = ivtmp.19 + 1; 
  if (ptr.27 != (unsigned char *) (ptr + (void *) len.23 + 1B)) goto <L0>; 
else goto <L2>; 
 
<L2>:; 
  cleanse_ctr = (unsigned char) ((signed char) cleanse_ctr + 63); 
  return; 
 
Note how the loop test is against ptr.  Also, as far as I can tell the 
right hand side of the test (i.e. "(ptr + (void *) len.23 + 1B)") is loop 
invariant and should have been moved out.  And the first two lines are 
also just weird, it is probably cheaper on almost any machine to do 
 
  len.23 = len; 
  if (len.23 != 0) goto <L6>; else goto <L2>; 
 
<L6>: 
  len.23 = len.23 - 1; 
  (etc...) 
 
In summary, we just produce crap code here ;-)

Comment 29 Zdenek Dvorak 2005-06-25 11:32:11 UTC

Subject: Re:  [4.0/4.1 Regression] openssl is slower when compiled with gcc 4.0 than 3.3

> ------- Additional Comments From steven at gcc dot gnu dot org  2005-06-25 10:15 -------
> Re. comment #25, as far as I can tell there are registers available in 
> that loop.  To quote the loop from comment #12: 
>  
> .L4:     
>         movb    (%esi), %al 
>         movb    %al, (%edx) 
>         leal    (%ecx,%edi), %eax 
>         andl    $15, %eax 
>         incl    %ecx 
>         addb    (%esi), %al 
>         incl    %edx 
>         addl    $17, %eax 
>         cmpl    %ecx, 12(%ebp) 
>         movb    %al, (%esi) 
>         jne     .L4 
>  
> Checking off used registers in this loop: 
> %esi x 
> %edi x 
> %eax x 
> %ebx 
> %ecx x 
> %edx x 
>  
> So %ebx at least is free (and iiuc, with -fomit-frame-pointer %ebp is 
> also free, right?).  Maybe the allocator thinks %ebx can't be used 
> because it is the PIC register. 

yes, ebx cannot be used because of pic, and -fomit-frame-pointer is off
by default.

> Here is what mainline today ("GCC: (GNU) 4.1.0 20050625 (experimental)") 
> gives me (x86-64 compiler with "-m32 -march=i686 -O3 -fPIC"): 
>  
> .L4: 
>         movzbl  (%esi), %eax 
>         movb    %al, (%ecx) 
>         incl    %ecx 
>         movzbl  -13(%ebp), %eax 
>         movzbl  (%esi), %edx 
>         incb    -13(%ebp) 
>         andb    $15, %al 
>         addb    $17, %dl 
>         addb    %dl, %al 
>         cmpl    %edi, %ecx 
>         movb    %al, (%esi) 
>         jne     .L4 
>  
> The .optimized tree dump looks like this: 
>  
> <bb 0>: 
>   len.23 = len - 1; 
>   if (len.23 != 4294967295) goto <L6>; else goto <L2>; 

> And the first two lines are 
> also just weird, it is probably cheaper on almost any machine to do 
>   len.23 = len; 
>   if (len.23 != 0) goto <L6>; else goto <L2>; 
>  
> <L6>: 
>   len.23 = len.23 - 1; 
>   (etc...) 

Not really.  On i686, there should be no difference.

Comment 30 Zdenek Dvorak 2005-07-02 06:31:06 UTC

Created attachment 9191 [details]
a possible patch

This patch could help; I need to benchmark it before submitting it.

Comment 31 Andrew Pinski 2005-10-27 00:47:03 UTC

(In reply to comment #30)
> This patch could help; I need to benchmark it before submitting it.

Any news about this patch?

Comment 32 Mark Mitchell 2005-10-31 02:39:22 UTC

Leaving as P2 as this is a significant pessimization on a significant piece of code on relatively common processors.

Comment 33 Steven Bosscher 2005-11-16 09:42:28 UTC

Zdenek, any news about your patch from comment #30?

Comment 34 Zdenek Dvorak 2005-11-17 13:35:14 UTC

It behaves somewhat erratically on SPEC2000 (it increases the overall score, but there are some significant regressions).  And, it also causes us to produce worse code for this testcase at the moment, due to a missunderstanding between ivopts and fold; expression 

(unsigned char) (signed char) (int) (ptr + 1B) - (unsigned char) ptr

is produced, and it is not folded to 1.

Comment 35 Zdenek Dvorak 2005-11-17 15:09:17 UTC

Created attachment 10263 [details]
Patch

After some playing with fold, I arrived to the following patch, that almost works.  With the patch, the code for the loop is

<L0>:;
  MEM[base: ptr]{*ptr} = cleanse_ctr;
  ptr = ptr + 1B;
  cleanse_ctr = (unsigned char) (((signed char) ptr & 15) + (signed char) cleanse_ctr + 17);
  len = len - 1;
  if (len != 0) goto <L0>; else goto <L2>;

Which seems just fine.  The assembler is

.L3:
        movb    (%edi), %al
        movb    %al, (%ecx)
        incl    %ecx
        movb    %cl, %al
        andl    $15, %eax
        movb    (%edi), %dl
        addl    $17, %edx
        addl    %edx, %eax
        movb    %al, (%edi)
        decl    %esi
        jne     .L3

Which also seems OK to me.  However, the "ugly" version we produce without the patch:

.L4:
        movb    (%edi), %al
        movb    %al, (%ecx)
        incl    %ecx
        movb    -16(%ebp), %al
        addl    %esi, %eax
        andl    $15, %eax
        movb    (%edi), %dl
        addl    $17, %edx
        addl    %edx, %eax
        movb    %al, (%edi)
        incl    %esi
        cmpl    12(%ebp), %esi
        jne     .L4

Is faster by 30%, from reasons I just don't understand :-(

Comment 36 Mark Mitchell 2006-02-24 00:25:40 UTC

This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.

Comment 37 Mark Mitchell 2006-05-25 02:32:33 UTC

Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.

Comment 38 Andrew Pinski 2006-08-27 21:07:54 UTC

For the reduced testcase in comment #12, we get:
.L2:
        movl    8(%ebp), %edx
        movl    -4(%edx,%ecx,4), %eax
        movl    %eax, (%esi)
        movl    16(%ebp), %edi
        movzbl  -1(%ecx,%edi), %edx
        movl    %edx, 64(%esi)
        xorl    %edx, %eax
        movl    %eax, 128(%esi)
        movl    12(%ebp), %edi
        movl    -4(%edi,%ecx,4), %eax
        xorl    -16(%ebp), %edx
        xorl    S@GOTOFF(%ebx,%edx,4), %eax
        movl    %eax, -4(%edi,%ecx,4)
        movl    %eax, -16(%ebp)
        addl    $1, %ecx
        addl    $4, %esi
        cmpl    $17, %ecx
        jne     .L2


Does someone know if this ok produced code or not, I never a good reader of x86 code?

Comment 39 Andrew Pinski 2007-07-04 21:53:54 UTC

I think this has been imrpvoed for 4.3, it would be nice if someone did some timings for 4.3.

Comment 40 Steven Bosscher 2007-11-19 10:04:31 UTC

Can someone please redo the timings for GCC 4.3?

Comment 41 dank 2007-11-19 12:27:49 UTC

OK, I'll see if I can get that done.

Comment 42 Dan Hipschman 2007-11-28 17:19:09 UTC

Created attachment 14656 [details]
Benchmarks

I ran the benchmarks for the minimal testcase (using Dan Kegel's script) on a Core 2 Duo, AMD Dual Core Opteron, and a Pentium 3 using GCC 2.95.3, 3.4.3, 4.0.1, and 4.3.0 from svn.  I'm still looking for a Pentium 4 to test on.

Comment 43 dank 2007-11-28 18:01:19 UTC

Of the results posted above, the only interesting one that is still slower 
than gcc-2.95 is a pentium 3 with -fPIC.
(Happily, gcc-4.3 is an improvement there, but it's still worse than 2.95.)
Oddly, this one does better with athlon-xp tuning.

/proc/cpuinfo info                                       pic?   tune            [t_34/t_295, t_401/t_295, t_43/t_295]
cpu family 6,model 11, Intel(R) Pentium(R) III CPU - S 1266MHz, -fPIC tune=pentium3   [1.37, 1.72, 1.20]

cpu family 6,model 11, Intel(R) Pentium(R) III CPU - S 1266MHz, -fPIC tune=athlon-xp [2.16, 2.49, 1.10]

I'm not sure how important Pentium III performance is anymore.

Comment 44 Richard Biener 2008-01-26 13:46:33 UTC

Fixed for 4.3.  WONTFIX for the branches.