here's openssl speed resoult when it's compiled with 3.3 (orginal debian unstable package): options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) blowfish(idx) compiler: gcc -fPIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_KRB5 -DOPENSSL_NO_IDEA -DOPENSSL_NO_MDC2 -DOPENSSL_NO_RC5 -DL_ENDIAN -DTERMIO -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer -Wall -DSHA1_ASM -DMD5_ASM -DRMD160_ASM available timing options: TIMES TIMEB HZ=100 [sysconf value] timing function used: times The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md2 510.80k 1064.79k 1486.96k 1641.83k 1702.87k mdc2 0.00 0.00 0.00 0.00 0.00 md4 4999.47k 17746.97k 51392.88k 97451.59k 131711.89k md5 4405.95k 15208.16k 43027.34k 77946.11k 101040.96k hmac(md5) 4951.58k 16851.67k 46126.90k 81002.65k 101700.77k sha1 3892.54k 12223.89k 29586.19k 45767.99k 54082.03k rmd160 3715.14k 10397.52k 23079.49k 33148.87k 37651.83k rc4 58941.98k 66899.63k 71733.39k 72572.54k 72476.92k des cbc 13353.92k 13897.80k 14067.26k 14088.53k 14107.61k des ede3 4887.63k 5039.28k 5083.63k 5116.70k 5086.58k idea cbc 0.00 0.00 0.00 0.00 0.00 rc2 cbc 5257.37k 5534.13k 5560.97k 5610.12k 5582.42k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 21054.83k 22340.34k 22704.49k 22895.90k 22860.91k cast cbc 14478.39k 15882.31k 16400.99k 16570.03k 16585.01k aes-128 cbc 13612.33k 14364.39k 14382.68k 14404.12k 14440.26k aes-192 cbc 12075.70k 12370.43k 12530.49k 12518.63k 12559.92k aes-256 cbc 10806.91k 11093.65k 11179.27k 11185.67k 11205.97k sign verify sign/s verify/s rsa 512 bits 0.0023s 0.0002s 438.5 4928.2 rsa 1024 bits 0.0109s 0.0006s 91.6 1746.1 rsa 2048 bits 0.0646s 0.0019s 15.5 527.6 rsa 4096 bits 0.4317s 0.0066s 2.3 152.0 sign verify sign/s verify/s dsa 512 bits 0.0018s 0.0022s 546.0 460.7 dsa 1024 bits 0.0054s 0.0065s 186.6 154.8 dsa 2048 bits 0.0179s 0.0220s 55.7 45.5 and here's the same package compiled with gcc 4.0, gcc-4.0 (GCC) 4.0.0 20050212 (experimental) compiler: gcc -fPIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_KRB5 -DOPENSSL_NO_IDEA -DO PENSSL_NO_MDC2 -DOPENSSL_NO_RC5 -DL_ENDIAN -DTERMIO -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer -Wall -DSHA1_ASM -DMD5_ASM -DRMD160_ASM available timing options: TIMES TIMEB HZ=100 [sysconf value] timing function used: times The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md2 361.81k 781.01k 1103.19k 1231.36k 1278.84k mdc2 0.00 0.00 0.00 0.00 0.00 md4 3103.64k 11338.88k 36135.04k 79292.67k 123123.36k md5 2758.32k 10084.74k 31863.54k 66522.25k 98860.02k hmac(md5) 4581.08k 15784.49k 43771.66k 78227.60k 101959.42k sha1 2638.72k 8889.12k 24063.88k 41890.99k 53462.15k rmd160 2477.15k 7918.19k 19696.52k 31106.04k 37317.88k rc4 60284.27k 67543.46k 71379.34k 72455.38k 72581.12k des cbc 13547.77k 13876.64k 14049.67k 14102.25k 14020.78k des ede3 4950.20k 5050.99k 5068.80k 5111.00k 5088.06k idea cbc 0.00 0.00 0.00 0.00 0.00 rc2 cbc 5814.75k 6060.45k 6150.37k 6169.60k 6196.13k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 20941.23k 22373.68k 22868.43k 22822.28k 23014.29k cast cbc 12790.60k 14102.95k 14514.24k 14494.77k 14622.21k aes-128 cbc 13030.43k 13549.49k 13653.51k 13694.85k 13696.33k aes-192 cbc 11257.66k 11517.92k 11545.25k 11604.32k 11568.43k aes-256 cbc 10065.01k 10296.48k 10403.82k 10332.02k 10382.25k sign verify sign/s verify/s rsa 512 bits 0.0024s 0.0002s 418.5 4201.7 rsa 1024 bits 0.0112s 0.0006s 89.5 1550.7 rsa 2048 bits 0.0650s 0.0020s 15.4 504.9 rsa 4096 bits 0.4311s 0.0068s 2.3 147.9 sign verify sign/s verify/s dsa 512 bits 0.0019s 0.0023s 521.4 441.9 dsa 1024 bits 0.0055s 0.0067s 182.9 148.3 dsa 2048 bits 0.0181s 0.0222s 55.2 45.1 as you can see almost each test is worst with 4.0. Not sure why. The same test on ultrasparc and amd64 shows 4.0 as clear winner. ( Althou it still crashes on amd64... ;) )
We need a self contained example.
No feedback in 3 months.
When we ran 'openssh speed md2', we did see that gcc-4.0 was slower than earlier versions, so we created a minimal test case, which we will attach. Here is how long it took to run a 34 megabyte file through the test program when compiled with various compilers and options: gcc-2.95.3 -fPIC -O1 4.940s gcc-4.0.0 -fPIC -O1 3.510s gcc-3.4.3 -fPIC -O1 5.190s gcc-2.95.3 -fPIC -O2 3.470s gcc-3.4.3 -fPIC -O2 3.460s gcc-4.0.0 -fPIC -O2 4.050s gcc-2.95.3 -fPIC -O3 3.400s gcc-3.4.3 -fPIC -O3 3.740s gcc-4.0.0 -fPIC -O3 4.010s This test was done on a pentium 4 workstation, and no smoothing was done on the resulting times, but they seemed to be repeatable. We also tried without -fPIC, but did not see as large a regression there.
Created attachment 9010 [details] small preprocessed standalone test based on openssh md2
I would not doubt this is just not using the i386 address mode
Confirmed. The regression appears only with -fPIC, and it's pretty evident. The core is md2_block, the inner loop: GCC 3.4 ============================================================= .L29: xorl %edx, %edx .p2align 2,,3 .L28: movl S@GOTOFF(%ebx,%eax,4), %esi xorl -216(%ebp,%edx,4), %esi movl S@GOTOFF(%ebx,%esi,4), %eax xorl -212(%ebp,%edx,4), %eax movl S@GOTOFF(%ebx,%eax,4), %edi xorl -208(%ebp,%edx,4), %edi movl %esi, -216(%ebp,%edx,4) movl S@GOTOFF(%ebx,%edi,4), %esi xorl -204(%ebp,%edx,4), %esi movl %eax, -212(%ebp,%edx,4) movl S@GOTOFF(%ebx,%esi,4), %eax xorl -200(%ebp,%edx,4), %eax movl %edi, -208(%ebp,%edx,4) movl S@GOTOFF(%ebx,%eax,4), %edi xorl -196(%ebp,%edx,4), %edi movl %esi, -204(%ebp,%edx,4) movl S@GOTOFF(%ebx,%edi,4), %esi xorl -192(%ebp,%edx,4), %esi movl %eax, -200(%ebp,%edx,4) movl S@GOTOFF(%ebx,%esi,4), %eax xorl -188(%ebp,%edx,4), %eax movl %edi, -196(%ebp,%edx,4) movl %esi, -192(%ebp,%edx,4) movl %eax, -188(%ebp,%edx,4) addl $8, %edx cmpl $47, %edx jle .L28 addl %ecx, %eax incl %ecx andl $255, %eax cmpl $17, %ecx jle .L29 ============================================================= GCC 4.0 ============================================================= .L16: movl -384(%ebp), %eax movl -208(%ebp), %esi incl -384(%ebp) addl %esi, %eax movl -456(%ebp), %esi andl $255, %eax movl (%edi,%eax,4), %ecx movl -464(%ebp), %eax xorl %ecx, %esi movl (%edi,%esi,4), %edx movl %esi, -368(%ebp) movl %esi, -456(%ebp) movl -488(%ebp), %esi xorl %edx, %eax movl -472(%ebp), %edx movl (%edi,%eax,4), %ecx movl (%edi,%eax,4), %ecx movl %eax, -364(%ebp) movl %eax, -464(%ebp) xorl %ecx, %edx movl -480(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -360(%ebp) movl %edx, -472(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -356(%ebp) movl %ecx, -480(%ebp) xorl %eax, %esi movl -496(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -352(%ebp) movl %esi, -488(%ebp) xorl %edx, %eax movl -504(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -348(%ebp) movl %eax, -496(%ebp) xorl %ecx, %edx movl -512(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -344(%ebp) movl %edx, -504(%ebp) xorl %eax, %ecx movl %ecx, -340(%ebp) movl (%edi,%ecx,4), %eax movl -520(%ebp), %esi movl %ecx, -512(%ebp) xorl %eax, %esi movl -528(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -336(%ebp) movl %esi, -520(%ebp) movl -552(%ebp), %esi xorl %edx, %eax movl -536(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -332(%ebp) movl %eax, -528(%ebp) xorl %ecx, %edx movl -544(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -328(%ebp) movl %edx, -536(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -324(%ebp) movl %ecx, -544(%ebp) xorl %eax, %esi movl -556(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -320(%ebp) movl %esi, -552(%ebp) movl -568(%ebp), %esi xorl %edx, %eax movl -560(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -316(%ebp) movl %eax, -556(%ebp) xorl %ecx, %edx movl -564(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -312(%ebp) movl %edx, -560(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -308(%ebp) movl %ecx, -564(%ebp) xorl %eax, %esi movl %esi, -304(%ebp) movl (%edi,%esi,4), %edx movl -572(%ebp), %eax movl %esi, -568(%ebp) movl -396(%ebp), %esi xorl %edx, %eax movl -576(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -300(%ebp) movl %eax, -572(%ebp) xorl %ecx, %edx movl -580(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -296(%ebp) movl %edx, -576(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -292(%ebp) movl %ecx, -580(%ebp) xorl %eax, %esi movl -400(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -288(%ebp) movl %esi, -396(%ebp) movl -412(%ebp), %esi xorl %edx, %eax movl -404(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -284(%ebp) movl %eax, -400(%ebp) xorl %ecx, %edx movl -408(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -280(%ebp) movl %edx, -404(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -276(%ebp) movl %ecx, -408(%ebp) xorl %eax, %esi movl -416(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -272(%ebp) movl %esi, -412(%ebp) xorl %edx, %eax movl %eax, -268(%ebp) movl (%edi,%eax,4), %ecx movl -420(%ebp), %edx movl %eax, -416(%ebp) movl -428(%ebp), %esi xorl %ecx, %edx movl -424(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -264(%ebp) movl %edx, -420(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -260(%ebp) movl %ecx, -424(%ebp) xorl %eax, %esi movl -432(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -256(%ebp) movl %esi, -428(%ebp) movl -444(%ebp), %esi xorl %edx, %eax movl -436(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -252(%ebp) movl %eax, -432(%ebp) xorl %ecx, %edx movl -440(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -248(%ebp) movl %edx, -436(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -244(%ebp) movl %ecx, -440(%ebp) xorl %eax, %esi movl -448(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -240(%ebp) movl %esi, -444(%ebp) xorl %edx, %eax movl -452(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -236(%ebp) movl %eax, -448(%ebp) xorl %ecx, %edx movl %edx, -232(%ebp) movl (%edi,%edx,4), %eax movl -460(%ebp), %ecx movl -468(%ebp), %esi movl %edx, -452(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -228(%ebp) movl %ecx, -460(%ebp) xorl %eax, %esi movl -476(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -224(%ebp) movl %esi, -468(%ebp) movl -500(%ebp), %esi xorl %edx, %eax movl -484(%ebp), %edx movl (%edi,%eax,4), %ecx movl %eax, -220(%ebp) movl %eax, -476(%ebp) xorl %ecx, %edx movl -492(%ebp), %ecx movl (%edi,%edx,4), %eax movl %edx, -216(%ebp) movl %edx, -484(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %edx, -216(%ebp) movl %edx, -484(%ebp) xorl %eax, %ecx movl (%edi,%ecx,4), %eax movl %ecx, -212(%ebp) movl %ecx, -492(%ebp) xorl %eax, %esi movl -508(%ebp), %eax movl (%edi,%esi,4), %edx movl %esi, -380(%ebp) movl %esi, -500(%ebp) xorl %edx, %eax movl -516(%ebp), %edx movl (%edi,%eax,4), %esi movl %eax, -376(%ebp) movl %eax, -508(%ebp) xorl %esi, %edx movl -524(%ebp), %esi movl (%edi,%edx,4), %ecx movl %edx, -372(%ebp) movl %edx, -516(%ebp) xorl %ecx, %esi movl %esi, -524(%ebp) movl -532(%ebp), %ecx movl (%edi,%esi,4), %edx xorl %edx, %ecx movl -540(%ebp), %edx movl (%edi,%ecx,4), %eax movl %ecx, -532(%ebp) xorl %eax, %edx movl -548(%ebp), %eax xorl (%edi,%edx,4), %eax movl %edx, -540(%ebp) movl %eax, -584(%ebp) movl %eax, -548(%ebp) movl (%edi,%eax,4), %eax xorl %eax, -208(%ebp) cmpl $17, -384(%ebp) jne .L16 ============================================================= The loop was unrolled, but it's clear that the address mode selection is worse.
I wonder if this is fixed by TARGET_MEM_REF.
Subject: Re: openssl is slower when compiled with gcc 4.0 than 3.3 The assembler attributed to 4.0 was produced by mainline (or some patched version of 4.0), wasn't it? Otherwise I cannot imagine why the inner loop would be unrolled. For plain 4.0, we get the following code, which seems just fine and equivalent to the one obtained with 3.4 (one of the memory references is strength reduced, but since we still fit into registers, this is OK). I don't just now see what/whether there is some problem with the code produced by 4.1, but I also don't see anything related to addressing mode selection there. .L21: movl S@GOTOFF(%ebx,%eax,4), %eax xorl (%edx), %eax movl %eax, (%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 4(%edx), %eax movl %eax, 4(%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 8(%edx), %eax movl %eax, 8(%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 12(%edx), %eax movl %eax, 12(%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 16(%edx), %eax movl %eax, 16(%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 20(%edx), %eax movl %eax, 20(%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 24(%edx), %eax movl %eax, 24(%edx) movl S@GOTOFF(%ebx,%eax,4), %eax xorl 28(%edx), %eax movl %eax, 28(%edx) addl $32, %edx leal -12(%ebp), %esi cmpl %esi, %edx jne .L21
Could L1 icache blow-out be the reason?
Subject: Re: openssl is slower when compiled with gcc 4.0 than 3.3 > Could L1 icache blow-out be the reason? This is not likely with the minimized example.
Uhm, at this point, I don't believe anymore that the loop I posted is the cause of the regression. Maybe the regression is somewhere else. I'll investigate.
Looks like the culrpit is this: ========================================================================= static unsigned int S[256]; unsigned md2_block (unsigned int *sp1, unsigned int *sp2, const unsigned char *d) { register unsigned int t; register int i, j; static unsigned int state[48]; j = sp2[16 - 1]; for (i = 0; i < 16; i++) { state[i] = sp1[i]; state[i + 16] = t = d[i]; state[i + 32] = (t ^ sp1[i]); j = sp2[i] ^= S[t ^ j]; } } ========================================================================= gcc 3.4.3 -fPIC -O2: =================================================== .L5: movl 8(%ebp), %esi movl (%esi,%ecx,4), %eax movl %eax, state.0@GOTOFF(%ebx,%ecx,4) movl 16(%ebp), %edx movzbl (%edx,%ecx), %eax movl %eax, 64+state.0@GOTOFF(%ebx,%ecx,4) movl (%esi,%ecx,4), %edx xorl %eax, %edx movl -16(%ebp), %esi xorl -20(%ebp), %eax movl %edx, 128+state.0@GOTOFF(%ebx,%ecx,4) movl (%esi,%eax,4), %eax xorl (%edi,%ecx,4), %eax movl %eax, (%edi,%ecx,4) incl %ecx cmpl $15, %ecx movl %eax, -20(%ebp) jle .L5 =================================================== gcc 4.1.0 20050529 -fPIC -O2: =================================================== .L2: movl 8(%ebp), %eax leal 0(,%edi,4), %ecx movl %ecx, -28(%ebp) addl %ecx, %eax movl 16(%ebp), %ecx movl %eax, %edx movl %eax, -24(%ebp) movl -4(%eax), %eax movl %eax, (%esi) movzbl -1(%ecx,%edi), %eax incl %edi movl %eax, 64(%esi) movl -4(%edx), %ecx movl 12(%ebp), %edx xorl %eax, %ecx movl %ecx, 128(%esi) movl -28(%ebp), %ecx addl $4, %esi addl %edx, %ecx movl -16(%ebp), %edx xorl %edx, %eax movl -20(%ebp), %edx movl (%edx,%eax,4), %eax movl -4(%ecx), %edx xorl %edx, %eax cmpl $17, %edi movl %eax, -4(%ecx) movl %eax, -16(%ebp) jne .L2 ===================================================
Subject: Re: openssl is slower when compiled with gcc 4.0 than 3.3 > Looks like the culrpit is this: > > ========================================================================= > static unsigned int S[256]; > unsigned > md2_block (unsigned int *sp1, unsigned int *sp2, const unsigned char *d) > { > register unsigned int t; > register int i, j; > static unsigned int state[48]; > > j = sp2[16 - 1]; > for (i = 0; i < 16; i++) > { > state[i] = sp1[i]; > state[i + 16] = t = d[i]; > state[i + 32] = (t ^ sp1[i]); > j = sp2[i] ^= S[t ^ j]; > } > } > ========================================================================= with the TARGET_MEM_REFs patch the result is much better. At least we avoid the multiplication by 4 > leal 0(,%edi,4), %ecx and other results of the DOM missoptimization of addressing modes, that was one of the main motivations for TARGET_MEM_REFs. We still use one more iv than in the 3.4 case, and in result we need one more register. .L2: movl 8(%ebp), %edi movl -4(%edi,%ecx,4), %eax movl %eax, (%esi) movl 16(%ebp), %edx movzbl -1(%ecx,%edx), %eax movl %eax, 64(%esi) movl -4(%edi,%ecx,4), %edx xorl %eax, %edx movl %edx, 128(%esi) xorl -20(%ebp), %eax movl -16(%ebp), %edi movl (%edi,%eax,4), %eax movl 12(%ebp), %edx xorl -4(%edx,%ecx,4), %eax movl %eax, -4(%edx,%ecx,4) movl %eax, -20(%ebp) incl %ecx addl $4, %esi cmpl $17, %ecx jne .L2
We're learning more about this bug. Anthony Danalis has boiled down the testcase much further; I'll attach the reduced testcase as foo4.i. It looks like it shows up if your /proc/cpuinfo says vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.80GHz stepping : 9 cpu MHz : 2793.051 cache size : 512 KB but not if your /proc/cpuinfo says vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping : 1 cpu MHz : 3200.255 cache size : 1024 KB But here's the fun part: on the newer CPU with the bigger cache, gcc-2.95.3 was just as slow as gcc-3.4.3/gcc-4.0.0. Go figure. We'll add more details once we've got more info.
Created attachment 9102 [details] 10x smaller testcase
(In reply to comment #14) > We're learning more about this bug. > Anthony Danalis has boiled down the testcase much further; > I'll attach the reduced testcase as foo4.i. Yes you know what the difference is between those two, the second one is not really a P4 but really a new core, Intel marketing at its best.
Looks to me like gcc-3.4.3 is known to fail, too, depending on the CPU. Anthony Danalis and I came up with a little script to run foo4.i on various processors with various values for -mtune, which I'll attach; here are the results for four different x86 variants. The last two columns are the time on gcc-3.4.3 and gcc-4.0.0 divided by the time on gcc-2.95.3, so any value above 1.0 in the last column is a performance regression. Rows are sorted by the last column. The first five rows represent performance regressions for gcc-3.4.3; the first three also represent performance regressions for gcc-4.0.0. family,model,name pic? tune [t_295, t_343, t_400] [t_295/t_295, t_343/t_295, t_400/t_295] 6,8, Pentium III (Coppermine), -fPIC athlon-xp [9.25, 16.22, 18.79] [1.00, 1.75, 2.03] 15,2, Xeon(TM) CPU 2.60GHz, -fPIC pentium4 [1.91, 3.89, 3.27] [1.00, 2.04, 1.71] 6,8, Pentium III (Coppermine), -fPIC pentium3 [9.15, 10.10, 13.20] [1.00, 1.10, 1.44] 15,2, Xeon(TM) CPU 2.60GHz, -fPIC athlon-xp [1.91, 2.00, 1.95] [1.00, 1.05, 1.02] 6,8, Pentium III (Coppermine), -fPIC pentium4 [9.27, 10.49, 8.87] [1.00, 1.13, 0.96] --- ok below this line --- 6,8, Pentium III (Coppermine), pentium4 [14.74, 13.71, 14.12] [1.00, 0.93, 0.96] 15,4, Athlon(tm) 64 3000+, -fPIC pentium4 [4.12, 3.68, 3.74] [1.00, 0.89, 0.91] 15,4, Pentium(R) 4 CPU 3.20GHz, -fPIC pentium4 [2.48, 2.18, 2.09] [1.00, 0.88, 0.84] 15,4, Athlon(tm) 64 3000+, -fPIC athlon-xp [4.12, 3.50, 3.20] [1.00, 0.85, 0.78] 15,4, Pentium(R) 4 CPU 3.20GHz, pentium4 [2.17, 1.07, 1.07] [1.00, 0.49, 0.49] 6,8, Pentium III (Coppermine), pentium3 [14.22, 6.26, 6.46] [1.00, 0.44, 0.45] 6,8, Pentium III (Coppermine), athlon-xp [14.93, 6.26, 6.27] [1.00, 0.42, 0.42] 15,4, Athlon(tm) 64 3000+, pentium4 [3.65, 1.39, 1.39] [1.00, 0.38, 0.38] 15,4, Athlon(tm) 64 3000+, athlon-xp [3.65, 1.39, 1.40] [1.00, 0.38, 0.38] 15,2, Xeon(TM) CPU 2.60GHz, pentium4 [6.42, 0.97, 0.98] [1.00, 0.15, 0.15]
Created attachment 9106 [details] script used to benchmark the problem code We also tried -mtune=nocona and -mtune=prescott, but they gave identical results to -mtune=pentium4 on the processors we ran on (piii, p4, k8).
To be clear, here are the two most worrying rows from the above table, expanded a bit. These are the runtimes of foo4.i in seconds. The cpu family, model, and name are as shown by /proc/cpuinfo. cpu family 15, model 2, Intel(R) Xeon(TM) CPU 2.60GHz: -fPIC -mtune=pentium4 -O3 gcc-2.95.3: 1.91 seconds gcc-3.4.3: 3.89 gcc-4.0.0: 3.27 cpu family 6, model 8, Pentium III (Coppermine) -fPIC -mtune=pentium3 -O3 gcc-2.95.3: 9.15 gcc-3.4.3: 10.10 gcc-4.0.0: 13.20 gcc-4.0.0 produces code that runs 1.7 and 1.4 times slower than gcc-2.95.3 on these (fairly common!) cpus, even when the proper -mtune is used.
The above tests did not use -mcpu on gcc-2.95.3, so they were comparing apples to oranges, kind of. I reran them on a PIII with gcc-2.95.3 -mcpu=$tune -O3 and gcc-[34] -mtune=$tune -O3. The problem persists even when using the most appropriate tuning option for the CPU in question. cpu family 6,model 8, Pentium III (Coppermine): -fPIC -mcpu=pentium -O3 gcc-2.95.3: 7.61 gcc-3.4.3: 27.43 gcc-4.0.0: 17.57 cpu family 6,model 8, Pentium III (Coppermine): -fPIC -mcpu=pentiumpro -O3 gcc-2.95.3: 9.27 gcc-3.4.3: 10.09 gcc-4.0.0: 13.96 cpu family 15, model 2, Intel(R) Xeon(TM) CPU 2.60GHz: -fPIC -mtune=pentium4 -O3 gcc-2.95.3: 1.91 seconds gcc-3.4.3: 3.89 gcc-4.0.0: 3.27
I asked the fellow who posted the original problem report to give me the results of 'cat /proc/cpuinfo' on the affected machine. Here it is: vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 10 cpu MHz : 896.153 This is the same as one of the two affected CPU types here. The slow routine appears to be the buffer cleaning routine, though I haven't verified this with oprofile yet. Here's its loop: static char cleanse_ctr; ... while (len--) { *(ptr++) = cleanse_ctr; cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF)); } and the output of -O3 -fPIC for both gcc-2.95.3 and gcc-4.0.0: --- gcc-2.95.3 --- .L5: movl cleanse_ctr@GOT(%ebx),%edi movb (%edi),%al movb %al,(%edx) incl %edx movb (%edi),%cl addb $17,%cl movb %dl,%al andb $15,%al addb %al,%cl movb %cl,(%edi) subl $1,%esi jnc .L5 .L4: --- gcc-4 --- .L4: movb (%esi), %al movb %al, (%edx) leal (%ecx,%edi), %eax andl $15, %eax incl %ecx addb (%esi), %al incl %edx addl $17, %eax cmpl %ecx, 12(%ebp) movb %al, (%esi) jne .L4 It's not obvious to me why the gcc-4.0.0 generated code should be slower when run on some CPUs, if in fact it is. Is it the fact that the loop condition is checked with a cmp against memory instead of a flag being set by subtracting 1 from a register? (And where's the best place to learn about how to predict how long assembly snippets like this will take to run on various modern CPUs, anyway?)
Michael Meissner looked at the code, and saw that gcc-2.95.3 converts the loop to a countdown loop, but gcc-3.x doesn't, which wastes a precious register.
And, for what it's worth, the latest 4.1 snapshot also suffers from this.
I don't see how the precious register would matter much. But this compare with memory is strange: cmpl %ecx, 12(%ebp) Why isn't len loaded into a register??
Subject: Re: [4.0/4.1 Regression] openssl is slower when compiled with gcc 4.0 than 3.3 > I don't see how the precious register would matter much. But this compare > with memory is strange: > > cmpl %ecx, 12(%ebp) > > Why isn't len loaded into a register?? You answer your own question -- because there is no register free; that's why the precisious register maters that much. (I guess; I may be wrong). Zdenek
(In reply to comment #21) > The slow routine appears to be the buffer cleaning routine, > though I haven't verified this with oprofile yet. > Here's its loop: > static char cleanse_ctr; > ... > while (len--) { > *(ptr++) = cleanse_ctr; > cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF)); > } [Not entirely related, but..] There's one obvious way to improve this loop. The compiler cannot prove that the write *(ptr++) does not alias the global variable cleanse_ptr, so it will read it from memory in each iteration. To avoid the extra memory read just do something like: void OPENSSL_cleanse(unsigned char *ptr, unsigned int len) { unsigned char local_cleanse_ctr = cleanse_ctr; while (len--) { *(ptr++) = local_cleanse_ctr; local_cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF)); } local_cleanse_ctr += 63; cleanse_ctr = local_cleanse_ctr; }
Ivopts seem to do several quite doubtful decisions in this testcase.
Re. comment #25, as far as I can tell there are registers available in that loop. To quote the loop from comment #12: .L4: movb (%esi), %al movb %al, (%edx) leal (%ecx,%edi), %eax andl $15, %eax incl %ecx addb (%esi), %al incl %edx addl $17, %eax cmpl %ecx, 12(%ebp) movb %al, (%esi) jne .L4 Checking off used registers in this loop: %esi x %edi x %eax x %ebx %ecx x %edx x So %ebx at least is free (and iiuc, with -fomit-frame-pointer %ebp is also free, right?). Maybe the allocator thinks %ebx can't be used because it is the PIC register. Here is what mainline today ("GCC: (GNU) 4.1.0 20050625 (experimental)") gives me (x86-64 compiler with "-m32 -march=i686 -O3 -fPIC"): .L4: movzbl (%esi), %eax movb %al, (%ecx) incl %ecx movzbl -13(%ebp), %eax movzbl (%esi), %edx incb -13(%ebp) andb $15, %al addb $17, %dl addb %dl, %al cmpl %edi, %ecx movb %al, (%esi) jne .L4 The .optimized tree dump looks like this: <bb 0>: len.23 = len - 1; if (len.23 != 4294967295) goto <L6>; else goto <L2>; <L6>:; ivtmp.19 = (unsigned char) (signed char) (int) (ptr + 1B); ptr.27 = ptr; <L0>:; MEM[base: ptr.27] = cleanse_ctr; ptr.27 = ptr.27 + 1B; cleanse_ctr = (unsigned char) (((signed char) ivtmp.19 & 15) + (signed char) cleanse_ctr + 17); ivtmp.19 = ivtmp.19 + 1; if (ptr.27 != (unsigned char *) (ptr + (void *) len.23 + 1B)) goto <L0>; else goto <L2>; <L2>:; cleanse_ctr = (unsigned char) ((signed char) cleanse_ctr + 63); return; Note how the loop test is against ptr. Also, as far as I can tell the right hand side of the test (i.e. "(ptr + (void *) len.23 + 1B)") is loop invariant and should have been moved out. And the first two lines are also just weird, it is probably cheaper on almost any machine to do len.23 = len; if (len.23 != 0) goto <L6>; else goto <L2>; <L6>: len.23 = len.23 - 1; (etc...) In summary, we just produce crap code here ;-)
Subject: Re: [4.0/4.1 Regression] openssl is slower when compiled with gcc 4.0 than 3.3 > ------- Additional Comments From steven at gcc dot gnu dot org 2005-06-25 10:15 ------- > Re. comment #25, as far as I can tell there are registers available in > that loop. To quote the loop from comment #12: > > .L4: > movb (%esi), %al > movb %al, (%edx) > leal (%ecx,%edi), %eax > andl $15, %eax > incl %ecx > addb (%esi), %al > incl %edx > addl $17, %eax > cmpl %ecx, 12(%ebp) > movb %al, (%esi) > jne .L4 > > Checking off used registers in this loop: > %esi x > %edi x > %eax x > %ebx > %ecx x > %edx x > > So %ebx at least is free (and iiuc, with -fomit-frame-pointer %ebp is > also free, right?). Maybe the allocator thinks %ebx can't be used > because it is the PIC register. yes, ebx cannot be used because of pic, and -fomit-frame-pointer is off by default. > Here is what mainline today ("GCC: (GNU) 4.1.0 20050625 (experimental)") > gives me (x86-64 compiler with "-m32 -march=i686 -O3 -fPIC"): > > .L4: > movzbl (%esi), %eax > movb %al, (%ecx) > incl %ecx > movzbl -13(%ebp), %eax > movzbl (%esi), %edx > incb -13(%ebp) > andb $15, %al > addb $17, %dl > addb %dl, %al > cmpl %edi, %ecx > movb %al, (%esi) > jne .L4 > > The .optimized tree dump looks like this: > > <bb 0>: > len.23 = len - 1; > if (len.23 != 4294967295) goto <L6>; else goto <L2>; > And the first two lines are > also just weird, it is probably cheaper on almost any machine to do > len.23 = len; > if (len.23 != 0) goto <L6>; else goto <L2>; > > <L6>: > len.23 = len.23 - 1; > (etc...) Not really. On i686, there should be no difference.
Created attachment 9191 [details] a possible patch This patch could help; I need to benchmark it before submitting it.
(In reply to comment #30) > This patch could help; I need to benchmark it before submitting it. Any news about this patch?
Leaving as P2 as this is a significant pessimization on a significant piece of code on relatively common processors.
Zdenek, any news about your patch from comment #30?
It behaves somewhat erratically on SPEC2000 (it increases the overall score, but there are some significant regressions). And, it also causes us to produce worse code for this testcase at the moment, due to a missunderstanding between ivopts and fold; expression (unsigned char) (signed char) (int) (ptr + 1B) - (unsigned char) ptr is produced, and it is not folded to 1.
Created attachment 10263 [details] Patch After some playing with fold, I arrived to the following patch, that almost works. With the patch, the code for the loop is <L0>:; MEM[base: ptr]{*ptr} = cleanse_ctr; ptr = ptr + 1B; cleanse_ctr = (unsigned char) (((signed char) ptr & 15) + (signed char) cleanse_ctr + 17); len = len - 1; if (len != 0) goto <L0>; else goto <L2>; Which seems just fine. The assembler is .L3: movb (%edi), %al movb %al, (%ecx) incl %ecx movb %cl, %al andl $15, %eax movb (%edi), %dl addl $17, %edx addl %edx, %eax movb %al, (%edi) decl %esi jne .L3 Which also seems OK to me. However, the "ugly" version we produce without the patch: .L4: movb (%edi), %al movb %al, (%ecx) incl %ecx movb -16(%ebp), %al addl %esi, %eax andl $15, %eax movb (%edi), %dl addl $17, %edx addl %edx, %eax movb %al, (%edi) incl %esi cmpl 12(%ebp), %esi jne .L4 Is faster by 30%, from reasons I just don't understand :-(
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
For the reduced testcase in comment #12, we get: .L2: movl 8(%ebp), %edx movl -4(%edx,%ecx,4), %eax movl %eax, (%esi) movl 16(%ebp), %edi movzbl -1(%ecx,%edi), %edx movl %edx, 64(%esi) xorl %edx, %eax movl %eax, 128(%esi) movl 12(%ebp), %edi movl -4(%edi,%ecx,4), %eax xorl -16(%ebp), %edx xorl S@GOTOFF(%ebx,%edx,4), %eax movl %eax, -4(%edi,%ecx,4) movl %eax, -16(%ebp) addl $1, %ecx addl $4, %esi cmpl $17, %ecx jne .L2 Does someone know if this ok produced code or not, I never a good reader of x86 code?
I think this has been imrpvoed for 4.3, it would be nice if someone did some timings for 4.3.
Can someone please redo the timings for GCC 4.3?
OK, I'll see if I can get that done.
Created attachment 14656 [details] Benchmarks I ran the benchmarks for the minimal testcase (using Dan Kegel's script) on a Core 2 Duo, AMD Dual Core Opteron, and a Pentium 3 using GCC 2.95.3, 3.4.3, 4.0.1, and 4.3.0 from svn. I'm still looking for a Pentium 4 to test on.
Of the results posted above, the only interesting one that is still slower than gcc-2.95 is a pentium 3 with -fPIC. (Happily, gcc-4.3 is an improvement there, but it's still worse than 2.95.) Oddly, this one does better with athlon-xp tuning. /proc/cpuinfo info pic? tune [t_34/t_295, t_401/t_295, t_43/t_295] cpu family 6,model 11, Intel(R) Pentium(R) III CPU - S 1266MHz, -fPIC tune=pentium3 [1.37, 1.72, 1.20] cpu family 6,model 11, Intel(R) Pentium(R) III CPU - S 1266MHz, -fPIC tune=athlon-xp [2.16, 2.49, 1.10] I'm not sure how important Pentium III performance is anymore.
Fixed for 4.3. WONTFIX for the branches.