Summary: | 70% slowdown with SSE enabled | ||
---|---|---|---|
Product: | gcc | Reporter: | Matteo Croce <rootkit85> |
Component: | target | Assignee: | Uroš Bizjak <ubizjak> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | gcc-bugs |
Priority: | P3 | Keywords: | ssemmx |
Version: | 4.2.3 | ||
Target Milestone: | 4.3.0 | ||
URL: | http://gcc.gnu.org/ml/gcc-patches/2008-01/msg00254.html | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2008-01-07 14:02:46 | |
Bug Depends on: | 23322 | ||
Bug Blocks: | |||
Attachments: |
the source
the source compiled with -mfpmath=387 the source compiled with -mfpmath=sse minimal testcase minimal testcase, compiled with -mfpmath=387 minimal testcase, compiled with -mfpmath=sse |
Description
Matteo Croce
2008-01-05 21:29:04 UTC
Created attachment 14882 [details]
the source
Created attachment 14883 [details]
the source compiled with -mfpmath=387
Created attachment 14884 [details]
the source compiled with -mfpmath=sse
Please narrow down the particular loop in your testcase that gets slower. It looks like the testsuite measures several things. Confirmed by following testcase: --cut here-- #include <stdio.h> void __attribute__((noinline)) dtime (void) { __asm__ __volatile__ ("" : : : "memory"); } double sa, sb, sc, sd; double one, two, four, five; double piref, piprg, pierr; int main (int argc, char *argv[]) { double s, u, v, w, x; long i, m; piref = 3.14159265358979324; one = 1.0; two = 2.0; four = 4.0; five = 5.0; m = 512000000; dtime(); s = -five; sa = -one; dtime(); for (i = 1; i <= m; i++) { s = -s; sa = sa + s; } dtime(); sc = (double) m; u = sa; v = 0.0; w = 0.0; x = 0.0; dtime(); for (i = 1; i <= m; i++) { s = -s; sa = sa + s; u = u + two; x = x + (s - u); v = v - s * u; w = w + s / u; } dtime(); m = (long) (sa * x / sc); sa = four * w / five; sb = sa + five / v; sc = 31.25; piprg = sb - sc / (v * v * v); pierr = piprg - piref; printf ("%13.4le\n", pierr); return 0; } --cut here-- .L5: xorb $-128, -17(%ebp) #, s addl $1, %eax #, i.65 addsd %xmm4, %xmm1 # two.16, u cmpl $512000001, %eax #, i.65 movsd -24(%ebp), %xmm0 # s, tmp90 addsd -24(%ebp), %xmm2 # s, sa_lsm.48 mulsd %xmm1, %xmm0 # u, tmp90 subsd %xmm0, %xmm3 # tmp90, v movsd -24(%ebp), %xmm0 # s, tmp91 divsd %xmm1, %xmm0 # u, tmp91 addsd -16(%ebp), %xmm0 # w, tmp91 movsd %xmm0, -16(%ebp) # tmp91, w jne .L5 #, It is somehow possible to tolerate that "s" and "w" are not pushed into registers due to non-existent live range splitting (PR 23322), the main problem here is that the sign of "s"is changed in the memory by using (unaligned) xorb insn. The same situation is in the first (shorter) loop: .L4: xorb $-128, -17(%ebp) #, s addl $1, %eax #, i cmpl $512000001, %eax #, i addsd -24(%ebp), %xmm0 # s, sa_lsm.97 jne .L4 #, The performance regression is caused by partial memory stall [1]. [1] Agner Fog: How to optimize for the Pentium family of microprocessors, section 14.7 Patch in testing. Patched gcc: 387: FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 1 -8.1208e-11 0.0128 1094.6170 2 -1.5485e-13 0.0061 1145.7086 SSE: FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 1 4.0146e-13 0.0114 1227.3206 2 -1.4166e-13 0.0050 1399.9125 [ 2 -1.4166e-13 0.0269 260.2975 ] So, 5.36x faster. Created attachment 14895 [details]
minimal testcase
Created attachment 14896 [details]
minimal testcase, compiled with -mfpmath=387
Created attachment 14897 [details]
minimal testcase, compiled with -mfpmath=sse
very very minimal testcase added Subject: Bug 34682 Author: uros Date: Mon Jan 7 20:06:34 2008 New Revision: 131381 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131381 Log: PR target/34682 * config/i386/i386.md (neg<mode>2): Rename from negsf2, negdf2 and negxf2. Macroize expander using X87MODEF mode iterator. Change predicates of op0 and op1 to register_operand. (abs<mode>2): Rename from abssf2, absdf2 and negxf2. Macroize expander using X87MODEF mode iterator. Change predicates of op0 and op1 to register_operand. ("*absneg<mode>2_mixed", "*absneg<mode>2_sse"): Rename from corresponding patterns and macroize using MODEF macro. Change predicates of op0 and op1 to register_operand and remove "m" constraint. Disparage "r" alternative with "!". ("*absneg<mode>2_i387"): Rename from corresponding patterns and macroize using X87MODEF macro. Change predicates of op0 and op1 to register_operand and remove "m" constraint. Disparage "r" alternative with "!". (absneg splitter with memory operands): Remove. ("*neg<mode>2_1", "*abs<mode>2_1"): Rename from corresponding patterns and macroize using X87MODEF mode iterator. * config/i386/sse.md (negv4sf2, absv4sf2, neg2vdf2, absv2df2): Change predicate of op1 to register_operand. * config/i386/i386.c (ix86_expand_fp_absneg_operator): Remove support for memory operands. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/sse.md Fixed in SVN. |