Summary: | [4.3 Regression] slow compilation on ia64 (postreload scheduling) | ||
---|---|---|---|
Product: | gcc | Reporter: | Martin Michlmayr <tbm> |
Component: | rtl-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | gcc-bugs, mkuvyrkov, pinskia, rguenth, sje, wilson |
Priority: | P2 | Keywords: | compile-time-hog |
Version: | 4.3.0 | ||
Target Milestone: | 4.3.0 | ||
Host: | Target: | ia64-linux-gnu | |
Build: | Known to work: | 4.2.2 | |
Known to fail: | Last reconfirmed: | 2007-11-03 10:48:04 | |
Attachments: |
rws_insn.patch
gcc43-ia64-rws-speedups.patch |
Description
Martin Michlmayr
2007-10-27 15:40:36 UTC
-ftime-report output please? compile times: 20070303 0m25.928s 20070422 0m8.723s 20070515 0m7.345s 20070613 0m8.996s 20070811 0m8.172s 20070916 0m24.503s 20071020 0m34.445s (In reply to comment #1) > -ftime-report output please? (sid)tbm@coconut0:~/x$ /usr/lib/gcc-snapshot/bin/gcc -c -O3 -ftime-report slow.c Execution times (seconds) garbage collection : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.32 ( 1%) wall 0 kB ( 0%) ggc callgraph construction: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 13 kB ( 0%) ggc callgraph optimization: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 2 kB ( 0%) ggc CFG verifier : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 142 kB ( 1%) ggc register information : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 224 kB ( 2%) ggc rebuild jump labels : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 1%) wall 0 kB ( 0%) ggc parser : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 83 kB ( 1%) ggc tree gimplify : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 14 kB ( 0%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 23 kB ( 0%) ggc tree CFG cleanup : 0.00 ( 0%) usr 0.00 ( 2%) sys 0.02 ( 0%) wall 1018 kB ( 8%) ggc tree VRP : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 132 kB ( 1%) ggc tree reassociation : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree PRE : 0.39 ( 2%) usr 0.00 ( 4%) sys 0.41 ( 2%) wall 1052 kB ( 8%) ggc tree conservative DCE : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc predictive commoning : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc tree SSA to normal : 0.06 ( 0%) usr 0.00 ( 4%) sys 0.06 ( 0%) wall 1010 kB ( 8%) ggc tree SSA verifier : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 10 kB ( 0%) ggc tree STMT verifier : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc expand : 0.08 ( 0%) usr 0.00 ( 2%) sys 0.77 ( 3%) wall 1163 kB ( 9%) ggc jump : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc CSE : 0.03 ( 0%) usr 0.00 ( 2%) sys 0.04 ( 0%) wall 1 kB ( 0%) ggc dead store elim1 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 129 kB ( 1%) ggc dead store elim2 : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 1%) wall 267 kB ( 2%) ggc CPROP 2 : 0.01 ( 0%) usr 0.00 ( 2%) sys 0.01 ( 0%) wall 132 kB ( 1%) ggc bypass jumps : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 130 kB ( 1%) ggc CSE 2 : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 1 kB ( 0%) ggc branch prediction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc combiner : 0.82 ( 4%) usr 0.00 ( 0%) sys 0.91 ( 3%) wall 452 kB ( 3%) ggc if-conversion : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 352 kB ( 3%) ggc regmove : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc scheduling : 1.32 ( 7%) usr 0.00 ( 2%) sys 1.55 ( 6%) wall 194 kB ( 1%) ggc local alloc : 0.14 ( 1%) usr 0.00 ( 0%) sys 0.14 ( 1%) wall 50 kB ( 0%) ggc global alloc : 0.54 ( 3%) usr 0.00 ( 9%) sys 0.78 ( 3%) wall 2537 kB (19%) ggc reload CSE regs : 0.18 ( 1%) usr 0.00 ( 0%) sys 0.19 ( 1%) wall 584 kB ( 4%) ggc load CSE after reload : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc thread pro- & epilogue: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 24 kB ( 0%) ggc rename registers : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc scheduling 2 : 14.45 (78%) usr 0.03 (65%) sys 19.36 (74%) wall 2099 kB (16%) ggc machine dep reorg : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc final : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 1%) wall 0 kB ( 0%) ggc symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 18.63 0.04 26.28 13034 kB Extra diagnostic checks enabled; compiler may run slowly. Configure with --enable-checking=release to disable checks. Maybe something for Maxim to look at? Oops, I forgot to add the testcase: typedef enum { ST_TiemanStyle, } BrailleDisplay; static int pendingCommand; static int currentModifiers; typedef struct { int (*updateKeys) (BrailleDisplay * brl, int *keyPressed); } ProtocolOperations; static const ProtocolOperations *protocol; brl_readCommand (BrailleDisplay * brl) { unsigned long int keys; int command; int keyPressed; unsigned char routingKeys[200]; int routingKeyCount; signed char rightVerticalSensor; if (pendingCommand != (-1)) { return command; } if (!protocol->updateKeys (brl, &keyPressed)) { if (rightVerticalSensor >= 0) keys |= 1; if ((routingKeyCount == 0) && keys) { if (currentModifiers) { doChord:switch (keys); } else { doCharacter: command = 0X2200; if (keys & 0X01UL) command |= 0001; if (keys & 0X02UL) command |= 0002; if (keys & 0X04UL) command |= 0004; if (keys & 0X08UL) command |= 0010; if (keys & 0X10UL) command |= 0020; if (keys & 0X20UL) command |= 0040; if (currentModifiers & (0X0010 | 0X0200)) command |= 0100; if (currentModifiers & 0X0040) command |= 0200; if (currentModifiers & 0X0100) command |= 0X020000; if (currentModifiers & 0X0400) command |= 0X080000; if (currentModifiers & 0X0800) command |= 0X040000; } unsigned char key1 = routingKeys[0]; if (key1 == 0) { } if (key1 == 1) if (keys) { currentModifiers |= 0X0010; goto doCharacter; } } } return command; } As a comparison, here is what I get with 20070811: (sid)tbm@coconut0:~/x$ /usr/lib/gcc-snapshot/bin/gcc -c -O3 -ftime-report slow.c Execution times (seconds) garbage collection : 0.06 ( 2%) usr 0.00 ( 0%) sys 0.43 ( 5%) wall 0 kB ( 0%) ggc CFG verifier : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df use-def / def-use chains: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.01 ( 0%) usr 0.00 ( 2%) sys 0.01 ( 0%) wall 198 kB ( 2%) ggc register information : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.03 ( 1%) usr 0.00 ( 0%) sys 0.15 ( 2%) wall 224 kB ( 2%) ggc parser : 0.00 ( 0%) usr 0.00 ( 8%) sys 0.01 ( 0%) wall 81 kB ( 1%) ggc tree VRP : 0.01 ( 0%) usr 0.00 ( 3%) sys 0.01 ( 0%) wall 132 kB ( 1%) ggc tree operand scan : 0.01 ( 0%) usr 0.00 ( 3%) sys 0.01 ( 0%) wall 106 kB ( 1%) ggc tree PRE : 0.41 (13%) usr 0.00 ( 3%) sys 1.00 (11%) wall 1052 kB ( 9%) ggc tree SSA to normal : 0.08 ( 3%) usr 0.00 ( 2%) sys 0.32 ( 3%) wall 1023 kB ( 8%) ggc tree SSA verifier : 0.03 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 10 kB ( 0%) ggc tree STMT verifier : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.24 ( 3%) wall 0 kB ( 0%) ggc expand : 0.02 ( 1%) usr 0.01 (12%) sys 0.03 ( 0%) wall 571 kB ( 5%) ggc CSE : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 1 kB ( 0%) ggc dead code elimination : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc dead store elim2 : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.05 ( 1%) wall 122 kB ( 1%) ggc CPROP 1 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 1%) wall 97 kB ( 1%) ggc CPROP 2 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 131 kB ( 1%) ggc bypass jumps : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 130 kB ( 1%) ggc CSE 2 : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1 kB ( 0%) ggc combiner : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 23 kB ( 0%) ggc if-conversion : 0.04 ( 1%) usr 0.00 ( 0%) sys 0.16 ( 2%) wall 0 kB ( 0%) ggc regmove : 0.04 ( 1%) usr 0.00 ( 3%) sys 0.13 ( 1%) wall 0 kB ( 0%) ggc scheduling : 0.40 (12%) usr 0.00 ( 5%) sys 1.17 (13%) wall 61 kB ( 1%) ggc local alloc : 0.03 ( 1%) usr 0.00 ( 0%) sys 0.15 ( 2%) wall 162 kB ( 1%) ggc global alloc : 0.35 (11%) usr 0.01 ( 9%) sys 1.03 (11%) wall 2694 kB (22%) ggc reload CSE regs : 0.22 ( 7%) usr 0.00 ( 2%) sys 0.67 ( 7%) wall 686 kB ( 6%) ggc load CSE after reload : 0.07 ( 2%) usr 0.00 ( 2%) sys 0.18 ( 2%) wall 0 kB ( 0%) ggc rename registers : 0.07 ( 2%) usr 0.00 ( 0%) sys 0.22 ( 2%) wall 3 kB ( 0%) ggc scheduling 2 : 1.02 (31%) usr 0.01 (11%) sys 2.50 (27%) wall 1192 kB (10%) ggc machine dep reorg : 0.03 ( 1%) usr 0.00 ( 2%) sys 0.04 ( 0%) wall 1 kB ( 0%) ggc final : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 3.24 0.06 9.11 12164 kB Extra diagnostic checks enabled; compiler may run slowly. Configure with --enable-checking=release to disable checks. So scheduling 2 has gone from 2.5 to 19.36 seconds from 20070811 to 20071020 (both with checking enabled). > Extra diagnostic checks enabled; compiler may run slowly.
> Configure with --enable-checking=release to disable checks.
We added this message for a reason, seems like you should try that for first. The release branches defaults to --enable-checking=release.
(In reply to comment #8) > > Extra diagnostic checks enabled; compiler may run slowly. > > Configure with --enable-checking=release to disable checks. > > We added this message for a reason, seems like you should try that for first. > The release branches defaults to --enable-checking=release. Well, I showed that even with checking enabled the compiler was _much_ faster 2 months ago. But, ok, I'll try with checking disabled too. Subject: Re: [4.3 Regression] slow compilation on ia64 On 27 Oct 2007 18:08:21 -0000, tbm at cyrius dot com <gcc-bugzilla@gcc.gnu.org> wrote: > Well, I showed that even with checking enabled the compiler was _much_ faster > 2 months ago. But, ok, I'll try with checking disabled too. Well someone (maybe DF) could have added a lot of checking. -- Pinski (In reply to comment #10) > Well someone (maybe DF) could have added a lot of checking. OK, good point. I'll report my findings in a few hours. Same results without checking (actually, even slower - is that possible?): (sid)tbm@coconut0:~/tmp/gcc/gcc-4.3-20071027-r129674-no-checking/gcc$ ./xgcc -B. -ftime-report -O3 -c ~/slow.c Execution times (seconds) df live regs : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 142 kB ( 1%) ggc register information : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 224 kB ( 2%) ggc tree VRP : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.11 ( 0%) wall 132 kB ( 1%) ggc tree PRE : 0.37 ( 2%) usr 0.00 ( 3%) sys 0.64 ( 1%) wall 1052 kB ( 8%) ggc tree SSA to normal : 0.06 ( 0%) usr 0.00 ( 3%) sys 0.06 ( 0%) wall 1010 kB ( 8%) ggc expand : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 1182 kB ( 9%) ggc CSE : 0.03 ( 0%) usr 0.00 ( 7%) sys 0.14 ( 0%) wall 1 kB ( 0%) ggc dead store elim2 : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 267 kB ( 2%) ggc CPROP 2 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 132 kB ( 1%) ggc bypass jumps : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 130 kB ( 1%) ggc CSE 2 : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.28 ( 1%) wall 0 kB ( 0%) ggc combiner : 0.81 ( 4%) usr 0.00 ( 3%) sys 1.77 ( 3%) wall 452 kB ( 4%) ggc if-conversion : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 352 kB ( 3%) ggc regmove : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc scheduling : 1.34 ( 6%) usr 0.00 ( 0%) sys 3.53 ( 7%) wall 194 kB ( 2%) ggc local alloc : 0.14 ( 1%) usr 0.00 ( 0%) sys 0.25 ( 0%) wall 50 kB ( 0%) ggc global alloc : 0.53 ( 2%) usr 0.00 ( 3%) sys 0.70 ( 1%) wall 2537 kB (20%) ggc reload CSE regs : 0.17 ( 1%) usr 0.00 ( 0%) sys 0.24 ( 0%) wall 584 kB ( 5%) ggc load CSE after reload : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc rename registers : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc scheduling 2 : 18.96 (83%) usr 0.02 (66%) sys 43.24 (84%) wall 1970 kB (15%) ggc final : 0.02 ( 0%) usr 0.00 ( 3%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 22.83 0.03 51.54 12913 kB What happens if you compile with -O3 -fno-tree-vectorize ? (In reply to comment #13) > What happens if you compile with -O3 -fno-tree-vectorize ? It's still slow: (sid)tbm@coconut0:~/tmp/gcc/gcc-4.3-20071027-r129674-no-checking/gcc$ ./xgcc -B. -ftime-report -O3 -fno-tree-vectorize -c ~/slow.c Execution times (seconds) callgraph construction: 0.00 ( 0%) usr 0.00 ( 2%) sys 0.07 ( 0%) wall 13 kB ( 0%) ggc callgraph optimization: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 2 kB ( 0%) ggc df reaching defs : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 142 kB ( 1%) ggc register information : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 224 kB ( 2%) ggc parser : 0.00 ( 0%) usr 0.00 ( 1%) sys 0.04 ( 0%) wall 83 kB ( 1%) ggc inline heuristics : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree gimplify : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 14 kB ( 0%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 1%) sys 0.02 ( 0%) wall 23 kB ( 0%) ggc tree CFG cleanup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1018 kB ( 8%) ggc tree VRP : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 132 kB ( 1%) ggc tree copy propagation : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 24 kB ( 0%) ggc tree PRE : 0.37 ( 2%) usr 0.00 ( 0%) sys 0.47 ( 1%) wall 1052 kB ( 8%) ggc tree SSA to normal : 0.06 ( 0%) usr 0.00 ( 1%) sys 0.06 ( 0%) wall 1010 kB ( 8%) ggc expand : 0.04 ( 0%) usr 0.00 ( 2%) sys 0.37 ( 1%) wall 1182 kB ( 9%) ggc forward prop : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 2 kB ( 0%) ggc CSE : 0.03 ( 0%) usr 0.00 ( 1%) sys 0.03 ( 0%) wall 1 kB ( 0%) ggc dead store elim2 : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 267 kB ( 2%) ggc CPROP 2 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 132 kB ( 1%) ggc bypass jumps : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 130 kB ( 1%) ggc CSE 2 : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall 0 kB ( 0%) ggc branch prediction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc combiner : 0.82 ( 3%) usr 0.00 ( 0%) sys 1.66 ( 4%) wall 452 kB ( 4%) ggc if-conversion : 0.02 ( 0%) usr 0.00 ( 1%) sys 0.03 ( 0%) wall 352 kB ( 3%) ggc regmove : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc scheduling : 1.34 ( 5%) usr 0.00 ( 0%) sys 2.99 ( 7%) wall 194 kB ( 2%) ggc local alloc : 0.14 ( 1%) usr 0.00 ( 0%) sys 0.34 ( 1%) wall 50 kB ( 0%) ggc global alloc : 0.53 ( 2%) usr 0.00 ( 1%) sys 1.15 ( 3%) wall 2537 kB (20%) ggc reload CSE regs : 0.17 ( 1%) usr 0.00 ( 0%) sys 0.36 ( 1%) wall 584 kB ( 5%) ggc load CSE after reload : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc if-conversion 2 : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc rename registers : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc scheduling 2 : 20.44 (84%) usr 0.08 (84%) sys 31.73 (79%) wall 1970 kB (15%) ggc final : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 24.34 0.10 40.40 12913 kB Compared to 20070803 with -O3 -fno-tree-vectorize there are now 100 times more calls to rtx_needs_barrier and 44 times more calls to safe_group_barrier_needed. E.g. the latter is horribly expensive, e.g. copying around 401 * sizeof (struct reg_write_state) == 1604 bytes several times. Haven't analyzed why exactly there are so many more safe_group_barrier_needed calls, but they are certainly much more common than direct group_barrier_needed calls on this testcase (14579701 safe_group_barrier_needed calls, 14604168 group_barrier_needed calls). But if so, the only thing that call cares about is the return value, all the state is thrown away. From what I see the need_barrier retval is ored together from all the recursive calls, couldn't we gain something by just returning 1 immediately whenever one of the recursive calls returned non-zero? Subject: Re: [4.3 Regression] slow compilation
on ia64 (postreload scheduling)
jakub at gcc dot gnu dot org wrote:
> ------- Comment #15 from jakub at gcc dot gnu dot org 2007-10-28 19:10 -------
> Compared to 20070803 with -O3 -fno-tree-vectorize there are now 100 times more
> calls to rtx_needs_barrier and 44 times more calls to
> safe_group_barrier_needed.
> E.g. the latter is horribly expensive, e.g. copying around 401 * sizeof (struct
> reg_write_state) == 1604 bytes several times.
The underlying problem is that list of ready to schedule instructions
now became larger than it was before and the scheduler tends to slow
down with the size of the list growing. There already is a workaround
for this problem (limiting ready list in case it is too large; see
PARAM_MAX_SCHED_READY_INSNS) but it doesn't seem to do best in this case.
Created attachment 14429 [details]
rws_insn.patch
Just a side note. Maintaining the rws_insn array seems to be horribly expensive to me, and for each regno only one bit is actually used just to check one gcc_assert and only two regnos are actually checked in some other code. So memsetting and maintaining a 1604 bytes long array all the time seems to be an overkill - a bitmap can do just fine or, if we just remove that gcc_assert
when not ENABLE_CHECKING, we need just 2 bits altogether instead of those 1604 bytes.
Doesn't help much on this testcase (as it is not addressing the algorithmic issue), but is already noticeable.
scheduling 2 : 10.60 (88%) usr 0.00 ( 0%) sys 10.60 (88%) wall 1970 kB (15%) ggc
went down to
scheduling 2 : 8.99 (86%) usr 0.01 (50%) sys 9.00 (86%) wall 1970 kB (15%) ggc
with this patch and --enable-checking=release, so about 14% speedup in wall time for the whole compilation of this file.
Another trivial patch that improves speed is: --- ia64.c (revision 129700) +++ ia64.c (working copy) @@ -5310,11 +5310,11 @@ ia64_safe_type (rtx insn) struct reg_write_state { - unsigned int write_count : 2; - unsigned int first_pred : 16; - unsigned int written_by_fp : 1; - unsigned int written_by_and : 1; - unsigned int written_by_or : 1; + unsigned short write_count : 2; + unsigned short first_pred : 10; + unsigned short written_by_fp : 1; + unsigned short written_by_and : 1; + unsigned short written_by_or : 1; }; /* Cumulative info for the current instruction group. */ which cuts the size of rws_sum and rws_saved arrays into half (1604 to 802 bytes) and with both patches in I get: scheduling 2 : 6.86 (82%) usr 0.01 (50%) sys 6.87 (82%) wall 1970 kB (15%) ggc or 31% speedup in wall time both patches together. first_pred is either 0 or PR_REG(0) through PR_REG(63), so it certainly fits into 10 bit bitfield. If needed it would fit even into 6 bit (as when pred == 0, write_count will be already 2 and we could subtract PR_REG(0) from it), but that's still too big to squeeze it into 1 byte per register. Even when this bug is fixed for real, both changes IMHO make sense anyway (the first patch could perhaps use some cleanup, nice macros to hide it or something). Actually, we don't probably need to write to rws_sum array at all when in safe_group_barried_needed and then we wouldn't need to copy it around (save and restore it) at all. --- config/ia64/ia64.c~ 2007-10-28 22:00:24.000000000 +0100 +++ config/ia64/ia64.c 2007-10-28 22:04:26.000000000 +0100 @@ -5353,6 +5353,7 @@ static int rtx_needs_barrier (rtx, struc static void init_insn_group_barriers (void); static int group_barrier_needed (rtx); static int safe_group_barrier_needed (rtx); +static int in_safe_group_barrier; /* Update *RWS for REGNO, which is being written by the current instruction, with predicate PRED, and associated register flags in FLAGS. */ @@ -5407,7 +5408,8 @@ rws_access_regno (int regno, struct reg_ { case 0: /* The register has not been written yet. */ - rws_update (regno, flags, pred); + if (!in_safe_group_barrier) + rws_update (regno, flags, pred); break; case 1: @@ -5421,7 +5423,8 @@ rws_access_regno (int regno, struct reg_ ; else if ((rws_sum[regno].first_pred ^ 1) != pred) need_barrier = 1; - rws_update (regno, flags, pred); + if (!in_safe_group_barrier) + rws_update (regno, flags, pred); break; case 2: @@ -5433,8 +5436,11 @@ rws_access_regno (int regno, struct reg_ ; else need_barrier = 1; - rws_sum[regno].written_by_and = flags.is_and; - rws_sum[regno].written_by_or = flags.is_or; + if (!in_safe_group_barrier) + { + rws_sum[regno].written_by_and = flags.is_and; + rws_sum[regno].written_by_or = flags.is_or; + } break; default: @@ -6099,17 +6105,16 @@ int safe_group_barrier_needed_cnt[5]; static int safe_group_barrier_needed (rtx insn) { - struct reg_write_state rws_saved[NUM_REGS]; int saved_first_instruction; int t; - memcpy (rws_saved, rws_sum, NUM_REGS * sizeof *rws_saved); saved_first_instruction = first_instruction; + in_safe_group_barrier = 1; t = group_barrier_needed (insn); - memcpy (rws_sum, rws_saved, NUM_REGS * sizeof *rws_saved); first_instruction = saved_first_instruction; + in_safe_group_barrier = 0; return t; } together with the other patches gives (everything is x86_64-linux -> ia64-linux cross, would need to measure it on ia64-linux native) scheduling 2 : 5.20 (78%) usr 0.01 (50%) sys 5.20 (77%) wall 1970 kB (15%) ggc or ~ 45% speedup on this testcase. Created attachment 14433 [details]
gcc43-ia64-rws-speedups.patch
All 3 patches together, with macros.
The most important cause of the slowdown e.g. compared to 4.2.x is the totally insane thing -ftree-pre creates though. For -O3 -fno-tree-vectorize -fdump-tree-all pr33922.c wc -l shows 2361 pr33922.c.090t.sink while for -O3 -fno-tree-vectorize -fno-tree-pre -fdump-tree-all pr33922.c 324 pr33922.c.090t.sink and of course the size of assembly corresponds to this: 11400 pr33922.s # -O3 -fno-tree-vectorize 195 pr33922.s # -O3 -fno-tree-vectorize -fno-tree-pre -O3 -fno-tree-vectorize -fdump-tree-pre-all dump contains 2081 ^Created.*value lines and all those constants are actually created and many PHI nodes as well. I believe this might be what nickc was trying to fix today by adding a limit, but wasn't that limit huge (131072 bits)? Subject: Re: [4.3 Regression] slow compilation
on ia64 (postreload scheduling)
On Thu, 1 Nov 2007, jakub at gcc dot gnu dot org wrote:
> ------- Comment #22 from jakub at gcc dot gnu dot org 2007-11-01 20:59 -------
> The most important cause of the slowdown e.g. compared to 4.2.x is the totally
> insane thing -ftree-pre creates though.
> For -O3 -fno-tree-vectorize -fdump-tree-all pr33922.c
> wc -l shows
> 2361 pr33922.c.090t.sink
> while for -O3 -fno-tree-vectorize -fno-tree-pre -fdump-tree-all pr33922.c
> 324 pr33922.c.090t.sink
> and of course the size of assembly corresponds to this:
> 11400 pr33922.s # -O3 -fno-tree-vectorize
> 195 pr33922.s # -O3 -fno-tree-vectorize -fno-tree-pre
>
> -O3 -fno-tree-vectorize -fdump-tree-pre-all dump contains
> 2081 ^Created.*value lines and all those constants are actually created and
> many PHI nodes as well. I believe this might be what nickc was trying to fix
> today by adding a limit, but wasn't that limit huge (131072 bits)?
The limit was to cut off exponential behavior. But yes, PRE (and more
PPRE) is known to increase code-size. Looks like some better heuristics
are needed.
Richard.
Subject: Bug 33922 Author: spop Date: Mon Nov 5 15:42:30 2007 New Revision: 129901 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=129901 Log: 2007-11-05 Nick Clifton <nickc@redhat.com> Sebastian Pop <sebastian.pop@amd.com> PR tree-optimization/32540 PR tree-optimization/33922 * doc/invoke.texi: Document PARAM_MAX_PARTIAL_ANTIC_LENGTH. * tree-ssa-pre.c: Include params.h. (compute_partial_antic_aux): Use PARAM_MAX_PARTIAL_ANTIC_LENGTH to limit the maximum length of the PA set for a given block. * Makefile.in: Add a dependency upon params.h for tree-ssa-pre.c * params.def (PARAM_MAX_PARTIAL_ANTIC_LENGTH): New parameter. * gcc.dg/tree-ssa/pr32540-1.c: New. * gcc.dg/tree-ssa/pr32540-2.c: New. * gcc.dg/tree-ssa/pr33922.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr32540-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr32540-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr33922.c Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/doc/invoke.texi trunk/gcc/params.def trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-pre.c Fixed. |