Bug 33922

Summary:	[4.3 Regression] slow compilation on ia64 (postreload scheduling)
Product:	gcc	Reporter:	Martin Michlmayr <tbm>
Component:	rtl-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	normal	CC:	gcc-bugs, mkuvyrkov, pinskia, rguenth, sje, wilson
Priority:	P2	Keywords:	compile-time-hog
Version:	4.3.0
Target Milestone:	4.3.0
Host:		Target:	ia64-linux-gnu
Build:		Known to work:	4.2.2
Known to fail:		Last reconfirmed:	2007-11-03 10:48:04
Attachments:	rws_insn.patch gcc43-ia64-rws-speedups.patch

Description Martin Michlmayr 2007-10-27 15:40:36 UTC

The following program take about 30 seconds to compile on IA64 with -O3 and
trunk, but took less than a second with 4.2.  On x86_x86 it take about 3
seconds.

(sid)tbm@coconut0:~$ time /usr/lib/gcc-snapshot/bin/gcc -c -O3 slow.c

real    0m32.572s
user    0m18.838s
sys     0m0.049s
(sid)tbm@coconut0:~$ time gcc-4.2 -c -O3 slow.c

real    0m0.696s
user    0m0.062s
sys     0m0.022s
(sid)tbm@coconut0:~$

Comment 1 Richard Biener 2007-10-27 15:48:38 UTC

-ftime-report output please?

Comment 2 Martin Michlmayr 2007-10-27 16:03:10 UTC

compile times:

  20070303  0m25.928s
  20070422  0m8.723s
  20070515  0m7.345s
  20070613  0m8.996s
  20070811  0m8.172s
  20070916  0m24.503s
  20071020  0m34.445s

Comment 3 Martin Michlmayr 2007-10-27 16:07:31 UTC

(In reply to comment #1)
> -ftime-report output please?

(sid)tbm@coconut0:~/x$ /usr/lib/gcc-snapshot/bin/gcc -c -O3 -ftime-report slow.c

Execution times (seconds)
 garbage collection    :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.32 ( 1%) wall       0 kB ( 0%) ggc
 callgraph construction:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      13 kB ( 0%) ggc
 callgraph optimization:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       2 kB ( 0%) ggc
 CFG verifier          :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     142 kB ( 1%) ggc
 register information  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     224 kB ( 2%) ggc
 rebuild jump labels   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 1%) wall       0 kB ( 0%) ggc
 parser                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.07 ( 0%) wall      83 kB ( 1%) ggc
 tree gimplify         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      14 kB ( 0%) ggc
 tree CFG construction :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      23 kB ( 0%) ggc
 tree CFG cleanup      :   0.00 ( 0%) usr   0.00 ( 2%) sys   0.02 ( 0%) wall    1018 kB ( 8%) ggc
 tree VRP              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     132 kB ( 1%) ggc
 tree reassociation    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree PRE              :   0.39 ( 2%) usr   0.00 ( 4%) sys   0.41 ( 2%) wall    1052 kB ( 8%) ggc
 tree conservative DCE :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 predictive commoning  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA to normal    :   0.06 ( 0%) usr   0.00 ( 4%) sys   0.06 ( 0%) wall    1010 kB ( 8%) ggc
 tree SSA verifier     :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall      10 kB ( 0%) ggc
 tree STMT verifier    :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 expand                :   0.08 ( 0%) usr   0.00 ( 2%) sys   0.77 ( 3%) wall    1163 kB ( 9%) ggc
 jump                  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 CSE                   :   0.03 ( 0%) usr   0.00 ( 2%) sys   0.04 ( 0%) wall       1 kB ( 0%) ggc
 dead store elim1      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     129 kB ( 1%) ggc
 dead store elim2      :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 1%) wall     267 kB ( 2%) ggc
 CPROP 2               :   0.01 ( 0%) usr   0.00 ( 2%) sys   0.01 ( 0%) wall     132 kB ( 1%) ggc
 bypass jumps          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     130 kB ( 1%) ggc
 CSE 2                 :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       1 kB ( 0%) ggc
 branch prediction     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 combiner              :   0.82 ( 4%) usr   0.00 ( 0%) sys   0.91 ( 3%) wall     452 kB ( 3%) ggc
 if-conversion         :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     352 kB ( 3%) ggc
 regmove               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 scheduling            :   1.32 ( 7%) usr   0.00 ( 2%) sys   1.55 ( 6%) wall     194 kB ( 1%) ggc
 local alloc           :   0.14 ( 1%) usr   0.00 ( 0%) sys   0.14 ( 1%) wall      50 kB ( 0%) ggc
 global alloc          :   0.54 ( 3%) usr   0.00 ( 9%) sys   0.78 ( 3%) wall    2537 kB (19%) ggc
 reload CSE regs       :   0.18 ( 1%) usr   0.00 ( 0%) sys   0.19 ( 1%) wall     584 kB ( 4%) ggc
 load CSE after reload :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 thread pro- & epilogue:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall      24 kB ( 0%) ggc
 rename registers      :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2          :  14.45 (78%) usr   0.03 (65%) sys  19.36 (74%) wall    2099 kB (16%) ggc
 machine dep reorg     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 final                 :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 1%) wall       0 kB ( 0%) ggc
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  18.63             0.04            26.28              13034 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

Comment 4 Martin Michlmayr 2007-10-27 16:13:53 UTC

Maybe something for Maxim to look at?

Comment 5 Martin Michlmayr 2007-10-27 16:44:47 UTC

Oops, I forgot to add the testcase:

typedef enum
{
  ST_TiemanStyle,
}
BrailleDisplay;
static int pendingCommand;
static int currentModifiers;
typedef struct
{
  int (*updateKeys) (BrailleDisplay * brl, int *keyPressed);
}
ProtocolOperations;
static const ProtocolOperations *protocol;
brl_readCommand (BrailleDisplay * brl)
{
  unsigned long int keys;
  int command;
  int keyPressed;
  unsigned char routingKeys[200];
  int routingKeyCount;
  signed char rightVerticalSensor;
  if (pendingCommand != (-1))
    {
      return command;
    }
  if (!protocol->updateKeys (brl, &keyPressed))
    {
      if (rightVerticalSensor >= 0)
        keys |= 1;
      if ((routingKeyCount == 0) && keys)
        {
          if (currentModifiers)
            {
            doChord:switch (keys);
            }
          else
            {
            doCharacter:
              command = 0X2200;
              if (keys & 0X01UL)
                command |= 0001;
              if (keys & 0X02UL)
                command |= 0002;
              if (keys & 0X04UL)
                command |= 0004;
              if (keys & 0X08UL)
                command |= 0010;
              if (keys & 0X10UL)
                command |= 0020;
              if (keys & 0X20UL)
                command |= 0040;
              if (currentModifiers & (0X0010 | 0X0200))
                command |= 0100;
              if (currentModifiers & 0X0040)
                command |= 0200;
              if (currentModifiers & 0X0100)
                command |= 0X020000;
              if (currentModifiers & 0X0400)
                command |= 0X080000;
              if (currentModifiers & 0X0800)
                command |= 0X040000;
            }
          unsigned char key1 = routingKeys[0];
          if (key1 == 0)
            {
            }
          if (key1 == 1)
            if (keys)
              {
                currentModifiers |= 0X0010;
                goto doCharacter;
              }
        }
    }
  return command;
}

Comment 6 Martin Michlmayr 2007-10-27 17:50:48 UTC

As a comparison, here is what I get with 20070811:

(sid)tbm@coconut0:~/x$ /usr/lib/gcc-snapshot/bin/gcc -c -O3 -ftime-report slow.c

Execution times (seconds)
 garbage collection    :   0.06 ( 2%) usr   0.00 ( 0%) sys   0.43 ( 5%) wall       0 kB ( 0%) ggc
 CFG verifier          :   0.02 ( 1%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.02 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df use-def / def-use chains:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.01 ( 0%) usr   0.00 ( 2%) sys   0.01 ( 0%) wall     198 kB ( 2%) ggc
 register information  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.03 ( 1%) usr   0.00 ( 0%) sys   0.15 ( 2%) wall     224 kB ( 2%) ggc
 parser                :   0.00 ( 0%) usr   0.00 ( 8%) sys   0.01 ( 0%) wall      81 kB ( 1%) ggc
 tree VRP              :   0.01 ( 0%) usr   0.00 ( 3%) sys   0.01 ( 0%) wall     132 kB ( 1%) ggc
 tree operand scan     :   0.01 ( 0%) usr   0.00 ( 3%) sys   0.01 ( 0%) wall     106 kB ( 1%) ggc
 tree PRE              :   0.41 (13%) usr   0.00 ( 3%) sys   1.00 (11%) wall    1052 kB ( 9%) ggc
 tree SSA to normal    :   0.08 ( 3%) usr   0.00 ( 2%) sys   0.32 ( 3%) wall    1023 kB ( 8%) ggc
 tree SSA verifier     :   0.03 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall      10 kB ( 0%) ggc
 tree STMT verifier    :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.24 ( 3%) wall       0 kB ( 0%) ggc
 expand                :   0.02 ( 1%) usr   0.01 (12%) sys   0.03 ( 0%) wall     571 kB ( 5%) ggc
 CSE                   :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       1 kB ( 0%) ggc
 dead code elimination :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 dead store elim2      :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 1%) wall     122 kB ( 1%) ggc
 CPROP 1               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 1%) wall      97 kB ( 1%) ggc
 CPROP 2               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     131 kB ( 1%) ggc
 bypass jumps          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     130 kB ( 1%) ggc
 CSE 2                 :   0.02 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       1 kB ( 0%) ggc
 combiner              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      23 kB ( 0%) ggc
 if-conversion         :   0.04 ( 1%) usr   0.00 ( 0%) sys   0.16 ( 2%) wall       0 kB ( 0%) ggc
 regmove               :   0.04 ( 1%) usr   0.00 ( 3%) sys   0.13 ( 1%) wall       0 kB ( 0%) ggc
 scheduling            :   0.40 (12%) usr   0.00 ( 5%) sys   1.17 (13%) wall      61 kB ( 1%) ggc
 local alloc           :   0.03 ( 1%) usr   0.00 ( 0%) sys   0.15 ( 2%) wall     162 kB ( 1%) ggc
 global alloc          :   0.35 (11%) usr   0.01 ( 9%) sys   1.03 (11%) wall    2694 kB (22%) ggc
 reload CSE regs       :   0.22 ( 7%) usr   0.00 ( 2%) sys   0.67 ( 7%) wall     686 kB ( 6%) ggc
 load CSE after reload :   0.07 ( 2%) usr   0.00 ( 2%) sys   0.18 ( 2%) wall       0 kB ( 0%) ggc
 rename registers      :   0.07 ( 2%) usr   0.00 ( 0%) sys   0.22 ( 2%) wall       3 kB ( 0%) ggc
 scheduling 2          :   1.02 (31%) usr   0.01 (11%) sys   2.50 (27%) wall    1192 kB (10%) ggc
 machine dep reorg     :   0.03 ( 1%) usr   0.00 ( 2%) sys   0.04 ( 0%) wall       1 kB ( 0%) ggc
 final                 :   0.02 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :   3.24             0.06             9.11              12164 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

Comment 7 Martin Michlmayr 2007-10-27 17:52:42 UTC

So scheduling 2 has gone from 2.5 to 19.36 seconds from 20070811 to
20071020 (both with checking enabled).

Comment 8 Andrew Pinski 2007-10-27 18:00:57 UTC

> Extra diagnostic checks enabled; compiler may run slowly.
> Configure with --enable-checking=release to disable checks.

We added this message for a reason, seems like you should try that for first.  The release branches defaults to --enable-checking=release.

Comment 9 Martin Michlmayr 2007-10-27 18:08:20 UTC

(In reply to comment #8)
> > Extra diagnostic checks enabled; compiler may run slowly.
> > Configure with --enable-checking=release to disable checks.
> 
> We added this message for a reason, seems like you should try that for first. 
> The release branches defaults to --enable-checking=release.

Well, I showed that even with checking enabled the compiler was _much_ faster
2 months ago.  But, ok, I'll try with checking disabled too.

Comment 10 pinskia@gmail.com 2007-10-27 18:10:41 UTC

Subject: Re:  [4.3 Regression] slow compilation on ia64

On 27 Oct 2007 18:08:21 -0000, tbm at cyrius dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
> Well, I showed that even with checking enabled the compiler was _much_ faster
> 2 months ago.  But, ok, I'll try with checking disabled too.

Well someone (maybe DF) could have added a lot of checking.

-- Pinski

Comment 11 Martin Michlmayr 2007-10-27 18:13:19 UTC

(In reply to comment #10)
> Well someone (maybe DF) could have added a lot of checking.

OK, good point.

I'll report my findings in a few hours.

Comment 12 Martin Michlmayr 2007-10-27 18:53:06 UTC

Same results without checking (actually, even slower - is that possible?):

(sid)tbm@coconut0:~/tmp/gcc/gcc-4.3-20071027-r129674-no-checking/gcc$ ./xgcc -B. -ftime-report -O3 -c ~/slow.c

Execution times (seconds)
 df live regs          :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     142 kB ( 1%) ggc
 register information  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     224 kB ( 2%) ggc
 tree VRP              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall     132 kB ( 1%) ggc
 tree PRE              :   0.37 ( 2%) usr   0.00 ( 3%) sys   0.64 ( 1%) wall    1052 kB ( 8%) ggc
 tree SSA to normal    :   0.06 ( 0%) usr   0.00 ( 3%) sys   0.06 ( 0%) wall    1010 kB ( 8%) ggc
 expand                :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 0%) wall    1182 kB ( 9%) ggc
 CSE                   :   0.03 ( 0%) usr   0.00 ( 7%) sys   0.14 ( 0%) wall       1 kB ( 0%) ggc
 dead store elim2      :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     267 kB ( 2%) ggc
 CPROP 2               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     132 kB ( 1%) ggc
 bypass jumps          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     130 kB ( 1%) ggc
 CSE 2                 :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.28 ( 1%) wall       0 kB ( 0%) ggc
 combiner              :   0.81 ( 4%) usr   0.00 ( 3%) sys   1.77 ( 3%) wall     452 kB ( 4%) ggc
 if-conversion         :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall     352 kB ( 3%) ggc
 regmove               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 scheduling            :   1.34 ( 6%) usr   0.00 ( 0%) sys   3.53 ( 7%) wall     194 kB ( 2%) ggc
 local alloc           :   0.14 ( 1%) usr   0.00 ( 0%) sys   0.25 ( 0%) wall      50 kB ( 0%) ggc
 global alloc          :   0.53 ( 2%) usr   0.00 ( 3%) sys   0.70 ( 1%) wall    2537 kB (20%) ggc
 reload CSE regs       :   0.17 ( 1%) usr   0.00 ( 0%) sys   0.24 ( 0%) wall     584 kB ( 5%) ggc
 load CSE after reload :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 rename registers      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2          :  18.96 (83%) usr   0.02 (66%) sys  43.24 (84%) wall    1970 kB (15%) ggc
 final                 :   0.02 ( 0%) usr   0.00 ( 3%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  22.83             0.03            51.54              12913 kB

Comment 13 Andrew Pinski 2007-10-27 18:59:07 UTC

What happens if you compile with -O3 -fno-tree-vectorize ?

Comment 14 Martin Michlmayr 2007-10-27 19:27:44 UTC

(In reply to comment #13)
> What happens if you compile with -O3 -fno-tree-vectorize ?

It's still slow:

(sid)tbm@coconut0:~/tmp/gcc/gcc-4.3-20071027-r129674-no-checking/gcc$ ./xgcc -B. -ftime-report -O3 -fno-tree-vectorize  -c ~/slow.c

Execution times (seconds)
 callgraph construction:   0.00 ( 0%) usr   0.00 ( 2%) sys   0.07 ( 0%) wall      13 kB ( 0%) ggc
 callgraph optimization:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       2 kB ( 0%) ggc
 df reaching defs      :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.08 ( 0%) wall       0 kB ( 0%) ggc
 df live regs          :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 df live&initialized regs:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 df reg dead/unused notes:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     142 kB ( 1%) ggc
 register information  :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 alias analysis        :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     224 kB ( 2%) ggc
 parser                :   0.00 ( 0%) usr   0.00 ( 1%) sys   0.04 ( 0%) wall      83 kB ( 1%) ggc
 inline heuristics     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree gimplify         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      14 kB ( 0%) ggc
 tree CFG construction :   0.00 ( 0%) usr   0.00 ( 1%) sys   0.02 ( 0%) wall      23 kB ( 0%) ggc
 tree CFG cleanup      :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall    1018 kB ( 8%) ggc
 tree VRP              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     132 kB ( 1%) ggc
 tree copy propagation :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall      24 kB ( 0%) ggc
 tree PRE              :   0.37 ( 2%) usr   0.00 ( 0%) sys   0.47 ( 1%) wall    1052 kB ( 8%) ggc
 tree SSA to normal    :   0.06 ( 0%) usr   0.00 ( 1%) sys   0.06 ( 0%) wall    1010 kB ( 8%) ggc
 expand                :   0.04 ( 0%) usr   0.00 ( 2%) sys   0.37 ( 1%) wall    1182 kB ( 9%) ggc
 forward prop          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       2 kB ( 0%) ggc
 CSE                   :   0.03 ( 0%) usr   0.00 ( 1%) sys   0.03 ( 0%) wall       1 kB ( 0%) ggc
 dead store elim2      :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall     267 kB ( 2%) ggc
 CPROP 2               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     132 kB ( 1%) ggc
 bypass jumps          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall     130 kB ( 1%) ggc
 CSE 2                 :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) wall       0 kB ( 0%) ggc
 branch prediction     :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 combiner              :   0.82 ( 3%) usr   0.00 ( 0%) sys   1.66 ( 4%) wall     452 kB ( 4%) ggc
 if-conversion         :   0.02 ( 0%) usr   0.00 ( 1%) sys   0.03 ( 0%) wall     352 kB ( 3%) ggc
 regmove               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 scheduling            :   1.34 ( 5%) usr   0.00 ( 0%) sys   2.99 ( 7%) wall     194 kB ( 2%) ggc
 local alloc           :   0.14 ( 1%) usr   0.00 ( 0%) sys   0.34 ( 1%) wall      50 kB ( 0%) ggc
 global alloc          :   0.53 ( 2%) usr   0.00 ( 1%) sys   1.15 ( 3%) wall    2537 kB (20%) ggc
 reload CSE regs       :   0.17 ( 1%) usr   0.00 ( 0%) sys   0.36 ( 1%) wall     584 kB ( 5%) ggc
 load CSE after reload :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 if-conversion 2       :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 rename registers      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall       0 kB ( 0%) ggc
 scheduling 2          :  20.44 (84%) usr   0.08 (84%) sys  31.73 (79%) wall    1970 kB (15%) ggc
 final                 :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  24.34             0.10            40.40              12913 kB

Comment 15 Jakub Jelinek 2007-10-28 19:10:47 UTC

Compared to 20070803 with -O3 -fno-tree-vectorize there are now 100 times more
calls to rtx_needs_barrier and 44 times more calls to safe_group_barrier_needed.
E.g. the latter is horribly expensive, e.g. copying around 401 * sizeof (struct reg_write_state) == 1604 bytes several times.

Comment 16 Jakub Jelinek 2007-10-28 19:42:02 UTC

Haven't analyzed why exactly there are so many more safe_group_barrier_needed
calls, but they are certainly much more common than direct group_barrier_needed
calls on this testcase (14579701 safe_group_barrier_needed calls,
14604168 group_barrier_needed calls).  But if so, the only thing that call
cares about is the return value, all the state is thrown away.  From what I see
the need_barrier retval is ored together from all the recursive calls, couldn't we gain something by just returning 1 immediately whenever one of the recursive calls returned non-zero?

Comment 17 Maxim Kuvyrkov 2007-10-28 20:04:43 UTC

Subject: Re:  [4.3 Regression] slow compilation
 on ia64 (postreload scheduling)

jakub at gcc dot gnu dot org wrote:
> ------- Comment #15 from jakub at gcc dot gnu dot org  2007-10-28 19:10 -------
> Compared to 20070803 with -O3 -fno-tree-vectorize there are now 100 times more
> calls to rtx_needs_barrier and 44 times more calls to
> safe_group_barrier_needed.
> E.g. the latter is horribly expensive, e.g. copying around 401 * sizeof (struct
> reg_write_state) == 1604 bytes several times.

The underlying problem is that list of ready to schedule instructions 
now became larger than it was before and the scheduler tends to slow 
down with the size of the list growing.  There already is a workaround 
for this problem (limiting ready list in case it is too large; see 
PARAM_MAX_SCHED_READY_INSNS) but it doesn't seem to do best in this case.

Comment 18 Jakub Jelinek 2007-10-28 20:20:51 UTC

Created attachment 14429 [details]
rws_insn.patch

Just a side note.  Maintaining the rws_insn array seems to be horribly expensive to me, and for each regno only one bit is actually used just to check one gcc_assert and only two regnos are actually checked in some other code.  So memsetting and maintaining a 1604 bytes long array all the time seems to be an overkill - a bitmap can do just fine or, if we just remove that gcc_assert
when not ENABLE_CHECKING, we need just 2 bits altogether instead of those 1604 bytes.
Doesn't help much on this testcase (as it is not addressing the algorithmic issue), but is already noticeable.
 scheduling 2          :  10.60 (88%) usr   0.00 ( 0%) sys  10.60 (88%) wall    1970 kB (15%) ggc
went down to
 scheduling 2          :   8.99 (86%) usr   0.01 (50%) sys   9.00 (86%) wall    1970 kB (15%) ggc
with this patch and --enable-checking=release, so about 14% speedup in wall time for the whole compilation of this file.

Comment 19 Jakub Jelinek 2007-10-28 20:42:59 UTC

Another trivial patch that improves speed is:
--- ia64.c      (revision 129700)
+++ ia64.c      (working copy)
@@ -5310,11 +5310,11 @@ ia64_safe_type (rtx insn)
 
 struct reg_write_state
 {
-  unsigned int write_count : 2;
-  unsigned int first_pred : 16;
-  unsigned int written_by_fp : 1;
-  unsigned int written_by_and : 1;
-  unsigned int written_by_or : 1;
+  unsigned short write_count : 2;
+  unsigned short first_pred : 10;
+  unsigned short written_by_fp : 1;
+  unsigned short written_by_and : 1;
+  unsigned short written_by_or : 1;
 };
 
 /* Cumulative info for the current instruction group.  */

which cuts the size of rws_sum and rws_saved arrays into half (1604 to 802 bytes)
and with both patches in I get:
 scheduling 2          :   6.86 (82%) usr   0.01 (50%) sys   6.87 (82%) wall    1970 kB (15%) ggc

or 31% speedup in wall time both patches together.  first_pred is either 0 or
PR_REG(0) through PR_REG(63), so it certainly fits into 10 bit bitfield.  If needed it would fit even into 6 bit (as when pred == 0, write_count will be already 2 and we could subtract PR_REG(0) from it), but that's still too big to squeeze it into 1 byte per register.

Even when this bug is fixed for real, both changes IMHO make sense anyway (the first patch could perhaps use some cleanup, nice macros to hide it or something).

Comment 20 Jakub Jelinek 2007-10-28 21:11:39 UTC

Actually, we don't probably need to write to rws_sum array at all when in safe_group_barried_needed and then we wouldn't need to copy it around (save and restore it) at all.

--- config/ia64/ia64.c~ 2007-10-28 22:00:24.000000000 +0100
+++ config/ia64/ia64.c  2007-10-28 22:04:26.000000000 +0100
@@ -5353,6 +5353,7 @@ static int rtx_needs_barrier (rtx, struc
 static void init_insn_group_barriers (void);
 static int group_barrier_needed (rtx);
 static int safe_group_barrier_needed (rtx);
+static int in_safe_group_barrier;
 
 /* Update *RWS for REGNO, which is being written by the current instruction,
    with predicate PRED, and associated register flags in FLAGS.  */
@@ -5407,7 +5408,8 @@ rws_access_regno (int regno, struct reg_
        {
        case 0:
          /* The register has not been written yet.  */
-         rws_update (regno, flags, pred);
+         if (!in_safe_group_barrier)
+           rws_update (regno, flags, pred);
          break;
 
        case 1:
@@ -5421,7 +5423,8 @@ rws_access_regno (int regno, struct reg_
            ;
          else if ((rws_sum[regno].first_pred ^ 1) != pred)
            need_barrier = 1;
-         rws_update (regno, flags, pred);
+         if (!in_safe_group_barrier)
+           rws_update (regno, flags, pred);
          break;
 
        case 2:
@@ -5433,8 +5436,11 @@ rws_access_regno (int regno, struct reg_
            ;
          else
            need_barrier = 1;
-         rws_sum[regno].written_by_and = flags.is_and;
-         rws_sum[regno].written_by_or = flags.is_or;
+          if (!in_safe_group_barrier)
+           {
+             rws_sum[regno].written_by_and = flags.is_and;
+             rws_sum[regno].written_by_or = flags.is_or;
+           }
          break;
 
        default:
@@ -6099,17 +6105,16 @@ int safe_group_barrier_needed_cnt[5];
 static int
 safe_group_barrier_needed (rtx insn)
 {
-  struct reg_write_state rws_saved[NUM_REGS];
   int saved_first_instruction;
   int t;
 
-  memcpy (rws_saved, rws_sum, NUM_REGS * sizeof *rws_saved);
   saved_first_instruction = first_instruction;
+  in_safe_group_barrier = 1;
 
   t = group_barrier_needed (insn);
 
-  memcpy (rws_sum, rws_saved, NUM_REGS * sizeof *rws_saved);
   first_instruction = saved_first_instruction;
+  in_safe_group_barrier = 0;
 
   return t;
 }

together with the other patches gives (everything is x86_64-linux -> ia64-linux
cross, would need to measure it on ia64-linux native) 
 scheduling 2          :   5.20 (78%) usr   0.01 (50%) sys   5.20 (77%) wall    1970 kB (15%) ggc

or ~ 45% speedup on this testcase.

Comment 21 Jakub Jelinek 2007-10-29 08:43:18 UTC

Created attachment 14433 [details]
gcc43-ia64-rws-speedups.patch

All 3 patches together, with macros.

Comment 22 Jakub Jelinek 2007-11-01 20:59:36 UTC

The most important cause of the slowdown e.g. compared to 4.2.x is the totally
insane thing -ftree-pre creates though.
For -O3 -fno-tree-vectorize -fdump-tree-all pr33922.c
wc -l shows
2361 pr33922.c.090t.sink
while for -O3 -fno-tree-vectorize -fno-tree-pre -fdump-tree-all pr33922.c
324 pr33922.c.090t.sink
and of course the size of assembly corresponds to this:
11400 pr33922.s # -O3 -fno-tree-vectorize
195 pr33922.s # -O3 -fno-tree-vectorize -fno-tree-pre

-O3 -fno-tree-vectorize -fdump-tree-pre-all dump contains
2081 ^Created.*value lines and all those constants are actually created and many PHI nodes as well.  I believe this might be what nickc was trying to fix today by adding a limit, but wasn't that limit huge (131072 bits)?

Comment 23 rguenther@suse.de 2007-11-01 21:01:55 UTC

Subject: Re:  [4.3 Regression] slow compilation
 on ia64 (postreload scheduling)

On Thu, 1 Nov 2007, jakub at gcc dot gnu dot org wrote:

> ------- Comment #22 from jakub at gcc dot gnu dot org  2007-11-01 20:59 -------
> The most important cause of the slowdown e.g. compared to 4.2.x is the totally
> insane thing -ftree-pre creates though.
> For -O3 -fno-tree-vectorize -fdump-tree-all pr33922.c
> wc -l shows
> 2361 pr33922.c.090t.sink
> while for -O3 -fno-tree-vectorize -fno-tree-pre -fdump-tree-all pr33922.c
> 324 pr33922.c.090t.sink
> and of course the size of assembly corresponds to this:
> 11400 pr33922.s # -O3 -fno-tree-vectorize
> 195 pr33922.s # -O3 -fno-tree-vectorize -fno-tree-pre
> 
> -O3 -fno-tree-vectorize -fdump-tree-pre-all dump contains
> 2081 ^Created.*value lines and all those constants are actually created and
> many PHI nodes as well.  I believe this might be what nickc was trying to fix
> today by adding a limit, but wasn't that limit huge (131072 bits)?

The limit was to cut off exponential behavior.  But yes, PRE (and more 
PPRE) is known to increase code-size.  Looks like some better heuristics
are needed.

Richard.

Comment 24 Sebastian Pop 2007-11-05 15:42:38 UTC

Subject: Bug 33922

Author: spop
Date: Mon Nov  5 15:42:30 2007
New Revision: 129901

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=129901
Log:
2007-11-05  Nick Clifton  <nickc@redhat.com>
	    Sebastian Pop  <sebastian.pop@amd.com>

	PR tree-optimization/32540
	PR tree-optimization/33922
	* doc/invoke.texi: Document PARAM_MAX_PARTIAL_ANTIC_LENGTH.
	* tree-ssa-pre.c: Include params.h.
	(compute_partial_antic_aux): Use PARAM_MAX_PARTIAL_ANTIC_LENGTH
	to limit the maximum length of the PA set for a given block.
	* Makefile.in: Add a dependency upon params.h for tree-ssa-pre.c
	* params.def (PARAM_MAX_PARTIAL_ANTIC_LENGTH): New parameter.

	* gcc.dg/tree-ssa/pr32540-1.c: New.
	* gcc.dg/tree-ssa/pr32540-2.c: New.
	* gcc.dg/tree-ssa/pr33922.c: New.


Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr32540-1.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr32540-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr33922.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/doc/invoke.texi
    trunk/gcc/params.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-pre.c

Comment 25 Sebastian Pop 2007-11-05 15:44:40 UTC

Fixed.