This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Slowdowns in code generated by GCC>=3.3
- From: Steven Bosscher <stevenb at suse dot de>
- To: Remko Troncon <remko dot troncon at cs dot kuleuven dot ac dot be>, gcc at gcc dot gnu dot org
- Date: Wed, 20 Oct 2004 15:04:07 +0200
- Subject: Re: Slowdowns in code generated by GCC>=3.3
- Organization: SUSE Labs
- References: <20041020123432.GA31922@cs.kuleuven.ac.be>
On Wednesday 20 October 2004 14:34, Remko Troncon wrote:
> Hi,
>
> I am a developer of a bytecode emulator for the Prolog language. With the
> release of GCC-3.3, our emulator was slowed down by a factor of 3 on x86
> with -O3 turned on (we didn't measure other platforms; the optimization
> flag doesn't seem to matter).
Which x86 architecture variant?
> We were hoping this was a temporary issue,
> but the situation didn't improve in any of the newer releases :(
> I don't know whether i should file this as a bug report, so i first ask
> for advice here.
Filing a bug report is only going to be useful if you can report your
problem in a way such that we can reproduce it: test case, output of
"gcc -v", etc. See http://gcc.gnu.org/bugs.html for the details ;-)
> I'll try to explain on a high level what happens. If this isn't sufficient,
> i can try to give some code, but this will take me some time to isolate the
> code. This is the situation:
> - Since the program counter in our emulator is very crucial, we use the
> 'register' and 'asm ("bx")' hints.
Is the program counter a global variable, or local? And if you remove
those hints, does that make your code worse?
I would actually expect it to improve if you remove those hints. x86
is a register starved architecture, and as the documentation mentions:
"Defining such a register variable does not reserve the register; it
remains available for other uses in places where flow control determines
the variable's value is not live. However, these registers are made
unavailable for use in the reload pass; excessive use of this feature
leaves the compiler too few available registers to compile certain
functions."
(see "info gcc", look for "Explicit Reg Vars")
For an architecture with basically only 6 registers, taking up just one
is probably "Excessive use" already.
> - For each instruction in the bytecode, we store the address of the label
> of the code which has to be executed for the instruction. Therefore,
> the program counter always contains points to an address of code to
> be executed, and after each instruction we do a
> goto **(void **)program_counter
> Previous versions of GCC keep the program counter in ebx, and do a
> jmp *(%ebx) after the instructions (as expected). The newer GCCs seem
> to unnecessarily move the program counter around between registers, and
> don't do the jmp*(%ebx) after each instruction, but seem to jump to a
> 'common' piece of code doing this jump.
Yes. Indirect jumps are incredibly expensive at compile time, so what
the compiler does is "factor" the computed jump, i.e. given,
goto *x;
[ ... ]
goto *x;
[ ... ]
goto *x;
[ ... ]
the compiler factors the computed jumps results in the following code
sequence which has a much simpler control flow graph:
goto y;
[ ... ]
goto y;
[ ... ]
goto y;
[ ... ]
y:
goto *x;
The compiler is supposed to unfactor this in the basic block reordering
pass, perhaps that is not happening for your code for some reason.
> Looking at the changelog of gcc-3.3, i can only deduce this has to do with
> the new DFA scheduler, but of course i can not tell for sure.
I can tell almost for sure that this is not the problem. In GCC 3.3,
only the pentium has a DFA scheduler description, all other architecture
variants still use the old scheduler. Besides, scheduling on i386 is a
local list scheduling and your problem seems to be control flow related.
> I don't know if any of this information is useful, but we could use some
> pointers in places to look where things are going wrong in the code
> generation. The factor 3 of slowdown is really a lot.
I would first try to remove that "register ... asm (...)" junk, and try
to optimize for something more advanced than i386 (which is the default
x86 architecture, see the manual, -march=*). If that does not help,
please file a bug report including a test case as explained on bugs.html.
Gr.
Steven