This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Some optimization thoughts (and thanks!)
- To: gcc at gcc dot gnu dot org
- Subject: Some optimization thoughts (and thanks!)
- From: At150bogomips at aol dot com
- Date: Tue, 1 May 2001 11:08:39 EDT
Here are some optimization suggestions:
* Inlining of all functions with a single caller: This would presumably
be done after short-function inlining (to allow wrapper functions to
be inlined into their callers, such wrappers are likely have condition
checks and such which can be simplified if inlined into the more basic
caller; but if the function that a wrapper calls is inlined first, the
resulting function may be too large to inline for multiple basic
callers). This removes the performance penalty of breaking code into
graspable portions. Unlike regular inlining, this reduces code size
(by removing call/return overhead). This should not be that difficult
to implement, nor add much to compile time, particularly if the
call-tree for a program is retained between compilings and partially
updated after source modification.
* Consider moving static condition check outside of simple loop.
Although branch prediction hardware would correctly predict the
direction of the branch (possibly after one or two warm-up cycles),
for a simple loop, the extra instruction could be an excessive extra
cost. This would bloat code somewhat, but might be a notable win
for simple enough, long loops. (E.g., "for (x = 0; x < limit; x++)
{ if (direction==forward) { A[x] = A[x]<<1; } else
{ A[x] = A[x]>>1;} }" --OR-- "for (x = 0; x < limit; x++) { if
( (test1==true) && (A[x] > min_A) ) { A[x]--; } }")
* Provide vector-like operations using standard registers, if conditions
allow such. (E.g., summing two arrays of 16-bit unsigned integers can
be done with in pairs with 32-bit registers if overflow is known not
to occur.) 64-bit machines would seem to offer even more benefit in
this, allowing a pair of 16-bit multiply and divide by a constant at
the cost of some shifting and logical operations. If gcc will include
support for detecting when vector-like operations can be done, it
might be useful to provide this optimization for chips that do not
have special vector instructions.
* When branches are expensive, it might be reasonable to translate if
statements with multiple conditions into merged condition and a single
branch rather than multiple branches. E.g., "if ((A==N) && (B==M))"
might be translated into "XOR Temp1 A N; XOR Temp2 B M; OR Temp1 Temp2
Temp1; BNEZ Temp1 L3; #branch over conditional code" rather than "XOR
Temp1 A N; BNEZ Temp1 L3; #branch over second condition test and
conditional code\\ XOR Temp1 B M; BNEZ Temp1 L3;" Removing a branch
might increase modality of the branch and decrease branch history
table aliasing, improving branch prediction. Such prediction
improvement might compensate for the overhead of performing both tests
in all cases. (This is what happened in a test case using a pair of
comparisons of a random number and a constant, at least performance
improved on an x86-based system.)
(I am disappointed that gcc does not surpass all other compilers in
optimization--gcc should be the best in everything (portability,
optimization, compile speed, standards conformance, etc.). It is
particularly disappointing that gcc seems to miss some common
optimizations (e.g., loop arrangement to optimize memory access
patterns). I am, of course, very happy that gcc is Free--a compiler is
a system component even if some vendors do not think so.)
Thank you for providing gcc!