This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [patch] reload1.c: Very minor speedup.
- From: Mike Stump <mrs at apple dot com>
- To: Paolo Carlini <pcarlini at suse dot de>
- Cc: Kazu Hirata <kazu at cs dot umass dot edu>, gcc-patches at gcc dot gnu dot org
- Date: Fri, 6 Feb 2004 15:53:06 -0800
- Subject: Re: [patch] reload1.c: Very minor speedup.
On Friday, February 6, 2004, at 02:34 PM, Paolo Carlini wrote:
Attached is a patch to micro-optimize the reset of can_eliminate in
reload1.c.
- if (ep->from_rtx == x && ep->can_eliminate)
+ if (ep->from_rtx == x)
ep->can_eliminate = 0;
if you have got two spare minutes, could you possibly explain a bit?
I mean, it's because the cost of a test (&& ep->can_eliminate) is
comparable to that of an assignment (ep->can_eliminate = 0), never
much smaller? Is it true on every architecture?
Smaller? He did say speedup, not smaller!
Modern machines are fascinating. You really want to grab a high level
tool like Shark (Mac developer tool) and run it on your favorite code
and then take a look at the results. Once you train up for a few days,
you'll discover just what you've been missing. Neat results like, out
of these 133,000 instructions in this one file, 4 of them, no more, no
less account for 90% of the time.
The change assumes a load store unit that isn't bandwidth limited and a
conditional branch that will be slow, mispredicted. Seems like it
might be the right choice, though, of course, I'd almost want to fire
it up and watch it, but I think this is so far down in the noise that
I'll pass up the opportunity.
[ pause ]
Ok, so I tried this, and found:
0.873 for V1 and 0.876 for V2:
struct S {
int pad[3];
int can_eliminate;
int pad2[4];
int rtx;
int tortx;
} er[1000];
main() {
int i, j;
struct S *s = er;
for (i= 0; i<100000; ++i) {
for (s = er; s<&er[1000]; ++s) {
#ifdef V1
if (s->rtx == 12 && s->can_eliminate)
#else
if (s->rtx == 12)
#endif
s->can_eliminate = 0;
}
}
}
and using a much larger er that might get us out of cache more:
V1: 0.448 V2: 0.459
So, the proposed patch looks like on my machine, it is slower. Kazu,
did you measure a speedup? On what machine? One the whole, changing
code for a near zero win, is probably not very interesting, otot, V2 is
12 bytes smaller, which tilts towards V2, plus it compiles faster! :-)