This is the mail archive of the
mailing list for the GCC project.
Re: code alignment on K6-2 and Athlon
- To: Frank Klemm <pfk at fuchs dot offl dot uni-jena dot de>
- Subject: Re: code alignment on K6-2 and Athlon
- From: Jan Hubicka <jh at suse dot cz>
- Date: Sat, 1 Sep 2001 18:45:06 +0200
- Cc: Jan Hubicka <jh at suse dot cz>, gcc at gcc dot gnu dot org
- References: <email@example.com> <Pine.LNX.firstname.lastname@example.org> <20010831221749.C10296@atrey.karlin.mff.cuni.cz> <20010901152838.C1368@fuchs.offl.uni-jena.de>
> I've tested the effect of code alignment on an Athlon.
> I don't found the slightest effect of code alignment and
> performance. The program always takes 10.287 seconds,
> sometimes a millisecond more, sometimes a millisecond less.
Athlon is less sensitive for alignment, than K6 or P3, but it is.
Definitly on my simple benchmarks I can get fluctulation +-20% by changing
I don't 100% understand some of slowdowns but the main issue is with tight
loops having loopback branch very near cache line boundary. This may
cause decoder stalls frequent enought to make visible effect on the perforance.
In some loops it happends, in other it does not depdending how the
loop is bottlenecked and whether CPU is able to hide the stall.
Also wisely used code alignment reduces frequency of cache line misses.
This shows up on any modern CPU of course quite drastically when given
benchmark fits in the cache almost perfectly.
> IIRC the K6-2 was very sensitive to code alignment, but I don't understood
> the system behind. The runtime pattern repeats every 32 bytes. That's is
The K6 has been limited by decoder perfromance.
One problem has been in pairing. WHen branch destination has been one isntruction away from cache line, the second decoder stalled.
Other problem is with prefetching logic, that always operates locally for cache
line. WHen given instruction crosses cache line boundary in a way that it
is not possible to determine her length in the first boundary, second decoder
may stall. Aditinally if it is not possible to determine opcode (in the
case of FP instructions or two byte opcodes), the instruction always
is vector decoded causing stall minimally of 2 cycles. (similary in case
of some addressing modes)
THese stalls are expensive enought to make worthwhile to emit .p2align
in front of each instruction with 2 byte opcode avoiding the crossing.
> easy to understand. But the misaligment (0...31) and the runtime behave I
> don't understood. Misalignments of 29, 30, 31 were seldom good.
> That's also easy to understand. But the rest was a random function and a
> alignment of 0 was not always the best alignment. Often 2, 4, 7, 9 gave the
> best result. Performance differences were some percent.
> Sorry, I can't repeat this tests from 1999, my K6-2 died in spring.
> The Pentium II was much much more insensitive to such misalignments.
Agreed, k6 has been kind of extreme case.
> Frank Klemm