.p2align
Mihai Donțu
mihai.dontu@gmail.com
Tue Sep 4 10:26:00 GMT 2007
On Monday 27 August 2007, tim prince wrote:
> Mihai DonÈu wrote:
> > ".p2align 4,,15" I said to myself: "good
> > to know" and did the necessary changes in my "*.S" files.
> > Indeed, what was before a nasty unaligned code, now it's nicely put at a
> > 16byte boundary. However, to my disapointment, this did not make the code
> > run faster :(. "Au contraire", it made it run slower. So why is gcc using it?
> > Or am I missing something?
> >
> > I've tested this on an AMD64 (Turion @ 2.2GHz) machine.
> >
> Did you check your object files, to see whether your linker has observed
> those alignments? Several years into the SSE era, gnu binutils for
> Windows was still configured so as to disable 16-byte alignment. I'm
> told it's still that way on Solaris.
> As the name indicates, the specific version you quote is designed for
> P-II. It won't have as remarkable an effect on other CPUs; I don't even
> know whether anyone has checked this out on Turion. In any case, it
> would normally show a significant gain only for the head of a frequently
> executed loop, and likely only in the case where it avoids an orphan
> partial instruction at the loop head.
> You didn't even say whether you are running in 64-bit mode, where there
> are more possibilities for orphans, such as where the first 2 bytes of
> an LCP instruction form an orphan. Depending on your specific
> combination of circumstances, you might be interested in trying
> variations, such as .p2align 4,,2.
Sorry for the dalayed response. I've been extremely busy :(
So: I'm on a 64bit Gentoo GNU/Linux, stable, gcc 4.1.2 with the latest
and greatest binutils :)
Since I use Gentoo, you can imagine I'm a speed freak :), thus I'm using
the following rule when building my files:
%.o: %.S
gcc -c -g -pipe -ansi -std=gnu99 -W -Wall -Winline -Wdisabled-optimization \
-Wmissing-prototypes -march=athlon64 -fPIC -DPIC -DNDEBUG -DNMMUNIT -D_REENTRANT \
-D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -I. -O3 -fno-exceptions \
-fomit-frame-pointer -o $@ $<
The assembler file I'm compiling, contains 12 routines (stubs) which
make the 'calling-convention-switch' (WIN64->x86_64) (just like NDIS
Wrapper does). Yes, this is the _new_shiny_thing_ these days :)
Now, these stubs (or as I call them: trampolines) induce a fair amount of
delay in the program execution, making this the reason for which I turned
to '.p2align' (never ONCE did I make the association between the name and
Intel PII).
Why I need all this: I have a tool that loads a DLL, provides the basic
needs for it (a couple of kernel32.dll routines) and gives control to it.
I use this DLL to perform a certain type of analysis on some files (7447
of them to be exact).
This DLL calls HeapAlloc() and HeapFree() (along with some others) frequently, via
a code like this (automatically generated at link time or when GetProcAddress() is
called):
movq <pe_address>, %r10 /* the address of the routine that needs to be called */
movq <stub_address>, %r11 /* the address of the trampoline */
jmpq *%r11d /* jump to the required trampoline */
A trampoline, looks like this:
00000000000003f0 <x86_64pc5>:
3f0: 48 89 7c 24 f8 mov %rdi,0xfffffffffffffff8(%rsp)
3f5: 48 89 74 24 f0 mov %rsi,0xfffffffffffffff0(%rsp)
3fa: 48 83 ec 10 sub $0x10,%rsp
3fe: 48 89 cf mov %rcx,%rdi
401: 48 89 d6 mov %rdx,%rsi
404: 4c 89 c2 mov %r8,%rdx
407: 4c 89 c9 mov %r9,%rcx
40a: 4c 8b 44 24 38 mov 0x38(%rsp),%r8
40f: 48 31 c0 xor %rax,%rax
412: 41 ff d2 callq *%r10
415: 48 83 c4 10 add $0x10,%rsp
419: 48 8b 7c 24 f8 mov 0xfffffffffffffff8(%rsp),%rdi
41e: 48 8b 74 24 f0 mov 0xfffffffffffffff0(%rsp),%rsi
423: c3 retq
424: 66 data16
425: 66 data16
426: 66 data16
427: 90 nop
428: 66 data16
429: 66 data16
42a: 66 data16
42b: 90 nop
42c: 66 data16
42d: 66 data16
42e: 66 data16
42f: 90 nop
430: /* next trampoline (x86_64pc6) */
Now, without '.p2align' the tool analyses all 7447 files in approx. 4:30 minutes.
With '.p2align 4,,15'' it rises to approx: 4.50 minutes (not much, but I'm going
easy, with 790MB of files - when I'm done optimizing, this tool will "dive" into
tens of GB).
I'm looking at what gcc does and it seems to believe that '.p2align 4,,15' is *the*
alignment to use for *all* functions and some jump points (jump points usually get
'.p2align 4,,7')
I've tried '.p2align 4,,2': it is *slightly* faster than '.p2align 4,,16' but not
as fast, as without '.p2align'.
Anyway, my question was: why is gcc so "found" of '.p2align' (it uses it in *all*
situations) since it does not always generate fast code. Other than that, gcc does
a great job! ;)
--
Mihai DonÈu
More information about the Gcc-help
mailing list