.p2align

Tue Sep 4 10:26:00 GMT 2007

On Monday 27 August 2007, tim prince wrote:
> Mihai DonÈ›u wrote:
> >  ".p2align 4,,15"  I said to myself: "good
> > to know" and did the necessary changes in my "*.S" files.
> > Indeed, what was before a nasty unaligned code, now it's nicely put at a
> > 16byte boundary. However, to my disapointment, this did not make the code
> > run faster :(. "Au contraire", it made it run slower. So why is gcc using it?
> > Or am I missing something?
> >
> > I've tested this on an AMD64 (Turion @ 2.2GHz) machine.
> >   
> Did you check your object files, to see whether your linker has observed 
> those alignments?  Several years into the SSE era, gnu binutils for 
> Windows was still configured so as to disable 16-byte alignment.  I'm 
> told it's still that way on Solaris.
> As the name indicates, the specific version you quote is designed for 
> P-II.  It won't have as remarkable an effect on other CPUs; I don't even 
> know whether anyone has checked this out on Turion.  In any case, it 
> would normally show a significant gain only for the head of a frequently 
> executed loop, and likely only in the case where it avoids an orphan 
> partial instruction at the loop head.
> You didn't even say whether you are running in 64-bit mode, where there 
> are more possibilities for orphans, such as where the first 2 bytes of 
> an LCP instruction form an orphan.  Depending on your specific 
> combination of circumstances, you might be interested in trying 
> variations, such as .p2align 4,,2.

  Sorry for the dalayed response. I've been extremely busy :(

  So: I'm on a 64bit Gentoo GNU/Linux, stable, gcc 4.1.2 with the latest
  and greatest binutils :)

  Since I use Gentoo, you can imagine I'm a speed freak :), thus I'm using
  the following rule when building my files:
  %.o: %.S
          gcc -c -g -pipe -ansi -std=gnu99 -W -Wall -Winline -Wdisabled-optimization \
          -Wmissing-prototypes -march=athlon64 -fPIC -DPIC -DNDEBUG -DNMMUNIT -D_REENTRANT \
          -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -I. -O3 -fno-exceptions \
          -fomit-frame-pointer -o $@ $<

  The assembler file I'm compiling, contains 12 routines (stubs) which
  make the 'calling-convention-switch' (WIN64->x86_64) (just like NDIS
  Wrapper does). Yes, this is the _new_shiny_thing_ these days :)

  Now, these stubs (or as I call them: trampolines) induce a fair amount of
  delay in the program execution, making this the reason for which I turned
  to '.p2align' (never ONCE did I make the association between the name and
  Intel PII).

  Why I need all this: I have a tool that loads a DLL, provides the basic
  needs for it (a couple of kernel32.dll routines) and gives control to it.
  I use this DLL to perform a certain type of analysis on some files (7447
  of them to be exact).

  This DLL calls HeapAlloc() and HeapFree() (along with some others) frequently, via
  a code like this (automatically generated at link time or when GetProcAddress() is
  called):
     movq <pe_address>, %r10 /* the address of the routine that needs to be called */
     movq <stub_address>, %r11 /* the address of the trampoline */
     jmpq *%r11d /* jump to the required trampoline */

  A trampoline, looks like this:
  00000000000003f0 <x86_64pc5>:
  3f0:   48 89 7c 24 f8          mov    %rdi,0xfffffffffffffff8(%rsp)
  3f5:   48 89 74 24 f0          mov    %rsi,0xfffffffffffffff0(%rsp)
  3fa:   48 83 ec 10             sub    $0x10,%rsp
  3fe:   48 89 cf                mov    %rcx,%rdi
  401:   48 89 d6                mov    %rdx,%rsi
  404:   4c 89 c2                mov    %r8,%rdx
  407:   4c 89 c9                mov    %r9,%rcx
  40a:   4c 8b 44 24 38          mov    0x38(%rsp),%r8
  40f:   48 31 c0                xor    %rax,%rax
  412:   41 ff d2                callq  *%r10
  415:   48 83 c4 10             add    $0x10,%rsp
  419:   48 8b 7c 24 f8          mov    0xfffffffffffffff8(%rsp),%rdi
  41e:   48 8b 74 24 f0          mov    0xfffffffffffffff0(%rsp),%rsi
  423:   c3                      retq
  424:   66                      data16
  425:   66                      data16
  426:   66                      data16
  427:   90                      nop
  428:   66                      data16
  429:   66                      data16
  42a:   66                      data16
  42b:   90                      nop
  42c:   66                      data16
  42d:   66                      data16
  42e:   66                      data16
  42f:   90                      nop
  430: /* next trampoline (x86_64pc6) */

  Now, without '.p2align' the tool analyses all 7447 files in approx. 4:30 minutes.
  With '.p2align 4,,15'' it rises to approx: 4.50 minutes (not much, but I'm going
  easy, with 790MB of files - when I'm done optimizing, this tool will "dive" into
  tens of GB).

  I'm looking at what gcc does and it seems to believe that '.p2align 4,,15' is *the*
  alignment to use for *all* functions and some jump points (jump points usually get
  '.p2align 4,,7')

  I've tried '.p2align 4,,2': it is *slightly* faster than '.p2align 4,,16' but not
  as fast, as without '.p2align'.

  Anyway, my question was: why is gcc so "found" of '.p2align' (it uses it in *all*
  situations) since it does not always generate fast code. Other than that, gcc does
  a great job! ;)

-- 
Mihai DonÈ›u