GCC asm block optimizations on x86_64

Wed Aug 29 11:11:00 GMT 2007

Rask Ingemann Lambertsen wrote:
> On Tue, Aug 28, 2007 at 11:02:49PM +0100, Darryl Miles wrote:
>    Peephole definitions check for cases like this and won't do the
> optimization clobbering the flags register if the flags register is live at
> that point.

So I take it that peephole works by knowing the instructions emitted 
with annotations about the lifetimes of registers / flags / other useful 
stuff to help it.  I was thinking it was a bit more blind to things than 
that.

If this is the case then that leads me to believe setting %edx to 0 
should have a lot of options open to achieve that goal.

>>  0000000000000090 <u64_divide>:
>>    00:   49 89 d1                mov    %rdx,%r9	<<- [1] save %rdx in
>> %r9 for arg-as-return
>>    03:   48 8b 07                mov    (%rdi),%rax
>>    06:   ?? ?? ??                xor    %edx,%edx	<<- implicit zero of
>> high 32bits, would accept xorq %rdx,%rdx
> 
>    Right, that's why I suggest using "gcc -S -dp" because then it clearly
> shows if it's a 32-bit (*movsi_xxx) or a 64-bit (*movdi_xxx) instruction (as
> seen from GCC's point of view, since the actual CPU instruction is the same
> in this and several other cases).

The code you are quoting is not generated by GCC but my ideal 
expectation of GCC, see the original u64_divide.c source comment of what 
GCC emits; at the end of this email is the "-O6 -S -dp" version from GCC 
4.0.2.

I did not understand the relevance to knowing if it is (*movsi_xxx) or 
(*movdi_xxx).  From my point of view knowing that would not alter the 2 
original points I was making [1] and [3].  Maybe there is some 
pipelining (or other complex) issue I don't know about which makes the 
emitted code better than what I'm suggesting.

Interestingly if I change the order of my input params I get different 
code, but the same 2 suboptimal situations exist.

>>    0b:   48 f7 36                divq   (%rsi)
>>    0e:   73 02                   jae    12 <u64_divide+0x12>
>>    10:   ?? ??                   inc    %r8d
> 
>    Can't you substitute the "jae; inc %r8d" sequence with "adcl $0, %r8d"?

Thats a possibly, but the u64_divide function is not actually functional 
(it can't deal with 64bit divisors; but thats besides the point to what 
I was highlighting).  Another reason why its not functional is that the 
processor flags on i386 are undefined after a DIV instruction anyway.

The carry check code was actually hi-jacked from my 
uint32_nowrap_add(u_int32_t *dest, u_int32_t addvalue) function which 
does want to know about carry for overflow purposes and your suggestion 
looks good.

>    You can use "rm" for such a constraint.

Tested and working.  Thanks.

>> Another concern that occurs to me is that if the __asm__ constraints are 
>> not 100% perfect is there anyway to test/permutate every possible way 
>> for the compiler might generate the code.
> 
>    I suppose you could write a script which outputs "calls" to the asm
> construct with a constant, local variable (which we assume will end up in a
> register) or global variable for each operand in turn, then try compiling
> and assembling (i.e. -c) the resulting code.

My thinking was for GCC to facilitate some sort of automated testing, 
which would then help everyone on every platform, especially if I was to 
then try and create inline-able versions of functions using __asm__.

I would imagine with the generated symbol approach you could easily make 
a DLL with many versions within it, load it into a test-harness program, 
lookup the symbols and execute every permutation of the function and 
verify the result.  Couple this with say valgrind and it maybe even 
possible to verify exactly what memory is read/written to.

All this would add confidence and eliminate a whole lot of possible 
uncertainty.

         xorl    %r8d, %r8d      # 44    *movdi_xor_rex64        [length 
= 3]
         movq    %rdx, %r9       # 8     *movdi_1_rex64/2        [length 
= 6]
         pushq   %rbx    # 38    *pushdi2_rex64/1        [length = 1]
.LCFI0:
         movq    (%rdi), %rax    # 16    *movdi_1_rex64/2        [length 
= 6]
         movl    %r8d, %edx      # 37    *movsi_1/1      [length = 3]
#APP

         xorl %ebx,%ebx
         divq (%rsi)
         jnc 1f
         incl %ebx
1:
         movq %rax,(%r9)
         movq %rdx,(%rcx)

#NO_APP
         movl    %ebx, %eax      # 36    *movsi_1/1      [length = 2]
         popq    %rbx    # 41    popdi1  [length = 1]
         ret     # 42    return_internal [length = 1]

Recapping on the original issues:

[1] failure to treat setting a register to the value of zero as a 
special case (since there maybe many ways to achieve this on a given 
CPU, different methods have different trades, insn length, unwanted side 
effects) which may allow this operation a lot of freedom for moving / 
scheduling.

[3] usage of %ebx when %r8d would have been a better choice, at the time 
%ebx is needed to be allocated the lifetime of the temporary use of %r8d 
was over.  i.e. allocating of registers which form outputs but not 
inputs should take place last thing (at the moment of #APP) maybe by 
doing this %r8d would have been a candidate ?  which would negate the 
need for the push/pop's.

Thanks for your thoughts.  Maybe I am just expecting too much.

Darryl