Why does __builtin_ctz clear eax on amd64 targets
Mason
slash.tmp@free.fr
Tue Oct 3 18:59:00 GMT 2017
On 03/10/2017 19:09, David Wohlferd wrote:
> On 10/3/2017 6:53 AM, Mason wrote:
>
>> Consider the following code:
>>
>> int my_ctz(unsigned int arg) { return __builtin_ctz(arg); }
>>
>> which "gcc-7 -O -S -march=skylake" compiles to:
>>
>> my_ctz:
>> xorl %eax, %eax
>> tzcntl %edi, %eax
>> ret
>>
>> I don't understand why GCC clears eax before executing tzcnt.
>> (Actually, this happens for other built-ins as well: clz, popcount.)
>>
>> tzcnt (or bsf) will write their result to eax.
>>
>> http://www.felixcloutier.com/x86/TZCNT.html
>> http://www.felixcloutier.com/x86/BSF.html
>>
>> Does it have to do with partial register write stalls?
>> Probably not, because the zero-ing remains even when the call
>> is inlined, and gcc "sees" there are no partial register writes.
>
> Quoting from the docs on tzcnt:
>
> "in the case of BSF instruction, if source operand is zero, the
> content of destination operand are undefined. On processors that do
> not support TZCNT, the instruction byte encoding is executed as BSF."
>
> So BSF leaves the contents of eax undefined, and TZCNT might execute as
> BSF. Given the trivial nature of xor eax, eax, this seems a sensible
> precaution.
Hello David,
Your answer makes sense, but falls apart given the following:
As I stated, "gcc-7 -O -S -march=skylake" generates
my_ctz:
xorl %eax, %eax
tzcntl %edi, %eax
ret
But "gcc-7 -O -S -march=barcelona" generates
my_ctz:
bsfl %edi, %eax
ret
AMD Barcelona does not support tzcnt, yet GCC doesn't clear
eax before executing bsf. The mystery remains :-)
Regards.
More information about the Gcc-help
mailing list