Who cares about performance (or Intel's CPU errata)?

Stefan Kanthak stefan.kanthak@nexgo.de
Sat May 27 22:52:30 GMT 2023


"Andrew Pinski" <pinskia@gmail.com> wrote:

> On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>>
>> --- dontcare.c ---
>> int ispowerof2(unsigned __int128 argument) {
>>     return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3    gcc -march=haswell -O3
>>
>> https://gcc.godbolt.org/z/PPzYsPzMc
>> ispowerof2(unsigned __int128):
>>         popcnt  rdi, rdi
>>         popcnt  rsi, rsi
>>         add     esi, edi
>>         xor     eax, eax
>>         cmp     esi, 1
>>         sete    al
>>         ret
>>
>> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
>
> Because the popcount is going to the same register, there is no false
> dependency ....
> The false dependency errata only applies if the result of the popcnt
> is going to a different register, the processor thinks it depends on
> the result in that register from a previous instruction but it does
> not (which is why it is called a false dependency). In this case it
> actually does depend on the previous result since the input is the
> same as the input.

OUCH, my fault; sorry for the confusion and the wrong accusation.

Nevertheless GCC fails to optimise code properly:

--- .c ---
int ispowerof2(unsigned long long argument) {
    return __builtin_popcountll(argument) == 1;
}
--- EOF ---

GCC 13.3    gcc -m32 -mpopcnt -O3

https://godbolt.org/z/fT7a7jP4e
ispowerof2(unsigned long long):
        xor     eax, eax
        xor     edx, edx
        popcnt  eax, [esp+4]
        popcnt  edx, [esp+8]
        add     eax, edx                 # eax is less than 64!
        cmp     eax, 1    ->    dec eax  # 2 bytes shorter
        sete    al
        movzx   eax, al                  # superfluous
        ret

5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
kilo- or even megabytes, and they can extend code to cross a cache line
or a 16-byte alignment boundary.

JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument

JFTR: GCC is notorious for generating superfluous MOVZX instructions
      where its optimiser SHOULD be able see that the value is already
      less than 256!

Stefan


More information about the Gcc mailing list