[Bug target/103069] cmpxchg isn't optimized

Wed Nov 3 20:53:13 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #1 from Thiago Macieira <thiago at kde dot org> ---
(the assembly doesn't match the source code, but we got your point)

Another possible improvement for the __atomic_fetch_{and,nand,or} functions is
that it can check whether the fetched value is already correct and branch out.
In your example, the __atomic_fetch_or with 0x40000000 can check if that bit is
already set and, if so, not execute the CMPXCHG at all.

This is a valid solution for x86 on memory orderings up to acq_rel. For other
architectures, they may still need barriers. For seq_cst, we either need a
barrier or we need to execute the CMPXCHG at least once. 

Therefore, the emitted code might want to optimistically execute the operation
once and, if it fails, enter the load loop. That's a slightly longer codegen.
Whether we want that under -Os or not, you'll have to be the judge.

Prior art: glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S:
ENTRY(__pthread_spin_lock)
1:      LOCK
        decl    0(%rdi)
        jne     2f
        xor     %eax, %eax
        ret

        .align  16
2:      rep
        nop
        cmpl    $0, 0(%rdi)
        jg      1b
        jmp     2b
END(__pthread_spin_lock)

This does the atomic operation once, hoping it'll succeed. If it fails, it
enters the PAUSE+CMP+JG loop until the value is suitable.