Bug 86693 - inefficient atomic_fetch_xor
Summary: inefficient atomic_fetch_xor
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.1.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-07-27 02:47 UTC by Ruslan Nikolaev
Modified: 2021-12-28 06:23 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ruslan Nikolaev 2018-07-27 02:47:40 UTC
(Compiled with O2 on x86-64)

Consider the following example:

void func1();

void func(unsigned long *counter)
{
        if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) == 1) {
                func1();
        }
}

It is clear that the code can be optimized to simply do 'lock xorq' rather than cmpxchg loop since the xor operation can easily be inverted 1^1 = 0, i.e. can be tested from flags directly (just like for similar cases with fetch_sub and fetch_add which gcc optimizes well).

However, gcc currently generates cmpxchg loop:
func:
.LFB0:
        .cfi_startproc
        movq    (%rdi), %rax
.L2:
        movq    %rax, %rcx
        movq    %rax, %rdx
        xorq    $1, %rcx
        lock cmpxchgq   %rcx, (%rdi)
        jne     .L2
        cmpq    $1, %rdx
        je      .L7
        rep ret

Compare this with fetch_sub instead of fetch_xor:
func:
.LFB0:
        .cfi_startproc
        lock subq       $1, (%rdi)
        je      .L4
        rep ret
Comment 1 Jakub Jelinek 2018-07-27 14:38:55 UTC
The reason why this works for sub/add is that x86 has xadd instruction, so we expand it as xadd and later on during combine find out we are actually comparing the result of lock; xadd with something we can optimize better and do the optimization.
For __atomic_fetch_xor (ptr, x, y) == x (or != x), or __atomic_xor_fetch (ptr, x, y) == 0 (or != 0), or __atomic_or_fetch (ptr, x, y) == 0 (or != 0), we'd need to handle this specially already at expansion time, so with extra special optabs, because there is no instruction that keeps the old or new value of xor or ior in a register, and once we emit a compare and exchange loop, it is very hard to optimize that to something different.
Comment 2 Ruslan Nikolaev 2018-07-28 14:06:52 UTC
Also may be (partially) related the following cases:

1.

#include <stdatomic.h>
#include <stdbool.h>

void func2();

void func(_Atomic(unsigned long) * obj, void * obj2)
{
        if (atomic_fetch_sub(obj, 1) == 1 && obj2)
                func2();
}

generates 'xadd' when 'sub' suffices:
func:
.LFB0:
        .cfi_startproc
        movq    $-1, %rax
        lock xaddq      %rax, (%rdi)
        testq   %rsi, %rsi
        je      .L1
        cmpq    $1, %rax
        je      .L10
.L1:
        rep ret
        .p2align 4,,10
        .p2align 3
.L10:
        xorl    %eax, %eax
        jmp     func2@PLT


2.
 
#include <stdatomic.h>

int func(_Atomic(unsigned long) * obj, unsigned long a)
{
        return atomic_fetch_add(obj, a) == -a;
}

generates 'xadd' when 'add' suffices:
func:
.LFB0:
        .cfi_startproc
        movq    %rsi, %rax
        lock xaddq      %rax, (%rdi)
        addq    %rsi, %rax
        sete    %al
        movzbl  %al, %eax
        ret
Comment 3 Ruslan Nikolaev 2018-07-29 00:04:21 UTC
(In reply to Jakub Jelinek from comment #1)
> The reason why this works for sub/add is that x86 has xadd instruction, so
> we expand it as xadd and later on during combine find out we are actually
> comparing the result of lock; xadd with something we can optimize better and
> do the optimization.
> For __atomic_fetch_xor (ptr, x, y) == x (or != x), or __atomic_xor_fetch
> (ptr, x, y) == 0 (or != 0), or __atomic_or_fetch (ptr, x, y) == 0 (or != 0),
> we'd need to handle this specially already at expansion time, so with extra
> special optabs, because there is no instruction that keeps the old or new
> value of xor or ior in a register, and once we emit a compare and exchange
> loop, it is very hard to optimize that to something different.

btw, do not know exactly how gcc handles it... Is it possible to emit an artificial 'xxor' instruction which acts like xadd? Then during optimization, xxor can be replaced to xor or to cmpxchg-loop depending on the circumstances.
Comment 4 Hongtao.liu 2021-12-27 02:18:16 UTC
Change testcase a little bit, gcc now can generate lock btc


void func1();

void func(unsigned long *counter)
{
        if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) & 1) {
                func1();
        }
}


func(unsigned long*):
        lock btc        QWORD PTR [rdi], 0
        jc      .L4
        ret
.L4:
        jmp     func1()
Comment 5 H.J. Lu 2021-12-27 03:03:07 UTC
(In reply to Hongtao.liu from comment #4)
> Change testcase a little bit, gcc now can generate lock btc
> 
> 
> void func1();
> 
> void func(unsigned long *counter)
> {
>         if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) & 1) {
>                 func1();
>         }
> }
> 
> 
> func(unsigned long*):
>         lock btc        QWORD PTR [rdi], 0
>         jc      .L4
>         ret
> .L4:
>         jmp     func1()

We should rewrite the original test to the canonical form, similar to r12-5102.
Hongtao, can you do that?
Comment 6 Hongtao.liu 2021-12-28 06:23:29 UTC
(In reply to H.J. Lu from comment #5)
> (In reply to Hongtao.liu from comment #4)
> > Change testcase a little bit, gcc now can generate lock btc
> > 
> > 
> > void func1();
> > 
> > void func(unsigned long *counter)
> > {
> >         if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) & 1) {
> >                 func1();
> >         }
> > }
> > 
> > 
> > func(unsigned long*):
> >         lock btc        QWORD PTR [rdi], 0
> >         jc      .L4
> >         ret
> > .L4:
> >         jmp     func1()
> 
> We should rewrite the original test to the canonical form, similar to
> r12-5102.
> Hongtao, can you do that?

The orginal testcase is not equal to btc, __atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) == 1 require other bits of *counter is 0. And as #c1 said, we don't have instructions to keep the old or new value of xor or ior in a register.

I can't find a way to optimize off the exchangeloop.