[Bug rtl-optimization/48986] New: Missed optimization in atomic decrement on x86/x64

Fri May 13 10:32:00 GMT 2011

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48986

           Summary: Missed optimization in atomic decrement on x86/x64
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: piotr.wyderski@gmail.com

Many uses of __sync_fetch_and_add() boil down to
decrement operation and checking if the result is
zero in order to delete the pointee. The most natural
way is to define it as:

bool xxx_decrement(int* p) {

   return __sync_fetch_and_add(p, -1) == 1;
}

void yyy(int* p) {

    if (xxx_decrement(p)) {

        delete p;
    }
}

Unfortunately, GCC compiles it in a literal way:

<__Z3yyyPi>:
  40edd0:    83 ec 0c                 sub    $0xc,%esp
  40edd3:    ba ff ff ff ff           mov    $0xffffffff,%edx
  40edd8:    8b 44 24 10              mov    0x10(%esp),%eax
  40eddc:    f0 0f c1 10              lock xadd %edx,(%eax)
  40ede0:    83 fa 01                 cmp    $0x1,%edx
  40ede3:    74 0b                    je     40edf0 <__Z3yyyPi+0x20>
  40ede5:    83 c4 0c                 add    $0xc,%esp
  40ede8:    c3                       ret    
  40ede9:    8d b4 26 00 00 00 00     lea    0x0(%esi,%eiz,1),%esi
  40edf0:    89 44 24 10              mov    %eax,0x10(%esp)
  40edf4:    83 c4 0c                 add    $0xc,%esp
  40edf7:    e9 24 03 00 00           jmp    40f120 <___wrap__ZdlPv>
  40edfc:    8d 74 26 00              lea    0x0(%esi,%eiz,1),%esi 

with the gist being:

  40edd3:    ba ff ff ff ff           mov    $0xffffffff,%edx
  40eddc:    f0 0f c1 10              lock xadd %edx,(%eax)
  40ede0:    83 fa 01                 cmp    $0x1,%edx
  40ede3:    74 0b                    je     40edf0 <__Z3yyyPi+0x20>

This special case should be handled by the optimizer and produce:

   lock sub $0x01,(%eax)
   je ...

or:

   lock dec (%eax)
   je ...

on platforms which do not suffer carry chain dependency penalties,
e.g. some AMD's chips.

Please note that this generalizes for any N:

   return __sync_fetch_and_add(p, -N) == N;

with a remark that for N != 1 the dec replacement can't be used.