Current __exchange_and_add on ia64 (was: Memory barriers..)

Fri Nov 18 09:39:00 GMT 2005

On 11/18/05, Paolo Carlini <pcarlini@suse.de> wrote:
> Hi all, hi Peter, hi Alexander,
>
> >> I filed other/24757, about this.
> >
> > Have you tried calling _S_initialize before spawning the threads? If
> > the problem disappears, then _S_initialize isn't thread safe. If it
> > persists... I'm out of ideas :-)
>
> I'm trying to figure out whether there is really something wrong in the
> assembly that mainline gcc is producing for a simple
> __sync_fetch_and_add of an int, on ia64. I'm compiling this:
>
> #include <ia64intrin.h>
>
> int
> __attribute__ ((__unused__))
> __exchange_and_add(volatile int* __mem, int __val)
> { return __sync_fetch_and_add(__mem, __val); }
>
> In fact, I'm seeing something different in mainline vs 4_0-branch (I
> think we agreed, some months ago, that the assembly produced by 4_0 was
> fine). At -O2:
>
> 4_0-branch
> ---
> 0000000000000000 <__exchange_and_add>:
>    0:   19 00 00 00 22 00       [MMB]       mf
>    6:   80 00 80 60 21 00                   ld4.acq r8=[r32]

No need for .acq here (in addition to preceding mf).

>    c:   00 00 00 20                         nop.b 0x0;;
>   10:   09 70 20 00 08 20       [MMI]       addp4 r14=r8,r0
>   16:   f0 00 20 00 42 00                   mov r15=r8
>   1c:   81 08 01 80                         add r8=r8,r33;;
>   20:   0b 00 38 40 2a 04       [MMI]       mov.m ar.ccv=r14;;
>   26:   80 40 80 22 20 00                   cmpxchg4.acq r8=[r32],r8,ar.ccv
>   2c:   00 00 04 00                         nop.i 0x0;;
>   30:   10 00 00 00 01 00       [MIB]       nop.m 0x0
>   36:   70 78 20 0c 71 03                   cmp4.eq p7,p6=r15,r8
>   3c:   e0 ff ff 4a                   (p06) br.cond.dptk.few 10
> <__exchange_and_add+0x10>
>   40:   17 00 00 00 00 08       [BBB]       nop.b 0x0
>   46:   00 00 00 00 10 80                   nop.b 0x0
>   4c:   08 00 84 00                         br.ret.sptk.many b0;;

Brr. I suppose it does

fence();
old = load_acq(__mem);
while ((result = cas_acq(__mem, old + __val, old)) != old) old = result;

Right?

>
> mainline
> ---
> 0000000000000000 <__exchange_and_add>:
>    0:   09 78 00 40 b0 10       [MMI]       ld4.acq r15=[r32]
>    6:   00 00 00 02 00 00                   nop.m 0x0
>    c:   00 00 04 00                         nop.i 0x0;;
>   10:   09 00 3c 40 2a 04       [MMI]       mov.m ar.ccv=r15
>   16:   e0 00 3c 00 42 e0                   mov r14=r15
>   1c:   f1 08 01 80                         add r15=r15,r33;;
>   20:   09 40 00 40 22 04       [MMI]       mov.m r8=ar.ccv
>   26:   f0 78 80 62 20 00                   cmpxchg4.rel
> r15=[r32],r15,ar.ccv
>   2c:   00 00 04 00                         nop.i 0x0;;
>   30:   13 30 38 1e 07 b8       [MBB]       cmp.eq p6,p7=r14,r15
>   36:   01 f0 ff ff 25 80             (p06) br.cond.dpnt.few 10
> <__exchange_and_add+0x10>
>   3c:   08 00 84 00                         br.ret.sptk.many b0;;

old = load_acq(__mem);
while ((result = cas_rel(__mem, old + __val, old)) != old) old = result;

I suppose.

>
> You see, mainline doesn't emit any 'mf'. Another difference is that
> mainline uses 'cmpxchg4.rel' instead of 'cmpxchg4.acq'. Now, if I
> remember correctly an old message from Alexander, either 'mf' is emitted
> before 'cmpxchg4.acq' or after 'cmpxchg4.rel', but must be present...

Your mainline doesn't seem to provide fully-fenced semantics (in spite
of ld.acq preceding cas loop with subsequent cas.rel on the same ia64
"semaphore" inside it). Subsequent (unordered) loads can be hoisted
above cas.rel (initial acquire on load preceding cas loop doesn't help
at all with respect to lack of store-load fencing, to begin with)...
and that can break things. Not good.

regards,
alexander.