Current __exchange_and_add on ia64

Fri Nov 18 14:36:00 GMT 2005

Alexander Terekhov wrote:

>>4_0-branch
>>---
>>0000000000000000 <__exchange_and_add>:
>>   0:   19 00 00 00 22 00       [MMB]       mf
>>   6:   80 00 80 60 21 00                   ld4.acq r8=[r32]
>>    
>>
>No need for .acq here (in addition to preceding mf).
>  
>
I see...

>>   c:   00 00 00 20                         nop.b 0x0;;
>>  10:   09 70 20 00 08 20       [MMI]       addp4 r14=r8,r0
>>  16:   f0 00 20 00 42 00                   mov r15=r8
>>  1c:   81 08 01 80                         add r8=r8,r33;;
>>  20:   0b 00 38 40 2a 04       [MMI]       mov.m ar.ccv=r14;;
>>  26:   80 40 80 22 20 00                   cmpxchg4.acq r8=[r32],r8,ar.ccv
>>  2c:   00 00 04 00                         nop.i 0x0;;
>>  30:   10 00 00 00 01 00       [MIB]       nop.m 0x0
>>  36:   70 78 20 0c 71 03                   cmp4.eq p7,p6=r15,r8
>>  3c:   e0 ff ff 4a                   (p06) br.cond.dptk.few 10
>><__exchange_and_add+0x10>
>>  40:   17 00 00 00 00 08       [BBB]       nop.b 0x0
>>  46:   00 00 00 00 10 80                   nop.b 0x0
>>  4c:   08 00 84 00                         br.ret.sptk.many b0;;
>>    
>>
>Brr. I suppose it does
>
>fence();
>old = load_acq(__mem);
>while ((result = cas_acq(__mem, old + __val, old)) != old) old = result;
>
>Right?
>  
>
Yes, I think so ;) In any case, it seems to me a pretty straightforward
way to implement the required atomic operation in terms of cas.

>>mainline
>>---
>>0000000000000000 <__exchange_and_add>:
>>   0:   09 78 00 40 b0 10       [MMI]       ld4.acq r15=[r32]
>>   6:   00 00 00 02 00 00                   nop.m 0x0
>>   c:   00 00 04 00                         nop.i 0x0;;
>>  10:   09 00 3c 40 2a 04       [MMI]       mov.m ar.ccv=r15
>>  16:   e0 00 3c 00 42 e0                   mov r14=r15
>>  1c:   f1 08 01 80                         add r15=r15,r33;;
>>  20:   09 40 00 40 22 04       [MMI]       mov.m r8=ar.ccv
>>  26:   f0 78 80 62 20 00                   cmpxchg4.rel
>>r15=[r32],r15,ar.ccv
>>  2c:   00 00 04 00                         nop.i 0x0;;
>>  30:   13 30 38 1e 07 b8       [MBB]       cmp.eq p6,p7=r14,r15
>>  36:   01 f0 ff ff 25 80             (p06) br.cond.dpnt.few 10
>><__exchange_and_add+0x10>
>>  3c:   08 00 84 00                         br.ret.sptk.many b0;;
>>    
>>
>
>old = load_acq(__mem);
>while ((result = cas_rel(__mem, old + __val, old)) != old) old = result;
>
>I suppose.
>  
>
Ok...

>>You see, mainline doesn't emit any 'mf'. Another difference is that
>>mainline uses 'cmpxchg4.rel' instead of 'cmpxchg4.acq'. Now, if I
>>remember correctly an old message from Alexander, either 'mf' is emitted
>>before 'cmpxchg4.acq' or after 'cmpxchg4.rel', but must be present...
>>    
>>
>Your mainline doesn't seem to provide fully-fenced semantics (in spite
>of ld.acq preceding cas loop with subsequent cas.rel on the same ia64
>"semaphore" inside it). Subsequent (unordered) loads can be hoisted
>above cas.rel (initial acquire on load preceding cas loop doesn't help
>at all with respect to lack of store-load fencing, to begin with)...
>and that can break things. Not good.
>  
>
Argh!! Thanks for the analysis. I'm going to attach this info to the
audit trail of the PR (target/24757). Now all those regressions in the
threaded tests can be easily explained...

Paolo.