Differences between revisions 3 and 4
Revision 3 as of 2011-07-29 15:06:50
Size: 8095
Comment:
Revision 4 as of 2011-07-29 18:51:24
Size: 8552
Comment:
Deletions are marked like this. Additions are marked like this.
Line 25: Line 25:
 * Basically all the new routine has to do is issue potential barriers before the CAS call (do the same thing 'expand_val_compare_and_swap' does, find the opcode and call it), and issue potential barriers after it. I think you would call expand_builtin_mem_thread_fence(model) before the call for SEQ_CST, ACQ_REL or RELEASE and after the CAS is its SEQ_CST, ACQ_REL or ACQUIRE. I think. its late on a friday before a long weekend so I could be off :-)

Tasks for the near term

Update July 29/11

Item number 1 must be done before anything else, the rest have no dependencies other than that.

1. Change __sync_mem_compare_exchange to __sync_mem_val_compare_and_swap

  • The plan now is to implement a __sync_mem_val_compare_and_swap which takes a memory model parameter, and then the c++ wrappers that for compare_exchange can do the other bits they need since the rest doesn't need to be atomic.

  • All the code for __sync_mem_compare_exchange should be changed over to implement val_compare_and_swap with a memory model instead. Format will be the same is the existing val_compare_and_swap except it will also have memory model parameters.

  • Prototyped: T __sync_mem_val_compare_and_swap (T*object, T expected, T desired, mem_model success, mem_model failure)

  • When writing the C++ code for compare_exchange, it will basically do then following: (and the optimizer should eliminate the local storage)

  {
     T local,res;
     local = *expected;
     res = val_compare_and_swap (ptr, local, desired, success, failure);
     if (local == res)
        return true;
     *expected = res;
     return false;
  } 
  • For the moment, we are punting on implementing a weak compare_and_swap, we're investigating long term possibly turning this into a tree code which returns both the bool result and the old value. Then when in expansion see which (or both) of the values are actually used and invoking an appropriate instruction for most efficiency. This is still in the design phase tho. The basic difference between a weak and strong compare_and_swap(CAS) is that a weak version may simply fail. If an architecture supports a weak CAS, they implement the strong version within a loop where they do the operation, check if it failed, and if it did, try again... (much like we do with optabs.c::expand_compare_and_swap_loop) the powerpc version is like this (config/rs6000/rs6000.c::rs6000_split_compare_and_swap). x86 only has a strong version, so no loop is needed.

  • New testcases will have to be written and added to gcc.dg/sync-mem-val-compare-and-swap-[1-5]*.c. Memory model restrictions are identical to compare-exchange's.
  • Basically all the new routine has to do is issue potential barriers before the CAS call (do the same thing 'expand_val_compare_and_swap' does, find the opcode and call it), and issue potential barriers after it. I think you would call expand_builtin_mem_thread_fence(model) before the call for SEQ_CST, ACQ_REL or RELEASE and after the CAS is its SEQ_CST, ACQ_REL or ACQUIRE. I think. its late on a friday before a long weekend so I could be off :-)

2 - Change C++ wrappers to atomic_2.h to use the __sync_mem routines

libstdc++-v3/include/bits/atomic_2.h and atomic_base.h contain the current implementation of the c++ wrappers.

The memory_order_* enum for c++ should already align exactly with our __SYNC_MEM_* memory models, so they should be able to be passed directly through.

The existing implementations make calls to various __sync builtins, and emit various fences when they think they need to. MOst of these methods can be changed to simply call the underlying __sync_mem routine, complete with memory model. This is where we'll discover what was missed in the interface planning :-)

Eventually we'll have to deal with the issues that these __sync_mem routines aren't ALL lockfree/correct in all sizes (especially 16 byte values), and implement an atomic_1.h file which combines lock-free and locked algorithms, but we can punt on that for now.

3 - add x86 patterns for new __sync_mem routines. Other targets to follow

The pattern for mem_exchange has already been checked into the branch, we need to do all the rest of the operations. This is fairly simple I think since x86 doesnt need much in the way of fences. The implementation details for most of the instructions can be found here. Note this page summarizes a number of targets.

as you can see, there is little in the way of fences required, most loads and stores are simply a 'mov'. For the fetch_* routines, Im not sure if the can be implemented by issueing a 'LOCK' in front of the instrcutions or not.. (ie, instead of add mem,reg, the pattern issues lock; add mem,reg. this needs checking into. Maybe torvald or rth can answer that. If there isnt a good way to do them, let them resort to the default generic routine which maps to a compare_and_swap loop. (ie, don't implement a pattern)

One extra thing that will hopefully be smooth is that since there are no targets as yet, a minor amount of debugging of the new generic routines will have to be done to make sure patterns are actually emitted. That code hasn't really been exercised yet. Especially pay attention to the mem_flag_test_and_set, mem_flag_clear, and mem_*_fence patterns since they don't have type modifiers and go through a different codepath.

4 - Add tests for atomicity of __sync_mem routines

Tests in gcc.dg/sync_mem* check for the functionality of the routines, (ie, that sync_mem_exchange actually performs an exchange) but they do not test if that exchange is actually performed in an atomic way.

Aldy added the infrastructure to the test suites to enable gdb to single step through a function and utilize inferior function calls as a simulated second thread running to test that operations are executed atomically. There is an example testcase on the wiki for testing an artificial ATOMIC_STORE operation here.

Using that idea, we need to test all the __sync_mem routines for verify they are in fact atomic. I would recommend a different file for each word size.. ie char, short, int, long long, and __int128_t much like the gcc.dg/sync-mem-*{1-5}.c test cases. Those files contain all the necessary dejagnu preamble to only execute when a word size is supported. It is probably possible to put all the different __sync_mem functions in one file to be tested, which would be good since then we'd know at a glance whether all the __sync routines of a given size are correct. So that would result in 5 tests cases. btw, I expect to see most 16 byte versions (except exchange and maybe store) fail this test on x86_64. and probably most of the 8 byte ones on a 32 bit x86.

If its too much work to put them all in one test case, its fine to split them out and simply test one function per test case. Note we don't have to worry about all the different memory models or anything... simply make everything use __SYNC_MEM_SEQ_CST and forget the rest.

Try to test the max values of everything like gcc.dg/sync-mem-fetch-and-{1-5}.c does. (ie uses masks which are either all 0 or all 1.)

5 - look into parameter type checking of __sync_mem

The parameters are massaged in c-family/c-common.c::resolve_overloaded_builtin, sync_resolve_return, and sync_resolve_params.

Previous to this, all the __sync were of the format T __Sync (T *, T, T, ...) so these 3 routines figured out what type T was and cast the return and all parameters to T and then to the type of the builtins's argument, usually BT_I{1,2,4,8,16}, which is an unsigned value. ie if it the builtin type was 1 byte, BT_I1, we'd see something like (BT_I1)(char)actual_param where BT_I1 maps to type_for_size(1), which I think is unsigned char.

The new __sync_mem routines are similar, but the memory model parameter is a BT_INT. This means that currently the memory model parameter on a call for 16 bytes values would be cast to an __int128_t, and then back to an int. Which is quite ugly and silly.

The right thing would be to change these routines to look at all the arguments and only do these casts when the underlying type of the builtin argument is not a pointer and the same size as the real base (T) (picked up from the pointer in parameter 1). Since the memory_model argument is a BT_INT, we'll only get the cast on the memory model parameter when it's size matches T (either long or int).. and then its a harmless cast.

The reason for all this casting is to prevent a bunch of compiler warnings when passing in pointers for compare and swap and to avoid signed/unsigned conversions issues which may cause miscompilations.

rth knows all about this code, so questions can go to him if Highlander isnt around.

None: Atomic/GCCMM/todo (last edited 2011-09-09 15:08:57 by AndrewMacLeod)