Differences between revisions 12 and 13
Revision 12 as of 2011-08-25 15:33:04
Size: 10277
Comment:
Revision 13 as of 2011-09-09 14:44:35
Size: 11406
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
Updated Aug 25/2011 Updated Sept 9, 2011
Line 116: Line 116:

=== 7 - Consider making all the old __sync patterns full barriers. ===

The original _``_sync patterns don't all have the proper barriers to prevent optimization. _``_sync_lock_test_and_set for instance is documented as an acquire barrier, but there is nothing in the pattern to actually enforce this behaviour, meaning an optimizer can move a store to BEFORE the operation, which would be invalid.

There was a [[http://libdispatch.macosforge.org/trac/ticket/35|bug opened]] and follow on [[http://gcc.gnu.org/ml/gcc/2011-09/msg00088.html|conversation]] which, although it is really invalid since a store was sunk in acquire mode (which is allowed), it demonstrates the lack of optimization barriers since there is also nothing preventing a store from being hoisted.

It also is an example of user expectations. At least sync_lock_test_and_set needs fixing, and I think in order to fix it, it will have to become a full optimization barrier since there isnt really any special infrastructure for this... We should perhaps consider making all the older _``_sync patterns full barriers for simplicity's sake???

Tasks for the near term

Updated Sept 9, 2011

1. Integrate compare_and_swap(CAS) as a tree code... maybe.

  • There are issues with implementing compare_exchange.
  • The C++ interface passes in the 'expected' value by reference.. ie, a pointer to the expected value. This value is then updated if the CAS fails. This is easily accomplished now in the c++ header by something like:
      {
         T local,res;
         local = *expected;
         res = val_compare_and_swap (ptr, local, desired, success, failure);
         if (local == res)
            return true;
         *expected = res;
         return false;
      } 
  • CAS really has 2 return values, a boolean indicating whether the operation succeeded or not, and a value which is the previous value of *ptr. Builtin functions don't handle returning 2 values well, and as a result there are 2 different builtins, __sync_bool_compare_and_swap and __sync_val_compare_and_swap. As you can see in the above c++ wrapper code, you can fake one by comparing the value result.

  • The real problem is implementing a weak compare_exchange. For this we need a weak compare_and_swap which returns both the bool result and the old value, and can't be faked. The basic difference between a weak and strong CAS is that a weak version may simply fail. If an architecture supports a weak CAS, they implement the strong version within a loop where they do the operation, check if it failed, and if it did, try again... (much like we do with optabs.c::expand_compare_and_swap_loop) the powerpc version is like this (config/rs6000/rs6000.c::rs6000_split_compare_and_swap). x86 only has a strong version, so no loop is needed.

  • The thought is to integrate CAS as a basic tree code in the compiler, with the 2 results, and then in expansion see which (or both) of the values are actually used and invoking an appropriate instruction for most efficiency. This still requires some additional thought.
  • New testcases will have to be written and added to gcc.dg/sync-mem-val-compare-and-swap-[1-5]*.c.

2 - Create a builtin which indicates which __sync_mem routines support true atomicity.

__sync routines exist for up to 16 bytes values, but their execution may not actually be atomic. There is a testsuite to verify which size objects are truly atomic in gcc.dg/memmodel/sync*.

A builtin function needs to be provided, along with some required configure bits such that the compiler can indicate if a type or typesize is properly supported or not. ie, something like __sync_size_is_lock_free (size) which returns a true or false based on what the target supports. I suspect this information will have to be manually set up in one of the configure files somehow.

My best guess is that configure needs to create the same info the testsuites configure into the dg-require-effective-target settings for sync_char_short, sync_int_long, sync_long_long, and sync_int_128. The targets and such are located in testsuite/lib/target-supports.exp in the routines check_effective_target_sync_*. I don't know whether the best way is to have configure set some predefined macros to true/false which the builtin can pick up or what. And once we have this info, it may be worthwhile to change the builtins themselves to not create the __sync routines if they arent properly supported.. ie, if there is no 8 byte support, dont create sync's that accept long long. We currently do, and produce non-atomically safe routines.

Perhaps we wont even need the builtin if configure predefines a set of macros like __SYNC_SUPPORTS_BYTE , __SYNC_SUPPORTS_SHORT, __SYNC_SUPPORTS_INT, __SYNC_SUPPORTS_LONG etc. rather than invoking the builtin, code can simply #if around the macros... I don't know if that is too limiting or not. investigate! Check with rth, jakub and bkoz.

3 - Create new C++ wrappers in atomic_1.h for hybrid locked/lock-free atomics.

Utilizing the __sync_size_is_lock_free builtin or macros, the C++ compiler will be able to detect whether there is lock-free support for a given typesize.

Based on that, it can determine whether to utilize the lock-free templates in atomic_2.h or the locked versions in atomic_1.h. This will require some other fudging around, but it should be possible generate a full proper compilation for all atomic types, using the lock free routines for any types which match the supported integral types.

bkoz would be the primary reference here if Highlander is not around.

4 - Add x86 patterns for new __sync_mem routines. Other targets to follow

The pattern for mem_exchange has already been checked into the branch, we need to do all the rest of the operations. This is fairly simple I think since x86 doesnt need much in the way of fences. The implementation details for most of the instructions can be found here. Note this page summarizes a number of targets.

as you can see, there is little in the way of fences required, most loads and stores are simply a 'mov'. For the fetch_* routines, Im not sure if the can be implemented by issueing a 'LOCK' in front of the instrcutions or not.. (ie, instead of add mem,reg, the pattern issues lock; add mem,reg. this needs checking into. Maybe torvald or rth can answer that. If there isnt a good way to do them, let them resort to the default generic routine which maps to a compare_and_swap loop. (ie, don't implement a pattern)

One extra thing that will hopefully be smooth is that since there are no targets as yet, a minor amount of debugging of the new generic routines will have to be done to make sure patterns are actually emitted. That code hasn't really been exercised yet. Especially pay attention to the mem_flag_test_and_set, mem_flag_clear, and mem_*_fence patterns since they don't have type modifiers and go through a different codepath.

rth is the primary contact here if Highlander is not around.

5 - Look into parameter type checking of __sync_mem

The parameters are massaged in c-family/c-common.c::resolve_overloaded_builtin, sync_resolve_return, and sync_resolve_params.

Previous to this, all the __sync were of the format T __Sync (T *, T, T, ...) so these 3 routines figured out what type T was and cast the return and all parameters to T and then to the type of the builtins's argument, usually BT_I{1,2,4,8,16}, which is an unsigned value. ie if it the builtin type was 1 byte, BT_I1, we'd see something like (BT_I1)(char)actual_param where BT_I1 maps to type_for_size(1), which I think is unsigned char.

The new __sync_mem routines are similar, but the memory model parameter is a BT_INT. This means that currently the memory model parameter on a call for 16 bytes values would be cast to an __int128_t, and then back to an int. Which is quite ugly and silly.

The right thing would be to change these routines to look at all the arguments and only do these casts when the underlying type of the builtin argument is not a pointer and the same size as the real base (T) (picked up from the pointer in parameter 1). Since the memory_model argument is a BT_INT, we'll only get the cast on the memory model parameter when it's size matches T (either long or int).. and then its a harmless cast.

Extra parameters can be thrown on the end as well and no errors are reported, this dates back to the variadic versions the IA64 stuff required. the new routines should complain if there are extra parameters, or they dont match the appropriate types.

The reason for all this casting is to prevent a bunch of compiler warnings when passing in pointers for compare and swap and to avoid signed/unsigned conversions issues which may cause miscompilations.

rth knows all about this code, so questions can go to him if Highlander isnt around.

6 - Consolidate __sync and __sync_mem functions

There is a lot of duplication of code between the expanders for __sync and __sync_mem in optabs.c. There should be some consolidation.

Currently, the __sync_mem routines will look for their rtl pattern, and if it isn't found, add any requires fences and call the old __sync routines. This isn't toooo bad, except for the *_fetch_* variations where there is before and after and math that can be done which the new expanders are not doing yet.

Ideally, I think we could simply make all the old routines simply call the new __sync_mem expander with the appropriate memory model. The __sync_mem routines would then look for their pattern, and if that fails use the original code from the old __sync expanders, wrapping them in any required fences instead of calling like it does now. And then of course default down to the compare and swap loop. Basically 'inline' the old expanders into the new routines where they are currently called.

I think this would be a lot less duplication of code, easier to understand, and would allow the transparent deprecation of the old rtl patterns as they are converted to the new memory model patterns.

ie in pseudo code, and with __sync_lock_test_and_set as an acquire operation. originally:

expand_sync_lock_test_and_set(...)
  {
    code_to_implement_sync_lock_test_and_set;
  }

expand_sync_mem_exchange (..., enum memmodel model)
  {
     if (exists)
      emit_mem_exchange_and_return(model);
     if (model == acq_rel || model == seq_cst)
       emit barrier();
     expand_sync_lock_test_and_set(...);
  }

would be changed to :

expand_sync_lock_test_and_set(...)
  {
    expand_sync_mem_exchange (..., acquire);
  }

expand_sync_mem_exchange (..., enum memmodel model)
  {
     if (exists)
      emit_mem_exchange_and_return(model);
     if (model == acq_rel || model == seq_cst)
       emit barrier();
     code_to_implement_sync_lock_test_and_set;
  }

This is as trivial as the examples get, but it consolidates all the code in one place and is potentially much better in other cases.

Also note that with the fetch_op routines, the original sync also implements a 'NOT' operation which will require a new sync_mem_{fetch_op,op_fetch} variation to be created as well.

7 - Consider making all the old __sync patterns full barriers.

The original __sync patterns don't all have the proper barriers to prevent optimization. __sync_lock_test_and_set for instance is documented as an acquire barrier, but there is nothing in the pattern to actually enforce this behaviour, meaning an optimizer can move a store to BEFORE the operation, which would be invalid.

There was a bug opened and follow on conversation which, although it is really invalid since a store was sunk in acquire mode (which is allowed), it demonstrates the lack of optimization barriers since there is also nothing preventing a store from being hoisted.

It also is an example of user expectations. At least sync_lock_test_and_set needs fixing, and I think in order to fix it, it will have to become a full optimization barrier since there isnt really any special infrastructure for this... We should perhaps consider making all the older __sync patterns full barriers for simplicity's sake???

None: Atomic/GCCMM/todo (last edited 2011-09-09 15:08:57 by AndrewMacLeod)