Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release

Mon Dec 7 12:24:06 GMT 2020

Hi all,

The attached patch includes a proposed alternative implementation of
_Sp_counted_base::_M_release(). I'd appreciate your feedback.

Benefits: Save the cost of the last atomic decrements of each of the use
count and the weak count in _Sp_counted_base. Atomic instructions are
significantly slower than regular loads and stores across
major architectures.

How current code works: _M_release() atomically decrements the use count,
checks if it was 1, if so calls _M_dispose(), atomically decrements the
weak count, checks if it was 1, and if so calls _M_destroy().

How the proposed algorithm works: _M_release() loads both use count and
weak count together atomically (assuming 8-byte alignment, discussed
later), checks if the value is equal to 0x100000001 (i.e., both counts are
equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
follows the original algorithm.

Why it works: When the current thread executing _M_release() finds each of
the counts is equal to 1, then no other threads could possibly hold use or
weak references to this control block. That is, no other threads could
possibly access the counts or the protected object.

There are two crucial high-level issues that I'd like to point out first:
- Atomicity of access to the counts together
- Proper alignment of the counts together

The patch is intended to apply the proposed algorithm only to the case of
64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.

** Atomicity **
- The proposed algorithm depends on the mutual atomicity among 8-byte
atomic operations and 4-byte atomic operations on each of the 4-byte halves
of the 8-byte aligned 8-byte block.
- The standard does not guarantee atomicity of 8-byte operations on a pair
of 8-byte aligned 4-byte objects.
- To my knowledge this works in practice on systems that guarantee native
implementation of 4-byte and 8-byte atomic operations.
- Can we limit applying the proposed algorithm to architectures that
guarantee native implementation of atomic operations?

** Alignment **
- _Sp_counted_base is an internal base class. Three internal classes are
derived from it.
- Two of these classes include a pointer as a first member. That is, the
layout (in the relevant case) is: use count (4 bytes), weak count (4
bytes), pointer (8 bytes). My understanding is that in these cases the two
4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
- In the third case (_Sp_counted_ptr_inplace), only includes user data
after the two counts, without necessarily including any 8-byte aligned
members. For example, if the protected object is int32_t, then the
_Sp_counted_ptr_inplace object consists of three 4-byte members (the two
counts inherited from _Sp_counted_base and the user data). My understanding
is that the standard does not guarantee 8-byte alignment in such a case.
- Is 8-byte alignment guaranteed in practice in some widely-used
environments?
- Can 8-byte alignment be checked at build time in some widely-used
environments?

Other points:
- The proposed algorithm can interact correctly with the current algorithm.
That is, multiple threads using different versions of the code with and
without the patch operating on the same objects should always interact
correctly. The intent for the patch is to be ABI compatible with the
current implementation.
- The proposed patch involves a performance trade-off between saving the
costs of atomic instructions when the counts are both 1 vs adding the cost
of loading the 8-byte combined counts and comparison with 0x100000001.
- I noticed a big difference between the code generated by GCC vs LLVM. GCC
seems to generate noticeably more code and what seems to be redundant null
checks and branches.
- The patch has been in use (built using LLVM) in a large environment for
many months. The performance gains outweigh the losses (roughly 10 to 1)
across a large variety of workloads.

I'd appreciate your feedback on the alignment and atomicity issues as well
as any other comments.

Thank you,
Maged
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch
Type: application/octet-stream
Size: 4269 bytes
Desc: not available
URL: <https://gcc.gnu.org/pipermail/libstdc++/attachments/20201207/86d456bf/attachment.obj>