[PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

Maged Michael maged.michael@gmail.com
Mon Aug 2 13:23:38 GMT 2021


Please find attached an updated patch after incorporating Jonathan's
suggestions.

Changes from the last patch include:
- Add a TSAN macro to bits/c++config.
- Use separate constexpr bool-s for the conditions for lock-freedom,
double-width and alignment.
- Move the code in the optimized path to a separate function
_M_release_double_width_cas.

Thanks,
Maged


On Fri, Jul 16, 2021 at 11:55 AM Maged Michael <maged.michael@gmail.com>
wrote:

> Thank you, Jonathan, for the detailed comments! I'll update the patch
> accordingly.
>
> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely <jwakely.gcc@gmail.com>
> wrote:
>
>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>> >
>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>> the
>> > two atomic instructions that decrement each of the use count and the
>> weak
>> > count when both are 1. I proposed the general idea in an earlier thread
>> (
>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
>> got
>> > useful feedback on a draft patch and responses to related questions
>> about
>> > multi-granular atomicity and alignment. This patch is based on that
>> > feedback.
>> >
>> >
>> > I added a check for thread sanitizer to use the current algorithm in
>> that
>> > case because TSAN does not support multi-granular atomicity. I'd like to
>> > add a check of __has_feature(thread_sanitizer) for building using LLVM.
>> I
>> > found examples of __has_feature in libstdc++
>>
>> There are no uses of __has_feature in libstdc++. We do use
>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>> (which GCC doesn't support) to work around some weird semantics of
>> __has_builtin in older versions of Clang.
>>
>>
>> > but it doesn't seem to be
>> > recognized in shared_ptr_base.h. Any guidance on how to check
>> > __has_feature(thread_sanitizer) in this patch?
>>
>> I think we want to do something like this in include/bits/c++config
>>
>> #if __SANITIZE_THREAD__
>> #  define _GLIBCXX_TSAN 1
>> #elif defined __has_feature
>> # if __has_feature(thread_sanitizer)
>> #  define _GLIBCXX_TSAN 1
>> # endif
>> #endif
>>
>> Then in bits/shared_ptr_base.h
>>
>> #if _GLIBCXX_TSAN
>>         _M_release_orig();
>>         return;
>> #endif
>>
>>
>>
>> > GCC generates code for _M_release that is larger and more complex than
>> that
>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>
>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>
>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
> generated less code than gcc. I see in the response to the issue that the
> new glibc is expected to optimize better. So maybe this will eliminate the
> issue.
>
>
>> > would you please create a bugzilla account for me (
>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>
>> Done (sorry, I didn't notice the request in this mail until coming
>> back to it to review the patch properly).
>>
>> Thank you!
>
>
>>
>>
>> >
>> >
>> > Information about the patch:
>> >
>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>> > instructions are significantly slower than regular loads and stores
>> across
>> > major architectures.
>> >
>> > - How current code works: _M_release() atomically decrements the use
>> count,
>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>> > weak count, checks if it was 1, and if so calls _M_destroy().
>> >
>> > - How the proposed patch works: _M_release() loads both use count and
>> weak
>> > count together atomically (when properly aligned), checks if the value
>> is
>> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and
>> if so
>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>> > algorithm.
>> >
>> > - Why it works: When the current thread executing _M_release() finds
>> each
>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>> other
>> > threads could possibly hold use or weak references to this control
>> block.
>> > That is, no other threads could possibly access the counts or the
>> protected
>> > object.
>> >
>> > - The proposed patch is intended to interact correctly with current code
>> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
>> and
>> > native lock-free support for atomic operations). That is, multiple
>> threads
>> > using different versions of the code with and without the patch
>> operating
>> > on the same objects should always interact correctly. The intent for the
>> > patch is to be ABI compatible with the current implementation.
>> >
>> > - The proposed patch involves a performance trade-off between saving the
>> > costs of two atomic instructions when the counts are both 1 vs adding
>> the
>> > cost of loading the combined counts and comparison with two ones (e.g.,
>> > 0x100000001).
>> >
>> > - The patch has been in use (built using LLVM) in a large environment
>> for
>> > many months. The performance gains outweigh the losses (roughly 10 to 1)
>> > across a large variety of workloads.
>> >
>> >
>> > I'd appreciate feedback on the patch and any suggestions for checking
>> > __has_feature(thread_sanitizer).
>>
>> N.B. gmail completely mangles patches unless you send them as attachments.
>>
>>
>> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
>> > b/libstdc++-v3/include/bits/shared_ptr_base.h
>> >
>> > index 368b2d7379a..a8fc944af5f 100644
>> >
>> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>> >
>> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>> >
>> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >
>> >         if (!_M_add_ref_lock_nothrow())
>> >
>> >           __throw_bad_weak_ptr();
>> >
>> >        }
>> >
>> >
>> >        bool
>> >
>> >        _M_add_ref_lock_nothrow() noexcept;
>> >
>> >
>> >        void
>> >
>> >        _M_release() noexcept
>> >
>> >        {
>> >
>> > +#if __SANITIZE_THREAD__
>> >
>> > +        _M_release_orig();
>> >
>> > +        return;
>> >
>> > +#endif
>> >
>> > +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>>
>> The line break should come before the logical operator, not after.
>> This makes it easier to see which operator it is, because it's at a
>> predictable position, not off on the right edge somewhere.
>>
>> i.e.
>>
>>         if (!__atomic_always_lock_free(sizeof(long long), 0)
>>            || !__atomic_always_lock_free(sizeof(_Atomic_word), 0)
>>            || sizeof(long long) < (2 * sizeof(_Atomic_word))
>>            || sizeof(long long) > (sizeof(void*)))
>>
>> But I think I'd like to see this condition expressed differently
>> anyway, see below.
>>
>> >
>> > +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>> >
>> > +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>>
>> Shouldn't this be != instead of < ?
>>
>> On a big endian target where sizeof(long long) > sizeof(_Atomic_word)
>> loading two _Atomic_word objects will fill the high bits of the long
>> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what
>> got loaded into the long long.
>>
>> >
>> > +            sizeof(long long) > (sizeof(void*)))
>>
>> This is checking the alignment, right? I think we can do so more
>> reliably, and should comment it.
>>
>> I think I'd like to see this condition defined as a number of
>> constexpr booleans, with comments. Maybe:
>>
>> constexpr bool __lock_free
>>   = __atomic_always_lock_free(sizeof(long long), 0)
>>      && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>> constexpr bool __double_word
>>   = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> // The ref-count members follow the vptr, so are aligned to
>> alignof(void*).
>> constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> if _GLIBCXX17_CONSTEXPR
>>   (__lock_free && __double_word && __aligned)
>>   {
>>     _M_release_double_width_cas();
>>     return;
>>   }
>> else
>>   {
>>     // ... original body of _M_release();
>>   }
>>
>> > +          {
>> >
>> > +            _M_release_orig();
>> >
>> > +            return;
>> >
>> > +          }
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> >
>> > +        if (__atomic_load_n((long long*)(&_M_use_count),
>> __ATOMIC_ACQUIRE)
>> >
>> > +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>>
>> This should use __CHAR_BIT__ instead of 8.
>>
>>
>> >
>> > +          {
>> >
>> > +            // Both counts are 1, so there are no weak references and
>> >
>> > +            // we are releasing the last strong reference. No other
>> >
>> > +            // threads can observe the effects of this _M_release()
>> >
>> > +            // call (e.g. calling use_count()) without a data race.
>> >
>> > +            *(long long*)(&_M_use_count) = 0;
>> >
>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> >
>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> >
>> > +            _M_dispose();
>> >
>> > +            _M_destroy();
>> >
>> > +          }
>> >
>> > +        else
>> >
>> > +          {
>> >
>> > +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) ==
>> 1))
>> >
>> > +              {
>> >
>> > +                _M_release_last_use();
>> >
>> > +              }
>> >
>> > +          }
>> >
>> > +      }
>> >
>> > +
>> >
>> > +      void
>> >
>> > +      __attribute__ ((noinline))
>> >
>> > +      _M_release_last_use() noexcept
>> >
>> > +      {
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> >
>> > +        _M_dispose();
>> >
>> > +        if (_Mutex_base<_Lp>::_S_need_barriers)
>> >
>> > +          {
>> >
>> > +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>> >
>> > +          }
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> >
>> > +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>> >
>> > +                                                   -1) == 1)
>> >
>> > +          {
>> >
>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> >
>> > +            _M_destroy();
>> >
>> > +          }
>> >
>> > +      }
>> >
>> > +
>> >
>> > +      void
>> >
>> > +      _M_release_orig() noexcept
>> >
>> > +      {
>> >
>> >          // Be race-detector-friendly.  For more info see
>> bits/c++config.
>> >
>> >          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>> >
>> >         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1)
>> == 1)
>> >
>> >           {
>> >
>> >              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> >
>> >             _M_dispose();
>> >
>> >             // There must be a memory barrier between dispose() and
>> > destroy()
>> >
>> >             // to ensure that the effects of dispose() are observed in
>> the
>> >
>> >             // thread that runs destroy().
>> >
>> >             // See
>> http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
>> >
>> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >
>> >      _Sp_counted_base<_S_single>::_M_release() noexcept
>> >
>> >      {
>> >
>> >        if (--_M_use_count == 0)
>> >
>> >          {
>> >
>> >            _M_dispose();
>> >
>> >            if (--_M_weak_count == 0)
>> >
>> >              _M_destroy();
>> >
>> >          }
>> >
>> >      }
>> >
>> >
>> > +  template<>
>> >
>> > +    inline void
>> >
>> > +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
>> >
>> > +    {
>> >
>> > +      _M_release_orig();
>> >
>> > +    }
>> >
>> > +
>> >
>> >    template<>
>> >
>> >      inline void
>> >
>> >      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
>> >
>> >      { ++_M_weak_count; }
>> >
>> >
>> >    template<>
>> >
>> >      inline void
>> >
>> >      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
>> >
>> >      {
>> >
>> >        if (--_M_weak_count == 0)
>>
>
-------------- next part --------------
diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config
index 69ace386dd7..543fac268b4 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -126,20 +126,29 @@
 # define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
 #endif
 
 // Macro to warn about unused results.
 #if __cplusplus >= 201703L
 # define _GLIBCXX_NODISCARD [[__nodiscard__]]
 #else
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
 
 // Macro for constexpr, to support in mixed 03/0x mode.
 #ifndef _GLIBCXX_CONSTEXPR
 # if __cplusplus >= 201103L
 #  define _GLIBCXX_CONSTEXPR constexpr
 #  define _GLIBCXX_USE_CONSTEXPR constexpr
 # else
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..3368426e417 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -153,20 +153,93 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	if (!_M_add_ref_lock_nothrow())
 	  __throw_bad_weak_ptr();
       }
 
       bool
       _M_add_ref_lock_nothrow() noexcept;
 
       void
       _M_release() noexcept
       {
+#if _GLIBCXX_TSAN
+        _M_release_orig();
+        return;
+#endif
+        constexpr bool __lock_free
+          = __atomic_always_lock_free(sizeof(long long), 0)
+          && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
+        constexpr bool __double_word
+          = sizeof(long long) == 2 * sizeof(_Atomic_word);
+        // The ref-count members follow the vptr, so are aligned to
+        // alignof(void*).
+        constexpr bool __aligned = alignof(long long) <= alignof(void*);
+        if _GLIBCXX17_CONSTEXPR
+          (__lock_free && __double_word && __aligned)
+          {
+            _M_release_double_width_cas();
+            return;
+          }
+        else
+          {
+            _M_release_orig();
+          }
+      }
+
+      void
+      _M_release_double_width_cas()
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
+            == (1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)))))
+          {
+            // Both counts are 1, so there are no weak references and
+            // we are releasing the last strong reference. No other
+            // threads can observe the effects of this _M_release()
+            // call (e.g. calling use_count()) without a data race.
+            *(long long*)(&_M_use_count) = 0;
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_dispose();
+            _M_destroy();
+          }
+        else
+          {
+            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
+              {
+                _M_release_last_use();
+              }
+          }
+      }
+
+      void
+      __attribute__ ((noinline))
+      _M_release_last_use() noexcept
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+        _M_dispose();
+        if (_Mutex_base<_Lp>::_S_need_barriers)
+          {
+            __atomic_thread_fence (__ATOMIC_ACQ_REL);
+          }
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+                                                   -1) == 1)
+          {
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_destroy();
+          }
+      }
+
+      void
+      _M_release_orig() noexcept
+      {
         // Be race-detector-friendly.  For more info see bits/c++config.
         _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
 	if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
 	  {
             _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
 	    _M_dispose();
 	    // There must be a memory barrier between dispose() and destroy()
 	    // to ensure that the effects of dispose() are observed in the
 	    // thread that runs destroy().
 	    // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
@@ -279,20 +352,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _Sp_counted_base<_S_single>::_M_release() noexcept
     {
       if (--_M_use_count == 0)
         {
           _M_dispose();
           if (--_M_weak_count == 0)
             _M_destroy();
         }
     }
 
+  template<>
+    inline void
+    _Sp_counted_base<_S_mutex>::_M_release() noexcept
+    {
+      _M_release_orig();
+    }
+
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
     { ++_M_weak_count; }
 
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_release() noexcept
     {
       if (--_M_weak_count == 0)


More information about the Gcc-patches mailing list