[PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

Maged Michael maged.michael@gmail.com
Wed Aug 4 19:48:12 GMT 2021


On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> >
> > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> :-)
>
> > The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> Wow, that is a convincing argument for making it not inline, thanks.
>
> > IIUC, we van use the following. Right?
> >
> > __attribute__((__noinline__))
>
> Right.
>
> > I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> This is a valid C++ program:
>
> #define noinline 1
> #include <memory>
> int main() { }
>
> But if anything in <memory> uses "noinline" then this valid program
> will not compile. Which is why we must use ((__noinline__)) instead of
> ((noinline)).
>
> Thanks. Now I get it.


>
>
> >
> > How about something like this comment?
> >
> > // Noinline to avoid code size increase.
>
> Great, thanks.
>
> On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > Actually I take back what I said. Sorry. I think the logic in your patch
> is correct. I missed some of the atomic decrements.
> > But I'd be concerned about performance. If we make _M_release_last_use
> noinline then we are adding overhead to the fast path of the original logic
> (where both counts are 1).
>
> Oh, I see. So the code duplication serves a purpose. We want the
> _M_release_last_use() code to be non-inline for the new logic, because
> in the common case we never call it (we either detect that both counts
> are 1 and do the dispose & destroy without atomic decrements, or we do
> a single decrement and don't dispose or destroy anything). But for the
> old logic, we do want that code inlined into _M_release (or
> _M_release_orig as it was called in your patch). Have I got that right
> now?
>
> Yes. Completely right.


> What if we remove the __noinline__ from _M_release_last_use() so that
> it can be inlined, but than add a noinline wrapper that can be called
> when we don't want to inline it?
>
> So:
>       // Called by _M_release() when the use count reaches zero.
>       void
>       _M_release_last_use() noexcept
>       {
>         // unchanged from previous patch, but without the attribute.
>         // ...
>       }
>
>       // As above, but 'noinline' to reduce code size on the cold path.
>       __attribute__((__noinline__))
>       void
>       _M_release_last_use_cold() noexcept
>       { _M_release_last_use(); }
>
>
> And then:
>
>   template<>
>     inline void
>     _Sp_counted_base<_S_atomic>::_M_release() noexcept
>     {
>       _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> #if ! _GLIBCXX_TSAN
>       constexpr bool __lock_free
>         = __atomic_always_lock_free(sizeof(long long), 0)
>         && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>       constexpr bool __double_word
>         = sizeof(long long) == 2 * sizeof(_Atomic_word);
>       // The ref-count members follow the vptr, so are aligned to
>       // alignof(void*).
>       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
>       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
>         {
>           constexpr long long __unique_ref
>             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
>           auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
>
>           _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>           if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> __unique_ref)
>             {
>               // Both counts are 1, so there are no weak references and
>               // we are releasing the last strong reference. No other
>               // threads can observe the effects of this _M_release()
>               // call (e.g. calling use_count()) without a data race.
>               *(long long*)(&_M_use_count) = 0;
>               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>               _M_dispose();
>               _M_destroy();
>               return;
>             }
>           if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
>             {
>               _M_release_last_use_cold();
>               return;
>             }
>         }
>       else
> #endif
>       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
>         {
>           _M_release_last_use();
>         }
>     }
>
>
> So we use the noinline version for the else branch in the new logic,
> but the can-inline version for the old logic. Would that work?
>
> Yes.


> We could also consider adding __attribute__((__cold__)) to the
> _M_release_last_use_cold() function, and/or add [[__unlikely__]] to
> the 'if' that calls it, but maybe that's overkill.
>
> It seems like this will slightly pessimize the case where the last use
> count is dropped but there are weak refs, as that will take the cold
> path now. That's probably OK, given the benefits for the "both counts
> are 1" case, and that it might reduce the code size for the "use count
> > 1" case.
>

Right, This pessimizes the case of having weak_ptr-s when the last use is
released (e.g., enable_shared_from_this).
For background, when I applied this optimization there were gains and
losses but overall net gain. Then I investigated whether the losses are
large enough to warrant developing a library for custom shared_ptr-s that
we can use only in the cases of gains and use the standard one for the
cases where we'd lose (for example if gains are 3x and losses are 2x). I
found out that overall (over 100s of services) the gains and losses are
closer to 1.1x gains and 0.1x losses, which didn't warrant the hassle of
maintaining a custom library.


>
> It's past 8pm here so I'll come back to this tomorrow and do some code
> size measurements of my own, now that I have a clearer understanding
> of what your original patch was optimizing for.
>

Thank you very much, Jonathan!


More information about the Gcc-patches mailing list