[Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait

Wed Sep 28 23:40:36 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #25 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #24)
> (In reply to Thomas Rodgers from comment #22)
> > Your example of '100+ core' systems especially on NUMA is certainly a valid
> > one. I would ask, at what point do those collisions and the resulting cache
> > invalidation traffic swamp the cost of just making the syscall? I do plan to
> > put these tests together, because there is another algorithm that I am
> > exploring, that I believe will reduce the likelihood of spurious wakeups,
> > and achieves the same result as this particular approach, without generating
> > the same invalidation traffic. At this point, I don't anticipate doing that
> > work until after GCC13 stage1 closes.
> 
> I try to explain: 
> 
> syscall overhead is some constant commonly like 10-30ns (futex syscall can
> be more expensive like 100ns in your example)
> 
> But numbers of cores are grow, arm also makes more popular (fetch_add/sub
> have more cost on it compares to x86)
> And people already faced with situation where fetch_add have a bigger cost
> than syscall overhead:
> 
> https://pkolaczk.github.io/server-slower-than-a-laptop/
> https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html
> 
> I don't think we will faced with problem like in these links in
> atomic::wait/notify in real code, but I'm pretty sure in some cases it can
> be more expansive than syscall part of atomic::wait/notify
> 
> Of course better to prove it, maybe someday I will do it :(

So to your previous comment, I don't the discussion is at all pointless. i plan
to raise some of these issues at the next SG1 meeting in November. Sure, that
doesn't help *you* or any developer with your specific intent until C++26, and
maybe Boost's implementation is a better choice, I also get how unsatisfying of
an aswer that is.

I'm well aware of the potential scalability problems, and I have a longer term
plan to get concrete data on how different implementation choices impact
scalability. The barrier implementation (which is the same algorithm as in
libc++), for example spreads this traffic over 64 individual atomic_refs, for
this very reason, and that implementation has been shown to scale quite well on
ORNL's Summit. But not all users of libstdc++ have those sorts of problems
either.