Bug 115955 - atomic<T>::wait _S_for uses a poor hash function
Summary: atomic<T>::wait _S_for uses a poor hash function
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: 17.0
Assignee: Jonathan Wakely
URL:
Keywords:
Depends on:
Blocks: c++20-lib
  Show dependency treegraph
 
Reported: 2024-07-16 11:20 UTC by Jakub Lopuszanski
Modified: 2026-04-21 15:58 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2025-01-10 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Lopuszanski 2024-07-16 11:20:39 UTC
Summary:
========

The way I understand the code in 
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/atomic_wait.h#L251-L258
and the article at
https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc#
is that for std::atomic<uint64_t>::wait the implementation uses 16 elements and maps the address of the actual atomic to the element by a very simple hash function:
```
      static __waiter_pool_base&
      _S_for(const void* __addr) noexcept
      {
	constexpr uintptr_t __ct = 16;
	static __waiter_pool_base __w[__ct];
	auto __key = (uintptr_t(__addr) >> 2) % __ct;
	return __w[__key];
      }
```
This seems to be a very poor choice in my experience, as it only depends on (4 of) the lowest 6 bits, because data structures used in multi-threaded apps, and in particular, data structures which use atomics, often use cache-line alignment to avoid "false sharing", and because a cache line usually has 64 bytes, thus the __addr%64 is usually the same value for all instances of a given class. Also, quite often: 0.
This is a performance problem, as it then leads to spurious wake ups.

Background:
===========

I am working on InnoDB engine for MySQL, which uses a very old `struct os_event` which among other things lets people call epoch=reset(), set(), wait_low(epoch). This structure is used a lot in other old classes such as ib_mutex_t or rw_lock_t to notify threads waiting for a latch that it got released. As each page in Buffer Pool has several such latches, and BP size can be in TBs, one can imagine we have a lot of os_event-s. They are implemented as a combination of boolean flag, uint64_t counter, a native mutex, and a native cond var, leading to a huge sizeof(os_event) (upwards of 96 bytes on Linux) and a lot of non-trivial code to maintain. In 2020 [Facebook proposed](https://bugs.mysql.com/bug.php?id=102045) to use an idea similar to [WebKit's Parking Lot](https://db.in.tum.de/~boettcher/locking.pdf), which is to have a pool of such structs, and somehow reuse/share them. This would make the logic even more complicated, but would save a lot of memory (their claim: 1GB).

Recently MySQL's codebase moved to C++20, and I saw an opportunity to replace all of this custom logic with std::atomic<uint64_t> as there is a natural mapping:
```
epoch=reset() --> counter=load(), 
set() --> fetch_add(1)+notify_all(), 
wait_low(counter) --> wait(epoch).
```
This is not exactly the same thing, but close enough semantically, so I've implemented the change and run sysbench OLTP-RW test suite, and saw...disaster:
```
$num_of_runs $median $median_abs_diff ($min < $avg < $max) "$version"
pareto 64
6 27800 +- 113 ( 27423 < 27739 < 27943 ) "use std::atomic instead of os_event_t + 32-bit"
6 27676 +- 255 ( 27401 < 27696 < 28019 ) "use std::atomic instead of os_event_t"
6 27527 +- 163 ( 27171 < 27545 < 27806 ) "baseline"

pareto 1024
9 32234 +- 113 ( 31930 < 32244 < 32555 ) "use std::atomic instead of os_event_t + 32-bit"
9 32080 +- 134 ( 31946 < 32137 < 32460 ) "baseline"
9 21791 +- 677 ( 19965 < 21408 < 22486 ) "use std::atomic instead of os_event_t"

uniform 64
9 27326 +- 170 ( 26799 < 27287 < 27666 ) "use std::atomic instead of os_event_t + 32-bit"
9 27264 +- 98 ( 26677 < 27162 < 27453 ) "use std::atomic instead of os_event_t"
9 27246 +- 195 ( 26758 < 27206 < 27561 ) "baseline"

uniform 1024
9 35766 +- 125 ( 35246 < 35671 < 36173 ) "use std::atomic instead of os_event_t + 32-bit"
9 35623 +- 146 ( 35446 < 35669 < 36150 ) "baseline"
9 23631 +- 313 ( 22835 < 23603 < 24399 ) "use std::atomic instead of os_event_t"
```
This is a machine with 96 CPUs, and we can see that for 1024 threads the impact of replacing the default os_event_t with atomic<uint64_t> is changing the number of transactions per second from 35.6K to 23.6K.

A flame graphs' diff showed a lot of time spent in `syscall` during latch acquistions, i.e. when calling "wait".
I guess this is not so much about the duration, but about frequency of these calls, as there's a retry `while` loop in atomic_wait.h which calls futex syscall again whenever a spurious wake up occurs.
At any rate, the problem is somewhere in the handling of 64-bit atomics, because changing the counter to std::atomic<uint32_t> made the problem go away as you can see.
I'd rather use uint64_t to make ABA almost impossible.

Proposed fix:
=============
How about:
a) using %17 instead of %16? 
b) looking at the next 4 bits?
c) pre-multiply the __addr by a random odd constant and take the top 4 bits?
d) use a real hash function, such as _mm_crc32_u64?
e) something more galaxy brainded like using a lock-free hashmap from actively used addr values to a pool of the size equal to number of threads?

Personally, I'd give (a) a try - division by a contexpr value shouldn't hurt much, and we are about to do a costly syscall anyway.
Comment 1 Jonathan Wakely 2024-07-16 11:22:42 UTC
Yes, this is a known limitation that will get fixed with the planned refactoring of atomic::wait.
Comment 2 Jonathan Wakely 2025-01-10 11:09:05 UTC
std::barrier does:

	std::hash<std::thread::id> __hasher;
	size_t __current = __hasher(std::this_thread::get_id());

Maybe this could be improved too, although it doesn't have exactly the same issue (not all thread IDs are separated by a constant power-of-two value).
Comment 3 Jonathan Wakely 2025-05-30 09:32:43 UTC
(In reply to Jonathan Wakely from comment #1)
> Yes, this is a known limitation that will get fixed with the planned
> refactoring of atomic::wait.

The function that maps an address to a mutex has been moved out of the headers for gcc 16, so can be changed to use a better hash function without causing an ABI break.

That makes GCC 16 incompatible with GCC 15, of course, but that was always going to be true because we need ABI changes in order to get C++20 stable and non-experimental.