This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
mt allocator documentation (draft)
- From: Stefan Olsson <stefan at xapa dot se>
- To: libstdc++ <libstdc++ at gcc dot gnu dot org>
- Date: Sun, 01 Feb 2004 11:49:48 +0100
- Subject: mt allocator documentation (draft)
Hi all,
here's a draft of the documentation for the mt allocator. I know, it's a
bit lengthy but I have been trying to draw that "magic 3D picture that
explains it all at once" for too long now so a little "story" will have
to do instead =)
Looking forward to all comments - I've probably missed some things that
I just "see as obvious" due to the amount of time spent with this piece
of code lately...
Brgds
/Stefan
Introduction
------------
The mt allocator [hereinafter referred to simply as "the allocator"] is a fixed
size (power of two) allocator that was initially developed specifically to suit
the needs of multi threaded applications [hereinafter referred to as a MT
application]. Over time the allocator has evolved and been improved in many
ways, one of the being that it now also does a good job in single threaded
applications [hereinafter referred to as a ST application].
(Note: In this document, when referring to single threaded applications this
also includes applications that are compiled with gcc without thread support
enabled. This is accomplished using ifdef's on __GTHREADS)
The aim of this document is to - from a application point of view - describe
the "inner workings" of the allocator.
Initialization
--------------
The static variables (pointers to freelists, tuning parameters etc)
are initialized to their default values at file scope, i.e.:
template<typename _Tp> size_t
__mt_alloc<_Tp>::_S_freelist_headroom = 10;
The very first allocate() call will always call the _S_init() function.
In order to make sure that this function is called exactly once we make use
of a __gthread_once (with _S_once_mt and _S_init as arguments) call in MT
applications and check a static bool (_S_initialized) in ST applications.
The _S_init() function:
- If the GLIBCXX_FORCE_NEW enviorment variable is set, it sets the bool
_S_force_new to true and then returns. This will cause subsequent calls to
allocate() to return memory directly from a new() call, and deallocate will
only do a delete() call.
- In the GLIBCXX_FORCE_NEW enviorment variable is not set, both ST and MT
applications will:
- Calculate the number of bins needed. A bin is a specific power of two size
of bytes. I.e. By default the allocator will deal with requests of up to
128 bytes (or whatever the value of _S_max_bytes is when _S_init() is
called). This means that there will be bins of the following sizes
(in bytes): 1, 2, 4, 8, 16, 32, 64, 128.
- Create the _S_binmap array. All requests are rounded up to the next
"large enough" bin. I.e. A request for 29 bytes will cause a block from
the "32 byte bin" to be returned to the application. The purpose of
_S_binmap is to speed up the process of finding out which bin to use.
I.e. The value of _S_binmap[ 29 ] is initialized to 5 (bin 5 = 32 bytes).
- Create the _S_bin array. This array consists of bin_records. There will be
as many bin_records in this array as the number of bins that we calculated
erlier. I.e. If _S_max_bytes = 128 there will be 8 entries.
Each bin_record is then initialized:
- bin_record->first = An array of pointers to block_records. There will be
as many block_records pointers as there are maximum number of threads
(in a ST application there is only 1 thread, in a MT application there
are _S_max_threads).
This holds the pointer to the first free block for each thread in this
bin. I.e. If we would like to know where the first free block of size 32
for thread number 3 is we would look this up by: _S_bin[ 5 ].first[ 3 ]
- bin_record->last = See above, the only difference being that this points
to the last record on the same freelist.
The above created block_record pointers members are now initialized to
their initial values. I.e. _S_bin[ n ].first[ n ] = NULL;
- Additionally a MT application will:
- Create a list of free thread id's. The pointer to the first entry
is stored in _S_thread_freelist_first. The reason for this approach is
that the __gthread_self() call will not return a value that corresponds to
the maximum number of threads allowed but rather a process id number or
something else. So what we do is that we create a list of thread_records.
This list is _S_max_threads long and each entry holds a size_t thread_id
which is initialized to 1, 2, 3, 4, 5 and so on up to _S_max_threads.
Each time a thread calls allocate() or deallocate() we call
_S_get_thread_id() which looks at the value of _S_thread_key which is a
thread local storage pointer. If this is NULL we know that this is a newly
created thread and we pop the first entry from this list and saves the
pointer to this record in the _S_thread_key variable. The next time
we will get the pointer to the thread_record back and we use the
thread_record->thread_id as identification. I.e. The first thread that calls
allocate will get the first record in this list and thus be thread
number 1 and will then find the pointer to its first free 32 byte block
in _S_bin[ 5 ].first[ 1 ]
When we create the _S_thread_key we also define a destructor
(_S_thread_key_destr) which means that when the thread dies, this
thread_record is returned to the front of this list and the thread id
can then be reused if a new thread is created.
This list is protected by a mutex (_S_thread_freelist_mutex) which is only
locked when records are removed/added to the list.
- Initialize the free and used counters of each bin_record:
- bin_record->free = An array of size_t. This keeps track of the number
of blocks on a specific thread's freelist in each bin. I.e. If a thread
has 12 32-byte blocks on it's freelists and allocates one of these, this
couter would be decreased to 11.
- bin_record->used = An array of size_t. This keeps track of the number
of blocks currently in use of this size by this thread. I.e. If a thread
has made 678 requests (and no deallocations...) of 32-byte blocks this
counter will read 678.
The above created arrays are now initialized with their initial values.
I.e. _S_bin[ n ].free[ n ] = 0;
- Initialize the mutex of each bin_record:
The bin_record->mutex is used to protect the global freelist. This concept
of a global freelist is explained in more detail in the section
"A multi threaded example", but basically this mutex is locked whenever
a block of memory is retrieved or returned to the global freelist for this
specific bin. This only occurs when a number of blocks are grabbed from the
global list to a thread specific list or when a thread decides to return
some blocks to the global freelist.
A single threaded example (and a primer for the multi threaded example!)
------------------------------------------------------------------------
Let's start by describing how the data on a freelist is layed out in memory.
This is the first two blocks in freelist for thread id 3 in bin 3 (8 bytes):
+----------------+
| next* ---------|--+ (_S_bin[ 3 ].first[ 3 ] points here)
| | |
| | |
| | |
+----------------+ |
| thread_id = 3 | |
| | |
| | |
| | |
+----------------+ |
| DATA | | (A pointer to here is what is returned to the
| | | the application when needed)
| | |
| | |
| | |
| | |
| | |
| | |
+----------------+ |
+----------------+ |
| next* |<-+ (If next == NULL it's the last one on the list and
| | then the _S_bin[ 3 ].last[ 3 ] pointer points to
| | here as well)
| |
+----------------+
| thread_id = 3 |
| |
| |
| |
+----------------+
| DATA |
| |
| |
| |
| |
| |
| |
| |
+----------------+
With this in mind we simplify things a bit for a while and say that there is
only one thread (a ST application). In this case all operations are made to
what is referred to as the global pool - thread id 0 (No thread may be
assigned this id since they span from 1 to _S_max_threads in a MT application).
When the application requests memory (calling allocate()) we first look at the
requested size and if this is > _S_max_bytes we call new() directly and return.
If the requested size is within limits we start by finding out from which
bin we should serve this request by looking in _S_binmap.
A quick look at _S_bin[ bin ].first[ 0 ] tells us if there are any blocks of
this size on the freelist (0). If this is not NULL - fine, just remove the
block that _S_bin[ bin ].first[ 0 ] points to from the list,
update _S_bin[ bin ].first[ 0 ] and return a pointer to that blocks data.
If the freelist is empty (the pointer is NULL) we must get memory from the
system and build us a freelist within this memory. All requests for new memory
is made in chunks of _S_chunk_size. Knowing the size of a block_record and
the bytes that this bin stores we then calculate how many blocks we can create
within this chunk, build the list, remove the first block, update the pointers
(_S_bin[ bin ].first[ 0 ] and _S_bin[ bin ].last[ 0 ]) and return a pointer
to that blocks data.
Deallocation is equally simple; the pointer is casted back to a block_record
pointer, lookup which bin to use based on the size, add the block to the end
of the global freelist (with the next pointer set to NULL) and update the
pointers as needed (_S_bin[ bin ].first[ 0 ] and _S_bin[ bin ].last[ 0 ]).
A multi threaded example
------------------------
In the ST example we never used the thread_id variable present in each block.
Let's start by explaining the purpose of this in a MT application.
The concept of "ownership" was introduced since many MT applications allocate
and deallocate memory to shared containers from different threads (such as a
cache shared amongst all threads). This introduces a problem if the allocator
only returns memory to it's own freelist (I.e. there might be one thread doing
all the allocation and thus obtaining ever more memory from the system and
another thread that is getting a longer and longer freelist - this will in
the end consume all available memory).
Each time a block is moved from the global list (where ownership is irrelevant),
to a threads freelist (or when a new freelist is built from a chunk directly
onto a threads freelist or when a deallocation occurs on a block which was not
allocated by the same thread id as the one doing the deallocation) the thread id
is set to the current one.
What's the use? Well, when a deallocation occurs we can now look at the thread
id and find out if it was allocated by another thread id and decrease the used
counter of that thread instead, thus keeping the free and used counters
correct. And keeping the free and used counters corrects is very important
since the relationship between these two variables decides if memory should
be returned to the global pool or not when a deallocation occurs.
When the application requests memory (calling allocate()) we first look at the
requested size and if this is > _S_max_bytes we call new() directly and return.
If the requested size is within limits we start by finding out from which
bin we should serve this request by looking in _S_binmap.
A call to _S_get_thread_id() returns the thread id for the calling thread
(and if no value has been set in _S_thread_key, a new id is assigned and
returned).
A quick look at _S_bin[ bin ].first[ thread_id ] tells us if there are any
blocks of this size on the current threads freelist. If this is not NULL -
fine, just remove the block that _S_bin[ bin ].first[ thread_id ] points to
from the list, update _S_bin[ bin ].first[ thread_id ], update the free and
used counters and return a pointer to that blocks data.
If the freelist is empty (the pointer is NULL) we start by looking at the global
freelist (0). If there are blocks available on the global freelist we lock
this bins mutex and move up to block_count (the number of blocks of this bins
size that will fit into a _S_chunk_size) or until end of list - whatever comes
first - to the current threads freelist and at the same time change the
thread_id ownership and update the counters and pointers. When the bins mutex
has been unlocked, we remove the block that _S_bin[ bin ].first[ thread_id ]
points to from the list, update _S_bin[ bin ].first[ thread_id ], update the
free and used counters, and return a pointer to that blocks data.
However if there isn't any memory on the global pool we need to get memory from
the system - this is done in exactly the same way as in a single threaded
application with one major difference; the list built in the newly allocated
memory (of _S_chunk_size size) is added to the current threads freelist instead
of to the global.
The basic process of a deallocation call is simple; always add the block to the
end of the current threads freelist and update the counters and pointers (as
described erlier with the specific check of ownership that causes the used
counter of the thread that originally allocated the block to be decreased
instead of the current threads counter).
And here comes the free and used counters to service. Each time a deallocation()
call is made, the length of the current threads freelist is compared to the
amount memory in use by this thread. Let's go back to the example of an
application that has one thread that does all the allocations and one that
deallocates. Both these threads use say 516 32-byte blocks that was allocated
during thread creation for example. Their used counters will both say 516 at
this point. The allocation thread now grabs 1000 32-byte blocks and puts them
in a shared container. The use counter for this thread is now 1516.
The deallocation thread now deallocates 500 of these blocks. For each
deallocation made the used counter of the allocating thread is decreased and
the freelist of the deallocation thread gets longer and longer. But the
calculation made in deallocate() will limit the length of the freelist in the
deallocation thread to _S_freelist_headroom % of it's used counter. In this
case, when the freelist (given that the _S_freelist_headroom is at it's default
value of 10%) exceeds 52 (516/10) blocks will be returned to the global pool
where the allocating thread may pick them up and reuse them. In order to
reduce lock contention (since this requires this bins mutex to be locked) this
operation is also made in chunks of blocks (just like when chunks of blocks
are moved from the global freelist to a threads freelist mentioned above).