This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

mt allocator documentation (draft)


Hi all,

here's a draft of the documentation for the mt allocator. I know, it's a bit lengthy but I have been trying to draw that "magic 3D picture that explains it all at once" for too long now so a little "story" will have to do instead =)

Looking forward to all comments - I've probably missed some things that I just "see as obvious" due to the amount of time spent with this piece of code lately...

Brgds

/Stefan
Introduction
------------
The mt allocator [hereinafter referred to simply as "the allocator"] is a fixed
size (power of two) allocator that was initially developed specifically to suit
the needs of multi threaded applications [hereinafter referred to as a MT 
application]. Over time the allocator has evolved and been improved in many 
ways, one of the being that it now also does a good job in single threaded 
applications [hereinafter referred to as a ST application].
(Note: In this document, when referring to single threaded applications this 
also includes applications that are compiled with gcc without thread support 
enabled. This is accomplished using ifdef's on __GTHREADS)

The aim of this document is to - from a application point of view - describe
the "inner workings" of the allocator.

Initialization
--------------
The static variables (pointers to freelists, tuning parameters etc)
are initialized to their default values at file scope, i.e.:
  template<typename _Tp> size_t
  __mt_alloc<_Tp>::_S_freelist_headroom = 10;

The very first allocate() call will always call the _S_init() function. 
In order to make sure that this function is called exactly once we make use
of a __gthread_once (with _S_once_mt and _S_init as arguments) call in MT 
applications and check a static bool (_S_initialized) in ST applications.

The _S_init() function:
- If the GLIBCXX_FORCE_NEW enviorment variable is set, it sets the bool
  _S_force_new to true and then returns. This will cause subsequent calls to
  allocate() to return memory directly from a new() call, and deallocate will
  only do a delete() call.

- In the GLIBCXX_FORCE_NEW enviorment variable is not set, both ST and MT 
  applications will:
  - Calculate the number of bins needed. A bin is a specific power of two size
    of bytes. I.e. By default the allocator will deal with requests of up to 
    128 bytes (or whatever the value of _S_max_bytes is when _S_init() is 
    called). This means that there will be bins of the following sizes 
    (in bytes): 1, 2, 4, 8, 16, 32, 64, 128. 

  - Create the _S_binmap array. All requests are rounded up to the next 
    "large enough" bin. I.e. A request for 29 bytes will cause a block from 
    the "32 byte bin" to be returned to the application. The purpose of 
    _S_binmap is to speed up the process of finding out which bin to use. 
    I.e. The value of _S_binmap[ 29 ] is initialized to 5 (bin 5 = 32 bytes).

  - Create the _S_bin array. This array consists of bin_records. There will be
    as many bin_records in this array as the number of bins that we calculated
    erlier. I.e. If _S_max_bytes = 128 there will be 8 entries.
    Each bin_record is then initialized:
    - bin_record->first = An array of pointers to block_records. There will be
      as many block_records pointers as there are maximum number of threads 
      (in a ST application there is only 1 thread, in a MT application there 
      are _S_max_threads).
      This holds the pointer to the first free block for each thread in this
      bin. I.e. If we would like to know where the first free block of size 32
      for thread number 3 is we would look this up by: _S_bin[ 5 ].first[ 3 ]
    - bin_record->last = See above, the only difference being that this points
      to the last record on the same freelist.

    The above created block_record pointers members are now initialized to 
    their initial values. I.e. _S_bin[ n ].first[ n ] = NULL;

- Additionally a MT application will:
  - Create a list of free thread id's. The pointer to the first entry
    is stored in _S_thread_freelist_first. The reason for this approach is 
    that the __gthread_self() call will not return a value that corresponds to 
    the maximum number of threads allowed but rather a process id number or
    something else. So what we do is that we create a list of thread_records.
    This list is _S_max_threads long and each entry holds a size_t thread_id
    which is initialized to 1, 2, 3, 4, 5 and so on up to _S_max_threads.
    Each time a thread calls allocate() or deallocate() we call 
    _S_get_thread_id() which looks at the value of _S_thread_key which is a
    thread local storage pointer. If this is NULL we know that this is a newly
    created thread and we pop the first entry from this list and saves the
    pointer to this record in the _S_thread_key variable. The next time 
    we will get the pointer to the thread_record back and we use the 
    thread_record->thread_id as identification. I.e. The first thread that calls
    allocate will get the first record in this list and thus be thread
    number 1 and will then find the pointer to its first free 32 byte block
    in _S_bin[ 5 ].first[ 1 ]
    When we create the _S_thread_key we also define a destructor 
    (_S_thread_key_destr) which means that when the thread dies, this
    thread_record is returned to the front of this list and the thread id
    can then be reused if a new thread is created.
    This list is protected by a mutex (_S_thread_freelist_mutex) which is only
    locked when records are removed/added to the list.

  - Initialize the free and used counters of each bin_record:
    - bin_record->free = An array of size_t. This keeps track of the number
      of blocks on a specific thread's freelist in each bin. I.e. If a thread
      has 12 32-byte blocks on it's freelists and allocates one of these, this
      couter would be decreased to 11.

    - bin_record->used = An array of size_t. This keeps track of the number
      of blocks currently in use of this size by this thread. I.e. If a thread
      has made 678 requests (and no deallocations...) of 32-byte blocks this
      counter will read 678.

    The above created arrays are now initialized with their initial values. 
    I.e. _S_bin[ n ].free[ n ] = 0;

  - Initialize the mutex of each bin_record:
    The bin_record->mutex is used to protect the global freelist. This concept
    of a global freelist is explained in more detail in the section
    "A multi threaded example", but basically this mutex is locked whenever 
    a block of memory is retrieved or returned to the global freelist for this
    specific bin. This only occurs when a number of blocks are grabbed from the
    global list to a thread specific list or when a thread decides to return 
    some blocks to the global freelist.

A single threaded example (and a primer for the multi threaded example!)
------------------------------------------------------------------------
Let's start by describing how the data on a freelist is layed out in memory.
This is the first two blocks in freelist for thread id 3 in bin 3 (8 bytes):
+----------------+
| next* ---------|--+  (_S_bin[ 3 ].first[ 3 ] points here)
|                |  |
|                |  |
|                |  |
+----------------+  |
| thread_id = 3  |  |
|                |  |
|                |  |
|                |  |
+----------------+  |
| DATA           |  |  (A pointer to here is what is returned to the
|                |  |   the application when needed)
|                |  |
|                |  |
|                |  |
|                |  |
|                |  |
|                |  |
+----------------+  |
+----------------+  |
| next*          |<-+  (If next == NULL it's the last one on the list and
|                |      then the _S_bin[ 3 ].last[ 3 ] pointer points to
|                |      here as well)
|                |
+----------------+
| thread_id = 3  |
|                |
|                |
|                |
+----------------+
| DATA           |
|                |
|                |
|                |
|                |
|                |
|                |
|                |
+----------------+

With this in mind we simplify things a bit for a while and say that there is
only one thread (a ST application). In this case all operations are made to 
what is referred to as the global pool - thread id 0 (No thread may be
assigned this id since they span from 1 to _S_max_threads in a MT application).

When the application requests memory (calling allocate()) we first look at the
requested size and if this is > _S_max_bytes we call new() directly and return.

If the requested size is within limits we start by finding out from which 
bin we should serve this request by looking in _S_binmap.

A quick look at _S_bin[ bin ].first[ 0 ] tells us if there are any blocks of
this size on the freelist (0). If this is not NULL - fine, just remove the
block that _S_bin[ bin ].first[ 0 ] points to from the list, 
update _S_bin[ bin ].first[ 0 ] and return a pointer to that blocks data.

If the freelist is empty (the pointer is NULL) we must get memory from the 
system and build us a freelist within this memory. All requests for new memory
is made in chunks of _S_chunk_size. Knowing the size of a block_record and 
the bytes that this bin stores we then calculate how many blocks we can create 
within this chunk, build the list, remove the first block, update the pointers
(_S_bin[ bin ].first[ 0 ] and _S_bin[ bin ].last[ 0 ]) and return a pointer 
to that blocks data. 

Deallocation is equally simple; the pointer is casted back to a block_record
pointer, lookup which bin to use based on the size, add the block to the end 
of the global freelist (with the next pointer set to NULL) and update the 
pointers as needed (_S_bin[ bin ].first[ 0 ] and _S_bin[ bin ].last[ 0 ]).

A multi threaded example
------------------------
In the ST example we never used the thread_id variable present in each block. 
Let's start by explaining the purpose of this in a MT application. 

The concept of "ownership" was introduced since many MT applications allocate
and deallocate memory to shared containers from different threads (such as a 
cache shared amongst all threads). This introduces a problem if the allocator
only returns memory to it's own freelist (I.e. there might be one thread doing
all the allocation and thus obtaining ever more memory from the system and 
another thread that is getting a longer and longer freelist - this will in 
the end consume all available memory).

Each time a block is moved from the global list (where ownership is irrelevant),
to a threads freelist (or when a new freelist is built from a chunk directly 
onto a threads freelist or when a deallocation occurs on a block which was not
allocated by the same thread id as the one doing the deallocation) the thread id
is set to the current one.

What's the use? Well, when a deallocation occurs we can now look at the thread
id and find out if it was allocated by another thread id and decrease the used 
counter of that thread instead, thus keeping the free and used counters
correct. And keeping the free and used counters corrects is very important 
since the relationship between these two variables decides if memory should 
be returned to the global pool or not when a deallocation occurs.

When the application requests memory (calling allocate()) we first look at the
requested size and if this is > _S_max_bytes we call new() directly and return.

If the requested size is within limits we start by finding out from which 
bin we should serve this request by looking in _S_binmap.

A call to _S_get_thread_id() returns the thread id for the calling thread 
(and if no value has been set in _S_thread_key, a new id is assigned and
returned).

A quick look at _S_bin[ bin ].first[ thread_id ] tells us if there are any 
blocks of this size on the current threads freelist. If this is not NULL - 
fine, just remove the block that _S_bin[ bin ].first[ thread_id ] points to 
from the list, update _S_bin[ bin ].first[ thread_id ], update the free and 
used counters and return a pointer to that blocks data.

If the freelist is empty (the pointer is NULL) we start by looking at the global
freelist (0). If there are blocks available on the global freelist we lock
this bins mutex and move up to block_count (the number of blocks of this bins
size that will fit into a _S_chunk_size) or until end of list - whatever comes
first - to the current threads freelist and at the same time change the 
thread_id ownership and update the counters and pointers. When the bins mutex 
has been unlocked, we remove the block that _S_bin[ bin ].first[ thread_id ] 
points to from the list, update _S_bin[ bin ].first[ thread_id ], update the 
free and used counters, and return a pointer to that blocks data.

However if there isn't any memory on the global pool we need to get memory from
the system - this is done in exactly the same way as in a single threaded
application with one major difference; the list built in the newly allocated
memory (of _S_chunk_size size) is added to the current threads freelist instead
of to the global.

The basic process of a deallocation call is simple; always add the block to the
end of the current threads freelist and update the counters and pointers (as
described erlier with the specific check of ownership that causes the used 
counter of the thread that originally allocated the block to be decreased
instead of the current threads counter).

And here comes the free and used counters to service. Each time a deallocation()
call is made, the length of the current threads freelist is compared to the 
amount memory in use by this thread. Let's go back to the example of an
application that has one thread that does all the allocations and one that
deallocates. Both these threads use say 516 32-byte blocks that was allocated
during thread creation for example. Their used counters will both say 516 at
this point. The allocation thread now grabs 1000 32-byte blocks and puts them
in a shared container. The use counter for this thread is now 1516.
The deallocation thread now deallocates 500 of these blocks. For each 
deallocation made the used counter of the allocating thread is decreased and
the freelist of the deallocation thread gets longer and longer. But the 
calculation made in deallocate() will limit the length of the freelist in the
deallocation thread to _S_freelist_headroom % of it's used counter. In this
case, when the freelist (given that the _S_freelist_headroom is at it's default
value of 10%) exceeds 52 (516/10) blocks will be returned to the global pool
where the allocating thread may pick them up and reuse them. In order to 
reduce lock contention (since this requires this bins mutex to be locked) this
operation is also made in chunks of blocks (just like when chunks of blocks 
are moved from the global freelist to a threads freelist mentioned above).


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]