This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
Thoughts on memory allocation...
- From: Stefan Olsson <stefan at noname4us dot com>
- To: libstdc++ at gcc dot gnu dot org
- Date: Fri, 28 Jun 2002 00:42:23 +0200
- Subject: Thoughts on memory allocation...
- Organization: Noname4us
Hi all!
Remember a couple of questions raised a few weeks ago related to the
libstdc++ allocators and ideas on improvements etc?
Back then I was thinking of just writing down all the ideas, so I set
out to read up on the subject and try to understand what others have
done, what might be useful or not and ended up with a nice little heap
of paper (Yes, I do like to read some stuff of paper still...) on my
desk. Just writing it all down into a new document seemed somewhat
redundant to what others have done so many times so I made a few design
decisions:
- Try to do as few system (i.e. malloc) calls as possible to gain
similar performance across various platforms.
- Since most of our applications are long running ones (application
servers and such) keep the heap growth (due to fragmentation etc) to a
minimum.
- Don't bother about memory consumption if needed to accomplish the
previous task (I don't like to waste memory, but on the targets that we
are working on memory is not that much of an issue).
- Improve thread concurrency
- "Solve" the "one thread allocates and anotherone freeing" memory issue
that I discussed earlier (i.e. what happens when keeping a cache of data
in a MT application)
After figuring out how to implement this (in terms of integrating with
libstdc++) i decided to make a header and implementation file that is
linked into the application to solve the "one instance" issue of having
actual code in .h files (I'm sure there is a much better way of doing
much of the implementation here, but once again: this is posted for
discussion...) and then tell the containers to use this by wrapping it
through the __allocator<> adapter available in stl_alloc.h (i.e. vector<
int, __allocator< void, nn4us::nn4us_alloc > > foo;)
Oki, the source is obviously the best documentation (?) but here are
some comments...
nn4us_alloc.hpp:
- There seems to something "missing" in the __allocator<> code since I
had to add the operator == and != in order to get basic_string<> to
compile, this does however not happen if I try to compile with
malloc_allocator which is part of stl_alloc.h - why I don't know...
- There are defintions for new and delete as well...
...as I said above, this piece of code is here for discussion and the
best way I could think of to get an entire application to behave similar
on several plattforms as well as minimizing heap growth was to have new
and delete use the same allocator as the containers.
nn4us_alloc.cpp
- The allocator gets chunks (CHUNK_SIZE) of memory from the system.
- Each size (up to the predetermined MAX_BYTES) are delt with by the
allocator, the rest is sent to malloc()/free() directly.
- All requests are rounded to powers of two - even the ones that are
larger than MAX_BYTES in order to minimize fragmentation in malloc()
freelists.
- Each size (power of two) has it's own list of records and it's own
mutex. This is done to improve concurrency (i.e. two requests of
diffrent sizes can be delt with without locking if the threads already
owns a record). What might seem as counterproductive to this is the
malloc_mutex which actually syncronizes all malloc() calls...
...once again this is done for testing and was added after reading a
paper on the Linux malloc() implementation that states "To reduce lock
contention, ptmalloc searchs for the first unlocked subheap and grabs
memory from it to fulfill a malloc() request. If ptmalloc doesnt find an
unlocked heap, it creates a new one." (A very interesting read:
http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf). My
interpertation, and the goal with this experiment was to see if
"helping" malloc() to avoid creating new heap would improve
overall/longterm performance.
- All records are initially "owned" by a fiction thread id 0 (each
thread is assigned a unique id by setting a key. BTW Does anyone know
what the posix standard says about assigning the id returned by
pthread_self()? I have tried to find this but with no luck and the
source code did not make any sense to me...) When a thread requests
memory the owner on this record is set to this thread id which means
that subsequent deallocate calls can be done without locking the list
and later on reused without locking. The "trick" with this is that we
can now see if the thread that is deallocating the memory also was
allocating it - if not we lock and return it to thread id 0 which
effectively reduces the problems of "unbounded heap growth" with the
current pthread_alloc implementation.
- In order to use the same set of methods as well as memory, the new and
delete calls rely on the size being part of the actual data...
...see the implementation - it does not make much sense to write more
about it here ;-)
NOTE! Some of the comments in the source may be a bit confusing since
some parts have been rewritten a couple of times for testing ;-)
So where did this leave us? Well we have done quite a few tests, both on
our real applications and on small testprograms. It seems that the goals
where obtained even though this rough implementation does leave a lot to
be tuned (see below) and fixed - but our main problem (fragmentation
over time and the inabillity to use the __default_alloc_template due to
concatenation) has more or less completly vanished and a small
testprogram (which creates 10 threads, each of which are inserting
values into a map<int,int> and then clearing the map over and over)
shows that with malloc_allocator as index 1, the default
(__default_alloc_template) allocator gains less than 10% on both single
and dual processor x86 machines whilst this implementation improves
performance with well over 100% on single processor machines and even
more on dual processor machines...
Things that needs to be looked into is obviously cacheline behaviour etc
but I thought I post this now and see if anyone find this of interest at
all? I feel that some of the ideas (which aren't new but still...) could
be added to the existing allocators to reduce fragmentation (i.e. fized
size/power of two) in libstdc++ based applications...
Looking forward to your comments!
Brgds
/Stefan
--
Life is like a 10 speed bicycle. Most of us have gears we never use.
#ifndef __nn4us_alloc_hpp__
#define __nn4us_alloc_hpp__
#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <pthread.h>
namespace nn4us
{
class nn4us_alloc
{
public:
static void* allocate( size_t __n );
static void deallocate( void* __p, size_t __n );
};
inline bool operator==( const nn4us_alloc& __a1, const nn4us_alloc& __a2)
{
return true;
}
inline bool operator!=( const nn4us_alloc& __a1, const nn4us_alloc& __a2)
{
return false;
}
}
void* operator new( size_t __n );
void* operator new[]( size_t __n );
void operator delete( void* __p );
void operator delete[]( void* __p );
#endif
#include "nn4us_alloc.hpp"
namespace nn4us
{
/*
* Static parameters
*/
static const int BITS = 10;
static unsigned const int MAX_BYTES = 1024;
static const int CHUNK_SIZE = 32768;
static const int MAX_THREADS = 256;
/*
* Initialization
*/
static pthread_once_t init_once = PTHREAD_ONCE_INIT;
/*
* Thread # assignment, thread_id is increased before used
*/
static int thread_id = 0;
static pthread_mutex_t thread_id_mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_key_t thread_id_key;
/*
* This is the "main" struct which is "part of" the actual data
*/
struct data_record
{
data_record* next;
int owner;
};
/*
* This is a fixed size allocator and each size (from 1 to MAX_BYTES in power of 2's)
* lives in it's own array offset
*/
struct freelist_record
{
/*
* Since we have a freelist per thread we need to keep track of MAX_THREADS pointers...
*/
data_record* first[ MAX_THREADS + 1 ];
data_record* last[ MAX_THREADS + 1 ];
/*
* Each thread keeps a counter of the number of records in the list
* This is used to increase performance and decide wether to return free'd memory to
* the thread specific freelist or the global one
*/
int count[ MAX_THREADS + 1 ];
/*
* When we have to change owner we need to lock the entire list (for this specific size)
*/
pthread_mutex_t* list_mutex;
};
static freelist_record volatile freelist[ BITS + 1 ];
/*
* Some (i.e. Linux) malloc() implementations create new freelists if the previous ones are locked
* In an attempt to minimize this we syncronize all our malloc() calls with this mutex
*/
static pthread_mutex_t malloc_mutex = PTHREAD_MUTEX_INITIALIZER;
static void thread_id_key_destructor( void* thread_id )
{
/*
* TODO
*/
}
/*
* Initialize all data structures once
*/
void init()
{
pthread_key_create( &thread_id_key, thread_id_key_destructor );
size_t round_up = 1;
for( int bit_count = 0; bit_count <= BITS; bit_count++ )
{
for( int i = 0; i <= MAX_THREADS; i++ )
{
freelist[ bit_count ].first[ i ] = NULL;
freelist[ bit_count ].last[ i ] = NULL;
freelist[ bit_count ].count[ i ] = 0;
}
/*
* We always allocate CHUNK_SIZE bytes, figure out how many records we can create
*/
int tmp_record_count = CHUNK_SIZE / ( round_up + sizeof( data_record ) ) - 1;
/*
* All records initially are owned by thread 0
*/
freelist[ bit_count ].count[ 0 ] = tmp_record_count;
pthread_mutex_lock( &malloc_mutex );
void* p = malloc( CHUNK_SIZE );
pthread_mutex_unlock( &malloc_mutex );
freelist[ bit_count ].first[ 0 ] = (data_record*)p;
data_record* tmp_record = freelist[ bit_count ].first[ 0 ];
while( tmp_record_count > 0 )
{
tmp_record->next = (data_record*)( (char*)tmp_record + ( round_up + sizeof( data_record ) ) );
tmp_record = tmp_record->next;
tmp_record_count--;
}
tmp_record->next = NULL;
freelist[ bit_count ].last[ 0 ] = tmp_record;
freelist[ bit_count ].list_mutex = (pthread_mutex_t*)malloc( sizeof( pthread_mutex_t ) );
pthread_mutex_init( freelist[ bit_count ].list_mutex, NULL );
round_up = round_up << 1;
}
}
/*
* If pthread_getspecific returns NULL in allocate() we have not assigned a unique id to this thread - that's what this does...
*/
int assign_thread_id()
{
int self;
int *tmp;
pthread_mutex_lock( &thread_id_mutex );
self = ++thread_id;
pthread_mutex_unlock( &thread_id_mutex );
tmp = (int*)malloc( sizeof( int ) );
*tmp = self;
pthread_setspecific( thread_id_key, (void*)tmp );
return self;
}
void* nn4us_alloc::allocate( size_t __n )
{
/*
* Initialize all data structures once
*/
pthread_once( &init_once, init );
/*
* Round the requested size to a power of 2
*/
size_t round_up = 1;
int bit_count = 0;
while( __n > round_up )
{
round_up = round_up << 1;
bit_count++;
}
/*
* Requests larger than MAX_BYTES is sent to malloc directly
*/
if( round_up > MAX_BYTES )
{
pthread_mutex_lock( &malloc_mutex );
void* p = malloc( round_up );
pthread_mutex_unlock( &malloc_mutex );
return p;
}
/*
* Who are we?
*/
int self;
int *tmp;
if( ( tmp = (int*)pthread_getspecific( thread_id_key ) ) == NULL )
{
self = assign_thread_id();
}
else
{
self = *tmp;
}
data_record* tmp_result = NULL;
/*
* Check if count for this thread is > 0, if so grab the first from this list, decrease count for this thread and return
*/
if( freelist[ bit_count ].count[ self ] > 0 )
{
tmp_result = freelist[ bit_count ].first[ self ];
freelist[ bit_count ].count[ self ]--;
if( freelist[ bit_count ].count[ self ] > 0 )
{
freelist[ bit_count ].first[ self ] = tmp_result->next;
}
return (void*)( (char*)tmp_result + sizeof( data_record ) );
}
pthread_mutex_lock( freelist[ bit_count ].list_mutex );
/*
* Check if count for thread 0 > 0, if so lock, change owner, decrease count for global and return
*/
if( freelist[ bit_count ].count[ 0 ] > 0 )
{
tmp_result = freelist[ bit_count ].first[ 0 ];
freelist[ bit_count ].count[ 0 ]--;
if( freelist[ bit_count ].count[ 0 ] > 0 )
{
freelist[ bit_count ].first[ 0 ] = tmp_result->next;
}
tmp_result->owner = self;
pthread_mutex_unlock( freelist[ bit_count ].list_mutex );
return (void*)( (char*)tmp_result + sizeof( data_record ) );
}
/*
* If we get here we need to lock, allocate memory, increase count for global - 1, change owner on first record and return
*/
/*
* We always allocate CHUNK_SIZE bytes, figure out how many records we can create
*/
int tmp_record_count = CHUNK_SIZE / ( round_up + sizeof( data_record ) ) - 1;
/*
* All records initially are owned by thread 0
*/
freelist[ bit_count ].count[ 0 ] = tmp_record_count;
pthread_mutex_lock( &malloc_mutex );
void* p = malloc( CHUNK_SIZE );
pthread_mutex_unlock( &malloc_mutex );
freelist[ bit_count ].first[ 0 ] = (data_record*)p;
data_record* tmp_record = freelist[ bit_count ].first[ 0 ];
while( tmp_record_count > 0 )
{
tmp_record->next = (data_record*)( (char*)tmp_record + ( round_up + sizeof( data_record ) ) );
tmp_record = tmp_record->next;
tmp_record_count--;
}
tmp_record->next = NULL;
freelist[ bit_count ].last[ 0 ] = tmp_record;
tmp_result = freelist[ bit_count ].first[ 0 ];
freelist[ bit_count ].count[ 0 ]--;
if( freelist[ bit_count ].count[ 0 ] > 0 )
{
freelist[ bit_count ].first[ 0 ] = tmp_result->next;
}
tmp_result->owner = self;
pthread_mutex_unlock( freelist[ bit_count ].list_mutex );
return (void*)( (char*)tmp_result + sizeof( data_record ) );
}
void nn4us_alloc::deallocate( void* __p, size_t __n )
{
/*
* Round the requested size to a power of 2
*/
size_t round_up = 1;
int bit_count = 0;
while( __n > round_up )
{
round_up = round_up << 1;
bit_count++;
}
/*
* Requests larger than MAX_BYTES is sent to malloc directly
*/
if( round_up > MAX_BYTES )
{
free( __p );
return;
}
/*
* Who are we?
*/
int self;
int *tmp;
if( ( tmp = (int*)pthread_getspecific( thread_id_key ) ) == NULL )
{
self = assign_thread_id();
}
else
{
self = *tmp;
}
data_record* tmp_record = NULL;
tmp_record = (data_record*)( (char*)__p - sizeof( data_record ) );
/*
* If self == owner, add this to the back of the freelist for this thread id and increase count for this thread
*/
if( self == tmp_record->owner )
{
if( freelist[ bit_count ].count[ self ] == 0 )
{
freelist[ bit_count ].first[ self ] = tmp_record;
}
/*
* If it's the first dealloc for this thread, the last pointer is set to NULL
* else we have to set the last pointer's next pointer and last pointer to the new "not-in-use" data_record
*/
if( freelist[ bit_count ].count[ self ] > 0 )
{
freelist[ bit_count ].last[ self ]->next = tmp_record;
}
freelist[ bit_count ].last[ self ] = tmp_record;
freelist[ bit_count ].count[ self ]++;
}
else
{
/*
* If self != owner, lock and add this to the back of the global freelist and increase count for global
*/
pthread_mutex_lock( freelist[ bit_count ].list_mutex );
if( freelist[ bit_count ].count[ 0 ] == 0 )
{
freelist[ bit_count ].first[ 0 ] = tmp_record;
}
if( freelist[ bit_count ].count[ 0 ] > 0 )
{
freelist[ bit_count ].last[ 0 ]->next = tmp_record;
}
freelist[ bit_count ].last[ 0 ] = tmp_record;
freelist[ bit_count ].count[ 0 ]++;
pthread_mutex_unlock( freelist[ bit_count ].list_mutex );
}
}
}
static nn4us::nn4us_alloc nn4us_alloc_inst;
void* operator new( size_t __n )
{
int tmp = __n + sizeof( int );
/*
* We allocate so that we have space enough to store the size at the beginning
*/
void* p = nn4us_alloc_inst.allocate( tmp );
memcpy( p, &tmp, 4 );
return (void*)( (char*)p + sizeof( int ) );
}
void* operator new[]( size_t __n )
{
int tmp = __n + sizeof( int );
/*
* We allocate so that we have space enough to store the size at the beginning
*/
void* p = nn4us_alloc_inst.allocate( tmp );
memcpy( p, &tmp, 4 );
return (void*)( (char*)p + sizeof( int ) );
}
void operator delete( void* __p )
{
int tmp;
void* p = (void*)( (char*)__p - sizeof( int ) );
memcpy( &tmp, p, 4 );
nn4us_alloc_inst.deallocate( p, tmp );
}
void operator delete[]( void* __p )
{
int tmp;
void* p = (void*)( (char*)__p - sizeof( int ) );
memcpy( &tmp, p, 4 );
nn4us_alloc_inst.deallocate( p, tmp );
}