This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Thoughts on memory allocation...


Hi all!

Remember a couple of questions raised a few weeks ago related to the libstdc++ allocators and ideas on improvements etc?

Back then I was thinking of just writing down all the ideas, so I set out to read up on the subject and try to understand what others have done, what might be useful or not and ended up with a nice little heap of paper (Yes, I do like to read some stuff of paper still...) on my desk. Just writing it all down into a new document seemed somewhat redundant to what others have done so many times so I made a few design decisions:

- Try to do as few system (i.e. malloc) calls as possible to gain similar performance across various platforms.

- Since most of our applications are long running ones (application servers and such) keep the heap growth (due to fragmentation etc) to a minimum.

- Don't bother about memory consumption if needed to accomplish the previous task (I don't like to waste memory, but on the targets that we are working on memory is not that much of an issue).

- Improve thread concurrency

- "Solve" the "one thread allocates and anotherone freeing" memory issue that I discussed earlier (i.e. what happens when keeping a cache of data in a MT application)

After figuring out how to implement this (in terms of integrating with libstdc++) i decided to make a header and implementation file that is linked into the application to solve the "one instance" issue of having actual code in .h files (I'm sure there is a much better way of doing much of the implementation here, but once again: this is posted for discussion...) and then tell the containers to use this by wrapping it through the __allocator<> adapter available in stl_alloc.h (i.e. vector< int, __allocator< void, nn4us::nn4us_alloc > > foo;)

Oki, the source is obviously the best documentation (?) but here are some comments...
nn4us_alloc.hpp:

- There seems to something "missing" in the __allocator<> code since I had to add the operator == and != in order to get basic_string<> to compile, this does however not happen if I try to compile with malloc_allocator which is part of stl_alloc.h - why I don't know...

- There are defintions for new and delete as well...
...as I said above, this piece of code is here for discussion and the best way I could think of to get an entire application to behave similar on several plattforms as well as minimizing heap growth was to have new and delete use the same allocator as the containers.

nn4us_alloc.cpp
- The allocator gets chunks (CHUNK_SIZE) of memory from the system.

- Each size (up to the predetermined MAX_BYTES) are delt with by the allocator, the rest is sent to malloc()/free() directly.

- All requests are rounded to powers of two - even the ones that are larger than MAX_BYTES in order to minimize fragmentation in malloc() freelists.

- Each size (power of two) has it's own list of records and it's own mutex. This is done to improve concurrency (i.e. two requests of diffrent sizes can be delt with without locking if the threads already owns a record). What might seem as counterproductive to this is the malloc_mutex which actually syncronizes all malloc() calls...
...once again this is done for testing and was added after reading a paper on the Linux malloc() implementation that states "To reduce lock contention, ptmalloc searchs for the first unlocked subheap and grabs memory from it to fulfill a malloc() request. If ptmalloc doesnt find an unlocked heap, it creates a new one." (A very interesting read: http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf). My interpertation, and the goal with this experiment was to see if "helping" malloc() to avoid creating new heap would improve overall/longterm performance.

- All records are initially "owned" by a fiction thread id 0 (each thread is assigned a unique id by setting a key. BTW Does anyone know what the posix standard says about assigning the id returned by pthread_self()? I have tried to find this but with no luck and the source code did not make any sense to me...) When a thread requests memory the owner on this record is set to this thread id which means that subsequent deallocate calls can be done without locking the list and later on reused without locking. The "trick" with this is that we can now see if the thread that is deallocating the memory also was allocating it - if not we lock and return it to thread id 0 which effectively reduces the problems of "unbounded heap growth" with the current pthread_alloc implementation.

- In order to use the same set of methods as well as memory, the new and delete calls rely on the size being part of the actual data...
...see the implementation - it does not make much sense to write more about it here ;-)

NOTE! Some of the comments in the source may be a bit confusing since some parts have been rewritten a couple of times for testing ;-)

So where did this leave us? Well we have done quite a few tests, both on our real applications and on small testprograms. It seems that the goals where obtained even though this rough implementation does leave a lot to be tuned (see below) and fixed - but our main problem (fragmentation over time and the inabillity to use the __default_alloc_template due to concatenation) has more or less completly vanished and a small testprogram (which creates 10 threads, each of which are inserting values into a map<int,int> and then clearing the map over and over) shows that with malloc_allocator as index 1, the default (__default_alloc_template) allocator gains less than 10% on both single and dual processor x86 machines whilst this implementation improves performance with well over 100% on single processor machines and even more on dual processor machines...

Things that needs to be looked into is obviously cacheline behaviour etc but I thought I post this now and see if anyone find this of interest at all? I feel that some of the ideas (which aren't new but still...) could be added to the existing allocators to reduce fragmentation (i.e. fized size/power of two) in libstdc++ based applications...

Looking forward to your comments!

Brgds

/Stefan

--
Life is like a 10 speed bicycle. Most of us have gears we never use.


#ifndef __nn4us_alloc_hpp__
#define __nn4us_alloc_hpp__


#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <pthread.h>


namespace nn4us
{
	class nn4us_alloc
	{
		public:
			static void* allocate( size_t __n );
			static void deallocate( void* __p, size_t __n );

	};

	inline bool operator==( const nn4us_alloc& __a1, const nn4us_alloc& __a2)
	{
		return true;
	}

	inline bool operator!=( const nn4us_alloc& __a1, const nn4us_alloc& __a2)
	{
		return false;
	}
}

void* operator new( size_t __n );
void* operator new[]( size_t __n );
void operator delete( void* __p );
void operator delete[]( void* __p );

#endif

#include "nn4us_alloc.hpp"


namespace nn4us
{
	/*
	 * Static parameters
	 */
	static const int			BITS        = 10;
	static unsigned const int		MAX_BYTES   = 1024;
	static const int			CHUNK_SIZE  = 32768;
	static const int			MAX_THREADS = 256;

	/*
	 * Initialization
	 */
	static pthread_once_t			init_once = PTHREAD_ONCE_INIT;

	/*
	 * Thread # assignment, thread_id is increased before used
	 */
	static int						thread_id = 0;
	static pthread_mutex_t			thread_id_mutex = PTHREAD_MUTEX_INITIALIZER;
	static pthread_key_t			thread_id_key;

	/*
	 * This is the "main" struct which is "part of" the actual data
	 */
	struct data_record
	{
		data_record*			next;
		int				owner;
	};

	/*
	 * This is a fixed size allocator and each size (from 1 to MAX_BYTES in power of 2's)
	 * lives in it's own array offset
	 */
	struct freelist_record
	{
		/*
		 * Since we have a freelist per thread we need to keep track of MAX_THREADS pointers...
		 */
		data_record*			first[ MAX_THREADS + 1 ];
		data_record*			last[ MAX_THREADS + 1 ];

		/*
		 * Each thread keeps a counter of the number of records in the list
		 * This is used to increase performance and decide wether to return free'd memory to
		 * the thread specific freelist or the global one
		 */
		int				count[ MAX_THREADS + 1 ];

		/*
		 * When we have to change owner we need to lock the entire list (for this specific size)
		 */
		pthread_mutex_t*		list_mutex;
	};

	static freelist_record volatile		freelist[ BITS + 1 ];

	/*
	 * Some (i.e. Linux) malloc() implementations create new freelists if the previous ones are locked
	 * In an attempt to minimize this we syncronize all our malloc() calls with this mutex
	 */
	static pthread_mutex_t			malloc_mutex = PTHREAD_MUTEX_INITIALIZER;

	static void thread_id_key_destructor( void* thread_id )
	{
		/*
		 * TODO
		 */
	}

	/*
	 * Initialize all data structures once
	 */
	void init()
	{
		pthread_key_create( &thread_id_key, thread_id_key_destructor );

		size_t round_up = 1;

		for( int bit_count = 0; bit_count <= BITS; bit_count++ )
		{
			for( int i = 0; i <= MAX_THREADS; i++ )
			{
				freelist[ bit_count ].first[ i ] = NULL;
				freelist[ bit_count ].last[ i ] = NULL;
				freelist[ bit_count ].count[ i ] = 0;
			}

			/*
			 * We always allocate CHUNK_SIZE bytes, figure out how many records we can create
			 */
			int tmp_record_count = CHUNK_SIZE / ( round_up + sizeof( data_record ) ) - 1;

			/*
			 * All records initially are owned by thread 0
			 */
			freelist[ bit_count ].count[ 0 ] = tmp_record_count;

			pthread_mutex_lock( &malloc_mutex );
			void* p = malloc( CHUNK_SIZE );
			pthread_mutex_unlock( &malloc_mutex );

			freelist[ bit_count ].first[ 0 ] = (data_record*)p;

			data_record* tmp_record = freelist[ bit_count ].first[ 0 ];

			while( tmp_record_count > 0 )
			{
				tmp_record->next = (data_record*)( (char*)tmp_record + ( round_up + sizeof( data_record ) ) );

				tmp_record = tmp_record->next;

				tmp_record_count--;
			}

			tmp_record->next = NULL;

			freelist[ bit_count ].last[ 0 ] = tmp_record;

			freelist[ bit_count ].list_mutex = (pthread_mutex_t*)malloc( sizeof( pthread_mutex_t ) );
			pthread_mutex_init( freelist[ bit_count ].list_mutex, NULL );

			round_up = round_up << 1;
		}
	}

	/*
	 * If pthread_getspecific returns NULL in allocate() we have not assigned a unique id to this thread - that's what this does...
	 */
	int assign_thread_id()
	{
		int self;
		int *tmp;

		pthread_mutex_lock( &thread_id_mutex );
		self = ++thread_id;
		pthread_mutex_unlock( &thread_id_mutex );

		tmp = (int*)malloc( sizeof( int ) );
		*tmp = self;
		pthread_setspecific( thread_id_key, (void*)tmp );

		return self;
	}

	void* nn4us_alloc::allocate( size_t __n )
	{
		/*
		 * Initialize all data structures once
		 */
		pthread_once( &init_once, init );

		/*
		 * Round the requested size to a power of 2
		 */
		size_t round_up = 1;
		int bit_count = 0;

		while( __n > round_up )
		{
			round_up = round_up << 1;
			bit_count++;
		}

		/*
		 * Requests larger than MAX_BYTES is sent to malloc directly
		 */
		if( round_up > MAX_BYTES )
		{
			pthread_mutex_lock( &malloc_mutex );
			void* p = malloc( round_up );
			pthread_mutex_unlock( &malloc_mutex );

			return p;
		}

		/*
		 * Who are we?
		 */
		int self;
		int *tmp;
		if( ( tmp = (int*)pthread_getspecific( thread_id_key ) ) == NULL )
		{
			self = assign_thread_id();
		}
		else
		{
			self = *tmp;
		}

		data_record* tmp_result = NULL;

		/*
		 * Check if count for this thread is > 0, if so grab the first from this list, decrease count for this thread and return 
		 */
		if( freelist[ bit_count ].count[ self ] > 0 )
		{
			tmp_result = freelist[ bit_count ].first[ self ];

			freelist[ bit_count ].count[ self ]--;

			if( freelist[ bit_count ].count[ self ] > 0 )
			{
				freelist[ bit_count ].first[ self ] = tmp_result->next;
			}

			return (void*)( (char*)tmp_result + sizeof( data_record ) );
		}

		pthread_mutex_lock( freelist[ bit_count ].list_mutex );

		/*
		 * Check if count for thread 0 > 0, if so lock, change owner, decrease count for global and return
		 */
		if( freelist[ bit_count ].count[ 0 ] > 0 )
		{
			tmp_result = freelist[ bit_count ].first[ 0 ];

			freelist[ bit_count ].count[ 0 ]--;

			if( freelist[ bit_count ].count[ 0 ] > 0 )
			{
				freelist[ bit_count ].first[ 0 ] = tmp_result->next;
			}

			tmp_result->owner = self;

			pthread_mutex_unlock( freelist[ bit_count ].list_mutex );

			return (void*)( (char*)tmp_result + sizeof( data_record ) );
		}

		/*
		 * If we get here we need to lock, allocate memory, increase count for global - 1, change owner on first record and return
		 */

		/*
		 * We always allocate CHUNK_SIZE bytes, figure out how many records we can create
		 */
		int tmp_record_count = CHUNK_SIZE / ( round_up + sizeof( data_record ) ) - 1;

		/*
		 * All records initially are owned by thread 0
		 */
		freelist[ bit_count ].count[ 0 ] = tmp_record_count;

		pthread_mutex_lock( &malloc_mutex );
		void* p = malloc( CHUNK_SIZE );
		pthread_mutex_unlock( &malloc_mutex );

		freelist[ bit_count ].first[ 0 ] = (data_record*)p;

		data_record* tmp_record = freelist[ bit_count ].first[ 0 ];

		while( tmp_record_count > 0 )
		{
			tmp_record->next = (data_record*)( (char*)tmp_record + ( round_up + sizeof( data_record ) ) );

			tmp_record = tmp_record->next;

			tmp_record_count--;
		}

		tmp_record->next = NULL;

		freelist[ bit_count ].last[ 0 ] = tmp_record;

		tmp_result = freelist[ bit_count ].first[ 0 ];

		freelist[ bit_count ].count[ 0 ]--;

		if( freelist[ bit_count ].count[ 0 ] > 0 )
		{
			freelist[ bit_count ].first[ 0 ] = tmp_result->next;
		}

		tmp_result->owner = self;

		pthread_mutex_unlock( freelist[ bit_count ].list_mutex );

		return (void*)( (char*)tmp_result + sizeof( data_record ) );
	}

	void nn4us_alloc::deallocate( void* __p, size_t __n )
	{
		/*
		 * Round the requested size to a power of 2
		 */
		size_t round_up = 1;
		int bit_count = 0;

		while( __n > round_up )
		{
			round_up = round_up << 1;
			bit_count++;
		}

		/*
		 * Requests larger than MAX_BYTES is sent to malloc directly
		 */
		if( round_up > MAX_BYTES )
		{
			free( __p );
			return;
		}

		/*
		 * Who are we?
		 */
		int self;
		int *tmp;
		if( ( tmp = (int*)pthread_getspecific( thread_id_key ) ) == NULL )
		{
			self = assign_thread_id();
		}
		else
		{
			self = *tmp;
		}

		data_record* tmp_record = NULL;

		tmp_record = (data_record*)( (char*)__p - sizeof( data_record ) );


		/*
		 * If self == owner, add this to the back of the freelist for this thread id and increase count for this thread
		 */
		if( self == tmp_record->owner )
		{
			if( freelist[ bit_count ].count[ self ] == 0 )
			{
				freelist[ bit_count ].first[ self ] = tmp_record;
			}

			/*
 			 * If it's the first dealloc for this thread, the last pointer is set to NULL
			 * else we have to set the last pointer's next pointer and last pointer to the new "not-in-use" data_record
			 */ 
			if( freelist[ bit_count ].count[ self ] > 0 )
			{
				freelist[ bit_count ].last[ self ]->next = tmp_record;
			}
			
			freelist[ bit_count ].last[ self ] = tmp_record;
			freelist[ bit_count ].count[ self ]++;
		}
		else
		{
			/*
			 * If self != owner, lock and add this to the back of the global freelist and increase count for global
			 */
			pthread_mutex_lock( freelist[ bit_count ].list_mutex );

			if( freelist[ bit_count ].count[ 0 ] == 0 )
			{
				freelist[ bit_count ].first[ 0 ] = tmp_record;
			}

			if( freelist[ bit_count ].count[ 0 ] > 0 )
			{
				freelist[ bit_count ].last[ 0 ]->next = tmp_record;
			}
		
			freelist[ bit_count ].last[ 0 ] = tmp_record;
			freelist[ bit_count ].count[ 0 ]++;

			pthread_mutex_unlock( freelist[ bit_count ].list_mutex );
		}
	}
}

static nn4us::nn4us_alloc nn4us_alloc_inst;

void* operator new( size_t __n )
{
	int tmp = __n + sizeof( int );

	/*
	 * We allocate so that we have space enough to store the size at the beginning
	 */
	void* p = nn4us_alloc_inst.allocate( tmp );

	memcpy( p, &tmp, 4 );

	return (void*)( (char*)p + sizeof( int ) );
}

void* operator new[]( size_t __n )
{
	int tmp = __n + sizeof( int );

	/*
	 * We allocate so that we have space enough to store the size at the beginning
	 */
	void* p = nn4us_alloc_inst.allocate( tmp );

	memcpy( p, &tmp, 4 );

	return (void*)( (char*)p + sizeof( int ) );
}

void operator delete( void* __p )
{
	int tmp;

	void* p = (void*)( (char*)__p - sizeof( int ) );

	memcpy( &tmp, p, 4 );

	nn4us_alloc_inst.deallocate( p, tmp );
}

void operator delete[]( void* __p )
{
	int tmp;

	void* p = (void*)( (char*)__p - sizeof( int ) );

	memcpy( &tmp, p, 4 );

	nn4us_alloc_inst.deallocate( p, tmp );
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]