This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Our proposal for a MATDS

From: Stefan Olsson <stefan at snon dot net>
To: libstdc++ <libstdc++ at gcc dot gnu dot org>
Cc: Ola Rönnerup <fnolis at home dot se>
Date: Tue, 04 Mar 2003 01:48:41 +0100
Subject: Our proposal for a MATDS

Hi all!

As mentioned erlier on this list (http://gcc.gnu.org/ml/libstdc++/2003-02/msg00098.html) we (that's me and Ola Rönnerup) have been working on a allocator suitable for MT applications for quite some time now. The initial version was written mid last year and has been running in production enviorment since then with good results. Now we have taken that code and rewritten it to a patch against stl_alloc.h and solved the issues that we've come across and added some new ideas (mostly related to how thread_id's are assigned to support both thread pool and producer/consumer models, the way memory allocated by one thread and then deallocated by another is handled and minimizing heap growth over time by dynamically limiting the freelists).

While looking at the patch, bear in mind that this is our first attempt to write a patch against libstdc++ and there are most likely both formatting, naming and other things wrong but the patch works with our CVS version of GCC (g++ (GCC) 3.4 20030202 (experimental)) with and without threadsupport (posix) enabled.

First we would like to give you some results from various performance tests using the simple testapplications attached (which are the same as in the post mentioned above but with __mt_alloc added).

NOTES: - All "single" tests are compiled with -static -O2 - All "mt" tests are compiled with -static -pthread -O2 - All tests are done on otherwise unloaded systems twice to filter out any real flukes - another rather interesting "mt" test is done further down. - All tests are done on Linux systems with kernel 2.4.18 and libc 2.2.5

alloc_perf_comp_single.cc, no threads, single 1GHz CPU machine:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 5500000 iterations.
__pool_alloc... DONE! Time 9.910733
__malloc_alloc... DONE! Time 15.141536
__mt_alloc... DONE! Time 12.615181
Relative performance to __pool_alloc:
__malloc_alloc 65.453947 %
__mt_alloc 78.561956 %

alloc_perf_comp_single.cc, no threads, dual 1GHz CPU machine:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 5600000 iterations.
__pool_alloc... DONE! Time 10.092421
__malloc_alloc... DONE! Time 15.412243
__mt_alloc... DONE! Time 12.789449
Relative performance to __pool_alloc:
__malloc_alloc 65.483142 %
__mt_alloc 78.912086 %

alloc_perf_comp_single.cc, posix threads, single 1GHz CPU machine:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 5400000 iterations.
__pool_alloc... DONE! Time 9.888981
__malloc_alloc... DONE! Time 14.917552
__mt_alloc... DONE! Time 14.302427
Relative performance to __pool_alloc:
__malloc_alloc 66.290910 %
__mt_alloc 69.141978 %

alloc_perf_comp_single.cc, posix threads, dual 1GHz CPU machine:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 5500000 iterations.
__pool_alloc... DONE! Time 10.065420
__malloc_alloc... DONE! Time 15.163035
__mt_alloc... DONE! Time 14.692080
Relative performance to __pool_alloc:
__malloc_alloc 66.381302 %
__mt_alloc 68.509156 %

alloc_perf_comt_mt.cc, posix threads, single 1GHz CPU machine:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 2700000 iterations.
__pool_alloc... DONE! Time 116.988530
__malloc_alloc... DONE! Time 97.573828
__mt_alloc... DONE! Time 77.698650
Relative performance to __pool_alloc:
__malloc_alloc 119.897448 %
__mt_alloc 150.567005 %

alloc_perf_comt_mt.cc, posix threads, dual 1GHz CPU machine:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 2700000 iterations.
__pool_alloc... DONE! Time 86.109538
__malloc_alloc... DONE! Time 98.029429
__mt_alloc... DONE! Time 77.588871
Relative performance to __pool_alloc:
__malloc_alloc 87.840497 %
__mt_alloc 110.981816 %

The conclusion so far is that in non threaded applications this allocator is slower than the default, but then again it's not aimed at this either and it still beats __malloc_alloc (There's probably quite a lot that can be optimized in single mode - this has not been our focus, just getting it to compile).

As expected it's faster than __pool_alloc in mt applications and the more threads the better it performs =) However we could not figure out what we did to accomplish the numbers (from the same testcases/machines) that we initially posted and that we have seen in real world applications that we have used this allocator in. It turns out that an unloaded system and/or this specific way of testing does not tell the whole story (but then again, what does ;-).

While running the tests one can observe the excessive context switching that comes from the global lock in __pool_alloc (alloc_perf_comt_mt.cc, posix threads, dual 1GHz CPU machine, vmstat 5):
  procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
8  0  0      0  10280   3128 425984   0   0     0     0  101 18100  92   8   0
8  0  0      0  10280   3128 425984   0   0     0     0  101 17606  92   8   0
8  0  0      0  10280   3128 425984   0   0     0     0  101 23502  90  10   0

The applications that we have been working on has primarily been network services. Simulating some "real world" network traffic is not easy but to demonstrate the effect we have run various applications at the same time as running the mt tests and found that the more the system is loaded/doing the better both __malloc_alloc and __mt_alloc perform. Here's a sample from my workstation (alloc_perf_comt_mt.cc, posix threads, single 1GHz CPU machine) with xmms in the background playing a 128kbit stream:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 2500000 iterations.
__pool_alloc... DONE! Time 143.066890
__malloc_alloc... DONE! Time 93.877978
__mt_alloc... DONE! Time 74.594824
Relative performance to __pool_alloc:
__malloc_alloc 152.396646 %
__mt_alloc 191.791980 %

In "real world applications" the use of per-bin-mutexes also reduces contention quite radically and the fixed size - even though it does consume more memory - virtually eliminates fragmentation.

There are a few things in the patch that needs to be discussed and solved if it's even going to be subject for inclusion in libstdc++ in one form or another:
- There is no __gthread_mutex_init () call in gthr.h. Right now we do a pthread_mutex_init () call to get it working...
- There's a minor issue/ugly workaround related to the supposed parser related bug (see c++/9779 for more details)
- We don't check malloc () calls for errors - it's not clear to us what the libstdc++ "rules" are when it comes to internal error handling and how this is to be reported back to the application.

As I wrote erlier most of our ideas are in this implementation, but future development would include a similar approach to customizing the allocator as described by Phil (see http://gcc.gnu.org/ml/libstdc++/2003-02/msg00319.html) using template arguments etc.

Ciao!

/Stefan & Ola

--
Experience is what you get when you were expecting something else.

/*
 * 2003-02-05 Stefan Olsson <stefan at snon dot net>
 *
 * The goal with this application is to compare the performance
 * between different STL allocators relative to the default
 * __pool_alloc.
 *
 * The container used for the tests is vector, which as stated by
 * SGI "Vector is the simplest of the STL container classes, and in 
 * many cases the most efficient.".
 *
 * NOTE! The vector<> container does some "caching" of it's own and
 * therefore we redeclare the variable in each iteration (forcing the
 * const/destr to be called and thus free memory).
 *
 * NOTE! The usage of gettimeofday is unwanted since it's not POSIX,
 * however I have not found a more generic system call to use - 
 * ideas are greatly appriciated!
 *
 * NOTE! This version only does operations on vector<int>. More/other
 * data types should maybe also be tested - ideas are always welcome!
 *
 * I assume that glibc/underlying malloc() implementation has been
 * compiled with -O2 thus should this application also be compiled
 * with -O2 in order to get relevant results.
 */

#include <vector>
#include <sys/time.h>

using namespace std;

/*
 * In order to avoid that the time it takes for the application to 
 * startup/shutdown affect the end result, we define a target 
 * duration (in seconds) for which all tests should run.
 * Each test is responsible for "calibrating" their # of iterations 
 * to come as close as possible to this target based on the time
 * it takes to complete the test using the default __pool_alloc.
 */
int target_run_time = 10;

/*
 * The number of iterations to be performed in order to figure out
 * the "speed" of this computer and adjust the number of iterations
 * needed to come close to target_run_time.
 */
int calibrate_iterations = 100000;

/*
 * The number of values to insert in the vector, 32 will cause 
 * 5 (re)allocations to be performed (sizes 4, 8, 16, 32 and 64)
 * This means that all allocations are within _MAX_BYTES = 128
 * as defined in stl_alloc.h for __pool_alloc.
 * Whether or not this value is relevant in "the real world" 
 * or not I don't know and should probably be investigated in 
 * more detail.
 */
int insert_values = 32;

static struct timeval _tstart, _tend;
static struct timezone tz;

void
tstart (void)
{
  gettimeofday (&_tstart, &tz);
}
void
tend (void)
{
  gettimeofday (&_tend, &tz);
}

double
tval()
{
  double t1, t2;

  t1 = (double)_tstart.tv_sec + (double)_tstart.tv_usec/(1000*1000);
  t2 = (double)_tend.tv_sec + (double)_tend.tv_usec/(1000*1000);
  return t2 - t1;
}

int
calibrate_test_ints (void)
{
  tstart ();
  for (int i = 0; i < calibrate_iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return (int)((double)target_run_time / tval ()) * calibrate_iterations;
}

double
test_ints_pool_alloc (int iterations)
{
  tstart ();
  for (int i = 0; i < iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return tval ();
}

double
test_ints_malloc_alloc (int iterations)
{
  tstart ();
  for (int i = 0; i < iterations; i++)
  {
    vector<int, __malloc_alloc<0> > v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return tval ();
}

double
test_ints_mt_alloc (int iterations)
{
  tstart ();
  for (int i = 0; i < iterations; i++)
  {
    vector<int, __mt_alloc<0> > v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return tval ();
}

int
main (void)
{
  printf ("Target run-time: %ds\n", target_run_time);

  printf ("Calibrating test_ints... ");
//int iterations = 1000000;

  int iterations = calibrate_test_ints ();
  printf ("DONE! Will do %d iterations.\n", iterations);

  double results[3];

  printf ("__pool_alloc... ");
  results[0] = test_ints_pool_alloc (iterations);
  printf ("DONE! Time %f\n", results[0]);

  printf ("__malloc_alloc... ");
  results[1] = test_ints_malloc_alloc (iterations);
  printf ("DONE! Time %f\n", results[1]);

  printf ("__mt_alloc... ");
  results[2] = test_ints_mt_alloc (iterations);
  printf ("DONE! Time %f\n", results[2]);

  printf ("Relative performance to __pool_alloc:\n");
  printf ("__malloc_alloc %f %%\n", (results[0] / results[1])*100);
  printf ("__mt_alloc %f %%\n", (results[0] / results[2])*100);

  return 0;
}

/*
 * 2003-02-05 Stefan Olsson <stefan at snon dot net>
 *
 * The goal with this application is to compare the performance
 * between different STL allocators relative to the default
 * __pool_alloc.
 *
 * The container used for the tests is vector, which as stated by
 * SGI "Vector is the simplest of the STL container classes, and in 
 * many cases the most efficient.".
 *
 * NOTE! The vector<> container does some "caching" of it's own and
 * therefore we redeclare the variable in each iteration (forcing the
 * const/destr to be called and thus free memory).
 *
 * NOTE! The usage of gettimeofday is unwanted since it's not POSIX,
 * however I have not found a more generic system call to use - 
 * ideas are greatly appriciated!
 *
 * NOTE! This version only does operations on vector<int>. More/other
 * data types should maybe also be tested - ideas are always welcome!
 *
 * I assume that glibc/underlying malloc() implementation has been
 * compiled with -O2 thus should this application also be compiled
 * with -O2 in order to get relevant results.
 */

#include <vector>
#include <sys/time.h>

/*
 * Make sure that the compiler has thread support...
 */
#if __GTHREADS

using namespace std;

/*
 * In order to avoid that the time it takes for the application to 
 * startup/shutdown affect the end result, we define a target 
 * duration (in seconds) for which all tests should run.
 * Each test is responsible for "calibrating" their # of iterations 
 * to come as close as possible to this target based on the time
 * it takes to complete the test using the default __pool_alloc.
 *
 * NOTE! In an ideal world where there is enough CPU and no
 * contention the application would complete on time, however
 * the design of the __pool_alloc allocator and the fact that we
 * don't have enough CPU available (normally atleast :) will make the
 * actual runtime much longer (at least # of threads * this value)
 */
int target_run_time = 10;

/*
 * The number of iterations to be performed in order to figure out
 * the "speed" of this computer and adjust the number of iterations
 * needed to come close to target_run_time.
 */
int calibrate_iterations = 100000;

/*
 * The number of values to insert in the vector, 32 will cause 
 * 5 (re)allocations to be performed (sizes 4, 8, 16, 32 and 64)
 * This means that all allocations are within _MAX_BYTES = 128
 * as defined in stl_alloc.h for __pool_alloc.
 * Whether or not this value is relevant in "the real world" 
 * or not I don't know and should probably be investigated in 
 * more detail.
 */
int insert_values = 32;

/*
 * Number of threads to create
 */
int threads = 8;

static struct timeval _tstart, _tend;
static struct timezone tz;

void
tstart (void)
{
  gettimeofday (&_tstart, &tz);
}
void
tend (void)
{
  gettimeofday (&_tend, &tz);
}

double
tval()
{
  double t1, t2;

  t1 = (double)_tstart.tv_sec + (double)_tstart.tv_usec/(1000*1000);
  t2 = (double)_tend.tv_sec + (double)_tend.tv_usec/(1000*1000);
  return t2 - t1;
}

int
calibrate_test_ints (void)
{
  tstart ();
  for (int i = 0; i < calibrate_iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return (int)((double)target_run_time / tval ()) * calibrate_iterations;
}

void*
test_ints_pool_alloc (void* it)
{
  int iterations = *(int*)it;

  for (int i = 0; i < iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }

  return NULL;
}

void*
test_ints_malloc_alloc (void* it)
{
  int iterations = *(int*)it;

  for (int i = 0; i < iterations; i++)
  {
    vector<int, __malloc_alloc<0> > v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }

  return NULL;
}

void*
test_ints_mt_alloc (void* it)
{
  int iterations = *(int*)it;

  for (int i = 0; i < iterations; i++)
  {
    vector<int, __mt_alloc<0> > v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }

  return NULL;
}

int
main (void)
{
  printf ("Target run-time: %ds\n", target_run_time);

  printf ("Calibrating test_ints... ");
  int iterations = calibrate_test_ints ();
  printf ("DONE! Will do %d iterations.\n", iterations);

  double results[3];
  pthread_t workers[threads];

  printf ("__pool_alloc... ");
  tstart ();
  for( int i = 0; i < threads; i++ )
  {
    pthread_create (&workers[i], NULL, test_ints_pool_alloc, (void*)&iterations);
  }

  for( int i = 0; i < threads; i++ )
  {
    pthread_join (workers[i], NULL);
  }
  tend ();
  results[0] = tval ();
  printf ("DONE! Time %f\n", results[0]);

  printf ("__malloc_alloc... ");
  tstart ();
  for( int i = 0; i < threads; i++ )
  {
    pthread_create (&workers[i], NULL, test_ints_malloc_alloc, (void*)&iterations);
  }

  for( int i = 0; i < threads; i++ )
  {
    pthread_join (workers[i], NULL);
  }
  tend ();
  results[1] = tval ();
  printf ("DONE! Time %f\n", results[1]);

  printf ("__mt_alloc... ");
  tstart ();
  for( int i = 0; i < threads; i++ )
  {
    pthread_create (&workers[i], NULL, test_ints_mt_alloc, (void*)&iterations);
  }

  for( int i = 0; i < threads; i++ )
  {
    pthread_join (workers[i], NULL);
  }
  tend ();
  results[2] = tval ();
  printf ("DONE! Time %f\n", results[2]);

  printf ("Relative performance to __pool_alloc:\n");
  printf ("__malloc_alloc %f %%\n", (results[0] / results[1])*100);
  printf ("__mt_alloc %f %%\n", (results[0] / results[2])*100);

  return 0;
}

#else

int 
main (void) 
{
  printf ("No thread support, exiting!\n");

  return 0;
}

#endif

Index: stl_alloc.h
===================================================================
RCS file: /cvsroot/gcc/gcc/libstdc++-v3/include/bits/stl_alloc.h,v
retrieving revision 1.31
diff -r1.31 stl_alloc.h
83a84
> #include <cassert>
110a112,780
>   /*
>    * MT Allocator That Doesn't Suck?
>    */
>   template<int __inst>
>     class __mt_alloc
>     {
>     private:
>       /*
>        * We need to create the initial lists and set up some variables
>        * before we can answer to the first request for memory. 
>        * The initialization of these variables is done at file scope 
>        * below class declaration.
>        */
> #ifdef __GTHREADS
>       static __gthread_once_t _S_once_mt;
> #endif
>       static bool _S_once_single;
>       static void _S_init ();
> 
>       /*
>        * Variables used to "tune" the behavior of the allocator, assigned
>        * and explained in detail below.
>        */
>       static size_t _S_max_bytes;
>       static size_t _S_chunk_size;
>       static size_t _S_max_threads;
>       static size_t _S_no_of_bins;
>       static size_t _S_freelist_headroom;
> 
>       /*
>        * Each requesting thread is assigned an id ranging from 1 to 
>        * _S_max_threads. Thread id 0 is used as a global memory pool.
>        * In order to get constant performance on the thread assignment
>        * routine, we keep a list of free ids. When a thread first requests
>        * memory we remove the first record in this list and stores the address
>        * in a __gthread_key. When initializing the __gthread_key
>        * we specify a destructor. When this destructor (i.e. the thread dies)
>        * is called, we return the thread id to the back of this list.
>        */
> #ifdef __GTHREADS
>       struct thread_record
>       {
>         /*
>          * Points to next free thread id record. NULL if last record in list.
>          */
>         thread_record* next;
> 
>         /*
>          * Thread id ranging from 1 to _S_max_threads.
>          */
>         size_t id;
>       };
> 
>       static thread_record* _S_thread_freelist_first;
>       static thread_record* _S_thread_freelist_last;
>       static __gthread_mutex_t _S_thread_freelist_mutex;
>       static void _S_thread_key_destr (void* freelist_pos);
>       static __gthread_key_t _S_thread_key;
>       static size_t _S_get_thread_id (size_t bin);
> #endif
> 
>       struct block_record
>       {
>         /*
>          * Points to the next block_record for its thread_id.
>          */
>         block_record* next;
> 
>         /*
>          * The thread id of the thread which has requested this block.
>          * All blocks are initially "owned" by global pool thread id 0.
>          */
>         size_t thread_id;
>       };
> 
>       struct bin_record
>       {
>         /*
>          * An "array" of pointers to the first/last free block for each 
>          * thread id. Memory to these "arrays" is allocated in _S_init () 
>          * for _S_max_threads + global pool 0.
>          */
>         block_record** first;
>         block_record** last;
> 
>         /*
>          * An "array" of counters used to keep track of the amount of blocks
>          * that are on the freelist/used for each thread id.
>          * Memory to these "arrays" is allocated in _S_init () 
>          * for _S_max_threads + global pool 0.
>          */
>         size_t* free;
>         size_t* used;
> 
>         /*
>          * Each bin has its own mutex which is used to ensure data integrity 
>          * while changing "ownership" on a block.
>          * The mutex is initialized in _S_init ().
>          */
> #ifdef __GTHREADS
>         __gthread_mutex_t* mutex; 
> #endif
>       };
> 
>       /*
>        * An "array" of bin_records each of which represents a specific 
>        * power of 2 size. Memory to this "array" is allocated in _S_init ().
>        */
>       static bin_record* _S_bin;
> 
>     public:
>       static void*
>       allocate (size_t __n)
>       {
>         /*
>          * Requests larger than _S_max_bytes are handled by malloc/free directly
>          */
>         if (__n > _S_max_bytes)
>           return malloc (__n);
> 
> #ifdef __GTHREADS
>         if (__gthread_active_p ())
>           __gthread_once (&_S_once_mt, _S_init);
>         else
> #endif
>           {
>             if (!_S_once_single)
>               {
>                 _S_max_threads = 0;
>                 _S_init ();
>                 _S_once_single = true;
>               }
>           }
> 
>         /*
>          * Round up to power of 2 and figure out which bin to use
>          */
>         size_t bin = 0;
>         size_t bin_t = 1;
>         while (__n > bin_t)
>           {
>             bin_t = bin_t << 1;
>             bin++;
>           }
> 
> #ifdef __GTHREADS
>         size_t thread_id = _S_get_thread_id (bin);
> #else
>         size_t thread_id = 0;
> #endif
> 
>         block_record* block;
> 
>         /*
>          * Find out if we have blocks on our freelist.
>          * If so, go ahead and use them directly without
>          * having to lock anything.
>          */
>         if (_S_bin[bin].first[thread_id] == NULL)
>           {
>             /*
>              * Are we using threads?
>              * - Yes, lock and check if there are free blocks on the global 
>              *   list (and if not add new ones), get the first one 
>              *   and change owner.
>              * - No, all operations are made directly to global pool 0 
>              *   no need to lock or change ownership but check for free 
>              *   blocks on global list (and if not add new ones) and 
>              *   get the first one.
>              */
> #ifdef __GTHREADS
>             if (__gthread_active_p ())
>               {
>                 __gthread_mutex_lock (_S_bin[bin].mutex);
> 
>                 if (_S_bin[bin].first[0] == NULL)
>                   {
>                     _S_bin[bin].first[0] = (block_record*)malloc (_S_chunk_size);
> 
>                     size_t block_count = _S_chunk_size / (bin_t + sizeof (block_record));
> 
>                     _S_bin[bin].free[0] = block_count;
> 
>                     block_count--;
>                     block = _S_bin[bin].first[0];
> 
>                     while (block_count > 0)
>                       {
>                         block->next = (block_record*)((char*)block + (bin_t + sizeof (block_record)));
>                         block = block->next;
> 
>                         block_count--;
>                       }
> 
>                     block->next = NULL;
>                     _S_bin[bin].last[0] = block;
>                   }
> 
>                 block = _S_bin[bin].first[0];
> 
>                 /*
>                  * Remove from list and count down the available counter on
>                  * global pool 0.
>                  */
>                 _S_bin[bin].first[0] = _S_bin[bin].first[0]->next;
>                 _S_bin[bin].free[0]--;
> 
>                 __gthread_mutex_unlock (_S_bin[bin].mutex);
> 
>                 /*
>                  * Now that we have removed the block from the global
>                  * freelist we can change owner and update the used counter
>                  * for this thread without locking.
>                  */
>                 block->thread_id = thread_id;
>                 _S_bin[bin].used[thread_id]++;
>               }
>             else
> #endif
>               {
>                 _S_bin[bin].first[0] = (block_record*)malloc (_S_chunk_size);
> 
>                 size_t block_count = _S_chunk_size / (bin_t + sizeof (block_record));
> 
>                 _S_bin[bin].free[0] = block_count;
> 
>                 block_count--;
>                 block = _S_bin[bin].first[0];
> 
>                 while (block_count > 0)
>                   {
>                     block->next = (block_record*)((char*)block + (bin_t + sizeof (block_record)));
>                     block = block->next;
> 
>                     block_count--;
>                   }
> 
>                 block->next = NULL;
>                 _S_bin[bin].last[0] = block;
> 
>                 block = _S_bin[bin].first[0];
> 
>                 /*
>                  * Remove from list and count down the available counter on
>                  * global pool 0 and increase it's used counter.
>                  */
>                 _S_bin[bin].first[0] = _S_bin[bin].first[0]->next;
>                 _S_bin[bin].free[0]--;
>                 _S_bin[bin].used[0]++;
>               }
>           }  
>         else
>           {
>             /*
>              * "Default" operation - we have blocks on our own freelist
>              * grab the first record and update the counters.
>              */
>             block = _S_bin[bin].first[thread_id];
> 
>             _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next;
>             _S_bin[bin].free[thread_id]--;
>             _S_bin[bin].used[thread_id]++;
>           }
> 
>         return (void*)((char*)block + sizeof (block_record));
>       }
> 
>       static void
>       deallocate (void* __p, size_t __n)
>       {
>         /*
>          * Requests larger than _S_max_bytes are handled by malloc/free directly
>          */
>         if (__n > _S_max_bytes)
>           {
>             free (__p);
>             return;
>           }
> 
>         /*
>          * Round up to power of 2 and figure out which bin to use
>          */
>         size_t bin = 0;
>         size_t bin_t = 1;
>         while (__n > bin_t)
>           {
>             bin_t = bin_t << 1;
>             bin++;
>           }
> 
> #ifdef __GTHREADS
>         size_t thread_id = _S_get_thread_id (bin);
> #else
>         size_t thread_id = 0;
> #endif
> 
>         block_record* block = (block_record*)((char*)__p - sizeof (block_record));
> 
>         /*
>          * This block will always be at the back of a list and thus
>          * we set its next pointer to NULL.
>          */
>         block->next = NULL;
> 
> #ifdef __GTHREADS
>         if (__gthread_active_p ())
>           {
>             /*
>              * Calculate the number of records to remove from our freelist
>              */
>             int remove = _S_bin[bin].free[thread_id] - (_S_bin[bin].used[thread_id] / _S_freelist_headroom);
> 
>             /*
>              * The calculation above will almost always tell us to remove one
>              * or two records at a time, but this creates too much contentation
>              * when locking and therefore we wait until the number of records
>              * to be removed exceeds _S_freelist_headroom % of the freelist.
>              */
>             if (remove > (int)(_S_bin[bin].free[thread_id] / _S_freelist_headroom))
>               {
>                 __gthread_mutex_lock (_S_bin[bin].mutex);
> 
>                 while (remove > 0)
>                   {
>                     if (_S_bin[bin].first[0] == NULL)
>                       _S_bin[bin].first[0] = _S_bin[bin].first[thread_id];
>                     else
>                       _S_bin[bin].last[0]->next = _S_bin[bin].first[thread_id];
> 
>                     _S_bin[bin].last[0] = _S_bin[bin].first[thread_id];
> 
>                     _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next;
> 
>                     _S_bin[bin].free[0]++;
>                     _S_bin[bin].free[thread_id]--;
> 
>                     remove--;
>                   }
> 
>                 _S_bin[bin].last[0]->next = NULL;
> 
>                 __gthread_mutex_unlock (_S_bin[bin].mutex);
>               }
> 
>             /*
>              * Did we allocate this block?
>              * - Yes, return it to our freelist
>              * - No, return it to global pool
>              */
>             if (thread_id == block->thread_id)
>               {
>                 if (_S_bin[bin].first[thread_id] == NULL)
>                   _S_bin[bin].first[thread_id] = block;
>                 else
>                   _S_bin[bin].last[thread_id]->next = block;
> 
>                 _S_bin[bin].last[thread_id] = block;
> 
>                 _S_bin[bin].free[thread_id]++;
>                 _S_bin[bin].used[thread_id]--;
>               }
>             else
>               {
>                 __gthread_mutex_lock (_S_bin[bin].mutex);
> 
>                 if (_S_bin[bin].first[0] == NULL)
>                   _S_bin[bin].first[0] = block;
>                 else
>                   _S_bin[bin].last[0]->next = block;
> 
>                 _S_bin[bin].last[0] = block;
> 
>                 _S_bin[bin].free[0]++;
>                 _S_bin[bin].used[block->thread_id]--;
> 
>                 __gthread_mutex_unlock (_S_bin[bin].mutex);
>               }
>           }
>         else
> #endif
>           {
>             /*
>              * Single threaded application - return to global pool
>              */
>             if (_S_bin[bin].first[0] == NULL)
>               _S_bin[bin].first[0] = block;
>             else
>               _S_bin[bin].last[0]->next = block;
> 
>             _S_bin[bin].last[0] = block;
> 
>             _S_bin[bin].free[0]++;
>             _S_bin[bin].used[0]--;
>           }
>       }
>     };
> 
>   template<int __inst>
>     void
>     __mt_alloc<__inst>::
>     _S_init ()
>     {
>       /*
>        * Calculate the number of bins required based on _S_max_bytes,
>        * _S_no_of_bins is initialized to 1 below.
>        */
>       size_t bin_t = 1;
>       while (_S_max_bytes > bin_t)
>         {
>           bin_t = bin_t << 1;
>           _S_no_of_bins++;
>         }
> 
>       /*
>        * If __gthread_active_p () create and initialize the list of free 
>        * thread ids. Single threaded applications use thread id 0 directly 
>        * and have no need for this. 
>        */
> #ifdef __GTHREADS
>       if (__gthread_active_p ())
>         {
>           _S_thread_freelist_first = (thread_record*)malloc (sizeof (thread_record) * _S_max_threads);
> 
>           /*
>            * NOTE! The first assignable thread id is 1 since the global 
>            * pool uses id 0
>            */
>           size_t i;
>           for (i = 1; i < _S_max_threads; i++)
>             {
>               _S_thread_freelist_first[i - 1].next = &_S_thread_freelist_first[i];
>               _S_thread_freelist_first[i - 1].id = i;
>             }
> 
>           /*
>            * Set last record and pointer to this
>            */
>           _S_thread_freelist_first[i - 1].next = NULL;
>           _S_thread_freelist_first[i - 1].id = i;
>           _S_thread_freelist_last = &_S_thread_freelist_first[i - 1];
> 
>           /*
>            * Initialize per thread key to hold pointer to _S_thread_freelist
>            * NOTE! Here's an ugly workaround - if _S_thread_key_destr is 
>            * not explicitly called at least once it won't be linked into 
>            * the application. This is the behavior of template methods and
>            * __gthread_key_create () takes only a pointer to the function
>            * and does not cause the compiler to create an instance.
>            */
>           _S_thread_key_destr( NULL );
>           __gthread_key_create (&_S_thread_key, _S_thread_key_destr);
>         }
> #endif
> 
>       /*
>        * Initialize _S_bin and its members
>        */
>       _S_bin = (bin_record*)malloc (sizeof (bin_record) * _S_no_of_bins);
> 
>       for (size_t bin = 0; bin < _S_no_of_bins; bin++)
>         {
>           _S_bin[bin].first = (block_record**)malloc (sizeof (block_record*) * (_S_max_threads + 1));
>           _S_bin[bin].last = (block_record**)malloc (sizeof (block_record*) * (_S_max_threads + 1));
>           _S_bin[bin].free = (size_t*)malloc (sizeof (size_t) * (_S_max_threads + 1));
>           _S_bin[bin].used = (size_t*)malloc (sizeof (size_t) * (_S_max_threads + 1));
> 
>           /*
>            * Ugly workaround of what at the time of writing seems to be
>            * a parser problem - see PR c++/9779 for more info.
>            */
> #ifdef __GTHREADS
>           size_t s = sizeof (__gthread_mutex_t); 
>           _S_bin[bin].mutex = (__gthread_mutex_t*)malloc (s);
> 
>           /*
>            * This is not only ugly - it's extremly non-portable!
>            * However gthr.h does not currently provide a __gthread_mutex_init ()
>            * call. The correct solution to this problem needs to be discussed.
>            */
>           pthread_mutex_init (_S_bin[bin].mutex, NULL);
> #endif
> 
>           for (size_t thread = 0; thread <= _S_max_threads; thread++)
>             {
>               _S_bin[bin].first[thread] = NULL;
>               _S_bin[bin].last[thread] = NULL;
>               _S_bin[bin].free[thread] = 0;
>               _S_bin[bin].used[thread] = 0;
>             }
>         }
>     }
> 
> #ifdef __GTHREADS
>   template<int __inst>
>     void
>     __mt_alloc<__inst>::
>     _S_thread_key_destr (void* freelist_pos)
>     {
>       /*
>        * This is due to the ugly workaround mentioned in _S_init ()
>        */
>       if (freelist_pos == NULL)
>         return;
> 
>       /*
>        * Return this thread id record to thread_freelist
>        */
>       __gthread_mutex_lock (&_S_thread_freelist_mutex);
> 
>       _S_thread_freelist_last->next = (thread_record*)freelist_pos;
>       _S_thread_freelist_last = (thread_record*)freelist_pos;
>       _S_thread_freelist_last->next = NULL;
> 
>       __gthread_mutex_unlock (&_S_thread_freelist_mutex);
> 
>       /*
>        * If the thread - when it dies - stil have records on its freelist
>        * we return them to the global pool here.
>        */
>       for (size_t bin = 0; bin < _S_no_of_bins; bin++)
>         {
>           block_record* block = _S_bin[bin].first[((thread_record*)freelist_pos)->id];
> 
>           if (block != NULL)
>             {
>               __gthread_mutex_lock (_S_bin[bin].mutex);
> 
>               while (block != NULL)
>                 {
>                   if (_S_bin[bin].first[0] == NULL)
>                     _S_bin[bin].first[0] = block;
>                   else
>                     _S_bin[bin].last[0]->next = block;
> 
>                   _S_bin[bin].last[0] = block;
> 
>                   block = block->next;
> 
>                   _S_bin[bin].free[0]++;
>                 }
> 
>               _S_bin[bin].last[0]->next = NULL;
> 
>               __gthread_mutex_unlock (_S_bin[bin].mutex);
>             }
>         }
>     }
> 
>   template<int __inst>
>     size_t
>     __mt_alloc<__inst>::
>     _S_get_thread_id (size_t bin)
>     {
>       /*
>        * If we have thread support and it's active we check the thread key
>        * value and return it's id or if it's not set we take the first record
>        * from _S_thread_freelist and sets the key and returns it's id.
>        */
>       if (__gthread_active_p ())
>         {
>           thread_record* freelist_pos;
> 
>           if ((freelist_pos = (thread_record*)__gthread_getspecific (_S_thread_key)) == NULL)
>             {
>               __gthread_mutex_lock (&_S_thread_freelist_mutex);
> 
>               /*
>                * Since _S_max_threads must be larger than the theoretical 
>                * max number of threads of the OS the list can never be empty.
>                */
>               freelist_pos = _S_thread_freelist_first;
>               _S_thread_freelist_first = _S_thread_freelist_first->next;
> 
>               __gthread_mutex_unlock (&_S_thread_freelist_mutex);
> 
>               __gthread_setspecific (_S_thread_key, (void*)freelist_pos);
> 
>               /*
>                * Since thread_ids may/will be reused (espcially in
>                * producer/consumer applications) we make sure that
>                * the list pointers and free counter is reset BUT
>                * as the "old" thread may still be owner of some
>                * memory (which is referred to by other threads and 
>                * thus not freed) we don't reset the used counter.
>                */
>               _S_bin[bin].first[freelist_pos->id] = NULL;
>               _S_bin[bin].last[freelist_pos->id] = NULL;
>               _S_bin[bin].free[freelist_pos->id] = 0;
>             } 
> 
>           return freelist_pos->id;
>         }
> 
>       /*
>        * Otherwise (no thread support or inactive) all requests are served from
>        * the global pool 0.
>        */
>       return 0;
>     }
> 
>   template<int __inst> __gthread_once_t
>   __mt_alloc<__inst>::_S_once_mt = __GTHREAD_ONCE_INIT;
> #endif
> 
>   template<int __inst> bool
>   __mt_alloc<__inst>::_S_once_single = false;
> 
>   /*
>    * Allocation requests (after round-up to power of 2) below this value 
>    * will be handled by the allocator. A raw malloc/free () call will be used
>    * for requests larger than this value. 
>    */
>   template<int __inst> size_t
>   __mt_alloc<__inst>::_S_max_bytes = 128;
> 
>   /*
>    * In order to avoid fragmenting and minimize the number of malloc ()
>    * calls we always request new memory using this value. Based on 
>    * previous discussions we have choosen the value 4096 which on Linux
>    * equals getpagesize ()
>    */
>   template<int __inst> size_t
>   __mt_alloc<__inst>::_S_chunk_size = 4096;
> 
>   /*
>    * The maximum number of supported threads. Our Linux 2.4.18 reports
>    * 4070 in /proc/sys/kernel/threads-max
>    */
>   template<int __inst> size_t
>   __mt_alloc<__inst>::_S_max_threads = 4096;
> 
>   /*
>    * Actual value calculated in _S_init ()
>    */
>   template<int __inst> size_t
>   __mt_alloc<__inst>::_S_no_of_bins = 1;
> 
>   /*
>    * Each time a deallocation occurs in a threaded application we make sure
>    * that there are no more than _S_freelist_headroom % of used memory on 
>    * the freelist. If the number of additional records is more than
>    * _S_freelist_headroom % of the freelist, we move these records back to 
>    * the global pool.
>    */
>   template<int __inst> size_t
>   __mt_alloc<__inst>::_S_freelist_headroom = 10;
> 
>   /*
>    * Actual initialization in _S_init ()
>    */
> #ifdef __GTHREADS
>   template<int __inst> typename __mt_alloc<__inst>::thread_record*
>   __mt_alloc<__inst>::_S_thread_freelist_first = NULL;
> 
>   template<int __inst> typename __mt_alloc<__inst>::thread_record*
>   __mt_alloc<__inst>::_S_thread_freelist_last = NULL;
> 
>   template<int __inst> __gthread_mutex_t
>   __mt_alloc<__inst>::_S_thread_freelist_mutex = __GTHREAD_MUTEX_INIT;
> 
>   /*
>    * Actual initialization in _S_init ()
>    */
>   template<int __inst> __gthread_key_t
>   __mt_alloc<__inst>::_S_thread_key;
> #endif
> 
>   template<int __inst> typename __mt_alloc<__inst>::bin_record*
>   __mt_alloc<__inst>::_S_bin = NULL;
219c889,892
<    *  arguments for debugging.
---
>    *  arguments for debugging.  Errors are reported using assert; these
>    *  checks can be disabled via NDEBUG, but the space penalty is still
>    *  paid, therefore it is far better to just use the underlying allocator
>    *  by itelf when no checking is desired.
248,249c921
<         if (*(size_t*)__real_p != __n)
<           abort();
---
>         assert(*(size_t*)__real_p == __n);
351a1024,1025
> 	    // Trust but verify...
> 	    assert(_S_force_new != 0);
751c1425
<   template<int inst>
---
>   template<int __inst>
753,754c1427,1440
<     operator==(const __malloc_alloc<inst>&,
<                const __malloc_alloc<inst>&)
---
>     operator==(const __mt_alloc<__inst>&,
>                const __mt_alloc<__inst>&)
>     { return true; }
> 
>   template<int __inst>
>     inline bool
>     operator!=(const __mt_alloc<__inst>&,
>                const __mt_alloc<__inst>&)
>     { return false; }
> 
>   template<int __inst>
>     inline bool
>     operator==(const __malloc_alloc<__inst>&,
>                const __malloc_alloc<__inst>&)
835a1522,1529
>     struct _Alloc_traits<_Tp, __mt_alloc<__inst> >
>     {
>       static const bool _S_instanceless = true;
>       typedef __simple_alloc<_Tp, __mt_alloc<__inst> > _Alloc_type;
>       typedef __allocator<_Tp, __mt_alloc<__inst> > allocator_type;
>     };
> 
>   template<typename _Tp, int __inst>
864a1559,1567
>   template<typename _Tp, typename _Tp1, int __inst>
>     struct _Alloc_traits<_Tp,
>                          __allocator<_Tp1, __mt_alloc<__inst> > >
>     {
>       static const bool _S_instanceless = true;
>       typedef __simple_alloc<_Tp, __mt_alloc<__inst> > _Alloc_type;
>       typedef __allocator<_Tp, __mt_alloc<__inst> > allocator_type;
>     };
>

Follow-Ups:
- Re: Our proposal for a MATDS
  - From: Nathan Myers
- Re: Our proposal for a MATDS
  - From: Benjamin Kosnik

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]