This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Allocator performance testing


Hi!

Measuring performance and comparing different solutions/approaches is as always tricky. However we felt that we needed some _simple_ tools to make sure that the allocator (aimed at MT applications) that we are working on performs as expected.

The two attached files contains our current approach to comparing performance relative to the current __pool_alloc allocator. The main reason for this post is to get much needed input on other cases that must be tested and/or other - maybe already existing - tools that we are not aware of.

Below are some figures from two different machines (single/dual, both running Linux 2.4.18). The overall conclusion is that the current __pool_alloc generates an awful amount of context switching (as expected due to the single global lock) and thus performs worse and worse with an increased number of threads and that the Linux malloc() implementation actually performs - relatively speaking - better and better the more threads (as do our proposed allocator :)

Specifics as to the number of threads, approach etc - have a look at the code :)

NOTE! The numbers from the alloc_perf_comp_mt.cc version includes figures from testing a crude version of our allocator which will result in a patch for review soon - so far it's looking good and there are quite a few things in there addressing other issues that this simple test can't reveal (in terms of long term fragmentation, heap growth etc).

For your reference - this is the compiler used:
Reading specs from /root/projects/alloc/local/lib/gcc-lib/i686-pc-linux-gnu/3.4/specs
Configured with: ../gcc/configure --prefix=/root/projects/alloc/local --enable-threads
Thread model: posix
gcc version 3.4 20030202 (experimental)

g++ alloc_perf_comp_single.cc -O2 -static

On single 1GHz PIII laptop:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 5400000 iterations.
__pool_alloc... DONE! Time 9.888797
__malloc_alloc... DONE! Time 14.835390
Relative performance to __pool_alloc:
__malloc_alloc 66.656805 %

On dual 1GHz PIII server:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 5500000 iterations.
__pool_alloc... DONE! Time 10.030389
__malloc_alloc... DONE! Time 15.172492
Relative performance to __pool_alloc:
__malloc_alloc 66.109041 %

g++ alloc_perf_comp_mt.cc -O2 -static -pthread

On single 1GHz PIII laptop:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 2700000 iterations.
__pool_alloc... DONE! Time 115.985574
__malloc_alloc... DONE! Time 97.495789
__mt_alloc... DONE! Time 55.502588
Relative performance to __pool_alloc:
__malloc_alloc 118.964701 %
__mt_alloc 208.973272 %

On dual 1GHz PIII server:
Target run-time: 10s
Calibrating test_ints... DONE! Will do 2700000 iterations.
__pool_alloc... DONE! Time 86.065636
__malloc_alloc... DONE! Time 97.874511
__mt_alloc... DONE! Time 55.050467
Relative performance to __pool_alloc:
__malloc_alloc 87.934678 %
__mt_alloc 156.339520 %

Brgds

/Stefan

--
Always yield to temptation-because it may not pass your way again.

/*
 * 2003-02-05 Stefan Olsson <stefan@snon.net>
 *
 * The goal with this application is to compare the performance
 * between different STL allocators relative to the default
 * __pool_alloc.
 *
 * The container used for the tests is vector, which as stated by
 * SGI "Vector is the simplest of the STL container classes, and in 
 * many cases the most efficient.".
 *
 * NOTE! The vector<> container does some "caching" of it's own and
 * therefore we redeclare the variable in each iteration (forcing the
 * const/destr to be called and thus free memory).
 *
 * NOTE! The usage of gettimeofday is unwanted since it's not POSIX,
 * however I have not found a more generic system call to use - 
 * ideas are greatly appriciated!
 *
 * NOTE! This version only does operations on vector<int>. More/other
 * data types should maybe also be tested - ideas are always welcome!
 *
 * I assume that glibc/underlying malloc() implementation has been
 * compiled with -O2 thus should this application also be compiled
 * with -O2 in order to get relevant results.
 */

#include <vector>
#include <sys/time.h>

using namespace std;

/*
 * In order to avoid that the time it takes for the application to 
 * startup/shutdown affect the end result, we define a target 
 * duration (in seconds) for which all tests should run.
 * Each test is responsible for "calibrating" their # of iterations 
 * to come as close as possible to this target based on the time
 * it takes to complete the test using the default __pool_alloc.
 */
int target_run_time = 10;

/*
 * The number of iterations to be performed in order to figure out
 * the "speed" of this computer and adjust the number of iterations
 * needed to come close to target_run_time.
 */
int calibrate_iterations = 100000;

/*
 * The number of values to insert in the vector, 32 will cause 
 * 5 (re)allocations to be performed (sizes 4, 8, 16, 32 and 64)
 * This means that all allocations are within _MAX_BYTES = 128
 * as defined in stl_alloc.h for __pool_alloc.
 * Whether or not this value is relevant in "the real world" 
 * or not I don't know and should probably be investigated in 
 * more detail.
 */
int insert_values = 32;

static struct timeval _tstart, _tend;
static struct timezone tz;

void
tstart (void)
{
  gettimeofday (&_tstart, &tz);
}
void
tend (void)
{
  gettimeofday (&_tend, &tz);
}

double
tval()
{
  double t1, t2;

  t1 = (double)_tstart.tv_sec + (double)_tstart.tv_usec/(1000*1000);
  t2 = (double)_tend.tv_sec + (double)_tend.tv_usec/(1000*1000);
  return t2 - t1;
}

int
calibrate_test_ints (void)
{
  tstart ();
  for (int i = 0; i < calibrate_iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return (int)((double)target_run_time / tval ()) * calibrate_iterations;
}

double
test_ints_pool_alloc (int iterations)
{
  tstart ();
  for (int i = 0; i < iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return tval ();
}

double
test_ints_malloc_alloc (int iterations)
{
  tstart ();
  for (int i = 0; i < iterations; i++)
  {
    vector<int, __malloc_alloc<0> > v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return tval ();
}

double
test_ints_mt_alloc (int iterations)
{
  tstart ();
  tend ();

  return tval ();
}

int
main (void)
{
  printf ("Target run-time: %ds\n", target_run_time);

  printf ("Calibrating test_ints... ");
  int iterations = calibrate_test_ints ();
  printf ("DONE! Will do %d iterations.\n", iterations);

  double results[3];

  printf ("__pool_alloc... ");
  results[0] = test_ints_pool_alloc (iterations);
  printf ("DONE! Time %f\n", results[0]);

  printf ("__malloc_alloc... ");
  results[1] = test_ints_malloc_alloc (iterations);
  printf ("DONE! Time %f\n", results[1]);

/*
  printf ("__mt_alloc... ");
  results[2] = test_ints_mt_alloc (iterations);
  printf ("DONE! Time %f\n", results[2]);
*/

  printf ("Relative performance to __pool_alloc:\n");
  printf ("__malloc_alloc %f %%\n", (results[0] / results[1])*100);
//  printf ("__mt_alloc %f %%\n", (results[0] / results[2])*100);

  return 0;
}
/*
 * 2003-02-05 Stefan Olsson <stefan@snon.net>
 *
 * The goal with this application is to compare the performance
 * between different STL allocators relative to the default
 * __pool_alloc.
 *
 * The container used for the tests is vector, which as stated by
 * SGI "Vector is the simplest of the STL container classes, and in 
 * many cases the most efficient.".
 *
 * NOTE! The vector<> container does some "caching" of it's own and
 * therefore we redeclare the variable in each iteration (forcing the
 * const/destr to be called and thus free memory).
 *
 * NOTE! The usage of gettimeofday is unwanted since it's not POSIX,
 * however I have not found a more generic system call to use - 
 * ideas are greatly appriciated!
 *
 * NOTE! This version only does operations on vector<int>. More/other
 * data types should maybe also be tested - ideas are always welcome!
 *
 * I assume that glibc/underlying malloc() implementation has been
 * compiled with -O2 thus should this application also be compiled
 * with -O2 in order to get relevant results.
 */

#include <vector>
#include <sys/time.h>

/*
 * Make sure that the compiler has thread support...
 */
#if __GTHREADS

using namespace std;

/*
 * In order to avoid that the time it takes for the application to 
 * startup/shutdown affect the end result, we define a target 
 * duration (in seconds) for which all tests should run.
 * Each test is responsible for "calibrating" their # of iterations 
 * to come as close as possible to this target based on the time
 * it takes to complete the test using the default __pool_alloc.
 *
 * NOTE! In an ideal world where there is enough CPU and no
 * contention the application would complete on time, however
 * the design of the __pool_alloc allocator and the fact that we
 * don't have enough CPU available (normally atleast :) will make the
 * actual runtime much longer (at least # of threads * this value)
 */
int target_run_time = 10;

/*
 * The number of iterations to be performed in order to figure out
 * the "speed" of this computer and adjust the number of iterations
 * needed to come close to target_run_time.
 */
int calibrate_iterations = 100000;

/*
 * The number of values to insert in the vector, 32 will cause 
 * 5 (re)allocations to be performed (sizes 4, 8, 16, 32 and 64)
 * This means that all allocations are within _MAX_BYTES = 128
 * as defined in stl_alloc.h for __pool_alloc.
 * Whether or not this value is relevant in "the real world" 
 * or not I don't know and should probably be investigated in 
 * more detail.
 */
int insert_values = 32;

/*
 * Number of threads to create
 */
int threads = 8;

static struct timeval _tstart, _tend;
static struct timezone tz;

void
tstart (void)
{
  gettimeofday (&_tstart, &tz);
}
void
tend (void)
{
  gettimeofday (&_tend, &tz);
}

double
tval()
{
  double t1, t2;

  t1 = (double)_tstart.tv_sec + (double)_tstart.tv_usec/(1000*1000);
  t2 = (double)_tend.tv_sec + (double)_tend.tv_usec/(1000*1000);
  return t2 - t1;
}

int
calibrate_test_ints (void)
{
  tstart ();
  for (int i = 0; i < calibrate_iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }
  tend ();

  return (int)((double)target_run_time / tval ()) * calibrate_iterations;
}

void*
test_ints_pool_alloc (void* it)
{
  int iterations = *(int*)it;

  for (int i = 0; i < iterations; i++)
  {
    vector<int> v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }

  return NULL;
}

void*
test_ints_malloc_alloc (void* it)
{
  int iterations = *(int*)it;

  for (int i = 0; i < iterations; i++)
  {
    vector<int, __malloc_alloc<0> > v1;

    for (int j = 0; j < insert_values; j++)
    {
      v1.push_back (1);
    }
  }

  return NULL;
}

void*
test_ints_mt_alloc (void* it)
{
  return NULL;
}

int
main (void)
{
  printf ("Target run-time: %ds\n", target_run_time);

  printf ("Calibrating test_ints... ");
  int iterations = calibrate_test_ints ();
  printf ("DONE! Will do %d iterations.\n", iterations);

  double results[3];
  pthread_t workers[threads];

  printf ("__pool_alloc... ");
  tstart ();
  for( int i = 0; i < threads; i++ )
  {
    pthread_create (&workers[i], NULL, test_ints_pool_alloc, (void*)&iterations);
  }

  for( int i = 0; i < threads; i++ )
  {
    pthread_join (workers[i], NULL);
  }
  tend ();
  results[0] = tval ();
  printf ("DONE! Time %f\n", results[0]);

  printf ("__malloc_alloc... ");
  tstart ();
  for( int i = 0; i < threads; i++ )
  {
    pthread_create (&workers[i], NULL, test_ints_malloc_alloc, (void*)&iterations);
  }

  for( int i = 0; i < threads; i++ )
  {
    pthread_join (workers[i], NULL);
  }
  tend ();
  results[1] = tval ();
  printf ("DONE! Time %f\n", results[1]);

/*
  printf ("__mt_alloc... ");
  tstart ();
  for( int i = 0; i < threads; i++ )
  {
    pthread_create (&workers[i], NULL, test_ints_mt_alloc, (void*)&iterations);
  }

  for( int i = 0; i < threads; i++ )
  {
    pthread_join (workers[i], NULL);
  }
  tend ();
  results[2] = tval ();
  printf ("DONE! Time %f\n", results[2]);
*/

  printf ("Relative performance to __pool_alloc:\n");
  printf ("__malloc_alloc %f %%\n", (results[0] / results[1])*100);
//  printf ("__mt_alloc %f %%\n", (results[0] / results[2])*100);

  return 0;
}

#else

int 
main (void) 
{
  printf ("No thread support, exiting!\n");

  return 0;
}

#endif


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]