This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Change default allocator?


Thanks alot for this detailed study - I am sure that this will help alot of
people since this is "kind of undocumented" even though it might seem obvious
to you guys ;-)

In our case the cause of this type of string usage is that each thread builds
up querystrings that then are sent to the SQL server. Then we iterate through
the resultset, where our MySQL wrapper returns data as strings and so on...

We will go with the solution of defining a specific allocator each time.

The crashes that we have seen is most likely due to that we are currently
running __USE_MALLOC and that this causes so much memory fragmentation so
that the application just gives up...

Once again thanks!

/Stefan

Loren James Rittle wrote:

> OK, let's look at the test program given below.  It is basically your
> version 1 when LOCAL_MEMORY is not defined and version 2 when
> LOCAL_MEMORY is defined.  I fixed the uninitialized pointer bug
> previously mentioned.  [My earlier remark was directed at the OS that
> failed to detect the wild pointer not in your coding ability.  Yes, I
> understood this was a quick hack not your production code.]  I also
> needed to account for the fact that, right or wrong, my local system
> headers have pthread_attr_default defined in <pthread.h>.  Then I
> changed the test program to do a known amount of logical work before
> terminating.  If NO_THREADS is defined then we see the amount of CPU
> time required to do the productive work (i.e. we can see the resource
> overhead of threading - even when used properly, it will always
> increases total CPU-time).  If we want to force an allocator that
> directly uses malloc, then define MALLOC_MEMORY.
>
> #include <unistd.h>
> #include <pthread.h>
> #include <string>
>
> #if defined (MALLOC_MEMORY)
> typedef std::basic_string <char, std::char_traits<char>,
>   std::__allocator<char, std::__malloc_alloc_template<0> > > my_string;
> #elif defined (LOCAL_MEMORY)
> // ATTENTION: stl_pthread_alloc.h is out of date on mainline and requires
> // this line until updated:
> // typedef std::__malloc_alloc_template<0> malloc_alloc;
> #include <bits/stl_pthread_alloc.h>
> typedef std::__allocator<char, std::pthread_alloc> std_char_pthread_alloc;
> typedef std::basic_string <char, std::char_traits<char>,
>   std_char_pthread_alloc > pthread_string;
> typedef pthread_string my_string;
> #else
> typedef std::string my_string;
> #endif
>
> using namespace std;
>
> void*
> worker (void *)
> {
>   for (int i = 0; i < 10000; i++)
>     {
>       my_string s = "jala";
>       s += "foo2";
>       s += "foo2";
>       s += "foo2";
>       s += "foo2";
>       s += "foo2";
>       s += "foo2";
>     }
> }
>
> #ifndef NTHR
> #define NTHR 16
> #endif
>
> #ifndef NO_THREADS
> int
> main ()
> {
>   pthread_t thread[NTHR];
>   pthread_attr_t pthread_attr_default_x;
>
>   pthread_attr_init (&pthread_attr_default_x);
>
>   for (int i = 0; i < NTHR; i++)
>     pthread_create (&thread[i], &pthread_attr_default_x, worker, NULL);
>
>   for (int i = 0; i < NTHR; i++)
>     pthread_join (thread[i], NULL);
>
>   return 0;
> }
> #else
> int
> main ()
> {
>   for (int i = 0; i < NTHR; i++)
>     worker (NULL);
> }
> #endif
>
> Gross results for my system with mainline compiler labeled 20011101 (I
> had some other load so we will focus only on CPU-time, the number
> labeled with a 'u'):
>
> 1. Base-line just to do the logical work load.
>
> S rittle@latour; g++  -O2  -g -DNO_THREADS x.C
> S rittle@latour; time a.out
>      3r     2.1u     0.0s       a.out
>
> 2. Base-line plus overhead for actually locking the allocator mutex
>    the number of times required for thread-safety (NOTE: you will only
>    see a real difference here from case 1 if your platform supports
>    weak symbols and your port is otherwise setup properly to take
>    advantage of that fact - Don't ask as I have no idea what ports are
>    setup so or where that information is well-documented).
>
> S rittle@latour; g++  -O2  -g -DNO_THREADS -pthread x.C
> S rittle@latour; time a.out
>     12r     7.2u     0.0s       a.out
>
> With the debugger, I note that your example locks the allocator mutex
> four independent times for each addition of "foo2".  I guess that
> looks right since, as coded, you have two new string reps being
> created and two string reps being destroyed per line.  Yikes!
>
> 3. Base-line plus overhead for locking the allocator mutex, thread
>    scheduling and mutex lock contention.
>
> S rittle@latour; g++  -O2  -g -pthread x.C
> S rittle@latour; time a.out
>     65r    31.9u     0.2s       a.out
>
> This looks horrible but matches my experience with fine-grain locking.
>
> 4. Verify linear scaling when varying total logical work load with no
>    threading.
>
> S rittle@latour; g++  -O2  -g -pthread -DNO_THREADS -DNTHR\=1 x.C
> S rittle@latour; time a.out
>      1r     0.4u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread -DNO_THREADS -DNTHR\=2 x.C
> S rittle@latour; time a.out
>      2r     0.9u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread -DNO_THREADS -DNTHR\=3 x.C
> S rittle@latour; time a.out
>      3r     1.3u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread -DNO_THREADS -DNTHR\=4 x.C
> S rittle@latour; time a.out
>      2r     1.8u     0.0s       a.out
>
> And 1.8 * 4 == 7.2, thus looks like linear scaling to me without
> further checking.
>
> 5. Investigate scaling as threading increases (look for a knee):
>
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=1 x.C
> S rittle@latour; time a.out
>      1r     0.4u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=2 x.C
> S rittle@latour; time a.out
>      4r     2.1u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=3 x.C
> S rittle@latour; time a.out
>      7r     3.7u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=4 x.C
> S rittle@latour; time a.out
>     10r     5.2u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=5 x.C
> S rittle@latour; time a.out
>     14r     7.4u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=6 x.C
> S rittle@latour; time a.out
>     16r     9.3u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=7 x.C
> S rittle@latour; time a.out
>     19r    10.8u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread  -DNTHR\=8 x.C
> S rittle@latour; time a.out
>     28r    14.0u     0.1s       a.out
>
> There is one, but even the second thread injects a lot of overhead for
> this worker code path on this platform.
>
> 6. Look at LOCAL_MEMORY path.  Note that I needed to add the commented
>    line of code with the mainline compiler compared to 3.0.X.
>
> S rittle@latour; g++  -O2  -g -pthread    -DNO_THREADS -DLOCAL_MEMORY x.C
> S rittle@latour; time a.out
>      4r     2.3u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -pthread    -DLOCAL_MEMORY x.C
> S rittle@latour; time a.out
>      5r     2.3u     0.0s       a.out
>
> NOTE: Even if the code in stl_pthread_alloc.h were ported to use the
> gthr.h abstraction layer, it is not a general solution to make this
> the default allocator due to memory leaking behavior - we would have
> to look at a hybrid allocator that could cache memory per-thread but
> would have to release a build up of memory in it per-thread pool back
> to a central non-per-thread pool.  I have no idea how this could be
> made to work to handle all cases as well or better than the default
> allocator.
>
> 7. Look at MALLOC_MEMORY path.
>
> S rittle@latour; g++  -O2  -g -DMALLOC_MEMORY -DNO_THREADS  x2.C
> S rittle@latour; time a.out
>      4r     3.5u     0.0s       a.out
> S rittle@latour; g++  -O2  -g -DMALLOC_MEMORY -pthread   x2.C
> S rittle@latour; time a.out
>      5r     4.3u     0.0s       a.out
>
> This case shows that even for string<>, __USE_MALLOC was not a very
> good configuration choice for single threaded cases (I must confess
> that reviewing data I posted at the time of the configuration change,
> I never looked at string<> performance, at the time, I may not have
> even known that string<> shared the memory allocators with STL).
>
> However, it is a win for *this* multi-threaded case over the default
> allocator.  It does not beat the per-thread pool.
>
> Conclusions based on my read of all the data, threading *this*
> CPU-bound problem that extensively uses string<> with the default
> allocator such as this example code has some major performance issues
> but it not an implementation bug as far as I'm concerned.  The
> performance profile matched my expectation.  You need to ask yourself,
> why are you threading this type of problem?  Unless you have a
> latency/blocking issue, threading buy you little and can be costly as
> your example code displays well.  And, which allocator do I need to
> use for this problem?  Ideally, you would make it so your entire
> program uses indirectly specified types and then performance test
> various configurations of type mappings to pick the best selections
> for a given architecture and run profile.
>
> I would still be interested in any example that crashes...  However, I
> must refrain from these detailed performance analysis in the future.
> I have now done various ones for the list and they are all archived
> and referenced in the FAQ.
>
> Regards,
> Loren

--
Military intelligence is a contradiction in terms.




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]