This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Performance regression with libstdc++-v3

To: libstdc++ at gcc dot gnu dot org
Subject: Re: Performance regression with libstdc++-v3
From: Loren James Rittle <rittle at latour dot rsch dot comm dot mot dot com>
Date: Thu, 17 May 2001 23:44:59 -0500 (CDT)
Organization: Networks and Infrastructure Lab (IL02/2240), Motorola Labs

[Phil and Benjamin, please read this analysis, if you have some time.
 Anyone that currently thinks that __USE_MALLOC is the right
 configuration should read this as well.  I make a recommendation for
 the 3.0 release cycle at the end.]

In article <200105161310.PAA47400@numa6.igpm.rwth-aachen.de> you write:

[... fairly straightforward code using only <list> removed...]

> This is a regression in performance of about 65% that came with the
> introduction of libstdc++-v3! (I also tested several snapshots between
> December 2000 and May 2001, but the results are the same. Compiling
> with "-fno-exceptions" doesn't help either.)

> That leads to a couple of questions:
> * Is that a known problem?

Yes, this performance issue is known in general to at least a few of
us.  I had been meaning to look closely at it for some time now.

> * Any ideas what causes this performance breakdown
>   (maybe some debugging code, or is it a problem of the optimizer)?

If you have the right tools installed (gprof will work fine and the
rest of these comments assume gprof): Compile your example with -pg.
After running the built program, run gprof.  I have done that for your
example code.

It appears to me that the main performance degradation is injected
because malloc() is called far more times with the new library than
the old.  These entries account for well-over half the total running
time on my platform (after excluding .mcount):

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    

 22.4      17.56     5.30 10000003     0.00     0.00  malloc_bytes [4]
  6.0      21.02     1.43 10000001     0.00     0.00  ifree [6]
  4.0      21.97     0.96 10000001     0.00     0.00  malloc [2]
  3.2      22.74     0.76 10000001     0.00     0.00  free [5]
  2.9      23.42     0.69 10000003     0.00     0.00  imalloc [3]

(Note: malloc_bytes, ifree and imalloc are all internal libc functions
which support malloc() and free() on this particular platform.)

They contribute near zero time in the program compiled against
libstdc++-v2.  malloc() is called 160 times instead of 10000000 + 1.

Recompiling the example against libstdc++-v3 without -O3 and setting a
breakpoint on malloc (), we see that the library gets one list element
at a time with each call to malloc().  The v2 library allocates an
exponentially increasing number of list elements with each call to
malloc().  Both are correct behavior according to the standard AFAIK
(I did some research but nothing worth citing).

> * Are there any compiler options that I can use to increase performance?

Since the problem is the core algorithm being selected, this problem
can't be solved by a mere compiler switch.  This will require some
human tuning.

> * Can somebody improve speed again, *please*?

There are two ways to proceed:

(1) You can provide an alternate allocator in your code to ensure this
is done an efficient way for your problem.

This line:

    std::list<int> List;

must be changed to (e.g.):

    std::list<int, exponential_pool_allocator<int> > List;

And you must find/write code to encode that element allocation policy.

Stroustrup gives an example user-provided pool allocator
(non-exponential) in TC++PL-SE.

(2) You could remove this line of libstdc++-v3/include/bits/c++config:

#define __USE_MALLOC

and rebuild libstdc++-v3.  Then, you will get about the same allocator
algorithm that was used in libstdc++-v2.  However, there might be a
*really* good reason why Benjamin selected the __USE_MALLOC
configuration...  Unfortunately, spelunking the digital caves we know
as the libstdc++-v3/ChangeLog files has not yielded any information.
There were 11 hits on __USE_MALLOC on our mailing list over the past
few years but most were buried in diffs.  Phil most recently mentioned
__USE_MALLOC in response to someone that thought libstdc++-v3 was
leaking memory.

I attempted the second course of action with your example code:

; /usr/bin/g++ -O3 P.C # 2.95.3 with libstdc++-v2
; time a.out
     4r     3.6u     0.7s       a.out

; /usr/local/beta-gcc/bin/g++ -O3 P.C # 3.0 pre-release with libstdc++-v3
; time a.out
     3r     2.3u     1.3s       a.out

Now, the example you posted looks like it would be completely
different than real application code.  However, when I tried the
example I posted about a week ago today (see
http://gcc.gnu.org/ml/libstdc++/2001-05/msg00053.html) involving heavy
use of STL code, it now runs 33% faster with libstdc++-v3 and g++ 3.0
pre-release than libstdc++-v2 and g++ 2.95.3 instead of about twice as
slow.  Sweet!

Benjamin/Phil, do you remember off-hand why the __USE_MALLOC
configuration of the allocator was selected?  If not, we might want to
seriously reconsider the selection before the 3.0 release (I know it
is *very* late to be suggesting this change, but STL code running
about three times slower than it could will be an important turnoff to
people we are trying to convince that g++ 3.0 and libstdc++-v3 is
ready for prime-time).  BTW, I reran the testsuite, at least for
non-threaded code, zero regressions were spotted on mainline.  From
commentary on the SGI STL WWW site, I see that the __USE_MALLOC
configuration is recommended to help find memory leaks in the
application.  From studying libstdc++-v3/include/bits/stl_alloc.h, I
see that the default path (used when __USE_MALLOC is not defined)
codes a metux lock when __STL_THREADS is defined.  The user can avoid
actually paying the price of locking the mutex in application code
when they define the _NOTHREADS macro on the compiler command line.

After making sure all this works the way it is suppose to, I see zero
downside to making this configuration change for the release.

Regards,
Loren

References:
- Performance regression with libstdc++-v3
  - From: Reichelt

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]