This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Tackling library IO performance

From: Jerry Quinn <jlquinn at optonline dot net>
To: libstdc++ at gcc dot gnu dot org
Date: Mon, 06 Jan 2003 01:33:09 -0500
Subject: Tackling library IO performance

Hi all.  I've been investigating the source of slowdowns in PR 8761.
On my machine, gcc 2.95.4 -O2 runs the testcase in about 4 seconds.
gcc 3.3 -O2 runs it in about 25 seconds.

I wasn't able to get profiling to work on the standard library, so I
built a version of the test program with about half the library in a
separate namespace.  This program runs a bit slower (27s) than the one built
with the standard library, but I think it should be close enough.

The first thing I did was to implement a specialization for
__convert_from_v<long>.  I replaced the snprintf code with an inline
implementation pulled from gcc 2.95.  Since the C locale is always
used, I could rip out and ignore the locale.  This change brought my
testcase runtime from 27s to 22s.

Next, I tried to bypass the call to grouping() in
num_put<char>::_M_widen_int.  It turns out that the virtual call to
numpunct<char>::grouping() is expensive.  Bypassing this call brings
the runtime from 22s down to 11s!

After some digging in the archives, I found that Nathan Myers
originally added a mechanism to deal with this by caching some of the
formatting strings, paying for the virtual calls once per locale.
However, the format cache was removed in
http://gcc.gnu.org/ml/libstdc++/2001-11/msg00279.html.  But I couldn't
find a reason for the removal of the cache.  My suspicion is that it
is tied to threading issues, but I don't know.

I added the old format cache back in to my test program and got about
the same runtime improvement.  But instead of using xalloc() to get an
index, I reserved index 0 for the format cache.  This makes a number
of things simpler.  Also, it appears to me that xalloc() doesn't
return a number less than 5, unless I'm misreading the code.

The third thing I did was trying to further improve _M_widen_int().
For the test case, ctype<char>.widen translates into a useless memcpy.
Removing the call reduces the runtime by another 1s.  I saw a
reference in the archives and bug databse that said that
ctype<char>.do_is() may bypass the virtual call for efficiency.  If
the same is true of widen, then we could safely skip the call to
ctype<char>.widen.

Assuming that it is possible to safely skip the widen call and we have
the format cache, we can skip making a local copy of the locale.  This
tweak bought me another 1s or so.

Finally, I wanted to improve the num_put<char>::_M_insert function.
This function inserts using the iterator by default, which translates
into repeated calls to streambuf::sputc.  I replaced this with a
single call to streambuf::sputn.  This decreased the runtime by
another 1s.  To do this, I had to add an _M_xxx accessor to the
ostreambuf_iterator so that I could access the underlying streambuf.

All told, I was able to chisel my test program from 27 seconds down to
8.  It's not quite as fast as gcc 2.95.4 on this program, but it's
much closer.

In addition, once I dropped using the snprintf call, the format string
is no longer necessary.  Changing __convert_from_v (actually, creating
another function), and passing in the fmtflags instead wins another
0.5s or so.  The bulk of that win is from removing the call to
S_format_int().


-------

C locale integer printing is pretty common and I'd really like to get
it sped up.  I want to send in patches to add all these changes in,
especially to 3.3, since it's a regression and we're going to live
with this compiler for a while.  

I wanted to get some feedback before I crank out patches that people
don't want, especially since the format cache had been removed once.
Are there things I should be careful of or avoid doing in putting
these patches together?

Thanks,
Jerry Quinn

Follow-Ups:
- Re: Tackling library IO performance
  - From: Nathan Myers
- Re: Tackling library IO performance
  - From: Paolo Carlini
- Re: Tackling library IO performance
  - From: Benjamin Kosnik
- Re: Tackling library IO performance
  - From: Martin Sebor

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]