Next, I tried to bypass the call to grouping() in
num_put<char>::_M_widen_int. It turns out that the virtual call to
numpunct<char>::grouping() is expensive. Bypassing this call brings
the runtime from 22s down to 11s!
After some digging in the archives, I found that Nathan Myers
originally added a mechanism to deal with this by caching some of the
formatting strings, paying for the virtual calls once per locale.
However, the format cache was removed in
http://gcc.gnu.org/ml/libstdc++/2001-11/msg00279.html. But I couldn't
find a reason for the removal of the cache. My suspicion is that it
is tied to threading issues, but I don't know.
I added the old format cache back in to my test program and got about
the same runtime improvement.