This is the mail archive of the mailing list for the libstdc++ project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: libstdc++/7445: poor performance of std::locale::classic() in multi-threaded applications

> I'm not excited by this patch. There has got to be a better way. Loren,
> can you commet on this? I should have just waited for you before I
> checked this in anyway. 

Hi, I am sorry Benjamin but, unless I have reported otherwise and am
just now forgetting it, I have never reviewed any of the threading
issues/code in the locale part of the library.  I have now looked at
this one tight issue (but not all the locale code).

I agree that src/ 1.62 is flawed.  I agree with Andrew
Pollard that your proposed patch (dated 2 Aug 2002 16:20:20 -0000
against 1.62) is flawed.  Andrew has implemented the so-called
Double-Checked Lock Pattern.  This is a performance speedup first
proposed/widely written about by Dr. Douglas C. Schmidt and his group.
I have read that (and I think Doug's paper itself discussed these
issues), for some CPU/memory architectures, it is possible that the
Double-Checked Lock Pattern will not work.  E.g. on any system that
reorders writes especially in light of multiple CPUs.

The POSIX thread system has the property that when *all* accesses of a
particular memory region (which may be disjointed and need not be
expressed in code) are guarded by a particular mutex, the memory
region is guaranteed to appear consistent in a threaded,
multiprocessor system.  In particular, to make this happen the mutex
implementation may force a flush of all pending writes at the mutex
lock/unlock points (or, they may especially mark all reads and writes
between those boundary points).  The DC locking violates this rule
(but can usually by made to work for any one particular set of
hardware - the trick is making it widely portable).  So does the code
segment in 1.62 (there is a reference to _S_classic outside the
otherwise mutex-protected region).  So does your first proposed patch
of 1.62.

To shift focus for a moment, this PR smacks of a problem found by
"code review" not actual profiling.  The code as written before the PR
was right (if non-optimal).  The patch provided with the PR is flawed
(for same reason as above).  Is std::locale::classic() actually called
so often in nominal cases as to matter?  A mutex lock/unlock operation
takes a certain amount of time.  At quick glance, the guarded init
code would appear to require *well* over that amount of time for
modern platforms.  Thus (and I can't judge this myself), if
std::locale::classic() is nominally called zero or just a few times,
the code as written before the PR patch was probably near optimal in
handling various tradeoffs.  If std::locale::classic() will be called
many times in nominal cases, then I agree that we might be non-optimal
here.  Back to the optimization discussion...

OK, does this mean we can't optimize this situation at all?  Well, it
turns out that POSIX makes a very similar guarantee related to memory
consistency whenever a new thread is started.  Thus, if we initialize
something which will later be read-only when there is exactly one
thread running, then we are completely within the rules to avoid the
cost of a mutex acquisition once we go threaded (I hereby state my
assumption that other thread models supported by our abstraction layer
were written for simpler hardware and/or support POSIX semantics since
it is near impossible to reason about the system without this rule).

Is there a reason why locale::c_locale should not be produced while
the system is still running library init code in a single thread?  I
assume the answer will be either: (a) ``if a particular program run
wouldn't have called locale::classic(), then we are not allowed to do
related setup work per standard section X.X.X.X'' or, more likely, (b)
``as an optimization, we didn't want to do the work or allocate memory
in any case where locale::classic() would not be called by the program''.

If (a) then we might be stuck with a slow locale::classic()
implementation (i.e. ``well-wrapped'' ;-) unless we want to sort out
all the issues related to DC locking for all environments to which
libstdc++-v3 ports.

If (b) then perhaps we could, as an optimization for platforms that
support WEAK (I already hear the moaning from some quarters ;-),
convert the init code for c_locale to run while we are single-threaded
guarded by a check of a weak symbol ``&locale::classic()''.  Does the
standard say anything about making such symbols weak?  And/or abusing
their reference or lack thereof to support optimizing library init

I hereby propose that the PR patch and all follow-on patches be
unrolled until someone produces a real test case that performs poorly.
Performance patches produced by code inspection should hold no merit
with us.

Loren J. Rittle, Principal Staff Engineer, Motorola Labs (IL02/2240), KeyID: 2048/ADCE34A5, FDC0292446937F2A240BC07D42763672

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]