basic_string

Fri Oct 20 11:36:00 GMT 2000

Here's my impression of the the state of basic_string in the standard and in
libstdc++.  I'd appreciate corrections.  My conclusion is that reference
counted implementations are not likely to result in reliable code.  I would
like to see the one in libstdc++ replaced by a non-reference-counted
(vector-like) one, which has much simpler correctness and performance
charcteristics. 

1) The standard specifies a complicated set of rules ostensibly to allow
reference counted implementations (21.3, paragraph 5, at least in my
slightly dated copy).

2) Those rules are slightly ambiguous.  In particular, I'm not quite sure
about the intended interpretation of the last clause.  "Any of the above
uses" excludes the exceptions in the preceding clause?  "Subsequent to ...
the first call" refers to the first call of any of the functions in the
second set after any call in the first set?  (I.e. if I call data(); at();
begin(), then only at() is meant?  Or is the first call to begin() after
data() also included?)

3) I have trouble making intelligent guesses about resolving thes
ambiguities because I don't understand the motivation.  It seems to me that
in reality a standard reference counted implementation may need to
invalidate iterators before returning the reference from operator[] in
		t = s; ...s[0]...

I don't see how this is covered by any reasonable interpretation of the
standard.

4) The only real reference counted solution is to mark a string unsharable
if a reference may ever have escaped?  This arguably doesn't allow for much
sharing, since a reference could escape from any call to non-const
operator[].

5) If I interpret it correctly, the standard disallows expressions like f(s,
s[0]) == s[1] where the internals of f are unknown to the client, since f
may invalidate the reference to s[1] without modifying s. I find this a bit
unintutitive. 

6) There is no definition anywhere of what constitutes a correct use of
basic_string in a multithreaded environment.  The rules talk about sequences
of operations.  In a multithreaded context, this means I need to lock around
sequences of string operations to make sure no dangerous operation sneaks in
from another thread?  The usual definition of thread-safety doesn't seem to
apply, since multiple readers can interfere.

7) libstdc++v2 basic_string, as shipped in RedHat 7, is clearly
thread-unsafe, no matter what the definition.  I claim this is due to
oversights related to the complexity of reference counting in basic_string.

8) As far as I can tell, the current libstdc++v3 still has a race between a
call of operator[] on s, and an assignment of s to t.  (Note that neitherr
of these conceptually modify s.  I didn't read the code as carefully as I
should have, so there may be other problems.  I also claim the code is
highly nontrivial.)  I see no evidence that libstdc++v3 basic_string could
be made thread-safe in short order, even if we had a reasonable definition
of what that meant.  Nor am I sure that the problem is solvable with
reasonable overhead, i.e. without locking.

9) Based on various newsgroup postings, there seems to be little evidence
that reference counting is a performance win for short strings, presumably
the intended domain for basic_string.

So:
Is there any good reason not to

1) Ignore 21.3 paragraph 5 of the standard.  (It is perfectly conforming to
ignore it, and to publish the fact that those restrictions are not present
in libstdc++.)

2) Adopt the otherwise standard definition of thread safety, i.e. concurrent
accesses are safe so long as none of them modifies the string, with the
intuitive definition of "modify"?

3) Switch to a non-reference-counted implementation, such as Matt Austern's
from the SGI STL?

Hans