Segment Register thread descriptor (was Re: Jv_AllocBytesChec ked)

Fri Jan 5 15:33:00 GMT 2001

"Boehm, Hans" <hans_boehm@hp.com> writes:

> I've needed it recently for Java locks and for faster access to thread-local
> data.  (There's some argument that the latter should be done by speeding up
> pthread_getspecific(), but I suspect the interface may be too constrained
> and the clients too diverse for that to be easily feasible.

I agree with the latter.  But you haven't mentioned any of these
functions in the last mail.  There is a whole bunch of functions you
want to add.

This introduces a major problem.  Currently the thread library is
largely self contained.  You can replace the existing implementation
with a new one (which is actually done).  Once you expose things like
the access to TLD this is much harder since replacing implementations
have to conform in their binary layout.

> > As for the different functions, it's certainly possible to signal the
> > user which functions are natively available and which would have to be
> > emulated.  This would allow a user to write code perhaps in two
> > different ways, depending on what functionality is available.
> Agreed.  But in addition I would like to see the core functions emulated on
> all platforms.  You should be able to do something like atomically update a
> bit field in a way that doesn't require more than a few lines of user code.

This is in line with the guidelines for glibc.  But it has a major
problem: people are not aware of the problems introduced by the
emulations.  The documentation would have to go to great length to
explain why somebody might not want to use the emulations and instead
provide alternative implementations of the code which needs this
functionality.

> I haven't thought about this enough.  I do need to be able to perform a
> compare-and-swap operation on pointer-sized or size_t fields in
> semi-portable code.  

There also is the intptr_t and uintptr_t type.  I forgot to mention them.

Basically, using long, int, short, etc is wrong.

> On second thought, the best solution might be to try to specify a
> few variants of known utility and generality (e.g.  compare-and-swap
> with a full barrier or no barrier, store with a release barrier,
> store with a <release or write> barrier), and then add others if
> they turn out to be essential.  Since it should be easy to cause the
> new ones to default to the next most general one, that should be
> reasonably manageable.

Basically what I suggested: come up with the list of problems (means
situations in which you need the primitives) and see how they can be
implemented on the different architectures.

> In general, I think you have to give the user control of the granularity of
> the lock, which seems to imply that the user needs to declare it somehow.

This is a different issue.  If the critical region is used to do more
than modify one variable explicit locking (mutex or spinlock) should
be used.  In the case of a single object as in

> struct {
>     size_t refcount;
>     DECL_LOCK(rc_lock);
>     ...
> };

this is a problem since it means that programmers can make errors
without seeing them.  If somebody has access only to platforms where
the spinlock is not needed errors might slip in unnoticed.  These kind
of things must be automated.

> 
> ...
>     __atomic_add(&p -> refcount, LOCK_ARG(&p ->rc_lock));

This is interesting.  You want to rely on statement expressions?  I
have no problems with defining a gcc-specific interface...

> I think it's still easy for the user to do this with the above.

Easy to get something done wrong.  One thing we've learned in Free
Software over the years it is that interfaces have to be defined
fool-proof.  Otherwise somebody will trip.

> But I really want something more general.  In particular, I need to
> be able to atomically update any entry in a large array, without
> allocating a lock for every array entry.

Not a problem.  Associate a lock with the whole array.  But of course
you can say this is not good and you want to lock individual blocks in
the array.  This plays again in the area of not having to define too
many interfaces.  Everybody wants something different.

> >      BEGIN_CR
> > 
> >        STORE_REL (a, ...some value...)
> > 
> >        if (foo)
> >           STORE_REL (b, ...another one...)
> > 
> >        WMB
> > 
> >      END_CR
> > 
> > But this will suck on architectures with both, st.rel and memory
> > barriers.  Which one to use depends on the actual processor
> > implementation.  So you'll have to introduce special STORE_REL and WMB
> > macro variants which are used if they are used together as in the
> > example above.
> > 
> > There are probably more such cases.
> I don't understand this example.  In my world, the STORE_REL would probably
> be used as part of the END_CR implementation, ensuring that all writes in
> the critical section become visible before the lock is released.

This is possible.  But perhaps there are systems where the critical
section is not protected by an memory object (we had lots of
specialized hardware in parallel system design).  I.e., nobody can
rely on END_CR acting as a memory barrier in general.  So there have
to be variants with and without memory barriers (read and write, and
all combinations).  It must also be possible to use STORE_REL directly.

> I was referring only to X86.  Sorry about the confusion.

Even for y86 this changes.

> I've heard claims both ways about whether STORE_REL needs a barrier on X86.
> Based on some fairly heavy duty tests here it doesn't seem to for what I'm
> doing, i.e. stores don't seem to be reordered.  But I'm willing to be
> corrected.  (And the answer may be chipset specific.)

Try P4 processors.  Intel has done something.

> Isn't that largely an orthogonal issue?  On X86, you may well want to use a
> library call to implement compare-and-exchange by default, though probably
> not some of the others.  But all of these things seem to be issues that
> applications currently have to sort out themselves.  Fundementally, there is
> no way to avoid these problems without the library either.

I don't say that I know a better way.  I just point out the problems.
I personally would definitely want to compile my binaries for my
machine and therefore sacrifize compatibility to i386 (which I don't
use).  I.e., there are many configurations necessary.  For x86 alone
you'd have

   i386, i486, i586, i686, PII, PIII, P4

and all of them in an inline version and the library version.  All
these revisions introduce the one or the other new feature for
multi-processor handling.

> My assumption is that this would usually be used to implement a set of C++
> primitives, which would be more appropriate for the C++ style.  I would
> first try to reduce or eliminate the machine dependent synchronization stuff
> in libstdc++, which already seems to be somewhat of a goal. 

This is possible but probably any such accepted set of primitives
should be implemented as compiler primitives (if possible).
Implementing then C++ versions using the builtins is less optimal than
defining appropriate C++ builtin primitives.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------