Segment Register thread descriptor (was Re: Jv_AllocBytesChec ked)

Fri Jan 5 16:28:00 GMT 2001

> From: Ulrich Drepper [ mailto:drepper@redhat.com ]
> 
> "Boehm, Hans" <hans_boehm@hp.com> writes:
> 
> > I've needed it recently for Java locks and for faster 
> access to thread-local
> > data.  (There's some argument that the latter should be 
> done by speeding up
> > pthread_getspecific(), but I suspect the interface may be 
> too constrained
> > and the clients too diverse for that to be easily feasible.
> 
> I agree with the latter.  But you haven't mentioned any of these
> functions in the last mail.  There is a whole bunch of functions you
> want to add.
I'm not proposing to add a replacement for pthread_getspecific, since I'm
not sure there is a sufficiently general one.  I'm proposing to add enough
so that I can hash on a thread id myself, since that often seems to be
appreciably faster than the current scheme.  The motivating observation here
was that this is the same facility that's needed by custom implementations
of any sort of recursive locking, e.g. in libgcj, so we know there are
multiple (well at least 2) clients.
> 
> This introduces a major problem.  Currently the thread library is
> largely self contained.  You can replace the existing implementation
> with a new one (which is actually done).  Once you expose things like
> the access to TLD this is much harder since replacing implementations
> have to conform in their binary layout.
At a minimum, you could expose the thread register when it exists.  That
would largely preserve thread library independence, since the register
pretty much has to be defined as part of the ABI, and any thread library
that doesn't use it would be sacrificing performance.

This is clearly not the most important part of the proposed library, since
it's not the one that's regularly replicated. I should have put it later in
the list ...
> 
> > > As for the different functions, it's certainly possible 
> to signal the
> > > user which functions are natively available and which 
> would have to be
> > > emulated.  This would allow a user to write code perhaps in two
> > > different ways, depending on what functionality is available.
> > Agreed.  But in addition I would like to see the core 
> functions emulated on
> > all platforms.  You should be able to do something like 
> atomically update a
> > bit field in a way that doesn't require more than a few 
> lines of user code.
> 
> This is in line with the guidelines for glibc.  But it has a major
> problem: people are not aware of the problems introduced by the
> emulations.  The documentation would have to go to great length to
> explain why somebody might not want to use the emulations and instead
> provide alternative implementations of the code which needs this
> functionality.
Right.  But isn't that a fact of life for almost anything related to
threads?
>
... 
> 
> > On second thought, the best solution might be to try to specify a
> > few variants of known utility and generality (e.g.  compare-and-swap
> > with a full barrier or no barrier, store with a release barrier,
> > store with a <release or write> barrier), and then add others if
> > they turn out to be essential.  Since it should be easy to cause the
> > new ones to default to the next most general one, that should be
> > reasonably manageable.
> 
> Basically what I suggested: come up with the list of problems (means
> situations in which you need the primitives) and see how they can be
> implemented on the different architectures.
> 
> > In general, I think you have to give the user control of 
> the granularity of
> > the lock, which seems to imply that the user needs to 
> declare it somehow.
> 
> This is a different issue.  If the critical region is used to do more
> than modify one variable explicit locking (mutex or spinlock) should
> be used.  In the case of a single object as in
> 
> > struct {
> >     size_t refcount;
> >     DECL_LOCK(rc_lock);
> >     ...
> > };
> 
> this is a problem since it means that programmers can make errors
> without seeing them.  If somebody has access only to platforms where
> the spinlock is not needed errors might slip in unnoticed.  These kind
> of things must be automated.
Again, it seems to me this is unavoidable.  You need to test on different
kinds of platforms.  We already have greatly varying memory models
(apparently even within the X86 line), which require that.  And you can omit
volatile all over the place if you have the right compiler.  And byte writes
are atomic until you find an old Compaq Alpha. For this one you could at
least provide an option to test with generic code, even if the hardware
doesn't require it.
> 
> > 
> > ...
> >     __atomic_add(&p -> refcount, LOCK_ARG(&p ->rc_lock));
> 
> This is interesting.  You want to rely on statement expressions?  I
> have no problems with defining a gcc-specific interface...
I didn't think I was.  LOCK_ARG should be _LOCK_ARG, and there should be
another argument, e.g. 1.  LOCK_ARG is a macro defined to either return 0 or
its argument.  I wasn't intending to do anything profound.  
> 
> > I think it's still easy for the user to do this with the above.
> 
> Easy to get something done wrong.  One thing we've learned in Free
> Software over the years it is that interfaces have to be defined
> fool-proof.  Otherwise somebody will trip.
As much as possible certainly.  But currently most people have to reinvent
this stuff, and the chance of getting it right the first time is far smaller
than that of using the interface correctly.
> 
> > But I really want something more general.  In particular, I need to
> > be able to atomically update any entry in a large array, without
> > allocating a lock for every array entry.
> 
> Not a problem.  Associate a lock with the whole array.  But of course
> you can say this is not good and you want to lock individual blocks in
> the array.  This plays again in the area of not having to define too
> many interfaces.  Everybody wants something different.
Which is why I would prefer to let the user declare the locks.
> 
> > I was referring only to X86.  Sorry about the confusion.
> 
> Even for y86 this changes.
> 
> > I've heard claims both ways about whether STORE_REL needs a 
> barrier on X86.
> > Based on some fairly heavy duty tests here it doesn't seem 
> to for what I'm
> > doing, i.e. stores don't seem to be reordered.  But I'm 
> willing to be
> > corrected.  (And the answer may be chipset specific.)
> 
> Try P4 processors.  Intel has done something.
> 
Interesting. Have they published/admitted to anything?  This would certainly
affect correctness of the garbage collector in libgcj.

Hans