This is the mail archive of the java-discuss@sources.redhat.com mailing list for the Java project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
RE: Segment Register thread descriptor (was Re: Jv_AllocBytesChec ked)

To: "'drepper at cygnus dot com'" <drepper at cygnus dot com>, "Boehm, Hans" <hans_boehm at hp dot com>
Subject: RE: Segment Register thread descriptor (was Re: Jv_AllocBytesChec ked)
From: "Boehm, Hans" <hans_boehm at hp dot com>
Date: Fri, 5 Jan 2001 14:51:00 -0800
Cc: tromey at redhat dot com, "'Bryce McKinlay'" <bryce at albatross dot co dot nz>, "'java-discuss at sources dot redhat dot com'" <java-discuss at sources dot redhat dot com>, "MOSBERGER, DAVID (HP-PaloAlto,unix3)" <davidm at hpl dot hp dot com>, Benjamin Kosnik <bkoz at redhat dot com>
> From: Ulrich Drepper [mailto:drepper@redhat.com]
> > 2) An inline function for retrieving such a thread id as 
> fast as possible.
> 
> And where do you want to use this?
I've needed it recently for Java locks and for faster access to thread-local
data.  (There's some argument that the latter should be done by speeding up
pthread_getspecific(), but I suspect the interface may be too constrained
and the clients too diverse for that to be easily feasible.  The thread
local allocation facility in recent versions of my collector could use it,
though it currently gets by with the high order bits of the stack pointer.
Malloc may eventually want to do the same.  See
http://www.hpl.hp.com/techreports/2000/HPL-2000-165.html for some details.)
> 
> > 3) Some collection of atomic update primitives.  I could live with
> > compare-and-swap as the only one.  There's a performance 
> advantage to be
> > gained from providing more, but it's much smaller than the 
> difference
> > between compare-and-swap and explicit locking.
> 
> This is already horribly complicated.  Since it would also have to
> work on architectures (revisions) which don't have such primitives
> built in it means you'll have to be able to emulate this.  But this
> means you have to have special ways to define such atomically updated
> objects.  In some case it'll not only be the object, you might have to
> define a spinlock with it.  (Sometimes not even a spinlock is
> sufficient.)
Right.  But I think this is unavoidable complexity.  And having the
complexity in one place with a documented interface is much better than
reproducing it in all the clients, which is what we have now.  Plus the
client-based implementations tend to be worse, since they were designed
incrementally on an as-needed basis.  (I'm definitely guilty of that, with
both the implementation in the GC and my part of stl_threads.h.)
> 
> We have this problem in the thread library and it's ugly.  Exporting
> this means putting this ugliness in stone.
But I think the "ugly" interface is far cleaner than what real clients are
actually using now.  And I don't see how this problem will go away.
> 
> As for the different functions, it's certainly possible to signal the
> user which functions are natively available and which would have to be
> emulated.  This would allow a user to write code perhaps in two
> different ways, depending on what functionality is available.
Agreed.  But in addition I would like to see the core functions emulated on
all platforms.  You should be able to do something like atomically update a
bit field in a way that doesn't require more than a few lines of user code.
> 
> >  	- There should be variants for at least int and long.
> 
> This is wrong.  All operations should be available only for intX_t and
> uintX_t types (perhaps intmax_t and uintmax_t).  Additionally perhaps
> the same for the int_fastX_t and uint_fastX_t types if this makes a
> difference.
I haven't thought about this enough.  I do need to be able to perform a
compare-and-swap operation on pointer-sized or size_t fields in
semi-portable code.  
> 
> > 	- There should be variants that add different memory barrier
> > properties.  The ones
> > 	  I've found useful for compare-and-swap are:
> > 		- none
> > 		- acquire (later memory operations cannot move forward)
> > 		- release (earlier memory operations cannot 
> move backwards
> > past the atomic op.)
> 
> The question is whether this is really all and whether it's useful to
> make the differentiation at this high level at all.  There might be
> new processor implementation which introduce even more variants.
David Mosberger points out that Alpha makes the distinction between read and
write barriers, as opposed to acquire and release barriers, and there are
certainly others.  On second thought, the best solution might be to try to
specify a few variants of known utility and generality (e.g.
compare-and-swap with a full barrier or no barrier, store with a release
barrier, store with a <release or write> barrier), and then add others if
they turn out to be essential.  Since it should be easy to cause the new
ones to default to the next most general one, that should be reasonably
manageable.
> 
> > 	- They have a lock parameter, of a type declared in the 
> header file.
> > This parameter
> > 	  is unused if the hardware provides the operation.  If 
> the hardware
> > fails to provide
> > 	  the operation, the call attempts to acquire the lock for the
> > duration of the operation.
> 
> This is very cumbersome.  The definitions should take care of this
> themselves.  Again, look at what we have done in glibc.
How is it done in glibc?  It looks like the generic implementation of
atomicity.h implements the primitives in a way that are not atomic?  Am I
looking in the wrong place?

In general, I think you have to give the user control of the granularity of
the lock, which seems to imply that the user needs to declare it somehow.

It doesn't seem that cumbersome to me.  In the typical case, you might have

struct {
    size_t refcount;
    DECL_LOCK(rc_lock);
    ...
};

...
    __atomic_add(&p -> refcount, LOCK_ARG(&p ->rc_lock));

or perhaps

   while(!__atomic_add(&p -> refcount, LOCK_ARG(&p ->rc_lock));

if __atomic_add returns false on lock contention (which may not be the right
choice).
> 
> 
> > 4) Suitable macro definitions to allow locks for atomic 
> operations to be
> > introduced and used so
> > that they disappear when they are not needed, e.g. a 
> DECL_LOCK(name) macro
> > to declare the lock,
> > and a LOCK_ARG(name) macro to pass it if it's needed, or to 
> pass zero
> > otherwise.
> 
> Maybe this is what I meant above.  I am not sure, though.  In short,
> one would need macros like
> 
> #define DEFVAR(type, var) \
>    type var; \
>    lock_type_t var##_lock
> 
> for an architecture without support for atomic operations and just
> 
> #define DEFVAR(type, var) \
>    type var
> 
> otherwise.  The use macro can be
> 
> #define UPDATE(var, val) \
>    lock (var##_lock);
>    var = val;
>    unlock (var##_lock)
> 
I think it's still easy for the user to do this with the above.  But I
really want something more general.  In particular, I need to be able to
atomically update any entry in a large array, without allocating a lock for
every array entry.  (I may get some contention by allocating a single lock
for the entire array, but in many cases even that may still be optimal.
E.g. the garbage collector needs to atomically set a bit in a vector of mark
bits.  I really don't want to allocate a lock per word in the vector, and I
wouldn't easily work with the above.)
> 
> > 5) A way to write to a memory location with release 
> semantics, e.g. at the
> > end of critical section entered with compare-and-swap.  (My 
> impression is
> > that on X86 this is just an assignment, but it requires 
> some magic to tell
> > the compiler not to move other operations past it.
> 
> This is actually showing quite dramatically how difficult it is to
> have definitions which match everywhere.
> 
> There are generally two solutions:
> 
> - have instructions which explicitly provide the semantic (IA-64, ...)
> 
> - use memory barriers
> 
> Depending on what is available you write the code differently.  If the
> critical region contains more than one store operation the question is
> what is best to use.  If only memory barriers are available you put
> exactly one write memory barrier before the end of the section.  If
> you have release semantic in the instructions you'll use it for all
> stores which can be the last ones in the critical region.  What do you
> do for something like this
> 
>      BEGIN_CR
> 
>        a = ...some value...
> 
>        if (foo)
>           b = ...another one...
> 
>      END_CR
> 
> Beside the case that one should probably care for architectures which
> have no memory barriers but do have the special store instructions
> there is the case where both types are available.  So you will have to
> write the sources like this
> 
>      BEGIN_CR
> 
>        STORE_REL (a, ...some value...)
> 
>        if (foo)
>           STORE_REL (b, ...another one...)
> 
>        WMB
> 
>      END_CR
> 
> But this will suck on architectures with both, st.rel and memory
> barriers.  Which one to use depends on the actual processor
> implementation.  So you'll have to introduce special STORE_REL and WMB
> macro variants which are used if they are used together as in the
> example above.
> 
> There are probably more such cases.
I don't understand this example.  In my world, the STORE_REL would probably
be used as part of the END_CR implementation, ensuring that all writes in
the critical section become visible before the lock is released.  I
typically don't care about write reordering within the critical section,
since no other processors should be looking at that stuff until I'm out of
the critical section.

I'm assuming that on architectures without a hardware implementation of
STORE_REL, the software implementation will perform a memory barrier
followed by a store.
> 
> 
> > An empty volatile asm seems to do the trick with gcc, but may be
> > overkill.
> 
> No, an empty volatile has no consequences on the processor.  Memory
> barriers and st.rel instructions ensure that the memory write/read
> reordering taking place during execution in the processor.  The asm
> volatiles only have effects on the compiler.  This is by far not
> enough for modern architectures.  Even on x86 we have now explicit
> memory barrier instructions.
I was referring only to X86.  Sorry about the confusion.

I've heard claims both ways about whether STORE_REL needs a barrier on X86.
Based on some fairly heavy duty tests here it doesn't seem to for what I'm
doing, i.e. stores don't seem to be reordered.  But I'm willing to be
corrected.  (And the answer may be chipset specific.)

It would be nice to encode such obscure information in one place, so that we
don't all have to rediscover that.
> 
> 
> Which brings me to the next point: using these definitions makes
> applications either less portable (e.g., using the atomic operations
> require new revisions of the hardware and therefore the programs
> cannot be used on old hardware) or the programs are very slow (e.g.,
> x86 applications would have to take i386 compatibility into account
> which prevents almost all atomic operations the processor knows about
> from being used).
> 
> If you use function calls for everything we would not have to start
> thinking about this.  The consequence is that unless you are willing
> to accept that binaries have to be compiled for the specific
> architecture, you will restrict the platforms benefitting from these
> new definitions to the very modern ones.  All architectures with
> history (which includes even Alpha) have to use blended code which is
> slow.
Isn't that largely an orthogonal issue?  On X86, you may well want to use a
library call to implement compare-and-exchange by default, though probably
not some of the others.  But all of these things seem to be issues that
applications currently have to sort out themselves.  Fundementally, there is
no way to avoid these problems without the library either.
> 
> 
> > 6) Everything needs to be in the system reserved part of 
> the C namespace,
> > since it should be usable from C, but also from standard 
> C++ headers.
> 
> How this has to be implemented is a different story.  I wouldn't make
> it a requirement that the same definitions are used. ...
My assumption is that this would usually be used to implement a set of C++
primitives, which would be more appropriate for the C++ style.  I would
first try to reduce or eliminate the machine dependent synchronization stuff
in libstdc++, which already seems to be somewhat of a goal. 

> It is not that I have never giving these things any thought.  In fact,
> I'm trying to find a solution for these problems for years.  But it is
> hard to do it right and whenever you think that the current set is
> sufficient somebody comes up with something when cannot be handled
> efficiently or the processor guys come up with yet another fancy ways
> to make themselves look good.
> 
> 
> > If there's some consensus that this would be a step in the 
> right direction,
> > I'm willing to generate such a header file for the 
> platforms I have easy
> > access to.  But we would need to get at least some 
> agreement that this also
> > makes sense for C++ .  (I can also put in fetch and add 
> explicitly, if that
> > would help convince people.)
> 
> I think it is wrong to start right away with proposing interfaces.
> First describe the problems and give examples (and make sure you are
> covering only the limited range of problems associated with Java;
> e.g., an OpenMP compliant FORTRAN implementation has some other
> problems).  This will help people who know a lot about a specific
> processor to decide how this can be best implemented given the
> architectures features.
> 
I think between us we have a reasonable collection of examples (reference
counting in C++ strings and ropes, other synchronization in the C++ library,
fast Java synchronization,  synchronization in linuxthreads, allocation
locking in the garbage collector, various subproblems of parallel garbage
collection, etc.)  A bunch of these seem to have resulted in rather similar
machine dependent code, and I would like to factor that out somehow,
especially since systems like libgcj are starting to contain copies of
several versionbs of this code.

It's likely that the result won't be universal.  But it should cover some
fundamental recurring problems (reference count updates, updates of bit
fields, faster access to thread-local stuff, visibility of initializing
writes before the object itself, ...)

It may be that this should be a project distinct from libc.  But it would be
nice if it were shared.

Hans
Follow-Ups:
- Re: Segment Register thread descriptor (was Re: Jv_AllocBytesChec ked)
  - From: Ulrich Drepper
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]