Bug 71744 - Concurrently throwing exceptions is not scalable
Summary: Concurrently throwing exceptions is not scalable
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libgcc (show other bugs)
Version: 5.3.1
: P3 normal
Target Milestone: ---
Assignee: Florian Weimer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-03 22:11 UTC by Nadav Har'El
Modified: 2023-01-09 07:30 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-11-19 00:00:00


Attachments
gcc7-pr71744.patch (975 bytes, patch)
2016-07-07 11:30 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nadav Har'El 2016-07-03 22:11:04 UTC
Multiple threads on multiple cores should be able to concurrently throw exceptions without bothering one another. But unfortunately, it appears that in the current implementation of libstdc++ and/or glibc, the stack unwinding process takes a global lock (while getting the list of shared objects, and perhaps other things) which serializes these parallel exception-throwing and can dramatically slow down the program.

This problem has been reported in the past by users - see for example 

http://stackoverflow.com/questions/26257343/does-stack-unwinding-really-require-locks
https://blogs.oracle.com/dave/entry/synchronization_horror_story_of_the

Some might dismiss this inefficiency with the standard "exceptions should be rare" excuse. They should be rare. But sometimes they are not, leading do a catastrophic collapse in performance.

We saw an illustrative example of an "exception storm" in a Seastar (http://www.seastar-project.org/) application. This application can handle over a million requests per second on many cores. Some unexpected circumstance caused the application to slow down somewhat, which led to some of the requests timing out. The timeout was implemented as an exception, so now we had thousands of exceptions being thrown in all cores in parallel. This led to the applications threads starting to hang, once in a while, on the lock(s) inside "throw". This in turn made the application even slower, and created even more timeouts, which in turn resulted in even more exceptions. In this way the number of exceptions per second escalated, until the point where most of the work the application was doing was fighting over the "throw" locks, and no useful work was being done.
Comment 1 Andrew Pinski 2016-07-03 23:09:07 UTC
unwind uses malloc which does have a lock too.
Comment 2 Andrew Pinski 2016-07-03 23:12:22 UTC
Also there is a cache in the unwidng code.  This cache is where the lock is really located.  Changing this cache to lockless might be harder to do.
Comment 3 Andrew Pinski 2016-07-03 23:18:42 UTC
This cache:
/* The unseen_objects list contains objects that have been registered
   but not yet categorized in any way.  The seen_objects list has had
   its pc_begin and count fields initialized at minimum, and is sorted
   by decreasing value of pc_begin.  */
static struct object *unseen_objects;
static struct object *seen_objects

In "unwind-dw2-fde.c".

So there is no global lock by the way, just around the area which accesses the above two objects.
Comment 4 Andrew Pinski 2016-07-03 23:23:59 UTC
Also if you are using C++ features you need to understand scalibility issues might not be under your control so you should stop using those features.
Comment 5 Nadav Har'El 2016-07-04 05:53:27 UTC
Some replies to Andrew's comments above:

1. As the number of cores continue to grow, libraries will also fix malloc scalability issues, so concurrent malloc will no longer present a problem. The code which led me to this report (http://www.seastar-project.org/) already replaces malloc() by a lock-free implementation (Seastar has per-thread memory allocation), so malloc() was never a problem there. 

2. With the phrase "global lock" I just meant that there is just one lock that all the unrelated threads which try to throw any exception of any type need to take.

3. As far as I can tell, the C++ standard doesn't say anything exceptions necessarily being very slow. Conventional wisdom indeed has it that exceptions are slower than normal function returns. But nothing suggests that C++ exceptions are not scalable, in the sense that throwing unrelated exceptions in unrelated threads at the same time is not possible, or is orders of magnitude slower than just the normal costs of exceptions on one thread. That is nothing more than an implementation bug. It can be argued how urgent fixing this bug is, but as the number of cores continue to grow, and performance of C++ software continues to improve (e.g., http://www.seastar-project.org/), "exception storms" of the type I outlined in my original report will become a very real risk.

Fixing this bug completely might require fixes in glibc too (dl_iterate_phdr, and as you said malloc). But I think it definitely needs fixing.
Comment 6 Andrew Pinski 2016-07-04 06:31:40 UTC
(In reply to Nadav Har'El from comment #5)
> That is nothing more than an implementation bug. 

Is it?  The problem here is you need to start redesigning it to use lock-free algorithms.  I don't see a way of doing this easily.  But maybe I am missing something here.
Comment 7 Andrew Pinski 2016-07-04 06:34:49 UTC
Note the kernel has many of these locks too.  Have you reported those cases to them?

And yes I know about the number of cores increasing, trust me, I use a 96 (or 48 depending on if I am on single socket or two) core machines for most of my work now so I know all about this issue.  Plus I will be using a 104 core machine soon too (2x54).  I have been looking into lock issues too.
Comment 8 Jakub Jelinek 2016-07-04 07:18:54 UTC
There are 2 locks, one is internal to libgcc* and used to guard the old style FDE registry which should be only rarely used.  This one perhaps could be transformed into a rwlock, and/or handle the common case where nothing is registered using the old registry somehow without the lock, just with some atomic comparison.
The second one is inside of glibc, in the dl_iterate_phdr call, to change anything here synchronized changes between glibc and gcc would be required, or do some changes internal to glibc only.
Right now glibc uses here a recursive global lock, dunno if it would be possible to use recursive rwlock instead and whether it wouldn't slow down dlopen/dlclose way too much.
The dl_iterate_phdr API these days provides (non-atomic) long long counters counting number of dlopens and number of dlcloses, so the unwind code can determine if it can use the caches (if no dlopen/dlclose happened since last time) or not.  But these are computed inside of the locked code.  The problem with e.g. making these counters atomicly updated (not possible on all arches, even i486 can't do that, and many others) and add some API to query those is that that still doesn't guarantee that in between checking these counters and looking stuff the library isn't dlclosed by some other thread.  Though, for unwinding, if some code address is seen in the backtrace, then it better not be dlclosed until the unwinding is done.
Comment 9 Jakub Jelinek 2016-07-07 11:30:34 UTC
Created attachment 38852 [details]
gcc7-pr71744.patch

Untested patch for the old-style registry.  But as long as glibc still takes a global lock it probably doesn't help that much.
Comment 10 Gleb Natapov 2016-07-25 14:46:44 UTC
I've just sent  https://sourceware.org/ml/libc-alpha/2016-07/msg00613.html to the mailing list that should solve this issue (or at least make it much more scalable). I did not notice that Jakub already proposed the fix for old-style registry and implemented it differently, but in the end both of them fix the same problem in similar way.
Comment 11 Gleb Natapov 2016-09-14 08:53:11 UTC
Jakub, can you apply your path please? As you've said glibc will also have to be fixed and I am working on it, but without that patch fixing glibc will not help either.
Comment 12 Jakub Jelinek 2016-09-14 08:56:50 UTC
I'll bootstrap/regtest it today.  That said, have you done any scalability benchmarking with/without that patch?
Comment 13 Gleb Natapov 2016-09-14 09:01:38 UTC
I've done it with my version of it which does basically the same. The patch by itself has no effect on scalability, but in conjunction with this fix for glibc
https://patchwork.ozlabs.org/patch/652301/ I get very good results.
Comment 14 Jakub Jelinek 2016-09-14 09:23:22 UTC
Looking at that patchset, I just want to say that the use of TLS in libgcc is very much undesirable, it affects every program even when not using exceptions (significant portion of them), especially when it uses that much (6 * sizeof(void *) * 8 + sizeof(void *) + 2 * 8 = 408 bytes per thread is significant, plus it uses the slow local dynamic TLS model.
Similarly, the 64 recursive locks in libc, again, significant amount of memory
per process that doesn't care about exceptions, and especially signficiant
slowdown on dlopen/dlclose.  While some programs abuse exceptions for normal
control flow, others abuse dlopen/dlclose.                                    
A rwlock ought to be better, but I don't think we have recursive rwlock in libc
yet (all we need is being able to recursively lock it for writing, for reading
we might not need to allow recursion in this case).  Plus the current rwlock
implementation still takes uses a per-process lock to guard the inner operations
on the rwlock, so while rwlock helps with longer critical sections, for very
short ones it likely doesn't scale all that well.
Comment 15 torvald 2016-09-14 10:19:50 UTC
(In reply to Jakub Jelinek from comment #14)
> Looking at that patchset, I just want to say that the use of TLS in libgcc
> is very much undesirable, it affects every program even when not using
> exceptions (significant portion of them), especially when it uses that much
> (6 * sizeof(void *) * 8 + sizeof(void *) + 2 * 8 = 408 bytes per thread is
> significant, plus it uses the slow local dynamic TLS model.

I agree.

> Similarly, the 64 recursive locks in libc, again, significant amount of
> memory
> per process that doesn't care about exceptions,

That might be reducable if we build custom locks and don't use pthread ones, but we'll at least have 64b per lock.

> and especially signficiant
> slowdown on dlopen/dlclose.  While some programs abuse exceptions for normal
> control flow, others abuse dlopen/dlclose.                                  
> 
> A rwlock ought to be better, but I don't think we have recursive rwlock in
> libc

rdlock can be recursive, rwlock cannot.  It is something we could add with not much effort in a custom rwlock, I think.

> yet (all we need is being able to recursively lock it for writing, for
> reading
> we might not need to allow recursion in this case).

Not having to deal with recursive rdlock can make things easier for the rwlock implementation.

> Plus the current rwlock
> implementation still takes uses a per-process lock to guard the inner
> operations
> on the rwlock, so while rwlock helps with longer critical sections, for very
> short ones it likely doesn't scale all that well.

This will change as soon as my new rwlock is upstream.  Especially short rdlock sections will be much faster.
Comment 16 Gleb Natapov 2016-09-14 10:25:01 UTC
Can you suggest an alternative to libgcc patch? Use other TLS model? Allocate per thread storage dynamically somehow?

About lock array, I tries to use rwlock and the current one is not better than
regular lock since it does lock internally. There is a patch series to improve
this, but even with the improved rwlock array approach generates much better result for my test case since it avoids lock contention in most cases. The array approach can be improved/generalized by having a lock per thread, but I think array is a much simpler approach that delivers good result.

About programs abusing dlopen/dlclose, while I am sure there are those, it is reasonable expectation that dlopen/dlclose on different threads will serialize them, but throwing exception, while expected to be heavy operation, is not expected to make threads to execute in a lock step. Taking some extra lock should be negligible compared to other work dlopen has to do.

About memory considerations, are those numbers are so significant, especially consider the overhead each thread has in the kernel?

I would be nice to continue this discussion on the mailing list.

Regardless, lest apply your patch since it is essential to the final solution.
Comment 17 Jakub Jelinek 2016-09-14 10:30:45 UTC
(In reply to torvald from comment #15)
> > Similarly, the 64 recursive locks in libc, again, significant amount of
> > memory
> > per process that doesn't care about exceptions,
> 
> That might be reducable if we build custom locks and don't use pthread ones,
> but we'll at least have 64b per lock.

Right now the lock is I think 40 bytes per lock, which is 64 * 40 bytes per process.  That is just too much (but of course the 64x locking a recursive lock is even worse).

> > and especially signficiant
> > slowdown on dlopen/dlclose.  While some programs abuse exceptions for normal
> > control flow, others abuse dlopen/dlclose.                                  
> > 
> > A rwlock ought to be better, but I don't think we have recursive rwlock in
> > libc
> 
> rdlock can be recursive, rwlock cannot.  It is something we could add with
> not much effort in a custom rwlock, I think.

The dl_load_lock is I believe private, so it could use custom special rwlock.

> > Plus the current rwlock
> > implementation still takes uses a per-process lock to guard the inner
> > operations
> > on the rwlock, so while rwlock helps with longer critical sections, for very
> > short ones it likely doesn't scale all that well.
> 
> This will change as soon as my new rwlock is upstream.  Especially short
> rdlock sections will be much faster.

Nice.
Comment 18 Jakub Jelinek 2016-09-14 10:41:56 UTC
(In reply to Gleb Natapov from comment #16)
> Can you suggest an alternative to libgcc patch? Use other TLS model?
> Allocate per thread storage dynamically somehow?

If we want to use TLS (which I hope we don't), then e.g. a single __thread pointer with some registered destructor that would free it on process exit could do the job, and on the first exception it would try to allocate memory for the cache and other stuff and use that (otherwise, if memory allocation fails, just take a lock and be non-scalable).

Another alternative, perhaps much better, if Torvald is going to improve rwlocks sufficiently, would be to use rwlock to guard writes to the cache etc. too, and perhaps somewhat enlarge the cache (either statically, or allow extending it through allocation).
I'd expect that usually these apps that use exceptions too much only care about a couple of shared libraries, so writes to the cache ought to be rare.
Comment 19 Gleb Natapov 2016-09-14 11:04:35 UTC
(In reply to Jakub Jelinek from comment #18)
> (In reply to Gleb Natapov from comment #16)
> > Can you suggest an alternative to libgcc patch? Use other TLS model?
> > Allocate per thread storage dynamically somehow?
> 
> If we want to use TLS (which I hope we don't), then e.g. a single __thread
> pointer with some registered destructor that would free it on process exit
> could do the job, and on the first exception it would try to allocate memory
> for the cache and other stuff and use that (otherwise, if memory allocation
> fails, just take a lock and be non-scalable).
>
I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do the same here?

> 
> Another alternative, perhaps much better, if Torvald is going to improve
> rwlocks sufficiently, would be to use rwlock to guard writes to the cache
> etc. too, and perhaps somewhat enlarge the cache (either statically, or
> allow extending it through allocation).
> I'd expect that usually these apps that use exceptions too much only care
> about a couple of shared libraries, so writes to the cache ought to be rare.
>
As I said in my previous reply, I tested the new rwlock and in congested case
it still slows does the system significantly, not the implementation fault, cpu just does not like locked instruction much. Not having a lock will be significantly better.
Comment 20 Jakub Jelinek 2016-09-14 11:27:14 UTC
(In reply to Gleb Natapov from comment #19)
> (In reply to Jakub Jelinek from comment #18)
> > (In reply to Gleb Natapov from comment #16)
> > > Can you suggest an alternative to libgcc patch? Use other TLS model?
> > > Allocate per thread storage dynamically somehow?
> > 
> > If we want to use TLS (which I hope we don't), then e.g. a single __thread
> > pointer with some registered destructor that would free it on process exit
> > could do the job, and on the first exception it would try to allocate memory
> > for the cache and other stuff and use that (otherwise, if memory allocation
> > fails, just take a lock and be non-scalable).
> >
> I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do
> the same here?

Can? Yes.  Want?  Nope.  It is worse than TLS.

> > Another alternative, perhaps much better, if Torvald is going to improve
> > rwlocks sufficiently, would be to use rwlock to guard writes to the cache
> > etc. too, and perhaps somewhat enlarge the cache (either statically, or
> > allow extending it through allocation).
> > I'd expect that usually these apps that use exceptions too much only care
> > about a couple of shared libraries, so writes to the cache ought to be rare.
> >
> As I said in my previous reply, I tested the new rwlock and in congested case
> it still slows does the system significantly, not the implementation fault,
> cpu just does not like locked instruction much. Not having a lock will be
> significantly better.

You still need at least one lock, the array of locks is definitely a bad idea.
Perhaps if you are worried about using 2 different rwlocks, it would be possible to just use the glibc internal one, by adding dl_iterate_phdr alternate entrypoint - dl_iterate_phdr would then be documented to only allow a single thread in the callback, which it satisfies now and in newer libc could wrlock _dl_load_lock, and then dl_iterate_phdr alternate entrypoint would be documented to allow multiple threads in the callback (i.e. it could rdlock _dl_load_lock).  On the libgcc side then it would call dl_iterate_phdr_rd (or whatever name it would have) first, and perform only read-only lookup in the cache, and if it wouldn't find anything, it would call dl_iterate_phdr afterwards and tweak the cache.
Comment 21 torvald 2016-09-14 11:31:38 UTC
(In reply to Jakub Jelinek from comment #17)
> (In reply to torvald from comment #15)
> > > Similarly, the 64 recursive locks in libc, again, significant amount of
> > > memory
> > > per process that doesn't care about exceptions,
> > 
> > That might be reducable if we build custom locks and don't use pthread ones,
> > but we'll at least have 64b per lock.
> 
> Right now the lock is I think 40 bytes per lock, which is 64 * 40 bytes per
> process.  That is just too much (but of course the 64x locking a recursive
> lock is even worse).

Remembering more of the discussion we had about this in the past, then even with the improved rwlock, Gleb reports that there is a slowdown in Gleb's tests because of cache contention -- which would mean that we may have to use one cacheline per lock.
Comment 22 Gleb Natapov 2016-09-14 11:36:51 UTC
(In reply to torvald from comment #21)
> (In reply to Jakub Jelinek from comment #17)
> > (In reply to torvald from comment #15)
> > > > Similarly, the 64 recursive locks in libc, again, significant amount of
> > > > memory
> > > > per process that doesn't care about exceptions,
> > > 
> > > That might be reducable if we build custom locks and don't use pthread ones,
> > > but we'll at least have 64b per lock.
> > 
> > Right now the lock is I think 40 bytes per lock, which is 64 * 40 bytes per
> > process.  That is just too much (but of course the 64x locking a recursive
> > lock is even worse).
> 
> Remembering more of the discussion we had about this in the past, then even
> with the improved rwlock, Gleb reports that there is a slowdown in Gleb's
> tests because of cache contention -- which would mean that we may have to
> use one cacheline per lock.

Right, my patch lacks it, but cache alignment will definitely be an improvement.
Comment 23 Gleb Natapov 2016-09-14 11:51:31 UTC
(In reply to Jakub Jelinek from comment #20)
> (In reply to Gleb Natapov from comment #19)
> > (In reply to Jakub Jelinek from comment #18)
> > > (In reply to Gleb Natapov from comment #16)
> > > > Can you suggest an alternative to libgcc patch? Use other TLS model?
> > > > Allocate per thread storage dynamically somehow?
> > > 
> > > If we want to use TLS (which I hope we don't), then e.g. a single __thread
> > > pointer with some registered destructor that would free it on process exit
> > > could do the job, and on the first exception it would try to allocate memory
> > > for the cache and other stuff and use that (otherwise, if memory allocation
> > > fails, just take a lock and be non-scalable).
> > >
> > I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do
> > the same here?
> 
> Can? Yes.  Want?  Nope.  It is worse than TLS.
Got it. If __thread pointer is an acceptable solution I will be happy to implement it.  

> 
> > > Another alternative, perhaps much better, if Torvald is going to improve
> > > rwlocks sufficiently, would be to use rwlock to guard writes to the cache
> > > etc. too, and perhaps somewhat enlarge the cache (either statically, or
> > > allow extending it through allocation).
> > > I'd expect that usually these apps that use exceptions too much only care
> > > about a couple of shared libraries, so writes to the cache ought to be rare.
> > >
> > As I said in my previous reply, I tested the new rwlock and in congested case
> > it still slows does the system significantly, not the implementation fault,
> > cpu just does not like locked instruction much. Not having a lock will be
> > significantly better.
> 
> You still need at least one lock, the array of locks is definitely a bad
> idea.
>
I am not sure I agree. 64 lock will take one page of memory, which is negligible amount nowadays and we can drop the array if compiled for single threaded machine. Breaking one big lock into many smaller one is a common technique to achieve scalability. 

Alternative is to make the lock to be per thread, then consumed memory is proportional to amount of threads. 
 
> Perhaps if you are worried about using 2 different rwlocks, it would be
> possible to just use the glibc internal one, by adding dl_iterate_phdr
> alternate entrypoint - dl_iterate_phdr would then be documented to only
> allow a single thread in the callback, which it satisfies now and in newer
> libc could wrlock _dl_load_lock, and then dl_iterate_phdr alternate
> entrypoint would be documented to allow multiple threads in the callback
> (i.e. it could rdlock _dl_load_lock).  On the libgcc side then it would call
> dl_iterate_phdr_rd (or whatever name it would have) first, and perform only
> read-only lookup in the cache, and if it wouldn't find anything, it would
> call dl_iterate_phdr afterwards and tweak the cache.
>
Such interface will make new dl_iterate_phdr_rd to libgcc specific, also scalablity will depend on cache efficiency, so while benchmark will show much better result, real application will not benefit. Complex C++ applications tend to have deep call chains.
Comment 24 Jakub Jelinek 2016-09-14 12:03:37 UTC
(In reply to Gleb Natapov from comment #23)
> I am not sure I agree. 64 lock will take one page of memory, which is
> negligible amount nowadays and we can drop the array if compiled for single
> threaded machine. 

It is perhaps negligible for your app, but libc has lots of users with different needs.  And dl_load_lock isn't the only widely used lock in libc, are we going to use page of array locks in each case such lock has scalability issues with certain use cases?

> Such interface will make new dl_iterate_phdr_rd to libgcc specific, also
> scalablity will depend on cache efficiency, so while benchmark will show
> much better result, real application will not benefit. Complex C++
> applications tend to have deep call chains.

Why would it be libgcc specific?  It would be another libc supported API, with clear documentation on what it does and any user could just use it.  As I said, the number of entries in the cache could be extended.
Comment 25 Gleb Natapov 2016-09-14 12:28:34 UTC
(In reply to Jakub Jelinek from comment #24)
> (In reply to Gleb Natapov from comment #23)
> > I am not sure I agree. 64 lock will take one page of memory, which is
> > negligible amount nowadays and we can drop the array if compiled for single
> > threaded machine. 
> 
> It is perhaps negligible for your app, but libc has lots of users with
> different needs.  And dl_load_lock isn't the only widely used lock in libc,
> are we going to use page of array locks in each case such lock has
> scalability issues with certain use cases?
>
That's a fair point. I think severity of the issue should be taken into account. I can tell from our experience (and searching the web we are not alone) that for exception throwing languages like C++ the issue is very serious, and no, we do not use exceptions as flow control, but when errors happen they tend to happen in bunches and when the first bunch slows the system to a crawl it causes even more errors. The only workaround is to not use exception which for us is not acceptable, so fixing the issue in its root is the only option.

Using Torvald's rwlock would be definitely better that current state, but not as good as per thread lock.

> 
> > Such interface will make new dl_iterate_phdr_rd to libgcc specific, also
> > scalablity will depend on cache efficiency, so while benchmark will show
> > much better result, real application will not benefit. Complex C++
> > applications tend to have deep call chains.
> 
> Why would it be libgcc specific?  It would be another libc supported API,
> with clear documentation on what it does and any user could just use it.
>  
I think I misunderstood what you propose. My patch essentially does what you suggest already, it calls the function dl_iterate_phdr_parallel instead of dl_iterate_phdr_rd, but otherwise it is the same: it can run multiple callback in parallel, so we only disagree on how _parallel_ part is achieved internally.
On glibc list there were some concerns about widening the interface though. They may prefer to use symbol versioning to change dl_iterate_phdr semantics (not sure if and how this can be done yet).


>                                                                        As
> I said, the number of entries in the cache could be extended.
>
Unless it extends dynamically it would be hard to guess a proper size, and the price of underguessing is too high. Finding a proper size dynamically
will require a lot of cache management code which I do not think belong here.
Comment 26 Jakub Jelinek 2016-09-14 12:40:27 UTC
(In reply to Gleb Natapov from comment #25)
> Using Torvald's rwlock would be definitely better that current state, but
> not as good as per thread lock.

In theory there could be e.g. an option to keep using the recursive lock by default and if e.g. dl_iterate_phdr determines it has been called already many times, or detects contention some way, it could under that allocate the page with the locks and switch to using that from that point on or something similar.
Or start with rwlock and switch to array of locks, whatever.

I'm not a glibc developer anymore, so I'll leave it to its maintainers, just wanted to express my fears.

> I think I misunderstood what you propose. My patch essentially does what you
> suggest already, it calls the function dl_iterate_phdr_parallel instead of
> dl_iterate_phdr_rd, but otherwise it is the same: it can run multiple
> callback in parallel, so we only disagree on how _parallel_ part is achieved
> internally.
> On glibc list there were some concerns about widening the interface though.
> They may prefer to use symbol versioning to change dl_iterate_phdr semantics
> (not sure if and how this can be done yet).

Using symbol versioning to change the number of arguments of a function IMHO is just wrong, using a different name is cleaner, and dl_iterate_phdr isn't used these days just in glibc where I've originally added it (for the use in the unwinder), but various other OSes, some BSDs, Solaris, ..., and not sure if all of them use symbol versioning, especially the GNU one.
Comment 27 Gleb Natapov 2016-09-14 13:06:02 UTC
(In reply to Jakub Jelinek from comment #26)
> (In reply to Gleb Natapov from comment #25)
> > Using Torvald's rwlock would be definitely better that current state, but
> > not as good as per thread lock.
> 
> In theory there could be e.g. an option to keep using the recursive lock by
> default and if e.g. dl_iterate_phdr determines it has been called already
> many times, or detects contention some way, it could under that allocate the
> page with the locks and switch to using that from that point on or something
> similar.
> Or start with rwlock and switch to array of locks, whatever.
>
This is interesting idea, I will try to play with it. On the surface it will be hard to make threads to agree on what method to use without some kind of synchronization which may, by itself, introduce non trivial slow down.

> 
> I'm not a glibc developer anymore, so I'll leave it to its maintainers, just
> wanted to express my fears.
Your input is appreciated, especially as cooperation between two projects is required to solve the issue.

> 
> > I think I misunderstood what you propose. My patch essentially does what you
> > suggest already, it calls the function dl_iterate_phdr_parallel instead of
> > dl_iterate_phdr_rd, but otherwise it is the same: it can run multiple
> > callback in parallel, so we only disagree on how _parallel_ part is achieved
> > internally.
> > On glibc list there were some concerns about widening the interface though.
> > They may prefer to use symbol versioning to change dl_iterate_phdr semantics
> > (not sure if and how this can be done yet).
> 
> Using symbol versioning to change the number of arguments of a function IMHO
> is just wrong, using a different name is cleaner, and dl_iterate_phdr isn't
> used these days just in glibc where I've originally added it (for the use in
> the unwinder), but various other OSes, some BSDs, Solaris, ..., and not sure
> if all of them use symbol versioning, especially the GNU one.
>
Why would new function have different number of arguments? My proposed dl_iterate_phdr_parallel has the same list of arguments, just no global lock around callback invocation.

Good point about other OSes. Will bring this to glibc guys attention as an argument in favor of a new function.
Comment 28 torvald 2016-09-14 19:00:47 UTC
(In reply to Jakub Jelinek from comment #20)
> (In reply to Gleb Natapov from comment #19)
> > (In reply to Jakub Jelinek from comment #18)
> > > (In reply to Gleb Natapov from comment #16)
> > > > Can you suggest an alternative to libgcc patch? Use other TLS model?
> > > > Allocate per thread storage dynamically somehow?
> > > 
> > > If we want to use TLS (which I hope we don't), then e.g. a single __thread
> > > pointer with some registered destructor that would free it on process exit
> > > could do the job, and on the first exception it would try to allocate memory
> > > for the cache and other stuff and use that (otherwise, if memory allocation
> > > fails, just take a lock and be non-scalable).
> > >
> > I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do
> > the same here?
> 
> Can? Yes.  Want?  Nope.  It is worse than TLS.
> 
> > > Another alternative, perhaps much better, if Torvald is going to improve
> > > rwlocks sufficiently, would be to use rwlock to guard writes to the cache
> > > etc. too, and perhaps somewhat enlarge the cache (either statically, or
> > > allow extending it through allocation).
> > > I'd expect that usually these apps that use exceptions too much only care
> > > about a couple of shared libraries, so writes to the cache ought to be rare.
> > >
> > As I said in my previous reply, I tested the new rwlock and in congested case
> > it still slows does the system significantly, not the implementation fault,
> > cpu just does not like locked instruction much. Not having a lock will be
> > significantly better.
> 
> You still need at least one lock, the array of locks is definitely a bad
> idea.

I don't think it is necessarily a bad idea -- spreading out the data used for synchronization to avoid cache conflicts when there is an asymmetrical workload is a commonly used technique.  But we need to find a balance between compute and space overhead that works for us and in this case.  Also, spreading out the data should ideally happen in a NUMA-aware way, so there may be more space overhead if we consider that too.

I've been thinking about using more scalable ways of locking in glibc which would use per-thread data, which could be useful for more than one lock.  So if we had that, we could just use that to deal with the exception scalability problem, without having space overhead just for exceptions.b
Comment 29 Jakub Jelinek 2016-09-16 19:18:19 UTC
Author: jakub
Date: Fri Sep 16 19:17:47 2016
New Revision: 240193

URL: https://gcc.gnu.org/viewcvs?rev=240193&root=gcc&view=rev
Log:
	PR libgcc/71744
	* unwind-dw2-fde.c (ATOMIC_FDE_FAST_PATH): Define if __register_frame*
	is not the primary registry and atomics are available.
	(any_objects_registered): New variable.
	(__register_frame_info_bases, __register_frame_info_table_bases):
	Atomically store 1 to any_objects_registered after registering first
	unwind info.
	(_Unwind_Find_FDE): Return early if any_objects_registered is 0.

Modified:
    trunk/libgcc/ChangeLog
    trunk/libgcc/unwind-dw2-fde.c
Comment 30 Martin Liška 2018-11-19 14:28:43 UTC
Jakub: Can the bug be marked as resolved?
Comment 31 Florian Weimer 2018-11-19 14:33:43 UTC
(In reply to Martin Liška from comment #30)
> Jakub: Can the bug be marked as resolved?

I'm working on eliminating the loader lock relied upon in unwind-dw2-fde-dip.c.  Jakub's commit does not eliminate the main bottleneck for unwinding.
Comment 32 Florian Weimer 2023-01-09 07:30:59 UTC
Fixed via r12-6210-g790854ea7670f1 for dynamically linked code:

commit 790854ea7670f11c14d431c102a49181d2915965
Author: Florian Weimer <fweimer@redhat.com>
Date:   Tue Jan 4 15:47:30 2022 +0100

    libgcc: Use _dl_find_object in _Unwind_Find_FDE
    
    libgcc/ChangeLog:
    
            * unwind-dw2-fde-dip.c (_Unwind_Find_FDE): Call _dl_find_object
            if available.

And r13-2706-g6e80a1d164d1f9 for the run-time code registration interface:

commit 6e80a1d164d1f996ad08a512c000025a7c2ca893
Author: Thomas Neumann <tneumann@users.sourceforge.net>
Date:   Tue Mar 1 21:57:35 2022 +0100

    eliminate mutex in fast path of __register_frame
    
    The __register_frame/__deregister_frame functions are used to register
    unwinding frames from JITed code in a sorted list. That list itself
    is protected by object_mutex, which leads to terrible performance
    in multi-threaded code and is somewhat expensive even if single-threaded.
    There was already a fast-path that avoided taking the mutex if no
    frame was registered at all.
    
    This commit eliminates both the mutex and the sorted list from
    the atomic fast path, and replaces it with a btree that uses
    optimistic lock coupling during lookup. This allows for fully parallel
    unwinding and is essential to scale exception handling to large
    core counts.
    
    libgcc/ChangeLog:
    
            * unwind-dw2-fde.c (release_registered_frames): Cleanup at shutdown.
            (__register_frame_info_table_bases): Use btree in atomic fast path.
            (__deregister_frame_info_bases): Likewise.
            (_Unwind_Find_FDE): Likewise.
            (base_from_object): Make parameter const.
            (classify_object_over_fdes): Add query-only mode.
            (get_pc_range): Compute PC range for lookup.
            * unwind-dw2-fde.h (last_fde): Make parameter const.
            * unwind-dw2-btree.h: New file.