Bug 47960 - dlopen call during DSO initialization breaks C++ RTTI
Summary: dlopen call during DSO initialization breaks C++ RTTI
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 4.3.2
: P3 normal
Target Milestone: 4.6.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-02 17:58 UTC by Andy
Modified: 2013-11-10 05:15 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
test case (1.01 KB, application/x-compressed)
2011-03-02 17:58 UTC, Andy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andy 2011-03-02 17:58:11 UTC
Hi, 

I am debugging a complex problem with our Linux-based applications sometimes crashing in mysterious ways. This is kind of usual exception RTTI problem when the exceptions thrown in one DSO is not correctly recognized in another DSO. We know so well that DSO and C++ RTTI do not always mix, but we follow all standard advices about how to build the apps to make RTTI work correctly and still it breaks.

Our apps are a mixture of the Python interpreter and many C++ shared libraries loaded from Python (using dlopen). Some C++ libs in turn use dlopen to load other shared libraries. Everything is linked with the correct flags (no symbol hiding) and all dlopen calls use RTLD_GLOBAL flags, so we do expect things to work correctly. Things do work correctly but only when we link the DSOs together with the C++ main(), thus eliminating top-level dlopen call (other dlopen calls still remain there). With LD_DEBUG I was able to confirm that in that case all typeinfo instances are resolved correctly and bound to one instance in the library linked to main app. In case of Python calling dlopen on the same library LD_DEBUG shows that typeinfo resolution fails and there are two instances of the typeinfo object for the Exception type in question.

I tried to reproduce the problem with simple example involving just a couple of DSOs and after some hair pulling I managed to do it. The peculiarity of the case (which I did not recognized initially) is that one of the dlopen() calls happens from the constructor of the global object (that is during the initialization of the corresponding DSO). If all dlopen calls happen in a regular way (after main() starts) then there is no problem at all. But if dlopen() happens during DSO init call then that DSO somehow is not used in the lookup for the dlopen'ed library symbols even tho DSO has RTLD_GLOBAL set.

The example code that I attach here demonstrates exactly this. To build the example app just do (should work on Linux without patching):

% tar zxf example.tgz
% make

This will build main app called 'main' and two DSOs: liba.so and libb.so. Main app calls ldopen for liba.so and calls a run() function from it. liba.so calls dlopen on libb.so either from run() function or from DSO init code depending on the particular envvar and then calls run() function from libb. libb's run() throws an exception that liba's run() tries to catch and analyze. 

To show default correct behavior with dlopen called only from inside main():

% ./main
As expected:
&typeid(ex):            0x2b594ce6e600
&typeid(Exception):     0x2b594ce6e600
typeid(ex).name:        9Exception
typeid(Exception).name: 9Exception
typeid(Exception)==typeid(ex): true

To see what happens when dlopen is called from liba init code:

% TEST_GLOBAL_INIT=1 ./main
*** Not expected:
&typeid(ex):            0x2b4532ad2050
&typeid(Exception):     0x2b45328d0600
typeid(ex).name:        9Exception
typeid(Exception).name: 9Exception
typeid(Exception)==typeid(ex): false

In this case the exception cannot be caught with its real type (it is caught as std::exception) so RTTI is totally broken. Then name in the exception typeinfo is still correct, but the addresses of the typeinfo in liba and libb are different.

From what I gather the C++ code in the example should be legal, global object initialization should not have restrictions on what functions it can call. But it seems like the implementation of the RTTI in gcc relies on the features that do not always work. 

Is there any way to fix the situation or at least to produce some kind of diagnostics when this situation happens?

Regards,
Andy
Comment 1 Andy 2011-03-02 17:58:58 UTC
Created attachment 23518 [details]
test case
Comment 2 Jonathan Wakely 2011-03-02 18:12:13 UTC
works as expected with gcc 4.5, possibly due to the change to __GXX_MERGED_TYPEINFO_NAMES
Comment 3 Andy 2011-03-02 18:50:56 UTC
(In reply to comment #2)
> works as expected with gcc 4.5, possibly due to the change to
> __GXX_MERGED_TYPEINFO_NAMES

Hi Jonathan,

sorry, I do not watch closely the progress, do you mean tha gcc 4.6 has __GXX_MERGED_TYPEINFO_NAMES disabled?

Andy
Comment 4 Andy 2011-03-02 18:51:49 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > works as expected with gcc 4.5, possibly due to the change to
> > __GXX_MERGED_TYPEINFO_NAMES
> 
> Hi Jonathan,
> 
> sorry, I do not watch closely the progress, do you mean tha gcc 4.6 has
> __GXX_MERGED_TYPEINFO_NAMES disabled?
> 
> Andy

Sorry, that should have been 4.5, not 4.6.
Comment 5 Jonathan Wakely 2011-03-02 19:43:17 UTC
Yes, from http://gcc.gnu.org/gcc-4.5/changes.html

"The default behavior for comparing typeinfo names has changed, so in <typeinfo>, __GXX_MERGED_TYPEINFO_NAMES now defaults to zero."

I think the changes were made to the trunk around October 2009
Comment 7 Andrew Pinski 2013-11-10 05:15:30 UTC
Fixed so closing as such.