Hi, I am debugging a complex problem with our Linux-based applications sometimes crashing in mysterious ways. This is kind of usual exception RTTI problem when the exceptions thrown in one DSO is not correctly recognized in another DSO. We know so well that DSO and C++ RTTI do not always mix, but we follow all standard advices about how to build the apps to make RTTI work correctly and still it breaks. Our apps are a mixture of the Python interpreter and many C++ shared libraries loaded from Python (using dlopen). Some C++ libs in turn use dlopen to load other shared libraries. Everything is linked with the correct flags (no symbol hiding) and all dlopen calls use RTLD_GLOBAL flags, so we do expect things to work correctly. Things do work correctly but only when we link the DSOs together with the C++ main(), thus eliminating top-level dlopen call (other dlopen calls still remain there). With LD_DEBUG I was able to confirm that in that case all typeinfo instances are resolved correctly and bound to one instance in the library linked to main app. In case of Python calling dlopen on the same library LD_DEBUG shows that typeinfo resolution fails and there are two instances of the typeinfo object for the Exception type in question. I tried to reproduce the problem with simple example involving just a couple of DSOs and after some hair pulling I managed to do it. The peculiarity of the case (which I did not recognized initially) is that one of the dlopen() calls happens from the constructor of the global object (that is during the initialization of the corresponding DSO). If all dlopen calls happen in a regular way (after main() starts) then there is no problem at all. But if dlopen() happens during DSO init call then that DSO somehow is not used in the lookup for the dlopen'ed library symbols even tho DSO has RTLD_GLOBAL set. The example code that I attach here demonstrates exactly this. To build the example app just do (should work on Linux without patching): % tar zxf example.tgz % make This will build main app called 'main' and two DSOs: liba.so and libb.so. Main app calls ldopen for liba.so and calls a run() function from it. liba.so calls dlopen on libb.so either from run() function or from DSO init code depending on the particular envvar and then calls run() function from libb. libb's run() throws an exception that liba's run() tries to catch and analyze. To show default correct behavior with dlopen called only from inside main(): % ./main As expected: &typeid(ex): 0x2b594ce6e600 &typeid(Exception): 0x2b594ce6e600 typeid(ex).name: 9Exception typeid(Exception).name: 9Exception typeid(Exception)==typeid(ex): true To see what happens when dlopen is called from liba init code: % TEST_GLOBAL_INIT=1 ./main *** Not expected: &typeid(ex): 0x2b4532ad2050 &typeid(Exception): 0x2b45328d0600 typeid(ex).name: 9Exception typeid(Exception).name: 9Exception typeid(Exception)==typeid(ex): false In this case the exception cannot be caught with its real type (it is caught as std::exception) so RTTI is totally broken. Then name in the exception typeinfo is still correct, but the addresses of the typeinfo in liba and libb are different. From what I gather the C++ code in the example should be legal, global object initialization should not have restrictions on what functions it can call. But it seems like the implementation of the RTTI in gcc relies on the features that do not always work. Is there any way to fix the situation or at least to produce some kind of diagnostics when this situation happens? Regards, Andy
Created attachment 23518 [details] test case
works as expected with gcc 4.5, possibly due to the change to __GXX_MERGED_TYPEINFO_NAMES
(In reply to comment #2) > works as expected with gcc 4.5, possibly due to the change to > __GXX_MERGED_TYPEINFO_NAMES Hi Jonathan, sorry, I do not watch closely the progress, do you mean tha gcc 4.6 has __GXX_MERGED_TYPEINFO_NAMES disabled? Andy
(In reply to comment #3) > (In reply to comment #2) > > works as expected with gcc 4.5, possibly due to the change to > > __GXX_MERGED_TYPEINFO_NAMES > > Hi Jonathan, > > sorry, I do not watch closely the progress, do you mean tha gcc 4.6 has > __GXX_MERGED_TYPEINFO_NAMES disabled? > > Andy Sorry, that should have been 4.5, not 4.6.
Yes, from http://gcc.gnu.org/gcc-4.5/changes.html "The default behavior for comparing typeinfo names has changed, so in <typeinfo>, __GXX_MERGED_TYPEINFO_NAMES now defaults to zero." I think the changes were made to the trunk around October 2009
http://gcc.gnu.org/viewcvs?view=revision&revision=153768
Fixed so closing as such.