Bug 53646 - Surprising effects of cxx11 vs cxx98 ABI compatibility
Surprising effects of cxx11 vs cxx98 ABI compatibility
Status: NEW
Product: gcc
Classification: Unclassified
Component: c++
4.8.0
: P3 normal
: ---
Assigned To: Not yet assigned to anyone
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-12 13:18 UTC by Michael Matz
Modified: 2012-06-21 15:32 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-06-13 00:00:00


Attachments
tarball containing testcase (655 bytes, application/x-tgz)
2012-06-12 13:18 UTC, Michael Matz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Matz 2012-06-12 13:18:07 UTC
Created attachment 27611 [details]
tarball containing testcase

As is long known mixing cxx11 and cxx98 code isn't supported.  I'm reporting
this anyway because I'm not sure that the full consequences of this were
realized.  Unpack the tarball, then:

# cd cxxabi-incompat
# make CXX=<path-to-trunk-g++>
...
# ./app
Segmentation fault
# ./app2
Segmentation fault

In this specific case the problem is the Rb_tree::equal_range function,
which returns a pair, under cxx98 that's POD (returned via registers), under 
cxx11 it's not POD and returned via invisible reference.  I.e. calls from
cxx98 to cxx11 version of that function will segfault and vice verso.

The testcase consists of two libraries (lib1 lib2), compiled with cxx11
and cxx98 respectively.  Both libs don't depend or interact with each other.
The app stands for a random application making use of random libraries,
calling simple functions in them, which themself don't exhibit any ABI
problem.  I.e. the testcase tries to show a typical situation for
application developers using several different 3rd party libraries not under
control of that developer.

The problem happens because both libs contains a (weak) definition of
the equal_range function.  As the libs don't control their exports this
symbol also is exported.  Hence the calls will be resolved to whatever
version comes first in dynamic linker search order (that's the reason
for the two apps built, once with "-l1 -l2", once with "-l2 -l1").

So, whichever library is first in search order, the other library will
resolve its own equal_range call (stemming from the inlined erase) to that
first library, and thereby crash because that version was compiled with
different c++ ABI.

Now, this might all be as designed, but what this "you can't mix c++11 and
c++98" means is that an application can't even _link_ against two
libraries compiled with different settings, when the libs don't tightly
control their exports (which might not be possible on all targets).
That's even prohibited if the library authors made sure that the API
itself doesn't contain any problematic constructs, for instance just a
C API, and c++ is used only internally.

In effect this means, that if library authors don't control all their
consumers (which is the case for most libraries) they must expect that
those will also link against some c++98 libraries, and hence can't make use
of c++11 constructs even internally.

Furthermore, if one wants to link against a c++11 compiled library (which
the typical user won't even know, especially if only used in the internal
implementation) the whole stack also needs to be c++11, effectively requiring
two versions of every library that somehow uses c++ to be installed.
The dynamic linker (or games via separate paths) would have to resolve to the
right variants.  No distribution is going to do something like this (and not
all libraries might even be compilable in c++11 mode), and this effectively
prevents c++11 to be used _at all_ in development.

This report is a result from a real issue we had: our package management
is written in C++, and the library author started using c++11.  The
applications linking against that component were still at c++98 leading
to crashes.  I had to advise them to control their exports or stop using
c++11.  Luckily we were in the position to do the former, but if it had
been a third party lib there would have been no way out (rewriting the
apps isn't possible because we don't control all of them).
Comment 1 Jonathan Wakely 2012-06-12 13:38:01 UTC
(In reply to comment #0)
> In this specific case the problem is the Rb_tree::equal_range function,
> which returns a pair, under cxx98 that's POD (returned via registers), under 
> cxx11 it's not POD and returned via invisible reference.

Ah, so that explains it.  Same issue as http://gcc.gnu.org/ml/gcc/2012-05/msg00409.html
Comment 2 Jonathan Wakely 2012-06-12 14:19:33 UTC
N.B. std::pair is not a POD in c++98 or c++11, so I don't know what libstdc++ could have done to cause the FE to change how it returns a std::pair.
Comment 3 Jonathan Wakely 2012-06-12 14:27:58 UTC
I don't think this is a libstdc++ issue, precompiling the code with 4.7 and then compiling with 4.8 still segfaults, so it's a FE change not a libstdc++ change.
Comment 4 Jonathan Wakely 2012-06-12 14:29:03 UTC
(In reply to comment #3)
> I don't think this is a libstdc++ issue, precompiling ...

Sorry, brainfart, I meant preprocessing
Comment 5 Michael Matz 2012-06-12 15:36:01 UTC
(In reply to comment #2)
> N.B. std::pair is not a POD in c++98 or c++11, so I don't know what libstdc++
> could have done to cause the FE to change how it returns a std::pair.

I don't know if it's PODness (but I believe that it is), but the calling
convention is changed by the existence of this c++11 construct in std::pair:

      pair(pair&& __p)
      noexcept(__and_<is_nothrow_move_constructible<_T1>,
               is_nothrow_move_constructible<_T2>>::value)
      : first(std::forward<first_type>(__p.first)),
        second(std::forward<second_type>(__p.second)) { }

It's some sort of copy-ctor, right?  In any case when it's there (and it's
only there when compiling/preprocessing in c++11 mode), then the frontend
makes this type be TREE_ADDRESSABLE, aggregate_value_p will return true,
and that makes the ABI use an invisible reference.
Comment 6 Michael Matz 2012-06-12 15:41:49 UTC
FWIW, it's finish_struct_bits setting TREE_ADDRESSABLE, because type_has_nontrivial_copy_init returns true for pair with that ctor.
I think this indeed makes pair non-POD.
Comment 7 Jonathan Wakely 2012-06-12 15:50:05 UTC
Trivially copyable is just one small part of the POD requirements. std::pair has always been non-POD, even in c++98, but in c++98 it is trivially copyable, in c++11 that move constructor is non-trivial.

It can be made trivial by changing it to:

      pair(pair&& __p)
      noexcept(__and_<is_nothrow_move_constructible<_T1>,
               is_nothrow_move_constructible<_T2>>::value)
      = default;

That might fix this problem, could you test it?
Comment 8 Jonathan Wakely 2012-06-12 15:53:15 UTC
Ah, the stl_pair.h header has a comment (I think from Paolo) saying that defaulting that move ctor breaks std::map:

      // XXX Defaulted?!? Breaks std::map!!!
Comment 9 Jonathan Wakely 2012-06-12 15:57:16 UTC
Defaulting that move-ctor fixes the issue referred to in comment 1 too.

I think we need to find out if that comment is still relevant and fix it if it is, so we can default the move-ctor.
Comment 10 Michael Matz 2012-06-12 16:02:28 UTC
Yep, defaulting that ctor changes the ABI back to register passing.
If we could change that in libstdc++, all the better, but I still think the
issue is larger than just this specific case.
Comment 11 Paolo Carlini 2012-06-12 16:04:58 UTC
Daniel should have all the details. It might be possible to do the change *together* with changing the constraining in the various container::insert to use is_constructible instead of is_convertible (we got PRs about this, but please also get details from Daniel). Actually, we may be close to being able to do this, mainline only of course, but really run the testsuite to completion and be ready for fallouts in unexpected places (I didn't feel brave enough to try for 4.7 ;)
Comment 12 Paolo Carlini 2012-06-12 16:12:55 UTC
If I remember correctly, the last time I tried, default + is_constructible worked pretty well modulo testcases sensitive to access control under sfinae. But the latter we are going to implment anyway for 4.8, thus temporary FIXME or something in the testcases would be fine.
Comment 13 Paolo Carlini 2012-06-13 09:38:44 UTC
Daniel, if you have two free minutes, can you update me about the famous issue with is_constructible vs is_convertible for constraining container::insert? I seemed to remember that at some point we agreed on the reflector that this is the way to go (and I badly need is_constructible if I want to try again defaulting pair' move constructor) but now I have trouble finding that recorded in an actual PR.
Comment 14 Jonathan Wakely 2012-06-13 10:22:32 UTC
Michael, would you prefer a separate PR for the specific pair(pair&&) issue, to keep this for the general problem of mxing incompatible libs?
Comment 15 Michael Matz 2012-06-13 12:07:38 UTC
I think so, yes.  I initially really reported this as general c++ problem,
with the testcase of course being about a concrete instance of the problem
but not meaning to specifically constrain it to libstdc++ only (that was
reclassified by Jakub).
Comment 16 Jonathan Wakely 2012-06-13 14:54:23 UTC
OK, the pair(pair&&) issue is now PR 53657 and I've changed component back to c++ for this PR
Comment 17 Jason Merrill 2012-06-14 06:32:19 UTC
I don't see the general problem.  C++98 and C++11 code should be ABI-compatible in general; the incompatibility in this case is a bug.
Comment 18 Michael Matz 2012-06-21 14:36:29 UTC
Just today I had to debug another crash in our package management code.
This time a c++98 library picked up a std::list operator= from a c++11
library resulting in a crash.  See
  https://bugzilla.novell.com/show_bug.cgi?id=767666

It was all quite non-obvious, as many libraries and packages are involved
and it's not easy to find out what the problematic symbol resolution is.
This may be an intended ABI breakage in c++11 mode (the std::list change I
mean), but it has all the same problems as the unintended bug in std::pair.
Comment 19 Jonathan Wakely 2012-06-21 14:46:22 UTC
Do you have an opinion on PR 53673 ?
Would having those symbols make debugging such problems easier, or did you already know there was a mix of c++98 and c++11 objects and the difficult part was identifying exactly where the problem happened?
Comment 20 Michael Matz 2012-06-21 15:25:59 UTC
As I stumbled over this problem complex recently already I had a hunch that it
might again be a 11/98 mixture.  Having the symbols would have made that a
certainty.  But it wouldn't have helped me that much further.  I still would 
have had to find out which functions were causing the wild write (after all,
an 11/98 mixture in itself is not the problem, the library authors for
instance might have controlled their exports) in order to say with confidence
that this is really the cause for the problem, and not something like a
normal program error.

Of course if the implementation of PR 53673 would make it so, that such
programs can't even be link edited or if so, then at least not be run (the
dynamic linker could throw an error for instance) then this problem wouldn't
have turned up.  OTOH, that's fairly unforgiving and could still be worked
around by the user by for instance specifically hiding such magic symbol.
(Although, if he's able to hide that magic symbol he should also be able to
hide all the other reexported libstdc++ symbols).

I btw. wonder if all these weak symbols of libstdc++, that right now
are reexported by default, shouldn't get hidden visibility.  Would make
(non-inlined) calls faster, and avoid the cross-c++-std-domain resolving
leading to this problem.  At least on targets that support visibility.
Comment 21 Michael Matz 2012-06-21 15:32:07 UTC
(In reply to comment #20)
> certainty.  But it wouldn't have helped me that much further.  I still would 
> have had to find out which functions were causing the wild write in order
> to say with confidence that this is really the cause for the problem, and
> not something like a normal program error.

To expand on that: I had to do so because this particular std::list isn't part
of any external API of the libraries in question, it's not passed around
or returned, it's purely an internal implementation detail.  Our wiki page
makes it sound as if only using the problematic types in the external API
is an issue.  That is not the case, it's already just _using_ these types
that's problematic.