User account creation filtered due to spam.

Bug 77493 - [6/7/8 Regression] -fcrossjumping (-O2) on ppc64le causes segfaults (jump to 0x0) (first bad r230091)
Summary: [6/7/8 Regression] -fcrossjumping (-O2) on ppc64le causes segfaults (jump to ...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.2.0
: P2 normal
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2016-09-05 18:38 UTC by David Abdurachmanov
Modified: 2017-04-19 21:50 UTC (History)
7 users (show)

See Also:
Host:
Target: powerpc64le-unknown-linux-gnu
Build:
Known to work: 5.3.0
Known to fail: 6.2.0
Last reconfirmed: 2016-12-06 00:00:00


Attachments
pre-processed source code (302.17 KB, application/x-bzip2)
2016-09-05 18:38 UTC, David Abdurachmanov
Details
generated assembly with last good commit (33.40 KB, application/x-bzip2)
2016-09-05 18:39 UTC, David Abdurachmanov
Details
generated assembly with the first bad commit (33.40 KB, application/x-bzip2)
2016-09-05 18:40 UTC, David Abdurachmanov
Details
generated assembly with the first bad commit + pragma no-crossjumping (34.41 KB, application/x-bzip2)
2016-09-05 18:40 UTC, David Abdurachmanov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Abdurachmanov 2016-09-05 18:38:46 UTC
Created attachment 39568 [details]
pre-processed source code

Finally I moved our PPC64LE builds to GCC 6.2.0 from 5.3.0 and quickly found that majority of workflows are segfaulting in a similar strange manner.

Quickly noticed that this starts happening with -O2, works fine with -O1. Started bisection of GCC -O2 flags and it -fcrossjumping is the causing it. Found that most issues are caused by MuonTrajectoryUpdator.o. Then started hunting for the function causing regression. Found that it was boost/smart_ptr/intrusive_ptr.hpp (from 1.57.0 version).

Doing the following resolves the issue:

165 #pragma GCC push_options
166 #pragma GCC optimize ("no-crossjumping")
167
168     T & operator*() const
169     {
170         BOOST_ASSERT( px != 0 );
171         return *px;
172     }
173
174 #pragma GCC pop_options

Started GCC bisection and did it twice with the same result.

First bad commit r230091 or b873d7f7eba25120074b3d74add21224e860df43
Last good commit r230090 or 79a77fee9f8eebf5bdd2b4868a0ba18b3ad3b74d

Does not happen on x86_64 and I still didn't move AArch64 to GCC 6.2.0.

This is quick look into GDB:

Dump of assembler code for function Chi2MeasurementEstimator::estimate(TrajectoryStateOnSurface const&, TrackingRecHit const&) const:
   0x00003fff985a7e40 <+0>:     addis   r2,r12,4
   0x00003fff985a7e44 <+4>:     addi    r2,r2,-320
=> 0x00003fff985a7e48 <+8>:     mflr    r0
   0x00003fff985a7e4c <+12>:    std     r27,-40(r1)
   0x00003fff985a7e50 <+16>:    std     r28,-32(r1)
   0x00003fff985a7e54 <+20>:    mr      r28,r4
   0x00003fff985a7e58 <+24>:    std     r30,-16(r1)
   0x00003fff985a7e5c <+28>:    std     r31,-8(r1)
   0x00003fff985a7e60 <+32>:    mr      r27,r3
   0x00003fff985a7e64 <+36>:    mr      r3,r5
   0x00003fff985a7e68 <+40>:    std     r29,-24(r1)
   0x00003fff985a7e6c <+44>:    mr      r31,r5
   0x00003fff985a7e70 <+48>:    std     r0,16(r1)
   0x00003fff985a7e74 <+52>:    stdu    r1,-288(r1)
   0x00003fff985a7e78 <+56>:    ld      r9,0(r5) // beginning of TrackingRecHit
   0x00003fff985a7e7c <+60>:    ld      r4,88(r9) // we load 0x0 (loading function address from vtable?)
   0x00003fff985a7e80 <+64>:    std     r2,24(r1)
   0x00003fff985a7e84 <+68>:    mtctr   r4         // we load CTR with 0x0
   0x00003fff985a7e88 <+72>:    mr      r12,r4    
   0x00003fff985a7e8c <+76>:    bctrl              // ucondition branch to CTR with is 0x0
   0x00003fff985a7e90 <+80>:    ld      r2,24(r1)
   0x00003fff985a7e94 <+84>:    li      r5,21
   0x00003fff985a7e98 <+88>:    mr      r30,r3

Code looks like:

34 std::pair<bool,double>
 35 Chi2MeasurementEstimator::estimate(const TrajectoryStateOnSurface& tsos,
 36                                    const TrackingRecHit& aRecHit) const {
 37     switch (aRecHit.dimension()) {
 38         case 1: return returnIt(lestimate<1>(tsos,aRecHit));
 39         case 2: return returnIt(lestimate<2>(tsos,aRecHit));
 40         case 3: return returnIt(lestimate<3>(tsos,aRecHit));
 41         case 4: return returnIt(lestimate<4>(tsos,aRecHit));
 42         case 5: return returnIt(lestimate<5>(tsos,aRecHit));
 43     }
 44     throw cms::Exception("RecHit of invalid size (not 1,2,3,4,5)");
 45 }

We are failing trying to make this call: aRecHit.dimension() [line 37].
Comment 1 David Abdurachmanov 2016-09-05 18:39:21 UTC
Created attachment 39569 [details]
generated assembly with last good commit
Comment 2 David Abdurachmanov 2016-09-05 18:40:13 UTC
Created attachment 39570 [details]
generated assembly with the first bad commit
Comment 3 David Abdurachmanov 2016-09-05 18:40:48 UTC
Created attachment 39571 [details]
generated assembly with the first bad commit + pragma no-crossjumping
Comment 4 Andrew Pinski 2016-09-05 18:51:40 UTC
Have you tried to see if there is any undefined code here?  Like say -fsantizied=undefined ?
Comment 5 David Abdurachmanov 2016-09-05 19:47:36 UTC
I tried running valgrind on both x86_64 and ppc64le. It was running for 2 days and didn't report any invalid reads / writes to the memory. My first guess was that we are corrupting memory.

I added -g -fsanitize=undefined -fno-omit-frame-pointer [-static-libubsan] on involved pieces and it did produce some errors. None of that seems to touch troublesome part of code, but with these flags code works again.

So far I am running builds with -fno-crossjumping and that seems to solve all issues.

Also, trying to bring back ASan builds, but that will take time thus I will keep looking into this.
Comment 6 Andrew Pinski 2016-09-05 19:51:35 UTC
Try a few more options then:
-fno-delete-null-pointer-checks 
-fwrapv

-fno-devirtualize
Comment 7 David Abdurachmanov 2016-09-05 19:54:24 UTC
I did try -fno-devirtualize and -fno-delete-null-pointer-checks without success, but will re-try again now.
Comment 8 David Abdurachmanov 2016-09-05 20:09:07 UTC
Can confirm that '-fno-delete-null-pointer-checks -fwrapv -fno-devirtualize' don't change the outcode, fails the same.
Comment 9 Richard Biener 2016-09-06 07:09:21 UTC
The bisection result is odd as that only changes vectorization (whether it happens or not), and vectorization is _not_ enabled at -O2.

The attached assembly doesn't contain Chi2MeasurementEstimator::estimate as far as I can see.

Note that doing

165 #pragma GCC push_options
166 #pragma GCC optimize ("no-crossjumping")
167
168     T & operator*() const
169     {
170         BOOST_ASSERT( px != 0 );
171         return *px;
172     }
173
174 #pragma GCC pop_options

probably has no effect on the operator itself but it prevents inlining it.
Are BOOST_ASSERT assertions enabled?  Can you enable them _and_ compile
with -fno-delete-null-pointer-checks?

Btw, another not mentioned issue people hit with GCC 6 (see also https://gcc.gnu.org/gcc-6/porting_to.html) is dead code elimination around
constructors/destructors.  Use -flifetime-dse=1 to restore GCC 5 behavior
(or -fno-lifetime-dse to pre-GCC 5 behavior).
Comment 10 David Abdurachmanov 2016-09-06 22:30:27 UTC
We have been bitten by DSE with TBB already on GCC 6.2.0.

I decided to try -fno-lifetime-dse on the first bad commit and got some surprising results. Now it fails with:

pure virtual method called
terminate called without an active exception

In this case we have a standard library container with 7 elements, std::shared_ptr<TrackingRecHit>, where TrackingRecHit is abstract class, which has:

  virtual int dimension() const = 0;

If I dump the actual type: typeid(*(x.get())).name() the actual object is of MuonTransientTrackingRecHit type.

Then the code does std::stable_sort on the collection and now in some cases MuonTransientTrackingRecHit changes to TrackingRecHit (abstract class). Thus we are really calling pure virtual method.

This does not happen on x86_64.

I guess, this could be caused by actual MuonTransientTrackingRecHit being destroyed if the last std::shared_ptr point to it goes away.

DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit
DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit
DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit
DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit
DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit
DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit
DUMP_BEFORE: std::shared_ptr<TrackingRecHit const>
DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit

// after std::stable_sort

DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: MuonTransientTrackingRecHit
DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: MuonTransientTrackingRecHit
DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: TrackingRecHit
DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: TrackingRecHit
DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: MuonTransientTrackingRecHit
DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: TrackingRecHit
DUMP: std::shared_ptr<TrackingRecHit const>
DUMP_TYPE: TrackingRecHit

SIZE: 7
Comment 11 Richard Biener 2016-09-07 07:19:42 UTC
CCing libstdc++ people -- not sure if std::stable_sort (on which kind of collection?) is safe for std::shared_ptr.

Would be nice if you could track down where the object gets destroyed btw
(maybe use a watchpoint on the vtable slot?)
Comment 12 David Abdurachmanov 2016-09-07 07:32:47 UTC
Looks to be std::vector<std::shared_ptr<TrackingRecHit const>> (aka ConstRecHitContainer).

Here is the class and typedef for the containers: https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/DataFormats/TrackingRecHit/interface/TrackingRecHit.h#L35

Here we call sort which is a member function (at the end of the file): https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc#L119

That member function calls stable_sort.

stable_sort is called here: https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc#L260

This is the comparison function used: https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/RecoMuon/TrackingTools/interface/MuonTrajectoryUpdator.h#L98

Will try to check where dtor is happening.
Comment 13 Jonathan Wakely 2016-09-07 09:31:22 UTC
(In reply to Richard Biener from comment #11)
> CCing libstdc++ people -- not sure if std::stable_sort (on which kind of
> collection?) is safe for std::shared_ptr.

It's required to work correctly. It should be equivalent to sorting TrackingRecHit* pointers.
Comment 14 Richard Biener 2016-11-22 08:03:23 UTC
Actually P2, we've released 6.2 with this bug so it doesn't block a 6.3 release.  It's also still UNCONFIRMED (we prefer to leave those at P3).
Comment 15 Aldy Hernandez 2016-12-06 09:58:33 UTC
David, could you verify that this is still happening on trunk?

I can't reproduce on a ppc64le box with a stage1 build.  Compilation succeeds:

$ ~/gcc/configure --disable-multilib --disable-bootstrap 
$ make all-gcc
$ cd gcc
$ ./xg++ -B./ -O2 ~/MuonTrajectoryUpdator2.ii -c -w
/home/aldyh/MuonTrajectoryUpdator2.ii: In instantiation of ‘class Basic2DVector<long double>’:
/home/aldyh/MuonTrajectoryUpdator2.ii:47851:11:   required from here
/home/aldyh/MuonTrajectoryUpdator2.ii:47371:7: note: the layout of aggregates containing vectors with 32-byte alignment has changed in GCC 5
 class Basic2DVector {
       ^~~~~~~~~~~~~
$

But perhaps what you are mean is that there is a segfault executing your program, not compiling it.  In which case, could you reduce your testcase (a lot :)), and provide a standalone testcase that runs, that we could use to reproduce the problem?
Comment 16 Jakub Jelinek 2016-12-21 10:57:26 UTC
GCC 6.3 is being released, adjusting target milestone.
Comment 17 Jakub Jelinek 2017-03-09 15:01:09 UTC
David, can you reproduce with latest trunk?  Any chance to have some reduced self-contained testcase (including main that invokes it)?
Comment 18 Jeffrey A. Law 2017-04-19 21:50:14 UTC
I really don't see anything we can do here without more information.

As Richi noted, the assembly codes don't include  Chi2MeasurementEstimator::estimate.

There's no runnable testcase.

The included .s files aren't particularly useful.  

I'm going to keep this open for now, but we really need David to engage here so that we can dig into the issue.  If we can't get more information from David, we'll have to close with insufficient data.