User account creation filtered due to spam.
Created attachment 39568 [details] pre-processed source code Finally I moved our PPC64LE builds to GCC 6.2.0 from 5.3.0 and quickly found that majority of workflows are segfaulting in a similar strange manner. Quickly noticed that this starts happening with -O2, works fine with -O1. Started bisection of GCC -O2 flags and it -fcrossjumping is the causing it. Found that most issues are caused by MuonTrajectoryUpdator.o. Then started hunting for the function causing regression. Found that it was boost/smart_ptr/intrusive_ptr.hpp (from 1.57.0 version). Doing the following resolves the issue: 165 #pragma GCC push_options 166 #pragma GCC optimize ("no-crossjumping") 167 168 T & operator*() const 169 { 170 BOOST_ASSERT( px != 0 ); 171 return *px; 172 } 173 174 #pragma GCC pop_options Started GCC bisection and did it twice with the same result. First bad commit r230091 or b873d7f7eba25120074b3d74add21224e860df43 Last good commit r230090 or 79a77fee9f8eebf5bdd2b4868a0ba18b3ad3b74d Does not happen on x86_64 and I still didn't move AArch64 to GCC 6.2.0. This is quick look into GDB: Dump of assembler code for function Chi2MeasurementEstimator::estimate(TrajectoryStateOnSurface const&, TrackingRecHit const&) const: 0x00003fff985a7e40 <+0>: addis r2,r12,4 0x00003fff985a7e44 <+4>: addi r2,r2,-320 => 0x00003fff985a7e48 <+8>: mflr r0 0x00003fff985a7e4c <+12>: std r27,-40(r1) 0x00003fff985a7e50 <+16>: std r28,-32(r1) 0x00003fff985a7e54 <+20>: mr r28,r4 0x00003fff985a7e58 <+24>: std r30,-16(r1) 0x00003fff985a7e5c <+28>: std r31,-8(r1) 0x00003fff985a7e60 <+32>: mr r27,r3 0x00003fff985a7e64 <+36>: mr r3,r5 0x00003fff985a7e68 <+40>: std r29,-24(r1) 0x00003fff985a7e6c <+44>: mr r31,r5 0x00003fff985a7e70 <+48>: std r0,16(r1) 0x00003fff985a7e74 <+52>: stdu r1,-288(r1) 0x00003fff985a7e78 <+56>: ld r9,0(r5) // beginning of TrackingRecHit 0x00003fff985a7e7c <+60>: ld r4,88(r9) // we load 0x0 (loading function address from vtable?) 0x00003fff985a7e80 <+64>: std r2,24(r1) 0x00003fff985a7e84 <+68>: mtctr r4 // we load CTR with 0x0 0x00003fff985a7e88 <+72>: mr r12,r4 0x00003fff985a7e8c <+76>: bctrl // ucondition branch to CTR with is 0x0 0x00003fff985a7e90 <+80>: ld r2,24(r1) 0x00003fff985a7e94 <+84>: li r5,21 0x00003fff985a7e98 <+88>: mr r30,r3 Code looks like: 34 std::pair<bool,double> 35 Chi2MeasurementEstimator::estimate(const TrajectoryStateOnSurface& tsos, 36 const TrackingRecHit& aRecHit) const { 37 switch (aRecHit.dimension()) { 38 case 1: return returnIt(lestimate<1>(tsos,aRecHit)); 39 case 2: return returnIt(lestimate<2>(tsos,aRecHit)); 40 case 3: return returnIt(lestimate<3>(tsos,aRecHit)); 41 case 4: return returnIt(lestimate<4>(tsos,aRecHit)); 42 case 5: return returnIt(lestimate<5>(tsos,aRecHit)); 43 } 44 throw cms::Exception("RecHit of invalid size (not 1,2,3,4,5)"); 45 } We are failing trying to make this call: aRecHit.dimension() [line 37].
Created attachment 39569 [details] generated assembly with last good commit
Created attachment 39570 [details] generated assembly with the first bad commit
Created attachment 39571 [details] generated assembly with the first bad commit + pragma no-crossjumping
Have you tried to see if there is any undefined code here? Like say -fsantizied=undefined ?
I tried running valgrind on both x86_64 and ppc64le. It was running for 2 days and didn't report any invalid reads / writes to the memory. My first guess was that we are corrupting memory. I added -g -fsanitize=undefined -fno-omit-frame-pointer [-static-libubsan] on involved pieces and it did produce some errors. None of that seems to touch troublesome part of code, but with these flags code works again. So far I am running builds with -fno-crossjumping and that seems to solve all issues. Also, trying to bring back ASan builds, but that will take time thus I will keep looking into this.
Try a few more options then: -fno-delete-null-pointer-checks -fwrapv -fno-devirtualize
I did try -fno-devirtualize and -fno-delete-null-pointer-checks without success, but will re-try again now.
Can confirm that '-fno-delete-null-pointer-checks -fwrapv -fno-devirtualize' don't change the outcode, fails the same.
The bisection result is odd as that only changes vectorization (whether it happens or not), and vectorization is _not_ enabled at -O2. The attached assembly doesn't contain Chi2MeasurementEstimator::estimate as far as I can see. Note that doing 165 #pragma GCC push_options 166 #pragma GCC optimize ("no-crossjumping") 167 168 T & operator*() const 169 { 170 BOOST_ASSERT( px != 0 ); 171 return *px; 172 } 173 174 #pragma GCC pop_options probably has no effect on the operator itself but it prevents inlining it. Are BOOST_ASSERT assertions enabled? Can you enable them _and_ compile with -fno-delete-null-pointer-checks? Btw, another not mentioned issue people hit with GCC 6 (see also https://gcc.gnu.org/gcc-6/porting_to.html) is dead code elimination around constructors/destructors. Use -flifetime-dse=1 to restore GCC 5 behavior (or -fno-lifetime-dse to pre-GCC 5 behavior).
We have been bitten by DSE with TBB already on GCC 6.2.0. I decided to try -fno-lifetime-dse on the first bad commit and got some surprising results. Now it fails with: pure virtual method called terminate called without an active exception In this case we have a standard library container with 7 elements, std::shared_ptr<TrackingRecHit>, where TrackingRecHit is abstract class, which has: virtual int dimension() const = 0; If I dump the actual type: typeid(*(x.get())).name() the actual object is of MuonTransientTrackingRecHit type. Then the code does std::stable_sort on the collection and now in some cases MuonTransientTrackingRecHit changes to TrackingRecHit (abstract class). Thus we are really calling pure virtual method. This does not happen on x86_64. I guess, this could be caused by actual MuonTransientTrackingRecHit being destroyed if the last std::shared_ptr point to it goes away. DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit DUMP_BEFORE: std::shared_ptr<TrackingRecHit const> DUMP_BEFORE_TYPE: MuonTransientTrackingRecHit // after std::stable_sort DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: MuonTransientTrackingRecHit DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: MuonTransientTrackingRecHit DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: TrackingRecHit DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: TrackingRecHit DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: MuonTransientTrackingRecHit DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: TrackingRecHit DUMP: std::shared_ptr<TrackingRecHit const> DUMP_TYPE: TrackingRecHit SIZE: 7
CCing libstdc++ people -- not sure if std::stable_sort (on which kind of collection?) is safe for std::shared_ptr. Would be nice if you could track down where the object gets destroyed btw (maybe use a watchpoint on the vtable slot?)
Looks to be std::vector<std::shared_ptr<TrackingRecHit const>> (aka ConstRecHitContainer). Here is the class and typedef for the containers: https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/DataFormats/TrackingRecHit/interface/TrackingRecHit.h#L35 Here we call sort which is a member function (at the end of the file): https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc#L119 That member function calls stable_sort. stable_sort is called here: https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc#L260 This is the comparison function used: https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/RecoMuon/TrackingTools/interface/MuonTrajectoryUpdator.h#L98 Will try to check where dtor is happening.
(In reply to Richard Biener from comment #11) > CCing libstdc++ people -- not sure if std::stable_sort (on which kind of > collection?) is safe for std::shared_ptr. It's required to work correctly. It should be equivalent to sorting TrackingRecHit* pointers.
Actually P2, we've released 6.2 with this bug so it doesn't block a 6.3 release. It's also still UNCONFIRMED (we prefer to leave those at P3).
David, could you verify that this is still happening on trunk? I can't reproduce on a ppc64le box with a stage1 build. Compilation succeeds: $ ~/gcc/configure --disable-multilib --disable-bootstrap $ make all-gcc $ cd gcc $ ./xg++ -B./ -O2 ~/MuonTrajectoryUpdator2.ii -c -w /home/aldyh/MuonTrajectoryUpdator2.ii: In instantiation of ‘class Basic2DVector<long double>’: /home/aldyh/MuonTrajectoryUpdator2.ii:47851:11: required from here /home/aldyh/MuonTrajectoryUpdator2.ii:47371:7: note: the layout of aggregates containing vectors with 32-byte alignment has changed in GCC 5 class Basic2DVector { ^~~~~~~~~~~~~ $ But perhaps what you are mean is that there is a segfault executing your program, not compiling it. In which case, could you reduce your testcase (a lot :)), and provide a standalone testcase that runs, that we could use to reproduce the problem?
GCC 6.3 is being released, adjusting target milestone.
David, can you reproduce with latest trunk? Any chance to have some reduced self-contained testcase (including main that invokes it)?
I really don't see anything we can do here without more information. As Richi noted, the assembly codes don't include Chi2MeasurementEstimator::estimate. There's no runnable testcase. The included .s files aren't particularly useful. I'm going to keep this open for now, but we really need David to engage here so that we can dig into the issue. If we can't get more information from David, we'll have to close with insufficient data.