As described in <http://gcc.gnu.org/ml/gcc-bugs/2003-06/threads.html#02040>, gcc-3.3/3.4 miscompiles mozilla-1.4 leading to a segfault at startup. Due to the kind help of Kevin Hendricks there is now a simplified testcase that shows the bug if compiled with -O2 -fno-exceptions. It is not clear yet if the code itself is aliasing safe.
Created attachment 4304 [details] Full version of testcase The miscompiled routine is inmContainerGIF::DecodingComplete. This version needs to be linked against the mozilla libs -lnspr -lplc4 to be executable.
Created attachment 4305 [details] Simplified testcase
The bug cannot be reproduced on i686-pc-cygwin. Anyway, I have some comments on the code: - Are the specializations of nsCOMPtr<> and nsGetterAddRefs<> really necessary to trigger the bug? They look to me just manual replacements of the template argument. If the are needed, did you try removing the definition of the general templates? - Are all the member functions of the templates really used by the program? - This: nsDerivedSafe<T>* get() const { return reinterpret_cast< nsDerivedSafe<T>* >(mRawPtr); } is not typesafe and generates undefined behaviour if you do anything with the returned pointer BUT casting it back to its original type (nsISupports*).
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, > - Are the specializations of nsCOMPtr<> and nsGetterAddRefs<> really > necessary to trigger the bug? They look to me just manual replacements of > the template argument. If the are needed, did you try removing the > definition of the general templates? I believe so yes. I did try removing the general templates but ran into problems trying to compile this. All of my attempts to recreate this problem with a general type did not succeed. The nsCOMPtr template is some sort of smart pointer used in libxpcom that I am not sure I understand. The specialization seems to be on handling the nsISupports xpcom interface types. The real test problem is huge and involved numerous libraries. I just tried to help out by narrowing down the problem to a standalone testcase and I had to bascially cut and paste the nsCOMPtr templates in order to see the problem at all. I was afraid if I did too much hacking away of that template class and the getter routines (since I did not udnerstand them well) I would introduce a new problem. > - Are all the member functions of the templates really used by the program? I am simply not sure. I will try to hack more away when I have more time later today but I was afraid to do so until I understood what was actaully going on and how the variaous templates interact. This is pretty much out of my league since I just dabble in C++. My single goal was to recreate the problem in a standalone test case and try not to introduce any new problems when doing so. > - This: > > nsDerivedSafe<T>* > get() const > { > return reinterpret_cast< nsDerivedSafe<T>* >(mRawPtr); > } > is not typesafe and generates undefined behaviour if you do anything with > the returned pointer BUT casting it back to its original type > (nsISupports*). Kevin
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, > > - Are the specializations of nsCOMPtr<> and nsGetterAddRefs<> really > > necessary to trigger the bug? I just rechecked this and they are not. I was able to remove both of those specializations and I still see the same problem. Kevin
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, Here is an even simpler test case with the specialized templates removed. cat tcase2a.cxx #include <cstdlib> #include <cstdio> typedef unsigned int nsrefcnt; typedef unsigned int nsresult; class nsISupports { public: virtual nsrefcnt AddRef(void) = 0; virtual nsrefcnt Release(void) = 0; }; class mc0 : public nsISupports { public: int hello; int setHello(int j) { hello = j; } int SetMutable(int k) { setHello(k); } nsrefcnt AddRef(void) { hello++; } nsrefcnt Release(void) { hello--; } }; template <class T> class nsDerivedSafe : public T { private: nsrefcnt AddRef(void); nsrefcnt Release(void); protected: nsDerivedSafe(); }; template <class T> nsrefcnt nsDerivedSafe<T>::AddRef() { return 0; } template <class T> nsrefcnt nsDerivedSafe<T>::Release() { return 0; } class nsCOMPtr_base { public: nsCOMPtr_base( nsISupports* rawPtr = 0 ) : mRawPtr(rawPtr) { } ~nsCOMPtr_base(); void assign_with_AddRef( nsISupports* ); void** begin_assignment(); protected: nsISupports* mRawPtr; void assign_assuming_AddRef( nsISupports* newPtr ) { nsISupports* oldPtr = mRawPtr; mRawPtr = newPtr; if ( oldPtr ) (oldPtr)->Release(); } }; nsCOMPtr_base::~nsCOMPtr_base() { if ( mRawPtr ) (mRawPtr)->Release(); } void nsCOMPtr_base::assign_with_AddRef( nsISupports* rawPtr ) { if ( rawPtr ) (rawPtr)->AddRef(); assign_assuming_AddRef(rawPtr); } void** nsCOMPtr_base::begin_assignment() { assign_assuming_AddRef(0); return reinterpret_cast< void** >(&mRawPtr); } template <class T> class nsCOMPtr : private nsCOMPtr_base { public: typedef T element_type; nsCOMPtr() : nsCOMPtr_base(0) { } nsCOMPtr( const nsCOMPtr<T>& aSmartPtr ) : nsCOMPtr_base(aSmartPtr.mRawPtr) { if ( mRawPtr ) (mRawPtr)->AddRef(); } nsCOMPtr( T* aRawPtr ) : nsCOMPtr_base(aRawPtr) { if ( mRawPtr ) (mRawPtr)->AddRef(); } nsCOMPtr<T>& operator=( const nsCOMPtr<T>& rhs ) { assign_with_AddRef(rhs.mRawPtr); return *this; } nsCOMPtr<T>& operator=( T* rhs ) { assign_with_AddRef(rhs); return *this; } nsDerivedSafe<T>* get() const { return reinterpret_cast< nsDerivedSafe<T>* >(mRawPtr); } operator nsDerivedSafe<T>*() const { return get(); } nsDerivedSafe<T>* operator->() const { return get(); } nsCOMPtr<T>* get_address() { return this; } const nsCOMPtr<T>* get_address() const { return this; } public: nsDerivedSafe<T>& operator*() const { return *get(); } T** StartAssignment() { return reinterpret_cast< T** >(begin_assignment()); } }; template <class T> inline nsCOMPtr<T>* address_of( nsCOMPtr<T>& aPtr ) { return aPtr.get_address(); } template <class T> inline const nsCOMPtr<T>* address_of( const nsCOMPtr<T>& aPtr ) { return aPtr.get_address(); } template <class T> class nsGetterAddRefs { public: explicit nsGetterAddRefs( nsCOMPtr<T>& aSmartPtr ) : mTargetSmartPtr(aSmartPtr) { } operator void**() { return reinterpret_cast< void** >(mTargetSmartPtr.StartAssignment()); } operator nsISupports**() { return reinterpret_cast< nsISupports** >(mTargetSmartPtr.StartAssignment()); } operator T**() { return mTargetSmartPtr.StartAssignment(); } T*& operator*() { return *(mTargetSmartPtr.StartAssignment()); } private: nsCOMPtr<T>& mTargetSmartPtr; }; template <class T> inline nsGetterAddRefs<T> getter_AddRefs( nsCOMPtr<T>& aSmartPtr ) { return nsGetterAddRefs<T>(aSmartPtr); } int main(int argc, char** argv) { nsISupports* aFrames[1]; nsCOMPtr<mc0> myframe; myframe = new mc0(); aFrames[0]=static_cast<nsISupports*> (myframe); nsCOMPtr<mc0> currentFrame; mc0** retval = getter_AddRefs(currentFrame); nsISupports* _elem = aFrames[0]; *retval = static_cast< mc0* >(_elem); currentFrame->SetMutable(0); fprintf(stderr,"done\n"); fflush(stderr); return 0; } Hope this helps. Kevin
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, Okay I took a shot at removing all of the functions/methods that I felt were not involved and have a simpler test case. cat tcase2b.cxx #include <cstdlib> #include <cstdio> typedef unsigned int nsrefcnt; typedef unsigned int nsresult; class nsISupports { public: virtual nsrefcnt AddRef(void) = 0; virtual nsrefcnt Release(void) = 0; }; class mc0 : public nsISupports { public: int hello; int setHello(int j) { hello = j; } int SetMutable(int k) { setHello(k); } nsrefcnt AddRef(void) { hello++; } nsrefcnt Release(void) { hello--; } }; template <class T> class nsDerivedSafe : public T { private: nsrefcnt AddRef(void); nsrefcnt Release(void); protected: nsDerivedSafe(); }; template <class T> nsrefcnt nsDerivedSafe<T>::AddRef() { return 0; } template <class T> nsrefcnt nsDerivedSafe<T>::Release() { return 0; } class nsCOMPtr_base { public: nsCOMPtr_base( nsISupports* rawPtr = 0 ) : mRawPtr(rawPtr) { } ~nsCOMPtr_base(); void assign_with_AddRef( nsISupports* ); void** begin_assignment(); protected: nsISupports* mRawPtr; void assign_assuming_AddRef( nsISupports* newPtr ) { nsISupports* oldPtr = mRawPtr; mRawPtr = newPtr; if ( oldPtr ) (oldPtr)->Release(); } }; nsCOMPtr_base::~nsCOMPtr_base() { if ( mRawPtr ) (mRawPtr)->Release(); } void nsCOMPtr_base::assign_with_AddRef( nsISupports* rawPtr ) { if ( rawPtr ) (rawPtr)->AddRef(); assign_assuming_AddRef(rawPtr); } void** nsCOMPtr_base::begin_assignment() { assign_assuming_AddRef(0); return reinterpret_cast< void** >(&mRawPtr); } template <class T> class nsCOMPtr : private nsCOMPtr_base { public: typedef T element_type; nsCOMPtr() : nsCOMPtr_base(0) { } nsCOMPtr( T* aRawPtr ) : nsCOMPtr_base(aRawPtr) { if ( mRawPtr ) (mRawPtr)->AddRef(); } nsCOMPtr<T>& operator=( const nsCOMPtr<T>& rhs ) { assign_with_AddRef(rhs.mRawPtr); return *this; } nsDerivedSafe<T>* get() const { return reinterpret_cast< nsDerivedSafe<T>* >(mRawPtr); } operator nsDerivedSafe<T>*() const { return get(); } nsDerivedSafe<T>* operator->() const { return get(); } public: T** StartAssignment() { return reinterpret_cast< T** >(begin_assignment()); } }; template <class T> class nsGetterAddRefs { public: explicit nsGetterAddRefs( nsCOMPtr<T>& aSmartPtr ) : mTargetSmartPtr(aSmartPtr) { } operator T**() { return mTargetSmartPtr.StartAssignment(); } private: nsCOMPtr<T>& mTargetSmartPtr; }; template <class T> inline nsGetterAddRefs<T> getter_AddRefs( nsCOMPtr<T>& aSmartPtr ) { return nsGetterAddRefs<T>(aSmartPtr); } int main(int argc, char** argv) { nsISupports* aFrames[1]; nsCOMPtr<mc0> myframe; myframe = new mc0(); aFrames[0]=static_cast<nsISupports*> (myframe); nsCOMPtr<mc0> currentFrame; mc0** retval = getter_AddRefs(currentFrame); nsISupports* _elem = aFrames[0]; *retval = static_cast< mc0* >(_elem); currentFrame->SetMutable(0); fprintf(stderr,"done\n"); fflush(stderr); return 0; } [kbhend@base1 huh]$ gcc -dumpversion 3.3 [kbhend@base1 huh]$ g++ -O2 -fno-exceptions -o tcase2b tcase2b.cxx [kbhend@base1 huh]$ ./tcase2b Segmentation fault [kbhend@base1 huh]$ g++ -O2 -o tcase2b tcase2b.cxx [kbhend@base1 huh]$ ./tcase2b Segmentation fault [kbhend@base1 huh]$ g++ -O2 -fno-exceptions -fno-strict-aliasing -o tcase2b tcase2b.cxx [kbhend@base1 huh]$ ./tcase2b done Hope this helps, Kevin
I cannot reproduce this bug with the 20030701 versions of 3.3.1 or 3.4 using the last test case in the audit trail on a Red Hat 8 system. Can anyone else?
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled At 04:08 02.07.2003, mmitchel at gcc dot gnu dot org wrote: >PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* gcc-bugs@gcc.gnu.org. > >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 > > > >------- Additional Comments From mmitchel at gcc dot gnu dot >org 2003-07-02 02:08 ------- >I cannot reproduce this bug with the 20030701 versions of 3.3.1 or 3.4 >using the >last test case in the audit trail on a Red Hat 8 system. Can anyone else? Mark, so far it has only be reproducible on powerpc-linux, can you try it on entropy? It's up again and there I can reproduce it with both the installed gcc-3_3-rhl-branch compiler and the current gcc-3_3-branch: [fsirl@entropy:~]$ g++ -O2 -fno-exceptions ~fsirl/bug-mozilla-5.cxx [fsirl@entropy:~]$ ./a.out Segmentation fault [fsirl@entropy:~]$ ~fsirl/obj/gcc33/gcc/g++ -B ~fsirl/obj/gcc33/gcc/ -isystem /usr/include/c++/3.3/ -isystem /usr/include/c++/3.3/ppc-linux/ -L ~fsirl/obj/gcc33/ppc-linux/libstdc++-v3/src/.libs/ -O2 -fno-exceptions ~fsirl/bug-mozilla-5.cxx [fsirl@entropy:~]$ ./a.out Segmentation fault Franz.
I believe the code is not legal. In main(), getter_addRefs calls nsGetterAddRefs<mc0> constructor, which calls nsComPtr<mc0>::StartAssignment. This function calls begin_assignment which casts &mRawPtr (whose type is nsiSupports**) to void** through reinterpret_cast. Then, this void** is casted to mc0** through reinterpret_cast. Now, mc0 is derived from nsiSupports. In short, this looks like: struct A {}; struct B : A {}; A* a; void** pv = reinterpret_cast<void**>(&a); B** pb = reinterpret_cast<B**>(pv); *pb = new B; assert (typeid(*a) == typeid(B)); I believe this violates ISO C++ aliasing rules, but I don't have a standard handy to double-check now. Kevin, can you please confirm that my snippet follows the code correctly? Nathan, do you think the code is legal?
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, This is basically what my first shot at a simpler test program looked like but I could never recreate the problem with code like this and ended up having to pull in the nsCOMPtr template and related support code to recreate the problem. I will try building this case up to see if it can recreate with this approach. Kevin > This function calls begin_assignment which casts &mRawPtr (whose type is > nsiSupports**) to void** through reinterpret_cast. Then, this void** is > casted to mc0** through reinterpret_cast. Now, mc0 is derived from > nsiSupports. In short, this looks like: > > struct A {}; > struct B : A {}; > A* a; > > void** pv = reinterpret_cast<void**>(&a); > B** pb = reinterpret_cast<B**>(pv); > *pb = new B; > assert (typeid(*a) == typeid(B)); > > I believe this violates ISO C++ aliasing rules, but I don't have a standard > handy to double-check now. > > Kevin, can you please confirm that my snippet follows the code correctly? > Nathan, do you think the code is legal? > > > > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is.
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, I could not recreate the bug with anything like the code you showed. So I went back to the last testcase I posted and removed all of the templates to make it easier to see what is going on and to further simplify things. I found something very strange (to me). [kbhend@base1 huh]$ g++ -O2 -o tcase3 tcase3.cxx [kbhend@base1 huh]$ ./tcase3 Segmentation fault [kbhend@base1 huh]$ g++ -O2 -DALL_INTERNAL -o tcase3 tcase3.cxx [kbhend@base1 huh]$ ./tcase3 done All ALL_INTERNAL does is control whether one method is defined internal to the class defintion or external to it. This makes no sense to me but I hope it might for someone who knows C++ well. Here is the tcase3.cxx #include <cstdlib> #include <cstdio> class Iface { public: virtual int Add(void) = 0; }; class M : public Iface { public: int i; int Add(void) { i++; return i; } int SetMutable(int k) { i = k; } }; class Safe : public M { private: int Add(void) { return 0; }; protected: Safe (); }; class Bbase { public: Bbase( Iface * rawPtr = 0 ) { a = rawPtr; } void assign( Iface * q ) { a = q; } #ifdef ALL_INTERNAL void ** begin_assign() { assign(0); return reinterpret_cast< void **> (&a); } #else void ** begin_assign(); #endif protected: Iface * a; }; #ifndef ALL_INTERNAL void ** Bbase::begin_assign() { assign(0); return reinterpret_cast< void **> (&a); } #endif class B : private Bbase { public: B(): Bbase(0) {} B( M* ptr ) : Bbase(ptr) {} M** StartAssign() { return reinterpret_cast< M**> (begin_assign()); } Safe* get () const { return reinterpret_cast< Safe* > (a); } operator Safe*() const { return get(); } Safe* operator->() const { return get(); } }; class C { public: C( B& bp) : mTarget(bp) {}; operator M** () { return mTarget.StartAssign(); } private: B& mTarget; }; inline C getter( B & bp) { return C(bp); } int main(int argc, char** argv) { Iface * p; B b1; b1 = new M(); p=static_cast< Iface* >(b1); B b2; M** retval = getter(b2); *retval = static_cast< M* >(p); b2->SetMutable(0); fprintf(stderr,"done\n"); fflush(stderr); return 0; } Hope this tells someone something. Kevin
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled On Monday, Jul 7, 2003, at 20:14 US/Eastern, kevin dot hendricks at sympatico dot ca wrote: > PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* > gcc-bugs@gcc.gnu.org. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 > > > > ------- Additional Comments From kevin dot hendricks at sympatico dot > ca 2003-07-08 00:13 ------- > Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled > > Hi, > > I could not recreate the bug with anything like the code you showed. > > So I went back to the last testcase I posted and removed all of > the templates to make it easier to see what is going on and to further > simplify things. > > I found something very strange (to me). > > [kbhend@base1 huh]$ g++ -O2 -o tcase3 tcase3.cxx > [kbhend@base1 huh]$ ./tcase3 > Segmentation fault > > > [kbhend@base1 huh]$ g++ -O2 -DALL_INTERNAL -o tcase3 tcase3.cxx > [kbhend@base1 huh]$ ./tcase3 > done > > > > All ALL_INTERNAL does is control whether one method is defined > internal to the class defintion or external to it. > > This makes no sense to me but I hope it might for someone > who knows C++ well. It makes sense to me, gcc is inlining begin_assign into StartAssign and StartAssign into operator M** and operator M** into main under ALL_INTERNAL while it under notdef ALL_INTERNAL, it does not inline begin_assign. This still looks like an aliasing problem. Thanks, Andrew Pinski > > Here is the tcase3.cxx > > #include <cstdlib> > #include <cstdio> > > class Iface { > public: > virtual int Add(void) = 0; > }; > > > class M : public Iface { > public: > int i; > int Add(void) { i++; return i; } > int SetMutable(int k) { i = k; } > }; > > > class Safe : public M { > private: > int Add(void) { return 0; }; > protected: > Safe (); > }; > > > class Bbase { > public: > Bbase( Iface * rawPtr = 0 ) { a = rawPtr; } > void assign( Iface * q ) { a = q; } > #ifdef ALL_INTERNAL > void ** begin_assign() { assign(0); return reinterpret_cast< void > **> (&a); } > #else > void ** begin_assign(); > #endif > protected: > Iface * a; > }; > > #ifndef ALL_INTERNAL > void ** Bbase::begin_assign() > { > assign(0); > return reinterpret_cast< void **> (&a); > } > #endif > > > class B : private Bbase { > public: > B(): Bbase(0) {} > B( M* ptr ) : Bbase(ptr) {} > M** StartAssign() { return reinterpret_cast< M**> (begin_assign()); } > Safe* get () const { return reinterpret_cast< Safe* > (a); } > operator Safe*() const { return get(); } > Safe* operator->() const { return get(); } > }; > > > class C { > public: > C( B& bp) : mTarget(bp) {}; > operator M** () { return mTarget.StartAssign(); } > private: > B& mTarget; > }; > > > inline C > getter( B & bp) { return C(bp); } > > > int > main(int argc, char** argv) > { > > Iface * p; > B b1; > b1 = new M(); > p=static_cast< Iface* >(b1); > B b2; > M** retval = getter(b2); > *retval = static_cast< M* >(p); > b2->SetMutable(0); > fprintf(stderr,"done\n"); fflush(stderr); > return 0; > } > > > > Hope this tells someone something. > > Kevin > >
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, Thanks for explaining that. Now that I am aware of which parts of the program the code generation is sensitive to, I have been able to take the test case and hack away the C and Safe classes and remove the getter() function and still see the problem. Here is yet again a smaller testcase and my last so I hope this helps. I am still not sure if this is an aliasing problem or not. Perhaps the developer who wrote the nsCOMPtr template class for Netscape / Mozilla can provide some insight into this problem? cat tcase4.cxx #include <cstdlib> #include <cstdio> class Iface { public: virtual int Add(void) = 0; }; class M : public Iface { public: int i; int Add(void) { i++; return i; } int SetMutable(int k) { i = k; } }; class Bbase { public: Bbase( Iface * rawPtr = 0) { a = rawPtr; } void assign( Iface * q ) { a = q; } #ifdef ALL_INTERNAL void ** begin_assign() { assign(0); return reinterpret_cast< void **> (&a); } #else void ** begin_assign(); #endif protected: Iface * a; }; #ifndef ALL_INTERNAL void ** Bbase::begin_assign() { assign(0); return reinterpret_cast< void **> (&a); } #endif class B : private Bbase { public: B(): Bbase(0) {} B( M* ptr ) : Bbase(ptr) {} M** StartAssign() { return reinterpret_cast< M**> (begin_assign()); } M* get () const { return reinterpret_cast< M* > (a); } operator M*() const { return get(); } M* operator->() const { return get(); } }; int main(int argc, char** argv) { B b1 = new M(); Iface * p=static_cast< Iface* >(b1); B b2; M** retval = b2.StartAssign(); *retval = static_cast< M* >(p); b2->SetMutable(0); fprintf(stderr,"done\n"); fflush(stderr); return 0; }
As Andrew said, inlining might be different (functions defined in-class have an implicit "inline" attribute, which is not the case if a function is defined out-of-class). You may see whether you can trigger the abort in the out-of-class case when you pass -finline-functions to gcc, which makes all functions eligible for inlining. Alternatively, use -O3, which implies -finline-functions. As a hint for further testcase reduction: make all classes structs, and remove the public/private/protected markers. Then try to do some of the inlining by hand, by replacing the use of accessor functions by direct accesses of the respective values, and subsequent removal of the accessor functions. By looking at the code, it's obvious that you are very close to the hailed one-page-limit, at which everything suddenly becomes clear. (I'd like to help here, but can't for lack of a respective platform...) W.
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, > As a hint for further testcase reduction: make all classes structs, and > remove the public/private/protected markers. Then try to do some of > the inlining by hand, by replacing the use of accessor functions by > direct accesses of the respective values, and subsequent removal of > the accessor functions. Okay- following along those lines we now get tcase5.cxx #include <cstdio> struct A { virtual int Setit(int k) = 0; }; struct B : A { int i; int Setit(int k) { i = k; } }; struct SP { A * a; void ** begin_assign(); }; void ** SP::begin_assign() { return reinterpret_cast< void **> (&a); } int main(int argc, char** argv) { SP p; B** retval = reinterpret_cast<B**> (p.begin_assign()); *retval = new B(); (p.a)->Setit(0); fprintf(stderr,"done\n"); fflush(stderr); return 0; } [kbhend@base1 huh]$ g++ -O2 -o tcase5 tcase5.cxx [kbhend@base1 huh]$ ./tcase5 Segmentation fault [kbhend@base1 huh]$ g++ -O2 -fno-strict-aliasing -o tcase5 tcase5.cxx [kbhend@base1 huh]$ ./tcase5 done [kbhend@base1 huh]$ g++ -O2 -finline-functions -o tcase5 tcase5.cxx [kbhend@base1 huh]$ ./tcase5 done Kevin
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, Just to add insult to injury here ... [kbhend@base1 huh]$ cat tcase6.cxx struct A { virtual int Setit(int k) = 0; }; struct B : A { int Setit(int k) { int i; i = k; } }; A* a; void ** begin_assign() { return reinterpret_cast< void **> (&a); } int main(int argc, char** argv) { B** ppB = reinterpret_cast<B**> (begin_assign()); *ppB = new B(); a->Setit(0); return 0; } [kbhend@base1 huh]$ g++ -O2 -o tcase6 tcase6.cxx [kbhend@base1 huh]$ ./tcase6 Segmentation fault [kbhend@base1 huh]$ g++ -O2 -finline-functions -o tcase6 tcase6.cxx [kbhend@base1 huh]$ ./tcase6 [kbhend@base1 huh]$ [kbhend@base1 huh]$ g++ -O2 -fno-strict-aliasing -o tcase6 tcase6.cxx [kbhend@base1 huh]$ ./tcase6 [kbhend@base1 huh]$
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled On Mon, 2003-07-07 at 19:33, kevin dot hendricks at sympatico dot ca wrote: > PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* gcc-bugs@gcc.gnu.org. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 > > > > ------- Additional Comments From kevin dot hendricks at sympatico dot ca 2003-07-08 02:33 ------- > Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled > > Hi, > > Just to add insult to injury here ... Thanks for cutting this down; now I can see easily that this test case is invalid C++. > [kbhend@base1 huh]$ cat tcase6.cxx > struct A { > virtual int Setit(int k) = 0; > }; > > struct B : A { > int Setit(int k) { int i; i = k; } > }; > > A* a; > void ** begin_assign() { return reinterpret_cast< void **> (&a); } > > int > main(int argc, char** argv) > { > B** ppB = reinterpret_cast<B**> (begin_assign()); > *ppB = new B(); > a->Setit(0); > return 0; > } The problem is that "*ppB" modifies a "B*". Unfortunately, the location pointed to by "ppB" is actually an "A*". That's invalid. (It's true that in C++ you can assign a "B*" to an "A*", but that's not the same as using a "B**" to point at at an "A*" and then writing into the "B**".) Simplifying your code even further gives: A* a; B** ppB = (B**) &a; *ppB = new B; which makes it clearer that the code is invalid. This code is very much analogous to: long l; int *ip = (int *) &l; *ip = 3; on a machine where "int" and "long" are the same width. It seems like it should work, but it's not valid.
The regression in PR 11376 was introduced or exposed by this patch: --- gcc/gcc/cp/ChangeLog --- 2002-03-11 Dan Nicolaescu <dann@ics.uci.edu> Daniel Berlin <dan@dberlin.org> * cp-lang.c (ok_to_generate_alias_set_for_type): New function. (cxx_get_alias_set): Use it. --- gcc/gcc/ChangeLog --- 2002-03-11 Dan Nicolaescu <dann@ics.uci.edu> Daniel Berlin <dan@dberlin.org> C++ alias analysis improvement. * alias.c (record_component_aliases): Record aliases for base classes too. The regression hunt used the test case from Additional Comment #7 in the PR, compiled with -O2 on powerpc-unknown-linux-gnu. A lot has happened with this PR since I started the hunt yesterday, but perhaps this information is still useful.
This is kind of humorous. Mark first points out it's an aliasing violation, *then* we discover it was the patch that added TBAA for simple C++ classes that broke the code. Usually it's the other way around. However, to add something useful, the fact that it was exposed by this patch lends credence to Mark's point that it's an aliasing violation. Unless someone doesn't think Mark is correct, this bug should be closed, as the code is invalid C++.
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, > Unless someone doesn't think Mark is correct, this bug should be > closed, as the > code is invalid C++. If it is an aliasing issue, changing the test case in this way struct A { virtual int Setit(int k) = 0; }; struct B : A { int Setit(int k) { int i; i = k; } }; A* a; void ** begin_assign() { return reinterpret_cast< void **> (&a); } int main(int argc, char** argv) { #if 0 B** ppB = reinterpret_cast<B**> (begin_assign()); *ppB = new B(); #else A** ppB = reinterpret_cast<A**> (begin_assign()); *ppB = new B(); #endif a->Setit(0); return 0; } Since *ppB will be a A* which can hold a B* legally, it should prevent any aliasing problem. I tested this case and sure enough it compiles and runs at all optimization levels just fine. So it sure looks like an aliasing issue to me. I agree this should be closed and a bug against Mozilla nsCOMPtr<> code should be opened. Franz? What do you think? Kevin
Well, since the generated assembly looks now quite different than for the full testcase, I would just like to hear another opinion whether the full testcase was correctly minimized. If yes, I agree that this bug shoud be closed and one against mozilla should be opened. Hmm, Kevin, could you create a "fixed" (like you just did for the simplified one) full testcase (diff?)? If that fixes the full testcase I'll be happy :-).
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi Franz, Please try changing inlinedGetFrameAt() in your imgContainerGIF.ii To look like the following (a rough hack job) inline nsresult inlinedGetFrameAt(PRUint32 index, gfxIImageFrame **_retval) { nsISupports ** rp = (nsISupports**) _retval; nsISupports *_elem = mFrames.ElementAt(index); if (!_elem) return ((nsresult) 0x80004005L); *rp = _elem; return 0; } This casts back the gfxIImageFrame** _retval to what is really is which is a nsISupports** I tried this with your test case under gcc 3.3 and get the following: [kbhend@base1 huh]$ g++ -O2 -fno-exceptions -o junk imgContainerGIF.ii -L/src1/mozilla-1.3/obj-ppc/dist/bin -lxpcom -lgkgfx [kbhend@base1 huh]$ ./junk rv 0 done [kbhend@base1 huh]$ gcc -dumpversion 3.3 So it fixes the problem (no segfault!) This is not the right fix. The nsCOMPtr<> template type and the many of the methods should be passing around nsISupports** instead of gfxIImageFrame** in many places. But this supports Mark's claim that the code is not aliasing safe. Kevin On Tuesday, July 8, 2003, at 01:50 PM, sirl at gcc dot gnu dot org wrote: > PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* > gcc-bugs@gcc.gnu.org. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 > > > > ------- Additional Comments From sirl at gcc dot gnu dot org > 2003-07-08 17:50 ------- > Well, since the generated assembly looks now quite different than for > the full > testcase, I would just like to hear another opinion whether the full > testcase > was correctly minimized. If yes, I agree that this bug shoud be closed > and one > against mozilla should be opened. > Hmm, Kevin, could you create a "fixed" (like you just did for the > simplified > one) full testcase (diff?)? If that fixes the full testcase I'll be > happy :-). > > > > > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. > > ---- Kevin B. Hendricks Associate Professor of Operations and Information Technology Richard Ivey School of Business, University of Western Ontario London Ontario, CANADA N6A 3K7 khendricks@ivey.uwo.ca
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, One question ... if the original mozilla code is not strict-aliasing safe as seems to be the case, why does this problem only impact ppc linux gcc 3.3 and not other ppc platforms nor it seems any other platform? In particular x86 seems to be immune to most strict-aliasing violations (the JDK, OpenOffice.,org, and now Mozilla are not aliasing safe) but the bug always seems to miss x86 but ends up biting ppc. Does that mean that fewer instructions are actually reschedule-able based on this optimization under x86? If so, what benefit actually occurs from assuming strict-aliasing on most platforms? Is there any measurable improvement on any platform? Finding these types of violations is extremely time consuming and it is hard to convince a non-impacted developer that his/her code is incorrect (they keep saying it runs fine on x86 so it must be a ppc specific problem). It is simply hard at first glance to see that: if B's parent is A then A* can hold a B* and be completely type legal B* can hold an A* (via a downcast) Both A* and B* can hold a B* (and be completely type) But *(B**) is not the same as a *(A**) from the point of view of aliasing. Unfortunately, without some form of warning about when the optimization is used it is next to impossible to find and fix all of the occurrences. Thanks, Kevin
Confirmed by inspecting the generated assembly.
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled > Unfortunately, without some form of warning about when the > optimization is used it is next to impossible to find and fix all > of the occurrences. There's no easy way to do that. The C/C++ standards say that casting "X**" to "Y**" is OK -- the problem is *using* that pointer. And by that point we've lost track of the fact that it was cast. I used to be an engineer at CenterLine; our tools (CodeCenter/ObjectCenter/AcquaProva) could catch these kinds of bugs dynamically, and give you a sensible "your code isn't right" message when the program ran. Unfortunately, CenterLine is largely defunct; I'm not sure if you can still get any of these tools, and, AFAIK, there's no open-source equivalent -- yet.
I would like to posit one additional comment here that I hope you don't take personal: When you say > Finding these types of violations is extremely time consuming and > it is hard to convince a non-impacted developer that his/her code is > incorrect [...] > Unfortunately, without some form of warning about when the > optimization is used it is next to impossible to find and fix all > of the occurrences. I would like to disagree. You have certainly noticed that even those of us who are well-trained in these questions tip-toed around the code and dared not say whether it is legal or not until it was really down to about one page. Rather than putting the blame on gcc, I would like to put it on the original authors of the code. They chose a wicked code design involving evil constructs such as casting to void** and using reinterpret_cast. It is certainly not only my personal view that this obfuscated the code to a degree that none of us could easily see what was going on. It is not up to me to judge how this might affect the mozilla developers community. However, I think there are more pressing things to do in gcc than to invent ways to track the course pointers take just to catch problems like this one. After all, it is not only time consuming to track down such bugs, but also to implement tracking them inside the compiler. Putting it somewhat bluntly (please take this with a grain of salt): If you choose to shoot yourself in the foot, gcc shouldn't interfere. W.
Filed as a mozilla bug http://bugzilla.mozilla.org/show_bug.cgi?id=212082
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, > They chose a wicked code > design involving evil constructs such as casting to void** and using > reinterpret_cast. It is certainly not only my personal view that this > obfuscated the code to a degree that none of us could easily see what > was going on. Since it is not my code, I completely agree with you! ;-) Give me a dumb pointer over a "smart pointer" any day. I am a big fan of obviousness and really love C compared to C++ because I hate having anything done "implicitly" and anything that impacts following the clear flow of the code (such as templates, multiple inheritance, implicit constructors and destructors, etc) gives me a headache. That was why it took me so long to trim down the testcase to something small, I could not figure out why nsDerivedsafe and nsGetterType templates were being passed around to begin with. > It is not up to me to judge how this might affect the > mozilla developers community. Unfortunately, it is just not them. Almost every major piece of code I have played with closely (the JDK, OpenOffice.org, and now Mozilla) seems to have problems. I build all of OOo, the JDK, and even Mozilla with -fno-strict-aliasing just to avoid having to track down these problems. > However, I think there are more pressing > things to do in gcc than to invent ways to track the course pointers > take > just to catch problems like this one. After all, it is not only time > consuming to track down such bugs, but also to implement tracking them > inside the compiler. I am sure there are other more important issues but the improvements in the strictness of the C++ parser in gcc 3.3 plus heavy use of -Wall has certainly helped OpenOffice.org find and fix serious bugs. So having a similar warning for aliasing violations really would help improve code in the wild. > Putting it somewhat bluntly (please take this with a grain of salt): > If you choose to shoot yourself in the foot, gcc shouldn't interfere. Agreed! ... but it might be nice if it told you the gun was using teflon coated steel jacketed shells when you thought your gun was actually loaded with blanks. ---- Kevin B. Hendricks Associate Professor of Operations and Information Technology Richard Ivey School of Business, University of Western Ontario London Ontario, CANADA N6A 3K7 khendricks@ivey.uwo.ca
replying to comment 24: The reason why x86 does not have this problem is because the schedular does not move around load and stores as much as the PPC schedular does, look at the gcc which is SPEC, it contains non alias safe code and it passes comparison test using the pentium 3 schedular but not with the G4 PPC one, it fails. Another reason why it effects PPC linux only and not PPC darwin is that the PIC code generation is different so you get different code for the PIC code, it might effect PPC darwin (in fact the simplified testcase does seg fault on PPC darwin). So the scheduling of the asm causes the load at different times.
Subject: Re: [3.3/3.4 regression] mozilla-1.4 miscompiled Hi, If it helps with the Mozilla bug report. The exact same bug bites with Apple's gcc 3.3 under Mac OSX 10.2 if you add -fstrict-aliasing (Apple's gcc 3.3 disables this by default). Kevin On Tuesday, July 8, 2003, at 03:42 PM, sirl at gcc dot gnu dot org wrote: > PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* > gcc-bugs@gcc.gnu.org. > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 > > > > ------- Additional Comments From sirl at gcc dot gnu dot org > 2003-07-08 19:42 ------- > Filed as a mozilla bug > http://bugzilla.mozilla.org/show_bug.cgi?id=212082 > > > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. > > ---- Kevin B. Hendricks Associate Professor of Operations and Information Technology Richard Ivey School of Business, University of Western Ontario London Ontario, CANADA N6A 3K7 khendricks@ivey.uwo.ca
> I am a big fan of obviousness and really love C compared to C++ It's not about C vs C++. C has the same aliasing rules, basically. In short, they read: if you've got an object of some type, don't try to fool everybody else (including the compiler) by pretending it has another type when you access it. Type safety is not so hard to achieve, and there are big C++ projects out there that have banned the use of void* and reinterpret_cast. I would certainly ask my collaborators to change their code if they tried to introduce such concepts. There's no need to do this. > So having a similar warning for aliasing > violations really would help improve code in the wild. This has been extensively discussed over and over again. I'd say for code like the one in question here, there's little hope a compiler can track what a certain pointer points to. (If it can, then just pack some accessor functions into another translation unit, and things become an order of magnitude harder). gcc has, between dot releases, switched off -fstrict-aliasing when it was first introduced, to give developers more time to fix their code, and because nobody could come up with a way to warn of invalid constructs. But this was years ago. Besides, in most code, these problems are not so hard to find -- as long as you avoid explicit casts, in particular reinterpret_cast, there are not so many possibilities to introduce aliasing violations. It is only if you try to outsmart C/C++'s type system that you are in danger. But I don't see why people should do that, templates offer them plenty of ways to always stay on the safe side, and the compiler will tell you about illegal casts -- unless you use void* and reinterpret_cast. So look out for these two constructs. Some perl-fu is more powerful than gcc for this... W.
I've been asked to add a summary of this discussion to the mozilla bug http://bugzilla.mozilla.org/show_bug.cgi?id=212082 , but I'm not a good enough at c++ speak to do that. Would someone here please be so kind to do this?
I'll take care of that in a minute. W.
I don't think such a summary is needed. The people who would actually fix the bug already understand the issues involved, even if the bug triagers who aren't programmers don't.
In response to comment 32: > It is only if you try to outsmart C/C++'s type system that you > are in danger. But I don't see why people should do that, templates offer them > plenty of ways to always stay on the safe side, and the compiler will tell > you about illegal casts -- unless you use void* and reinterpret_cast. Just to offer a little perspective from the Mozilla side of things, I think there are two main reasons why people working on Mozilla don't do as you think they should: 1. Portability to compilers that exist now and that we want to support generally ends up winning over portability to compilers that don't exist. Programmers generally don't understand all the rules (especially cases where behavior is undefined) of the language standards. Until late 2001, Mozilla supported gcc 2.7.2.3 (because it was, until then, the only way to use Purify on any Unix platform (Solaris), and valgrind didn't exist), which made using templates very difficult. Some of the other compilers that we still support also cause problems, so introducing any new templates into Mozilla still requires extensive testing and takes much more work than alternatives. (This is even more true for many web standards. While http://gcc.gnu.org/ is valid HTML (I'm impressed), it's still violating a "should not"-level requirement of the HTML specification since it uses tables for something other than the presentation of tabular data. Not that I blame you, since tables remain the only portable way of achieving many types of presentation.) 2. Compilers often don't optimize code the way the programmer wants. (My most recent experience with this was http://bugzilla.mozilla.org/show_bug.cgi?id=206682#c25 , and I've been meaning to file a gcc bug.) nsCOMPtr is very heavily used in the Mozilla codebase and has been hand-optimized for a number of different compilers. (You miss all the |#ifdef|s by looking at the preprocessor output.) In this case, the aliasing rules are being broken because of hand-optimization that I suspect was to improve the size of the generated code (although one that's being used on all platforms on which it compiles, which may not be necessary). Don't take any of that to mean that we don't want to fix our aliasing bugs. We do.
I sympathize with your comments (since we have the same problem with different compilers on different platforms, etc, although we don't have to go back to gcc2.7...). Just one remark: > Programmers generally don't understand all the rules (especially cases where > behavior is undefined) of the language standards. I certainly agree. On the other hand, note that it is not actually trivial to trip over aliasing rules -- you have to actively cheat on the compiler, by using reinterpret_casts etc. We get quite a number of reports with invalid code like that, so type safety doesn't seem to be very high on the list for quite a number of programmers (how sad). I wonder what cases these people are working around when you say you hand optimize code that way. We'd certainly like to see such code in our "optimization" section of bugzilla. And don't forget to submit you gcc bug report :-) Thanks Wolfgang
Howdy. I'm the original author of the evil code in question :-) Just checking in to say I understand and appreciate the analysis. The problem will be extremely hard to fix within Mozilla in any `correct' way. There are a couple of different paths that will stop the symptoms from hurting the users at run-time. I want to follow up to dbaron's earlier comment as well. The purpose of |nsCOMPtr| is to provide sugar and glue to a clone of COM. Originally, |nsCOMPtr| was designed as a (almost) drop in replacement for raw pointers to these reference counted COM objects. The XPCOM inventors at the time strongly resisted the introduction of a smart pointer class on the grounds that it wasn't needed, that it couldn't work, that it would be a waste of space and processing power, and that templates were broken in general. The design had to fit within the XPCOM rules without forcing itself on anyone. That led to the the primary hitch. XPCOM functions that return pointers do so by assigning into a |T**| result parameter. This direct assignment interfered with any attempt to manage refcounts automatically. The answer was the ugly (and it turns out, illegal) |getter_AddRefs( nsCOMPtr& )|, which |Releases| the current referent, if any, and then provides a |T**| as expected by the called function for assignment of the result. This machinery allowed the same functions to be called with either a |T*| or an |nsCOMPtr| in which to return their answer. |nsCOMPtr|s began to dominate the code base because the people who didn't use them pretty much always had leaks. The people who did use them had far fewer. Had XPCOM's originators been more open minded about C++, we could have just legislated that XPCOM object results are always returned through |nsCOMPtr&|s, and no aliasing problem would exist; the |getter_AddRefs| glue could all be thrown away. I would like nothing more. Such a change now would be very large. Not impossible, but certainly daunting. Just a note of history to show how we got here, and to mention in passing that I found nothing at which to take offense in any of the analyses and comments above. In fact, exactly the opposite: I really appreciate how deeply you guys dug into the problem. I hope I can come up with an answer that isn't just another work-around on top of functionality that is, itself, a work-around.
*** Bug 11846 has been marked as a duplicate of this bug. ***
*** Bug 21450 has been marked as a duplicate of this bug. ***
Reopening to ...
Mark as a dup of bug 21920. *** This bug has been marked as a duplicate of 21920 ***