Bug 48750 - for_each_template_random_access_ed has unbalanced new/delete[]
Summary: for_each_template_random_access_ed has unbalanced new/delete[]
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 4.4.4
: P3 normal
Target Milestone: 4.6.1
Assignee: Paolo Carlini
URL:
Keywords:
: 48751 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-04-24 11:45 UTC by Seth Heeren
Modified: 2011-05-03 17:55 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-04-24 12:16:35


Attachments
minimal example (97.93 KB, application/zip)
2011-04-24 11:51 UTC, Seth Heeren
Details
Draft patch (2.50 KB, patch)
2011-04-27 01:05 UTC, Paolo Carlini
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Seth Heeren 2011-04-24 11:45:40 UTC
thread_results is allocated using scalar new, yet deleted with array delete[]

Version of libs (ubuntu package): libstdc++6-4.4-dev through libstdc++6-4.6-dev
Version of compiler: g++ (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5
Also with: g++ (Debian 4.5.2-8) 4.5.2

Snippets:
/usr/include/c++/4.4/parallel/par_loop.h:87
#   pragma omp single
{
        num_threads = omp_get_num_threads();
        thread_results = static_cast<Result*>(
                            ::operator new(num_threads * sizeof(Result)));
        constructed = new bool[num_threads];
}

but line 127:

    delete[] thread_results;

This is unbalanced, and leads to the expected valgrind errors. Attached is a snippet (originally from http://stackoverflow.com/questions/5769908/unexpected-segfault-with-gnu-parallelaccumulate) that will segfault.

I witnessed the segfault went away when fixing the `delete[]` into `delete`. However, I think it should be evaluated whether the optimized (uninitializing) allocation of thread_results is safe to do (valgrind keeps reporting uninitialized memory references even if the attached example doesn't segfault anymore).

---- 

I'll attach the sources in 10 minutes due to time restriction and browser flakiness
Comment 1 Seth Heeren 2011-04-24 11:51:17 UTC
Created attachment 24089 [details]
minimal example

Compiled with


g++ -g -O0 -fopenmp minimal.cpp  -o minimal -save-temps
Comment 2 Paolo Carlini 2011-04-24 12:11:29 UTC
This has been fixed a lot of time ago: 4_5-branch, 4_6-branch and mainline are all fine.
Comment 3 Paolo Carlini 2011-04-24 12:16:35 UTC
Never mind, now I see.
Comment 4 Paolo Carlini 2011-04-24 12:21:32 UTC
*** Bug 48751 has been marked as a duplicate of this bug. ***
Comment 5 Paolo Carlini 2011-04-24 12:26:17 UTC
Let's add Johannes in CC. Indeed, something is fishy here and yes I agree the correct fix doesn't seem just using ::operator delete(__thread_results);
Comment 6 Seth Heeren 2011-04-24 12:54:15 UTC
That _is quick response. Thank you. I'll keep an eye on this one
Comment 7 Paolo Carlini 2011-04-24 21:51:51 UTC
I had a quick look to this code and something really bad is going on, like, eg, memory deallocated before calling the destructor of the object constructed in it via placement new. And much more.

The patchlet below, which I'd ask the submitters to test, gives me something much more sane, wrt Valgrind too, but Jahannes should really have a look asap.

(Note the patch is versus mainline first, as usual, thus doesn't apply cleanly as-is to 4.4 for various trivial reasons, like uglified names)

/////////////////

Index: par_loop.h
===================================================================
--- par_loop.h	(revision 172920)
+++ par_loop.h	(working copy)
@@ -91,8 +91,7 @@
 	_ThreadIndex __iam = omp_get_thread_num();
 
 	// Neutral element.
-	_Result* __reduct = static_cast<_Result*>
-	  (::operator new(sizeof(_Result)));
+	_Result* __reduct;
 
 	_DifferenceType
 	  __start = equally_split_point(__length, __num_threads, __iam),
@@ -100,7 +99,7 @@
 
 	if (__start < __stop)
 	  {
-	    new(__reduct) _Result(__f(__o, __begin + __start));
+	    __reduct = new _Result(__f(__o, __begin + __start));
 	    ++__start;
 	    __constructed[__iam] = true;
 	  }
@@ -110,18 +109,26 @@
 	for (; __start < __stop; ++__start)
 	  *__reduct = __r(*__reduct, __f(__o, __begin + __start));
 
-	__thread_results[__iam] = *__reduct;
+	if (__constructed[__iam])
+	  {
+	    ::new(&__thread_results[__iam]) _Result(*__reduct);
+	    delete __reduct;
+	  }
       } //parallel
 
       for (_ThreadIndex __i = 0; __i < __num_threads; ++__i)
 	if (__constructed[__i])
-	  __output = __r(__output, __thread_results[__i]);
+	  {
+	    __output = __r(__output, __thread_results[__i]);
+	    (&__thread_results[__i])->~_Result();
+	  }
 
       // Points to last element processed (needed as return value for
       // some algorithms like transform).
       __f._M_finish_iterator = __begin + __length;
 
-      delete[] __thread_results;
+      ::operator delete(__thread_results);
+
       delete[] __constructed;
 
       return __o;
Comment 8 Paolo Carlini 2011-04-24 22:13:24 UTC
Johannes, can you please also audit the other uses of operator new, placement new, etc, and double check that:

  1- We are not assigning to an object before constructing.
  2- We destruct each object before deleting its memory.

Thanks!
Comment 9 Paolo Carlini 2011-04-26 15:19:12 UTC
Seth, Tom, if you get a chance to test the changes I propose in Comment 7 (suitably, trivially tweaked for 4.4.x), please let me know as soon as possible.

If Johannes cannot provide feedback on this issue over the next few days I mean to go ahead with my patch, likely for 4.6.x too, and work on a other similar tweaks in the same spirit elsewhere in the parallel code (luckily we have only an handful of operator new / operator delete pairs needing attention).
Comment 10 Tom 2011-04-26 18:09:05 UTC
Hi Paulo,

Yes, this certainly fixes my segfault.

Valgrind complains a little - although I am not sure if this is important (I
am very new to the helgrind tool).  In case it is useful, I attach the
output I get,

So the line numbers are meaningful, I include my modified par_loop.h and my
test file.


Here is my g++ -v

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
4.4.4-14ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.4 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls
--with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug
--enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

running on Ubuntu 10-10


Many thanks for your assistance with this. My gratitude to the FSF grows
daily.

Tom



On 26 April 2011 16:20, paolo.carlini at oracle dot com <
gcc-bugzilla@gcc.gnu.org> wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48750
>
> --- Comment #9 from Paolo Carlini <paolo.carlini at oracle dot com>
> 2011-04-26 15:19:12 UTC ---
> Seth, Tom, if you get a chance to test the changes I propose in Comment 7
> (suitably, trivially tweaked for 4.4.x), please let me know as soon as
> possible.
>
> If Johannes cannot provide feedback on this issue over the next few days I
> mean
> to go ahead with my patch, likely for 4.6.x too, and work on a other
> similar
> tweaks in the same spirit elsewhere in the parallel code (luckily we have
> only
> an handful of operator new / operator delete pairs needing attention).
>
> --
> Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>
Comment 11 Paolo Carlini 2011-04-26 18:38:26 UTC
(In reply to comment #10)
> Yes, this certainly fixes my segfault.

For me, it only fixes the segfault, also avoids all valgrind and glibc errors messages at runtime and drastically improves the valgring report in terms of memory leaks, to 0 definitely and 0 indirectly lost (try not ding 'delete __reduct', as was effectively happening before the patch, for example, to see what I mean). The report is still not perfect, indeed, there may be other leaks somewhere else, I'm going to audit other bits of the parallel infrastructure as I said. This is the valgrind report I get here on an i7-980X I have at hand:

==7206== LEAK SUMMARY:
==7206==    definitely lost: 0 bytes in 0 blocks
==7206==    indirectly lost: 0 bytes in 0 blocks
==7206==      possibly lost: 3,344 bytes in 11 blocks
==7206==    still reachable: 2,984 bytes in 3 blocks
==7206==         suppressed: 0 bytes in 0 blocks

> Valgrind complains a little - although I am not sure if this is important (I
> am very new to the helgrind tool).  In case it is useful, I attach the
> output I get,

Sure it would be useful. Note, in general, a few memory issues have been fixed already in 4.5.x and 4.6.x, thus at some point, if you are serious about parallel mode, updating GCC will be certainly recommended.
Comment 12 Paolo Carlini 2011-04-26 18:40:17 UTC
Forgot: my numbers, etc, are all about the exaxt testcase in PR48751.
Comment 13 Paolo Carlini 2011-04-27 01:05:36 UTC
Created attachment 24108 [details]
Draft patch

This is my work ìn progress patch. I'm for example seeing very good to perfect improvements in terms of valgrind reports for sort / stable_sort / random_shuffle too, which, before the patch, when the ValueType has a non trivial destructor leak a lot.

I'm going to the run the patch through the testsuite in parallel-mode and get it closer to the final shape.
Comment 14 Seth Heeren 2011-04-27 09:48:57 UTC
I just built a debian SID chroot and applied the diff to it's libstdc++6 (4.6.0-2), and tested your patch with the minimal.cpp from this bug report

patch -p1 < ~sehe/attachment.cgi?id=24108

    patching file parallel/multiway_merge.h
    patching file parallel/losertree.h
    patching file parallel/par_loop.h
    Hunk #1 succeeded at 91 with fuzz 1.
    patching file parallel/quicksort.h
    patching file parallel/random_shuffle.h
    patching file parallel/multiway_mergesort.h
    patching file parallel/partial_sum.h

g++-4.6 -g -O0 -fopenmp minimal.cpp  -o minimal -save-temps
valgrind --leak-check=full ./minimal

    ==5960== LEAK SUMMARY:
    ==5960==    definitely lost: 0 bytes in 0 blocks
    ==5960==    indirectly lost: 0 bytes in 0 blocks
    ==5960==      possibly lost: 456 bytes in 3 blocks
    ==5960==    still reachable: 1,748 bytes in 3 blocks
    ==5960==         suppressed: 0 bytes in 0 blocks

The allocation possibly lost are attributed to GOMP_parallel_start from numeric:99). 

Helgrind gives quite the array of possible data races. I don't know whether I need to worry about them (as I don't know what kind of thread primitives GOMP uses internally and if helgrind supports them).

If you want I can attach precompiled sources + val/helgrind outputs
Comment 15 Seth Heeren 2011-04-27 09:51:26 UTC
Probably redundant info, but the compiler used in my prior post is

sehe@sid:~$ g++-4.6 -v       
Using built-in specs.
COLLECT_GCC=g++-4.6
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.6.1/lto-wrapper
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.6.0-2' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++,go --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-gold --enable-ld=default --with-plugin-ld --enable-objc-gc --enable-targets=all --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu
Thread model: posix
gcc version 4.6.1 20110329 (prerelease) (Debian 4.6.0-2)
Comment 16 singler@gcc.gnu.org 2011-04-27 19:59:29 UTC
I have reviewed the patch, it looks mostly fine.  Thanks!
Only line 606 of losertree.h might not yet be fully consistent with line 597.  I'm not sure all loser tree members are correctly constructed in the first place.

We might have to add

        for (unsigned int __i = 0; __i < _M_k; ++__i)
          {
            ::new(&(_M_losers[__i]._M_key)) _Tp(__sentinel);
            _M_losers[__i]._M_source = -1;
          }

I will test that...
 
We have used these ::operator new/placement new construction to avoid calling the default constructor, which might not be available for the ValueType.  Interference with the actual parallelism should be minimal.
Comment 17 Paolo Carlini 2011-04-27 20:23:25 UTC
Thanks Johannes. Indeed, I was still a little nervous about those bits of losertree.h. If you can, please make sure that (even if it costs a bit of performance):

  1- After memory allocation, the first time each objects is constructed, via placement new, not assigned.
  2- All the following times are assigned, not constructed.
  3- All the object are destructed, once, by calling explicitly the destructor, in the destructor of _LoserTree* class, before deallocating the memory via operator delete. Careful not trying to destruct objects never constructed in the first place.

In general, I recommend preparing a tiny snippet code exercising the parallel code at issue and running it under valgrind (doing that yesterday for the elements of a vector<string>, for sort, stable_sort, etc, turned out to be rather "interesting" ;)

Please, keep me up to date, we want to resolve this for 4.6.1, I think.
Comment 18 paolo@gcc.gnu.org 2011-05-03 14:20:49 UTC
Author: paolo
Date: Tue May  3 14:20:45 2011
New Revision: 173309

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173309
Log:
2011-05-03  Paolo Carlini  <paolo.carlini@oracle.com>

	PR libstdc++/48750
	* include/parallel/multiway_merge.h: Run _ValueType destructors.
	* include/parallel/multiway_mergesort.h: Likewise.
	* include/parallel/quicksort.h: Likewise.
	* include/parallel/random_shuffle.h: Likewise.
	* include/parallel/partial_sum.h: Likewise.
	* include/parallel/losertree.h: Run destructors; minor tweaks.
	* include/parallel/par_loop.h: Run destructors, fix memory
	allocations and deallocations.
	* testsuite/26_numerics/accumulate/48750.cc: New.

	* testsuite/ext/profile/mutex_extensions_neg.cc: Do not run in
	parallel-mode to avoid spurious multiple errors.


Added:
    trunk/libstdc++-v3/testsuite/26_numerics/accumulate/48750.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/parallel/losertree.h
    trunk/libstdc++-v3/include/parallel/multiway_merge.h
    trunk/libstdc++-v3/include/parallel/multiway_mergesort.h
    trunk/libstdc++-v3/include/parallel/par_loop.h
    trunk/libstdc++-v3/include/parallel/partial_sum.h
    trunk/libstdc++-v3/include/parallel/quicksort.h
    trunk/libstdc++-v3/include/parallel/random_shuffle.h
    trunk/libstdc++-v3/testsuite/ext/profile/mutex_extensions_neg.cc
Comment 19 Seth Heeren 2011-05-03 15:47:28 UTC
Cheers!
Comment 20 paolo@gcc.gnu.org 2011-05-03 17:54:41 UTC
Author: paolo
Date: Tue May  3 17:54:35 2011
New Revision: 173335

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173335
Log:
2011-05-03  Paolo Carlini  <paolo.carlini@oracle.com>

	PR libstdc++/48750
	* include/parallel/multiway_merge.h: Run _ValueType destructors.
	* include/parallel/multiway_mergesort.h: Likewise.
	* include/parallel/quicksort.h: Likewise.
	* include/parallel/random_shuffle.h: Likewise.
	* include/parallel/partial_sum.h: Likewise.
	* include/parallel/losertree.h: Run destructors; minor tweaks.
	* include/parallel/par_loop.h: Run destructors, fix memory
	allocations and deallocations.
	* testsuite/26_numerics/accumulate/48750.cc: New.

	* testsuite/ext/profile/mutex_extensions_neg.cc: Do not run in
	parallel-mode to avoid spurious multiple errors.

Added:
    branches/gcc-4_6-branch/libstdc++-v3/testsuite/26_numerics/accumulate/48750.cc
Modified:
    branches/gcc-4_6-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/losertree.h
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/multiway_merge.h
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/multiway_mergesort.h
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/par_loop.h
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/partial_sum.h
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/quicksort.h
    branches/gcc-4_6-branch/libstdc++-v3/include/parallel/random_shuffle.h
    branches/gcc-4_6-branch/libstdc++-v3/testsuite/ext/profile/mutex_extensions_neg.cc
Comment 21 Paolo Carlini 2011-05-03 17:55:39 UTC
Fixed for 4.6.1.