Bug 58437

Summary: [4.7/4.8/4.9 Regression] Sorting value in reverse order is much slower compare to gcc44
Product: gcc Reporter: Tammy Hsu <tammy>
Component: libstdc++Assignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: chris, daniel.kruegler, jmbnyc, paolo.carlini
Priority: P3    
Version: 4.8.1   
Target Milestone: 4.7.4   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2013-09-17 00:00:00
Attachments: Sort patch
Performance tests for sort

Description Tammy Hsu 2013-09-16 20:37:55 UTC
For the following testcase:

=======================================================
#include <algorithm>
#include <vector>

using namespace std;

int main()
{
    const size_t num = 10000000;
    vector<int> v(num);

    generate(v.begin(), v.end(), &rand);
    sort(v.begin(), v.end());
    sort(v.rbegin(), v.rend());
}
======================================================

If we compile it with "g++ -O3", it takes 1 sec to sort random
values with either gcc445 or gcc481. Gcc445 takes 0.33 sec to sort
value in reverse order, but gcc 4.7/4.8 takes 2 secs.

Is there any way to improve the run-time performance?
Comment 1 Marc Glisse 2013-09-16 21:00:53 UTC
I can't reproduce this. I tried compiling with -O3 (with 4.4, 4.7 and 4.8) after commenting out either of the 2 sort lines, and I see roughly the same execution times (maybe a few % difference, but nowhere near the factor 6 you mention, which seems quite suspicious to me).

You need to at least describe your platform more in details...
Comment 2 Marc Glisse 2013-09-16 21:06:27 UTC
Ah forget my last message, I understand now you are really interested in how long it takes to reverse-sort an already sorted vector. Indeed it does take much longer with 4.6+ than with 4.4.
Comment 3 Marc Glisse 2013-09-16 21:18:31 UTC
Less confusing testcase:

#include <algorithm>
#include <vector>

using namespace std;

int main()
{
  const int num = 10000000;
  vector<int> v; v.reserve(num);
  for(int i=0;i!=num;++i) v.push_back(-i);
  sort(v.begin(), v.end());
}
Comment 4 Paolo Carlini 2013-09-17 00:03:16 UTC
Eh, it's *a lot* of time! I guess we have ask again Chris' help, because I don't think much happened after the move semantics work.
Comment 5 Paolo Carlini 2013-09-17 01:02:43 UTC
*** Bug 58440 has been marked as a duplicate of this bug. ***
Comment 6 Paolo Carlini 2013-09-17 01:19:22 UTC
In particular I'm thinking this change:

  http://gcc.gnu.org/ml/libstdc++/2009-08/msg00073.html
Comment 7 Chris Jefferson 2013-09-17 09:02:50 UTC
I will look at this next week. Quicksorts are always suboptimal for mostly sorted (forwards or backwards) data, and it would be nice to fix that.
Comment 8 Paolo Carlini 2013-09-17 09:06:27 UTC
Thanks Chris. Note that, according at least to the Dup bug, some slow down is noticeable in other less special cases too.
Comment 9 Tammy Hsu 2013-09-17 16:44:21 UTC
Marc/Paolo/Chris,

Thanks a lot for your help. It's awesome to see so many activities within hours after submitting the bug.

Thanks, Tammy
Comment 10 Jeffrey M. Birnbaum 2013-09-17 16:53:54 UTC
Tammy,

Something must have been in the air because your timestamp on the bug submission means that right at the time you were reporting the bug I was actively looking into why I was seeing horrible sort performance. I am working on an algo that requires a post sort of something 1B entries so I wanted to see worst case for std::sort on 500M and was stunned to see how poor it was. My CPU was Haswell so wondered if there was an issue so I tried another bug (that happened to have gcc 4.4.7) and realized it appeared to be a compiler or std lib regression. 

Anyhow, given the length of time that this has been out there it is weird that we reported the same bug within hours of each other.

Best,
/JMB
Comment 11 Tammy Hsu 2013-09-17 17:24:14 UTC
Jeffrey,

It's weird indeed. We are still using gcc445 for our production releases and are in the process of evaluating gcc481/gcc473. The performance tests show slowness in some tests....

I also don't have gcc45x available in house, so don't know if this bug exists in gcc45. BTW, it's really nice to know someone else encountered the same issue.

Tammy
Comment 12 Jeffrey M. Birnbaum 2013-09-17 17:32:50 UTC
Tammy,

We tested gcc 4.7.2, 4.6.2 and 4.4.3/5 (the bug is not in either 4.4.3/5). I have gcc 4.8.1 on my laptop but have not tried it yet. I confirmed the issue by compiling my test (almost identical to the one you submitted but using 500M elements) on 4.4.5 and then moving the executable over to a box with 4.7.2 installed. the native compiled program performed poorly compared to the one compiled with 4.4.5 (this also ruled out chip issues, i.e. haswell vs sandybridge).

I knew something was wrong when my own single threaded merge sort that produces a gradient instead of sorting the data in place was outperforming the std::sort using -D_GLIBCXX_PARALLEL, i.e. parallel sort of 500M entries. 

Best,
/JMB
Comment 13 Chris Jefferson 2013-09-19 07:28:21 UTC
Created attachment 30861 [details]
Sort patch

Wow, this an embarrassing bug to get through testing. Obviously not enough profiling done!

This patch (I hope) gets performance back to g++-4.4 levels, and in my experiments slightly (but not noticeably) better. I would be interested in anyone trying out this patch on their own data.

For the interested, here is a description of the bug. I will assume you understand quicksort!

Quicksort requires taking a pivot, then partitioning the array by that pivot. Getting O(n log n) behaviour requires that you choose a "good" pivot. Like most people, in g++ we choose the median of the first, middle, and last value in the array as the pivot.

In the C++03 world, after we chose the pivot we took a copy of it. In C++11, we cannot copy values in std::sort any more, only move and swap. 

Trying to keep track of the pivot as it moves around during the partitioning is horrible and inefficient. So, I had the "bright idea" to do swap(*first, *pivot). Now the pivot is at the first location in the array, and we can just partition [first+1, end) and the partitioning is fine. See the mistake yet?

The bug comes when we recursively partition the first cell. We again choose the median from the first, mid and last of this cell, but *first is always the biggest thing in the cell (as we pivoted on it one level above), meaning our median calculation is unbalanced and often chooses terrible values, leading to us often hitting the O(n^2) behaviour of quicksort (actually we technically get o(n log n), as we switch to heapsort when things are going badly, but the time is still bad).

There are a selection of ways of fixing this. This patch switches the median calculation from considering then first, mid, last locations to on first+1, mid, last-1 (there is no reason in this bug to consider last-1 instead of last, but I thought it wouldn't do any harm, and might avoid other issues of accidentally choosing a bad pivot.
Comment 14 Marc Glisse 2013-09-19 08:18:34 UTC
Hi Chris,

(detail: could you pass -u10, or at least -p, to diff to make the patches easier to read? It isn't required so you don't have to)

I don't really understand why the pivot is still considered in lower levels of the recursion at all. So, first we move the pivot to the first position with a swap:
(pivot, rest...)
Then we partition:
(pivot, small..., big...)
Then we know what the right position is for the pivot (this isn't a stable sort), so we could put it there with another swap:
(small..., pivot, big...)
And now we can recurse on small and big, and no one needs to look at the pivot ever again.

From what I understand of the code, we are instead relying on recursive calls to eventually sort pivot to the right position, which seems strange to me.
Comment 15 Paolo Carlini 2013-09-19 09:13:44 UTC
Thanks a lot guys, I appreciate all the help you are providing. While fixing this, let's remember that this regressed even in the old 4.7.x branch. Thus, if we are sure we are minimally restoring the performance of the C++03 code, we could also envisage a very minimal but *very* safe fix for 4.7.4 and 4.8.2 and something a tad more aggressive for 4.9.0.
Comment 16 Chris Jefferson 2013-09-19 10:03:01 UTC
Indeed, if std::sort had never used lower-level partitioning to get the pivot in the correct location, we would never have had this problem in the first place!

This is not too serious a problem performance-wise (as the pivot ends up in the right place fairly quickly), but it would certainly be better to do that in future. Unfortunately doing this requires changing quite a few functions, to make all code which uses partition skip over the pivot value.

I would suggest my current patch (which is simpler than it looks from the diff) for previous versions, then investigate pivot. In general I would like to modernise the sort to match more modern thoughts on sorting such as making use of partly sorted data (which would include reverse-ordered). The only problem there is satisfying all the requirements of std::sort).

(detail: -p doesn't help, you just get things like " -110,16 +112,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION". Neither -u 10, -x -u 10 or -x "-u 10" seem to work, I just get errors. If you can give me the exact svn diff command to run to get nicer output, I am happy to do so).
Comment 17 Paolo Carlini 2013-09-19 10:10:02 UTC
Can I ask you also a rather simple test for testsuite/performance/25_algorithms?
Comment 18 Marc Glisse 2013-09-19 11:28:50 UTC
[Ugh, bugzilla seems half broken at the moment]

> I would suggest my current patch (which is simpler than it looks from the diff) for previous versions, then investigate pivot.

Well, you're the one doing all the work ;-)

(svn diff --diff-cmd diff -x -u10 (actually I have diff-cmd = diff-for-svn in ~/.subversion/config where diff-for-svn is a script where I put the options I like, so svn diff does it by default), too bad -p often doesn't work well with C++)
Comment 19 Paolo Carlini 2013-09-19 14:10:20 UTC
I ran some quick tests and indeed the performance seems equal of better of those of the old C++03 code. Note that the patch is very small and can be installed even on top of the installed headers, without rebuilding anything, thus I would really encourage the bug submitters to test it and report back.
Comment 20 Chris Jefferson 2013-09-19 14:46:36 UTC
Created attachment 30867 [details]
Performance tests for sort

This is some performance tests for performance checking. Sorry for tar rather than patch, just pop them in performance/25_algorithms.

I decided to check stable_sort and heap_sort as well as just sort, just to check nothing else got broken at the same time (turns out, nothing else did).

Not sure exactly what the appropriate memory / time tradeoffs are for the performance test suite, so any comments welcome.
Comment 21 Jeffrey M. Birnbaum 2013-09-19 14:50:57 UTC
(In reply to Paolo Carlini from comment #19)
> I ran some quick tests and indeed the performance seems equal of better of
> those of the old C++03 code. Note that the patch is very small and can be
> installed even on top of the installed headers, without rebuilding anything,
> thus I would really encourage the bug submitters to test it and report back.

Excellent. Will test and report back.
Comment 22 Tammy Hsu 2013-09-19 17:09:54 UTC
Thanks a lot. We will test it out with our real application.
Just want to do things right, the patch is the one described in Attachment 30861 [details]. And I just need to patch the <gcc481_root>/include/c++/4.8.1/bits/stl_algo.h, don't need to rebuild gcc481.
Comment 23 Paolo Carlini 2013-09-19 17:35:46 UTC
If this is a question, the answer is YES.
Comment 24 Tammy Hsu 2013-09-19 17:37:06 UTC
Yes, it is a question. And thank you for the answer....... :-)
Comment 25 Marc Glisse 2013-09-19 18:35:46 UTC
Note that naively doing what I am proposing in comment #14 (it's just an iter_swap and a +-1) also makes reverse-sorted arrays a bad case, because of the way we do partitioning, so it isn't an alternative to Chris's first+1 approach, more of an orthogonal optimization.
Comment 26 Tammy Hsu 2013-09-23 20:45:03 UTC
Thanks a lot for the patch. It addresses the issue on the standalone test and shows also in our real tool test results. With this patch, the major degradations are gone. Compared to gcc445, there is a slight overall improvement for this tool now.

Really appreciate all the comments and the creation of a patch in such a short of time!!!!
Comment 27 paolo@gcc.gnu.org 2013-09-30 17:42:33 UTC
Author: paolo
Date: Mon Sep 30 17:42:31 2013
New Revision: 203035

URL: http://gcc.gnu.org/viewcvs?rev=203035&root=gcc&view=rev
Log:
2013-09-30  Chris Jefferson  <chris@bubblescope.net>

	PR libstdc++/58437
	* include/bits/stl_algo.h (__move_median_first): Rename to
	__move_median_to_first, change to take an addition argument.
	(__unguarded_partition_pivot): Adjust.
	* testsuite/performance/25_algorithms/sort.cc: New.
	* testsuite/performance/25_algorithms/sort_heap.cc: Likewise.
	* testsuite/performance/25_algorithms/stable_sort.cc: Likewise.

Added:
    trunk/libstdc++-v3/testsuite/performance/25_algorithms/sort.cc
    trunk/libstdc++-v3/testsuite/performance/25_algorithms/sort_heap.cc
    trunk/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/stl_algo.h
Comment 28 paolo@gcc.gnu.org 2013-09-30 17:42:54 UTC
Author: paolo
Date: Mon Sep 30 17:42:52 2013
New Revision: 203036

URL: http://gcc.gnu.org/viewcvs?rev=203036&root=gcc&view=rev
Log:
2013-09-30  Chris Jefferson  <chris@bubblescope.net>

	PR libstdc++/58437
	* include/bits/stl_algo.h (__move_median_first): Rename to
	__move_median_to_first, change to take an addition argument.
	(__unguarded_partition_pivot): Adjust.
	* testsuite/performance/25_algorithms/sort.cc: New.
	* testsuite/performance/25_algorithms/sort_heap.cc: Likewise.
	* testsuite/performance/25_algorithms/stable_sort.cc: Likewise.

Added:
    branches/gcc-4_8-branch/libstdc++-v3/testsuite/performance/25_algorithms/sort.cc
    branches/gcc-4_8-branch/libstdc++-v3/testsuite/performance/25_algorithms/sort_heap.cc
    branches/gcc-4_8-branch/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
Modified:
    branches/gcc-4_8-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_8-branch/libstdc++-v3/include/bits/stl_algo.h
Comment 29 paolo@gcc.gnu.org 2013-09-30 17:43:07 UTC
Author: paolo
Date: Mon Sep 30 17:43:05 2013
New Revision: 203037

URL: http://gcc.gnu.org/viewcvs?rev=203037&root=gcc&view=rev
Log:
2013-09-30  Chris Jefferson  <chris@bubblescope.net>

	PR libstdc++/58437
	* include/bits/stl_algo.h (__move_median_first): Rename to
	__move_median_to_first, change to take an addition argument.
	(__unguarded_partition_pivot): Adjust.
	* testsuite/performance/25_algorithms/sort.cc: New.
	* testsuite/performance/25_algorithms/sort_heap.cc: Likewise.
	* testsuite/performance/25_algorithms/stable_sort.cc: Likewise.

Added:
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/performance/25_algorithms/sort.cc
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/performance/25_algorithms/sort_heap.cc
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
Modified:
    branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_7-branch/libstdc++-v3/include/bits/stl_algo.h
Comment 30 Paolo Carlini 2013-09-30 17:44:27 UTC
Fixed for 4.7.4/4.8.2/4.9.0.