nth_element in 4.3.0 fails to meet its complexity requirements.
This is basically intraselect by Dave Musser.
The question is "What algorithm should you switch to when quickselect fails?"
In the codebase __heapselect() is called. But this in O(N * log N). __heapselect() is O(middle-first) + O((last-middle) * log(last-first)) which is O(N * log N) and the C++ standard requires nth_element to be O(N). __heapselect() is no use at all here.
Better still is median-of-medians of groups of 5 (and plenty of proofs that it is O(N)) or approximate median finding and that is still O(N)
In the best or average case, intraselect is
O(N) partitioning + O(1) finding a suitable partition element
In the worse case, intraselect is
O(N) partitioning + O(N) finding a suitable partition element
Roger, can you have a look? Thanks in advance.
Yep, now that we're back in stage1, it's about time I got around to submitting the O(n) worst case nth_element implementation that I mentioned last year. For
Steven's benefit, the implementation I've already coded up uses the median-of-medians in groups of five strategy as a fallback to a modified quickselect.
[Though I'll need to quickly read the paper you cite]
The trick for libstdc++ is to attempt to make the typical case as fast or
faster than the existing implementation. Whilst the standards now require
O(n) worst case, the perceived performance of g++'s users is the average
case and changing to an O(n) implementation that has a large co-efficient constant may upset some folks.
Many thanks Roger for your further help on nth_element, excellent news.
Yes. You want a partition that is O(1) that each time round eliminates N/2 elements (bearing in mind the post-condition for nth_element where iterators greater than the kth iterator have elements that are >= the kth element and iterators less than the kth iterator have elements that are <= the kth element)
So median-of-3 or for large N is a must. And this is run for 3/2 * log2(N) steps.
If it has not converged by end of steps (so it has been a bad run) or new N is < some constant (making binary insertion sort worthwhile) then at that point you want the cheapest approximate median algorithm that is _guaranteed_ O(N). The algorithm is still O(N) as the choosing median and partitioning is occuring in serial. In this case, it is minimising the constant factor that matters.
The median-of-median of 5 is well known
But this approximate median is less well known.
So it is the combination of
(i) guaranteed O(N) partitioning
(ii) cheapest constant factor (so minimising iterator traversal, copy ctor, swapping, comparing etc)
I have not yet checked median of median-of-5 against this, but I believe (ii) works out cheaper.
And if it works that there exists an even cheaper guranteed O(N) partitioning that finds an approximate median, gcc should use that. It does not matter if the exact median is found each time round just as long as N/2 elements are reduced each time round.
I have also been alerted to a intraselect variation of nth_element() that looks promising. It is this:
If you wanted the iterator that was 20% in from the start and you sampled elements to choose a partition element, instead of choosing an approximate median, choose the element that is 20%. So if the sample was 5 elements, choose the 2nd element. Now if partitioning was lop-sided but you reduced the number of elements drastically leaving a tiny amount as candidates, you have massively reduced the problem. This is brand new research in nth_element but I have not yet seen much analysis.
Well, I've now had time to read the Barriato, Hofri et al. 2002 paper, and the
bad news is that such an approximate median selection algorithm can't be used
to guarantee an O(N) worst-case std::nth_element implementation. It could be
used in an implementation to guess a good pivot, but the quality of this median, i.e. how approximate it is, doesn't meet the necessary criterion to ensure an O(N) worst case. You'd still need a fallback method with guaranteed bounds or an exact median in order to achieve O(N). i.e. it could help improve the average case performance, but doesn't help with the worst case.
For the mathematically inclined, in order to achieve an O(N) worst-case performance, you need to guarantee a constant fraction of elements can be eliminated at each level of the recursion. In comment #4, Steven fixates
on "just as long as N/2 elements are reduced each time round", but the
equations for sum of geometric series show that doing better than any constant fraction guarantees O(N) worst case. Hence even if you only guarantee that
you can eliminate 10% each round, you still achieve O(N) worst-case.
Hence you need a method that provides an approximate median that worst-case
can guarantee elimination of say 10% of elements from consideration. This is why approximate medians offer some utility over exact medians if they can be found faster. Unfortunately, the method of Battiato referenced in comment #1
doesn't provide such a constant fraction guarantee. An analysis shows that at each round, it can only eliminate (2^n-1)/3^n of the elements in its worst case, where n is log_3(N).
By hand, naming the ranks 0..N-1, when N=3, the true median at rank 1 is selected. For N=9, the elements at rank 3,4 or 5 may be considered as a median, i.e. 1/3 eliminated. For N=27, the elements between ranks 8 and 20 may be returned as the median, i.e. 7/27 eliminated. In the limit, as N tends towards infinity (and n tends to infinity), the eliminated fraction (2^n-1)/3^n tends to zero unbounded. i.e. the larger the input size the less useful is the worst-case median.
The poor quality of the median is lamented by the authors in the penultimate paragraph of section 4.1 of the paper. They then go on to show that statistically such a worst-case is rare, but unfortunately even a rare worst case breaks the C++ standard libraries O(N) constraint.
This Achilles heel is already well documented in the algorithmic complexity community. The Blum, Floyd, Pratt, Rivest and Trarjan paper [BFRT73] and the Floyd and Rivest paper [FR75] analyse the issues with median-of-k-medians, and show that k>=5 is the lowest value capable of guaranteed fractional worst case. i.e. they already consider and reject the algorithm given in the cited work
(k=3) for the purpose of exact median finding.
Anyway, I hope you find this interesting. There will always be efficient
methods for finding approximate medians. The question is how efficient vs.
how approximate. Many quicksort implementation select the first element as a pivot, an O(1) method for selecting an extremely approximate median! Statistically over all possible input orders, this first element will on average partition the input array at the median, with some variance. It's not that the paper is wrong or incorrect; it does what it describes in finding a statistically good approximate median very efficiently with excellent worst case performance. Unfortunately for the problem we need to solve, which is not the problem the paper's authors were attempting to solve, we need a better approximation perhaps using a more complex implementation.
Anyway, thanks again for the reference. I'd not come across it before and
really enjoyed reading it. Let me know if you spot a flaw in my reasoning
Dr Roger Sayle,
Ph.D. Computer Science
Can I re-classify this as an enhancement?
I agree with your analysis. I am slightly crestfallen as I was suspicious that Barriato, Hofri etc's paper never mentioned "worst case" only "approximate case". And I have now seen BFPRT73 paper where it mentions for the number of columns (c) "Note, however, that we must have c >= 5 for PICK to run in linear time". So yes, median-of median of 5 seems best you can do for "worst case".
The only point in shifting to a different algorithm for the worse case is if
(i) It is cheaper in terms of total comparisons, swaps, assignments etc
(ii) It is still linear in N in the worse case
Roger, are you still actively working on this?
Roger told me in private that he doesn't mean to actively work on this issue any time soon. Unassigning.
Worth noting, perhaps, that the C++ standard does _not_ require O(n) worst-case behavior for nth_element. The exact wording (C++11 section 25.4.2 [alg.nth.element] paragraph (3)) says:
Complexity: Linear on average.
It is not obvious (to me) what distribution the "on average" refers to. All permutations of input with equal probability, I suppose (?)
Anyway, just because GCC falls back to an O(N log N) algorithm under some circumstances does not necessarily mean it violates the C++ specification, as long as that fallback only happens in log N of the cases or thereabouts.
Not that I am up for actually performing this complexity analysis.