This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements



------- Comment #5 from roger at eyesopen dot com  2008-04-24 15:01 -------
Well, I've now had time to read the Barriato, Hofri et al. 2002 paper, and the
bad news is that such an approximate median selection algorithm can't be used
to guarantee an O(N) worst-case std::nth_element implementation.  It could be
used in an implementation to guess a good pivot, but the quality of this
median, i.e. how approximate it is, doesn't meet the necessary criterion to
ensure an O(N) worst case.  You'd still need a fallback method with guaranteed
bounds or an exact median in order to achieve O(N).  i.e. it could help improve
the average case performance, but doesn't help with the worst case.

For the mathematically inclined, in order to achieve an O(N) worst-case
performance, you need to guarantee a constant fraction of elements can be
eliminated at each level of the recursion.  In comment #4, Steven fixates
on "just as long as N/2 elements are reduced each time round", but the
equations for sum of geometric series show that doing better than any constant
fraction guarantees O(N) worst case.  Hence even if you only guarantee that
you can eliminate 10% each round, you still achieve O(N) worst-case.

Hence you need a method that provides an approximate median that worst-case
can guarantee elimination of say 10% of elements from consideration.  This is
why approximate medians offer some utility over exact medians if they can be
found faster.  Unfortunately, the method of Battiato referenced in comment #1
doesn't provide such a constant fraction guarantee.  An analysis shows that at
each round, it can only eliminate (2^n-1)/3^n of the elements in its worst
case, where n is log_3(N).

By hand, naming the ranks 0..N-1, when N=3, the true median at rank 1 is
selected.  For N=9, the elements at rank 3,4 or 5 may be considered as a
median, i.e. 1/3 eliminated.  For N=27, the elements between ranks 8 and 20 may
be returned as the median, i.e. 7/27 eliminated.  In the limit, as N tends
towards infinity (and n tends to infinity), the eliminated fraction (2^n-1)/3^n
tends to zero unbounded.  i.e. the larger the input size the less useful is the
worst-case median.

The poor quality of the median is lamented by the authors in the penultimate
paragraph of section 4.1 of the paper.  They then go on to show that
statistically such a worst-case is rare, but unfortunately even a rare worst
case breaks the C++ standard libraries O(N) constraint.

This Achilles heel is already well documented in the algorithmic complexity
community.  The Blum, Floyd, Pratt, Rivest and Trarjan paper [BFRT73] and the
Floyd and Rivest paper [FR75] analyse the issues with median-of-k-medians, and
show that k>=5 is the lowest value capable of guaranteed fractional worst case.
i.e. they already consider and reject the algorithm given in the cited work
(k=3) for the purpose of exact median finding.

Anyway, I hope you find this interesting.  There will always be efficient
methods for finding approximate medians.  The question is how efficient vs.
how approximate.  Many quicksort implementation select the first element as a
pivot, an O(1) method for selecting an extremely approximate median! 
Statistically over all possible input orders, this first element will on
average partition the input array at the median, with some variance.  It's not
that the paper is wrong or incorrect; it does what it describes in finding a
statistically good approximate median very efficiently with excellent worst
case performance.  Unfortunately for the problem we need to solve, which is not
the problem the paper's authors were attempting to solve, we need a better
approximation perhaps using a more complex implementation.

Anyway, thanks again for the reference.  I'd not come across it before and
really enjoyed reading it.  Let me know if you spot a flaw in my reasoning
above.

Dr Roger Sayle,
Ph.D. Computer Science
--


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]