find
with Small Secondary-to-Primary Key Ratios
find
with Large Secondary-to-Primary Key Ratios
insert
with Small
Secondary-to-Primary Key Ratios
insert
with Small
Secondary-to-Primary Key Ratios
insert
with Small
Secondary-to-Primary Key Ratios Memory Use
insert
with Small
Secondary-to-Primary Key Ratios Memory Use
Figure A shows a tree whose each node contains two entries: a floating-point key, and some size-type metadata (in bold beneath it) that is the number of nodes in the sub-tree. (The root has key 0.99, and has 5 nodes (including itself) in its sub-tree.) A container based on this data structure can obviously answer efficiently whether 0.3 is in the container object, but it can also answer what is the order of 0.3 among all those in the container object: see [biblio.clrs2001].
As another example, Figure B shows a tree whose each node
contains two entries: a half-open geometric line interval,
and a number metadata (in bold beneath
it) that is the largest endpoint of all intervals in its
sub-tree. (The root describes the interval [20,
36)
, and the largest endpoint in its sub-tree is
99.) A container based on this data structure can obviously
answer efficiently whether [3, 41)
is
in the container object, but it can also answer efficiently
whether the container object has intervals that intersect
[3, 41)
. These types of queries are
very useful in geometric algorithms and lease-management
algorithms.
It is important to note, however, that as the trees are modified, their internal structure changes. To maintain these invariants, one must supply some policy that is aware of these changes. Without this, it would be better to use a linked list (in itself very efficient for these purposes).
The figure below shows the different underlying data structures currently supported in this library.
A shows a collision-chaining hash-table, B shows a probing hash-table, C shows a red-black tree, D shows a splay tree, E shows a tree based on an ordered vector(implicit in the order of the elements), F shows a PATRICIA trie, and G shows a list-based container with update policies.
Each of these data structures has some performance benefits, in terms of speed, size or both. For now, note that vector-based trees and probing hash tables manipulate memory more efficiently than red-black trees and collision-chaining hash tables, and that list-based associative containers are very useful for constructing "multimaps".
Now consider a function manipulating a generic associative container,
template<class Cntnr> int some_op_sequence(Cntnr &r_cnt) { ... }
Ideally, the underlying data structure
of Cntnr
would not affect what can be
done with r_cnt
. Unfortunately, this is not
the case.
For example, if Cntnr
is std::map
, then the function can
use
std::for_each(r_cnt.find(foo), r_cnt.find(bar), foobar)
in order to apply foobar
to all
elements between foo
and
bar
. If
Cntnr
is a hash-based container,
then this call's results are undefined.
Also, if Cntnr
is tree-based, the type
and object of the comparison functor can be
accessed. If Cntnr
is hash based, these
queries are nonsensical.
There are various other differences based on the container's underlying data structure. For one, they can be constructed by, and queried for, different policies. Furthermore:
Containers based on C, D, E and F store elements in a
meaningful order; the others store elements in a meaningless
(and probably time-varying) order. By implication, only
containers based on C, D, E and F can
support erase
operations taking an
iterator and returning an iterator to the following element
without performance loss.
Containers based on C, D, E, and F can be split and joined efficiently, while the others cannot. Containers based on C and D, furthermore, can guarantee that this is exception-free; containers based on E cannot guarantee this.
Containers based on all but E can guarantee that erasing an element is exception free; containers based on E cannot guarantee this. Containers based on all but B and E can guarantee that modifying an object of their type does not invalidate iterators or references to their elements, while containers based on B and E cannot. Containers based on C, D, and E can furthermore make a stronger guarantee, namely that modifying an object of their type does not affect the order of iterators.
A unified tag and traits system (as used for the C++ standard library iterators, for example) can ease generic manipulation of associative containers based on different underlying data structures.
std::for_each(c.find(1), c.find(5), foo);
In our opinion, this problem is not caused just because
red-black trees are order preserving while
collision-chaining hash tables are (generally) not - it
is more fundamental. Most of the standard's containers
order sequences in a well-defined manner that is
determined by their interface:
calling insert
on a tree-based
container modifies its sequence in a predictable way, as
does calling push_back
on a list or
a vector. Conversely, collision-chaining hash tables,
probing hash tables, priority queues, and list-based
containers (which are very useful for "multimaps") are
self-organizing data structures; the effect of each
operation modifies their sequences in a manner that is
(practically) determined by their
implementation.
Consequently, applying an algorithm to a sequence obtained from most containers may or may not make sense, but applying it to a sub-sequence of a self-organizing container does not.
Suppose c
is some collision-chaining
hash-based container object, and one calls
c.find(3)
Then what composes the returned iterator?
It should be noted that containers based on collision-chaining hash-tables are not the only ones with this type of behavior; many other self-organizing data structures display it as well.
Consider the following snippet:
it = c.find(3); c.erase(5);
Erasing 5 from A1 yields A2. Clearly, an iterator to 3 can be de-referenced and incremented. The sequence of iterators changed, but in a way that is well-defined by the interface.
Erasing 5 from B1 yields B2. Clearly, an iterator to 3 is not valid at all - it cannot be de-referenced or incremented; the order of iterators changed in a way that is (practically) determined by the implementation and not by the interface.
Erasing 5 from C1 yields C2. Here the situation is more
complicated. On the one hand, there is no problem in
de-referencing it
. On the other hand,
the order of iterators changed in a way that is
(practically) determined by the implementation and not by
the interface.
So in the standard library containers, it is not always possible
to express whether it
is valid or not. This
is true also for insert
. Again, the
iterator concept seems overloaded.
The design of the functional overlay to the underlying data structures differs slightly from some of the conventions used in the C++ standard. A strict public interface of methods that comprise only operations which depend on the class's internal structure; other operations are best designed as external functions. (See [biblio.meyers02both]).With this rubric, the standard associative containers lack some useful methods, and provide other methods which would be better removed.
Order-preserving standard associative containers provide the method
iterator erase(iterator it)
typename C::iterator it = c.begin(); typename C::iterator e_it = c.end(); while(it != e_it) it = pred(*it)? c.erase(it) : ++it;
All associative containers include a conditional-erase method
template< class Pred> size_type erase_if (Pred pred)
The standard associative containers provide methods for multiple-item erase of the form
size_type erase(It b, It e)
c.erase(c.find(2), c.find(5))
It is well-known that tree-based and trie-based container objects can be efficiently split or joined (See [biblio.clrs2001]). Externally splitting or joining trees is super-linear, and, furthermore, can throw exceptions. Split and join methods, consequently, seem good choices for tree-based container methods, especially, since as noted just before, they are efficient replacements for erasing sub-sequences.
The standard associative containers provide methods of the form
template<class It> size_type insert(It b, It e);
Many graph algorithms (see [biblio.clrs2001]) require increasing a value in a priority queue (again, in the sense of the container's comparison functor), or joining two priority-queue objects.
The return type of priority_queue
's
push
method is a point-type iterator, which can
be used for modifying or erasing arbitrary values. For
example:
priority_queue<int> p; priority_queue<int>::point_iterator it = p.push(3); p.modify(it, 4);
These types of cross-referencing operations are necessary for making priority queues useful for different applications, especially graph applications.
It is sometimes necessary to erase an arbitrary value in a
priority queue. For example, consider
the select
function for monitoring
file descriptors:
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds, struct timeval *timeout);
then, as the select documentation states:
“ The nfds argument specifies the range of file descriptors to be tested. The select() function tests file descriptors in the range of 0 to nfds-1.”
It stands to reason, therefore, that we might wish to
maintain a minimal value for nfds
, and
priority queues immediately come to mind. Note, though, that
when a socket is closed, the minimal file description might
change; in the absence of an efficient means to erase an
arbitrary value from a priority queue, we might as well
avoid its use altogether.
The standard containers typically support iterators. It is
somewhat unusual
for std::priority_queue
to omit them
(See [biblio.meyers01stl]). One might
ask why do priority queues need to support iterators, since
they are self-organizing containers with a different purpose
than abstracting sequences. There are several reasons:
Iterators (even in self-organizing containers) are useful for many purposes: cross-referencing containers, serialization, and debugging code that uses these containers.
The standard library's hash-based containers support iterators, even though they too are self-organizing containers with a different purpose than abstracting sequences.
In standard-library-like containers, it is natural to specify the
interface of operations for modifying a value or erasing
a value (discussed previously) in terms of a iterators.
It should be noted that the standard
containers also use iterators for accessing and
manipulating a specific value. In hash-based
containers, one checks the existence of a key by
comparing the iterator returned by find
to the
iterator returned by end
, and not by comparing a
pointer returned by find
to NULL.
This is indeed an elegant example of the adapter concept and the algorithm/container/iterator decomposition. (See [biblio.nelson96stlpq]). There are several reasons why a binary-heap priority queue may be better implemented as a container instead of a sequence adapter:
std::priority_queue
cannot erase values
from its adapted sequence (irrespective of the sequence
type). This means that the memory use of
an std::priority_queue
object is always
proportional to the maximal number of values it ever contained,
and not to the number of values that it currently
contains. (See performance/priority_queue_text_pop_mem_usage.cc
.)
This implementation of binary heaps acts very differently than
other underlying data structures (See also pairing heaps).
Some combinations of adapted sequences and value types
are very inefficient or just don't make sense. If one uses
std::priority_queue<std::vector<std::string>
> >
, for example, then not only will each
operation perform a logarithmic number of
std::string
assignments, but, furthermore, any
operation (including pop
) can render the container
useless due to exceptions. Conversely, if one uses
std::priority_queue<std::deque<int> >
>
, then each operation uses incurs a logarithmic
number of indirect accesses (through pointers) unnecessarily.
It might be better to let the container make a conservative
deduction whether to use the structure in the graphic above, labels A1 or A2.
There does not seem to be a systematic way to determine what exactly can be done with the priority queue.
If p
is a priority queue adapting an
std::vector
, then it is possible to iterate over
all values by using &p.top()
and
&p.top() + p.size()
, but this will not work
if p
is adapting an std::deque
; in any
case, one cannot use p.begin()
and
p.end()
. If a different sequence is adapted, it
is even more difficult to determine what can be
done.
If p
is a priority queue adapting an
std::deque
, then the reference return by
p.top()
will remain valid until it is popped,
but if p
adapts an std::vector
, the
next push
will invalidate it. If a different
sequence is adapted, it is even more difficult to
determine what can be done.
Sequence-based binary heaps can still implement
linear-time erase
and modify
operations.
This means that if one needs to erase a small
(say logarithmic) number of values, then one might still
choose this underlying data structure. Using
std::priority_queue
, however, this will generally
change the order of growth of the entire sequence of
operations.
[biblio.abrahams97exception] STL Exception Handling Contract . 1997. ISO SC22/WG21 .
[biblio.austern01htprop] A Proposal to Add Hashtables to the Standard Library . 2001 . ISO SC22/WG21 .
[biblio.dawestimer] Boost Timer Library . Boost .
[biblio.clearypool] Boost Pool Library . Boost .
[biblio.mscom] COM: Component Model Object Technologies . Microsoft .
[biblio.nelson96stlpq] Priority Queues and the STL . January 1996 . Dr. Dobbs Journal .
[biblio.wickland96thirty] Thirty Years Among the Dead . 1996 . National Psychological Institute .