This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Customized Allocator (on disk)

From: Reza Roboubi <reza at linisoft dot com>
To: Phil Edwards <pedwards at disaster dot jaj dot com>, Benjamin Kosnik <bkoz at redhat dot com>
Cc: libstdc++ at gcc dot gnu dot org
Date: Mon, 03 Dec 2001 21:19:44 -0800
Subject: Re: Customized Allocator (on disk)
Newsgroups: comp.lang.c++,comp.std.c++
Organization: Linisoft
References: <3BF4FE3A.D7116417@linisoft.com> <20011203000210.A13647@disaster.jaj.com> <3C0B50A2.EAA108A5@linisoft.com> <20011203184225.A20200@disaster.jaj.com>

Phil Edwards wrote:

> It's lacking from the standard because the standard doesn't say anything
> about disks, hashes, or how to access them.  It doesn't even assume that
> "files" exist.

I thought that the STL standard has template objects like "map," "hash," and  the
like.  Why do you say that STL says nothing about hashes??

With respect to disks you are correct.  The standard has nothing to do with disks,
files etc.  I am not at all proposing a change to this tradition.

> What exactly do you mean by "on the disk"?  Just stored as a normal file,
> or accessing the disk sectors directly, or via mmap(), or...?  Most of
> the major databases do something like the second option, bypassing the
> filesystem entirely to gain speed.  The library can't do much more than
> the first option without getting ourselves into trouble.

I think our friend Benjamin understands me better:
Benjamin Kosnik wrote:

>  I think the subject under discussion is a persistent allocator.

Let me clarify what I mean a  little better:

--------------------
FIRST SOME MOTIVATION:
Imagine you want to write something like the google search engine.  You need a lot of
the basic objects that STL offers.  (I don't know how efficient the STL
implementations are, but you need those kinds of algorithms, such as the GNU AVL tree
implementation.)

The only database _library_ that I know, which offers direct (non SQL) calls to C
functions,  is Berkeley DB.  It appears good, but has many limitations.  Essentially,
it is _far_ from a generic library, offering all the basic algorithms a database
engineer would need.  You can just "dump" things into a hash, and that's it.  But a lot
of functions for true low level database work need _separation_ of components:  for
example you want a hash,  on one partition, which stores the keys and only _pointers_
to the actual values.  The values are then "vector" objects stored on _another_
partition.

Berkeley DB is great,  but it does not offer such separation and low level surgical
power to the programmer.

END MOTIVATION
-----------------------

I have looked at the Dinkum STL Library Reference manual on the web.  I do not have the
STL standard in front of me, so the line between the Dinkum Library and STL is a bit
blurry.  Please bare with me.

This is not about writing any routines for the GNU implementation to access the disk or
any other media directly.  Although it is about making it possible for the _user_ to do
this.  The user should be able to create "persistent allocators" for any media (disk,
raw partition, ...), exactly as I believe Benjamin is pointing out.  Eventually, I am
thinking of disks, and if I write such a "persistent allocator" (for disks) myself, it
will be submitted for addition to the GNU STL.  But the main QUESTION is:  what are the
additions that one must make to the actual objects(hash, map, etc) to make this
possible.

(I apologize: I have not had the time to read the GNU STL code yet, so I will rely on
the Dinkum manual)

The Dinkum library is designed as follows (whether this is required by the standard or
not):
I believe that the Allocator objects are required to provide an abstract "Pointer"
type.  When we pass an Allocator object to a template object like the Hash, the Hash
template is instantiated into a real hash object which depends on the abstract
"Pointer" type supplied by the Allocator.  In other words whenever the hash wants to
write somewhere, it does something like:
p[i + 4] = 55; where the Pointer Type p overrides the [] operator of course.

I believe that is how it is implemented in Dinkum.  But  again, I have not yet read the
GNU code.

So far so good.  This is exactly what we need.  Now the Allocator might create a cach
in RAM, which maps to a raw partition,  and whenever the "Hash" writes into p[i], the
data goes to the cache that the Allocator manages.  An operation like p++ will
"increment" the pointer, if needed it will load the relevant page from the "raw
partition" and put it into the RAM cache, making sure that *p points to that part of
the RAM.  Great!  All of this works for Dinkum, and hopefully also for the STL standard
and GNU.    Now here is what I believe is _lacking_ from the STL standard:

FIRST WEAKNESS:
It all boils down to the notion of what is an "integer?"  To truly generalize the STL
template objects, what is required is a generalized "Allocator,"  a generalized
"Pointer," and a generalized "Integer."  As long as a "Hash" object depends on the 32
bit integer, it is impossible to place it into a 50 GB partition and make proper use of
the whole 50 GB.  So the main issue is that the actual __template_objects__ (Hash, Map,
etc) in the GNU implementation of the STL  must be changed.  They must depend on an
abstract notion of "Integer" which is not limited to the 32 bit range.

SECOND WEAKNESS:
The second limitation inherent to the STL standard is the fact that it does not specify
object constructors  that take abstract "Pointers" as their argument:  what if you have
created this great "map" or "hash" on your 50 GB raw partition, closed your
application, then re-opened your application,  and the "hash" needs to be constructed
using the data that is stored on the disk?  There are (to my knowledge) no constructors
for the "Hash" object that use an abstract "Pointer" object, which "points" to the
"disk." So the hash cannot be _constructed_ using the data on the disk, and the data is
lost.  Too bad.  This is an STL limitation.

Benjamin Kosnik wrote:

>  I'd be
>  interested in somebody volunteering to make this available to GNU C++
>  users.

I am very interested to implement this into GNU, but right now I am very pressed for
time (trying to finish a project for newsgroup data gathering, reading on OS design and
thinking how "safe" kernel modules can be implemented, the list goes on.)  I will
probably get back to this issue again.  If I can implement this, I truly believe that
the STL can become a goldmine compared to the commercial implementations that are out
there.  It's applications will be much more far reaching, and it potentially ends the
need for the Berkeley  DB type database libraries.  I am _really_ interested in
contributing.  However, I also need some advice from the people who have been working
on the GNU implementation so far.

Your comments on the subject are much appreciated.  Please let me know how this can be
implemented into the GNU STL implementation and whether my understanding of the STL
Standard weaknesses is correct.  Tell me who implemented "Hash" or "Vector" or
whatever, so that I can debate this issue with them if possible.

--
Reza Roboubi
Software Development Consultant
www.linisoft.com

Follow-Ups:
- Re: Customized Allocator (on disk)
  - From: Benjamin Kosnik
- Re: Customized Allocator (on disk)
  - From: Phil Edwards

References:
- Re: Customized Allocator (on disk)
  - From: Phil Edwards
- Re: Customized Allocator (on disk)
  - From: Reza Roboubi
- Re: Customized Allocator (on disk)
  - From: Phil Edwards

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]