This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Search algorithms in __gnu_cxx::


Dhruv Matani wrote:

Nice. I'm also thinking that probably the design could be improved: as I
know this kind of algorithms the entire data set we are searching in has
to remain around, intact, and in therefore would have to be duplicated
and stuffed inside __bm_struct. That seems stupid, it's a lot of memory
and a lot of work. We have to find a way to always use iterators and
additionally keep on a side only the output of the preprocessing proper...

I think you are right. Also, if we take into consideration say just the BM search algo, then the preprocessing needs to be done on just the shorter string(one whose occurance we need to find).

As a matter of fact, I think this is of course the *most* common case, I was wrong, swapping in my mind m and n.


Thus, we
should be able to provide(for random access iterators) an interface
similar to:

Iter
__internal_bm_search(
/* The source iterator pair. */
Iter start, Iter end,
/* The string to be searched for(preprocessed). */
__bm_struct str1);


Additionally, provide a wrapper: Iter __bm_search(Iter start, Iter end, Iter str1, Iter str2) { __bm_struct __bms = __bm_preprocess(str1, str2); return __internal_bm_search(start, end, __bms); }

So, every call of the _good_ looking function would incur an
additional penalty of preprocessing the string at every stage if we
want to keep searching the whole document for the same string over and
over again.

Definitely. We can work exactly like that and enjoy an interface of __bm_search very similar to std::search. Very nice. I would probably call __internal_bm_search also __bm_search and overload.


For BM, the constraint on the iterator Iter would be such that
typeof(*Iter) should be a hashable type, so that if the user makes a
UDT, then he/she would require to override/provide a hash function for
it. An alternative strategy would be to require an operator< defied
for the type so that we can use binary search or std::map, etc....

Fine. Basically no differences wrt the type of an hashed container, same constraints, same requirements. Note, we are agreeing that, as soon as we start using iterators everywhere, there is no reason to restrict ourselves to strings, or sequences of PODs.


One more thing that i would like to do is create better test cases for
string search algo. performance comparisons. Typically, there would be
3 broad categories of source strings namely:
[1] small strings < 128bytes.
[2] Medium, size strings <= 64KB, and
[3] Large strings > 64kB.

Each of them should have a test case where:
[1] The string occurs relatively early.
[2] The string occurs relatively late.
[3] There are many strings in the source string such that there is a
longish common preffix. Something like search for 'hello' in 'in thelo
was being he is hel In hErE onhe hello world.'
[4] Search for a no-match string.
[5] Search for a string with no match such that the intersection of
the set of characters constituting the source and
to-be-searched-string is a NULL set(empty set). Something like search
for: 'hello' in 'Did it fax jump undar ytx tabbx.'

Good.


Paolo.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]