Dhruv Matani wrote:
>> Nice. I'm also thinking that probably the design could be improved: as I
>> know this kind of algorithms the entire data set we are searching in has
>> to remain around, intact, and in therefore would have to be duplicated
>> and stuffed inside __bm_struct. That seems stupid, it's a lot of memory
>> and a lot of work. We have to find a way to always use iterators and
>> additionally keep on a side only the output of the preprocessing
>> proper...
>
> I think you are right. Also, if we take into consideration say just
> the BM search algo, then the preprocessing needs to be done on just
> the shorter string(one whose occurance we need to find).
As a matter of fact, I think this is of course the *most* common case, I
was wrong, swapping in my mind m and n.
> Thus, we
> should be able to provide(for random access iterators) an interface
> similar to:
>
> Iter
> __internal_bm_search(
> /* The source iterator pair. */
> Iter start, Iter end,
> /* The string to be searched for(preprocessed). */
> __bm_struct str1);
>
>
> Additionally, provide a wrapper:
> Iter
> __bm_search(Iter start, Iter end,
> Iter str1, Iter str2)
> {
> __bm_struct __bms = __bm_preprocess(str1, str2);
> return __internal_bm_search(start, end, __bms);
> }
>
> So, every call of the _good_ looking function would incur an
> additional penalty of preprocessing the string at every stage if we
> want to keep searching the whole document for the same string over and
> over again.
Definitely. We can work exactly like that and enjoy an interface of
__bm_search very similar to std::search. Very nice. I would probably
call __internal_bm_search also __bm_search and overload.
> For BM, the constraint on the iterator Iter would be such that
> typeof(*Iter) should be a hashable type, so that if the user makes a
> UDT, then he/she would require to override/provide a hash function for
> it. An alternative strategy would be to require an operator< defied
> for the type so that we can use binary search or std::map, etc....
Fine. Basically no differences wrt the type of an hashed container, same
constraints, same requirements. Note, we are agreeing that, as soon as
we start using iterators everywhere, there is no reason to restrict
ourselves to strings, or sequences of PODs.
> One more thing that i would like to do is create better test cases for
> string search algo. performance comparisons. Typically, there would be
> 3 broad categories of source strings namely:
> [1] small strings < 128bytes.
> [2] Medium, size strings <= 64KB, and
> [3] Large strings > 64kB.
>
> Each of them should have a test case where:
> [1] The string occurs relatively early.
> [2] The string occurs relatively late.
> [3] There are many strings in the source string such that there is a
> longish common preffix. Something like search for 'hello' in 'in thelo
> was being he is hel In hErE onhe hello world.'
> [4] Search for a no-match string.
> [5] Search for a string with no match such that the intersection of
> the set of characters constituting the source and
> to-be-searched-string is a NULL set(empty set). Something like search
> for: 'hello' in 'Did it fax jump undar ytx tabbx.'
Good.
Paolo.