I would recommend to first concentrate on improving
the "generic" performance of your core algorithm and
test it intensively. Then test it again with a variety
of iterators, a variety data types and a variety of
data distributions. Compare it a lot with my
implementation. You can download my performance tests
from CodeProject and use them to compare the two
algorithms.
After ensuring that your core algorithm is at least as
fast as mine (there is no sense in re-implementing a
slower version of a piece of free code that already
exists!) then (and only then) is the time to worry
about backward prefetchers in "exotic" targets and do
specializations for N=2, N=3, N=4 etc. In my opinion,
it is essential that you first work a lot more with
your generic/core algorithm.