Vectorized _cpp_clean_line

Thu Aug 12 22:32:00 GMT 2010

On 08/12/2010 03:07 PM, Andi Kleen wrote:
> At least for sse 4.2 I'm not sure the table lookup
> for alignment is worth it. The unaligned loads are quite
> cheap on current micro architectures with sse 4.2 
> and the page end test is also not that expensive.

Perhaps.  That's something else that will want testing, as
it's all of a dozen instructions.

At minimum the page end test should not be performed inside
the loop.  We can adjust END before beginning the loop so
that we never cross a page.

> I originally avoided the indirect call because I was worried
> about the effect on CPUs with indirect branch predictor.

WithOUT the indirect branch predictor, you mean?  Which ones
don't have that?  Surely we have to be going back pretty far...

Since the call is the same destination every time, that matches
up well with the indirect branch predictor, AFAIK.  If we're
worried about the indirect branch predictor, we could write

static inline bool
search_line_fast (s, end, out)
{
  if (fast_impl == 0)
    return search_line_sse42 (s, end, out);
  else if (fast_impl == 1)
    return search_line_sse2 (s, end, out);
  else
    return search_line_acc_char (s, end, out);
}

where FAST_IMPL is set up appropriately by init_vectorized_lexer.

The question being, are three predicted jumps faster than one
indirect jump on a processor without the proper predictor?

r~