Vectorized _cpp_clean_line

Fri Aug 13 07:26:00 GMT 2010

On Thu, Aug 12, 2010 at 03:23:04PM -0700, Richard Henderson wrote:
> On 08/12/2010 03:07 PM, Andi Kleen wrote:
> > At least for sse 4.2 I'm not sure the table lookup
> > for alignment is worth it. The unaligned loads are quite
> > cheap on current micro architectures with sse 4.2 
> > and the page end test is also not that expensive.
> 
> Perhaps.  That's something else that will want testing, as
> it's all of a dozen instructions.
> 
> At minimum the page end test should not be performed inside
> the loop.  We can adjust END before beginning the loop so
> that we never cross a page.

The test runs in parallel with the match on a OOO CPU. It would
only be a problem if you were decoder limited.

Moving it out would require special case tail code. glibc used a lot
of switches for that in its code, I didn't like this.

The best probably would be to ensure there is always a tail pad
in the caller, but it is presumably difficult if you mmap()
the input file.

> > I originally avoided the indirect call because I was worried
> > about the effect on CPUs with indirect branch predictor.
> 
> WithOUT the indirect branch predictor, you mean?  Which ones

Yes without.

> don't have that?  Surely we have to be going back pretty far...

Nope.  They're a relatively recent invention: a lot of x86 CPUs still
being used don't have them.

> 
> Since the call is the same destination every time, that matches
> up well with the indirect branch predictor, AFAIK.  If we're
> worried about the indirect branch predictor, we could write

Yes if you have a indirect branch predictor you're fine, assuming
the rest of the compiler didn't thrash the buffers.

Or maybe profile feedback will fix it and does the necessarily inlining
(but you have to fix PR45227 first :-)  Also when I tested this last
time it didn't seem to work very well.

And it would only help if you run it on the same type of system as the end 
host.

Or maybe it's in the wash because it's only once per line.

> 
> static inline bool
> search_line_fast (s, end, out)
> {
>   if (fast_impl == 0)
>     return search_line_sse42 (s, end, out);
>   else if (fast_impl == 1)
>     return search_line_sse2 (s, end, out);
>   else
>     return search_line_acc_char (s, end, out);
> }
> 
> where FAST_IMPL is set up appropriately by init_vectorized_lexer.
> 
> The question being, are three predicted jumps faster than one
> indirect jump on a processor without the proper predictor?

Yes usually, especially if you don't have to go through all three
on average.

-Andi

P.S.: I wonder if there's more to be gotten from larger changes
in cpplib.  The clang preprocessor doesn't use vectorization and it seems
to be still faster?