This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, libcpp]: Use asm flag outputs in search_line_sse42 main loop


On Mon, Jun 29, 2015 at 09:07:22PM +0200, Uros Bizjak wrote:
> Hello!
> 
> Attached patch introduces asm flag outputs in seach_line_sse42 main
> loop to handle carry flag value from pcmpestri insn. Slightly improved
> old code that uses asm loop compiles to:
>
Using sse4.2 here is bit dubios as pcmpistri has horrible latency, and
four checks are near boundary where replacing it by sse2 sequence is
faster.

So I looked closer and wrote program to count number of source file lines to compute.

I found that there is almost no difference between sse2, sse4.2 code or
just calling strpbrk.

But there were significant performance mistakes in sse2 code. First one
is that a comment

  /* Create a mask for the bytes that are valid within the first
     16-byte block.  The Idea here is that the AND with the mask
     within the loop is "free", since we need some AND or TEST
     insn in order to set the flags for the branch anyway.  */

First claim about free is false as gcc does repeat setting mask to 1 in
each iteration instead only on first.

Then there is problem that here jumping directly into loop is bad idea
due to branch misprediction. Its better to use header when its likely
that loop ends in first iteration.

A worst problem is that using aligned load and masking is unpredictable
loop, depending on alignment it could only check one byte.

A correct approach here is check if we cross page boundary and use
unaligned load. That always checks 16 bytes instead of 8 on average when
alignment is completely random.

That improved a sse2 code to be around 5% faster than sse4.2 code.

A second optimization is that most lines are less than 80 characters
long. So don't bother with loop just do checks in header. That gives
another 5%

A benchmark is bit ugly, usage is

./benchmark file function repeat
where you need supply source named file that will be scanned repeat
times. A functions tested are following:
./benchmark foo.c 1 100000 # strpbrk
./benchmark foo.c 2 100000 # current sse2
./benchmark foo.c 3 100000 # current sse4.2
./benchmark foo.c 4 100000 # improved sse2 with unaligned check of 16 bytes.
./benchmark foo.c 5 100000 # improved sse2 with unaligned check of 16 bytes.


I will send patch later, do you have comments about that improvements?

Attachment: line.c
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]