[PATCH, x86] Use vector moves in memmove expanding

Wed Apr 10 22:24:00 GMT 2013

On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote:
> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
> > much faster on small strings (variant for sandy bridge attached.
> 
> I'm not sure I get what you meant - could you please explain what is
> computed jumps?
computed goto. See Duff's device it works almost exactly same.
> 
> > You must also check performance with cold instruction cache.
> > Now memcpy(x,y,128) takes 126 bytes which is too much.
> 
> > Do not align for small sizes. Dependency caused by this erases any gains
> > that you migth get. Keep in mind that in 55% of cases data are already
> > aligned.
> 
> Other algorithms are still available and we can use them for small
> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't
> use vector instructions in it.

128 is about upper bound you can expand with sse moves. 
Tuning did not take into account code size and measured only when code
is in tigth loop.
For GPR-moves limit is around 64.

What matters which code has best performance/size ratio.
> But that's tuning and I haven't worked on it yet - I'm going to
> measure performance of all algorithms on all sizes and thus defines on
> which sizes which algorithm is preferable.
> What I did in this patch is introducing some infrastructure to allow
> emitting of vector moves in movmem expanding - tuning is certainly
> possible and needed, but that's out of the scope of the patch.
> 
> On 10 April 2013 21:43, OndÅ™ej BÃlka <neleai@seznam.cz> wrote:
> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote:
> >> Hi,
> >> This patch adds a new algorithm of expanding movmem in x86 and a bit
> >> refactor existing implementation. This is a reincarnation of the patch
> >> that was sent wasn't checked couple of years ago - now I reworked it
> >> from scratch and divide into several more manageable parts.
> >>
> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
> > much faster on small strings (variant for sandy bridge attached.
> >
> >> For now this algorithm isn't used, because cost_models are tuned to
> >> use existing ones. I believe the new algorithm will give better
> >> performance, but I'll leave cost-models tuning for a separate patch.
> >>
> > You must also check performance with cold instruction cache.
> > Now memcpy(x,y,128) takes 126 bytes which is too much.
> >
> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as
> >> well. Probably, there is another way of getting info about alignment -
> >> if so, please let me know.
> >>
> > Do not align for small sizes. Dependency caused by this erases any gains
> > that you migth get. Keep in mind that in 55% of cases data are already
> > aligned.
> >
> > Also in my tests best way to handle prologue is first copy last 16
> > bytes and then loop.
> >
> >> Similar improvements could be done in expanding of memset, but that's
> >> in progress now and I'm going to proceed with it if this patch is ok.
> >>
> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64.
> >>
> >> Is it ok for trunk?
> >>
> >> Changelog entry:
> >>
> >> 2013-04-10  Michael Zolotukhin  <michael.v.zolotukhin@gmail.com>
> >>
> >>         * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
> >>         * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
> >>         adjust_address instead of change_address to keep info about alignment.
> >>         (emit_strmov): Remove.
> >>         (emit_memmov): New function.
> >>         (expand_movmem_epilogue): Refactor to properly handle bigger sizes.
> >>         (expand_movmem_epilogue): Likewise and return updated rtx for
> >>         destination.
> >>         (expand_constant_movmem_prologue): Likewise and return updated rtx for
> >>         destination and source.
> >>         (decide_alignment): Refactor, handle vector_loop.
> >>         (ix86_expand_movmem): Likewise.
> >>         (ix86_expand_setmem): Likewise.
> >>         * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg.
> >>         * emit-rtl.c (get_mem_align_offset): Compute alignment for MEM_REF.
> >>
> >>
> >> --
> >> ---
> >> Best regards,
> >> Michael V. Zolotukhin,
> >> Software Engineer
> >> Intel Corporation.
> >
> 
> 
> 
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.

-- 

Traffic jam on the Information Superhighway.