This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, x86] Use vector moves in memmove expanding
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Michael Zolotukhin <michael dot v dot zolotukhin at gmail dot com>
- Cc: Jan Hubicka <hubicka at ucw dot cz>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 12 Apr 2013 10:54:16 +0200
- Subject: Re: [PATCH, x86] Use vector moves in memmove expanding
- References: <CANtU07_xUQHqFVhc=xXcXC1T0c37FhW+F9O8BgHtnoq2LNsEYw at mail dot gmail dot com> <20130410174302 dot GA9599 at domone dot kolej dot mff dot cuni dot cz> <CANtU07-JZECds2sVZ7Pb8i5ySJs-H2ndkCZSPmJHbW0CK9Pmzw at mail dot gmail dot com> <20130410185355 dot GB9786 at domone dot kolej dot mff dot cuni dot cz> <CANtU07_PKB9thB7q7CAGVgQ3PveHs8GpBfTULSd+BhAcKeHEoQ at mail dot gmail dot com>
On Thu, Apr 11, 2013 at 04:32:30PM +0400, Michael Zolotukhin wrote:
> > 128 is about upper bound you can expand with sse moves.
> > Tuning did not take into account code size and measured only when code
> > is in tigth loop.
> > For GPR-moves limit is around 64.
> Thanks for the data - I've not performed measurements with this
> implementation yet, but we surely should adjust thresholds to avoid
> performance degradations on small sizes.
>
I did some profiling of builtin implementation, download this
http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2
see files results_rand/result.html and results_rand_noicache/result.html
A memcpy_new_builtin for sizes x0,x1...x5 calls builtin and new
otherwise.
I did same for memcpy_glibc to see variance.
memcpy_new does not call builtin.
To regenerate graphs on other arch run benchmarks script.
To use other builtin change in Makefile how to compile variant/builtin.c
file.
A builtin are faster by inlined function call, I did not add that as I
do not know estimate of this cost.
> Michael
>
> On 10 April 2013 22:53, OndÅej BÃlka <neleai@seznam.cz> wrote:
> > On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote:
> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
> >> > much faster on small strings (variant for sandy bridge attached.
> >>
> >> I'm not sure I get what you meant - could you please explain what is
> >> computed jumps?
> > computed goto. See Duff's device it works almost exactly same.
> >>
> >> > You must also check performance with cold instruction cache.
> >> > Now memcpy(x,y,128) takes 126 bytes which is too much.
> >>
> >> > Do not align for small sizes. Dependency caused by this erases any gains
> >> > that you migth get. Keep in mind that in 55% of cases data are already
> >> > aligned.
> >>
> >> Other algorithms are still available and we can use them for small
> >> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't
> >> use vector instructions in it.
> >
> > 128 is about upper bound you can expand with sse moves.
> > Tuning did not take into account code size and measured only when code
> > is in tigth loop.
> > For GPR-moves limit is around 64.
> >
> > What matters which code has best performance/size ratio.
> >> But that's tuning and I haven't worked on it yet - I'm going to
> >> measure performance of all algorithms on all sizes and thus defines on
> >> which sizes which algorithm is preferable.
> >> What I did in this patch is introducing some infrastructure to allow
> >> emitting of vector moves in movmem expanding - tuning is certainly
> >> possible and needed, but that's out of the scope of the patch.
> >>
> >> On 10 April 2013 21:43, OndÅej BÃlka <neleai@seznam.cz> wrote:
> >> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote:
> >> >> Hi,
> >> >> This patch adds a new algorithm of expanding movmem in x86 and a bit
> >> >> refactor existing implementation. This is a reincarnation of the patch
> >> >> that was sent wasn't checked couple of years ago - now I reworked it
> >> >> from scratch and divide into several more manageable parts.
> >> >>
> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
> >> > much faster on small strings (variant for sandy bridge attached.
> >> >
> >> >> For now this algorithm isn't used, because cost_models are tuned to
> >> >> use existing ones. I believe the new algorithm will give better
> >> >> performance, but I'll leave cost-models tuning for a separate patch.
> >> >>
> >> > You must also check performance with cold instruction cache.
> >> > Now memcpy(x,y,128) takes 126 bytes which is too much.
> >> >
> >> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as
> >> >> well. Probably, there is another way of getting info about alignment -
> >> >> if so, please let me know.
> >> >>
> >> > Do not align for small sizes. Dependency caused by this erases any gains
> >> > that you migth get. Keep in mind that in 55% of cases data are already
> >> > aligned.
> >> >
> >> > Also in my tests best way to handle prologue is first copy last 16
> >> > bytes and then loop.
> >> >
> >> >> Similar improvements could be done in expanding of memset, but that's
> >> >> in progress now and I'm going to proceed with it if this patch is ok.
> >> >>
> >> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64.
> >> >>
> >> >> Is it ok for trunk?
> >> >>
> >> >> Changelog entry:
> >> >>
> >> >> 2013-04-10 Michael Zolotukhin <michael.v.zolotukhin@gmail.com>
> >> >>
> >> >> * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
> >> >> * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
> >> >> adjust_address instead of change_address to keep info about alignment.
> >> >> (emit_strmov): Remove.
> >> >> (emit_memmov): New function.
> >> >> (expand_movmem_epilogue): Refactor to properly handle bigger sizes.
> >> >> (expand_movmem_epilogue): Likewise and return updated rtx for
> >> >> destination.
> >> >> (expand_constant_movmem_prologue): Likewise and return updated rtx for
> >> >> destination and source.
> >> >> (decide_alignment): Refactor, handle vector_loop.
> >> >> (ix86_expand_movmem): Likewise.
> >> >> (ix86_expand_setmem): Likewise.
> >> >> * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg.
> >> >> * emit-rtl.c (get_mem_align_offset): Compute alignment for MEM_REF.
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
--
Spider infestation in warm case parts