Use of vector instructions in memmov/memset expanding

Wed Sep 28 12:29:00 GMT 2011

Attached is a part 1 of patch that enables use of vector-instructions
in memset and memcopy (middle-end part).
The main part of the changes is in functions
move_by_pieces/set_by_pieces. In new version algorithm of move-mode
selection was changed – now it checks if alignment is known at compile
time and uses cost-models to choose between aligned and unaligned
vector or not-vector move-modes.

Build and 'make check' was tested - in 'make check' there is a fail,
that would be cured when complete patch is applied.

On 27 September 2011 18:44, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> I divided the patch into three smaller ones:
>
> 1) Patch with target-independent changes (see attached file memfunc-mid.patch).
> The main part of the changes is in functions
> move_by_pieces/set_by_pieces. In new version algorithm of move-mode
> selection was changed – now it checks if alignment is known at compile
> time and uses cost-models to choose between aligned and unaligned
> vector or not-vector move-modes.
>
> 2) Patch with target-dependent changes (memfunc-be.patch).
> The main part of the changes is in functions
> ix86_expand_setmem/ix86_expand_movmem. The other changes are only
> needed to support it.
> The changes mostly touched unrolled_loop strategy – now vector move
> modes could be used here. That resulted in large epilogues and
> prologues, so their generation also was modified.
> This patch contains some changes in middle-end (to make build
> possible) - but all these changes are present in the first patch, so
> there is no need to review them here.
>
> 3) Patch with all new tests (memfunc-tests.patch).
> This patch contains a lot of small tests for different memset and memcopy cases.
>
> Separately from each other, these patches won't give performance gain.
> The positive effect will be noticeable only if they are applied
> together (I attach the complete patch also - see file
> memfunc-complete.patch).
>
>
> If you have any questions regarding these changes, please don't
> hesitate to ask them.
>
>
> On 18 July 2011 15:00, Michael Zolotukhin
> <michael.v.zolotukhin@gmail.com> wrote:
>> Here is a summary - probably, it doesn't cover every single piece in
>> the patch, but I tried to describe the major changes. I hope this will
>> help you a bit - and of course I'll answer your further questions if
>> they appear.
>>
>> The changes could be logically divided into two parts (though, these
>> parts have something in common).
>> The first part is changes in target-independent part, in functions
>> move_by_pieces() and store_by_pieces() - mostly located in expr.c.
>> The second part touches ix86_expand_movmem() and ix86_expand_setmem()
>> - mostly located in config/i386/i386.c.
>>
>> Changes in i386.c (target-dependent part):
>> 1) Strategies for cases with known and unknown alignment are separated
>> from each other.
>> When alignment is known at compile time, we could generate optimized
>> code without libcalls.
>> When it's unknown, we sometimes could create runtime-checks to reach
>> desired alignment, but not always.
>> Strategies for atom and generic_32, generic_64 were chosen according
>> to set of experiments, strategies in other
>> cost models are unchanged (strategies for unknown alignment are copied
>> from existing strategies).
>> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes,
>> if they're available.
>> 3) As size of data, moved in one iteration, greatly increased, and
>> epilogues became bigger - so some changes were needed in epilogue
>> generation. In some cases a special loop (not unrolled) is generated
>> in epilogue to avoid slow copying by bytes (changes in
>> expand_set_or_movmem_via_loop() and introducing of
>> expand_set_or_movmem_via_loop_with_iter() is made for these cases).
>> 4) As bigger alignment might be needed than previously, prologue
>> generation was also modified.
>>
>> Changes in expr.c (target-independent part):
>> There are two possible strategies now: use of aligned and unaligned
>> moves. For each of them a cost model was implemented and the choice is
>> made according to the cost of each option. Move-mode choice is made by
>> functions widest_mode_for_unaligned_mov() and
>> widest_mode_for_aligned_mov().
>> Cost estimation is implemented in functions compute_aligned_cost() and
>> compute_unaligned_cost().
>> Choice between these two strategies and the generation of moves
>> themselves are in function move_by_pieces().
>>
>> Function store_by_pieces() calls set_by_pieces_1() instead of
>> store_by_pieces_1(), if this is memset-case (I needed to introduce
>> set_by_pieces_1 to separate memset-case from others -
>> store_by_pieces_1 is sometimes called for strcpy and some other
>> functions, not only for memset).
>>
>> Set_by_pieces_1() estimates costs of aligned and unaligned strategies
>> (as in move_by_pieces() ) and generates moves for memset. Single move
>> is generated via
>> generate_move_with_mode(). If it's called first time, a promoted value
>> (register, filled with one-byte value of memset argument) is generated
>> - later calls reuse this value.
>>
>> Changes in MD-files:
>> For generation of promoted values, I made some changes in
>> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands
>> for vec_dup4si and vec_dupv2di were introduced for this too (these
>> expands differ from corresponding define_insns - existing define_insn
>> work only with registers, while new expands could process memory
>> operand as well).
>>
>> Some code were added to allow generation of MOVQ (with SSE-registers)
>> - such moves aren't usual ones, because they use only half of
>> xmm-register.
>> There was a need to generate such moves explicitly, so I added a
>> simple expand to sse.md.
>>
>>
>> On 16 July 2011 03:24, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> > New algorithm for move-mode selection is implemented for move_by_pieces,
>>>> > store_by_pieces.
>>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed in
>>>> > similar way, x86 cost-models parameters are slightly changed to support
>>>> > this. This implementation checks if array's alignment is known at compile
>>>> > time and chooses expanding algorithm and move-mode according to it.
>>>
>>> Can you give some sumary of changes you made?  It would make it a lot easier to
>>> review if it was broken up int the generic changes (with rationaly why they are
>>> needed) and i386 backend changes that I could review then.
>>>
>>> From first pass through the patch I don't quite see the need for i.e. adding
>>> new move patterns when we can output all kinds of SSE moves already.  Will look
>>> more into the patch to see if I can come up with useful comments.
>>>
>>> Honza
>>>
>>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>

-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.