[PATCH AARCH64]load store pair optimization using sched_fusion pass.

Mon Dec 8 03:26:00 GMT 2014

On Mon, Dec 8, 2014 at 11:16 AM, Andrew Pinski <pinskia@gmail.com> wrote:
> On Sun, Dec 7, 2014 at 5:47 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Sun, Dec 7, 2014 at 6:24 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>>> On Sat, Dec 6, 2014 at 6:35 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>>>> On Sat, Dec 6, 2014 at 5:54 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>>>>> On Fri, Dec 5, 2014 at 9:08 AM, Marcus Shawcroft
>>>>> <marcus.shawcroft@gmail.com> wrote:
>>>>>> On 18 November 2014 at 08:34, Bin Cheng <bin.cheng@arm.com> wrote:
>>>>>>
>>>>>>> 2014-11-18  Bin Cheng  <bin.cheng@arm.com>
>>>>>>>
>>>>>>>         * config/aarch64/aarch64.md (load_pair<mode>): Split to
>>>>>>>         load_pairsi, load_pairdi, load_pairsf and load_pairdf.
>>>>>>>         (load_pairsi, load_pairdi, load_pairsf, load_pairdf): Split
>>>>>>>         from load_pair<mode>.  New alternative to support int/fp
>>>>>>>         registers in fp/int mode patterns.
>>>>>>>         (store_pair<mode>:): Split to store_pairsi, store_pairdi,
>>>>>>>         store_pairsf and store_pairdi.
>>>>>>>         (store_pairsi, store_pairdi, store_pairsf, store_pairdf): Split
>>>>>>>         from store_pair<mode>.  New alternative to support int/fp
>>>>>>>         registers in fp/int mode patterns.
>>>>>>>         (*load_pair_extendsidi2_aarch64): New pattern.
>>>>>>>         (*load_pair_zero_extendsidi2_aarch64): New pattern.
>>>>>>>         (aarch64-ldpstp.md): Include.
>>>>>>>         * config/aarch64/aarch64-ldpstp.md: New file.
>>>>>>>         * config/aarch64/aarch64-protos.h (aarch64_gen_adjusted_ldpstp):
>>>>>>> New.
>>>>>>>         (extract_base_offset_in_addr): New.
>>>>>>>         (aarch64_operands_ok_for_ldpstp): New.
>>>>>>>         (aarch64_operands_adjust_ok_for_ldpstp): New.
>>>>>>>         * config/aarch64/aarch64.c (enum sched_fusion_type): New enum.
>>>>>>>         (TARGET_SCHED_FUSION_PRIORITY): New hook.
>>>>>>>         (fusion_load_store): New functon.
>>>>>>>         (extract_base_offset_in_addr): New function.
>>>>>>>         (aarch64_gen_adjusted_ldpstp): New function.
>>>>>>>         (aarch64_sched_fusion_priority): New function.
>>>>>>>         (aarch64_operands_ok_for_ldpstp): New function.
>>>>>>>         (aarch64_operands_adjust_ok_for_ldpstp): New function.
>>>>>>>
>>>>>>> 2014-11-18  Bin Cheng  <bin.cheng@arm.com>
>>>>>>>
>>>>>>>         * gcc.target/aarch64/ldp-stp-1.c: New test.
>>>>>>>         * gcc.target/aarch64/ldp-stp-2.c: New test.
>>>>>>>         * gcc.target/aarch64/ldp-stp-3.c: New test.
>>>>>>>         * gcc.target/aarch64/ldp-stp-4.c: New test.
>>>>>>>         * gcc.target/aarch64/ldp-stp-5.c: New test.
>>>>>>>         * gcc.target/aarch64/lr_free_1.c: Disable scheduling fusion
>>>>>>>         and peephole2 pass.
>>>>>>
>>>>>> Committed, thanks. /Marcus
>>>>>
>>>>>
>>>>> This patch has a bug in it dealing with volatile mems.  It will do a
>>>>> peephole of two volatile mems into a pair which is not correct as the
>>>>> order in the architecture is not defined and might even swap the order
>>>>> of the load/store incorrectly.
>>>>> I noticed this when I was merging in my changes for an improved
>>>>> peephone which already has a check for volatile.
>>>>
>>>> Something like:
>>>> @@ -10555,6 +10863,10 @@ aarch64_operands_ok_for_ldpstp (rtx
>>>> *operands, bool load,
>>>>        reg_2 = operands[3];
>>>>      }
>>>>
>>>> +  /* The mems cannot be volatile.  */
>>>> +  if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>>>> +    return false;
>>>> +
>>>>    /* Check if the addresses are in the form of [base+offset].  */
>>>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>>>>
>>>>
>>>> ---- CUT ----
>>>> gcc.target/aarch64/rev16_1.c is the testcase where two volatile loads
>>>> being merged into one load pair.
>>>
>>>
>>> And this one:
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 40881e7..da36dc8 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -10974,6 +10974,10 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
>>> *operands, bool load,
>>>    if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode))
>>>      return false;
>>>
>>> +  if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2)
>>> +      || MEM_VOLATILE_P (mem_3) ||MEM_VOLATILE_P (mem_4))
>>> +    return false;
>>> +
>>>    /* Check if the addresses are in the form of [base+offset].  */
>>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>>> --- CUT ---
>>> This one happens with c-c++-common/torture/complex-sign-add.c.
>>> I will post a full patch tomorrow.
>>>
>>> Thanks,
>>> Andrew
>>>
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>>
>>>>> Thanks,
>>>>> Andrew Pinski
>> Hi Andrew,
>> Thanks for reporting this.  When I facing this volatile problem, I
>> thought it was fine as long as the memory access wasn't removed
>> accidentally.  Could you please explain more why the order of volatile
>> also matters?
>
> Because volatile access between sequence points cannot be reordered.
> See https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html .
> Mainly:
> " The minimum requirement is that at a sequence point all previous
> accesses to volatile objects have stabilized and no subsequent
> accesses have occurred. Thus an implementation is free to reorder and
> combine volatile accesses that occur between sequence points, but
> cannot do so for accesses across a sequence point".
>
>> Is it related to some barriers between the two
>> accesses?
>
> Volatile accesses are not allowed to be merged as I pointed out above
> or reordered which gives an implicit barrier between the two accesses.
Ah, yes, it is implicit barrier that can't be crossed.
Thanks very much for explanation.

Thanks,
bin
>
>> If that's the case, does it imply that dependence created
>> by GCC isn't correct?
>
> It is correct just not the peepholes you created.
>
>> I am still not sure why the case passed in my
>> regression test, or how I missed it.
>
> Well that is because the ICE only shows up with my improved
> peephole2/load_pair/store_pair patterns.  My load_pair/strore_pair
> patterns that I had created were rejecting volatile memory locations
> to make sure nobody used these patterns with volatile memory
> locations.  Once I get some more time, I will post a clean up patch of
> my peephole2/load_pair/store_pair which I think are slightly easier to
> understand.
>
> Thanks,
> Andrew Pinski
>
>>
>> Any way, volatile cases should be handled conservatively here.  Thanks again.
>>
>> Thanks,
>> bin