[PATCH AARCH64]load store pair optimization using sched_fusion pass.

Bin.Cheng amker.cheng@gmail.com
Mon Dec 8 03:35:00 GMT 2014


On Mon, Dec 8, 2014 at 11:26 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, Dec 8, 2014 at 11:16 AM, Andrew Pinski <pinskia@gmail.com> wrote:
>> On Sun, Dec 7, 2014 at 5:47 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Sun, Dec 7, 2014 at 6:24 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>>>> On Sat, Dec 6, 2014 at 6:35 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>>>>> On Sat, Dec 6, 2014 at 5:54 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>>>>>> On Fri, Dec 5, 2014 at 9:08 AM, Marcus Shawcroft
>>>>>> <marcus.shawcroft@gmail.com> wrote:
>>>>>>> On 18 November 2014 at 08:34, Bin Cheng <bin.cheng@arm.com> wrote:
>>>>>>>
>>>>>>>> 2014-11-18  Bin Cheng  <bin.cheng@arm.com>
>>>>>>>>
>>>>>>>>         * config/aarch64/aarch64.md (load_pair<mode>): Split to
>>>>>>>>         load_pairsi, load_pairdi, load_pairsf and load_pairdf.
>>>>>>>>         (load_pairsi, load_pairdi, load_pairsf, load_pairdf): Split
>>>>>>>>         from load_pair<mode>.  New alternative to support int/fp
>>>>>>>>         registers in fp/int mode patterns.
>>>>>>>>         (store_pair<mode>:): Split to store_pairsi, store_pairdi,
>>>>>>>>         store_pairsf and store_pairdi.
>>>>>>>>         (store_pairsi, store_pairdi, store_pairsf, store_pairdf): Split
>>>>>>>>         from store_pair<mode>.  New alternative to support int/fp
>>>>>>>>         registers in fp/int mode patterns.
>>>>>>>>         (*load_pair_extendsidi2_aarch64): New pattern.
>>>>>>>>         (*load_pair_zero_extendsidi2_aarch64): New pattern.
>>>>>>>>         (aarch64-ldpstp.md): Include.
>>>>>>>>         * config/aarch64/aarch64-ldpstp.md: New file.
>>>>>>>>         * config/aarch64/aarch64-protos.h (aarch64_gen_adjusted_ldpstp):
>>>>>>>> New.
>>>>>>>>         (extract_base_offset_in_addr): New.
>>>>>>>>         (aarch64_operands_ok_for_ldpstp): New.
>>>>>>>>         (aarch64_operands_adjust_ok_for_ldpstp): New.
>>>>>>>>         * config/aarch64/aarch64.c (enum sched_fusion_type): New enum.
>>>>>>>>         (TARGET_SCHED_FUSION_PRIORITY): New hook.
>>>>>>>>         (fusion_load_store): New functon.
>>>>>>>>         (extract_base_offset_in_addr): New function.
>>>>>>>>         (aarch64_gen_adjusted_ldpstp): New function.
>>>>>>>>         (aarch64_sched_fusion_priority): New function.
>>>>>>>>         (aarch64_operands_ok_for_ldpstp): New function.
>>>>>>>>         (aarch64_operands_adjust_ok_for_ldpstp): New function.
>>>>>>>>
>>>>>>>> 2014-11-18  Bin Cheng  <bin.cheng@arm.com>
>>>>>>>>
>>>>>>>>         * gcc.target/aarch64/ldp-stp-1.c: New test.
>>>>>>>>         * gcc.target/aarch64/ldp-stp-2.c: New test.
>>>>>>>>         * gcc.target/aarch64/ldp-stp-3.c: New test.
>>>>>>>>         * gcc.target/aarch64/ldp-stp-4.c: New test.
>>>>>>>>         * gcc.target/aarch64/ldp-stp-5.c: New test.
>>>>>>>>         * gcc.target/aarch64/lr_free_1.c: Disable scheduling fusion
>>>>>>>>         and peephole2 pass.
>>>>>>>
>>>>>>> Committed, thanks. /Marcus
>>>>>>
>>>>>>
>>>>>> This patch has a bug in it dealing with volatile mems.  It will do a
>>>>>> peephole of two volatile mems into a pair which is not correct as the
>>>>>> order in the architecture is not defined and might even swap the order
>>>>>> of the load/store incorrectly.
>>>>>> I noticed this when I was merging in my changes for an improved
>>>>>> peephone which already has a check for volatile.
>>>>>
>>>>> Something like:
>>>>> @@ -10555,6 +10863,10 @@ aarch64_operands_ok_for_ldpstp (rtx
>>>>> *operands, bool load,
>>>>>        reg_2 = operands[3];
>>>>>      }
>>>>>
>>>>> +  /* The mems cannot be volatile.  */
>>>>> +  if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>>>>> +    return false;
>>>>> +
>>>>>    /* Check if the addresses are in the form of [base+offset].  */
>>>>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>>>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>>>>>
>>>>>
>>>>> ---- CUT ----
>>>>> gcc.target/aarch64/rev16_1.c is the testcase where two volatile loads
>>>>> being merged into one load pair.
>>>>
>>>>
>>>> And this one:
>>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>>> index 40881e7..da36dc8 100644
>>>> --- a/gcc/config/aarch64/aarch64.c
>>>> +++ b/gcc/config/aarch64/aarch64.c
>>>> @@ -10974,6 +10974,10 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
>>>> *operands, bool load,
>>>>    if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode))
>>>>      return false;
>>>>
>>>> +  if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2)
>>>> +      || MEM_VOLATILE_P (mem_3) ||MEM_VOLATILE_P (mem_4))
>>>> +    return false;
>>>> +
>>>>    /* Check if the addresses are in the form of [base+offset].  */
>>>>    extract_base_offset_in_addr (mem_1, &base_1, &offset_1);
>>>>    if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
>>>> --- CUT ---
>>>> This one happens with c-c++-common/torture/complex-sign-add.c.
>>>> I will post a full patch tomorrow.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew Pinski
>>> Hi Andrew,
>>> Thanks for reporting this.  When I facing this volatile problem, I
>>> thought it was fine as long as the memory access wasn't removed
>>> accidentally.  Could you please explain more why the order of volatile
>>> also matters?
>>
>> Because volatile access between sequence points cannot be reordered.
>> See https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html .
>> Mainly:
>> " The minimum requirement is that at a sequence point all previous
>> accesses to volatile objects have stabilized and no subsequent
>> accesses have occurred. Thus an implementation is free to reorder and
>> combine volatile accesses that occur between sequence points, but
>> cannot do so for accesses across a sequence point".
>>
>>> Is it related to some barriers between the two
>>> accesses?
>>
>> Volatile accesses are not allowed to be merged as I pointed out above
>> or reordered which gives an implicit barrier between the two accesses.
> Ah, yes, it is implicit barrier that can't be crossed.
> Thanks very much for explanation.
>
> Thanks,
> bin
>>
>>> If that's the case, does it imply that dependence created
>>> by GCC isn't correct?
>>
>> It is correct just not the peepholes you created.
>>
>>> I am still not sure why the case passed in my
>>> regression test, or how I missed it.
>>
>> Well that is because the ICE only shows up with my improved
>> peephole2/load_pair/store_pair patterns.  My load_pair/strore_pair
>> patterns that I had created were rejecting volatile memory locations
>> to make sure nobody used these patterns with volatile memory
>> locations.  Once I get some more time, I will post a clean up patch of
>> my peephole2/load_pair/store_pair which I think are slightly easier to
>> understand.
>>
>> Thanks,
>> Andrew Pinski
>>
>>>
>>> Any way, volatile cases should be handled conservatively here.  Thanks again.
>>>
>>> Thanks,
>>> bin

As for the fix, I think it is TARGET_SCHED_FUSION_PRIORITY that needs
to be fixed.  Scheduler reorders loads/stores according to the
priorities set by that hook function.  After fixing the hook, even
volatile accesses can be paired if they are next to each other (or
volatile access and normal one) at the first place (which means no
sequence point between volatile accesses).  Makes sense?

Thanks,
bin



More information about the Gcc-patches mailing list