This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [GCC RFC]A new and simple pass merging paired load store instructions

From: "Bin.Cheng" <amker dot cheng at gmail dot com>
To: Steven Bosscher <stevenb dot gcc at gmail dot com>
Cc: "bin.cheng" <bin dot cheng at arm dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>
Date: Fri, 16 May 2014 18:07:24 +0800
Subject: Re: [GCC RFC]A new and simple pass merging paired load store instructions
Authentication-results: sourceware.org; auth=none
References: <004d01cf700e$ef1e30e0$cd5a92a0$ at arm dot com> <CABu31nMY6zapHfhr5x4BjZ3kvFuEKuhagBfx2cYbD4bbSwybTg at mail dot gmail dot com>

On Fri, May 16, 2014 at 12:57 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Thu, May 15, 2014 at 9:26 AM, bin.cheng wrote:
>> Hi,
>> Targets like ARM and AARCH64 support double-word load store instructions,
>> and these instructions are generally faster than the corresponding two
>> load/stores.  GCC currently uses peephole2 to merge paired load/store into
>> one single instruction which has a disadvantage.  It can only handle simple
>> cases like the two instructions actually appear sequentially in instruction
>> stream, and is too weak to handle cases in which the two load/store are
>> intervened by other irrelevant instructions.
>>
>> Here comes up with a new GCC pass looking through each basic block and
>> merging paired load store even they are not adjacent to each other.  The
>> algorithm is pretty simple:
>> 1) In initialization pass iterating over instruction stream it collects
>> relevant memory access information for each instruction.
>> 2) It iterates over each basic block, tries to find possible paired
>> instruction for each memory access instruction.  During this work, it checks
>> dependencies between the two possible instructions and also records the
>> information indicating how to pair the two instructions.  To avoid quadratic
>> behavior of the algorithm, It introduces new parameter
>> max-merge-paired-loadstore-distance and set the default value to 4, which is
>> large enough to catch major part of opportunities on ARM/cortex-a15.
>> 3) For each candidate pair, it calls back-end's hook to do target dependent
>> check and merge the two instructions if possible.
>>
>> Though the parameter is set to 4, for miscellaneous benchmarks, this pass
>> can merge numerous opportunities except ones already merged by peephole2
>> (same level numbers of opportunities comparing to peepholed ones).  GCC
>> bootstrap can also confirm this finding.
>>
>> Yet there is an open issue about when we should run this new pass.  Though
>> register renaming is disabled by default now, I put this pass after it,
>> because renaming can resolve some false dependencies thus benefit this pass.
>> Another finding is, it can capture a lot more opportunities if it's after
>> sched2, but I am not sure whether it will mess up with scheduling results in
>> this way.
>>
>> So, any comments about this?

Hi Steven,
Thanks for reviewing this.  Here are some answers to the general questions.

>
> First off: Why does this need a target hook? We're getting more and
> more of them -- too many IMHO. There should be really good reasons for
> adding even more new ones.

Yes, I think this one does have a good reason.  The target independent
pass just makes sure that two consecutive memory access instructions
are free of data-dependency with each other, then feeds it to back-end
hook.  It's back-end's responsibility to generate correct instruction.
 It's not about modifying an existing insn then recognize it, it's
about creating new instruction sometimes.  For example, we can
generate a simple move insn in Arm mode, while have to generate a
parallel instruction in Thumb mode.  Target independent part has no
idea how to generate an expected insn.  Moreover, back-end may check
some special conditions too.

>
> Does this pass have to run after scheduling? The way you describe it,
No, I just meant there is more opportunities after regrenaming, and
even more opportunities after sched2, I haven't investigated reason
for the latter one yet,  but this pass doesn't depend on sched2 to
work.

> this sounds like a scheduling problem, where you don't need regrename
> to resolve false dependencies. Your sched2 pass should be able to
> prioritize mergeable loads/stores to schedule them adjacent. Of if you
> must do this before scheduling, then at least do it before sched2. Now
> you're really ruining the order of the scheduled instructions, which
> seems bad.
Yes, I agree it should NOT disturb scheduling results, that's why I
put it before sched2 and after register renaming right now.

>
> I don't see how regrename will help resolve [base+offset] false
> dependencies. Can you explain? I'd expect effects from
> hardreg-copyprop "commoning" a base register.
It's the register operand's false dependency, rather than the base's
one.  Considering below simple case:
    mov r1,  #const1
    store r1, [base+offset1]
    mov r1, #const2
    store r1, [base_offset2]
It should be renamed into:
    mov r1,  #const1
    store r1, [base+offset1]
    mov r2, #const2
    store r2, [base_offset2]
Then caught by the patch.

I will leave other comments for a moment after more discussion here.

Thanks,
bin
>
> ChangeLog is missing the entry for arm.c.
>
> Your pass should make those peephole2's redundant, so you should
> remove the relevant define_peephole2's.
>
>
>>  +   generated by spilling during reigster allocation.  To catch all
>
> s/reigster/register/
>
>
>> +   whose address expression is in the form of "[base_offset]".  It
>
> s/[base_offset]/[base+offset]/
>
>
>> +   only guarantee that the two consecutive memory access instructions
>
> s/guarantee/guarantees/
>
>
>> +   has no data dependency issues.  After a pair is found it calls to
>
> s/has/have/
>
>
>> +/* Maximal number of memory references we support in single isntruction.  */
>
> s/Maximal/Maximum/
> s/isntruction/instruction/
>
>
>> +/* Check instruction recorded in PINSN_INFO to see if it is the
>> +   first instruction of load store pair.  If yes, record the
>> +   information in PPAIR_INFO and return TRUE.  */
>> +
>> +static bool
>> +find_pair_x_insn (struct pair_info_t *ppair_info,
>
> The function description doesn't seem to match the function
> implementation. There is nothing in find_pair_x_insn that looks for a
> load/store pair.
>
>
>> +  /* Don't bother if base is a register and we are modifying it.  */
>> +  if (REG_P (base) && reg_modified_p (base, insn))
>> +    return false;
>
> You can shortcut this if you're looking at a load.
>
>
>> +  for (; *ref_rec; ref_rec++)
>
> Nit: "while (*ref_rec)"
>
>
>> +  /* Can't be paired if memory refs have different mode.  */
>> +  if (GET_MODE (mem) != mode)
>> +    return false;
>
> Not even if same GET_MODE_SIZE?
>
>
>> +  /* Clear PAIR_SINK if the memory unit of the first instruction
>> +     is clobbered between.  */
>> +  if ((pinfo->pair_kind & PAIR_SINK) != 0
>> +      && MEM_CLOBBERED_P (pinfo->mem_clob, MEM_CLOB_SELF))
>> +    pinfo->pair_kind &= (~PAIR_SINK);
>
> This has the smell of architecture-specific behavior. Maybe you can't
> even pair instructions if there is a MEM_CLOB_SELF involved. Likewise
> for some of the checks in valid_insn_pair_p following the one quoted
> above.
>
>
>> +      HOST_WIDE_INT offset_t;
>
> We generally reserve "_t" for types, so I'd recommend a different name
> for this variable.
>
>
>
>> +find_pair_y_insn
>
> You're using the fixed distance here. Can't you instead walk the
> entire basic block once and build dependence chains?
>
> A distance of 4 seems rather arbitrary. Have you tried other maximum
> distances to see how many extra cases you catch/miss?
>
>
>> +  FOR_EACH_BB_FN (bb, cfun)
>> +    {
>> +      if (bb->index < NUM_FIXED_BLOCKS)
>> + continue;
>
> Redundant check. You only run into ENTRY/EXIT if you use
> FOR_ALL_BB_FN. With FOR_EACH_BB you only visit blocks between
> ENTRY/EXIT.
>
>
>> +      for (; insn != NULL_RTX && insn != BB_END (bb); insn = next)
>
> Off-by-one: If you stop at BB_END like this, then you're overlooking
> the last instruction in the basic block for pair matching. Stop at
> NEXT_INSN (BB_END (bb)), like FOR_BB_INSNS.
> (In a perfect world, we'd have proper insn iterators...)
>
> That's all for now! ;-)
>
> Ciao!
> Steven



-- 
Best Regards.

Follow-Ups:
- Re: [GCC RFC]A new and simple pass merging paired load store instructions
  - From: Jeff Law
- Re: [GCC RFC]A new and simple pass merging paired load store instructions
  - From: Mike Stump

References:
- [GCC RFC]A new and simple pass merging paired load store instructions
  - From: bin.cheng
- Re: [GCC RFC]A new and simple pass merging paired load store instructions
  - From: Steven Bosscher

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]