[PATCH][v6] GIMPLE store merging pass

Mon Oct 24 09:41:00 GMT 2016

On 21/10/16 19:37, Richard Biener wrote:
> On October 21, 2016 3:29:15 PM GMT+02:00, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi Richard,
>>
>> On 21/10/16 13:37, Richard Biener wrote:
>>> On Tue, 18 Oct 2016, Kyrill Tkachov wrote:
>>>
>>>> Hi Richard,
>>>>
>>>> This patch is a merge of [1] and [2] and implements the manual
>> merging of
>>>> bitfields
>>>> as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN
>> too.
>>>> It caused me a lot of headeache because the bit offset is counted
> >from the
>>>> most significant bit
>>>> in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN
>> looks
>>>> irrelevant for store merging
>>>> anyway as it's just used to described RTL extract operations).
>>>> I've included ASCII diagrams of the steps in the merging algorithm.
>>> Heh, thanks.
>>>
>>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>> aarch64-none-linux-gnu,
>>>> x86_64-unknown-linux-gnu.
>>>> Also tested on aarch64_be-none-elf.
>>>>
>>>> How does this version look now?
>>> Mostly good.  For
>>>
>>> +bool
>>> +pass_store_merging::terminate_all_aliasing_chains (tree dest, tree
>> base,
>>> +                                                  gimple *stmt)
>>> +{
>>> ...
>>> +  /* Check if the assignment destination (BASE) is part of a store
>> chain.
>>> +     This is to catch non-constant stores to destinations that may
>> be
>>> part
>>> +     of a chain.  */
>>> +  if (base)
>>> +    {
>>> +      chain_info = m_stores.get (base);
>>> +      if (chain_info)
>>> +       {
>>> +         struct store_immediate_info *info;
>>> +         unsigned int i;
>>> +         FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
>>> +           {
>>> +             if (refs_may_alias_p (info->dest, dest))
>>> +               {
>>>
>>> I suppose the chain is not yet sorted in any way?
>>>
>>> At least for 'dest' which do not have a known constant offset we
>>> could do
>>>
>>>      if (base)
>>>        terminate_and_release_chain (base);
>> Do you mean when get_inner_reference returns non-NULL for POFFSET?
> Yes.
>
>> Or do you think we should try to look into dest in this function?
>>
>>> to speed things up?  IIRC we do not terminate chains early in
>>> this phase when we have enough stores to form a group, so
>>> writing a testcase that triggers quadraticness would be as simple
>>> as having
>>>
>>> char a[1000000];
>>>
>>> void foo ()
>>> {
>>>    a[0] = 1;
>>>    a[1] = 2;
>>>    ....
>>>    a[9999999] = 3;
>>> }
>>>
>>> ?
>>>
>>> so I think you probably want to limit the number of stores you
>>> ever put onto a chain and if you reach that limit, terminate
>>> and release it?  Like just choose 16 or 64?  (and experiment
>>> with the above kind of testcases)
>> I was initially thinking of imposing such a limit as well but
>> later I thought we'd want to extend the output code to be able to emit
>> a memcpy (or memset) call for large regions, so detecting the largest
>> possible
>> regions would be needed.
> But then we need a better data structure here to avoid the quadraticness.

For the testcase:

char a[1000000];

void foo ()
{
   a[0] = 1;
   a[1] = 2;
   ....
   a[9999999] = 3;
}

and similar we don't have quadratic behaviour because constant stores to a constant offset
from base (i.e. stores valid for merging) get automatically inserted into the chain without
doing the alias checking (we keep track of their program order so that overlapping stores
are merged properly). Since pushing into a vec is not a linear operation we're fine.

Once I teach it to quickly terminate chains when encountering stores to base with a variable
offset (like you suggest above) there is only one case of quadraticness:
We walk the stores in the chain (causing quadratic behaviour) when among
the valid mergeable stores we have interspersed stores to a constant offset from base but that
do not store a constant value e.g. a[5] = x;
So the input needs to be something like:

char a[16000];

void
foo (char x)
{
   a[0] = 1; a[8000 + 0] = x;
   a[1] = 1; a[8000 + 1] = x;
   a[2] = 1; a[8000 + 2] = x;
   a[3] = 1; a[8000 + 3] = x;
   a[4] = 1; a[8000 + 4] = x;
   ...
   a[7999] = 1; a[8000 + 7999] = x;
}

I've reproduced a slowdown (with store merging taking a disproportional
time in -ftime-report) with this case and I've verified that limiting the
maximum number of statements in the chain fixes it.

>
>   But that is not implemented yet (though I have
>> experimented
>> with it) so I can add a limit here. Should I just hardcode a limit or
>> should I make it
>> into a --param (MAX_STMTS_IN_STORE_MERGING_CHAIN or something)?
> A param is preferred.

Thanks. I'll introduce the limit but, as mentioned above, the simple case
of inserting multiple stores is not quadratic.

Kyrill

>
>>> +                 bit_off = byte_off << LOG2_BITS_PER_UNIT;
>>> +                 if (!wi::neg_p (bit_off) && wi::fits_shwi_p
>> (bit_off))
>>> +                   {
>>> +                     bitpos += bit_off.to_shwi ();
>>> +
>>>
>>> I think you want bit_off += bitpos before the fits_shwi check
>>> otherwise this add may still overflow.
>>>
>>> +                     base_addr = copy_node (base_addr);
>>> +                     TREE_OPERAND (base_addr, 1)
>>> +                       = build_zero_cst (TREE_TYPE (TREE_OPERAND (
>>> +                                                      base_addr,
>> 1)));
>>> I'd prefer
>>>
>>>         base_addr = build2 (MEM_REF, ...);
>>>
>>> here.
>> Thanks for the feedback,
>> Kyrill
>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00573.html
>>>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00572.html
>>>>
>>>> 2016-10-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>
>>>>       PR middle-end/22141
>>>>       * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
>>>>       * common.opt (fstore-merging): New Optimization option.
>>>>       * opts.c (default_options_table): Add entry for
>>>>       OPT_ftree_store_merging.
>>>>       * fold-const.h (can_native_encode_type_p): Declare prototype.
>>>>       * fold-const.c (can_native_encode_type_p): Define.
>>>>       * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
>>>>       * passes.def: Insert pass_tree_store_merging.
>>>>       * tree-pass.h (make_pass_store_merging): Declare extern
>>>>       prototype.
>>>>       * gimple-ssa-store-merging.c: New file.
>>>>       * doc/invoke.texi (Optimization Options): Document
>>>>       -fstore-merging.
>>>>
>>>> 2016-10-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>               Jakub Jelinek  <jakub@redhat.com>
>>>>               Andrew Pinski  <pinskia@gmail.com>
>>>>
>>>>       PR middle-end/22141
>>>>       PR rtl-optimization/23684
>>>>       * gcc.c-torture/execute/pr22141-1.c: New test.
>>>>       * gcc.c-torture/execute/pr22141-2.c: Likewise.
>>>>       * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
>>>>       * gcc.target/aarch64/ldp_stp_4.c: Likewise.
>>>>       * gcc.dg/store_merging_1.c: New test.
>>>>       * gcc.dg/store_merging_2.c: Likewise.
>>>>       * gcc.dg/store_merging_3.c: Likewise.
>>>>       * gcc.dg/store_merging_4.c: Likewise.
>>>>       * gcc.dg/store_merging_5.c: Likewise.
>>>>       * gcc.dg/store_merging_6.c: Likewise.
>>>>       * gcc.dg/store_merging_7.c: Likewise.
>>>>       * gcc.target/i386/pr22141.c: Likewise.
>>>>       * gcc.target/i386/pr34012.c: Add -fno-store-merging to
>> dg-options.
>>>>       * g++.dg/init/new17.C: Likewise.
>>>>
>