[PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Sat May 8 01:49:43 GMT 2021

On Fri, Apr 30, 2021 at 1:20 PM Feng Xue OS via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> >> This patch implements a new loop optimization according to the proposal
> >> in RFC given at
> >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
> >> So do not repeat the idea in this mail. Hope your comments on it.
> >
> > With the caveat that I'm not an optimization expert (but no one else
> > seems to have replied), here are some thoughts.
> >
> > [...snip...]
> >
> >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
> >>  [PR98598]
> >
> > BTW, did you mean to also post patches 2 and 3?
> >
>
> Not yet, but they are ready. Since this is kind of special optimization that uses
> heap as temporary storage, not a common means in gcc, we do not know
> basic attitude of the community towards it. So only the first patch was sent
> out for initial comments, in that it implements a generic MGO framework, and
> is complete and self-contained. Other 2 patches just composed some
> enhancements for specific code pattern and dynamic alias check. If possible,
> this proposal would be accepted principally, we will submit other 2 for review.
>
> >
> >> In nested loops, if scattered memory accesses inside inner loop remain
> >> unchanged in outer loop, we can sequentialize these loads by caching
> >> their values into a temporary memory region at the first time, and
> >> reuse the caching data in following iterations. This way can improve
> >> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> >
> > I don't think you've cited any performance numbers so far.  Does the
> > optimization show a measurable gain on some benchmark(s)?  e.g. is this
> > ready to run SPEC yet, and how does it do?
>
> Yes, we have done that. Minor improvement about several point percentage
> could gain for some real applications. And to be specific, we also get major
> improvement as more than 30% for certain benchmark in SPEC2017.
Hi Feng Xue,
Could you help point out which bench it is?  I failed to observe
improvement in spec2017 on local x86 machine.  I am running with O3
level.

Thanks,
bin
>
> >
> >> To illustrate what the optimization will do, two pieces of pseudo code,
> >> before and after transformation, are given. Suppose all loads and
> >> "iter_count" are invariant in outer loop.
> >>
> >> From:
> >>
> >>   outer-loop ()
> >>     {
> >>       inner-loop (iter, iter_count)
> >>         {
> >>           Type1 v1 = LOAD (iter);
> >>           Type2 v2 = LOAD (v1);
> >>           Type3 v3 = LOAD (v2);
> >>           ...
> >>           iter = NEXT (iter);
> >>         }
> >>     }
> >>
> >> To:
> >>
> >>   typedef struct cache_elem
> >>     {
> >>       bool   init;
> >>       Type1  c_v1;
> >>       Type2  c_v2;
> >>       Type3  c_v3;
> >
> > Putting the "bool init;" at the front made me think "what about
> > packing?" but presumably the idea is that every element is accessed in
> > order, so it presumably benefits speed to have "init" at the top of the
> > element, right?
>
> Yes, layout of the struct layout could be optimized in terms of size by
> some means, such as:
>   o. packing "init" into a padding hole after certain field
>   o. if certain field is a pointer type, the field can take the role of "init"
>       (Non-NULL implies "initialized")
> Now this simple scheme is straightforward, and would be enhanced
> in various aspects later.
>
> >>     } cache_elem;
> >>
> >>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
>
> > What if the allocation fails at runtime?  Do we keep an unoptimized
> > copy of the nested loops around as a fallback and have an unlikely
> > branch to that copy?
>
> Yes, we should. But in a different way, a flag is added into original
> nested loop to control runtime switch between optimized and
> unoptimized execution. This definitely incurs runtime cost, but
> avoid possible code size bloating. A better handling, as a TODO is
> to apply dynamic-switch for large loop, and loop-clone for small one.
>
> > I notice that you're using calloc, presumably to clear all of the
> > "init" flags (and the whole buffer).
> >
> > FWIW, this feels like a case where it would be nice to have a thread-
> > local heap allocation, perhaps something like an obstack implemented in
> > the standard library - but that's obviously scope creep for this.
>
> Yes, that's good, specially for many-thread application.
>
> > Could it make sense to use alloca for small allocations?  (or is that
> > scope creep?)
>
> We did consider using alloca as you said.  But if we could not determine
> up limit for a non-constant size, we have to place alloca inside a loop that
> encloses the nested loop. Without a corresponding free operation, this
> kind of alloca-in-loop might cause stack overflow. So it becomes another
> TODO.
>
> >>   outer-loop ()
> >>     {
> >>       size_t cache_idx = 0;
> >>
> >>       inner-loop (iter, iter_count)
> >>         {
> >>           if (!cache_arr[cache_idx]->init)
> >>             {
> >>               v1 = LOAD (iter);
> >>               v2 = LOAD (v1);
> >>               v3 = LOAD (v2);
> >>
> >>               cache_arr[cache_idx]->init = true;
> >>               cache_arr[cache_idx]->c_v1 = v1;
> >>               cache_arr[cache_idx]->c_v2 = v2;
> >>               cache_arr[cache_idx]->c_v3 = v3;
> >>             }
> >>         else
> >>             {
> >>               v1 = cache_arr[cache_idx]->c_v1;
> >>               v2 = cache_arr[cache_idx]->c_v2;
> >>               v3 = cache_arr[cache_idx]->c_v3;
> >>             }
> >>           ...
> >>           cache_idx++;
> >>           iter = NEXT (iter);
> >>         }
> >>     }
> >>
> >>   free (cache_arr);
> >
> > I see that the pass puts a "free" on every CFG exit from the loop.
> >
> > What about "longjmp" and C++ exceptions?  Are we guaranteed that "free"
> > will happen for those ways of exiting the loop?  (should have test
> > cases for this, I think)
> >
>
> We simply disable the optimization if a loop contains an abnormal or
> eh exit.
>
> > [...]
> >
> >
> >> 2020-12-25  Feng Xue  <fxue@os.amperecomputing.com>
> >>
> >> gcc/
> >>       PR tree-optimization/98598
> >>       * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> >>       * common.opt (-ftree-loop-mgo): New option.
> >>       * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+.
> >>       * params.opt (mgo-max-dep-load-level): New parameter.
> >>       (mgo-max-cache-elem-size): Likewise.
> >>       (mgo-min-cache-array-length): Likewise.
> >>       (mgo-max-cache-array-length): Likewise.
> >>       * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter.
> >>       (mgo-max-cache-elem-size): Likewise.
> >>       (mgo-min-cache-array-length): Likewise.
> >>       (mgo-max-cache-array-length): Likewise.
> >
> > Nit: also need to document "-ftree-loop-mgo".
>
> OK.
>
> >
> >>       * passes.def: Add new pass_loop_mgo pass.
> >>       * timevar.def (TV_LOOP_MGO): New timevar.
> >>       * tree-pass.h (make_pass_loop_mgo): New declaration.
> >>       * tree-ssa-loop-mgo.c: New file.
> >
> > Nit: new C++ source files should have a .cc extension, rather than .c
>
> OK.
>
> >
> >> gcc/testsuite/
> >>       PR tree-optimization/98598
> >>       * gcc.dg/tree-ssa/mgo/mgo.exp: New file.
> >>       * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header.
> >>       * gcc.dg/tree-ssa/mgo/list/list-1.c: New test.
> >>       * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise.
> >
> > [...]
> >
> > Presumably the optimization can't happen if the underlying data gets
> > modified during the iteration.
>
> Yes. Any data to be gathered should be read-only in the loop.
>
> > I briefly looked through the test cases, but I didn't see any/(much?)
> > test coverage for loops that write back to the data, or call functions
> > that might do that.  Sorry if I missed some, but clearly we need to
> > test that we don't incorrectly optimize those cases.
>
> If we could not statically deduce whether data might be written, we
> add dynamic alias-check to detect write hazard, which has been
> implemented in other patch. Those test cases covering data-write
> and function call are also included in it.
>
> > What happens in a multithreaded program in which another thread might
> > be writing to the data?  What are we allowed to optimize?
>
> For data access in multi-thread, we assume programmer know how to
> avoid data race using synchronization primitive call or  "volatile" specifier
> to annotate related data. If non-pure call or "volatile" memory access
> is found, the optimization will be suppressed.
>
> > Are there some C++ examples?  Fortran?
>
> OK. Will add some.
>
> > It would be great to be able to inject calloc failures when testing
> > this, but obviously that's out of scope for such a patch.
>
> OK. This is a TODO.
>
> > Hope this is constructive
>
> It definitely is, thanks for your comments.
>
> Feng