This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
- From: Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: Yufeng Zhang <Yufeng dot Zhang at arm dot com>, Jeff Law <law at redhat dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Tue, 03 Dec 2013 16:04:13 -0600
- Subject: Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
- Authentication-results: sourceware.org; auth=none
- References: <5277EA58 dot 5020303 at arm dot com> <1384189786 dot 8213 dot 28 dot camel at gnopaine> <5282ACE5 dot 8020304 at arm dot com> <1384365888 dot 8213 dot 65 dot camel at gnopaine> <5283D3B5 dot 1040300 at arm dot com> <1384373498 dot 8213 dot 76 dot camel at gnopaine> <1384376098 dot 8213 dot 94 dot camel at gnopaine> <52840A69 dot 1020308 at arm dot com> <CAFiYyc189fgJ6Ytov4xfi6tyAacKrhcivYdY38O+M66jTAvbQA at mail dot gmail dot com> <5294B7F4 dot 4020106 at arm dot com> <529CABA5 dot 1040003 at arm dot com> <529D7ED6 dot 6060408 at redhat dot com> <529DD39F dot 9080405 at arm dot com> <CAFiYyc28DNZRn2=LbjYfPbmOxxQy1q6nKrtz-x9LPfXv_j8VCg at mail dot gmail dot com> <529DFE32 dot 7040600 at arm dot com> <c34076fc-d283-44f7-99fa-c678b5dda692 at email dot android dot com>
On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >On 12/03/13 14:20, Richard Biener wrote:
> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >wrote:
> >>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>
> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>
> >>>>> Ping~
> >>>>>
> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>
> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>
> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>> Zhang<Yufeng.Zhang@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>
> >>>>>>>>> The second version of your original patch is ok with me with
> >the
> >>>>>>>>> following changes. Sorry for the little side adventure into
> >the
> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >it
> >>>>>>>>> helps in
> >>>>>>>>> this case. Thanks for having a look at it, anyway. Thanks
> >also for
> >>>>>>>>> cleaning up this version to be less intrusive to common
> >interfaces; I
> >>>>>>>>> appreciate it.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks a lot for the review. I've attached an updated patch
> >with the
> >>>>>>>> suggested changes incorporated.
> >>>>>>>>
> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>> experiment; it's
> >>>>>>>> a good chance of gaining insight into the pass. Many thanks
> >for
> >>>>>>>> your prompt
> >>>>>>>> replies and patience in guiding!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Everything else looks OK to me. Please ask Richard for final
> >>>>>>>>> approval,
> >>>>>>>>> as I'm not a maintainer.
> >>>>
> >>>> First a note, I need to check on voting for Bill as the slsr
> >maintainer
> >>>> from the steering committee. Voting was in progress just before
> >the
> >>>> close of stage1 development so I haven't tallied the results :-)
> >>>
> >>>
> >>> Looking forward to some good news! :)
> >>>
> >>>
> >>>>>>
> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >shared.
> >>>>>> The cached is introduced mainly because get_alternative_base
> >() may
> >>>>>> be
> >>>>>> called twice on the same 'base' tree, once in the
> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>> alloc_cand_and_find_basis () for record_potential_basis (). I'm
> >happy
> >>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>
> >>>> Without some sense of how expensive the lookups are vs how often
> >the
> >>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>
> >>>> I'd say take it out unless you have some sense it's really saving
> >time.
> >>>> It's a pretty minor implementation detail either way.
> >>>
> >>>
> >>> I think the affine tree routines are generally expensive; it is
> >worth having
> >>> a cache to avoid calling them too many times. I run the slsr-*.c
> >tests
> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >from
> >>> 55.6% to 90%, with 73.5% as the average. The samples may not well
> >represent
> >>> the real world scenario, but they do show the fact that the 'base'
> >tree can
> >>> be shared to some extent. So I'd like to have the cache in the
> >patch.
> >>>
> >>>
> >>>>
> >>>>>>
> >>>>>>> +/* { dg-do compile } */
> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>> +
> >>>>>>> +typedef int arr_2[50][50];
> >>>>>>> +
> >>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>> +{
> >>>>>>> + int i, j;
> >>>>>>> +
> >>>>>>> + i = v1 + 5;
> >>>>>>> + j = i;
> >>>>>>> + a2 [i-10] [j] = 2;
> >>>>>>> + a2 [i] [j++] = i;
> >>>>>>> + a2 [i+20] [j++] = i;
> >>>>>>> + a2 [i-3] [i-1] += 1;
> >>>>>>> + return;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>
> >>>>>>> scanning for 5 MEMs looks non-sensical. What transform do
> >>>>>>> you expect? I see other slsr testcases do similar non-sensical
> >>>>>>> checking which is bad, too.
> >>>>>>
> >>>>>>
> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >to
> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >MEM_REFs
> >>>>>> is an effective check. Alternatively, I can add a follow-up
> >patch to
> >>>>>> add some dumping facility in replace_ref () to print out the
> >replacing
> >>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>
> >>>> I think adding some details to the dump and scanning for them would
> >be
> >>>> better. That's the only change that is required for this to move
> >forward.
> >>>
> >>>
> >>> I've updated to patch to dump more details when
> >-fdump-tree-slsr-details is
> >>> on. The tests have also been updated to scan for these new dumps
> >instead of
> >>> MEMs.
> >>>
> >>>
> >>>>
> >>>> I suggest doing it quickly. We're well past stage1 close at this
> >point.
> >>>
> >>>
> >>> The bootstrapping on x86_64 is still running. OK to commit if it
> >succeeds?
> >>
> >> I still don't like it. It's using the wrong and too expensive tools
> >to do
> >> stuff. What kind of bases are we ultimately interested in? Browsing
> >> the code it looks like we're having
> >>
> >> /* Base expression for the chain of candidates: often, but not
> >> always, an SSA name. */
> >> tree base_expr;
> >>
> >> which isn't really too informative but I suppose they are all
> >> kind-of-gimple_val()s? That said, I wonder if you can simply
> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >(),
> >> ignoring the returned offset.
> >
> >'base_expr' is essentially the base address of a handled_component_p,
> >e.g. ARRAY_REF, COMPONENT_REF, etc. In most case, it is the address of
> >
> >the object returned by get_inner_reference ().
> >
> >Given a test case like the following:
> >
> >typedef int arr_2[20][20];
> >
> >void foo (arr_2 a2, int i, int j)
> >{
> > a2[i+10][j] = 1;
> > a2[i+10][j+1] = 1;
> > a2[i+20][j] = 1;
> >}
> >
> >The IR before SLSR is (on x86_64):
> >
> > _2 = (long unsigned int) i_1(D);
> > _3 = _2 * 80;
> > _4 = _3 + 800;
> > _6 = a2_5(D) + _4;
> > *_6[j_8(D)] = 1;
> > _10 = j_8(D) + 1;
> > *_6[_10] = 1;
> > _12 = _3 + 1600;
> > _13 = a2_5(D) + _12;
> > *_13[j_8(D)] = 1;
> >
> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >_6, while the base_expr for a2[i+20][j] is _13.
> >
> >_13 is essentially (_6 + 800), so all of the three memory references
> >essentially share the same base address. As their strides are also the
> >
> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >MEM_REFs. What this patch does is to use the tree affine tools to help
> >
> >recognize the underlying base address expression; as it requires
> >looking
> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >won't help here.
> >
> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >
> >far we think this is the best way to proceed. The use of tree affine
> >routines has been restricted to CAND_REFs only and there is the
> >aforementioned cache facility to help reduce the overhead.
> >
> >Thanks,
> >Yufeng
> >
> >P.S. some more details what the patch does:
> >
> >The CAND_REF for the three memory references are:
> >
> > 6 [2] *_6[j_8(D)] = 1;
> > REF : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> > basis: 0 dependent: 8 sibling: 0
> > next-interp: 0 dead-savings: 0
> >
> > 8 [2] *_6[_10] = 1;
> > REF : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> > basis: 6 dependent: 11 sibling: 0
> > next-interp: 0 dead-savings: 0
> >
> > 11 [2] *_13[j_8(D)] = 1;
> > REF : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> > basis: 8 dependent: 0 sibling: 0
> > next-interp: 0 dead-savings: 0
> >
> >Before the patch, the strength reduction candidate chains for the three
> >
> >CAND_REFs are:
> >
> > _6 -> 6 -> 8
> > _13 -> 11
> >
> >i.e. SLSR recognizes the first two references share the same basis,
> >while the last one is on it own.
> >
> >With the patch, an extra candidate chain can be recognized:
> >
> > a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >
> >i.e. all of the three references are found to have the same basis
> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >_6
> >or _13, with the immediate offset removed. The pass is now able to
> >lower all of the three references, instead of the first two only, to
> >MEM_REFs.
>
> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
I think that's overstating SLSR's current capabilities a bit. :) We do
use get_inner_reference to come up with the base expression for
reference candidates (based on some of your suggestions a couple of
years back). However, in the case of multiple levels of array
references, we miss opportunities because get_inner_reference stops at
an SSA name that could be further expanded by following its definition
back to a more fundamental base expression.
Part of the issue here is that reference candidates are basis for a more
specific optimization than the mult and add candidates. The latter have
a more general framework for building up a recording of simple affine
expressions that can be strength-reduced. Ultimately we ought to be
able to do something similar for reference candidates, building up
simple affine expressions from base expressions, so that everything is
done in a forward order and the tree-affine interfaces aren't needed.
But that will take some more fundamental design changes, and since this
provides some good improvements for important cases, I feel it's
reasonable to get this into the release.
Thanks,
Bill
>
> Richard.
>
>