This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Fix PRE of TARGET_MEM_REF

From: Dorit Nuzman <DORIT at il dot ibm dot com>
To: Richard Guenther <rguenther at suse dot de>
Cc: gcc-patches at gcc dot gnu dot org, Michael Matz <matz at suse dot de>, Revital1 Eres <ERES at il dot ibm dot com>, Richard Guenther <richard dot guenther at gmail dot com>
Date: Tue, 2 Jun 2009 18:07:18 +0300
Subject: Re: [PATCH] Fix PRE of TARGET_MEM_REF

Richard Guenther <rguenther@suse.de> wrote on 31/05/2009 21:32:06:

> On Sun, 31 May 2009, Dorit Nuzman wrote:
>
> > > On Tue, May 26, 2009 at 1:12 PM, Revital1 Eres <ERES@il.ibm.com>
wrote:
> > > >
> > > > Hello,
> > > >
> > > > I wonder if there is any objection to also schedule predictive
> > commoning
> > > > after the vectorizer, like in the following patch.
> > > > Scheduling predcom ?and PRE passes after the vectorizer could help
to
> > solve
> > > > PR39300, ?among others.
> > > > I am planning to do SPEC runs with this patch for testing.
> > >
> >
> > great. this is the main thing that has held back scheduling predcom
after
> > vectorization
> >
> > > What we should do for predictive commoning is to run its analysis
> > > phase before (or inside?) the vectorizer and just handle it in the
> > > vectorizer cost model.  I expect that for small vector sizes
> > > (for example v2df) predictive commoning is often more effective
> > > than vectorization, especially if the loop is memory bound.
> > >
> >
> > I don't think it's the vectorizer's job to evaluate the impact of
> > predictive commoning or any other optimization... hopefully if
loop-count
> > is too small the cost model will figure out that we shouldn't vectorize
> > regardless of other optimization alternatives. Also, we could try to
> > vectorize efficiently, taking advantage of the data reuse while we
> > vectorize (we do it in other situations in the vectorizer).
>

(sorry for slow response - on a business trip)

> Well, I think we should have a centralized cost-model for the various
> loop optimizations.

yes, I agree, and I think the natural place for such a centralized cost
model for loop transformations is Graphite... We already incorporated
vectorizer related cost model considerations into graphite, and Konrad is
working on finalizing the initial implementation into a presentable
patch....

> Running
>
>   predcom-analysis
>   vectorizer-analysis
>   cost-model
>   vectorizer / predcom
>
> is one way to make sure we do not disable predictive commoning
> by vectorizing or the other way around if the other transformation
> is more profitable.
>

I agree in principal, but I think you may be not taking into account that
we could generate efficient vectorized code for loops that are also
candidates for predcom. This can be done either by detecting the data reuse
during vectorization (which is something we already do in the vectorizer
anyhow, in the context of efficient realignment) or by extending other
passes to also optimize vectorized code. Here's a quick sketch of a
simplified example:

(1) original:
    for (i=0; i<n; i++)
      c[i] = a[i] * a[i+2]

(2) after predcom:
    a0 = a[i]
    a1 = a[i+1]
    for (i=0; i<n; i+=2){
      a2 = a[i+2]
      a3 = a[i+3]
      c[i] = a0 * a2
      c[i+1] = a1 * a3
      a0 = a2
      a1 = a3
    }

(3) naive vectorization of (1):
      - for simplicity, assuming a,c are aligned, VF=4 and n%4=0
      - each dataref handled independetly (so no reuse across DRs)
      - realignment optimized for each DR independetly

    vector vpa = a;
    vtmp0 = vload(vpa)
    for (i=0; i<n; i+=4){
      vtmp1 = vload(vpa+4)
      v0 = vload(vpa)   //redundant
      v2 = realign(vtmp0,vtmp1,2)
      vc = v0 * v2
      *vpc++ = vc
      vpa += 4;
      vtmp0 = vtmp1
    }

(4) optimized vectorization of (1) (i.e. reusing loads across DRs):
    vector vpa = a;
    vtmp0 = vload(vpa)
    for (i=0; i<n; i+=4){
      vtmp1 = vload(vpa+4)
      v0 = vtmp0;     //reused
      v2 = realign(vtmp0,vtmp1,2)
      vc = v0 * v2
      *vpc++ = vc
      vpa += 4
      vtmp0 = vtmp1
    }

So in (4) you have one vector load in each vector iteration... I think it's
easier to get to (4) from (1) (original) than from (2) (transformed by
predcom). Vectorizing after predcom basically means extra analysis to
detect the pattern and undo it. It's easier to have all DRs explicitely in
the loop, without the dependence, and to detect reuse between them.

I don't see what we gain if predcom is scheduled before vectorization, but
I do see advantages the other way around (simpler vectorization, potential
to optimize vectorized code). I think our first step should be to (1) move
predcom to after vectorization, and our immediate next steps should be to
(2) enhance the vectorizer to detect this reuse, and to (3) work on the
Graphite-based centralized cost-model for loop optimizations (in the works
for a subset of optimizations that include vectorization), and in the
future also (4) enhance predcom to work on vectorized code.

dorit

> After some more thoughts I would consider doing PRE before ivopts,
> because the extra redundancy removal should improve its decisions.
>
> We should probably do PRE on scalar code parts (simply not
> phi-translate over back-edges) early, during the first FRE run.
>
> Richard.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]