This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

From: Richard Guenther <rguenther at suse dot de>
To: Jan Hubicka <hubicka at ucw dot cz>
Cc: gcc-patches at gcc dot gnu dot org
Date: Tue, 2 Oct 2012 11:07:52 +0200 (CEST)
Subject: Re: [RFC] Make vectorizer to skip loops with small iteration estimate
References: <20120930111610.GA7097@kam.mff.cuni.cz> <alpine.LNX.2.00.1210011409560.4063@zhemvz.fhfr.qr> <20121001175757.GB11298@kam.mff.cuni.cz>

On Mon, 1 Oct 2012, Jan Hubicka wrote:

> > > 
> > >      So the unvectorized cost is
> > >      SIC * niters
> > > 
> > >      The vectorized path is
> > >      SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> > >      The scalar path of vectorizer loop is
> > >      SIC * niters + SOC
> > 
> > Note that 'th' is used for the runtime profitability check which is
> > done at the time the setup cost has already been taken (yes, we
> 
> Yes, I understand that.
> > probably should make it more conservative but then guard the whole
> > set of loops by the check, not only the vectorized path).
> > See PR53355 for the general issue.
> 
> Yep, we may reduce the cost of SOC by outputting early guard for non-vectorized
> path better than we do now. However...
> > >    Of course this is very simple benchmark, in reality the vectorizatoin can be
> > >    a lot more harmful by complicating more complex control flows.
> > >
> > >    So I guess we have two options
> > >     1) go with the new formula and try to make cost model a bit more realistic.
> > >     2) stay with original formula that is quite close to reality, but I think
> > >        more by an accident.
> > 
> > I think we need to improve it as whole, thus I'd prefer 2).
> 
> ... I do not see why.
> Even if we make the check cheaper we will only distribute part of SOC to vector
> prologues/epilogues.
> 
> Still I think the formula is wrong, I.e. accounting SOC where it should not.
> 
> The cost of scalar path without vectorization is 
>   niters * SIC
> while with vectorization we have scalar path
>   niters * SIC + SOC
> and vector path
>   SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> 
> So SOC cancels out in the runtime check.
> I still think we need two formulas - one determining if vectorization is
> profitable, other specifying the threshold for scalar path at runtime (that
> will generally give lower values).

True, we want two values.  But part of the scalar path right now
is all the computation required for alias and alignment runtime checks
(because the way all the conditions are combined).

I'm not much into the details of what we account for in SOC (I suppose
it's everything we insert in the preheader of the vector loop).

+      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+        fprintf (vect_dump, "not vectorized: estimated iteration count 
too small.");
+      if (vect_print_dump_info (REPORT_DETAILS))
+        fprintf (vect_dump, "not vectorized: estimated iteration count 
smaller than "
+                 "user specified loop bound parameter or minimum "
+                 "profitable iterations (whichever is more 
conservative).");

this won't work anymore btw - dumping infrastructure changed.

I suppose your patch is a step in the right direction, but to really
make progress we need to re-organize the loop and predicate structure
produced by the vectorizer.

So, please update your patch, re-test and then it's ok.

> > > 2) Even when loop iterates 2 times, it is estimated to 4 iterations by
> > >    estimated_stmt_executions_int with the profile feedback.
> > >    The reason is loop_ch pass.  Given a rolled loop with exit probability
> > >    30%, proceeds by duplicating the header with original probabilities.
> > >    This makes the loop to be executed with 60% probability.  Because the
> > >    loop body counts remain the same (and they should), the expected number
> > >    of iterations increase by the decrease of entry edge to the header.
> > > 
> > >    I wonder what to do about this.  Obviously without path profiling
> > >    loop_ch can not really do a good job.  We can artifically make
> > >    header to suceed more likely, that is the reality, but that requires
> > >    non-trivial loop profile updating.
> > > 
> > >    We can also simply record the iteration bound into loop structure 
> > >    and ignore that the profile is not realistic
> > 
> > But we don't preserve loop structure from header copying ...
> 
> From what time we keep loop structure? In general I would like to eventualy
> drop value histograms to loop structure specifying number of iterations with
> profile feedback.

We preserve it from the start of the tree loop optimizers (it's easy
to preserve them from earlier points as long as you don't cross inlining,
but to lower the impact of the change I placed it where it was enough
to prevent the excessive unrolling/peeling done by RTL)

> > 
> > >    Finally we can duplicate loop headers before profilng.  I implemented
> > >    that via early_ch pass executed only with profile generation or feedback.
> > >    I guess it makes sense to do, even if it breaks the assumption that
> > >    we should do strictly -Os generation on paths where
> > 
> > Well, there are CH cases that do not increase code size and I doubt
> > that loop header copying is generally bad for -Os ... we are not
> > good at handling non-copied loop headers.
> 
> There is comment saying 
>   /* Loop header copying usually increases size of the code.  This used not to
>      be true, since quite often it is possible to verify that the condition is
>      satisfied in the first iteration and therefore to eliminate it.  Jump
>      threading handles these cases now.  */
>   if (optimize_loop_for_size_p (loop))
>     return false;
> 
> I am not sure how much backing it has. Schedule loop_ch as part of early passes
> just after profile pass makes optimize_loop_for_size_p to return true 
> even for functions that are later found cold by profile feedback.  I do not see
> that being big issue.

The point is that jump threading is pretty late as well.

> I tested enabling loop_ch in early passes with -fprofile-feedback and it is SPEC
> neutral.  Given that it improves loop count estimates, I would still like mainline
> doing that.  I do not like these quite important estimates to be wrong most of time.

I agree.  It also helps getting rid of once rolling loops I think.

> > 
> > Btw, I added a "similar" check in vect_analyze_loop_operations:
> > 
> >   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >        && (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
> >       || ((max_niter = max_stmt_executions_int (loop)) != -1
> >           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> >     {
> >       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >                          "not vectorized: iteration count too small.");
> >       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >                          "not vectorized: iteration count smaller than "
> >                          "vectorization factor.");
> >       return false;
> >     }
> > 
> > maybe you simply need to update that to also consider the profile?
> 
> Hmm, I am still getting familiar wth the code. Later we later have
>   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>       && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th)
>     {
>       if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
>         fprintf (vect_dump, "not vectorized: vectorization not "
>                  "profitable.");
>       if (vect_print_dump_info (REPORT_DETAILS))
>         fprintf (vect_dump, "not vectorized: iteration count smaller than "
>                  "user specified loop bound parameter or minimum "
>                  "profitable iterations (whichever is more conservative).");
>       return false;
>     }
> 
> where th is always greater or equal than vectorization_factor from the cost model.
> So this test seems redundant if the max_stmt_executions_int was pushed down
> to the second conditoinal?

Yes, sort of.  The new check was supposed to be crystal clear, and
even with the cost model disabled we want to not vectorize in this
case.  But yes, the whole cost-model stuff needs TLC.

Richard.

> 
> Honza
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

Follow-Ups:
- Re: [RFC] Make vectorizer to skip loops with small iteration estimate
  - From: Jan Hubicka

References:
- Re: [RFC] Make vectorizer to skip loops with small iteration estimate
  - From: Richard Guenther
- Re: [RFC] Make vectorizer to skip loops with small iteration estimate
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]