This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH PR77536]Generate correct profiling information for vectorized loop

From: Jan Hubicka <hubicka at ucw dot cz>
To: "Bin.Cheng" <amker dot cheng at gmail dot com>
Cc: Jan Hubicka <hubicka at ucw dot cz>, Richard Biener <richard dot guenther at gmail dot com>, Bin Cheng <Bin dot Cheng at arm dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>, "pthaugen at linux dot vnet dot ibm dot com" <pthaugen at linux dot vnet dot ibm dot com>
Date: Mon, 20 Feb 2017 17:05:09 +0100
Subject: Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
Authentication-results: sourceware.org; auth=none
References: <VI1PR0802MB217672F0D45BA9EA238CA7BDE75A0@VI1PR0802MB2176.eurprd08.prod.outlook.com> <CAFiYyc2C2jr8+aJgdN5s_WEbg6Hh_1B2jzJ3TAvLxJJBZ-i70Q@mail.gmail.com> <20170220140210.GA2932@kam.mff.cuni.cz> <CAHFci2_Mq7b_-7cwxdmwvGWojgyeTg2qApedRKTFMgPcJvU-hw@mail.gmail.com> <20170220151705.GA29965@kam.mff.cuni.cz> <CAHFci294t7QDhjkKWp96pvyRXUm8gucfY6G6Lzhez1H+jb-kag@mail.gmail.com>

> BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded,
> how should we adjust the number for calling scale_loop_frequencies
> which has int type?  In extreme case, gcov_type could be out of int's
> range, we have to cap the value anyway.  But yes, 10000 in
> expect_loop_iterations is too small.

What I usually do is to use fixed point math in this case (based on REG_BR_PROB_BASE).
Just pass REG_BR_PROB_BASE as den and calculate the adjustment in gcov_type converting
to int. Because you should be just decreasing the counts, it won't overflow and because
the decarese will be in range, say 2...256 times, it should also be sufficiently
precise.

Be careful to avoid overflow of gcov type - it is not safe to multiply two
counts in 64bit math because each count can be more than 2^32.  (next stage1 I
plan to switch most of this to sreals that will make this easier)

> >> > But I guess here it is sort of safe because vectorized loops are simple.
> >> > You can't just scale down the existing counts/frequencies by vf, because the
> >> > entry edge frequency was adjusted.
> >> I am not 100% follow here, it looks the code avoids changing frequency
> >> counter for preheader/exit edge, otherwise we would need to change all
> >> counters dominated by them?
> >
> > I was just wondering what is wrong with taking the existing frequencies/counts
> > the loop body has and dividing them all by the unroll factor.  This is correct
> > if you ignore the versioning. With versioning I guess you want to scale by
> > the unroll factor and also subtract frequencies/counts that was acocunted to
> > the other versions of the loop, right?
> IIUC, for (vectorized) loop header, it's frequency is calculated by:
>           freq_header = freq_preheader + freq_latch
> and freq_latch = (niter * freq_preheader).  Simply scaling it by VF gives:
>           freq_header = (freq_preheader + freq_latch) / VF
> which is wrong.  Especially if the loop is vectorized by large VF
> (=16) and we normally assume niter (=10) without profiling
> information, it results not only mismatch, but also
> (loop->header->frequency < loop->preheader->frequency).  In fact, if
> we have accurate niter information, the counter should be:
>           freq_header = freq_preheader + niter * freq_preheader

You are right. We need to compensate for the change of probability of the loop
exit edge.
> 
> >> >
> >> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
> >> > need to compute it before you start chanigng the CFG by peeling proplogue?
> >> Peeling for prologue doesn't change profiling information of
> >> vect_loop, it is the skip edge from before loop to preferred epilogue
> >> loop that will change profile counters.  I guess here exists a dilemma
> >> that niter_for_unrolled_loop is for loop after peeling for prologue?
> >
> > expected_loop_iterations_unbounded calculates number of iteations by computing
> > sum of frequencies of edges entering the loop and comparing it to the frequency
> > of loop header.  While peeling the prologue, you split the preheader edge and
> > adjust frequency of the new preheader BB of the loop to be vectorized.  I think
> > that will adjust the #of iterations estimate.
> It's not the case now I think.  one motivation of new vect_do_peeling
> is to avoid niter checks between prologue and vector loop.  Once
> prologue loop is entered or checked, the vector loop must be executed
> unconditionally.  So the preheaderof vector loop has consistent
> frequency counters now.  The niter check on whether vector loop should
> be executed is now merged with cost check before prologue, and in the
> future I think this can be further merged if loop versioning is
> needed.

Originally you have

  loop_preheader
       |
       v
   loop_header

and the ratio of the two BB frequencies is the loop iteration count. Then you
do something like:

  orig_loop_preheader
       |
       v
   loop_prologue -----> scalar_version_of_loop
       |
       v
 new_loop_preheader
       |
       v
   loop_header

At some point, you need to update new_loop_preheader frequency/count
to reflect the fact that with some probability the loop_prologue avoids
the vectorized loop.  Once you do it and if you don't scale frequency of
loop_header you will make expect_loop_iterations to return higher value
than previously.

So at the time you are calling it, you need to be sure that the loop_header
and its preheader frequences was both adjusted by same factor.  Or you need
to call it early before you start hacking on the CFG and its profile.

Pehaps currently it is safe, because your peeling code is also scaling
the loop profiles.

Honza
> 
> Thanks,
> bin
> >
> > Honza
> >>
> >> Thanks,
> >> bin
> >> >
> >> > Finally if freq_e is 0, all frequencies and counts will be probably dropped to
> >> > 0.  What about determining fraction by counts if they are available?
> >> >
> >> > Otherwise the patch looks good and thanks a lot for working on this!
> >> >
> >> > Honza
> >> >
> >> >> >
> >> >> > gcc/testsuite/ChangeLog
> >> >> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
> >> >> >
> >> >> >         PR tree-optimization/77536
> >> >> >         * gcc.dg/vect/pr79347.c: Revise testing string.

Follow-Ups:
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Bin.Cheng
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Bin.Cheng

References:
- [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Bin Cheng
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Richard Biener
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Jan Hubicka
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Bin.Cheng
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Jan Hubicka
- Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  - From: Bin.Cheng

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]