[google 4.7] atomic update of profile counters (issue6965050)

Fri Dec 21 09:55:00 GMT 2012

On Fri, Dec 21, 2012 at 10:13 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Thu, Dec 20, 2012 at 8:20 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> On Wed, Dec 19, 2012 at 4:29 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>> >> >
>> >> > On Wed, Dec 19, 2012 at 12:08 PM, Rong Xu <xur@google.com> wrote:
>> >> > > Hi,
>> >> > >
>> >> > > This patch adds the supprot of atomic update the profile counters.
>> >> > > Tested with google internal benchmarks and fdo kernel build.
>> >> >
>> >> > I think you should use the __atomic_ functions instead of __sync_
>> >> > functions as they allow better performance for simple counters as you
>> >> > can use __ATOMIC_RELAXED.
>> >>
>> >> You are right. I think __ATOMIC_RELAXED should be OK here.
>> >> Thanks for the suggestion.
>> >>
>> >> >
>> >> > And this would be useful for the trunk also.  I was going to implement
>> >> > this exact thing this week but some other important stuff came up.
>> >>
>> >> I'll post trunk patch later.
>> >
>> > Yes, I like that patch, too. Even if the costs are quite high (and this is why
>> > atomic updates was sort of voted down in the past) the alternative of using TLS
>> > has problems with too-much per-thread memory.
>>
>> Actually sometimes (on some processors) atomic increments are cheaper
>> than doing a regular incremental.  Mainly because there is an
>> instruction which can handle it in the L2 cache rather than populating
>> the L1.   Octeon is one such processor where this is true.
>
> One reason for large divergence may be the fact that we optimize the counter
> update code.  Perhaps declaring counters volatile will prevent load/store motion
> and reduce the racing, too.

Well, that will make it slower, too.  The best benchmark to check is tramp3d
for all this stuff.  I remember that ICC when it had a function call for each
counter update was about 100000x slower instrumented than w/o instrumentation
(that is, I never waited long enough to make it finish even one iteration ...)

Thus, it's very important that counter updates are subject to loop
invariant / store
motion (and SCEV const-prop)!  GCC does a wonderful job here at the moment,
please do not regress here.

Richard.

> Honza
>>
>> Thanks,
>> Andrew Pinski
>>
>> >
>> > While there are even more alternatives, like recording the changes and
>> > commmiting them in blocks (say at function return), I guess some solution is
>> > better than no solution.
>> >
>> > Thanks,
>> > Honza