[RFC] Speed-up -fprofile-update=atomic

Mon Oct 24 12:09:00 GMT 2016

On 10/17/2016 02:03 PM, Richard Biener wrote:
> On Mon, Oct 17, 2016 at 1:46 PM, Martin LiÅ¡ka <mliska@suse.cz> wrote:
>> On 10/13/2016 11:43 AM, Richard Biener wrote:
>>> On Wed, Oct 12, 2016 at 3:52 PM, Martin LiÅ¡ka <mliska@suse.cz> wrote:
>>>> On 10/04/2016 11:45 AM, Richard Biener wrote:
>>>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin LiÅ¡ka <mliska@suse.cz> wrote:
>>>>>> On 09/07/2016 02:09 PM, Richard Biener wrote:
>>>>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin LiÅ¡ka <mliska@suse.cz> wrote:
>>>>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>>>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>>>>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications.
>>>>>>>>>>>> The best proxy we have for that is -pthread.
>>>>>>>>>>>>
>>>>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks
>>>>>>>>>>>> garbage data otherwise, which in many ways is even worse.
>>>>>>>>>>>
>>>>>>>>>>> It will likely be catastrophically slower in some cases.
>>>>>>>>>>>
>>>>>>>>>>> Catastrophically as in too slow to be usable.
>>>>>>>>>>>
>>>>>>>>>>> An atomic instruction is a lot more expensive than a single
>>>>>>>>>> increment. Also
>>>>>>>>>>> they sometimes are really slow depending on the state of the machine.
>>>>>>>>>>
>>>>>>>>>> Can't we just have thread-local copies of all the counters (perhaps
>>>>>>>>>> using
>>>>>>>>>> __thread pointer as base) and just atomically merge at thread
>>>>>>>>>> termination?
>>>>>>>>>
>>>>>>>>> I suggested that as well but of course it'll have its own class of issues (short lived threads, so we need to somehow re-use counters from terminated threads, large number of threads and thus using too much memory for the counters)
>>>>>>>>>
>>>>>>>>> Richard.
>>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> I've got written the approach on my TODO list, let's see whether it would be doable in a reasonable amount of time.
>>>>>>>>
>>>>>>>> I've just finished some measurements to illustrate slow-down of -fprofile-update=atomic approach.
>>>>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate -fprofile-update=atomic
>>>>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>>>>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>>>>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>>>>>>>
>>>>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not having much experience with default option
>>>>>>>> selection, but these numbers can probably help.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Look at the generated code for an instrumented simple loop and see that for
>>>>>>> the non-atomic updates we happily apply store-motion to the counter update
>>>>>>> and thus we only get one counter update per loop exit rather than one per
>>>>>>> loop iteration.  Now see what happens for the atomic case (I suspect you
>>>>>>> get one per iteration).
>>>>>>>
>>>>>>> I'll bet this accounts for most of the slowdown.
>>>>>>>
>>>>>>> Back in time ICC which had atomic counter updates (but using function
>>>>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>>>>>>> didn't have early inlining -- removing abstraction helps reducing the number
>>>>>>> of counters significantly).
>>>>>>>
>>>>>>> Richard.
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>>>>>> profile counter updates. My first attempt is to utilize TLS storage, where
>>>>>> every function is accumulating arcs counters. These are eventually added
>>>>>> (using atomic operations) to the global one at the very end of a function.
>>>>>> Currently I rely on target support of TLS, which is questionable whether
>>>>>> to have such a requirement for -fprofile-update=atomic, or to add a new option value
>>>>>> like -fprofile-update=atomic-tls?
>>>>>>
>>>>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to finish.
>>>>>> Time shrinks to 50%, compared to the current implementation.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Hmm, I thought I suggested that you can simply use automatic storage
>>>>> (which effectively
>>>>> is TLS...) for regions that are not forked or abnormally left (which
>>>>> means SESE regions
>>>>> that have no calls that eventually terminate or throw externally).
>>>>>
>>>>> So why did you end up with TLS?
>>>>
>>>> Hi.
>>>>
>>>> Usage for TLS does not makes sense, stupid mistake ;)
>>>>
>>>> By using SESE regions, do you mean the infrastructure that is utilized
>>>> by Graphite machinery?
>>>
>>> No, just as "single-entry single-exit region" which means placing of
>>> initializations of the internal counters to zero and the updates of the
>>> actual counters is "obvious".
>>>
>>> Note that this "optimization" isn't one if the SESE region does not contain
>>> cycle(s).  Unless there is a way to do an atomic update of a bunch of
>>> counters faster than doing them separately.  This optimization will also
>>> increase register pressure (or force the internal counters to the stack).
>>> Thus selecting which counters to "optimize" and which ones to leave in place
>>> might be necessary.
>>
>> Ok, I must admit the selection which counters to optimize is crucial. Current implementation
>> (atomic increments at places where a BB is reached) has advantage that it does not increase
>> register pressure and it does not update a global arcs counter when the BB is not visited.
>> On the other hand, having a local counters which are updated at function exit (my current implementation)
>> possibly updates counters for BB that not seen and it creates a huge memory locking spot if multiple threads
>> call a function very often. This is perf report of cray benchmark:
>>
>>        â”‚             if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
>>   0.01 â”‚3c0:   xor    %ecx,%ecx
>>   0.00 â”‚3c2:   mov    0x110(%rsp),%rdx
>>   0.00 â”‚       lock   add    %rdx,__gcov0.ray_sphere
>>   7.96 â”‚       mov    0x118(%rsp),%rdx
>>        â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x8
>>  11.39 â”‚       mov    0x120(%rsp),%rdx
>>        â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x10
>>  11.09 â”‚       mov    0x128(%rsp),%rdx
>>   0.00 â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x18
>>  11.27 â”‚       mov    0x130(%rsp),%rdx
>>        â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x20
>>  11.02 â”‚       mov    0x138(%rsp),%rdx
>>        â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x28
>>  11.46 â”‚       mov    0x140(%rsp),%rdx
>>   0.00 â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x30
>>  11.84 â”‚       mov    0x148(%rsp),%rdx
>>   0.00 â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x38
>>  11.57 â”‚       mov    0x150(%rsp),%rdx
>>        â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x40
>>   6.86 â”‚       mov    0x158(%rsp),%rdx
>>        â”‚       lock   add    %rdx,__gcov0.ray_sphere+0x48
>>
>> My current approach does atomic increment when maybe_hot_bb_p return false
>> and local counters are used otherwise. Question is how to find a better place
>> where to initialize and store local counter values?
> 
> Given the main reason for using local counters is loops you can use
> loop information
> and put counter initializations in the loop preheader and stores on
> the loop exit edges
> (basically what loop store motion does w/o atomic updates).  Store motion knows
> to ignore any aliasing with counters and thus is _very_ aggressive
> with doing this
> (including all loop nests but of course not across possible (abnormal)
> function exits).
> 
> While profile instrumentation is quite late it is of course still
> before IPA inlining.
> Thus loop store motion w/o atomic updates possibly still sees more opportunities
> than this.  I wonder if we might want to leave the optimization to the regular
> optimizers by only lowering the actual counter kind late and start with some
> IFN_UPDATE_COVERAGE_COUNTER which passes could handle (and merge).
> 
> Anyway, I'd go for the loop trick and simply handle counter motion
> from innermost
> loops only.  The final update place would be the common post-dominator of all
> exits (if you want to handle multiple exits).  As said you need to watch out for
> function termination that is not reflected by the CFG (external throws, exits in
> callees, etc.).
> 
> Richard.

Hello Richard.

I've just finished my patch, where I come up with a new internal fn (UPDATE_COVERAGE_COUNTER).
The function is generated by profile pass, and is handled by lim pass. I originally tried to
support internal fn in the lim machinery, but doing specific loop motion looks easier to work with.

With the patch applied, tramp3d runs 1.8x slower with -fprofile-update=atomic compared to -fprofile-update=single.

Patch can bootstrap on ppc64le-redhat-linux and survives regression tests.
Thoughts?

Thanks,
Martin
> 
>> Ideas welcomed.
>> Thanks,
>> Martin
>>
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> Martin
>>>>
>>>>>
>>>>> Richard.
>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>      Jakub
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Introduce-loop-store-motion-for-UPDATE_COVERAGE_COUN.patch
Type: text/x-patch
Size: 17879 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20161024/616bee78/attachment.bin>