[PATCH] Conditional count update for fast coverage test in multi-threaded programs

Mon Nov 25 21:53:00 GMT 2013

On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote:
>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
>>>> Hi,
>>>>
>>>> This patch injects a condition into the instrumented code for edge
>>>> counter update. The counter value will not be updated after reaching
>>>> value 1.
>>>>
>>>> The feature is under a new parameter --param=coverage-exec_once.
>>>> Default is disabled and setting to 1 to enable.
>>>>
>>>> This extra check usually slows the program down. For SPEC 2006
>>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>>> slow down in -O2 coverage build. This feature, however, is expected to
>>>> improve the coverage run speed for multi-threaded programs, because
>>>> there virtually no data race and false sharing in updating counters.
>>>> The improvement can be significant for highly threaded programs -- we
>>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>>> applications.
>>>>
>>>> Tested with bootstrap.
>>>
>>> Err - why not simply emit
>>>
>>>   counter = 1
>>>
>>> for the counter update itself with that --param (I don't like a --param
>>> for this either).
>>>
>>> I assume that CPUs can avoid data-races and false sharing for
>>> non-changing accesses?
>>>
>>
>> I'm not aware of any CPU having this feature. I think a write to the
>> shared cache line to invalidate all the shared copies. I cannot find
>> any reference on checking the value of the write. Do you have any
>> pointer to the feature?
>
> I don't have any pointer - but I remember seeing this in the context
> of atomics thus it may be only in the context of using a xchg
> or cmpxchg instruction.  Which would make it non-portable to
> some extent (if you don't want to use atomic builtins here).
>

cmpxchg should work here -- it's a conditional write so the data race
/false sharing can be avoided.
I'm comparing the performance b/w explicit branch vs cmpxchg and will
report back.

-Rong

> Richard.
>
>> I just tested this implementation vs. simply setting to 1, using
>> google search as the benchmark.
>> This one is 4.5x faster. The test was done on Intel Westmere systems.
>>
>> I can change the parameter to an option.
>>
>> -Rong
>>
>>> Richard.
>>>
>>>> Thanks,
>>>>
>>>> -Rong