[PATCH 7/8] Model cache auto-prefetcher in scheduler

Fri Jan 16 15:34:00 GMT 2015

On Fri, Jan 16, 2015 at 3:06 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com> wrote:
>
>>
>
> Hi Ramana,
> Hi Vladimir,
>
> I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch.  Since stage3 is about to finish, I'm going to commit the target independent part of the patch now (it was approved by Vladimir a while ago).
>
> I'll commit ARM part of the patch once Ramana is happy with it.
>
> I've addressed all review comments and updated patch is attached.
>
> Any objections?
>
>> On 14/11/14 15:12, Maxim Kuvyrkov wrote:
>>> On Nov 14, 2014, at 8:38 AM, Jeff Law <law@redhat.com> wrote:
>>>
>>>> On 10/20/14 22:06, Maxim Kuvyrkov wrote:
>>>>> Hi,
>>>>> Ramana, this change requires benchmarking, which I can't easily do
>>>>> at
>>>> the moment. I would appreciate any benchmarking results that you can
>>>> share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
>>>> needs to be tuned/confirmed for Cortex-A15.
>>>> What were the results of that benchmarking?  IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that.  Sorry,  my bad there).
>>>
>>> I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers.  The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve.
>>
>>
>> I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk.
>>
>> The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k.
>
> After going back to the original benchmark that exposed the autoprefetcher issue (STREAMS) the required value of the parameter to fix the regression turned out to be 5.
>
> I have also discovered that Cortex-A57 also benefits from this optimization.  Running STREAMS benchmark on Juno in AArch32 mode showed increased performance by 10-15%, which is not as big as Cortex-A15's 250-300%, but still significant.
>
> In this updated patch the parameter is set to maximum -- look through all scheduling queue of instructions in search of memory ops that are relevant to L2 autoprefetcher.  In practice the setting to maximum will not change much, since there are rarely many instructions in the queue.
>
>>
>>
>>>
>>> @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
>>>   return issue_rate > 1 ? issue_rate : 0;
>>> }
>>>
>>> +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
>>> +static int
>>> +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
>>> +{
>>> +  switch (arm_tune)
>>> +    {
>>> +    case cortexa15:
>>> +      return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
>>> +
>>> +    default:
>>> +      return 0;
>>> +    }
>>> +}
>>> +
>>
>>
>> It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued.
>>
>> So this logic here should just be
>>
>> if (current_tune->multipass_lookahead)
>>  return autopref_multipass_lookahead_guard (insn, ready_index);
>> else
>>  return 0;
>>
>
> Done.  The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning tables.

The ARM changes look sane. As long as you've tested this with a
bootstrap, I'm ok with it - we probably need a similar change for
AArch64 and Cortex-A57 but I think we should do that for the next
stage1. Thanks for persevering with this.

Ramana

>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>