This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c

From: Maxim Kuvyrkov <maxim dot kuvyrkov at linaro dot org>
To: Richard Guenther <richard dot guenther at gmail dot com>
Cc: gcc-patches at gcc dot gnu dot org, Alexander Monakov <amonakov at ispras dot ru>, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
Date: Thu, 29 Aug 2019 19:43:17 +0300
Subject: Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
References: <D46C8D08-685F-41A7-8695-23BB65B74A87@linaro.org> <09F25146-8361-4FB0-AE6B-E13BF8CF332F@gmail.com>

> On Aug 29, 2019, at 7:29 PM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On August 29, 2019 5:40:47 PM GMT+02:00, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>> Hi,
>> 
>> This patch tweaks autoprefetcher heuristic in scheduler to better group
>> memory loads and stores together.
>> 
>> From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598:
>> 
>> There are two separate changes, both related to instruction scheduler,
>> that cause the regression.  The first change in r253235 is responsible
>> for 70% of the regression.
>> ===
>>   haifa-sched: fix autopref_rank_for_schedule qsort comparator
>> 
>> * haifa-sched.c (autopref_rank_for_schedule): Order 'irrelevant' insns
>>           first, always call autopref_rank_data otherwise.
>> 
>> 
>> 
>> git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253235
>> 138bc75d-0d04-0410-961f-82ee72b054a4
>> ===
>> 
>> After this change instead of
>> r1 = [rb + 0]
>> r2 = [rb + 8]
>> r3 = [rb + 16]
>> r4 = <math with r1>
>> r5 = <math with r2>
>> r6 = <math with r3>
>> 
>> we get
>> r1 = [rb + 0]
>> <math with r1>
>> r2 = [rb + 8]
>> <math with r2>
>> r3 = [rb + 16]
>> <math with r3>
>> 
>> which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
>> schedule happens because r2= load gets lower priority than the
>> "irrelevant" <math with r1> due to the above patch.
>> 
>> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
>> means that true dependencies of all similar base+offset loads are
>> resolved.  Therefore, for autoprefetcher-friendly schedule we should
>> prioritize memory reads before "irrelevant" instructions.
> 
> But isn't there also max number of load issues in a fetch window to consider? 
> So interleaving arithmetic with loads might be profitable. 

It appears that cores with autoprefetcher hardware prefer loads and stores bundled together, not interspersed with other instructions to occupy the rest of CPU units.

Autoprefetching heuristic is enabled only for cores that support it, and isn't active for by default.

> 
>> On the other hand, following similar logic, we want to delay memory
>> stores as much as possible to start scheduling them only after all
>> potential producers are scheduled.  I.e., for autoprefetcher-friendly
>> schedule we should prioritize "irrelevant" instructions before memory
>> writes.
>> 
>> Obvious patch to implement the above is attached.  It brings 70% of
>> regressed performance on this testcase back.
>> 
>> OK to commit?
>> 
>> Regards,
>> 
>> --
>> Maxim Kuvyrkov
>> www.linaro.org

Follow-Ups:
- Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  - From: Alexander Monakov

References:
- [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  - From: Maxim Kuvyrkov
- Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]