This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Atom: Scheduler improvements for better imul placement


On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamyatin@gmail.com> wrote:
> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <abel@ispras.ru> wrote:
>> On 13.04.2012 14:18, Igor Zamyatin wrote:
>>>
>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<abel@ispras.ru>
>>> ?wrote:
>>>>
>>>> On 12.04.2012 16:38, Richard Guenther wrote:
>>>>>
>>>>>
>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamyatin@gmail.com>
>>>>> ?wrote:
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
>>>>>> <richard.guenther@gmail.com> ? ?wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amonakov@ispras.ru>
>>>>>>> ?wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Can atom execute two IMUL in parallel? ?Or what exactly is the
>>>>>>>>> pipeline
>>>>>>>>> behavior?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> As I understand from Intel's optimization reference manual, the
>>>>>>>> behavior is as
>>>>>>>> follows: if the instruction immediately following IMUL has shorter
>>>>>>>> latency,
>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency);
>>>>>>>> otherwise,
>>>>>>>> a
>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without
>>>>>>>> a
>>>>>>>> stall.
>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency
>>>>>>>> instructions, but not to short-latency instructions.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It seems to be modeled in the pipeline description though:
>>>>>>>
>>>>>>> ;;; imul insn has 5 cycles latency
>>>>>>> (define_reservation "atom-imul-32"
>>>>>>> ? ? ? ? ? ? ? ? ? ?"atom-imul-1, atom-imul-2, atom-imul-3,
>>>>>>> atom-imul-4,
>>>>>>> ? ? ? ? ? ? ? ? ? ? atom-port-0")
>>>>>>>
>>>>>>> ;;; imul instruction excludes other non-FP instructions.
>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1"
>>>>>>> ? ? ? ? ? ? ? "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4")
>>>>>>>
>>>>>>
>>>>>> The main idea is quite simple:
>>>>>>
>>>>>> If we are going to schedule IMUL instruction (it is on the top of
>>>>>> ready list) we try to find out producer of other (independent) IMUL
>>>>>> instruction that is in ready list too. The goal is try to schedule
>>>>>> such a producer to get another IMUL in ready list and get scheduling
>>>>>> of 2 successive IMUL instructions.
>>>>>
>>>>>
>>>>>
>>>>> Why does that not happen without your patch? ?Does it never happen
>>>>> without
>>>>> your patch or does it merely not happen for one EEMBC benchmark (can
>>>>> you provide a testcase?)?
>>>>
>>>>
>>>>
>>>> It does not happen because the scheduler by itself does not do such
>>>> specific
>>>> reordering. ?That said, it is easy to imagine the cases where this patch
>>>> will make things worse rather than better.
>>>>
>>>> Igor, why not try different subtler mechanisms like adjust_priority,
>>>> which
>>>> is get called when an insn is added to the ready list? ?E.g. increase the
>>>> producer's priority.
>>>>
>>>> The patch as is misses checks for NONDEBUG_INSN_P. ?Also, why bail out
>>>> when
>>>> you have more than one imul in the ready list? ?Don't you want to bump
>>>> the
>>>> priority of the other imul found?
>>>
>>>
>>> Could you provide some examples when this patch would harm the
>>> performance?
>>
>>
>> I thought of the cases when the other ready insns can fill up the hole and
>> that would be more beneficial because e.g. they would be on more critical
>> paths than the producer of your second imul. ?I don't know enough of Atom to
>> give an example -- maybe some long divisions?
>>
>>
>>>
>>> Sched_reorder was chosen since it is used in other ports and looks
>>> most suitable for such case, e.g. it provides access to the whole
>>> ready list.
>>> BTW, just increasing producer's priority seems to be more risky in
>>> performance sense - we can incorrectly start delaying some
>>> instructions.
>>
>>
>> Yes, but exactly because of the above example you can start incorrectly
>> delaying other insns, too, as you force the insn to be the first in the
>> list. ?While bumping priority still leaves the scheduler sorting heuristics
>> in place and actually lowers that risk.
>>
>>>
>>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
>>> contains them - this could be added easily
>>
>>
>> It does, but I'm not sure the sched_reorder hook gets them or they are
>> immediately removed -- I saw similar checks in one of the targets' hooks.
>
> Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached
>
>>
>> Anyways, my main thought was that it is better to test on more benchmarks to
>> alleviate the above concerns, so as long as the i386 maintainers are happy,
>> I don't see major problems here. ?A good idea could be to generalize the
>> patch to handle other long latency insns as second consumers, not only
>> imuls, if this is relevant for Atom.
>
> Yes, generalization of this approach is in plans. According to Atom
> Software optimization guide there are several headrooms left here.
> As for trying on more benchmarks - the particular case is indeed quite
> rare. I attached the example
> where patch helps to group imuls in pairs which is profitable for
> Atom. Such and similar codes are not very common.
> But hopefully this approach could help avoid this and other glassjaws.

BTW, this patch also helps some EEMBC tests when funroll-loops specified.
So, any feedback from i386 maintainers about this? :)

Changelog slightly changed


2012-04-10  Yuri Rumyantsev  <yuri.s.rumyantsev@intel.com>

	* config/i386/i386.c (x86_sched_reorder): New function.
	Added new function x86_sched_reorder
	to take advantage of Atom pipelened IMUL execution.


>
>>
>> Andrey
>>
>>
>>>
>>> The case with more than one imul in the ready list wasn't considered
>>> just because the goal was to catch the particular case when there is a
>>> risk to get the following picture: "imul-producer-imul" which is less
>>> effective than "producer-imul-imul" for Atom
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>> And MD allows us only prefer scheduling of successive IMUL
>>>>>> instructions,
>>>>>> i.e.
>>>>>> If IMUL was just scheduled and ready list contains another IMUL
>>>>>> instruction then it will be chosen as the best candidate for
>>>>>> scheduling.
>>>>>>
>>>>>>
>>>>>>> at least from my very limited guessing of what the above does. ?So,
>>>>>>> did
>>>>>>> you
>>>>>>> analyze why the scheduler does not properly schedule things for you?
>>>>>>>
>>>>>>> Richard.
>>>>>>>
>>>>>>>>
>>>>>>>> ?From reading the patch, I could not understand the link between
>>>>>>>> pipeline
>>>>>>>> behavior and what the patch appears to do.
>>>>>>>>
>>>>>>>> Hope that helps.
>>>>>>>>
>>>>>>>> Alexander
>>>>
>>>>
>>>>
>>>
>>> Thanks,
>>> Igor
>>
>>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]