This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch, i386] Limit unroll factor for certain loops on Corei7


On Fri, Dec 2, 2011 at 11:36 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Teresa Johnson <tejohnson@google.com> writes:
>
> Interesting optimization. I would be concerned a little bit
> about compile time, does it make a measurable difference?

I haven't measured compile time explicitly, but I don't it should,
especially after I address your efficiency suggestion (see below),
since it will just have one pass over the instructions in innermost
loops.

>
>> The attached patch detects loops containing instructions that tend to
>> incur high LCP (loop changing prefix) stalls on Core i7, and limits
>> their unroll factor to try to keep the unrolled loop body small enough
>> to fit in the Corei7's loop stream detector which can hide LCP stalls
>> in loops.
>
> One more optimization would be to optimize padding for this case,
> the LSD only works if the loop is not spread over too many 32 byte
> chunks. So if you detect the loop is LSD worthy always pad to 32 bytes
> at the beginning.

Thanks for the suggestion, I will look at doing that in follow-on tuning.

>
>> To do this I leveraged the existing TARGET_LOOP_UNROLL_ADJUST target
>> hook, which was previously only defined for s390. I added one
>> additional call to this target hook, when unrolling for constant trip
>> count loops. Previously it was only called for runtime computed trip
>> counts. Andreas, can you comment on the effect for s390 of this
>> additional call of the target hook, since I can't measure that?
>
> On Sandy-Bridge there's also the decoded icache which is much larger,
> but also has some restrictions. It would be nice if this optimization
> was general enough to handle this case too.
>
> In general I notice that the tree loop unroller is too aggressive recently:
> a lot of loops that probably shouldn't be unrolled (like containing
> function calls etc.) are unrolled at -O3. So probably a better cost
> model for unrolling would make sense anyways.

These are both good suggestions, and I will look into Sandy Bridge
heuristics in follow-on work, since we will need to tune for that
soon.

>
>> + ?/* Don't reduce unroll factor in loops with floating point
>> + ? ? computation, which tend to benefit more heavily from
>> + ? ? larger unroll factors and are less likely to bottleneck
>> + ? ? at the decoder. */
>> + ?has_FP = loop_has_FP_comp(loop);
>
> You could cache the loop body and pass it in here.

That is a great idea, and in fact, I think I will do away with this
separate function completely for this patch. I can more efficiently
look for the FP computation while I am looking for the half word
stores. I'll do that and send a follow up with the new patch.

>
> Patch looks reasonable to me, but I cannot approve.

Thanks!
Teresa

>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only



-- 
Teresa Johnson?|?Software Engineer?|?tejohnson@google.com?|?408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]