This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
> > cache than fully blowin function call.
>
> Be careful with this. My recollection is that REP sequence is good for
> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles), while for large size copy, it is less efficient
> compared with library version.
Well this is based on the data from the memtest script.
Core has good REP implementation - it is a win from rather small blocks (16
bytes if I recall) and it does not need alignment.
Library version starts to be interesting with caching hints, but I think till 80KB
it is still not a win for my setup (glibc-2.15)
> >> >
> >> > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> >> > * on 16-bit immediate moves into memory on Core2 and Corei7. */
> >> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >> > m_K6,
> >> >
> >> > /* X86_TUNE_USE_CLTD */
> >> > - ~(m_PENT | m_ATOM | m_K6),
> >> > + ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),
>
> My change was to enable CLTD for generic. Is your change intended to
> revert that?
No, it is merge conflict, sorry. I will update it in my tree.
> > Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >> >
>
>
> K8 and K10 partitions the flags into groups. References to flags to
> the same group can still cause the stall -- not sure how that can be
> handled.
I belive the stalls happends only in quite special cases where compare instruction
combines flags from multiple instructions. GCC don't generate this type of code, so
we should be safe.
Honza