This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: An unusual Performance approach using Synthetic registers


> The stack reordering pass posted by Sanjiv Gupta can do this.
> This was posted about a few days ago in gcc-patches.

Thanks for the pointer. I do not actively monitor the GCC-PATCHES list, so I
didn't know about it. Anyway, this work shows that at least there seems to
be some beneft in rearranging the stack layout.

> What you're describing is actually bad on the Pentium, and probably
> subsequent implementations as well.
> The Pentium can dual-issue loads as long as they reference separate cache
> ways. So, manually sorting the stack so contiguous accesses are localized
> increases the probability of the loads accessing the same cache way, thus
> decreasing the probability of single-issuing.
>
> You guys really should read the processor manual instead of making
> incorrect assumptions about what features would improve the code quality.
>

I don't know if it is bad for Pentium II or higher processors. The
Optimization guides from Intel for Pentium II/III and for Pentium 4
processors don't seem to mention that multiple accesses to a same cache line
are bad from the performance point of view.
http://developer.intel.com/design/pentiumii/manuals/245127.htm
http://developer.intel.com/design/pentium4/manuals/248966.htm


> > 2) Running the RA over the stack slots will cause the slots to be reused
> > when the life range of variables does not overlap. This even increases
the
> > compactness that already gives the benefit of point 1. Also, overall
> > reducing stack usage will always be a small gain.
>
> The stack slots are already reused.

Are you sure about that? I though some time ago, it was mentioned on this
list that this was not the case. I guess I will try to write a small test
program to check this.

> > 3) The "compact" memory access pattern and the reuse of stack slots
might
> > increase the opportunity for the processor to use "shortcut" features in
> > memory access. For example successive writes to the same memory location
> > might be optimised to a single write, or read access to a memory
location
> > may be fast if there is still a pending write on the same location
>
> The first feature you're describing is called "dead write elimination" and
> is already done in gcc.

There are a number of cases where the compiler can't eiliminate the store.
For example a store from within a loop, and another store after the loop, or
a store / load  / store cycle where the second store occurs before the first
one has retired.

> The second feature you're describing is called "write FIFO snooping" and
> is a hardware feature.

Intel calls it "store-forwarding"


Marcel




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]