This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
It is in 1st chapter of Intel's optimization manual. It is 4 for int and 12 for fp (up from 2 and 9 for northwood). It is definetly designed for higher frequency (intel's marketing drum. I wish them to solve heat disipation problem for prescott/nocona which is much less problem for AMD. The goverments should give a rebate for processor with less heat disipation as they do it for appliences).Actually my patch prevents generation of inc/dec for Nocona.
Looking at your patch I see important difference in latency time of
loading/storing (sometimes moving) integer/floating point registers
(and MMX which is not important for benchmarks). I also increased
move ratio. I think they may be different from latency time given by
The move ratios in my patch are based on latencies, I have no idea what
load/store latency is so I just put in some numbers I wanted to
experiment with later.
Why do you expect larger move ratios to give better results?I meant MOVE_RATIO. I've the same the same from K8/Athlon. The bigger blocks, the better behaviour for OOO processor (but we should remeber about trace cache size). There are a lot of parameters to play with it. This process could be infinite acoording to my experince. It is possible to play with register moves/load/store times. I've made them smaller to take OOO nature into account. In theory they should be 2 times more.
Intel because of nature of OOO processor. So somebody could play with this parameters to get a better result.
I've just finish SPEC2000 and got the same results. So I confirmed the previous results. Why I started the tuning was incredibly low Linpack results:
+++++ 5: Linpackc unrolled double precision ++++++
user 0m0.470s
user 0m0.490s
text data bss dec hex filename
15616 608 646896 663120 a1e50 ./a1.out
15528 608 646896 663032 a1df8 ./a2.out
./a1.out ./a2.out differ: byte 209, line 1
none:Unrolled Double Precision 277.30 Mflops nocona:Unrolled Double Precision 944.25 Mflops
x86_64 was tunned for K8. The code is probably very bad for Nocona in
64-bit mode. It is interesting to know why there is such difference
but it will take time which I have not right now.
I see. My tests were relative to previous -march=nocona code generation (that IMO make more sense...:)
At my first look at whetstone, the biggest differenece is in usage of movsd for nocona. Second one is absence inc/dec. The third one I found is less multiplies.I dug out the results relative to K8 and they are consistent with yours (tought I have only C part of SPECfp that is not that interesting).
Main difference comes from K8 optimized SSE reg-reg and loads code
generation that prevents almost any OOO reordering in Pentium4 based
cores. If we got into busyness of generating code that works well on both
chips, I guess we can use just the natural move instructions (movsd for
loads/reg-reg moves) that has just moderate penalty on K8.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |