This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).

From: "Vladimir N. Makarov" <vmakarov at redhat dot com>
To: Jan Hubicka <jh at suse dot cz>
Cc: Richard Henderson <rth at redhat dot com>, Jan Hubicka <hubicka at ucw dot cz>, gcc-patches at gcc dot gnu dot org
Date: Fri, 02 Apr 2004 16:45:37 -0500
Subject: Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
References: <406C74C9.9040305@redhat.com> <20040401203955.GA31186@redhat.com> <406C840E.7060207@redhat.com> <20040401221809.GP28573@atrey.karlin.mff.cuni.cz> <20040401225746.GA32188@redhat.com> <406CA250.6020204@redhat.com> <20040402094732.GC2539@kam.mff.cuni.cz> <406D8AF9.4050406@redhat.com> <20040402185207.GD2105@kam.mff.cuni.cz>

Jan Hubicka wrote:

Actually my patch prevents generation of inc/dec for Nocona.

Looking at your patch I see important difference in latency time of loading/storing (sometimes moving) integer/floating point registers (and MMX which is not important for benchmarks). I also increased move ratio. I think they may be different from latency time given by

The move ratios in my patch are based on latencies, I have no idea what load/store latency is so I just put in some numbers I wanted to experiment with later.

It is in 1st chapter of Intel's optimization manual. It is 4 for int and 12 for fp (up from 2 and 9 for northwood). It is definetly designed for higher frequency (intel's marketing drum. I wish them to solve heat disipation problem for prescott/nocona which is much less problem for AMD. The goverments should give a rebate for processor with less heat disipation as they do it for appliences).

Why do you expect larger move ratios to give better results?

I meant MOVE_RATIO. I've the same the same from K8/Athlon. The bigger blocks, the better behaviour for OOO processor (but we should remeber about trace cache size). There are a lot of parameters to play with it. This process could be infinite acoording to my experince. It is possible to play with register moves/load/store times. I've made them smaller to take OOO nature into account. In theory they should be 2 times more.

The relation of load/store latency time affects register allocator (register assinging). Register alllocator uses only freq, nrefs, live range length. But the cost implictly is in regclasses. So changing the relation may change preferable and alternative classes. But I am sure you know more that I wrote. Just in case.

Intel because of nature of OOO processor.  So somebody could play with
this parameters to get a better result.
I've just finish SPEC2000 and got the same results.  So I confirmed
the previous results.  Why I started the tuning was incredibly low
Linpack results:
+++++ 5: Linpackc unrolled double precision ++++++ user 0m0.470s user 0m0.490s text data bss dec hex filename 15616 608 646896 663120 a1e50 ./a1.out 15528 608 646896 663032 a1df8 ./a2.out ./a1.out ./a2.out differ: byte 209, line 1 none:Unrolled Double Precision 277.30 Mflops nocona:Unrolled Double Precision 944.25 Mflops

x86_64 was tunned for K8. The code is probably very bad for Nocona in 64-bit mode. It is interesting to know why there is such difference but it will take time which I have not right now.
I see.  My tests were relative to previous -march=nocona code generation
(that IMO make more sense...:)

I thought about this. Earlier nocona meant only usage of SSE3. I checked the difference of -mtune=nocona vs -mtune=nocona -mno-sse3. Only two test codes are different (perl and eon). All SPECFP test codes are the same. So I beleive the results should be the same (at least for SPECFP).

I dug out the results relative to K8 and they are consistent with yours
(tought I have only C part of SPECfp that is not that interesting).
Main difference comes from K8 optimized SSE reg-reg and loads code generation that prevents almost any OOO reordering in Pentium4 based cores. If we got into busyness of generating code that works well on both chips, I guess we can use just the natural move instructions (movsd for loads/reg-reg moves) that has just moderate penalty on K8.

At my first look at whetstone, the biggest differenece is in usage of movsd for nocona. Second one is absence inc/dec. The third one I found is less multiplies.

Follow-Ups:
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Jan Hubicka

References:
- RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Vladimir Makarov
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Richard Henderson
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Vladimir Makarov
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Jan Hubicka
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Richard Henderson
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Vladimir Makarov
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Jan Hubicka
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Vladimir N. Makarov
- Re: RFA: patch - tuning gcc for Intel Nocona (64 bit).
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]