This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [Patch, fortran] PR24518 and PR24520 - Improvements to MOD and

From: Tim Prince <tprince at myrealbox dot com>
To: Paul Thomas <paulthomas2 at wanadoo dot fr>
Cc: Janne Blomqvist <jblomqvi at cc dot hut dot fi>, Tobias Schlüter <tobias dot schlueter at physik dot uni-muenchen dot de>, patch <gcc-patches at gcc dot gnu dot org>, "'fortran at gcc dot gnu dot org'" <fortran at gcc dot gnu dot org>
Date: Sat, 12 Nov 2005 07:11:01 -0800
Subject: Re: [Patch, fortran] PR24518 and PR24520 - Improvements to MOD and
References: <43730120.7030808@wanadoo.fr> <20051110221707.GA21185@vipunen.hut.fi> <4375C1D6.7080602@wanadoo.fr>
Reply-to: tprince at computer dot org

Paul Thomas wrote:

:REVIEWPATCH:

Janne and Tobi,

Thank you for your comments. For the reasons that I outline below, I am withdrawing this patch for repairs. I will resubmit when I am satisfied that I have satisfactory solutions for the optimal switching between library and inline.
BTW, does BLAS do anything fancy for dot product that might help for
big vectors? I mean, is it worth thinking about inlining only for
small vectors? What happpens when the vectors won't fit into cache?
I'm not saying this as a criticism of your patch, just idle
wondering..
I have found circumstances where the library and inline DOT_PRODUCT execution times cross over at lengths ~32.

The timing test that I supplied with the patch is a complete aberration on my part, as a quick examination will reveal, and the results should be ignored completely.

Please find attached a test which is more "realistic" (ie. correct):

With -O3 -ffast-math, I obtain:
DOT_PRODUCT test      library                  inline
 array length        time(ns)                time(ns)
                                      -ve stride  (+ve  stride)
      4               57.90               34.20     (22.00)
      8               58.70               34.90     (22.20)
     16               94.70               76.60     (67.20)
     32              140.40              133.30    (112.50)
     64              230.60              247.20    (203.70)
    128              412.80              473.50    (385.40)
    256              775.10              927.20    (748.00)
    512             1500.80             1833.40   (1472.40)
   1024             2949.30             3645.70   (2921.10)
The time for the library function does not depend on the stride of the arguments of DOT_PRODUCT. What I learn from this is:

(i) Lengths 4 and 8 are too fast for the timer resolution. (ii) There is an overhead of ~25ns for the library function call. (iii) With a positive stride, this difference between the library and the inline is present for any vector length. (iv) The more complicated scalarizer index arithmetic makes the inline slower per element, so that the advantage is lost between vector lengths of 32 and 64.

We showed that the library version of dot_product can be speeded up significantly for length >= 8. If the compiler implements in-lining for short length, we would have more incentive to optimize the library for longer length.

Follow-Ups:
- Re: [Patch, fortran] PR24518 and PR24520 - Improvements to MOD and
  - From: Paul Thomas

References:
- [Patch, fortran] PR24518 and PR24520 - Improvements to MOD and DOT_PRODUCT
  - From: Paul Thomas
- Re: [Patch, fortran] PR24518 and PR24520 - Improvements to MOD and DOT_PRODUCT
  - From: Janne Blomqvist
- Re: [Patch, fortran] PR24518 and PR24520 - Improvements to MOD andDOT_PRODUCT
  - From: Paul Thomas

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]