This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch, fortran] PR24518 and PR24520 - Improvements to MOD and


Paul Thomas wrote:
:REVIEWPATCH:

Janne and Tobi,

Thank you for your comments. For the reasons that I outline below, I am withdrawing this patch for repairs. I will resubmit when I am satisfied that I have satisfactory solutions for the optimal switching between library and inline.

BTW, does BLAS do anything fancy for dot product that might help for
big vectors? I mean, is it worth thinking about inlining only for
small vectors? What happpens when the vectors won't fit into cache?
I'm not saying this as a criticism of your patch, just idle
wondering..

I have found circumstances where the library and inline DOT_PRODUCT execution times cross over at lengths ~32.

The timing test that I supplied with the patch is a complete aberration on my part, as a quick examination will reveal, and the results should be ignored completely.

Please find attached a test which is more "realistic" (ie. correct):

With -O3 -ffast-math, I obtain:

DOT_PRODUCT test      library                  inline
 array length        time(ns)                time(ns)
                                      -ve stride  (+ve  stride)
      4               57.90               34.20     (22.00)
      8               58.70               34.90     (22.20)
     16               94.70               76.60     (67.20)
     32              140.40              133.30    (112.50)
     64              230.60              247.20    (203.70)
    128              412.80              473.50    (385.40)
    256              775.10              927.20    (748.00)
    512             1500.80             1833.40   (1472.40)
   1024             2949.30             3645.70   (2921.10)

The time for the library function does not depend on the stride of the arguments of DOT_PRODUCT. What I learn from this is:

(i) Lengths 4 and 8 are too fast for the timer resolution.
(ii) There is an overhead of ~25ns for the library function call.
(iii) With a positive stride, this difference between the library and the inline is present for any vector length.
(iv) The more complicated scalarizer index arithmetic makes the inline slower per element, so that the advantage is lost between vector lengths of 32 and 64.


We showed that the library version of dot_product can be speeded up significantly for length >= 8. If the compiler implements in-lining for short length, we would have more incentive to optimize the library for longer length.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]