:REVIEWPATCH:
Janne and Tobi,
Thank you for your comments. For the reasons that I outline below, I am
withdrawing this patch for repairs. I will resubmit when I am satisfied
that I have satisfactory solutions for the optimal switching between
library and inline.
BTW, does BLAS do anything fancy for dot product that might help for
big vectors? I mean, is it worth thinking about inlining only for
small vectors? What happpens when the vectors won't fit into cache?
I'm not saying this as a criticism of your patch, just idle
wondering..
I have found circumstances where the library and inline DOT_PRODUCT
execution times cross over at lengths ~32.
The timing test that I supplied with the patch is a complete aberration
on my part, as a quick examination will reveal, and the results should
be ignored completely.
Please find attached a test which is more "realistic" (ie. correct):
With -O3 -ffast-math, I obtain:
DOT_PRODUCT test library inline
array length time(ns) time(ns)
-ve stride (+ve stride)
4 57.90 34.20 (22.00)
8 58.70 34.90 (22.20)
16 94.70 76.60 (67.20)
32 140.40 133.30 (112.50)
64 230.60 247.20 (203.70)
128 412.80 473.50 (385.40)
256 775.10 927.20 (748.00)
512 1500.80 1833.40 (1472.40)
1024 2949.30 3645.70 (2921.10)
The time for the library function does not depend on the stride of the
arguments of DOT_PRODUCT. What I learn from this is:
(i) Lengths 4 and 8 are too fast for the timer resolution.
(ii) There is an overhead of ~25ns for the library function call.
(iii) With a positive stride, this difference between the library and
the inline is present for any vector length.
(iv) The more complicated scalarizer index arithmetic makes the inline
slower per element, so that the advantage is lost between vector lengths
of 32 and 64.