This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Polyhedron shootout between g95 and gfortran


Paul,

I am puzzled with:

> I should add that your diagnosis is correct - dot_product is screwed up; 
> not just on timing grounds but also because the range of quotients that 
> it produces a good result for is quite limited. It is an occasion in 
> which fast-math is just that and better in its output.

and

> I got a bit stuck with this patch; largely because of REAL(10) 
> problems.  It is one of a very long list of TODOs that I will get back 
> to as soon as I can.

because i don't see the connection between dot_product and "quotients".
So I assume that you are assuming that I know you submitted two independent
improvements in your patch, one for mod(ulo) and another one for 
dot_product, so the above lines seems to be related to the mod part.

My experiment with the mod part has shown little improvement and I agree 
with you that it does not address several issues (see below).

So I propose to split the patch in two pieces, one included ASAP for
the inlining of dot_product (as such, no threshold, just brute force)
and a second one for mod(ulo) waiting solution(s) of the pending issues.
The reason why I am pushing to include rapidly dot_product inlining and 
dependency check is that the corresponding patches are close to the 
threshold of what I am able/ready to do (I don't plan to use CVS and
a daily build!). Until now the two patched files did not evolve and
I had just to copy them. In the last snapshot the dependency.c file has 
been changed and I have had to rebuild the patch (this part was easy), but
now I see the same is about to happen for the trans-intrinsic.c file.

As I said in my mail the intrinsic dot_product is not better than a plain 
do loop:

   AMD 2Ghz, 1Mb L2,    ns/element
           size    gfc      ifort     pgf
do loop       4   2.1355   3.1507   3.5479
intrinsic     4   4.7376   2.6445   4.1700
do loop     512   2.0508   0.8205   1.1264
intrinsic   512   2.0689   0.6687   1.1157
do loop    8192   2.4498   2.0294   2.1155
intrinsic  8192   3.3458   1.7611   2.1039

I obtained similar results on the G5 and I did not see any regression
in the timings with the patch. So I am inclined to say: apply the patch
as it is, without looking for threshold between inlining and intrinsic,
and let see if someone see a regression on a particular achitecture.
If not, everything will be fine; if yes, it will be time to look for a 
threshold for this particular architecture.

Before leaving the improvements which should be applied ASAP, I would
like to mention that with your patch there will be two mechanisms to
look for dependencies: yours and another one by Roger Sayle:

see  http://gcc.gnu.org/ml/fortran/2006-02/msg00063.html

It would probably worth to look at possible overlap.


Now concerning the mod(ulo) part, I have seen the following problems:

(1) rounding errors for large quotients. I don't think they can be 
avoided, but in my opinion a "good" implementation should meet the 
following (ordered) criteria:
(a) 0<=mod(a,b)<b (a,b>0 and the corresponding inequalities for the other 
cases).
(b) trig(a)=trig(mod(a,2*pi)), where trig() is one of sin, cos, and tan,
note this may imply a change in the trig functions.
(c) as many possible exact mod(a,2.O), mod(a,3.O), ... when a is an 
(large) integer.
(d) if the system has a trap for "loss of precision", it should be used
(why not returning a NaN when the quotient is larger than 2.0**23 or 24
or 2.0d0**52 or 53?).

(2) speed improvement:
(a) It may be used as criteria to select some subset of the above 
constraints.
(b) The way to compute the quotient (integer part of a/b) is architecture
dependent.  For instance xlf is much faster than any Gnu apparented
compilers starting from g77 (there is no machine code on the powerPC
architecture to get the integer part).  I have disasembled the g95
implementation which is a very complicated stuff I have been unable to
understand, while xlf use a very simple mechanism: it sets the rounding 
towards zero, then add and substract 2.0d0**52 (real*8) to a/b. I have 
implemented a similar mechanism directly in f* with 'if' to avoid
fiddling with rounding modes and it is quite faster than the native GNU
implementations on a G5 (I did not tested what happens on an AMD).
I did not go further because I did not have the time to check the impact
on the accuracy points above.
(c) I have seen few codes relying on intensive use of mod. So far I have 
seen random number generator (ac.f90) and number theory codes. In these 
two classes the b in mod(a,b) is unchanged in the time consuming loops,
so there is a very simple hand optimization by computing 1/b once and then 
use say a-b*int(a*(1/b)). I have no idea on how this can be implemented
in a compiler, but it speeds up pix.f and ac.f90 by a factor ~2 on a G5
(but not on an AMD).

To summarize I don't have the capabilites (time/will to get them) to do the
patches myself, but I am ready to do some testing and participate to
discussions along the above lines.

Dominique


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]