This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Re: Polyhedron shootout between g95 and gfortran
- From: dominiq at lps dot ens dot fr (Dominique Dhumieres)
- To: paulthomas2 at wanadoo dot fr, dominiq at lps dot ens dot fr
- Cc: fortran at gcc dot gnu dot org
- Date: Thu, 23 Feb 2006 11:18:44 +0100
- Subject: Re: Polyhedron shootout between g95 and gfortran
- References: <43FCEE30.mailH9C11EVO4@tournesol.lps.ens.fr> <43FCF92F.308@wanadoo.fr>
Paul,
I am puzzled with:
> I should add that your diagnosis is correct - dot_product is screwed up;
> not just on timing grounds but also because the range of quotients that
> it produces a good result for is quite limited. It is an occasion in
> which fast-math is just that and better in its output.
and
> I got a bit stuck with this patch; largely because of REAL(10)
> problems. It is one of a very long list of TODOs that I will get back
> to as soon as I can.
because i don't see the connection between dot_product and "quotients".
So I assume that you are assuming that I know you submitted two independent
improvements in your patch, one for mod(ulo) and another one for
dot_product, so the above lines seems to be related to the mod part.
My experiment with the mod part has shown little improvement and I agree
with you that it does not address several issues (see below).
So I propose to split the patch in two pieces, one included ASAP for
the inlining of dot_product (as such, no threshold, just brute force)
and a second one for mod(ulo) waiting solution(s) of the pending issues.
The reason why I am pushing to include rapidly dot_product inlining and
dependency check is that the corresponding patches are close to the
threshold of what I am able/ready to do (I don't plan to use CVS and
a daily build!). Until now the two patched files did not evolve and
I had just to copy them. In the last snapshot the dependency.c file has
been changed and I have had to rebuild the patch (this part was easy), but
now I see the same is about to happen for the trans-intrinsic.c file.
As I said in my mail the intrinsic dot_product is not better than a plain
do loop:
AMD 2Ghz, 1Mb L2, ns/element
size gfc ifort pgf
do loop 4 2.1355 3.1507 3.5479
intrinsic 4 4.7376 2.6445 4.1700
do loop 512 2.0508 0.8205 1.1264
intrinsic 512 2.0689 0.6687 1.1157
do loop 8192 2.4498 2.0294 2.1155
intrinsic 8192 3.3458 1.7611 2.1039
I obtained similar results on the G5 and I did not see any regression
in the timings with the patch. So I am inclined to say: apply the patch
as it is, without looking for threshold between inlining and intrinsic,
and let see if someone see a regression on a particular achitecture.
If not, everything will be fine; if yes, it will be time to look for a
threshold for this particular architecture.
Before leaving the improvements which should be applied ASAP, I would
like to mention that with your patch there will be two mechanisms to
look for dependencies: yours and another one by Roger Sayle:
see http://gcc.gnu.org/ml/fortran/2006-02/msg00063.html
It would probably worth to look at possible overlap.
Now concerning the mod(ulo) part, I have seen the following problems:
(1) rounding errors for large quotients. I don't think they can be
avoided, but in my opinion a "good" implementation should meet the
following (ordered) criteria:
(a) 0<=mod(a,b)<b (a,b>0 and the corresponding inequalities for the other
cases).
(b) trig(a)=trig(mod(a,2*pi)), where trig() is one of sin, cos, and tan,
note this may imply a change in the trig functions.
(c) as many possible exact mod(a,2.O), mod(a,3.O), ... when a is an
(large) integer.
(d) if the system has a trap for "loss of precision", it should be used
(why not returning a NaN when the quotient is larger than 2.0**23 or 24
or 2.0d0**52 or 53?).
(2) speed improvement:
(a) It may be used as criteria to select some subset of the above
constraints.
(b) The way to compute the quotient (integer part of a/b) is architecture
dependent. For instance xlf is much faster than any Gnu apparented
compilers starting from g77 (there is no machine code on the powerPC
architecture to get the integer part). I have disasembled the g95
implementation which is a very complicated stuff I have been unable to
understand, while xlf use a very simple mechanism: it sets the rounding
towards zero, then add and substract 2.0d0**52 (real*8) to a/b. I have
implemented a similar mechanism directly in f* with 'if' to avoid
fiddling with rounding modes and it is quite faster than the native GNU
implementations on a G5 (I did not tested what happens on an AMD).
I did not go further because I did not have the time to check the impact
on the accuracy points above.
(c) I have seen few codes relying on intensive use of mod. So far I have
seen random number generator (ac.f90) and number theory codes. In these
two classes the b in mod(a,b) is unchanged in the time consuming loops,
so there is a very simple hand optimization by computing 1/b once and then
use say a-b*int(a*(1/b)). I have no idea on how this can be implemented
in a compiler, but it speeds up pix.f and ac.f90 by a factor ~2 on a G5
(but not on an AMD).
To summarize I don't have the capabilites (time/will to get them) to do the
patches myself, but I am ready to do some testing and participate to
discussions along the above lines.
Dominique