This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Re: Bug with g77 and -mieee on Alpha Linux

To: toon at moene dot indiv dot nluug dot nl
Subject: Re: Bug with g77 and -mieee on Alpha Linux
From: craig at jcb-sc dot com
Date: 9 Jul 1999 15:23:12 -0000
Cc: rth at cygnus dot com, hadsell at blueskystudios dot com, egcs at egcs dot cygnus dot com, martin dot kahlert at mchp dot siemens dot de
Cc: craig at jcb-sc dot com
References: <199907062042.WAA00509@keksy.linux.provi.de> <19990707140435.1429.qmail@deer> <19990707194012.A291@keksy.linux.provi.de> <3783B4B1.89DC2124@moene.indiv.nluug.nl> <19990708135500.12573.qmail@deer> <3784BE26.D14F95CD@blueskystudios.com> <19990708155845.13652.qmail@deer> <19990708111645.A6051@cygnus.com> <37850FE1.94D5B63D@moene.indiv.nluug.nl> <19990709021223.16356.qmail@deer> <3785E797.F6809570@moene.indiv.nluug.nl>
>> Perhaps, but, as I pointed out earlier, almost *any* program can generate
>> denormals if it is compiled to randomly spill 80-bit values as 32-bit
>> or 64-bit values and tries to compare them to (nearly) identically
>> computed results to see if they're approximately equal.
>
>I'm pretty sure that ours don't - in spite of the fact that they perform
>somewhere between 10^10 and 10^11 floating point operations per run.

But *yours* isn't the only code on the planet, correct?

I mean, c'mon, you *constantly* write about how people *shouldn't* (or
"don't", despite substantial evidence on these lists to the contrary)
write floating-point code, as if it applied to the entire planet's
past, present, *and* future use of Fortran.

And now you're saying "but *we* don't write code that way"?

Again, I don't know where *anyone* can find freely available information
on how they *are* supposed to write floating-point code.  I believe
such information exists as it pertains to *other* compilers, which
behave in a more predictable fashion (e.g. we've had reports that at
least some do 80-bit spills on IA32), but not as it pertains to g77.

>Why ?  Because in the problem domain we're operating in, floating point
>numbers have values that are far from FLT_MAX and FLT_MIN.  In our case
>that is because if you express the physical quantities in the earthly
>atmosphere in SI units, you *get* values that are far from these
>extremes.

How nice...for *you*.

>My point is that every physicist engaging in numerical analysis will
>choose his/her units to get this effect.  This is not surprising, as
>people like to talk about the major phenomena they study using small
>numbers.  So a cosmologist uses Mega- or Giga-parsecs, not meters, to
>measure distance.
>
>The (automatic) side effect of this "change of units" is that the
>numerical algorithms used to describe those phenomena will stay out of
>the complexities of having to deal with denormals *unless they are
>unstable or operating outside their domain of validity*.
>
>Of course one could always set up a scenario where, using the fact that
>you cannot know whether a floating point value is 80-, 64- or 32-bits
>wide, you can generate small differences (which should have been zero)
>that, suitably multiplied which each other, will generate a denormal
>(especially if you start close to FLT_MIN).  My point is that you have
>to cleverly set this up, because it will not show up in normal practice.

I am rapidly coming to the conclusion that all of my efforts on g77
are, according to your logic, legitimately applicable to only about
50 truly expert programmers (in floating-point arithmetic) worldwide.

That might explain the stunning lack of breadth in funding I've received,
especially as this "beta" version of g77 has matured.

Needless to say, I'm *also* rapidly losing interest in investing much
more of my time making g77 even *more* attractive to programmers who,
according to you, will be writing incorrect programs, especially since
they seem to have no reasonable way to find out how to write *correct*
ones, and since *I* am nowhere near having sufficient knowledge to
either document this *or* implement and maintain a compiler (which includes
responding to bug reports) that is intended for use in this domain.

So I believe I'm approaching a choice:

  -  Mandate that g77 shall, as of the rewrite (0.6), default to precise
     (e.g. 80-bit) spills of intermediate computations, to full-range support
     of the underlying floating-point *type* (-mieee), etc., even if that
     means effectively forking gcc or writing a new compiler from scratch
     (neither of which I'm likely to undertake myself as part of doing
     the rewrite, of course).  By "mandate" I mean "refuse to work on,
     maintain, or respond to bug reports concerning any product that
     incorporates the 0.6 front end but doesn't meet my standards".  Not
     quite the DJ Bernstein approach (for qmail anyway), but one that's
     entirely reasonable, I believe.

  -  Leave working on g77 to those who actually understand how to write
     code (and therefore maintain compilers) that carefully navigates the
     murky waters left by *not* choosing defaults based on
     my understanding of what it takes to achieve overall robustness.

That means, it is quite likely that the only way I'm likely to believe I
can truly contribute effectively to g77 *and its user base* (beyond a tiny
handful of experts, anyway) is if I can convince the GCC steering committee
to mandate 80-bit spills, -mieee as a default, etc.  Which seems unlikely,
though, at least, while working on *that*, I can proceed with the rewrite,
then effectively abandon its incarnation as a gcc product when failure
to convince becomes clear, and offer it (since it is GPL'ed software) along
with my (limited) services to the community to put into a new compiler
project that meets the requirements I believe the *larger* number-crunching
community has, or which, at least, I can *explain* without tying myself
into knots.  (Whether such an offer is taken up, I could not control,
of course.)

The choice is rather stark simply because, the more I read about your
definition of what constitutes proper floating-point programming, the
*more* confused and, frankly, "frightened" I get, regarding the viability
of actually successfully understanding it or getting other people to
understand it sufficiently to write robust code to your standard.

Whereas, throughout my life, the more I've read about stuff like designing
compilers, the *less* confused and worried I've gotten regarding the
viability of *that*.  Which certainly strongly suggests where I should
devote my efforts (e.g. continue compiler development, avoid Fortran
or other loosely-specified languages used for floating-point work).

For example, ANSI FORTRAN 77 says "The value of a [variable] does not change
until the [variable] [is changed in a way clearly visible in the source
code]".

Now, putting aside any language-lawyer hat I might otherwise be willing to
wear ("but can a value be an *approximation*, the *precise* value of which
is undefined at any given point during execution?"), it seems to me that
means, quite clearly to any *typical* reader of the standard, that

  Y = 1.
  Z = 3.
  X = Y / Z
  IF (X .LT. X) PRINT *, 'FAIL'
  END

must *never*, under *any* circumstances, print "FAIL", because X can have
only *one* value, and the .LT. operator cannot reasonably be interpreted
(again, even if the standard can be *argued* to allow it) to compare
*approximations* (two *different* ones) of the *one* value in a variable
being compared to itself.

But, according to your interpretation (from last December, IIRC), the
compiler may, at any time and in any way of its choosing, choose to
use excess precision to compute results of computations *and* randomly
chop those results down to less precision, so the IF statement could
be interpreted as if it read:

  IF ((X - roundoff) .LT. X) ...

Which is, in fact, something gcc can be easily convinced to do, with
perhaps only a bit more coding, these days.  (The changes necessary to
provoke this probably involve straight assignment.  Again, from a
language-lawyer perspective, you could perhaps argue that straight
assignment is permitted to assign yet another *approximation* -- but I'm
not even going to look that up in the standard, because, even if *allowed*,
it's too poor a quality of implementation in my view.  Of course, I'm
referring here to an assignment to the same type, since it makes sense
that conversion, especially to a less-precise type, involves another
approximation.)

I find that increasingly unacceptable, because I can't explain to programmers
how to avoid it, especially without creating the sorts of denormals (very-
low-magnitude values) you say code should never create.

Now, your *new* justification appears to be that no floating-point
programs worth their salt ever compute values anywhere *near* the
boundaries of denormals, Infs (which I guess crash on Alphas compiled
using the defaults), etc., so this can "never happen".

Problems with *that* logic include:

  -  The fact that we don't, anywhere I can see, document that programmers
     must stay away from those limits (despite widespread advertising of
     the "supported range and precision" of the underlying types by
     hardware manufacturers), which, to encourage portability, must
     necessarily represent the *intersection* of all pertinent limits on
     all machines.

  -  The fact that we don't offer any option (let's call it `-fcheck-toon'
     ;-) programmers can use to more cleanly catch any computations their
     code does resulting in these "inappropriate" values, leaving it to
     the underlying implementation to crash, burn, or simply compute
     incorrect results due (at least partly) to the defaults we've chosen
     for how to compile code.  And since we have no real spec on those
     limits, `-fcheck-toon' seems to me almost as hard to implement as, say,
     `-fcheck-whether-my-code-does-what-I-want'.

  -  The fact that there's no clear, concise, industry-wide spec on
     programming to *these* limits.  (There is one on programming to
     IEEE 754 limits: it's called IEEE 754.  But we appear to be
     inventing our own standard, so we can't just tell people to use
     *that* standard, even if they limit their use of g77/gcc to
     machines that implement it as a native, or nearly native, type.)

  -  The fact that we are unlikely to ever be able to offer high-speed
     computation of an extended type that offers denotations of out-of-
     range values ("too small", "too large", etc.) for code that wants to
     quickly compute intermediate results that it might never use in
     final calculations, but can't effectively filter out, or which it
     might want to explicitly test for at run time.  (IEEE 754, of course,
     offers this -- via concepts like single and double precision,
     denormals, Infs, and NaNs.)

  -  The fact that, because we (the whole gcc project, driven by your
     view of what constitutes proper floating-point programming, which
     surely must pertain to other languages as well as Fortran) are marching
     to our own drummer, we have to modify all libraries that might be
     "downstream" of our product (the libm exp()) example to assure that
     they "behave" according to our "specification" (e.g. when computing
     results in an underflow, silently replace with zero, but if -fcheck-toon
     is specified, crash instead).  So we can't just grab a bunch of
     well-written, highly optimized IEEE-754-conforming code to implement
     underlying routines -- we have to, in effect, grow our own, to meet
     the performance needs of the types we're creating.

  -  The fact that, AFAICT, we "advertise" values like FLT_MIN and FLT_MAX
     which are, to some extent, outright *lies* in terms of what we
     actually support.  At least, FLT_MIN, anyway -- you cannot reliably
     compute down to near FLT_MIN without the compiler randomly computing
     values much smaller for you and then deciding to crash on them (again,
     this is based on *your* claims of what is permissible for the compiler
     over the past year or so).  But I don't see you, or those who appear
     to agree with you, proposing new values for FLT_MIN and FLT_MAX, or
     proposing values for new macros called something like FLT_TOON_MIN
     and FLT_TOON_MAX.  So programmers have to somehow just "know" what
     this magic range is, but are not able to actually look it up
     anywhere convenient.

Further, you appear to not particularly care how a program behaves, compiled
by default, when it computes values you consider out of range.

It was formerly suggested (by you or someone else) that *crashing* when
computing denormals was okay, because that meant the program almost
certainly had a bug.  But now we discover we likely *won't* crash,
but rather silently generate a zero, on a denormal, even though, on
most *other* targets on which we'll generate the correct (IEEE) result.

Yet that (*not* crashing) *now* seems to be cause for *celebration* among
those who claim denormals "never happen anyway" in correct code -- so
*catching* bugs seems to have not been an issue in the first place, as
whether the code crashed or computed incorrect results (admittedly from
inputs declared, by you, to be incorrect) turns out to not have mattered
at all.

So, it seems to me likely that, right now, on Alphas (and again I can't
test this), there are discontinuities in important functions, such as
reading in a floating-point number expressed as decimal.  E.g. on my
Pentium II, if I do the equivalent of "READ (UNIT='1E-5000', *) R",
the value that ends up in R is 0., with no crash.  Apparently, according
to you, it's okay if that also happens on the Alpha, even if changing
the input value to 1E-40 causes a crash (since it generates a denormal),
and even though changing it to 1E-5 produces the correct result (no crash).

But I consider that sort of discontinuity (zero -> crash -> correct value)
to be *wholly* unacceptable, whether it occurs in the inputs to the
text-to-FP converter, or in the inputs to exp(), or whatever.

Now, maybe all that's needed is for these few discontinuities to be
*fixed*, but, from your "camp", so to speak, I see little interest
in doing so, *or* in specifying exactly what the proper behaviors
*should* be so the *rest* of us can know how to fix them.  (In particular,
as far as I can tell, the specification is "anything goes, so it can be
as fast as the hardware can possibly run", which clearly allows for
discontinuous behavior across the domain -- except, of course, in the
at-best-fuzzily-specified region *you* claim is the only one programs
should be employing anyway.  And there's no point in "fixing" such
discontinuities if the "standard" is to go as fast as possible, completely
ignoring the possibility of ill-conditioned input resulting in
discontinuous behaviors.)

So, I'm stumped as to how to what path to take.  I want to work only
on *robustly designed* products, which include taking into account
how *typical* users will use it in practice, and assuring they'll not
get misleading or incorrect results.

Now, I'm not particularly interested in working on something like Java,
which offers the (IMO false) promise of "write once, run anywhere", even
though I'm sure I'd find their discussions of issues revolving around
IEEE 754, performance, and consistent behavior quite valuable in many
ways.

But, I'd like to work on a product that *at least* respects the underlying
floating-point types of the machine and *properly* implements them.  That
means respecting all the expertise that not only went into *designing* those
types but has since gone into *using* them.  If those underlying types are
IEEE 754, so be it.  It also means implementing consistent behavior across
all the intrinsics, conversions, approximations, etc.  It also means doing
80-bit spills on IA32, since that's what the *designer* of the hardware
intended (and, again, it's what the Fortran standard *clearly*, if not
*legalistically*, requires).  (It certainly means, at least, 80-bit spills
of program variables; I believe it might also mean 80-bit spills of
intermediate results.  Of course, there's the argument that what is
*really* means is always computing 32-bit or 64-bit intermediate results,
leaving IA32 performance out in the cold.)

This is all considered "quality of implementation" in the Fortran community,
I'm happy to concede, but my point is, if I can't (and I know I don't)
thoroughly understand the floating-point model the compiler offers, then
I should *not* be working on it, certainly not in a significant role
(anything approaching the role I serve now vis-a-vis g77).

And, right now, I have *serious* doubts that the floating-point model
g77/gcc offers is *anywhere* near understood by *anybody*, including
yourself.  You probably don't realize that that "IF (X .LT. X)" example
above, put into a loop that nevertheless computes the *same* pertinent
values, could result in *different* paths taken via the IF, especially if the
loop contains any other code before *or* after that sample block of code.
But it could, due to loop unrolling or other sorts of code-duplication
optimizations.

(Though you probably *do* realize that adding another identical IF to
the original sample program could legitimately result in different
paths taken in each IF even in a straight-through execution of that code.)

Certainly I don't think most of the *current* g77 audience understands
those issues, and I do not have what it takes to explain them to that
audience.

My guess is, assuming we continue on this path, the *future* g77 audience
*will* understand these issues thoroughly.

But not because we've *explained* these issues to them.

It will be because the future g77 audience (and the audience for gcc's
floating-point capabilities) will consist of maybe 50 people worldwide,
or some number so small it comes *nowhere* near justifying all the effort
we're putting into this aspect of the product.

IMO, an audience that small should either spend a few hundred thousand
dollars for a high-end compiler to be installed on all their workstations,
or use assembly, because that's what the economics of their situation
dictates.

Or, they could just use -mno-ieee, -fno-correct-spills, etc., assuming
we picked *robust* defaults for g77 (and gcc).

So, Toon, what do you suggest?  Do you insist I continue to work on a
product in which I have increasingly *less* confidence, and which is
going in a direction with which I strongly disagree (much of this
"direction" being in the realm of "revelations" about how it's actually
working, and being told not to do anything about it), resulting in my
(hopefully-still-at-least-marginally-good) name being associated with
something which I wouldn't recommend anyone actually *use*?

Or, should I work on something else, something in which I *do* have
confidence can become something which I believe is worthy of
recommendation to others to use?

(These questions are as open to others as they are to Toon, and they
should be considered as questions *other* people working on g77/gcc
might be asking as well.  E.g. maybe Toon would be unwilling to work on
g77 if it didn't have the defaults *he* wants -- after all, convincing
floating-point experts to compile with -mno-ieee might seem, to him
anyway, quite difficult.)

        tq vm, (burley)
Follow-Ups:
- Re: Bug with g77 and -mieee on Alpha Linux
  - From: Toon Moene
References:
- Re: Bug with g77 and -mieee on Alpha Linux
  - From: craig
- Re: Bug with g77 and -mieee on Alpha Linux
  - From: Richard Henderson
- Re: Bug with g77 and -mieee on Alpha Linux
  - From: Toon Moene
- Re: Bug with g77 and -mieee on Alpha Linux
  - From: craig
- Re: Bug with g77 and -mieee on Alpha Linux
  - From: Toon Moene
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]