Bug 108279 - Improved speed for float128 routines
Summary: Improved speed for float128 routines
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: libgcc (show other bugs)
Version: unknown
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2023-01-03 20:55 UTC by Thomas Koenig
Modified: 2023-02-10 13:38 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Example patch with Michael S's code just pasted over the libgcc implementation, for a test (5.21 KB, patch)
2023-01-03 21:02 UTC, Thomas Koenig
Details | Diff
matmul_r16.i (31.64 KB, text/plain)
2023-01-14 09:21 UTC, Thomas Koenig
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Koenig 2023-01-03 20:55:32 UTC
Our soft-float routines, which are used for the basic float128 arithmetic
(__addtf3, __subtf3, etc) are much slower than they need to be.

Michael S has some routines which are considerably faster, at
https://github.com/already5chosen/extfloat, which he would like to
contribute to gcc.  There is a rather lengthy thread in comp.arch
starting with https://groups.google.com/g/comp.arch/c/Izheu-k00Nw .

Current status of the discussion:

The routines currently do not support rounding modes, they support round to
nearest with tie even only. Adding such support would be feasible.

Handling the rounding mode it is currently done in libgcc, by
querying the hardware, leading to a high overhead for each
call. This would not be needed if -ffast-math (or a relevant
suboption) is specified.

It would also be suitable as is (with a different name) for Fortran
intrinsics such as matmul.

Fortran is a bit special because rounding modes are default on procedure
entry and are restored on procedure exit (which is why setting rounding
modes in a subroutine is a no-op). This would allow to keep a local
variable keeping track of the rounding mode.

The current idea would be something like this:

The current behavior of __addtf3 and friends could remain as is,
but its speed could be improved,. but it would still query the
hardware.

There can be two additional routines for each arithmetic operation. One
of them would implement the operation given a specified rounding mode
(to be called from Fortran when the correct IEEE module is in
use).

The other one would just implement round-to-nearest, for use from
Fortran intrinsics and from all other languages if the right flags
are given. It would be good to bolt this onto some flag which is
used for libgfortran, to make it accessible from C.

Probably gcc14 material.
Comment 1 Thomas Koenig 2023-01-03 21:02:09 UTC
Created attachment 54183 [details]
Example patch with Michael S's code just pasted over the libgcc implementation, for a test

A benchmarks: Just pasting over the code from the github
repo yields an improvement of gfortran's matmul by almost a factor of two,
so significant speedups are possible:

module tick
interface
function rdtsc() bind(C,name="rdtsc")
use iso_c_binding
integer(kind=c_long) :: rdtsc
end function rdtsc
end interface
end module tick

program main
use tick
use iso_c_binding
implicit none
integer, parameter :: wp = selected_real_kind(30)
! integer, parameter :: n=5000, p=4000, m=3666
integer, parameter :: n = 1000, p = 1000, m = 1000
real (kind=wp) :: c(n,p), a(n,m), b(m, p)
character(len=80) :: line
integer(c_long) :: t1, t2, t3
real (kind=wp) :: fl = 2.d0*n*m*p
integer :: i,j

print *,wp

line = '10 10'
call random_number(a)
call random_number(b)
t1 = rdtsc()
t2 = rdtsc()
t3 = t2-t1
print *,t3
t1 = rdtsc()
c = matmul(a,b)
t2 = rdtsc()
print *,1/(fl/(t2-t1-t3)),"Cycles per operation"
read (unit=line,fmt=*) i,j
write (unit=line,fmt=*) c(i,j)
end program main

showed

tkoenig@gcc188:~> ./original
16
32
^C
tkoenig@gcc188:~> time ./original
16
32
90.5696151959999999999999999999999997 Cycles per operation

real 1m2,148s
user 1m2,123s
sys 0m0,008s
tkoenig@gcc188:~> time ./modified
16
32
52.8148391719999999999999999999999957 Cycles per operation

real 0m36,296s
user 0m36,278s
sys 0m0,008s 

where "original" is the current libgcc soft-float implementation, and
"modified" is with the code from the repro.

It does not handle exceptions, so this causes a few regressions, but certainly shows the potential
Comment 2 Jakub Jelinek 2023-01-04 10:43:38 UTC
From what I can see, they are certainly not portable.
E.g. the relying on __int128 rules out various arches (basically all 32-bit arches,
ia32, powerpc 32-bit among others).  Not handling exceptions is a show stopper too.
Guess better time investment would be to improve performance of the soft-fp versions.
Comment 3 Thomas Koenig 2023-01-04 17:14:19 UTC
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit among others).

For this kind of performance improvement on 64-bit systems, we could probably
introduce an appropriate #ifdef. Regarding x86 intrinsics, maybe they
can be replaced by gcc's vector extension.

> Not handling exceptions is a show
> stopper too.

Agreed, we should not be replacing the soft-fp that way.

> Guess better time investment would be to improve performance of the soft-fp
> versions.

I'm not sure, I think we could get an appreciable benefit if we
only invoke this kind of routine behind the appropriate sub-flags
of -ffast-math.

For a general-purpose code, I see at least no way around the bottleneck
of querying the processor status on each invocation, and that is a waste
if the program does not care.
Comment 4 Michael_S 2023-01-04 22:19:17 UTC
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit among others).  Not handling exceptions is a show
> stopper too.
> Guess better time investment would be to improve performance of the soft-fp
> versions.

The only 32-bit gcc target [out of targets represented on Goldbolt] that supports __float128/_Float128 right now are
- x386==IA32
- SPARC
- RV32
I don't know why it is supported, but would be surprised if __float128 on RV32 is actually used by anybody. Just too slow. 

32-bit SPARC sounds as dead platform.

x386 is more problematic. People still use it, people still develop new programs for it. And since modern x386 processors are so fast, performance of __float128 could be usable, even if it is 5 or 10 times slower than on the same machine running x86-64.

PPC32 is not supported, so no problems from that side.
Comment 5 Michael_S 2023-01-11 23:06:49 UTC
Hi Thomas
Are you in or out?

If you are still in, I can use your help on several issues.

1. Torture. 
See if Invalid Operand exception raised properly now. Also if there are still remaining problems with NaN.

2. Run my correction tests on as many non-AMD64 targets as you can. Preferably, with 100,000,000 iterations, but on weaker HW 10,000,000 will do.

3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of AMD64 computers than I did.
Of special interest are
- AMD Zen3 on Linux running on bare metal
- Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running on bare metal
I realize that doing speed tests is not nearly as simple as correctness tests.
We need non-busy (preferably almost idle) machines that have stable CPU clock rate. It's not easy to find machines like that nowadays. But, may be, you can find at least some from the list.

4. Run my speed tests on as many non-obsolete ARM64 computers as you can find.
Well, probably a wishful thinking on my part.


Also off topic but of interest: postprocessed source of matmul_r16.c
Comment 6 Thomas Koenig 2023-01-12 23:24:02 UTC
(In reply to Michael_S from comment #5)
> Hi Thomas
> Are you in or out?

Depends a bit on what exactly you want to do, and if there is
a chance that what you want to do will be incorporated into gcc.

If you want to replace the soft-float routines, you will have to
replace them with the full functionality.

And there will have to be a decision about 32-bit targets.

> If you are still in, I can use your help on several issues.
> 
> 1. Torture. 
> See if Invalid Operand exception raised properly now. Also if there are
> still remaining problems with NaN.

I've putyour addition/subtraction routines in as a replacement
an am running a regression test.  We'll see when that finishes.

> 2. Run my correction tests on as many non-AMD64 targets as you can.
> Preferably, with 100,000,000 iterations, but on weaker HW 10,000,000 will do.

This will take some time.

> 3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of
> AMD64 computers than I did.
> Of special interest are
> - AMD Zen3 on Linux running on bare metal
> - Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running
> on bare metal
> I realize that doing speed tests is not nearly as simple as correctness
> tests.
> We need non-busy (preferably almost idle) machines that have stable CPU
> clock rate. It's not easy to find machines like that nowadays. But, may be,
> you can find at least some from the list.

I currenty have no access to that sort of hardware (I'm just a volunteer,
and my home box is Zen-1).
 
> 4. Run my speed tests on as many non-obsolete ARM64 computers as you can
> find.
> Well, probably a wishful thinking on my part.
> 
> 
> Also off topic but of interest: postprocessed source of matmul_r16.c

Where should I send that to?
Comment 7 Michael_S 2023-01-13 00:34:28 UTC
Either here or my yahoo e-mail
Comment 8 Michael_S 2023-01-13 01:29:43 UTC
(In reply to Thomas Koenig from comment #6)
> (In reply to Michael_S from comment #5)
> > Hi Thomas
> > Are you in or out?
> 
> Depends a bit on what exactly you want to do, and if there is
> a chance that what you want to do will be incorporated into gcc.
> 

What about incorporation in Fortran?
What about incorporation in C under fast-math ?

> If you want to replace the soft-float routines, you will have to
> replace them with the full functionality.
> 

Full functionality including Inexact Exception that practically nobody uses?
Sounds wasteful of perfectly good CPU cycles.
Also, I am not so sure that Inexact Exception is fully supported in  existing soft-float library.

Almost-full functionality with support for non-default rounding modes, but without Inexact Exception?
I actually implemented it and did few measurements. You can find the results in the directory /reports in my repo.
Summary: architecture-neutral method cause very serious slowdown. Less so on slower machines, massive 2.5x on the fastest machine (Zen3 under Linux under WSL).
AMD64-specific method causes smaller slowdown, esp. on relatively old Intel cores on Windows (I have no modern Intel cores available for testing). But Zen3/Linux still suffer 1.45x slowdown. Again, a big wastage of perfectly good CPU cycles.
Also, what about other architectures? Should they suffer an "architecture-neutral" slowdown? Even if there are faster methods on other architecture, these methods should be found by somebody and tested by somebody. This sort of work is time-consuming. And for what?

Also I measured an impact of implementing non-default rounding through additional function parameter. An impact is very small, 0  to 5%.
You said on comp.arch that at least for Fortran it could work.

What else is missing for "full functionality"? 
Surely there are other things that I forgot. May be, additional exceptions apart from Invalid Operand (that hopefully already works) and apart from Inexact that I find stupid? I don't think that they are hard to implement or expensive in terms of speed. Just a bit of work and more than a bit of testing.

> And there will have to be a decision about 32-bit targets.
>

IMHO, 32-bit targets should be left in their current state.
People that use them probably do not care deeply about performance.
Technically, I can implement 32-bit targets in the same sources, by means of few ifdefs and macros, but resulting source code will look much uglier than how it looks today. Still, not to the same level of horror that you have in matmul_r16.c, but certainly uglier than how I like it to look.
And I am not sure at all that my implementation of 32-bit targets would be significantly faster than current soft float. 
In short, it does not sound as a good ROI.
BTW, do you know why current soft float supports so few 32-bit targets?
Most likely somebody felt just like me about it - it's not too hard to support more 32-bit targets, but it's not a good ROI.
Comment 9 Thomas Koenig 2023-01-14 09:21:01 UTC
Created attachment 54273 [details]
matmul_r16.i

Here is matmul_r16.i from a relatively recent trunk.
Comment 10 Thomas Koenig 2023-01-14 10:22:24 UTC
What we would need for incorporation into gcc is to have several
functions, which would then called depending on which floating point
options are in force at the time of invocation.

So, let's go through the gcc options, to see what would fit where. Walking
down the options tree, depth first.

From the gcc docs:

'-ffast-math'
     Sets the options '-fno-math-errno', '-funsafe-math-optimizations',
     '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans',
     '-fcx-limited-range' and '-fexcess-precision=fast'.

-fno-math-errno is irrelevant in this context, no need to look at that.

'-funsafe-math-optimizations'

     Allow optimizations for floating-point arithmetic that (a) assume
     that arguments and results are valid and (b) may violate IEEE or
     ANSI standards.  When used at link time, it may include libraries
     or startup files that change the default FPU control word or other
     similar optimizations.

     This option is not turned on by any '-O' option since it can result
     in incorrect output for programs that depend on an exact
     implementation of IEEE or ISO rules/specifications for math
     functions.  It may, however, yield faster code for programs that do
     not require the guarantees of these specifications.  Enables
     '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and
     '-freciprocal-math'.

'-fno-signed-zeros'
     Allow optimizations for floating-point arithmetic that ignore the
     signedness of zero.  IEEE arithmetic specifies the behavior of
     distinct +0.0 and -0.0 values, which then prohibits simplification
     of expressions such as x+0.0 or 0.0*x (even with
     '-ffinite-math-only').  This option implies that the sign of a zero
     result isn't significant.

     The default is '-fsigned-zeros'.

I don't think this options is relevant.

'-fno-trapping-math'
     Compile code assuming that floating-point operations cannot
     generate user-visible traps.  These traps include division by zero,
     overflow, underflow, inexact result and invalid operation.  This
     option requires that '-fno-signaling-nans' be in effect.  Setting
     this option may allow faster code if one relies on "non-stop" IEEE
     arithmetic, for example.

     This option should never be turned on by any '-O' option since it
     can result in incorrect output for programs that depend on an exact
     implementation of IEEE or ISO rules/specifications for math
     functions.

     The default is '-ftrapping-math'.

Relevant.

'-ffinite-math-only'
     Allow optimizations for floating-point arithmetic that assume that
     arguments and results are not NaNs or +-Infs.

     This option is not turned on by any '-O' option since it can result
     in incorrect output for programs that depend on an exact
     implementation of IEEE or ISO rules/specifications for math
     functions.  It may, however, yield faster code for programs that do
     not require the guarantees of these specifications.

This does not have further suboptions. Relevant.

'-fassociative-math'

     Allow re-association of operands in series of floating-point
     operations.  This violates the ISO C and C++ language standard by
     possibly changing computation result.  NOTE: re-ordering may change
     the sign of zero as well as ignore NaNs and inhibit or create
     underflow or overflow (and thus cannot be used on code that relies
     on rounding behavior like '(x + 2**52) - 2**52'.  May also reorder
     floating-point comparisons and thus may not be used when ordered
     comparisons are required.  This option requires that both
     '-fno-signed-zeros' and '-fno-trapping-math' be in effect.
     Moreover, it doesn't make much sense with '-frounding-math'.  For
     Fortran the option is automatically enabled when both
     '-fno-signed-zeros' and '-fno-trapping-math' are in effect.

     The default is '-fno-associative-math'.

Not relevant, I think - this influences compiler optimizations.

'-freciprocal-math'

     Allow the reciprocal of a value to be used instead of dividing by
     the value if this enables optimizations.  For example 'x / y' can
     be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject
     to common subexpression elimination.  Note that this loses
     precision and increases the number of flops operating on the value.

     The default is '-fno-reciprocal-math'.

Again, not relevant.


'-frounding-math'
     Disable transformations and optimizations that assume default
     floating-point rounding behavior.  This is round-to-zero for all
     floating point to integer conversions, and round-to-nearest for all
     other arithmetic truncations.  This option should be specified for
     programs that change the FP rounding mode dynamically, or that may
     be executed with a non-default rounding mode.  This option disables
     constant folding of floating-point expressions at compile time
     (which may be affected by rounding mode) and arithmetic
     transformations that are unsafe in the presence of sign-dependent
     rounding modes.

     The default is '-fno-rounding-math'.

     This option is experimental and does not currently guarantee to
     disable all GCC optimizations that are affected by rounding mode.
     Future versions of GCC may provide finer control of this setting
     using C99's 'FENV_ACCESS' pragma.  This command-line option will be
     used to specify the default state for 'FENV_ACCESS'.

Also no further suboptions. This is relevant.

'-fsignaling-nans'
     Compile code assuming that IEEE signaling NaNs may generate
     user-visible traps during floating-point operations.  Setting this
     option disables optimizations that may change the number of
     exceptions visible with signaling NaNs.  This option implies
     '-ftrapping-math'.

     This option causes the preprocessor macro '__SUPPORT_SNAN__' to be
     defined.

     The default is '-fno-signaling-nans'.

     This option is experimental and does not currently guarantee to
     disable all GCC optimizations that affect signaling NaN behavior.

Also, no further suboptions. Relevant.

-fcx-limited-range is not relevant, and neither is -fexcess-precision=fast.

So, unless I missed something, wit should be possible to select different
functions depending on the values of -ftrapping-math, -finite-math-only,
-frounding-math and -fsignalling-nans.

Regarding Fortran's matmul: We use -ffast-math when compiling the library
functions, so any change we make to any of the -ffast-math suboptions
would be used, as well.
Comment 11 Michael_S 2023-01-14 23:52:32 UTC
(In reply to Thomas Koenig from comment #9)
> Created attachment 54273 [details]
> matmul_r16.i
> 
> Here is matmul_r16.i from a relatively recent trunk.

Thank you.
Unfortunately, I was not able to link it with main in Fortran.
So, I still only have to guess why even after replacement of __multf3 and __addtf3 by my implementations it is still more than twice slower (on Zen3) then what it should be.
Looking at source and assuming that inner loop starts at line 8944, this
loop looks very strange, but apart from bad programming style and apart from misunderstanding of what is optimal scheduling there is nothing criminal about it. May be, because of wrong scheduling, it is 10-15% slower than the best, but certainly it should not be 2.4 times slower that I am seeing.


That's what I got from linker:
/usr/bin/ld: matmul_r16.o: warning: relocation against `_gfortrani_matmul_r16_avx128_fma3' in read-only section `.text'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx':
matmul_r16.c:(.text+0x48): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x342): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x10e4): undefined reference to `_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x10f1): undefined reference to `_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x12e5): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x13a9): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x13e7): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx2':
matmul_r16.c:(.text+0x24c8): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x27c2): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x3564): undefined reference to `_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x3571): undefined reference to `_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x3765): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x3829): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x3867): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx512f':
matmul_r16.c:(.text+0x4948): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x4c47): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x5a32): undefined reference to `_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x5a3f): undefined reference to `_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x5c35): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x5cfb): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x5d35): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_vanilla':
matmul_r16.c:(.text+0x6de8): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x70e2): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x7e84): undefined reference to `_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x7e91): undefined reference to `_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x8085): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x8149): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x8187): undefined reference to `_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `_gfortran_matmul_r16':
matmul_r16.c:(.text+0x92b7): undefined reference to `_gfortrani_matmul_r16_avx128_fma3'
/usr/bin/ld: matmul_r16.c:(.text+0x92ea): undefined reference to `_gfortrani_matmul_r16_avx128_fma4'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
Comment 12 Michael_S 2023-01-15 00:20:00 UTC
(In reply to Thomas Koenig from comment #10)
> What we would need for incorporation into gcc is to have several
> functions, which would then called depending on which floating point
> options are in force at the time of invocation.
> 
> So, let's go through the gcc options, to see what would fit where. Walking
> down the options tree, depth first.
> 
> From the gcc docs:
> 
> '-ffast-math'
>      Sets the options '-fno-math-errno', '-funsafe-math-optimizations',
>      '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans',
>      '-fcx-limited-range' and '-fexcess-precision=fast'.
> 
> -fno-math-errno is irrelevant in this context, no need to look at that.
> 
> '-funsafe-math-optimizations'
> 
>      Allow optimizations for floating-point arithmetic that (a) assume
>      that arguments and results are valid and (b) may violate IEEE or
>      ANSI standards.  When used at link time, it may include libraries
>      or startup files that change the default FPU control word or other
>      similar optimizations.
> 
>      This option is not turned on by any '-O' option since it can result
>      in incorrect output for programs that depend on an exact
>      implementation of IEEE or ISO rules/specifications for math
>      functions.  It may, however, yield faster code for programs that do
>      not require the guarantees of these specifications.  Enables
>      '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and
>      '-freciprocal-math'.
> 
> '-fno-signed-zeros'
>      Allow optimizations for floating-point arithmetic that ignore the
>      signedness of zero.  IEEE arithmetic specifies the behavior of
>      distinct +0.0 and -0.0 values, which then prohibits simplification
>      of expressions such as x+0.0 or 0.0*x (even with
>      '-ffinite-math-only').  This option implies that the sign of a zero
>      result isn't significant.
> 
>      The default is '-fsigned-zeros'.
> 
> I don't think this options is relevant.
> 
> '-fno-trapping-math'
>      Compile code assuming that floating-point operations cannot
>      generate user-visible traps.  These traps include division by zero,
>      overflow, underflow, inexact result and invalid operation.  This
>      option requires that '-fno-signaling-nans' be in effect.  Setting
>      this option may allow faster code if one relies on "non-stop" IEEE
>      arithmetic, for example.
> 
>      This option should never be turned on by any '-O' option since it
>      can result in incorrect output for programs that depend on an exact
>      implementation of IEEE or ISO rules/specifications for math
>      functions.
> 
>      The default is '-ftrapping-math'.
> 
> Relevant.
> 
> '-ffinite-math-only'
>      Allow optimizations for floating-point arithmetic that assume that
>      arguments and results are not NaNs or +-Infs.
> 
>      This option is not turned on by any '-O' option since it can result
>      in incorrect output for programs that depend on an exact
>      implementation of IEEE or ISO rules/specifications for math
>      functions.  It may, however, yield faster code for programs that do
>      not require the guarantees of these specifications.
> 
> This does not have further suboptions. Relevant.
> 
> '-fassociative-math'
> 
>      Allow re-association of operands in series of floating-point
>      operations.  This violates the ISO C and C++ language standard by
>      possibly changing computation result.  NOTE: re-ordering may change
>      the sign of zero as well as ignore NaNs and inhibit or create
>      underflow or overflow (and thus cannot be used on code that relies
>      on rounding behavior like '(x + 2**52) - 2**52'.  May also reorder
>      floating-point comparisons and thus may not be used when ordered
>      comparisons are required.  This option requires that both
>      '-fno-signed-zeros' and '-fno-trapping-math' be in effect.
>      Moreover, it doesn't make much sense with '-frounding-math'.  For
>      Fortran the option is automatically enabled when both
>      '-fno-signed-zeros' and '-fno-trapping-math' are in effect.
> 
>      The default is '-fno-associative-math'.
> 
> Not relevant, I think - this influences compiler optimizations.
> 
> '-freciprocal-math'
> 
>      Allow the reciprocal of a value to be used instead of dividing by
>      the value if this enables optimizations.  For example 'x / y' can
>      be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject
>      to common subexpression elimination.  Note that this loses
>      precision and increases the number of flops operating on the value.
> 
>      The default is '-fno-reciprocal-math'.
> 
> Again, not relevant.
> 
> 
> '-frounding-math'
>      Disable transformations and optimizations that assume default
>      floating-point rounding behavior.  This is round-to-zero for all
>      floating point to integer conversions, and round-to-nearest for all
>      other arithmetic truncations.  This option should be specified for
>      programs that change the FP rounding mode dynamically, or that may
>      be executed with a non-default rounding mode.  This option disables
>      constant folding of floating-point expressions at compile time
>      (which may be affected by rounding mode) and arithmetic
>      transformations that are unsafe in the presence of sign-dependent
>      rounding modes.
> 
>      The default is '-fno-rounding-math'.
> 
>      This option is experimental and does not currently guarantee to
>      disable all GCC optimizations that are affected by rounding mode.
>      Future versions of GCC may provide finer control of this setting
>      using C99's 'FENV_ACCESS' pragma.  This command-line option will be
>      used to specify the default state for 'FENV_ACCESS'.
> 
> Also no further suboptions. This is relevant.
> 
> '-fsignaling-nans'
>      Compile code assuming that IEEE signaling NaNs may generate
>      user-visible traps during floating-point operations.  Setting this
>      option disables optimizations that may change the number of
>      exceptions visible with signaling NaNs.  This option implies
>      '-ftrapping-math'.
> 
>      This option causes the preprocessor macro '__SUPPORT_SNAN__' to be
>      defined.
> 
>      The default is '-fno-signaling-nans'.
> 
>      This option is experimental and does not currently guarantee to
>      disable all GCC optimizations that affect signaling NaN behavior.
> 
> Also, no further suboptions. Relevant.
> 
> -fcx-limited-range is not relevant, and neither is -fexcess-precision=fast.
> 
> So, unless I missed something, wit should be possible to select different
> functions depending on the values of -ftrapping-math, -finite-math-only,
> -frounding-math and -fsignalling-nans.
> 
> Regarding Fortran's matmul: We use -ffast-math when compiling the library
> functions, so any change we make to any of the -ffast-math suboptions
> would be used, as well.

This set of options does not map too well into real difficulties of implementation.
There are only 2 things that are expensive:
1. Inexact Exception
2. Fetching of the current rounding mode.
The rest of IEEE-754 features is so cheap that creating separate variants without them simply is not worth the effort of maintaining distinct variants, even if all difference is a single three-lines #ifdef

BTW, Inexact Exception can be made fairly affordable with a little help from compiler. All we need for that is ability to say "don't remove this floating point addition even if you don't see that it produces any effect".
Something similar to 'volatile', but with volatile compiler currently puts result of addition on stack, which adds undesirable cost.
However, judged by comment of Jakub, compiler maintainers are not particularly interested in this enterprise.
Comment 13 Thomas Koenig 2023-01-15 15:13:27 UTC
I tried compiling your tests on Apple silicon using Asahi Linux, but
without success. A first step was rather easy; replacing __float128 by
_Float128 was required.  I then bootstrapped gcc on that machine and
added the (private) include path for <quadmath.h>, and am now hitting missing
__float128 in quadmath.h.  Not sure how to proceed from here.

The machine is gcc103.fsffrance.org, by the way, of the GCC compile farm.
Comment 14 Thomas Koenig 2023-01-15 19:17:48 UTC
Seems that libquadmath is not built on that particular Linux/CPU variant,
for whatever reason. At last I cannot find any '*quadmath* files
in the build directory.

/proc/cpuinfo tells me that

processor       : 0
BogoMIPS        : 48.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint
CPU implementer : 0x61
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x022
CPU revision    : 1

[...]

and uname -a is

Linux gcc103.fsffrance.org 6.0.0-rc5-asahi-00001-gc62bd3fe430f #1 SMP Sun Sep 18 18:07:57 CEST 2022 aarch64 GNU/Linux

So much for testing on Apple silicon.
Comment 15 Jakub Jelinek 2023-01-15 19:28:23 UTC
libquadmath is not needed nor useful on aarch64-linux, because long double type there is already IEEE 754 quad.
Comment 16 Michael_S 2023-01-15 22:27:55 UTC
(In reply to Jakub Jelinek from comment #15)
> libquadmath is not needed nor useful on aarch64-linux, because long double
> type there is already IEEE 754 quad.

That's good to know. Thank you.

If you are here I have one more somewhat related question: is IEEE binary128 arithmetic really not supported under ARMv7 or, like in the case above, it is supported, but not under one of common names (__float128/_Float128) ?

And if it is not supported then why? 
If I had to peek one 32-bit architecture to support then it certainly would be ARMv7, the most used 32-bit architecture in the world that probably outsells the next one by order of magnitude.
Comment 17 jsm-csl@polyomino.org.uk 2023-01-16 22:16:22 UTC
It's not part of the ABI for the Arm 32-bit Architecture (AAPCS32).

https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst

You can file an issue there if you want, though I don't know how 
interested the maintainers will be in that optional language feature.
Comment 18 Wilco 2023-01-18 19:02:19 UTC
(In reply to Michael_S from comment #12)

> This set of options does not map too well into real difficulties of
> implementation.
> There are only 2 things that are expensive:
> 1. Inexact Exception
> 2. Fetching of the current rounding mode.
> The rest of IEEE-754 features is so cheap that creating separate variants
> without them simply is not worth the effort of maintaining distinct
> variants, even if all difference is a single three-lines #ifdef

In general reading the current rounding mode is relatively cheap, but modifying can be expensive, so optimized fenv implementations in GLIBC only modify the FP status if a change is required. It should be feasible to check for round-to-even and use optimized code for that case.

> BTW, Inexact Exception can be made fairly affordable with a little help from
> compiler. All we need for that is ability to say "don't remove this floating
> point addition even if you don't see that it produces any effect".
> Something similar to 'volatile', but with volatile compiler currently puts
> result of addition on stack, which adds undesirable cost.
> However, judged by comment of Jakub, compiler maintainers are not
> particularly interested in this enterprise.

There are macros in GLIBC math-barriers.h which do what you want - eg. AArch64:

#define math_opt_barrier(x)                                     \
  ({ __typeof (x) __x = (x); __asm ("" : "+w" (__x)); __x; })
#define math_force_eval(x)                                              \
  ({ __typeof (x) __x = (x); __asm __volatile__ ("" : : "w" (__x)); })

The first blocks optimizations (like constant folding) across the barrier, the 2nd forces evaluation of an expression even if it is deemed useless. These are used in many math functions in GLIBC. They are target specific due to needing inline assembler operands, but it should be easy to add similar definitions to libgcc.
Comment 19 Jakub Jelinek 2023-01-18 19:20:40 UTC
So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86 sfp-machine.h
#define FP_INIT_ROUNDMODE                                       \
  do {                                                          \
    __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw));       \
  } while (0)
#endif
to do that only if not round-to-nearest.
E.g. the fast-float library uses since last November:
  static volatile float fmin = __FLT_MIN__;
  float fmini = fmin; // we copy it so that it gets loaded at most once.
  return (fmini + 1.0f == 1.0f - fmini);
as a quick test whether round-to-nearest is in effect or some other rounding mode.
Most likely if this is done with -frounding-math it wouldn't even need the volatile stuff.  Of course, if it isn't round-to-nearest, we would need to do the expensive {,v}stmxcsr.
Comment 20 Jakub Jelinek 2023-01-18 19:26:33 UTC
__attribute__((noinline, optimize ("rounding-math"))) static int
round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; }

and
  if (round_to_nearest ()) \
    _fcw = FP_RND_NEAREST; \
  else \
    __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \

Except that from _fcw we don't determine just the rounding mode but also what exceptions are enabled.
Comment 21 Wilco 2023-01-18 20:10:47 UTC
(In reply to Jakub Jelinek from comment #20)
> __attribute__((noinline, optimize ("rounding-math"))) static int
> round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; }

Wouldn't that always set inexact?

> and
>   if (round_to_nearest ()) \
>     _fcw = FP_RND_NEAREST; \
>   else \
>     __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \
> 
> Except that from _fcw we don't determine just the rounding mode but also
> what exceptions are enabled.

Yes that wouldn't work in fenv but FP emulation functions don't need to read the exception flags.
Comment 22 Michael_S 2023-01-18 22:28:17 UTC
(In reply to Michael_S from comment #8)
> (In reply to Thomas Koenig from comment #6)
> > And there will have to be a decision about 32-bit targets.
> >
> 
> IMHO, 32-bit targets should be left in their current state.
> People that use them probably do not care deeply about performance.
> Technically, I can implement 32-bit targets in the same sources, by means of
> few ifdefs and macros, but resulting source code will look much uglier than
> how it looks today. Still, not to the same level of horror that you have in
> matmul_r16.c, but certainly uglier than how I like it to look.
> And I am not sure at all that my implementation of 32-bit targets would be
> significantly faster than current soft float.

I explored this path (implementing 32-bit and 64-bit targets from the same source with few ifdefs) a little more:
Now I am even more sure that it is not a way to go. gcc compiler does not generate good 32-bit code for this style of sources. This especially applies to i386, other supported 32-bit targets (RV32, SPARC32) are affected less.

In the process I encountered a funny illogical pessimization by i386 code generator:
https://godbolt.org/z/En6Tredox
Routines foo32() and foo64() are semantically identical, but foo32() is written  with 32-bit targets in mind while foo64() is the style of could that will likely be written if one wants to support 32 and 64 bits from the same source with #ifdef.

The code, generated by gcc for foo32() is reasonable. Like in the source, we can see 8 multiplications.
The code, generated by gcc for foo64() is anything but reasonable. Somehow, compiler invented 5 more multiplications for a total of 13 multiplications.

May be, it deserves a separate bug report.
Comment 23 Michael_S 2023-01-18 23:31:37 UTC
(In reply to Jakub Jelinek from comment #19)
> So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86
> sfp-machine.h
> #define FP_INIT_ROUNDMODE                                       \
>   do {                                                          \
>     __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw));       \
>   } while (0)
> #endif
> to do that only if not round-to-nearest.
> E.g. the fast-float library uses since last November:
>   static volatile float fmin = __FLT_MIN__;
>   float fmini = fmin; // we copy it so that it gets loaded at most once.
>   return (fmini + 1.0f == 1.0f - fmini);
> as a quick test whether round-to-nearest is in effect or some other rounding
> mode.
> Most likely if this is done with -frounding-math it wouldn't even need the
> volatile stuff.  Of course, if it isn't round-to-nearest, we would need to
> do the expensive {,v}stmxcsr.

I agree with Wilco. This trick is problematic due to effect on inexact flag.
Also, I don't quite understand how you got to setting rounding mode.
I don't need to set rounding mode, I just need to read a current rounding mode.

Doing it in portable way, i.e. by fegetround(), is slow mostly due to various overheads.
Doing it in non-portable way on x86-64 (by _MM_GET_ROUNDING_MODE()) is not slow on Intel, but still pretty slow on AMD Zen3, although even on Zen3 it is much faster than fegetround().
Results of measurements are here:
https://github.com/already5chosen/extfloat/blob/master/binary128/reports/rm-impact.txt
Anyway, I'd very much prefer a portable solution over multitude of ifdefs.
It is a pity that gcc doesn't implement FLT_ROUNDS like other compilers.

But, then again, it is a pity that gcc doesn't implement few other things implemented by other compilers that could make life of developers of portable multiprecision routines in general and of soft float in particular so much easier.
Comment 24 Michael_S 2023-02-10 13:38:09 UTC
(In reply to Michael_S from comment #22)
> (In reply to Michael_S from comment #8)
> > (In reply to Thomas Koenig from comment #6)
> > > And there will have to be a decision about 32-bit targets.
> > >
> > 
> > IMHO, 32-bit targets should be left in their current state.
> > People that use them probably do not care deeply about performance.
> > Technically, I can implement 32-bit targets in the same sources, by means of
> > few ifdefs and macros, but resulting source code will look much uglier than
> > how it looks today. Still, not to the same level of horror that you have in
> > matmul_r16.c, but certainly uglier than how I like it to look.
> > And I am not sure at all that my implementation of 32-bit targets would be
> > significantly faster than current soft float.
> 
> I explored this path (implementing 32-bit and 64-bit targets from the same
> source with few ifdefs) a little more:
> Now I am even more sure that it is not a way to go. gcc compiler does not
> generate good 32-bit code for this style of sources. This especially applies
> to i386, other supported 32-bit targets (RV32, SPARC32) are affected less.
> 

I can't explain to myself why I am doing it, but I did continue exploration of 32-bit targets. Well, not quite "targets", I don't have SPARC32 or RV32 to play. So, I did continue exploration of i386.
As said above, using the same code for 32-bit and 64-bit does not produce acceptable results. But pure 32-bit source did better than what I expected.
So when 2023-01-13 I wrote "And I am not sure at all that my implementation of 32-bit targets would be significantly faster than current soft float" I was wrong. My implementation of 32-bit targets (i.e. i386) is significantly faster than current soft float. Up to 3 times faster on Zen3, approximately 2 times faster on various oldish Intel CPUs.
Today I put 32-bit sources into my github repository.

I am still convinced that improving performance of IEEE binary128 on 32-bit targets is wastage of time, but since the time is already wasted may be results can be used.

And may be, it can be used to bring IEEE binary128 to the Arm Cortex-M, where it can be moderately useful in some situations.