Our soft-float routines, which are used for the basic float128 arithmetic (__addtf3, __subtf3, etc) are much slower than they need to be. Michael S has some routines which are considerably faster, at https://github.com/already5chosen/extfloat, which he would like to contribute to gcc. There is a rather lengthy thread in comp.arch starting with https://groups.google.com/g/comp.arch/c/Izheu-k00Nw . Current status of the discussion: The routines currently do not support rounding modes, they support round to nearest with tie even only. Adding such support would be feasible. Handling the rounding mode it is currently done in libgcc, by querying the hardware, leading to a high overhead for each call. This would not be needed if -ffast-math (or a relevant suboption) is specified. It would also be suitable as is (with a different name) for Fortran intrinsics such as matmul. Fortran is a bit special because rounding modes are default on procedure entry and are restored on procedure exit (which is why setting rounding modes in a subroutine is a no-op). This would allow to keep a local variable keeping track of the rounding mode. The current idea would be something like this: The current behavior of __addtf3 and friends could remain as is, but its speed could be improved,. but it would still query the hardware. There can be two additional routines for each arithmetic operation. One of them would implement the operation given a specified rounding mode (to be called from Fortran when the correct IEEE module is in use). The other one would just implement round-to-nearest, for use from Fortran intrinsics and from all other languages if the right flags are given. It would be good to bolt this onto some flag which is used for libgfortran, to make it accessible from C. Probably gcc14 material.
Created attachment 54183 [details] Example patch with Michael S's code just pasted over the libgcc implementation, for a test A benchmarks: Just pasting over the code from the github repo yields an improvement of gfortran's matmul by almost a factor of two, so significant speedups are possible: module tick interface function rdtsc() bind(C,name="rdtsc") use iso_c_binding integer(kind=c_long) :: rdtsc end function rdtsc end interface end module tick program main use tick use iso_c_binding implicit none integer, parameter :: wp = selected_real_kind(30) ! integer, parameter :: n=5000, p=4000, m=3666 integer, parameter :: n = 1000, p = 1000, m = 1000 real (kind=wp) :: c(n,p), a(n,m), b(m, p) character(len=80) :: line integer(c_long) :: t1, t2, t3 real (kind=wp) :: fl = 2.d0*n*m*p integer :: i,j print *,wp line = '10 10' call random_number(a) call random_number(b) t1 = rdtsc() t2 = rdtsc() t3 = t2-t1 print *,t3 t1 = rdtsc() c = matmul(a,b) t2 = rdtsc() print *,1/(fl/(t2-t1-t3)),"Cycles per operation" read (unit=line,fmt=*) i,j write (unit=line,fmt=*) c(i,j) end program main showed tkoenig@gcc188:~> ./original 16 32 ^C tkoenig@gcc188:~> time ./original 16 32 90.5696151959999999999999999999999997 Cycles per operation real 1m2,148s user 1m2,123s sys 0m0,008s tkoenig@gcc188:~> time ./modified 16 32 52.8148391719999999999999999999999957 Cycles per operation real 0m36,296s user 0m36,278s sys 0m0,008s where "original" is the current libgcc soft-float implementation, and "modified" is with the code from the repro. It does not handle exceptions, so this causes a few regressions, but certainly shows the potential
From what I can see, they are certainly not portable. E.g. the relying on __int128 rules out various arches (basically all 32-bit arches, ia32, powerpc 32-bit among others). Not handling exceptions is a show stopper too. Guess better time investment would be to improve performance of the soft-fp versions.
(In reply to Jakub Jelinek from comment #2) > From what I can see, they are certainly not portable. > E.g. the relying on __int128 rules out various arches (basically all 32-bit > arches, > ia32, powerpc 32-bit among others). For this kind of performance improvement on 64-bit systems, we could probably introduce an appropriate #ifdef. Regarding x86 intrinsics, maybe they can be replaced by gcc's vector extension. > Not handling exceptions is a show > stopper too. Agreed, we should not be replacing the soft-fp that way. > Guess better time investment would be to improve performance of the soft-fp > versions. I'm not sure, I think we could get an appreciable benefit if we only invoke this kind of routine behind the appropriate sub-flags of -ffast-math. For a general-purpose code, I see at least no way around the bottleneck of querying the processor status on each invocation, and that is a waste if the program does not care.
(In reply to Jakub Jelinek from comment #2) > From what I can see, they are certainly not portable. > E.g. the relying on __int128 rules out various arches (basically all 32-bit > arches, > ia32, powerpc 32-bit among others). Not handling exceptions is a show > stopper too. > Guess better time investment would be to improve performance of the soft-fp > versions. The only 32-bit gcc target [out of targets represented on Goldbolt] that supports __float128/_Float128 right now are - x386==IA32 - SPARC - RV32 I don't know why it is supported, but would be surprised if __float128 on RV32 is actually used by anybody. Just too slow. 32-bit SPARC sounds as dead platform. x386 is more problematic. People still use it, people still develop new programs for it. And since modern x386 processors are so fast, performance of __float128 could be usable, even if it is 5 or 10 times slower than on the same machine running x86-64. PPC32 is not supported, so no problems from that side.
Hi Thomas Are you in or out? If you are still in, I can use your help on several issues. 1. Torture. See if Invalid Operand exception raised properly now. Also if there are still remaining problems with NaN. 2. Run my correction tests on as many non-AMD64 targets as you can. Preferably, with 100,000,000 iterations, but on weaker HW 10,000,000 will do. 3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of AMD64 computers than I did. Of special interest are - AMD Zen3 on Linux running on bare metal - Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running on bare metal I realize that doing speed tests is not nearly as simple as correctness tests. We need non-busy (preferably almost idle) machines that have stable CPU clock rate. It's not easy to find machines like that nowadays. But, may be, you can find at least some from the list. 4. Run my speed tests on as many non-obsolete ARM64 computers as you can find. Well, probably a wishful thinking on my part. Also off topic but of interest: postprocessed source of matmul_r16.c
(In reply to Michael_S from comment #5) > Hi Thomas > Are you in or out? Depends a bit on what exactly you want to do, and if there is a chance that what you want to do will be incorporated into gcc. If you want to replace the soft-float routines, you will have to replace them with the full functionality. And there will have to be a decision about 32-bit targets. > If you are still in, I can use your help on several issues. > > 1. Torture. > See if Invalid Operand exception raised properly now. Also if there are > still remaining problems with NaN. I've putyour addition/subtraction routines in as a replacement an am running a regression test. We'll see when that finishes. > 2. Run my correction tests on as many non-AMD64 targets as you can. > Preferably, with 100,000,000 iterations, but on weaker HW 10,000,000 will do. This will take some time. > 3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of > AMD64 computers than I did. > Of special interest are > - AMD Zen3 on Linux running on bare metal > - Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running > on bare metal > I realize that doing speed tests is not nearly as simple as correctness > tests. > We need non-busy (preferably almost idle) machines that have stable CPU > clock rate. It's not easy to find machines like that nowadays. But, may be, > you can find at least some from the list. I currenty have no access to that sort of hardware (I'm just a volunteer, and my home box is Zen-1). > 4. Run my speed tests on as many non-obsolete ARM64 computers as you can > find. > Well, probably a wishful thinking on my part. > > > Also off topic but of interest: postprocessed source of matmul_r16.c Where should I send that to?
Either here or my yahoo e-mail
(In reply to Thomas Koenig from comment #6) > (In reply to Michael_S from comment #5) > > Hi Thomas > > Are you in or out? > > Depends a bit on what exactly you want to do, and if there is > a chance that what you want to do will be incorporated into gcc. > What about incorporation in Fortran? What about incorporation in C under fast-math ? > If you want to replace the soft-float routines, you will have to > replace them with the full functionality. > Full functionality including Inexact Exception that practically nobody uses? Sounds wasteful of perfectly good CPU cycles. Also, I am not so sure that Inexact Exception is fully supported in existing soft-float library. Almost-full functionality with support for non-default rounding modes, but without Inexact Exception? I actually implemented it and did few measurements. You can find the results in the directory /reports in my repo. Summary: architecture-neutral method cause very serious slowdown. Less so on slower machines, massive 2.5x on the fastest machine (Zen3 under Linux under WSL). AMD64-specific method causes smaller slowdown, esp. on relatively old Intel cores on Windows (I have no modern Intel cores available for testing). But Zen3/Linux still suffer 1.45x slowdown. Again, a big wastage of perfectly good CPU cycles. Also, what about other architectures? Should they suffer an "architecture-neutral" slowdown? Even if there are faster methods on other architecture, these methods should be found by somebody and tested by somebody. This sort of work is time-consuming. And for what? Also I measured an impact of implementing non-default rounding through additional function parameter. An impact is very small, 0 to 5%. You said on comp.arch that at least for Fortran it could work. What else is missing for "full functionality"? Surely there are other things that I forgot. May be, additional exceptions apart from Invalid Operand (that hopefully already works) and apart from Inexact that I find stupid? I don't think that they are hard to implement or expensive in terms of speed. Just a bit of work and more than a bit of testing. > And there will have to be a decision about 32-bit targets. > IMHO, 32-bit targets should be left in their current state. People that use them probably do not care deeply about performance. Technically, I can implement 32-bit targets in the same sources, by means of few ifdefs and macros, but resulting source code will look much uglier than how it looks today. Still, not to the same level of horror that you have in matmul_r16.c, but certainly uglier than how I like it to look. And I am not sure at all that my implementation of 32-bit targets would be significantly faster than current soft float. In short, it does not sound as a good ROI. BTW, do you know why current soft float supports so few 32-bit targets? Most likely somebody felt just like me about it - it's not too hard to support more 32-bit targets, but it's not a good ROI.
Created attachment 54273 [details] matmul_r16.i Here is matmul_r16.i from a relatively recent trunk.
What we would need for incorporation into gcc is to have several functions, which would then called depending on which floating point options are in force at the time of invocation. So, let's go through the gcc options, to see what would fit where. Walking down the options tree, depth first. From the gcc docs: '-ffast-math' Sets the options '-fno-math-errno', '-funsafe-math-optimizations', '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans', '-fcx-limited-range' and '-fexcess-precision=fast'. -fno-math-errno is irrelevant in this context, no need to look at that. '-funsafe-math-optimizations' Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations. This option is not turned on by any '-O' option since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. Enables '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and '-freciprocal-math'. '-fno-signed-zeros' Allow optimizations for floating-point arithmetic that ignore the signedness of zero. IEEE arithmetic specifies the behavior of distinct +0.0 and -0.0 values, which then prohibits simplification of expressions such as x+0.0 or 0.0*x (even with '-ffinite-math-only'). This option implies that the sign of a zero result isn't significant. The default is '-fsigned-zeros'. I don't think this options is relevant. '-fno-trapping-math' Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option requires that '-fno-signaling-nans' be in effect. Setting this option may allow faster code if one relies on "non-stop" IEEE arithmetic, for example. This option should never be turned on by any '-O' option since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. The default is '-ftrapping-math'. Relevant. '-ffinite-math-only' Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs. This option is not turned on by any '-O' option since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. This does not have further suboptions. Relevant. '-fassociative-math' Allow re-association of operands in series of floating-point operations. This violates the ISO C and C++ language standard by possibly changing computation result. NOTE: re-ordering may change the sign of zero as well as ignore NaNs and inhibit or create underflow or overflow (and thus cannot be used on code that relies on rounding behavior like '(x + 2**52) - 2**52'. May also reorder floating-point comparisons and thus may not be used when ordered comparisons are required. This option requires that both '-fno-signed-zeros' and '-fno-trapping-math' be in effect. Moreover, it doesn't make much sense with '-frounding-math'. For Fortran the option is automatically enabled when both '-fno-signed-zeros' and '-fno-trapping-math' are in effect. The default is '-fno-associative-math'. Not relevant, I think - this influences compiler optimizations. '-freciprocal-math' Allow the reciprocal of a value to be used instead of dividing by the value if this enables optimizations. For example 'x / y' can be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject to common subexpression elimination. Note that this loses precision and increases the number of flops operating on the value. The default is '-fno-reciprocal-math'. Again, not relevant. '-frounding-math' Disable transformations and optimizations that assume default floating-point rounding behavior. This is round-to-zero for all floating point to integer conversions, and round-to-nearest for all other arithmetic truncations. This option should be specified for programs that change the FP rounding mode dynamically, or that may be executed with a non-default rounding mode. This option disables constant folding of floating-point expressions at compile time (which may be affected by rounding mode) and arithmetic transformations that are unsafe in the presence of sign-dependent rounding modes. The default is '-fno-rounding-math'. This option is experimental and does not currently guarantee to disable all GCC optimizations that are affected by rounding mode. Future versions of GCC may provide finer control of this setting using C99's 'FENV_ACCESS' pragma. This command-line option will be used to specify the default state for 'FENV_ACCESS'. Also no further suboptions. This is relevant. '-fsignaling-nans' Compile code assuming that IEEE signaling NaNs may generate user-visible traps during floating-point operations. Setting this option disables optimizations that may change the number of exceptions visible with signaling NaNs. This option implies '-ftrapping-math'. This option causes the preprocessor macro '__SUPPORT_SNAN__' to be defined. The default is '-fno-signaling-nans'. This option is experimental and does not currently guarantee to disable all GCC optimizations that affect signaling NaN behavior. Also, no further suboptions. Relevant. -fcx-limited-range is not relevant, and neither is -fexcess-precision=fast. So, unless I missed something, wit should be possible to select different functions depending on the values of -ftrapping-math, -finite-math-only, -frounding-math and -fsignalling-nans. Regarding Fortran's matmul: We use -ffast-math when compiling the library functions, so any change we make to any of the -ffast-math suboptions would be used, as well.
(In reply to Thomas Koenig from comment #9) > Created attachment 54273 [details] > matmul_r16.i > > Here is matmul_r16.i from a relatively recent trunk. Thank you. Unfortunately, I was not able to link it with main in Fortran. So, I still only have to guess why even after replacement of __multf3 and __addtf3 by my implementations it is still more than twice slower (on Zen3) then what it should be. Looking at source and assuming that inner loop starts at line 8944, this loop looks very strange, but apart from bad programming style and apart from misunderstanding of what is optimal scheduling there is nothing criminal about it. May be, because of wrong scheduling, it is 10-15% slower than the best, but certainly it should not be 2.4 times slower that I am seeing. That's what I got from linker: /usr/bin/ld: matmul_r16.o: warning: relocation against `_gfortrani_matmul_r16_avx128_fma3' in read-only section `.text' /usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx': matmul_r16.c:(.text+0x48): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x342): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x10e4): undefined reference to `_gfortrani_size0' /usr/bin/ld: matmul_r16.c:(.text+0x10f1): undefined reference to `_gfortrani_xmallocarray' /usr/bin/ld: matmul_r16.c:(.text+0x12e5): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x13a9): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x13e7): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx2': matmul_r16.c:(.text+0x24c8): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x27c2): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x3564): undefined reference to `_gfortrani_size0' /usr/bin/ld: matmul_r16.c:(.text+0x3571): undefined reference to `_gfortrani_xmallocarray' /usr/bin/ld: matmul_r16.c:(.text+0x3765): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x3829): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x3867): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx512f': matmul_r16.c:(.text+0x4948): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x4c47): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x5a32): undefined reference to `_gfortrani_size0' /usr/bin/ld: matmul_r16.c:(.text+0x5a3f): undefined reference to `_gfortrani_xmallocarray' /usr/bin/ld: matmul_r16.c:(.text+0x5c35): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x5cfb): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x5d35): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.o: in function `matmul_r16_vanilla': matmul_r16.c:(.text+0x6de8): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x70e2): undefined reference to `_gfortrani_compile_options' /usr/bin/ld: matmul_r16.c:(.text+0x7e84): undefined reference to `_gfortrani_size0' /usr/bin/ld: matmul_r16.c:(.text+0x7e91): undefined reference to `_gfortrani_xmallocarray' /usr/bin/ld: matmul_r16.c:(.text+0x8085): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x8149): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.c:(.text+0x8187): undefined reference to `_gfortrani_runtime_error' /usr/bin/ld: matmul_r16.o: in function `_gfortran_matmul_r16': matmul_r16.c:(.text+0x92b7): undefined reference to `_gfortrani_matmul_r16_avx128_fma3' /usr/bin/ld: matmul_r16.c:(.text+0x92ea): undefined reference to `_gfortrani_matmul_r16_avx128_fma4' /usr/bin/ld: warning: creating DT_TEXTREL in a PIE collect2: error: ld returned 1 exit status
(In reply to Thomas Koenig from comment #10) > What we would need for incorporation into gcc is to have several > functions, which would then called depending on which floating point > options are in force at the time of invocation. > > So, let's go through the gcc options, to see what would fit where. Walking > down the options tree, depth first. > > From the gcc docs: > > '-ffast-math' > Sets the options '-fno-math-errno', '-funsafe-math-optimizations', > '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans', > '-fcx-limited-range' and '-fexcess-precision=fast'. > > -fno-math-errno is irrelevant in this context, no need to look at that. > > '-funsafe-math-optimizations' > > Allow optimizations for floating-point arithmetic that (a) assume > that arguments and results are valid and (b) may violate IEEE or > ANSI standards. When used at link time, it may include libraries > or startup files that change the default FPU control word or other > similar optimizations. > > This option is not turned on by any '-O' option since it can result > in incorrect output for programs that depend on an exact > implementation of IEEE or ISO rules/specifications for math > functions. It may, however, yield faster code for programs that do > not require the guarantees of these specifications. Enables > '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and > '-freciprocal-math'. > > '-fno-signed-zeros' > Allow optimizations for floating-point arithmetic that ignore the > signedness of zero. IEEE arithmetic specifies the behavior of > distinct +0.0 and -0.0 values, which then prohibits simplification > of expressions such as x+0.0 or 0.0*x (even with > '-ffinite-math-only'). This option implies that the sign of a zero > result isn't significant. > > The default is '-fsigned-zeros'. > > I don't think this options is relevant. > > '-fno-trapping-math' > Compile code assuming that floating-point operations cannot > generate user-visible traps. These traps include division by zero, > overflow, underflow, inexact result and invalid operation. This > option requires that '-fno-signaling-nans' be in effect. Setting > this option may allow faster code if one relies on "non-stop" IEEE > arithmetic, for example. > > This option should never be turned on by any '-O' option since it > can result in incorrect output for programs that depend on an exact > implementation of IEEE or ISO rules/specifications for math > functions. > > The default is '-ftrapping-math'. > > Relevant. > > '-ffinite-math-only' > Allow optimizations for floating-point arithmetic that assume that > arguments and results are not NaNs or +-Infs. > > This option is not turned on by any '-O' option since it can result > in incorrect output for programs that depend on an exact > implementation of IEEE or ISO rules/specifications for math > functions. It may, however, yield faster code for programs that do > not require the guarantees of these specifications. > > This does not have further suboptions. Relevant. > > '-fassociative-math' > > Allow re-association of operands in series of floating-point > operations. This violates the ISO C and C++ language standard by > possibly changing computation result. NOTE: re-ordering may change > the sign of zero as well as ignore NaNs and inhibit or create > underflow or overflow (and thus cannot be used on code that relies > on rounding behavior like '(x + 2**52) - 2**52'. May also reorder > floating-point comparisons and thus may not be used when ordered > comparisons are required. This option requires that both > '-fno-signed-zeros' and '-fno-trapping-math' be in effect. > Moreover, it doesn't make much sense with '-frounding-math'. For > Fortran the option is automatically enabled when both > '-fno-signed-zeros' and '-fno-trapping-math' are in effect. > > The default is '-fno-associative-math'. > > Not relevant, I think - this influences compiler optimizations. > > '-freciprocal-math' > > Allow the reciprocal of a value to be used instead of dividing by > the value if this enables optimizations. For example 'x / y' can > be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject > to common subexpression elimination. Note that this loses > precision and increases the number of flops operating on the value. > > The default is '-fno-reciprocal-math'. > > Again, not relevant. > > > '-frounding-math' > Disable transformations and optimizations that assume default > floating-point rounding behavior. This is round-to-zero for all > floating point to integer conversions, and round-to-nearest for all > other arithmetic truncations. This option should be specified for > programs that change the FP rounding mode dynamically, or that may > be executed with a non-default rounding mode. This option disables > constant folding of floating-point expressions at compile time > (which may be affected by rounding mode) and arithmetic > transformations that are unsafe in the presence of sign-dependent > rounding modes. > > The default is '-fno-rounding-math'. > > This option is experimental and does not currently guarantee to > disable all GCC optimizations that are affected by rounding mode. > Future versions of GCC may provide finer control of this setting > using C99's 'FENV_ACCESS' pragma. This command-line option will be > used to specify the default state for 'FENV_ACCESS'. > > Also no further suboptions. This is relevant. > > '-fsignaling-nans' > Compile code assuming that IEEE signaling NaNs may generate > user-visible traps during floating-point operations. Setting this > option disables optimizations that may change the number of > exceptions visible with signaling NaNs. This option implies > '-ftrapping-math'. > > This option causes the preprocessor macro '__SUPPORT_SNAN__' to be > defined. > > The default is '-fno-signaling-nans'. > > This option is experimental and does not currently guarantee to > disable all GCC optimizations that affect signaling NaN behavior. > > Also, no further suboptions. Relevant. > > -fcx-limited-range is not relevant, and neither is -fexcess-precision=fast. > > So, unless I missed something, wit should be possible to select different > functions depending on the values of -ftrapping-math, -finite-math-only, > -frounding-math and -fsignalling-nans. > > Regarding Fortran's matmul: We use -ffast-math when compiling the library > functions, so any change we make to any of the -ffast-math suboptions > would be used, as well. This set of options does not map too well into real difficulties of implementation. There are only 2 things that are expensive: 1. Inexact Exception 2. Fetching of the current rounding mode. The rest of IEEE-754 features is so cheap that creating separate variants without them simply is not worth the effort of maintaining distinct variants, even if all difference is a single three-lines #ifdef BTW, Inexact Exception can be made fairly affordable with a little help from compiler. All we need for that is ability to say "don't remove this floating point addition even if you don't see that it produces any effect". Something similar to 'volatile', but with volatile compiler currently puts result of addition on stack, which adds undesirable cost. However, judged by comment of Jakub, compiler maintainers are not particularly interested in this enterprise.
I tried compiling your tests on Apple silicon using Asahi Linux, but without success. A first step was rather easy; replacing __float128 by _Float128 was required. I then bootstrapped gcc on that machine and added the (private) include path for <quadmath.h>, and am now hitting missing __float128 in quadmath.h. Not sure how to proceed from here. The machine is gcc103.fsffrance.org, by the way, of the GCC compile farm.
Seems that libquadmath is not built on that particular Linux/CPU variant, for whatever reason. At last I cannot find any '*quadmath* files in the build directory. /proc/cpuinfo tells me that processor : 0 BogoMIPS : 48.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint CPU implementer : 0x61 CPU architecture: 8 CPU variant : 0x1 CPU part : 0x022 CPU revision : 1 [...] and uname -a is Linux gcc103.fsffrance.org 6.0.0-rc5-asahi-00001-gc62bd3fe430f #1 SMP Sun Sep 18 18:07:57 CEST 2022 aarch64 GNU/Linux So much for testing on Apple silicon.
libquadmath is not needed nor useful on aarch64-linux, because long double type there is already IEEE 754 quad.
(In reply to Jakub Jelinek from comment #15) > libquadmath is not needed nor useful on aarch64-linux, because long double > type there is already IEEE 754 quad. That's good to know. Thank you. If you are here I have one more somewhat related question: is IEEE binary128 arithmetic really not supported under ARMv7 or, like in the case above, it is supported, but not under one of common names (__float128/_Float128) ? And if it is not supported then why? If I had to peek one 32-bit architecture to support then it certainly would be ARMv7, the most used 32-bit architecture in the world that probably outsells the next one by order of magnitude.
It's not part of the ABI for the Arm 32-bit Architecture (AAPCS32). https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst You can file an issue there if you want, though I don't know how interested the maintainers will be in that optional language feature.
(In reply to Michael_S from comment #12) > This set of options does not map too well into real difficulties of > implementation. > There are only 2 things that are expensive: > 1. Inexact Exception > 2. Fetching of the current rounding mode. > The rest of IEEE-754 features is so cheap that creating separate variants > without them simply is not worth the effort of maintaining distinct > variants, even if all difference is a single three-lines #ifdef In general reading the current rounding mode is relatively cheap, but modifying can be expensive, so optimized fenv implementations in GLIBC only modify the FP status if a change is required. It should be feasible to check for round-to-even and use optimized code for that case. > BTW, Inexact Exception can be made fairly affordable with a little help from > compiler. All we need for that is ability to say "don't remove this floating > point addition even if you don't see that it produces any effect". > Something similar to 'volatile', but with volatile compiler currently puts > result of addition on stack, which adds undesirable cost. > However, judged by comment of Jakub, compiler maintainers are not > particularly interested in this enterprise. There are macros in GLIBC math-barriers.h which do what you want - eg. AArch64: #define math_opt_barrier(x) \ ({ __typeof (x) __x = (x); __asm ("" : "+w" (__x)); __x; }) #define math_force_eval(x) \ ({ __typeof (x) __x = (x); __asm __volatile__ ("" : : "w" (__x)); }) The first blocks optimizations (like constant folding) across the barrier, the 2nd forces evaluation of an expression even if it is deemed useless. These are used in many math functions in GLIBC. They are target specific due to needing inline assembler operands, but it should be easy to add similar definitions to libgcc.
So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86 sfp-machine.h #define FP_INIT_ROUNDMODE \ do { \ __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \ } while (0) #endif to do that only if not round-to-nearest. E.g. the fast-float library uses since last November: static volatile float fmin = __FLT_MIN__; float fmini = fmin; // we copy it so that it gets loaded at most once. return (fmini + 1.0f == 1.0f - fmini); as a quick test whether round-to-nearest is in effect or some other rounding mode. Most likely if this is done with -frounding-math it wouldn't even need the volatile stuff. Of course, if it isn't round-to-nearest, we would need to do the expensive {,v}stmxcsr.
__attribute__((noinline, optimize ("rounding-math"))) static int round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; } and if (round_to_nearest ()) \ _fcw = FP_RND_NEAREST; \ else \ __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \ Except that from _fcw we don't determine just the rounding mode but also what exceptions are enabled.
(In reply to Jakub Jelinek from comment #20) > __attribute__((noinline, optimize ("rounding-math"))) static int > round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; } Wouldn't that always set inexact? > and > if (round_to_nearest ()) \ > _fcw = FP_RND_NEAREST; \ > else \ > __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \ > > Except that from _fcw we don't determine just the rounding mode but also > what exceptions are enabled. Yes that wouldn't work in fenv but FP emulation functions don't need to read the exception flags.
(In reply to Michael_S from comment #8) > (In reply to Thomas Koenig from comment #6) > > And there will have to be a decision about 32-bit targets. > > > > IMHO, 32-bit targets should be left in their current state. > People that use them probably do not care deeply about performance. > Technically, I can implement 32-bit targets in the same sources, by means of > few ifdefs and macros, but resulting source code will look much uglier than > how it looks today. Still, not to the same level of horror that you have in > matmul_r16.c, but certainly uglier than how I like it to look. > And I am not sure at all that my implementation of 32-bit targets would be > significantly faster than current soft float. I explored this path (implementing 32-bit and 64-bit targets from the same source with few ifdefs) a little more: Now I am even more sure that it is not a way to go. gcc compiler does not generate good 32-bit code for this style of sources. This especially applies to i386, other supported 32-bit targets (RV32, SPARC32) are affected less. In the process I encountered a funny illogical pessimization by i386 code generator: https://godbolt.org/z/En6Tredox Routines foo32() and foo64() are semantically identical, but foo32() is written with 32-bit targets in mind while foo64() is the style of could that will likely be written if one wants to support 32 and 64 bits from the same source with #ifdef. The code, generated by gcc for foo32() is reasonable. Like in the source, we can see 8 multiplications. The code, generated by gcc for foo64() is anything but reasonable. Somehow, compiler invented 5 more multiplications for a total of 13 multiplications. May be, it deserves a separate bug report.
(In reply to Jakub Jelinek from comment #19) > So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86 > sfp-machine.h > #define FP_INIT_ROUNDMODE \ > do { \ > __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \ > } while (0) > #endif > to do that only if not round-to-nearest. > E.g. the fast-float library uses since last November: > static volatile float fmin = __FLT_MIN__; > float fmini = fmin; // we copy it so that it gets loaded at most once. > return (fmini + 1.0f == 1.0f - fmini); > as a quick test whether round-to-nearest is in effect or some other rounding > mode. > Most likely if this is done with -frounding-math it wouldn't even need the > volatile stuff. Of course, if it isn't round-to-nearest, we would need to > do the expensive {,v}stmxcsr. I agree with Wilco. This trick is problematic due to effect on inexact flag. Also, I don't quite understand how you got to setting rounding mode. I don't need to set rounding mode, I just need to read a current rounding mode. Doing it in portable way, i.e. by fegetround(), is slow mostly due to various overheads. Doing it in non-portable way on x86-64 (by _MM_GET_ROUNDING_MODE()) is not slow on Intel, but still pretty slow on AMD Zen3, although even on Zen3 it is much faster than fegetround(). Results of measurements are here: https://github.com/already5chosen/extfloat/blob/master/binary128/reports/rm-impact.txt Anyway, I'd very much prefer a portable solution over multitude of ifdefs. It is a pity that gcc doesn't implement FLT_ROUNDS like other compilers. But, then again, it is a pity that gcc doesn't implement few other things implemented by other compilers that could make life of developers of portable multiprecision routines in general and of soft float in particular so much easier.
(In reply to Michael_S from comment #22) > (In reply to Michael_S from comment #8) > > (In reply to Thomas Koenig from comment #6) > > > And there will have to be a decision about 32-bit targets. > > > > > > > IMHO, 32-bit targets should be left in their current state. > > People that use them probably do not care deeply about performance. > > Technically, I can implement 32-bit targets in the same sources, by means of > > few ifdefs and macros, but resulting source code will look much uglier than > > how it looks today. Still, not to the same level of horror that you have in > > matmul_r16.c, but certainly uglier than how I like it to look. > > And I am not sure at all that my implementation of 32-bit targets would be > > significantly faster than current soft float. > > I explored this path (implementing 32-bit and 64-bit targets from the same > source with few ifdefs) a little more: > Now I am even more sure that it is not a way to go. gcc compiler does not > generate good 32-bit code for this style of sources. This especially applies > to i386, other supported 32-bit targets (RV32, SPARC32) are affected less. > I can't explain to myself why I am doing it, but I did continue exploration of 32-bit targets. Well, not quite "targets", I don't have SPARC32 or RV32 to play. So, I did continue exploration of i386. As said above, using the same code for 32-bit and 64-bit does not produce acceptable results. But pure 32-bit source did better than what I expected. So when 2023-01-13 I wrote "And I am not sure at all that my implementation of 32-bit targets would be significantly faster than current soft float" I was wrong. My implementation of 32-bit targets (i.e. i386) is significantly faster than current soft float. Up to 3 times faster on Zen3, approximately 2 times faster on various oldish Intel CPUs. Today I put 32-bit sources into my github repository. I am still convinced that improving performance of IEEE binary128 on 32-bit targets is wastage of time, but since the time is already wasted may be results can be used. And may be, it can be used to bring IEEE binary128 to the Arm Cortex-M, where it can be moderately useful in some situations.