[PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Thu Nov 26 09:14:09 GMT 2020

Hi,

On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
>
> Hi,
>
> This patch adds an efficient assembly-language implementation of IEEE-754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-1).  This is the libgcc portion of a larger library originally described in 2018:
>
>     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
>
> Since that time, I've separated the libm functions for submission to newlib.  The remaining libgcc functions in the attached patch have the following characteristics:
>
>     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
>     __clzsi2                        42                  23              0       exact
>     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
>     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
>
>     __umulsidi3                     44                  24              0       exact
>     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
>     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
>     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
>     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
>     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
>
>     __aeabi_lcmp                    20                   13             0       exact
>     __aeabi_ulcmp                   16                  10              0       exact
>
>     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
>     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
>     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
>     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
>     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
>
>     __shared_float                  178
>     __shared_float (OPTIMIZE_SIZE)  154
>
>     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
>     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
>     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
>     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
>     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
>     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
>
>     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
>     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
>     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
>
>     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
>     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
>     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
>     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
>     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
>
>     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
>     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
>     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
>     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
>     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
>
>     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
>     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
>     __aeabi_h2f                     34+__shared_float 34             8     exact
>     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
>
> Copyright assignment is on file with the FSF.
>
> I've built the gcc-arm-none-eabi cross-compiler using the 20201108 snapshot of GCC plus this patch, and successfully compiled a test program:
>
>     extern int main (void)
>     {
>         volatile int x = 1;
>         volatile unsigned long long int y = 10;
>         volatile long long int z = x / y; // 64-bit division
>
>         volatile float a = x; // 32-bit casting
>         volatile float b = y; // 64 bit casting
>         volatile float c = z / b; // float division
>         volatile float d = a + c; // float addition
>         volatile float e = c * b; // float multiplication
>         volatile float f = d - e - c; // float subtraction
>
>         if (f != c) // float comparison
>             y -= (long long int)d; // float casting
>     }
>
> As one point of comparison, the test program links to 876 bytes of libgcc code from the patched toolchain, vs 10276 bytes from the latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a 90% size reduction.

This looks awesome!

>
> I have extensive test vectors, and have passed these tests on an STM32F051.  These vectors were derived from UCB [1], Testfloat [2], and IEEECC754 [3] sources, plus some of my own creation.  Unfortunately, I'm not sure how "make check" should work for a cross compiler run time library.
>
> Although I believe this patch can be incorporated as-is, there are at least two points that might bear discussion:
>
> * I'm not sure where or how they would be integrated, but I would be happy to provide sources for my test vectors.
>
> * The library is currently built for the ARM v6m architecture only.  It is likely that some of the other Cortex variants would benefit from these routines.  However, I would need some guidance on this to proceed without introducing regressions.  I do not currently have a test strategy for architectures beyond Cortex M0, and I have NOT profiled the existing thumb-2 implementations (ieee754-sf.S) for comparison.

I tried your patch, and I see many regressions in the GCC testsuite
because many tests fail to link with errors like:
ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function `__clzdi2':
/libgcc/config/arm/cm0/clz2.S:39: multiple definition of
`__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
first defined here

This happens with a toolchain configured with --target arm-none-eabi,
default cpu/fpu/mode,
--enable-multilib --with-multilib-list=rmprofile and running the tests with
-mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m

Does it work for you?

Thanks,

Christophe

>
> I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
>
> Thanks,
> Daniel Engel
>
> [1] http://www.netlib.org/fp/ucbtest.tgz
> [2] http://www.jhauser.us/arithmetic/TestFloat.html
> [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html