[PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Wed Jan 6 17:05:47 GMT 2021

On 06/01/2021 11:20, Daniel Engel wrote:
> Hi Christophe, 
> 
> On Wed, Dec 16, 2020, at 9:15 AM, Christophe Lyon wrote:
>> On Wed, 2 Dec 2020 at 04:31, Daniel Engel <libgcc@danielengel.com> wrote:
>>>
>>> Hi Christophe,
>>>
>>> On Thu, Nov 26, 2020, at 1:14 AM, Christophe Lyon wrote:
>>>> Hi,
>>>>
>>>> On Fri, 13 Nov 2020 at 00:03, Daniel Engel <libgcc@danielengel.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> This patch adds an efficient assembly-language implementation of IEEE-
>>>>> 754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-
>>>>> 1).  This is the libgcc portion of a larger library originally
>>>>> described in 2018:
>>>>>
>>>>>     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
>>>>>
>>>>> Since that time, I've separated the libm functions for submission to
>>>>> newlib.  The remaining libgcc functions in the attached patch have
>>>>> the following characteristics:
>>>>>
>>>>>     Function(s)                     Size (bytes)        Cycles          Stack   Accuracy
>>>>>     __clzsi2                        42                  23              0       exact
>>>>>     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0       exact
>>>>>     __clzdi2                        8+__clzsi2          4+__clzsi2      0       exact
>>>>>
>>>>>     __umulsidi3                     44                  24              0       exact
>>>>>     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8       exact
>>>>>     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0       exact
>>>>>     __ashldi3 (__aeabi_llsl)        22                  13              0       exact
>>>>>     __lshrdi3 (__aeabi_llsr)        22                  13              0       exact
>>>>>     __ashrdi3 (__aeabi_lasr)        22                  13              0       exact
>>>>>
>>>>>     __aeabi_lcmp                    20                   13             0       exact
>>>>>     __aeabi_ulcmp                   16                  10              0       exact
>>>>>
>>>>>     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0       < 1 lsb
>>>>>     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8       < 1 lsb
>>>>>     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      16      < 1 lsb
>>>>>     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      16      < 1 lsb
>>>>>     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    32      < 1 lsb
>>>>>
>>>>>     __shared_float                  178
>>>>>     __shared_float (OPTIMIZE_SIZE)  154
>>>>>
>>>>>     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8       <= 0.5 ulp
>>>>>     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8       <= 0.5 ulp
>>>>>     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>>>>>     __aeabi_frsub                   8+__addsf3          6+__addsf3      8       <= 0.5 ulp
>>>>>     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8       <= 0.5 ulp
>>>>>     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8       <= 0.5 ulp
>>>>>     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8       <= 0.5 ulp
>>>>>     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8       <= 0.5 ulp
>>>>>
>>>>>     __cmpsf2/__lesf2/__ltsf2        72                  33              0       exact
>>>>>     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0       exact
>>>>>
>>>>>     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8       <= 0.5 ulp
>>>>>     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8       <= 0.5 ulp
>>>>>     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8       <= 0.5 ulp
>>>>>     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8       <= 0.5 ulp
>>>>>     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8       <= 0.5 ulp
>>>>>
>>>>>     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0       exact
>>>>>     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0       exact
>>>>>     __fixsfsi (__aeabi_f2iz)        52                  19              0       exact
>>>>>     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0       exact
>>>>>     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0       exact
>>>>>
>>>>>     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8     exact
>>>>>     __aeabi_d2f                     56+__shared_float 54 – 58     8     <= 0.5 ulp
>>>>>     __aeabi_h2f                     34+__shared_float 34             8     exact
>>>>>     __aeabi_f2h                     84                 23 – 34         0     <= 0.5 ulp
>>>>>
>>>>> Copyright assignment is on file with the FSF.
>>>>>
>>>>> I've built the gcc-arm-none-eabi cross-compiler using the 20201108
>>>>> snapshot of GCC plus this patch, and successfully compiled a test
>>>>> program:
>>>>>
>>>>>     extern int main (void)
>>>>>     {
>>>>>         volatile int x = 1;
>>>>>         volatile unsigned long long int y = 10;
>>>>>         volatile long long int z = x / y; // 64-bit division
>>>>>
>>>>>         volatile float a = x; // 32-bit casting
>>>>>         volatile float b = y; // 64 bit casting
>>>>>         volatile float c = z / b; // float division
>>>>>         volatile float d = a + c; // float addition
>>>>>         volatile float e = c * b; // float multiplication
>>>>>         volatile float f = d - e - c; // float subtraction
>>>>>
>>>>>         if (f != c) // float comparison
>>>>>             y -= (long long int)d; // float casting
>>>>>     }
>>>>>
>>>>> As one point of comparison, the test program links to 876 bytes of
>>>>> libgcc code from the patched toolchain, vs 10276 bytes from the
>>>>> latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a
>>>>> 90% size reduction.
>>>>
>>>> This looks awesome!
>>>>
>>>>>
>>>>> I have extensive test vectors, and have passed these tests on an
>>>>> STM32F051.  These vectors were derived from UCB [1], Testfloat [2],
>>>>> and IEEECC754 [3] sources, plus some of my own creation.
>>>>> Unfortunately, I'm not sure how "make check" should work for a cross
>>>>> compiler run time library.
>>>>>
>>>>> Although I believe this patch can be incorporated as-is, there are
>>>>> at least two points that might bear discussion:
>>>>>
>>>>> * I'm not sure where or how they would be integrated, but I would be
>>>>>   happy to provide sources for my test vectors.
>>>>>
>>>>> * The library is currently built for the ARM v6m architecture only.
>>>>>   It is likely that some of the other Cortex variants would benefit
>>>>>   from these routines.  However, I would need some guidance on this
>>>>>   to proceed without introducing regressions.  I do not currently
>>>>>   have a test strategy for architectures beyond Cortex M0, and I
>>>>>   have NOT profiled the existing thumb-2 implementations (ieee754-
>>>>>   sf.S) for comparison.
>>>>
>>>> I tried your patch, and I see many regressions in the GCC testsuite
>>>> because many tests fail to link with errors like:
>>>> ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function
>>>> `__clzdi2':
>>>> /libgcc/config/arm/cm0/clz2.S:39: multiple definition of
>>>> `__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
>>>> first defined here
>>>>
>>>> This happens with a toolchain configured with --target arm-none-eabi,
>>>> default cpu/fpu/mode,
>>>> --enable-multilib --with-multilib-list=rmprofile and running the tests with
>>>> -mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m
>>>>
>>>> Does it work for you?
>>>
>>> Thanks for the feedback.
>>>
>>> I'm afraid I'm quite ignorant as to the gcc test suite
>>> infrastructure, so I don't know how to use the options you've shared
>>> above.  I'm cross- compiling the Windows toolchain on Ubuntu.  Would
>>> you mind sharing a full command line you would use for testing?  The
>>> toolchain is built with the default options, which includes "--
>>> target arm-none-eabi".
>>>
>>
>> Why put Windows in the picture? This seems unnecessarily
>> complicated... I suggest you build your cross-toolchain on x86_64
>> ubuntu and run it on x86_64 ubuntu (of course targetting arm)
> 
> Mostly because I had not previously committed the time to understand the
> GCC regression test environment.  My company and personal computers both
> run Windows.  I created an Ubuntu virtual machine for this project, and
> I'd been trying to get by with the build scripts provided by the ARM
> toolchain.  Clearly that was insufficient.
> 
>> The above options where GCC configure options, except for the last one
>> which I used when running the tests.
>>
>> There is some documentation about how to run the GCC testsuite there:
>> https://gcc.gnu.org/install/test.html
> 
> Thanks.  I was able to take this document, plus some additional pages
> about constructing a combined tree with newlib, and put together a
> working regression test.  GDB didn't want to build cleanly at first, so
> eventually I gave up and disabled that part.
> 
>> Basically 'make check' should mostly work except for execution tests
>> for which you'll need to teach DejaGnu how to run the generated
>> programs on a real board or on a simulator.
>>
>> I didn't analyze your patch, I just submitted it to my validation
>> system:
>> https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/report-build-info.html
>> - the red "regressed" items indicate regressions in the testsuite. You
>>   can click on "log" to download the corresponding gcc.log
>> - the dark-red "build broken" items indicate that the toolchain build
>>   failed
>> - the orange "interrupted" items indicate an infrastructure problem,
>>   so you can ignore such cases
>> - similarly the dark red "ref build failed" indicate that the
>>   reference build failed for some infrastructure reason
>>
>> for the arm-none-eabi target, several toolchain versions fail to
>> build, some succeed. This is because I use different multilib
>> configuration flags, it looks like the ones involving --with-
>> multilib=rmprofile are broken with your patch.
>>
>> These ones should be reasonably easy to fix: no 'make check' involved.
>>
>> For instance if you configure GCC with:
>> --target arm-none-eabi --enable-multilib --with-multilib-list=rmprofile
>> you should see the build failure.
> 
> So far, I have not found a cause for the build failures you are seeing.
> The ARM toolchain script I was using before did build with the
> 'rmprofile' option.  With my current configure options, gcc builds
> 'rmprofile', 'aprofile', and even 'armeb'.  I did find a number of link
> issues with 'make check' due to incorrect usage of the 'L_'  defines in
> LIB1ASMFUNCS.  These are fixed in the new version attached.
> 
> Returning to the build failures you logged, I do consistently see this
> message in the logs [1]: "fatal error: cm0/fplib.h: No such file or
> directory".  I recognize the file, since it's one of the new files in
> my patch (the full sub-directory is libgcc/config/arm/cm0/fplib.h).
> Do I have to format patches in some different way so that new files
> get created?
> 
> Regression testing also showed that the previous patch was failing the
> "arm/divzero" test because I wasn't providing the same arguments to
> div0() as the existing implementation.  Having made that change, I think
> the patch is clean.  (I don't think there is a strict specification for
> div0(), and the changes add a non-trivial number of instructions, but
> I'll hold that discussion for another time).
> 
> Do you have time to re-check this patch on your build system?
> 
> Thanks,
> Daniel
> 
> [1] Line 36054: <https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/arm-none-eabi/build-rh70-arm-none-eabi-default-default-default-mthumb.-mcpu=cortex-m0.-mfloat-abi=soft.-march=armv6s-m.log.xz>
> 
>>
>> HTH
>>
>> Christophe
>>
>>> I did see similar errors once before.  It turned out then that I omitted
>>> one of the ".S" files from the build.  My interpretation at that point
>>> was that gcc had been searching multiple versions of "libgcc.a" and
>>> unable to merge the symbols.  In hindsight, that was a really bad
>>> interpretation.   I was able to reproduce the error above by simply
>>> adding a line like "volatile double m = 1.0; m += 2;".
>>>
>>> After reviewing the existing asm implementations more closely, I
>>> believe that I have not been using the function guard macros (L_arm_*)
>>> as intended.  The make script appears to compile "lib1funcs.S" dozens of
>>> times -- once for each function guard macro listed in LIB1ASMFUNCS --
>>> with the intent of generating a separate ".o" file for each function.
>>> Because they were unguarded, my new library functions were duplicated
>>> into every ".o" file, which caused the link errors you saw.
>>>
>>> I have attached an updated patch that implements the macros.
>>>
>>> However, I'm not sure whether my usage is really consistent with the
>>> spirit of the make script.  If there's a README or HOWTO, I haven't
>>> found it yet.  The following points summarize my concerns as I was
>>> making these updates:
>>>
>>> 1.  While some of the new functions (e.g. __cmpsf2) are standalone,
>>>     there is a common core in the new library shared by several related
>>>     functions.  That keeps the library small.  For now, I've elected to
>>>     group all of these related functions together in a single object
>>>     file "_arm_addsubsf3.o" to protect the short branches (+/-2KB)
>>>     within this unit.  Notice that I manually assigned section names in
>>>     the code, so there still shouldn't be any unnecessary code linked in
>>>     the final build.  Does the multiple-".o" files strategy predate "-gc-
>>>     sections", or should I be trying harder to break these related
>>>     functions into separate compilation units?
>>>
>>> 2.  I introduced a few new macro keywords for functions/groups (e.g.
>>>     "_arm_f2h" and '_arm_f2h'.  My assumption is that some empty ".o"
>>>     files compiled for the non-v6m architectures will be benign.
>>>
>>> 3.  The "t-elf" make script implies that __mulsf3() should not be
>>>     compiled in thumb mode (it's inside a conditional), but this is one
>>>     of the new functions.  Moot for now, since my __mulsf3() is grouped
>>>     with the common core functions (see point 1) and is thus currently
>>>     guarded by the "_arm_addsubsf3.o" macro.
>>>
>>> 4.  The advice (in "ieee754-sf.S") regarding WEAK symbols does not seem
>>>     to be working.  I have defined __clzsi2() as a weak symbol to be
>>>     overridden by the combined function __clzdi2().  I can also see
>>>     (with "nm") that "clzsi2.o" is compiled before "clzdi2.o" in
>>>     "libgcc.a".  Yet, the full __clzdi2() function (8 bytes larger) is
>>>     always linked, even in programs that only call __clzsi2(),  A minor
>>>     annoyance at this point.
>>>
>>> 5.  Is there a permutation of the makefile that compiles libgcc with
>>>     __OPTIMIZE_SIZE__?  There are a few sections in the patch that can
>>>     optimize either way, yet the final product only seems to have the
>>>     "fast" code.  At this optimization level, the sample program above
>>>     pulls in 1012 bytes of library code instead of 836. Perhaps this is
>>>     meant to be controlled by the toolchain configuration step, but it
>>>     doesn't follow that the optimization for the cross-compiler would
>>>     automatically translate to the target runtime libraries.
>>>
>>> Thanks again,
>>> Daniel
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>>>>
>>>>> I'm naturally hoping for some action on this patch before the Nov 16th deadline for GCC-11 stage 3.  Please review and advise.
>>>>>
>>>>> Thanks,
>>>>> Daniel Engel
>>>>>
>>>>> [1] http://www.netlib.org/fp/ucbtest.tgz
>>>>> [2] http://www.jhauser.us/arithmetic/TestFloat.html
>>>>> [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html
>>>>
>>
> 

Thanks for working on this, Daniel.

This is clearly stage1 material, so we've got time for a couple of
iterations to sort things out.

Firstly, the patch is very large, but contains a large number of
distinct changes, so it would really benefit from being broken down into
a number of distinct patches.  This will make reviewing the individual
changes much more straight-forward.  I'd suggest:

1) Some basic makefile cleanups to ease initial integration - in
particular where we have things like

LIB1FUNCS += <long list of functions>

that this be rewritten with one function per line (and sorted
alphabetically) - then we can see which functions are being changed in
subsequent patches.  It makes the Makefile fragments longer, but the
improvement in clarity for makes this worthwhile.

2) The changes for the existing integer functions - preferably one
function per patch.

3) The new integer functions that you're adding

4) The floating-point support.

Some more general observations:

- where functions are already in lib1funcs.asm, please leave them there.
- lets avoid having the cm0 subdirectory - in particular we should do
this when there is existing code for other targets in the same source
files.  It's OK to have any new files in the main 'arm' directory of the
source tree - just name the files appropriately if really needed.
- let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
better term, and that is used widely elsewhere in the compiler.  So if
you really need a term just use THUMB1, or even T1.
- For the 64-bit shift functions, I expect the existing code to be
preferable whenever possible - I think it can even be tweaked to build
for thumb2 by inserting a suitable IT instruction.  So your code should
only be used when

 #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1

- most, if not all, of your LSYM symbols should not be needed after
assembly, so should start with a captial 'L' (and no leading
underscores), the assembler will then automatically discard any that are
not needed for relocations.
- you'll need to write suitable commit messages for each patch, which
also contain a suitable ChangeLog style entry.
- finally, your popcount implementations have data in the code segment.
 That's going to cause problems when we have compilation options such as
-mpure-code.

I strongly suggest that, rather than using gcc snapshots (I'm assuming
this based on the diff style and directory naming in your patches), you
switch to using a git tree, then you'll be able to use tools such as
rebasing and the git posting tools to send the patch series for
subsequent review.

Richard.