[PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Thu Jan 7 16:44:34 GMT 2021

On 07/01/2021 13:27, Christophe Lyon via Gcc-patches wrote:
> On Thu, 7 Jan 2021 at 13:56, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 07/01/2021 00:59, Daniel Engel wrote:
>>> --snip--
>>>
>>> On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
>>>
>>>>
>>>> Thanks for working on this, Daniel.
>>>>
>>>> This is clearly stage1 material, so we've got time for a couple of
>>>> iterations to sort things out.
>>>
>>> I appreciate your feedback.  I had been hoping that with no regressions
>>> this might still be eligible for stage2.  Christophe never indicated
>>> either way. but the fact that he was looking at it seemed positive.
>>> I thought I would be a couple of weeks faster with this last
>>> iteration, but holidays got in the way.
>>
>> GCC doesn't have a stage 2 any more (historical wart).  We were in
>> (late) stage3 when this was first posted, and because of the significant
>> impact this might have on not just CM0 but other targets as well, I
>> don't think it's something we should try to squeeze in at the last
>> minute.  We're now in stage 4, so that is doubly the case.
>>
>> Christophe is a very valuable member of our community, but he's not a
>> port maintainer and thus cannot really rule on what can go into the
>> tools, or when.
>>
>>>
>>> I actually think your comments below could all be addressable within a
>>> couple of days.  But, I'm not accounting for the review process.
>>>
>>>> Firstly, the patch is very large, but contains a large number of
>>>> distinct changes, so it would really benefit from being broken down into
>>>> a number of distinct patches.  This will make reviewing the individual
>>>> changes much more straight-forward.
>>>
>>> I have no context for "large" or "small" with respect to gcc.  This
>>> patch comprises about 30% of a previously-monolithic library that's
>>> been shipping since ~2016 (the rest is libm material).  Other than
>>> (1) the aforementioned change to div0(), (2) a nascent adaptation
>>> for __truncdfsf2() (not enabled), and (3) the gratuitous addition of
>>> the bitwise functions, the library remains pretty much as it was
>>> originally released.
>>
>> Large, like many other terms is relative.  For assembler file changes,
>> which this is primarily, the overall size can be much smaller and still
>> be considered 'large'.
>>
>>>
>>> The driving force in the development of this library was small size,
>>> which of course was never possible with the softfp routines.  It's not
>>> half-slow, either, for the limitations of the M0 architecture.   And,
>>> it's IEEE compliant.  But, that means that most of the functions are
>>> highly interconnected.  So, some of it can be broken up as you outline
>>> below, but that last patch is still worth more than half of the total.
>>
>> Nevertheless, having the floating-point code separated out will make
>> reviewing more straight forward.  I'll likely need to ask one of our FP
>> experts to have a specific look at that part and that will be easier if
>> it is disentangled from the other changes.
>>
>>>
>>> I also have ~70k lines of test vectors that seem mostly redundant, but
>>> not completely.  I haven't decided what to do here.  For example, I have
>>> coverage for __aeabi_u/ldivmod, while GCC does not.  If I do anything
>>> with this code it will be in a separate thread.
>>
>> Publishing the test code, even if it isn't integrated into the GCC
>> testsuite would be useful.  Perhaps someone else could then help with that.
>>
>>>
>>>> I'd suggest:
>>>>
>>>> 1) Some basic makefile cleanups to ease initial integration - in
>>>> particular where we have things like
>>>>
>>>> LIB1FUNCS += <long list of functions>
>>>>
>>>> that this be rewritten with one function per line (and sorted
>>>> alphabetically) - then we can see which functions are being changed in
>>>> subsequent patches.  It makes the Makefile fragments longer, but the
>>>> improvement in clarity for makes this worthwhile.
>>>
>>> I know next to nothing about Makefiles, particularly ones as complex as
>>> GCC's.  I was just trying to work with the existing style to avoid
>>> breaking something.  However, I can certainly adopt this suggestion.
>>>
>>>> 2) The changes for the existing integer functions - preferably one
>>>> function per patch.
>>>>
>>>> 3) The new integer functions that you're adding
>>>
>>> These wouldn't be too hard to do, but what are the expectations for
>>> testing?  A clean build of GCC takes about 6 hours in my VM, and
>>> regression testing takes about 4 hours per architecture.  You would want
>>> a full regression report for each incremental patch?  I have no idea how
>>> to target regression tests that apply to particular runtime functions
>>> without the risk of missing something.
>>>
>>
>> Most of this can be tested in a cross-compile environment using qemu as
>> a model.  A cross build shouldn't take that long (especially if you
>> restrict the compiler to just C and C++ - other languages are
>> vanishingly unlikely to pick up errors in the parts of the compiler
>> you're changing).  But build checks will be mostly sufficient for most
>> of the intermediate patches.
>>
>>>> 4) The floating-point support.
>>>>
>>>> Some more general observations:
>>>>
>>>> - where functions are already in lib1funcs.asm, please leave them there.
>>>
>>> I guess I have a different vision here.  I have had a really hard time
>>> following all of the nested #ifdefs in lib1funcs, so I thought it would
>>> be helpful to begin breaking it up into logical units.
>>
>> Agreed, it's not easy.  But the restructuring, if any, should be done
>> separately from other changes, not mixed up with them.
>>
>>>
>>> The functions removed were all functions for which I had THUMB1
>>> sequences faster/smaller than lib1funcs:  __clzsi2, __clzdi2, __ctzsi2,
>>> __ashrdi3, __lshrdi3, __ashldi3.  In fact, the new THUMB1 of __clzsi2 is
>>> the same number of instructions as the previous ARM/THUMB2 version.
>>>
>>> You will find all of the previous ARM versions of these functions merged
>>> into the new files (with attribution) and the same preprocessor
>>> selection path.  So no architecture variant should be any worse off than
>>> before this patch, and some beyond v6m should benefit.
>>>
>>> In the future, I think that my versions of __divsi3 and __divdi3 will
>>> prove faster/smaller than the existing THUMB2 versions.  I know that my
>>> float routines are less than half the compiled size of THUMB2 versions
>>> in 'ieee754-sf.S'.  However, I haven't profiled the exact performance
>>> differences so I have left all this work for future patches. (It's also
>>> quite likely that my version can be further-refined with a few judicious
>>> uses of THUMB2 alternatives.)
>>>
>>> My long-term vision would be use lib1funcs as an architectural wrapper
>>> distinct from the implementation code.
>>>
>>>> - lets avoid having the cm0 subdirectory - in particular we should do
>>>> this when there is existing code for other targets in the same source
>>>> files.  It's OK to have any new files in the main 'arm' directory of the
>>>> source tree - just name the files appropriately if really needed.
>>>
>>> Fair point on the name.  In v1 of this patch, all these files were all
>>> preprocessor-selected for v6m only.  However, as I've stumbled through
>>> the finer points of integration, that line has blurred.  Name aside,
>>> the subdirectory does still represent a standalone library.   I think
>>> I've managed to add enough integration hooks that it works well in
>>> a libgcc context, but it still has a very distinct implementation style.
>>>
>>> I don't have a strong opinion on this, just preference.  But, keeping
>>> the subdirectory with a neutral name will probably make maintenance
>>> easier in the short term.  I would suggest "lib0" (since it caters to
>>> the lowest common denominator) or "eabi" (since that was the original
>>> target).  There are precedents in other architectures (e.g. avr).
>>
>> The issue here is that the selection of code from the various
>> subdirectories is not consistent.  In some cases we might be pulling in
>> a thumb1 implementation into a thumb2 environment, so having the code in
>> a directory that doesn't reflect this makes maintaining the code harder.
>>  I don't mind too much if some new files are introduced and their names
>> reflect both their function and the architecture they support - eg
>> t1-di-shift.S would obviously contain code for di-mode shifts in thumb1.
>>
>>>
>>>> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a
>>>> better term, and that is used widely elsewhere in the compiler.  So if
>>>> you really need a term just use THUMB1, or even T1.
>>>
>>> Maybe.  The Cortex M0 includes a subset of THUMB2 instructions.  Most
>>> of this is probably THUMB1 clean, but it wasn't a design requirement.
>>
>> It's particularly the Thumb1 issue, just more the name is for a specific
>> CPU which might cause confusion later.  v6m would be preferable to that
>> if there really is a dependency on the instructions that are not in the
>> original Thumb1 ISA.
>>
>>>
>>> The CM0_FUNC_START exists so that I can specify subsections of ".text"
>>> for each function.  This was a fairly fundamental design decision that
>>> allowed me to make a number of branch optimizations between functions.
>>> The other macros are just duplicates for naming symmetry.
>>
>> This is something we'll have to get to during the main review of the
>> code - we used to have support for PE-COFF object files.  That might now
>> be obsolete, wince support is certainly deprecated - but we can't assume
>> that ELF is the only object format we'll ever have to support.
>>
>>>
>>> The existing  FUNC_START macro inserts extra conflicting ".text"
>>> directives that would break the build.  Of course, the prefix was
>>> arbitrary; I just took CM0 from the library name.  But, there's nothing
>>> architecturally significant about this macro at all, so THUMB1 and T1
>>> seems just about as wrong.  Maybe define a FUNC_START_SECTION macro with
>>> two parameters? For example:
>>>
>>>     FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
>>>
>>> Instead of:
>>>
>>>     .section .text.sorted.libgcc.clz2.clzdi2,"x"
>>>     CM0_FUNC_START clzdi2
>>>
>>>> - For the 64-bit shift functions, I expect the existing code to be
>>>> preferable whenever possible - I think it can even be tweaked to build
>>>> for thumb2 by inserting a suitable IT instruction.  So your code should
>>>> only be used when
>>>>
>>>>  #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1
>>>
>>> That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting.
>>> (The name doesn't seem quite right for Cortex M0, since it does support
>>> some 32 bit instructions, but that's beside the point.)
>>
>> The terms Thumb1 and Thumb2 predate the arm-v8m architecture
>> specifications - even then the term Thumb1 was interpreted as "mostly
>> 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit".  Yes, the
>> 16/32-bit spilt has become more blurred and that will likely continue in
>> future since the 16-bit encoding space is pretty full.
>>
>>>
>>> The current lib1funcs ARM code path still exists, as described above. My
>>> THUMB1 implementations were 1 - 3 instructions shorter than the current
>>> versions, which is why I took the time to merge the files.
>>>
>>> Unfortunately, the Cortex M0 THUMB2 subset does not provide IT.  I don't
>>> see an advantage to eliminating the branch unless these functions were
>>> written with cryptographic side channel attacks in mind.
>>
>> On high performance cores branches are predicted - if the branch is
>> predictable then the common path will be taken and the unneeded
>> instructions will never be used.  But library functions like this tend
>> to have very unpredictable values used for calling them, so it's much
>> less likely that the hardware will predict the right path - at this
>> point conditional instructions tend to win (especially if there aren't
>> very many of them) because the cost (on average) of not executing the
>> unneeded instructions is much lower than the cost (on average) of
>> unwinding the processor state to execute the other arm of the
>> conditional branch.
>>
>>>
>>>> - most, if not all, of your LSYM symbols should not be needed after
>>>> assembly, so should start with a captial 'L' (and no leading
>>>> underscores), the assembler will then automatically discard any that are
>>>> not needed for relocations.
>>>
>>> You don't want debugging symbols for libgcc internals :) ?  I sort of
>>> understand that, but those symbols have been useful to me in the past.
>>> The "." by itself seems to keep visibility local, so the extra symbols
>>> won't cause linker issuess. Would you object to a macro variant (e.g.
>>> LLSYM) that prepends the "L" but is easier to disable?
>>
>> It is a matter of taste, but I really prefer the local symbols to
>> disappear entirely once the file is compiled - it makes things like
>> backtrace gdb show the proper call heirarchy.
>>
>>>
>>>> - you'll need to write suitable commit messages for each patch, which
>>>> also contain a suitable ChangeLog style entry.
>>>
>>> OK.
>>>
>>>> - finally, your popcount implementations have data in the code segment.
>>>>  That's going to cause problems when we have compilation options such as
>>>> -mpure-code.
>>>
>>> I am just following the precedent of existing lib1funcs (e.g. __clz2si).
>>> If this matters, you'll need to point in the right direction for the
>>> fix.  I'm not sure it does matter, since these functions are PIC anyway.
>>
>> That might be a bug in the clz implementations - Christophe: Any thoughts?
>>
> 
> Indeed that looks suspicious. I'm wondering why I saw no problem during testing.
> Is it possible that __clzsi2 is not covered by GCC's 'make check' ?

It's possible, but don't forget that modern cores have a CLZ
instruction, so most tests will not end up using the library implementation.

> 
>>>
>>>> I strongly suggest that, rather than using gcc snapshots (I'm assuming
>>>> this based on the diff style and directory naming in your patches), you
>>>> switch to using a git tree, then you'll be able to use tools such as
>>>> rebasing and the git posting tools to send the patch series for
>>>> subsequent review.
>>>
>>> Your assumption is correct.  I didn't think that I would have to get so
>>> deep into the gcc development process for this contribution.  I used
>>> this library as a bare metal alternative for libgcc/libm in the product
>>> for years, so I thought it would just drop in.  But, the libgcc compile
>>> mechanics have proved much more 'interesting'. I'm assuming this
>>> architecture was created years before the introduction of -ffunction-
>>> sections...
>>>
>>
>> I don't think I've time to write a history lesson, even if you wanted
>> it.  Suffice to say, this does date back to the days of a.out format
>> object files (with 4 relocation types, STABS debugging, and one code,
>> one data and one bss section).
>>
>>>>
>>>> Richard.
>>>>
>>>
>>> Thanks again,
>>> Daniel
>>>
>>
>> R.

R.