GCC 10 using floating-point registers to pass some 64-bit arguments on ARM Cortex-M

Wed May 20 14:10:35 GMT 2020

On 19/05/2020 21:28, Freddie Chopin wrote:
> On Tue, 2020-05-19 at 14:52 +0100, Richard Earnshaw wrote:
>> Only d7?  No, that couldn't be right.  d7 would only be used if d0-d6
>> had also been used.
> 
> I've looked at the disassembly again and my first description of
> symptoms was indeed wrong, well - partially (; I've tried looking at a
> bigger picture now and it seems that the parameters are not passed via
> FPU registers, but FPU registers are used as intermediate helper
> registers in a few places ("vldr" appears in the listing 24 times, this
> application does not use any floating point types or functions).
> 
> The most common pattern is something like this:
> 
> -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --
> 
> 080112e0 <distortos::SoftwareTimerCommon::start(std::chrono::time_point<distortos::TickClock, std::chrono::duration<long long, std::ratio<1ll, 1000ll> > >, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >)>:
> {
>  80112e0:	b500      	push	{lr}
>  80112e2:	b083      	sub	sp, #12
>  80112e4:	ed9d 7b04 	vldr	d7, [sp, #16]
> 	softwareTimerControlBlock_.start(internal::getScheduler().getSoftwareTimerSupervisor(), timePoint, period);
>  80112e8:	4904      	ldr	r1, [pc, #16]	; (80112fc <distortos::SoftwareTimerCommon::start(std::chrono::time_point<distortos::TickClock, std::chrono::duration<long long, std::ratio<1ll, 1000ll> > >, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >)+0x1c>)
>  80112ea:	ed8d 7b00 	vstr	d7, [sp]
>  80112ee:	3008      	adds	r0, #8
>  80112f0:	f000 f840 	bl	8011374 <distortos::internal::SoftwareTimerControlBlock::start(distortos::internal::SoftwareTimerSupervisor&, std::chrono::time_point<distortos::TickClock, std::chrono::duration<long long, std::ratio<1ll, 1000ll> > >, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >)>
> }
>  80112f4:	2000      	movs	r0, #0
>  80112f6:	b003      	add	sp, #12
>  80112f8:	f85d fb04 	ldr.w	pc, [sp], #4
>  80112fc:	20000a3c 	.word	0x20000a3c
> 
> -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --
> 
> So it's a vldr followed by vstr (sometimes more than one), it seems
> like a way to load 64-bit values in one step. Such pattern appears in
> the code several times, it uses mostly d7, but sometimes d8 or d6 (some
> parts use two registers in the same block of code, d6 and d7).
> 
> A few times compiler uses s16 as a scratch register like this:
> 
> -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --
> 
>  80060fe:	f812 3b01 	ldrb.w	r3, [r2], #1
>  8006102:	9204      	str	r2, [sp, #16]
>  8006104:	ee08 3a10 	vmov	s16, r3
> 			const auto rawQueueWrapper = makeRawQueueWrapper<0>(dynamic, fifo);
>  8006108:	f816 2b01 	ldrb.w	r2, [r6], #1
>  800610c:	ee18 1a10 	vmov	r1, s16
>  8006110:	a809      	add	r0, sp, #36	; 0x24
>  8006112:	f7ff ff8b 	bl	800602c <std::unique_ptr<distortos::test::RawQueueWrapper, std::default_delete<distortos::test::RawQueueWrapper> >
> 
> -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 -- >8 --
> 
> In this case it seems to make no sense at all, why not just move from
> r3 to r1 and be done with that (s16 is not used again in this
> function), or why not load into r1 directly?
> 
> Sorry for the initial confusion, I hope that this time I'm more precise
> (;
> 
>> No, those changes are for handling of 64-bit integral values where we
>> no-longer use Neon to perform those options and have improved the way
>> code is generated to handle them using the GP registers.
> 
> I see. I'm just looking for the answer to my basic question - is this a
> bug or a feature? If it's a feature, then maybe there's a way to
> disable it somehow.
> 
>> Testcase needed.
> 
> I could try providing one if you really think that what I see here is a
> bug, not an expected behaviour.
> 
> Regards,
> FCh
> 

OK, so not a bug then.  Phew!

TLDR;

I was playing with some changes to try to handle 64-bit copies more
efficiently last year, but it was a major can of worms and untangling it
proved infeasible before the end of the stage-1 development window.  The
issue here is that the Arm architecture is not sufficiently orthogonal
in its memory addressing modes and GCC likes to think that all
architectures fundamentally are orthogonal in this behaviour.  It really
sucks.  Add to that the compiler has a tendency to pun modes when it
thinks it might not matter and you end up with a mess that's quite hard
to untangle in a reasonable way.

I might have a look at those patches again this year, but it might
depend on what else I have on my plate during the development window.

R.