106763 – Armv8.2 vmov.f16 instruction sometimes causes SIGILL

Bug 106763 - Armv8.2 vmov.f16 instruction sometimes causes SIGILL

Summary: Armv8.2 vmov.f16 instruction sometimes causes SIGILL

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	12.2.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-08-28 14:07 UTC by George Pee
Modified:	2022-09-01 19:03 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	arm-linux-gnueabi
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description George Pee 2022-08-28 14:07:59 UTC

First noticed when going from gcc-10 to gcc-12 and a complex function being compiled with -ftree-vectorize started to emit a vmov.f16 instruction and sometimes SIGILL on it.

It appears that this commit is when it started to emit that instruction:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=6390c5047adb75960f86d56582e6322aaa4d9281

Made a small program that also SIGILLs intermittently on vmov.f16 instruction 

Compiled with -g -mcpu=cortex-a55:

#include <vector>
#include <algorithm>
#include <arm_neon.h>
#include <stdio.h>
#include <sys/auxv.h>

float16_t rand_func()
{
    return float16_t(double(rand()) / double(RAND_MAX));
}

int main()
{
    srand( 11 );

    // just to show that vmov.f16 doesn't always SIGILL
    {
        std::vector<float16_t> floats_small(10);
        printf("float_small pre\n");
        std::generate( floats_small.begin() , floats_small.end() , rand_func );
        printf("float_small post\n");
    }

    // sometimes SIGILLS
    {
        std::vector<float16_t> floats_large(500000);
        printf("float_large pre\n");
        std::generate( floats_large.begin() , floats_large.end() , rand_func );
        printf("float_large post\n");
    }
    return 0;
}

Execution:
$ ./a.out
float_small pre
float_small post
float_large pre
float_large post

$ ./a.out
float_small pre
float_small post
float_large pre
Illegal instruction

GDB:
Program terminated with signal SIGILL, Illegal instruction.
#0  0x00010778 in rand_func () at vmov_f16_test.c:9
9           return float16_t(double(rand()) / double(RAND_MAX));
(gdb) disassemble 0x00010770,+32
Dump of assembler code from 0x10770 to 0x10790:
   0x00010770 <rand_func()+24>: vdiv.f64        d16, d17, d18
   0x00010774 <rand_func()+28>: vcvtb.f16.f64   s15, d16
=> 0x00010778 <rand_func()+32>: vmov.f16        r3, s15
   0x0001077c <rand_func()+36>: mov     r0, r3
   0x00010780 <rand_func()+40>: pop     {r11, pc}
   0x00010784 <rand_func()+44>: nop     {0}
   0x00010788 <rand_func()+48>:                 ; <UNDEFINED> instruction: 0xffc00000
   0x0001078c <rand_func()+52>: ldrshmi pc, [pc, #255]  ; 0x10893 <main()+258>  ; <UNPREDICTABLE>


I have been able to run the compiled program on multiple Cortex-A55 devices.  Initially, I though that since FP16 is optional on armv8.2-a that it was truly an illegal instruction, but if that is the case, then why would it only fail intermittently?

Comment 1 George Pee 2022-08-28 14:23:38 UTC

I was using gcc-12, but I was also able to reproduce this issue using the sample program above and gcc-10.

Comment 2 George Pee 2022-08-28 16:15:16 UTC

Forgot to mention that I'm building 32-bit.

Comment 3 Richard Earnshaw 2022-08-30 13:39:16 UTC

Programs don't generally take SIGILL intermittently - if that really is the case, then it's unlikely to be a bug in the compiler.

You haven't told us what OS you are running on, or anything else about your machine (eg, is it a big-little configuration?).  

Are there any diagnostics in the kernel logs, or can you enable any?

Comment 4 George Pee 2022-08-30 14:12:02 UTC

Yes, it's possible that this isn't a compiler bug.  I thought that it might be because the problem started showing up after upgrading the toolchain.

I wasn't sure if the compiler was failing to emit some kind of alignment or fp setting code.

I running on linux 4.9.118,
I enabled CONFIG_DEBUG_USER=y and set user_debug=31 in the kernel cmdline.

This is what the kernel reports.  I'm starting to look through it.
[   51.337524] a.out (3638): undefined instruction: pc=00010778
[   51.337536] CPU: 3 PID: 3638 Comm: a.out Tainted: P           O    4.9.118 #2
[   51.337547] task: 8572b000 task.stack: 8a002000
[   51.337555] PC is at 0x10778
[   51.337560] LR is at 0x60dc51b4
[   51.337567] pc : [<00010778>]    lr : [<60dc51b4>]    psr: 60000010
[   51.337567] sp : 72909c50  ip : 60dc51c0  fp : 72909c54
[   51.337572] r10: 60ff5000  r9 : 00000000  r8 : 00000000
[   51.337578] r7 : 00000000  r6 : 00010668  r5 : 00000000  r4 : 00003346
[   51.337583] r3 : 00000000  r2 : 00000001  r1 : 00000000  r0 : 6ff59dd5
[   51.337589] Flags: nZCv  IRQs on  FIQs on  Mode USER_32  ISA ARM  Segment user
[   51.337595] Control: 50c0383d  Table: 3e0c406a  DAC: 00000015
[   51.337605] Code: eef81be7 eddf2b05 eec10ba2 eef37b60 (ee173990)

Comment 5 Richard Earnshaw 2022-08-30 14:16:40 UTC

My guess (and it's only a guess because I'm not a kernel expert) is that the OS has disabled the FP/SIMD unit because of something like a context switch and then is failing, somehow, to recognize that the instruction is part of the VFP extension, so not re-enabling it.  That's plausible because this instruction was added as later extension.

I note that your kernel is from the linux 4 series, which is pretty ancient these days.

Comment 6 George Pee 2022-08-30 14:41:02 UTC

That explanation makes a lot of sense. Thank you!

Comment 7 George Pee 2022-08-30 21:11:15 UTC

Based on further experimentation, this does not look to be a compiler bug.

Comment 8 Richard Earnshaw 2022-08-31 16:35:29 UTC

I spoke to our kernel experts about this and they think my hypothesis is quite likely to be correct.  They also noted that kernel version 4.9.118 is about 200 releases out of date on the 4.9 LTS series.

But more importantly, they say that trying to run a 32-bit kernel on 64-bit Arm v8 hardware is not recommended and that you should seriously consider switching to a 64-bit kernel (which can still run a 32-bit userland if you need it to).

I do note that the code here:
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-armv.S#n628 
does not have an entry for CP#9, which would be needed for FP16 support as all FP16 instructions are in that part of the encoding space.  I have no idea if changing that would be enough.

Comment 9 George Pee 2022-09-01 00:33:13 UTC

Thank you for following up even after I closed the ticket.

Unfortunately, I'm unable to switch to a 64-bit kernel at the moment.

Using this works around the issue by treating it via a neon path and enabling the vfp bit and retrying the instruction.

@@ -824,6 +824,9 @@ call_fpe:
        .align  6
 
 .LCneon_arm_opcodes:
+       .word   0xee000000                      @ mask
+       .word   0xee000000                      @ opcode
+
        .word   0xfe000000                      @ mask
        .word   0xf2000000                      @ opcode

 

I am now using this simplified case, which fails somewhere between 100,000 and 1,000,000 iterations:

    int c = 0;
    while(1)
    {
        c++;
        asm ( 
            "vmov.f16    r6, s18\n"
        );
        if (c % 100 == 0)
            printf("%d\n",c);
    }

It's quite odd that it is intermittent.  After instrumenting vfp enable/disable in the kernel, it seems as though there is something disabling the vfp bit in the fpexc register, but it doesn't seem to be the kernel.

I am able to reproduce this with other FP16 instructions, but not other non-FP16 VFP instructions.

Comment 10 Richard Earnshaw 2022-09-01 09:46:45 UTC

If you don't have CONFIG_SMP enabled, it looks like the kernel will do lazy context switching of the FP registers (it can save time if a process doesn't do any FP).  So another work around might be to enable that, even if you have only have a single core.

Comment 11 Richard Earnshaw 2022-09-01 09:57:11 UTC

(In reply to George Pee from comment #9)
> Using this works around the issue by treating it via a neon path and
> enabling the vfp bit and retrying the instruction.
> 
> @@ -824,6 +824,9 @@ call_fpe:
>         .align  6
>  
>  .LCneon_arm_opcodes:
> +       .word   0xee000000                      @ mask
> +       .word   0xee000000                      @ opcode
> +
>         .word   0xfe000000                      @ mask
>         .word   0xf2000000                      @ opcode
> 
>  

No, that's not going to be the right change (and wouldn't support Thumb, either).  I'd start off by trying the following, though it's completely untested:

diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S
index 0ea8529a4872..df6e3c8533fa 100644
--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -629,11 +629,12 @@ call_fpe:
 #endif
 	ret.w	lr				@ CP#7
 	ret.w	lr				@ CP#8
-	ret.w	lr				@ CP#9
 #ifdef CONFIG_VFP
+	W(b)	do_vfp				@ CP#9  (VFP)
 	W(b)	do_vfp				@ CP#10 (VFP)
 	W(b)	do_vfp				@ CP#11 (VFP)
 #else
+	ret.w	lr				@ CP#9  (VFP)
 	ret.w	lr				@ CP#10 (VFP)
 	ret.w	lr				@ CP#11 (VFP)
 #endif

Comment 12 George Pee 2022-09-01 13:22:25 UTC

SMP is enabled.  The opcode thing was an experiment only.

Your suggestion seems to work great, but is it safe to make the change across all ARM cpus ?

Comment 13 Richard Earnshaw 2022-09-01 14:32:42 UTC

I don't think it would hurt.  With this change, a float-16 instruction that was encountered on older cores would enable the VFP unit if it wasn't previously enabled and then fault again when the retried instruction failed.  That's what likely already happens today if you have an instruction that isn't legal but still falls in the top-level decoding space for 32-bit and 64-bit FP instructions.  Perviously we would unconditionally raise a SIGILL, which is what you're seeing right now.

However, I'm not a kernel expert and I haven't tested any of the above, so all the caveats that that implies should be assumed.

Comment 14 Richard Earnshaw 2022-09-01 14:35:10 UTC

Also beware that I don't think Russel King (Arm Linux kernel maintainer) would accept this patch on its own.  You'd likely need to add some boot time detection of the additional feature and expose that through the HWCAP interface to do a complete patch.  But that's icing on the cake really.

Comment 15 George Pee 2022-09-01 14:54:58 UTC

Funny that you mention that...
https://lore.kernel.org/linux-arm-kernel/20220901141307.2361752-1-georgepee@gmail.com/T/#u

Comment 16 Richard Earnshaw 2022-09-01 14:58:30 UTC

(In reply to George Pee from comment #15)
> Funny that you mention that...
> https://lore.kernel.org/linux-arm-kernel/20220901141307.2361752-1-
> georgepee@gmail.com/T/#u

:)

Don't forget that the arm64 kernel will also need to export the same HWCAP values when supporting a 32-bit userland.

Comment 17 George Pee 2022-09-01 15:06:24 UTC

Any idea on why the issue is intermittent?

Comment 18 Richard Earnshaw 2022-09-01 15:14:11 UTC

(In reply to George Pee from comment #17)
> Any idea on why the issue is intermittent?

For SMP not really, because I think that path doesn't use lazy context switching; but perhaps the kernel is smart enough to switch into non-SMP mode if only one processor is present at boot time (thus saving a lot of kernel locking).  

The kernel can disable the FPU at times (the most common case is a context switch) and then re-enable it when another VFP/SIMD instruction is encountered.  If the first instruction encountered after it has been disabled is a FP16 operation, then the kernel fails to recognize it as such and so doesn't try to re-enable the VFP unit.  If it's some other, recognized, operation then the unit gets re-enabled and then the fp16 instructions never take a fault.

Lazy context switching can save time and energy loading/saving the VFP register state (which is relatively large) if most applications on the system use little or no FP/SIMD, but it's significantly more complicated on SMP systems because it means the state may have to be fetched from a different CPU's FPU if the process is switched to another CPU, so this is normally only done on single processor systems.

Comment 19 George Pee 2022-09-01 19:03:46 UTC

Thank you for all of your thoughts and details, it has been tremendously helpful!