Bug 96882 - Wrong assembly code generated with arm-none-eabi-gcc -flto -mfloat-abi=hard options
Summary: Wrong assembly code generated with arm-none-eabi-gcc -flto -mfloat-abi=hard o...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.3.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: ABI, wrong-code
Depends on:
Blocks: 96939
  Show dependency treegraph
 
Reported: 2020-09-01 12:57 UTC by emilie.feral
Modified: 2022-03-29 16:05 UTC (History)
1 user (show)

See Also:
Host:
Target: arm
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-09-01 00:00:00


Attachments
preprocessed file triggering the bug (225 bytes, text/plain)
2020-09-01 12:57 UTC, emilie.feral
Details

Note You need to log in before you can comment on or make changes to this bug.
Description emilie.feral 2020-09-01 12:57:46 UTC
Created attachment 49163 [details]
preprocessed file triggering the bug

Hi,

When compiling the following code:

/********************************************************************/

typedef struct {
  double m_a;
  double m_b;
  double m_c;
  double m_d;
} AtLeast32BytesObject;

AtLeast32BytesObject __attribute__((noinline)) CalledFunction() {
  AtLeast32BytesObject result = {1.1, 2.2, 3.3, 4.4};
  return result;
}

void __attribute__((noinline)) _start() {
  volatile AtLeast32BytesObject result = CalledFunction();
  while(1) {}
}

/********************************************************************/

with "arm-none-eabi-gcc -Os -flto -mthumb -mfloat-abi=hard -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc", the assembly instructions emitted for the symbol "CalledFunction" use callee-save registers r4-r7 to store the result of the CalledFunction procedure (cf following disassemble function addresses range 0x0000805e-0x0000806e). The registers r4-r7 are overwritten when leaving the subroutine (since they're callee-save registers) leading to a corrupted result from "CalledFunction" (cf following disassemble function at address 0x00008072).

Dump of assembler code for function CalledFunction:
   0x00008000 <+0>: push {r4, r5, r6, r7, lr}
   0x00008002 <+2>: ldr r5, [pc, #112] ; (0x8074 <CalledFunction+116>)
   0x00008004 <+4>: ldmia r5!, {r0, r1, r2, r3}
   0x00008006 <+6>: sub sp, #132 ; 0x84
   0x00008008 <+8>: add r4, sp, #64 ; 0x40
   0x0000800a <+10>: stmia r4!, {r0, r1, r2, r3}
   0x0000800c <+12>: ldmia.w r5, {r0, r1, r2, r3}
   0x00008010 <+16>: add r5, sp, #64 ; 0x40
   0x00008012 <+18>: stmia.w r4, {r0, r1, r2, r3}
   0x00008016 <+22>: ldmia r5!, {r0, r1, r2, r3}
   0x00008018 <+24>: add r4, sp, #96 ; 0x60
   0x0000801a <+26>: stmia r4!, {r0, r1, r2, r3}
   0x0000801c <+28>: ldmia.w r5, {r0, r1, r2, r3}
   0x00008020 <+32>: stmia.w r4, {r0, r1, r2, r3}
   0x00008024 <+36>: ldr r3, [sp, #96] ; 0x60
   0x00008026 <+38>: str r3, [sp, #0]
   0x00008028 <+40>: ldr r3, [sp, #100] ; 0x64
   0x0000802a <+42>: str r3, [sp, #4]
   0x0000802c <+44>: ldr r3, [sp, #104] ; 0x68
   0x0000802e <+46>: str r3, [sp, #8]
   0x00008030 <+48>: ldr r3, [sp, #108] ; 0x6c
   0x00008032 <+50>: str r3, [sp, #12]
   0x00008034 <+52>: ldr r3, [sp, #112] ; 0x70
   0x00008036 <+54>: str r3, [sp, #16]
   0x00008038 <+56>: ldr r3, [sp, #116] ; 0x74
   0x0000803a <+58>: ldr r7, [sp, #124] ; 0x7c
   0x0000803c <+60>: str r3, [sp, #20]
   0x0000803e <+62>: ldr r3, [sp, #120] ; 0x78
   0x00008040 <+64>: strd r3, r7, [sp, #24]
   0x00008044 <+68>: ldr r3, [sp, #0]
   0x00008046 <+70>: str r3, [sp, #32]
   0x00008048 <+72>: ldr r3, [sp, #4]
   0x0000804a <+74>: str r3, [sp, #36] ; 0x24
   0x0000804c <+76>: ldr r3, [sp, #8]
   0x0000804e <+78>: str r3, [sp, #40] ; 0x28
   0x00008050 <+80>: ldr r3, [sp, #12]
   0x00008052 <+82>: str r3, [sp, #44] ; 0x2c
   0x00008054 <+84>: ldr r3, [sp, #16]
   0x00008056 <+86>: str r3, [sp, #48] ; 0x30
   0x00008058 <+88>: ldr r3, [sp, #20]
   0x0000805a <+90>: str r3, [sp, #52] ; 0x34
   0x0000805c <+92>: ldr r3, [sp, #24]
   0x0000805e <+94>: strd r3, r7, [sp, #56] ; 0x38 // HERE, we store
   0x00008062 <+98>: ldrd r0, r1, [sp, #32] // the result
   0x00008066 <+102>: ldrd r2, r3, [sp, #40] ; 0x28 // in r0-r7
   0x0000806a <+106>: ldrd r4, r5, [sp, #48] ; 0x30 //
   0x0000806e <+110>: ldr r6, [sp, #56] ; 0x38 //
   0x00008070 <+112>: add sp, #132 ; 0x84
   0x00008072 <+114>: pop {r4, r5, r6, r7, pc} // HERE, we overwrite r4-r7
   0x00008074 <+116>: strh r0, [r5, #4]
   0x00008076 <+118>: movs r0, r0
End of assembler dump.

I attach to this report the "main.i" containing the previous preprocessed code.

The toolchain version is arm-none-eabi-gcc (GNU Arm Embedded Toolchain 9-2020-q2-update) 9.3.1 20200408 (release).
It was from the binary package gcc-arm-none-eabi-9-2020-q2-update-mac.pkg downloaded from https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads.
The host machine is a MacBook Pro with Catalina version 10.15.4 (19E287).

The command lines I used are:

arm-none-eabi-gcc main.c -Os -flto -mthumb -mfloat-abi=hard -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc -save-temps -o a.elf

arm-none-eabi-gdb -batch -ex 'file a.elf' -ex 'disassemble CalledFunction'

Thanks for your help,
Émilie
Comment 1 Richard Earnshaw 2020-09-01 13:25:00 UTC
We need to see the configuration information.  What is the output of "gcc -v" for your compiler?
Comment 2 emilie.feral 2020-09-01 13:43:29 UTC
Here they are:

arm-none-eabi-gcc -v                                                     •[master]
Using built-in specs.
COLLECT_GCC=/Applications/ARM/bin/arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/Applications/ARM/bin/../lib/gcc/arm-none-eabi/9.3.1/lto-wrapper
Target: arm-none-eabi
Configured with: /tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/src/gcc/configure --target=arm-none-eabi --prefix=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native --libexecdir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/lib --infodir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/info --mandir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/man --htmldir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/html --pdfdir=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/share/doc/gcc-arm-none-eabi/pdf --enable-languages=c,c++ --enable-plugins --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp --disable-libstdcxx-pch --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-as --with-gnu-ld --with-newlib --with-headers=yes --with-python-dir=share/gcc-arm-none-eabi --with-sysroot=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/install-native/arm-none-eabi --build=x86_64-apple-darwin10 --host=x86_64-apple-darwin10 --with-gmp=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-mpfr=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-mpc=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-isl=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-libelf=/tmp/jenkins-GCC-9-pipeline-200_20200521_1590053285/build-native/host-libs/usr --with-host-libstdcxx='-static-libgcc -Wl,-lstdc++ -lm' --with-pkgversion='GNU Arm Embedded Toolchain 9-2020-q2-update' --with-multilib-list=rmprofile,aprofile
Thread model: single
gcc version 9.3.1 20200408 (release) (GNU Arm Embedded Toolchain 9-2020-q2-update)
Comment 3 Richard Earnshaw 2020-09-01 15:00:17 UTC
LTO seems to be getting confused as to the ABI.  Investigating...

In the mean time, the only work-around I can think of is to remove -flto from your build.
Comment 4 Richard Earnshaw 2020-09-01 19:36:57 UTC
typedef struct {
  double m_a;
  double m_b;
  double m_c;
  double m_d;
} AtLeast32BytesObject;

static AtLeast32BytesObject __attribute__((noinline,noclone)) CalledFunction() {
  AtLeast32BytesObject result = {1.1, 2.2, 3.3, 4.4};
  return result;
}

void __attribute__((noinline)) _start() {
  volatile AtLeast32BytesObject result = CalledFunction();
  while(1) {}
}

Will miscompile without needing LTO.
Comment 5 emilie.feral 2020-09-02 08:10:39 UTC
When compiling without the lto using the command:
arm-none-eabi-gcc main.c -Os -mfloat-abi=hard -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc -save-temps -o a.elf

I get the following instructions for CalledFunction:

Dump of assembler code for function CalledFunction:
   0x00008000 <+0>:	push	{r4, r5, lr}
   0x00008002 <+2>:	ldr	r5, [pc, #52]	; (0x8038 <CalledFunction+56>)
   0x00008004 <+4>:	ldmia	r5!, {r0, r1, r2, r3}
   0x00008006 <+6>:	sub	sp, #100	; 0x64
   0x00008008 <+8>:	add	r4, sp, #32
   0x0000800a <+10>:	stmia	r4!, {r0, r1, r2, r3}
   0x0000800c <+12>:	ldmia.w	r5, {r0, r1, r2, r3}
   0x00008010 <+16>:	add	r5, sp, #32
   0x00008012 <+18>:	stmia.w	r4, {r0, r1, r2, r3}
   0x00008016 <+22>:	ldmia	r5!, {r0, r1, r2, r3}
   0x00008018 <+24>:	add	r4, sp, #64	; 0x40
   0x0000801a <+26>:	stmia	r4!, {r0, r1, r2, r3}
   0x0000801c <+28>:	ldmia.w	r5, {r0, r1, r2, r3}
   0x00008020 <+32>:	stmia.w	r4, {r0, r1, r2, r3}
   0x00008024 <+36>:	vldr	d0, [sp, #64]	; 0x40
   0x00008028 <+40>:	vldr	d1, [sp, #72]	; 0x48
   0x0000802c <+44>:	vldr	d2, [sp, #80]	; 0x50
   0x00008030 <+48>:	vldr	d3, [sp, #88]	; 0x58
   0x00008034 <+52>:	add	sp, #100	; 0x64
   0x00008036 <+54>:	pop	{r4, r5, pc}
   0x00008038 <+56>:	strh	r0, [r3, #2]
   0x0000803a <+58>:	movs	r0, r0
End of assembler dump.

Which seems correct to me: the result is returned through registers d0-d3.

Interesting fact, if I keep the lto but remove the mfloat-abi=hard option:
arm-none-eabi-gcc main.c -Os -flto -mthumb -mcpu=cortex-m4 -ffreestanding -nostdlib -lgcc -save-temps -o a.elf

The compilation also seems correct: the result is written at the address given by r0 and the address is returned through r0.

Dump of assembler code for function CalledFunction:
   0x00008000 <+0>:	push	{r4, r5, r6, lr}
   0x00008002 <+2>:	ldr	r5, [pc, #20]	; (0x8018 <CalledFunction+24>)
   0x00008004 <+4>:	mov	r6, r0
   0x00008006 <+6>:	mov	r4, r0
   0x00008008 <+8>:	ldmia	r5!, {r0, r1, r2, r3}
   0x0000800a <+10>:	stmia	r4!, {r0, r1, r2, r3}
   0x0000800c <+12>:	ldmia.w	r5, {r0, r1, r2, r3}
   0x00008010 <+16>:	stmia.w	r4, {r0, r1, r2, r3}
   0x00008014 <+20>:	mov	r0, r6
   0x00008016 <+22>:	pop	{r4, r5, r6, pc}
   0x00008018 <+24>:	strh	r0, [r5, #0]
   0x0000801a <+26>:	movs	r0, r0
End of assembler dump.
Comment 6 Richard Earnshaw 2020-09-02 10:05:52 UTC
Yes, the problem is related to returning values in memory and the ABI variants we have.  If we have hardware floating-point we generally use registers to return values; if we don't, then we have to return in memory.

However, when we have a function that is not inlinable, but is private to the compilation unit we can optimize the ABI in some circumstances.  That's what is happening here.  Unfortunately, it appears that function that decides whether or not the result should be returned in memory or in registers lacks important information as to whether or not the function is private and this in turn leads to two parts of the compiler making different choices - with the disastrous consequences you've discovered.

I'm not sure if this is restricted to M-profile parts or if it's more wide-spread - I'm still investigating.
Comment 7 emilie.feral 2020-09-15 09:32:23 UTC
Hello,
Any news on the subject?
Would you advise in the meantime to discard the LTO (with the -fno-lto option) on the compilation unit containing the failing code?
The bug occurred for us when returning a structure of four doubles. Do you have any indication of when the bug might appear to help us track other occurrences?
Thanks for helping!
Comment 8 Richard Earnshaw 2020-09-15 13:36:43 UTC
(In reply to emilie.feral from comment #7)
> Hello,
> Any news on the subject?
> Would you advise in the meantime to discard the LTO (with the -fno-lto
> option) on the compilation unit containing the failing code?
> The bug occurred for us when returning a structure of four doubles. Do you
> have any indication of when the bug might appear to help us track other
> occurrences?
> Thanks for helping!

Sorry, I haven't had time to work on this yet.

The safest work-around for now is to add an additional attribute to force the PCS to the default for the selected ABI - I think adding 

 pcs("aapcs-vfp")

to the attributes will solve the problem.

ie.

AtLeast32BytesObject __attribute__((noinline, pcs("aapcs-vfp"))) CalledFunction() {
  AtLeast32BytesObject result = {1.1, 2.2, 3.3, 4.4};
  return result;
}
Comment 9 David Crocker 2022-03-14 15:36:52 UTC
Is there any update on this? I need to turn on LTO to keep the code size of a large application within the flash memory space of the target ARM Cortex M4F processor; but by the sound of it, doing so will be unsafe.
Comment 10 CVS Commits 2022-03-29 16:05:11 UTC
The master branch has been updated by Richard Earnshaw <rearnsha@gcc.gnu.org>:

https://gcc.gnu.org/g:1dca4ca1bf2f1b05537a1052e373d8b0ff11e53c

commit r12-7894-g1dca4ca1bf2f1b05537a1052e373d8b0ff11e53c
Author: Richard Earnshaw <rearnsha@arm.com>
Date:   Tue Mar 29 16:59:37 2022 +0100

    arm: temporarily disable 'local' pcs selection (PR96882)
    
    The arm port has an optimization used during selection of the
    function's ABI to permit deviation from the strict ABI when the
    function does not escape the current translation unit.
    
    Unfortunately, the ABI selection it makes can be unsafe if it changes
    how a result is returned because not enough information is available
    via the RETURN_IN_MEMORY hook to determine where the function gets
    used.  This can result in some parts of the compiler thinking a value
    is returned in memory while others think it is returned in registers.
    
    To mitigate this, this patch temporarily disables the optimization and
    falls back to using the default ABI for the translation.
    
    gcc/ChangeLog:
    
            PR target/96882
            * config/arm/arm.cc (arm_get_pcs_model): Disable selection of
            ARM_PCS_AAPCS_LOCAL.