87373 – Byte by byte accessing to the peripheral registers causes issues on ARM v7-m architecture

Bug 87373 - Byte by byte accessing to the peripheral registers causes issues on ARM v7-m architecture

Summary: Byte by byte accessing to the peripheral registers causes issues on ARM v7-m ...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	c++ (show other bugs)
Version:	unknown

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-09-21 06:51 UTC by Murat Ursavaş
Modified:	2018-10-01 09:54 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	arm
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Minimum Test Case (207 bytes, text/x-csrc) 2018-09-21 06:51 UTC, Murat Ursavaş	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Murat Ursavaş 2018-09-21 06:51:52 UTC

Created attachment 44731 [details]
Minimum Test Case

Hi,

I've faced a weird behavior on my ARM MCU about a year ago and reported it on the launchpad page. In the meantime I tried to report the issue at here but your register process is longer than expected and here I'm reporting this issue almost a year later.

We'd discussed the issue on this page:

https://bugs.launchpad.net/gcc-arm-embedded/+bug/1738730

I think this looks like a bug, but with my limited internal GCC knowledge, I don't want to be rude and call it like that directly.

If I should summarize the discussion;

The GCC versions from 5.2 to 7, the compiler generates unexpected code while handling packed structs. Normally the behavior should be like 4.9 but somehow it got broken since then. Misaligned access should help but this does not seem to be the case.

I've attached a minimum example which reproduces the issue.

About the seriousness; I can not update my compiler and still use 4.9 for our products. This is a annoying drawback for me because I know GCC has received many improvements since then. The IDE which I use is still deployed with 4.9 and this is unacceptable from my point of view.

MCU Core: ARM Cortex M3

Comment 1 Umesh Kalappa 2018-09-21 08:57:21 UTC

with trunk i see below output for the attached case .

main:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        sub     sp, sp, #8
        movw    r3, #1025
        str     r3, [sp]
        ldr     r0, [sp]
        ldr     r3, .L4
        uxtb    r0, r0
        str     sp, [r3]
        add     sp, sp, #8
        @ sp needed
        bx      lr

$ arm-windriver-linux-gnueabi-gcc  -mcpu=cortex-m3 -mthumb  '-DDEBUG=1' -O2 -pedantic -Wall -Wextra  -fmessage-length=0  -mno-sched-prolog -fno-builtin -fshort-enums  test.c -S

Comment 2 Murat Ursavaş 2018-09-21 08:59:47 UTC

Hi Umesh,

Could you test it with the following options:

-g3 -gdwarf-2 -mcpu=cortex-m3 -mthumb -std=c++1y '-DDEBUG=1' -O0 -pedantic -Wall -Wextra -c -fmessage-length=0 -fno-rtti -fno-exceptions -mno-sched-prolog -fno-builtin -fpack-struct -fshort-enums -ffunction-sections -fdata-sections

Thanks in advance.

Comment 3 Umesh Kalappa 2018-09-21 09:18:29 UTC

With -O0 , i see the byte load /store like

 push    {r7}
        sub     sp, sp, #12
        add     r7, sp, #0
        ldr     r2, .L3
        mov     r3, r7
        str     r3, [r2]
        ldr     r3, .L3
        ldr     r3, [r3]
        ldrb    r2, [r3]
        movs    r2, #0
        orr     r2, r2, #1
        strb    r2, [r3]
        ldrb    r2, [r3, #1]
        movs    r2, #0
        orr     r2, r2, #4
        strb    r2, [r3, #1]
        ldrb    r2, [r3, #2]
        movs    r2, #0
        strb    r2, [r3, #2]
        ldrb    r2, [r3, #3]
        movs    r2, #0
        strb    r2, [r3, #3]
        ldr     r3, .L3
        ldr     r3, [r3]
        ldr     r2, [r3]        @ unaligned
        ldr     r3, .L3
        ldr     r3, [r3]
        uxtb    r2, r2
        strb    r2, [r3, #4]
        ldr     r3, .L3
        ldr     r3, [r3]
        ldrb    r3, [r3, #4]    @ zero_extendqisi2
        mov     r0, r3
        adds    r7, r7, #12
        mov     sp, r7
        @ sp needed
        pop     {r7}
        bx      lr

but ,why you are compiling with -O0 , any reason such ?

Comment 4 Jonathan Wakely 2018-09-21 09:37:04 UTC

(In reply to Murat Ursavaş from comment #0)
> I've faced a weird behavior on my ARM MCU about a year ago and reported it
> on the launchpad page. In the meantime I tried to report the issue at here
> but your register process is longer than expected and here I'm reporting
> this issue almost a year later.

Umm, you got a bugzilla account here on Wed, 20 Dec 2017, so any delay since then is not because of our process :-)

Comment 5 Richard Earnshaw 2018-09-21 09:45:36 UTC

It's not clear what behaviour you think is 'proper' for a packed struct with a volatile member.  Since packed is a GNU extension, there's nothing in the C (or C++) standards that you can call upon to reach a reasonable expectation, and anything that is left must be down to either the GNU manuals or good luck.  In this case I don't think the GNU manuals say anything specific about this case either (since that would imply knowing how each supported architecture handles misaligned memory accesses.

Comment 6 Murat Ursavaş 2018-09-21 09:46:28 UTC

Hi Jonathan,

I just wanted a dramatic entrance :) (There was a discussion about GCC bugzilla on reddit recently) Of course it hasn't took that long. But this is like missing a call. You would answer that at that time but if you miss it, it becomes difficult to call back :)

Comment 7 Richard Earnshaw 2018-09-21 09:51:05 UTC

(In reply to Murat Ursavaş from comment #6)
> Hi Jonathan,
> 
> I just wanted a dramatic entrance :) (There was a discussion about GCC
> bugzilla on reddit recently) Of course it hasn't took that long. But this is
> like missing a call. You would answer that at that time but if you miss it,
> it becomes difficult to call back :)

Unfortunately we had to disable automatic account creation due to spammers abusing it.  This is a volunteer project and we can't have admins sitting around just waiting for real people to ask for accounts so that they can respond instantly.

Comment 8 Jonathan Wakely 2018-09-21 10:05:37 UTC

Yes, I read the discussion on reddit, it was annoying. It took 7h18m for your account to be created after you sent an email requesting it. Then it took 9 months for you to use the account. On the other hand, since we started the manual registration process we've had no spam attacks. The process is working IMHO.

Comment 9 Murat Ursavaş 2018-09-21 10:34:18 UTC

Umesh,

The reason is step-by-step debugging. I'd like to debug it first with -O0, than pack it with -Os for the release. Otherwise with a low resource MCU, things become messy really fast.

Comment 10 Murat Ursavaş 2018-09-21 10:38:06 UTC

Jonathan,

I don't blame any of you and very well aware of the volunteering effort. Please don't get me wrong. It's just me attempted multiple times to open the case but get distracted with something else.

Let's concentrate on the case, OK? :)

Thanks for your efforts again, I appreciate that and try to make more people appreciate it.

Comment 11 Murat Ursavaş 2018-09-21 10:59:10 UTC

Richard,

I don't know about the standards as you are and please accept me as a newbie. The peripheral parameters of the manufacturer library are all defined as volatile structs and accessed with pointers. This is working till 4.9 but not after 5.2.

So point me to the right direction, who should fix what and how?

I can't fix my software or use a workaround because I need packed structs. I tried many other options but none of them worked.

To understand the matter I'd like to ask, the trunk is creating the code Umesh just sent and this doesn't look like I expect. Is the code correct?

As far as I know I should see a shorter one with enabled unaligned access, which looks default on Cortex-M3 architecture. Of course I would like to have shorter image but longer image was not helping and assigning wrong values to the variables. Minimum test case has been written to show that.

Comment 12 Murat Ursavaş 2018-09-21 11:11:28 UTC

Richard,

Ok I remembered things with reading the old posts on launchpad. The compiler was generating normal code if I use the struct variable directly. But if I use a pointer to access it, it assigns not what I try to assign.

Isn't that a compiler bug?

Comment 13 Murat Ursavaş 2018-09-21 11:18:35 UTC

Richard,

Also as far as I remember GNU manual was indeed saying something on this case. It was saying that "if the struct is not packed, it would access to members word by word. But if unaligned access is disabled, it would access the variables byte by byte and create longer and slower images. With unaligned access, it should access to the struct members again word by word."

I can't show where it's written but I remember it. 

With the direct access, it is indeed accessing word by word no matter -pack-struct option is. But if I access it indirectly with a pointer, it tries to access byte by byte and still can't assign the values developer want.

Comment 14 Richard Earnshaw 2018-09-21 12:30:29 UTC

(In reply to Murat Ursavaş from comment #13)
> Richard,
> 
> Also as far as I remember GNU manual was indeed saying something on this
> case. It was saying that "if the struct is not packed, it would access to
> members word by word. But if unaligned access is disabled, it would access
> the variables byte by byte and create longer and slower images. With
> unaligned access, it should access to the struct members again word by word."
> 
> I can't show where it's written but I remember it. 
> 
> With the direct access, it is indeed accessing word by word no matter
> -pack-struct option is. But if I access it indirectly with a pointer, it
> tries to access byte by byte and still can't assign the values developer
> want.

That's because in the direct case it can see that the object is on the stack and aligned, so the value can be read simply without an unaligned object.  Once the object is accessed through the pointer the knowledge of the alignment is lost and accesses have to be done more conservatively.

Comment 15 Richard Earnshaw 2018-09-21 12:33:55 UTC

(In reply to Murat Ursavaş from comment #12)
> Richard,
> 
> Ok I remembered things with reading the old posts on launchpad. The compiler
> was generating normal code if I use the struct variable directly. But if I
> use a pointer to access it, it assigns not what I try to assign.
> 
> Isn't that a compiler bug?

I'm still not sure what you're saying is incorrect, based on your testcase.

Are you talking about the value returned from the function, or the types of memory access being used?  ie, if we lay aside for the moment that this is accessing a device, does the program behave correctly in terms of the values read and written?

Comment 16 Murat Ursavaş 2018-09-21 12:39:52 UTC

OK I understand conservative action and not wait for word by word access. But the resulting value is not 0x401 on the test case, but it should be.

In my original case this was effecting a USART peripheral register and it was activating different switches and making the peripheral useless. 

To make the case worse, if you assign some important physical pins on the same port, this bug can make them work differently. Let's say, it can pull a trigger without intention, because GPIO peripherals simply waits for bit assignments.

Comment 17 Richard Earnshaw 2018-09-21 13:15:41 UTC

(In reply to Murat Ursavaş from comment #16)
> OK I understand conservative action and not wait for word by word access.
> But the resulting value is not 0x401 on the test case, but it should be.

Is not 0x401 at what point?  This part of the sequence:

        ldr     r3, .L3        < &testStructPtr
        ldr     r3, [r3]       < testStructPtr
        ldrb    r2, [r3]       < testStructPtr->one[byte 0] (dead)
        movs    r2, #0         < 0
        orr     r2, r2, #1     < 1
        strb    r2, [r3]       < testStructPtr->one[byte 0] = 0x01
        ldrb    r2, [r3, #1]   < testStructPtr->one[byte 1] (dead)
        movs    r2, #0         < 0
        orr     r2, r2, #4     < 4
        strb    r2, [r3, #1]   < testStructPtr->one[byte 1] = 0x04
        ldrb    r2, [r3, #2]
        movs    r2, #0
        strb    r2, [r3, #2]   < testStructPtr->one[byte 2] = 0x00
        ldrb    r2, [r3, #3]
        movs    r2, #0
        strb    r2, [r3, #3]   < testStructPtr->one[byte 3] = 0x00

so successive bytes starting at &(testStructPtr->one) contain 0x01, 0x04, 0x00, 0x00 which is the value in the source code.

Comment 18 Richard Earnshaw 2018-09-21 13:35:06 UTC

BTW, are you really trying to say that your hardware has a register that isn't naturally aligned?  That's really weird if so.  If not, there are much easier ways to handle this sort of stuff.

Comment 19 Murat Ursavaş 2018-09-21 15:14:44 UTC

Hi Richard,

This source code had been designed to see word by word access and may create expected results. I'm not sure about that.

Let me use latest stable and see what happens. It wasn't plug and play last time but like I said I have to make sure I'm not the root cause. (Looks like I'm the usual suspect at this :)

Thanks for verifying that this is the expected code, just longer due to more conservative access style.

This could take a while as I'm struggling to find some extra time to pursue these.

Comment 20 Murat Ursavaş 2018-09-21 15:16:03 UTC

By the way, the hardware peripheral registers are aligned to 32bits.

Comment 21 Richard Earnshaw 2018-09-21 15:27:11 UTC

(In reply to Murat Ursavaş from comment #20)
> By the way, the hardware peripheral registers are aligned to 32bits.

So why don't you define your struct as

struct TestStructType
{
    volatile unsigned one;
    unsigned char two;
    unsigned short three __attribute__((packed));
};

And get rid of the pragma entirely?

Comment 22 Richard Earnshaw 2018-09-21 15:29:53 UTC

Or
#pragma pack(push, 1)

struct TestStructType
{
  volatile unsigned one;
    unsigned char two;
  unsigned short three;
} __attribute__((aligned(32)));

#pragma pack(pop)

Comment 23 Richard Earnshaw 2018-09-21 15:33:10 UTC

(In reply to Richard Earnshaw from comment #22)
> Or
> #pragma pack(push, 1)
> 
> struct TestStructType
> {
>   volatile unsigned one;
>     unsigned char two;
>   unsigned short three;
> } __attribute__((aligned(32)));

Err, that should be aligned(4), the alignment is in bytes!

> 
> #pragma pack(pop)

Comment 24 Eric Gallager 2018-09-21 18:47:26 UTC

(In reply to Murat Ursavaş from comment #6)
> Hi Jonathan,
> 
> I just wanted a dramatic entrance :) (There was a discussion about GCC
> bugzilla on reddit recently) 

Link to the reddit discussion? I searched and can't seem to find it.

Comment 25 Murat Ursavaş 2018-09-22 04:15:33 UTC

(In reply to Eric Gallager from comment #24)
> (In reply to Murat Ursavaş from comment #6)
> > Hi Jonathan,
> > 
> > I just wanted a dramatic entrance :) (There was a discussion about GCC
> > bugzilla on reddit recently) 
> 
> Link to the reddit discussion? I searched and can't seem to find it.

I've found it via twitter. Here;

https://twitter.com/blelbach/status/1032586866162196481

Comment 26 Murat Ursavaş 2018-09-22 04:23:59 UTC

(In reply to Richard Earnshaw from comment #21)
> (In reply to Murat Ursavaş from comment #20)
> > By the way, the hardware peripheral registers are aligned to 32bits.
> 
> So why don't you define your struct as
> 
> struct TestStructType
> {
>     volatile unsigned one;
>     unsigned char two;
>     unsigned short three __attribute__((packed));
> };
> 
> And get rid of the pragma entirely?

Richard,

Some of the structs are not under my control, since they belong to manufacturer libraries. I need pack-1 for some due to storage and communication needs.

And I didn't know that I could pack individual struct members. Please correct me if I'm wrong. This structs size is total 10 bytes, one:4, two:4 and three:2, right?

Comment 27 Richard Biener 2018-09-24 14:40:06 UTC

There doesn't seem to be any bug here just discussion belonging to gcc-help.

Comment 28 Murat Ursavaş 2018-09-29 13:57:27 UTC

Hi,

I've created the time and added 7.3.1 to my toolchain list. (It is really annoyingly hard to add a new toolchain in my configuration due to a bug in the IDE).

Anyway right now I can compare 4.9.3 and 7.3.1 side by side and my application is not working with the 7 series. That is exactly how it started at the beginning.

I've cleared some issues that could interfere with this issue but now I can reproduce the issue on my target.

I'm not sure this is due to packed structs or not but I've found a difference which should not happen. Please bear with me on this.

Here's the disassembly of a problematic part:

4.9.3

121           NVM_SPI->ROUTE = USART_ROUTE_TXPEN | USART_ROUTE_RXPEN | USART_ROUTE_CLKPEN | NVM_SPI_LOCATION;
00029e38:   ldr     r3,[pc,#0x4c] ; 0x29e84
00029e3a:   ldr     r2,[r3,#0x54]
00029e3c:   movs    r2,#0x0
00029e3e:   orr     r2,r2,#0xb
00029e42:   str     r2,[r3,#0x54]

7.3.1

121           NVM_SPI->ROUTE = USART_ROUTE_TXPEN | USART_ROUTE_RXPEN | USART_ROUTE_CLKPEN | NVM_SPI_LOCATION;
0000572e:   ldr     r3,[pc,#0x70] ; 0x579c
00005730:   ldrb.w  r2,[r3,#0x54]
00005734:   movs    r2,#0x0
00005736:   orr     r2,r2,#0xb
0000573a:   strb.w  r2,[r3,#0x54]
0000573e:   ldrb.w  r2,[r3,#0x55]
00005742:   movs    r2,#0x0
00005744:   strb.w  r2,[r3,#0x55]
00005748:   ldrb.w  r2,[r3,#0x56]
0000574c:   movs    r2,#0x0
0000574e:   strb.w  r2,[r3,#0x56]
00005752:   ldrb.w  r2,[r3,#0x57]
00005756:   movs    r2,#0x0
00005758:   strb.w  r2,[r3,#0x57]

4.9.3 sets the ROUTE register as 0xB correctly. But 7.3.1 sets it as 0x30B. The correct value is 0xB (calculated from the bit values). This maps the USART to the wrong pins and makes the peripheral physically useless and also cripples other pins.

Like I said, this may not be a bug, could be my error or vendor libraries but something doesn't look right. Please let me know if you need further info. I may need some guidance to collect more data.

P.S: I'm trying to improve GCC, otherwise I'm just fine with 4.9.3.

Comment 29 Murat Ursavaş 2018-09-29 18:31:19 UTC

And just out of curiosity, why the compiler loads zero to the register and then OR's with the value? 

00029e3c:   movs    r2,#0x0
00029e3e:   orr     r2,r2,#0xb

Why doesn't it load directly the necessary value? Like,

00029e3c:   movs    r2,#0xb

I know ARM arch needs load/store mechanism for the RAM but why this additional task for a register?

Comment 30 Murat Ursavaş 2018-09-29 18:56:07 UTC

OK, looks like it is possible like this:

ldr r2, =0x0000000b

Source: https://stackoverflow.com/questions/38689886/loading-32-bit-values-to-a-register-in-arm-assembly

Comment 31 Murat Ursavaş 2018-09-29 19:11:39 UTC

(In reply to Murat Ursavaş from comment #28)
> 
> Here's the disassembly of a problematic part:
> 
> 4.9.3
> 
> 121           NVM_SPI->ROUTE = USART_ROUTE_TXPEN | USART_ROUTE_RXPEN |
> USART_ROUTE_CLKPEN | NVM_SPI_LOCATION;
> 00029e38:   ldr     r3,[pc,#0x4c] ; 0x29e84
> 00029e3a:   ldr     r2,[r3,#0x54]
> 00029e3c:   movs    r2,#0x0
> 00029e3e:   orr     r2,r2,#0xb
> 00029e42:   str     r2,[r3,#0x54]
> 
> 7.3.1
> 
> 121           NVM_SPI->ROUTE = USART_ROUTE_TXPEN | USART_ROUTE_RXPEN |
> USART_ROUTE_CLKPEN | NVM_SPI_LOCATION;
> 0000572e:   ldr     r3,[pc,#0x70] ; 0x579c
> 00005730:   ldrb.w  r2,[r3,#0x54]
> 00005734:   movs    r2,#0x0
> 00005736:   orr     r2,r2,#0xb
> 0000573a:   strb.w  r2,[r3,#0x54]
> 0000573e:   ldrb.w  r2,[r3,#0x55]
> 00005742:   movs    r2,#0x0
> 00005744:   strb.w  r2,[r3,#0x55]
> 00005748:   ldrb.w  r2,[r3,#0x56]
> 0000574c:   movs    r2,#0x0
> 0000574e:   strb.w  r2,[r3,#0x56]
> 00005752:   ldrb.w  r2,[r3,#0x57]
> 00005756:   movs    r2,#0x0
> 00005758:   strb.w  r2,[r3,#0x57]

My limited assembler knowledge says new one is byte by byte access and should set the register correctly, but somehow it's not.

Could actual object code be different than what I see in the disassembly? I'll try to verify it via inspecting the code space.

Comment 32 Murat Ursavaş 2018-09-29 20:15:43 UTC

OK, dug down into Thumb-2 reference manual and verified. The code space shows correct instructions.

For this line;

strb.w  r2,[r3,#0x54]

It shows;

0xF883 0x2054

If I've read correctly, this is exactly what the disassembly says.

I guess from GCC perspective, this is a perfectly valid situation and this ticket should be closed.

If you have any additional ideas what could cause this, I'm all ears (Peripheral, Core, GDB). Otherwise, thanks for your time. It was enlightening for me to chase this issue.

Comment 33 Murat Ursavaş 2018-09-29 20:19:41 UTC

One thing though. Would you accept this a regression and get back to 4.9 style?

Yes, GCC is doing everything by the book but the result is not perfect (due to other undocumented issues not related to GNU team).

Comment 34 Murat Ursavaş 2018-10-01 06:35:00 UTC

I think I've got what's going on. (I know this case turned to a monologue, but I would like to improve it for the future search references.

In ARM architecture we have one simple linear address space for everything, flash, RAM and other hardware like peripherals. This makes many things quite easy.

If you would like to setup a USART, you just write some information on this address space and you get what you want, like in this case. But If you are like me, trying to make everything deterministic, you may want to enable packed-structs. No problem with that. GCC takes care of the rest. It can access the RAM unaligned anyway as default.

But, one thing can stay under the radar. We see peripheral registers as usual RAM addresses, but they are not. They may have limitations like no unaligned access.

In this case with GCC 4.9.3, if I access to the register it uses this:

ldr     r3,[pc,#0x4c]
ldr     r2,[r3,#0x54]
movs    r2,#0x0
orr     r2,r2,#0xb
str     r2,[r3,#0x54]    ;<<<< Important instruction

This part easily sets a 32bit register and everything works as expected.

But after GCC 5+, It uses byte by byte access an uses the instructions below;

ldr     r3,[pc,#0x70]
ldrb.w  r2,[r3,#0x54]
movs    r2,#0x0
orr     r2,r2,#0xb
strb.w  r2,[r3,#0x54]    ;<<<< Important instruction
ldrb.w  r2,[r3,#0x55]
movs    r2,#0x0
strb.w  r2,[r3,#0x55]
ldrb.w  r2,[r3,#0x56]
movs    r2,#0x0
strb.w  r2,[r3,#0x56]
ldrb.w  r2,[r3,#0x57]
movs    r2,#0x0
strb.w  r2,[r3,#0x57]

There is nothing wrong, if it was a normal RAM location. It would set the register as 0x0000000b. But since this is a peripheral location, and has to be accessed as aligned, it takes just the first strb.w instruction into consideration, and leaves further ones useless. 0 - 7bits are OK, but 8-31 bits are left to decide by entropy. In my case the entropy wants to move the physical pins to a different location.

I'm not sure whether this is a GCC regression, or must be taken care by the hardware manufacturer, but this is my conclusion at the end.

So what will be my workaround;

Project wide packed structs are dangerous, I'll remove it from the project settings and limit it down to necessary structs, leaving others relaxed. This should make the peripheral access aligned.

P.S: I'm reopening this record for one final evaluation by the GNU team. From my perspective, this looks like a regression, but it's up to you guys.

Comment 35 Richard Earnshaw 2018-10-01 09:54:03 UTC

I haven't researched the history of this change, but it was probably deliberate.  Device memory on processors (a memory classification that implies that the underlying contents may contain memory mapped devices that have side effects) is generally intolerant of misaligned accesses.  As such, emitting an unaligned access to it may cause the system to fault.

If that was the reason for the change to GCC, it won't be reversed.  (The fact that in your example the optimizers discover the alignment and change the code back to an aligned access does not alter this general observation.)