This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RL78] Questions about code-generation


Hi,

The code produced by GCC for the RL78 target is around twice as large as that produced by IAR and I've been trying to find out why.

The project I'm working on uses an RL78/F12 with 16KB of code flash.  As I have to get a bootloader and an application into that, I have to pay close attention to how large the code is becoming.

Looking at the assembler output for some simple examples, the problem seems to be 'bloated' code as opposed to not squeezing every last byte out through the use of ingenious optimization tricks.

I've managed to build GCC myself so that I could experiment a bit but as this is my first foray into compiler internals, I'm struggling to work out how things fit together and what affects what.

My initial impression is that significant gains could be made by clearing away some low-hanging fruit, but without understanding what caused that code to be generated in the first place, it's hard to do anything about it.

In particular, I'd be interested to know what is caused (or could be improved) by the RL78-specific code, and what comes from the generic part of GCC.

Here's an example extracted from one of the functions in our project:

--------

unsigned short gOrTest;
#define SOE0 (*(volatile unsigned short *)0xF012A)

void orTest()
{
   SOE0 |= 3;
   /* gOrTest |= 3; */
}

--------

This produces the following code (using -Os):

  29 0000 C9 F2 2A 01                  movw  r10, #298
  30 0004 AD F2                        movw  ax, r10
  31 0006 16                           movw  hl, ax
  32 0007 AB                           movw  ax, [hl]
  33 0008 BD F4                        movw  r12, ax
  34 000a 60                           mov   a, x
  35 000b 6C 03                        or    a, #3
  36 000d 9D F0                        mov   r8, a
  37 000f 8D F5                        mov   a, r13
  38 0011 9D F1                        mov   r9, a
  39 0013 AD F2                        movw  ax, r10
  40 0015 12                           movw  bc, ax
  41 0016 AD F0                        movw  ax, r8
  42 0018 78 00 00                     movw  [bc], ax
  43 001b D7                           ret

There's so much unnecessary register passing going on there (#298 could go straight into HL, why does the same value end up in BC even though HL hasn't been touched? etc.)

Commenting out the 'SOE0' line and bringing the 'gOrTest' line back in generates better code (but still worthy of optimization):

  29 0000 8F 00 00                     mov   a, !_gOrTest
  30 0003 6C 03                        or a, #3
  31 0005 9F 00 00                     mov   !_gOrTest, a
  32 0008 8F 00 00                     mov   a, !_gOrTest+1
  33 000b 6C 00                        or a, #0
  34 000d 9F 00 00                     mov   !_gOrTest+1, a
  35 0010 D7                           ret

What causes that code to be generated when using a variable instead of a fixed memory address?

Even allowing for the unnecessary 'or a, #0' and keeping to a 16-bit access, it's still possible to perform the same operation in half the space of the original:


  29 0000 36 2A 01                     movw hl, #298
  30 0003 AB                           movw ax, [hl]
  31 0004 75                           mov  d, a
  32 0005 60                           mov  a, x
  33 0006 6C 03                        or   a, #3
  34 0008 70                           mov  x, a
  35 0009 65                           mov  a, d
  36 000a 6C 00                        or   a, #0
  37 000c BB                           movw [hl], ax
  38 000d D7                           ret

And, of course, that could be optimized further.

Excessive register copying and an apparant preference for R8 onwards over the B,C,D,E,H and L registers (which could save a byte on every 'mov') seems to be one of the main causes of 'bloated' code (among others).

So, I guess my question is how much of the bloat comes from inefficiencies in the hardware-specific code?  I saw a comment in the RL78 code about performing CSE optimization but it's not clear to me where or how that would be done.  I tried to look at the code for some other processors to get an idea but it's hard to find things when you don't
know what you're looking for :)

Any help would be gratefully received!

Regards,

Richard Hulme


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]