This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Re: modifying the ARM generation behavior?

To: Richard dot Earnshaw at arm dot com
Subject: Re: modifying the ARM generation behavior?
From: Josh Fryman <fryman at cc dot gatech dot edu>
Date: Mon, 24 Sep 2001 08:49:30 -0400
CC: gcc at gcc dot gnu dot org, nickc at cambridge dot redhat dot com
Organization: CoC, GaTech
References: <200109241100.MAA09341@cam-mail2.cambridge.arm.com>

hi,

thanks for the replies ;) i'm combining replies to avoid redundancy, 
hope you don't mind.

let me start by saying my current comparison is coming from gcc-2.95.2,
vanilla, and is targeted to the CRL Skiff board (SA-110).  i note now that
you may just tell me to upgrade my compiler version to 3.0.0 or 3.0.1, 
and if you think that will fix part of this in some way, i'll be happy to
give that a go... i don't think it will work, personally, but let me
explain why.

my knowledge of ARM is limited to what i've been picking up en route to
what i'm doing research-wise, but that is the source of the problem.  so
to give you a better understanding of what i want to do, i've put a bit
about what exactly i'm doing at the end of this email.  sorry, this wound
up being a bit windy :(

Nick:
> r11 is already in use.  It is the frame pointer.  In fact the ARM is

i was suggesting r11 as just an exercise to think about, sorry.  i 
didn't mean it literally.  you were both right to jump on me for my 
choice - bad thinking on my part...

Richard:
> Hm, which compiler release are you using?

2.95.2, as stated above.  moving it out of the function body and to the
function end doesn't change my underlying problem.

Nick:
> This is a little unfair.  GCC will normally put these contants at the
> end of the function, not in the middle of the instruction stream.

as discussed below, it may partly be a compiler issue as well as the 
optimization level... so "unfair" may be "unfair for current gcc" but
the toolchain CRL gave me was 2.95.2 based, and i've been hesitant to
replace it... i don't know how many other things i'll have to replace
in the process :(

Richard:
> Hm, what you are describing is a poition-independent data model (in ARM's
> ATPCS parlance, RWPI -- read-write position independent), but taken to the
> extreme that even constants are pushed into the global data tables.

i'll have to read up about this... and the % efficiency drop you mention.
i'd like to know if that drop is seen with *all* ARM compilers, or just 
the ARM compiler.  i note that the ARM compiler does not generate code like
gcc does ... as is true on all architectures i've dealt with, gcc is a good
compiler in general, but if you want completely optimized code, you have
to use the platform specific compiler (intel c, sun devwkshp c, etc).  

Nick:
> to know that the next generation of the ARM ABI is being developed.
> See http://www.armdevzone.com/ for more information).

ooo, more goodies to look through.  thanks, i'll spend some time browsing.

Nick:
> What happens if there is too much data to fit into the area pointed to
> by r11 ?  (or whichever register is used).  Since this may only be

Richard:
> time.  Further, any moderately large program is going to exceed the 4k
> offset range of your base register, meaning that you will either need to
> create one base value per module (= more code at the start of each module
> to set the base register up) or you will have to compile on the assumption
> that a single ldr can't load a constant, something like

you both jumped on this one rather pointedly ;)  it deserves it.  i'm not
sure how silly the idea is.  but ... to answer ...

not necessarily.  i won't put it past some people to have a buttload of 
files in their projects, so the smart decision would be to have the register
be an indirect index itself.  think of something like the idea that you have
1K of 4-byte addresses to "data tables" ... these addresses are in turn
addresses to function- or file-specific "data tables" ... and then there you
are.  you have one setup of the register at program start, and then each
function would have two loads at the top to get the right table into the
register value ... this doesn't seem much of a penalty to me.  makes more
overhead in the data segment... but that could be massaged a bit to reduce
inefficiency.

Richard:
> [re: alternate address load model]
>
>         add     Rtmp, Rbase, #OFFSET_HIGH(offset)
>         ldr     Rx, [Rtmp, #OFFSET_LOW(offset)]

this is sort of how Sparc (and other) systems work. they do have a different
instruction pattern for it, but it does the two-stage load... and it's very
easy to catch in the code by a parser like mine.  i know exactly what its
doing... in the sparc.  because it doesn't use register offset addressing.
makes me wish very fervently that the ARM had an instruction like the "bl"
or "b" -- something like "ldrhi r3, <high-imm-16bit>", and the "ldrlo"
follow-on.  

Nick:
> What about shared libraries ?  Would r11 be loaded with a different
> value whenever a shared library function is called, or would the
> share'd libraries data have to merged into the application's own data?

uhhh... good question.  i'll have to think about the shared libraries
aspect.  i don't see it as a major obstacle given the above multiple-level
pointer trick, but it might make the fixup a little dicey.  i was assuming
a function would always know what offset path to follow, and that would
be inserted properly by the compiler and values stuff by the linker... 
shared libraries are a different problem.  i'm not using them, so i didn't
think about it.  

Richard:
> normal use.  You can't use r11 since it is already used, so you would have
> to use r9 (or for some compilations r8), that would use up 15-20% of the
> remaining call-saved registers -- that's likely to have a significant
> effect on the efficiency of the rest of your code, since the compiler will
> now have to spill more often.

maybe, i don't think so.  (ignoring which particular rN is being used.)  in
the test code i've generated (prior to using -ffixed-r8), i've looked through
the assembly output quite a bit.  i have yet to see (this is working through
adpcm codecs, mpeg codecs, jpeg codecs, and some custom test apps) anything
use *more* than r0-r5.  i have never seen an r6, r7, r8 reference *anywhere*.

to be honest, i find that kind of odd and don't understand why.  maybe there's
some hidden penalty in using r6-r8 the compiler knows about that i don't? but
given this, i see no problem is taking off another register that doesn't seem
to be used anyway.  (note, i don't use -O2, i use -O1 - could be an 
optimization issue...)

Richard:
> > we also avoid having the dozens of "ldr r3,<some-var-block>" throughout
> > the code generation.  this would make for more efficient code.
> 
> Please show me a real example where we get dozens of such accesses that
> would be avoided by your model; the existing model makes use of the PC as
> an effective base register, you would loose that benefit with your
> approach.

yeah, there are pros/cons.  i don't really know how to solve this particular
problem.  but, you wanted a real code example, here ya go:

here's a fairly simple function i was testing through the system, and noted
the behavior on first:

test.c:
---------
#include <stdio.h>

extern int s1( int );
extern int s2( int );
extern int s3( int );

extern int g1;

int debug( void )
{
   int g;

   printf("debugging s1/2/3...\n");
   for (g=0; g<10; g++)
      printf("s1(%d) = %d, s2(%d)=%d, s3(%d)=%d\n", g, s1(g), g, s2(g), g, s3(g) );
   printf("end debug...\n");

   g1 *= s3(g);
   printf("g1 is now %d\n", g1);

   g = s1(10) + s2(20);
   return g;
}

here's a snippet of the asm output from gcc ...

        ldr     r3, .L7
        ldr     r0, [r3, #0]
        bl      s1
        mov     r4, r0
        ldr     r3, .L7
        ldr     r0, [r3, #0]
        bl      s2
        mov     r5, r0
        ldr     r3, .L7
        ldr     r0, [r3, #0]
        bl      s3
        mov     r1, r0
        ldr     r2, .L7
        ldr     r3, .L7
        str     r5, [sp, #0]

sure looks like a lot of "ldr r3, .L7" to me ;)  however, i note that i'm using
a very long string of flags to gcc, as well as an older version.  when i went back
and undid many of the flags, and put the optimization level at O2 (i use O1 at
present, O2 has some side effects i haven't figured out how to deal with yet)
i *do* get different output that looks different:

.LM5:
        ldr     r0, [sp, #12]
        bl      s1
        mov     r4, r0
        ldr     r0, [sp, #12]
        bl      s2
        mov     r5, r0
        ldr     r0, [sp, #12]
        bl      s3
        mov     r3, r0
        str     r5, [sp, #0]
        ldr     r2, [sp, #12]

so i'm not sure if what i'm seeing is necessarily normal behavior.  kind of hard
to tell.

> I think it probable that code compiled the way you suggest could be made
> to work, but I very much doubt that it would be more efficient.

i'd be willing to settle for neutral.  a very small performance loss i could
accept - if we can make this work, we'd do an actual implementation that had
hardware support for our check-routines and better granularity on page controls.
then we might actually have a viable system...

then again, we may suck eggs.  it's always hard to tell this early on which
way it will go ;)

any other thoughts on the situation?

thanks for the feedback!

-josh fryman

[ begin research description : ignore rest of email if uninterested ]

for a research project we're looking at a different model of CPU design
that would have *no* caches.  think of a remote sensor device with a 
very small (~4-8K) on-chip memory footprint and that's it - plus some way
to interface a sensor array.  (sensor = uninteresting black box here.)
so essentially all the space that would be cache is now the RAM we have
available.  we want to dynamically page in/out code and data from this
space, for our little sensor uP is connected to a backend powerful server
via some link (serial, IR, ethernet, wireless, whatever).  so the 
uP will run a little set of "stubs" that will obtain code snippets from
the server, run them, and intercept ld/st and b/bl situations ... to
remap the instruction to (a) the proper address if resident; or (b) to
fetch the proper chunk needed and then remap - this may involve shipping
code/data back to the server.  the concept is that assuming our memory is
sufficiently large for our "hot" code (adpcm coding, whatever) that 
eventually we reach steady-state and no longer talk to the server except
to send "cooked" sensor data back.

this is the final goal.  the current implementation is a test prototype
running under linux... the app and server run on the same skiff board
(or not) and communicate via generic socket read/write ops.  i have a
client which receives function-sized chunks and executes them, asking
the server periodically for more code... for now, we ignore the data
segment paging by just allocating enough memory and not trying to manage
it.  managing the code is difficult enough.

the problem we immediately run into is that for sub-function-sized 
page granularity, it's very difficult in a "nice" way to catch the 
memory references that are loading data from the instruction stream.
(from those little tables in the text segment i'm complaining about ;)

i'm expicitly compiling the "app" to run on our client such that there is
a big separation between I- and D- sections -- I at 0x021... and D at
0x022... if we were running in a real embedded environment (no linux),
i could probably just set the MMU up to catch these for me, assuming
i could get it to recognize the small page size... 

but here, when the server is parsing the code chunk to send to the 
client, it replaces "bl myfunc" with "bl bl_intercept" ... the intercept
will do the negotiation with the server.  for me to use finer grain chunks 
and break up the function (say, on any "b offs" or "bl func") then i need
to move the code chunks around in memory to be non-contiguous.  the
problem here is that now i need to have more knowledge of the code 
and be on the lookout for "ld r3, <some ofs addr in I-space>" and 
replace that with something like "bl ld_intercept" where i go and
do all the address work elsewhere for the load.  in all probability,
i'd wind up exceeding the limited offset range of the ldr instruction
to just remap it...

you may think "so what?" - i'm taking a huge performance hit with
the bl-intercept routine, so what difference does the ld-intercept
make?  the reason is that as we page code in to us, we self-modify
those "bl b_intercept" to actually become "bl <new-real-address>" ...
and when we page code out, we replace any call sites to the now-
removed code with "bl b_intercept" so we can reload the code as 
needed.  so in essence, we take the hit once, and then never again,
when we reach that "hot code" steady state...

the problem is i can't see a way to remove the ld_intercept *ever*
because i may always exceed the offset space of the ldr-instr,
not to mention the extra complexity in server-side book-keeping.

if i could just stuff all those tables at a fixed address in memory
that i can keep track of in some way, that would make my life much
easier.  (maybe by defining a new segment ".funcvars" and sticking
them there ... by whatever means i can make it work...)

hope this info helps paint the broader picture...
Follow-Ups:
- Re: modifying the ARM generation behavior?
  - From: Richard Earnshaw
References:
- Re: modifying the ARM generation behavior?
  - From: Richard Earnshaw
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]