Bug 43129

Summary:	Simplify global variable's address loading with option -fpic
Product:	gcc	Reporter:	Carrot <carrot>
Component:	target	Assignee:	Not yet assigned to anyone <unassigned>
Status:	NEW ---
Severity:	enhancement	CC:	carrot, gcc-bugs, rearnsha, siarhei.siamashka, stephen.clarke
Priority:	P3	Keywords:	missed-optimization
Version:	4.5.0
Target Milestone:	---
Host:	i686-linux	Target:	arm-eabi
Build:	i686-linux	Known to work:
Known to fail:		Last reconfirmed:	2010-02-20 12:41:03

Description Carrot 2010-02-20 08:28:31 UTC

Compile the following code with options -march=armv5te -mthumb -Os -fpic

extern int i;
int foo(int j)
{
  int t = i;
  i = j;
  return t;
}

GCC generates following code:

foo:
        ldr     r3, .L2        // A
        ldr     r2, .L2+4      // B
.LPIC0:
        add     r3, pc         // A
        ldr     r3, [r3, r2]   // B
        @ sp needed for prologue
        ldr     r2, [r3]
        str     r0, [r3]
        mov     r0, r2
        bx      lr
.L3:
        .align  2
.L2:
        .word   _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)  // A
        .word   i(GOT)               // B

Instructions marked A compute the address of GOT table, instructions marked B load the global variables GOT entry to get actual address of the global variable. There are 4 instructions and 2 constant pool entries in total. It can be simplified by applying the fact that the offset from label .LPIC0 to any GOT entry is fixed at linking time. The result is:

foo:
        ldr     r3, .L2        // C
.LPIC0:
        add     r3, pc         // C
        ldr     r3, [r3]       // C
        @ sp needed for prologue
        ldr     r2, [r3]
        str     r0, [r3]
        mov     r0, r2
        bx      lr
.L3:
        .align  2
.L2:
        .word   ABS_ADDRESS_OF_GOT_ENTRY_FOR_i -(.LPIC0+4) // C

The instructions marked C load the actual address of a global variable. It uses only 3 instructions and 1 constant pool entry. It is both smaller and faster.

But it is not always beneficial to use instruction sequence C. If there are many global variable accesses, by using code sequence B, each one global variable need 2 extra instructions to load its address. But using code sequence C, each one global variable need 3 extra instructions to load its address.

Suppose there are n global variables, the code size needed to compute the actual addresses by instruction sequence A and B is:

code_size(A) + code_size(B) * n = 2*2 + 4 + (2*2 + 4) * n = 8n + 8         <1>

The code size needed by instruction sequence C is:

code_size(C) = (3*2 + 4) * n = 10n                                      <2>

Let <1> = <2>, we get
 
   8n + 8 = 10n
        n = 4

So if there are more than 4 global variables' access instruction sequence A and B is smaller, if there are less than 4 global variables' access instruction sequence C is smaller. If there are 4 global variables' access both methods have same code size. But code sequence C has one less memory load (the load in instructions A) and use one less register(the global register hold the GOT address). So code sequence C is still faster.

For arm instruction set, both methods have same code sequence, but with different code size, now we have
 
   4 * 2 + 4 + (4 * 2 + 4) * n = (4 * 3 + 4) * n
                             n = 3

So the threshold value of n is 3 for arm instruction set.

Now the problem is how to represent the offset from a code label to a global variable's GOT entry. Ian mentioned that arm relocation R_ARM_GOT_PREL can be used, but I can't find how to represent this relocation in gnu assembler.

Any suggestions?

Comment 1 Carrot 2010-03-16 06:23:57 UTC

This optimization uses one less register (the register hold the GOT base), to get this beneficial the ideal place for it should be before register allocation.

Usually expand pass generates instructions to load global variable's address from GOT entry for each access of the global variable. Later cse/gcse passes can remove many of them. In order to precisely model the cost, this optimization should be put after some cse/gcse passes.

So what is the best place for this optimization? Is there any existed pass can be enhanced with this optimization? Or should I add a new pass?

Comment 2 Eric Botcazou 2010-03-30 14:36:40 UTC

Doesn't this belong in the linker as a relaxation?  This would solve the reloc problem in the process.

Comment 3 Carrot 2010-03-31 08:01:18 UTC

(In reply to comment #2)
> Doesn't this belong in the linker as a relaxation?  This would solve the reloc
> problem in the process.
> 

Gnu linker has already support R_ARM_GOT_PREL. And the new relocation (GOT_PREL) has been added to trunk gas. So the offset can be represented as

      i(GOT_PREL)+(.-(.LPIC1+4))

Now my question is where is the best place to insert this enhancement?

Comment 4 Richard Henderson 2010-04-07 15:45:19 UTC

My best guess is that this optimization should be done late.
For instance, in the machine-dependant reorg pass.  I don't
see any place to hook this earlier.

The problem is that reload should be able to "spill" pseudos
containing your got addresses and re-compute them from the
given constants rather than consuming a stack slot to hold
the computed value.  Which means that the number of instances
of got address loads may vary until after reload, which means
that any size estimation calculation you do earlier can be off.

The down-side to doing it after reload is that you will have
committed to saving and restoring arm_pic_register in the prologue
and epilogue.  Given that arm uses ldm/stm this ought not impact
your code size often, but will in the extreme case of a leaf
function with no other saved registers.

I guess you'll have to experiment with your implementation to
see what gives the best results on a large body of code.

Comment 5 Carrot 2010-04-08 13:31:54 UTC

(In reply to comment #4)
> I guess you'll have to experiment with your implementation to
> see what gives the best results on a large body of code.
> 

I will experiment on CSiBE.

Comment 6 Carrot 2010-04-11 12:09:28 UTC

Some experiment results:

Compile CSiBE with options -Os -fpic -mthumb -fno-short-enums

without this optimization: 2830665
simplify-got before ra:    2825737
simplify-got after ra:     2826853

So this optimization should be done before RA.

Comment 7 Carrot 2010-04-11 12:12:51 UTC

(In reply to comment #6)
> Some experiment results:
> 
> Compile CSiBE with options -Os -fpic -mthumb -fno-short-enums
> 
> without this optimization: 2830665
> simplify-got before ra:    2825737
> simplify-got after ra:     2826853

These numbers are sum of each line in file result-size.csv.

Comment 8 Stephen Clarke 2010-10-14 16:32:56 UTC

For arm instruction set, could you fold pc into the indexing
to save an instruction?

foo:
        ldr     r3, .L2        // C
.LPIC0:
        ldr     r3, [r3,pc]    // C
        @ sp needed for prologue
        ldr     r2, [r3]
        str     r0, [r3]
        mov     r0, r2
        bx      lr
.L3:
        .align  2
.L2:
        .word   ABS_ADDRESS_OF_GOT_ENTRY_FOR_i -(.LPIC0+4) // C

Comment 9 ramana.radhakrishnan@arm.com 2010-10-14 16:39:26 UTC

On Thu, 2010-10-14 at 16:33 +0000, stephen.clarke at st dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
> 
> Stephen Clarke <stephen.clarke at st dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |stephen.clarke at st dot
>                    |                            |com
> 
> --- Comment #8 from Stephen Clarke <stephen.clarke at st dot com> 2010-10-14 16:32:56 UTC ---
> For arm instruction set, could you fold pc into the indexing
> to save an instruction?
> 
> foo:
>         ldr     r3, .L2        // C
> .LPIC0:
>         ldr     r3, [r3,pc]    // C


You'll find that the ARM-ARM thinks that PC in any of the 3 locations in this instruction form is *unpredictable*. Thus this form of the instruction should not be used. 

cheers
Ramana

Comment 10 Stephen Clarke 2010-10-14 17:01:47 UTC

OK, I can see that the ARM ARM states for Rm == PC then its unpredictable.

But for Rn == PC, I can only see that its unpredictable if W is 1
or P is 0  (I am looking at encoding A1).  So I am struggling to
understand that:

   ldr     r3, [pc,r3]

is unpredictable.  Forgive me if I made a mistake, my knowledge of
ARM is a little rusty.

Comment 11 Jackie Rosen 2014-02-16 13:14:41 UTC Comment hidden (spam)

*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.