This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[RFC] Re: Immediate offset in array index

From: Yufeng Zhang <Yufeng dot Zhang at arm dot com>
To: gcc at gcc dot gnu dot org
Cc: gcc-help at gcc dot gnu dot org
Date: Thu, 28 Nov 2013 19:44:59 +0000
Subject: [RFC] Re: Immediate offset in array index
Authentication-results: sourceware.org; auth=none
References: <52810851 dot 6040007 at arm dot com>

(Add gcc@gcc.gnu.org which is more relevant.)

After some investigation, I can answer my own question now.

On 11/11/13 16:39, Yufeng Zhang wrote:

Hi,

I'm trying to understand the following code-gen difference in the
following code:

#ifdef BEFORE
typedef int arr_2[4][5];
#else
typedef int arr_2[4][4];
#endif

void foo (arr_2 a2, int i, int j)
{
    a2[i + 2][j] = 2;
}

The code-gen using a mainline gcc -O2 -DBEFORE is:

          movslq  %esi, %rsi
          movslq  %edx, %rdx
          leaq    (%rsi,%rsi,4), %rax
          leaq    40(%rdi,%rax,4), %rax
          movl    $2, (%rax,%rdx,4)
          ret

and -O2 is:

          movslq  %esi, %rsi
          movslq  %edx, %rdx
          addq    $2, %rsi
          salq    $4, %rsi
          addq    %rdi, %rsi
          movl    $2, (%rsi,%rdx,4)
          ret

When the typedef is arr_2[4][5], the compiler distributes (i + 2) * 20
to i * 20 + 40, while not doing the distribution when the typedef is
arr_2[4][4].

pointer_int_sum () in c-family/c-common.c applies the distributive lawto the non-pointer addend if the addend is a MULT_EXPR in a form of ((a+/- c1) * c2), where both c1 and c2 are constant immediates. It claimsthat "this helps produce common subexpressions".


In the example above, a2[i + 2] is transformed to
  a2 + ((i * 16) + 32)
when arr_2 is typedefed as [4][4], and transformed to
  a2 + ((i * 20) + 40)
in the case of [4][5].

However, later on fold_binary_loc () in fold-const.c will revert theeffort with the assistance from fold_plusminus_mult_expr (), which triesto undistribute the plus (or minus). But for fold_plusminus_mult_expr() to carry out the undistribution, c2 has to be a power of two, whichis why we see the code-gen difference (as 32 is a power of two, but 40is not).

The -fdump-tree-original dumps already diff, so I guess that it is the
parser which does something interesting.

Can anyone enlighten me that which part of the parser I shall further
look into to get a better understanding of the code-gen difference?

Regards,
Yufeng

p.s.

diff of the -fdump-tree-original

-  (*(a2 + ((sizetype) ((long unsigned int) i * 20) + 40)))[j] = 2;
+  (*(a2 + ((sizetype) i + 2) * 16))[j] = 2;


A similar code-gen difference can be observed in the following example code:

foo (float * a, float * b, int n, int j)
{
  *a = b[j+50] + b[j-50];
}

You'll see the following -fdump-tree-original on x86_64 -O2:

*a = *(b + ((sizetype) j + 50) * 4) + *(b + ((sizetype) ((long unsignedint) j * 4) + 18446744073709551416));

While b[j+50] is turned into (b + (j + 50) * 4), for b[j-50] themultiplication has been distributed.

The difference is caused again by fold_plusminus_mult_expr () whichcarries out the undistribution when host_integerp (200, 0) passes butquits early when host_integerp (18446744073709551416, 0) fails.

This looks like a classic example where different parts of a compilerdisagree on which are better heuristics.

IMHO, array indexing expression with immediate offset(s) shall bedistributed to allow immediates to be folded (when there are multiple)and re-associated to the rightmost side of the addressing expression (tobetter utilize address mode with immediate offset). For example, giventhe following example:


typedef struct { int x,y,a,b; } X;

int
foo (X p[][4], int x, int y)
{
  return p[x+1][y+1].a;
}

the code-gen on arm -mcpu=cortex-a15 -O2 -mthumb is

        add     r2, r2, #1
        add     r1, r1, #1
        mov     r2, r2, asl #4
        add     r2, r2, r1, asl #6
        add     r0, r0, r2
        ldr     r0, [r0, #8]

which can be optimized to:

        add     r1, r0, r1, asl #6
        add     r2, r1, r2, asl #4
        ldr     r0, [r2, #88]

if the distribution is applied. Similarly the code-gen on x86_64 can beoptimized in a similar fashion from current poorer code-gen:


        movslq  %esi, %rsi
        addl    $1, %edx
        addq    $1, %rsi
        movslq  %edx, %rdx
        salq    $6, %rsi
        salq    $4, %rdx
        addq    %rdi, %rsi
        movl    8(%rsi,%rdx), %eax

I'm currently experimenting with a new tree-ssa pass to apply thedistribution on a qualified addend of any POINTER_PLUS expression. Ideliberately avoid to change fold_binary_loc () orfold_plusminus_mult_expr (), as I think both folding functions arefairly mature and the folding can be beneficial in other cases.

I'm quite keen to hear what you think on this issue. Any comment iswelcomed!


Thanks,
Yufeng

References:
- Immediate offset in array index
  - From: Yufeng Zhang

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]