This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] Re: Immediate offset in array index


(Add gcc@gcc.gnu.org which is more relevant.)

After some investigation, I can answer my own question now.

On 11/11/13 16:39, Yufeng Zhang wrote:
Hi,

I'm trying to understand the following code-gen difference in the
following code:

#ifdef BEFORE
typedef int arr_2[4][5];
#else
typedef int arr_2[4][4];
#endif

void foo (arr_2 a2, int i, int j)
{
    a2[i + 2][j] = 2;
}

The code-gen using a mainline gcc -O2 -DBEFORE is:

          movslq  %esi, %rsi
          movslq  %edx, %rdx
          leaq    (%rsi,%rsi,4), %rax
          leaq    40(%rdi,%rax,4), %rax
          movl    $2, (%rax,%rdx,4)
          ret

and -O2 is:

          movslq  %esi, %rsi
          movslq  %edx, %rdx
          addq    $2, %rsi
          salq    $4, %rsi
          addq    %rdi, %rsi
          movl    $2, (%rsi,%rdx,4)
          ret

When the typedef is arr_2[4][5], the compiler distributes (i + 2) * 20
to i * 20 + 40, while not doing the distribution when the typedef is
arr_2[4][4].

pointer_int_sum () in c-family/c-common.c applies the distributive law to the non-pointer addend if the addend is a MULT_EXPR in a form of ((a +/- c1) * c2), where both c1 and c2 are constant immediates. It claims that "this helps produce common subexpressions".

In the example above, a2[i + 2] is transformed to
  a2 + ((i * 16) + 32)
when arr_2 is typedefed as [4][4], and transformed to
  a2 + ((i * 20) + 40)
in the case of [4][5].

However, later on fold_binary_loc () in fold-const.c will revert the effort with the assistance from fold_plusminus_mult_expr (), which tries to undistribute the plus (or minus). But for fold_plusminus_mult_expr () to carry out the undistribution, c2 has to be a power of two, which is why we see the code-gen difference (as 32 is a power of two, but 40 is not).

The -fdump-tree-original dumps already diff, so I guess that it is the
parser which does something interesting.

Can anyone enlighten me that which part of the parser I shall further
look into to get a better understanding of the code-gen difference?

Regards,
Yufeng

p.s.

diff of the -fdump-tree-original

-  (*(a2 + ((sizetype) ((long unsigned int) i * 20) + 40)))[j] = 2;
+  (*(a2 + ((sizetype) i + 2) * 16))[j] = 2;

A similar code-gen difference can be observed in the following example code:

foo (float * a, float * b, int n, int j)
{
  *a = b[j+50] + b[j-50];
}

You'll see the following -fdump-tree-original on x86_64 -O2:

*a = *(b + ((sizetype) j + 50) * 4) + *(b + ((sizetype) ((long unsigned int) j * 4) + 18446744073709551416));

While b[j+50] is turned into (b + (j + 50) * 4), for b[j-50] the multiplication has been distributed.

The difference is caused again by fold_plusminus_mult_expr () which carries out the undistribution when host_integerp (200, 0) passes but quits early when host_integerp (18446744073709551416, 0) fails.

This looks like a classic example where different parts of a compiler disagree on which are better heuristics.

IMHO, array indexing expression with immediate offset(s) shall be distributed to allow immediates to be folded (when there are multiple) and re-associated to the rightmost side of the addressing expression (to better utilize address mode with immediate offset). For example, given the following example:

typedef struct { int x,y,a,b; } X;

int
foo (X p[][4], int x, int y)
{
  return p[x+1][y+1].a;
}

the code-gen on arm -mcpu=cortex-a15 -O2 -mthumb is

        add     r2, r2, #1
        add     r1, r1, #1
        mov     r2, r2, asl #4
        add     r2, r2, r1, asl #6
        add     r0, r0, r2
        ldr     r0, [r0, #8]

which can be optimized to:

        add     r1, r0, r1, asl #6
        add     r2, r1, r2, asl #4
        ldr     r0, [r2, #88]

if the distribution is applied. Similarly the code-gen on x86_64 can be optimized in a similar fashion from current poorer code-gen:

        movslq  %esi, %rsi
        addl    $1, %edx
        addq    $1, %rsi
        movslq  %edx, %rdx
        salq    $6, %rsi
        salq    $4, %rdx
        addq    %rdi, %rsi
        movl    8(%rsi,%rdx), %eax

I'm currently experimenting with a new tree-ssa pass to apply the distribution on a qualified addend of any POINTER_PLUS expression. I deliberately avoid to change fold_binary_loc () or fold_plusminus_mult_expr (), as I think both folding functions are fairly mature and the folding can be beneficial in other cases.

I'm quite keen to hear what you think on this issue. Any comment is welcomed!

Thanks,
Yufeng


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]