67213 – When compiling for size with -Os loops can get bigger after peeling

Bug 67213 - When compiling for size with -Os loops can get bigger after peeling

Summary: When compiling for size with -Os loops can get bigger after peeling

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	5.2.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2015-08-14 09:21 UTC by Fredrik Hederstierna
Modified:	2017-05-06 22:05 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	arm-none-eabi, arm-none-linux-gnueabi
Build:
Known to work:
Known to fail:	4.8.5, 4.9.3, 5.2.0, 5.3.0, 6.3.0, 7.1.0
Last reconfirmed:

Attachments
Example files (1.63 KB, application/zip) 2015-08-14 09:28 UTC, Fredrik Hederstierna	Details
Dump from tree-ssa-loop-ivcanon.c (1.76 KB, text/plain) 2015-08-14 11:06 UTC, Fredrik Hederstierna	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Fredrik Hederstierna 2015-08-14 09:21:03 UTC

When compiling thumb1 code for size with -Os some loops can be larger due to complete peeling.

Example code:

extern char data[10];

void test_iter_2(void)
{
  int i;
  for (i = 0; i < 2; i++) {
    data[i] = i;
  }
}

void test_iter_6(void)
{
  int i;
  for (i = 0; i < 6; i++) {
    data[i] = i;
  }
}

void test_iter_7(void)
{
  int i;
  for (i = 0; i < 7; i++) {
    data[i] = i;
  }
}

It will compile to

00000000 <test_iter_2>:
   0:   e3a02000        mov     r2, #0
   4:   e59f300c        ldr     r3, [pc, #12]   ; 18 <test_iter_2+0x18>
   8:   e5c32000        strb    r2, [r3]
   c:   e3a02001        mov     r2, #1
  10:   e5c32001        strb    r2, [r3, #1]
  14:   e12fff1e        bx      lr
  18:   00000000        .word   0x00000000

0000001c <test_iter_6>:
  1c:   e3a02000        mov     r2, #0
  20:   e59f302c        ldr     r3, [pc, #44]   ; 54 <test_iter_6+0x38>
  24:   e5c32000        strb    r2, [r3]
  28:   e3a02001        mov     r2, #1
  2c:   e5c32001        strb    r2, [r3, #1]
  30:   e3a02002        mov     r2, #2
  34:   e5c32002        strb    r2, [r3, #2]
  38:   e3a02003        mov     r2, #3
  3c:   e5c32003        strb    r2, [r3, #3]
  40:   e3a02004        mov     r2, #4
  44:   e5c32004        strb    r2, [r3, #4]
  48:   e3a02005        mov     r2, #5
  4c:   e5c32005        strb    r2, [r3, #5]
  50:   e12fff1e        bx      lr
  54:   00000000        .word   0x00000000

00000058 <test_iter_7>:
  58:   e3a03000        mov     r3, #0
  5c:   e59f2010        ldr     r2, [pc, #16]   ; 74 <test_iter_7+0x1c>
  60:   e7c33002        strb    r3, [r3, r2]
  64:   e2833001        add     r3, r3, #1
  68:   e3530007        cmp     r3, #7
  6c:   1afffffb        bne     60 <test_iter_7+0x8>
  70:   e12fff1e        bx      lr
  74:   00000000        .word   0x00000000

The unrolling of iter_6 seems to be controlled by default:

 --param max-completely-peel-times=5

if changing to

 --param max-completely-peel-times=0

code for iter_6 gets ok, but then iter_2 get larger.

00000000 <test_iter_2>:
   0:   e3a03000        mov     r3, #0
   4:   e59f2010        ldr     r2, [pc, #16]   ; 1c <test_iter_2+0x1c>
   8:   e7c33002        strb    r3, [r3, r2]
   c:   e2833001        add     r3, r3, #1
  10:   e3530002        cmp     r3, #2
  14:   1afffffb        bne     8 <test_iter_2+0x8>
  18:   e12fff1e        bx      lr
  1c:   00000000        .word   0x00000000

00000020 <test_iter_6>:
  20:   e3a03000        mov     r3, #0
  24:   e59f2010        ldr     r2, [pc, #16]   ; 3c <test_iter_6+0x1c>
  28:   e7c33002        strb    r3, [r3, r2]
  2c:   e2833001        add     r3, r3, #1
  30:   e3530006        cmp     r3, #6
  34:   1afffffb        bne     28 <test_iter_6+0x8>
  38:   e12fff1e        bx      lr
  3c:   00000000        .word   0x00000000

00000040 <test_iter_7>:
  40:   e3a03000        mov     r3, #0
  44:   e59f2010        ldr     r2, [pc, #16]   ; 5c <test_iter_7+0x1c>
  48:   e7c33002        strb    r3, [r3, r2]
  4c:   e2833001        add     r3, r3, #1
  50:   e3530007        cmp     r3, #7
  54:   1afffffb        bne     48 <test_iter_7+0x8>
  58:   e12fff1e        bx      lr
  5c:   00000000        .word   0x00000000

I guess its a trade off between number allowed unrolls and expected code size growth/decrease. Though it could maybe be detected that code size growth in this case.

Attach toolchain build script and code.

Comment 1 Fredrik Hederstierna 2015-08-14 09:28:47 UTC

Created attachment 36185 [details]
Example files

Comment 2 Richard Biener 2015-08-14 10:00:59 UTC

Btw, you can look at -fdump-tree-cunroll-details dump for why GCC thinks the loop will get smaller in all cases.

Comment 3 Fredrik Hederstierna 2015-08-14 11:06:39 UTC

Created attachment 36186 [details]
Dump from tree-ssa-loop-ivcanon.c

In the function iter_6 it seems like it will keep cost 5 when unrolling.
Maybe the weights and costs estimations could be more pessimistic when optimizing for size? I think functions tree_estimate_loop_size() and estimated_unrolled_size() uses a rough number guess of 1/3, perhaps it could be more pessimistic eg. for -Os?

Comment 4 Fredrik Hederstierna 2016-03-21 15:33:38 UTC

I've investigated this issue some further, and I believe the problem might be that we allow too many iterations when doing complete peeling of loops on ARM.

The heuristics in "tree-ssa-loop-ivcanon.c" for estimating unrolled cost/size in "estimated_unrolled_size()" is quite rough, just assuming it will be reduced in further passes to 2/3? This is not always true and can lead to larger code size I think after a complete peeling of loops (as in the example in this issue).

It seems very difficult to estimate the final size of complete peeling, also across all architectures. I've experimented with 3/4 if optimizing for size, but it became worse.

One solution that works for me is to set a lower limit for the number of times the unpeeling may use:

I did this patch and it worked.
(Same thing is done in "spu.c" for SPU architecture when they want small code size.)

In function "arm_option_override (void)":

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index c868490..2ba8244 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
+
+  /* Small loops might be completely unpeeled even at -Os.
+     Try to keep code small.  */
+  if (optimize_function_for_size_p (cfun)
+      && !flag_unroll_loops && !flag_peel_loops)
+    maybe_set_param_value (PARAM_MAX_COMPLETELY_PEEL_TIMES, 4,
+                          global_options.x_param_values,
+                          global_options_set.x_param_values);


I simply override max-completely-peel-times to be 4 instead of default 16, and this seems to work well.

I tested it with CSiBE benchmark on arm/thumb1/thumb2 and I got shorter code on all tests, no negative results on any function.

What do you think, is it a okey solution to solve this issue, even though the long-term best solution would be to be able to estimate cost/size better of unrolling, but this seems like a much more difficult problem to solve.

/Fredrik

Comment 5 Fredrik Hederstierna 2017-05-06 21:51:58 UTC

Still same in GCC-7.1.0.

I analyzed using -fdump-tree-cunroll-details

void test_iter_6(void)
{
  int i;
  for (i = 0; i < 6; i++) {
    data[i] = i;
  }
}

The function was generated "test_iter_6":

0000001c <test_iter_6>:
  1c:   e59f3030        ldr     r3, [pc, #48]   ; 54 <test_iter_6+0x38>
  20:   e3a02000        mov     r2, #0
  24:   e5c32000        strb    r2, [r3]
  28:   e3a02001        mov     r2, #1
  2c:   e5c32001        strb    r2, [r3, #1]
  30:   e3a02002        mov     r2, #2
  34:   e5c32002        strb    r2, [r3, #2]
  38:   e3a02003        mov     r2, #3
  3c:   e5c32003        strb    r2, [r3, #3]
  40:   e3a02004        mov     r2, #4
  44:   e5c32004        strb    r2, [r3, #4]
  48:   e3a02005        mov     r2, #5
  4c:   e5c32005        strb    r2, [r3, #5]
  50:   e12fff1e        bx      lr
  54:   00000000        .word   0x00000000

With "--param max-completely-peel-times=4" (instead of default 5) it became

0000001c <test_iter_6>:
  1c:   e59f2014        ldr     r2, [pc, #20]   ; 38 <test_iter_6+0x1c>
  20:   e3a03000        mov     r3, #0
  24:   e7c33002        strb    r3, [r3, r2]
  28:   e2833001        add     r3, r3, #1
  2c:   e3530006        cmp     r3, #6
  30:   1afffffb        bne     24 <test_iter_6+0x8>
  34:   e12fff1e        bx      lr
  38:   00000000        .word   0x00000000

It seems like "try_unroll_loop_completely()" in file "tree-ssa-loop-ivcanon.c" think it could fold counting variable, but maybe its not possible since its used as both index and as RHS value?

;; Function test_iter_6 (test_iter_6, funcdef_no=1, decl_uid=4067, cgraph_uid=1)

Analyzing # of iterations of loop 1
  exit condition [5, + , 4294967295] != 0
  bounds on difference of bases: -5 ... -5
  result:
    # of iterations 5, bounded by 5
Analyzing # of iterations of loop 1
  exit condition [5, + , 4294967295] != 0
  bounds on difference of bases: -5 ... -5
  result:
    # of iterations 5, bounded by 5
Statement (exit)if (ivtmp_7 != 0)
 is executed at most 5 (bounded by 5) + 1 times in loop 1.
Induction variable (int) 0 + 1 * iteration does not wrap in statement data[i_9] = _4;
 in loop 1.
Statement data[i_9] = _4;
 is executed at most 9 (bounded by 9) + 1 times in loop 1.
Induction variable (int) 1 + 1 * iteration does not wrap in statement i_6 = i_9 + 1;
 in loop 1.
Statement i_6 = i_9 + 1;
 is executed at most 2147483646 (bounded by 2147483646) + 1 times in loop 1.
Loop 1 iterates 5 times.
Loop 1 iterates at most 5 times.
Estimating sizes for loop 1
 BB: 3, after_exit: 0
  size:   0 _4 = (char) i_9;
   Induction variable computation will be folded away.
  size:   1 data[i_9] = _4;
  size:   1 i_6 = i_9 + 1;
   Induction variable computation will be folded away.
  size:   1 ivtmp_7 = ivtmp_1 - 1;
   Induction variable computation will be folded away.
  size:   2 if (ivtmp_7 != 0)
   Exit condition will be eliminated in peeled copies.
 BB: 4, after_exit: 1
size: 5-4, last_iteration: 5-2
  Loop size: 5
  Estimated size after unrolling: 5

Though produced code with peeling become

test_iter_6 ()
{
  int i;
  char _4;
  unsigned int ivtmp_7;
  char _12;
  unsigned int ivtmp_15;
  char _19;
  unsigned int ivtmp_22;
  char _26;
  unsigned int ivtmp_29;
  char _33;
  unsigned int ivtmp_36;
  char _40;
  unsigned int ivtmp_43;

  <bb 2>:
  _12 = 0;
  data[0] = _12;
  i_14 = 1;
  ivtmp_15 = 5;
  _19 = (char) i_14;
  data[i_14] = _19;
  i_21 = i_14 + 1;
  ivtmp_22 = ivtmp_15 + 4294967295;
  _26 = (char) i_21;
  data[i_21] = _26;
  i_28 = i_21 + 1;
  ivtmp_29 = ivtmp_22 + 4294967295;
  _33 = (char) i_28;
  data[i_28] = _33;
  i_35 = i_28 + 1;
  ivtmp_36 = ivtmp_29 + 4294967295;
  _40 = (char) i_35;
  data[i_35] = _40;
  i_42 = i_35 + 1;
  ivtmp_43 = ivtmp_36 + 4294967295;
  _4 = (char) i_42;
  data[i_42] = _4;
  i_6 = i_42 + 1;
  ivtmp_7 = ivtmp_43 + 4294967295;
  return;

}

instead of original and shorter

test_iter_6 ()
{
  int i;
  unsigned int ivtmp_1;
  char _4;
  unsigned int ivtmp_7;

  <bb 2>:

  <bb 3>:
  # i_9 = PHI <i_6(4), 0(2)>
  # ivtmp_1 = PHI <ivtmp_7(4), 6(2)>
  _4 = (char) i_9;
  data[i_9] = _4;
  i_6 = i_9 + 1;
  ivtmp_7 = ivtmp_1 - 1;
  if (ivtmp_7 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  goto <bb 3>;

  <bb 5>:
  return;

}

Could it be that somewhat since that index also is used as data that variable cannot be folded away like
  size:   1 i_6 = i_9 + 1;
   Induction variable computation will be folded away.

Comment 6 Fredrik Hederstierna 2017-05-06 22:05:44 UTC

Same thing for x86, not only ARM:


bash# gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609


bash# gcc -c test.c -Os --param max-completely-peel-times=5
bash# objdump -dath test.o


Disassembly of section .text:

000000000000000f <test_iter_6>:
   f:   c6 05 00 00 00 00 00    movb   $0x0,0x0(%rip)        # 16 <test_iter_6+0x7>
  16:   c6 05 00 00 00 00 01    movb   $0x1,0x0(%rip)        # 1d <test_iter_6+0xe>
  1d:   c6 05 00 00 00 00 02    movb   $0x2,0x0(%rip)        # 24 <test_iter_6+0x15>
  24:   c6 05 00 00 00 00 03    movb   $0x3,0x0(%rip)        # 2b <test_iter_6+0x1c>
  2b:   c6 05 00 00 00 00 04    movb   $0x4,0x0(%rip)        # 32 <test_iter_6+0x23>
  32:   c6 05 00 00 00 00 05    movb   $0x5,0x0(%rip)        # 39 <test_iter_6+0x2a>
  39:   c3                      retq   


bash# gcc -c test.c -Os --param max-completely-peel-times=4
bash# objdump -dath test.o


Disassembly of section .text:

000000000000000f <test_iter_6>:
   f:   31 c0                   xor    %eax,%eax
  11:   88 80 00 00 00 00       mov    %al,0x0(%rax)
  17:   48 ff c0                inc    %rax
  1a:   48 83 f8 06             cmp    $0x6,%rax
  1e:   75 f1                   jne    11 <test_iter_6+0x2>
  20:   c3                      retq