Bug 67213 - When compiling for size with -Os loops can get bigger after peeling
Summary: When compiling for size with -Os loops can get bigger after peeling
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 5.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-08-14 09:21 UTC by Fredrik Hederstierna
Modified: 2017-05-06 22:05 UTC (History)
0 users

See Also:
Host:
Target: arm-none-eabi, arm-none-linux-gnueabi
Build:
Known to work:
Known to fail: 4.8.5, 4.9.3, 5.2.0, 5.3.0, 6.3.0, 7.1.0
Last reconfirmed:


Attachments
Example files (1.63 KB, application/zip)
2015-08-14 09:28 UTC, Fredrik Hederstierna
Details
Dump from tree-ssa-loop-ivcanon.c (1.76 KB, text/plain)
2015-08-14 11:06 UTC, Fredrik Hederstierna
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fredrik Hederstierna 2015-08-14 09:21:03 UTC
When compiling thumb1 code for size with -Os some loops can be larger due to complete peeling.

Example code:

extern char data[10];

void test_iter_2(void)
{
  int i;
  for (i = 0; i < 2; i++) {
    data[i] = i;
  }
}

void test_iter_6(void)
{
  int i;
  for (i = 0; i < 6; i++) {
    data[i] = i;
  }
}

void test_iter_7(void)
{
  int i;
  for (i = 0; i < 7; i++) {
    data[i] = i;
  }
}

It will compile to

00000000 <test_iter_2>:
   0:   e3a02000        mov     r2, #0
   4:   e59f300c        ldr     r3, [pc, #12]   ; 18 <test_iter_2+0x18>
   8:   e5c32000        strb    r2, [r3]
   c:   e3a02001        mov     r2, #1
  10:   e5c32001        strb    r2, [r3, #1]
  14:   e12fff1e        bx      lr
  18:   00000000        .word   0x00000000

0000001c <test_iter_6>:
  1c:   e3a02000        mov     r2, #0
  20:   e59f302c        ldr     r3, [pc, #44]   ; 54 <test_iter_6+0x38>
  24:   e5c32000        strb    r2, [r3]
  28:   e3a02001        mov     r2, #1
  2c:   e5c32001        strb    r2, [r3, #1]
  30:   e3a02002        mov     r2, #2
  34:   e5c32002        strb    r2, [r3, #2]
  38:   e3a02003        mov     r2, #3
  3c:   e5c32003        strb    r2, [r3, #3]
  40:   e3a02004        mov     r2, #4
  44:   e5c32004        strb    r2, [r3, #4]
  48:   e3a02005        mov     r2, #5
  4c:   e5c32005        strb    r2, [r3, #5]
  50:   e12fff1e        bx      lr
  54:   00000000        .word   0x00000000

00000058 <test_iter_7>:
  58:   e3a03000        mov     r3, #0
  5c:   e59f2010        ldr     r2, [pc, #16]   ; 74 <test_iter_7+0x1c>
  60:   e7c33002        strb    r3, [r3, r2]
  64:   e2833001        add     r3, r3, #1
  68:   e3530007        cmp     r3, #7
  6c:   1afffffb        bne     60 <test_iter_7+0x8>
  70:   e12fff1e        bx      lr
  74:   00000000        .word   0x00000000

The unrolling of iter_6 seems to be controlled by default:

 --param max-completely-peel-times=5

if changing to

 --param max-completely-peel-times=0

code for iter_6 gets ok, but then iter_2 get larger.

00000000 <test_iter_2>:
   0:   e3a03000        mov     r3, #0
   4:   e59f2010        ldr     r2, [pc, #16]   ; 1c <test_iter_2+0x1c>
   8:   e7c33002        strb    r3, [r3, r2]
   c:   e2833001        add     r3, r3, #1
  10:   e3530002        cmp     r3, #2
  14:   1afffffb        bne     8 <test_iter_2+0x8>
  18:   e12fff1e        bx      lr
  1c:   00000000        .word   0x00000000

00000020 <test_iter_6>:
  20:   e3a03000        mov     r3, #0
  24:   e59f2010        ldr     r2, [pc, #16]   ; 3c <test_iter_6+0x1c>
  28:   e7c33002        strb    r3, [r3, r2]
  2c:   e2833001        add     r3, r3, #1
  30:   e3530006        cmp     r3, #6
  34:   1afffffb        bne     28 <test_iter_6+0x8>
  38:   e12fff1e        bx      lr
  3c:   00000000        .word   0x00000000

00000040 <test_iter_7>:
  40:   e3a03000        mov     r3, #0
  44:   e59f2010        ldr     r2, [pc, #16]   ; 5c <test_iter_7+0x1c>
  48:   e7c33002        strb    r3, [r3, r2]
  4c:   e2833001        add     r3, r3, #1
  50:   e3530007        cmp     r3, #7
  54:   1afffffb        bne     48 <test_iter_7+0x8>
  58:   e12fff1e        bx      lr
  5c:   00000000        .word   0x00000000

I guess its a trade off between number allowed unrolls and expected code size growth/decrease. Though it could maybe be detected that code size growth in this case.

Attach toolchain build script and code.
Comment 1 Fredrik Hederstierna 2015-08-14 09:28:47 UTC
Created attachment 36185 [details]
Example files
Comment 2 Richard Biener 2015-08-14 10:00:59 UTC
Btw, you can look at -fdump-tree-cunroll-details dump for why GCC thinks the loop will get smaller in all cases.
Comment 3 Fredrik Hederstierna 2015-08-14 11:06:39 UTC
Created attachment 36186 [details]
Dump from tree-ssa-loop-ivcanon.c

In the function iter_6 it seems like it will keep cost 5 when unrolling.
Maybe the weights and costs estimations could be more pessimistic when optimizing for size? I think functions tree_estimate_loop_size() and estimated_unrolled_size() uses a rough number guess of 1/3, perhaps it could be more pessimistic eg. for -Os?
Comment 4 Fredrik Hederstierna 2016-03-21 15:33:38 UTC
I've investigated this issue some further, and I believe the problem might be that we allow too many iterations when doing complete peeling of loops on ARM.

The heuristics in "tree-ssa-loop-ivcanon.c" for estimating unrolled cost/size in "estimated_unrolled_size()" is quite rough, just assuming it will be reduced in further passes to 2/3? This is not always true and can lead to larger code size I think after a complete peeling of loops (as in the example in this issue).

It seems very difficult to estimate the final size of complete peeling, also across all architectures. I've experimented with 3/4 if optimizing for size, but it became worse.

One solution that works for me is to set a lower limit for the number of times the unpeeling may use:

I did this patch and it worked.
(Same thing is done in "spu.c" for SPU architecture when they want small code size.)

In function "arm_option_override (void)":

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index c868490..2ba8244 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
+
+  /* Small loops might be completely unpeeled even at -Os.
+     Try to keep code small.  */
+  if (optimize_function_for_size_p (cfun)
+      && !flag_unroll_loops && !flag_peel_loops)
+    maybe_set_param_value (PARAM_MAX_COMPLETELY_PEEL_TIMES, 4,
+                          global_options.x_param_values,
+                          global_options_set.x_param_values);


I simply override max-completely-peel-times to be 4 instead of default 16, and this seems to work well.

I tested it with CSiBE benchmark on arm/thumb1/thumb2 and I got shorter code on all tests, no negative results on any function.

What do you think, is it a okey solution to solve this issue, even though the long-term best solution would be to be able to estimate cost/size better of unrolling, but this seems like a much more difficult problem to solve.

/Fredrik
Comment 5 Fredrik Hederstierna 2017-05-06 21:51:58 UTC
Still same in GCC-7.1.0.

I analyzed using -fdump-tree-cunroll-details

void test_iter_6(void)
{
  int i;
  for (i = 0; i < 6; i++) {
    data[i] = i;
  }
}

The function was generated "test_iter_6":

0000001c <test_iter_6>:
  1c:   e59f3030        ldr     r3, [pc, #48]   ; 54 <test_iter_6+0x38>
  20:   e3a02000        mov     r2, #0
  24:   e5c32000        strb    r2, [r3]
  28:   e3a02001        mov     r2, #1
  2c:   e5c32001        strb    r2, [r3, #1]
  30:   e3a02002        mov     r2, #2
  34:   e5c32002        strb    r2, [r3, #2]
  38:   e3a02003        mov     r2, #3
  3c:   e5c32003        strb    r2, [r3, #3]
  40:   e3a02004        mov     r2, #4
  44:   e5c32004        strb    r2, [r3, #4]
  48:   e3a02005        mov     r2, #5
  4c:   e5c32005        strb    r2, [r3, #5]
  50:   e12fff1e        bx      lr
  54:   00000000        .word   0x00000000

With "--param max-completely-peel-times=4" (instead of default 5) it became

0000001c <test_iter_6>:
  1c:   e59f2014        ldr     r2, [pc, #20]   ; 38 <test_iter_6+0x1c>
  20:   e3a03000        mov     r3, #0
  24:   e7c33002        strb    r3, [r3, r2]
  28:   e2833001        add     r3, r3, #1
  2c:   e3530006        cmp     r3, #6
  30:   1afffffb        bne     24 <test_iter_6+0x8>
  34:   e12fff1e        bx      lr
  38:   00000000        .word   0x00000000

It seems like "try_unroll_loop_completely()" in file "tree-ssa-loop-ivcanon.c" think it could fold counting variable, but maybe its not possible since its used as both index and as RHS value?

;; Function test_iter_6 (test_iter_6, funcdef_no=1, decl_uid=4067, cgraph_uid=1)

Analyzing # of iterations of loop 1
  exit condition [5, + , 4294967295] != 0
  bounds on difference of bases: -5 ... -5
  result:
    # of iterations 5, bounded by 5
Analyzing # of iterations of loop 1
  exit condition [5, + , 4294967295] != 0
  bounds on difference of bases: -5 ... -5
  result:
    # of iterations 5, bounded by 5
Statement (exit)if (ivtmp_7 != 0)
 is executed at most 5 (bounded by 5) + 1 times in loop 1.
Induction variable (int) 0 + 1 * iteration does not wrap in statement data[i_9] = _4;
 in loop 1.
Statement data[i_9] = _4;
 is executed at most 9 (bounded by 9) + 1 times in loop 1.
Induction variable (int) 1 + 1 * iteration does not wrap in statement i_6 = i_9 + 1;
 in loop 1.
Statement i_6 = i_9 + 1;
 is executed at most 2147483646 (bounded by 2147483646) + 1 times in loop 1.
Loop 1 iterates 5 times.
Loop 1 iterates at most 5 times.
Estimating sizes for loop 1
 BB: 3, after_exit: 0
  size:   0 _4 = (char) i_9;
   Induction variable computation will be folded away.
  size:   1 data[i_9] = _4;
  size:   1 i_6 = i_9 + 1;
   Induction variable computation will be folded away.
  size:   1 ivtmp_7 = ivtmp_1 - 1;
   Induction variable computation will be folded away.
  size:   2 if (ivtmp_7 != 0)
   Exit condition will be eliminated in peeled copies.
 BB: 4, after_exit: 1
size: 5-4, last_iteration: 5-2
  Loop size: 5
  Estimated size after unrolling: 5

Though produced code with peeling become

test_iter_6 ()
{
  int i;
  char _4;
  unsigned int ivtmp_7;
  char _12;
  unsigned int ivtmp_15;
  char _19;
  unsigned int ivtmp_22;
  char _26;
  unsigned int ivtmp_29;
  char _33;
  unsigned int ivtmp_36;
  char _40;
  unsigned int ivtmp_43;

  <bb 2>:
  _12 = 0;
  data[0] = _12;
  i_14 = 1;
  ivtmp_15 = 5;
  _19 = (char) i_14;
  data[i_14] = _19;
  i_21 = i_14 + 1;
  ivtmp_22 = ivtmp_15 + 4294967295;
  _26 = (char) i_21;
  data[i_21] = _26;
  i_28 = i_21 + 1;
  ivtmp_29 = ivtmp_22 + 4294967295;
  _33 = (char) i_28;
  data[i_28] = _33;
  i_35 = i_28 + 1;
  ivtmp_36 = ivtmp_29 + 4294967295;
  _40 = (char) i_35;
  data[i_35] = _40;
  i_42 = i_35 + 1;
  ivtmp_43 = ivtmp_36 + 4294967295;
  _4 = (char) i_42;
  data[i_42] = _4;
  i_6 = i_42 + 1;
  ivtmp_7 = ivtmp_43 + 4294967295;
  return;

}

instead of original and shorter

test_iter_6 ()
{
  int i;
  unsigned int ivtmp_1;
  char _4;
  unsigned int ivtmp_7;

  <bb 2>:

  <bb 3>:
  # i_9 = PHI <i_6(4), 0(2)>
  # ivtmp_1 = PHI <ivtmp_7(4), 6(2)>
  _4 = (char) i_9;
  data[i_9] = _4;
  i_6 = i_9 + 1;
  ivtmp_7 = ivtmp_1 - 1;
  if (ivtmp_7 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  goto <bb 3>;

  <bb 5>:
  return;

}

Could it be that somewhat since that index also is used as data that variable cannot be folded away like
  size:   1 i_6 = i_9 + 1;
   Induction variable computation will be folded away.
Comment 6 Fredrik Hederstierna 2017-05-06 22:05:44 UTC
Same thing for x86, not only ARM:


bash# gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609


bash# gcc -c test.c -Os --param max-completely-peel-times=5
bash# objdump -dath test.o


Disassembly of section .text:

000000000000000f <test_iter_6>:
   f:   c6 05 00 00 00 00 00    movb   $0x0,0x0(%rip)        # 16 <test_iter_6+0x7>
  16:   c6 05 00 00 00 00 01    movb   $0x1,0x0(%rip)        # 1d <test_iter_6+0xe>
  1d:   c6 05 00 00 00 00 02    movb   $0x2,0x0(%rip)        # 24 <test_iter_6+0x15>
  24:   c6 05 00 00 00 00 03    movb   $0x3,0x0(%rip)        # 2b <test_iter_6+0x1c>
  2b:   c6 05 00 00 00 00 04    movb   $0x4,0x0(%rip)        # 32 <test_iter_6+0x23>
  32:   c6 05 00 00 00 00 05    movb   $0x5,0x0(%rip)        # 39 <test_iter_6+0x2a>
  39:   c3                      retq   


bash# gcc -c test.c -Os --param max-completely-peel-times=4
bash# objdump -dath test.o


Disassembly of section .text:

000000000000000f <test_iter_6>:
   f:   31 c0                   xor    %eax,%eax
  11:   88 80 00 00 00 00       mov    %al,0x0(%rax)
  17:   48 ff c0                inc    %rax
  1a:   48 83 f8 06             cmp    $0x6,%rax
  1e:   75 f1                   jne    11 <test_iter_6+0x2>
  20:   c3                      retq