When compiling thumb1 code for size with -Os some loops can be larger due to complete peeling. Example code: extern char data[10]; void test_iter_2(void) { int i; for (i = 0; i < 2; i++) { data[i] = i; } } void test_iter_6(void) { int i; for (i = 0; i < 6; i++) { data[i] = i; } } void test_iter_7(void) { int i; for (i = 0; i < 7; i++) { data[i] = i; } } It will compile to 00000000 <test_iter_2>: 0: e3a02000 mov r2, #0 4: e59f300c ldr r3, [pc, #12] ; 18 <test_iter_2+0x18> 8: e5c32000 strb r2, [r3] c: e3a02001 mov r2, #1 10: e5c32001 strb r2, [r3, #1] 14: e12fff1e bx lr 18: 00000000 .word 0x00000000 0000001c <test_iter_6>: 1c: e3a02000 mov r2, #0 20: e59f302c ldr r3, [pc, #44] ; 54 <test_iter_6+0x38> 24: e5c32000 strb r2, [r3] 28: e3a02001 mov r2, #1 2c: e5c32001 strb r2, [r3, #1] 30: e3a02002 mov r2, #2 34: e5c32002 strb r2, [r3, #2] 38: e3a02003 mov r2, #3 3c: e5c32003 strb r2, [r3, #3] 40: e3a02004 mov r2, #4 44: e5c32004 strb r2, [r3, #4] 48: e3a02005 mov r2, #5 4c: e5c32005 strb r2, [r3, #5] 50: e12fff1e bx lr 54: 00000000 .word 0x00000000 00000058 <test_iter_7>: 58: e3a03000 mov r3, #0 5c: e59f2010 ldr r2, [pc, #16] ; 74 <test_iter_7+0x1c> 60: e7c33002 strb r3, [r3, r2] 64: e2833001 add r3, r3, #1 68: e3530007 cmp r3, #7 6c: 1afffffb bne 60 <test_iter_7+0x8> 70: e12fff1e bx lr 74: 00000000 .word 0x00000000 The unrolling of iter_6 seems to be controlled by default: --param max-completely-peel-times=5 if changing to --param max-completely-peel-times=0 code for iter_6 gets ok, but then iter_2 get larger. 00000000 <test_iter_2>: 0: e3a03000 mov r3, #0 4: e59f2010 ldr r2, [pc, #16] ; 1c <test_iter_2+0x1c> 8: e7c33002 strb r3, [r3, r2] c: e2833001 add r3, r3, #1 10: e3530002 cmp r3, #2 14: 1afffffb bne 8 <test_iter_2+0x8> 18: e12fff1e bx lr 1c: 00000000 .word 0x00000000 00000020 <test_iter_6>: 20: e3a03000 mov r3, #0 24: e59f2010 ldr r2, [pc, #16] ; 3c <test_iter_6+0x1c> 28: e7c33002 strb r3, [r3, r2] 2c: e2833001 add r3, r3, #1 30: e3530006 cmp r3, #6 34: 1afffffb bne 28 <test_iter_6+0x8> 38: e12fff1e bx lr 3c: 00000000 .word 0x00000000 00000040 <test_iter_7>: 40: e3a03000 mov r3, #0 44: e59f2010 ldr r2, [pc, #16] ; 5c <test_iter_7+0x1c> 48: e7c33002 strb r3, [r3, r2] 4c: e2833001 add r3, r3, #1 50: e3530007 cmp r3, #7 54: 1afffffb bne 48 <test_iter_7+0x8> 58: e12fff1e bx lr 5c: 00000000 .word 0x00000000 I guess its a trade off between number allowed unrolls and expected code size growth/decrease. Though it could maybe be detected that code size growth in this case. Attach toolchain build script and code.
Created attachment 36185 [details] Example files
Btw, you can look at -fdump-tree-cunroll-details dump for why GCC thinks the loop will get smaller in all cases.
Created attachment 36186 [details] Dump from tree-ssa-loop-ivcanon.c In the function iter_6 it seems like it will keep cost 5 when unrolling. Maybe the weights and costs estimations could be more pessimistic when optimizing for size? I think functions tree_estimate_loop_size() and estimated_unrolled_size() uses a rough number guess of 1/3, perhaps it could be more pessimistic eg. for -Os?
I've investigated this issue some further, and I believe the problem might be that we allow too many iterations when doing complete peeling of loops on ARM. The heuristics in "tree-ssa-loop-ivcanon.c" for estimating unrolled cost/size in "estimated_unrolled_size()" is quite rough, just assuming it will be reduced in further passes to 2/3? This is not always true and can lead to larger code size I think after a complete peeling of loops (as in the example in this issue). It seems very difficult to estimate the final size of complete peeling, also across all architectures. I've experimented with 3/4 if optimizing for size, but it became worse. One solution that works for me is to set a lower limit for the number of times the unpeeling may use: I did this patch and it worked. (Same thing is done in "spu.c" for SPU architecture when they want small code size.) In function "arm_option_override (void)": diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index c868490..2ba8244 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c + + /* Small loops might be completely unpeeled even at -Os. + Try to keep code small. */ + if (optimize_function_for_size_p (cfun) + && !flag_unroll_loops && !flag_peel_loops) + maybe_set_param_value (PARAM_MAX_COMPLETELY_PEEL_TIMES, 4, + global_options.x_param_values, + global_options_set.x_param_values); I simply override max-completely-peel-times to be 4 instead of default 16, and this seems to work well. I tested it with CSiBE benchmark on arm/thumb1/thumb2 and I got shorter code on all tests, no negative results on any function. What do you think, is it a okey solution to solve this issue, even though the long-term best solution would be to be able to estimate cost/size better of unrolling, but this seems like a much more difficult problem to solve. /Fredrik
Still same in GCC-7.1.0. I analyzed using -fdump-tree-cunroll-details void test_iter_6(void) { int i; for (i = 0; i < 6; i++) { data[i] = i; } } The function was generated "test_iter_6": 0000001c <test_iter_6>: 1c: e59f3030 ldr r3, [pc, #48] ; 54 <test_iter_6+0x38> 20: e3a02000 mov r2, #0 24: e5c32000 strb r2, [r3] 28: e3a02001 mov r2, #1 2c: e5c32001 strb r2, [r3, #1] 30: e3a02002 mov r2, #2 34: e5c32002 strb r2, [r3, #2] 38: e3a02003 mov r2, #3 3c: e5c32003 strb r2, [r3, #3] 40: e3a02004 mov r2, #4 44: e5c32004 strb r2, [r3, #4] 48: e3a02005 mov r2, #5 4c: e5c32005 strb r2, [r3, #5] 50: e12fff1e bx lr 54: 00000000 .word 0x00000000 With "--param max-completely-peel-times=4" (instead of default 5) it became 0000001c <test_iter_6>: 1c: e59f2014 ldr r2, [pc, #20] ; 38 <test_iter_6+0x1c> 20: e3a03000 mov r3, #0 24: e7c33002 strb r3, [r3, r2] 28: e2833001 add r3, r3, #1 2c: e3530006 cmp r3, #6 30: 1afffffb bne 24 <test_iter_6+0x8> 34: e12fff1e bx lr 38: 00000000 .word 0x00000000 It seems like "try_unroll_loop_completely()" in file "tree-ssa-loop-ivcanon.c" think it could fold counting variable, but maybe its not possible since its used as both index and as RHS value? ;; Function test_iter_6 (test_iter_6, funcdef_no=1, decl_uid=4067, cgraph_uid=1) Analyzing # of iterations of loop 1 exit condition [5, + , 4294967295] != 0 bounds on difference of bases: -5 ... -5 result: # of iterations 5, bounded by 5 Analyzing # of iterations of loop 1 exit condition [5, + , 4294967295] != 0 bounds on difference of bases: -5 ... -5 result: # of iterations 5, bounded by 5 Statement (exit)if (ivtmp_7 != 0) is executed at most 5 (bounded by 5) + 1 times in loop 1. Induction variable (int) 0 + 1 * iteration does not wrap in statement data[i_9] = _4; in loop 1. Statement data[i_9] = _4; is executed at most 9 (bounded by 9) + 1 times in loop 1. Induction variable (int) 1 + 1 * iteration does not wrap in statement i_6 = i_9 + 1; in loop 1. Statement i_6 = i_9 + 1; is executed at most 2147483646 (bounded by 2147483646) + 1 times in loop 1. Loop 1 iterates 5 times. Loop 1 iterates at most 5 times. Estimating sizes for loop 1 BB: 3, after_exit: 0 size: 0 _4 = (char) i_9; Induction variable computation will be folded away. size: 1 data[i_9] = _4; size: 1 i_6 = i_9 + 1; Induction variable computation will be folded away. size: 1 ivtmp_7 = ivtmp_1 - 1; Induction variable computation will be folded away. size: 2 if (ivtmp_7 != 0) Exit condition will be eliminated in peeled copies. BB: 4, after_exit: 1 size: 5-4, last_iteration: 5-2 Loop size: 5 Estimated size after unrolling: 5 Though produced code with peeling become test_iter_6 () { int i; char _4; unsigned int ivtmp_7; char _12; unsigned int ivtmp_15; char _19; unsigned int ivtmp_22; char _26; unsigned int ivtmp_29; char _33; unsigned int ivtmp_36; char _40; unsigned int ivtmp_43; <bb 2>: _12 = 0; data[0] = _12; i_14 = 1; ivtmp_15 = 5; _19 = (char) i_14; data[i_14] = _19; i_21 = i_14 + 1; ivtmp_22 = ivtmp_15 + 4294967295; _26 = (char) i_21; data[i_21] = _26; i_28 = i_21 + 1; ivtmp_29 = ivtmp_22 + 4294967295; _33 = (char) i_28; data[i_28] = _33; i_35 = i_28 + 1; ivtmp_36 = ivtmp_29 + 4294967295; _40 = (char) i_35; data[i_35] = _40; i_42 = i_35 + 1; ivtmp_43 = ivtmp_36 + 4294967295; _4 = (char) i_42; data[i_42] = _4; i_6 = i_42 + 1; ivtmp_7 = ivtmp_43 + 4294967295; return; } instead of original and shorter test_iter_6 () { int i; unsigned int ivtmp_1; char _4; unsigned int ivtmp_7; <bb 2>: <bb 3>: # i_9 = PHI <i_6(4), 0(2)> # ivtmp_1 = PHI <ivtmp_7(4), 6(2)> _4 = (char) i_9; data[i_9] = _4; i_6 = i_9 + 1; ivtmp_7 = ivtmp_1 - 1; if (ivtmp_7 != 0) goto <bb 4>; else goto <bb 5>; <bb 4>: goto <bb 3>; <bb 5>: return; } Could it be that somewhat since that index also is used as data that variable cannot be folded away like size: 1 i_6 = i_9 + 1; Induction variable computation will be folded away.
Same thing for x86, not only ARM: bash# gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 bash# gcc -c test.c -Os --param max-completely-peel-times=5 bash# objdump -dath test.o Disassembly of section .text: 000000000000000f <test_iter_6>: f: c6 05 00 00 00 00 00 movb $0x0,0x0(%rip) # 16 <test_iter_6+0x7> 16: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 1d <test_iter_6+0xe> 1d: c6 05 00 00 00 00 02 movb $0x2,0x0(%rip) # 24 <test_iter_6+0x15> 24: c6 05 00 00 00 00 03 movb $0x3,0x0(%rip) # 2b <test_iter_6+0x1c> 2b: c6 05 00 00 00 00 04 movb $0x4,0x0(%rip) # 32 <test_iter_6+0x23> 32: c6 05 00 00 00 00 05 movb $0x5,0x0(%rip) # 39 <test_iter_6+0x2a> 39: c3 retq bash# gcc -c test.c -Os --param max-completely-peel-times=4 bash# objdump -dath test.o Disassembly of section .text: 000000000000000f <test_iter_6>: f: 31 c0 xor %eax,%eax 11: 88 80 00 00 00 00 mov %al,0x0(%rax) 17: 48 ff c0 inc %rax 1a: 48 83 f8 06 cmp $0x6,%rax 1e: 75 f1 jne 11 <test_iter_6+0x2> 20: c3 retq