The following code void f(unsigned *_bss_start, unsigned *_bss_end) { unsigned *p; for (p = _bss_start; p < _bss_end; p++) *p = 0; } when compiled with arm-elf-gcc -S -o - -fomit-frame-pointer -mcpu=arm7tdmi-s -Os t.c produces (GCC 4.3.0 20071107) f: mov r3, #0 b .L2 .L3: str r3, [r0], #4 .L2: cmp r0, r1 bcc .L3 bx lr It could be further optimized for both space and speed by emitting f: mov r3, #0 .L1: cmp r0, r1 strcc r3, [r0], #4 bcc .L1 bx lr
-Os disables copy loop header which would have done some parts of this opt.
The trunk we get: movs r3, #0 .L2: cmp r0, r1 bcc .L3 @ sp needed bx lr .L3: stmia r0!, {r3} b .L2