I have noticed this on SH, maybe it also applies to other targets (checked on 4.9 r208241). The following simple loop (simple strlen implementation): unsigned int test (const char* s0) { const char* s1 = s0; while (*s1) s1++; return s1 - s0; } With -O2 -m4 gets compiled to: mov.b @r4,r1 tst r1,r1 bt/s .L4 mov r4,r1 add #1,r1 .align 2 .L3: mov r1,r0 mov.b @r0,r2 tst r2,r2 bf/s .L3 add #1,r1 rts sub r4,r0 .align 1 .L4: rts mov #0,r0 With -Os -m4 it is basically just the inner loop: mov r4,r1 .L2: mov r1,r0 mov.b @r0,r2 tst r2,r2 bf/s .L2 add #1,r1 rts sub r4,r0 The additional loop test in the loop header in the -O2 version seems a bit pointless. If the loop exists at the first iteration, it simply falls through. The additional test and jump around the loop doesn't gain anything in this case but just increases code size unnecessarily.
For -O2 we do this to enable loop optimizations which almost all require do { } while style loops. This canonicalization can sometimes peel an entire iteration as you can see here, and this canonicalization is not done at -Os unless the loop is determined as hot (so with -Os and profile-feedback some loops may get this treatment). It's hard to undo this transform but that's what would be needed here ... (or make more passes deal with number-of-iterations == n or zero)