This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug middle-end/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0
- From: "aldyh at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 15 Mar 2018 11:25:51 +0000
- Subject: [Bug middle-end/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0
- Auto-submitted: auto-generated
- References: <bug-70359-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
--- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
Hi Richi.
(In reply to rguenther@suse.de from comment #31)
> I'd have not restricted the out-of-loop IV use to IV +- CST but
> instead did the transform
>
> + LOOP:
> + # p_8 = PHI <p_16(2), p_INC(3)>
> + ...
> + p_INC = p_8 - 1;
> + goto LOOP;
> + ... p_8 uses ...
>
> to
>
> + LOOP:
> + # p_8 = PHI <p_16(2), p_INC(3)>
> + ...
> + p_INC = p_8 - 1;
> + goto LOOP;
> newtem_12 = p_INC + 1; // undo IV increment
> ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
>
> so it would always work if we can undo the IV increment.
>
> The disadvantage might be that we then rely on RTL optimizations
> to combine the original out-of-loop constant add with the
> newtem computation but I guess that's not too much to ask ;)
> k
It looks like RTL optimizations have a harder time optimizing things when I
take the above approach.
Doing what you suggest, we end up optimizing this (simplified for brevity):
<bb 3>
# p_8 = PHI <p_16(2), p_19(3)>
p_19 = p_8 + 4294967295;
if (ui_7 > 9)
goto <bb 3>; [89.00%]
...
<bb 5>
p_22 = p_8 + 4294967294;
MEM[(char *)p_19 + 4294967295B] = 45;
into this:
<bb 3>:
# p_8 = PHI <p_16(2), p_19(3)>
p_19 = p_8 + 4294967295;
if (ui_7 > 9)
...
<bb 4>:
_25 = p_19 + 1; ;; undo the increment
...
<bb 5>:
p_22 = _25 + 4294967294;
MEM[(char *)_25 + 4294967294B] = 45;
I haven't dug into the RTL optimizations, but the end result is that we only
get one auto-dec inside the loop, and some -2 indexing outside of it:
strb r1, [r4, #-1]!
lsr r3, r3, #3
bhi .L4
cmp r6, #0
movlt r2, #45
add r3, r4, #1
strblt r2, [r3, #-2]
sublt r4, r4, #1
as opposed to mine:
<bb 3>:
# p_8 = PHI <p_16(2), p_19(3)>
p_19 = p_8 + 4294967295;
if (ui_7 > 9)
...
<bb 5>:
p_22 = p_19 + 4294967295;
*p_22 = 45;
which gives us two auto-dec, and much tighter code:
strb r1, [r4, #-1]!
lsr r3, r3, #3
bhi .L4
cmp r6, #0
movlt r3, #45
strblt r3, [r4, #-1]!
Would it be OK to go with my approach, or is worth looking into the rtl
optimizers and seeing what can be done (boo! :)).
Thanks.