This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
- From: "jiwang at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 10 Mar 2016 16:40:37 +0000
- Subject: [Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
- Auto-submitted: auto-generated
- References: <bug-70048-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048
--- Comment #16 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to Richard Henderson from comment #13)
> Created attachment 37911 [details]
> aggressive patch
>
Cool! Thanks very much for experimenting this thoughtful new aggressive
direction.
But there is a performance issue as described at
https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00281.html
"this patch forces register scaling expression out of memory ref, so that
RTL CSE pass can handle common register scaling expressions"
This is particularly performance critial if a group of instructions are
using the same "scaled register" inside hot loop. CSE can reduce redundant
calculations.
r233136 was fixing this by forcing the scale part into register then turn
somelike:
ldr dst1, [reg_base1, reg_index, #lsl 1]
ldr dst2, [reg_base2, reg_index, #lsl 1]
ldr dst3, [reg_base3, reg_index, #lsl 1]
into
reg_index = reg_index << 1;
ldr dst1, [reg_base1, reg_index]
ldr dst2, [reg_base2, reg_index]
ldr dst3, [reg_base3, reg_index]
For this aggresive version patch at #c13
===
In TARGET_LEGITIMATE_ADDRESS_P, the aarch64_classify_index allows
register scaled offset through those "MUL" checks, this is correct, but
then address like [reg, reg, #lsl 1] will be judged valid and hiding
behind MEM unvisiable to RTL passes until quite later stage.
For second patch at #c10
===
I found op1 is *not* forced to register when op0 is virtual eliminable
register. As the perf issue mentioned above, I think we should always
force op0 into register to expose the scaled calculation to gcc.
I can confirm the following:
* both the patches at #c13 and #c10 will cause performance regression
and reopen the issue r233136 fixed.
* for the aggressive patch at #c13, if we explicitly don't allow
(plus reg, (mul reg, imm)) in aarch64_classify_index_allow then
there is no perf regression.
* for the second patch at #c10, if we always do the following no matter
op0 is virtual & eliminable or not
"op1 = force_operand (op1, NULL_RTX);"
then there is no performance regression as well. Even more, there is
significant perf improvement beyond current trunk on one benchmark.
If the regression fixed, I fell the second patch at #c10 looks better if the
decision is to fix this by gcc 6. As I feel expose RTL details to RTL passes
the earlier the better.