This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/39423] [4.6/4.7/4.8 Regression] [SH] performance regression: lost mov @(disp,Rn)
- From: "olegendo at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 13 Jul 2012 09:36:36 +0000
- Subject: [Bug target/39423] [4.6/4.7/4.8 Regression] [SH] performance regression: lost mov @(disp,Rn)
- Auto-submitted: auto-generated
- References: <bug-39423-4@http.gcc.gnu.org/bugzilla/>
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39423
--- Comment #21 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-13 09:36:36 UTC ---
(In reply to comment #20)
> I like your mem,plus,plus combine that look better than my folding. just there
> are 2 little details that I tried while playing with it: there was a
> lost of generality with the hardcoding of the shift constant,
Yeah, sure. The hardcoded '4' was just to see if/how it could work at all.
> and I'm not sure
> if there was a risk of clobbering a live operands[1]. For those reasons I
> modified it a bit as followed.
>
> (define_insn_and_split "*movsi_disp"
> [(set (match_operand:SI 0 "arith_reg_dest" "=r")
> (mem:SI (plus:SI
> (plus:SI (mult:SI (match_operand:SI 1 "arith_reg_operand" "r")
> (match_operand:SI 2 "const_int_operand" "i"))
> (match_operand:SI 3 "arith_reg_operand" "K06"))
> (match_operand:SI 4 "const_int_operand" "i"))))]
> "TARGET_SH1 && satisfies_constraint_K06 (operands[4]) && exact_log2 (INTVAL
> (operands[2])) > 0"
> {
> gcc_unreachable ();
> return "#";
> }
> "&& 1"
> [(set (match_dup 1) (ashift:SI (match_dup 1) (match_dup 2)))
> (set (match_dup 1) (plus:SI (match_dup 1) (match_dup 3)))
> (set (match_dup 0) (mem:SI (plus:SI (match_dup 1) (match_dup 4))))]
> "{
> int n = exact_log2 (INTVAL (operands[2]));
> rtx res = gen_reg_rtx (SImode);
> emit_move_insn (res, operands[1]);
>
> operands[1] = res;
> operands[2] = GEN_INT (n);
> }"
> )
>
Uhm, I think it would be enough to do the following in the split preparation...
(define_insn_and_split "*movsi_disp"
[(set (match_operand:SI 0 "arith_reg_dest" "=r")
(mem:SI (plus:SI
(plus:SI (mult:SI (match_operand:SI 1 "arith_reg_operand" "r")
(match_operand:SI 2 "const_int_operand" "i"))
(match_operand:SI 3 "arith_reg_operand" "K06"))
(match_operand:SI 4 "const_int_operand" "i"))))]
"TARGET_SH1 && satisfies_constraint_K06 (operands[4]) && exact_log2 (INTVAL
(operands[2])) > 0"
{
gcc_unreachable ();
return "#";
}
"&& can_create_pseudo_p ()"
[(set (match_dup 5) (ashift:SI (match_dup 1) (match_dup 2)))
(set (match_dup 6) (plus:SI (match_dup 5) (match_dup 3)))
(set (match_dup 0) (mem:SI (plus:SI (match_dup 6) (match_dup 4))))]
{
operands[5] = gen_reg_rtx (SImode);
operands[6] = gen_reg_rtx (SImode);
operands[2] = GEN_INT (exact_log2 (INTVAL (operands[2])));
})
> (define_insn "ashlsi3_k"
> [(set (match_operand:SI 0 "arith_reg_dest" "=r,r,r")
> (ashift:SI (match_operand:SI 1 "arith_reg_operand" "0,0,0")
> (match_operand:SI 2 "const_int_operand" "M,P27,r")))]
> "TARGET_SH1"
> "@
> add %0,%0
> shll%O2 %0
> shld %2,%0"
> [(set_attr "type" "arith")])
>
The shld insn is only available on SH3 and above or SH2A.
Allowing arbitrary shifts definitely makes sense here, but then maybe
we should use the existing shift-expansion, although it sometimes
causes problems (like it ends up generating functions calls for shll8 on
SH2..).
Another option might be to do these combine patterns on < SH3 only
when the resulting shift costs do not pass a certain threshold.
> Using those changes, a snipet like
>
> int
> foo4 (long long tab[], int index)
> {
> return (int)tab [index+4];
> }
>
> not compiles as:
>
> mov #3,r1
> shld r1,r5
> add r4,r5
> rts
> mov.l @(32,r5),r0
>
If I recall correctly, a sequence such as "mov #3,r1; shld r1,r5" will be
slower
than "shll r5; shll2 r5" due to pipelining issues. But that's another story.
> 2)plus_mulsi
>
> I see some interactions with the movsi_disp pattern, due to different ordering
> on the matching in the combiner. I'll need to play more this them activated
> together.
>
> So I think you can go ahead with your combiner movsi_disp pattern and propose
> to Kaz when ready to close this defect. Feel free to take or not my
> suggestions.
>
> I'll then go with my plus_mulsi combiner in a second time, making clear that
> it's provide additional optimization opportunities without mixing the impacts.
>
OK, I'll brush up the 'movsi_disp' stuff a little, taking our input into
account.
It might take a couple of days though.
> On a related thread, for further work, I'm thinking on adding support for the
> MAC instruction, now that was have the multiply and add. But this requires
> exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> like to give this a try pretty soon.
>
I have created PR 53949 for this.