This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/39423] [4.6/4.7/4.8 Regression] [SH] performance regression: lost mov @(disp,Rn)

From: "olegendo at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 13 Jul 2012 09:36:36 +0000
Subject: [Bug target/39423] [4.6/4.7/4.8 Regression] [SH] performance regression: lost mov @(disp,Rn)
Auto-submitted: auto-generated
References: <bug-39423-4@http.gcc.gnu.org/bugzilla/>

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39423

--- Comment #21 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-13 09:36:36 UTC ---
(In reply to comment #20)
> I like your mem,plus,plus combine that look better than my folding. just there
> are 2 little details that I tried while playing with it: there was a
> lost of generality with the hardcoding of the shift constant,

Yeah, sure.  The hardcoded '4' was just to see if/how it could work at all.

> and I'm not sure
> if there was a risk of clobbering a live operands[1]. For those reasons I
> modified it a bit as followed. 
> 
> (define_insn_and_split "*movsi_disp"
>   [(set (match_operand:SI 0 "arith_reg_dest" "=r")
>         (mem:SI (plus:SI
>           (plus:SI (mult:SI (match_operand:SI 1 "arith_reg_operand" "r")
>                     (match_operand:SI 2 "const_int_operand" "i"))
>                (match_operand:SI 3 "arith_reg_operand" "K06"))
>         (match_operand:SI 4 "const_int_operand" "i"))))]
>   "TARGET_SH1 && satisfies_constraint_K06 (operands[4]) && exact_log2 (INTVAL
> (operands[2])) > 0"
> {
>   gcc_unreachable ();
>   return "#";
> }
>   "&& 1"
>   [(set (match_dup 1) (ashift:SI (match_dup 1) (match_dup 2)))
>    (set (match_dup 1) (plus:SI (match_dup 1) (match_dup 3)))
>    (set (match_dup 0) (mem:SI (plus:SI (match_dup 1) (match_dup 4))))]
> "{
>     int n = exact_log2 (INTVAL (operands[2]));
>     rtx res = gen_reg_rtx (SImode);
>     emit_move_insn (res, operands[1]);
> 
>     operands[1] = res;
>     operands[2] = GEN_INT (n);
> }"
> )
> 

Uhm, I think it would be enough to do the following in the split preparation...

(define_insn_and_split "*movsi_disp"
  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
        (mem:SI (plus:SI
          (plus:SI (mult:SI (match_operand:SI 1 "arith_reg_operand" "r")
                    (match_operand:SI 2 "const_int_operand" "i"))
               (match_operand:SI 3 "arith_reg_operand" "K06"))
        (match_operand:SI 4 "const_int_operand" "i"))))]
  "TARGET_SH1 && satisfies_constraint_K06 (operands[4]) && exact_log2 (INTVAL
(operands[2])) > 0"
{
  gcc_unreachable ();
  return "#";
}
  "&& can_create_pseudo_p ()"
  [(set (match_dup 5) (ashift:SI (match_dup 1) (match_dup 2)))
   (set (match_dup 6) (plus:SI (match_dup 5) (match_dup 3)))
   (set (match_dup 0) (mem:SI (plus:SI (match_dup 6) (match_dup 4))))]
{
    operands[5] = gen_reg_rtx (SImode);
    operands[6] = gen_reg_rtx (SImode);
    operands[2] = GEN_INT (exact_log2 (INTVAL (operands[2])));
})


> (define_insn "ashlsi3_k"
>   [(set (match_operand:SI 0 "arith_reg_dest" "=r,r,r")
>     (ashift:SI (match_operand:SI 1 "arith_reg_operand" "0,0,0")
>            (match_operand:SI 2 "const_int_operand" "M,P27,r")))]
>   "TARGET_SH1"
>   "@
>     add    %0,%0
>     shll%O2    %0
>     shld    %2,%0"
>   [(set_attr "type" "arith")])
> 

The shld insn is only available on SH3 and above or SH2A.
Allowing arbitrary shifts definitely makes sense here, but then maybe 
we should use the existing shift-expansion, although it sometimes
causes problems (like it ends up generating functions calls for shll8 on
SH2..).
Another option might be to do these combine patterns on < SH3 only 
when the resulting shift costs do not pass a certain threshold.

> Using those changes, a snipet like 
> 
> int
> foo4 (long long tab[], int index)
> {
>   return (int)tab [index+4];
> }
> 
> not compiles as:
> 
>        mov     #3,r1
>         shld    r1,r5
>         add     r4,r5
>         rts
>         mov.l   @(32,r5),r0
> 

If I recall correctly, a sequence such as "mov #3,r1; shld r1,r5" will be
slower
than "shll r5; shll2 r5" due to pipelining issues.  But that's another story.


> 2)plus_mulsi
> 
> I see some interactions with the movsi_disp pattern, due to different ordering
> on the matching in the combiner. I'll need to play more this them activated
> together.
> 
> So I think you can go ahead with your combiner movsi_disp pattern and propose
> to Kaz when ready to close this defect. Feel free to take or not my
> suggestions.
> 
> I'll then go with my plus_mulsi combiner in a second time, making clear that
> it's provide additional optimization opportunities without mixing the impacts.
> 

OK, I'll brush up the 'movsi_disp' stuff a little, taking our input into
account.
It might take a couple of days though.

> On a related thread, for further work, I'm thinking on adding support for the
> MAC instruction, now that was have the multiply and add. But this requires
> exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> like to give this a try pretty soon.
> 

I have created PR 53949 for this.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]