[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

rguenth at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Fri Mar 5 12:27:54 GMT 2021


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #29 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
> 
> Try this:
> 
> (define_peephole2
>   [(match_scratch:DI 5 "Yv")
>    (set (match_operand:DI 0 "sse_reg_operand")
>         (match_operand:DI 1 "general_reg_operand"))
>    (set (match_operand:V2DI 2 "sse_reg_operand")
>         (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand")
>                          (match_operand:DI 4 "nonimmediate_gr_operand")))]
>   ""
>   [(set (match_dup 0)
>         (match_dup 1))
>    (set (match_dup 5)
>         (match_dup 4))
>    (set (match_dup 2)
>        (vec_concat:V2DI (match_dup 3)
>                         (match_dup 5)))])

Ah, I messed up operands.  The following works (the above position of
match_scratch happily chooses an operand matching operand 0):

;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
  [(set (match_operand:DI 0 "sse_reg_operand")
        (match_operand:DI 1 "general_reg_operand"))
   (match_scratch:DI 2 "Yv")
   (set (match_operand:V2DI 3 "sse_reg_operand")
        (vec_concat:V2DI (match_dup 0)
                         (match_operand:DI 4 "nonimmediate_gr_operand")))]
  "reload_completed && optimize_insn_for_speed_p ()"
  [(set (match_dup 0)
        (match_dup 1))
   (set (match_dup 2)
        (match_dup 4))
   (set (match_dup 3)
        (vec_concat:V2DI (match_dup 0)
                         (match_dup 2)))])

but for some reason it again doesn't work for the important loop.  There
we have

  389: xmm0:DI=cx:DI
      REG_DEAD cx:DI
  390: dx:DI=[sp:DI+0x10]
   56: {dx:DI=dx:DI 0>>0x3f;clobber flags:CC;}
      REG_UNUSED flags:CC
   57: xmm0:V2DI=vec_concat(xmm0:DI,dx:DI)

I suppose the reason is that there's two unrelated insns between the
xmm0 = cx:DI and the vec_concat.  Which would hint that we somehow
need to not match this GPR->XMM move in the peephole pattern but
instead somehow in the condition (can we use DF there?)

The simplified variant below works but IMHO matches cases we do not
want to transform.  I can't find any example on how to achieve that
though.

;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
  [(match_scratch:DI 3 "Yv")
   (set (match_operand:V2DI 0 "sse_reg_operand")
        (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
                         (match_operand:DI 2 "nonimmediate_gr_operand")))]
  "reload_completed && optimize_insn_for_speed_p ()"
  [(set (match_dup 3)
        (match_dup 2))
   (set (match_dup 0)
        (vec_concat:V2DI (match_dup 1)
                         (match_dup 3)))])


More information about the Gcc-bugs mailing list