[RFC/PATCH v3] ira: Support more matching constraint forms with param [PR100328]

Hongtao Liu crazylht@gmail.com
Wed Jun 30 10:18:48 GMT 2021


On Wed, Jun 30, 2021 at 5:42 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/6/30 下午4:53, Hongtao Liu wrote:
> > On Mon, Jun 28, 2021 at 3:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> on 2021/6/28 下午3:20, Hongtao Liu wrote:
> >>> On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>>>
> >>>> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> PR100328 has some details about this issue, I am trying to
> >>>>>> brief it here.  In the hottest function LBM_performStreamCollideTRT
> >>>>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions
> >>>>>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style
> >>>>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg
> >>>>>> class have 64 registers whose foregoing 32 ones make up the
> >>>>>> whole FLOAT_REG.  There are some differences for these two
> >>>>>> flavors, taking "*fma<mode>4_fpr" as example:
> >>>>>>
> >>>>>> (define_insn "*fma<mode>4_fpr"
> >>>>>>   [(set (match_operand:SFDF 0 "gpc_reg_operand" "=<Ff>,wa,wa")
> >>>>>>       (fma:SFDF
> >>>>>>         (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
> >>>>>>         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
> >>>>>>         (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
> >>>>>>
> >>>>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG.
> >>>>>> // <Ff> (f/d) => A floating point register, aka. FLOAT_REG.
> >>>>>>
> >>>>>> So for VSX_REG, we only have the destructive form, when VSX_REG
> >>>>>> alternative being used, the operand 2 or operand 3 is required
> >>>>>> to be the same as operand 0.  reload has to take care of this
> >>>>>> constraint and create some non-free register copies if required.
> >>>>>>
> >>>>>> Assuming one fma insn looks like:
> >>>>>>   op0 = FMA (op1, op2, op3)
> >>>>>>
> >>>>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead,
> >>>>>> IRA simply creates three shuffle copies for them (here the operand
> >>>>>> order matters, since with the same freq, the one with smaller number
> >>>>>> takes preference), but IMO both op2 and op3 should take higher priority
> >>>>>> in copy queue due to the matching constraint.
> >>>>>>
> >>>>>> I noticed that there is one function ira_get_dup_out_num, which meant
> >>>>>> to create this kind of constraint copy, but the below code looks to
> >>>>>> refuse to create if there is an alternative which has valid regclass
> >>>>>> without spilled need.
> >>>>>>
> >>>>>>       default:
> >>>>>>       {
> >>>>>>         enum constraint_num cn = lookup_constraint (str);
> >>>>>>         enum reg_class cl = reg_class_for_constraint (cn);
> >>>>>>         if (cl != NO_REGS
> >>>>>>             && !targetm.class_likely_spilled_p (cl))
> >>>>>>           goto fail
> >>>>>>
> >>>>>>        ...
> >>>>>>
> >>>>>> I cooked one patch attached to make ira respect this kind of matching
> >>>>>> constraint guarded with one parameter.  As I stated in the PR, I was
> >>>>>> not sure this is on the right track.  The RFC patch is to check the
> >>>>>> matching constraint in all alternatives, if there is one alternative
> >>>>>> with matching constraint and matches the current preferred regclass
> >>>>>> (or best of allocno?), it will record the output operand number and
> >>>>>> further create one constraint copy for it.  Normally it can get the
> >>>>>> priority against shuffle copies and the matching constraint will get
> >>>>>> satisfied with higher possibility, reload doesn't create extra copies
> >>>>>> to meet the matching constraint or the desirable register class when
> >>>>>> it has to.
> >>>>>>
> >>>>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay
> >>>>>> as shuffle copies, and later any of A,B,C,D gets assigned by one
> >>>>>> hardware register which is a VSX register (VSX_REG) but not a FP
> >>>>>> register (FLOAT_REG), which means it has to pay costs once we can NOT
> >>>>>> go with VSX alternatives, so at that time it's important to respect
> >>>>>> the matching constraint then we can increase the freq for the remaining
> >>>>>> copies related to this (A/B, A/C, A/D).  This idea requires some side
> >>>>>> tables to record some information and seems a bit complicated in the
> >>>>>> current framework, so the proposed patch aggressively emphasizes the
> >>>>>> matching constraint at the time of creating copies.
> >>>>>>
> >>>>>
> >>>>> Comparing with the original patch (v1), this patch v3 has
> >>>>> considered: (this should be v2 for this mail list, but bump
> >>>>> it to be consistent as PR's).
> >>>>>
> >>>>>   - Excluding the case where for one preferred register class
> >>>>>     there can be two or more alternatives, one of them has the
> >>>>>     matching constraint, while another doesn't have.  So for
> >>>>>     the given operand, even if it's assigned by a hardware reg
> >>>>>     which doesn't meet the matching constraint, it can simply
> >>>>>     use the alternative which doesn't have matching constraint
> >>>>>     so no register move is needed.  One typical case is
> >>>>>     define_insn *mov<mode>_internal2 on rs6000.  So we
> >>>>>     shouldn't create constraint copy for it.
> >>>>>
> >>>>>   - The possible free register move in the same register class,
> >>>>>     disable this if so since the register move to meet the
> >>>>>     constraint is considered as free.
> >>>>>
> >>>>>   - Making it on by default, suggested by Segher & Vladimir, we
> >>>>>     hope to get rid of the parameter if the benchmarking result
> >>>>>     looks good on major targets.
> >>>>>
> >>>>>   - Tweaking cost when either of matching constraint two sides
> >>>>>     is hardware register.  Before this patch, the constraint
> >>>>>     copy is simply taken as a real move insn for pref and
> >>>>>     conflict cost with one hardware register, after this patch,
> >>>>>     it's allowed that there are several input operands
> >>>>>     respecting the same matching constraint (but in different
> >>>>>     alternatives), so we should take it to be like shuffle copy
> >>>>>     for some cases to avoid over preferring/disparaging.
> >>>>>
> >>>>> Please check the PR comments for more details.
> >>>>>
> >>>>> This patch can be bootstrapped & regtested on
> >>>>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some
> >>>>> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list
> >>>>> was attached in the PR and thought the new assembly looks
> >>>>> improved (expected).
> >>>>>
> >>>>> With option Ofast unroll, this patch can help to improve SPEC2017
> >>>>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while
> >>>>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any
> >>>>> remarkable degradations.
> >
> > Here's SPEC2017  rate result tested on AMD milan
> > option is: -march=znver2 -Ofast -funroll-loops  -mfpmath=sse -flto
> >
> > fprate:
> >       503.bwaves_r                 0.01    (A)  shliclel219
> >       507.cactuBSSN_r             -0.19    (A)  shliclel219
> >       508.namd_r                   0.02    (A)  shliclel219
> >       510.parest_r                -0.68    (A)  shliclel219
> >       511.povray_r                 1.59    (A)  shliclel219
> >       521.wrf_r                    0.19    (A)  shliclel219
> >       526.blender_r                0.68    (A)  shliclel219
> >       527.cam4_r                  -0.30    (A)  shliclel219
> >       538.imagick_r               -3.81 <- (A)  shliclel219
> >       544.nab_r                    0.02    (A)  shliclel219
> >       549.fotonik3d_r              0.02    (A)  shliclel219
> >       554.roms_r                  -0.43    (A)  shliclel219
> >       997.specrand_fr             -3.80 <- (A)  shliclel219
> >                                     Geometric mean:  -0.52
> > intrate:
> >       500.perlbench_r             -1.54    (A)  shliclel219
> >       502.gcc_r                   -0.38    (A)  shliclel219
> >       505.mcf_r                   -0.10    (A)  shliclel219
> >       520.omnetpp_r               -0.24    (A)  shliclel219
> >       523.xalancbmk_r             -1.04    (A)  shliclel219
> >       525.x264_r                   0.31    (A)  shliclel219
> >       531.deepsjeng_r             -0.02    (A)  shliclel219
> >       541.leela_r                  0.95    (A)  shliclel219
> >       548.exchange2_r              0.08    (A)  shliclel219
> >       557.xz_r                    -0.40    (A)  shliclel219
> >                                     Geometric mean:  -0.24
>
>
> Roger, thanks!  The result looks not good, I think I'll disable it
> for target x86_64 in next version.  By the way, bmk 519.lbm_r seemed
> missing, just curious whether due to that it failed to build even
> with baseline?
519.lbm_r           0  ------    ------    BuildSame on milan

here is fprate on CLX:
      503.bwaves_r               -0.12
      507.cactuBSSN_r            -0.02
      508.namd_r                 -0.57
      510.parest_r                0.40
      511.povray_r               -0.37
      519.lbm_r                   0.10
      521.wrf_r                   0.61
      526.blender_r              -0.50
      527.cam4_r                 -0.45
      538.imagick_r              -6.61 <-
      544.nab_r                  -0.11
      549.fotonik3d_r             0.16
      554.roms_r                  0.22
      997.specrand_fr            -0.18

And there's something broken on my local cascade lake, so intrate test
result for CLX would be later.
>
> BR,
> Kewen



-- 
BR,
Hongtao


More information about the Gcc-patches mailing list