This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [SVE] [fwprop] PR88833 - Redundant moves for WHILELO-based loops
- From: Richard Sandiford <richard dot sandiford at arm dot com>
- To: Prathamesh Kulkarni <prathamesh dot kulkarni at linaro dot org>
- Cc: gcc Patches <gcc-patches at gcc dot gnu dot org>
- Date: Wed, 26 Jun 2019 11:35:11 +0100
- Subject: Re: [SVE] [fwprop] PR88833 - Redundant moves for WHILELO-based loops
- References: <CAAgBjMndm8upw++qigGPQNTA4nqm+Uck=CcVBpt3cZ74fEGEwA@mail.gmail.com> <mptd0j3ib67.fsf@arm.com> <CAAgBjM=z6g3R6Ai7nRcOShAXEp2dd44kFkmZUUf-=T+jfUFboQ@mail.gmail.com> <mptv9wvgj37.fsf@arm.com> <CAAgBjMngAEz-3O6M45k_cjMaRzTqrPBXLf4aCWfipGsJ09Z=SA@mail.gmail.com> <CAAgBjMmrBvcuJiYu98PzGxSCpNbnQ92rJ39YDM=Eup-tSts2zQ@mail.gmail.com> <mpt5zotg2c9.fsf@arm.com> <CAAgBjMng2QWYv7J2eq2sZRgETtpbAk+sQArRAQGw4ox4UBYbUA@mail.gmail.com>
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 25 Jun 2019 at 20:05, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Mon, 24 Jun 2019 at 21:41, Prathamesh Kulkarni
>> > <prathamesh.kulkarni@linaro.org> wrote:
>> >>
>> >> On Mon, 24 Jun 2019 at 19:51, Richard Sandiford
>> >> <richard.sandiford@arm.com> wrote:
>> >> >
>> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > > @@ -1415,6 +1460,19 @@ forward_propagate_into (df_ref use)
>> >> > > if (!def_set)
>> >> > > return false;
>> >> > >
>> >> > > + if (reg_prop_only
>> >> > > + && !REG_P (SET_SRC (def_set))
>> >> > > + && !REG_P (SET_DEST (def_set)))
>> >> > > + return false;
>> >> >
>> >> > This should be:
>> >> >
>> >> > if (reg_prop_only
>> >> > && (!REG_P (SET_SRC (def_set)) || !REG_P (SET_DEST (def_set))))
>> >> > return false;
>> >> >
>> >> > so that we return false if either operand isn't a register.
>> >> Oops, sorry about that -:(
>> >> >
>> >> > > +
>> >> > > + /* Allow propagations into a loop only for reg-to-reg copies, since
>> >> > > + replacing one register by another shouldn't increase the cost. */
>> >> > > +
>> >> > > + if (DF_REF_BB (def)->loop_father != DF_REF_BB (use)->loop_father
>> >> > > + && !REG_P (SET_SRC (def_set))
>> >> > > + && !REG_P (SET_DEST (def_set)))
>> >> > > + return false;
>> >> >
>> >> > Same here.
>> >> >
>> >> > OK with that change, thanks.
>> >> Thanks for the review, will make the changes and commit the patch
>> >> after re-testing.
>> > Hi,
>> > Testing the patch showed following failures on 32-bit x86:
>> >
>> > Executed from: g++.target/i386/i386.exp
>> > g++:g++.target/i386/pr88152.C scan-assembler-not vpcmpgt|vpcmpeq|vpsra
>> > Executed from: gcc.target/i386/i386.exp
>> > gcc:gcc.target/i386/pr66768.c scan-assembler add*.[ \t]%gs:
>> > gcc:gcc.target/i386/pr90178.c scan-assembler-times xorl[\\t
>> > ]*\\%eax,[\\t ]*%eax 1
>> >
>> > The failure of pr88152.C is also seen on x86_64.
>> >
>> > For pr66768.c, and pr90178.c, forwprop replaces register which is
>> > volatile and frame related respectively.
>> > To avoid that, the attached patch, makes a stronger constraint that
>> > src and dest should be a register
>> > and not have frame_related or volatil flags set, which is checked in
>> > usable_reg_p().
>> > Which avoids the failures for both the cases.
>> > Does it look OK ?
>>
>> That's not the reason it's a bad transform. In both cases we're
>> propagating r2 <- r1 even though
>>
>> (a) r1 dies in the copy and
>> (b) fwprop can't replace all uses of r2, because some have multiple
>> definitions
>>
>> This has the effect of making both values live in cases where only one
>> was previously.
>>
>> In the case of pr66768.c, fwprop2 is undoing the effect of
>> cse.c:canon_reg, which tries to pick the best register to use
>> (see cse.c:make_regs_eqv). fwprop1 makes the same change,
>> and made it even before the patch, but the cse.c choice should win.
>>
>> A (hopefully) conservative fix would be to propagate the copy only if
>> both registers have a single definition, which you can test with:
>>
>> (DF_REG_DEF_COUNT (regno) == 1
>> && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
>>
>> In that case, fwprop should see all uses of the destination, and should
>> be able to replace it in all cases with the source.
> Ah I see, thanks for the explanation!
> The above check works to resolve the failure.
> IIUC, !bitmap_bit_p (...) above checks that reg isn't used uninitialized ?
Right.
>> > For g++.target/i386/pr88152.C, the issue is that after the patch,
>> > forwprop1 does following propagation (in f10) which wasn't done
>> > before:
>> >
>> > In insn 10, replacing
>> > (unspec:SI [
>> > (reg:V2DF 91)
>> > ] UNSPEC_MOVMSK)
>> > with (unspec:SI [
>> > (subreg:V2DF (reg:V2DI 90) 0)
>> > ] UNSPEC_MOVMSK)
>> >
>> > This later defeats combine because insn 9 gets deleted.
>> > Without patch, the following combination takes place:
>> >
>> > Trying 7 -> 9:
>> > 7: r90:V2DI=r89:V2DI>r93:V2DI
>> > REG_DEAD r93:V2DI
>> > REG_DEAD r89:V2DI
>> > 9: r91:V2DF=r90:V2DI#0
>> > REG_DEAD r90:V2DI
>> > Successfully matched this instruction:
>> > (set (subreg:V2DI (reg:V2DF 91) 0)
>> > (gt:V2DI (reg:V2DI 89)
>> > (reg:V2DI 93)))
>> > allowing combination of insns 7 and 9
>> >
>> > and then:
>> > Trying 6, 9 -> 10:
>> > 6: r89:V2DI=const_vector
>> > 9: r91:V2DF#0=r89:V2DI>r93:V2DI
>> > REG_DEAD r89:V2DI
>> > REG_DEAD r93:V2DI
>> > 10: r87:SI=unspec[r91:V2DF] 43
>> > REG_DEAD r91:V2DF
>> > Successfully matched this instruction:
>> > (set (reg:SI 87)
>> > (unspec:SI [
>> > (lt:V2DF (reg:V2DI 93)
>> > (const_vector:V2DI [
>> > (const_int 0 [0]) repeated x2
>> > ]))
>> > ] UNSPEC_MOVMSK))
>>
>> Eh? lt:*V2DF*? Does that mean that it's 0 for false and an all-1 NaN
>> for true?
> Well looking at .optimized dump:
>
> vector(2) long int _2;
> vector(2) double _3;
> int _6;
> signed long _7;
> long int _8;
> signed long _9;
> long int _10;
>
> <bb 2> [local count: 1073741824]:
> _7 = BIT_FIELD_REF <a_4(D), 64, 0>;
> _8 = _7 < 0 ? -1 : 0;
> _9 = BIT_FIELD_REF <a_4(D), 64, 64>;
> _10 = _9 < 0 ? -1 : 0;
> _2 = {_8, _10};
> _3 = VIEW_CONVERT_EXPR<__m128d>(_2);
> _6 = __builtin_ia32_movmskpd (_3); [tail call]
> return _6;
>
> So AFAIU, we're using -1 for representing true and 0 for false
> and casting -1 (literally) to double would change it's value to -NaN ?
Exactly. And for -ffinite-math-only, we might as well then fold the
condition to false. :-)
IMO rtl condition results have to have integral modes and not folding
(subreg:V2DF (lt:V2DI ...) 0) is the correct behaviour. So I think this
is just a latent bug and shouldn't hold up the patch.
I'm not sure whether:
reinterpret_cast<__m128d> (a > ...)
is something we expect users to write, or whether it was just
included for completeness.
Thanks,
Richard
> The above insn 10 created by combine is then transformed to following insn 22:
> (insn 22 9 16 2 (set (reg:SI 0 ax [87])
> (unspec:SI [
> (reg:V2DF 20 xmm0 [93])
> ] UNSPEC_MOVMSK))
> "../../stage1-build/gcc/include/emmintrin.h":958:34 4222
> {sse2_movmskpd}
> (nil))
>
> deleting insn 10.
>
> Thanks,
> Prathamesh
>
>>
>> Looks like a bug that we manage to fold to that, and manage to match it.
>>
>> Thanks,
>> Richard
>>
>> > allowing combination of insns 6, 9 and 10
>> > original costs 4 + 8 + 4 = 16
>> > replacement cost 12
>> > deferring deletion of insn with uid = 9.
>> > deferring deletion of insn with uid = 6.
>> > which deletes insns 2, 3, 6, 7, 9.
>> >
>> > With patch, it fails to combine 7->10:
>> > Trying 7 -> 10:
>> > 7: r90:V2DI=r89:V2DI>r93:V2DI
>> > REG_DEAD r93:V2DI
>> > REG_DEAD r89:V2DI
>> > 10: r87:SI=unspec[r90:V2DI#0] 43
>> > REG_DEAD r90:V2DI
>> > Failed to match this instruction:
>> > (set (reg:SI 87)
>> > (unspec:SI [
>> > (subreg:V2DF (gt:V2DI (reg:V2DI 89)
>> > (reg:V2DI 93)) 0)
>> > ] UNSPEC_MOVMSK))
>> >
>> > and subsequently 6, 7 -> 10
>> > (attached combine dumps before and after patch).
>> >
>> > So IIUC, the issue is that the target does not have a pattern that can
>> > match the above insn ?
>> > I tried a simple workaround to "pessimize" the else condition in
>> > propagate_rtx_1 in patch, to require old_rtx and new_rtx have same
>> > rtx_code, which at least
>> > works for this test-case, but not sure if that's the correct approach.
>> > Could you suggest how to proceed ?
>> >
>> > Thanks,
>> > Prathamesh
>> >>
>> >> Thanks,
>> >> Prathamesh
>> >> >
>> >> > Richard