This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: RFC: Combine of compare & and oddity
- From: Andrew Pinski <pinskia at gmail dot com>
- To: Wilco Dijkstra <wdijkstr at arm dot com>
- Cc: Segher Boessenkool <segher at kernel dot crashing dot org>, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Thu, 3 Sep 2015 23:14:03 +0800
- Subject: Re: RFC: Combine of compare & and oddity
- Authentication-results: sourceware.org; auth=none
- References: <000e01d0e5a2$1e2f66b0$5a8e3410$ at com> <20150902184747 dot GA7676 at gate dot crashing dot org> <000f01d0e63d$c40686e0$4c1394a0$ at com> <20150903131809 dot GA27819 at gate dot crashing dot org> <001001d0e659$1120bb60$33623220$ at com>
On Thu, Sep 3, 2015 at 10:59 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>> Segher Boessenkool wrote:
>> On Thu, Sep 03, 2015 at 12:43:34PM +0100, Wilco Dijkstra wrote:
>> > > > Combine canonicalizes certain AND masks in a comparison with zero into extracts of the
>> > > widest
>> > > > register type. During matching these are expanded into a very inefficient sequence that
>> > > fails to
>> > > > match. For example (x & 2) == 0 is matched in combine like this:
>> > > >
>> > > > Failed to match this instruction:
>> > > > (set (reg:CC 66 cc)
>> > > > (compare:CC (zero_extract:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) 0)
>> > > > (const_int 1 [0x1])
>> > > > (const_int 1 [0x1]))
>> > > > (const_int 0 [0])))
>> > >
>> > > Yes. Some processors even have specific instructions to do this.
>> >
>> > However there are 2 issues with this, one is the spurious subreg,
>>
>> Combine didn't make that up out of thin air; something already used
>> DImode here. It could simplify it to SImode in this case, that is
>> true, don't know why it doesn't; it isn't necessarily faster code to
>> do so, it can be slower, it might not match, etc.
>
> The relevant RTL instructions on AArch64 are:
>
> (insn 8 3 25 2 (set (reg:SI 77 [ D.2705 ])
> (and:SI (reg/v:SI 76 [ xD.2641 ])
> (const_int 2 [0x2]))) tmp5.c:122 452 {andsi3}
> (nil))
> (insn 26 25 27 2 (set (reg:CC 66 cc)
> (compare:CC (reg:SI 77 [ D.2705 ])
> (const_int 0 [0]))) tmp5.c:122 377 {*cmpsi}
> (expr_list:REG_DEAD (reg:SI 77 [ D.2705 ])
> (nil)))
>
> I don't see anything using DI...
>
>> > the other is
>> > that only a subset of legal zero_extracts are tried (only single bit and the
>> > degenerate case of zero_extract with shift of 0 - which I think should not be a
>> > zero_extract). All other AND immediate remain as AND.
>>
>> Yes. I'm happy to see this weird special case "optimisation",
>> "canocalisation" gone.
>>
>> > So to emit an AND on targets without such specific instructions, or where such
>> > instructions are more expensive than a simple AND (*), you need now at least 3 different
>> > backend patterns for any instruction that can emit AND immediate...
>>
>> It's only a problem for AND-and-compare, no?
>
> Yes, so it looks like some other backends match the odd pattern and then have another
> pattern change it back into the canonical AND/TST form during the split phase (maybe
> the subreg confuses register allocation or block other optimizations). This all seems
> a lot of unnecessary complexity for a few special immediates when there is a much
> simpler solution...
>
>> > (*) I think that is another issue in combine - when both alternatives match you
>> > want to select the lowest cost one, not the first one that matches.
>>
>> That's recog, not combine. And quite a few backends rely on "first match
>> wins", because it always has been that way. It also is very easy to write
>> such patterns accidentally (sometimes even the exact same one twice in the
>> same machine description, etc.)
>>
>> > > > Failed to match this instruction:
>> > > > (set (reg:CC 66 cc)
>> > > > (compare:CC (and:DI (lshiftrt:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) 0)
>> > > > (const_int 1 [0x1]))
>> > > > (const_int 1 [0x1]))
>> > > > (const_int 0 [0])))
>> > >
>> > > This is after r223067. Combine tests only one "final" instruction; that
>> > > revision rewrites zero_ext* if it doesn't match and tries again. This
>> > > helps for processors that can do more generic masks (like rs6000, and I
>> > > believe also aarch64?): without it, you need many more patterns to match
>> > > all the zero_ext{ract,end} cases.
>> >
>> > But there are more efficient ways to emit single bit and masks tests that apply
>> > to most CPUs rather than doing something specific that works for just one target
>> > only. For example single bit test is a simple shift into carry flag or into the
>> > sign bit, and for mask tests, you shift out all the non-mask bits.
>>
>> Most of those are quite target-specific. Some others are already done,
>> and/or done by other passes.
>
> But what combine does here is even more target-specific. Shifts and AND setting flags
> are universally supported on targets with condition code register.
> Bitfield test/extract instructions are more rare, and when they are supported, they
> may well be more expensive.
>
>> > So my question is, is it combine's job to try all possible permutations that
>> > constitute a bit or mask test?
>>
>> Combine converts the merged instructions to what it thinks is the
>> canonical or cheapest form, and uses that. It does not try multiple
>> options (the zero_ext* -> and+shift rewriting is not changing the
>> semantics of the pattern at all).
>
> But the change from AND to zero_extract is already changing semantics...
>
>> > Or would it be better to let each target decide
>> > on how to canonicalize bit tests and only try that alternative?
>>
>> The question is how to write the pattern to be most convenient for all
>> targets.
>
> The obvious choice is to try the 2 original instructions merged.
>
>> > > > Neither matches the AArch64 patterns for ANDS/TST (which is just compare and AND). If
>> the
>> > > immediate
>> > > > is not a power of 2 or a power of 2 -1 then it matches correctly as expected.
>> > > >
>> > > > I don't understand how ((x >> 1) & 1) != 0 could be a useful expansion
>> > >
>> > > It is zero_extract(x,1,1) really. This is convenient for (old and embedded)
>> > > processors that have special bit-test instructions. If we now want combine
>> > > to not do this, we'll have to update all backends that rely on it.
>> >
>> > Would any backend actually rely on this given it only does some specific masks,
>> > has a redundant shift with 0 for the mask case and the odd subreg as well?
>>
>> Such backends match the zero_extract patterns, of course. Random example:
>> the h8300 patterns for the "btst" instruction.
>>
>> > > They are common, and many processors had instructions for them, which is
>> > > (supposedly) why combine special-cased them.
>> >
>> > Yes, but that doesn't mean (x & C) != 0 shouldn't be tried as well...
>>
>> Combine does not try multiple options.
>
> I'm not following - combine tries zero_extract and shift+AND - that's 2 options.
> If that is feasible then adding a 3rd option should be possible.
>
>> > > > It's trivial to change change_zero_ext to expand extracts always into AND and remove the
>> > > redundant
>> > > > subreg.
>> > >
>> > > Not really trivial... Think about comparisons...
>> > >
>> > > int32_t x;
>> > > ((x >> 31) & 1) > 0;
>> > > // is not the same as
>> > > (x & 0x80000000) > 0; // signed comparison
>> > >
>> > > (You do not easily know what the COMPARE is used for).
>> >
>> > Indeed if you had a zero_extract that included the sign-bit then you may have to adjust
>> > the compare condition. However it looks like that case can't happen - (x & 0x80000000)
>> > comparisons with have the AND optimized away much earlier.
>>
>> A long time before combine, yes. But it is *possible* to give such insns
>> to combine, and it should not do the wrong thing with them.
>
> Yes. So that suggests we'd need to block the canonicalization rather than undo it.
>
>> > > > However wouldn't it make more sense to never special case certain AND immediate in the
>> first
>> > > > place?
>> > >
>> > > Yes, but we need to make sure no targets regress (i.e. by rewriting patterns
>> > > for those that do). We need to know the impact of such a change, at the least.
>> >
>> > The alternative would be let the target decide how to canonicalize bit tests.
>> > That seems like a better solution than trying multiple possibilities which can never
>> > match on most targets.
>>
>> You will end up with a *lot* of target hooks like this. It will also
>> make testing harder (less coverage). I am not sure that is a good idea.
>
> We certainly need a lot more target hooks in general so GCC can do the right thing
> (rather than using costs inconsistently all over the place). But that's a different
> discussion...
This is the pattern which I have in my tree (I need someone here to
submit it still):
(define_insn "*and<mode>3_ze_nr_compare0"
[(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
(zero_extract:GPI ; loc size pos
(match_operand:GPI 0 "register_operand" "r")
(match_operand:GPI 1 "const_int_operand" "n")
(match_operand:GPI 2 "const_int_operand" "n"))
(const_int 0)))]
"aarch64_bitmask_imm ((((HOST_WIDE_INT_1U << UINTVAL (operands[1]))
- 1) << UINTVAL (operands[2])),<MODE>mode)"
{
unsigned HOST_WIDE_INT value = (((HOST_WIDE_INT_1U << UINTVAL
(operands[1])) - 1) << UINTVAL (operands[2]));
operands[1] = GEN_INT (value);
return "tst\\t%<w>0, %<w>1";
}
[(set_attr "type" "logics_reg")]
)
--- CUT ---
And then under case COMPARE part of rtx_costs:
/* CC_NZmode supports zero extract for free. */
if (GET_MODE (x) == CC_NZmode && GET_CODE (op0) == ZERO_EXTRACT)
op0 = XEXP (op0, 0);
And in aarch64_select_cc_mode:
if ((GET_MODE (x) == SImode || GET_MODE (x) == DImode)
&& y == const0_rtx
&& (code == EQ || code == NE || code == LT || code == GE)
&& (GET_CODE (x) == PLUS || GET_CODE (x) == MINUS || GET_CODE (x) == AND
|| GET_CODE (x) == NEG || GET_CODE (x) == ZERO_EXTRACT))
return CC_NZmode;
Note I also have the following too, to solve the case where
zero_extend is need being used in some cases for ands (yes this shows
up mostly right after returns from functions)
(define_insn "*andsi3_compare0_uxtw"
[(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
(and:SI (match_operand:SI 1 "register_operand" "%r,r")
(match_operand:SI 2 "aarch64_logical_operand" "r,K"))
(const_int 0)))
(set (match_operand:DI 0 "register_operand" "=r,r")
(zero_extend:DI (and:SI (match_dup 1) (match_dup 2))))]
""
"ands\\t%w0, %w1, %w2"
[(set_attr "type" "logics_reg,logics_imm")]
Thanks,
Andrew Pinski
>
> Wilco
>
>