[PATCH] x86_64: Integer min/max improvements.
Roger Sayle
roger@nextmovesoftware.com
Thu Aug 6 08:40:49 GMT 2020
Hi Uros,
Many thanks for the review and feedback. Here's the final version as committed,
with both the test cases requested by Richard Biener and your suggestion/request
to use ix86_expand_clear. Tested again on x86_64-pc-linux-gnu.
Thank you again for the fantastic ix86_expand_clear pointer, which cleared up one
of two technical questions I had, and allowed this peephole2 to now also apply to
QImode and HImode MOV0s, where my original version was limited to SImode and
DImode.
My two questions were (i) why a QImode set of 0 with a flags clobber isn't a recognized
instruction? I'd assume that on some architectures "xorb dl,dl" might be an appropriate
sequence to use. This is mostly answered by the use of ix86_expand_clear, which
intelligently selects the correct form, but the lack of a *movqi_xor was previously odd.
(ii) My other question, was that despite my best efforts I couldn't seem to convince GCC
to generate/use a *movsi_or to load the constant -1. It was just a curiosity, but this
would affect/benefit the smaxm1 and sminm1 examples in the new i386/minmax-10.c
test program.
Many thanks again for your help.
Roger
--
-----Original Message-----
From: Uros Bizjak <ubizjak@gmail.com>
Sent: 03 August 2020 11:29
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] x86_64: Integer min/max improvements.
On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the
> semantics of these operations are visible to the early RTL
> optimization passes, until split into explicit comparison and
> conditional move instructions. The good news is that i386.md already
> contains all of the required logic (many thanks to Richard Biener and
> Uros Bizjak), but this is currently only to enable scalar-to-vector
> (STV) synthesis of min/max instructions. This change enables this
> functionality for all TARGET_CMOVE architectures for SImode, HImode and DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed
> why this functionality isn't already enabled: In PR rtl-optimization
> 91154 we end up generating a cmp instead of a test in
> gcc.target/i386/minmax-3.c which has poor performance on AMD Opteron.
> My solution to this is to actually support "min(reg,general_operand)"
> allowing us to inspect any immediate constant at the point this
> operation is split. This allows us to use "test" instructions for min/max against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants,
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo: cmpl $12345, %edi
> movl $12345, %eax
> cmovge %edi, %eax
> ret
>
> with this patch we instead generate:
>
> foo: movl $12345, %eax
> cmpl %eax, %edi
> cmovge %edi, %eax
> ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax"
> instructions being produced by the register allocator. Materializing
> constants as late as possible reduces register pressure, but for
> const0_rtx on x86, it's preferable to use "xor" by moving this set
> from between a flags setting operation and its use. This should also
> help instruction macro fusion on some microarchitectures, where
> test/cmp and the following instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo: testl %edi, %edi
> movl $0, %eax
> cmovns %edi, %eax
> ret
>
> with this patch we instead generate:
> foo: xorl %eax, %eax
> testl %edi, %edi
> cmovns %edi, %eax
> ret
>
> The benefits of having min/max explicit at the RTL level can be seen
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
> int y = max(x,5);
> return max(y,10);
> }
>
> where GCC currently produces:
>
> foo: cmpl $5, %edi
> movl $5, %eax
> movl $10, %edx
> cmovge %edi, %eax
> cmpl $10, %eax
> cmovl %edx, %eax
> ret
>
> and with this patch it instead now produces:
>
> foo: movl $10, %eax
> cmpl %eax, %edi
> cmovge %edi, %eax
> ret
>
>
> The original motivation was from looking at a performance critical
> function in a quantum mechanics code, that performed MIN_EXPR and
> MAX_EXPR of the same arguments (effectively a two-element sort), where
> GCC was performing the comparison twice. I'd hoped that it might be
> possible to fuse these together, perhaps in combine, but this patch is
> an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?
>
>
> 2020-07-30 Roger Sayle <roger@nextmovesoftware.com>
>
> * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
> (<maxmin><mode>3): Support SWI248 and general_operand for
> second operand, when TARGET_CMOVE.
> (<maxmin><mode>3_1 splitter): Optimize comparisons against
> 0, 1 and -1 to use "test" instead of "cmp".
> (*<maxmin>di3_doubleword): Likewise, allow general_operand
> and enable on TARGET_CMOVE.
> (peephole2): Convert clearing a register after a flag setting
> instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance. Almost all of the credit goes to Uros and
> Richard for already implementing this feature, I've just copied the
> transformations from optab expansion, that allow it to be enabled
> without penalty (this late in the compilation).
LGTM, modulo the below part:
+;; Avoid clearing a register between a flags setting comparison and its
+use, ;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+ [(set (reg FLAGS_REG) (match_operand 0))
+ (set (match_operand:SWI48 1 "register_operand") (const_int 0))]
+ "peep2_regno_dead_p (0, FLAGS_REG)
+ && !reg_overlap_mentioned_p (operands[1], operands[0])"
+ [(parallel [(set (match_dup 1) (const_int 0))
+ (clobber (reg:CC FLAGS_REG))])
+ (set (match_dup 2) (match_dup 0))]
+{
+ operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+})
Please use ix86_expand_clear (operands[1]); in the preparation statement instead of emitting RTX pattern. This will take
TARGET_USE_MOV0 and size optimizations into account.
Uros.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patche8.txt
URL: <https://gcc.gnu.org/pipermail/gcc-patches/attachments/20200806/d3d3212f/attachment-0001.txt>
More information about the Gcc-patches
mailing list