[PATCH] x86_64: Integer min/max improvements.

Mon Aug 3 10:29:00 GMT 2020

On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the
> semantics of these operations are visible to the early RTL optimization
> passes, until split into explicit comparison and conditional move
> instructions. The good news is that i386.md already contains all of
> the required logic (many thanks to Richard Biener and Uros Bizjak),
> but this is currently only to enable scalar-to-vector (STV) synthesis
> of min/max instructions.  This change enables this functionality for
> all TARGET_CMOVE architectures for SImode, HImode and DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed
> why this functionality isn't already enabled: In PR rtl-optimization 91154
> we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c
> which has poor performance on AMD Opteron.  My solution to this is to
> actually support "min(reg,general_operand)" allowing us to inspect
> any immediate constant at the point this operation is split.  This
> allows us to use "test" instructions for min/max against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants,
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo:    cmpl    $12345, %edi
>         movl    $12345, %eax
>         cmovge  %edi, %eax
>         ret
>
> with this patch we instead generate:
>
> foo:    movl    $12345, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax" instructions
> being produced by the register allocator.  Materializing constants as
> late as possible reduces register pressure, but for const0_rtx on x86,
> it's preferable to use "xor" by moving this set from between a flags
> setting operation and its use.  This should also help instruction macro
> fusion on some microarchitectures, where test/cmp and the following
> instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo:    testl   %edi, %edi
>         movl    $0, %eax
>         cmovns  %edi, %eax
>         ret
>
> with this patch we instead generate:
> foo:    xorl    %eax, %eax
>         testl   %edi, %edi
>         cmovns  %edi, %eax
>         ret
>
> The benefits of having min/max explicit at the RTL level can be seen
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
>   int y = max(x,5);
>   return max(y,10);
> }
>
> where GCC currently produces:
>
> foo:    cmpl    $5, %edi
>         movl    $5, %eax
>         movl    $10, %edx
>         cmovge  %edi, %eax
>         cmpl    $10, %eax
>         cmovl   %edx, %eax
>         ret
>
> and with this patch it instead now produces:
>
> foo:    movl    $10, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> The original motivation was from looking at a performance critical
> function in a quantum mechanics code, that performed MIN_EXPR and
> MAX_EXPR of the same arguments (effectively a two-element sort),
> where GCC was performing the comparison twice.  I'd hoped that it
> might be possible to fuse these together, perhaps in combine, but
> this patch is an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?
>
>
> 2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance.  Almost all of the credit goes to Uros and
> Richard for already implementing this feature, I've just copied the
> transformations from optab expansion, that allow it to be enabled
> without penalty (this late in the compilation).

LGTM, modulo the below part:

+;; Avoid clearing a register between a flags setting comparison and its use,
+;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI48 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+  [(parallel [(set (match_dup 1) (const_int 0))
+      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+})

Please use ix86_expand_clear (operands[1]); in the preparation
statement instead of emitting RTX pattern. This will take
TARGET_USE_MOV0 and size optimizations into account.

Uros.