70321 – [10/11/12/13 Regression] STV generates less optimized code

Bug 70321 - [10/11/12/13 Regression] STV generates less optimized code

Summary: [10/11/12/13 Regression] STV generates less optimized code

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	6.0

Importance:	P2 normal
Target Milestone:	13.0
Assignee:	Not yet assigned to anyone

URL:	https://gcc.gnu.org/pipermail/gcc-pat...
Keywords:	deferred, missed-optimization, patch

Depends on:
Blocks:

Reported:	2016-03-20 14:40 UTC by H.J. Lu
Modified:	2024-02-01 11:49 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	i386
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-03-20 00:00:00

Attachments
gcc6-pr70321.patch (860 bytes, patch) 2016-03-22 12:59 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2016-03-20 14:40:52 UTC

[hjl@gnu-tools-1 bitwise-1]$ cat z.i
void
foo (long long ixi)
{
  if (ixi != 14348907)
    __builtin_abort ();
}
[hjl@gnu-tools-1 bitwise-1]$ make z.s z1.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2 -m32 -S -o z.s z.i
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2 -m32 -mno-stv -S -o z1.s z.i
[hjl@gnu-tools-1 bitwise-1]$ cat z.s
	.file	"z.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	subl	$12, %esp
	.cfi_def_cfa_offset 16
	movl	20(%esp), %edx
	movl	16(%esp), %eax
	xorb	$0, %dh
	xorl	$14348907, %eax
	movl	%edx, %ecx
	orl	%eax, %ecx
	jne	.L5
	addl	$12, %esp
	.cfi_remember_state
	.cfi_def_cfa_offset 4
	ret
.L5:
	.cfi_restore_state
	call	abort
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160318 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tools-1 bitwise-1]$ cat z1.s
	.file	"z.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	subl	$12, %esp
	.cfi_def_cfa_offset 16
	movl	16(%esp), %eax
	xorl	$14348907, %eax
	orl	20(%esp), %eax
	jne	.L5
	addl	$12, %esp
	.cfi_remember_state
	.cfi_def_cfa_offset 4
	ret
.L5:
	.cfi_restore_state
	call	abort
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160318 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tools-1 bitwise-1]$ 

STV generates:

	movl	20(%esp), %edx
	movl	16(%esp), %eax
	xorb	$0, %dh
	xorl	$14348907, %eax
	movl	%edx, %ecx
	orl	%eax, %ecx

vs

	movl	16(%esp), %eax
	xorl	$14348907, %eax
	orl	20(%esp), %eax

Comment 1 H.J. Lu 2016-03-20 14:53:16 UTC

"and" is also less optimized:

[hjl@gnu-tools-1 bitwise-1]$ cat and.i 
extern long long x;

void
foo (long long ixi)
{
  x = ixi & 14348907;
}
[hjl@gnu-tools-1 bitwise-1]$ make and.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -S -o and.s and.i
[hjl@gnu-tools-1 bitwise-1]$ make and1.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -mno-stv -S -o and1.s and.i
[hjl@gnu-tools-1 bitwise-1]$ cat and.s
	.file	"and.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
	movl	4(%esp), %eax
	movl	8(%esp), %edx
	andl	$14348907, %eax
	andl	$0, %edx
	movl	%eax, x
	movl	%edx, x+4
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160318 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tools-1 bitwise-1]$ cat and1.s
	.file	"and.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
	movl	4(%esp), %eax
	movl	$0, x+4
	andl	$14348907, %eax
	movl	%eax, x
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160318 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tools-1 bitwise-1]$

Comment 2 H.J. Lu 2016-03-20 14:57:35 UTC

or and xor have the same issue:

[hjl@gnu-tools-1 bitwise-1]$ cat or.i 
extern long long x;

void
foo (long long ixi)
{
  x = ixi | 14348907;
}
[hjl@gnu-tools-1 bitwise-1]$ make or.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -S -o or.s or.i
[hjl@gnu-tools-1 bitwise-1]$ make or1.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -mno-stv -S -o or1.s or.i
[hjl@gnu-tools-1 bitwise-1]$ cat or.s
	.file	"or.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
	movl	4(%esp), %eax
	movl	8(%esp), %edx
	orl	$14348907, %eax
	orb	$0, %dh
	movl	%eax, x
	movl	%edx, x+4
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160318 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tools-1 bitwise-1]$ cat or1.s
	.file	"or.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
	movl	4(%esp), %eax
	orl	$14348907, %eax
	movl	%eax, x
	movl	8(%esp), %eax
	movl	%eax, x+4
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160318 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tools-1 bitwise-1]$

Comment 3 H.J. Lu 2016-03-20 16:44:43 UTC

Should STV split DImode early if STV is known to not profitable?

Comment 4 Jakub Jelinek 2016-03-22 12:59:08 UTC

Created attachment 38057 [details]
gcc6-pr70321.patch

I'm afraid for GCC 6.x we can't do much more than attached patch, which improves the generated code, but there is no combiner post RA that would improve it even more.  For GCC 6.x, I think it is out of question to move STV pass somewhere else.

For stage1, I wonder if it can't move earlier, say before the combiner.  If it could, then we could split the TARGET_STV patterns that weren't changed into vector insns, and let the combiner deal with that.  Is there a reason why it is done after the combiner now?

Comment 5 Ilya Enkovich 2016-03-22 13:42:32 UTC

(In reply to Jakub Jelinek from comment #4)
> For stage1, I wonder if it can't move earlier, say before the combiner.  If
> it could, then we could split the TARGET_STV patterns that weren't changed
> into vector insns, and let the combiner deal with that.  Is there a reason
> why it is done after the combiner now?

Yes, STV searches for ANDN pattern and also for DI comparison with zero executed via OR.  Those are produced by combiner.  I believe some STV tests would fail if we move it before combine.

Comment 6 Jakub Jelinek 2016-03-22 13:46:27 UTC

(In reply to Ilya Enkovich from comment #5)
> (In reply to Jakub Jelinek from comment #4)
> > For stage1, I wonder if it can't move earlier, say before the combiner.  If
> > it could, then we could split the TARGET_STV patterns that weren't changed
> > into vector insns, and let the combiner deal with that.  Is there a reason
> > why it is done after the combiner now?
> 
> Yes, STV searches for ANDN pattern and also for DI comparison with zero
> executed via OR.  Those are produced by combiner.  I believe some STV tests
> would fail if we move it before combine.

Why couldn't STV just "vectorize" AND and NOT patterns and let the combiner combine that in the vectorized code?

Comment 7 Ilya Enkovich 2016-03-22 14:06:50 UTC

(In reply to Jakub Jelinek from comment #6)
> Why couldn't STV just "vectorize" AND and NOT patterns and let the combiner
> combine that in the vectorized code?

I think the only thing we miss for that is corresponding NOT pattern in DI mode.

For comparison we have

{r94:SI=r89:DI#4|r89:DI#0;clobber flags:CC;}
flags:CCZ=cmp(r94:SI,0)

combined into

{flags:CCZ=cmp(r89:DI#4|r89:DI#0,0);clobber scratch;}

which is converted into PTEST by STV.

Comment 8 H.J. Lu 2016-03-22 14:45:01 UTC

(In reply to Ilya Enkovich from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > Why couldn't STV just "vectorize" AND and NOT patterns and let the combiner
> > combine that in the vectorized code?
> 
> I think the only thing we miss for that is corresponding NOT pattern in DI
> mode.

It is PR 70322.

Comment 9 Jakub Jelinek 2016-03-23 09:49:44 UTC

Author: jakub
Date: Wed Mar 23 09:49:12 2016
New Revision: 234416

URL: https://gcc.gnu.org/viewcvs?rev=234416&root=gcc&view=rev
Log:
	PR target/70321
	* config/i386/i386.md (*anddi3_doubleword, *<code>di3_doubleword):
	Optimize TARGET_STV splitters, if high or low word of last argument
	is 0 or -1.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md

Comment 10 Jakub Jelinek 2016-03-23 10:27:54 UTC

Are the changes I've made sufficient enough that we can postpone the rest of this for stage1?

Comment 11 H.J. Lu 2016-03-23 11:48:21 UTC

(In reply to Jakub Jelinek from comment #10)
> Are the changes I've made sufficient enough that we can postpone the rest of
> this for stage1?

Now we got

[hjl@gnu-6 pr70321]$ cat z.i 
void
foo (long long ixi)
{
  if (ixi != 14348907)
    __builtin_abort ();
}
[hjl@gnu-6 pr70321]$ make z.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -S -o z.s z.i
[hjl@gnu-6 pr70321]$ make z1.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -fno-asynchronous-unwind-tables -O2 -m32 -mno-stv -S -o z1.s z.i
[hjl@gnu-6 pr70321]$ cat z.s
	.file	"z.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
	subl	$12, %esp
	movl	16(%esp), %eax
	movl	20(%esp), %edx
	xorl	$14348907, %eax
	movl	%edx, %ecx
	orl	%eax, %ecx
	jne	.L5
	addl	$12, %esp
	ret
.L5:
	call	abort
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160323 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-6 pr70321]$ cat z1.s
	.file	"z.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
	subl	$12, %esp
	movl	16(%esp), %eax
	xorl	$14348907, %eax
	orl	20(%esp), %eax
	jne	.L5
	addl	$12, %esp
	ret
.L5:
	call	abort
	.size	foo, .-foo
	.ident	"GCC: (GNU) 6.0.0 20160323 (experimental)"
	.section	.note.GNU-stack,"",@progbits

It is an improvement.  But we should fix it properly for GCC 7.

Comment 12 Jakub Jelinek 2016-03-23 15:02:10 UTC

Ok, deferring the rest for GCC 7.

Comment 13 Jakub Jelinek 2016-12-09 14:28:49 UTC

So shall we still consider moving 32-bit stv pass before the combiner instead of after it?  The argument about andn is no longer relevant, we have a one_cmpldi3_doubleword pattern now.

Comment 14 Ilya Enkovich 2016-12-09 18:29:00 UTC

(In reply to Jakub Jelinek from comment #13)
> So shall we still consider moving 32-bit stv pass before the combiner
> instead of after it?  The argument about andn is no longer relevant, we have
> a one_cmpldi3_doubleword pattern now.

Previously moving the pass caused regressions. If no regressions appear now then we should do it. It might enable some additional optimizations.

Comment 15 Jakub Jelinek 2017-04-03 19:09:01 UTC

I think it is too late to change this for GCC7, we shall try to do that in early stage1.

Comment 16 Jakub Jelinek 2018-02-28 19:16:06 UTC

We didn't do that in stage1, stage4 is too risky for that.  Can somebody from Intel try that for GCC9 stage1?
I'm willing to help, but will have lots of work on OpenMP 5.0, so can't drive that.

Comment 17 H.J. Lu 2018-02-28 20:06:06 UTC

(In reply to Jakub Jelinek from comment #16)
> We didn't do that in stage1, stage4 is too risky for that.  Can somebody
> from Intel try that for GCC9 stage1?
> I'm willing to help, but will have lots of work on OpenMP 5.0, so can't
> drive that.

We will look into it.  Thanks.

Comment 18 Eric Gallager 2018-06-22 04:37:19 UTC

(In reply to H.J. Lu from comment #17)
> (In reply to Jakub Jelinek from comment #16)
> > We didn't do that in stage1, stage4 is too risky for that.  Can somebody
> > from Intel try that for GCC9 stage1?
> > I'm willing to help, but will have lots of work on OpenMP 5.0, so can't
> > drive that.
> 
> We will look into it.  Thanks.

It's gcc9 stage1 now...

Comment 19 Jakub Jelinek 2019-05-03 09:18:33 UTC

GCC 9.1 has been released.

Comment 20 Jakub Jelinek 2019-08-12 08:58:17 UTC

GCC 9.2 has been released.

Comment 21 Jakub Jelinek 2020-03-12 11:58:53 UTC

GCC 9.3.0 has been released, adjusting target milestone.

Comment 22 Richard Biener 2021-06-01 08:07:39 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 23 Roger Sayle 2022-04-19 13:09:52 UTC

Patch proposed at https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593174.html
(waiting for GCC 13 stage 1).

Comment 24 Richard Biener 2022-05-27 09:36:15 UTC

GCC 9 branch is being closed

Comment 25 GCC Commits 2022-05-30 20:21:58 UTC

The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:43201f2c2173894bf7c423cad6da1c21567e06c0

commit r13-855-g43201f2c2173894bf7c423cad6da1c21567e06c0
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Mon May 30 21:20:09 2022 +0100

    PR target/70321: Split double word equality/inequality after STV on x86.
    
    This patch resolves the last piece of PR target/70321 a code quality
    (P2 regression) affecting mainline.  Currently, for HJ's testcase:
    
    void foo (long long ixi)
    {
      if (ixi != 14348907)
        __builtin_abort ();
    }
    
    GCC with -m32 -O2 generates four instructions for the comparison:
    
            movl    16(%esp), %eax
            movl    20(%esp), %edx
            xorl    $14348907, %eax
            orl     %eax, %edx
    
    but with this patch it now requires only three, making better use of
    x86's addressing modes:
    
            movl    16(%esp), %eax
            xorl    $14348907, %eax
            orl     20(%esp), %eax
    
    The solution is to expand "doubleword" equality/inequality expressions
    using flag setting COMPARE instructions for the early RTL passes, and
    then split them during split1, after STV and before reload.
    Hence on x86_64, we now see/allow things like:
    
    (insn 11 8 12 2 (set (reg:CCZ 17 flags)
            (compare:CCZ (reg/v:TI 84 [ x ])
                (reg:TI 96))) "cmpti.c":2:43 30 {*cmpti_doubleword}
    
    This allows the STV pass to decide whether it's preferrable to perform
    this comparison using vector operations, i.e. a pxor/ptest sequence,
    or as scalar integer operations, i.e. a xor/xor/or sequence.  Alas
    this required tweaking of the STV pass to recognize the "new" form of
    these comparisons and split out the pxor operation itself.  To confirm
    this still works as expected I've added a new STV test case:
    
    long long a[1024];
    long long b[1024];
    
    int foo()
    {
      for (int i=0; i<1024; i++)
      {
        long long t = (a[i]<<8) | (b[i]<<24);
        if (t == 0)
          return 1;
      }
      return 0;
    }
    
    where with -m32 -O2 -msse4.1 the above comparison with zero should look
    like:
    
            punpcklqdq      %xmm0, %xmm0
            ptest   %xmm0, %xmm0
    
    Although this patch includes one or two minor tweaks to provide all the
    necessary infrastructure to support conversion of TImode comparisons to
    V1TImode (and SImode comparisons to V4SImode), STV doesn't yet implement
    these transformations, but this is something that can be considered after
    stage 4.  Indeed the new convert_compare functionality is split out
    into a method to simplify its potential reuse by the timode_scalar_chain
    class.
    
    2022-05-30  Roger Sayle  <roger@nextmovesoftware.com>
    
    gcc/ChangeLog
            PR target/70321
            * config/i386/i386-expand.cc (ix86_expand_branch): Don't decompose
            DI mode equality/inequality using XOR here.  Instead generate a
            COMPARE for doubleword modes (DImode on !TARGET_64BIT or TImode).
            * config/i386/i386-features.cc (gen_gpr_to_xmm_move_src): Use
            gen_rtx_SUBREG when NUNITS is 1, i.e. for TImode to V1TImode.
            (general_scalar_chain::convert_compare): New function to convert
            scalar equality/inequality comparison into vector operations.
            (general_scalar_chain::convert_insn) [COMPARE]: Refactor. Call
            new convert_compare helper method.
            (convertible_comparion_p): Update to match doubleword COMPARE
            of two register, memory or integer constant operands.
            * config/i386/i386-features.h (general_scalar_chain::convert_compare):
            Prototype/declare member function here.
            * config/i386/i386.md (cstore<mode>4): Change mode to SDWIM, but
            only allow new doubleword modes for EQ and NE operators.
            (*cmp<dwi>_doubleword): New define_insn_and_split, to split a
            doubleword comparison into a pair of XORs followed by an IOR to
            set the (zero) flags register, optimizing the XORs if possible.
            * config/i386/sse.md (V_AVX): Include V1TI and V2TI in mode
            iterator; V_AVX is (currently) only used by ptest.
            (sse4_1 mode attribute): Update to support V1TI and V2TI.
    
    gcc/testsuite/ChangeLog
            PR target/70321
            * gcc.target/i386/pr70321.c: New test case.
            * gcc.target/i386/sse4_1-stv-1.c: New test case.

Comment 26 Roger Sayle 2022-06-04 09:12:26 UTC

This should now be fixed on mainline.

Comment 27 GCC Commits 2024-02-01 11:49:29 UTC

The master branch has been updated by Rainer Orth <ro@gcc.gnu.org>:

https://gcc.gnu.org/g:ee7011eab67d8272f88e954bbab2e42bf6353441

commit r14-8690-gee7011eab67d8272f88e954bbab2e42bf6353441
Author: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>
Date:   Thu Feb 1 12:48:57 2024 +0100

    testsuite: i386: Fix gcc.target/i386/pr70321.c on 32-bit Solaris/x86
    
    gcc.target/i386/pr70321.c FAILs on 32-bit Solaris/x86 since its
    introduction in
    
    commit 43201f2c2173894bf7c423cad6da1c21567e06c0
    Author: Roger Sayle <roger@nextmovesoftware.com>
    Date:   Mon May 30 21:20:09 2022 +0100
    
        PR target/70321: Split double word equality/inequality after STV on x86.
    
    FAIL: gcc.target/i386/pr70321.c scan-assembler-times mov 1
    
    The failure happens because 32-bit Solaris/x86 defaults to
    -fno-omit-frame-pointer.
    
    Fixed by specifying -fomit-frame-pointer explicitly.
    
    Tested on i386-pc-solaris2.11 and i686-pc-linux-gnu.
    
    2024-01-23  Rainer Orth  <ro@CeBiTec.Uni-Bielefeld.DE>
    
            gcc/testsuite:
            * gcc.target/i386/pr70321.c: Add -fomit-frame-pointer to
            dg-options.