Bug 53346 - [4.6/4.7/4.8 Regression] Bad if conversion in cptrf2 of rnflow.f90
Summary: [4.6/4.7/4.8 Regression] Bad if conversion in cptrf2 of rnflow.f90
Status: RESOLVED DUPLICATE of bug 54073
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P2 normal
Target Milestone: 4.8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: cmov
  Show dependency treegraph
 
Reported: 2012-05-14 15:40 UTC by Dominique d'Humieres
Modified: 2023-04-20 10:20 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-05-17 00:00:00


Attachments
slow x86_64 assembly, obtained with -O2 -ftree-loop-if-convert (38.73 KB, text/plain)
2012-05-18 17:08 UTC, Uroš Bizjak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique d'Humieres 2012-05-14 15:40:33 UTC
At revision 187457 (i.e., with pr53340 fixed) on x86_64-apple-darwin10, after

[macbook] test/dbg_rnflow% gfc -c -O3 -ffast-math -funroll-loops timctr.f90 cmpcpt.f90 cptrf2.f90 dger.f90 dgetri.f90 dswap.f90 dtrsm.f90 evlrnf.f90 idamax.f90 main.f90 mattrs.f90 cmpmat.f90 dgemm.f90 dgetf2.f90 dlaswp.f90 dtrmm.f90 dtrti2.f90 extpic.f90 ilaenv.f90 matcnt.f90 reaseq.f90 xerbla.f90 cptrf1.f90 dgemv.f90 dgetrf.f90 dscal.f90 dtrmv.f90 dtrtri.f90 gentrs.f90 lsame.f90 matsim.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null                                                                                             23.872u 0.349s 0:24.22 99.9%	0+0k 0+0io 0pf+0w[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187339/bin/gfortran -c -O3 -ffast-math -funroll-loops evlrnf.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
22.259u 0.346s 0:22.61 99.9%	0+0k 0+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187291/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
22.252u 0.345s 0:22.60 99.9%	0+0k 0+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187102/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
22.121u 0.346s 0:22.47 99.9%	0+0k 0+0io 0pf+0w

(i.e., working around prpr53342 and a regression for idamax.f90, see 
below), the compilation of cptrf2.f90 (source attached to pr53340) with the following flags yiels

optimization level      4.4.6   4.5.3   4.6.3   4.7.0   r187457

-O2                      27.8    28.2    28.2    21.8    21.8
-O2 -ftree-vectorize     27.8    28.2    28.2    27.9    27.9
-O3                      22.0    21.3    25.1    25.3    25.3
-O3 -fno-tree-vectorize  22.1    21.3    21.4    21.4    21.4

Note that 4.5/4.6/4.7 vectorize two loops (lines 21 and 29), while 4.8 vectorizes only the loop at line 21 (29: not vectorized: iteration count too small.).

Looking at my archives I have found that a first regression appeared 
between revisions 162456 and 164728

optimization level      4.6-162456 4.6p-164728

-O2                             28.2    28.3
-O2 -ftree-vectorize            28.1    28.3
-O3                             21.4    29.4
-O3 -fno-tree-vectorize         21.3    21.4
-O3 -ffast-math                 21.4    22.3
-O3 -ffast-math -funroll-loops  21.9    22.4

For the record, as said above the compilation of idamax regressed between 
revisions 187102 and 187291

[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187291/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
22.252u 0.345s 0:22.60 99.9%	0+0k 0+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187102/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
22.121u 0.346s 0:22.47 99.9%	0+0k 0+0io 0pf+0w

Although the regression is slightly above the noise margin at the level of 
rnflow.f90, it could be worth to investigate it because:
(1) it is a LAPACK routine (may be slightly modified),
(2) there equivalent intrinsics in F90,
(3) the slowdown may be quite significant at the level of the proc itself.
Comment 1 Richard Biener 2012-05-15 09:43:31 UTC
Do you possibly have a testcase?
Comment 2 Dominique d'Humieres 2012-05-15 12:39:50 UTC
> Do you possibly have a testcase?

I am not sure to understand what you ask for. 

The source for cptrf2.f90 has been attached to pr53340. I can provide a version of rnflow without the proc cptrf2 or an archive with the rnflow.f90 source split to one file per proc.

If you ask for a reduced test, it is much more difficult:
(1) the code is not mine and I don't know it well,
(2) optimizations may change for tiny details of the source layout.
Comment 3 Uroš Bizjak 2012-05-17 18:29:12 UTC
Confirmed, -O2 vs. -O2 -ftree-vectorize on x86_64:

-O2 -ftree-vectorize:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 43.83      9.73     9.73       64     0.15     0.15  cptrf2_
 40.68     18.76     9.03     6685     0.00     0.00  trs2a2.2054
  7.70     20.47     1.71       64     0.03     0.03  gentrs_
  1.49     20.80     0.33       64     0.01     0.01  cptrf1_
  1.40     21.11     0.31        1     0.31    12.33  matsim_
  1.40     21.42     0.31     6685     0.00     0.00  invima.2045
  1.13     21.67     0.25       64     0.00     0.00  cmpcpt_

-O2:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 55.20      9.20     9.20     6685     0.00     0.00  trs2a2.2054
 23.40     13.10     3.90       64     0.06     0.06  cptrf2_
 10.38     14.83     1.73       64     0.03     0.03  gentrs_
  2.58     15.26     0.43       64     0.01     0.01  cptrf1_
  2.34     15.65     0.39     6685     0.00     0.00  invima.2045
  1.98     15.98     0.33        1     0.33     6.58  matsim_
  1.14     16.17     0.19       64     0.00     0.00  cmpcpt_

cptrf2_ runtime increased for almost 6 seconds!

The only vectorization is in:

3530: LOOP VECTORIZED.
rnflow.f90:3510: note: vectorized 1 loops in function.

Which corresponds to:

! ______________________________________________________________________
      real, dimension (1:nxtr), intent (in)     :: xxtrt ! extrema
      integer, intent (in)                      :: nxtr  ! leur nombre
      integer, dimension (1:nxtr), intent (out) :: ixtrt ! indices
      integer, intent (out)                     :: kerr  ! code d'erreur
! ______________________________________________________________________
!
      kerr = 0
      ixtrt = 0                  <<<<<<<<<<<<<< HERE

This vectorization results in zeroing of certain memory area:

	pxor	%xmm0, %xmm0
	leaq	(%rdx,%r8,4), %r8
	xorl	%esi, %esi
	.p2align 4,,10
	.p2align 3
.L183:
	addq	$1, %rsi
	movdqa	%xmm0, (%r8)
	addq	$16, %r8
	cmpq	%rsi, %r11
	ja	.L183

And this causes 6 second difference ?!
Comment 4 Uroš Bizjak 2012-05-17 20:09:42 UTC
Instead of this:

.L228:
	movl	$0, -4(%rdx,%rax,4)
	addq	$1, %rax
	cmpq	%rax, %rsi
	jge	.L228

vectorization generates following:

	movq	%rdx, %rax
	movq	%r9, %r8
	andl	$15, %eax
	shrq	$2, %rax
	negq	%rax
	andl	$3, %eax
	cmpq	%r9, %rax
	cmovbe	%rax, %r8
	cmpq	$6, %r9
	cmovbe	%r9, %r8
	testq	%r8, %r8
	je	.L233
	leaq	1(%r8), %rsi
	movl	$1, %eax
	.p2align 4,,10
	.p2align 3
.L176:
	movl	$0, -4(%rdx,%rax,4)
	addq	$1, %rax
	cmpq	%rsi, %rax
	jne	.L176
	cmpq	%r9, %r8
	je	.L182
.L174:
	movq	%r9, %rbp
	subq	%r8, %rbp
	movq	%rbp, %r11
	shrq	$2, %r11
	leaq	0(,%r11,4), %rbx
	testq	%rbx, %rbx
	je	.L181
	pxor	%xmm0, %xmm0
	leaq	(%rdx,%r8,4), %r8
	xorl	%esi, %esi
	.p2align 4,,10
	.p2align 3
.L183:
	addq	$1, %rsi
	movdqa	%xmm0, (%r8)
	addq	$16, %r8
	cmpq	%rsi, %r11
	ja	.L183
	addq	%rbx, %rax
	cmpq	%rbx, %rbp
	je	.L182
	.p2align 4,,10
	.p2align 3
.L181:
	movl	$0, -4(%rdx,%rax,4)
	addq	$1, %rax
	cmpq	%rax, %r9
	jge	.L181

Whoa.
Comment 5 Richard Biener 2012-05-18 11:02:53 UTC
Yeah, this is sort-of related to what is observed in PR53355.  I suppose at
runtime nxtr is comparatively small.

Reduced testcase:

      subroutine cptrf2 (nxtr, ixtrt)
      integer, dimension (1:nxtr), intent (out) :: ixtrt
      ixtrt = 0
      end subroutine

we peel the loop to possibly align the stores, and we peel the loop
to possibly take care of a remaining store at the end of the array.
And of course we compute that we need at least 6 scalar iterations
to make executing the vectorized loop profitable.

And apart from all that we should have recognized the loop as memset.

Mine.
Comment 6 Richard Biener 2012-05-18 13:10:11 UTC
Author: rguenth
Date: Fri May 18 13:10:01 2012
New Revision: 187655

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187655
Log:
2012-05-18  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/53346
	* tree-loop-distribution.c (ldist_gen): Make sure to apply
	builtin transform even when only a single partition with
	all reads/writes exists.

	* gcc.dg/tree-ssa/ldist-18.c: New testcase.
	* gcc.target/i386/incoming-10.c: Adjust.
	* gcc.target/i386/incoming-11.c: Likewise.
	* gcc.target/i386/pr46295.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-18.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/incoming-10.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-11.c
    trunk/gcc/testsuite/gcc.target/i386/pr46295.c
    trunk/gcc/tree-loop-distribution.c
Comment 7 Richard Biener 2012-05-18 13:10:28 UTC
Fixed.
Comment 8 Uroš Bizjak 2012-05-18 14:46:01 UTC
(In reply to comment #7)
> Fixed.

Unfortunately, the loop in original rnflow test still gets vectorized, with no change in the runtime:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 43.46      9.69     9.69       64     0.15     0.15  cptrf2_
 40.63     18.75     9.06     6685     0.00     0.00  trs2a2.2054
  7.89     20.51     1.76       64     0.03     0.03  gentrs_
  2.02     20.96     0.45     6685     0.00     0.00  invima.2045
  1.93     21.39     0.43       64     0.01     0.01  cptrf1_
  1.17     21.65     0.26        1     0.26    12.36  matsim_
  0.99     21.87     0.22       64     0.00     0.00  cmpcpt_

GNU Fortran (GCC) version 4.8.0 20120518 (experimental) [trunk revision 187655]
Comment 9 Dominique d'Humieres 2012-05-18 14:49:22 UTC
> Unfortunately, the loop in original rnflow test still gets vectorized, with no
> change in the runtime:

Confirmed, at revision 187655 I still get

-O2                      21.8
-O2 -ftree-vectorize     27.9
-O3                      25.2
-O3 -fno-tree-vectorize  21.4

Uneducated guess: is it possible that failed attempts to vectorize may mess up further optimizations?
Comment 10 Uroš Bizjak 2012-05-18 15:11:53 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Fixed.
> 
> Unfortunately, the loop in original rnflow test still gets vectorized, with no
> change in the runtime:

With -O2 -ftree-loop-distribute-patterns -ftree-vectorize, the runtime is still the same:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 43.76      9.70     9.70       64     0.15     0.15  cptrf2_
 40.69     18.72     9.02     6685     0.00     0.00  trs2a2.2054
  7.35     20.35     1.63       64     0.03     0.03  gentrs_
  2.21     20.84     0.49       64     0.01     0.01  cptrf1_
  1.44     21.16     0.32        1     0.32    12.32  matsim_
  1.17     21.42     0.26     6685     0.00     0.00  invima.2045
  0.81     21.60     0.18       64     0.00     0.00  cmpcpt_
  0.54     21.72     0.12        1     0.12     9.85  evlrnf_
Comment 11 Uroš Bizjak 2012-05-18 16:04:46 UTC
(In reply to comment #9)

> Uneducated guess: is it possible that failed attempts to vectorize may mess up
> further optimizations?

You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option makes all the difference!

-O2 -ftree-vectorize:

real    0m24.061s
user    0m23.789s
sys     0m0.225s

-O2 -ftree-vectorize -fno-tree-loop-if-convert

real    0m18.029s
user    0m17.761s
sys     0m0.220s

We were barking up to the wrong tree. ;)
Comment 12 Uroš Bizjak 2012-05-18 16:07:45 UTC
(In reply to comment #11)

> You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option
> makes all the difference!
> 
> -O2 -ftree-vectorize:
> 
> real    0m24.061s
> user    0m23.789s
> sys     0m0.225s
> 
> -O2 -ftree-vectorize -fno-tree-loop-if-convert
> 
> real    0m18.029s
> user    0m17.761s
> sys     0m0.220s

-O2 -ftree-loop-if-convert:

real    0m24.034s
user    0m23.770s
sys     0m0.218s

-O2

real    0m18.163s
user    0m17.892s
sys     0m0.233s
Comment 13 Uroš Bizjak 2012-05-18 17:08:08 UTC
Created attachment 27435 [details]
slow x86_64 assembly, obtained with -O2 -ftree-loop-if-convert

This is the slow assembly, stay tuned for the WTF part.
Comment 14 Uroš Bizjak 2012-05-18 17:17:45 UTC
Compile and execute slow assembly:

gfortran rnflow.s && time ./a.out

real    0m24.454s
user    0m24.167s
sys     0m0.231s

Apply following patch that changes cmove in very fast loops (cptrf2) to jumps:

--cut here--
--- rnflow.s    2012-05-18 19:00:22.314102061 +0200
+++ rnflow1.s   2012-05-18 19:10:59.363428625 +0200
@@ -1305,7 +1305,9 @@
        movslq  %edx, %rbx
        movss   -4(%rdi,%rbx,4), %xmm0
        ucomiss (%r9), %xmm0
-       cmova   %ecx, %edx
+       jbe     .L183x
+       movl    %ecx, %edx
+.L183x:
        subl    $1, %ecx
        subq    $4, %r9
        cmpl    %r10d, %ecx
@@ -1329,7 +1331,9 @@
        movslq  %ecx, %r10
        movss   -4(%rdi,%r10,4), %xmm0
        ucomiss (%r9), %xmm0
-       cmova   %r11d, %ecx
+       jbe     .L192x
+       movl    %r11d, %ecx
+.L192x:
        subl    $1, %r11d
        subq    $4, %r9
        cmpl    %eax, %r11d
@@ -1485,7 +1489,9 @@
        movslq  %edx, %r10
        movss   -4(%rdi,%r10,4), %xmm0
        ucomiss (%r9), %xmm0
-       cmova   %ecx, %edx
+       jbe     .L179x
+       movl    %ecx, %edx
+.L179x:
        subq    $4, %r9
        subl    $1, %ecx
        jne     .L179
--cut here--

gfortran rnflow.s && time ./a.out

real    0m18.170s
user    0m17.907s
sys     0m0.223s

WTF happened here?!

Relevant part of my /proc/cpuinfo:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 42

Adding CC.
Comment 15 Andrew Pinski 2012-05-18 17:54:16 UTC
(In reply to comment #14)
> Compile and execute slow assembly:
> real    0m18.170s
> user    0m17.907s
> sys     0m0.223s
> 
> WTF happened here?!

Are conditional moves that bad on x86?  The change which uses them more for COND_EXPR was mine but really I think this was a latent bug or a way to say chose conditional move over jumps for some targets.
Comment 16 Uroš Bizjak 2012-05-18 18:24:43 UTC
Perf confirms this findings, the first loop:

    0.02 :          401e10:       movslq %edx,%rbx
    5.04 :          401e13:       movss  -0x4(%rdi,%rbx,4),%xmm0
   24.97 :          401e19:       ucomiss (%r9),%xmm0
   14.66 :          401e1d:       cmova  %ecx,%edx
   15.37 :          401e20:       sub    $0x1,%ecx
    0.00 :          401e23:       sub    $0x4,%r9
    0.00 :          401e27:       cmp    %r10d,%ecx
    0.00 :          401e2a:       jne    401e10 <cptrf2_+0x230>

the second:

    0.00 :          401e60:       movslq %ecx,%r10
    1.69 :          401e63:       movss  -0x4(%rdi,%r10,4),%xmm0
    7.78 :          401e6a:       ucomiss (%r9),%xmm0
    4.75 :          401e6e:       cmova  %r11d,%ecx
    4.52 :          401e72:       sub    $0x1,%r11d
    0.00 :          401e76:       sub    $0x4,%r9
    0.05 :          401e7a:       cmp    %eax,%r11d
    0.00 :          401e7d:       jne    401e60 <cptrf2_+0x280>

the third:

    0.00 :          401ff8:       movslq %edx,%r10
    0.78 :          401ffb:       movss  -0x4(%rdi,%r10,4),%xmm0
    3.14 :          402002:       ucomiss (%r9),%xmm0
    2.04 :          402006:       cmova  %ecx,%edx
    1.89 :          402009:       sub    $0x4,%r9
    0.00 :          40200d:       sub    $0x1,%ecx
    0.00 :          402010:       jne    401ff8 <cptrf2_+0x418>
Comment 17 H.J. Lu 2012-05-18 18:27:21 UTC
I was told that cmov wins if branch is mispredicted, otherwise
cmov loses.  We will investigate if we can improve cmov in GCC.
Comment 18 Dominique d'Humieres 2012-05-18 18:29:06 UTC
> Are conditional moves that bad on x86?  The change which uses them more for
> COND_EXPR was mine but really I think this was a latent bug or a way to say
> chose conditional move over jumps for some targets.

As said in comment #0 the first regression appeared  between revisions 162456 (2010-07-23) and 164728 (2010-09-29), so the problem is fairly old

[macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-162456/bin/gfortran -c -O3 cptrf2.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
20.904u 0.345s 0:21.26 99.9%	0+0k 0+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-162456/bin/gfortran -c -O3 -fno-tree-loop-if-convert cptrf2.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
20.898u 0.341s 0:21.24 99.9%	0+0k 0+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-164728/bin/gfortran -c -O3 cptrf2.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
28.607u 0.346s 0:28.96 99.9%	0+0k 0+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-164728/bin/gfortran -c -O3 -fno-tree-loop-if-convert cptrf2.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
21.153u 0.342s 0:21.50 99.9%	0+0k 0+0io 0pf+0w
Comment 19 Dominique d'Humieres 2012-05-19 22:19:02 UTC
The change in timing occured at revision 163998

Author:	matz
Date:	Wed Sep 8 12:34:52 2010 UTC (20 months, 1 week ago)
Changed paths:	4
Log Message:	
	PR tree-optimization/33244
	* tree-ssa-sink.c (statement_sink_location): Don't sink into
	empty loop latches.

testsuite/
	PR tree-optimization/33244
	* gfortran.dg/vect/fast-math-vect-8.f90: New test.

[macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-163997/bin/gfortran -c -O3 cptrf2.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
20.881u 0.345s 0:21.37 99.2%	0+0k 3+0io 0pf+0w
[macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-163998/bin/gfortran -c -O3 cptrf2.f90
[macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null
28.545u 0.351s 0:29.06 99.4%	0+0k 3+0io 0pf+0w
Comment 20 Richard Biener 2012-09-07 11:58:31 UTC
This turned into a target bug about cmov.
Comment 21 Jan Hubicka 2012-11-14 22:18:53 UTC
Well, as I wrote to the other PR, the main problem of cmov is extension of dependency chain.  For well predicted sequence with conditional jump there is no update of rbs so the loop executes faster, because the loads/stores/comparisons executes "in parallel". The load in the next iteration can then happen speculatively before the condition from previous iteration is resolved. With cmov in it, there is dependence on rbx for all the other computations in the loop.

I guess there is no localy available information suggesting suggesting that the particular branch is well predictable, at least without profile feedback (where we won't disable the conversion anyway).

I wonder
 1) why the conversion to cmov do not happen on RTL if conversion pass
 2) whether we can do something to detect similar patterns and possibly disable cmovs on them...
Comment 22 Jan Hubicka 2012-11-14 22:38:19 UTC
OK, similar loop in C looks like:
float a[10000];
float b[10000];
t()
{
  int mi = 0,i;
  for (i=0;i<1000;i++)
    if (a[i]<b[i])
      mi = i;
  return mi;
}
and the why we do not ifconvert at RTl level is that the condition is UNLE that do not pass unordered_comparsion_operator.  This was noticed by Jakub in other PR, we do not really need to test unorderedness here since expander knows how to handle it.  So this was more by chance than by design. I am testing
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md (revision 193503)
+++ config/i386/i386.md (working copy)
@@ -964,7 +964,7 @@
        (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand")
                    (match_operand:SDWIM 2 "<general_operand>")))
    (set (pc) (if_then_else
-              (match_operator 0 "ordered_comparison_operator"
+              (match_operator 0 "comparison_operator"
                [(reg:CC FLAGS_REG) (const_int 0)])
               (label_ref (match_operand 3))
               (pc)))]
@@ -982,7 +982,7 @@
        (compare:CC (match_operand:SWIM 2 "nonimmediate_operand")
                    (match_operand:SWIM 3 "<general_operand>")))
    (set (match_operand:QI 0 "register_operand")
-       (match_operator 1 "ordered_comparison_operator"
+       (match_operator 1 "comparison_operator"
          [(reg:CC FLAGS_REG) (const_int 0)]))]
   ""
 {
@@ -16120,7 +16120,7 @@
 
 (define_expand "mov<mode>cc"
   [(set (match_operand:SWIM 0 "register_operand")
-       (if_then_else:SWIM (match_operand 1 "ordered_comparison_operator")
+       (if_then_else:SWIM (match_operand 1 "comparison_operator")
                           (match_operand:SWIM 2 "<general_operand>")
                           (match_operand:SWIM 3 "<general_operand>")))]
   ""
Comment 23 Andrew Pinski 2012-12-31 09:19:50 UTC
(In reply to comment #22)
If the patch referenced in comment #22 fixes this bug, then it is a dup of bug 54073.  Can someone confirm if this has been fixed on the trunk now?
Comment 24 Andrew Pinski 2012-12-31 09:40:29 UTC
Fixed aka a dup of bug 54073.

*** This bug has been marked as a duplicate of bug 54073 ***
Comment 25 GCC Commits 2022-09-26 03:22:07 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:3db8e9c2422d924a958336fd0871b24cce3e65d1

commit r13-2843-g3db8e9c2422d924a958336fd0871b24cce3e65d1
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Sep 21 14:56:08 2022 +0800

    Support 2-instruction vector shuffle for V4SI/V4SF in ix86_expand_vec_perm_const_1.
    
    2022-09-23  Hongtao Liu  <hongtao.liu@intel.com>
                Liwei Xu  <liwei.xu@intel.com>
    
    gcc/ChangeLog:
    
            PR target/53346
            * config/i386/i386-expand.cc (expand_vec_perm_shufps_shufps):
            New function.
            (ix86_expand_vec_perm_const_1): Insert
            expand_vec_perm_shufps_shufps at the end of 2-instruction
            expand sequence.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr53346-1.c: New test.
            * gcc.target/i386/pr53346-2.c: New test.
            * gcc.target/i386/pr53346-3.c: New test.
            * gcc.target/i386/pr53346-4.c: New test.
Comment 26 Hongtao.liu 2022-09-26 03:24:18 UTC
(In reply to CVS Commits from comment #25)
> The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:3db8e9c2422d924a958336fd0871b24cce3e65d1
> 
> commit r13-2843-g3db8e9c2422d924a958336fd0871b24cce3e65d1
> Author: liuhongt <hongtao.liu@intel.com>
> Date:   Wed Sep 21 14:56:08 2022 +0800
> 
>     Support 2-instruction vector shuffle for V4SI/V4SF in
> ix86_expand_vec_perm_const_1.
>     
>     2022-09-23  Hongtao Liu  <hongtao.liu@intel.com>
>                 Liwei Xu  <liwei.xu@intel.com>
>     
>     gcc/ChangeLog:
>     
>             PR target/53346
>             * config/i386/i386-expand.cc (expand_vec_perm_shufps_shufps):
>             New function.
>             (ix86_expand_vec_perm_const_1): Insert
>             expand_vec_perm_shufps_shufps at the end of 2-instruction
>             expand sequence.
>     
>     gcc/testsuite/ChangeLog:
>     
>             * gcc.target/i386/pr53346-1.c: New test.
>             * gcc.target/i386/pr53346-2.c: New test.
>             * gcc.target/i386/pr53346-3.c: New test.
>             * gcc.target/i386/pr53346-4.c: New test.

Sorry, it should be PR54346