This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- From: Richard Biener <rguenther at suse dot de>
- To: Uros Bizjak <ubizjak at gmail dot com>
- Cc: Martin Jambor <mjambor at suse dot cz>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Jakub Jelinek <jakub at redhat dot com>, Vladimir Makarov <vmakarov at redhat dot com>
- Date: Thu, 1 Aug 2019 11:28:11 +0200 (CEST)
- Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- References: <alpine.LSU.2.20.1907231548290.30921@zhemvz.fhfr.qr> <ri6o91i32a4.fsf@suse.cz> <alpine.LSU.2.20.1907251405590.4958@zhemvz.fhfr.qr> <CAFULd4ZEAOsRSGZ5SD7kOz2NLmkvkfd6X6v=vkTm99=NgxXZiw@mail.gmail.com> <CAFULd4b9LXOZyqLSSS5mz1r7xe71LiwXFAgm5=VZFRLE57E2cw@mail.gmail.com> <alpine.LSU.2.20.1907311309240.19626@zhemvz.fhfr.qr> <CAFULd4YHaEZsYn8Ym_F61QZ52hFyG4bRwJPU6y3qtSj39xRBuw@mail.gmail.com>
On Thu, 1 Aug 2019, Uros Bizjak wrote:
> On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Sat, 27 Jul 2019, Uros Bizjak wrote:
> >
> > > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > > > How would one write smaxsi3 as a splitter to be split after
> > > > > reload in the case LRA assigned the GPR alternative? Is it
> > > > > even worth doing? Even the SSE reg alternative can be split
> > > > > to remove the not needed CC clobber.
> > > > >
> > > > > Finally I'm unsure about the add where I needed to place
> > > > > the SSE alternative before the 2nd op memory one since it
> > > > > otherwise gets the same cost and wins.
> > > > >
> > > > > So - how to go forward with this?
> > > >
> > > > Sorry to come a bit late to the discussion.
> > > >
> > > > We are aware of CMOV issue for quite some time, but the issue is not
> > > > understood yet in detail (I was hoping for Intel people to look at
> > > > this). However, you demonstrated that using PMAX and PMIN instead of
> > > > scalar CMOV can bring us big gains, and this thread now deals on how
> > > > to best implement PMAX/PMIN for scalar code.
> > > >
> > > > I think that the way to go forward is with STV infrastructure.
> > > > Currently, the implementation only deals with DImode on SSE2 32bit
> > > > targets, but I see no issues on using STV pass also for SImode (on
> > > > 32bit and 64bit targets). There are actually two STV passes, the first
> > > > one (currently run on 64bit targets) is run before cse2, and the
> > > > second (which currently runs on 32bit SSE2 only) is run after combine
> > > > and before split1 pass. The second pass is interesting to us.
> > > >
> > > > The base idea of the second STV pass (for 32bit targets!) is that we
> > > > introduce a DImode _doubleword instructons that otherwise do not exist
> > > > with integer registers. Now, the passes up to and including combine
> > > > pass can use these instructions to simplify and optimize the insn
> > > > flow. Later, based on cost analysis, STV pass either converts the
> > > > _doubleword instructions to a real vector ones (e.g. V2DImode
> > > > patterns) or leaves them intact, and a follow-up split pass splits
> > > > them into scalar SImode instruction pairs. STV pass also takes care to
> > > > move and preload values from their scalar form to a vector
> > > > representation (using SUBREGs). Please note that all this happens on
> > > > pseudos, and register allocator will later simply use scalar (integer)
> > > > registers in scalar patterns and vector registers with vector insn
> > > > patterns.
> > > >
> > > > Your approach to amend existing scalar SImode patterns with vector
> > > > registers will introduce no end of problems. Register allocator will
> > > > do funny things during register pressure, where values will take a
> > > > trip to a vector register before being stored to memory (and vice
> > > > versa, you already found some of them). Current RA simply can't
> > > > distinguish clearly between two register sets.
> > > >
> > > > So, my advice would be to use STV pass also for SImode values, on
> > > > 64bit and 32bit targets. On both targets, we will be able to use
> > > > instructions that operate on vector register set, and for 32bit
> > > > targets (and to some extent on 64bit targets), we would perhaps be
> > > > able to relax register pressure in a kind of controlled way.
> > > >
> > > > So, to demonstrate the benefits of existing STV pass, it should be
> > > > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > > > handle 64bit values. For 32bit values, the pass should be re-run to
> > > > convert SImode scalar operations to vector operations in a controlled
> > > > way, based on various cost functions.
> >
> > I've looked at STV before trying to use RA to solve the issue but
> > quickly stepped away because of its structure which seems to be
> > tied to particular modes, duplicating things for TImode and DImode
> > so it looked like I have to write up everything again for SImode...
>
> ATM, DImode is used exclusively for x86_32 while TImode is used
> exclusively for x86_64. Also, TImode is used for different purpose
> before combine, while DImode is used after combine. I don't remember
> the details, but IIRC it made sense for the intended purpose.
> >
> > It really should be possible to run the pass once, handling a set
> > of modes rather than re-running it for the SImode case I am after.
> > See also a recent PR about STV slowness and tendency to hog memory
> > because it seems to enable every DF problem that is around...
>
> Huh, I was not aware of implementation details...
>
> > > Please find attached patch to see STV in action. The compilation will
> > > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> > > dump, you will be able to see chain building, cost calculation and
> > > conversion insertion.
> >
> > So you unconditionally add a smaxdi3 pattern - indeed this looks
> > necessary even when going the STV route. The actual regression
> > for the testcase could also be solved by turing the smaxsi3
> > back into a compare and jump rather than a conditional move sequence.
> > So I wonder how you'd do that given that there's pass_if_after_reload
> > after pass_split_after_reload and I'm not sure we can split
> > as late as pass_split_before_sched2 (there's also a split _after_
> > sched2 on x86 it seems).
> >
> > So how would you go implement {s,u}{min,max}{si,di}3 for the
> > case STV doesn't end up doing any transform?
>
> If STV doesn't transform the insn, then a pre-reload splitter splits
> the insn back to compare+cmove.
OK, that would work. But there's no way to force a jumpy sequence then
which we know is faster than compare+cmove because later RTL
if-conversion passes happily re-discover the smax (or conditional move)
sequence.
> However, considering the SImode move
> from/to int/xmm register is relatively cheap, the cost function should
> be tuned so that STV always converts smaxsi3 pattern.
Note that on both Zen and even more so bdverN the int/xmm transition
makes it no longer profitable but a _lot_ slower than the cmp/cmov
sequence... (for the loop in hmmer which is the only one I see
any effect of any of my patches). So identifying chains that
start/end in memory is important for cost reasons.
So I think the splitting has to happen after the last if-conversion
pass (and thus we may need to allocate a scratch register for this
purpose?)
> (As said before,
> the fix of the slowdown with consecutive cmov insns is a side effect
> of the transformation to smax insn that helps in this particular case,
> I think that this issue should be fixed in a general way, there are
> already a couple of PRs reported).
>
> > You could save me some guesswork here if you can come up with
> > a reasonably complete final set of patterns (ok, I only care
> > about smaxsi3) so I can have a look at the STV approach again
> > (you may remember I simply "split" at assembler emission time).
>
> I think that the cost function should always enable smaxsi3
> generation. To further optimize STV chain (to avoid unnecessary
> xmm<->int transitions) we could add all integer logic, arithmetic and
> constant shifts to the candidates (the ones that DImode STV converts).
>
> Uros.
>
> > Thanks,
> > Richard.
> >
> > > The testcase:
> > >
> > > --cut here--
> > > long long test (long long a, long long b)
> > > {
> > > return (a > b) ? a : b;
> > > }
> > > --cut here--
> > >
> > > gcc -O2 -m32 -msse2 (-mstv):
> > >
> > > _.268r.stv2 dump:
> > >
> > > Searching for mode conversion candidates...
> > > insn 2 is marked as a candidate
> > > insn 3 is marked as a candidate
> > > insn 7 is marked as a candidate
> > > Created a new instruction chain #1
> > > Building chain #1...
> > > Adding insn 2 to chain #1
> > > Adding insn 7 into chain's #1 queue
> > > Adding insn 7 to chain #1
> > > r85 use in insn 12 isn't convertible
> > > Mark r85 def in insn 7 as requiring both modes in chain #1
> > > Adding insn 3 into chain's #1 queue
> > > Adding insn 3 to chain #1
> > > Collected chain #1...
> > > insns: 2, 3, 7
> > > defs to convert: r85
> > > Computing gain for chain #1...
> > > Instruction conversion gain: 24
> > > Registers conversion cost: 6
> > > Total gain: 18
> > > Converting chain #1...
> > >
> > > ...
> > >
> > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
> > > (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> > > {*movdi_internal}
> > > (nil))
> > > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
> > > (mem/c:DI (plus:SI (reg/f:SI 16 argp)
> > > (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> > > {*movdi_internal}
> > > (nil))
> > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> > > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
> > > (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
> > > (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
> > > (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
> > > (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
> > > (expr_list:REG_UNUSED (reg:CC 17 flags)
> > > (nil)))))
> > > (insn 15 7 16 2 (set (reg:V2DI 87)
> > > (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
> > > (nil))
> > > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
> > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> > > (nil))
> > > (insn 17 16 18 2 (set (reg:V2DI 87)
> > > (lshiftrt:V2DI (reg:V2DI 87)
> > > (const_int 32 [0x20]))) "max.c":3:22 -1
> > > (nil))
> > > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
> > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> > > (nil))
> > > (insn 12 18 13 2 (set (reg/i:DI 0 ax)
> > > (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
> > > (expr_list:REG_DEAD (reg:DI 86)
> > > (nil)))
> > > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
> > > (nil))
> > >
> > > Uros.
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
>
--
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)