This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs


On Sat, 27 Jul 2019, Uros Bizjak wrote:

> On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > > How would one write smaxsi3 as a splitter to be split after
> > > reload in the case LRA assigned the GPR alternative?  Is it
> > > even worth doing?  Even the SSE reg alternative can be split
> > > to remove the not needed CC clobber.
> > >
> > > Finally I'm unsure about the add where I needed to place
> > > the SSE alternative before the 2nd op memory one since it
> > > otherwise gets the same cost and wins.
> > >
> > > So - how to go forward with this?
> >
> > Sorry to come a bit late to the discussion.
> >
> > We are aware of CMOV issue for quite some time, but the issue is not
> > understood yet in detail (I was hoping for Intel people to look at
> > this). However, you demonstrated that using PMAX and PMIN  instead of
> > scalar CMOV can bring us big gains, and this thread now deals on how
> > to best implement PMAX/PMIN for scalar code.
> >
> > I think that the way to go forward is with STV infrastructure.
> > Currently, the implementation only deals with DImode on SSE2 32bit
> > targets, but I see no issues on using STV pass also for SImode (on
> > 32bit and 64bit targets). There are actually two STV passes, the first
> > one (currently run on 64bit targets) is run before cse2, and the
> > second (which currently runs on 32bit SSE2 only) is run after combine
> > and before split1 pass. The second pass is interesting to us.
> >
> > The base idea of the second STV pass (for 32bit targets!) is that we
> > introduce a DImode _doubleword instructons that otherwise do not exist
> > with integer registers. Now, the passes up to and including combine
> > pass can use these instructions to simplify and optimize the insn
> > flow. Later, based on cost analysis, STV pass either converts the
> > _doubleword instructions to a real vector ones (e.g. V2DImode
> > patterns) or leaves them intact, and a follow-up split pass splits
> > them into scalar SImode instruction pairs. STV pass also takes care to
> > move and preload values from their scalar form to a vector
> > representation (using SUBREGs). Please note that all this happens on
> > pseudos, and register allocator will later simply use scalar (integer)
> > registers in scalar patterns and vector registers with vector insn
> > patterns.
> >
> > Your approach to amend existing scalar SImode patterns with vector
> > registers will introduce no end of problems. Register allocator will
> > do funny things during register pressure, where values will take a
> > trip to a vector register before being stored to memory (and vice
> > versa, you already found some of them). Current RA simply can't
> > distinguish clearly between two register sets.
> >
> > So, my advice would be to use STV pass also for SImode values, on
> > 64bit and 32bit targets. On both targets, we will be able to use
> > instructions that operate on vector register set, and for 32bit
> > targets (and to some extent on 64bit targets), we would perhaps be
> > able to relax register pressure in a kind of controlled way.
> >
> > So, to demonstrate the benefits of existing STV pass, it should be
> > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > handle 64bit values. For 32bit values, the pass should be re-run to
> > convert SImode scalar operations to vector operations in a controlled
> > way, based on various cost functions.

I've looked at STV before trying to use RA to solve the issue but
quickly stepped away because of its structure which seems to be
tied to particular modes, duplicating things for TImode and DImode
so it looked like I have to write up everything again for SImode...

It really should be possible to run the pass once, handling a set
of modes rather than re-running it for the SImode case I am after.
See also a recent PR about STV slowness and tendency to hog memory
because it seems to enable every DF problem that is around...

> Please find attached patch to see STV in action. The compilation will
> crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> dump, you will be able to see chain building, cost calculation and
> conversion insertion.

So you unconditionally add a smaxdi3 pattern - indeed this looks
necessary even when going the STV route.  The actual regression
for the testcase could also be solved by turing the smaxsi3
back into a compare and jump rather than a conditional move sequence.
So I wonder how you'd do that given that there's pass_if_after_reload
after pass_split_after_reload and I'm not sure we can split
as late as pass_split_before_sched2 (there's also a split _after_
sched2 on x86 it seems).

So how would you go implement {s,u}{min,max}{si,di}3 for the
case STV doesn't end up doing any transform?

You could save me some guesswork here if you can come up with
a reasonably complete final set of patterns (ok, I only care
about smaxsi3) so I can have a look at the STV approach again
(you may remember I simply "split" at assembler emission time).

Thanks,
Richard.

> The testcase:
> 
> --cut here--
> long long test (long long a, long long b)
> {
>   return (a > b) ? a : b;
> }
> --cut here--
> 
> gcc -O2 -m32 -msse2 (-mstv):
> 
> _.268r.stv2 dump:
> 
> Searching for mode conversion candidates...
>   insn 2 is marked as a candidate
>   insn 3 is marked as a candidate
>   insn 7 is marked as a candidate
> Created a new instruction chain #1
> Building chain #1...
>   Adding insn 2 to chain #1
>   Adding insn 7 into chain's #1 queue
>   Adding insn 7 to chain #1
>   r85 use in insn 12 isn't convertible
>   Mark r85 def in insn 7 as requiring both modes in chain #1
>   Adding insn 3 into chain's #1 queue
>   Adding insn 3 to chain #1
> Collected chain #1...
>   insns: 2, 3, 7
>   defs to convert: r85
> Computing gain for chain #1...
>   Instruction conversion gain: 24
>   Registers conversion cost: 6
>   Total gain: 18
> Converting chain #1...
> 
> ...
> 
> (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
>         (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> {*movdi_internal}
>      (nil))
> (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
>         (mem/c:DI (plus:SI (reg/f:SI 16 argp)
>                 (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> {*movdi_internal}
>      (nil))
> (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
>         (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
>             (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
>      (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
>         (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
>             (expr_list:REG_UNUSED (reg:CC 17 flags)
>                 (nil)))))
> (insn 15 7 16 2 (set (reg:V2DI 87)
>         (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
>      (nil))
> (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
>         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
>      (nil))
> (insn 17 16 18 2 (set (reg:V2DI 87)
>         (lshiftrt:V2DI (reg:V2DI 87)
>             (const_int 32 [0x20]))) "max.c":3:22 -1
>      (nil))
> (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
>         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
>      (nil))
> (insn 12 18 13 2 (set (reg/i:DI 0 ax)
>         (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
>      (expr_list:REG_DEAD (reg:DI 86)
>         (nil)))
> (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
>      (nil))
> 
> Uros.
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]