This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC, PR target/65105] Use vector instructions for scalar 64bit computations on 32bit target

From: Jeff Law <law at redhat dot com>
To: Ilya Enkovich <enkovich dot gnu at gmail dot com>, gcc-patches at gcc dot gnu dot org
Date: Mon, 3 Aug 2015 14:52:19 -0600
Subject: Re: [RFC, PR target/65105] Use vector instructions for scalar 64bit computations on 32bit target
Authentication-results: sourceware.org; auth=none
References: <20150619132130 dot GA15263 at msticlxl57 dot ims dot intel dot com>

On 06/19/2015 07:21 AM, Ilya Enkovich wrote:

Hi,

This patch tries to improve 64bit integer computations on 32bit
target.  This is achieved by an additional i386 target pass which
searches for all conversion candidates and tries to transform them
into vector mode when profitable.

Presumably you're building a chain of related operations that couldpossibly run in the vector unit, then if the costing model says ok, thenyou convert the whole chain.

Note that there's costing issues outside the model that can be expressedin GCC. For example, you can get a significant latency spike in the AVXunit if you're not feeding it work regularly.


Our of curiosity, what does LLVM do here in terms of costing models?


Initial problem discussion had several assumptions that this
optimization should be done in RA.  But implementation of this in RA
seems really complex.  I don't believe it can be done in a
reasonalble time.  And taking into account quite narrow performance
impact, I believe a separate conversion pass is a better solution.

The advantage of doing it in RA is probably the ability to accuratelyknow if we've run out of general purpose registers and have vectorregisters to spare in the right spots. BUt with the amount of rewritinggoing on, it may be excessively complex to do in the allocator.


Here is shortly a list of changes:

1. Add insn templates for 64bit and/ior/xor/zext for 32bit target to
avoid split on expand 2. Add new pass to convert scalar computation
into vector.  The flow of the pass is following: a. Find all
instructions we may convert b. Split them into chains of dependant
instructions c. Estimate if chain conversion is profitable d. Convert
chain if profitable 3. Add splits for not converted insns

Seems to make reasonable sense.


Current cost model uses processor_costs table to estimate how much
gain somes from a single instruction usage vs pair of instruction +
estimate cost of scalar->vector and back conversions.  Cost
estimation doesn't actually use CFG and have a (lot of) room for
improvement.  The problem here is a lack of workloads to be used for
tuning.

Right. I'd think the tuning is probably one of the harder problemshere. ISTM one of the metrics you'd want to be looking at is theregister pressure for both register files across the lifetime of thechain of dependent instructions.

Note there are mechanisms to get register pressure estimates so that youcan use them to help drive this kind of transformation.

From a correctness standpoint, one of the interesting tests would be toturn off all tuning -- ie, always convert if it's supposed to bepossible. Then throw as much code as possible at it and see if anythingbreaks. Also a good time to instrument so that you can then buildtestcases from real-world code.

Also note that with a new pass, you may need to do some compile-timebenchmarking.


Added DI insns and splits for 32bit target delay insns split until
reload_completed.  It is a potential degradation for cases when
conversion doesn't happen. Pass probably may be moved before spli1
pass to allow early split of not converted insns.  Or new pass itself
may split not converted chains.

I also had to modify register constraint of movdi for sse->mem
alternative.  I understand we don't like this alternative for 64bit
target but for 32bit it is more useful.  E.g. I see mem->mem copies
go through xmm instead of GPR pair with this change.  May we have
separate xmm register alternatives for 32bit and bit targets in
movdi?

The patch as a whole is ultimately Uros's call since it's implementedentirely in the x86_64 backend.



A few implementation notes.

Don't use const0_rtx, use CONST0_RTX (mode) whenever possible. The vastmajority of the time the right mode is available in some other operand.

For convertible_comparison_p, please include the rtx form of what you'relooking for in the function comment. It looks like you're searching for


(set (Z) (compare (ior (subreg (X) (subreg Y)) (const_int 0)

Where the subregs are extracting a SImode value out of a DImode X & Yrespectively.

Note that you don't seem to be checking for a high vs low word, is thatintentional?

For has_non_address_hard_reg, the name of the function is somewhat odd-- what does "address" in the function name have to do with theimplementation which doesn't seem to do anything with addresses oraddress registers.


Does that routine DTRT for a value that is an input, but clobbered?

s/registerss/register (in comment before remove_non_convertible_regs)

remove_non_convertible_regs needs to document its parameter CANDIDATES.I figured out it's a bitmap of insn UIDs, but that should be calledout in the function comment.

It also seems that routine assumes that anything set in CANDIDATES mustbe a single_set? If so, where is that enforced?

I don't see anything that jumps out as painfully wrong. Uros reallyneeds to review the code as a whole though.


jeff

Follow-Ups:
- Re: [RFC, PR target/65105] Use vector instructions for scalar 64bit computations on 32bit target
  - From: Ilya Enkovich

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]