This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC, PR target/65105] Use vector instructions for scalar 64bit computations on 32bit target


On 06/19/2015 07:21 AM, Ilya Enkovich wrote:
Hi,

This patch tries to improve 64bit integer computations on 32bit
target.  This is achieved by an additional i386 target pass which
searches for all conversion candidates and tries to transform them
into vector mode when profitable.
Presumably you're building a chain of related operations that could possibly run in the vector unit, then if the costing model says ok, then you convert the whole chain.

Note that there's costing issues outside the model that can be expressed in GCC. For example, you can get a significant latency spike in the AVX unit if you're not feeding it work regularly.

Our of curiosity, what does LLVM do here in terms of costing models?


Initial problem discussion had several assumptions that this
optimization should be done in RA.  But implementation of this in RA
seems really complex.  I don't believe it can be done in a
reasonalble time.  And taking into account quite narrow performance
impact, I believe a separate conversion pass is a better solution.
The advantage of doing it in RA is probably the ability to accurately know if we've run out of general purpose registers and have vector registers to spare in the right spots. BUt with the amount of rewriting going on, it may be excessively complex to do in the allocator.



Here is shortly a list of changes:

1. Add insn templates for 64bit and/ior/xor/zext for 32bit target to
avoid split on expand 2. Add new pass to convert scalar computation
into vector.  The flow of the pass is following: a. Find all
instructions we may convert b. Split them into chains of dependant
instructions c. Estimate if chain conversion is profitable d. Convert
chain if profitable 3. Add splits for not converted insns
Seems to make reasonable sense.


Current cost model uses processor_costs table to estimate how much
gain somes from a single instruction usage vs pair of instruction +
estimate cost of scalar->vector and back conversions.  Cost
estimation doesn't actually use CFG and have a (lot of) room for
improvement.  The problem here is a lack of workloads to be used for
tuning.
Right. I'd think the tuning is probably one of the harder problems here. ISTM one of the metrics you'd want to be looking at is the register pressure for both register files across the lifetime of the chain of dependent instructions.

Note there are mechanisms to get register pressure estimates so that you can use them to help drive this kind of transformation.

From a correctness standpoint, one of the interesting tests would be to turn off all tuning -- ie, always convert if it's supposed to be possible. Then throw as much code as possible at it and see if anything breaks. Also a good time to instrument so that you can then build testcases from real-world code.

Also note that with a new pass, you may need to do some compile-time benchmarking.



Added DI insns and splits for 32bit target delay insns split until
reload_completed.  It is a potential degradation for cases when
conversion doesn't happen. Pass probably may be moved before spli1
pass to allow early split of not converted insns.  Or new pass itself
may split not converted chains.

I also had to modify register constraint of movdi for sse->mem
alternative.  I understand we don't like this alternative for 64bit
target but for 32bit it is more useful.  E.g. I see mem->mem copies
go through xmm instead of GPR pair with this change.  May we have
separate xmm register alternatives for 32bit and bit targets in
movdi?
The patch as a whole is ultimately Uros's call since it's implemented entirely in the x86_64 backend.


A few implementation notes.

Don't use const0_rtx, use CONST0_RTX (mode) whenever possible. The vast majority of the time the right mode is available in some other operand.

For convertible_comparison_p, please include the rtx form of what you're looking for in the function comment. It looks like you're searching for

(set (Z) (compare (ior (subreg (X) (subreg Y)) (const_int 0)

Where the subregs are extracting a SImode value out of a DImode X & Y respectively.

Note that you don't seem to be checking for a high vs low word, is that intentional?

For has_non_address_hard_reg, the name of the function is somewhat odd -- what does "address" in the function name have to do with the implementation which doesn't seem to do anything with addresses or address registers.

Does that routine DTRT for a value that is an input, but clobbered?

s/registerss/register (in comment before remove_non_convertible_regs)

remove_non_convertible_regs needs to document its parameter CANDIDATES. I figured out it's a bitmap of insn UIDs, but that should be called out in the function comment.

It also seems that routine assumes that anything set in CANDIDATES must be a single_set? If so, where is that enforced?


I don't see anything that jumps out as painfully wrong. Uros really needs to review the code as a whole though.

jeff





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]