This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/29969] New: should use floating point registers for block copies


In integer-dominated code, it is often useful to use floating point
registers to do block copies.  If suitable alignment is available,
64 bit loads / stores allow to do the copy with half as many memory
operations.  If the source is loop invariant, the loads can be
hoisted out of the loop; register pressure usually makes this
unfeasible for integer registers.
The destination, and, if not loop invariant, the source need to be
at least 32 bit aligned for this to be profitable (or at least there must
be a known constant offset to such an alignment.  At -O3, preconditioning
could be used to cover all possible offsets and select the code at
run-time).  Also, a minimum size is required.  The total size need not be
aligned, as smaller pieces can be copied in integer registers.

A testcase for this is the main loop of dhrystone, where
the two strings fit into 4 64-bit values each (after padding),
and cse allows to fit them in 5 64-bit values together.
Four of these fit into the call saved registers dr12, dr14, xd12 and xd14,
thus their loads can be hoisted out of the loop.

The tree of the current function could be examined for heuristics to
determine if using floating point registers for block copies makes sense
(look for high integer register pressure and low floating point register
pressure - call saved registers if a loop invariant crosses a call; might
also take different integer / floating point memory latencies into account
if the block is relatively short, by checking if there appear to be a
sufficient number of other instructions to hide some of the latency.
Alternatively or additionally, an option and/or parameters used in the
heuristics can be used to control the behaviour.

To increase the incidence of suitably aligned copies, constant alignment and
data alignment for block copy destinations of suitable size which are
defined in the current compilation unit should be increased to 64 bit,
and such data items should also be padded to 64 bits.
This may be controlled by an invocation option.
(If the last 64 bit item would contain no more than 32 bits, and the
 register pressure is too high to hoist out all loads, padding to fit 8
/ 16 / 32 bit is sufficient.  The latter padding is useful for integer
 copies in general)
When doing LTO, this might be expanded to items which are defined in other
compilation units, and to special cases of indirect references.

The actual copy is best done exploiting post-increment for load and
pre-decrement for store, and is thus highly machine specific.  It therefore
seems best to do this in sh.c:expand_block_move.
Thus, STORE_BY_PIECES_P and MOVE_BY_PIECES_P will have to reject the
size and alignment combinations of copies that we want to handle this way.

Due to a quirk in the SH4 specification, we need a third fp_mode value
for 64 bit loads / stores (unless FMOVD_WORKS is true).
This mode has FPSCR.PR cleared and FPSCR.SZ set.
To get the full benefit for copies that are in a loop that does calls,
we should fix rtl-optimization/29349 first.
When using the -m4-single ABI, the new mode can be generated from the
normal mode by issuing one fschg instruction; we can switch back with
another fschg instruction.
For -m4a or -m4-300, we need both an fpchg and an fschg; -m4 must load
the new mode from a third value in fpscr_values.

The actual loads and stores must not look like ordinary SImode or DImode
loads and stores, because that would give - via GO_IF_LEGITIMATE_ADDRESS -
the wrong message to the optimizers about the available addressing modes.
Moreover, POST_INC / PRE_DEC are currently not allowed at rtl generation
time.
A possible sulution is to use patterns that pair the load / store
with an explicit set of the address register.  I'd prefer to use
two match_dup to keep the address register in sync, since otherwise
the optimizers can too easily hijack the pattern for something inappropriate.
The MEMs are probably best using SFmode / DFmode, but wrapping them in an
SImode / DImode unspec; however, care must be taken to still get the
right alias set for the MEM.


-- 
           Summary: should use floating point registers for block copies
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: amylaar at gcc dot gnu dot org
GCC target triplet: sh4*-*-*
 BugsThisDependsOn: 29349
OtherBugsDependingO 29842
             nThis:


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29969


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]