This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [rfc] multi-word subreg lowering pass
Roger Sayle wrote:
>On Mon, 29 May 2006, Daniel Jacobowitz wrote:
>> If a value is worth loading into the coprocessor, then it should be
>> treated as 64-bit. If it's not, though, it would be wonderful to split
>> it and handle the two halves separately. Will splitting 64-bit values
>> at expand time make it hard to use 64-bit registers for them later?
>It sounds as though the iWMMXt ARM is a more pronounced example of x86_64
>where you have all of the original 32-bit operations, but additionally
>64-bit registers and operations on them. In my current experiments, I've
>been completely disabling SUBREG/CONCAT lowering during RTL expansion on
>such TARGET_64BIT targets. I've been thinking more about processors that
>currently "white lie to reload" about having multiword register classes,
>and then always split them post-reload (with the associated penalties of
>not being able allocate the pieces independently or optimize some of the
>lower-level operations away).
>I suspect the current scheme where iWMMXt-like systems present two
>register classes ("hard" and "multi") to reload which then decides where
>to put pseudos, finally splitting those deemed best to be done in 32-bit
>post-reload is the best compromise we have for now. i.e. a circumstance
>where the little white lie is justified. However, in such a scheme
>the quality of register allocation is even more critical :-(
In this case one might also try to emit DImode RTL, and use late splitting
(splitting after the initial RTL passes but before register allocation) and
use simple heuristics for the decision whether or not to split. E.g. one
might check if the last setter of the DImode register was a sign/zero extend
operation or a previous DImode arithmetics result. This could probably be
realized with not too much difficulty by using the use-def chain that is also
used by combine.?
The first splitting pass would already be right at the correct location. The
only thing that one possibly would want between splitting and allocation is
possibly a CSE pass on the affected basic blocks.?
>But to more directly answer your question, there are a number of open
>PRs, including PR middle-end/22141, which highlight GCC's inability to
>promote multiple operations in one mode to a single operation in a
>wider mode. Given combine's limited lookahead and our inability to
>reason about concatenated registers in the RTL optimizers, I think
>keeping DImode operations in one piece is clearly to be prefered if
>there's some potential backend benefit. However, on targets such as
>x86, AVR and regular ARM where you know in advance that all operations
>must eventually be decomposed, its better to "bite the bullet" and do
>that earlier during the transition from tree-ssa to RTL.
Concerning AVR I disagree. At the beginning I had been strong advocate of
lowering at expand. However, after closely analyzing gcc and quite a number
of tests I come to the conclusion that one will definitely loose more than
one gains. In comparison to lowering at expand we are better off with avr's
present lowering at text output.
The question, *when* to do the lowering is, IMO, of quite some importance for
the smaller targets. I understand that one might be reluctant to let some gcc
targets make much of the work on the RTL level (so that one never could get
rid of some of the uglier aspects of RTL). However, the only other solution
that I see for obtaining compareable performance, e.g. for AVR or HC12,
would be to insert lots of target dependencies at the Tree level.?
>Prior to tree-ssa, backends had to emulate wide modes if they were to
>have any chance of being "globally" optimized at a high-level by the
>RTL optimizers. Now that that role is filled by tree-ssa, RTL can more
>accurately reflect the target hardware and the role of RTL optimizers
>is increasingly to capture machine specific idioms.
... but still in order to capture these, you may require to have a fairly
high-level representation ... This seems to be the case for these ARM targets
and definitely is the case for the two ports that I know better (AVR and
HC12).
>But as you correctly
>point out, we still make critical high-level instruction selection and
>code generation decisions very late during register allocation.
>I'll give it some thought. If anyone has some good suggestions for
>handling operation lowering on WMMX-like targets I'd be interested in
>hearing their opinions.
>Thanks for pointing out this issue. I hadn't previously given it
>much consideration, and unfortunately my proposed solution isn't much
>help for tackling this problem on targets like those you describe.
>Roger