Purpose of this enhancement request is not only to resolve a missed
optimization issue for the avr port. It aims to show up ways for resolving a
general difficulty for the RTL optimizers that we are presently having for
double-set insn: A difficulty that shows up when dealing with side effects of
double-set RTL insns that produce unexpected useful results.
IMO this is a general issue that will not only show up for the divmod4 patterns
but will probably also be very important for the efficiency of code generated
after the CCmode transition for targets where 1.) arithmetic or logic
operations leave the CC register with useful information and 2.) arithmetic
operations on SImode or DImode objects need to be lowered to a sequence of
At the end of this bug report I'm about to ask for an early CSE pass or a small
change of the jump2 pass.
When having a function like
void foo (uint32_t z, uint32_t n)
d = z / n;
m = z % n;
the avr port presently does not recognize that it needs to call the udivmodsi
function only once. Assembly code generated reads, thus,
(function prologue/epilogue stripped). The issue is that the CSE passes
presently are not able to realize that the same result is calculated twice. The
optimizers do, thus, not recognize that the first divmod4 call had a useful
side-effect (the one that it calculates as well the mod).
This could be fixed by attaching appropriate REG_EQUAL register notes to the
insns that copy the results from the library call that reside in hard regs to
target pseudos. The present implementation of the avr port does not generate
such notes since it generates the calls explicitly and not by optabs.c (because
the implementation knows exactly which registers will be clobbered by the
library call). I will attach a patch that adds these required REG_EQUAL notes.
The important issue is, that the REG_EQUAL note needs to refer to the original
pseudos that were used as input parameters for the library call. Otherwise CSE
will not realize the optimization opportunity.
When using the attached patch, i.e. when adding the appropriate REG_EQUAL notes
to the RTL, I veryfied that CSE is able to realize the optimization
opportunity. I.e. one ends up with assembly output reading
. The only difficulty is, that with the present optimization setup, the jump2
pass removes the insn that contain the useful REG_EQUAL notes for the
unexpected useful side-effect because it correctly recognizes that it is
"trivially dead". For this reason, I had to place the first CSE pass before the
jump2 pass in "passes.c" in order to get above results. I.e. my "passes.c"
For this reason the attached patch could only resolve this issue IMO only if
one of the following solutions is realized:
1.) the jump2 pass is changed such that it refrains from deleting trivially
dead insn if they are containing REG_EQUAL notes because possibly it could be
necessary to make them being seen at least once by a CSE pass.
2.) an earlier CSE pass is added to the optimizer flow. Possibly controlled by
using a target-dependent flag.
Created attachment 9665 [details]
Patch adding REG_EQUAL notes to the divmod4 expanders
Or even better add support for multiple sets in CSE and forget about adding notes. That will improve
more targets than just AVR.
IMO, one would not be able to handle the issue by changing CSE. E.g. I
presently don't see how to avoid using register notes for the following
situation. Imagine a target not having DImode operations so that DImode
arithmetic needs to be lowered to a sequence of SImode operations. Imagine
further that after the sequence the condition code contains useful information
that we want to re-use.
E.g a minus:DI operation would be expanded into two parallels like
(set (subreg:SI (reg:DI operand0) 0)
minus:SI ((subreg:SI (reg:DI operand1) 0)
(subreg:SI (reg:DI operand2) 0))))
(set (reg:CCmode CC) (generate borrow))])
(set (subreg:SI (reg:DI operand0) 4)
(extract_borrow:SI (reg:CCmode CC))
(minus:SI ((subreg:SI (reg:DI operand1) 4)
(subreg:SI (reg:DI operand2) 4)))))
(set (reg:CCmode CC) (generate condition code))])
In order to describe what information is written to the CC register in the
second parallel, one needs to refer to both, input parameters of the parallel
for the lower 4 bytes and input parameters for the higher 4 bytes. E.g. the
information that CC contains the result of a compare of operand1 and operand2
could therefore not be expressed in the RTL! One could add, however, a third
instruction to the expanded sequence reading
(set (reg:CCmode CC) (reg:CCmode CC))
-> WITH ATTACHED REGISTER NOTE "is equal to compare:DI (operand1) (operand2)"
where the REG_EQUAL note gives the required information. IMO this is a general
issue relevant for all targets that are aiming to do subreg lowering of
arithmetic and logic operations at expand and that wish to recycle the
condition codes generated. Of course 8 bit targets like AVR that need to use
subreg lowering for almost everything will benefit most :-).
I agree that generally register notes are kind of ugly. But for this kind of
information, I think that they could be useful.
You most likely want to use a libcall blocks instead of the regnotes here.
Do you know of any doc on how libcall block RTL is supposed to look like
(except from e.g. code reading optabs.c)?
I have done some code reading now and come to the following conclusion:
When having an expanded sequence that provides one single result, libcall
blocks are an appropriate method for making sure that a single-set insn that
carries a REG_EQUAL note is *not* deleted too early.
Libcall notes, however, do not provide a method so far for dealing with library
calls or expanded sequences that yield *two* results. E.g. they are no solution
for both, divmod4 on the one hand and arithmetic expanders doing subreg
lowering and also yielding a CC on the other hand.
In order to keep changes as small as possible, my suggestion is to change
such that it no longer calls "delete_trivially_dead_insns ()" at it's very
beginning. (I'd have posted a patch, if savannah.gnu.org is down right now.)
IIUC that the trivially dead insn are in first line a performance issue because
they need memory and handling that would otherwise not be necessary, my
suggested change would not be too serious. "delete_trivially_dead_insns ()"
would be among the first things that is called in the pass that immediately
follows jump2: The first cse pass.
I'd appreciate comments on whether such trivially dead insn could prevent jump2
from realizing important optimization steps.
All P1 enhancements not targeted towards 4.1, moving to P5.
Created attachment 14738 [details]
My new analysis leads me to the result that the key problem of this missed optimization is a target problem: Presently we expand a divmod4 pattern that implements the support funciton calls directly. This pattern refers directly to hard registers. The resulting RTL is too complex for CSE and we don't identify this optimization.
The attached patch uses the same RTL for the support function calls. Only this RTL is now generated only after combine by a define_insn_and_split pattern. The insn part of this pattern is valid only for pseudos so that until split. In order to assure this, I have added a new predicate.
I have tested the patch against the atmega128 simulation target without regressions.
2007-12-11 Bjoern Haase <email@example.com>
* config/avr/predicates.md (pseudo_register_operand): Add new predicate
* config/avr/avr.md (divmodqi4,udivmodqi4): replace define_expand by
define_insn_and_split, delay expansion of call patterns to split pass.
Subject: Bug 23726
Date: Sun Dec 13 21:03:41 2009
New Revision: 155195
* config/avr/predicates.md (pseudo_register_operand): New predicate for pseudos.
* config/avr/avr.md (divmodqi4): Replace with define_insn_and_split to allow div/mod optimization.
Setting Target Milestone.