This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH]: GCC Scheduler support for R10000 on MIPS
Richard Sandiford wrote:
Yeah, that's not too surprising. This model says that the pipeline
looks 15 cycles in advance to see whether a division issued now will
complete in 16 cycles' time, which needs a hefty number of DFA states
to track properly. That's probably not how the pipeline works.
Yeah, I figured the pipeline is using one cycle to issue, then while the
fp-divider is doing its own thing, the fp-multiplier issue unit is moving on to
issue instructions to the next component, be that the sqrt or multiplier
directly. Assuming one cycle to issue, then in the amount of time it takes for
the fp-divider to do its work, the issue unit can pop out a good number of other
instructions. I guess DFA may not be able to handle scenarios like this without
generating a massive number of states?
In other words, it's probably the completion stuff that's causing
problems. Things might be better if you just model the issue and
execution stages.
Yeah, it looks like it is. I dropped those, and it build in about 5 seconds flat.
Also, if you model the issue stage, you should model it for all insns,
not just the ones that use r10k_fpmpy_issue.
Hmm, model issue for all of them? I don't recall the manual stating that the
integer systems need that explicitly. In the case of insns that only have a
latency of one cycle, wouldn't factoring in an issue sequence add another cycle
on and essentially slow things down?
Only the fp-divider and fp-square root unit have the specific mention of sharing
their issue/completion logic, so, being a details person, I was giving it a shot
at modeling. But if it proves to be too problematic, then I'll probably revert
back to just putting the divider and square root units back into their own
automata and leaving them at that.
And, just for clarification, if the manual says something has a repeat rate of
say, 16, and we do model the issue and/or completion phases, we need to subtract
those cycles off the repeat rate, right?
Well, this reserves ALU2 for 35 cycles and (immediately after that)
reserves r10k_idiv_single for one cycle. Is that what you wanted?
Not sure. That specific example in the Processor Pipeline description seemed to
detail an integer divider that remains busy for the duration of the divide, but
I wasn't sure if I was converting it to my application properly. Does DFA
handle when a unit is already busy? I.e., if r10k_alu2 is already working on a
previous divide insn, if something else comes along (say another divide), will
gcc take this into account and know not to issue that insn until the divide is
complete? It seems on this processor, only integer divides aren't pipelined,
and it looked like re-using the running insn reservation achieved that affect.
The internals guide doesn't offer a lot of clear cut examples on things like
this, so I'm sort of guessing at it. It also doesn't help that the example
provided uses "div" as both the name for the insn reservation and for the cpu
unit, which makes the example more obtuse to the newcomer.
That's not correct. "imul" is used for MULT, MULTU, DMULT and DMULTU.
(The "<u>" in those patterns means "" for signed and "u" for unsigned.)
I eventually figured out what <u> was doing (still not sure on <mode> 100%, or
<su>), but I was looking more (or should say, searching more) at the actual asm
code generated. I was only seeing mult and mul asm commands being created -
but, I know very little about mips asm, so I suppose even though multu or dmultu
may not be explicitly spelled out, the operands to the asm insns can probably
take args in such a way as to become unsigned variants.
Poking around mips.md, I'm clueless on where one would start. It looks like
there's a ton of different insns that fall into the 'imul' attribute type, so an
initial looks makes it look like any such split would be pretty significant.
I guess for now, it's either to just use the Lo latency for MULT/DMULT, and hope
the 1-2 cycle deviance from MULTU and DMULTU doesn't degrade performance too
much, or use the Hi latency as the middle ground (MULT Lo lat is 5, Hi is 6;
MULTU Lo is 6, Hi is 7, so 6 is the compromise pick), and maybe down the road,
look at this as a future project.
FWOW, an alternative is to pick a single big file (e.g. gcc's fold-const)
and preprocess it. You can then run cc1 on it directly, which means that
the benchmark is a single process.
I still have to fully build gcc, right? Or is there a way to fold in the
changes to 10000.md, rebuild the pre-processor and cc1 directly (I assume it
takes -march parameters) w/o rebuilding the whole compiler? Think I can get
away with just the bootstrap compiler?
Yes, '(eq_attr "cpu" ...)' tests the attribute defined by
'(define_attr "cpu" ...)', so you need to remove the processor
names from both.
Done.
I think what I'll do is at minimum, see if the ALU2 blocking is able to be
modeled for integer divides, then tweak my comments appropriately to state what
we can't or won't model, then start to work on testing this in the testsuite and
issue you a final patch. I think pulling the issue bits may be preferable in
the end, but I'll try testing them on that file you mentioned and see what
happens. I figure that's a pretty math-heavy compile, right? I know I had this
one C++ app around someplace that really whacked the system with math-intensive
tests. I may have to go digging around on my other systems and find it.
Cheers!,
--
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
"The past tempts us, the present confuses us, the future frightens us. And our
lives slip away, moment by moment, lost in that vast, terrible in-between."
--Emperor Turhan, Centauri Republic