This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH]: GCC Scheduler support for R10000 on MIPS


Richard Sandiford wrote:

Yeah, that's not too surprising. This model says that the pipeline looks 15 cycles in advance to see whether a division issued now will complete in 16 cycles' time, which needs a hefty number of DFA states to track properly. That's probably not how the pipeline works.

Yeah, I figured the pipeline is using one cycle to issue, then while the fp-divider is doing its own thing, the fp-multiplier issue unit is moving on to issue instructions to the next component, be that the sqrt or multiplier directly. Assuming one cycle to issue, then in the amount of time it takes for the fp-divider to do its work, the issue unit can pop out a good number of other instructions. I guess DFA may not be able to handle scenarios like this without generating a massive number of states?



In other words, it's probably the completion stuff that's causing
problems.  Things might be better if you just model the issue and
execution stages.

Yeah, it looks like it is. I dropped those, and it build in about 5 seconds flat.



Also, if you model the issue stage, you should model it for all insns,
not just the ones that use r10k_fpmpy_issue.

Hmm, model issue for all of them? I don't recall the manual stating that the integer systems need that explicitly. In the case of insns that only have a latency of one cycle, wouldn't factoring in an issue sequence add another cycle on and essentially slow things down?


Only the fp-divider and fp-square root unit have the specific mention of sharing their issue/completion logic, so, being a details person, I was giving it a shot at modeling. But if it proves to be too problematic, then I'll probably revert back to just putting the divider and square root units back into their own automata and leaving them at that.

And, just for clarification, if the manual says something has a repeat rate of say, 16, and we do model the issue and/or completion phases, we need to subtract those cycles off the repeat rate, right?


Well, this reserves ALU2 for 35 cycles and (immediately after that)
reserves r10k_idiv_single for one cycle.  Is that what you wanted?

Not sure. That specific example in the Processor Pipeline description seemed to detail an integer divider that remains busy for the duration of the divide, but I wasn't sure if I was converting it to my application properly. Does DFA handle when a unit is already busy? I.e., if r10k_alu2 is already working on a previous divide insn, if something else comes along (say another divide), will gcc take this into account and know not to issue that insn until the divide is complete? It seems on this processor, only integer divides aren't pipelined, and it looked like re-using the running insn reservation achieved that affect.


The internals guide doesn't offer a lot of clear cut examples on things like this, so I'm sort of guessing at it. It also doesn't help that the example provided uses "div" as both the name for the insn reservation and for the cpu unit, which makes the example more obtuse to the newcomer.


That's not correct.  "imul" is used for MULT, MULTU, DMULT and DMULTU.
(The "<u>" in those patterns means "" for signed and "u" for unsigned.)

I eventually figured out what <u> was doing (still not sure on <mode> 100%, or <su>), but I was looking more (or should say, searching more) at the actual asm code generated. I was only seeing mult and mul asm commands being created - but, I know very little about mips asm, so I suppose even though multu or dmultu may not be explicitly spelled out, the operands to the asm insns can probably take args in such a way as to become unsigned variants.


Poking around mips.md, I'm clueless on where one would start. It looks like there's a ton of different insns that fall into the 'imul' attribute type, so an initial looks makes it look like any such split would be pretty significant.

I guess for now, it's either to just use the Lo latency for MULT/DMULT, and hope the 1-2 cycle deviance from MULTU and DMULTU doesn't degrade performance too much, or use the Hi latency as the middle ground (MULT Lo lat is 5, Hi is 6; MULTU Lo is 6, Hi is 7, so 6 is the compromise pick), and maybe down the road, look at this as a future project.


FWOW, an alternative is to pick a single big file (e.g. gcc's fold-const)
and preprocess it.  You can then run cc1 on it directly, which means that
the benchmark is a single process.

I still have to fully build gcc, right? Or is there a way to fold in the changes to 10000.md, rebuild the pre-processor and cc1 directly (I assume it takes -march parameters) w/o rebuilding the whole compiler? Think I can get away with just the bootstrap compiler?



Yes, '(eq_attr "cpu" ...)' tests the attribute defined by
'(define_attr "cpu" ...)', so you need to remove the processor
names from both.

Done.



I think what I'll do is at minimum, see if the ALU2 blocking is able to be modeled for integer divides, then tweak my comments appropriately to state what we can't or won't model, then start to work on testing this in the testsuite and issue you a final patch. I think pulling the issue bits may be preferable in the end, but I'll try testing them on that file you mentioned and see what happens. I figure that's a pretty math-heavy compile, right? I know I had this one C++ app around someplace that really whacked the system with math-intensive tests. I may have to go digging around on my other systems and find it.



Cheers!,


--
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org

"The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]