[PATCH 00/31] VAX: Bring the port up to date (yes, MODE_CC conversion is included)
Maciej W. Rozycki
macro@linux-mips.org
Fri Dec 11 14:54:50 GMT 2020
On Wed, 9 Dec 2020, Paul Koning wrote:
> > This all sounds great. Do you happen to know if it is cycle-accurate
> > with respect to individual hardware microarchitectures simulated? That
> > would be required for performance evaluation of compiler-generated code.
>
> No, it isn't. I believe it just charges one time unit per instruction,
> with the possible exception of CIS instructions.
Fair enough, from experience most CPU emulators are instruction-accurate
only. Of all the generally available emulators I came across (and looked
into closely enough; maybe I missed something) only ones for the Z80 were
cycle-accurate, and I believe the MAME project has had cycle-accurate
emulation, both down to the system level and both out of necessity, as
software they were written for was often unforgiving when it comes to any
discrepancy with respect to original hardware.
Commercially, MIPS Technologies used to have cycle-accurate MIPSsim,
actually used for hardware verification, and taking into account all the
implementation details such as the TLB and caches of individual CPU cores
supported. And you could choose the topology of these resources according
to what actual silicon could have. Some LV hardware has had it too for
evaluation purposes:
YAMON> scpu
Current settings :
I-Cache bytes per way = 0x1000
I-Cache associativity = 4
D-Cache bytes per way = 0x1000
D-Cache associativity = 4
MMU = tlb
YAMON> scpu -a
Available settings :
I-Cache bytes per way : 0x1000, 0x0
I-Cache associativity : 4, 3, 2, 1
D-Cache bytes per way : 0x1000, 0x0
D-Cache associativity : 4, 3, 2, 1
MMU types : tlb, fixed
YAMON> scpu -i 0x1000 2
YAMON> scpu -d 0x1000 2
YAMON> scpu fixed
YAMON> scpu
Current settings :
I-Cache bytes per way = 0x1000
I-Cache associativity = 2
D-Cache bytes per way = 0x1000
D-Cache associativity = 2
MMU = fixed
YAMON>
But then even cycle-accurate MIPSsim would not take every parameter of a
system into account, such as the latency of peripheral components. Not
sure about DRAM either, though being predictable I guess that might have
been simulated.
> I don't know of any cycle accurate PDP-11 emulators. It's not even
> clear if it is possible to build one, given the asynchronous operation
> of the UNIBUS. It certainly would be extremely difficult since even the
> documented timing is amazingly complex, never mind the possibility that
> the reality is different from what is documented.
For the purpose of compiler's performance evaluation however I don't
think we need to go down as far as the external bus, so however UNIBUS
performs should not really matter. Even with the modern systems all the
pipeline descriptions and operation timings we have recorded within GCC
reflect perfect operating conditions such as hot caches, no TLB misses, no
branch mispredictions, to say nothing of disruption to all that caused by
hardware interrupts and context switches.
So I guess with cycle-accurate PDP-11 emulation it would be sufficient if
relative CPU instruction execution timings were correctly reflected, such
as the latency of say MOV vs DIV, as I am fairly sure they are not even
close to being equivalent. But that does come at a cost; cycle-accurate
MIPSsim was much slower than its instruction-accurate counterpart which
also existed.
> The pdp11 back end uses a very rough approximation of the documented
> 11/70 timing, but GCC doesn't make it easy (or maybe not even possible)
> to use the full timing details. It's not something I'd expect to refine
> a whole lot further.
Understood.
> More interesting would be to tweak the optimizing machinery to improve
> parts that either have bitrotted or never actually worked. The code
> generation for auto-increment etc. isn't particularly effective and I
> think that's a known limitation. Ditto indirect addressing, since few
> other machines have that. (VAX does, of course; it might benefit too.)
> And with LRA things are more limited still, again this seems to be known
> and is caused by the focus on modern machine architectures.
Correctness absolutely has to take precedence over performance, but that
does not mean the latter has to be completely ignored either. And the
presence of tools may only help with that. We may not have the resources
available commercially significant ports have, but that does not mean we
should decide upfront to abandon any kind of performance QA. I think we
can still act professionally and try to do our best to make the quality of
code produced as good as possible within our available resources.
FWIW,
Maciej
More information about the Gcc-patches
mailing list