[PATCH 00/31] VAX: Bring the port up to date (yes, MODE_CC conversion is included)

Fri Dec 11 14:54:50 GMT 2020

On Wed, 9 Dec 2020, Paul Koning wrote:

> > This all sounds great.  Do you happen to know if it is cycle-accurate 
> > with respect to individual hardware microarchitectures simulated?  That 
> > would be required for performance evaluation of compiler-generated code.
> 
> No, it isn't.  I believe it just charges one time unit per instruction, 
> with the possible exception of CIS instructions.

 Fair enough, from experience most CPU emulators are instruction-accurate 
only.  Of all the generally available emulators I came across (and looked 
into closely enough; maybe I missed something) only ones for the Z80 were 
cycle-accurate, and I believe the MAME project has had cycle-accurate 
emulation, both down to the system level and both out of necessity, as 
software they were written for was often unforgiving when it comes to any 
discrepancy with respect to original hardware.

 Commercially, MIPS Technologies used to have cycle-accurate MIPSsim, 
actually used for hardware verification, and taking into account all the 
implementation details such as the TLB and caches of individual CPU cores 
supported.  And you could choose the topology of these resources according 
to what actual silicon could have.  Some LV hardware has had it too for 
evaluation purposes:

YAMON> scpu
Current settings :
  I-Cache bytes per way = 0x1000
  I-Cache associativity = 4
  D-Cache bytes per way = 0x1000
  D-Cache associativity = 4
  MMU                   = tlb
YAMON> scpu -a
Available settings :
  I-Cache bytes per way : 0x1000, 0x0
  I-Cache associativity : 4, 3, 2, 1
  D-Cache bytes per way : 0x1000, 0x0
  D-Cache associativity : 4, 3, 2, 1
  MMU types             : tlb, fixed
YAMON> scpu -i 0x1000 2
YAMON> scpu -d 0x1000 2
YAMON> scpu fixed
YAMON> scpu
Current settings :
  I-Cache bytes per way = 0x1000
  I-Cache associativity = 2
  D-Cache bytes per way = 0x1000
  D-Cache associativity = 2
  MMU                   = fixed
YAMON> 

But then even cycle-accurate MIPSsim would not take every parameter of a 
system into account, such as the latency of peripheral components.  Not 
sure about DRAM either, though being predictable I guess that might have 
been simulated.

> I don't know of any cycle accurate PDP-11 emulators.  It's not even 
> clear if it is possible to build one, given the asynchronous operation 
> of the UNIBUS.  It certainly would be extremely difficult since even the 
> documented timing is amazingly complex, never mind the possibility that 
> the reality is different from what is documented.

 For the purpose of compiler's performance evaluation however I don't 
think we need to go down as far as the external bus, so however UNIBUS 
performs should not really matter.  Even with the modern systems all the 
pipeline descriptions and operation timings we have recorded within GCC 
reflect perfect operating conditions such as hot caches, no TLB misses, no 
branch mispredictions, to say nothing of disruption to all that caused by 
hardware interrupts and context switches.

 So I guess with cycle-accurate PDP-11 emulation it would be sufficient if 
relative CPU instruction execution timings were correctly reflected, such 
as the latency of say MOV vs DIV, as I am fairly sure they are not even 
close to being equivalent.  But that does come at a cost; cycle-accurate 
MIPSsim was much slower than its instruction-accurate counterpart which 
also existed.

> The pdp11 back end uses a very rough approximation of the documented 
> 11/70 timing, but GCC doesn't make it easy (or maybe not even possible) 
> to use the full timing details.  It's not something I'd expect to refine 
> a whole lot further.

 Understood.

> More interesting would be to tweak the optimizing machinery to improve 
> parts that either have bitrotted or never actually worked. The code 
> generation for auto-increment etc. isn't particularly effective and I 
> think that's a known limitation.  Ditto indirect addressing, since few 
> other machines have that.  (VAX does, of course; it might benefit too.)  
> And with LRA things are more limited still, again this seems to be known 
> and is caused by the focus on modern machine architectures.

 Correctness absolutely has to take precedence over performance, but that 
does not mean the latter has to be completely ignored either.  And the 
presence of tools may only help with that.  We may not have the resources 
available commercially significant ports have, but that does not mean we 
should decide upfront to abandon any kind of performance QA.  I think we 
can still act professionally and try to do our best to make the quality of 
code produced as good as possible within our available resources.

 FWIW,

  Maciej