This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: PA8000 performance oddity



  In message <374AC4C5.614AD369@americasm01.nt.com>you write:
  > The actual weirdness I was trying to address was that when I manually
  > remove the false dependency by changing register allocation to another
  > temp register, i.e., changing b = a + b to c = a + b and updating the
  > ensuing code, performance got WORSE by a little bit.  If anything, I
  > would expect it to either be improved (due to eliminating a dependency)
  > or unchanged (latency hidden by reordering, or no latency problem).
It can vary.  Predicting preformance for out of order execution machines
is difficult at best.  I wouldn't worry too much about it right now (ie,
I suspect there are other things we can do to improve performance, which
will also make it easier to analyze the code later).

  > It almost suggested that there is some kind of cache effect inside the
  > processor favoring recently used registers over ones that are idle.  Is
  > this possible/reasonable?  If so, it would make correct machine modeling
  > much more complex.
Not that I'm aware of.  At least not in the PA8000 or PA8200.  I don't know
about the PA8500 or PA8700.

  > > I did some fooling around with this stuff a while back and it was worth a
  > > few more percent across specfp.
  > 
  > I saw something similar when I played with hobbling the aCC assembly
  > this way.  If I get a chance, I'll look at adding this stuff in.
I would recommend it.  It'll be a little more complex than the stuff you've
been working on, but not terribly so.

For example, some parts of the PA backend try to rewrite addresses for FP
loads and stores to make better use of the -16 .. 15 displacement.  You'll
need to tweak them (LEGITIMIZE_ADDRESS, LEGITIMIZE_RELOAD_ADDRESS).  You'll
also need to update GO_IF_LEGITIMATE_ADDRESS.  Assembler & BFD work is also
needed to handle the larger displacements for FP loads & stores.

My recommendation is to get the compiler stuff working first -- use the HP
assembler to test your code (and benchmark any improvements in your app).



  > I can send the aCC assembly if you'd like to see what aCC did vs gcc.
I really wouldn't have the time to look at it.  My focus is on gcc-2.95,
not performance issues for future releases.


  > One other thing I noted.  Gcc still generated fmpysub, so I'll have to
  > update the pattern in pa.md.  Shutting off pa_combine_instructions still
  > lets this pattern operate.  There was actually a performance loss by
  > having the instruction in there.  I know you said this, but I didn't
  > actually believe it.  I'm still not sure I actually understand.  Even
  > though fmpyadd/sub takes two reorder buffer slots, it shouldn't be any
  > slower than fmpy followed by fsub, should it?  If anything I would think
  > there is still a bit of benefit by having one less instruction - at
  > least in this particular example.
An fmpysub can not retire until both ops are finished.  Thus it holds resources
until both operations are finished.  A fmpy followed by a fsub allows either
operation to retire as soon as it is complete, returning resources to the
processor (particularly reorder buffer slots) for use by other insns.

You should invesitgate how the fmpysub was created.  That shouldn't be
happening.

  > I've also played with a couple of other things.  I tried different
  > BRANCH_COST values up to 10 and ran it on our code, plus the performance
  > test suite that Mark Lehmann set up.  There doesn't seem to be much
  > benefit to changing the value that I can see.  Perhaps a value of 2 or 3
  > is a bit better, but there were plenty of test cases that got worse as
  > well as better.
Like most optimization work, rarely will it always help.  I suspect a value
of 2 is best for the PA8000.  We may need to tweak it further for PA8200,
PA8500 and PA8700.


  > Finally, in High Performance Computing 2nd Ed., there is an appendix
  > that sketches the PA8000.  One thing they mention that I haven't seen in
  > HP's publicly available papers is that branches take a slot in both the
  > memory reorder buffer and the arithmetic buffer.  I tried this and saw a
  > bit (1%?) drop in performance.  Is it possible that this comment isn't
  > correct?  I simply added branch instructions to the memory function unit
  > as well as the alu function unit.
It is correct, but I don't think trying to model it is going to work. 
Remember that the scheduler does not schedule branches.  I have no idea what
exposing this aspect of the PA architecture would do to the schedules.

jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]