This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Hot and Cold Partitioning (Was: GCC 4.1 Projects)



I apologize for not responding to these messages sooner; I was out of town for a few days and only
just read them.


In the first place, I am a little confused about exactly what Joern is objecting to. If I am reading your
emails correctly, you seem to feel that the hot/cold partitioning optimization, as currently designed,
has a problem because sometimes it will increase the size of the hot section by an amount that
will not be compensated for by the removal of the cold code to another section. You also seem to
be expressing concerns that some branch instructions will not be able to span the distance between
hot and cold sections, and it appears that you therefore don't want this optimization to be put in. It
sounds as if you don't want this optimization to go in at all, but in actuality it is already there, and what
I am proposing to do is fix parts of it that are still a little bit broken.


As with all optimizations, hot/cold partitioning is an educated guess at how to improve the program.
Therefore it will on occasion make a wrong guess. By using profiling data (at other people indicated)
the number of wrong guesses will be greatly reduced, but not entirely eliminated. While most of the
time it will either have no effect or will improve program performance, it can and will occasionally
slow it down. This is one of the reasons that the optimization is controlled by a flag, and is not
turned on by default. If you find the optimization is giving you trouble, you can always turn it off.


The optimization was designed to take into account the fact that on many architectures, various
branch instructions might not be able to span the distance between hot/cold sections. As others
have indicated, this is done by adding a level of indirection to the jumps. This is conditioned on
macros that can (should) be defined by each architecture, so the indirection won't be performed on
architectures where it isn't needed.


There might be some validity in the idea of modifying this optimization, in the future, to consider
the size of a basic block in addition to it's "hot-ness", when deciding which partition to put it into.
I expect this would not be that difficult to implement, and would probably address your concerns.


However, at the moment, I would first like to get the "correctness" fixes for the hot/cold partitioning
optimization into FSF mainline. But I am open to persuasion, and if the FSF community in general
feels that I really ought to add the size test as well at this time, I will do so.


What do other people think?

-- Caroline Tice
ctice@apple.com

On Feb 28, 2005, at 12:09 PM, Joern RENNECKE wrote:

Dale Johannesen wrote:


No, you should not turn on partitioning in situations where code size is important to you.


You are missing the point. In my example, with perfect profiling data, you still end up with
more code in the hot section,


Yes.

i.e. more pages are actually swapped in.


Unless the cross-section branch is actually executed, there's no reason the unconditional
jumps should get paged in, so this doesn't follow.

If you separate the unconditional jumps from the rest of the function, you just have created a
per-function cold section. Except for corner cases, there would have to be a lot of them to
save a page of working set. And if you have that many, it will mean that the condjump can't
reach. And it is still utterly pointless to put blocks into the inter-function cold section
if that only makes the intra-function cold section larger.
So we've come from 4 bytes, on cycle:


bf 0f
mov #0,rn

over 6 bytes, BR issue slot during one cycle:
bt L2
L1:

..

L2:
bra L1
mov #0,n

to 10 bytes in hot part of the hot section, 12 bytes in cold part of the hot
section, and another 10 to 12 bytes in the cold section, while the execution
time in the hot path is now two cycles (if we manage to get a good
schedule, we might execute two other instructions in these cycles, but still,
this is no better than we started out with):


.hotsection:
bf L2
mov.w 0f,rn
braf @rn
nop
0: .word L2-0b
L1:

...

L2:
mov.l 0f,rn
jmp @rn
nop
.balign 4
0: .long L3

.coldsection
L3:
mov.l 0f,rn
jmp @rn
mov #0,rn
.balign 4
0: .long L1




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]