Serious code size regression from 3.0.2 to now

tm tm@mail.kloo.net
Thu Jul 18 18:56:00 GMT 2002


On Thu, 18 Jul 2002, Joern Rennecke wrote:

> tm wrote:
> > Basically, GCC is now generating HUGE groups of jump instructions which
> > are aligned to 32-byte boundaries.
> 
> Was it not generating these lone jump instructions before,
> or did it align them less?

Okay, I've done a bit more investigation, and hopefully I'm understanding
this better.

Before, GCC was generating this sequence:

	mov.l	L_label,r0
	jmp	@r0
	ins

Now, GCC is generating this sequence:

	bra	L_label
	ins

	.align	5
L_label:
	bra	L_label2
	ins

L_label2:

This is fine for isolated cases. However, when you have a large function,
you wind up with many of these stacked relative branches, then you get:


	.align	5
L_label:
	bra	L_label2
	ins
	.align	5
L_label3:
	bra	L_label4
	ins

and each one of these instructions winds up in a different cache line like
this:

15756                  .L2282:
 15757 7ba0 AF19                bra     .L2275
 15758 7ba2 6013                mov     r1,r0
 15759 7ba4 00090009            .align 5
 15759      00090009
 15759      00090009
 15759      00090009
 15759      00090009
 15760                  .L2279:
 15761 7bc0 AEFF                bra     .L2887
 15762 7bc2 4011                cmp/pz  r0
 15763 7bc4 00090009            .align 5
 15763      00090009
 15763      00090009
 15763      00090009
 15763      00090009
 15764                  .L2276:
 15765 7be0 AEE5                bra     .L2888
 15766 7be2 4811                cmp/pz  r8
 15767 7be4 00090009            .align 5
 15767      00090009
 15767      00090009
 15767      00090009
 15767      00090009
...

In map_fog.i/VideoDraw32OnlyFog32Alpha() alone I counted 94 (!) cache
lines which contain only either:

1. branch + delay slot
2. literal load + branch far + delay slot instruction

So basically there are two factors which create this situation:

1) GCC is now generating two relative branches instead of one absolute
   jump for some reason, and

2) Branch targets are cache-line aligned, so the 2nd branch of the branch
   pair winds up occupying an entire cache line.

I don't really like this idea of generating two relative branches.
It's bad for instruction prefetching, and obviously creates a lot of
ancillary problems.

Toshi



More information about the Gcc-bugs mailing list