This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Serious code size regression from 3.0.2 to now

From: tm <tm at mail dot kloo dot net>
To: Joern Rennecke <joern dot rennecke at superh dot com>
Cc: gcc-bugs at gcc dot gnu dot org, stephen dot clarke at superh dot com,shumpei dot kawasaki at hsa dot hitachi dot com
Date: Thu, 18 Jul 2002 15:15:00 -0700 (PDT)
Subject: Re: Serious code size regression from 3.0.2 to now

On Thu, 18 Jul 2002, Joern Rennecke wrote:

> tm wrote:
> > Basically, GCC is now generating HUGE groups of jump instructions which
> > are aligned to 32-byte boundaries.
> 
> Was it not generating these lone jump instructions before,
> or did it align them less?

Okay, I've done a bit more investigation, and hopefully I'm understanding
this better.

Before, GCC was generating this sequence:

	mov.l	L_label,r0
	jmp	@r0
	ins

Now, GCC is generating this sequence:

	bra	L_label
	ins

	.align	5
L_label:
	bra	L_label2
	ins

L_label2:

This is fine for isolated cases. However, when you have a large function,
you wind up with many of these stacked relative branches, then you get:

	.align	5
L_label:
	bra	L_label2
	ins
	.align	5
L_label3:
	bra	L_label4
	ins

and each one of these instructions winds up in a different cache line like
this:

15756                  .L2282:
 15757 7ba0 AF19                bra     .L2275
 15758 7ba2 6013                mov     r1,r0
 15759 7ba4 00090009            .align 5
 15759      00090009
 15759      00090009
 15759      00090009
 15759      00090009
 15760                  .L2279:
 15761 7bc0 AEFF                bra     .L2887
 15762 7bc2 4011                cmp/pz  r0
 15763 7bc4 00090009            .align 5
 15763      00090009
 15763      00090009
 15763      00090009
 15763      00090009
 15764                  .L2276:
 15765 7be0 AEE5                bra     .L2888
 15766 7be2 4811                cmp/pz  r8
 15767 7be4 00090009            .align 5
 15767      00090009
 15767      00090009
 15767      00090009
 15767      00090009
...

In map_fog.i/VideoDraw32OnlyFog32Alpha() alone I counted 94 (!) cache
lines which contain only either:

1. branch + delay slot
2. literal load + branch far + delay slot instruction

So basically there are two factors which create this situation:

1) GCC is now generating two relative branches instead of one absolute
   jump for some reason, and

2) Branch targets are cache-line aligned, so the 2nd branch of the branch
   pair winds up occupying an entire cache line.

I don't really like this idea of generating two relative branches.
It's bad for instruction prefetching, and obviously creates a lot of
ancillary problems.

Toshi

References:
- Re: Serious code size regression from 3.0.2 to now
  - From: Joern Rennecke

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]