This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Suboptimal bb ordering with -Os on arm

Hi all,

in the course of doing some benchmarks on arm with -Os, I noticed that
some list traversion code became significantly slower since gcc 5.3 when
instruction caches are cold.

Reproducer with relevant defines copied from the Linux kernel:

  struct list_head {
  	struct list_head *next, *prev;

  #define list_for_each(pos, head) \
  	for (pos = (head)->next; pos != (head); pos = pos->next)

  void g(struct list_head *l);
  void h(void);

  void f(struct list_head *head)
    struct list_head *l;

    list_for_each(l, head) {


This behaviour can be tracked down to
r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up")

Since this commit up to and including svn trunk, the output of the
above snippet compiled with -Os looks like

  00000000 <f>:
     0:	e92d4070 	push	{r4, r5, r6, lr}
     4:	e1a05000 	mov	r5, r0
     8:	e5904000 	ldr	r4, [r0]
     c:	e1540005 	cmp	r4, r5
    10:	1a000001 	bne	1c <f+0x1c>
    14:	e8bd4070 	pop	{r4, r5, r6, lr}
    18:	eafffffe 	b	0 <h>
    1c:	e1a00004 	mov	r0, r4
    20:	ebfffffe 	bl	0 <g>
    24:	e5944000 	ldr	r4, [r4]
    28:	eafffff7 	b	c <f+0xc>

Observe how the bb corresponding to the loop's body, i.e. the one
invoking g(), gets jumped to by a conditional *forward* jump which is
bad for static branch prediction and thus, prefetching.

Compare this to gcc 4.9's output with -Os or alternatively, trunk's
output with -Os -freorder-blocks-algorithm=stc:

  00000000 <f>:
     0:	e92d4070 	push	{r4, r5, r6, lr}
     4:	e1a05000 	mov	r5, r0
     8:	e5904000 	ldr	r4, [r0]
     c:	e1540005 	cmp	r4, r5
    10:	0a000003 	beq	24 <f+0x24>
    14:	e1a00004 	mov	r0, r4
    18:	ebfffffe 	bl	0 <g>
    1c:	e5944000 	ldr	r4, [r4]
    20:	eafffff9 	b	c <f+0xc>
    24:	e8bd4070 	pop	{r4, r5, r6, lr}
    28:	eafffffe 	b	0 <h>

Here, the loop body's bb is the fallthrough of the loop's conditional
branch which is good.

Measurements on a memory stressed system, i.e. one where the instruction
caches can be considered cold at the loop's first iteration, do indeed
show runtime differences of ~0.5us on average.

I believe that this drop in performance is due to the code of g() not
getting prefetched properly.

That being said, I could certainly go and submit a patch to the Linux
kernel setting -freorder-blocks-algorithm=stc for the -Os case.

Before I do this, I'd be glad to hear your opinions about this though as
I can't really quantify the implications.

>From the discussion on gcc-patches [1] of what is now the aforementioned
r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
it is not clear to me whether this change can actually reduce code size
beyond those 0.1% given there for -Os.

A sloppy test with an, admittedly heavily stripped down, Linux kernel
gives me exactly these 0.1%, too.

So first question:
Do you guys know of any code where there are more significant code size
savings achieved?

And second question:
If that isn't the case, would it possibly make sense to partly revert
gcc's behaviour and set -freorder-blocks-algorithm=stc at -Os?

Thank you very much!



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]