This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

new entry for 'optimizer inadequacies' page (attn loop experts)


Loop optimization really isn't doing its job these days.  This talks
mostly about failure to hoist loads of loop invariants, but I've seen
failure to hoist invariant calculations too, and the code being
generated for exit tests is just bizarre.

I've added an HTMLized version of this to the "optimizer inadequacies"
webpage.

---

If presented with even slightly complicated looping code, GCC may fail
to extract all the invariants from the loops, even when there are
plenty of registers.

Consider the following code, which is a trimmed down version of a real
function that does something sensible.

unsigned char *
read_and_prescan (ip, len, speccase)
     unsigned char *ip;
     unsigned int len;
     unsigned char *speccase;
{
  unsigned char *buf = malloc (len);
  unsigned char *input_buffer = malloc (4096);
  unsigned char *ibase, *op;
  int deferred_newlines;

  op = buf;
  ibase = input_buffer + 2;
  deferred_newlines = 0;

  for (;;)
    {
      unsigned int span = 0;

      if (deferred_newlines)
	{
	  while (speccase[ip[span]] == 0
		 && ip[span] != '\n'
		 && ip[span] != '\t'
		 && ip[span] != ' ')
	    span++;
	  memcpy (op, ip, span);
	  op += span;
	  ip += span;
	  if (speccase[ip[0]] == 0)
	    while (deferred_newlines)
	      deferred_newlines--, *op++ = '\r';
	  span = 0;
	}

      while (speccase[ip[span]] == 0) span++;
      memcpy (op, ip, span);
      op += span;
      ip += span;
      if (*ip == '\0')
	break;
    }
  return buf;
}


We're going to look exclusively at the code generated for the
innermost three loops.  This one is the most important:

while (speccase[ip[span]] == 0) span++;

which is compiled to

.L12:
        xorl    %esi, %esi
.L6:
        movzbl  (%esi,%ebx), %eax
        movl    16(%ebp), %edx
        cmpb    $0, (%eax,%edx)
        jne     .L19
        .p2align 4
.L20:
        incl    %esi
*       movl    16(%ebp), %edx
        movzbl  (%esi,%ebx), %eax
        cmpb    $0, (%eax,%edx)
        je      .L20
.L19:

To start, look at the line marked with a star.  There is no way to
reach label .L20 except by going through the block starting at .L6.
Register %edx is not modified inside the loop, and neither is the
stack slot it's being loaded from.  So why wasn't that load deleted?

Then there's the matter of the entire loop test being duplicated.
When the body of the loop is large, that's a good move, but here it
doubles the size of the code.  The loop optimizer should have the
brains to start the counter at -1, and emit instead

.L12:
        movl    $-1, %esi
        movl    16(%ebp), %edx
        .p2align 4
.L20:
        incl    %esi
        movzbl  (%esi,%ebx), %eax
        cmpb    $0, (%eax,%edx)
        je      .L20

The next loop is

while (deferred_newlines)
      deferred_newlines--, *op++ = '\r';

This compiles to

        movl    -20(%ebp), %ecx
        testl   %ecx, %ecx
        je      .L12
        .p2align 4
.L15:
        movb    $13, (%edi)
        incl    %edi
*       decl    -20(%ebp)
        jne     .L15


This is the same problem, but with a value that's modified.  It loaded
-20(%ebp) into a register, but then forgot about it and started doing
read-mod-write on a memory location, which is horrible.

Finally, the topmost loop:

  while (speccase[ip[span]] == 0
	 && ip[span] != '\n'
	 && ip[span] != '\t'
	 && ip[span] != ' ')
    span++;

compiles to

        movl    $0, %esi
        movzbl  (%ebx), %eax
        movl    16(%ebp), %edx
        cmpb    $0, (%eax,%edx)
        jne     .L8
        movb    (%ebx), %al
        jmp     .L22
        .p2align 4,,7
.L9:
        incl    %esi
*       movl    16(%ebp), %edx
        movzbl  (%esi,%ebx), %eax
        cmpb    $0, (%eax,%edx)
        jne     .L8
*       movb    (%esi,%ebx), %al
.L22:
        cmpb    $10, %al
        je      .L8
        cmpb    $9, %al
        je      .L8
        cmpb    $32, %al
        jne     .L9
.L8:

Exact same problem: a pointer is fetched on every trip through the
loop, despite the fact that that register is never used for anything
else.  Also, note that the value we need in %al for the comparison
sequence starting at .L22 is already there, but we fetch it again
anyway.  (This may avoid a partial register stall, in which case it's
the right thing.)  And we've got an odd split loop test, with half
duplicated and half not.

If you look at the source code carefully, you might notice another
oddity: deferred_newlines is set to zero before the outer loop, and
never modified again except inside an if block that will only be
executed if it's nonzero.  Therefore, that if block is dead code, and
should have been deleted.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]