[Bug c/31792] New: ivopts generates slightly worse code for tight loop from zlib

Wed May 2 18:35:00 GMT 2007

The loop in question is isolated from zlib 1.2.3's longest_match() function (it
s longest_match()'s inner-most loop). Given two char * pointers, scan and
match, it figures out their longest common prefix. After every 8 character
comparisons, it also tests whether scan has reached the end.

int inner_longest_match(char *scan, char *match, char *strend)
{
  char *start_scan = scan;
  do {
  } while (*++scan == *++match && *++scan == *++match &&
           *++scan == *++match && *++scan == *++match &&
           *++scan == *++match && *++scan == *++match &&
           *++scan == *++match && *++scan == *++match &&
           scan < strend);

  return scan - start_scan;
}

Verison: gcc version 4.3.0 20070425 (experimental)

Compilation: /usr/local/mygcc/bin/gcc -g -O2 -c lmatch.c -o lmatch.o . This has
ivopts turned on by default. The generated code is:

  11:   0f b6 42 01             movzbl 0x1(%edx),%eax
  15:   8d 5a 01                lea    0x1(%edx),%ebx
  18:   3a 41 01                cmp    0x1(%ecx),%al
  1b:   75 64                   jne    81 <inner_longest_match+0x81>
  1d:   0f b6 42 02             movzbl 0x2(%edx),%eax
  21:   8d 5a 02                lea    0x2(%edx),%ebx
  24:   3a 41 02                cmp    0x2(%ecx),%al
  27:   75 58                   jne    81 <inner_longest_match+0x81>
  29:   0f b6 42 03             movzbl 0x3(%edx),%eax
  2d:   8d 5a 03                lea    0x3(%edx),%ebx
  30:   3a 41 03                cmp    0x3(%ecx),%al
  33:   75 4c                   jne    81 <inner_longest_match+0x81>
  35:   0f b6 42 04             movzbl 0x4(%edx),%eax
  39:   8d 5a 04                lea    0x4(%edx),%ebx
  3c:   3a 41 04                cmp    0x4(%ecx),%al
  3f:   75 40                   jne    81 <inner_longest_match+0x81>
  41:   0f b6 42 05             movzbl 0x5(%edx),%eax
  45:   8d 5a 05                lea    0x5(%edx),%ebx
  48:   3a 41 05                cmp    0x5(%ecx),%al
  4b:   75 34                   jne    81 <inner_longest_match+0x81>
  4d:   0f b6 42 06             movzbl 0x6(%edx),%eax
  51:   8d 5a 06                lea    0x6(%edx),%ebx
  54:   3a 41 06                cmp    0x6(%ecx),%al
  57:   75 28                   jne    81 <inner_longest_match+0x81>
  59:   0f b6 42 07             movzbl 0x7(%edx),%eax
  5d:   8d 5a 07                lea    0x7(%edx),%ebx
  60:   3a 41 07                cmp    0x7(%ecx),%al
  63:   75 1c                   jne    81 <inner_longest_match+0x81>
  65:   83 c2 08                add    $0x8,%edx
  68:   8d 41 08                lea    0x8(%ecx),%eax
  6b:   89 c1                   mov    %eax,%ecx
  6d:   0f b6 02                movzbl (%edx),%eax
  70:   3a 01                   cmp    (%ecx),%al
  72:   75 04                   jne    78 <inner_longest_match+0x78>
  74:   39 d6                   cmp    %edx,%esi
  76:   77 99                   ja     11 <inner_longest_match+0x11>

Let's take a look at one "element" of the  comparison - i.e., the sixth *++scan
== *++match

  41:   0f b6 42 05             movzbl 0x5(%edx),%eax
  45:   8d 5a 05                lea    0x5(%edx),%ebx
  48:   3a 41 05                cmp    0x5(%ecx),%al
  4b:   75 34                   jne    81 <inner_longest_match+0x81>

It's doing the following:
- load the character at scan_start_loop + 5 into eax
- put scan_start_loop + 5 into ebx (if this is the element that breaks the
comparison, we need the last scan value)
- compare the lowest byte in eax to the byte at match_start_loop + 5

Ok, what's the problem? The problem is that scan  actually outlives the loop -
it's needed for the return value. So for each of the eight elements, gcc
(actually ivopts) copies scan_start_loop + n into a temporary. An extra
register is needed.

Without ivopts (i.e., -fno-ivopts), the generated code is:

  10:   83 c2 01                add    $0x1,%edx
  13:   0f b6 02                movzbl (%edx),%eax
  16:   3a 41 01                cmp    0x1(%ecx),%al
  19:   75 53                   jne    6e <inner_longest_match+0x6e>
  1b:   83 c2 01                add    $0x1,%edx
  1e:   0f b6 02                movzbl (%edx),%eax
  21:   3a 41 02                cmp    0x2(%ecx),%al
  24:   75 48                   jne    6e <inner_longest_match+0x6e>
  26:   83 c2 01                add    $0x1,%edx
  29:   0f b6 02                movzbl (%edx),%eax
  2c:   3a 41 03                cmp    0x3(%ecx),%al
  2f:   75 3d                   jne    6e <inner_longest_match+0x6e>
  31:   83 c2 01                add    $0x1,%edx
  34:   0f b6 02                movzbl (%edx),%eax
  37:   3a 41 04                cmp    0x4(%ecx),%al
  3a:   75 32                   jne    6e <inner_longest_match+0x6e>
  3c:   83 c2 01                add    $0x1,%edx
  3f:   0f b6 02                movzbl (%edx),%eax
  42:   3a 41 05                cmp    0x5(%ecx),%al
  45:   75 27                   jne    6e <inner_longest_match+0x6e>
  47:   83 c2 01                add    $0x1,%edx
  4a:   0f b6 02                movzbl (%edx),%eax
  4d:   3a 41 06                cmp    0x6(%ecx),%al
  50:   75 1c                   jne    6e <inner_longest_match+0x6e>
  52:   83 c2 01                add    $0x1,%edx
  55:   0f b6 02                movzbl (%edx),%eax
  58:   3a 41 07                cmp    0x7(%ecx),%al
  5b:   75 11                   jne    6e <inner_longest_match+0x6e>
  5d:   83 c2 01                add    $0x1,%edx
  60:   83 c1 08                add    $0x8,%ecx
  63:   0f b6 02                movzbl (%edx),%eax
  66:   3a 01                   cmp    (%ecx),%al
  68:   75 04                   jne    6e <inner_longest_match+0x6e>
  6a:   39 da                   cmp    %ebx,%edx
  6c:   72 a2                   jb     10 <inner_longest_match+0x10>

The sixth element:

  3c:   83 c2 01                add    $0x1,%edx
  3f:   0f b6 02                movzbl (%edx),%eax
  42:   3a 41 05                cmp    0x5(%ecx),%al
  45:   75 27                   jne    6e <inner_longest_match+0x6e>

The scan variable is no longer folded (it is instead incremented by one in each
element), and consequently this loop  no longer requires the ebx register for
the temporary. The match variable is still folded.   

The issue here is that folding scan, while reducing the overall dataflow
height, also requires an additional register. For register-starved
architectures ( i.e., x86-32), this can turn into additional spills. This
isolated example doesn't generate more spills. However, in real life this loop
is part of the longest_match function within  zlib's deflate.c, where it does
create more spills (compiling deflate.c with -fno-ivopts speeds up the
compression of high-redundancy content by ~1-2%). 

Finally, note that gcc 4.1.1 behaves similarly.

-- 
           Summary: ivopts generates slightly worse code for tight loop from
                    zlib
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: vlad at petric dot cc
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31792