[Bug c/31792] New: ivopts generates slightly worse code for tight loop from zlib
vlad at petric dot cc
gcc-bugzilla@gcc.gnu.org
Wed May 2 18:35:00 GMT 2007
The loop in question is isolated from zlib 1.2.3's longest_match() function (it
s longest_match()'s inner-most loop). Given two char * pointers, scan and
match, it figures out their longest common prefix. After every 8 character
comparisons, it also tests whether scan has reached the end.
int inner_longest_match(char *scan, char *match, char *strend)
{
char *start_scan = scan;
do {
} while (*++scan == *++match && *++scan == *++match &&
*++scan == *++match && *++scan == *++match &&
*++scan == *++match && *++scan == *++match &&
*++scan == *++match && *++scan == *++match &&
scan < strend);
return scan - start_scan;
}
Verison: gcc version 4.3.0 20070425 (experimental)
Compilation: /usr/local/mygcc/bin/gcc -g -O2 -c lmatch.c -o lmatch.o . This has
ivopts turned on by default. The generated code is:
11: 0f b6 42 01 movzbl 0x1(%edx),%eax
15: 8d 5a 01 lea 0x1(%edx),%ebx
18: 3a 41 01 cmp 0x1(%ecx),%al
1b: 75 64 jne 81 <inner_longest_match+0x81>
1d: 0f b6 42 02 movzbl 0x2(%edx),%eax
21: 8d 5a 02 lea 0x2(%edx),%ebx
24: 3a 41 02 cmp 0x2(%ecx),%al
27: 75 58 jne 81 <inner_longest_match+0x81>
29: 0f b6 42 03 movzbl 0x3(%edx),%eax
2d: 8d 5a 03 lea 0x3(%edx),%ebx
30: 3a 41 03 cmp 0x3(%ecx),%al
33: 75 4c jne 81 <inner_longest_match+0x81>
35: 0f b6 42 04 movzbl 0x4(%edx),%eax
39: 8d 5a 04 lea 0x4(%edx),%ebx
3c: 3a 41 04 cmp 0x4(%ecx),%al
3f: 75 40 jne 81 <inner_longest_match+0x81>
41: 0f b6 42 05 movzbl 0x5(%edx),%eax
45: 8d 5a 05 lea 0x5(%edx),%ebx
48: 3a 41 05 cmp 0x5(%ecx),%al
4b: 75 34 jne 81 <inner_longest_match+0x81>
4d: 0f b6 42 06 movzbl 0x6(%edx),%eax
51: 8d 5a 06 lea 0x6(%edx),%ebx
54: 3a 41 06 cmp 0x6(%ecx),%al
57: 75 28 jne 81 <inner_longest_match+0x81>
59: 0f b6 42 07 movzbl 0x7(%edx),%eax
5d: 8d 5a 07 lea 0x7(%edx),%ebx
60: 3a 41 07 cmp 0x7(%ecx),%al
63: 75 1c jne 81 <inner_longest_match+0x81>
65: 83 c2 08 add $0x8,%edx
68: 8d 41 08 lea 0x8(%ecx),%eax
6b: 89 c1 mov %eax,%ecx
6d: 0f b6 02 movzbl (%edx),%eax
70: 3a 01 cmp (%ecx),%al
72: 75 04 jne 78 <inner_longest_match+0x78>
74: 39 d6 cmp %edx,%esi
76: 77 99 ja 11 <inner_longest_match+0x11>
Let's take a look at one "element" of the comparison - i.e., the sixth *++scan
== *++match
41: 0f b6 42 05 movzbl 0x5(%edx),%eax
45: 8d 5a 05 lea 0x5(%edx),%ebx
48: 3a 41 05 cmp 0x5(%ecx),%al
4b: 75 34 jne 81 <inner_longest_match+0x81>
It's doing the following:
- load the character at scan_start_loop + 5 into eax
- put scan_start_loop + 5 into ebx (if this is the element that breaks the
comparison, we need the last scan value)
- compare the lowest byte in eax to the byte at match_start_loop + 5
Ok, what's the problem? The problem is that scan actually outlives the loop -
it's needed for the return value. So for each of the eight elements, gcc
(actually ivopts) copies scan_start_loop + n into a temporary. An extra
register is needed.
Without ivopts (i.e., -fno-ivopts), the generated code is:
10: 83 c2 01 add $0x1,%edx
13: 0f b6 02 movzbl (%edx),%eax
16: 3a 41 01 cmp 0x1(%ecx),%al
19: 75 53 jne 6e <inner_longest_match+0x6e>
1b: 83 c2 01 add $0x1,%edx
1e: 0f b6 02 movzbl (%edx),%eax
21: 3a 41 02 cmp 0x2(%ecx),%al
24: 75 48 jne 6e <inner_longest_match+0x6e>
26: 83 c2 01 add $0x1,%edx
29: 0f b6 02 movzbl (%edx),%eax
2c: 3a 41 03 cmp 0x3(%ecx),%al
2f: 75 3d jne 6e <inner_longest_match+0x6e>
31: 83 c2 01 add $0x1,%edx
34: 0f b6 02 movzbl (%edx),%eax
37: 3a 41 04 cmp 0x4(%ecx),%al
3a: 75 32 jne 6e <inner_longest_match+0x6e>
3c: 83 c2 01 add $0x1,%edx
3f: 0f b6 02 movzbl (%edx),%eax
42: 3a 41 05 cmp 0x5(%ecx),%al
45: 75 27 jne 6e <inner_longest_match+0x6e>
47: 83 c2 01 add $0x1,%edx
4a: 0f b6 02 movzbl (%edx),%eax
4d: 3a 41 06 cmp 0x6(%ecx),%al
50: 75 1c jne 6e <inner_longest_match+0x6e>
52: 83 c2 01 add $0x1,%edx
55: 0f b6 02 movzbl (%edx),%eax
58: 3a 41 07 cmp 0x7(%ecx),%al
5b: 75 11 jne 6e <inner_longest_match+0x6e>
5d: 83 c2 01 add $0x1,%edx
60: 83 c1 08 add $0x8,%ecx
63: 0f b6 02 movzbl (%edx),%eax
66: 3a 01 cmp (%ecx),%al
68: 75 04 jne 6e <inner_longest_match+0x6e>
6a: 39 da cmp %ebx,%edx
6c: 72 a2 jb 10 <inner_longest_match+0x10>
The sixth element:
3c: 83 c2 01 add $0x1,%edx
3f: 0f b6 02 movzbl (%edx),%eax
42: 3a 41 05 cmp 0x5(%ecx),%al
45: 75 27 jne 6e <inner_longest_match+0x6e>
The scan variable is no longer folded (it is instead incremented by one in each
element), and consequently this loop no longer requires the ebx register for
the temporary. The match variable is still folded.
The issue here is that folding scan, while reducing the overall dataflow
height, also requires an additional register. For register-starved
architectures ( i.e., x86-32), this can turn into additional spills. This
isolated example doesn't generate more spills. However, in real life this loop
is part of the longest_match function within zlib's deflate.c, where it does
create more spills (compiling deflate.c with -fno-ivopts speeds up the
compression of high-redundancy content by ~1-2%).
Finally, note that gcc 4.1.1 behaves similarly.
--
Summary: ivopts generates slightly worse code for tight loop from
zlib
Product: gcc
Version: 4.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vlad at petric dot cc
GCC build triplet: i686-pc-linux-gnu
GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31792
More information about the Gcc-bugs
mailing list