This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: 2.95, x86: severe performance problems with short arithmetic
- To: Richard Henderson <rth@cygnus.com>
- Subject: Re: 2.95, x86: severe performance problems with short arithmetic
- From: Zack Weinberg <zack@bitmover.com>
- Date: Wed, 01 Sep 1999 22:55:29 -0700
- cc: gcc@gcc.gnu.org
Richard Henderson wrote:
> On Thu, Aug 12, 1999 at 01:20:03PM -0700, Zack Weinberg wrote:
> > What we actually want is to do everything in SImode until the results
> > become visible outside a function - returning, or write to memory.
>
> Or comparison, or anything else that could detect overflow,
> like another addition.
Only if we test for the overflow, though.
Now seems like a good time to revisit this. Here's my test case
again.
unsigned short
cksum(unsigned char *buf1, unsigned char *buf2)
{
unsigned short sum = 0;
unsigned char *p, *q, c;
p = buf1;
q = buf2;
for (;;) {
c = *p++;
if (c == '\0') break;
sum += c;
*q++ = c;
if (c == '\n') break;
}
return (sum);
}
Here are side-by-side diffs of 2.95.1 release and CVS right after
new_ia32 merge, with -O2 -mpentiumpro -fomit-frame-pointer. I
reformatted the 2.95.1 side to match. (It's much easier to read the
new way - thanks!)
.file "ck-s.c" .file "ck-s.c"
.version "01.01" .version "01.01"
gcc2_compiled.: gcc2_compiled.:
.text .text
.align 4 | .align 16
.globl cksum .globl cksum
.type cksum,@function .type cksum,@function
cksum: cksum:
pushl %esi pushl %esi
> movw $0, %si
pushl %ebx pushl %ebx
movl 12(%esp), %ebx movl 12(%esp), %ebx
movl 16(%esp), %ecx movl 16(%esp), %ecx
xorl %esi, %esi <
.p2align 4 .p2align 4
.L3: .L3:
movb (%ebx), %dl movb (%ebx), %dl
incl %ebx incl %ebx
testb %dl, %dl testb %dl, %dl
je .L4 je .L4
movzbw %dl, %ax <
addl %eax, %esi <
movb %dl, (%ecx) movb %dl, (%ecx)
> movzbw %dl, %ax
> addw %ax, %si
incl %ecx incl %ecx
cmpb $10, %dl cmpb $10, %dl
jne .L3 jne .L3
.L4: .L4:
movzwl %si, %eax <
popl %ebx popl %ebx
> movzwl %si, %eax
popl %esi popl %esi
ret ret
.Lfe1: .Lfe1:
.size cksum,.Lfe1-cksum .size cksum,.Lfe1-cksum
.ident "2.95.1 19990816" | .ident "2.96 19990901"
Ignoring the different schedule, the difference is that new_ia32 ducks
the partial register stall and takes a slight decode penalty.
For this function, we would like gcc to treat the 'sum' and 'c'
variables as unsigned int. On the right hand side of this sdiff, I
made that change in the source. Both sides are new_ia32 this time.
.file "ck-s.c" | .file "ck-i.c"
.version "01.01" .version "01.01"
gcc2_compiled.: gcc2_compiled.:
.text .text
.align 16 .align 16
.globl cksum .globl cksum
.type cksum,@function .type cksum,@function
cksum: cksum:
pushl %esi <
movw $0, %si <
pushl %ebx pushl %ebx
movl 12(%esp), %ebx | movl 8(%esp), %eax
movl 16(%esp), %ecx | movl $0, %ebx
> movl 12(%esp), %ecx
.p2align 4 .p2align 4
.L3: .L3:
movb (%ebx), %dl | movzbl (%eax), %edx
incl %ebx | incl %eax
testb %dl, %dl | testl %edx, %edx
je .L4 je .L4
movb %dl, (%ecx) movb %dl, (%ecx)
movzbw %dl, %ax | addl %edx, %ebx
addw %ax, %si <
incl %ecx incl %ecx
cmpb $10, %dl | cmpl $10, %edx
jne .L3 jne .L3
.L4: .L4:
> movzwl %bx, %eax
popl %ebx popl %ebx
movzwl %si, %eax <
popl %esi <
ret ret
.Lfe1: .Lfe1:
.size cksum,.Lfe1-cksum .size cksum,.Lfe1-cksum
.ident "2.96 19990901" | .ident "2.96 19990901"
- no register stalls, no decode penalties except on the memory fetch,
and one fewer register used. Barring an unroll, I think this is about
as good as it gets.
Now, how we teach the compiler to make this transform is another
story. PROMOTE_MODE doesn't work by itself, the math is done in the
original mode with fixups afterward. WORD_REGISTER_OPERATIONS
(suggested by Jeff) doesn't do anything useful and may in fact break
reload.
I dinked with define_split expressions in an effort to convince the
compiler that
(set (reg:HI 22)
(add:HI (subreg:HI (reg:SI 20) 0)
(subreg:HI (reg:SI 21) 0)))
(set (reg:SI 23) (zero_extend:SI (reg:HI 22)))
- this is the sort of RTL you get with PROMOTE_MODE - could be
rewritten to
(set (reg:SI 23) (add:SI (reg:SI 20) (reg:SI 21)))
but this did not work; in fact, it lost the movzwl at the end of the
function, which is just wrong.
RTL doesn't seem to have a notion of input-language variables. I'm
wondering if this is another optimization we should be doing at the
tree level (and therefore we can't with the current front end).
zw