This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: 2.95, x86: severe performance problems with short arithmetic


Richard Henderson wrote:
> On Thu, Aug 12, 1999 at 01:20:03PM -0700, Zack Weinberg wrote:
> > What we actually want is to do everything in SImode until the results
> > become visible outside a function - returning, or write to memory.
> 
> Or comparison, or anything else that could detect overflow,
> like another addition.

Only if we test for the overflow, though.

Now seems like a good time to revisit this.  Here's my test case
again.

unsigned short
cksum(unsigned char *buf1, unsigned char *buf2)
{
	unsigned short	sum = 0;
	unsigned char	*p, *q, c;

	p = buf1;
	q = buf2;
	for (;;) {
		c = *p++;
		if (c == '\0') break;
		sum += c;
		*q++ = c;
		if (c == '\n') break;
	}
	return (sum);
}

Here are side-by-side diffs of 2.95.1 release and CVS right after
new_ia32 merge, with -O2 -mpentiumpro -fomit-frame-pointer.  I
reformatted the 2.95.1 side to match.  (It's much easier to read the
new way - thanks!)

	.file	"ck-s.c"			.file	"ck-s.c"
	.version	"01.01"			.version	"01.01"
gcc2_compiled.:				gcc2_compiled.:
.text					.text
	.align 4		      |		.align 16
.globl cksum				.globl cksum
	.type	 cksum,@function		.type	 cksum,@function
cksum:					cksum:
	pushl	%esi				pushl	%esi
				      >		movw	$0, %si
	pushl	%ebx				pushl	%ebx
	movl	12(%esp), %ebx			movl	12(%esp), %ebx
	movl	16(%esp), %ecx			movl	16(%esp), %ecx
	xorl	%esi, %esi	      <
	.p2align 4				.p2align 4
.L3:					.L3:
	movb	(%ebx), %dl			movb	(%ebx), %dl
	incl	%ebx				incl	%ebx
	testb	%dl, %dl			testb	%dl, %dl
	je	.L4				je	.L4
	movzbw	%dl, %ax	      <
	addl	%eax, %esi	      <
	movb	%dl, (%ecx)			movb	%dl, (%ecx)
				      >		movzbw	%dl, %ax
				      >		addw	%ax, %si
	incl	%ecx				incl	%ecx
	cmpb	$10, %dl			cmpb	$10, %dl
	jne	.L3				jne	.L3
.L4:					.L4:
	movzwl	%si, %eax	      <
	popl	%ebx				popl	%ebx
				      >		movzwl	%si, %eax
	popl	%esi				popl	%esi
	ret					ret
.Lfe1:					.Lfe1:
	.size	 cksum,.Lfe1-cksum		.size	 cksum,.Lfe1-cksum
	.ident	"2.95.1 19990816"     |		.ident	"2.96 19990901"

Ignoring the different schedule, the difference is that new_ia32 ducks
the partial register stall and takes a slight decode penalty.

For this function, we would like gcc to treat the 'sum' and 'c'
variables as unsigned int.  On the right hand side of this sdiff, I
made that change in the source.  Both sides are new_ia32 this time.

	.file	"ck-s.c"	      |		.file	"ck-i.c"
	.version	"01.01"			.version	"01.01"
gcc2_compiled.:				gcc2_compiled.:
.text					.text
	.align 16				.align 16
.globl cksum				.globl cksum
	.type	 cksum,@function		.type	 cksum,@function
cksum:					cksum:
	pushl	%esi		      <
	movw	$0, %si		      <
	pushl	%ebx				pushl	%ebx
	movl	12(%esp), %ebx	      |		movl	8(%esp), %eax
	movl	16(%esp), %ecx	      |		movl	$0, %ebx
				      >		movl	12(%esp), %ecx
	.p2align 4				.p2align 4
.L3:					.L3:
	movb	(%ebx), %dl	      |		movzbl	(%eax), %edx
	incl	%ebx		      |		incl	%eax
	testb	%dl, %dl	      |		testl	%edx, %edx
	je	.L4				je	.L4
	movb	%dl, (%ecx)			movb	%dl, (%ecx)
	movzbw	%dl, %ax	      |		addl	%edx, %ebx
	addw	%ax, %si	      <
	incl	%ecx				incl	%ecx
	cmpb	$10, %dl	      |		cmpl	$10, %edx
	jne	.L3				jne	.L3
.L4:					.L4:
				      >		movzwl	%bx, %eax
	popl	%ebx				popl	%ebx
	movzwl	%si, %eax	      <
	popl	%esi		      <
	ret					ret
.Lfe1:					.Lfe1:
	.size	 cksum,.Lfe1-cksum		.size	 cksum,.Lfe1-cksum
	.ident	"2.96 19990901"	      |		.ident	"2.96 19990901"


- no register stalls, no decode penalties except on the memory fetch,
and one fewer register used.  Barring an unroll, I think this is about
as good as it gets.

Now, how we teach the compiler to make this transform is another
story.  PROMOTE_MODE doesn't work by itself, the math is done in the
original mode with fixups afterward.  WORD_REGISTER_OPERATIONS
(suggested by Jeff) doesn't do anything useful and may in fact break
reload.

I dinked with define_split expressions in an effort to convince the
compiler that

(set (reg:HI 22) 
   (add:HI (subreg:HI (reg:SI 20) 0)
           (subreg:HI (reg:SI 21) 0)))
(set (reg:SI 23) (zero_extend:SI (reg:HI 22)))

- this is the sort of RTL you get with PROMOTE_MODE - could be
rewritten to

(set (reg:SI 23) (add:SI (reg:SI 20) (reg:SI 21)))

but this did not work; in fact, it lost the movzwl at the end of the
function, which is just wrong.

RTL doesn't seem to have a notion of input-language variables.  I'm
wondering if this is another optimization we should be doing at the
tree level (and therefore we can't with the current front end).

zw

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]