This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Possible improvement in inline stringop code generation


Hi Folks,

GCC 4.5.1 20100924 "-Os -minline-all-stringops"  on Core i7

int
main( int argc, char *argv[] )
{
  int i, a[256], b[256];

  for( i = 0; i < 256; ++i )  // discourage optimization
	a[i] = rand();

  memcpy( b, a, argc * sizeof(int) );

  printf( "%d\n", b[rand()] );  // discourage optimization

  return 0;
}

I wonder if its possible to improve the code generation for inline
stringops when
the length is known to be a multiple of 4 bytes?

That is, instead of:

	movsx   rcx, ebp    # argc
	sal rcx, 2
	rep movsb

it would be nice to see:

	movsx   rcx, ebp    # argc
	rep movsd

Note that  memcpy( b, a, 1024 ) generates:

	mov ecx, 256
	rep movsd

The reason I think this might be possible is this:-

Use -mstringop-strategy=rep_4byte to force the use of movsd.

For memcpy( b, a, argc * sizeof(int) ) we get:

	movsx   rcx, ebp    # argc
	sal rcx, 2
	cmp rcx, 4
	jb  .L5 #,
	shr rcx, 2
	rep movsd
.L5:


For memcpy( b, a, argc ) we get:

	movsx   rax, ebp    # argc, argc
	mov rdi, rsp    # tmp76,
	lea rsi, [rsp+1024] # tmp77,
	cmp rax, 4  # argc,
	jb  .L3 #,
	mov rcx, rax    # tmp78, argc
	shr rcx, 2  # tmp78,
	rep movsd
.L3:
	xor edx, edx    # tmp80
	test    al, 2   # argc,
	je  .L4 #,
	mov dx, WORD PTR [rsi]  # tmp82,
	mov WORD PTR [rdi], dx  #, tmp82
	mov edx, 2  # tmp80,
.L4:
	test    al, 1   # argc,
	je  .L5 #,
	mov al, BYTE PTR [rsi+rdx]  # tmp85,
	mov BYTE PTR [rdi+rdx], al  #, tmp85
.L5:

In the former case (* sizeof(int)) gcc has omitted all the code do deal with 1,
2, and 3 bytes so the stringop code generation has apparently spotted
that the length
is a multiple of 4 bytes.

I can see that the expression code for the length is separate from the stringop
stuff.  Though it does do the right thing with a literal.

Incidentally, for the second case, memcpy( b, a, argc ), the Visual Studio
compiler generates code like this:

	mov eax, ecx
	shr ecx, 2
	rep movsd
	mov ecx, eax
	and ecx, 3
	rep movsb

which seems cleaner (no jumps) than the GCC code, though knowing GCC there is
probably a good reason for its choice as it generally seems to have a far more
sophisticated optimizer.

Best regards,
Jeremy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]