[PATCH, i386]: Do not emit "cld" instructions

Jan Hubicka jh@suse.cz
Wed Dec 6 13:02:00 GMT 2006


> On 12/5/06, Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> >>>According to the guide, it applies to pentium4.
> >>>
> >>>
> >>
> >>This is pretty high.  Would be possible for you to rerun the
> >>test_stringops script on P4 machine after removing the CLD?  If it
> >>really is 48 cycles, it should show difference in the preffered memcpy
> >>codegen.
> >>
> >>
> >>
> >Sure! But I think that this is an error in the optimizing guide.
> 
> ... NOT.

Funny, it is great you noticed!
> 
> --cut here--
> #define rdtsc(value) \
> asm volatile("rdtsc":"=A" (value))
> 
> int main(int argc, char ** argv)
> {
>  unsigned long long a,b;
> 
>  rdtsc(a);
>  rdtsc(b);
>  printf("%lld\n", b-a);
> 
>  rdtsc(a);
>  asm volatile ("std; cld;");

P4 might behave in a way just removing redundant CLDs so there is still
hope that the benchmark would fare well if std wasn't included.  But
definitly, it would be great if you could rerun the test_stringop script
on P4 macihne I sent you.  It ought to make rep;mov sequences a lot more
fruitful that should in turn reduce code size.
I will re-check Athlons/K8s/Centrinos, but there cld is supposed to be
cheap.

Honza
>  rdtsc(b);
>  printf("%lld\n", b-a);
> 
>  return 0;
> }
> 
> --cut here--
> 
> gcc -O2
> ./a.out
> 
> ./a.out
> 84
> 172
> ./a.out
> 84
> 172
> /a.out
> 84
> 188
> /a.out
> 84
> 188
> ./a.out
> 84
> 172
> 
> Uros.



More information about the Gcc-patches mailing list