Re: help interpreting gcc 4.1.1 optimisation bug

> The correct version is I think,
> void longcpy(long* _dst, long* _src, unsigned _numwords)
>  {
>     asm volatile (
>         "cld         \n\t"
>         "rep         \n\t"
>         "movsl       \n\t"
> 	// Outputs (read/write)
>         : "=S" (_src), "=D" (_dst), "=c" (_numwords)
> 	// Inputs - specify same registers as outputs
>         : "0"  (_src), "1"  (_dst), "2"  (_numwords)
> 	// Clobbers: direction flag, so "cc", and "memory"
>         : "cc", "memory"
>         );
> }

  I did not re-check with GCC-4.1.1, but I noticed problems with this
 kind of "memory" clobber: when the source you are copying from is
 not in memory but (is a structure) in the stack. I have to say that
 I tend to use a form without "volatile" after the asm (one of the
 result has to be used then).

 The usual symtom is that the memcopy is done, but the *content* of the
 source structure is not updated *before* the memcopy: nothing in your
 asm says that the content of your pointer has to be up-to-date.

 The "memory" says that main memory will be changed, not that it will be
 used, and if you are memcopy-ing from a structure in stack - for instance
 a structure which fit in a register - you may have problems.

 That is why IHMO it is better to do type copying by directly copying
 structure (mostly when using -fstrict-aliasing) instead of using
 memcpy() - like: struct {int a,b,c } x, y = {0,1,1}; x = y;
 The main disadvantage of the type copying is the relatively bad code
 that previous compiler can generate for it, and that bug may appear
 (correct me if I am wrong) because by not calling an external function
 called memcpy() you are again not forcing the external memory to be
 updated - but it should be quicker for exactly the same reason.

 I did not really experiment with __builtin_memcpy(), is it treated
 specially or like a standard function call; I do not know if:

int globint;
int fct (short *a, short *b) {
  globint = 3;
  __builtin_memcpy(a, b, sizeof(*a));
  if (globint == 3)
      return 1;
      return 0;

 Is the test present or optimised away like in:
int globint;
int fct (short *a, short *b) {
  globint = 3;
  *a = *b;
  if (globint == 3)
      return 1;
      return 0;


