Bug 17990 - [3.4/4.0 Regression] sse used for negate without -mfpmath=sse
Summary: [3.4/4.0 Regression] sse used for negate without -mfpmath=sse
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 3.4.2
: P2 normal
Target Milestone: 3.4.4
Assignee: Richard Henderson
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2004-10-14 03:28 UTC by Eric Shattow
Modified: 2004-12-15 08:01 UTC (History)
4 users (show)

See Also:
Host:
Target: i686-pc-linux-gnu
Build:
Known to work: 3.3.3
Known to fail: 3.4.2 4.0.0
Last reconfirmed: 2004-12-09 18:01:21


Attachments
'gcc -v' output (430 bytes, text/plain)
2004-10-14 03:28 UTC, Eric Shattow
Details
OscilGen.ii (27.43 KB, text/plain)
2004-10-14 03:44 UTC, Eric Shattow
Details
OscilGen.S (39.38 KB, text/plain)
2004-10-14 03:47 UTC, Eric Shattow
Details
OscilGen.S_listing (44.83 KB, text/plain)
2004-10-14 03:51 UTC, Eric Shattow
Details
GDB session (1.54 KB, text/plain)
2004-10-14 03:55 UTC, Giovanni Bajo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Shattow 2004-10-14 03:28:10 UTC
The "ZynAddSubFX 2.1.1" C++ program happens to work with gcc 3.3.4, but crashes
strangely with gcc 3.4.2 in a specific section of the src/Synth/OscilGen.C code.
Comment 1 Eric Shattow 2004-10-14 03:28:52 UTC
Created attachment 7342 [details]
'gcc -v' output
Comment 2 Eric Shattow 2004-10-14 03:44:06 UTC
Created attachment 7343 [details]
OscilGen.ii

erisha@necro ~/bin/ZYNSUCKS-3.4.2/src/Synth $ gcc -O0 -ggdb -Wall  -DOS_LINUX
-DALSAMIDIIN -DFFTW_VERSION_3 -DASM_F2I_YES `fltk-config --cflags` 
-DJACKAUDIOOUT	`pkg-config --cflags jack`   -c -o OscilGen.o OscilGen.C
--save-temps
Comment 3 Eric Shattow 2004-10-14 03:47:36 UTC
Created attachment 7344 [details]
OscilGen.S

gcc -O0 -ggdb -Wall  -DOS_LINUX -DALSAMIDIIN -DFFTW_VERSION_3 -DASM_F2I_YES
`fltk-config --cflags`	-DJACKAUDIOOUT	`pkg-config --cflags jack`   -c -o
OscilGen.S OscilGen.C -S
Comment 4 Eric Shattow 2004-10-14 03:51:02 UTC
Created attachment 7345 [details]
OscilGen.S_listing

objdump -S OscilGen.o > OscilGen.S_listing
Comment 5 Giovanni Bajo 2004-10-14 03:55:43 UTC
Created attachment 7346 [details]
GDB session
Comment 6 Giovanni Bajo 2004-10-14 03:57:32 UTC

 for (i=0;i<OSCIL_SIZE;i++) oscilFFTfreqs[i]=0.0;
    25ad:	a1 00 00 00 00       	mov    0x0,%eax
    25b2:	31 ff                	xor    %edi,%edi
    25b4:	39 c7                	cmp    %eax,%edi
    25b6:	89 45 80             	mov    %eax,0xffffff80(%ebp)
    25b9:	7d 1c                	jge    25d7 
<_ZN8OscilGen7prepareEv+0x157>
    25bb:	8b 55 e0             	mov    0xffffffe0(%ebp),%edx
    25be:	8b 92 88 05 00 00    	mov    0x588(%edx),%edx
    25c4:	89 55 84             	mov    %edx,0xffffff84(%ebp)
    25c7:	8b 45 84             	mov    0xffffff84(%ebp),%eax
    25ca:	c7 04 b8 00 00 00 00 	movl   $0x0,(%eax,%edi,4)
    25d1:	47                   	inc    %edi
    25d2:	3b 7d 80             	cmp    0xffffff80(%ebp),%edi
    25d5:	7c f0                	jl     25c7 
<_ZN8OscilGen7prepareEv+0x147>
   if (Pcurrentbasefunc==0) {//the sine case
    25d7:	8b 55 e0             	mov    0xffffffe0(%ebp),%edx
	for (i=0;i<MAX_AD_HARMONICS;i++){
	    oscilFFTfreqs[i+1]=-hmag[i]*sin(hphase[i]*(i+1))/2.0;
	    oscilFFTfreqs[OSCIL_SIZE-i-1]=hmag[i]*cos(hphase[i]*(i+1))/2.0;
	};
   } else {
	for (j=0;j<MAX_AD_HARMONICS;j++){
    25da:	31 f6                	xor    %esi,%esi
    25dc:	80 ba 29 01 00 00 00 	cmpb   $0x0,0x129(%edx)
    25e3:	0f 85 1f 02 00 00    	jne    2808 
<_ZN8OscilGen7prepareEv+0x388>
    25e9:	8b 82 88 05 00 00    	mov    0x588(%edx),%eax
    25ef:	31 ff                	xor    %edi,%edi
    25f1:	f3 0f 10 05 00 00 00 	movss  0x0,%xmm0
    25f8:	00 
    25f9:	0f 29 45 88          	movaps %xmm0,0xffffff88(%ebp)
    25fd:	89 45 84             	mov    %eax,0xffffff84(%ebp)
    2600:	89 85 7c ff ff ff    	mov    %eax,0xffffff7c(%ebp)
    2606:	8d 77 01             	lea    0x1(%edi),%esi
    2609:	8b 55 e0             	mov    0xffffffe0(%ebp),%edx
    260c:	0f 57 c0             	xorps  %xmm0,%xmm0
    260f:	f3 0f 2a c6          	cvtsi2ss %esi,%xmm0


The problem seems that movaps is writing to memory which is not 16-byte aligned.
Comment 7 Giovanni Bajo 2004-10-14 03:58:37 UTC
Also on mainline:
        movaps  %xmm0, -104(%ebp)
Comment 8 Andrew Pinski 2004-10-14 13:31:25 UTC
Note the options needed to reproduce this is -O2 -march=athlon-xp.
Comment 9 Andrew Pinski 2004-10-14 14:01:04 UTC
Confirmed reduced testcase:
extern double sin (double __x) throw (); 
float hmag[128],hphase[128];
void prepare(float *oscilFFTfreqs)
{
  int i;
  for (i=0;i<128;i++)
    *oscilFFTfreqs=-hmag[i]*sin(hphase[i])/2.0;
}
Comment 10 Andrew Pinski 2004-10-14 14:04:28 UTC
And this is a regression. Only -O2 -msse is needed to reproduce the problem.
Comment 11 Andrew Pinski 2004-10-14 14:14:25 UTC
Related to bug 17962.
Comment 12 Wolfgang Bangerth 2004-10-15 18:40:29 UTC
I have a very similar problem in my program. It segfaults in this instruction: 
    0x0815935d <vector+203>:	movapd %xmm0,(%eax) 
where I have 
    eax            0xbfffd874	-1073751948 
and 
(gdb) p ((unsigned int)0xbfffd874) 
$1 = 3221215348 
(gdb) p ((unsigned int)0xbfffd874) % 16 
$2 = 4 
 
So indeed movapd isn't writing to 16-byte aligned storage. 
 
W. 
Comment 13 Eric Shattow 2004-10-17 06:24:53 UTC
possibly related to http://www.cygwin.com/ml/cygwin/2004-04/msg01103.html
Comment 14 Giovanni Bajo 2004-10-18 22:24:34 UTC
Uros, Jan, can you have a look at this bug? It's about an unaligned XMM 
operation to the stack. There is a small testcase and it is a pretty serious 
regression.
Comment 15 Uroš Bizjak 2004-10-19 09:55:31 UTC
Testcase from comment #9 and '-O2 -msse -fomit-frame-pointer' procduces code
that seems OK:

prepare:
        pushl  %ebp
        pushl  %edi
        movl   $hmag+512, %edi
        pushl  %esi
        movl   $hphase, %esi
        pushl  %ebx
        movl   $hmag, %ebx
        subl   $76, %esp
        movss  .LC0, %xmm0
        movl   96(%esp), %ebp
        movaps %xmm0, 32(%esp)
        .p2align 4,,15
.L2:
        flds   (%ebx)
        addl   $4, %ebx
        fchs
        flds   (%esi)
        addl   $4, %esi
        fstpl  (%esp)
        fstpl  16(%esp)
        call   sin
        fldl   16(%esp)
        cmpl   %ebx, %edi
        fmulp  %st, %st(1)
        fmull  .LC2
        fstps  (%ebp)
        jne    .L2
        addl   $76, %esp
        popl   %ebx
        popl   %esi
        popl   %edi
        popl   %ebp
        ret

BTW: when alter_reg() for reg 81 is called from reload() (at line ~836,
reload1.c) it gets right offset (-32). [This offset is calculated at line ~1969
in reload1.c]. Something changes this correct offset from (-32) to (-56) if
frame pointer is not omitted.
Comment 16 Giovanni Bajo 2004-10-19 11:42:18 UTC
So, Eric, it looks like that adding -fomit-frame-pointer to your Makefile is a 
good workaround for the time being without losing any advanced optimization. 
You were looking for one, IIRC.
Comment 17 Eric Shattow 2004-10-21 00:47:17 UTC
-fomit-frame-pointer has no effect. thanks anyways G.

 -E
Comment 18 Eric Shattow 2004-10-21 02:15:28 UTC
i tested -mno-sse and it is a functional good workaround.  also seems to
override '-march=athlon-xp' which in this case is a good thing.

-E
Comment 19 Uroš Bizjak 2004-10-25 14:35:41 UTC
The problem here is triggered in reload() function around line 950, this part
(#ifdef'd part was added by me:):

      for (i = FIRST_PSEUDO_REGISTER; i < max_regno; i++)
	if (reg_renumber[i] < 0 && reg_equiv_memory_loc[i])
	  {
	    rtx x = eliminate_regs (reg_equiv_memory_loc[i], 0, NULL_RTX);

#ifdef DEBUG
	    debug_rtx(reg_equiv_memory_loc[i]);
	    debug_rtx(x);
#endif
	    if (strict_memory_address_p (GET_MODE (regno_reg_rtx[i]),
					 XEXP (x, 0)))
	      reg_equiv_mem[i] = x, reg_equiv_address[i] = 0;
	      ...

For the testcase from comment #9 (converted to plain c), 'gcc -O2 -msse' will
produce relevant debug information:
IN:
(mem:V4SF (plus:SI (reg/f:SI 20 frame)
        (const_int -32 [0xffffffe0])) [6 S16 A8])
OUT:
(mem:V4SF (plus:SI (reg/f:SI 6 bp)
        (const_int -56 [0xffffffc8])) [6 S16 A8])

So, the problem is inside eliminate_regs() function, that unaligns otherwise
aligned address. This unaligned address is passed down and somewhere around line
1214, following code will be triggered:

  for (i = FIRST_PSEUDO_REGISTER; i < max_regno; i++)
    {
      rtx addr = 0;

      if (reg_equiv_mem[i])
	addr = XEXP (reg_equiv_mem[i], 0);
      ...

and this addr is used as new (unaligned) address on stack.

To further analyze this issue: following RTX is passed to eliminate_regs():
(mem:V4SF (plus:SI (reg/f:SI 20 frame)
        (const_int -32 [0xffffffe0])) [6 S16 A8])

After getting through MEM: case, function recurses to PLUS: case, where
following RTX is processed:
plus:SI (reg/f:SI 20 frame)
        (const_int -32 [0xffffffe0]))

and this code is triggered:
		  ...
		else
		  return gen_rtx_PLUS (Pmode, ep->to_rtx,
				       plus_constant (XEXP (x, 1),
						      ep->previous_offset));

where ep->previous_offset (when substituting frame pointer with ebp) equals
(-24). And the resulting sum is then -56. 

I'm a little lost here, what previous_offset field represents, perhaps someone
with more knowledge could find, if magic number (-24) is OK [it is not!].

BTW: the testcase from comment #9 when -fomit-frame-pointer is added to
compilation flags produces correctly aligned address, because 
ep->previous_offset, when substituting frame pointer with esp, equals to 64.
-32 + 64 = 32 in this case.

Regarding comment #17: Perhaps original testcase still uses ebp, even with
'-fomit-frame-pointer'.

Uros.
Comment 20 Uroš Bizjak 2004-10-26 07:19:28 UTC
These PRs are all duplicates:

target/12902: Invalid assembly generated when using SSE / xmmintrin.h
target/14776: -mfpmath=sse causes movapd from non-16-byte aligned address
fortran/17930: -mfpmath=sse creates illegal code (movapd with misaligned argument)

and perhaps:
target/10395: [x86] vector types are incorrectly aligned causing crash in
multi-threaded apps or when using in main
[this bug has 10 dupes!]

Uros.
Comment 21 Giovanni Bajo 2004-10-27 15:29:27 UTC
Subject: Re:  [3.4/4.0 Regression] unaligned xmm movaps on the stack with -O2 -msse because of the frame pointer

uros at kss-loka dot si wrote:

> target/12902: Invalid assembly generated when using SSE / xmmintrin.h
> target/14776: -mfpmath=sse causes movapd from non-16-byte aligned
> address fortran/17930: -mfpmath=sse creates illegal code (movapd with
> misaligned argument)
>
> and perhaps:
> target/10395: [x86] vector types are incorrectly aligned causing
> crash in multi-threaded apps or when using in main
> [this bug has 10 dupes!]

Uros, thanks for the analysys. I have just given you edit access to Bugzilla,
would you please close the dupes as such yourself, but making sure that you
bring over any attacched comments/patches/testcases that you think might
matter?

Thanks,
Giovanni Bajo


Comment 22 Mark Mitchell 2004-10-31 02:03:55 UTC
Postponed until GCC 3.4.4.
Comment 23 Giovanni Bajo 2004-12-09 18:01:21 UTC
Reconfirmed with today's mainline. This bug has about 10 dups...
Comment 24 Uroš Bizjak 2004-12-13 15:15:19 UTC
Could this patch help with unaligned stack:

http://gcc.gnu.org/ml/gcc-patches/2003-01/msg00019.html
Comment 25 Richard Henderson 2004-12-13 18:20:06 UTC
I have been completely unable to reproduce an unaligned stack, with either 3.4
or 4.0.  Can those gcc folk that claim to have reproduced it in the past still
do so?  If so, what are you trying?

I can say that the proximate cause of the existance of the SSE instruction in
the first place is a test in negsf2 using TARGET_SSE that ought to be using
TARGET_SSE_MATH.  I can say that the reason reload fails to rematerialize the
vector constant (and spills it on the stack) is due to the movss split firing
before reload and removing the obviousness of the vector constant.

For 3.4 I can make two minor tweaks that will prevent the manifestation of this
bug, but leave -mfpmath=sse less than correct.  For 4.0 I will attempt to fix
that as well.
Comment 26 Wolfgang Bangerth 2004-12-13 20:19:51 UTC
Richard,  
I have added a small testcase in f77 to PR 14776. I'll try to come up 
with something that can be used for mainline as well, but for now this 
is at least some kind of testcase for you to work with. 
 
Regards 
  Wolfgang 
Comment 27 Wolfgang Bangerth 2004-12-13 20:55:03 UTC
I have something else (although of a quality that I don't trust myself) 
in C in PR 12902. 
 
W. 
Comment 28 GCC Commits 2004-12-14 01:58:06 UTC
Subject: Bug 17990

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	gcc-3_4-branch
Changes by:	rth@gcc.gnu.org	2004-12-14 01:57:57

Modified files:
	gcc            : ChangeLog 
	gcc/config/i386: i386.md 

Log message:
	PR target/17990
	* config/i386/i386.md (negsf2): Fix condition for using sse.
	(negdf2, abssf2, absdf2): Likewise.
	(negsf2_if, abssf2_if): Don't disable if sse enabled.
	(movv4sf_internal splitter): Postpone til after reload.
	(movv2di_internal splitter): Likewise.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=gcc-3_4-branch&r1=2.2326.2.731&r2=2.2326.2.732
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&only_with_tag=gcc-3_4-branch&r1=1.502.2.10&r2=1.502.2.11

Comment 29 Uroš Bizjak 2004-12-14 06:38:11 UTC
(In reply to comment #25)
> I have been completely unable to reproduce an unaligned stack, with either 3.4
> or 4.0.  Can those gcc folk that claim to have reproduced it in the past still
> do so?  If so, what are you trying?

Testcase 1 (without threads):

--cut here--
float hmag[128],hphase[128];
void prepare(float *oscilFFTfreqs)
{
          int i;
            for (i=0;i<128;i++)
                 *oscilFFTfreqs=-hmag[i]*sin(hphase[i])/2.0;
}
--cut here--

Compile it with 'gcc -O2 msse2':

...
prepare:
        pushl   %ebp
        movl %esp, %ebp
        pushl   %esi
        pushl   %ebx
        xorl %ebx, %ebx
        subl $64, %esp
        movss   .LC0, %xmm0
        movl 8(%ebp), %esi
        movaps  %xmm0, -40(%ebp)     <- here, load from unaligned stack
        .p2align 4,,15
.L2:
...

Testcase 2: (with threads):

--cut here--
/* *** start ***/
#include <pthread.h>
#include <stdio.h>
#include <assert.h>

#include <xmmintrin.h>
#include <mmintrin.h>

#ifdef __ICC
#include <emmintrin.h>
#endif

void * f(void *p)
{
          int x = (p == NULL) ? 0 : * (int *) p;
            __m128i s;
              printf("&x = %p &s= %p\n", &x, &s);
                return NULL;
}

int main(int argc, char ** argv)
{
          pthread_t th;

            f(& argc);
              assert(pthread_create(& th, NULL, f, &argc)==0);
                assert(pthread_join(th, NULL)==0);
           return 0;
}

/* ***end ***/

gcc -O2 -msse2 -lpthread:
./a.out
&x = 0xbffff9cc &s= 0xbffff9b0
&x = 0x4085fac8 &s= 0x4085faac   <- here, s is not aligned correctly

It is a question whether ifor pthreads a stack alignment code should be present
in the function (that would be more robust), or it is a libc's job to provide
correctly alligned stack.

I think that summary of this bugreport is not correct now.
Comment 30 Uroš Bizjak 2004-12-14 07:19:29 UTC
I was too fast, testcase 1 from comment #29 is OK, it doesn't segfault anymore.
Comment 31 Giovanni Bajo 2004-12-14 12:38:45 UTC
Eric, can you doublecheck that the bug is gone for you too, with and without -
mfpmath=sse, with a today snapshot of 3.4 or 4.0?
Comment 32 Wolfgang Bangerth 2004-12-14 14:09:47 UTC
In response to comment #30: it is libpthread's responsibility to align the 
stack of subthreads properly. It doesn't do that, however, but I believe 
that we have another PR for that (this is something that we can't really 
do anything about, however). 
 
If Uros' first testcase now works, then I think we have fixed the part that 
we could fix, so I'm closing the PR. Eric, if there are more problems, 
please open another PR and have a link in it to this PR. 
 
Thanks 
  Wolfgang 
Comment 33 Jan Hubicka 2004-12-14 14:17:36 UTC
Subject: Re:  [3.4/4.0 Regression] sse used for negate without -mfpmath=sse

> 
> ------- Additional Comments From giovannibajo at libero dot it  2004-12-14 12:38 -------
> Eric, can you doublecheck that the bug is gone for you too, with and without -
> mfpmath=sse, with a today snapshot of 3.4 or 4.0?

Actually -mfpmath is about the floating point operations that might
change results depending on unit (ie whether you use extended precision
or not).  Negation is safe operation in this context so I would not say
that this is a bug.  Perhaps documentation needs to be clarified?

Honza
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17990
> 
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
Comment 34 GCC Commits 2004-12-14 22:45:35 UTC
Subject: Bug 17990

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rth@gcc.gnu.org	2004-12-14 22:45:29

Modified files:
	gcc            : ChangeLog 
	gcc/config/i386: i386-protos.h i386.c i386.h i386.md 
	                 predicates.md 

Log message:
	PR target/17990
	* config/i386/i386.c (x86_use_bt): New.
	(ix86_expand_unary_operator): Use MEM_P.
	(ix86_expand_fp_absneg_operator): New.
	* config/i386/i386.h (x86_use_bt): Declare.
	(TARGET_USE_BT): New.
	* config/i386/i386-protos.h: Update.
	* config/i386/i386.md (negsf2): Use ix86_expand_fp_absneg_operator.
	(negdf2, negxf2, abssf2, absdf2, absxf2): Likewise.
	(negsf2_memory, negsf2_ifs, negsf2_if, negdf2_memory, negdf2_ifs,
	negdf2_ifs_rex64, negdf2_if, negdf2_if_rex64, negxf2_if,
	abssf2_memory, abssf2_ifs, abssf2_if, absdf2_memory, absdf2_ifs,
	absdf2_ifs_rex64, absdf2_if, absxf2_if): Remove.
	(absnegsf2_mixed, absnegsf2_sse, absnegsf2_i387, absnegdf2_mixed,
	absnegdf2_sse, absnegdf2_i387, absnegxf2_i387): New.  Merge all
	neg and abs splitters.  Handle DFmode in general regs in 64-bit mode.
	(negextendsfdf2, absextendsfdf2): Disable for non-mixed sse math.
	(btsq, btrq, btcq): New.  Add peepholes as well.
	(movv4sf_internal splitter): Postpone til after reload.
	(movv2di_internal splitter): Likewise.
	* config/i386/predicates.md (const_0_to_63_operand): New.
	(absneg_operator): New.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.6829&r2=2.6830
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386-protos.h.diff?cvsroot=gcc&r1=1.121&r2=1.122
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&r1=1.752&r2=1.753
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.h.diff?cvsroot=gcc&r1=1.406&r2=1.407
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&r1=1.573&r2=1.574
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/predicates.md.diff?cvsroot=gcc&r1=1.8&r2=1.9

Comment 35 Uroš Bizjak 2004-12-15 08:01:13 UTC
Just FYI: first testcase from comment #29 shows the problem with missing i386
addressing mode. First leal is not needed as array can be addressed directly
with flds. A PR 18643 is opened for this bug.

.L2:
        leal 0(,%edx,4), %eax
        flds hphase(%eax)
        fsin
        flds hmag(%eax)
        fchs
        leal 1(%edx), %eax
        cmpl $128, %eax
        movl %eax, %edx
        fmulp   %st, %st(1)
        fmul %st(1), %st
        fstps   (%ecx)
        jne .L2