The attached C++ code produces incorrect (different) results when compiled with "-O3 -ffast-math" (and "-O2 -finline-functions -ffast-math") than it does when compiled with just "-O3" or with "-O2 -ffast-math". The problem appears in both gcc 3.3.2 and the current gcc-3_3-branch (3.3.3 prerelease), but works correctly on mainline, gcc 3.3, and gcc 3.2.x. Even minor changes to this reduced testcase seem to allow the problem to disappear. Removing the "throw ()" attribute from sin and cos fixes the problem. Using, "extern "C" double sin(double);" and "extern "C" double cos(double);", i.e. GCC's builtins, fixes the problem. Even changing the alloca's to mallocs fixes the problem!! I'm filing the PR so that the regression hunters can identify both the patch that introduced this regression between 3.3 and 3.3.2, and also the patch that fixed it between 3.3 and current mainline. [Attachment to follow!]
Created attachment 5427 [details] Original testcase
This looks like more the case where the extra bits for the 80bits register, also note -msse -mfpmath=sse -O3 -finline produces wrong results though so it might be something else too.
I beleive I've managed to trace the location of the bug in GCC. The minimal change to the mark.cpp testcase to change it from functioning to non-functioning is to change the return type of MaxZDist2 from float to double. This "delta" then allowed me to compare the assembly language output between the two versions. After many hours of eliminating label numbering and register allocation differences, I managed to reduce the differences between the two .s files to just: 162c162,164 < fstps -28(%ebp) --- > fstps -40(%ebp) > movl -40(%ebp), %eax > movl %eax, -28(%ebp) Where the second version aborts and the first version doesn't! I believe that the problem is that -40(%ebp) isn't a valid temporary stack slot, which ends up corrupting the stack. Changing -40(%ebp) to -28(%ebp) resolves the failure (so it isn't an IA-32 RAW hazard :) The issue is that the function MaxZDist2 looks like: float MaxZDist2(...) { if (!n) return 0.0f; ... return max_dist2; } Which is being transformed (in the broken float form) into the sequence: ... movl $0x0, %eax // float return value allocated to %eax if (!n) goto L21 ... fstps -40(%ebp) // calculated max_dist2 calculated in st(0) movl -40(%ebp), %eax // move max_dist2 into return value .L21: movl %eax, -28(%ebp) // store result of MaxZDist2 The "double" version which works perfectly, instead uses the sequence ... fldz if (!n) goto L21 fstp %st(0) ... .L21: fstps -28(%ebp) where the contents of the ... and the rest of the function are identical. The question now arises of how does GCC manage to use an invalid stack slot, -40(%ebp)??? I suspect that the interaction is caused by the fact that MaxZDist2 is being inlined into FindBestRotation (-finline-functions or -O3 is necessary), and that FindBestRotation is calling alloca before MaxZDist2 is called (changing the alloca to malloc or auto variables makes the problem go away). Perhaps prior to inlining -40(%ebp) was known to be the top of the stack, but this was never corrected for upon inlining. Certainly, the function prologues and epilogues are identical between the working (double) and broken (float) forms, even though the float version clearly uses an additional stack slot that the double version doesn't. I've also confirmed that GNU gcc 3.3.1 is affected but that GNU gcc 3.3 isn't. The problem was originally reported using the Fedora system compiler, but I've failed to reproduce it with the system gcc-3.3.1 that comes with SuSe 9.0.
The stack slot -40(%ebp) is created during global register allocation (it appears for the first time in the .greg dump file), indicating a bug somewhere in GCC's stack spilling code.
Well, the -40(%ebp) appears to be being created legitimately (on gcc-3.3.3) by a call to assign_stack_local from get_secondary_mem in reload.c. There seems to be some poor interaction between alloca and the x86 stack frame. The prologue looks like: _Z16FindBestRotationPKfS0_jPf: pushl %ebp movl %esp, %ebp // %sp == %ebp pushl %edi // %sp == %ebp - 4 pushl %esi // %sp == %ebp - 8 pushl %ebx // %sp == %ebp - 12 subl $28, %esp // %sp == %ebp - 40 // calculate argument to alloca movl 16(%ebp), %edi // %edi = n leal (%edi,%edi,2), %edx // %edx = 3*n leal 15(,%edx,4), %esi // %esi = 3*n*sizeof(float)+15 andl $-16, %esi // round to multiple of 16 subl %esi, %esp leal 4(%esp), %edx // %edx result from alloca ... fstps -40(%ebp) // same address as %edx + %esi - 4 Changing the "subl $28, %esp" to "subl $32, %esp" resolves the failure. So it looks like there's a poor interaction between the stack slot allocation routines and alloca. i.e. the last four bytes of the first alloca allocation (where the allocation is a multiple of 16 and therefore has no padding) overlap with the last four bytes of of cfun->x_frame_offset, i.e. the last allocated stack slot. Can someone more familiar with i386's stack frames take it from here? I don't know where the x86's push is predecrement or postdecrement so I don't know whether the "%sp == %ebp - 40", i.e. having sp point to the problematic stack slot, is the cause of the problem or not. Internally, that stack_frame_size is "28", FRAME_GROWS_DOWNWARDS and the last/problematic stack slot is the last four bytes of these 28, i.e. (MEM (PLUS (REG frame_pointer_rtx) (CONST_INT -28))) which once "eliminate_regs" has has its way becomes (MEM (PLUS (REG %ebp) (CONST_INT -40))) I hope this helps.
Patch here: <http://gcc.gnu.org/ml/gcc-patches/2004-01/msg01532.html>.
Subject: Bug 13608 CVSROOT: /cvs/gcc Module name: gcc Changes by: hubicka@gcc.gnu.org 2004-01-16 18:53:52 Modified files: gcc : ChangeLog c-pretty-print.c dwarf2out.c et-forest.c haifa-sched.c ra.h gcc/config/i386: i386.c i386.md gcc/cp : ChangeLog mangle.c Log message: * i386.md (load_tp_di): Fix pasto. PR opt/13608 * i386.c (ix86_compute_frame_layout): Fix for alloca on leaf function. * c-pretty-print.c (pp_c_type_cast, pp_c_abstract_declarator, pp_c_character_constant, pp_c_floating_constant, pp_c_additive_expression, pp_c_shift_expression, pp_c_equality_expression, pp_c_and_expression, pp_c_exclusive_or_expression, pp_c_inclusive_or_expression, pp_c_logical_and_expression): Remove inline modifier. * dwarf2out.c (get_AT): Likewise. * et-forest.c (et_splay): Likewise. * ra.h (ra_alloc, ra_calloc): Likewise * mangle.c (write_mangled_name): Remove inline modifier. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.2325&r2=2.2326 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/c-pretty-print.c.diff?cvsroot=gcc&r1=1.34&r2=1.35 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dwarf2out.c.diff?cvsroot=gcc&r1=1.477&r2=1.478 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/et-forest.c.diff?cvsroot=gcc&r1=1.9&r2=1.10 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/haifa-sched.c.diff?cvsroot=gcc&r1=1.235&r2=1.236 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ra.h.diff?cvsroot=gcc&r1=1.8&r2=1.9 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&r1=1.634&r2=1.635 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&r1=1.501&r2=1.502 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/cp/ChangeLog.diff?cvsroot=gcc&r1=1.3890&r2=1.3891 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/cp/mangle.c.diff?cvsroot=gcc&r1=1.94&r2=1.95
Subject: Bug 13608 CVSROOT: /cvs/gcc Module name: gcc Branch: hammer-3_3-branch Changes by: hubicka@gcc.gnu.org 2004-01-16 18:55:53 Modified files: gcc : ChangeLog.hammer gcc/config/i386: i386.c Log message: PR opt/13608 * i386.c (ix86_compute_frame_layout): Fix for alloca on leaf function. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.hammer.diff?cvsroot=gcc&only_with_tag=hammer-3_3-branch&r1=1.1.2.263&r2=1.1.2.264 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&only_with_tag=hammer-3_3-branch&r1=1.490.2.64&r2=1.490.2.65
Subject: Bug 13608 CVSROOT: /cvs/gcc Module name: gcc Branch: gcc-3_3-branch Changes by: hubicka@gcc.gnu.org 2004-01-16 19:01:10 Modified files: gcc : ChangeLog gcc/config/i386: i386.c Log message: PR opt/13608 * i386.c (ix86_compute_frame_layout): Fix for alloca on leaf function. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=gcc-3_3-branch&r1=1.16114.2.885&r2=1.16114.2.886 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&only_with_tag=gcc-3_3-branch&r1=1.495.2.30&r2=1.495.2.31
Fixed.