Bug 13608 - [3.3 regression] Incorrect code with -O3 -ffast-math
Summary: [3.3 regression] Incorrect code with -O3 -ffast-math
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 3.3.3
: P2 normal
Target Milestone: 3.3.3
Assignee: Not yet assigned to anyone
Keywords: wrong-code
Depends on:
Reported: 2004-01-07 21:15 UTC by roger
Modified: 2004-01-17 21:41 UTC (History)
1 user (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2004-01-08 17:11:21

Original testcase (1.99 KB, text/plain)
2004-01-07 21:17 UTC, roger

Note You need to log in before you can comment on or make changes to this bug.
Description roger 2004-01-07 21:15:39 UTC
The attached C++ code produces incorrect (different) results when compiled with
"-O3 -ffast-math" (and "-O2 -finline-functions -ffast-math") than it does when
compiled with just "-O3" or with "-O2 -ffast-math".  The problem appears in
both gcc 3.3.2 and the current gcc-3_3-branch (3.3.3 prerelease), but works
correctly on mainline, gcc 3.3, and gcc 3.2.x.

Even minor changes to this reduced testcase seem to allow the problem to
disappear.  Removing the "throw ()" attribute from sin and cos fixes the
problem.  Using, "extern "C" double sin(double);" and "extern "C" double
cos(double);", i.e. GCC's builtins, fixes the problem.  Even changing the
alloca's to mallocs fixes the problem!!

I'm filing the PR so that the regression hunters can identify both the
patch that introduced this regression between 3.3 and 3.3.2, and also
the patch that fixed it between 3.3 and current mainline.

[Attachment to follow!]
Comment 1 roger 2004-01-07 21:17:02 UTC
Created attachment 5427 [details]
Original testcase
Comment 2 Andrew Pinski 2004-01-07 23:04:20 UTC
This looks like more the case where the extra bits for the 80bits register, also note -msse 
-mfpmath=sse -O3 -finline produces wrong results though so it might be something else too.
Comment 3 roger 2004-01-08 17:11:16 UTC
I beleive I've managed to trace the location of the bug in GCC.  The minimal
change to the mark.cpp testcase to change it from functioning to non-functioning
is to change the return type of MaxZDist2 from float to double.  This "delta"
then allowed me to compare the assembly language output between the two versions.

After many hours of eliminating label numbering and register allocation
differences, I managed to reduce the differences between the two .s files
to just:

<       fstps   -28(%ebp)
>       fstps   -40(%ebp)
>       movl    -40(%ebp), %eax
>       movl    %eax, -28(%ebp)

Where the second version aborts and the first version doesn't!  I
believe that the problem is that -40(%ebp) isn't a valid temporary
stack slot, which ends up corrupting the stack.  Changing -40(%ebp)
to -28(%ebp) resolves the failure (so it isn't an IA-32 RAW hazard :)

The issue is that the function MaxZDist2 looks like:

float MaxZDist2(...)
  if (!n) return 0.0f;
  return max_dist2;

Which is being transformed (in the broken float form) into the sequence:

        movl $0x0, %eax     // float return value allocated to %eax 
        if (!n) goto L21
        fstps   -40(%ebp)   // calculated max_dist2 calculated in st(0)
        movl    -40(%ebp), %eax  // move max_dist2 into return value
        movl    %eax, -28(%ebp)  // store result of MaxZDist2

The "double" version which works perfectly, instead uses the sequence

        if (!n) goto L21
        fstp %st(0)
        fstps -28(%ebp)

where the contents of the ... and the rest of the function are identical.

The question now arises of how does GCC manage to use an invalid stack
slot, -40(%ebp)???  I suspect that the interaction is caused by the fact
that MaxZDist2 is being inlined into FindBestRotation (-finline-functions
or -O3 is necessary), and that FindBestRotation is calling alloca before
MaxZDist2 is called (changing the alloca to malloc or auto variables makes
the problem go away).  Perhaps prior to inlining -40(%ebp) was known to
be the top of the stack, but this was never corrected for upon inlining.
Certainly, the function prologues and epilogues are identical between the
working (double) and broken (float) forms, even though the float version
clearly uses an additional stack slot that the double version doesn't.

I've also confirmed that GNU gcc 3.3.1 is affected but that GNU gcc 3.3
isn't.  The problem was originally reported using the Fedora system compiler,
but I've failed to reproduce it with the system gcc-3.3.1 that comes with
SuSe 9.0.
Comment 4 roger 2004-01-11 13:55:08 UTC
The stack slot -40(%ebp) is created during global register allocation (it appears
for the first time in the .greg dump file), indicating a bug somewhere in GCC's
stack spilling code.
Comment 5 roger 2004-01-15 14:52:38 UTC
Well, the -40(%ebp) appears to be being created legitimately (on gcc-3.3.3) by
a call to assign_stack_local from get_secondary_mem in reload.c.  There seems
to be some poor interaction between alloca and the x86 stack frame.  The
prologue looks like:

        pushl   %ebp
        movl    %esp, %ebp                    // %sp == %ebp
        pushl   %edi                          // %sp == %ebp - 4
        pushl   %esi                          // %sp == %ebp - 8
        pushl   %ebx                          // %sp == %ebp - 12
        subl    $28, %esp                     // %sp == %ebp - 40

        // calculate argument to alloca
        movl    16(%ebp), %edi               // %edi = n
        leal    (%edi,%edi,2), %edx          // %edx = 3*n
        leal    15(,%edx,4), %esi            // %esi = 3*n*sizeof(float)+15
        andl    $-16, %esi                   // round to multiple of 16
        subl    %esi, %esp
        leal    4(%esp), %edx                // %edx result from alloca
        fstps   -40(%ebp)    // same address as %edx + %esi - 4

Changing the "subl $28, %esp" to "subl $32, %esp" resolves the failure.  So
it looks like there's a poor interaction between the stack slot allocation
routines and alloca.  i.e. the last four bytes of the first alloca allocation
(where the allocation is a multiple of 16 and therefore has no padding)
overlap with the last four bytes of of cfun->x_frame_offset, i.e. the last
allocated stack slot.

Can someone more familiar with i386's stack frames take it from here?
I don't know where the x86's push is predecrement or postdecrement so I
don't know whether the "%sp == %ebp - 40", i.e. having sp point to the
problematic stack slot, is the cause of the problem or not.  Internally,
that stack_frame_size is "28", FRAME_GROWS_DOWNWARDS and the last/problematic
stack slot is the last four bytes of these 28, i.e.

      (MEM (PLUS (REG frame_pointer_rtx) (CONST_INT -28)))

which once "eliminate_regs" has has its way becomes

      (MEM (PLUS (REG %ebp) (CONST_INT -40)))

I hope this helps.
Comment 6 Andrew Pinski 2004-01-16 00:04:03 UTC
Patch here: <http://gcc.gnu.org/ml/gcc-patches/2004-01/msg01532.html>.
Comment 7 CVS Commits 2004-01-16 18:53:57 UTC
Subject: Bug 13608

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	hubicka@gcc.gnu.org	2004-01-16 18:53:52

Modified files:
	gcc            : ChangeLog c-pretty-print.c dwarf2out.c 
	                 et-forest.c haifa-sched.c ra.h 
	gcc/config/i386: i386.c i386.md 
	gcc/cp         : ChangeLog mangle.c 

Log message:
	* i386.md (load_tp_di): Fix pasto.
	PR opt/13608
	* i386.c (ix86_compute_frame_layout): Fix for alloca on leaf function.
	* c-pretty-print.c (pp_c_type_cast, pp_c_abstract_declarator,
	pp_c_character_constant, pp_c_floating_constant,
	pp_c_additive_expression, pp_c_shift_expression,
	pp_c_equality_expression, pp_c_and_expression,
	pp_c_exclusive_or_expression, pp_c_inclusive_or_expression,
	pp_c_logical_and_expression): Remove inline modifier.
	* dwarf2out.c (get_AT): Likewise.
	* et-forest.c (et_splay): Likewise.
	* ra.h (ra_alloc, ra_calloc): Likewise
	* mangle.c (write_mangled_name): Remove inline modifier.


Comment 8 CVS Commits 2004-01-16 18:55:56 UTC
Subject: Bug 13608

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	hammer-3_3-branch
Changes by:	hubicka@gcc.gnu.org	2004-01-16 18:55:53

Modified files:
	gcc            : ChangeLog.hammer 
	gcc/config/i386: i386.c 

Log message:
	PR opt/13608
	* i386.c (ix86_compute_frame_layout): Fix for alloca on leaf function.


Comment 9 CVS Commits 2004-01-16 19:01:15 UTC
Subject: Bug 13608

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	gcc-3_3-branch
Changes by:	hubicka@gcc.gnu.org	2004-01-16 19:01:10

Modified files:
	gcc            : ChangeLog 
	gcc/config/i386: i386.c 

Log message:
	PR opt/13608
	* i386.c (ix86_compute_frame_layout): Fix for alloca on leaf function.


Comment 10 Jan Hubicka 2004-01-16 19:02:12 UTC