Bug 78176 - [MIPS] miscompiles ldxc1 with large pointers on 32-bits
Summary: [MIPS] miscompiles ldxc1 with large pointers on 32-bits
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.2.0
: P3 normal
Target Milestone: 8.0
Assignee: Doug Gilmore
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2016-11-01 12:35 UTC by James Cowgill
Modified: 2017-03-07 10:52 UTC (History)
8 users (show)

See Also:
Host:
Target: mips
Build:
Known to work: 7.0
Known to fail: 6.2.0
Last reconfirmed: 2017-01-16 00:00:00


Attachments
testcase (433 bytes, text/plain)
2016-11-01 12:35 UTC, James Cowgill
Details
testcase v2 (437 bytes, text/plain)
2016-11-01 12:49 UTC, James Cowgill
Details
Prototype change to backout r216501. (1.35 KB, patch)
2017-01-31 00:14 UTC, Doug Gilmore
Details | Diff
Tweak to adjust_setup_cost (r220473). (960 bytes, patch)
2017-01-31 00:18 UTC, Doug Gilmore
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description James Cowgill 2016-11-01 12:35:36 UTC
Created attachment 39937 [details]
testcase

Originally from the 'flint' package on Debian failing to build:
https://buildd.debian.org/status/fetch.php?pkg=flint&arch=mipsel&ver=2.5.2-10&stamp=1477705959

The attached testcase when compiled and run on a 64-bit mips processor crashes with a SIGBUS error. The same binary works when run on a 32-bit mips processor.

Compile with -O2 optimization like this:
> gcc -O2 -march=mips32r2 ldxc1.c -o ldxc1

The crash occurs at the ldxc1 instruction at the end of this exerpt from the inner loop in the ldxc1_test function:

> 8e0:   02028023        subu    s0,s0,v0
> 8e4:   afbf003c        sw      ra,60(sp)
> 8e8:   0080b025        move    s6,a0
> 8ec:   8fb3005c        lw      s3,92(sp)
> [loop begins here]
> 8f0:   02e0a825        move    s5,s7
> 8f4:   26d6ffff        addiu   s6,s6,-1
> 8f8:   0256102a        slt     v0,s2,s6
> 8fc:   10400012        beqz    v0,948 <ldxc1_test+0xd8>
> 900:   8f998054        lw      t9,-32684(gp)
> 904:   8e620000        lw      v0,0(s3)
> 908:   8ea6fff8        lw      a2,-8(s5)
> 90c:   8ee30000        lw      v1,0(s7)
> 910:   00551021        addu    v0,v0,s5
> 914:   26b5fffc        addiu   s5,s5,-4
> 918:   00511021        addu    v0,v0,s1
> 91c:   00c33023        subu    a2,a2,v1
> 920:   8c420000        lw      v0,0(v0)
> 924:   00541021        addu    v0,v0,s4
> 928:   2694fff8        addiu   s4,s4,-8
> 92c:   4e020301        ldxc1   $f12,v0(s0)

Before the ldxc1 instruction is executed, gdb reports that the values in v0 and s0 are both large integers (above 0x80000000):
(gdb) print/x $v0
$1 = 0xfffee7f8
(gdb) print/x $s0
$2 = 0x80008b50

When added together, the lower 32-bits contains the correct pointer (in this case on the stack). On a 32-bit processor this is fine.

On a 64-bit processor, we know that v0 and s0 are sign extended as the last instructions to touch them were the addu at 924 and the subu at 8e0. So the values in the registers are actually:
v0 = 0xfffffffffffee7f8
s0 = 0xffffffff80008b50

Adding these together (modulo 64-bit) gives the final pointer of 0xffffffff7fff7348 which is outside the user address space and thus results in a SIGBUS.

I think GCC is assuming that the address calculated by the ldxc1 instruction is modulo 32-bit when compiled for a 32-bit processor. However, this is not true if the code is later run on a 64-bit processor.
Comment 1 James Cowgill 2016-11-01 12:49:58 UTC
Created attachment 39938 [details]
testcase v2

Testcase actually initializes the arrays this time :)

The bug still occurs.
Comment 2 Andrew Pinski 2016-11-01 15:28:54 UTC
I think this code is undefined if you have wrapping pointers. No pointer should ever be above INT_MAX in user space on mips32 due to the memory layout on MIPS32.
Comment 3 James Cowgill 2016-11-01 15:35:37 UTC
As far as I can tell, all the pointers in the original C code are valid and do not wrap. Some of the registers wrap, but they're not pointers (until added with other registers).
Comment 4 Andrew Pinski 2016-11-01 16:16:01 UTC
Then the problem is a reassociation issue. That is we should never get an overflow happening for pointers for MIPS.
Comment 5 Matthew Fortune 2016-11-03 16:38:47 UTC
I'm struggling to see what can be fixed in GCC for this (and I'm not sure I follow why there is a problem with GCC's behaviour) but...

The issue may stem from the C front end where the dumps start off as below. Note that the '-1' in kappa-1 has ended up being represented as 1073741823 (0x3fffffff) presumably owing to the fact it will be multiplied by 4 eventually and hence be 'safe'. I can't figure out so far what creates this value but the behavior started some time between gcc 4.7 and 4.8 I believe. The questionably bad code then shows up dependent on optimisations and instruction availability. If the calculation had been done with ADDU instead of in the LDXC1 then all would be well.

Regardless of whether this is a real bug for the compiler then the Linux kernel should be making use of the status.UX bit in order to properly emulate a 32-bit environment for o32 on 64-bit kernel/hardware and had this been run using a 32-bit kernel (on 64-bit hardware) then it would run correctly. There is an issue for N32 code with the UX bit in that you also need the PX bit available to independently control 64-bit instruction availability vs address space but this is only missing in MIPSIV cores.

;; Function ldxc1_test (null)
;; enabled by -tree-original


{
  int kappa2 = kappa;
  double tmp = 0.0;

    int kappa2 = kappa;
    double tmp = 0.0;
  <D.1497>:;
  kappa-- ;
  if (zeros + 1 < kappa)
    {
      tmp = *(*(r->rows + ((sizetype) kappa + 1073741823) * 4) + ((sizetype) kappa + 536870911) * 8) * ctt;
      tmp = ldexp (tmp, *(expo + ((sizetype) kappa + 1073741823) * 4) - *(expo + (sizetype) ((unsigned int) kappa2 * 4)));
    }
  if (zeros + 1 < kappa && *(s + ((sizetype) kappa + 536870911) * 8) <= tmp) goto <D.1497>; else goto <D.1498>;
  <D.1498>:;
  return tmp;
Comment 6 Eric Botcazou 2016-11-03 17:20:43 UTC
> The issue may stem from the C front end where the dumps start off as below.
> Note that the '-1' in kappa-1 has ended up being represented as 1073741823
> (0x3fffffff) presumably owing to the fact it will be multiplied by 4
> eventually and hence be 'safe'.

All pointer calculations are done in sizetype and it is unsigned.  This kind of issues generally come from the RTL level, maybe expansion here.
Comment 7 Matthew Fortune 2016-11-04 11:15:47 UTC
(In reply to Eric Botcazou from comment #6)
> > The issue may stem from the C front end where the dumps start off as below.
> > Note that the '-1' in kappa-1 has ended up being represented as 1073741823
> > (0x3fffffff) presumably owing to the fact it will be multiplied by 4
> > eventually and hence be 'safe'.
> 
> All pointer calculations are done in sizetype and it is unsigned.  This kind
> of issues generally come from the RTL level, maybe expansion here.

The expansion looks like an acceptable transformation to me i.e. it is not introducing the overflow for the offending pointer just maintaining what is already in the tree.

I dug back to reassoc2 based on Andrew's comment which gets the following as input:

  _48 = (sizetype) kappa_6(D);
  _49 = _48 * 8;
  _50 = s_37(D) + _49;
  ivtmp.19_47 = (unsigned int) _50

and then in the loop:

  _54 = (sizetype) s_37(D);
  _55 = ivtmp.19_3 - _54;
  _56 = _55 + 4294967280;
  # RANGE [0, 4294967295] NONZERO 4294967288
  _19 = _56;
  # PT = nonlocal escaped 
  _20 = _17 + _19;
  # VUSE <.MEM_4>
  _21 = *_20

This includes a pointer subtraction of 's' (somewhat pointless but valid I believe).

reassoc2 changes it to this:

  _48 = (sizetype) kappa_6(D);
  _49 = _48 * 8;
  _50 = s_37(D) + _49;
  ivtmp.19_47 = (unsigned int) _50;

and in the loop:

  _54 = (sizetype) s_37(D);
  _38 = -_54;
  _25 = 4294967280 - _54;
  _56 = _25 + ivtmp.19_3;
  # RANGE [0, 4294967295] NONZERO 4294967288
  _19 = _56;
  # PT = nonlocal escaped 
  _20 = _17 + _19;
  # VUSE <.MEM_4>
  _21 = *_20;

And _56 is where we get the overflow.

I'm still not sure if there is really a bug. Should reassoc not be doing this for 'sizetype'? Should ivopts not have created the mess in the first place? Would changing either of these actually introduce an assurance that this situation won't occur from other optimisations?
Comment 8 Eric Botcazou 2016-11-04 11:50:35 UTC
> The expansion looks like an acceptable transformation to me i.e. it is not
> introducing the overflow for the offending pointer just maintaining what is
> already in the tree.

Wrap around for unsigned types is OK but, if expansion does implicit extensions to larger types, then things can easily go wrong.

> I'm still not sure if there is really a bug. Should reassoc not be doing
> this for 'sizetype'? Should ivopts not have created the mess in the first
> place? Would changing either of these actually introduce an assurance that
> this situation won't occur from other optimisations?

sizetype is unsigned so all the transformations looks valid.
Comment 9 Matthew Fortune 2016-11-04 13:45:58 UTC
(In reply to Eric Botcazou from comment #8)
> > The expansion looks like an acceptable transformation to me i.e. it is not
> > introducing the overflow for the offending pointer just maintaining what is
> > already in the tree.
> 
> Wrap around for unsigned types is OK but, if expansion does implicit
> extensions to larger types, then things can easily go wrong.

Sure, and in this case there are no implicit extensions to larger types. The bug does require an implicit extension to occur but this only happens at runtime when on 64-bit hardware and the core is not configured to be limited to a 32-bit address space which is always true for a 64-bit kernel as it turns out.

> > I'm still not sure if there is really a bug. Should reassoc not be doing
> > this for 'sizetype'? Should ivopts not have created the mess in the first
> > place? Would changing either of these actually introduce an assurance that
> > this situation won't occur from other optimisations?
> 
> sizetype is unsigned so all the transformations looks valid.

I'm leaning heavily towards not-a-compiler-bug but may need to offer a workaround until we can solve this in the Linux kernel.

The only workaround possible is to provide an option to disable register+index addressing i.e. remove [ls][wd]xc1 instructions.
Comment 10 Eric Botcazou 2016-11-04 17:25:27 UTC
> Sure, and in this case there are no implicit extensions to larger types. The
> bug does require an implicit extension to occur but this only happens at
> runtime when on 64-bit hardware and the core is not configured to be limited
> to a 32-bit address space which is always true for a 64-bit kernel as it
> turns out.

OK, if the compiler doesn't know that larger types are involved, then indeed it cannot do anything to prevent implicit extensions.  Such a configuration needs to define Pmode as DImode and ptr_mode as SImode for the compiler to work properly.
Comment 11 Maciej W. Rozycki 2016-11-04 19:03:19 UTC
TBH this does look like trying to rely on UB to me, as per section
6.5.6 "Additive operators" clause 8 of the C language standard, which
states (among others):

"If both the pointer operand and the result point to elements of the
same array object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the behavior
is undefined."

Here under the triggering conditions the pointer the integer is added
to with LDXC1 does not point to an element of the array operated on (or
to one past the last), so the hardware operation matches the standard's
semantics WRT overflow and I don't think we ought to be pushing it.

So it looks like a middle end bug to me and the backend is fine in
faithfully assuming its RTL pattern won't be passed operands leading to
UB.

Have I missed anything?

  Maciej
Comment 12 Doug Gilmore 2017-01-13 21:43:06 UTC
Bisected the problem to commit r216501:

commit 9a416363e99c9f2d48fa810e220bc2f7904f1788
Author: zqchen <zqchen@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Oct 21 03:38:37 2014 +0000

    2014-10-21  Zhenqiang Chen  <zhenqiang.chen@arm.com>
    
    	* cfgloopanal.c (seq_cost): Delete.
    	* rtl.h (seq_cost): New prototype.
    	* rtlanal.c (seq_cost): New function.
    	* tree-ssa-loop-ivopts.c (seq_cost): Delete.
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@216501 138bc75d-0d04-0410-961f-82ee72b054a4

More analysis to follow.

Given the short time until the release, we plan submit a patch to
provide a target flag and build option to avoid the bug.
Comment 13 mpf 2017-01-16 17:02:24 UTC
(In reply to Maciej W. Rozycki from comment #11)
> TBH this does look like trying to rely on UB to me, as per section
> 6.5.6 "Additive operators" clause 8 of the C language standard, which
> states (among others):
> 
> "If both the pointer operand and the result point to elements of the
> same array object, or one past the last element of the array object,
> the evaluation shall not produce an overflow; otherwise, the behavior
> is undefined."
> 
> Here under the triggering conditions the pointer the integer is added
> to with LDXC1 does not point to an element of the array operated on (or
> to one past the last), so the hardware operation matches the standard's
> semantics WRT overflow and I don't think we ought to be pushing it.

Aren't language rules mostly irrelevant at this stage though?
We don't really have the concept of a pointer after generating RTL we just happen to have some values in a mode that is the same as Pmode.

mips.h comment for Pmode:

/* Specify the machine mode that pointers have.
   After generation of rtl, the compiler makes no further distinction
   between pointers and any other objects of this machine mode.  */

> So it looks like a middle end bug to me and the backend is fine in
> faithfully assuming its RTL pattern won't be passed operands leading to
> UB.

I can't see any feature/option/setting that gives this guarantee. Why do you think the backend can make this assumption?
Comment 14 mpf 2017-01-19 16:06:30 UTC
Author: mpf
Date: Thu Jan 19 16:05:59 2017
New Revision: 244640

URL: https://gcc.gnu.org/viewcvs?rev=244640&root=gcc&view=rev
Log:
MIPS: PR target/78176 add -mlxc1-sxc1.

gcc/

	PR target/78176
	* config.gcc (supported_defaults): Add lxc1-sxc1.
	(with_lxc1_sxc1): Add validation.
	(all_defaults): Add lxc1-sxc1.
	* config/mips/mips.opt (mlxc1-sxc1): New option.
	* gcc/config/mips/mips.h (OPTION_DEFAULT_SPECS): Add a default for
	mlxc1-sxc1.
	(TARGET_CPU_CPP_BUILTINS): Add builtin_define for
	__mips_no_lxc1_sxc1.
	(ISA_HAS_LXC1_SXC1): Gate with mips_lxc1_sxc1.
	* gcc/doc/invoke.texi (-mlxc1-sxc1): Document the new option.
	* doc/install.texi (--with-lxc1-sxc1): Document the new option.

gcc/testsuite/

	* gcc.target/mips/lxc1-sxc1-1.c: New file.
	* gcc.target/mips/lxc1-sxc1-2.c: Likewise.
	* gcc.target/mips/mips.exp (mips_option_groups): Add ghost option
	HAS_LXC1.
	(mips_option_groups): Add -m[no-]lxc1-sxc1.
	(mips-dg-init): Detect default -mno-lxc1-sxc1.
	(mips-dg-options): Handle HAS_LXC1 arch upgrade/downgrade.

Added:
    trunk/gcc/testsuite/gcc.target/mips/lxc1-sxc1-1.c
    trunk/gcc/testsuite/gcc.target/mips/lxc1-sxc1-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config.gcc
    trunk/gcc/config/mips/mips.h
    trunk/gcc/config/mips/mips.opt
    trunk/gcc/doc/install.texi
    trunk/gcc/doc/invoke.texi
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/mips/mips.exp
Comment 15 Doug Gilmore 2017-01-31 00:14:31 UTC
Created attachment 40631 [details]
Prototype change to backout r216501.

> Bisected the problem to commit r216501:

The review discussion of r216501 starts with message:

https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00758.html

Which contains:

    The are two implementations of seq_cost. The function bodies are
    exactly the same. The patch removes one of them and make the other
    global.

This seems the patch was cleanup that shouldn't introduce a
functional change.

However implementations of seq_cost are different, per
final version of the patch:

https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00896.html

cfgloopanal.c:
-	cost += set_rtx_cost (set, speed);


rtlanal.c:
+        cost += set_rtx_cost (set, speed);

tree-ssa-loop-ivopts.c:
-	cost += set_src_cost (SET_SRC (set), speed);

In general, when computing the cost of a sequence of N INSNs this
increases the cost of the sequence by N*4.  This really throws
off the costing of substituting different IVs on MIPS.

The first patch attached (just a prototype) basically reverts
this change.  The second fixes a problem with r220473, a fix
for PR62631 from Eric Botcazou:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62631#c17

    This looks a generic problem in get_shiftadd_cost to me, it ought
    to mimic the algorithms in expmed.c, something like ...

This change can lower the cost of a sequence of instruction.  However
there are situations this (lower) cost is being scaled by an estimated
iteration count will cause the adjusted cost to now become zero.  For
the example attached to the second patch the IV replacement algorithm
will determine that the cost using separate IVs for each load will be
less than then cost of one IV for all loads.

Thus, in the second patch we detect that a non-zero cost being scaled
to zero should represented by one instead, which gets us back to
IVSOPTS generating just one IV that will be used for all loads.
Comment 16 Doug Gilmore 2017-01-31 00:18:13 UTC
Created attachment 40632 [details]
Tweak to adjust_setup_cost (r220473).

Second patch associated with previous comment.
Comment 17 Doug Gilmore 2017-01-31 00:51:19 UTC
> This really throws off the costing of substituting different IVs on
> MIPS.
I forgot to mention that for MIPS the net of effect r216501 is to not
produce indexed memory OPs in simple examples where we should.  But
we also will produce problematic indexed memory OPs in situations
where address generation costing is a bit complicated (the original
issue associated with this bug report).

Applying the the two patches I just attached fixes the problem of
generating indexed memory OPs in simple examples, and also will cause
IVOPTS to select IVs that are similar to those that were made in the
past that avoids the problem executing indexed memory OPs in O32
binaries on 64-bit MIPS processors running current Linux kernels.

There is still the issue of recognizing that rewriting a "use" to use
a different IV can expose a problem with indexed memory OPs on 64-bit
MIPS processors, where an infinite cost should be associated in that
situation, that still needs to be addressed (thus the need for the
flag to turn off the generation of indexed memory OPs until this issue
is addressed).
Comment 18 Doug Gilmore 2017-01-31 04:57:42 UTC
CC author and reviewers of r216501.
Comment 19 Richard Biener 2017-02-01 15:29:28 UTC
I agree with the comments that this (if at all) needs to be fixed at RTL expansion time where we already do quite some "hacks" for sizetype
in POINTER_PLUS_EXPR context:

    case POINTER_PLUS_EXPR:
      /* Even though the sizetype mode and the pointer's mode can be different
         expand is able to handle this correctly and get the correct result out
         of the PLUS_EXPR code.  */
      /* Make sure to sign-extend the sizetype offset in a POINTER_PLUS_EXPR
         if sizetype precision is smaller than pointer precision.  */
      if (TYPE_PRECISION (sizetype) < TYPE_PRECISION (type))
        treeop1 = fold_convert_loc (loc, type,
                                    fold_convert_loc (loc, ssizetype,
                                                      treeop1));
      /* If sizetype precision is larger than pointer precision, truncate the
         offset to have matching modes.  */

but I don't see from the comments in this bug what the actual stmt is the
critical one.
Comment 20 Doug Gilmore 2017-02-02 06:07:07 UTC
I'll collect more tracing data on the costing problem.

Hopefully I post an update in the next few days.
Comment 21 mpf 2017-03-07 10:52:14 UTC
Moving target to GCC 8.0 albeit that it may not be solvable. Mitigated in GCC 7.