Bug 83201

Summary: [7/8 Regression] SPEC CPU2017 505.mcf_r produces incorrect output when built with -flto and FDO
Product: gcc Reporter: Pat Haugen <pthaugen>
Component: ltoAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED INVALID    
Severity: normal CC: bill.schmidt, brlcad, dje, hubicka, jeffreyalaw, lists, rguenth, segher
Priority: P3 Keywords: wrong-code
Version: 8.0   
Target Milestone: 7.3   
Host: powerpc64le-unknown-linux-gnu, x86_64-linux-gnu Target: powerpc64le-unknown-linux-gnu, x86_64-linux-gnu
Build: powerpc64le-unknown-linux-gnu, x86_64-linux-gnu Known to work:
Known to fail: Last reconfirmed: 2017-12-19 00:00:00
Attachments: Patch that removes violation of aliasing rules

Description Pat Haugen 2017-11-28 20:12:25 UTC
505.mcf_f produces incorrect output when built with both LTO/FDO. Using either option separately is fine. GCC trunk r255207 was used. Following are options used.

OPTIMIZE        = -O3 -mcpu=power8 -flto

PASS1_FLAGS   = -fprofile-generate
PASS1_LDFLAGS  = -fprofile-generate
PASS2_FLAGS   = -fprofile-use
PASS2_LDFLAGS  = -fprofile-use


Contents of inp.out.mis (miscompares).

0010:  simplex iterations         : 107102
       simplex iterations         : 107598
                                       ^
0014:  simplex iterations         : 152479
       simplex iterations         : 149876
                                     ^
0016:  erased arcs                : 995716
       erased arcs                : 995702
                                        ^
0017:  new implicit arcs          : 2995716
       new implicit arcs          : 2995702
                                         ^
0019:  simplex iterations         : 253145
       simplex iterations         : 248008
                                     ^
0020:  objective value            : 12161789395
       objective value            : 12171761765
                                       ^
0021:  erased arcs                : 2991635
       erased arcs                : 2991537
                                        ^
0022:  new implicit arcs          : 2991635
       new implicit arcs          : 2991537
                                        ^
0024:  simplex iterations         : 398127
       simplex iterations         : 385785
                                     ^
0025:  objective value            : 11729854482
       objective value            : 11769820561
                                       ^


It appears to work fine with r254943. I'll start a bisect and post results.
Comment 1 Bill Schmidt 2017-11-28 20:20:01 UTC
It may be latent for a while -- the same problem exists with GCC 7.  (Well, technically with branches/ibm/gcc-7-branch.)
Comment 2 Pat Haugen 2017-11-28 22:53:35 UTC
(In reply to Pat Haugen from comment #0)
> 
> It appears to work fine with r254943. I'll start a bisect and post results.

My bisect showed that r254946 was where it started failing on trunk. And yes, it fails with current GCC 7 branch too.
Comment 3 Richard Biener 2017-11-29 08:32:28 UTC
(In reply to Pat Haugen from comment #2)
> (In reply to Pat Haugen from comment #0)
> > 
> > It appears to work fine with r254943. I'll start a bisect and post results.
> 
> My bisect showed that r254946 was where it started failing on trunk. And
> yes, it fails with current GCC 7 branch too.

Probably a bogus bisect point, for a few revisions inlining was broken so it's probably broken a bit earlier as well.
Comment 4 Richard Biener 2017-12-14 13:04:25 UTC
Do we have an indication that it worked with any 6.x or 7.x release?
Comment 5 Pat Haugen 2017-12-15 02:30:57 UTC
Current FSF 6 branch works fine, so I have some bisect points. Will comment further as I find out.
Comment 6 Pat Haugen 2017-12-15 21:39:22 UTC
So I did a bisect of trunk during the GCC 7 development timeframe (r235035-r247017) and it pointed to r236878 as the point where the failure started.


+++ gcc/ChangeLog	(revision 236878)
@@ -1,3 +1,9 @@
+2016-05-30  Jan Hubicka  <hubicka@ucw.cz>
+
+	* tree-ssa-loop-ivcanon.c (try_peel_loop): Correctly set wont_exit
+	for peeled copies; avoid underflow when updating estimates; correctly
+	scale loop profile.
+
Comment 7 Martin Liška 2017-12-19 10:52:21 UTC
Confirmed on x86_64-linux-gnu with current trunk. I'll try to reproduce with the suspected revision.
Comment 8 Martin Liška 2017-12-19 12:09:29 UTC
But my bisection script points to r217827 as first bad revision:

2014-11-20   Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63677
	* tree-ssa-dom.c: Include gimplify.h for unshare_expr.
	(avail_exprs_stack): Make a vector of pairs.
	(struct hash_expr_elt): Replace stmt member with vop member.
	(expr_elt_hasher::equal): Simplify.
	(initialize_hash_element): Adjust.
	(initialize_hash_element_from_expr): Likewise.
	(dom_opt_dom_walker::thread_across_edge): Likewise.
	(record_cond): Likewise.
	(dom_opt_dom_walker::before_dom_children): Likewise.
	(print_expr_hash_elt): Likewise.
	(remove_local_expressions_from_table): Restore previous state
	if requested.
	(record_equivalences_from_stmt): Record &x + CST as constant
	&MEM[&x, CST] for further propagation.
	(vuse_eq): New function.
	(lookup_avail_expr): For loads use the alias oracle to see
	whether a candidate from the expr hash is usable.
	(avail_expr_hash): Do not hash VUSEs.

	* gcc.dg/tree-ssa/ssa-dom-cse-2.c: New testcase.
	* gcc.dg/tree-ssa/ssa-dom-cse-3.c: Likewise.
Comment 9 Martin Liška 2017-12-19 12:21:54 UTC
Using just a single ltrans, I see first divergence in mcf_r.ltrans0.088t.dom1.
Richi, how possible is the revision real culprit?
Comment 10 rguenther@suse.de 2017-12-19 13:05:57 UTC
On Tue, 19 Dec 2017, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201
> 
> --- Comment #9 from Martin Liška <marxin at gcc dot gnu.org> ---
> Using just a single ltrans, I see first divergence in mcf_r.ltrans0.088t.dom1.
> Richi, how possible is the revision real culprit?

Unlikely, it probably triggers a latent issue...
Comment 11 Martin Liška 2017-12-19 13:12:01 UTC
(In reply to rguenther@suse.de from comment #10)
> On Tue, 19 Dec 2017, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201
> > 
> > --- Comment #9 from Martin Liška <marxin at gcc dot gnu.org> ---
> > Using just a single ltrans, I see first divergence in mcf_r.ltrans0.088t.dom1.
> > Richi, how possible is the revision real culprit?
> 
> Unlikely, it probably triggers a latent issue...

That's not much positive :/ Ideas what to do with that?
Comment 12 rguenther@suse.de 2017-12-19 13:17:44 UTC
On Tue, 19 Dec 2017, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201
> 
> --- Comment #11 from Martin Liška <marxin at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #10)
> > On Tue, 19 Dec 2017, marxin at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201
> > > 
> > > --- Comment #9 from Martin Liška <marxin at gcc dot gnu.org> ---
> > > Using just a single ltrans, I see first divergence in mcf_r.ltrans0.088t.dom1.
> > > Richi, how possible is the revision real culprit?
> > 
> > Unlikely, it probably triggers a latent issue...
> 
> That's not much positive :/ Ideas what to do with that?

Given 505.mcf_r isn't too large figuring out the miscompiled function
shouldn't be too hard (heh, well ...).  The key is probably to somehow
reproduce without FDO ... like just do -fprofile-use?  What other flags
did you use on x86_64?
Comment 13 Richard Biener 2017-12-19 13:38:27 UTC
Likely invalid.  spec_qsort is full of alias violations.  We sort

typedef struct basket
{
    arc_t *a;
    cost_t cost;
    cost_t abs_cost;
    LONG number;
} BASKET;

and spec_qsort does stuff like

        if (n < 7) {
                for (pm = (char *)a + es; pm < (char *)a + n * es; pm += es)
                        for (pl = pm;
                             pl > (char *)a && cmp(pl - es, pl) > 0;
                             pl -= es)
                                swap(pl, pl - es);
                return;

with

#define swap(a, b)                              \
        if (swaptype_long == 0) {               \
                long t = *(long *)(a);          \
                *(long *)(a) = *(long *)(b);    \
                *(long *)(b) = t;               \
        } else if (swaptype_int == 0) {         \
                int t = *(int *)(a);            \
                *(int *)(a) = *(int *)(b);      \
                *(int *)(b) = t;                \
        } else                                  \
                swapfunc((char *)a, (char *)b, es, swaptype_long, swaptype_int)

eh... (no, swapfunc isn't any better).

Anybody up to reporting this to SPEC?
Comment 14 Richard Biener 2017-12-19 13:39:45 UTC
Copies in more benchmarks:

> find benchspec -name spec_qsort.c
benchspec/common/spec_qsort/spec_qsort.c
benchspec/CPU/505.mcf_r/src/spec_qsort/spec_qsort.c
benchspec/CPU/502.gcc_r/src/spec_qsort/spec_qsort.c
benchspec/CPU/511.povray_r/src/spec_qsort/spec_qsort.c
benchspec/CPU/527.cam4_r/src/spec_qsort/spec_qsort.c

...
Comment 15 Richard Biener 2017-12-19 13:55:52 UTC
SWAPINIT should end up with swaptype_long == 1 I think and swaptype_int == 1
for the cases in question.  Enforcing swaptype_int = swaptype_long = 2
should make it work (scratch SWAPINIT calls).

#define SWAPINIT(TYPE, a, es) swaptype_ ## TYPE =       \
        ((char *)a - (char *)0) % sizeof(TYPE) ||       \
        es % sizeof(TYPE) ? 2 : es == sizeof(TYPE) ? 0 : 1;

static INLINE void
swapfunc(char *a, char *b, int n, int swaptype_long, int swaptype_int)
{
        if (swaptype_long <= 1)
                swapcode(long, a, b, n)
        else if (swaptype_int <= 1)
                swapcode(int, a, b, n)
        else
                swapcode(char, a, b, n)
}

#define swap(a, b)                              \
        if (swaptype_long == 0) {               \
                long t = *(long *)(a);          \
                *(long *)(a) = *(long *)(b);    \
                *(long *)(b) = t;               \
        } else if (swaptype_int == 0) {         \
                int t = *(int *)(a);            \
                *(int *)(a) = *(int *)(b);      \
                *(int *)(b) = t;                \
        } else                                  \
                swapfunc((char *)a, (char *)b, es, swaptype_long, swaptype_int)

...


loop:   SWAPINIT(long, a, es);
        SWAPINIT(int, a, es);
Comment 16 Martin Liška 2017-12-19 14:00:38 UTC
(In reply to Richard Biener from comment #15)
> SWAPINIT should end up with swaptype_long == 1 I think and swaptype_int == 1
> for the cases in question.  Enforcing swaptype_int = swaptype_long = 2
> should make it work (scratch SWAPINIT calls).

I can confirm that.
Comment 17 Martin Liška 2017-12-19 14:50:33 UTC
Created attachment 42919 [details]
Patch that removes violation of aliasing rules
Comment 18 Pat Haugen 2017-12-19 16:42:07 UTC
(In reply to Martin Liška from comment #16)
> (In reply to Richard Biener from comment #15)
> > SWAPINIT should end up with swaptype_long == 1 I think and swaptype_int == 1
> > for the cases in question.  Enforcing swaptype_int = swaptype_long = 2
> > should make it work (scratch SWAPINIT calls).
> 
> I can confirm that.

Yes, that fixes the problem for me on PowerPC also. I can pass along the info to our SPEC rep.


Richi,
  I'm curious if the alias violations were apparent in a dump file, or did you just happened to spot them looking through the source?
Comment 19 rguenther@suse.de 2017-12-19 17:32:35 UTC
On December 19, 2017 5:42:07 PM GMT+01:00, "pthaugen at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201
>
>--- Comment #18 from Pat Haugen <pthaugen at gcc dot gnu.org> ---
>(In reply to Martin Liška from comment #16)
>> (In reply to Richard Biener from comment #15)
>> > SWAPINIT should end up with swaptype_long == 1 I think and
>swaptype_int == 1
>> > for the cases in question.  Enforcing swaptype_int = swaptype_long
>= 2
>> > should make it work (scratch SWAPINIT calls).
>> 
>> I can confirm that.
>
>Yes, that fixes the problem for me on PowerPC also. I can pass along
>the info
>to our SPEC rep.
>
>
>Richi,
>I'm curious if the alias violations were apparent in a dump file, or
>did you
>just happened to spot them looking through the source?

I just looked at the source seeing *(long *) ptr and bells went off. 

Richard.
Comment 20 Jeffrey A. Law 2018-01-07 04:37:18 UTC
Based on c#18.
Comment 21 john henning 2018-05-09 16:13:49 UTC
Comments from SPEC CPU Documentation Guy:

As described at 
https://www.spec.org/cpu2017/Docs/faq.html#Miscompare.07
https://www.spec.org/cpu2017/Docs/benchmarks/505.mcf_r.html#Portability
SPEC's recommendation is to use
   -fno-strict-aliasing

The patch attached to this bug is not approved by SPEC.

SPEC generally does not change benchmarks after release (no moving targets); more detail is at the above links.

   John Henning
   SPEC CPU Subcommittee Secretary
Comment 22 Martin Liška 2018-06-06 10:06:09 UTC
(In reply to john henning from comment #21)
> Comments from SPEC CPU Documentation Guy:
> 
> As described at 
> https://www.spec.org/cpu2017/Docs/faq.html#Miscompare.07
> https://www.spec.org/cpu2017/Docs/benchmarks/505.mcf_r.html#Portability
> SPEC's recommendation is to use
>    -fno-strict-aliasing
> 
> The patch attached to this bug is not approved by SPEC.
> 
> SPEC generally does not change benchmarks after release (no moving targets);
> more detail is at the above links.
> 
>    John Henning
>    SPEC CPU Subcommittee Secretary

Note that the same function is used by other benchmarks as well: 502.gcc_r, 511.povray_r, 527.cam4_r

That means the same miscompilation can happen for them as well :/
Comment 23 Sean 2022-03-02 09:18:35 UTC
Sorry for digging up the past, but recently ran into a related implementation issue in OpenBSD's qsort implementation and came across this discussion.  While I understand the proposed patch was rejected, it looks like it also degraded the implementation to single byte swaps.

Admittedly untested, but a possible solution that I think would make LTO happy (avoids aliasing the pointer) and doesn't change the implementation profile would be to cast it differently since it's always dealing with a simple type alignment check on long/int/char.  Something like this:

#define SWAPINIT(TYPE, a, es) swaptype_ ## TYPE = (uintptr_t)a % sizeof(TYPE) || es % sizeof(TYPE) ? 2 : es == sizeof(TYPE) ? 0 : 1;

Maybe a more acceptable alternative compromise since it would have a nearly identical performance profile with it still swap-optimizing int and long.
Comment 24 Richard Biener 2022-03-02 09:43:49 UTC
(In reply to Sean from comment #23)
> Sorry for digging up the past, but recently ran into a related
> implementation issue in OpenBSD's qsort implementation and came across this
> discussion.  While I understand the proposed patch was rejected, it looks
> like it also degraded the implementation to single byte swaps.
> 
> Admittedly untested, but a possible solution that I think would make LTO
> happy (avoids aliasing the pointer) and doesn't change the implementation
> profile would be to cast it differently since it's always dealing with a
> simple type alignment check on long/int/char.  Something like this:
> 
> #define SWAPINIT(TYPE, a, es) swaptype_ ## TYPE = (uintptr_t)a %
> sizeof(TYPE) || es % sizeof(TYPE) ? 2 : es == sizeof(TYPE) ? 0 : 1;
> 
> Maybe a more acceptable alternative compromise since it would have a nearly
> identical performance profile with it still swap-optimizing int and long.

Note the issue is not the way the type is computed but the memory
accesses done via

#define swap(a, b)                              \
        if (swaptype_long == 0) {               \
                long t = *(long *)(a);          \
                *(long *)(a) = *(long *)(b);    \
                *(long *)(b) = t;               \
        } else if (swaptype_int == 0) {         \
                int t = *(int *)(a);            \
                *(int *)(a) = *(int *)(b);      \
                *(int *)(b) = t;                \

there exists non-standard ways to preserve the access pattern like

typedef int notbaa_int __attribute__((may_alias));
typedef long notbaa_long __attribute__((may_alias));
#define swap(a, b)                              \
        if (swaptype_long == 0) {               \
                long t = *(notbaa_long *)(a);          \
                *(notbaa_long *)(a) = *(notbaa_long *)(b);    \
                *(notbaa_long *)(b) = t;               \
        } else if (swaptype_int == 0) {         \
                int t = *(notbaa_int *)(a);            \
                *(notbaa_int *)(a) = *(notbaa_int *)(b);      \
                *(notbaa_int *)(b) = t;                \

or portable ones like doing

  long t;
  memcpy (&t, a, sizeof (long));
  memcpy (a, b, sizeof (long));
  memcpy (b, &t, sizeof (long));

and hoping for the compiler to optimize this well (GCC will do).

Note at time I proposed the change to the originating project (freebsd?) and
it was accepted there.
Comment 25 Sean 2022-03-02 10:25:24 UTC
Ah, that makes sense, thank you Richard.  I didn't pay as close attention to the actual swap() code and casting going on there.

Apparently unrelated, but perhaps worth noting the reason this come up on my radar is because the latest compilers now also issue a warning on the subtraction:

error: performing pointer subtraction with a null pointer has undefined behavior [-Werror,-Wnull-pointer-subtraction]