This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug rtl-optimization/50489] New: [UPC/IA64] mis-schedule of MEM ref with -ftree-vectorize and -fschedule-insns2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50489

             Bug #: 50489
           Summary: [UPC/IA64] mis-schedule of MEM ref with
                    -ftree-vectorize and -fschedule-insns2
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: gary@intrepid.com
            Target: IA64


After a change in GUPC's tree-lowering pass made a couple of months back (that
simplified the tree code being generated), we saw regressions where several
small test cases were failing on an IA64 target (SGI Altix, running SUSE).

We have been unable so far to reduce this to a "C" only test case that
demonstrates the problem, so we are submitting this as a "UPC" bug report,
along with a script that will build the UPC compiler from the GUPC branch, and
create the various bug artifacts.  Perhaps someone knowledgeable with the
instruction scheduler will understand how this mis-scheduling happens and
either reproduce the issue as a "C" test case, or propose a patch.

We also do not know at this time if the UPC compiler should be generating
memory barriers or generate some other metadata to avoid this mis-scheduling,
and would appreciate any suggestions in that regard.

The attached UPC test case works fine when "-O2 -ftree-vectorize
-fno-schedule-insns2" is asserted, but demonstrates a mis-schedule when "-O2
-ftree-vectorize" is asserted.

The following description is copied from the
README-ia64-upc-sched-insn2-bug.txt file that is included in the attached zip
file as well.

Background
----------

On a 64-bit target (using the "struct PTS" configuration),
the UPC compiler represents UPC pointer-to-shared values 
as 128-bit structures with three fields: (vaddr, thread, phase)
as shown in the declaration below.

typedef struct shared_ptr_struct
  {
    void     *vaddr;
    uint32_t thread;
    uint32_t phase;
  } upc_shared_ptr_t
  __attribute__ ((aligned (16)))
  ;

In: ia64-upc-vaddr-bug.upc.143t.optimized, there is the following
sequence of tree statements.

  unsigned int D.3062;
  unsigned int D.3061;
  shared [8] struct foo * D.3060;
  shared [8] struct foo[1] * D.3059;
  struct upc_shared_ptr_t D.3058;
  unsigned int D.3057;

  D.3057_10 = D.3056_9 * 8;
  D.3058.vaddr = &_u_barray;
  MEM[(struct upc_shared_ptr_t *)&D.3058 + 8B] = { 0, 0 };
  D.3059_11 = VIEW_CONVERT_EXPR<shared [8] struct foo[1] *>(D.3058);
  D.3060_12 = (shared [8] struct foo *) D.3059_11;
  D.3061_13 = VIEW_CONVERT_EXPR<struct upc_shared_ptr_t>(D.3060_12).phase;
  D.3062_14 = D.3057_10 + D.3061_13;

D.3059_11 and D.3060_12 are UPC pointers-to-shared (PTS's);
these are 128-bit "fat" pointers with internal
{vaddr, thread, phase} fields.

D.3058 is a PTS representation struct that is initialized
to {&_u_barray, 0, 0}.  Note that D.3059_11 and D.3060_12
are copies of the PTS representation structure, D.3058
that have been recast into a UPC pointer-to-shared (PTS).

The casts above might impose inefficiencies, and there may
be ways to improve the code, but this is the current
tree code that is generated.

This assignment statement:
  D.3061_13 = VIEW_CONVERT_EXPR<struct upc_shared_ptr_t>(D.3060_12).phase;
extracts the 'phase' field from D.3060_12, which is a copy
of the value of D.3058.phase.  The value of D.3058.phase was
previously initialized to zero by the MEM[] assignment.
The fetched phase value D.3061_13 should be zero when this
assignment is executed.

Bug
---

It is this latter access to D.3060_12.phase that expands
into incorrect RTL after the selective instruction scheduling
pass is run.  The access to D.3060_12.phase is scheduled
ahead of the code that sets D.3058.phase.

Valid RTL
---------

The 'ok' and 'bug' compilations share the same RTL dump output all
the way through ia64-upc-vaddr-bug.upc.213r.compgotos.
In that file there RTL statements that are affected by the
mis-scheduling of instructions.  (additional notes added
as '#' comments):

# D.3058.vaddr = &_u_barray (the base address of barray.
#
# r34 was previously assigned the value of &_u_barray
# r47 = r12 + 32;
# r12 is the stack pointer and r47 points to the beginning
# of the D.3058 structure, which also happens to be the
# address of the first field, D.3058.vaddr.
# Therefore, r47 points to D.3058.vaddr

(insn 46 42 331 4 (set (mem/s/f/c:DI (reg/f:DI 47 r47 [532]) [2 D.3058.vaddr+0
S8 A128])
        (reg/f:DI 34 r34 [533])) ia64-upc-vaddr-bug.upc:11 5 {movdi_internal}
     (nil))

# This vector op assigns: {D.3058.thread = 0; D.3058.phase = 0;}
#
# This is done by using r46 as the destination address and r36 as the source.
# r46 = r12 + 40, which is the base address of D.3058.thread.
#                 (D.3058.phase is the field following the D.3058.thread)
# r36 was previously set to {0, 0}.

(insn 52 69 65 4 (set (mem/s/c:V2SI (reg/f:DI 46 r46 [534]) [3 MEM[(struct
upc_shared_ptr_t *)&D.3058 + 8B]+0 S8 A64])
        (reg:V2SI 36 r36 [536])) ia64-upc-vaddr-bug.upc:11 377
{*movv2si_internal}
     (nil))

# r37 = D.3085.phase by indirecting through r45
#
# r12 is the stack pointer
# D.3085.vaddr  starts at r12 + 32
# D.3085.thread starts at r12 + 40
# D.3085.phase  starts at r12 + 44
# r45 = (r12 + 44); where r12 is the stack pointer and 44
#                   is the offset of D.3085.phase
# Therefore, r45 points to D.3058.phase.

(insn 57 59 68 4 (set (reg:DI 37 r37)
        (zero_extend:DI (mem/s/j/c:SI (reg/f:DI 45 r45 [524]) [0
VIEW_CONVERT_EXPR<struct upc_shared_ptr_t>(D.3060_12).phase+0 S4 A32])))
ia64-upc-vaddr-bug.upc:11 136 {zero_extendsidi2}
     (nil))

Although there are intervening instructions, the key thing to note
is that the first two instructions (46 and 52) initialize the
contents of D.3085, and the instruction (57) fetches the
value of D.3085.phase.  This is a valid ordering.

Incorrect RTL: after instruction scheduling
-------------------------------------------

The file ia64-upc-vaddr-bug.upc.215r.mach dumps the RTL
*after* the selective scheduling pass has run.  Here, we
see that D.3058.phase is fetched *before* it is set by
the vector operation.

The following RTL is copied directly from the .mach dump file and
appears exactly in the order shown.

# r37 = D.3058.phase by indirecting through r45
# r45 points to D.3058.phase [r12 + 44]
#
# BUG: D.3058.phase has *not* been initialized at this point.
#

(insn:TI 57 507 506 4 (set (reg:DI 37 r37)
        (zero_extend:DI (mem/s/j/c:SI (reg/f:DI 45 r45 [524]) [0
VIEW_CONVERT_EXPR<struct upc_shared_ptr_t>(D.3060_12).phase+0 S4 A32])))
ia64-upc-vaddr-bug.upc:11 136 {zero_extendsidi2}
     (nil))

# Initialize {D.3058.thread = 0; D.3058.phase = 0} via a vector operation.
#
# BUG: this vector operation should precede the instruction that
# fetches D.3058.phase, but the instruction scheduler has incorrectly
# scheduled this vector assignment after the fetch.  It apparently
# did *not* notice that the memory vector beginning at r46 [r12 + 40] aliases
# both D.3058.thread and D.3058.phase and that r45 [r12 + 44] points
# to D.3058.phase, and therefore is being used to fetch
# the value of D.3058.phase.
#

(insn 52 501 17 4 (set (mem/s/c:V2SI (reg/f:DI 46 r46 [534]) [3 MEM[(struct
upc_shared_ptr_t *)&D.3058 + 8B]+0 S8 A64])
        (reg:V2SI 36 r36 [536])) ia64-upc-vaddr-bug.upc:11 377
{*movv2si_internal}
     (nil))


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]