[Bug rtl-optimization/50489] [UPC/IA64] mis-schedule of MEM ref with -ftree-vectorize and -fschedule-insns2

gary at intrepid dot com gcc-bugzilla@gcc.gnu.org
Sun Sep 25 20:06:00 GMT 2011


--- Comment #6 from Gary Funck <gary at intrepid dot com> 2011-09-25 19:58:58 UTC ---
(In reply to comment #5)
>   D.3059_11 = VIEW_CONVERT_EXPR<shared [8] struct foo[1] *>(D.3058);
> looks like bogus IL to me.  You view D.3058, a struct of size 16, as
> a pointer (of size 8).  I suppose you want to load D.3058.vaddr here?
>   D.3060_12 = (shared [8] struct foo *) D.3059_11;
>   D.3061_13 = VIEW_CONVERT_EXPR<struct upc_shared_ptr_t>(D.3060_12).phase;
> looks bogus IL to me.  It views the pointer(!?) D.3060_12 as being a
> struct upc_shared_ptr_t and extracts a value that is not within that
> pointer.
> But maybe I'm missing something because I don't recognize that 'shared [8]'
> qualification.  [...]

The syntax (shared [8] struct foo *) above is unique to UPC.  This is a pointer
to a "shared' qualified object with a "blocking factor" (layout qualifier) of
8.  This type of pointer is called a "pointer-to-shared" (PTS) in the UPC
language definition; it is a pointer that can span nodes.  On a 64-bit machine,
using the "sturct PTS" (as opposed to "packed PTS") representation it is a 16
byte quantity.  Thus the casts back/forth between (shared *) and "struct
upc_shared_ptr_t" do not violate the size assumptions of VIEW_CONVERT_EXPR().

The "blocking factor" (the [8] in "shared [8] *" above) is unique to UPC.  In
UPC, arrays are "block distributed".  This means that block 0 is on thread 0,
block 1 is on thread 1 and so on.  Thus, for a UPC program that is run with 2
threads, foo[0], foo[1] ... foo[7] are allocated on (have "affinity to") thread
0 and foo[8], foo[9] ... foo[13] are allocated on thread 1.  This blocking
factor provides for the ability to cast a pointer to a block of shared storage
into a regular "C" pointer (a "local" pointer) as long as the thread performing
the cast has affinity to the block.

What is potentially troublesome for the "middle end" tree optimizations and
"back end" RTL optimizations is that these pointers-to-shared (PTS's) are "fat"
pointers.  Note that after the lowering pass (performed in
upc/upc-genericize.c) that there will be no *indirections* through a PTS. 
Instead, indirections of a PTS in a value context will be converted into "get"
calls, which are implemented by the UPC runtime (libupc/smp).  Indirections
that are the targets of assignments are translated into "put" calls,
implemented by the UPC runtime. 

The lowering pass also translates UPC pointer-to-shared arithmetic operations
into their equivalent operations which do not involve PTS's, but rather cast
the PTS's to their representation type (struct upc_shared_ptr_t) and then
operate on the component parts of the PTS.  As you can see from the description
of blocking factors above, the mapping of foo[i] to its (global) address
requires a fairly complex arrangement of division and modulo operations.

The libupc runtime is unique in that parts of it may be inlined.  Inlining of
the runtime is enabled at optimization levels greater than 0, or it can be
explicitly inlined/not-inlined via the -fupc-inline-lib switch.  The inlining
is accomplished via a pre-include of a runtime header file, implemented by the
"upc" driver.  Inlining is enabled in the test case documented in this bug
report.  Thus, a simple assignment statement involving array indexing of a UPC
shared "blocked" array expands into a rather complex assortment of tree code,
and generated RTL.  (This complexity makes it difficult to create an equivalent
"C" test case.)

After lowering, any references to "shared *" (pointers-to-shared) should only
occur in casts to/from the representation type and in moves/copies of the PTS
container.  We have run into a few places where the middle end makes some
assumptions about regular pointers and tries to apply those assumptions to a
UPC pointer-to-shared; we have been able to exclude PTS's by adding additional
checks for them -- there are not many places that we have had to do this. 
Perhaps that sort of pointer-specific logic is kicking in here.

Arguably, the UPC lowering pass should fully lower PTS typed expressions, so
that they don't end up in the tree.  Potentially, a PTS hanging around in the
tree doesn't meet the strict (or even not-so-strict) definition of GENERIC. 
Fully lowering those expressions is on our "to do" list.  When we do that,
rather than using casts, we will likely rewrite the PTS type references into
references to the PTS representation type.  We have shied away from this
because it makes the resulting tree code even more difficult to follow, because
it loses logical correspondence to the original "C" source statements.

That said, this technique of casting a PTS to its representation type and then
extracting its sub-parts has been working for quite a while on several
different target architectures.  However, maybe this recast of a
pointer-to-shared is confusing the post-reload instruction scheduler and/or the
logic that creates the MEM_REF?.

We would like to see if we can find a way to make the current lowering pass
approach work, because it does work in many contexts, and will allow us to make
forward progress without making the lowering pass re-work become a critical
path task.  Also, we don't know that the presence of a PTS-typed node in the
tree is actually the cause of the problem.

More information about the Gcc-bugs mailing list