This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

non-optimal constructor eval, MOVE_BY_PIECES_P() misuse?


Some work that I'm doing on GCC requires augmenting the representation of
pointers
so that they can generalize across distributed multi-processors. In some cases,
the
pointers are represented as a triple (p, t, v) where p = phase, t = thread, and
v = a virtual address. This code makes extensive use of constructors
internally,
and therefore the efficiency of the generated code depends in many cases how
well the
compiler handles constructors. This note describes a situation that I noticed
where
GCC (on a MIPS/SGI Irix target) generates lengthy code to do some
simple things. The code would be noticeably more efficient if the compiler made
the appropriate choice as to when it should output a 'static' constructor as
a constant definition in memory or not. Other languages, like Ada, might
benefit from
better handling of constructors as well.

----

Consider the following test program:

struct s {int p:24; int t:8;  void *addr;} __attribute__ ((aligned (8)));

char c = 'a';
int global = 10;

struct s xs = {1, 1, &global};

main()
{
  struct s t = {1, 1, &global};
}

----

On the MIPS/SGI target at default settings, the assignment statement above,
which assigns the constructor  {1, 1, &global} to 't', generates the following
code,

        ld      $3,16($fp)
        dli     $2,0xffffffffff         # 1099511627775
        and     $2,$3,$2
        dli     $3,0x10000000000                # 1099511627776
        or      $2,$2,$3
        sd      $2,16($fp)
        li      $2,1                    # 0x1
        sb      $2,19($fp)
        lw      $2,%got_disp(global)($28)
        sw      $2,20($fp)

That's roughly 10 instructions and 6 memory accesses on a 64-bit processor
('sb' counts for 2). Let's call it roughly 16 cycles. But if we rewrite the
assignment as follows,

  struct s t = xs;

The generated code is:

        lw      $2,%got_disp(xs)($28)
        ld      $2,0($2)
        sd      $2,16($fp)

That's 3 instructions, and roughly 6 cycles. Thus, the resulting code is almost
three times as efficient.

Delving into this a bit, the inefficiency is due to the decision that the
compiler made not store the constructor value into memory. The logic in expr.c
is:

  7159        /* All elts simple constants => refer to a constant in memory.
But
  7160           if this is a non-BLKmode mode, let it store a field at a time
  7161           since that should make a CONST_INT or CONST_DOUBLE when we
  7162           fold.  Likewise, if we have a target we can use, it is best to
  7163           store directly into the target unless the type is large enough
  7164           that memcpy will be used.  If we are making an initializer and
  7165           all operands are constant, put it in memory as well.
  7166
  7167          FIXME: Avoid trying to fill vector constructors piece-meal.
  7168          Output them with output_constant_def below unless we're sure
  7169          they're zeros.  This should go away when vector initializers
  7170          are treated like VECTOR_CST instead of arrays.
  7171        */
  7172        else if ((TREE_STATIC (exp)
  7173                  && ((mode == BLKmode
  7174                       && ! (target != 0 && safe_from_p (target, exp,
1)))
  7175                      || TREE_ADDRESSABLE (exp)
  7176                      || (host_integerp (TYPE_SIZE_UNIT (type), 1)
  7177                          && (! MOVE_BY_PIECES_P
  7178                              (tree_low_cst (TYPE_SIZE_UNIT (type), 1),
  7179                               TYPE_ALIGN (type)))
  7180                          && ((TREE_CODE (type) == VECTOR_TYPE
  7181                               && !is_zeros_p (exp))
  7182                              || ! mostly_zeros_p (exp)))))
  7183                 || (modifier == EXPAND_INITIALIZER && TREE_CONSTANT
(exp)))
  7184          {
  7185            rtx constructor = output_constant_def (exp, 1);

----

In this example, the mode being used is DImode because the entire structure
fits in 64 bits,
so this part of the 'if' doesn't apply,

  7173                  && ((mode == BLKmode
  7174                       && ! (target != 0 && safe_from_p (target, exp,
1)))

The expr is designated as 'static', but not TREE_ADDRESSABLE, thus leaving only
the following
test to be applied:

  7176                      || (host_integerp (TYPE_SIZE_UNIT (type), 1)
  7177                          && (! MOVE_BY_PIECES_P
  7178                              (tree_low_cst (TYPE_SIZE_UNIT (type), 1),
  7179                               TYPE_ALIGN (type)))
  7180                          && ((TREE_CODE (type) == VECTOR_TYPE
  7181                               && !is_zeros_p (exp))
  7182                              || ! mostly_zeros_p (exp)))))

Since the underlying type is not a vector type, the test above simplifies to
checking
for !MOVE_BY_PIECES_P() && !mostly_zeros_p(). There seem to be two problems
(from a code
generation point of view) with this test:

  1) the underlying assumption "... let it store a field at a time
     since that should make a CONST_INT or CONST_DOUBLE when we fold",
     doesn't appear to hold - because of the additional address value,
     which has an inherent relocation. Also (noted below), the folding
     seems to only occur at -O1 and above.

  2) MOVE_BY_PIECES_P() is supposed to check if the approximate number of
instructions
     required by a move is below a certain threshold, presumably because this
threshold
     is below the approximate overhead required from calling an out-of-line
move routine.
     In practice, the MOVE_BY_PIECES_P() check may be too weak, because it
seems to be
     built around the assumption that the value will be folded into a single
scalar
     quantity, but in the case above, it was not.

Here's the default MOVE_BY_PIECES_P() check:

#ifndef MOVE_RATIO
#if defined (HAVE_movstrqi) || defined (HAVE_movstrhi) || defined (HAVE_movstr
si) || defined (HAVE_movstrdi) || defined (HAVE_movstrti)
#define MOVE_RATIO 2
#else
/* If we are optimizing for space (-Os), cut down the default move ratio.  */
#define MOVE_RATIO (optimize_size ? 3 : 15)
#endif
#endif

/* This macro is used to determine whether move_by_pieces should be called
   to perform a structure copy.  */
#ifndef MOVE_BY_PIECES_P
#define MOVE_BY_PIECES_P(SIZE, ALIGN) \
  (move_by_pieces_ninsns (SIZE, ALIGN) < (unsigned int) MOVE_RATIO)
#endif

The result from move_by_pieces_ninsns() must be below the threshold,
MOVE_RATIO,
for this predicate to be true.

move_by_pieces_ninsns() says it does the following:
  "Return number of insns required to move L bytes by pieces.
  ALIGN (in bits) is maximum alignment we can assume."

Under the assumption that 'move' means to 'copy memory', it looked to
me that this function might be under-counting the actual (approximate)
number of instructions by a factor of 2 on a RISC architecture, since a move
requires a load, followed by a store. In the example above,
move_by_pieces_ninsns()
returns the value 1.  This method of calculation might be offset somewhat by
properly setting MOVE_RATIO in a machine dependent way -- it is defined for
quite a few targets:

 alpha arm avr c4x d30v h8300 i386 m68hc11 m68k
 m68k mn10200 mn10300 ns32k pa v850

Here's what the i386 does:

/* If a memory-to-memory move would take MOVE_RATIO or more simple
   move-instruction pairs, we will do a movstr or libcall instead.
   Increasing the value will always make code faster, but eventually
   incurs high cost in increased code size.

   If you don't define this, a reasonable default is used.  */

#define MOVE_RATIO (optimize_size ? 3 : ix86_cost->move_ratio)

The proper setting of MOVE_RATIO seems to depend a lot upon the underlying
assumptions of MOVE_BY_PIECES_P() and target-specific factors.  I'm still
not certain that MOVE_BY_PIECES_P() is a good choice for determining whether
to output the constructor as a constant in memory or not. And
the assumption the data will be folded into a larger combined constructor is
critical to that determination.

If the example above is changed to use only integer fields, the compiler
will fold those initializations into one value, and then use this value to set
the constructor, as the comments indicate. If we change 'addr' above to an
'int',
and initialize 't' to the value {1, 2, 3}, we get code that looks like this:

        dli     $2,0x10200000003                # 1108101562371
        sd      $2,16($sp)

which isn't bad, and is somewhat better than

        lw      $2,%got_disp(xs)($28)
        ld      $2,0($2)
        sd      $2,16($fp)

Note, however, that this folding of the smaller constructor field values into a
larger
constant occurs *only at optimization level -O1 and higher*, and thus is not
the default
case. The default code looks like this for the all integer fields case:

        ld      $3,16($fp)
        dli     $2,0xffffffffff         # 1099511627775
        and     $2,$3,$2
        dli     $3,0x10000000000                # 1099511627776
        or      $2,$2,$3
        sd      $2,16($fp)
        li      $2,2                    # 0x2
        sb      $2,19($fp)
        li      $2,3                    # 0x3
        sw      $2,20($fp)

----

Summarizing, the decision for whether to output the constructor as a memory
constant might
be improved by changing it to test for the following criteria:

  1. is the optimization level greater than 0?
  2. is folding possible/likely?
  3. is MOVE_BY_PIECES_P() and its related MOVE_RATIO tuned properly for the
target,
     (or is it even appropriate to use MOVE_BY_PIECES_P())?





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]