This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
non-optimal constructor eval, MOVE_BY_PIECES_P() misuse?
- From: "Gary Funck" <gary at Intrepid dot Com>
- To: "GCC List" <gcc at gcc dot gnu dot org>
- Date: Sat, 3 May 2003 11:36:40 -0700
- Subject: non-optimal constructor eval, MOVE_BY_PIECES_P() misuse?
Some work that I'm doing on GCC requires augmenting the representation of
pointers
so that they can generalize across distributed multi-processors. In some cases,
the
pointers are represented as a triple (p, t, v) where p = phase, t = thread, and
v = a virtual address. This code makes extensive use of constructors
internally,
and therefore the efficiency of the generated code depends in many cases how
well the
compiler handles constructors. This note describes a situation that I noticed
where
GCC (on a MIPS/SGI Irix target) generates lengthy code to do some
simple things. The code would be noticeably more efficient if the compiler made
the appropriate choice as to when it should output a 'static' constructor as
a constant definition in memory or not. Other languages, like Ada, might
benefit from
better handling of constructors as well.
----
Consider the following test program:
struct s {int p:24; int t:8; void *addr;} __attribute__ ((aligned (8)));
char c = 'a';
int global = 10;
struct s xs = {1, 1, &global};
main()
{
struct s t = {1, 1, &global};
}
----
On the MIPS/SGI target at default settings, the assignment statement above,
which assigns the constructor {1, 1, &global} to 't', generates the following
code,
ld $3,16($fp)
dli $2,0xffffffffff # 1099511627775
and $2,$3,$2
dli $3,0x10000000000 # 1099511627776
or $2,$2,$3
sd $2,16($fp)
li $2,1 # 0x1
sb $2,19($fp)
lw $2,%got_disp(global)($28)
sw $2,20($fp)
That's roughly 10 instructions and 6 memory accesses on a 64-bit processor
('sb' counts for 2). Let's call it roughly 16 cycles. But if we rewrite the
assignment as follows,
struct s t = xs;
The generated code is:
lw $2,%got_disp(xs)($28)
ld $2,0($2)
sd $2,16($fp)
That's 3 instructions, and roughly 6 cycles. Thus, the resulting code is almost
three times as efficient.
Delving into this a bit, the inefficiency is due to the decision that the
compiler made not store the constructor value into memory. The logic in expr.c
is:
7159 /* All elts simple constants => refer to a constant in memory.
But
7160 if this is a non-BLKmode mode, let it store a field at a time
7161 since that should make a CONST_INT or CONST_DOUBLE when we
7162 fold. Likewise, if we have a target we can use, it is best to
7163 store directly into the target unless the type is large enough
7164 that memcpy will be used. If we are making an initializer and
7165 all operands are constant, put it in memory as well.
7166
7167 FIXME: Avoid trying to fill vector constructors piece-meal.
7168 Output them with output_constant_def below unless we're sure
7169 they're zeros. This should go away when vector initializers
7170 are treated like VECTOR_CST instead of arrays.
7171 */
7172 else if ((TREE_STATIC (exp)
7173 && ((mode == BLKmode
7174 && ! (target != 0 && safe_from_p (target, exp,
1)))
7175 || TREE_ADDRESSABLE (exp)
7176 || (host_integerp (TYPE_SIZE_UNIT (type), 1)
7177 && (! MOVE_BY_PIECES_P
7178 (tree_low_cst (TYPE_SIZE_UNIT (type), 1),
7179 TYPE_ALIGN (type)))
7180 && ((TREE_CODE (type) == VECTOR_TYPE
7181 && !is_zeros_p (exp))
7182 || ! mostly_zeros_p (exp)))))
7183 || (modifier == EXPAND_INITIALIZER && TREE_CONSTANT
(exp)))
7184 {
7185 rtx constructor = output_constant_def (exp, 1);
----
In this example, the mode being used is DImode because the entire structure
fits in 64 bits,
so this part of the 'if' doesn't apply,
7173 && ((mode == BLKmode
7174 && ! (target != 0 && safe_from_p (target, exp,
1)))
The expr is designated as 'static', but not TREE_ADDRESSABLE, thus leaving only
the following
test to be applied:
7176 || (host_integerp (TYPE_SIZE_UNIT (type), 1)
7177 && (! MOVE_BY_PIECES_P
7178 (tree_low_cst (TYPE_SIZE_UNIT (type), 1),
7179 TYPE_ALIGN (type)))
7180 && ((TREE_CODE (type) == VECTOR_TYPE
7181 && !is_zeros_p (exp))
7182 || ! mostly_zeros_p (exp)))))
Since the underlying type is not a vector type, the test above simplifies to
checking
for !MOVE_BY_PIECES_P() && !mostly_zeros_p(). There seem to be two problems
(from a code
generation point of view) with this test:
1) the underlying assumption "... let it store a field at a time
since that should make a CONST_INT or CONST_DOUBLE when we fold",
doesn't appear to hold - because of the additional address value,
which has an inherent relocation. Also (noted below), the folding
seems to only occur at -O1 and above.
2) MOVE_BY_PIECES_P() is supposed to check if the approximate number of
instructions
required by a move is below a certain threshold, presumably because this
threshold
is below the approximate overhead required from calling an out-of-line
move routine.
In practice, the MOVE_BY_PIECES_P() check may be too weak, because it
seems to be
built around the assumption that the value will be folded into a single
scalar
quantity, but in the case above, it was not.
Here's the default MOVE_BY_PIECES_P() check:
#ifndef MOVE_RATIO
#if defined (HAVE_movstrqi) || defined (HAVE_movstrhi) || defined (HAVE_movstr
si) || defined (HAVE_movstrdi) || defined (HAVE_movstrti)
#define MOVE_RATIO 2
#else
/* If we are optimizing for space (-Os), cut down the default move ratio. */
#define MOVE_RATIO (optimize_size ? 3 : 15)
#endif
#endif
/* This macro is used to determine whether move_by_pieces should be called
to perform a structure copy. */
#ifndef MOVE_BY_PIECES_P
#define MOVE_BY_PIECES_P(SIZE, ALIGN) \
(move_by_pieces_ninsns (SIZE, ALIGN) < (unsigned int) MOVE_RATIO)
#endif
The result from move_by_pieces_ninsns() must be below the threshold,
MOVE_RATIO,
for this predicate to be true.
move_by_pieces_ninsns() says it does the following:
"Return number of insns required to move L bytes by pieces.
ALIGN (in bits) is maximum alignment we can assume."
Under the assumption that 'move' means to 'copy memory', it looked to
me that this function might be under-counting the actual (approximate)
number of instructions by a factor of 2 on a RISC architecture, since a move
requires a load, followed by a store. In the example above,
move_by_pieces_ninsns()
returns the value 1. This method of calculation might be offset somewhat by
properly setting MOVE_RATIO in a machine dependent way -- it is defined for
quite a few targets:
alpha arm avr c4x d30v h8300 i386 m68hc11 m68k
m68k mn10200 mn10300 ns32k pa v850
Here's what the i386 does:
/* If a memory-to-memory move would take MOVE_RATIO or more simple
move-instruction pairs, we will do a movstr or libcall instead.
Increasing the value will always make code faster, but eventually
incurs high cost in increased code size.
If you don't define this, a reasonable default is used. */
#define MOVE_RATIO (optimize_size ? 3 : ix86_cost->move_ratio)
The proper setting of MOVE_RATIO seems to depend a lot upon the underlying
assumptions of MOVE_BY_PIECES_P() and target-specific factors. I'm still
not certain that MOVE_BY_PIECES_P() is a good choice for determining whether
to output the constructor as a constant in memory or not. And
the assumption the data will be folded into a larger combined constructor is
critical to that determination.
If the example above is changed to use only integer fields, the compiler
will fold those initializations into one value, and then use this value to set
the constructor, as the comments indicate. If we change 'addr' above to an
'int',
and initialize 't' to the value {1, 2, 3}, we get code that looks like this:
dli $2,0x10200000003 # 1108101562371
sd $2,16($sp)
which isn't bad, and is somewhat better than
lw $2,%got_disp(xs)($28)
ld $2,0($2)
sd $2,16($fp)
Note, however, that this folding of the smaller constructor field values into a
larger
constant occurs *only at optimization level -O1 and higher*, and thus is not
the default
case. The default code looks like this for the all integer fields case:
ld $3,16($fp)
dli $2,0xffffffffff # 1099511627775
and $2,$3,$2
dli $3,0x10000000000 # 1099511627776
or $2,$2,$3
sd $2,16($fp)
li $2,2 # 0x2
sb $2,19($fp)
li $2,3 # 0x3
sw $2,20($fp)
----
Summarizing, the decision for whether to output the constructor as a memory
constant might
be improved by changing it to test for the following criteria:
1. is the optimization level greater than 0?
2. is folding possible/likely?
3. is MOVE_BY_PIECES_P() and its related MOVE_RATIO tuned properly for the
target,
(or is it even appropriate to use MOVE_BY_PIECES_P())?