This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Add -fsection-anchors (4.2 project)


This is the main part for the 4.2 section anchor project:

    http://gcc.gnu.org/wiki/Section%20Anchor%20Optimisations

The idea is to introduce anchor symbols that can be used to access
several nearby objects.  For example, if we have:

    static int a, b, c;
    int foo (void) { return a + b + c; }

gcc will normally perform separate symbolic address calculations for
"a", "b" and "c".  The idea of this patch is to introduce a new anchor
symbol and access "a", "b" and "c" relative to that anchor.

The main motivation is to reduce GOT size, but the patch also has
the potential to make some code faster.  SPEC results are attached
and described below.


Managing the relative positions of objects
==========================================

Introduction
------------

We can only access two objects from the same anchor if we know how far
apart those two objects are.  gcc doesn't really have any infrastructure
for this at the moment; it simply writes out each static object in
isolation.

A great deal of the patch is therefore about allowing objects to
be grouped together so that their relative positions are fixed.
The aim has been to make this infrastructure as general as possible,
so that it could be used for other optimisations besides section
anchors.

Another goal was to support the placement of every kind of static
data -- decls, tree constants, and rtx constants -- and to allow
these different kinds of data to be grouped together in the same
block.  With that constraint, the best approach seemed to be to
group objects based on their SYMBOL_REF.

One of the main design decisions here was: when should we decide whether
to put a SYMBOL_REF in a block, rather than treat it as a stand-alone
entity?  The ideal answer might seem to be "whenever we know we'll need
positional information".  Unfortunately, in the case of section anchors,
that will be part way through compiling a function, and by that time we
might already have written the object out.  (Note that this is true even
in unit-at-a-time mode.)  The approach I took was therefore to put objects
into blocks if (1) we knew enough information about them to do so and
(2) the current compilation mode might make use of positional information.

In other words, we now have a compilation mode in which objects are
grouped together into blocks whenever possible.  This mode is selected
if one of the active optimisations might find it useful.


Creating block symbols
----------------------

Block symbols (i.e., symbols that are put into blocks) can be created in
three places: make_decl_rtl (for data decls), build_constant_desc (for
tree constants) and force_const_mem (for rtx constants).  A new function
-- use_blocks_for_decl_p -- says whether a decl can be put into a block,
while a new target hook -- use_blocks_for_constant_p -- says the same
about rtx constants.  Tree constants can always be grouped into blocks.

We often don't know at this stage whether the object will be needed or
not.  For example, local data decls (C statics) might be removed or
constants might not be marked.  A block symbol therefore starts with
an offset of -1 to indicate that its position in the block is not
yet known.


Writing out block symbols
-------------------------

The corresponding assembly output routines are assemble_variable
(for data decls), output_constant_def_contents (for tree constants)
and output_constant_pool (for rtx constants).  The patch makes sure
these functions do not emit the definitions of block symbols; instead,
they simply make sure that the symbols have been assigned a position.
The whole block is then written out at the end of compilation by a new
function called "output_object_blocks".


Placing block symbols
---------------------

The main function for placing an object is place_block_symbol.
This function is called by the output functions listed above,
and by any client optimisation that wants to know the position
of an unplaced object.


Data structures
---------------

The patch introduces a new structure called "object_block" for
representing a group of objects.  The block is associated with
a particular section and stores its contents as a sorted vector
of SYMBOL_REFs.

The main data structure decision here is: how should we represent
a SYMBOL_REF's position within a block?  Two considerations are:

  (1) We don't want to allocate any extra data for symbols that
      aren't going to be put in blocks.

  (2) Every object we decide to put into a block will have this
      information, and it will live as long as the object itself does.

(1) means we can't unconditionally grow the SYMBOL_REF rtx.
If the new compilation mode is not selected, the extra memory will
never be used at all, while if the mode _is_ selected, the memory will
only be useful for a subset of symbols.

IMO, (2) argues against the use of hash tables.  Hash tables would have
an overhead of at least two pointers per symbol compared to something
directly attached to the SYMBOL_REF; there would be one pointer for the
hash table chain and one to identify the symbol.  There's no temporal
advantage either since the table entry will live as long as the symbol
does.

Because we're deciding up-front whether or not to put objects in blocks,
we know when creating a SYMBOL_REF whether we want the extra information.
I therefore created an extended SYMBOL_REF structure called block_symbol:

------------------------------------------------------------------------
/* This structure remembers the position of a SYMBOL_REF within an
   object_block structure.  A SYMBOL_REF only provides this information
   if SYMBOL_REF_IN_BLOCK_P is true.  */
struct block_symbol GTY(()) {
  /* The usual SYMBOL_REF fields.  */
  rtunion GTY ((skip)) fld[3];

  /* The block that contains this object.  */
  struct object_block *block;

  /* The offset of this object from the start of its block.  It is negative
     if the symbol has not yet been assigned an offset.  */
  HOST_WIDE_INT offset;
};
------------------------------------------------------------------------

and added block_symbol to the rtx union.  A new SYMBOL_REF flag called
SYMBOL_FLAG_IN_BLOCK indicates whether this information is available.

[ The usual way of adding data to an rtx is to add new rtunion fields,
  so using a structure here might seem a little odd.  The problem is that
  rtunion no longer has a HOST_WIDE_INT field that we can use for the
  offset: it used to, but it led to unnecessarily-bloated rtxes on
  ILP32 hosts compiling for need_64bit_hwint targets. ]

With this approach, the memory overhead per block symbol is one pointer
(the block) and one HOST_WIDE_INT (the offset within the block).  The
patch adds two new accessors, SYMBOL_REF_BLOCK and SYMBOL_REF_BLOCK_OFFSET,
for getting this information.

One potentially controversial aspect of this approach is that the size
of an rtx no longer depends entirely on its code.  SYMBOL_REFs with
SYMBOL_FLAG_IN_BLOCK set are larger than those without.  This led to
the following changes:

  - RTX_SIZE becomes RTX_CODE_SIZE, to emphasise that it provides
    the base size for a particular code.

  - A new function, rtx_size, provides the size of an existing rtx.

  - shallow_copy_rtx uses rtx_size (x) instead of RTX_SIZE (GET_CODE (x)).

  - Code that allocates new rtxes uses RTX_CODE_SIZE instead of RTX_SIZE.

  - Code that copies old rtxes uses shallow_copy_rtx instead of inline
    copies.

This is pretty trivial, and I see the last point as a clean-up.
I've split out these changes to ease review, although rtx_size's use
of SYMBOL_REF_IN_BLOCK_P means that it isn't a stand-alone patch.

Also, the garbage collector must see the new SYMBOL_REF fields iff
SYMBOL_FLAG_IN_BLOCK is true.  This too is pretty trivial.


A note on rtx constant pools
----------------------------

We currently maintain a separate constant pool for each function.
There's not really any point doing this if use_blocks_for_constant_p
returns true; if the constant is going to be written out at the end of
compilation anyway, we might as well allow it to be shared between
functions.

However, the same is sometimes true even if !use_blocks_for_constant_p.
E.g. powerpc TOC entries can be reused by different functions, even though
we'd never want to put them into blocks.  The powerpc backend gets around
the current per-function pools by using aliases for duplicate TOC entries.

I plan to submit a follow-on patch that adds a shared constant pool
and that gives backends a chance to choose whether a constant goes
in this shared pool or in the function-local one.


Control of section anchors
==========================

The use of section anchors is controlled by a new switch,
-fsection-anchors.  It isn't activated by any of the -O options
since its benefit is so target-dependent.  If it does turn out
to be a consistent win for some configurations, I think the right
thing to do would be to set flag_section_anchors in the backend's
OPTIMIZATION_OPTIONS.

There are some new hooks for controlling the use of section anchors:

  /* Output the definition of a section anchor.  */
  void (*output_anchor) (rtx);

  /* The minimum and maximum byte offsets for anchored addresses.  */
  HOST_WIDE_INT min_anchor_offset;
  HOST_WIDE_INT max_anchor_offset;

  /* True if section anchors can be used to access the given symbol.  */
  bool (* use_anchors_for_symbol_p) (rtx);

The texinfo documentation describes these hooks in more detail.


Using anchored addresses
========================

A new function, use_anchored_address, converts a MEM whose address
is a block symbol into a MEM whose address uses an anchor.  The rtl
expanders call this function after reading a DECL_RTL, after calling
force_const_to_mem, or after calling output_constant_def.

Most addresses are passed through validate_mem, which acts as a
convenient (and, I hope, logical) point to call use_anchored_address.
Only a few places need to call use_anchored_address directly.

If an anchor A is used several times in a function, we would prefer
to calculate A's address once, store it in a register, and reuse that
register for all accesses involving A.  (It probably isn't worth using
-fsection-anchors on targets where this isn't true.)

However, if A is only used once in a function, and is used for a symbol
at offset X from the anchor, some targets may prefer to allow CSE or
combine to create an address of the form:

   (const (plus (symbol_ref A) (const_int X))).

Other targets might not (such as powerpc with -mminimal-toc).

use_anchored_address therefore forces the anchor into a register,
but makes no attempt to hoist the register initialisation code,
or to reuse registers between accesses.  The rtl optimisers can then
decide whether it's worth hoisting and reusing the registers, just
as they would for ordinary symbolic accesses.


Aliasing
========

I was worried that using anchors might adversely affect the alias.c
base-analysis code.  If we have indirect accesses like the following:

    int x[4], y[4];

                            r := &anchor
    int *r1 = x;    ---->   r1 := r + (x - &anchor);
    int *r2 = y;    ---->   r2 := r + (y - &anchor);

    ...indirect accesses to x and y using r1 and r2...

we can no longer tell that accesses based on r1 are always to "x" and
those based on r2 are always to "y".

[ Note 1: I'm only talking about indirect accesses here.  Direct accesses
  aren't a problem, because the MEM_ATTRs provide the information we need.

  Note 2: We can't use the offsets to work out the underlying object.
  Consider accesses like:

      r1[10000 - i]

  If r1 is only used once, we might rewrite this access to use
  the rtl equivalent of:

      r3 := r + (x - anchor) + 10000
      r3[-i]

  But r3 has no relation to the variable at offset (x - anchor) + 10000.
  
  It would be difficult to stop this sort of transformation from
  happening, even if we wanted to. ]

Fortunately, this doesn't seem to have much effect on SPEC, and hasn't
had a noticable impact in the examples I've looked at in detail.
Perhaps it make more difference on targets with no offset addressing?


Other random points
===================

- The tree-ssa-loop-ivopts.c change was discussed here:
  http://gcc.gnu.org/ml/gcc/2005-09/msg00850.html

- I moved the rtx vector definitions to rtl.h so that they can
  be used in rtl.h.

- The idea of writing out blocks of objects at the end of compilation is
  really doing things unit-at-a-time.  -fsection-anchors is therefore
  conditional on -funit-at-a-time.

- Only the powerpc backend uses section anchors at the moment.
  (I have plans to add MIPS at some point.)


SPEC results
============

I ran SPEC on powerpc64 with the options:

  -mcpu=970 -O3 -fomit-frame-pointer -fno-reorder-blocks

(-fno-reorder-blocks because block reordering gives worse code
in several cases).  The base results had "-fno-section-anchors"
and the peak ones had "-fsection-anchors".

As I said at the beginning, the main aim was to reduce GOT size,
so the results for that are as follows:

-----------------------------------------------------------------------
.got size in bytes

Benchmark                Before      After   Delta
=========                ======      =====    ====
168.wupwise                 944        392    -552    41.53%
171.swim                    328        272     -56    82.93%
172.mgrid                   352        264     -88    75.00%
173.applu                  1104        912    -192    82.61%
177.mesa                  11240      10984    -256    97.72%
178.galgel                 5264       3456   -1808    65.65%
179.art                     824        592    -232    71.84%
183.equake                 1192        920    -272    77.18%
187.facerec                1216        520    -696    42.76%
188.ammp                   5096       4880    -216    95.76%
189.lucas                   688        264    -424    38.37%
191.fma3d                 27888      10672  -17216    38.27%
200.sixtrack              53376      20120  -33256    37.69%
301.apsi                   2576       1608    -968    62.42%
164.gzip                   2688       1968    -720    73.21%
175.vpr                    7888       7400    -488    93.81%
176.gcc                   43376      33688   -9688    77.67%
181.mcf                     456        408     -48    89.47%
186.crafty                16872      16360    -512    96.97%
197.parser                 6008       4864   -1144    80.96%
252.eon                   25296      10112  -15184    39.97%
253.perlbmk               21056      20800    -256    98.78%
254.gap                   22848      21104   -1744    92.37%
255.vortex                25848      23216   -2632    89.82%
256.bzip2                  1448       1120    -328    77.35%
300.twolf                 14208      14096    -112    99.21%

Total                    300080     210992  -89088    70.31%
-----------------------------------------------------------------------

The actual SPEC results are attached below, but the summary is:

-----------------------------------------------------------------------
   164.gzip          1400     187         747*     1400     184         762*
   175.vpr           1400     303         462*     1400     303         462*
   176.gcc           1100     138         800*     1100     127         868*
   181.mcf           1800     540         334*     1800     540         334*
   186.crafty        1000      85.3      1173*     1000      84.9      1178*
   197.parser        1800     320         563*     1800     328         550*
   252.eon           1300     102        1272*     1300      95.4      1363*
   253.perlbmk       1800     264         683*     1800     259         696*
   254.gap           1100     150         732*     1100     156         705*
   255.vortex        1900     231         822*     1900     225         843*
   256.bzip2         1500     248         606*     1500     242         619*
   300.twolf         3000     602         499*     3000     602         498*
   SPECint_base2000                       679
   SPECint2000                                                          689

   168.wupwise       1600       154      1042*     1600       157      1020*
   171.swim          3100      1174       264*     3100      1173       264*
   172.mgrid         1800       312       576*     1800       312       577*
   173.applu         2100       286       733*     2100       286       734*
   177.mesa          1400       141       993*     1400       138      1016*
   178.galgel        2900       230      1262*     2900       231      1254*
   179.art           2600       401       648*     2600       400       650*
   183.equake        1300       128      1014*     1300       127      1024*
   187.facerec       1900       234       811*     1900       237       800*
   188.ammp          2200       558       394*     2200       559       394*
   189.lucas         2000       240       833*     2000       241       830*
   191.fma3d         2100       219       959*     2100       223       943*
   200.sixtrack      1100       211       520*     1100       211       521*
   301.apsi          2600       453       574*     2600       453       574*
   SPECfp_base2000                        704
   SPECfp2000                                                           703
-----------------------------------------------------------------------

Testing CSiBE with -Os -fno-section-anchors vs. -Os -fsection-anchors
suggests it's a minor size win for -Os.  Again, the comparison is
attached below.


Testing
=======

Bootstrapped & regression tested on powerpc64-linux-gnu:

  - once as-is
  - once with -fsection-anchors enabled by default and with normal
    checking options
  - once with -fsection-anchors enabled by default and with rtl checking
    enabled (i.e. --enable-checking=yes,rtl)

The only extra failures were in objc:

    FAIL: objc/execute/class-13.m compilation
    FAIL: objc/execute/class-6.m compilation
    FAIL: objc/execute/object_is_class.m compilation
    FAIL: objc/execute/object_is_meta_class.m compilation

(options snipped).  These failures are caused by objc creating two decls
for _OBJC_METACLASS_* objects.  I think it would be better to modify the
old decl once the initialiser is known, but I don't know enough about
objc to do that.

Also bootstrapped & regression tested on i686-linux-gnu.  OK for trunk?

Richard


	* cselib.c (cselib_init): Change RTX_SIZE to RTX_CODE_SIZE.
	* emit-rtl.c (copy_rtx_if_shared_1): Use shallow_copy_rtx.
	(copy_insn_1): Likewise.  Don't copy each field individually.
	Reindent.
	* read-rtl.c (apply_macro_to_rtx): Use RTX_CODE_SIZE instead
	of RTX_SIZE.
	* reload1.c (eliminate_regs): Use shallow_copy_rtx.
	* rtl.c (rtx_size): Rename variable to...
	(rtx_code_size): ...this.
	(rtx_size): New function.
	(rtx_alloc_stat): Use RTX_CODE_SIZE instead of RTX_SIZE.
	(copy_rtx): Use shallow_copy_rtx.  Don't copy each field individually.
	Reindent.
	(shallow_copy_rtx_stat): Use rtx_size instead of RTX_SIZE.
	* rtl.h (rtx_code_size): New variable.
	(rtx_size): Change from a variable to a function.
	(RTX_SIZE): Rename to...
	(RTX_CODE_SIZE): ...this.

Attachment: rtx-code-size.diff
Description: Text document

	* doc/tm.texi (TARGET_USE_BLOCKS_FOR_CONSTANT_P): Document.
	(Anchored Addresses): New section.
	* doc/invoke.texi (-fsection-anchors): Document.
	* doc/rtl.texi (SYMBOL_REF_IN_BLOCK_P, SYMBOL_FLAG_IN_BLOCK): Likewise.
	(SYMBOL_REF_ANCHOR_P, SYMBOL_FLAG_ANCHOR): Likewise.
	(SYMBOL_REF_BLOCK, SYMBOL_REF_BLOCK_OFFSET): Likewise.
	* hooks.c (hook_bool_mode_rtx_false): New function.
	* hooks.h (hook_bool_mode_rtx_false): Declare.
	* gengtype.c (create_optional_field): New function.
	(adjust_field_rtx_def): Add the "block_sym" field for SYMBOL_REFs when
	SYMBOL_REF_IN_BLOCK_P is true.
	* target.h (output_anchor, use_blocks_for_constant_p): New hooks.
	(min_anchor_offset, max_anchor_offset): Likewise.
	(use_anchors_for_symbol_p): New hook.
	* toplev.c (compile_file): Call output_object_blocks.
	(target_supports_section_anchors_p): New function.
	(process_options): Check that -fsection-anchors is only used on
	targets that support it and when -funit-at-a-time is in effect.
	* tree-ssa-loop-ivopts.c (prepare_decl_rtl): Only create DECL_RTL
	if the decl doesn't have one.
	* dwarf2out.c: Remove instantiations of VEC(rtx,gc).
	* expr.c (emit_move_multi_word, emit_move_insn): Pass the result
	of force_const_mem through use_anchored_address.
	(expand_expr_constant): New function.
	(expand_expr_addr_expr_1): Call it.  Use the same modifier when
	calling expand_expr for INDIRECT_REF.
	(expand_expr_real_1): Pass DECL_RTL through use_anchored_address
	for all modifiers except EXPAND_INITIALIZER.  Use expand_expr_constant.
	* expr.h (use_anchored_address): Declare.
	* loop-unroll.c: Don't declare rtx vectors here.
	* explow.c: Include output.h.
	(validize_mem): Call use_anchored_address.
	(use_anchored_address): New function.
	* common.opt (-fsection-anchors): New switch.
	* varasm.c (object_block_htab, anchor_labelno): New variables.
	(hash_section, object_block_entry_eq, object_block_entry_hash)
	(use_object_blocks_p, get_block_for_section, create_block_symbol)
	(use_blocks_for_decl_p, change_symbol_section): New functions.
	(get_variable_section): New function, split out from assemble_variable.
	(make_decl_rtl): Create a block symbol if use_object_blocks_p and
	use_blocks_for_decl_p say so.  Use change_symbol_section if the
	symbol has already been created.
	(assemble_variable_contents): New function, split out from...
	(assemble_variable): ...here.  Don't output any code for
	block symbols; just pass them to place_block_symbol.
	Use get_variable_section and assemble_variable_contents.
	(get_constant_alignment, get_constant_section, get_constant_size): New
	functions, split from output_constant_def_contents.
	(build_constant_desc): Create a block symbol if use_object_blocks_p
	says so.  Or into SYMBOL_REF_FLAGS.
	(assemble_constant_contents): New function, split from...
	(output_constant_def_contents): ...here.  Don't output any code
	for block symbols; just pass them to place_section_symbol.
	Use get_constant_section and get_constant_alignment.
	(force_const_mem): Create a block symbol if use_object_blocks_p and
	use_blocks_for_constant_p say so.  Or into SYMBOL_REF_FLAGS.
	(output_constant_pool_1): Add an explicit alignment argument.
	Don't switch sections here.
	(output_constant_pool): Adjust call to output_constant_pool_1.
	Switch sections here instead.  Don't output anything for block symbols;
	just pass them to place_block_symbol.
	(init_varasm_once): Initialize object_block_htab.
	(default_encode_section_info): Keep the old SYMBOL_FLAG_IN_BLOCK.
	(default_asm_output_anchor, default_use_aenchors_for_symbol_p)
	(place_block_symbol, get_section_anchor, output_object_block)
	(output_object_block_htab, output_object_blocks): New functions.
	* target-def.h (TARGET_ASM_OUTPUT_ANCHOR): New macro.
	(TARGET_ASM_OUT): Include it.
	(TARGET_USE_BLOCKS_FOR_CONSTANT_P): New macro.
	(TARGET_MIN_ANCHOR_OFFSET, TARGET_MAX_ANCHOR_OFFSET): New macros.
	(TARGET_USE_ANCHORS_FOR_SYMBOL_P): New macro.
	(TARGET_INITIALIZER): Include them.
	* rtl.c (rtl_check_failed_block_symbol): New function.
	* rtl.h: Include vec.h.  Declare heap and gc rtx vectors.
	(block_symbol, object_block): New structures.
	(rtx_def): Add a block_symbol field to the union.
	(BLOCK_SYMBOL_CHECK): New macro.
	(rtl_check_failed_block_symbol): Declare.
	(SYMBOL_FLAG_IN_BLOCK, SYMBOL_FLAG_ANCHOR): New SYMBOL_REF flags.
	(SYMBOL_REF_IN_BLOCK_P, SYMBOL_REF_ANCHOR_P): New predicates.
	(SYMBOL_FLAG_MACH_DEP_SHIFT): Bump by 2.
	(SYMBOL_REF_BLOCK, SYMBOL_REF_BLOCK_OFFSET): New accessors.
	* output.h (output_section_symbols): Declare.
	(object_block): Name structure.
	(place_section_symbol, get_section_anchor, default_asm_output_anchor)
	(default_use_anchors_for_symbol_p): Declare.
	* Makefile.in (RTL_BASE_H): Add vec.h.
	(explow.o): Depend on output.h.
	* config/rs6000/rs6000.c (TARGET_MIN_ANCHOR_OFFSET): Override default.
	(TARGET_MAX_ANCHOR_OFFSET): Likewise.
	(TARGET_USE_BLOCKS_FOR_CONSTANT_P): Likewise.
	(rs6000_use_blocks_for_constant_p): New function.

Attachment: section-anchor.diff
Description: Text document

Attachment: CINT2000.txt
Description: Text document

Attachment: CFP2000.txt
Description: Text document

Attachment: CSiBE.txt
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]