{PATCH v3, rs6000] Replace X-form addressing with D-form addressing in new pass for Power9

Thu Oct 17 23:53:00 GMT 2019

Hi Kelvin,

On Wed, Oct 09, 2019 at 03:28:45PM -0500, Kelvin Nilsen wrote:
> This new pass scans existing rtl expressions and replaces them with rtl expressions that favor selection of the D-form instructions in contexts for which the D-form instructions are preferred.  The new pass runs after the RTL loop optimizations since loop unrolling often introduces opportunities for beneficial replacements of X-form addressing instructions.
> 
> For each of the new tests, multiple X-form instructions are replaced with D-form instructions, some addi instructions are replaced with add instructions, and some addi instructions are eliminated.  The typical improvement for the included tests is a decrease of 4.28% to 12.12% in the number of instructions executed on each iteration of the loop.  The optimization has not shown measurable improvement on specmark tests, presumably because the typical loops that are benefited by this optimization are memory bounded and this optimization does not eliminate memory loads or stores.  However, it is anticipated that multi-threaded workloads and measurements of total power and cooling costs for heavy server workloads would benefit.

My first question is, why did ivopts choose the suboptimal solution?
_Did_ it, or did something later mess things up?

This new pass can help us investigate that.  It certainly sounds like we
could do better earlier already.

I think it is a good design to make fixes late in the pass pipeline, *but*
we should try to make good choices earlier, too -- the "late tweaks" should
be just that, tweaks; 4%-12% is a bit much.

(It's not that super late here; but still, why does it help so much?)

> 2. Improved comments and added discussion of computational complexity.

It's really not good at all to have anything that is quadratic in the size
of the program, or the size of a function, or the size of a basic block:
there always show up real programs which then take approximately infinitely
long to compile.

If there are good arguments why some parameter can not be bigger than 100
in reality, or maybe even 1000, it is different of course; but things like
number of function, number of basic blocks, or number of instructions (per
function or bb or loop) are not naturally limited.

> 5. Refactored the code to divide into smaller functions and provide more descriptive commentary.

Many thanks for this :-)

> +   This pass replaces the above-matched sequences with:
> +
> +   Ai: derived_pointer = array_base + offset
> +       *(derived_pointer)
> +
> +   Aij: leave these alone.  expect that subsequent optimization deletes
> +        this code as it may become dead (since we don't use the
> +        indexing expression following our code transformations.)
> +
> +   Ai:
> +   *(derived_pointer + constant_i)
> +     (where constant_i equals sum of constant (n,j) for all n from 1
> +      to i paired with all j from 1 to Kn,

So if I understand this correctly, if the code is

  x0 = [base+8]
  x1 = [base]
  x2 = [base+16]

this pass will change it to

  p = base+8
  x0 = [p]
  x1 = [p-8]
  x2 = [p+8]

Should it always pick the first access as the new base pointer?  Should it
use the lowest offset instead?

(Maybe the code does something more advanced than picking the first; not
clear from this comment though).

> +class indexing_web_entry: public web_entry_base
> +{
> + public:
> +  rtx_insn *insn;		/* Pointer to the insn */
> +  basic_block bb;		/* Pointer to the enclosing basic block */

The rest of the fields have the comment before the declaration.  I would
just lose the comments here though: if something called "insn" is not an
insn, or something called "bb" is not a block, ... :-)

> +  /* A unique sequence number is assigned to each instruction for the
> +     purpose of simplifying domination tests.  Within each basic
> +     block, sequence numbers areassigned in strictly increasing order.
> +     Thus, for any two instructions known to reside in the same basic
> +     block, the instruction with a lower insn_sequence_no is kknown
> +     to dominate the instruction with a higher insn_sequence_no.  */
> +  unsigned int insn_sequence_no;

Many existing passes call this "luid" (for "local unique id").

(Typos: "are assigned", "known").

> +  /* If this insn is relevant, it is a load or store with a memory
> +     address that is comprised of a base pointer (e.g. the address of
> +     an array or array slice) and an index expression (e.g. an index
> +     within the array).  The original_base_use and original_index_use
> +     fields represent the numbers of the instructions that define the
> +     base and index values which are summed together with a constant
> +     value to determine the value of this instruction's memory
> +     address.  */
> +  unsigned int original_base_use;
> +  unsigned int original_index_use;

I wonder how you determine what is base and what is index?

(I'll review the rest later).

Segher