This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PR18002: Undo fold_single_bit_test in do_jump


Hi, Roger, thanks for efforts and explanations, (annotations below)

> From: Roger Sayle <roger@eyesopen.com>
> Hi Paul,
> On Thu, 9 Dec 2004, Paul Schlie wrote:
>> Thanks for the clarification of context, but the problem introduced
>> by the initial transform of expressions of the form:
>> 
>>  if (x & <pow2-const>) ... else ... ; // which is a typical bit test
>> =>
>>  if ((x >> <log2-const>) & 1) ... else ... ;
>> 
>> Wreak havoc on targets with relatively inefficient shifts, and seems
>> generally counter productive in circumstances where conditional jumps
>> are inherently unavoidable, therefore would be nice, if not imperative,
>> to have remedied. (which I had hoped the refinement was attempting to
>> address)
> 
> I think you have a fundamental misunderstanding about how and why
> code transformations are applied in modern optimizing compilers.
> Many of the transformations that a compiler makes to a program do
> not concern code generation at all and are often not optimizations.
> 
> The easiest to understand canonicalization transformation is for
> commutative binary operators, where a compiler will transform the
> tree "1 + x" into the equivalent "x + 1".  Typically this will not
> improve performance or even affect the code we generate.  Instead
> the motivation is that it greatly simplifies the task of recognizing
> a tree that represents an addition by a constant.  This improves
> compile-time performance and reduces the amount of code needed to
> write a compiler.
> 
> The next level up, might be the canonicalization of "x + x" into
> the equivalent "x * 2".  This transformation is critical for things
> such as recognizing that "x + x + x + x" is the same as "x * 4".
> However, notice that on most modern processors, a multiplication
> is typically much slower than an addition.  This canonicalization
> would appear to hurt performance and be a pessimization.  The important
> thing to realize is that GCC's tree representation is an abstract
> intermediate form that represents the semantics of a program, and
> unlike RTL, it is not a faithful representation of the code that will
> be executed.  The important bit to notice is that although at the tree
> level we convert "x + x" to "x * 2", which is easier to analyse and
> manipulate, at RTL expansion time, we convert "x * 2" back into a
> single addition before the compiler's output is generated.
> 
> Internally, "if (x & C)" has the semantics "if ((x & C) != 0)" when
> interpreted by language front-ends.  Which is a tree with two operators
> a NE_EXPR and a BIT_AND_EXPR, which has identical semantics to
> equivalent expression "(x >> C') & 1", which again has two operators
> BIT_AND_EXPR and a RSHIFT_EXPR.  Both are equivalent, so we can
> CSE/GCSE the value of one into the use of the other, so that it makes
> sense at the conceptual level that both are internally represented
> identically.

- but unless I misunderstand, the operation if (x & C) typically does not
require an explicit test for != 0, as it's fairly common for condition codes
to be generated on all logical and arithmetic operations; but the expression
which although semantically equivalent to if ((x >> C') & 1) then has the
analogous semantics of if (((x >> C') & 1) != 0) given your explanation
above.  Which seems clearly more complex, and implies a tree of 3
operations, as opposed to 2 (but again the != 0 is typically not required
my most targets it typically is implied by the &); therefore in most typical
cases, (x & C) implies a single explicit operations, and ((x >> C') & 1)
implies the requirement for 2? (what am I missing?)

> The important thing to remember is that this is a semantic issue
> and not a performance one.  Which representation gets used for
> trees is irrelevant provided that the final code we generate is
> "optimal" for its context.  In this particular case, it was historically
> preferable to canonicalize to the shift-form, as this simplified the
> task of when to undo the transformation on expansion into RTL form.
> 
> It is vital to get beyond the belief that tree-ssa form is related
> to the efficiency of the final machine code.  fold-const.c appears
> to convert expressions into more expensive forms, gimplification
> would appear to introduce large numbers of unnecessary variables and
> trivial assignments between them, spliting loop headers and critical
> blocks would appear to introduce large numbers of empty basic blocks
> and unconditional jumps.
>
> The truth is than none of the above should adversely affect a program's
> performance, and indeed the goal of applying these transformations is
> to improve the program globally, and not just tree-node by tree-node.

- I believe I understand, however since back-ends presently look for
  particular patterns which it knows the target's instruction set can
  perform more efficiently, any transforms which affect final representation
  (although arguably semantically equivalent), aren't likely not recognized
  if syntactically different. (Would be nice if the GCC's envisioned
  idealized canonical forms were documented so that back-ends could be
  correspondingly refined, and/or possibly so that the back-end may be
  refined to automatically optimally map the canonical forms onto the
  target's supported forms?)

> Hopefully, this explains why its not important that you're aware of
> a tree transformation that appears counter-intuitive, only that the
> code which is generated for the PRs is at least as efficient as it
> was for 3.3.1, but also demonstrably more correct.

- your patch definitely improves the code from your earlier example:

  int foo(int a) { return (a & (1L<<23)) ? 1 : 2; }

(which semantically as noted is basically a sign test, although oddly coded)

But it's not clear that it helps if the bit is within the range of the
original operand (although hope it does) i.e.:
 
 char foo(char a) { return (a & (1 << 5)) ? 1 : 2; }

Which 3.3.1 I believe treats as:

 char foo(char a) { return (a & 32) ? 1 : 2; }

Where if (a & 32) is recognized by the avr .md as a conditional single
bit-test & skip:

        sbrs r24,5
        rjmp .L2
        ldi r24,lo8(1)
        ret
.L2:
        ldi r24,lo8(2)
        ret

Which would be nice to retain, but was lost as the back-end saw the
relatively more complex tree ((a >> 5) & 1)? (I guess one could argue that
just such a more complex pattern now needs to be now recognized by the back
end as being equivalent to (a & 32), as the literal execution of a shift by
5 requires as many cycles times the size of the operand, not a minor cost).

[ and although somewhat tangential, it would be nice if 4.0 didn't so often
  needlessly promote expression's operands and operations beyond their
  true need, as it should ideally consider an operations target precision
  requirements when determining the necessity of operand/operation
  promotion, which although not explicitly spelled out in C, is implied as a
  logical consequence of assignment operation combined with operation
  semantics, i.e. (a & 32) need not be promoted beyond char regardless of if
  "a" or 32 are considered signed or unsigned values, if it would have no
  effect on the resulting stored value/state, correspondingly nor should the
  operands/operations of the transformed expression be needlessly promoted ]

> Roger
> --



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]