This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code

From: Richard Biener <richard dot guenther at gmail dot com>
To: Abe <abe_skolnik at yahoo dot com>
Cc: Alan Lawrence <alan dot lawrence at arm dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
Date: Tue, 28 Jul 2015 12:21:29 +0200
Subject: Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
Authentication-results: sourceware.org; auth=none
References: <55A961C1 dot 1070206 at yahoo dot com>

On Fri, Jul 17, 2015 at 10:12 PM, Abe <abe_skolnik@yahoo.com> wrote:
> Dear all,
>
> Another benefit of the new if converter that perhaps I neglected to
> mention/explain...
>
> TLDR: for some source code which can reasonably be expected to exist in
> "real-world code",
> when the false/true values of the condition bits of a given "if" in a given
> loop are very
> well-clustered, the code produced by the new converter runs _much_ faster
> for the same
> inputs than the code produced by the old converter when write-back caching
> is in effect.
>
> The long explanation follows.
>
>
>
> In the case of a loop with a "half-hammock" that looks something like this:
>
>   if (C[index])  A[index] = foo(bar);
>   // important: no else here
>
> ... or problem-wise-equivalently:
>
>   if (C[index])  ; // empty "then" section
>   else           B[index] = foo(bar);
>
> ... the latter of which is semantically equivalent to:
>
>   if (! C[index])  B[index] = foo(bar);
>   // important: no else here
>
>
> ... the old if converter does something that may massively damage
> performance.
>
> Basically, the old if converter converts...
>
>   if (C[index])  A[index] = foo(bar);
>   // important: no else here
>
> ... to the equivalent of:
>
>   __compiler_temp = foo(bar);
>   A[index] = C[index] ? __compiler_temp : A[index];
>
>
> For now, let`s assume the preceding conversion is valid even in the face of
> multithreading,
> since multithreading bugs introduced by an "optimization" are a whole other
> ball of wax than what this message is all about; for now, let`s assume that
> all of A[] is thread-local and no nasty, sneaky pointer-passing has
> occurred.
>
> The problem is this: the compiler cannot, in the general case, predict what
> the values of C[]
> will be at runtime.  Therefor, it cannot [again, in the general case] arrive
> at a conclusion
> "this is definitely worth it" or "this is definitely _not_ worth it".  All
> the compiler can do
> statically without profiling information is to say "I guess a probability of
> 50% on the
> elements of C[] being equivalent to true", which -- under an assumption of
> vectorization --
> means that the vectorization factor is going to make the transformation
> worthwhile.
>
> However: what if the values of C[] are mostly equivalent to false, not to
> true?  For such
> cases, the old if converter yielded code that may cause a big performance
> degradation due to
> the if conversion, even in the presence of vectorization.  If we assume that
> the CPU hardware
> is not checking to see whether writes to an address change the contents,
> then each execution
> of "A[index] = C[index] ? foo(bar) : A[index];" is causing a write to occur
> *_no matter
> what the value of "C[index]" is/was_*.  Now, instead of reading the whole
> A[] and writing
> a tiny fraction of it, the program is reading all of A[] and also
> (re)writing at least
> almost all of A[] (possibly writing all of it even when the probability of
> the
> relevant elements of C[] is _not_ 100%, due to cache-line granularity of
> writes:
> when every cache line from A[] is modified, all of A[] will be rewritten).
>
> The preceding problem could be somewhat ameliorated by profiling, providing
> that the data
> you run through your program while profiling is a good representation of the
> data run
> through the same program by "real executions", but there is no need for that
> extra work
> or extra qualification given the existence of the new if converter.  Plus,
> the profiling
> approach to "fixing" this problem with the old converter would only result
> in a binary
> decision -- "do convert this if" vs. "don`t convert this if" -- which in
> cases where the
> decision is to do/retain the conversion, the converted code is going to
> rewrite the whole
> array.  The new converter`s conversion, OTOH, can produce better performance
> than the
> conversion from the old converter in cases where the elements of C[] in our
> example are
> very clustered: in other words, the _overall_ probability can still be close
> [or =] to the
> hardest-to-deal-with challenge of 50%, but there is a large degree of
> clustering of the
> "true" values and the "false" values.  For example, all the "false" values
> come first in C[].
> In this case, if a profiling-based approach using the old converter decides
> to do/keep
> the conversion, then there are still lots of wasted writes that the new
> converter
> would avoid, assuming at least one level of write-back cache in the relevant
> data path.
>
> The key factor to understand for understanding how/why the new converter`s
> resulting code
> is better than that of the old converter is this: the new converter uses a
> scratchpad
> to "throw away" useless writes.  This not only fixes problems with
> speculative writes
> through a null pointer that the pre-conversion code never actually does, it
> also fixes
> the above-described potential performance problem, at least on architectures
> with
> write-back data cache, which AFAIK covers most/all modern high-speed CPUs.
> The new converter converts something like (same example as one of the
> above):
>
>   if (C[index])  A[index] = foo(bar);
>   // important: no else here
>
> ... into something like:
>
>   /* the type of the scalar goes here */ * pointer = C[index] ? &A[index] :
> &scratchpad;
>   __compiler_temp = foo(bar);
>   *pointer = C[index] ? __compiler_temp : scratchpad;

Note the store to *pointer can be done unconditionally

  *pointer = __compiler_temp;

and note that another performance issue of if-conversion is that foo(bar)
is executed unconditionally.  We have a bugreport that

  if (C[index]) A[index] = exp (x);

massively slows down things if C[index] is almost never true.  So it
is not only the store to A[] but in some cases even more so the
unconditional execution of foo(bar).  Of course the new if-converter
doesn't fix that.

Dependent on whether the scratchpad is shared for different sized accesses
or not the scratchpad may also introduce store hazards in the CPU
because if scratchpad-as-int = ...; follows scratchpad-as-short = ...;
the latter only partially kills the preceeding store to the scratchpad
(usually CPUs can't merge the two requests in the store buffer).

Not sure which approach to allocating the scratchpad you are using currently.
I still have to review the patch itself ... (and so many others)

> ... so that in the event of C[] being mostly false, most of the reads are
> _from_ the scratchpad
> and most of the writes are _to_ the scratchpad.  This not only probably*
> saves a lot of the
> potentially-massive number of wasted writes due to the old conversion
> rewriting all of A[]
> no matter what the values of C[]`s elements are, it also saves a lot of
> reads when there is
> enough clustering of false values in C[] that entire cache lines worth of
> A[] can be not
> read from main RAM.  Caching and the "chunking effect" of cache lines,
> besides qualifying how
> much reading is saved by the new conversion relative to the old one, also is
> the reason I needed
> to qualify "probably* saves most/all the [...] wasted writes": it depends on
> how well or
> how poorly the values in C[] are clustered.  In the case that is best for
> the new converter,
> the new converter saves at least almost all the waste of the old converter`s
> code on both
> the read path _and_ the write path [modulo at most one cache line of both
> read and write].
>
>
> I hope the above helps to explain why the new converter is worth merging to
> trunk,
> even though for now it still has performance regressions which
> we expect to fix before the first GCC 6.x is tagged for release.

How do you expect to fix the performance regressions?

Richard.

>
> Regards,
>
> Abe

Follow-Ups:
- Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  - From: Abe

References:
- Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  - From: Abe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]