This is the mail archive of the
mailing list for the GCC project.
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Abe <abe_skolnik at yahoo dot com>
- Cc: Alan Lawrence <alan dot lawrence at arm dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, Sebastian Pop <sebpop at gmail dot com>
- Date: Thu, 9 Jul 2015 10:23:02 +0200
- Subject: Re: making the new if-converter not mangle IR that is already vectorizer-friendly
- Authentication-results: sourceware.org; auth=none
- References: <55946699 dot 4070803 at yahoo dot com> <559504BE dot 40207 at arm dot com> <5595AB07 dot 7040404 at yahoo dot com> <559657CA dot 4080706 at arm dot com> <559C3CAF dot 9040308 at yahoo dot com>
On Tue, Jul 7, 2015 at 10:55 PM, Abe <email@example.com> wrote:
> [Alan wrote:]
>> My understanding is that any decision as to whether one or both of y or z
>> is evaluated (when 'evaluation' involves doing any work,
>> e.g. a load), has already been encoded into the gimple/tree IR. Thus, if
>> we are to only evaluate one of 'y' or 'z' in your example,
>> the IR will (prior to if-conversion), contain basic blocks and control
>> flow, that means we jump around the one that's not evaluated.
>> This appears to be the case in pr61194.c: prior to if-conversion, the IR
>> for the loop in barX is
>> <bb 3>:
>> # i_16 = PHI <i_13(7), 0(2)>
>> # ivtmp_21 = PHI <ivtmp_20(7), 1024(2)>
>> _5 = x[i_16];
>> _6 = _5 > 0.0;
>> _7 = w[i_16];
>> _8 = _7 < 0.0;
>> _9 = _6 & _8;
>> if (_9 != 0)
>> goto <bb 4>;
>> goto <bb 5>;
>> <bb 4>:
>> iftmp.0_10 = z[i_16];
>> goto <bb 6>;
>> <bb 5>:
>> iftmp.0_11 = y[i_16];
> [note: the following section, until the next quoted line, is mostly an
> of if conversion and its trade-offs; please feel free to skip it or to read
> it later]
> OK, but that`s overly-conservative in this case for most/all modern
> high-speed processors: "z[i]" and "y[i]" are both pure expressions and
> the possible values of 'i' here [given that the hardware doesn`t
> malfunction, nobody alters the value in a debugger, etc.] can be proven
> to not overflow the bounds of the arrays. With purity established and
> bounds-validity known to be OK, all we need to do is to evaluate
> the "cost" of evaluating both expressions vs. that of inflicting a
> conditional branch on the hardware. I believe this "cost" depends on the
> values in the arrays 'x' and 'w', but when the probability of either case --
> "((x[i]>0) & (w[i]<0))" is true or is false -- is very close to
> 50% for a long time, the branch predictor is going to find it difficult to
> make any sense of it, and speculative execution is going to speculatively
> execute the incorrect branch about 50% of the time until the data comes in
> that renders speculation no-longer-needed in the given iteration.
> On most/all modern high-speed processors, doing the loads unconditionally is
> going to be better IMO for anything even remotely resembling a "normal"
> or "average" workload. In other words, so long as the result of "((x[i]>0)
> & (w[i]<0))" changes frequently and unpredictably as 'i' is incremented,
> it should be better to do both loads: the needed data are in cache lines
> that are already in cache, possibly even in L1 data cache, so it`s pretty
> "cheap" to just load them. Conditional branches based on
> difficult-to-predict data that changes frequently [as the arrays are
> OTOH, are likely to cause stalls and result in slower execution than the
> fastest possible on that CPU for the source code in question.
> In cases where the data upon which the condition depends lead to mostly-true
> or mostly-false results,
> the conditional branch should perform well for enough-iterations loops on
> any CPU with a decent branch predictor,
> but those are rare conditions IMO. I think we should assume random data in
> the arrays unless/until we have analysis that tells
> us otherwise, and I don`t expect much compile-time analysis of the contents
> of arrays to become the norm in compilers anytime soon.
> These are probably extremely-infrequent corner cases anyway, with the
> possible exception of the processing of sparse matrices;
> does anybody reading this with a strong background in numerical/scientific
> computing have anything to comment on this?
> Predication of instructions can help to remove the burden of the conditional
> branch, but is not available on all relevant architectures.
> In some architectures that are typically implemented in modern high-speed
> processors -- i.e. with high core frequency, caches, speculative execution,
> etc. --
> there is not full support for predication [as there is, for example, in
> 32-bit ARM] but there _is_ support for "conditional move" [hereinafter
> If/when the ISA in question supports cmove to/from main memory, perhaps it
> would be profitable to use two cmoves back-to-back with opposite conditions
> and the same register [destination for load, source for store] to implement
> e.g. "temp = c ? X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc.
> Even without cmove to/from main memory, two cmoves back-to-back with
> opposite conditions could still be used, e.g. [not for a real-world ISA]:
> load X[x] -> reg1
> load Y[y] -> reg2
> cmove c ? reg1 -> reg3
> cmove (!c) ? reg2 -> reg3
> Or even better if the ISA can support something like:
> load X[x] -> reg1
> load Y[y] -> reg2
> cmove (c ? reg1 : reg2) -> reg3
> However, this is a code-gen issue from my POV, and at the present time all
> the if-conversion work is occurring at the GIMPLE level.
> If anybody reading this knows how I could make the if converter generate
> GIMPLE that leads to code-gen that is better for at
> least one ISA and the change[s] do/does not negatively impact upon code-gen
> for any other ISA, I`d be glad to read about it.
>> Without -ftree-loop-if-convert-stores, if-conversion leaves this alone,
>> and vectorization fails (i.e. the vectorizer
>> bails out because the loop has >2 basic blocks). With
>> -ftree-loop-if-convert-stores, if-conversion produces
>> where I have commented the conditional loads that have become
>> (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how
>> the if-conversion phase converts loads,
>> too - please correct me if I misunderstand (Richard?) ?!) This contains no
>> control flow, and so is vectorizable.
>> (This is all without your scratchpad patch, of course.)
> I think the preceding is quite correct.
> Sebastian and I have discussed [at some length, on my part ;-)] what to do
> about the flags. We definitely want them to change.
> We might wind up with a different number of flags, i.e. not just rename one
> or both of the two we currently have.
> Does anybody know of any large codebase that depends heavily on the current
> AFAIK these have been infrequently-if-ever used so far in production code.
The GIMPLE level if-conversion code was purely written to make loops suitable
for vectorization. It wasn't meant to provide if-conversion of scalar
code in the
end (even though it does). We've discussed enabling the versioning path
unconditionally for example.
So if the new scheme with scratch-pads produces more "correct" code but
code that will known to fail vectorization then it's done at the wrong
place - because
the whole purpose of GIMPLE if-conversion is to enable more vectorization.