This is the mail archive of the
mailing list for the GCC project.
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
- From: Abe <abe_skolnik at yahoo dot com>
- To: Alan Lawrence <alan dot lawrence at arm dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, Richard Biener <richard dot guenther at gmail dot com>, Sebastian Pop <sebpop at gmail dot com>
- Date: Tue, 07 Jul 2015 15:55:11 -0500
- Subject: Re: making the new if-converter not mangle IR that is already vectorizer-friendly
- Authentication-results: sourceware.org; auth=none
- References: <55946699 dot 4070803 at yahoo dot com> <559504BE dot 40207 at arm dot com> <5595AB07 dot 7040404 at yahoo dot com> <559657CA dot 4080706 at arm dot com>
My understanding is that any decision as to whether one or both of y or z is evaluated (when 'evaluation' involves doing any work,
e.g. a load), has already been encoded into the gimple/tree IR. Thus, if we are to only evaluate one of 'y' or 'z' in your example,
the IR will (prior to if-conversion), contain basic blocks and control flow, that means we jump around the one that's not evaluated.
This appears to be the case in pr61194.c: prior to if-conversion, the IR for the loop in barX is
# i_16 = PHI <i_13(7), 0(2)>
# ivtmp_21 = PHI <ivtmp_20(7), 1024(2)>
_5 = x[i_16];
_6 = _5 > 0.0;
_7 = w[i_16];
_8 = _7 < 0.0;
_9 = _6 & _8;
if (_9 != 0)
goto <bb 4>;
goto <bb 5>;
iftmp.0_10 = z[i_16];
goto <bb 6>;
iftmp.0_11 = y[i_16];
[note: the following section, until the next quoted line, is mostly an explanation
of if conversion and its trade-offs; please feel free to skip it or to read it later]
OK, but that`s overly-conservative in this case for most/all modern high-speed processors: "z[i]" and "y[i]" are both pure expressions and
the possible values of 'i' here [given that the hardware doesn`t malfunction, nobody alters the value in a debugger, etc.] can be proven
to not overflow the bounds of the arrays. With purity established and bounds-validity known to be OK, all we need to do is to evaluate
the "cost" of evaluating both expressions vs. that of inflicting a conditional branch on the hardware. I believe this "cost" depends on the
values in the arrays 'x' and 'w', but when the probability of either case -- "((x[i]>0) & (w[i]<0))" is true or is false -- is very close to
50% for a long time, the branch predictor is going to find it difficult to make any sense of it, and speculative execution is going to speculatively
execute the incorrect branch about 50% of the time until the data comes in that renders speculation no-longer-needed in the given iteration.
On most/all modern high-speed processors, doing the loads unconditionally is going to be better IMO for anything even remotely resembling a "normal"
or "average" workload. In other words, so long as the result of "((x[i]>0) & (w[i]<0))" changes frequently and unpredictably as 'i' is incremented,
it should be better to do both loads: the needed data are in cache lines that are already in cache, possibly even in L1 data cache, so it`s pretty
"cheap" to just load them. Conditional branches based on difficult-to-predict data that changes frequently [as the arrays are traversed],
OTOH, are likely to cause stalls and result in slower execution than the fastest possible on that CPU for the source code in question.
In cases where the data upon which the condition depends lead to mostly-true or mostly-false results,
the conditional branch should perform well for enough-iterations loops on any CPU with a decent branch predictor,
but those are rare conditions IMO. I think we should assume random data in the arrays unless/until we have analysis that tells
us otherwise, and I don`t expect much compile-time analysis of the contents of arrays to become the norm in compilers anytime soon.
These are probably extremely-infrequent corner cases anyway, with the possible exception of the processing of sparse matrices;
does anybody reading this with a strong background in numerical/scientific computing have anything to comment on this?
Predication of instructions can help to remove the burden of the conditional branch, but is not available on all relevant architectures.
In some architectures that are typically implemented in modern high-speed processors -- i.e. with high core frequency, caches, speculative execution, etc. --
there is not full support for predication [as there is, for example, in 32-bit ARM] but there _is_ support for "conditional move" [hereinafter "cmove"].
If/when the ISA in question supports cmove to/from main memory, perhaps it would be profitable to use two cmoves back-to-back with opposite conditions
and the same register [destination for load, source for store] to implement e.g. "temp = c ? X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc.
Even without cmove to/from main memory, two cmoves back-to-back with opposite conditions could still be used, e.g. [not for a real-world ISA]:
load X[x] -> reg1
load Y[y] -> reg2
cmove c ? reg1 -> reg3
cmove (!c) ? reg2 -> reg3
Or even better if the ISA can support something like:
load X[x] -> reg1
load Y[y] -> reg2
cmove (c ? reg1 : reg2) -> reg3
However, this is a code-gen issue from my POV, and at the present time all the if-conversion work is occurring at the GIMPLE level.
If anybody reading this knows how I could make the if converter generate GIMPLE that leads to code-gen that is better for at
least one ISA and the change[s] do/does not negatively impact upon code-gen for any other ISA, I`d be glad to read about it.
Without -ftree-loop-if-convert-stores, if-conversion leaves this alone, and vectorization fails (i.e. the vectorizer
> bails out because the loop has >2 basic blocks). With -ftree-loop-if-convert-stores, if-conversion produces
where I have commented the conditional loads that have become unconditional.
> (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how the if-conversion phase converts loads,
> too - please correct me if I misunderstand (Richard?) ?!) This contains no control flow, and so is vectorizable.
> (This is all without your scratchpad patch, of course.)
I think the preceding is quite correct.
Sebastian and I have discussed [at some length, on my part ;-)] what to do about the flags. We definitely want them to change.
We might wind up with a different number of flags, i.e. not just rename one or both of the two we currently have.
Does anybody know of any large codebase that depends heavily on the current flags?
AFAIK these have been infrequently-if-ever used so far in production code.