This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: RFC: seeking insight on store_data_bypass_p (recog.c)
- From: Pat Haugen <pthaugen at linux dot vnet dot ibm dot com>
- To: Kelvin Nilsen <kdnilsen at linux dot vnet dot ibm dot com>, gcc-patches at gcc dot gnu dot org
- Date: Thu, 13 Apr 2017 08:55:21 -0500
- Subject: Re: RFC: seeking insight on store_data_bypass_p (recog.c)
- Authentication-results: sourceware.org; auth=none
- References: <be2e1935-4225-43a6-7675-299944cd4d34@linux.vnet.ibm.com>
On 04/12/2017 06:33 PM, Kelvin Nilsen wrote:
>
> 1. As input arguments, out_insn represents an rtl expression that
> potentially "produces" a store to memory and in_insn represents an rtl
> expression that potentially "consumes" a value recently stored to memory.
>
You have this reversed, the code is trying to find situations where
out_insn is producing a value that in_insn will be storing to memory.
> 2. If the memory store produced matches the memory fetch consumed, this
> function returns true to indicate that this sequence of two instructions
> qualifies for a special "bypass" latency that represents the fact that
> the fetch will obtain the value out of the write buffer. So, whereas
> the instruction scheduler might normally expect that this sequence of
> two instructions would experience Load-Hit-Store penalties associated
> with cache coherency hardware costs, since these two instruction qualify
> for the store_data_bypass optimization, the instruction scheduler counts
> the latency as only 1 or 2 cycles (potentially). [This is what I
> understand, but I may be wrong, so please correct me if so.]
>
In general, yes, if the function returns true then the sequence has been
identified and the target will take appropriate action (adjusting
latency or whatever).
As for the remainder below dealing with PARALLELs, I don't have any
history on that so hopefully others can chime in. For the rs6000 port, I
don't see the handling of multiple SET operations making much sense, but
then again I don't know if it will actually occur either based on the
places where store_data_bypass_p is used.
-Pat
> 3. Actually, what I described above is only the "simple" case. It may
> be that the rtl for either out_insn or in_insn is really a parallel
> clause with multiple rtl trees beneath it. In this case, we compare the
> subtrees in a "similar" way to see if the compound expressions qualify
> for the store_data_bypass_p "optimization". (I've got some questions
> about how this is done below) As currently implemented, special
> handling is given to a CLOBBER subtree as part of either PARALLEL
> expression: we ignore it. This is because CLOBBER does not represent
> any real machine instructions. It just represents semantic information
> that might be used by the compiler.
>
> In addition to seeking confirmation of my existing understanding of the
> code as outlined above, the specific questions that I am seeking help
> with are:
>
> 1. In the current implementation (as I understand it), near the top of
> the function body, we handle the case that the consumer (in_insn) rtl is
> a single SET expression and the producer (out_insn) rtl is a PARALLEL
> expression containing multiple sets. The way I read this code, we are
> requiring that every one of the producer's parallel SET instructions
> produce the same value that is to be consumed in order to qualify this
> sequence as a "store data bypass". That seems wrong to me. I would
> expect that we only need "one" of the produced values to match the
> consumed value in order to qualify for the "store data bypass"
> optimization. Please explain. (The same confusing behavior happens
> below in the same function, in the case that the consumer rtl is a
> PARALLEL expression of multiple SETs: we require that every producer's
> stored value match every consumer's fetched value.)
>
> 2. A "bigger" concern is that any time any SETs are buried within a
> PARALLEL tree, I'm not sure the answer produced by this function, as
> currently implemented, is at all reliable:
>
> a) PARALLEL does not necessarily mean all of its subtrees happen in
> parallel on hardware. It just means that there is no sequencing imposed
> by the source code, so the final order in which the multiple subtrees
> beneath the PARALLEL node is not known at this stage of compilation.
>
> b) It seems to me that it doesn't really make sense to speak of whether
> a whole bunch of producers combined with a whole bunch of consumers
> qualify for an optimized store data bypass latency. If we say that they
> do qualify (as a group), which pair(s) of producer and consumer machine
> instructions qualify? It seems we need to know which producer matches
> with which consumer in order to know where the bypass latencies "fit"
> into the schedule.
>
> c) Furthermore, if it turns out that the "arbitrary" order in which the
> producer instructions and consumer instructions are emitted places too
> much "distance" between a producer and the matching consumer, then it is
> possible that by the time the hardware executes the consumer, the stored
> value is no longer in the write buffer, so even though we might have
> "thought" two PARALLEL rtl expressions qualified for the store bypass
> optimization, we really should have returned false.
>
> Can someone help me understand this better?
>
> Thanks much.
>
>