This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: seeking insight on store_data_bypass_p (recog.c)


Pat Haugen <pthaugen@linux.vnet.ibm.com> writes:
> On 04/12/2017 06:33 PM, Kelvin Nilsen wrote:
>> 
>> 1. As input arguments, out_insn represents an rtl expression that
>> potentially "produces" a store to memory and in_insn represents an rtl
>> expression that potentially "consumes" a value recently stored to memory.
>> 
> You have this reversed, the code is trying to find situations where
> out_insn is producing a value that in_insn will be storing to memory.
>
>> 2. If the memory store produced matches the memory fetch consumed, this
>> function returns true to indicate that this sequence of two instructions
>> qualifies for a special "bypass" latency that represents the fact that
>> the fetch will obtain the value out of the write buffer.  So, whereas
>> the instruction scheduler might normally expect that this sequence of
>> two instructions would experience Load-Hit-Store penalties associated
>> with cache coherency hardware costs, since these two instruction qualify
>> for the store_data_bypass optimization, the instruction scheduler counts
>> the latency as only 1 or 2 cycles (potentially).  [This is what I
>> understand, but I may be wrong, so please correct me if so.]
>> 
> In general, yes, if the function returns true then the sequence has been
> identified and the target will take appropriate action (adjusting
> latency or whatever).
>
>
> As for the remainder below dealing with PARALLELs, I don't have any
> history on that so hopefully others can chime in. For the rs6000 port, I
> don't see the handling of multiple SET operations making much sense, but
> then again I don't know if it will actually occur either based on the
> places where store_data_bypass_p is used.
>
> -Pat
>

Reordering the message, sorry:

>> 2. A "bigger" concern is that any time any SETs are buried within a
>> PARALLEL tree, I'm not sure the answer produced by this function, as
>> currently implemented, is at all reliable:
>> 
>>  a) PARALLEL does not necessarily mean all of its subtrees happen in
>> parallel on hardware.  It just means that there is no sequencing imposed
>> by the source code, so the final order in which the multiple subtrees
>> beneath the PARALLEL node is not known at this stage of compilation.
>>
>>  b) It seems to me that it doesn't really make sense to speak of whether
>> a whole bunch of producers combined with a whole bunch of consumers
>> qualify for an optimized store data bypass latency.  If we say that they
>> do qualify (as a group), which pair(s) of producer and consumer machine
>> instructions qualify?  It seems we need to know which producer matches
>> with which consumer in order to know where the bypass latencies "fit"
>> into the schedule.
>> 
>>  c) Furthermore, if it turns out that the "arbitrary" order in which the
>> producer instructions and consumer instructions are emitted places too
>> much "distance" between a producer and the matching consumer, then it is
>> possible that by the time the hardware executes the consumer, the stored
>> value is no longer in the write buffer, so even though we might have
>> "thought" two PARALLEL rtl expressions qualified for the store bypass
>> optimization, we really should have returned false.

The bypass function only operates on individual rtl instructions:
i.e. one producer and one consumer.  Usually rtl instructions
map to single machine instructions, with PARALLELs being used if
the machine instruction does more than one thing (more below).

.md files do have the option of using a single rtl instruction to
represent a sequence of several machine instructions but:

(a) they're then effectively asking the target-independent code to
    treat the sequence "as-if" it was a single indivisble instruction.

(b) that hampers scheduling in lots of ways, so should be avoided
    unless there's really no alternative.  One problem is that it
    stops other machine instructions from being scheduled in the
    sequence.  Another is that it makes it harder to describe the
    microarchitecture effects of the sequence, since more than one
    instruction is going through the pipeline.

So yeah, if a target does put several machine instructions into
a single rtl instruction, and one of those instructions is a store,
using store_data_bypass_p on it is going to give poor results.
But it would give poor results in general, even without the bypass.
I think it's case of "don't do that".

Sometimes it is (or was) useful to treat multiple machine
instructions as single rtl instructions during early rtl
optimisation.  It's still better to split them into individual
machine instructions for scheduling though, via define_split or
define_insn_and_split.

>> 3. Actually, what I described above is only the "simple" case.  It may
>> be that the rtl for either out_insn or in_insn is really a parallel
>> clause with multiple rtl trees beneath it.  In this case, we compare the
>> subtrees in a "similar" way to see if the compound expressions qualify
>> for the store_data_bypass_p "optimization".  (I've got some questions
>> about how this is done below)  As currently implemented, special
>> handling is given to a CLOBBER subtree as part of either PARALLEL
>> expression: we ignore it.  This is because CLOBBER does not represent
>> any real machine instructions.  It just represents semantic information
>> that might be used by the compiler.

The distinction between SET and CLOBBER is more whether the value
produced is useful to GCC or not.  E.g. instructions that set the
flags as a side-effect often have two patterns associated with them:
one of the form:

  (parallel [(set dest src)
             (clobber flags)])

for cases in which the flags result isn't useful and one of the form:

  (parallel [(set flags ...)
             (set dest src)])

for cases in which it is.  (The outer (parallel ...) is implicit
in a define_insn and gets added automatically by the generators.)

>> In addition to seeking confirmation of my existing understanding of the
>> code as outlined above, the specific questions that I am seeking help
>> with are:
>> 
>> 1. In the current implementation (as I understand it), near the top of
>> the function body, we handle the case that the consumer (in_insn) rtl is
>> a single SET expression and the producer (out_insn) rtl is a PARALLEL
>> expression containing multiple sets.  The way I read this code, we are
>> requiring that every one of the producer's parallel SET instructions
>> produce the same value that is to be consumed in order to qualify this
>> sequence as a "store data bypass".  That seems wrong to me.  I would
>> expect that we only need "one" of the produced values to match the
>> consumed value in order to qualify for the "store data bypass"
>> optimization.  Please explain.

Yeah, I agree that looks unnecessarily restrictive.  It'll correctly
handle store-multiple instructions, but it seems like it should also
allow non-MEM SET_DESTs in parallel with MEM SET_DESTs, provided that
the non-MEM SET_DESTs aren't used by the consumer.

Thanks,
Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]