[PR 81616] Deferring FMA transformations in tight loops

Martin Jambor mjambor@suse.cz
Fri Dec 22 00:01:00 GMT 2017


Hi,

On Mon, Dec 18 2017, Richard Biener wrote:
> On Fri, Dec 15, 2017 at 3:19 PM, Martin Jambor <mjambor@suse.cz> wrote:
>>
>> Hello,
>>
>> the patch below prevents creation if fused-multiply-and-add instructions
>> in the widening_mul gimple pass on the Zen-based AMD CPUs and as a
>> result fixes regressions of native znver1 tuning when compared to
>> generic tuning in:
>>
>>   - the matrix.c testcase of PR 81616 (straightforward matrix
>>     multiplication) at -O2 and -O3 which is currently 60% (!),
>>
>>   - SPEC 2006 454.calculix at -O2, which is currently over 20%, and
>>
>>   - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also
>>     about 20% in both cases.
>>
>> The basic idea is to detect loops in the following form:
>>
>>     <bb 6>
>>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
>>     ...
>>     _65 = _14 * _16;
>>     accumulator_66 = _65 + accumulator_111;
>>
>> and prevents from creating FMA for it.  Because at least in the parest
>> and calculix cases it has to, it also deals with more than one chain of
>> FMA candidates that feed the next one's addend:
>>
>>
>>     <bb 6>
>>     # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
>>     ...
>>     _65 = _14 * _16;
>>     accumulator_55 = _65 + accumulator_111;
>>     _65 = _24 * _36;
>>     accumulator_66 = _65 + accumulator_55;
>>
>> Unfortunately, to really get rid of the calculix regression, the
>> algorithm cannot just look at one BB at a time but also has to work for
>> cases like the following:
>>
>>      1  void mult(void)
>>      2  {
>>      3      int i, j, k, l;
>>      4
>>      5     for(i=0; i<SIZE; ++i)
>>      6     {
>>      7        for(j=0; j<SIZE; ++j)
>>      8        {
>>      9           for(l=0; l<SIZE; l+=10)
>>     10           {
>>     11               c[i][j] += a[i][l] * b[k][l];
>>     12               for(k=1; k<10; ++k)
>>     13               {
>>     14                   c[i][j] += a[i][l+k] * b[k][l+k];
>>     15               }
>>     16
>>     17           }
>>     18        }
>>     19     }
>>     20  }
>>
>> where the FMA on line 14 feeds into the one on line 11 in an
>> encompassing loop.  Therefore I have changed the structure of the pass
>> to work in reverse dominance order and it keeps a hash set of results of
>> rejected FMAs candidates which it checks when looking at PHI nodes of
>> the current BB.  Without this reorganization, calculix was still 8%
>> slower with native tuning than with generic one.
>>
>> When the deferring mechanism realizes that in the current BB, the FMA
>> candidates do not all form a one chain tight-loop like in the examples
>> above, it goes back to all the previously deferred candidates (in the
>> current BB only) and performs the transformation.
>>
>> The main reason is to keep the patch conservative (and also simple), but
>> it also means that the following function is not affected and is still
>> 20% slower when compiled with native march and tuning compared to the
>> generic one:
>>
>>      1  void mult(struct s *p1, struct s *p2)
>>      2  {
>>      3     int i, j, k;
>>      4
>>      5     for(i=0; i<SIZE; ++i)
>>      6     {
>>      7        for(j=0; j<SIZE; ++j)
>>      8        {
>>      9           for(k=0; k<SIZE; ++k)
>>     10           {
>>     11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];
>>     12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];
>>     13           }
>>     14        }
>>     15     }
>>     16  }
>
> So here we apply store-motion to p[12]->c[i][j] so we see a
> reduction in SSA?  That is, you are really looking for
> SSA SCCs that involve only FMA candidates?  There's a SCC
> finding code in tree-ssa-sccvn.c, DFS (), but a straight-forward
> use->def walk over the adds should find a SCC just involving the
> adds?  That is, rather than doing all the defering stuff I wonder
> if it is easier to simply gather the SCC if there is any and add
> "disqualifications", like to each other member?  So when seeing
>
>   x += a1 * b1;
>   x += a2 * b2;
>   x += a3 * b3;
> ...

That is certainly an alternative but I am not sure it would result in
simpler code, at least with the identical intended behavior, because I
would have to then walk backwards from the additions to multiplications,
skipping a potential negation, to check that they form a viable FMA.
And remembering disqualifications is likely to be similarly complex like
remembering deferred transformations.

> we'd still generate a FMA for each 2nd statement?  Basically

When I manually unroll the matrix multiplication twice, not creating
every second FMA improves run-time only by 20%, whereas removing all by
60%.  When I tried to unroll the loop more, specifically, eight times,
the regression disappeared even with unpatched compiler, I assume the
loads hid the latency issue.

> do some scheduling of as many mults as needed to conver
> the latency of the first FMA.
>
> To go back to the Zen specific issue - so FMA has a higher overall
> latency than mul + add (?) and (or?)

No, if I read the tables I'm using for this right, the latency is 5c,
whereas add + mul have 7.

> if there's more than one mul being
> accumulated then the 2nd mul can execute in parallel with the first
> add (in case dependences allow).

That is how I understand the issue.

> I wonder if there's an easy way to
> query the DFA for the latency of FMA vs. ADD and/or if the --param
> (or target hook?) should rather tell us sth about this latency difference?

Well, I guess that ideally this would really be an instruction-selection
decision based on the fact whether the addend is likely to be ready or
not.  But I am not sure how to model this at gimple level and I did not
think I had enough time to investigate this for gcc 8, so I basically
re-used (most of) the observations that lead to Venkat's

  https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html

but in order to prevent FMA generation, not to split it.

>
> Zen IIRC can execute two FMA128 in parallel (thus one FMA256 per cycle).
> That makes it an odd observation that avx256 can "fix" the issue.  WRT
> the above suggestion the situation needs to be clarified.

I don't have a good explanation for this, I will try to ask AMD.

>
>> I suppose that the best optimization for the above would be to split the
>> loops, but one could probably construct at least an artificial testcase
>> where the FMAs would keep enough locality that it is not the case.  The
>> mechanism can be easily extended to keep track of not just one chain but
>> a few, preferably as a followup, if people think it makes sense.
>>
>> An interesting observation is that the matrix multiplication does not
>> suffer the penalty when compiled with -O3 -mprefer-vector-width=256.
>> Apparently the 256 vector processing can hide the latency penalty when
>> internally it is split into two halves.  The same goes for 512 bit
>> vectors.  That is why the patch leaves those be - well, there is a param
>> for the threshold which is set to zero for everybody but znver1.  If
>> maintainers of any other architecture suspect that their FMAs might
>> suffer similar latency problem, they can easily try tweaking that
>> parameter and see what happens with the matrix multiplication example.
>>
>> I have bootstrapped and tested the patch on x86_64-linux (as it is and
>> also with the param set to a 256 by default to make it trigger).  I have
>> also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017
>> FPrate and the only changes are the big improvements of calculix and
>> parest.
>>
>> After I address any comments and/or suggestions, would it be OK for
>> trunk?
>
> What does the patch do to bwaves whose innermost loop is a sequence
> of FMAs (slightly complicated by reassoc widening the sum)?

It splits the chain of FMAs at -O2 and -O3 (not at -Ofast where
reassociation does that).  It actually speeds up the testcase by over 5%
at -O2 and over 10% at -O3.

>
> So for bwaves we currently have loads feeding the multiplies which might
> cause enough latency to make the FMA issue not matter and/or there
> are intermediate non-FMA adds that might hide the issue.  Forcing
> a full FMA chain is slowing Zen down significantly though (see my report).

Apparently so.

>
> Another issue is are FMAs not discovered anyway by combine later?
> Possibly not on x86 since all patterns use FMA rather than (plus (mult ...)).

I don't think so but I may be easily wrong when it comes to such late passes.

>
> One minor issue below (otherwise I was just looking at the high-level parts).
>

...

>> +static gphi *
>> +result_of_phi (tree op)
>> +{
>> +  if (TREE_CODE (op) != SSA_NAME)
>> +    return NULL;
>
>   return dyn_cast <gphi *> (SSA_NAME_DEF_STMT (op));
>

Fixed on my private branch.

> ?  Maybe inline into the single(?) user.

There are two.  I can do that, but the (TREE_CODE (op) != SSA_NAME) part
would make the code IMHO uglier.

...

>> +/* After processing statements of a BB and recording STATE, return true if the
>> +   initial phi is fed by the last FMA candidate result ore one such result from
>> +   previously processed BBs marked in LAST_RESULT_SET.  */
>> +
>> +static bool
>> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
>> +                                     hash_set<tree> *last_result_set)
>> +{
>> +  ssa_op_iter iter;
>> +  use_operand_p use;
>> +  FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE)
>> +    {
>> +      tree t = USE_FROM_PTR (use);
>> +      if (t == state->m_last_result
>> +         || last_result_set->contains (t))
>> +       return true;
>> +    }
>
> maybe query the other way around,
>
>    if (single_imm_use (state->m_last_result, ....)
>       && def_stmt is PHI?
>

I am not sure I understand.  There are usually two uses of
state->m_last_result, one in the PHI node and at least one final use of
the computed sum after the loop.  Also, I need to check the arguments
whether they are in the last_result_set.

Thanks for the comments,

Martin



More information about the Gcc-patches mailing list