This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)

On 08/19/13 13:18, Aldy Hernandez wrote:

Well, apparently this last revision was approved and I didn't even know about it :).

Tested one last time against current trunk and committed as revision 202029.

My apologies to Jakub for the merge problems he will inherit on the gomp-4_0-branch.


On 08/19/13 04:35, Jakub Jelinek wrote:
On Thu, Aug 01, 2013 at 09:38:31AM -1000, Richard Henderson wrote:
+      if (simd
+         /*
+         || (fd->sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
+         && !fd->have_ordered)*/)

Debugging leftovers or what?

gomp-4_0-branch contains also the || there, but for the trunk merge
I think Aldy can leave that comment out, I'll deal with it when
merging the
trunk changes back to gomp-4_0-branch.

Removed comment and the enclosed code altogether.

+  /* Enforce simdlen 1 in simd loops with data sharing clauses
+     variable sized vars.  That is unnecessarily hard to support
and very
+     unlikely to result in vectorized code anyway.  */
+  if (is_simd)
+    for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
+      switch (OMP_CLAUSE_CODE (c))
+       {
+           max_vf = 1;
+         /* FALLTHRU */
+       case OMP_CLAUSE_PRIVATE:
+       case OMP_CLAUSE_LINEAR:
+         if (is_variable_sized (OMP_CLAUSE_DECL (c)))
+           max_vf = 1;
+         break;
+       default:
+         continue;
+       }

Comment is wrong, code is right, adjusting max_vf not simdlen.

max_vf = 1 causes
           c = build_omp_clause (UNKNOWN_LOCATION, OMP_CLAUSE_SAFELEN);
           OMP_CLAUSE_SAFELEN_EXPR (c) = build_int_cst
so I think the comment should just be changed s/simdlen/safelen/,
always get
those 2 wrong...  simdlen is for elemental functions, safelen for simd

Adjusted comment to document both things-- that max_vf is being set, and
that this later triggers a safelen of 1:

+  /* Set max_vf=1 (which will later enforce safelen=1) in simd loops
+     with data sharing clauses referencing variable sized vars.  That
+     is unnecessarily hard to support and very unlikely to result in
+     vectorized code anyway.  */

+  /* If max_vf is non-NULL, then we can use only vectorization factor
+     up to the max_vf we chose.  So stick it into safelen clause.  */
+  if (max_vf)
+    {
+      tree c = find_omp_clause (gimple_omp_for_clauses (ctx->stmt),
+                               OMP_CLAUSE_SAFELEN);

First, non-zero, not non-null -- max_vf is not a pointer.



Second, I don't understand why we're adjusting safelen.  The VF we
for optimizing the loop (even 1), does not change the fact that there
no dependencies between loop iterations larger than that.

No, it adds depenencies.  We choose some size of the per-simd lane
and we must not allow larger vectorization factor than that, as every
lane should have different object.  The size is normally the maximum
possible value for a target, so it should be big enough, but we really
it to be recorded, so that we don't generate invalid code.  max_vf = 1
is for the cases where we punt on handling it properly, those loops
for now
can't be vectorized just because they had safelen clause (normal
vectorization rules will probably also punt on those).  In the future if
e.g. privatized per-SIMD lane VLAs are common we can improve that, but
now I somehow doubt that would be the case.

I'll let you two sort this out.  Nothing done.

Did you want to be adding a _max_vf_ attribute to record stuff that we
learned while examining the omp loop?  E.g. the max_vf=1 reduction

Given the only adjustment we make to max_vf is to disable vectorization,
did we in fact want to discard the simd attribute instead, making this a
"regular" openmp loop?

See above, it is not only about disabling it, we record the sizes of the
arrays there.  And, for SAFELEN clause with 1, we actually don't set
loop->force_vect and set loop->safelen to 0 (i.e. normal loop).

Similarly, you too can sort this out.

+       if (__builtin_expect (N32 cond3 N31, 0)) goto ZERO_ITER_BB;
+       if (cond3 is <)
+         adj = STEP3 - 1;
+       else
+         adj = STEP3 + 1;
+       count3 = (adj + N32 - N31) / STEP3;
+       if (__builtin_expect (N22 cond2 N21, 0)) goto ZERO_ITER_BB;
+       if (cond2 is <)
+         adj = STEP2 - 1;
+       else
+         adj = STEP2 + 1;
+       count2 = (adj + N22 - N21) / STEP2;
+       if (__builtin_expect (N12 cond1 N11, 0)) goto ZERO_ITER_BB;

CEIL_DIV_EXPR, instead of TRUNC_DIV_EXPR and adjusting by hand?
Unless we
can't generate the same code in the end because generically we don't
know that
the values involved must be negative for GT?

This is just moving code around (and merging multiple copies of it),
the old
code was also using TRUNC_DIV_EXPR.  I think we can try CEIL_DIV_EXPR,
I'd strongly prefer to only do it incrementally and separately from these

Added a comment for a future enhancement.

I wonder if we do better mooshing all of the BBs together, creating
one larger
BB with all the computation and the unexpected jump at the end.  I.e.

   bool zero3, zero2, zero1, zero;

   zero3 = N32 c3 N31;
   count3 = (N32 - N31) /[cl] STEP3;
   zero2 = N22 c2 N21;
   count2 = (N22 - N21) /[cl] STEP2;
   zero1 = N12 c1 N11;
   count1 = (N12 - N11) /[cl] STEP1;
   zero = zero3 || zero2 || zero1;
   count = count1 * count2 * count3;
   if (__builtin_expect(zero, false)) goto zero_iter_bb;

After all, we expect the zero=false, and thus we expect to have to
evaluate all
of the comparison expressions, so short-circuiting oughtn't be a
win.  Since
the condition isn't protecting a denominator, we're not concerned about
divide-by-zero, so we can fully evaluate count even if a numerator
turned out
to be wrong.

It seems like putting this all together would create much better
opportunities, and less pressure on the chip's branch predictor.

Ditto this.  Though, from my experience, on i?86/x86_64 the above
results in
tons of setne/sete insns and it would surprise me if the generated
code was

Added a big note arguing for one big BB as a future enhancement.

@@ -291,6 +300,15 @@ vect_analyze_data_ref_dependence (struct

You've two adjustments in this function.  First hunk for "dont_know",
which is
fine; second hunk for "data is bad", which I suppose is really a form of
dont_know and thus also fine.

But it seems like there should be a third hunk in the "know" section,
in which
we perform some MAX(dist, loop->safelen) computation inside that loop.

The user guarantee IMHO is useful only if you don't know or would give up
otherwise, if you can compute the exact distance, it is better to just
what you've computed.

Alternately, why are we patching this function at all, rather than
whatever bit
of code creates the data_dependence_relation data?

This is the spot where we first see *max_vf and can tweak it, in
tree-data-ref.c vectorization factor can't be adjusted.

I expect all this will be a NO-OP as well.  Nothing done.

+  hash_table <simduid_to_vf> simduid_to_vf_htab;
+  hash_table <decl_to_simduid> decl_to_simduid_htab;

Why two hashes?  Seems like you can map from decl to vf directly.  At
point do you have a simduid integer, but not the decl from whence it

simduid_to_vf is for mapping simduid to vectorization factor of that
decl_to_simduid maps the array VAR_DECLs with "omp declare simd"
to the simduid of the loop they are used in; you can have several such
arrays in a loop, or none.

As mentioned earlier, I have changed the nomenclature and documented
this a bit better.

How does this look?

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]