This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)

From: Aldy Hernandez <aldyh at redhat dot com>
To: Jakub Jelinek <jakub at redhat dot com>
Cc: Richard Henderson <rth at redhat dot com>, Richard Biener <richard dot guenther at gmail dot com>, "Iyer, Balaji V" <balaji dot v dot iyer at intel dot com>, gcc-patches <gcc-patches at gcc dot gnu dot org>, jason merrill <jason at redhat dot com>
Date: Tue, 27 Aug 2013 13:42:00 -0500
Subject: Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)
Authentication-results: sourceware.org; auth=none
References: <51DC3405 dot 4070509 at redhat dot com> <51F5F9EB dot 2070205 at redhat dot com> <51FAB937 dot 2040103 at redhat dot com> <20130819093556 dot GP1814 at tucnak dot redhat dot com> <5212618C dot 40307 at redhat dot com>

On 08/19/13 13:18, Aldy Hernandez wrote:

Well, apparently this last revision was approved and I didn't even knowabout it :).


Tested one last time against current trunk and committed as revision 202029.

My apologies to Jakub for the merge problems he will inherit on thegomp-4_0-branch.


Aldy

On 08/19/13 04:35, Jakub Jelinek wrote:

On Thu, Aug 01, 2013 at 09:38:31AM -1000, Richard Henderson wrote:

+      if (simd
+         /*
+         || (fd->sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
+         && !fd->have_ordered)*/)


Debugging leftovers or what?


gomp-4_0-branch contains also the || there, but for the trunk merge
I think Aldy can leave that comment out, I'll deal with it when
merging the
trunk changes back to gomp-4_0-branch.


Removed comment and the enclosed code altogether.

+  /* Enforce simdlen 1 in simd loops with data sharing clauses
referencing
+     variable sized vars.  That is unnecessarily hard to support
and very
+     unlikely to result in vectorized code anyway.  */
+  if (is_simd)
+    for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
+      switch (OMP_CLAUSE_CODE (c))
+       {
+       case OMP_CLAUSE_REDUCTION:
+         if (OMP_CLAUSE_REDUCTION_PLACEHOLDER (c))
+           max_vf = 1;
+         /* FALLTHRU */
+       case OMP_CLAUSE_PRIVATE:
+       case OMP_CLAUSE_FIRSTPRIVATE:
+       case OMP_CLAUSE_LASTPRIVATE:
+       case OMP_CLAUSE_LINEAR:
+         if (is_variable_sized (OMP_CLAUSE_DECL (c)))
+           max_vf = 1;
+         break;
+       default:
+         continue;
+       }


Comment is wrong, code is right, adjusting max_vf not simdlen.


max_vf = 1 causes
           c = build_omp_clause (UNKNOWN_LOCATION, OMP_CLAUSE_SAFELEN);
           OMP_CLAUSE_SAFELEN_EXPR (c) = build_int_cst
(integer_type_node,
                                                        max_vf);
so I think the comment should just be changed s/simdlen/safelen/,
always get
those 2 wrong...  simdlen is for elemental functions, safelen for simd
loops.


Adjusted comment to document both things-- that max_vf is being set, and
that this later triggers a safelen of 1:

+  /* Set max_vf=1 (which will later enforce safelen=1) in simd loops
+     with data sharing clauses referencing variable sized vars.  That
+     is unnecessarily hard to support and very unlikely to result in
+     vectorized code anyway.  */

+  /* If max_vf is non-NULL, then we can use only vectorization factor
+     up to the max_vf we chose.  So stick it into safelen clause.  */
+  if (max_vf)
+    {
+      tree c = find_omp_clause (gimple_omp_for_clauses (ctx->stmt),
+                               OMP_CLAUSE_SAFELEN);


First, non-zero, not non-null -- max_vf is not a pointer.


Ack.


Fixed.

Second, I don't understand why we're adjusting safelen.  The VF we
choose
for optimizing the loop (even 1), does not change the fact that there
are
no dependencies between loop iterations larger than that.


No, it adds depenencies.  We choose some size of the per-simd lane
arrays,
and we must not allow larger vectorization factor than that, as every
simd
lane should have different object.  The size is normally the maximum
possible value for a target, so it should be big enough, but we really
need
it to be recorded, so that we don't generate invalid code.  max_vf = 1
is for the cases where we punt on handling it properly, those loops
for now
can't be vectorized just because they had safelen clause (normal
vectorization rules will probably also punt on those).  In the future if
e.g. privatized per-SIMD lane VLAs are common we can improve that, but
right
now I somehow doubt that would be the case.


I'll let you two sort this out.  Nothing done.

Did you want to be adding a _max_vf_ attribute to record stuff that we
learned while examining the omp loop?  E.g. the max_vf=1 reduction
above?

Given the only adjustment we make to max_vf is to disable vectorization,
did we in fact want to discard the simd attribute instead, making this a
"regular" openmp loop?


See above, it is not only about disabling it, we record the sizes of the
arrays there.  And, for SAFELEN clause with 1, we actually don't set
loop->force_vect and set loop->safelen to 0 (i.e. normal loop).


Similarly, you too can sort this out.

+       if (__builtin_expect (N32 cond3 N31, 0)) goto ZERO_ITER_BB;
+       if (cond3 is <)
+         adj = STEP3 - 1;
+       else
+         adj = STEP3 + 1;
+       count3 = (adj + N32 - N31) / STEP3;
+       if (__builtin_expect (N22 cond2 N21, 0)) goto ZERO_ITER_BB;
+       if (cond2 is <)
+         adj = STEP2 - 1;
+       else
+         adj = STEP2 + 1;
+       count2 = (adj + N22 - N21) / STEP2;
+       if (__builtin_expect (N12 cond1 N11, 0)) goto ZERO_ITER_BB;


CEIL_DIV_EXPR, instead of TRUNC_DIV_EXPR and adjusting by hand?
Unless we
can't generate the same code in the end because generically we don't
know that
the values involved must be negative for GT?


This is just moving code around (and merging multiple copies of it),
the old
code was also using TRUNC_DIV_EXPR.  I think we can try CEIL_DIV_EXPR,
but
I'd strongly prefer to only do it incrementally and separately from these
changes.


Added a comment for a future enhancement.

I wonder if we do better mooshing all of the BBs together, creating
one larger
BB with all the computation and the unexpected jump at the end.  I.e.

   bool zero3, zero2, zero1, zero;

   zero3 = N32 c3 N31;
   count3 = (N32 - N31) /[cl] STEP3;
   zero2 = N22 c2 N21;
   count2 = (N22 - N21) /[cl] STEP2;
   zero1 = N12 c1 N11;
   count1 = (N12 - N11) /[cl] STEP1;
   zero = zero3 || zero2 || zero1;
   count = count1 * count2 * count3;
   if (__builtin_expect(zero, false)) goto zero_iter_bb;

After all, we expect the zero=false, and thus we expect to have to
evaluate all
of the comparison expressions, so short-circuiting oughtn't be a
win.  Since
the condition isn't protecting a denominator, we're not concerned about
divide-by-zero, so we can fully evaluate count even if a numerator
turned out
to be wrong.

It seems like putting this all together would create much better
scheduling
opportunities, and less pressure on the chip's branch predictor.


Ditto this.  Though, from my experience, on i?86/x86_64 the above
results in
tons of setne/sete insns and it would surprise me if the generated
code was
faster.


Added a big note arguing for one big BB as a future enhancement.

@@ -291,6 +300,15 @@ vect_analyze_data_ref_dependence (struct
data_dependence_re


You've two adjustments in this function.  First hunk for "dont_know",
which is
fine; second hunk for "data is bad", which I suppose is really a form of
dont_know and thus also fine.

But it seems like there should be a third hunk in the "know" section,
in which
we perform some MAX(dist, loop->safelen) computation inside that loop.


The user guarantee IMHO is useful only if you don't know or would give up
otherwise, if you can compute the exact distance, it is better to just
use
what you've computed.

Alternately, why are we patching this function at all, rather than
whatever bit
of code creates the data_dependence_relation data?


This is the spot where we first see *max_vf and can tweak it, in
tree-data-ref.c vectorization factor can't be adjusted.


I expect all this will be a NO-OP as well.  Nothing done.

+  hash_table <simduid_to_vf> simduid_to_vf_htab;
+  hash_table <decl_to_simduid> decl_to_simduid_htab;


Why two hashes?  Seems like you can map from decl to vf directly.  At
what
point do you have a simduid integer, but not the decl from whence it
came?


simduid_to_vf is for mapping simduid to vectorization factor of that
loop.
decl_to_simduid maps the array VAR_DECLs with "omp declare simd"
attribute
to the simduid of the loop they are used in; you can have several such
arrays in a loop, or none.


As mentioned earlier, I have changed the nomenclature and documented
this a bit better.

How does this look?

References:
- Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)
  - From: Richard Henderson
- Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)
  - From: Jakub Jelinek
- Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)
  - From: Aldy Hernandez

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]