This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING][PATCH][PR65443] Add transform_to_exit_first_loop_alt


On 26/05/15 12:39, Richard Biener wrote:
On Thu, 14 May 2015, Tom de Vries wrote:

On 20-04-15 14:25, Richard Biener wrote:
On Wed, 15 Apr 2015, Tom de Vries wrote:

On 03-04-15 14:39, Tom de Vries wrote:
On 27-03-15 15:10, Tom de Vries wrote:
Hi,

this patch fixes PR65443, a todo in the parloops pass for function
transform_to_exit_first_loop:
...
      TODO: the common case is that latch of the loop is empty and
immediately
      follows the loop exit.  In this case, it would be better not to
copy
the
      body of the loop, but only move the entry of the loop directly
before
the
      exit check and increase the number of iterations of the loop by
one.
      This may need some additional preconditioning in case NIT = ~0.
...

The current implementation of transform_to_exit_first_loop transforms
the
loop
by moving and duplicating the loop body. This patch transforms the
loop
into the
same normal form, but without the duplication, if that's possible
(determined by
try_transform_to_exit_first_loop_alt).

The actual transformation, done by transform_to_exit_first_loop_alt
transforms
this loop:
...
        bb preheader:
        ...
        goto <bb header>

        <bb header>:
        ...
        if (ivtmp < n)
          goto <bb latch>;
        else
          goto <bb exit>;

        <bb latch>:
        ivtmp = ivtmp + inc;
        goto <bb header>
...

into this one:
...
        bb preheader:
        ...
        goto <bb newheader>

        <bb header>:
        ...
        goto <bb latch>;

        <bb newheader>:
        if (ivtmp < n + 1)
          goto <bb header>;
        else
          goto <bb exit>;

        <bb latch>:
        ivtmp = ivtmp + inc;
        goto <bb newheader>
...


Updated patch, now using redirect_edge_var_map and flush_pending_stmts.

Bootstrapped and reg-tested on x86_64.

OK for stage1 trunk?


Ping.


Hi Richard,

thanks for the review.

Sorry for the delay (too many things to review, too much other work...)


Np, thanks for the review.

+static void
+replace_imm_uses (tree val, imm_use_iterator *imm_iter)
+{
+  use_operand_p use_p;
+
+  FOR_EACH_IMM_USE_ON_STMT (use_p, *imm_iter)
+    SET_USE (use_p, val);

Use propagate_value.  Why this odd interface passing a imm_iter?!


The replace_imm_uses function is common code factored out of
replace_uses_in_bb_by and replace_uses_in_bbs_by. I'm not sure what is odd
about the interface of the replace_imm_uses function, but I've removed the
function.

I tried using propagate_value, but that one doesn't like virtuals.

In fact most of the "repair SSA" in transform_to_exit_first_loop_alt
looks too ad-hoc to me ... it also looks somewhat familiar to other
code ...

Unfortunately the comment before the function isn't in SSA form
so it's hard to follow the transform.


I've added the ssa bits in the transformation comment before the function, and
updated variable names and comments in the function.

I consider the parloops code bitrotten, no repair possible, so
I might be convinced to not care about new spaghetti in there...

+  /* Fix up loop structure.  TODO: Check whether this is sufficient.  */
+  loop->header = new_header;
+

no, surely not.  Number of iterations (and estimates) are off
after the transform

The loop is cancelled at the end of gen_parallel_loop, and the estimations are
removed.  We only seem to be using a limited set of the loop data until the
loop is cancelled. Updated comment accordingly.

and loop->latch might also need updating.


That one actually stays the same. Updated comment accordingly.

"Safest" is to simply schedule a fixup (but you'll lose any
loop annotation in that process).


Since the loop is cancelled, AFAIU we don't need that, but... You're
referring to the annotations mentioned in
replace_loop_annotate_in_block? Losing the annotations seems to be a
generic problem of scheduling such a fixup, not particular to this
patch. Do you have a concern related to the patch and these annotations?

For example such annotations (or others added by OpenMP like the
safe_len stuff).  Generally fixups preserve those _iff_ the loop
header stays the same.  So in your case doing

   loop->header = new_header;
   set_loops_state (LOOPS_NEED_FIXUP);

would probably "work".  But yes, if the loop is cancelled anyway
there is no point jumping through hoops.

+  /* Figure out whether nit + 1 overflows.  */
+  if (TREE_CODE (nit) == INTEGER_CST)
+    {
+      if (!tree_int_cst_equal (nit, TYPE_MAXVAL (nit_type)))

in case nit_type is a pointer type TYPE_MAXVAL will be NULL I think.


We've done ivcanon just before this point, so AFAIU we can assume we're
dealing with an unsigned integer.

Is the whole exercise for performance?

I'm trying to fix the todo in the code for parloops, which is about
performance, though I don't expect a large gain.

OTOH, my main focus is a related oacc kernels issue. For an oacc kernels
region, the entire loop is enclosed in a kernels region. Peeling of the last
iteration means I have to guard that iteration to run on only one thread,
which currently isn't done, so in that sense it's a correctness issue as well.

But as the patch doesn't cover all cases you still have a correctness
issue to solve, no?  It seems to me the last iteration should simply
_not_ be in the oacc kernels region?


Correct. On gomp-4_0-branch, I've committed r223849 (submitted at https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00077.html). If we don't manage to do the alt transformation for a loop in an oacc kernels region, we abandon parallelization of that loop.

FWIW, another way to deal with this problem is to guard the peeled off last loop iteration such that only one accelerator thread executes it. This is a generic problem, that we'll have to solve for the case that the user specifies sequential code in a kernels region. But if the user didn't specify sequential code, then we better not have the compiler introduce it.

In that case using an
entirely new, unsigned IV, that runs from 0 to niter should be easiest and

Introducting such an IV would mean that we don't have to update the IV, but
still we have to connect the new IV to the uses of the old IV.

The current special handling of the IV variable is actually not that
elaborate:
...
+      /* Replace ivtmp_a with ivtmp_c in condition 'if (ivtmp_a < n)'.  */
+      replace_uses_in_bb_by (res_a, res_c, new_header);
...
So I'm not sure it's easier.

just run-time guard that niter == +INF case?

That doesn't work out nicely for the oacc kernels case. I need to know
compile-time wheter I can parallelize or not.

Hmm, I see.

For the graphite case, can't you make graphite generate the
loops exit-first in the first place?


Perhaps, but that doesn't make a difference for the oacc kernels case.

The testcases are just correctness ones?  That is, there isn't
any testcase that checks whether the new code is exercised?
(no extra debugging dumping?)


There are 3 new test patterns, each with a libgomp/testsuite/libgomp.c and a
gcc/testsuite/gcc.dg variant, so 6 new test-cases in total. The
gcc/testsuite/gcc.dg variant checks that the new code is exercised. The
libgomp/testsuite/libgomp.c variant checks that the new code generates correct
code.

Reposting updated patch (for brevity, without testcases) below.

I'm ok with the patch and count on you to fix eventual fallout ;)


Great, will do.

I'll merge this patch from gomp-4_0-branch to trunk, retest and commit.

Thanks,
- Tom


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]