This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PR tree-optimization/52558]: RFC: questions on store data race

From: Richard Guenther <richard dot guenther at gmail dot com>
To: Aldy Hernandez <aldyh at redhat dot com>
Cc: Richard Guenther <rguenther at suse dot de>, Andrew MacLeod <amacleod at redhat dot com>, "Boehm, Hans" <hans dot boehm at hp dot com>, gcc-patches <gcc-patches at gcc dot gnu dot org>, Torvald Riegel <triegel at redhat dot com>
Date: Wed, 25 Apr 2012 13:45:01 +0200
Subject: Re: [PR tree-optimization/52558]: RFC: questions on store data race
References: <4F875324.7060908@redhat.com> <CAFiYyc2Cc2AgL75miwrENHZVywCVuxtUrE1RdO_A4VsiDJg3=A@mail.gmail.com> <4F96E635.2040808@redhat.com>

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez <aldyh@redhat.com> wrote:
> On 04/13/12 03:46, Richard Guenther wrote:
>>
>> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<aldyh@redhat.com> ?wrote:
>
>
> Richard. ?Thanks so much for reviewing and providing an alternative
> approach, which AFAICT provides superior results.
>
>
>> A similar effect could be obtained by keeping a flag whether we entered
>> the
>> path that stored the value before the transform. ?Thus, do
>>
>> ? lsm = g2; ?// technically not needed, if you also handle loads
>> ? flag = false;
>> ? for (...)
>> ? ? {
>> ? ? ? ?if (g1)
>> ? ? ? ? ?{
>> ? ? ? ? ? ? if (flag)
>> ? ? ? ? ? ? ? g2 = lsm;
>> ? ? ? ? ?}
>> ? ? ? ?else
>> ? ? ? ? ?{
>> ? ? ? ? ? ? lsm = 0;
>> ? ? ? ? ? ? flag = true;
>> ? ? ? ? ?}
>> ? ? }
>> ? if (flag)
>> ? ? g2 = lsm;
>
>
> I have implemented this in the attached patch, and it seems to be generating
> better code than my other if-changed approach. ?It also avoids generating
> unnecessary loads that may trap.
>
> For the original testcase:
>
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
> ?int l;
> ?for (l = 0; l < 1234; l++)
> ?{
> ? if (g_1)
> ? ? return l;
> ? else
> ? ? g_2 = 0;
> ?}
> ?return 999;
> }
>
> After all optimizations are done, we are now generating the following for
> the flags approach:
>
> new:
> func_1:
> ? ? ? ?movl ? ?g_1(%rip), %edx
> ? ? ? ?xorl ? ?%eax, %eax
> ? ? ? ?testl ? %edx, %edx
> ? ? ? ?je ? ? ?.L5
> ? ? ? ?rep
> ? ? ? ?ret
> .L5:
> ? ? ? ?movl ? ?$0, g_2(%rip)
> ? ? ? ?movw ? ?$999, %ax
> ? ? ? ?ret
>
> Which is significantly better than the unmodified trunk of:
>
> func_1:
> ? ? ? ?movl ? ?g_1(%rip), %edx
> ? ? ? ?movl ? ?g_2(%rip), %eax
> ? ? ? ?testl ? %edx, %edx
> ? ? ? ?je ? ? ?.L5
> ? ? ? ?movl ? ?%eax, g_2(%rip)
> ? ? ? ?xorl ? ?%eax, %eax
> ? ? ? ?ret
> .L5:
> ? ? ? ?movl ? ?$0, g_2(%rip)
> ? ? ? ?movl ? ?$999, %eax
> ? ? ? ?ret
>
> And in other less contrived cases, we generate correct code that avoids
> writes on code paths that did not have it. ?For example, in Hans register
> promotion example:
>
> int count;
>
> struct obj {
> ? ?int data;
> ? ?struct obj *next;
> } *q;
>
> void func()
> {
> ?struct obj *p;
> ?for (p = q; p; p = p->next)
> ? ? ? ?if (p->data > 0)
> ? ? ? ? ? ? ? ?count++;
> }
>
> Under the new memory model we should avoid promoting "count" to a register
> and unilaterally writing to it upon exiting the loop.
>
> With the attached patch, we now generate:
>
> func:
> .LFB0:
> ? ? ? ?movq ? ?q(%rip), %rax
> ? ? ? ?xorl ? ?%ecx, %ecx
> ? ? ? ?movl ? ?count(%rip), %edx
> ? ? ? ?testq ? %rax, %rax
> ? ? ? ?je ? ? ?.L1
> .L9:
> ? ? ? ?movl ? ?(%rax), %esi
> ? ? ? ?testl ? %esi, %esi
> ? ? ? ?jle ? ? .L3
> ? ? ? ?addl ? ?$1, %edx
> ? ? ? ?movl ? ?$1, %ecx
> .L3:
> ? ? ? ?movq ? ?8(%rax), %rax
> ? ? ? ?testq ? %rax, %rax
> ? ? ? ?jne ? ? .L9
> ? ? ? ?testb ? %cl, %cl
> ? ? ? ?jne ? ? .L12
> .L1:
> ? ? ? ?rep
> ? ? ? ?ret
> .L12:
> ? ? ? ?movl ? ?%edx, count(%rip)
> ? ? ? ?ret
>
> Which is as expected.
>
> I don't understand your suggestion of:
>
>
>> ? ?lsm = g2; ?// technically not needed, if you also handle loads
>>
>> that would allow to extend this to cover the cases where the access may
>>
>> eventually trap (if you omit the load before the loop).
>
>
> Can't I just remove the lsm=g2? ?What's this about also handling loads?
>
>
>>
>> Not sure which is more efficient (you can eventually combine flag
>> variables
>> for different store locations, combining the if-changed compare is not so
>> easy
>> I guess).
>
>
> So far, I see better code generated with the flags approach, so...
>
>
>>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
>>> g_2.
>>> ?In the PR, Richard mentions that both FRE/PRE can figure this out, but
>>> they
>>> are not run after store motion. ?DOM should also be able to, but
>>> apparently
>>> does not :(. ?Tips?
>>
>>
>> Well. ?Schedule FRE after loop opts ...
>>
>> DOM is not prepared to handle loads/stores with differing VOPs - it was
>> never
>> updated really after the alias-improvements merge.
>>
>>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>>> with complex floats/doubles. ?do_jump is complaining that it can't
>>> compare
>>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare
>>> complex
>>> values since complex lowering has already happened (??). Is there a more
>>> canonical way of creating a GIMPLE_COND that may contain complex floats
>>> at
>>> this stage?
>>
>>
>> As you are handling redundant stores you want a bitwise comparison anyway,
>> but yes, complex compares need to be lowered. ?But as said, you want a
>> bitwise comparison, not a value-comparison. ?You can try using
>
>
> I can ignore all this. ?I have implemented both alternatives, with a target
> hook as suggested, but I'm thinking of removing it altogether and just
> leaving the flags approach.
>
>> Few comments on the patch itself.
>>
>> + ? ? ? ? new_bb = split_edge (ex);
>> + ? ? ? ? then_bb = create_empty_bb (new_bb);
>> + ? ? ? ? if (current_loops&& ?new_bb->loop_father)
>>
>> + ? ? ? ? ? add_bb_to_loop (then_bb, new_bb->loop_father);
>>
>> + ? ? ? ? gsi = gsi_start_bb (new_bb);
>> + ? ? ? ? t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>>
>> Hmm, why do you need to re-load the value? ?Don't you have both
>> the value as it was loaded before the loop and the new value (the lsm tmp
>> reg)
>> already? ?Why not simply remember the SSA name used? ?Ah, because
>> store motion also uses make_rename_temp. ?If you don't want to change
>> that (which would be good but inconvenient because of the required PHI
>> insertions), I'd change the code that performs the load in the pre-header
>> to do
>>
>> ? SSA_NAME = ref->mem;
>> ? rename-lsm-tmp = SSA_NAME;
>>
>> so you can remember SSA_NAME and re-use it here. ?That probably solves
>> your optimization issue, too.
>
>
> Fixed.
>
>
>> + ? ? ? ? for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
>> gsi_next (&gsi))
>> + ? ? ? ? ? {
>> + ? ? ? ? ? ? gimple phi = gsi_stmt (gsi);
>> + ? ? ? ? ? ? unsigned i;
>> +
>> + ? ? ? ? ? ? for (i = 0; i< ?gimple_phi_num_args (phi); i++)
>> + ? ? ? ? ? ? ? if (gimple_phi_arg_edge (phi, i)->src == new_bb)
>> + ? ? ? ? ? ? ? ? {
>> + ? ? ? ? ? ? ? ? ? tree arg = gimple_phi_arg_def (phi, i);
>> + ? ? ? ? ? ? ? ? ? add_phi_arg (phi, arg, then_old_edge,
>> UNKNOWN_LOCATION);
>> + ? ? ? ? ? ? ? ? ? update_stmt (phi);
>> + ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? }
>>
>> Hmm. ?This is of course doing unnecessary work in case there are multiple
>> moved stores. ?In fact splitting the exit edge is only necessary if there
>> are
>> more than one predecessor. ?Otherwise you can simply split the exit block
>> after inserting the new conditional after its labels.
>
>
> Is this still the case with the current patch? ?If so, I'm a bit confused as
> to what you want here.
>
>
>>
>> + ?if (opts->x_flag_tm)
>> + ? ?{
>> + ? ? ?if (opts->x_flag_non_call_exceptions)
>> + ? ? ? sorry ("transactional memory is not supported with non-call
>> exceptions");
>> +
>> + ? ? ?set_param_value ("allow-load-data-races", 0,
>> + ? ? ? ? ? ? ? ? ? ? ?opts->x_param_values, opts_set->x_param_values);
>> + ? ? ?set_param_value ("allow-store-data-races", 0,
>> + ? ? ? ? ? ? ? ? ? ? ?opts->x_param_values, opts_set->x_param_values);
>>
>> Unless the user explicitely set those params? ?Which means using params
>> wasn't a very good idea ... ideally you'd diagnose an error when the user
>> would mix -fgnu-tm and -fallow-store-data-races. ?So - can we remove
>> allow-load-data-races (we are never going to implement that) and make
>> allow-store-data-races a proper -f flag please?
>
>
> Sorry, that was not meant for public consumption. ?I have rewritten this to
> either take the param value as is, and in the case of flag_tm, only restrict
> the optimization if the loop is inside an actual transaction (see the
> changes to trans-mem.c, cause I know you'll hate bb_in_transaction() :-)).
>
>
>>
>> + ? ?}
>>
>> And that should be a separate patch
>
>
> I can certainly reimplement --param=allow-{load,store}-data-races as proper
> -f flags. ?I don't care either way. ?But I remember Ian Taylor having a
> different opinion. ?If y'all agree, I can implement whatever.
>
>
>> It would be nice if you could think about LIM/SM for possibly trapping
>> loads/stores
>> while you are doing this work.
>
>
> We are no longer adding additional trapping loads (as with the if-changed
> approach). ?Is this something else you'd like me to concentrate on?
>
> Please take a look at the attached patch. ?I tested the flags alternative on
> a full bootstrap, and there's only one regression which I will look at, but
> I'd like to know if the current WIP is aligned with what you had in mind.

+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+     multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      int res = bb_in_transaction_1 (e->src);
+      if (res != -1)
+       return (bool) res;
+    }
+  return false;

that's weird code - if predecessors are not all in or not in a transaction
you return a random value.  So it seems this is somewhat fragile.

If bb status can be derived from looking at a single stmt why is the
transaction-ness not recorded as a basic-block flag then?  Thus,
why do we have a bit in gimple stmts?

Why can nobody merge a transaction and a non-transaction basic-block
making your predicate above wrong because only the 2nd stmt is
in a transaction?

+  bool single_exit_p = single_pred_p (ex->dest);

that's a strange variable name ;)

+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
...
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+
+      gsi = gsi_start_bb (gimple_bb (loc->stmt));
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+       if (gsi_stmt (gsi) == loc->stmt)

does not seem to do it on every path but on every REF setter.  And instead
of the innermost loop just use gsi_for_stmt (loc->stmt).

+  /* Emit the load code into the latch, so that we are sure it will
+     be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);

the 2nd assign is no longer a "load", so I suppose you don't need any lim_data
for it?

+      else if (store == STORE_IF_CHANGED_FLAG)
+       execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+                              store_flag);

and this can pass NULL instead of mem_ssa?

Overall this looks reasonable.  With also handling trapping things I meant
that we should be able to apply store-motion to

int a[256];
void foo (int *p)
{
  int i;
  for (i= 0;i<256;i++)
    if (flag)
      *p = a[i];
}

but we cannot, because the transform

  lsm = *p;
  for (i = 0; i < 256; ++i)
    if (flag)
      lsm = a[i];
  *p = lsm;

even when the store is properly conditional the load is not (and you do not
change that).  The unconditional load may trap here.  What I meant to say
is that we do not need the load at all unless there is also a load involved
inside the loop - if we use the flag-based approach.  If there is also a load
inside the loop we'd need to perform the load there (or not handle this
case)

But overall this looks nice!

Thanks,
Richard.

> Thanks again.
> Aldy

Follow-Ups:
- Re: [PR tree-optimization/52558]: RFC: questions on store data race
  - From: Aldy Hernandez

References:
- [PR tree-optimization/52558]: RFC: questions on store data race
  - From: Aldy Hernandez
- Re: [PR tree-optimization/52558]: RFC: questions on store data race
  - From: Richard Guenther
- Re: [PR tree-optimization/52558]: RFC: questions on store data race
  - From: Aldy Hernandez

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]