Bug 61034 - Optimizing takes too many passes
Summary: Optimizing takes too many passes
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 5.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2014-05-02 10:52 UTC by Marc Glisse
Modified: 2015-11-26 11:32 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-05-05 00:00:00


Attachments
hack (1.64 KB, patch)
2014-05-05 10:48 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Glisse 2014-05-02 10:52:47 UTC
Hello,

when I compile the code below (-O3), I have plenty of calls to malloc left. If I make the final expression simpler, they disappear. If I add some more fre/copy_prop/dse/dce passes, they eventually get optimized, but the number of passes required seems to increase with the size of the expression, which obviously doesn't scale. Cycling through the passes seems like an obvious way to work around this. Otherwise, I don't really know which pass to blame this on. Maybe FRE could do better, it leaves some:

  _79 = _73; 
  _79->count = _81;
[things that don't alias, small if-then-else]
  _86 = _73;
  _87 = _86->count;

which PRE will handle later after other passes have cleaned it up a bit and removed everything in the middle []. But it isn't at all obvious this is the right place to improve things.

The code is a reference counted double. I was trying to optimize something more complicated, but the simpler things have to come first...


#define assume(x) if(!(x))__builtin_unreachable()
inline void* operator new(__SIZE_TYPE__ n){ return __builtin_malloc(n); }
inline void operator delete(void *p) { __builtin_free(p); }
struct O {
  double num;
  int count;
};
struct I {
  O *o;
  I(double d = 0) : o (new O) { o->num = d; o->count = 1; }
  I(I const&i) { assume(i.o->count >= 1); o = i.o; ++o->count; }
  I& operator=(I const&i) { I(i).swap(*this); return *this; }
  ~I() { if (--o->count == 0) delete o; }
  void swap(I& i) { O *tmp = o; o = i.o; i.o = tmp; }
  I& operator*= (I const&i) {
    if (o->count > 1) *this = I(o->num);
    o->num *= i.o->num;
    return *this;
  }
  I& operator-= (I const&i) {
    if (o->count > 1) *this = I(o->num);
    o->num -= i.o->num;
    return *this;
  }
};
inline I operator* (I a, I const&b) { return a *= b; }
inline I operator- (I a, I const&b) { return a -= b; }
inline bool operator< (I const&a, I const&b) { return a.o->num < b.o->num; }

bool f(I a, I b, I c, I d) {
  return (a * d - b * c) * (a * b - c * d) < 42;
}
Comment 1 Richard Biener 2014-05-05 09:08:21 UTC
You are talking about

  _73 = __builtin_malloc (16);
...
  _79 = _73;
...
  _79->count = _81;
...
  _83->count = _85;
...
  _86 = _73;
  _87 = _86->count;

note that FRE doesn't do copy-propagation at its simplification stage.
Usually the issue is that alias info isn't the best possible, and indeed:

  # PT = nonlocal escaped
  _2 = MEM[(const struct I &)c_3(D)].o;
...
  # PT = nonlocal escaped { D.2585 }
  _83 = _2;
...
  # PT = { D.2585 }
  _86 = _73;
  _87 = _86->count;

so _86->count is possibly clobbered by the store because the pointer may
point to the malloc result.

So ... as FRE figures out some pointer equivalencies it could ideally
also take the stricter points-to info into account.  It already somewhat
does this, but obviously not fully (see vn_reference_lookup_3, the
"First try to disambiguate ..." code at the beginning).

Hmm.

Confirmed (FRE is to blame).

I will have a look (at some point).
Comment 2 Richard Biener 2014-05-05 10:25:15 UTC
Actually it's the conditional free that "clobbers" the value as _83 may
point to the same thing as _86 (so we don't CSE (possible) use-after-frees).

  if (_85 == 0)
    goto <bb 6>;
  else     
    goto <bb 7>;
           
  <bb 6>:  
  # .MEM_287 = VDEF <.MEM_286>
  # USE = nonlocal
  # CLB = nonlocal
  __builtin_free (_83);
           
  <bb 7>:  
  # .MEM_40 = PHI <.MEM_286(5), .MEM_287(6)>

and we don't use VN to improve that disambiguation result (we know that
_83 == _2 and that doesn't point to the malloc result in question).
Comment 3 Richard Biener 2014-05-05 10:33:50 UTC
points-to cannot be improved here as it is flow-insensitive for memory
accesses and the pointers are copied:

  D.2327.o = _2;
  D.2327.o = _79;

thus they blend together.

So we have to improve alias walking by also making get_continuation_for_phi
(and callees) use the 'translate' callback and improve the 'translate'
callback in VN to handle calls (which it doesn't at the moment, and it's
also somewhat difficult to do that ... if done generically, we can of
course special-case builtins we can ignore for CSE/PRE purposes here).
Comment 4 Richard Biener 2014-05-05 10:48:53 UTC
Created attachment 32736 [details]
hack

patch that does the job (but has wrong-code issues, so I need to add a new
hook to walk_non_aliased_vuses instead).

  _88 = 1;

you might want to check whether that improves things and point me to stuff
that is left (FRE related).
Comment 5 Marc Glisse 2014-05-05 11:56:28 UTC
(In reply to Richard Biener from comment #4)
> you might want to check whether that improves things and point me to stuff
> that is left (FRE related).

The fre2 dump looks much nicer (the .optimized dump is a bit surprising, I would need more time to look at it). It is not sufficient, but it clearly helps.

In the fre2 dump, I still see (skipping many lines, I may have skipped some essential ones):

  D.2325.o = _4;
  D.2325.o = _101;
  _105 = _4;
  __builtin_free (_105);
  _96 = D.2325.o;

(instead of _96 = _101)
Comment 6 rguenther@suse.de 2014-05-05 12:18:59 UTC
On Mon, 5 May 2014, glisse at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=61034
> 
> --- Comment #5 from Marc Glisse <glisse at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #4)
> > you might want to check whether that improves things and point me to stuff
> > that is left (FRE related).
> 
> The fre2 dump looks much nicer (the .optimized dump is a bit surprising, I
> would need more time to look at it). It is not sufficient, but it clearly
> helps.
> 
> In the fre2 dump, I still see (skipping many lines, I may have skipped some
> essential ones):
> 
>   D.2325.o = _4;
>   D.2325.o = _101;
>   _105 = _4;
>   __builtin_free (_105);
>   _96 = D.2325.o;
> 
> (instead of _96 = _101)

that's a conditional assignment AFAICS
Comment 7 Marc Glisse 2014-05-05 13:40:58 UTC
(In reply to rguenther@suse.de from comment #6)
> that's a conditional assignment AFAICS

Ah, you are right of course. It shouldn't be conditional, but it will take a VRP pass to notice that. If I schedule another FRE right after VRP1, things optimize nicely, and after some cleanup by DOM+DSE, DCE2 can remove all malloc+free. However, if I don't add this extra FRE pass, we somehow don't manage. Note that in the PRE dump, with just your patch (no extra pass), I see:

  pretmp_92 = 1;
  _235 = pretmp_92;
  if (_235 == 0)

and these conditions seem to be what prevents us from finishing the job.
Comment 8 Richard Biener 2014-05-07 11:08:21 UTC
(In reply to Marc Glisse from comment #7)
> (In reply to rguenther@suse.de from comment #6)
> > that's a conditional assignment AFAICS
> 
> Ah, you are right of course. It shouldn't be conditional, but it will take a
> VRP pass to notice that. If I schedule another FRE right after VRP1, things
> optimize nicely, and after some cleanup by DOM+DSE, DCE2 can remove all
> malloc+free. However, if I don't add this extra FRE pass, we somehow don't
> manage. Note that in the PRE dump, with just your patch (no extra pass), I
> see:
> 
>   pretmp_92 = 1;
>   _235 = pretmp_92;
>   if (_235 == 0)
> 
> and these conditions seem to be what prevents us from finishing the job.

Yeah.  Looks somewhat tricky, but I'll play with it.  Meanwhile testing
a proper patch for the first issue.
Comment 9 Richard Biener 2014-05-07 11:50:04 UTC
(In reply to Richard Biener from comment #8)
> (In reply to Marc Glisse from comment #7)
> > (In reply to rguenther@suse.de from comment #6)
> > > that's a conditional assignment AFAICS
> > 
> > Ah, you are right of course. It shouldn't be conditional, but it will take a
> > VRP pass to notice that. If I schedule another FRE right after VRP1, things
> > optimize nicely, and after some cleanup by DOM+DSE, DCE2 can remove all
> > malloc+free. However, if I don't add this extra FRE pass, we somehow don't
> > manage. Note that in the PRE dump, with just your patch (no extra pass), I
> > see:
> > 
> >   pretmp_92 = 1;
> >   _235 = pretmp_92;
> >   if (_235 == 0)
> > 
> > and these conditions seem to be what prevents us from finishing the job.
> 
> Yeah.  Looks somewhat tricky, but I'll play with it.  Meanwhile testing
> a proper patch for the first issue.

Ok, so we have after PRE

<bb 39>:
_240 = 2;
_241 = 1;
MEM[(struct O *)_73].count = _241;
if (1 == 0)
  goto <bb 40>;
else
  goto <bb 53>;

<bb 53>:
goto <bb 41>;

<bb 40>:
__builtin_free (_73);
pretmp_48 = MEM[(struct O *)_73].count;

<bb 41>:
# prephitmp_35 = PHI <1(53), pretmp_48(40)>
D.2328 ={v} {CLOBBER};
D.2328 ={v} {CLOBBER};
_242 = prephitmp_35;
if (_242 == 1)
  goto <bb 42>;
else
  goto <bb 54>;

and only CFG-cleanup will simplify the PHI node by removing the unreachable
path and thus we end up with

_242 = 1;
if (_242 == 1)

of course the above shows another case where free() bogusly is thought
to clobber the load from count.  But in general as SCCVN/PRE is not
predicate aware (or optimistically treating edges as not executed) the
issue cannot always be avoided.  Applying copy-propagation during
elimination would make CFG-cleanup do the job though.
Comment 10 Richard Biener 2014-05-07 14:19:46 UTC
Author: rguenth
Date: Wed May  7 14:19:14 2014
New Revision: 210160

URL: http://gcc.gnu.org/viewcvs?rev=210160&root=gcc&view=rev
Log:
2014-05-07  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/61034
	* tree-ssa-alias.c (call_may_clobber_ref_p_1): Export.
	(maybe_skip_until): Use translate to take into account
	lattices when trying to do disambiguations.
	(get_continuation_for_phi_1): Likewise.
	(get_continuation_for_phi): Adjust for added translate
	arguments.
	(walk_non_aliased_vuses): Likewise.
	* tree-ssa-alias.h (get_continuation_for_phi): Adjust
	prototype.
	(walk_non_aliased_vuses): Likewise.
	(call_may_clobber_ref_p_1): Declare.
	* tree-ssa-sccvn.c (vn_reference_lookup_3): Also
	disambiguate against calls.  Stop early if we are
	only supposed to disambiguate.
	* tree-ssa-pre.c (translate_vuse_through_block): Adjust.

	* g++.dg/tree-ssa/pr61034.C: New testcase.

Added:
    trunk/gcc/testsuite/g++.dg/tree-ssa/pr61034.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-alias.c
    trunk/gcc/tree-ssa-alias.h
    trunk/gcc/tree-ssa-pre.c
    trunk/gcc/tree-ssa-sccvn.c
Comment 11 Marc Glisse 2014-05-07 15:14:41 UTC
The committed patch doesn't seem to optimize as well as the hack. fre2 (- is the hack and + is the new trunk, unless I confused my directories):

-  _8 = _98;
+  _8 = MEM[(const struct I &)b_9(D)].o;

-  _224 = _201;
+  _224 = _18->count;

etc

I assume that's because call_may_clobber_ref_p_1 sometimes says that free clobbers its argument while the hack was assuming it never does. "free" is strange, somehow here we would want call_may_clobber_ref_p_1 to return false (we can't be reading after a free, so we must have taken another path, this free didn't run and didn't clobber anything) even when stmt_kills_ref_p_1 would return true, but that would confuse other parts of gcc.
Comment 12 rguenther@suse.de 2014-05-08 10:00:39 UTC
On Wed, 7 May 2014, glisse at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=61034
> 
> --- Comment #11 from Marc Glisse <glisse at gcc dot gnu.org> ---
> The committed patch doesn't seem to optimize as well as the hack. fre2 (- is
> the hack and + is the new trunk, unless I confused my directories):
> 
> -  _8 = _98;
> +  _8 = MEM[(const struct I &)b_9(D)].o;
> 
> -  _224 = _201;
> +  _224 = _18->count;
> 
> etc
> 
> I assume that's because call_may_clobber_ref_p_1 sometimes says that free
> clobbers its argument while the hack was assuming it never does. "free" is
> strange, somehow here we would want call_may_clobber_ref_p_1 to return false
> (we can't be reading after a free, so we must have taken another path, this
> free didn't run and didn't clobber anything) even when stmt_kills_ref_p_1 would
> return true, but that would confuse other parts of gcc.

Yeah, I know.  call_may_clobber_ref_p_1 has to guard against sinking
a load across it as well.

I'm working on improving value-numbering in other ways at the moment,
we can come back and special-case free and va_end builtins later
in SCCVN.
Comment 13 Richard Biener 2015-11-26 11:32:09 UTC
We arrive at different final optimizations depending on PUSH_ARGS_REVERSED
(see PR67203).  Current (GCC 6) final state is either 3 or 4 calls depending
on that.  And this is only because the final DCE (which removes malloc/free
pairs) needs some more DSE (which only follows DCE).  The late dce/dse
passes are the only ones with this particular odering, all other pairs
come the other way around which would end up removing all malloc/free pairs
in this (finally).

Of course DSE and DCE depend on each other so exchanging the last two
isn't a trivial surgery.  Ideally DSE would have at least a basic
DCE embedded or we'd finally merge both passes (given that DSE is quite
ad-hoc anyway).