Bug 60911 - [4.9/4.10 Regression] wrong code with -O2 -flto -fipa-pta
Summary: [4.9/4.10 Regression] wrong code with -O2 -flto -fipa-pta
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 5.0
: P3 normal
Target Milestone: 4.9.1
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2014-04-21 08:58 UTC by Zdenek Sojka
Modified: 2014-04-25 07:49 UTC (History)
3 users (show)

See Also:
Host: x86_64-pc-linux-gnu
Target: x86_64-pc-linux-gnu
Build:
Known to work: 4.10.0, 4.8.3, 4.9.1
Known to fail: 4.9.0
Last reconfirmed: 2014-04-22 00:00:00


Attachments
reduced testcase (gcc.dg/torture/pr30665-1.c) (202 bytes, text/x-csrc)
2014-04-21 08:58 UTC, Zdenek Sojka
Details
cleanup patch (3.14 KB, patch)
2014-04-23 13:22 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zdenek Sojka 2014-04-21 08:58:13 UTC
Created attachment 32646 [details]
reduced testcase (gcc.dg/torture/pr30665-1.c)

Output:
$ gcc -O2 -flto -fipa-pta testcase.c
$ ./a.out 
Aborted

Relevant objdump output:
0000000000400530 <main>:
  400530:       48 83 ec 28             sub    rsp,0x28
  400534:       48 8d 74 24 10          lea    rsi,[rsp+0x10]
  400539:       48 89 e7                mov    rdi,rsp
  40053c:       c7 04 24 ad 3a 00 00    mov    DWORD PTR [rsp],0x3aad
  400543:       c7 44 24 04 d1 5a 00    mov    DWORD PTR [rsp+0x4],0x5ad1
  40054a:       00 
  40054b:       c7 44 24 10 19 45 00    mov    DWORD PTR [rsp+0x10],0x4519
  400552:       00 
  400553:       c7 44 24 14 a0 5b 00    mov    DWORD PTR [rsp+0x14],0x5ba0
  40055a:       00 
  40055b:       e8 10 01 00 00          call   400670 <f.constprop.0>
  400560:       e8 9b ff ff ff          call   400500 <abort@plt>
  400565:       0f 1f 00                nop    DWORD PTR [rax]

abort() is called unconditionally
Comment 1 Richard Biener 2014-04-22 08:45:36 UTC
I'll have a look.
Comment 2 Richard Biener 2014-04-23 10:15:17 UTC
The issue is that IPA-PTA sees main () still as calling f but the cgraph
has it calling f.constprop.0.  For some reason the IPA-CP transform has
not been applied to main () (yet?).  In .simdclone I see a call to
f.constprop.0.

Honza?  Martin?

Late IPA passes need all regular IPA transforms applied and clones
materialized.
Comment 3 Martin Jambor 2014-04-23 11:52:39 UTC
(In reply to Richard Biener from comment #2)
> Late IPA passes need all regular IPA transforms applied and clones
> materialized.

I agree that would make sense but it is not what we do.  Just look at
compile() in cgraphunit.c.  When late IPA passes run, clones are
already materialized but transformation phases of regular IPA passes
have not been run yet.  Apart from the fact that late-IPA passes do
not see the bodies transformed and inlining performed it also means
that call statements have not been updated to reflect call
redirections performed at analysis stage which is the problem here and
which is (for reasons I have never quite grasped) done as a part of
inlining transformation phase.

OTOH, I am all for moving the call to execute_all_ipa_transforms()
from expand_function() to compile(), at least in trunk.
Comment 4 rguenther@suse.de 2014-04-23 12:03:50 UTC
On Wed, 23 Apr 2014, jamborm at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60911
> 
> --- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #2)
> > Late IPA passes need all regular IPA transforms applied and clones
> > materialized.
> 
> I agree that would make sense but it is not what we do.  Just look at
> compile() in cgraphunit.c.  When late IPA passes run, clones are
> already materialized but transformation phases of regular IPA passes
> have not been run yet.  Apart from the fact that late-IPA passes do
> not see the bodies transformed and inlining performed it also means
> that call statements have not been updated to reflect call
> redirections performed at analysis stage which is the problem here and
> which is (for reasons I have never quite grasped) done as a part of
> inlining transformation phase.

Hmm, ok.  But that makes "simple" IPA passes work on a limbo state.
I think we delay applying IPA transforms as "optimization" with
respect to compile-time/memory-use as in theory we only require
a single function body in memory during LTRANS.  I wonder why
it works without LTO ... because there I _do_ see the IPA
transform applied to main ().

> OTOH, I am all for moving the call to execute_all_ipa_transforms()
> from expand_function() to compile(), at least in trunk.

We also have, in execute_one_pass,

  /* SIPLE IPA passes do not handle callgraphs with IPA transforms in it.
     Apply all trnasforms first.  */
  if (pass->type == SIMPLE_IPA_PASS)
    {
      bool applied = false;
      do_per_function (apply_ipa_transforms, (void *)&applied);
      if (applied)
        symtab_remove_unreachable_nodes (true, dump_file);
      /* Restore current_pass.  */
      current_pass = pass;

but that doesn't seem to work ... at least with LTO?  Which is
because for main()

      FOR_EACH_DEFINED_FUNCTION (node)
        if (node->analyzed && gimple_has_body_p (node->decl)
            && (!node->clone_of || node->decl != node->clone_of->decl))
          {

this doesn't trigger (for f.constprop it does).
gimple_has_body_p (node->decl) is false.

main/1 (main) @0x7ffff6daf148
  Type: function definition analyzed
  Visibility: externally_visible prevailing_def public
  References: 
  Referring: 
  Read from file: /tmp/ccED1aQr.ltrans0.o
  Availability: available
  First run: 0
  Function flags: only_called_at_startup
  Called by: 
  Calls: abort/2 (0.00 per call) f.constprop.0/3 (1.00 per call) 

it's probably not yet streamed in?
Comment 5 Richard Biener 2014-04-23 12:15:47 UTC
Ah, and tree-ssa-structalias.c does

  /* Build the constraints.  */
  FOR_EACH_DEFINED_FUNCTION (node)
    {
      varinfo_t vi;
      /* Nodes without a body are not interesting.  Especially do not
         visit clones at this point for now - we get duplicate decls
         there for inline clones at least.  */
      if (!cgraph_function_with_gimple_body_p (node) || node->clone_of)
        continue;
      cgraph_get_body (node);

but doesn't apply IPA transforms.  I think the kludge in the pass manager
should do that instead.  I wonder how the above doesn't trigger the
in_lto_p assert in cgraph_get_body though ... (maybe the odd DECL_RESULT
check makes sure we don't trigger that path).

Ick.  That code path triggers also with

2171          do_per_function (clear_last_verified, NULL);

doing the push/pop_cfun game on each fn just to clear last_verified ...
in WPA stage ... same for

2179      do_per_function (update_properties_after_pass, pass);

and more.  Trying to make

Index: gcc/passes.c
===================================================================
--- gcc/passes.c        (revision 209677)
+++ gcc/passes.c        (working copy)
@@ -1506,9 +1506,11 @@ do_per_function (void (*callback) (void
     {
       struct cgraph_node *node;
       FOR_EACH_DEFINED_FUNCTION (node)
-       if (node->analyzed && gimple_has_body_p (node->decl)
+       if (node->analyzed
+           && cgraph_function_with_gimple_body_p (node)
            && (!node->clone_of || node->decl != node->clone_of->decl))
          {
+           cgraph_get_body (node);
            push_cfun (DECL_STRUCT_FUNCTION (node->decl));
            callback (data);
            if (!flag_wpa)

not ICE ...

Index: gcc/passes.c
===================================================================
--- gcc/passes.c        (revision 209677)
+++ gcc/passes.c        (working copy)
@@ -1506,18 +1506,24 @@ do_per_function (void (*callback) (void
     {
       struct cgraph_node *node;
       FOR_EACH_DEFINED_FUNCTION (node)
-       if (node->analyzed && gimple_has_body_p (node->decl)
+       if (node->analyzed
+           && cgraph_function_with_gimple_body_p (node)
            && (!node->clone_of || node->decl != node->clone_of->decl))
          {
-           push_cfun (DECL_STRUCT_FUNCTION (node->decl));
-           callback (data);
            if (!flag_wpa)
+             cgraph_get_body (node);
+           if (gimple_has_body_p (node->decl))
              {
-               free_dominance_info (CDI_DOMINATORS);
-               free_dominance_info (CDI_POST_DOMINATORS);
+               push_cfun (DECL_STRUCT_FUNCTION (node->decl));
+               callback (data);
+               if (!flag_wpa)
+                 {
+                   free_dominance_info (CDI_DOMINATORS);
+                   free_dominance_info (CDI_POST_DOMINATORS);
+                 }
+               pop_cfun ();
+               ggc_collect ();
              }
-           pop_cfun ();
-           ggc_collect ();
          }
     }
 }

works though.
Comment 6 Richard Biener 2014-04-23 12:19:53 UTC
Seems like simd_clone_create gets along with this because it does

      if (!cgraph_function_with_gimple_body_p (old_node))
        return NULL;
      cgraph_get_body (old_node);
      new_node = cgraph_function_versioning (old_node, vNULL, NULL, NULL,
                                             false, NULL, NULL, "simdclone");

and cgraph_function_versioning applies IPA transforms?
Comment 7 Richard Biener 2014-04-23 12:22:42 UTC
Anyway, that push/pop_cfun stuff should be pushed to the callbacks, not be
done in do_per_function ().  Likewise the ggc_collect ().

Btw, same code in do_per_function_toporder.
Comment 8 Richard Biener 2014-04-23 13:22:25 UTC
Created attachment 32662 [details]
cleanup patch
Comment 9 Jan Hubicka 2014-04-23 19:55:46 UTC
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60911
> 
> --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Ah, and tree-ssa-structalias.c does
> 
>   /* Build the constraints.  */
>   FOR_EACH_DEFINED_FUNCTION (node)
>     {
>       varinfo_t vi;
>       /* Nodes without a body are not interesting.  Especially do not
>          visit clones at this point for now - we get duplicate decls
>          there for inline clones at least.  */
>       if (!cgraph_function_with_gimple_body_p (node) || node->clone_of)
>         continue;
>       cgraph_get_body (node);
> 
> but doesn't apply IPA transforms.  I think the kludge in the pass manager
> should do that instead.  I wonder how the above doesn't trigger the
> in_lto_p assert in cgraph_get_body though ... (maybe the odd DECL_RESULT
> check makes sure we don't trigger that path).

Odd DECL_RESULT check is check if the function is already in memory.
It probably should be documented though.

The purpose of function is to implement lazy loading of function bodies.
If body is already loaded it should do nothing.

Making pass manager to read all function bodies prior first enabled late
small_ipa pass is definitely possible, but it would mean that every late IPA
pass will push up peak memory use of ltrans stage quite a bit. I think PTA
should be done on whole program and not here. Late IPA passes (in addition to
experiments where this is OK) may be useful for stuff that doesn't need to
touch whole program all the time, I think Jakub's SIMD stuff is a good example.

What about adding parameter to cgraph_get_body to apply transformations?
(inliner and clone materialization needs bodies w/o transforms)
Comment 10 Jan Hubicka 2014-04-23 20:43:20 UTC
> 
> We also have, in execute_one_pass,
> 
>   /* SIPLE IPA passes do not handle callgraphs with IPA transforms in it.
>      Apply all trnasforms first.  */
>   if (pass->type == SIMPLE_IPA_PASS)
>     {
>       bool applied = false;
>       do_per_function (apply_ipa_transforms, (void *)&applied);
>       if (applied)
>         symtab_remove_unreachable_nodes (true, dump_file);
>       /* Restore current_pass.  */
>       current_pass = pass;
> 
> but that doesn't seem to work ... at least with LTO?  Which is
> because for main()
> 
>       FOR_EACH_DEFINED_FUNCTION (node)
>         if (node->analyzed && gimple_has_body_p (node->decl)
>             && (!node->clone_of || node->decl != node->clone_of->decl))
>           {
> 
> this doesn't trigger (for f.constprop it does).
> gimple_has_body_p (node->decl) is false.

yeah, this is something I missed with the lazy loading. There is predicate
cgraph_function_with_gimple_body_p with does not care about the lazy stuff
(i.e. will return same value for both). The problem is that we do not want to
run the verifiers that are also called by this for_each.* wrapper.

As written in previous post, for 4.10, I think we should have cgraph_get_body
to lazily load body and apply transformations/produce the clone. In short it should
do whatever necessary to actually get the real body to the optimizer.

I want to keep the default optimization queue to not necessarily load
everything into memory. Therefore I think late IPA passes should be responsible
to call cgraph_get_body themselves to avoid growth in ltrans peak memory use
for passes that touch just few functions (like this SIMD stuff). 

In long term IPA-PTA, that is an example of pass that touch all bodies, should
go to real IPA queue IMO and be reimplemented by one of the more scalable
algorithms perhaps with context senstivity. In fact I have student (Lada Laska)
who is looking into the topic.

For local passes I think it should be pass manager's responsibilty to load the
bodies, as it is now.

This would hopefully not be difficult to use and it won't boost peak ltrans
memory that now, with WPA improvements, seems to be a bottleneck
http://hubicka.blogspot.ca/2014/04/linktime-optimization-in-gcc-2-firefox.html
at least for firefox and -flto=16 (it should actually reduce the peak)

For 4.9 I would suggest simply making the above code to apply transformations
even with LTO (i.e. drop the gimple_has_body_p check from the patch and replace it
cgraph_function_with_gimple_body_p). Will look into it now.
Comment 11 Richard Biener 2014-04-24 09:46:55 UTC
Index: gcc/passes.c
===================================================================
--- gcc/passes.c        (revision 209742)
+++ gcc/passes.c        (working copy)
@@ -2194,8 +2194,26 @@ execute_one_pass (opt_pass *pass)
      Apply all trnasforms first.  */
   if (pass->type == SIMPLE_IPA_PASS)
     {
+      struct cgraph_node *node;
       bool applied = false;
-      do_per_function (apply_ipa_transforms, (void *)&applied);
+      FOR_EACH_DEFINED_FUNCTION (node)
+       if (node->analyzed
+           && cgraph_function_with_gimple_body_p (node)
+           && (!node->clone_of || node->decl != node->clone_of->decl))
+         {
+           if (!node->global.inlined_to
+               && node->ipa_transforms_to_apply.exists ())
+             {
+               cgraph_get_body (node);
+               push_cfun (DECL_STRUCT_FUNCTION (node->decl));
+               execute_all_ipa_transforms ();
+               rebuild_cgraph_edges ();
+               free_dominance_info (CDI_DOMINATORS);
+               free_dominance_info (CDI_POST_DOMINATORS);
+               pop_cfun ();
+               applied = true;
+             }
+         }
       if (applied)
         symtab_remove_unreachable_nodes (true, dump_file);
       /* Restore current_pass.  */

should work and avoid all the issues I ran into with the patches patching
do_per_function.

I'm testing that for trunk/4.9 now with possible cleanups done on trunk
only.
Comment 12 Richard Biener 2014-04-25 07:45:14 UTC
Author: rguenth
Date: Fri Apr 25 07:44:40 2014
New Revision: 209779

URL: http://gcc.gnu.org/viewcvs?rev=209779&root=gcc&view=rev
Log:
2014-04-25  Richard Biener  <rguenther@suse.de>

	PR ipa/60911
	* passes.c (apply_ipa_transforms): Inline into only caller ...
	(execute_one_pass): ... here.  Properly bring in function
	bodies for nodes we want to apply IPA transforms to.

	* gcc.dg/lto/pr60911_0.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/lto/pr60911_0.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/passes.c
    trunk/gcc/testsuite/ChangeLog
Comment 13 Richard Biener 2014-04-25 07:48:41 UTC
Author: rguenth
Date: Fri Apr 25 07:48:06 2014
New Revision: 209781

URL: http://gcc.gnu.org/viewcvs?rev=209781&root=gcc&view=rev
Log:
2014-04-25  Richard Biener  <rguenther@suse.de>

	PR ipa/60911
	* passes.c (apply_ipa_transforms): Inline into only caller ...
	(execute_one_pass): ... here.  Properly bring in function
	bodies for nodes we want to apply IPA transforms to.

	* gcc.dg/lto/pr60911_0.c: New testcase.

Added:
    branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/lto/pr60911_0.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/passes.c
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
Comment 14 Richard Biener 2014-04-25 07:49:47 UTC
Fixed.