Bug 23455 - tree load PRE is not working for global variables
Summary: tree load PRE is not working for global variables
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.1.0
: P2 enhancement
Target Milestone: 4.4.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, TREE
: 16797 25553 35286 35287 (view as bug list)
Depends on:
Blocks: 19721
  Show dependency treegraph
 
Reported: 2005-08-18 07:55 UTC by Paolo Bonzini
Modified: 2014-12-15 19:17 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-01-02 19:31:41


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paolo Bonzini 2005-08-18 07:55:55 UTC
Load PRE is scheduled for 4.2, I'm creating this bug because load PRE is
currently split between CSE and GCSE (this bug blocks the "optimizations caught
by CSE" meta-bug, PR19721).  Given this code,

   unsigned outcnt;
   extern void flush_outbuf(void);

   void
   bi_windup(unsigned char *outbuf, unsigned char bi_buf)
   {
       outbuf[outcnt] = bi_buf;
       if (outcnt == 16384)
               flush_outbuf();
       outbuf[outcnt] = bi_buf;
   }
   
we'd want the code to become

    void
    bi_windup(unsigned char *outbuf, unsigned char bi_buf)
    {
        int t1 = outcnt;
        outbuf[t1] = bi_buf;
        int t2 = outcnt, t3;
        if (t2 == 16384) {
                flush_outbuf();
		t3 = outcnt;
	} else
		t3 = t2;
        outbuf[t3] = bi_buf;
    }

currently this optimization is split between CSE (with path following) and GCSE.  
Actually, GCSE is able to do it alone if the number of GCSE passes is increased.
Comment 1 Daniel Berlin 2005-08-18 13:09:59 UTC
Subject: Re:  New: load PRE is missing

On Thu, 2005-08-18 at 07:55 +0000, bonzini at gcc dot gnu dot org wrote:
> Load PRE is scheduled for 4.2, I'm creating this bug because load PRE is
> currently split between CSE and GCSE (this bug blocks the "optimizations caught
> by CSE" meta-bug, PR19721).  Given this code,
> 
>    unsigned outcnt;
>    extern void flush_outbuf(void);
> 
>    void
>    bi_windup(unsigned char *outbuf, unsigned char bi_buf)
>    {
>        outbuf[outcnt] = bi_buf;
>        if (outcnt == 16384)
>                flush_outbuf();
>        outbuf[outcnt] = bi_buf;
>    }
>    
> we'd want the code to become
> 
>     void
>     bi_windup(unsigned char *outbuf, unsigned char bi_buf)
>     {
>         int t1 = outcnt;
>         outbuf[t1] = bi_buf;
>         int t2 = outcnt, t3;
>         if (t2 == 16384) {
>                 flush_outbuf();
> 		t3 = outcnt;
> 	} else
> 		t3 = t2;
>         outbuf[t3] = bi_buf;
>     }
> 

This doesn't remove a single load.
In fact, i'm not sure what you think is better about this code.

However, it certainly won't be caught by load PRE.


Comment 2 Paolo Bonzini 2005-08-18 13:44:01 UTC
outcnt aliases with outbuf
Comment 3 Daniel Berlin 2005-08-18 13:46:20 UTC
Subject: Re:  load PRE is missing

On Thu, 2005-08-18 at 13:44 +0000, bonzini at gcc dot gnu dot org wrote:
> ------- Additional Comments From bonzini at gcc dot gnu dot org  2005-08-18 13:44 -------
> outcnt aliases with outbuf
> 

Then there is nothing to optimize in this testcase.

If you look at the code you want, you have exactly the same number of
loads from the global along the main path, and have actually *added* an
extra load along the other path.

This is *not* an optimization.


Comment 4 Paolo Bonzini 2005-08-18 13:46:30 UTC
Sorry.

outcnt aliases with outbuf, so the load of t2 cannot be removed.  The GIMPLE
code that is now emitted is something like:


    void
    bi_windup(unsigned char *outbuf, unsigned char bi_buf)
    {
        int t1 = outcnt;
        outbuf[t1] = bi_buf;
        int t2 = outcnt;
        if (t2 == 16384)
                flush_outbuf();
	int t3 = outcnt;
        outbuf[t3] = bi_buf;
    }

where t3 == t2 if flush_outbuf is not called.  My code removes the load into t3
if flush_outbuf is not called.

 
Comment 5 Paolo Bonzini 2006-01-02 19:31:41 UTC
Reconfirmed after the first load PRE patch went in, as it does not handle globals.
Comment 6 Steven Bosscher 2006-01-02 21:27:21 UTC
Ehm, wouldn't "unsigned char *" alias everything?  GCSE doesn't do load PRE for me on the original test case, either.  And, as long as "outbuf[outcnt] = bi_buf;" has a V_MAY_DEF for outcnt, you're not going to see any load PRE happening here.

With a changed test case, GCSE _does_ perform load PRE:

unsigned long outcnt;
extern void flush_outbuf(void);

void
bi_windup(unsigned int *outbuf, unsigned int bi_buf)
{
    unsigned long t1 = outcnt;
    outbuf[t1] = bi_buf;

    unsigned long t2 = outcnt;
    if (t2 == 16384)
      flush_outbuf();

    unsigned long t3 = outcnt;
    outbuf[t3] = bi_buf;
}

This gives me the following asm on x86-64:
bi_windup:
        movq    %rbx, -16(%rsp)
        movq    %rbp, -8(%rsp)
        subq    $24, %rsp
        movq    outcnt(%rip), %rax
        movq    %rdi, %rbp
        movl    %esi, %ebx
        cmpq    $16384, %rax
        movl    %esi, (%rbp,%rax,4)
        je      .L6
.L2:
        movl    %ebx, (%rbp,%rax,4)
        movq    8(%rsp), %rbx
        movq    16(%rsp), %rbp
        addq    $24, %rsp
        ret
        .p2align 4,,7
.L6:
        call    flush_outbuf
        movq    outcnt(%rip), %rax
        jmp     .L2


Looking at the .pre tree dump, there isn't anything obvious in the way of doing load PRE here AFAICT:

bi_windup (outbufD.1866, bi_bufD.1867)
{
  unsigned intD.3 storetmp.21D.1918;
  long unsigned intD.4 t3D.1872;
  long unsigned intD.4 t2D.1871;
  long unsigned intD.4 t1D.1870;
  unsigned intD.3 * D.1878;
  unsigned intD.3 * D.1877;
  long unsigned intD.4 D.1876;
  unsigned intD.3 * D.1875;
  unsigned intD.3 * D.1874;
  long unsigned intD.4 D.1873;

  # BLOCK 2 freq:10000
  # PRED: ENTRY [100.0%]  (fallthru,exec)
  #   VUSE <outcntD.1863_1>;
  t1D.1870_2 = outcntD.1863;
  D.1873_3 = t1D.1870_2 * 4;
  D.1874_4 = (unsigned intD.3 *) D.1873_3;
  D.1875_6 = D.1874_4 + outbufD.1866_5;
  #   TMT.4D.1900_16 = V_MAY_DEF <TMT.4D.1900_15>;
  *D.1875_6 = bi_bufD.1867_7;
  if (t1D.1870_2 == 16384) goto <L0>; else goto <L2>;
  # SUCC: 3 [26.2%]  (true,exec) 5 [73.8%]  (false,exec)

  # BLOCK 5 freq:7378
  # PRED: 2 [73.8%]  (false,exec)
<L2>:;
  goto <bb 4> (<L1>);
  # SUCC: 4 [100.0%]  (fallthru)

  # BLOCK 3 freq:2622
  # PRED: 2 [26.2%]  (true,exec)
<L0>:;
  #   outcntD.1863_18 = V_MAY_DEF <outcntD.1863_1>;
  #   TMT.4D.1900_19 = V_MAY_DEF <TMT.4D.1900_16>;
  flush_outbuf ();
  # SUCC: 4 [100.0%]  (fallthru,exec)

  # BLOCK 4 freq:10000
  # PRED: 5 [100.0%]  (fallthru) 3 [100.0%]  (fallthru,exec)
  # TMT.4D.1900_14 = PHI <TMT.4D.1900_16(5), TMT.4D.1900_19(3)>;
  # outcntD.1863_13 = PHI <outcntD.1863_1(5), outcntD.1863_18(3)>;
<L1>:;
  #   VUSE <outcntD.1863_13>;
  t3D.1872_9 = outcntD.1863;
  D.1876_10 = t3D.1872_9 * 4;
  D.1877_11 = (unsigned intD.3 *) D.1876_10;
  D.1878_12 = D.1877_11 + outbufD.1866_5;
  #   TMT.4D.1900_17 = V_MAY_DEF <TMT.4D.1900_14>;
  *D.1878_12 = bi_bufD.1867_7;
  return;
  # SUCC: EXIT [100.0%]

}


Comment 7 Steven Bosscher 2006-01-02 21:30:52 UTC
With an even more modified test case, load PRE does happen:

unsigned long outcnt;
extern void flush_outbuf(void);

void
bi_windup(unsigned int *outbuf, unsigned int bi_buf, unsigned long *outcnt)
{
    unsigned long t1 = *outcnt;
    outbuf[t1] = bi_buf;

    unsigned long t2 = *outcnt;
    if (t2 == 16384)
      flush_outbuf();

    unsigned long t3 = *outcnt;
    outbuf[t3] = bi_buf;
}


--> .pre tree dump

bi_windup (outbufD.1866, bi_bufD.1867, outcntD.1868)
{
  unsigned intD.3 * prephitmp.26D.1925;
  unsigned intD.3 * pretmp.25D.1924;
  long unsigned intD.4 prephitmp.24D.1923;
  long unsigned intD.4 pretmp.23D.1922;
  unsigned intD.3 storetmp.22D.1921;
  long unsigned intD.4 t3D.1873;
  long unsigned intD.4 t2D.1872;
  long unsigned intD.4 t1D.1871;
  unsigned intD.3 * D.1879;
  unsigned intD.3 * D.1878;
  long unsigned intD.4 D.1877;
  unsigned intD.3 * D.1876;
  unsigned intD.3 * D.1875;
  long unsigned intD.4 D.1874;

  # BLOCK 2 freq:10000
  # PRED: ENTRY [100.0%]  (fallthru,exec)
  #   VUSE <TMT.4D.1902_15>;
  t1D.1871_2 = *outcntD.1868_1;
  D.1874_3 = t1D.1871_2 * 4;
  D.1875_4 = (unsigned intD.3 *) D.1874_3;
  D.1876_6 = D.1875_4 + outbufD.1866_5;
  #   TMT.5D.1903_17 = V_MAY_DEF <TMT.5D.1903_16>;
  *D.1876_6 = bi_bufD.1867_7;
  if (t1D.1871_2 == 16384) goto <L0>; else goto <L2>;
  # SUCC: 3 [26.2%]  (true,exec) 5 [73.8%]  (false,exec)

  # BLOCK 5 freq:7378
  # PRED: 2 [73.8%]  (false,exec)
<L2>:;
  goto <bb 4> (<L1>);
  # SUCC: 4 [100.0%]  (fallthru)

  # BLOCK 3 freq:2622
  # PRED: 2 [26.2%]  (true,exec)
<L0>:;
  #   TMT.4D.1902_19 = V_MAY_DEF <TMT.4D.1902_15>;
  #   TMT.5D.1903_20 = V_MAY_DEF <TMT.5D.1903_17>;
  flush_outbuf ();
  #   VUSE <TMT.4D.1902_19>;
  pretmp.23D.1922_23 = *outcntD.1868_1;
  pretmp.23D.1922_25 = pretmp.23D.1922_23 * 4;
  pretmp.25D.1924_27 = (unsigned intD.3 *) pretmp.23D.1922_25;
  pretmp.25D.1924_29 = outbufD.1866_5 + pretmp.25D.1924_27;
  # SUCC: 4 [100.0%]  (fallthru,exec)

  # BLOCK 4 freq:10000
  # PRED: 5 [100.0%]  (fallthru) 3 [100.0%]  (fallthru,exec)
  # prephitmp.26D.1925_30 = PHI <D.1876_6(5), pretmp.25D.1924_29(3)>;
  # prephitmp.26D.1925_28 = PHI <D.1875_4(5), pretmp.25D.1924_27(3)>;
  # prephitmp.24D.1923_26 = PHI <D.1874_3(5), pretmp.23D.1922_25(3)>;
  # prephitmp.24D.1923_24 = PHI <t1D.1871_2(5), pretmp.23D.1922_23(3)>;
  # TMT.5D.1903_14 = PHI <TMT.5D.1903_17(5), TMT.5D.1903_20(3)>;
  # TMT.4D.1902_13 = PHI <TMT.4D.1902_15(5), TMT.4D.1902_19(3)>;
<L1>:;
  t3D.1873_9 = prephitmp.24D.1923_24;
  D.1877_10 = prephitmp.24D.1923_26;
  D.1878_11 = prephitmp.26D.1925_28;
  D.1879_12 = prephitmp.26D.1925_30;
  #   TMT.5D.1903_18 = V_MAY_DEF <TMT.5D.1903_14>;
  *D.1879_12 = bi_bufD.1867_7;
  return;
  # SUCC: EXIT [100.0%]

}
Comment 8 Paolo Bonzini 2006-01-03 08:04:33 UTC
> Ehm, wouldn't "unsigned char *" alias everything?  GCSE doesn't do load PRE for
> me on the original test case, either.

Yes, I was checking if load PRE removes the redundant load of t3 if flush_outbuf is not called.  See comment #4.
Comment 9 Daniel Berlin 2006-01-03 18:29:56 UTC
Load PRE on globals is going to take a while, because we do some stupid things in hashing (like hashing tcc_declaration's by pointer, instead of DECL_UID).
Comment 10 Steven Bosscher 2006-01-04 22:35:56 UTC
Re. comment #9, I don't understand how the way we hash would make load PRE harder.  Could you elaborate a bit on what is missing or done "wrong" for load PRE of globals??
Comment 11 Steven Bosscher 2006-01-07 00:25:46 UTC
The thread starting at http://gcc.gnu.org/ml/gcc-patches/2006-01/msg00373.html addresses my question in comment #10.
Comment 12 Andrew Pinski 2008-04-07 01:40:53 UTC
*** Bug 35286 has been marked as a duplicate of this bug. ***
Comment 13 Andrew Pinski 2008-04-07 01:44:18 UTC
*** Bug 25553 has been marked as a duplicate of this bug. ***
Comment 14 Andrew Pinski 2008-04-07 01:46:50 UTC
*** Bug 35287 has been marked as a duplicate of this bug. ***
Comment 15 Steven Bosscher 2008-07-06 08:53:05 UTC
Patch here:
http://gcc.gnu.org/ml/gcc-patches/2008-07/msg00430.html

We should re-evaluate the need for gcse.c load-PRE after Daniel's patch goes in.  The last time I looked at what loads gcse.c load-PRE would move, it only touched loads from globals, so it may be a redundant pass when tree-ssa-pre moves them.
Comment 16 Daniel Berlin 2008-07-08 16:11:58 UTC
Subject: Bug 23455

Author: dberlin
Date: Tue Jul  8 16:11:06 2008
New Revision: 137631

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=137631
Log:
2008-07-05  Daniel Berlin  <dberlin@dberlin.org>
	
	Fix PR tree-optimization/23455
	Fix PR tree-optimization/35286
	Fix PR tree-optimization/35287
	* Makefile.in (OBJS-common): Remove tree-vn.o.
	tree-vn.o: Remove.
	* dbgcnt.def: Add treepre_insert debug counter.
	* gcc/tree-flow.h (add_to_value): Updated for other changes.
	(debug_value_expressions): Ditto.
	(print_value_expressions): Ditto.
	* tree-pretty-print.c (dump_generic_node): Updated for
	VALUE_HANDLE removal.
	* tree-ssa-dom.c (record_equality): Ditto.
	(cprop_operand): Ditto.
	(lookup_avail_expr): Ditto.
	* tree-ssa-threadedge.c
	(record_temporary_equivalences_from_stmts_at_dest): Ditto.
	(simplify_control_stmt_condition): Ditto.
	* tree.c (tree_code_size): Ditto.
	(tree_node_structure): Ditto.
	(iterative_hash_expr): Ditto.
	* tree.def: Ditto.
	* tree.h (VALUE_HANDLE_ID): Ditto.
	(VALUE_HANDLE_EXPR_SET): Ditto.
	(struct tree_value_handle): Ditto.
	(union tree_node): Ditto.
	* treestruct.def: Ditto.
	* tree-vn.c: Removed.
	* tree-ssa-pre.c: Rewritten entirely.
	* tree-ssa-sccvn.c (constant_to_value_id): New hashtable.
	(constant_value_ids): Ditto.
	(vn_nary_op_t): Moved to header.
	(vn_phi_t): Ditto.
	(vn_reference_op_t): Ditto
	(vn_reference_t): Ditto.
	(next_value_id): New variable.
	(VN_INFO): Add an assert.
	(vn_constant_eq): New function.
	(vn_constant_hash): Ditto.
	(get_or_alloc_constant_value_id): Ditto.
	(value_id_constant_p): Ditto.
	(vn_reference_compute_hash): De-staticify.
	(copy_reference_ops_from_ref): Don't use get_callee_fndecl.
	Disable some code with a FIXME.
	Remove VALUE_HANDLE use.
	(valueize_refs): Update opcode if it changes from ssa name to
	constant.
	(vn_reference_lookup_1): Add new argument.
	(vn_reference_lookup):  Ditto.
	(vn_reference_lookup_pieces): New function.
	(vn_reference_insert): Add return type. Modify to deal with value
	ids.
	(vn_reference_insert_pieces):  New function.
	(vn_nary_op_compute_hash): De-staticify.
	(vn_nary_op_eq): Ditto.
	(vn_nary_op_lookup_pieces): New function.
	(vn_nary_op_lookup): Add new argument.  
	(vn_nary_op_insert_pieces): New function.
	(vn_nary_op_insert): Add return type. Modify to deal with value
	ids.
	(vn_phi_insert): Ditto.
	(visit_unary_op): Update for callee changes.
	(visit_binary_op): Ditto.
	(visit_reference_op_load): Ditto.
	(visit_reference_op_store): Ditto.
	(init_scc_vn): Init next_value_id, constant_to_value_id and
	constant_value_ids. 
	(free_scc_vn): Free them.
	(set_hashtable_value_ids): New function.
	(run_scc_vn): Use it.
	(get_max_value_id): New function.
	(get_next_value_id): Ditto.
	(expressions_equal_p): Moved from tree-vn.c
	(sort_vuses): Ditto.
	(sort_vuses_heap): Ditto.
	* tree-ssa-sccvn.h: Structures moved from tree-ssa-sccvn.c (noted
	above).
	* tree.c (iterative_hash_hashval_t): Made non-static
	* tree.h (iterative_hash_hashval_t): Declare it.


Added:
    trunk/gcc/testsuite/gcc.c-torture/compile/20080704-1.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr23455.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr35286.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr35287.c
Removed:
    trunk/gcc/tree-vn.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/dbgcnt.def
    trunk/gcc/testsuite/gcc.dg/tree-ssa/loadpre24.c
    trunk/gcc/tree-flow.h
    trunk/gcc/tree-pretty-print.c
    trunk/gcc/tree-ssa-dom.c
    trunk/gcc/tree-ssa-pre.c
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-ssa-sccvn.h
    trunk/gcc/tree-ssa-threadedge.c
    trunk/gcc/tree.c
    trunk/gcc/tree.def
    trunk/gcc/tree.h
    trunk/gcc/treestruct.def

Comment 17 Richard Biener 2008-07-08 20:27:54 UTC
Fixed.
Comment 18 Sebastian Pop 2014-12-15 19:17:46 UTC
*** Bug 16797 has been marked as a duplicate of this bug. ***