Bug 34043 - Missed optimization causing extra loads and stores when using x86_64 builtin function together with aggregate types.
Summary: Missed optimization causing extra loads and stores when using x86_64 builtin ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.3.0
: P3 enhancement
Target Milestone: 4.4.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 33989
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-09 16:14 UTC by jsjodin
Modified: 2008-03-14 16:16 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-02-28 16:40:35


Attachments
Source code to expose the missed optimization (486 bytes, text/plain)
2007-11-09 16:16 UTC, jsjodin
Details
Assembly code for bad code. (612 bytes, text/plain)
2007-11-09 16:16 UTC, jsjodin
Details
Assembly code for good code. (581 bytes, text/plain)
2007-11-09 16:17 UTC, jsjodin
Details
patch prototype (2.36 KB, text/plain)
2008-01-04 15:38 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jsjodin 2007-11-09 16:14:44 UTC
Compile the attached file test.cpp with the following flags:
g++ -DNDEBUG -m32 -msse2 -O2 -msse2 -DSHOW_BUG gcc_bug.cpp -S

In the generated code there are useless stores and loads of %xmm0 to
-40(%eps) and -56(%eps).  If the code is compiled without -DSHOW_BUG
it will generate a more optimal version without the extra memory
accesses. See the attached generated files test_bad.s and test_good.s
for the resulting code. This is an old problem existing in 4.1.x.
Comment 1 jsjodin 2007-11-09 16:16:05 UTC
Created attachment 14516 [details]
Source code to expose the missed optimization
Comment 2 jsjodin 2007-11-09 16:16:49 UTC
Created attachment 14517 [details]
Assembly code for bad code.
Comment 3 jsjodin 2007-11-09 16:17:12 UTC
Created attachment 14518 [details]
Assembly code for good code.
Comment 4 Andrew Pinski 2007-11-09 18:24:06 UTC
Related to PR 33790 (and most likely fixed by it).  There is another issue with that bug relating to not deleting the extra store.
Comment 5 jsjodin 2007-11-13 17:07:16 UTC
(In reply to comment #4)
> Related to PR 33790 (and most likely fixed by it).  There is another issue with
> that bug relating to not deleting the extra store.
> 

Indeed the extra load disappeared when with the patch. The store did not get deleted as expected. I looked at the differences between the good and bad case. 
Compiling the good case has the following sequence before the fre pass:
Note: src and dst are unions

src.f = D.9650_45;
D.9630_31 = src.f;
D.9655_46 = __builtin_ia32_addps (D.9630_31, D.9630_31);
dst.f = D.9655_46;
D.9632_33 = dst.f;                

After fre the temps have been propagated and replaced the uses of dst.f:

src.f = D.9650_45;                                                                                                                                                                                  
D.9630_31 = D.9650_45;                                                                                                                                         
D.9655_46 = __builtin_ia32_addps (D.9630_31, D.9630_31);                                                                                                                                            
dst.f = D.9655_46;                                                                                                                                                                                     
D.9632_33 = D.9655_46;

The extra stores to src.f are eliminated in dce. 

The bad case has the following code before and after fre:
src.i = D.9651_44;                                                                                                                                                                                  
D.9630_31 = src.f;                                                                                                                                                                                  
D.9655_45 = __builtin_ia32_addps (D.9630_31, D.9630_31);                                                                                                                                            
dst.f = D.9655_45;                                                                                                                                                                                  
D.9632_33 = dst.i;   

Since the src.i and src.f are probably not considered to be the same the propagation does not work. It might be possible to handle this case if one consideres the size of data being written and read from unions.
Comment 6 jsjodin 2007-12-03 15:41:56 UTC
> Indeed the extra load disappeared when with the patch. The store did not get
> deleted as expected. I looked at the differences between the good and bad case. 

After doing some more investigation an attempt to fix the problem was done in the value numbering pass. By using the size and a unique type for the size to compute the value number it not longer matters which union that is accessed, only the number of bits that are read/written. This gives good code test.cpp, however the problem with this approach is that in FRE the different types result in a cast, but it should really be a reinterpretation of the bits (different types, same bit pattern). The small example below gives: int: -1, float: -1.000000 which is wrong. The expected result is: int: -1, float: nan

The solution would be to make the reinterpretation possible somehow. 


Simple test:
---------------------------------------------------------------
#include <stdio.h>

union foo {
  int i;
  float f;
} bar;


int main()
{
  int x = 0xFFFFFFFF;
  float y = 0.0;
  bar.i = x;
  y = bar.f;
  printf("int: %d, float: %f\n", bar.i, bar.f); 

  return 0;
}
-----------------------------------------------------------------




Index: tree-ssa-sccvn.c
===================================================================
--- tree-ssa-sccvn.c    (revision 130099)
+++ tree-ssa-sccvn.c    (working copy)
@@ -456,6 +456,33 @@ shared_vuses_from_stmt (tree stmt)
   return shared_lookup_vops;
 }
 
+
+static tree get_generic_type_node_from_size(int size)
+{
+  tree result = NULL;
+  switch(size)
+    {
+    case 8:
+      result = unsigned_intQI_type_node;
+      break;
+    case 16:
+      result = unsigned_intHI_type_node;
+      break;
+    case 32:
+      result = unsigned_intSI_type_node;
+      break;
+    case 64:
+      result = unsigned_intDI_type_node;
+      break;
+    case 128:
+      result = unsigned_intTI_type_node;
+      break;
+    default:
+      break;
+    }
+  return result;
+}
+
 /* Copy the operations present in load/store/call REF into RESULT, a vector of
    vn_reference_op_s's.  */
 
@@ -521,7 +548,23 @@ copy_reference_ops_from_ref (tree ref, V
          break;
        case COMPONENT_REF:
          /* Record field as operand.  */
-         temp.op0 = TREE_OPERAND (ref, 1);
+         {
+           tree aggregate = TREE_OPERAND (ref, 0);
+           tree aggregate_type = TREE_TYPE(aggregate);
+           tree field_decl = TREE_OPERAND (ref, 1);
+           tree field_decl_type = TREE_TYPE(field_decl);
+           tree field_decl_type_size = TYPE_SIZE(field_decl_type);
+           if (TREE_CODE (aggregate_type) == UNION_TYPE
+               && TREE_CODE (field_decl_type_size) == INTEGER_CST)
+             {
+               temp.op0 = field_decl_type_size;
+               temp.type = get_generic_type_node_from_size(TREE_INT_CST_LOW(field_decl_type_size)) != NULL 
+                             ? get_generic_type_node_from_size(TREE_INT_CST_LOW(field_decl_type_size)) 
+                             : temp.type;
+             } else {
+               temp.op0 = TREE_OPERAND (ref, 1);
+             }
+         }
          break;
        case ARRAY_RANGE_REF:
        case ARRAY_REF:
 
Comment 7 Andrew Pinski 2007-12-09 22:34:20 UTC
>+static tree get_generic_type_node_from_size(int size)

This function needs lots of improvement because the size of the modes is based on UNIT_BIT_SIZE (I think that is the macro name) and really you can go from a size to a mode and then to a tree type.

Note this patch really fixes my bug (PR 32964) if PRE is extended to handle this too.
Comment 8 jsjodin 2007-12-13 01:56:49 UTC
(In reply to comment #7)
> >+static tree get_generic_type_node_from_size(int size)
> 
> This function needs lots of improvement because the size of the modes is based
> on UNIT_BIT_SIZE (I think that is the macro name) and really you can go from a
> size to a mode and then to a tree type.
> 
> Note this patch really fixes my bug (PR 32964) if PRE is extended to handle
> this too.
> 

Yes I rewrote the patch to always use the first member of a union that match a specific size. When the expression was extracted (queried from VN function) a VIEW_CONVERT_EXPR would be insterted if the expected type did not match the stored expression type. This caused some problems (assertions) in the compiler that I have not been able to track down yet, although it fixed both my test cases and for constants it doesn't cause any problems since the expression gets folded. To avoid introducing VIEW_CONVERT_EXPR it may be possible to return an expression (from VN) that refers to a union member with a matching type instead. It would mean that the value number would be the same, but the returned expression might be different. 



Comment 9 Richard Biener 2008-01-03 17:29:44 UTC
Confirmed.  While I have a patch to make VN handle the case of type-punning
constants, making it to generate VIEW_CONVERT_EXPRs instead runs into the
principle barrier that FRE does not do insertion.

The patch looks like

Index: tree-ssa-sccvn.c
===================================================================
*** tree-ssa-sccvn.c    (revision 131302)
--- tree-ssa-sccvn.c    (working copy)
*************** visit_reference_op_store (tree lhs, tree
*** 1231,1236 ****
--- 1247,1292 ----
        }
 
        vn_reference_insert (lhs, op, vdefs);
+
+       /* For unions and constant stores we can register stores to
+        the other union members as well, allowing propagation.
+        Do this for two-element unions only, as they are likely
+        used for type-punning which we want to optimize.  */
+       if (TREE_CODE (lhs) == COMPONENT_REF
+         && TREE_CODE (TREE_TYPE (TREE_OPERAND (lhs, 0))) == UNION_TYPE
+         /* We may be able to handle non-constant type-punnings as well.  */
+         && (TREE_CODE (op) == INTEGER_CST
+             || TREE_CODE (op) == REAL_CST
+             || TREE_CODE (op) == VECTOR_CST
+             || TREE_CODE (op) == COMPLEX_CST)
+         && fields_length (TREE_TYPE (TREE_OPERAND (lhs, 0))) == 2)
+       {
+         tree field = TYPE_FIELDS (TREE_TYPE (TREE_OPERAND (lhs, 0)));
+         tree val;
+
+         if (field == TREE_OPERAND (lhs, 1))
+           field = TREE_CHAIN (field);
+
+         /* Do not generate VIEW_CONVERT_EXPRs, but only handle the
+            cases we can fold it away.  */
+         val = fold_unary (VIEW_CONVERT_EXPR, TREE_TYPE (field), op);
+         if (val)
+           {
+             if (dump_file)
+               {
+                 fprintf (dump_file, "Value numbering store to alternate "
+                          "union field ");
+                 print_generic_expr (dump_file, field, 0);
+                 fprintf (dump_file, " with type-punned value ");
+                 print_generic_expr (dump_file, val, 0);
+                 fprintf (dump_file, "\n");
+               }
+             vn_reference_insert (build3 (COMPONENT_REF, TREE_TYPE (field),
+                                          TREE_OPERAND (lhs, 0), field,
+                                          NULL_TREE),
+                                  val, vdefs);
+           }
+       }
      }
    else
      {

and it handles the testcase in comment #6 correctly.
Comment 10 Richard Biener 2008-01-04 15:35:59 UTC
We can 'abuse' PREs inserting of fake stores machinery to handle the case in
PR33989, where for

union a
{
  int i;
  float f;
};

void f (float *a, int *b)
{
  union a c;
  c.f = *a;
  *b = c.i;
}

we then get (PRE)

<bb 2>:
  # VUSE <SMT.6_5(D)>
  D.1549_2 = *a_1(D);
  # SFT.0_8 = VDEF <SFT.0_6(D)>
  # SFT.1_9 = VDEF <SFT.1_7(D)>
  c.f = D.1549_2;
  uniontmp.12_13 = VIEW_CONVERT_EXPR<int>(D.1549_2);
  D.1550_3 = uniontmp.12_13;
  # SMT.7_11 = VDEF <SMT.7_10(D)>
  *b_4(D) = D.1550_3;
  return;

and before expand:

<bb 2>:
  # SMT.7_11 = VDEF <SMT.7_10(D)>
  *b = VIEW_CONVERT_EXPR<int>(*a);
  return;

and finally

f:
.LFB2:
        movl    (%rdi), %eax
        movl    %eax, (%rsi)
        ret

but we need to make sure VN processes the inserted conversions (which do
not have uses) and we need to enhance PREs poolification of trees to
handle COMPONENT_REFs (and possibly more).
Comment 11 Richard Biener 2008-01-04 15:38:12 UTC
Created attachment 14879 [details]
patch prototype

So this patch certainly will ICE in a lot of cases, I wonder if poolifying the
inserted code is worth the hassle.  It also shows that PRE of loads/stores is
probably defect for all but indirect references.
Comment 12 Richard Biener 2008-03-14 14:52:56 UTC
Subject: Bug 34043

Author: rguenth
Date: Fri Mar 14 14:52:07 2008
New Revision: 133218

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133218
Log:
2008-03-14  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/34043
	PR tree-optimization/33989
	* tree-ssa-pre.c (execute_pre): Allow SCCVN to do insertion
	when doing FRE.
	(bitmap_find_leader): Use extra argument to verify dominance
	relationship inside a basic-block.
	(can_PRE_operation): Add VIEW_CONVERT_EXPR.
	(find_leader_in_sets): Adjust.
	(create_component_ref_by_pieces): Take extra argument for
	dominance check, handle lookup failures.
	(find_or_generate_expression): Likewise.
	(create_expression_by_pieces): Likewise.
	(insert_into_preds_of_block): Adjust.
	(create_value_expr_from): If asked for, verify all operands
	are in the blocks AVAIL_OUT set.
	(make_values_for_stmt): Check for SSA_NAMEs that are life
	over an abnormal edge.
	(compute_avail): Remove such check.
	(do_SCCVN_insertion): New function.
	(eliminate): If we do not find a leader suitable for replacement
	insert a replacement expression from SCCVN if available.
	* tree-ssa-sccvn.h (run_scc_vn): Update prototype.
	(struct vn_ssa_aux): Add needs_insertion flag.
	* tree-ssa-sccvn.c (may_insert): New global flag.
	(copy_reference_ops_from_ref): Value-number union member access
	based on its size, not type and member if insertion is allowed.
	(visit_reference_op_load): For a weak match from union type
	punning lookup a view-converted value and insert a SSA_NAME
	for that value if that is not found.
	(visit_use): Make dumps shorter.  Do not disallow value numbering
	SSA_NAMEs that are life over an abnormal edge to constants.
	(free_scc_vn): Release inserted SSA_NAMEs.
	(run_scc_vn): New flag to specify whether insertion is allowed.
	Process SSA_NAMEs in forward order.
	* tree-ssa-loop-im.c (for_each_index): Handle invariant
	ADDR_EXPRs inside VIEW_CONVERT_EXPR.
	* fold-const.c (fold_unary): Fold VIEW_CONVERT_EXPRs from/to
	pointer type to/from integral types that do not change the
	precision to regular conversions.

	* gcc.dg/tree-ssa/ssa-fre-7.c: New testcase.
	* gcc.dg/tree-ssa/ssa-fre-8.c: Likewise.
	* gcc.dg/tree-ssa/ssa-fre-9.c: Likewise.
	* gcc.dg/tree-ssa/ssa-fre-10.c: Likewise.
	* gcc.dg/tree-ssa/ssa-pre-17.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-10.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-7.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-8.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-9.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-loop-im.c
    trunk/gcc/tree-ssa-pre.c
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-ssa-sccvn.h

Comment 13 Richard Biener 2008-03-14 16:16:34 UTC
Fixed.