Bug 52890 - Revision 185336 causes 10% degradation on cpu2000 benchmark 252.eon
Summary: Revision 185336 causes 10% degradation on cpu2000 benchmark 252.eon
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Martin Jambor
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-06 16:59 UTC by Pat Haugen
Modified: 2012-12-03 13:25 UTC (History)
3 users (show)

See Also:
Host: powerpc64-linux
Target: powerpc64-linux
Build: powerpc64-linux
Known to work:
Known to fail:
Last reconfirmed: 2012-04-10 00:00:00


Attachments
Reduced testcase (332 bytes, text/plain)
2012-09-12 23:04 UTC, Pat Haugen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pat Haugen 2012-04-06 16:59:10 UTC
Following testcase extracted from benchmark file mrXZRectangle.cc.

$ cat test.cc
// g++ -m64 -O2 -mcpu=power7 -S test.cc
class ggVector3 {
public:
    ggVector3() {e[0] = 1.0; e[1] = e[2] = 0.0; }
    ggVector3(double e0, double e1, double e2) {
                   e[0] = e0; e[1] = e1; e[2] = e2; }
    double e[3];
};

class ggONB3 {
public:
    ggONB3() {  }
    ggONB3(const ggVector3& a, const ggVector3& b, const ggVector3& c) {
         U = a; V = b; W = c; }
private:
    ggVector3 U,V,W;
};

class mrViewingHitRecord {
public:
    ggONB3 UVW;
};

void foo(mrViewingHitRecord& VHR)
{

      VHR.UVW = ggONB3(ggVector3 (1, 0, 0),
                       ggVector3 (0, 0, -1),
                       ggVector3 (0, 1, 0));
}

Revision 185335 produces tight sequence of 8-byte stores:

        stfd 0,8(3)
        stfd 0,16(3)
        stfd 0,24(3)
        stfd 0,32(3)
        ...

Revision 185336, with additional check for SLOW_UNALIGNED_ACCESS now causes path through store_bit_field() which generates a much larger sequence of byte stores.

        stb 8,0(3)
        stb 10,1(3)
        stb 9,2(3)
        stb 9,3(3)
        stb 9,4(3)
        stb 9,5(3)
        ...
Comment 1 Pat Haugen 2012-04-06 21:13:04 UTC
My main question on this is why the MEM_REF has an alignment of 8 (i.e. byte)?

Breakpoint 1, expand_assignment (to=0xfffafed2790, from=<optimized out>, 
    nontemporal=<optimized out>) at /home/pthaugen/src/gcc/temp3/gcc/gcc/expr.c:4653
4653            store_bit_field (mem, GET_MODE_BITSIZE (mode),
(gdb) prt to
 <mem_ref 0xfffafed2790
    type <real_type 0xfffb0057b60 double DF
        size <integer_cst 0xfffafe125e0 constant 64>
        unit size <integer_cst 0xfffafe12600 constant 8>
        align 8 symtab 0 alias set 2 canonical type 0xfffaff00f18 precision 64>
   
    arg 0 <ssa_name 0xfffb00e1d60
        type <reference_type 0xfffb0055f28 type <record_type 0xfffb0055d30 mrViewingHitRecord>
            public unsigned DI size <integer_cst 0xfffafe125e0 64> unit size <integer_cst 0xfffafe12600 8>
            align 64 symtab 0 alias set 6 canonical type 0xfffb0055f28>
        visited var <parm_decl 0xfffafe41980 VHR>def_stmt GIMPLE_NOP

        version 1
        ptr-info 0xfffafe9af80>
    arg 1 <integer_cst 0xfffafe19b80 type <reference_type 0xfffb0055f28> constant 0>
    test.cc:29:44>
(gdb) pr mem
(mem:DF (reg/v/f:DI 120 [ VHR ]) [5 MEM[(struct mrViewingHitRecord &)VHR_1(D)]+0 S8 A8])
Comment 2 Richard Biener 2012-04-10 12:06:26 UTC
(gdb) prt to
 <mem_ref 0xfffafed2790
    type <real_type 0xfffb0057b60 double DF
        size <integer_cst 0xfffafe125e0 constant 64>
        unit size <integer_cst 0xfffafe12600 constant 8>
        align 8 symtab 0 alias set 2 canonical type 0xfffaff00f18 precision 64>

I suppose SRA does this.  Can you check with -fno-tree-sra?

The code in SRA, which is

  /* If prev_base were always an originally performed access
     we can extract more optimistic alignment information
     by looking at the access mode.  That would constrain the
     alignment of base + base_offset which we would need to
     adjust according to offset.  */
  align = get_pointer_alignment_1 (base, &misalign);
  if (misalign == 0
      && (TREE_CODE (prev_base) == MEM_REF
          || TREE_CODE (prev_base) == TARGET_MEM_REF))
    align = MAX (align, TYPE_ALIGN (TREE_TYPE (prev_base)));
  misalign += (double_int_sext (tree_to_double_int (off),
                                TYPE_PRECISION (TREE_TYPE (off))).low
               * BITS_PER_UNIT);
  misalign = misalign & (align - 1);
  if (misalign != 0)
    align = (misalign & -misalign);
  if (align < TYPE_ALIGN (exp_type))
    exp_type = build_aligned_type (exp_type, align);

only looks at conservative alignment in this case (prev_base is not a base,
but a COMPONENT_REF).  I think martin has a patch to fix this.
Comment 3 Pat Haugen 2012-04-10 16:13:49 UTC
(In reply to comment #2)
> (gdb) prt to
>  <mem_ref 0xfffafed2790
>     type <real_type 0xfffb0057b60 double DF
>         size <integer_cst 0xfffafe125e0 constant 64>
>         unit size <integer_cst 0xfffafe12600 constant 8>
>         align 8 symtab 0 alias set 2 canonical type 0xfffaff00f18 precision 64>
> 
> I suppose SRA does this.  Can you check with -fno-tree-sra?

With -fno-tree-sra they both generate the same code, a sequence of 8-byte stores to build up the rhs, and then a call to memcpy for the entire structure copy.

At expand phase we now have:

  struct ggVector3 D.2563;
  struct ggVector3 D.2562;
  struct ggVector3 D.2561;
  struct ggONB3 D.2573;

  # BLOCK 2 freq:10000
  # PRED: ENTRY [100.0%]  (fallthru,exec)
  D.2561.e[0] = 1.0e+0;
  D.2561.e[1] = 0.0;
  D.2561.e[2] = 0.0;
  D.2562.e[0] = 0.0;
  D.2562.e[1] = 0.0;
  D.2562.e[2] = -1.0e+0;
  D.2563.e[0] = 0.0;
  D.2563.e[1] = 1.0e+0;
  D.2563.e[2] = 0.0;
  D.2573.U = D.2561;
  D.2573.V = D.2562;
  D.2573.W = D.2563;
  VHR_1(D)->UVW = D.2573;


Whereas with SRA we had:

  MEM[(struct mrViewingHitRecord &)VHR_1(D)] = 1.0e+0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 8] = 0.0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 16] = 0.0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 24] = 0.0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 32] = 0.0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 40] = -1.0e+0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 48] = 0.0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 56] = 1.0e+0;
  MEM[(struct mrViewingHitRecord &)VHR_1(D) + 64] = 0.0;
Comment 4 Martin Jambor 2012-04-11 14:30:56 UTC
(In reply to comment #2)
> only looks at conservative alignment in this case (prev_base is not a base,
> but a COMPONENT_REF).  I think martin has a patch to fix this.

Unfortunately, http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00206.html
does not help.  I'm going to look at this in more detail then...
Comment 5 Martin Jambor 2012-04-11 15:53:44 UTC
But when I add the following line to the patch linked above, the
problem goes away:

Index: src/gcc/tree-sra.c
===================================================================
--- src.orig/gcc/tree-sra.c
+++ src/gcc/tree-sra.c
@@ -1455,6 +1455,7 @@ build_ref_for_offset (location_t loc, tr
     }
   else if (TREE_CODE (base) == MEM_REF)
     {
+      prev_base = base;
       off = build_int_cst (TREE_TYPE (TREE_OPERAND (base, 1)),
                           base_offset + offset / BITS_PER_UNIT);
       off = int_const_binop (PLUS_EXPR, TREE_OPERAND (base, 1), off);
Comment 6 Martin Jambor 2012-04-11 16:06:31 UTC
I'm assigning this to myself, after I commit the aforementioned patch
(hopefully very soon), I'll work on a followup similar to the above
that will also handle some situations in which
get_addr_base_and_unit_offset in build_ref_for_offset returns NULL.
Comment 7 Pat Haugen 2012-04-11 18:34:55 UTC
I observed similar behavior on loads (8-byte loads transformed to sequence of byte loads/shifts/ors) with r185470, unaligned read patch. This caused some sizable degradation on a few benchmarks also.-fno-tree-sra had no effect on the unaligned load case.

Reduced testcase:

// g++ -S -m64 -O2 -mcpu=power7 load.cc
typedef int ggBoolean;
class mrSurface
{
};
template < class T > class ggTrain
{
public:ggBoolean Append (T item);
  T operator[] (int i) const
  {
    return data[i];
  }
private:  T * data;
  int nData;
};

class mrSurfaceList:public mrSurface
{
  virtual void viewingHit () const;
protected:  ggTrain < mrSurface * >surfaces;
};
extern mrSurface *sPtr;
void
mrSurfaceList::viewingHit () const 
{
  sPtr = surfaces[1];
}



150r.expand:

  D.2562_3 = MEM[(const struct ggTrain *)this_1(D) + 8B];
  D.2560_11 = MEM[(struct mrSurface * *)D.2562_3 + 8B];
  sPtr = D.2560_11;


The first MEM[...] gets expanded to the sequence of lbzs/shifts/ors.
Comment 8 Martin Jambor 2012-04-17 15:45:47 UTC
The last case is caused by IPA-SRA, -fno-ipa-sra helps, I belive.  All
pre-requesite patches are committed now so I'll concentrate on this
now and hopefully come up with a patch soon.
Comment 9 David Edelsohn 2012-09-12 15:23:27 UTC
Please add new testcases -- hopefully generic tests -- in the common part of the testsuite to cover this type of regression.
Comment 10 Pat Haugen 2012-09-12 23:04:55 UTC
Created attachment 28181 [details]
Reduced testcase

Martin,
  Have you done any more digging on this? I just discovered that cpu2006 benchmark 471.omnetpp suffers the same problem (8 byte loads turned in to sequence of byte loads/shifts/ors). It causes a 12% degradation on PowerPC, and also goes away when -fno-ipa-sra is added. I'm attaching a reduced testcase that can be compiled with g++ -S -O3 -m64 -mcpu=power7.
Comment 11 Martin Jambor 2012-09-14 13:55:04 UTC
(In reply to comment #10)
> 
> Martin,
>   Have you done any more digging on this? I just discovered that cpu2006
> benchmark 471.omnetpp suffers the same problem (8 byte loads turned in to
> sequence of byte loads/shifts/ors). It causes a 12% degradation on PowerPC, and
> also goes away when -fno-ipa-sra is added. I'm attaching a reduced testcase
> that can be compiled with g++ -S -O3 -m64 -mcpu=power7.

Thanks for the reduced testcase.  This PR is still on my radar but so
far other issues have had higher priority.  On the other hand, I agree
I have not looked at this for too long, I'll do my best to look at it
soon.  I can't promise much in the next 2-3 weeks, though.
Comment 12 Martin Jambor 2012-11-27 20:47:21 UTC
I have proposed a patch on the mailing list:

http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02265.html
Comment 13 Martin Jambor 2012-11-30 16:11:41 UTC
Author: jamborm
Date: Fri Nov 30 16:11:33 2012
New Revision: 193998

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193998
Log:
2012-11-30  Martin Jambor  <mjambor@suse.cz>

	PR middle-end/52890
	PR tree-optimization/55415
	PR tree-optimization/54386
	PR target/55448
	* ipa-prop.c (ipa_modify_call_arguments): Be optimistic when
	get_pointer_alignment_1 returns false and the base was not a
	dereference.
	* tree-sra.c (access_precludes_ipa_sra_p): New parameter req_align,
	added check for required alignment.  Update the user.

	* testsuite/gcc.dg/ipa/ipa-sra-7.c: New test.
	* testsuite/gcc.dg/ipa/ipa-sra-8.c: Likewise.
	* testsuite/gcc.dg/ipa/ipa-sra-9.c: Likewise.
	* testsuite/gcc.target/i386/pr55448.c: Likewise.


Added:
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-7.c
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-8.c
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-9.c
    trunk/gcc/testsuite/gcc.target/i386/pr55448.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-prop.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-sra.c
Comment 14 Martin Jambor 2012-12-03 13:25:26 UTC
Finally fixed.  Thanks for patience.