Following testcase extracted from benchmark file mrXZRectangle.cc. $ cat test.cc // g++ -m64 -O2 -mcpu=power7 -S test.cc class ggVector3 { public: ggVector3() {e[0] = 1.0; e[1] = e[2] = 0.0; } ggVector3(double e0, double e1, double e2) { e[0] = e0; e[1] = e1; e[2] = e2; } double e[3]; }; class ggONB3 { public: ggONB3() { } ggONB3(const ggVector3& a, const ggVector3& b, const ggVector3& c) { U = a; V = b; W = c; } private: ggVector3 U,V,W; }; class mrViewingHitRecord { public: ggONB3 UVW; }; void foo(mrViewingHitRecord& VHR) { VHR.UVW = ggONB3(ggVector3 (1, 0, 0), ggVector3 (0, 0, -1), ggVector3 (0, 1, 0)); } Revision 185335 produces tight sequence of 8-byte stores: stfd 0,8(3) stfd 0,16(3) stfd 0,24(3) stfd 0,32(3) ... Revision 185336, with additional check for SLOW_UNALIGNED_ACCESS now causes path through store_bit_field() which generates a much larger sequence of byte stores. stb 8,0(3) stb 10,1(3) stb 9,2(3) stb 9,3(3) stb 9,4(3) stb 9,5(3) ...
My main question on this is why the MEM_REF has an alignment of 8 (i.e. byte)? Breakpoint 1, expand_assignment (to=0xfffafed2790, from=<optimized out>, nontemporal=<optimized out>) at /home/pthaugen/src/gcc/temp3/gcc/gcc/expr.c:4653 4653 store_bit_field (mem, GET_MODE_BITSIZE (mode), (gdb) prt to <mem_ref 0xfffafed2790 type <real_type 0xfffb0057b60 double DF size <integer_cst 0xfffafe125e0 constant 64> unit size <integer_cst 0xfffafe12600 constant 8> align 8 symtab 0 alias set 2 canonical type 0xfffaff00f18 precision 64> arg 0 <ssa_name 0xfffb00e1d60 type <reference_type 0xfffb0055f28 type <record_type 0xfffb0055d30 mrViewingHitRecord> public unsigned DI size <integer_cst 0xfffafe125e0 64> unit size <integer_cst 0xfffafe12600 8> align 64 symtab 0 alias set 6 canonical type 0xfffb0055f28> visited var <parm_decl 0xfffafe41980 VHR>def_stmt GIMPLE_NOP version 1 ptr-info 0xfffafe9af80> arg 1 <integer_cst 0xfffafe19b80 type <reference_type 0xfffb0055f28> constant 0> test.cc:29:44> (gdb) pr mem (mem:DF (reg/v/f:DI 120 [ VHR ]) [5 MEM[(struct mrViewingHitRecord &)VHR_1(D)]+0 S8 A8])
(gdb) prt to <mem_ref 0xfffafed2790 type <real_type 0xfffb0057b60 double DF size <integer_cst 0xfffafe125e0 constant 64> unit size <integer_cst 0xfffafe12600 constant 8> align 8 symtab 0 alias set 2 canonical type 0xfffaff00f18 precision 64> I suppose SRA does this. Can you check with -fno-tree-sra? The code in SRA, which is /* If prev_base were always an originally performed access we can extract more optimistic alignment information by looking at the access mode. That would constrain the alignment of base + base_offset which we would need to adjust according to offset. */ align = get_pointer_alignment_1 (base, &misalign); if (misalign == 0 && (TREE_CODE (prev_base) == MEM_REF || TREE_CODE (prev_base) == TARGET_MEM_REF)) align = MAX (align, TYPE_ALIGN (TREE_TYPE (prev_base))); misalign += (double_int_sext (tree_to_double_int (off), TYPE_PRECISION (TREE_TYPE (off))).low * BITS_PER_UNIT); misalign = misalign & (align - 1); if (misalign != 0) align = (misalign & -misalign); if (align < TYPE_ALIGN (exp_type)) exp_type = build_aligned_type (exp_type, align); only looks at conservative alignment in this case (prev_base is not a base, but a COMPONENT_REF). I think martin has a patch to fix this.
(In reply to comment #2) > (gdb) prt to > <mem_ref 0xfffafed2790 > type <real_type 0xfffb0057b60 double DF > size <integer_cst 0xfffafe125e0 constant 64> > unit size <integer_cst 0xfffafe12600 constant 8> > align 8 symtab 0 alias set 2 canonical type 0xfffaff00f18 precision 64> > > I suppose SRA does this. Can you check with -fno-tree-sra? With -fno-tree-sra they both generate the same code, a sequence of 8-byte stores to build up the rhs, and then a call to memcpy for the entire structure copy. At expand phase we now have: struct ggVector3 D.2563; struct ggVector3 D.2562; struct ggVector3 D.2561; struct ggONB3 D.2573; # BLOCK 2 freq:10000 # PRED: ENTRY [100.0%] (fallthru,exec) D.2561.e[0] = 1.0e+0; D.2561.e[1] = 0.0; D.2561.e[2] = 0.0; D.2562.e[0] = 0.0; D.2562.e[1] = 0.0; D.2562.e[2] = -1.0e+0; D.2563.e[0] = 0.0; D.2563.e[1] = 1.0e+0; D.2563.e[2] = 0.0; D.2573.U = D.2561; D.2573.V = D.2562; D.2573.W = D.2563; VHR_1(D)->UVW = D.2573; Whereas with SRA we had: MEM[(struct mrViewingHitRecord &)VHR_1(D)] = 1.0e+0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 8] = 0.0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 16] = 0.0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 24] = 0.0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 32] = 0.0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 40] = -1.0e+0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 48] = 0.0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 56] = 1.0e+0; MEM[(struct mrViewingHitRecord &)VHR_1(D) + 64] = 0.0;
(In reply to comment #2) > only looks at conservative alignment in this case (prev_base is not a base, > but a COMPONENT_REF). I think martin has a patch to fix this. Unfortunately, http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00206.html does not help. I'm going to look at this in more detail then...
But when I add the following line to the patch linked above, the problem goes away: Index: src/gcc/tree-sra.c =================================================================== --- src.orig/gcc/tree-sra.c +++ src/gcc/tree-sra.c @@ -1455,6 +1455,7 @@ build_ref_for_offset (location_t loc, tr } else if (TREE_CODE (base) == MEM_REF) { + prev_base = base; off = build_int_cst (TREE_TYPE (TREE_OPERAND (base, 1)), base_offset + offset / BITS_PER_UNIT); off = int_const_binop (PLUS_EXPR, TREE_OPERAND (base, 1), off);
I'm assigning this to myself, after I commit the aforementioned patch (hopefully very soon), I'll work on a followup similar to the above that will also handle some situations in which get_addr_base_and_unit_offset in build_ref_for_offset returns NULL.
I observed similar behavior on loads (8-byte loads transformed to sequence of byte loads/shifts/ors) with r185470, unaligned read patch. This caused some sizable degradation on a few benchmarks also.-fno-tree-sra had no effect on the unaligned load case. Reduced testcase: // g++ -S -m64 -O2 -mcpu=power7 load.cc typedef int ggBoolean; class mrSurface { }; template < class T > class ggTrain { public:ggBoolean Append (T item); T operator[] (int i) const { return data[i]; } private: T * data; int nData; }; class mrSurfaceList:public mrSurface { virtual void viewingHit () const; protected: ggTrain < mrSurface * >surfaces; }; extern mrSurface *sPtr; void mrSurfaceList::viewingHit () const { sPtr = surfaces[1]; } 150r.expand: D.2562_3 = MEM[(const struct ggTrain *)this_1(D) + 8B]; D.2560_11 = MEM[(struct mrSurface * *)D.2562_3 + 8B]; sPtr = D.2560_11; The first MEM[...] gets expanded to the sequence of lbzs/shifts/ors.
The last case is caused by IPA-SRA, -fno-ipa-sra helps, I belive. All pre-requesite patches are committed now so I'll concentrate on this now and hopefully come up with a patch soon.
Please add new testcases -- hopefully generic tests -- in the common part of the testsuite to cover this type of regression.
Created attachment 28181 [details] Reduced testcase Martin, Have you done any more digging on this? I just discovered that cpu2006 benchmark 471.omnetpp suffers the same problem (8 byte loads turned in to sequence of byte loads/shifts/ors). It causes a 12% degradation on PowerPC, and also goes away when -fno-ipa-sra is added. I'm attaching a reduced testcase that can be compiled with g++ -S -O3 -m64 -mcpu=power7.
(In reply to comment #10) > > Martin, > Have you done any more digging on this? I just discovered that cpu2006 > benchmark 471.omnetpp suffers the same problem (8 byte loads turned in to > sequence of byte loads/shifts/ors). It causes a 12% degradation on PowerPC, and > also goes away when -fno-ipa-sra is added. I'm attaching a reduced testcase > that can be compiled with g++ -S -O3 -m64 -mcpu=power7. Thanks for the reduced testcase. This PR is still on my radar but so far other issues have had higher priority. On the other hand, I agree I have not looked at this for too long, I'll do my best to look at it soon. I can't promise much in the next 2-3 weeks, though.
I have proposed a patch on the mailing list: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02265.html
Author: jamborm Date: Fri Nov 30 16:11:33 2012 New Revision: 193998 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193998 Log: 2012-11-30 Martin Jambor <mjambor@suse.cz> PR middle-end/52890 PR tree-optimization/55415 PR tree-optimization/54386 PR target/55448 * ipa-prop.c (ipa_modify_call_arguments): Be optimistic when get_pointer_alignment_1 returns false and the base was not a dereference. * tree-sra.c (access_precludes_ipa_sra_p): New parameter req_align, added check for required alignment. Update the user. * testsuite/gcc.dg/ipa/ipa-sra-7.c: New test. * testsuite/gcc.dg/ipa/ipa-sra-8.c: Likewise. * testsuite/gcc.dg/ipa/ipa-sra-9.c: Likewise. * testsuite/gcc.target/i386/pr55448.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-7.c trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-8.c trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-9.c trunk/gcc/testsuite/gcc.target/i386/pr55448.c Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-prop.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
Finally fixed. Thanks for patience.