43846 – [4.5 Regression] array vs members, total scalarization issues

Bug 43846 - [4.5 Regression] array vs members, total scalarization issues

Summary: [4.5 Regression] array vs members, total scalarization issues

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.5.0

Importance:	P3 normal
Target Milestone:	4.5.1
Assignee:	Martin Jambor

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2010-04-22 06:25 UTC by tbp
Modified:	2010-05-24 11:04 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:
Build:
Known to work:	4.4.1
Known to fail:	4.5.0
Last reconfirmed:	2010-04-22 12:35:41

Attachments
Proposed fix (1.96 KB, patch) 2010-04-22 17:18 UTC, Martin Jambor	Details \| Diff
parser.i from kdelibs. (249.33 KB, application/octet-stream) 2010-05-23 21:25 UTC, Pawel Sikora	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description tbp 2010-04-22 06:25:41 UTC

Hello,

As instructed in http://gcc.gnu.org/ml/gcc/2010-04/msg00505.html i'm filing this.
I've noticed gcc 4.5 has more trouble than it used to to removed dead stores when dealing with arrays, as exemplified by
$ cat huh.cc
struct foo_t {
	float x, y, z;
	foo_t() {}
	foo_t(float a, float b, float c) : x(a),y(b),z(c) {}
	friend foo_t operator*(foo_t lhs, float s) { return foo_t(lhs.x*s, lhs.y*s, lhs.z*s); }
	friend float dot(foo_t lhs, foo_t rhs) { return lhs.x*rhs.x + lhs.y*rhs.y + lhs.z*rhs.z; }
};
struct bar_t {
	float m[3];
	bar_t() {}
	bar_t(float a, float b, float c) { m[0]=a; m[1]=b; m[2]=c; }
	friend bar_t operator*(bar_t lhs, float s) { return bar_t(lhs.m[0]*s, lhs.m[1]*s, lhs.m[2]*s); }
	friend float dot(bar_t lhs, bar_t rhs) { return lhs.m[0]*rhs.m[0] + lhs.m[1]*rhs.m[1] + lhs.m[2]*rhs.m[2]; }
};
namespace {
	template<typename T> float magsqr(T v) { return dot(v, v); }
	template<typename T> T norm(T v) { return v*(1/__builtin_sqrtf(magsqr(v))); }
}
void frob1(const foo_t &a, foo_t &b) { b = norm(a); }
void frob2(const bar_t &a, bar_t &b) { b = norm(a); }
int main() { return 0; }
$ /usr/local/gcc-4.5.0/bin/g++ -O2 -ffast-math -march=native huh.cc

Whereas frob1 and frob2 do not differ with, say, gcc 4.4.1, they now do.

$ /usr/local/gcc-4.5.0/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc-4.5.0/bin/g++
COLLECT_LTO_WRAPPER=/usr/local/gcc-4.5.0/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gcc-4.5.0 --enable-languages=c,c++ --enable-threads=posix --disable-checking --disable-nls --with-system-zlib --disable-shared --enable-checking=none --disable-bootstrap --enable-mpfr --enable-gold
Thread model: posix
gcc version 4.5.0 (GCC) 
$ g++ -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.1-4ubuntu9' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --disable-werror --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)

Comment 1 Richard Biener 2010-04-22 09:07:39 UTC

Hm, frob1 looks like

_Z5frob1RK5foo_tRS_:
.LFB18:
        movss   (%rdi), %xmm3
        movss   4(%rdi), %xmm2
        movaps  %xmm3, %xmm4
        movaps  %xmm2, %xmm0
        mulss   %xmm3, %xmm4
        movss   8(%rdi), %xmm1
        mulss   %xmm2, %xmm0
        addss   %xmm4, %xmm0
        movaps  %xmm1, %xmm4
        mulss   %xmm1, %xmm4
        addss   %xmm4, %xmm0
        rsqrtss %xmm0, %xmm4
        mulss   %xmm4, %xmm0
        mulss   %xmm4, %xmm0
        mulss   .LC1(%rip), %xmm4
        addss   .LC0(%rip), %xmm0
        mulss   %xmm4, %xmm0
        mulss   %xmm0, %xmm3
        mulss   %xmm0, %xmm2
        mulss   %xmm1, %xmm0
        movss   %xmm3, (%rsi)
        movss   %xmm2, 4(%rsi)
        movss   %xmm0, 8(%rsi)
        ret

and frob2 like

_Z5frob2RK5bar_tRS_:
.LFB19:
        movss   (%rdi), %xmm3
        movss   4(%rdi), %xmm2
        movaps  %xmm3, %xmm4
        movaps  %xmm2, %xmm0
        mulss   %xmm3, %xmm4
        movss   8(%rdi), %xmm1
        mulss   %xmm2, %xmm0
        addss   %xmm4, %xmm0
        movaps  %xmm1, %xmm4
        mulss   %xmm1, %xmm4
        addss   %xmm4, %xmm0
        rsqrtss %xmm0, %xmm4
        mulss   %xmm4, %xmm0
        mulss   %xmm4, %xmm0
        mulss   .LC1(%rip), %xmm4
        addss   .LC0(%rip), %xmm0
        mulss   %xmm4, %xmm0
        mulss   %xmm0, %xmm3
        mulss   %xmm0, %xmm2
        mulss   %xmm1, %xmm0
        movss   %xmm3, -24(%rsp)
        movss   %xmm2, -20(%rsp)
        movq    -24(%rsp), %rax
        movss   %xmm0, -16(%rsp)
        movq    %rax, (%rsi)
        movl    -16(%rsp), %eax
        movl    %eax, 8(%rsi)
        ret

so it's an aggregate copy that is not scalarized in frob2:

  b_1(D)->x = D.2444_20;
  b_1(D)->y = D.2443_19;
  b_1(D)->z = D.2442_18;
  return;

vs.

  D.2464.m[0] = D.2473_20;
  D.2464.m[1] = D.2472_19;
  D.2464.m[2] = D.2471_18;
  *b_1(D) = D.2464;
  return;

all inlining happens during early inlining and frob1 and frob2 are reasonably
similar after early inlining.

But then we have early SRA which does

;; Function void frob1(const foo_t&, foo_t&) (_Z5frob1RK5foo_tRS_)

Candidate (2452): D.2452
Candidate (2434): v
Candidate (2435): D.2435
Will attempt to totally scalarize D.2435 (UID: 2435):
Will attempt to totally scalarize D.2452 (UID: 2452):
Marking v offset: 0, size: 32:  to be replaced.
Marking v offset: 32, size: 32:  to be replaced.
Marking v offset: 64, size: 32:  to be replaced.
...

;; Function void frob2(const bar_t&, bar_t&) (_Z5frob2RK5bar_tRS_)

Candidate (2481): D.2481
Candidate (2464): D.2464
Candidate (2463): v
Marking v offset: 0, size: 32:  to be replaced.
Marking v offset: 32, size: 32:  to be replaced.
Marking v offset: 64, size: 32:  to be replaced.
...
! Disqualifying D.2464 - No scalar replacements to be created.

so it doesn't consider the struct with the array for total scalarization
for some reason.  Martin?

Comment 2 Martin Jambor 2010-04-22 12:35:40 UTC

(In reply to comment #1)
> 
> so it doesn't consider the struct with the array for total scalarization
> for some reason.  Martin?
> 

Well, that was a deliberate decision when fixing PR 42585 (see
type_consists_of_records_p).  The code is simpler because it does not
have to know how to iterate over the array index domain.

Of course, we can alleviate this restriction and learn how to iterate.
However, all the accesses for the whole array are already created,
that is not the issue.  The problem basically is that when we see the
sequence

  D.2035.m[0] = D.2044_20;
  D.2035.m[1] = D.2043_19;
  D.2035.m[2] = D.2042_18;
  *b_1(D) = D.2035;

(and there are no other accesses to D.2035) the condition that tries
to prevent us from creating unnecessary replacements kicks in and we
decide not to scalarize.  The intent of the current code (possibly
among other reasons) was to avoid going through a replacement when the
whole structure was then passed as an argument to a function and
similar situations.  But it should not be very difficult to change the
condition (in analyze_access_subtree) to handle both situations right.

Doing this, rather than total scalarization for arrays (which should
be only useful as a substitute for a copy propagation) should enable
us to handle even huge arrays.

I'll get to this right after dealing with PR 43835.

Comment 3 davidxl 2010-04-22 17:04:48 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > 
> > so it doesn't consider the struct with the array for total scalarization
> > for some reason.  Martin?
> > 
> 
> Well, that was a deliberate decision when fixing PR 42585 (see
> type_consists_of_records_p).  The code is simpler because it does not
> have to know how to iterate over the array index domain.
> 
> Of course, we can alleviate this restriction and learn how to iterate.
> However, all the accesses for the whole array are already created,
> that is not the issue.  The problem basically is that when we see the
> sequence
> 
>   D.2035.m[0] = D.2044_20;
>   D.2035.m[1] = D.2043_19;
>   D.2035.m[2] = D.2042_18;
>   *b_1(D) = D.2035;
> 
> (and there are no other accesses to D.2035) the condition that tries
> to prevent us from creating unnecessary replacements kicks in and we
> decide not to scalarize. 

This code sequence looks like a good motivating factor for scalarizing/expansion. In fact, small arrays should be treated the same way as records if all accesses are through compile time constant indices. This is a common scenario after full unrolling. 

 The intent of the current code (possibly
> among other reasons) was to avoid going through a replacement when the
> whole structure was then passed as an argument to a function and
> similar situations. 

If the temp aggregate is passed to call and the calling convention is not exposed at the IL level, then it is not a good sra candidate as no copy (both code and storage) elimination will be exposed. In this one, the temp aggregate is used as the RHS of an assignment, thus it is a good candidate to expand. So will be the reverse case:

aggregate1 = aggregate2;
 ..
... = aggregate1.e1;
... = aggregate1.e2;

David

> But it should not be very difficult to change the
> condition (in analyze_access_subtree) to handle both situations right.
> 
> Doing this, rather than total scalarization for arrays (which should
> be only useful as a substitute for a copy propagation) should enable
> us to handle even huge arrays.
> 
> I'll get to this right after dealing with PR 43835.
>

Comment 4 Martin Jambor 2010-04-22 17:18:59 UTC

Created attachment 20464 [details]
Proposed fix

I'm currently testing this patch and will submit it tomorrow if everything goes OK.

Comment 5 Martin Jambor 2010-04-23 14:52:30 UTC

Subject: Bug 43846

Author: jamborm
Date: Fri Apr 23 14:52:06 2010
New Revision: 158668

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=158668
Log:
2010-04-23  Martin Jambor  <mjambor@suse.cz>

	PR tree-optimization/43846
	* tree-sra.c (struct access): New flag grp_assignment_read.
	(build_accesses_from_assign): Set grp_assignment_read.
	(sort_and_splice_var_accesses): Propagate grp_assignment_read.
	(enum mark_read_status): New type.
	(analyze_access_subtree): Propagate grp_assignment_read, create
	accesses also if both direct_read and root->grp_assignment_read.

	* testsuite/gcc.dg/tree-ssa/sra-10.c: New test.


Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-10.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-sra.c

Comment 6 Martin Jambor 2010-04-28 13:10:15 UTC

Subject: Bug 43846

Author: jamborm
Date: Wed Apr 28 13:09:56 2010
New Revision: 158826

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=158826
Log:
2010-04-28  Martin Jambor  <mjambor@suse.cz>

	PR tree-optimization/43846
	* tree-sra.c (struct access): New flag grp_assignment_read.
	(build_accesses_from_assign): Set grp_assignment_read.
	(sort_and_splice_var_accesses): Propagate grp_assignment_read.
	(enum mark_read_status): New type.
	(analyze_access_subtree): Propagate grp_assignment_read, create
	accesses also if both direct_read and root->grp_assignment_read.

	* testsuite/gcc.dg/tree-ssa/sra-10.c: New test.


Added:
    branches/gcc-4_5-branch/gcc/testsuite/gcc.dg/tree-ssa/sra-10.c
Modified:
    branches/gcc-4_5-branch/gcc/ChangeLog
    branches/gcc-4_5-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_5-branch/gcc/tree-sra.c

Comment 7 Martin Jambor 2010-04-28 13:15:05 UTC

This is now fixed on both the trunk and the 4.5 branch.

Comment 8 tbp 2010-04-28 13:43:34 UTC

Allow me to extend to you my most profuse praises and blessing; may all the woman in your vicinity fall pregnant and your male progeny be granted abounding chest hair.

Comment 9 Pawel Sikora 2010-05-23 11:53:59 UTC

(In reply to comment #7)
> This is now fixed on both the trunk and the 4.5 branch.
> 

this commit produces broken libkhtml.so.5.4.0 from kdelibs-4.4.3.
in details, it produces different/broken binaries for khtml/css/parser.cpp
and khtml/svg/SVGGradientElement.cpp.

finally we get nice GPF during knode/kmail/konqueror startup:

[KCrash Handler]
#5  memcpy () at ../sysdeps/x86_64/memcpy.S:78
#6  0x00007f546e63fc5e in QString::QString(QChar const*, int) () from 
/usr/lib64/libQtCore.so.4
#7  0x00007f5469f70e2e in qString (ps=<value optimized out>) at 
/usr/src/debug/kdelibs-4.4.3/khtml/css/cssparser.h:84
#8  DOM::CSSParser::parseValue (ps=<value optimized out>) at 
/usr/src/debug/kdelibs-4.4.3/khtml/css/cssparser.cpp:518
#9  0x00007f5469f95075 in cssyyparse (parser=0x7fff08c22820) at 
/usr/src/debug/kdelibs-4.4.3/khtml/css/parser.cpp:2969
#10 0x00007f5469f67d00 in DOM::CSSParser::runParser (this=0x7fff08c22820) at 
/usr/src/debug/kdelibs-4.4.3/khtml/css/cssparser.cpp:151
(...)

Comment 10 Pawel Sikora 2010-05-23 21:25:42 UTC

Created attachment 20731 [details]
parser.i from kdelibs.

Comment 11 Martin Jambor 2010-05-24 09:43:49 UTC

(In reply to comment #9)
> (In reply to comment #7)
> > This is now fixed on both the trunk and the 4.5 branch.
> > 
> 
> this commit produces broken libkhtml.so.5.4.0 from kdelibs-4.4.3.
> in details, it produces different/broken binaries for khtml/css/parser.cpp
> and khtml/svg/SVGGradientElement.cpp.
> 

Please file this as a separate bug and CC me.  I can't promise I'll be
able to look at it this week though.

Comment 12 Pawel Sikora 2010-05-24 11:04:22 UTC

Comment on attachment 20731 [details]
parser.i from kdelibs.

moved to separated PR44258.