User account creation filtered due to spam.

Bug 44423 - [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
Summary: [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.6.0
: P3 normal
Target Milestone: 4.5.1
Assignee: Martin Jambor
URL:
Keywords: missed-optimization, ssemmx
Depends on:
Blocks:
 
Reported: 2010-06-05 10:16 UTC by Martin Reinecke
Modified: 2010-06-15 10:04 UTC (History)
3 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work: 4.4.2
Known to fail: 4.5.1 4.6.0
Last reconfirmed: 2010-06-08 13:16:32


Attachments
test case (405 bytes, text/plain)
2010-06-05 10:18 UTC, Martin Reinecke
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Reinecke 2010-06-05 10:16:58 UTC
If the attached test case is compiled with current trunk, it runs almost 4 times more slowly than the same code compiled with gcc 4.4 and identical options:

~/ujedi/splotchnew>gcc -O2 -v bugrep.c -W -Wall
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /tmp/S/gcc-4.4.2/configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --mandir=/usr/share/man --infodir=/usr/share/info --enable-cpp --enable-nls --enable-shared --enable-multilib --enable-languages=c,c++,fortran
Thread model: posix
gcc version 4.4.2 (GCC) 
COLLECT_GCC_OPTIONS='-O2' '-v' '-W' '-Wall' '-mtune=generic'
 /afs/mpa/@sys/system/MPA-5.13/usr/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.4.2/cc1 -quiet -v -iprefix /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/ bugrep.c -quiet -dumpbase bugrep.c -mtune=generic -auxbase bugrep -O2 -W -Wall -version -o /tmp/cc3cnddf.s
ignoring nonexistent directory "/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../x86_64-unknown-linux-gnu/include"
ignoring duplicate directory "/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/../../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/include"
ignoring duplicate directory "/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/../../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/include-fixed"
ignoring nonexistent directory "/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/../../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../x86_64-unknown-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/include
 /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/include-fixed
 /usr/local/include
 /usr/include
End of search list.
GNU C (GCC) version 4.4.2 (x86_64-unknown-linux-gnu)
        compiled by GNU C version 4.4.2, GMP version 4.3.1, MPFR version 2.4.1.
warning: MPFR header version 2.4.1 differs from library version 2.4.2.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 364bbcc3e471e0834234549d6d2cb3d0
COLLECT_GCC_OPTIONS='-O2' '-v' '-W' '-Wall' '-mtune=generic'
 /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../x86_64-unknown-linux-gnu/bin/as -V -Qy -o /tmp/ccKlmGkp.o /tmp/cc3cnddf.s
GNU assembler version 2.19.1 (x86_64-unknown-linux-gnu) using BFD version (GNU Binutils) 2.19.1
COMPILER_PATH=/afs/mpa/@sys/system/MPA-5.13/usr/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.4.2/:/afs/mpa/@sys/system/MPA-5.13/usr/bin/../libexec/gcc/:/usr/libexec/gcc/x86_64-unknown-linux-gnu/4.4.2/:/usr/libexec/gcc/x86_64-unknown-linux-gnu/:/usr/lib/gcc/x86_64-unknown-linux-gnu/4.4.2/:/usr/lib/gcc/x86_64-unknown-linux-gnu/:/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../x86_64-unknown-linux-gnu/bin/
LIBRARY_PATH=/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/:/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/:/usr/lib/gcc/x86_64-unknown-linux-gnu/4.4.2/:/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../x86_64-unknown-linux-gnu/lib/:/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-O2' '-v' '-W' '-Wall' '-mtune=generic'
 /afs/mpa/@sys/system/MPA-5.13/usr/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.4.2/collect2 --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../lib64/crt1.o /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../lib64/crti.o /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/crtbegin.o -L/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2 -L/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc -L/usr/lib/gcc/x86_64-unknown-linux-gnu/4.4.2 -L/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../x86_64-unknown-linux-gnu/lib -L/afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../.. /tmp/ccKlmGkp.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/crtend.o /afs/mpa/@sys/system/MPA-5.13/usr/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.2/../../../../lib64/crtn.o

~/ujedi/splotchnew>time ./a.out
0.440u 0.000s 0:00.44 100.0%    0+0k 0+0io 0pf+0w



~/ujedi/splotchnew>gcc -O2 -v bugrep.c -W -Wall
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/mpa/data/martin/ugcc/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /scratch/martin/gcc/configure --disable-bootstrap --enable-gold --enable-plugins --prefix=/afs/mpa/data/martin/ugcc --with-ppl=/afs/mpa/data/martin/numlibs64 --with-cloog=/afs/mpa/data/martin/numlibs64 --with-libelf=/afs/mpa/data/martin/numlibs64 --enable-languages=c++,fortran --enable-target=all --enable-checking=release
Thread model: posix
gcc version 4.6.0 20100604 (experimental) [trunk revision 160258] (GCC) 
COLLECT_GCC_OPTIONS='-O2' '-v' '-W' '-Wall' '-mtune=generic' '-march=x86-64'
 /afs/mpa/data/martin/ugcc/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/cc1 -quiet -v bugrep.c -quiet -dumpbase bugrep.c -mtune=generic -march=x86-64 -auxbase bugrep -O2 -W -Wall -version -o /tmp/ccm5w6IQ.s
GNU C (GCC) version 4.6.0 20100604 (experimental) [trunk revision 160258] (x86_64-unknown-linux-gnu)
        compiled by GNU C version 4.4.2, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.1
warning: GMP header version 4.3.2 differs from library version 4.3.1.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /afs/mpa/data/martin/ugcc/include
 /afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/include
 /afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/include-fixed
 /usr/include
End of search list.
GNU C (GCC) version 4.6.0 20100604 (experimental) [trunk revision 160258] (x86_64-unknown-linux-gnu)
        compiled by GNU C version 4.4.2, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.1
warning: GMP header version 4.3.2 differs from library version 4.3.1.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: d4da26fe4a5a46c88c2087304495247d
COLLECT_GCC_OPTIONS='-O2' '-v' '-W' '-Wall' '-mtune=generic' '-march=x86-64'
 as -V -Qy --64 -o /tmp/ccmGCvZC.o /tmp/ccm5w6IQ.s
GNU assembler version 2.19.1 (x86_64-unknown-linux-gnu) using BFD version (GNU Binutils) 2.19.1
COMPILER_PATH=/afs/mpa/data/martin/ugcc/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/:/afs/mpa/data/martin/ugcc/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/:/afs/mpa/data/martin/ugcc/libexec/gcc/x86_64-unknown-linux-gnu/:/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/:/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/
LIBRARY_PATH=/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/:/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-O2' '-v' '-W' '-Wall' '-mtune=generic' '-march=x86-64'
 /afs/mpa/data/martin/ugcc/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/collect2 --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o /afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/crtbegin.o -L/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0 -L/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../.. /tmp/ccmGCvZC.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /afs/mpa/data/martin/ugcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/crtend.o /usr/lib/../lib64/crtn.o

~/ujedi/splotchnew>time ./a.out
1.580u 0.000s 0:01.58 100.0%    0+0k 0+0io 0pf+0w


This was run on a Core2Duo.
gcc 4.5 also exhibits this regression.


Another observation: if I change ARRSZ in the testcase to 20 instead of 1024, all executables run even more slowly (by about another factor of five); I have absolutely no explanation for this :-/
Comment 1 Martin Reinecke 2010-06-05 10:18:06 UTC
Created attachment 20849 [details]
test case
Comment 2 Richard Biener 2010-06-05 10:47:28 UTC
We have (4.4):

<bb 2>:
  va.f[0] = a->r;
  va.f[1] = a->g;
  va.f[2] = a->b;
  va.f[3] = 0.0;
  pretmp.40 = va.v;
  ivtmp.61 = 0;

<bb 3>:
  att.12 = MEM[base: pre1, index: ivtmp.61] * pre2;
  tmpatt = {att.12, att.12, att.12, att.12};
  tmpatt.66 = __builtin_ia32_mulps (tmpatt, pretmp.40);
  D.3720 = __builtin_ia32_addps (tmpatt.66, MEM[base: lpic, index: ivtmp.61, step: 4]);
  MEM[base: lpic, index: ivtmp.61, step: 4] = D.3720;
  ivtmp.61 = ivtmp.61 + 4;
  if (ivtmp.61 != 80)
    goto <bb 3>;
  else
    goto <bb 4>;

vs. (4.6):

<bb 2>:
  va$f$0_3 = a_2(D)->r;
  va$f$1_4 = a_2(D)->g;
  va$f$2_5 = a_2(D)->b;

<bb 3>:
  # ivtmp.31_52 = PHI <ivtmp.31_51(3), 0(2)>
  D.4345_11 = MEM[base: pre1_9(D), index: ivtmp.31_52];
  att.1_13 = D.4345_11 * pre2_12(D);
  tmpatt_37 = {att.1_13, att.1_13, att.1_13, att.1_13};
  va.f[0] = va$f$0_3;
  va.f[1] = va$f$1_4;
  va.f[2] = va$f$2_5;
  va.f[3] = 0.0;
  D.4347_16 = va.v;
  tmpatt_38 = __builtin_ia32_mulps (tmpatt_37, D.4347_16);
  D.4350_25 = MEM[base: lpic_20(D), index: ivtmp.31_52, step: 4];
  D.4382_39 = __builtin_ia32_addps (tmpatt_38, D.4350_25);
  MEM[base: lpic_20(D), index: ivtmp.31_52, step: 4] = D.4382_39;
  ivtmp.31_51 = ivtmp.31_52 + 4;
  if (ivtmp.31_51 != 80)
    goto <bb 3>;
  else
    goto <bb 4>;

which means that somehow we were not able to hoist loop invariant stuff.

investigating.
Comment 3 Richard Biener 2010-06-05 10:56:06 UTC
Ok.  Fact is that no pass can move invariant store/load pairs.  But that's
pre-existing - the main issue is that the new SRA implementation ends up
rematerializing the stores inside the loop!

Diff of pre-esra vs. esra:

 <bb 2>:
   D.4339_3 = a_2(D)->r;
-  va.f[0] = D.4339_3;
+  va$f$0_33 = D.4339_3;
   D.4340_4 = a_2(D)->g;
-  va.f[1] = D.4340_4;
+  va$f$1_32 = D.4340_4;
   D.4341_5 = a_2(D)->b;
-  va.f[2] = D.4341_5;
-  va.f[3] = 0.0;
+  va$f$2_31 = D.4341_5;
+  va$f$3_30 = 0.0;
   y_6 = 0;
   goto <bb 4>;
 
@@ -504,6 +203,10 @@
   tmpatt_37 = {D.4375_36, D.4375_36, D.4375_36, D.4375_36};
   tmpatt_40 = tmpatt_37;
   tmpatt_15 = tmpatt_40;
+  va.f[0] = va$f$0_33;
+  va.f[1] = va$f$1_32;
+  va.f[2] = va$f$2_31;
+  va.f[3] = va$f$3_30;
   D.4347_16 = va.v;
   tmpatt_38 = __builtin_ia32_mulps (tmpatt_15, D.4347_16);
   tmpatt_41 = tmpatt_38;

that's of course bad (and the scalarization in this particular case looks
useless, too - the only use is an aggregate one, covering all stores).
Comment 4 Martin Jambor 2010-06-08 13:16:32 UTC
Mine
Comment 5 Martin Reinecke 2010-06-08 13:54:38 UTC
(In reply to comment #2)
> We have (4.4):
> 
> <bb 2>:
>   va.f[0] = a->r;
>   va.f[1] = a->g;
>   va.f[2] = a->b;
>   va.f[3] = 0.0;
>   pretmp.40 = va.v;
>   ivtmp.61 = 0;

[...]

Could you please tell me the compiler flag(s) needed to produce this kind of information? That seems much more useful for debugging and chasing performance bottlenecks than assembler dumps...
Comment 6 Richard Biener 2010-06-08 14:02:59 UTC
(In reply to comment #5)
> (In reply to comment #2)
> > We have (4.4):
> > 
> > <bb 2>:
> >   va.f[0] = a->r;
> >   va.f[1] = a->g;
> >   va.f[2] = a->b;
> >   va.f[3] = 0.0;
> >   pretmp.40 = va.v;
> >   ivtmp.61 = 0;
> 
> [...]
> 
> Could you please tell me the compiler flag(s) needed to produce this kind of
> information? That seems much more useful for debugging and chasing performance
> bottlenecks than assembler dumps...

-fdump-tree-all will leave you with a hunded dump files, program state
after each individual tree optimization pass.  -da will add dumps after
each RTL pass.
Comment 7 Martin Jambor 2010-06-08 14:29:30 UTC
I don't think I can fix this bug in its most general form without
doing some flow-sensitive decisions (which can be difficult for
aggregates) and without causing PR 43846 again.  (Aggregate
copy-propagation and either of the two things described below should
do the trick, though).

As noted, this is caused by a fix to PR 43846 which on the other hand
is certainly not necessary for non-aggregate types when we do type
punning of register types through unions.  I've got a two line patch
testing that and it works (and bootstraps and tests) fine.

However, that is only a change in the new heuristics and if the array
elements are individually read somewhere else in the function too, a
different decision making condition will kick in and we will end up
with the replacements and extra statements in the loop again.

Therefore, I now tend to think that these accesses to SSE vectors are
a good reason to simply disallow scalarization of anything that has a
non-aggregate parent in the SRA access tree.  This would only affect
type punning through unions and weird typecasts (none of which could
be processed by previous SRA).

Actually, I had this disallowed when I was developing the new SRA most
of the time and then decided to allow it only very late.  I don't
remember why I did that.  I'm now testing a patch doing that, maybe
some testcase will remind me what the reason was.

I will ponder about this a bit more but probably will soon submit a
patch doing the latter.
Comment 8 Richard Biener 2010-06-08 14:50:51 UTC
I don't think you need flow-sensitivity.

Basically when you have only aggregate uses (as in this case) then you only
want to scalarize if the destination of the use is scalarized as well
(to be able to copyprop out the aggregate copy).
Comment 9 Martin Jambor 2010-06-08 15:00:06 UTC
(In reply to comment #8)
> I don't think you need flow-sensitivity.
> 
> Basically when you have only aggregate uses (as in this case)

Vectors are considered scalars in GCC.  That is why the solutions
described above work.

> then you only want to scalarize if the destination of the use is
> scalarized as well (to be able to copyprop out the aggregate copy).

Well, that is what I thought until someone filed PR 43846.
Comment 10 Richard Biener 2010-06-08 15:11:09 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I don't think you need flow-sensitivity.
> > 
> > Basically when you have only aggregate uses (as in this case)
> 
> Vectors are considered scalars in GCC.  That is why the solutions
> described above work.
> 
> > then you only want to scalarize if the destination of the use is
> > scalarized as well (to be able to copyprop out the aggregate copy).
> 
> Well, that is what I thought until someone filed PR 43846.

Hm, yes.  But there you know that

  D.2464.m[0] = D.2473_20;
  D.2464.m[1] = D.2472_19;
  D.2464.m[2] = D.2471_18;
  *b_1(D) = D.2464;

D.2464 will be dead after scalarization.  In the particular case of the
current bug the aggregate remains live because of the load from va.v
which we cannot scalarize(*).

(*) we can scalarize this particular case if you manage to build a
proper constructor from the elements - but that's probably a bit
involved.
Comment 11 Martin Jambor 2010-06-09 09:02:41 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> > > I don't think you need flow-sensitivity.
> > > 
> > > Basically when you have only aggregate uses (as in this case)
> > 
> > Vectors are considered scalars in GCC.  That is why the solutions
> > described above work.
> > 
> > > then you only want to scalarize if the destination of the use is
> > > scalarized as well (to be able to copyprop out the aggregate copy).
> > 
> > Well, that is what I thought until someone filed PR 43846.
> 
> Hm, yes.  But there you know that
> 
>   D.2464.m[0] = D.2473_20;
>   D.2464.m[1] = D.2472_19;
>   D.2464.m[2] = D.2471_18;
>   *b_1(D) = D.2464;
> 
> D.2464 will be dead after scalarization.

If D.2464 was larger than just m, that would not necessarily be the
case and we would still want to avoid the extra copies.

However, I it is true that it would make sense to take
grp_assignment_read into account only if the whole access subtree
would end up with grp_unscalarized_data set to zero but that would
require quite a rewrite of analyze_access_subtree and would not help
in this case because grp_unscalarized_data is zero, the union is
covered by scalar replacements.

The real issue is that

>  In the particular case of the
> current bug the aggregate remains live because of the load from va.v
> which we cannot scalarize(*).

we determine this very late, in sra_modify_assign (right after the big
comment) and in the most general form this can be determined only when
we already have the whole access tree (so if we wanted to do this
during analysis, we would have to scan the function body twice).
Nevertheless, for scalar accesses that have scalar sub-accesses this
is always true and it can be easily detected and so we can simply
disallow them, like I wrote in comment #7.  And disallow them always,
since otherwise it would be easy to _add_ stuff to the function that
is causing trouble now so that any heuristics is confused and decides
to produce replacements.

I'll submit a patch in a while.

> 
> (*) we can scalarize this particular case if you manage to build a
> proper constructor from the elements - but that's probably a bit
> involved.
> 

Well, I don't think I want to implement that... but I am curious,
would that actually lead to better (or even different) code if I
placed something like that into the loop?  And I also thought that in
gimple, constructors only could have invariants in them.
Comment 12 Martin Jambor 2010-06-09 09:05:57 UTC
(In reply to comment #11)
> >   D.2464.m[0] = D.2473_20;
> >   D.2464.m[1] = D.2472_19;
> >   D.2464.m[2] = D.2471_18;
> >   *b_1(D) = D.2464;
> > 
> > D.2464 will be dead after scalarization.
> 
> If D.2464 was larger than just m, that would not necessarily be the
> case and we would still want to avoid the extra copies.
> 

Of course this would be true only if the last assignment was 

*b_1(D) = D.2464.m;

but the argument is the same.
Comment 13 Martin Jambor 2010-06-09 11:20:52 UTC
Subject: Bug 44423

Author: jamborm
Date: Wed Jun  9 11:20:03 2010
New Revision: 160462

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=160462
Log:
2010-06-09  Martin Jambor  <mjambor@suse.cz>

	PR tree-optimization/44423
	* tree-sra.c (dump_access): Dump also grp_assignment_read.
	(analyze_access_subtree): Pass negative allow_replacements to children
	if the current type is scalar.

	* testsuite/gcc.dg/tree-ssa/pr44423.c: New test.


Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr44423.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-sra.c

Comment 14 Martin Reinecke 2010-06-09 12:06:03 UTC
SSE performance is fine again, thanks a lot!

One more question, if that's OK...
Depending on ARRSZ the testcase uses wildly varying amounts of CPU time; it's about half a second for ARRSZ=1024, but almost 10 seconds for ARRSZ=20 on my machine, which is extremely strange because the operation count is the same in both cases. I suspect that something weird is happening with respect to the cache and prefetching. Should I open another PR for this?
Comment 15 Martin Jambor 2010-06-14 12:39:53 UTC
(In reply to comment #14)
> SSE performance is fine again, thanks a lot!
> 
> One more question, if that's OK...
> Depending on ARRSZ the testcase uses wildly varying amounts of CPU time; it's
> about half a second for ARRSZ=1024, but almost 10 seconds for ARRSZ=20 on my
> machine, which is extremely strange because the operation count is the same in
> both cases. I suspect that something weird is happening with respect to the
> cache and prefetching. Should I open another PR for this?
> 

The generated assembly is not different for the two cases, except that
there are much smaller offsets, of course.  This means that the lpic
and pre1 arrays are much closer to each other which may be something
the processor does not like.  I find this surprising but unless you
can think of a specific missed optimization opportunity (I can't), I
don't think it is a PR material.
Comment 16 Martin Reinecke 2010-06-14 12:46:17 UTC
(In reply to comment #15)

I have found the problem in the meantime ... it's my mistake, sorry about the noise :(

The problem is that I did not explicitly zero the arrays in main(), so they apparently contained NaN or similar nastinesses for the small ARRSZ, and "usual" numbers for large ARRSZ. Of course the processor chokes on the "unusual" numbers and takes much longer to execute the code.
I'm not sure whether the zeroing should be added for the regression test case ... but since you check for compiler diagnostic and do not try to run the resulting executable that's probably not necessary.
Comment 17 Martin Jambor 2010-06-14 12:50:08 UTC
OK, I did not put much effort into my thinking about it :-)
Yes, the testcase is fine as it is.
I'm not testing the patch on the 4.5 branch and will commit it today
if everything goes fine.
Comment 18 Martin Jambor 2010-06-15 09:48:54 UTC
Subject: Bug 44423

Author: jamborm
Date: Tue Jun 15 09:48:39 2010
New Revision: 160775

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=160775
Log:
2010-06-15  Martin Jambor  <mjambor@suse.cz>

	PR tree-optimization/44423
	* tree-sra.c (dump_access): Dump also grp_assignment_read.
	(analyze_access_subtree): Pass negative allow_replacements to children
	if the current type is scalar.

	* testsuite/gcc.dg/tree-ssa/pr44423.c: New test.


Added:
    branches/gcc-4_5-branch/gcc/testsuite/gcc.dg/tree-ssa/pr44423.c
Modified:
    branches/gcc-4_5-branch/gcc/ChangeLog
    branches/gcc-4_5-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_5-branch/gcc/tree-sra.c

Comment 19 Martin Jambor 2010-06-15 10:04:33 UTC
This is now fixed on both the trunk and the 4.5 branch.