User account creation filtered due to spam.

Bug 32921 - [4.3/4.4 Regression] Revision 126326 causes 12% slowdown
Summary: [4.3/4.4 Regression] Revision 126326 causes 12% slowdown
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 4.4.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 33816 35972
Blocks: 33974
  Show dependency treegraph
 
Reported: 2007-07-28 00:23 UTC by H.J. Lu
Modified: 2009-02-02 19:16 UTC (History)
12 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.4.0
Known to fail: 4.3.0 4.3.3
Last reconfirmed: 2008-04-30 20:29:48


Attachments
Testcase (315 bytes, application/octet-stream)
2007-10-17 16:14 UTC, Pat Haugen
Details
Testcase (997 bytes, text/plain)
2007-10-23 20:30 UTC, Pat Haugen
Details
partition heuristics change (1.37 KB, text/plain)
2007-10-24 15:19 UTC, Richard Biener
Details
Testcase (1.55 KB, text/plain)
2008-04-30 18:49 UTC, Pat Haugen
Details
Testcase (1.99 KB, text/plain)
2008-05-01 19:26 UTC, Pat Haugen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2007-07-28 00:23:02 UTC
This checkin:

http://gcc.gnu.org/ml/gcc-cvs/2007-07/msg00168.html

causes 12% slowdown for 437.leslie3d in SPEC CPU 2006 on 64bit Core 2
Duo when compiled with -O2 -ffast-math.
Comment 1 Andrew Pinski 2007-07-28 00:32:08 UTC
Do you know why?  Because right now there is not enough information in this bug to declare if it is a bug or not.
Comment 2 Andrew Pinski 2007-07-28 00:32:44 UTC
Or really a scheduling fuck up in the core 2 duo.
Comment 3 H.J. Lu 2007-07-28 00:51:13 UTC
I also saw 13% slowdown on Opteron. It may also happen on other none-x86
processors.

This program is in Fortran. It has loops with 3-D, 4-D and 5-D arrays.
Comment 4 Andrew Pinski 2007-07-28 01:02:06 UTC
The patch just makes more types be the same inside GCC so if this caused a regression, then this is a latent bug or it is the case where mem-ssa failed.
Comment 5 Richard Biener 2007-07-28 10:02:08 UTC
It might be a missed optimization, so it would be nice to see where the slowdown is and what changed there wrt trees we get.  (note I will not have time to
investigate this for the next three weeks as I'm on vacation)
Comment 6 H.J. Lu 2007-07-28 20:51:37 UTC
This part of the patch:

Index: tree-ssa.c
===================================================================
--- tree-ssa.c  (revision 126326)
+++ tree-ssa.c  (working copy)
@@ -979,11 +979,9 @@ useless_type_conversion_p (tree outer_ty
        return false;

       /* Do not lose casts between pointers with different
-        TYPE_REF_CAN_ALIAS_ALL setting or alias sets.  */
-      if ((TYPE_REF_CAN_ALIAS_ALL (inner_type)
-          != TYPE_REF_CAN_ALIAS_ALL (outer_type))
-         || (get_alias_set (TREE_TYPE (inner_type))
-             != get_alias_set (TREE_TYPE (outer_type))))
+        TYPE_REF_CAN_ALIAS_ALL setting.  */
+      if (TYPE_REF_CAN_ALIAS_ALL (inner_type)
+         != TYPE_REF_CAN_ALIAS_ALL (outer_type))
        return false;

       /* Do not lose casts from const qualified to non-const

causes this performance regression.

437.leslie3d is a big file with 19 functions. If I put each function in
a separate file, there is no regression. If I put 2 particular functions
in one file, there are 10-30% regressions depending on input file. Those
2 functions work on several global arrays.

I tried " --param max-aliased-vops=1000000 --param avg-aliased-vops=7".
It doesn't make a difference.
Comment 7 rguenther@suse.de 2007-07-28 20:59:26 UTC
Subject: Re:  [4.3 Regression]  Revision 126326
 causes 12% slowdown

On Sat, 28 Jul 2007, hjl at lucon dot org wrote:

> causes this performance regression.
> 
> 437.leslie3d is a big file with 19 functions. If I put each function in
> a separate file, there is no regression. If I put 2 particular functions
> in one file, there are 10-30% regressions depending on input file. Those
> 2 functions work on several global arrays.
> 
> I tried " --param max-aliased-vops=1000000 --param avg-aliased-vops=7".
> It doesn't make a difference.

This sounds strange and maybe relates this to PR32624.

Richard.
Comment 8 H.J. Lu 2007-07-28 23:25:25 UTC
(In reply to comment #7)

> This sounds strange and maybe relates this to PR32624.
> 

I tried the patch in

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32624#c4

It doesn't make a difference in this case. But both are related to
useless_type_conversion_p.
Comment 9 H.J. Lu 2007-10-07 15:16:52 UTC
Gcc 4.3 revision 128885 is much slower than gcc 4.1 redhat revision 128771
with -O2 -ffast-math on 64bit Linux/Core 2 Duo:

437.leslie3d                     -26.2%
Comment 10 H.J. Lu 2007-10-07 15:19:38 UTC
(In reply to comment #9)
> Gcc 4.3 revision 128885 is much slower than gcc 4.1 redhat revision 128771
> with -O2 -ffast-math on 64bit Linux/Core 2 Duo:
> 
> 437.leslie3d                     -26.2%
> 

The main difference between gcc 4.1 redhat and gcc 4.1 is the -mtune=generic
change. I believe gcc 4.3 will also be much slower than gcc 4.1 on Opteron.
Comment 11 Pat Haugen 2007-10-17 16:14:00 UTC
Created attachment 14364 [details]
Testcase
Comment 12 Pat Haugen 2007-10-17 16:18:53 UTC
And now some comments to go with the prior attatchment...

This checkin is causing a 75% degradation on leslie3d for PowerPC. As HJ observed earlier, it depends on a second function accessing some of the global data.

Generated code for simple copy loop in EXTRAPI(), compiled with -O2.

revision 126325:

        .p2align 4,,15
.L3:
        lfd 0,0(11)      #* ivtmp.71, tmp198
        add 11,11,6      # ivtmp.71, ivtmp.71, ivtmp.77
        stfd 0,0(9)      #* ivtmp.76, tmp198
        add 9,9,5        # ivtmp.76, ivtmp.76, ivtmp.73
        bdnz .L3         #


revision 126326:

        .p2align 4,,15
.L6:
        lwz 10,12(27)    # <variable>.stride, prephitmp.66
        lwz 3,12(28)     # <variable>.stride, prephitmp.56
.L3:
        mullw 10,10,12   # tmp141, prephitmp.66, i
        lwz 9,36(30)     # <variable>.stride, <variable>.stride
        lwz 0,36(31)     # <variable>.stride, <variable>.stride
        lwz 8,4(30)      # uav.offset, uav.offset
        lwz 11,4(31)     # qav.offset, qav.offset
        lwz 6,24(30)     # <variable>.stride, <variable>.stride
        lwz 7,24(31)     # <variable>.stride, <variable>.stride
        lwz 4,0(30)      # uav.data, uav.data
        lwz 5,0(31)      # qav.data, qav.data
        mullw 3,3,12     # tmp160, prephitmp.56, i
        slwi 9,9,1       # tmp142, <variable>.stride,
        slwi 0,0,1       # tmp161, <variable>.stride,
        add 9,9,8        # tmp143, tmp142, uav.offset
        add 0,0,11       # tmp162, tmp161, qav.offset
        add 9,9,6        # tmp145, tmp143, <variable>.stride
        add 0,0,7        # tmp164, tmp162, <variable>.stride
        add 9,9,10       # tmp147, tmp145, tmp141
        add 0,0,3        # tmp166, tmp164, tmp160
        slwi 9,9,3       # tmp148, tmp147,
        slwi 0,0,3       # tmp167, tmp166,
        addi 12,12,1     # i, i,
        lfdx 0,5,0       #* qav.data, tmp169
        cmpw 7,12,29     # imax.3, tmp170, i
        stfdx 0,9,4      #* uav.data, tmp169
        bne 7,.L6        #

Comment 13 Andrew Pinski 2007-10-17 16:59:24 UTC
Can someone explain this may_alias behavior:

so we have in the IR:
  # VUSE <SFT.30_53>
D.892_19 = qav.data;
D.893_20 = (real8[0:] *) D.892_19;


 in may_alias we get a constraint of:
 D.892_19 = qav
D.893_20 = D.892_19



Isn't this wrong?  D. 892_19 shouldn't just point to anywhere?

Inside the loop we get:

# VUSE <imax_33, SFT.11_34, SFT.12_35, SFT.13_36, SFT.14_37, SFT.15_38, SFT.16_39, SFT.17_40, SFT.18_41, SFT.19_42, SFT.20_43, SFT.21_44, SFT.22_45, SFT.23_46, SFT.24_47, SFT.25_48, SFT.26_49, SFT.27_50, SFT.28_51, SFT.29_52, SFT.30_53, SMT.35_54>
D.903_30 = (*D.893_20)[D.902_29];

Which obviously shows that the vops are wrong.
Comment 14 Daniel Berlin 2007-10-17 17:41:34 UTC
Subject: Re:  [4.3 Regression] Revision 126326 causes 12% slowdown

On 17 Oct 2007 16:59:25 -0000, pinskia at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> ------- Comment #13 from pinskia at gcc dot gnu dot org  2007-10-17 16:59 -------
> Can someone explain this may_alias behavior:
>
> so we have in the IR:
>   # VUSE <SFT.30_53>
> D.892_19 = qav.data;
> D.893_20 = (real8[0:] *) D.892_19;
>
>
>  in may_alias we get a constraint of:
>  D.892_19 = qav
> D.893_20 = D.892_19
>
>
>
> Isn't this wrong?  D. 892_19 shouldn't just point to anywhere?

qav should have a constraint somewhere that says qav = &ANYTHING

You won't directly see D.892_19 = ANYTHING
Comment 15 Richard Biener 2007-10-17 21:21:53 UTC
comment #12 hints at that this is really the same problem as PR32624 (which
basically says aliasing is fucked up and non-deterministic).
Comment 16 Richard Biener 2007-10-18 11:45:39 UTC
I have a patch.
Comment 17 H.J. Lu 2007-10-18 20:55:02 UTC
This patch:

http://gcc.gnu.org/ml/gcc-patches/2007-10/msg01047.html

makes 437.leslie3d 10% faster on Intel Core 2 Duo 64bit. But it is still 23%
slower than gcc 4.1 Red Hat.
Comment 18 Richard Biener 2007-10-19 11:26:13 UTC
Subject: Bug 32921

Author: rguenth
Date: Fri Oct 19 11:25:55 2007
New Revision: 129484

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=129484
Log:
2007-10-19  Richard Guenther  <rguenther@suse.de>

	PR middle-end/32921
	* tree.c (build_array_type): Do not re-layout unbound array
	types.

	* gfortran.dg/pr32921.f: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/pr32921.f
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree.c

Comment 19 Richard Biener 2007-10-19 12:27:40 UTC
Subject: Bug 32921

Author: rguenth
Date: Fri Oct 19 12:27:25 2007
New Revision: 129487

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=129487
Log:
2007-10-19  Richard Guenther  <rguenther@suse.de>

        Revert
        2007-10-19  Richard Guenther  <rguenther@suse.de>

	PR middle-end/32921
	* tree.c (build_array_type): Do not re-layout unbound array
	types.

	* gfortran.dg/pr32921.f: New testcase.

Removed:
    trunk/gcc/testsuite/gfortran.dg/pr32921.f
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree.c

Comment 20 Richard Biener 2007-10-19 12:40:10 UTC
Complete mess.  Can of worms.  Unassigning.
Comment 21 Richard Biener 2007-10-19 15:36:33 UTC
Subject: Bug 32921

Author: rguenth
Date: Fri Oct 19 15:36:05 2007
New Revision: 129491

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=129491
Log:
2007-10-19  Richard Guenther  <rguenther@suse.de>

	PR middle-end/33816
	PR middle-end/32921
	* stor-layout.c (layout_type): Assert that aggregates do not
	have their alias sets set.
	* alias.c (get_alias_set): Return alias set zero for incomplete
	types, return the alias set of the element for incomplete array
	types, but do not remember these.

	* gfortran.dg/pr32921.f: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/pr32921.f
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/alias.c
    trunk/gcc/stor-layout.c
    trunk/gcc/testsuite/ChangeLog

Comment 22 Richard Biener 2007-10-19 16:28:24 UTC
Actually, the fix for PR33816 might have fixed this.
Comment 23 H.J. Lu 2007-10-19 22:13:25 UTC
Gcc 4.3 revision 129493 makes 437.leslie3d 25% faster than revision 129372 on
Intel Core 2 Duo 64bit. But it is still 13% slower than gcc 4.1 Red Hat.
Comment 24 patchapp@dberlin.org 2007-10-20 04:21:16 UTC
Subject: Bug number PR32921

A patch for this bug has been added to the patch tracker.
The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2007-10/msg01036.html
Comment 25 Richard Biener 2007-10-20 09:38:49 UTC
HJ, does applying the patch from comment #6 bring back performance to 4.1 RH
level?
Comment 26 H.J. Lu 2007-10-20 16:05:39 UTC
(In reply to comment #25)
> HJ, does applying the patch from comment #6 bring back performance to 4.1 RH
> level?
> 

It makes no difference. I saw 20% slowdown in 437.leslie3d on Intel Core
2 Duo 64bit between revision 117890 and revision 119497. It could be
PR 30735.
Comment 27 Pat Haugen 2007-10-22 21:11:11 UTC
I tried a recent mainline on PowerPC for leslie3d, revision 129550 improved over revision 129454 by 67%.
Comment 28 Joey Ye 2007-10-23 02:23:40 UTC
Got similar result on x86_64, Core 2 improves 24% from 129469 to 129504. That's great.

Comment 29 Pat Haugen 2007-10-23 20:29:25 UTC
    Found another example on PowerPC in the same benchmark that is not fixed by the checked in patches.  Compiled with -m32 -O2. From the loop in procedure FLUXI:

    revision 126325:

    .L47:
            lfd 0,0(9)       #* ivtmp.277, tmp346
            lfd 13,-8(9)     #, tmp347
            add 9,9,0        # ivtmp.277, ivtmp.277, ivtmp.309
            fsub 0,0,13      # tmp345, tmp346, tmp347
            fmul 0,0,12      # tmp348, tmp345, dtvol.23
            fneg 0,0         # tmp349, tmp348
            stfd 0,0(11)     #* ivtmp.306, tmp349
            add 11,11,10     # ivtmp.306, ivtmp.306, ivtmp.280
            bdnz .L47        #



    revision 126326 (and current mainline):

    .L83:
            lwz 0,48(7)      # <variable>.stride, <variable>.stride
            lfd 0,0(10)      #* ivtmp.277, tmp416
            lfd 13,-8(10)    #, tmp417
            lfd 12,0(5)      # dtvol, dtvol
            add 10,10,6      # ivtmp.277, ivtmp.277, ivtmp.306
            mullw 0,8,0      # tmp409, l, <variable>.stride
            fsub 0,0,13      # tmp415, tmp416, tmp417
            lwz 9,4(7)       # du.offset, du.offset
            lwz 11,0(7)      # du.data, du.data
            fmul 0,0,12      # tmp420, tmp415, dtvol
            fneg 0,0         # tmp422, tmp420
            add 0,0,9        # tmp412, tmp409, du.offset
            addi 8,8,1       # l, l,
            slwi 0,0,3       # tmp413, tmp412,
            stfdx 0,11,0     #, tmp422
            bdnz .L83        #

Comment 30 Pat Haugen 2007-10-23 20:30:50 UTC
Created attachment 14402 [details]
Testcase
Comment 31 Richard Biener 2007-10-24 12:23:21 UTC
This is a partitioning issue.  We allocate the same MPT to loads and stores
in a loop.  Fixed with for example --param max-aliased-vops=10000.

Diego, this is yours.  Avoiding false aliasing of loads and stores should
be top priority in the partitioning heuristics.

Though even with no paritioning we are not able to move the invariant load
of dtvol:

.L68:
        movsd   (%rsi), %xmm0
        addl    $1, %ecx
        addq    %rdi, %rsi
        subsd   (%rdx), %xmm0
        addq    %rdi, %rdx 
        mulsd   __les3d_data_MOD_dtvol(%rip), %xmm0
        xorpd   %xmm1, %xmm0
        movsd   %xmm0, (%rax)
        addq    %r8, %rax 
        cmpl    $6, %ecx
        jne     .L68

because:

  # VUSE <dtvol_613>
  dtvol.23_424 = dtvol; 
  D.1276_425 = D.1274_423 * dtvol.23_424;
  D.1277_426 = -D.1276_425;
  # dtvol_1865 = VDEF <dtvol_613>
  # HEAP.185_1160 = VDEF <HEAP.185_609>
  # HEAP.186_1142 = VDEF <HEAP.186_1194>
  # HEAP.202_1108 = VDEF <HEAP.202_1177>
  (*pretmp.248_955)[D.1259_397] = D.1277_426;

as we think that pretmp.248_955 might point to dtvol.  The pointers just
have a constraint like

pretmp.246_921 = du

and

du = &ANYTHING
du.ubound = &ANYTHING
du.lbound = &ANYTHING
du.stride = &ANYTHING
du.ubound = &ANYTHING
du.lbound = &ANYTHING
du.stride = &ANYTHING
du.ubound = &ANYTHING
du.lbound = &ANYTHING
du.stride = &ANYTHING
du.ubound = &ANYTHING
du.lbound = &ANYTHING
du.stride = &ANYTHING
du.dtype = &ANYTHING
du.offset = &ANYTHING

and dtvol and DU are global variables.  So we are possibly out of luck
for this one.
Comment 32 Richard Biener 2007-10-24 15:02:16 UTC
I have a patch to "fix" the heuristics, but it doesn't have effect as nobody
hoists PHI nodes that have become invariant apperantly.  After PRE I see



<bb 100>:
  # HEAP.202_828 = PHI <HEAP.202_747(150), HEAP.202_743(148)>
  
<bb 101>:
  # SMT.222_639 = PHI <SMT.222_638(100), SMT.222_599(152)>
  # dtvol_656 = PHI <dtvol_620(100), dtvol_655(152)>
  # MPT.242_668 = PHI <MPT.242_744(100), MPT.242_668(152)>
  # l_1 = PHI <1(100), l_428(152)>
  # VUSE <MPT.242_668>
  D.1241_376 = du.data;
... (more VUSEs of MPT.242_668, the only VDEFs in the loop follow)
  # dtvol_655 = VDEF <dtvol_656>
  # SMT.222_599 = VDEF <SMT.222_639>
  (*D.1242_377)[D.1259_397] = D.1277_426;
  l_428 = l_1 + 1;
  if (l_1 == 5)
    goto <bb 102>;
  else
    goto <bb 152>;

<bb 152>:
  goto <bb 101>;

<bb 102>:

the

  # SMT.222_639 = PHI <SMT.222_638(100), SMT.222_599(152)>

could be hoisted to BB100 (even to BB92).  Who is supposed to do this?
Comment 33 Richard Biener 2007-10-24 15:18:02 UTC
DOM does, so scheduling another dominator pass after PRE (where alias is re-run)
fixes the problem.  One problem with partitioning is that we don't collect
statistics for symbols inside a partition, which makes all but the first
paritioning run prefer already partitioned symbols.
Comment 34 Richard Biener 2007-10-24 15:19:10 UTC
Created attachment 14406 [details]
partition heuristics change

The idea is to change the statistics to make the values in the fields
match their names and based on that avoid partitioning for variables
that are often directly accessed or indirectly accessed as mostly
reads or writes.  It probably makes sense to change the field names
and weight the individual accesses with their BB frequency again.

Of course the patch is not finished and is untested apart from throwing
it on the 2nd testcase.
Comment 35 Pat Haugen 2007-10-24 23:15:14 UTC
I reran leslie3d on PowerPC with --param max-aliased-vops=10000. The result was a 90% improvement over my prior revision 129550 run (218% improvement over the original mainline run).
Comment 36 Richard Biener 2007-10-25 08:55:24 UTC
One reason why we see a regression here regarding to partitioning is that the
fortran FE now inlines allocate () producing three calls instead of one, which
spoils the partitioning heuristics:

  {
    void * D.1028;
    logical4 D.1027;
    int8 size.0;
    int8 D.1025;
    
    qs.dtype = 537;
    qs.dim[0].lbound = 0;
    qs.dim[0].ubound = (int8) (imax + -1);
    qs.dim[0].stride = 1;
    D.1025 = (int8) (imax + -1) + 1;
    D.1027 = imax <= 0;
    if (D.1027)
      {
        size.0 = 0;
      }
    else
      {
        size.0 = D.1025 * 8;
      }
    if (qs.data == 0B)
      {
        {
          void * D.1030;
          int8 D.1029;

          D.1029 = size.0;
          if (D.1029 < 0)
            {
              _gfortran_runtime_error (&"Attempt to allocate negative amount of memory. Possible integer overflow"[1]{lb: 1 sz: 1});
            }
          else
            {
              D.1030 = __builtin_malloc (MAX_EXPR <D.1029, 1>);
              if (D.1030 == 0B)
                {
                  _gfortran_os_error (&"Out of memory"[1]{lb: 1 sz: 1});
                }
            }
          D.1028 = D.1030;
        }
      }
    else
      {
        _gfortran_runtime_error (&"Attempting to allocate already allocated array"[1]{lb: 1 sz: 1});
      }
    qs.data = D.1028;
    qs.offset = 0;
  }

if we mark the error functions as having no VOPs (I don't see that we need
to preserve or order memory operations for these particular functions that
do not return) then this should clean up the number of VOPs considerably.
Comment 37 Andrew Pinski 2007-10-25 10:05:40 UTC
_gfortran_runtime_error is marked as no return which means virtual operations should not be on it.  Sounds like noreturn should be the same as no vops (maybe).  You have to take care about exceptions as no return functions can still change global state and "return" via an exception (or a long jump).
Comment 38 Richard Biener 2007-10-25 11:02:26 UTC
Another thing is that for all of the mem_sym_stats we collect, we _only_ consider
memory references through pointers(!), but not for example

  # VUSE <MPT.242_651>
  D.1244_380 = du.dim[0].stride;

the associated SFT will neither get indirect nor direct reads accounted.
Comment 39 Richard Biener 2007-10-25 11:06:47 UTC
No, in general noreturn functions cannot be treated as novops.  Consider

void __attribute__((noreturn,noinline)) my_main(int *ret)
{
  exit(*ret);
}

int main()
{
  int ret = 0;
  my_main (&ret);
}

without VOPs we would remove the store to ret.
Comment 40 Richard Biener 2008-03-14 16:53:11 UTC
Adjusting target milestone (though realistically this will not be fixed for 4.3.x).
Comment 41 Richard Biener 2008-04-23 14:09:15 UTC
Subject: Bug 32921

Author: rguenth
Date: Wed Apr 23 14:08:25 2008
New Revision: 134598

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=134598
Log:
2008-04-23  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/27799
	PR tree-optimization/32921
	PR tree-optimization/32624
	* tree-ssa-structalias.c (merge_smts_into): Only merge the
	SMTs aliases and the tag itself into the solution.
	* tree-ssa-alias.c (compute_flow_sensitive_aliasing): Do not
	merge the points-to solution back into the SMT aliases.
	(may_alias_p): Use alias_set_subset_of instead of
	aliases_conflict_p.  A pointer which points to
	memory with alias set zero may access any variable.

	* gcc.dg/tree-ssa/pr27799.c: New testcase.
	* gcc.dg/tree-ssa/20030807-7.c: Remove xfail, scan vrp dump.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr27799.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/20030807-7.c
    trunk/gcc/tree-ssa-alias.c
    trunk/gcc/tree-ssa-structalias.c

Comment 42 Richard Biener 2008-04-23 14:09:55 UTC
Likely fixed on the mainline.
Comment 43 Pat Haugen 2008-04-30 18:49:24 UTC
Created attachment 15553 [details]
Testcase

I tried a mainline with the latest patch.  While we no longer have problems with the prior testcases, there is no improvement for leslie3d on ppc64. I can still double the performance of the benchmark by specifying --param max-aliased-vops=10000.

Including a new trimmed down testcase from the benchmark where I'm still seeing poor code when max-aliased-vops is not increased, compiled with 'gfortran -m32 -O2'.

Refer to the first nested loop in procedure FLUXK():

            DO I = I1, I2
               QS(I) = WAV(I,J,K) * ZAREA
            END DO


Base:

.L150:
        lwz 0,24(18)     # <variable>.stride, <variable>.stride
        lwz 9,36(18)     # <variable>.stride, <variable>.stride
        lwz 11,12(18)    # <variable>.stride, <variable>.stride
        lwz 10,4(18)     # wav.offset, wav.offset
        mullw 0,17,0     # tmp660, ivtmp.602, <variable>.stride
        lwz 8,0(18)      # wav.data, wav.data
        mullw 9,30,9     # tmp666, ivtmp.590, <variable>.stride
        mullw 11,6,11    # tmp670, i, <variable>.stride
        add 0,0,9        # tmp672, tmp660, tmp666
        addi 6,6,1       # i, i,
        add 0,0,11       # tmp673, tmp672, tmp670
        add 0,0,10       # tmp674, tmp673, wav.offset
        slwi 0,0,3       # tmp676, tmp674,
        lfdx 0,8,0       #, tmp678
        fmul 0,0,8       # tmp679, tmp678, zarea.64
        stfdx 0,15,7     #* ivtmp.589, tmp679
        addi 7,7,8       # ivtmp.589, ivtmp.589,
        bdnz .L150       #



With --param max-aliased-vops=1000:

.L150:
        lfd 0,0(11)      #* ivtmp.599, tmp739
        add 11,11,30     # ivtmp.599, ivtmp.599, D.2783
        fmul 0,0,8       # tmp740, tmp739, zarea.64
        stfdx 0,22,9     #* ivtmp.602, tmp740
        addi 9,9,8       # ivtmp.602, ivtmp.602,
        bdnz .L150       #
Comment 44 Richard Biener 2008-04-30 20:29:48 UTC
We are putting the HEAP tag for D.914_385 and wav in the same partition:

<bb 198>:
  # MPT.501_1171 = PHI <MPT.501_1167(197), MPT.501_1228(199)>
  # i_25 = PHI <1(197), i_646(199)>
  D.1264_629 = i_25 + -1;
  # VUSE <MPT.501_1171>
  D.1265_630 = wav.data;
  D.1266_631 = (real(kind=8)[0:] *) D.1265_630;
  # VUSE <MPT.501_1171>
  D.1267_632 = wav.dim[0].stride;
  D.1268_633 = i_25 * D.1267_632;
  # VUSE <MPT.501_1171>
  D.1269_634 = wav.dim[1].stride;
  D.1270_635 = j_28 * D.1269_634;
  # VUSE <MPT.501_1171>
  D.1272_637 = wav.dim[2].stride;
  D.1273_638 = k_29 * D.1272_637;
  # VUSE <MPT.501_1171>
  D.1275_640 = wav.offset;
  D.1271_636 = D.1270_635 + D.1275_640;
  D.1274_639 = D.1271_636 + D.1273_638;
  D.1276_641 = D.1274_639 + D.1268_633;
  # VUSE <dxi_301, zarea_284, SMT.480_250>
  D.1277_642 = (*D.1266_631)[D.1276_641];
  D.1279_644 = D.1277_642 * zarea.64_643;
  # MPT.501_1228 = VDEF <MPT.501_1171>
  (*D.914_385)[D.1264_629] = D.1279_644;
  i_646 = i_25 + 1;
  if (i_25 == i2_608)


A shorter testcase doesn't reproduce this bogus decision even with
--param max-aliased-vops=0.  LIM should be able to disambiguate here
with TBAA.  And indeed, the following seems to fix it

Index: tree-ssa-loop-im.c
===================================================================
--- tree-ssa-loop-im.c	(revision 134835)
+++ tree-ssa-loop-im.c	(working copy)
@@ -1640,6 +1640,8 @@ mem_refs_may_alias_p (tree mem1, tree me
 	  && SSA_VAR_P (mem1)
 	  && !AGGREGATE_TYPE_P (TREE_TYPE (mem1)))
 	return false;
+      if (!alias_sets_conflict_p (get_alias_set (mem1), get_alias_set (mem2)))
+	return false;
     }
 
   /* The expansion of addresses may be a bit expensive, thus we only do

Pat - can you confirm this?
Comment 45 Richard Biener 2008-04-30 21:43:10 UTC
Subject: Bug 32921

Author: rguenth
Date: Wed Apr 30 21:42:24 2008
New Revision: 134838

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=134838
Log:
2008-04-30  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/32921
	* tree-ssa-loop-im.c (mem_refs_may_alias_p): Disambiguate with TBAA.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-loop-im.c

Comment 46 Pat Haugen 2008-04-30 22:35:37 UTC
Just following up with comments posted on irc.  The patch does fix the problem I was seeing.  Spec ratio improved from 5.18 to 9.07 with the patch (75%), not quite the full 100% improvement I was seeing with --param max-aliased-vops=10000, but getting awfully close.
Comment 47 Pat Haugen 2008-05-01 19:26:54 UTC
Created attachment 15557 [details]
Testcase

Found some more cases, which are contributing to the missing 25%. Looks like similar code, but obviously something different about it since it's not getting caught with the latest patch.  Compiled with -O2 -m64.

From first nested loop in FLUXI():

            DO I = I1,I2
               QS(I) = UAV(I,J,K) * XAREA
            END DO


Base:

.L185:
        ld 10,168(1)     # pretmp.1065,
        ld 4,424(1)      # D.1331,
        sldi 11,9,3      # tmp667, i,
        mulld 0,10,9     # tmp669,, i
        add 0,0,25       # tmp670, tmp669, ivtmp.1093
        addi 9,9,1       # tmp675, i,
        sldi 0,0,3       # tmp671, tmp670,
        lfdx 0,4,0       #, tmp673
        extsw 9,9        # i.1073, tmp675
        fmul 0,0,28      # tmp674, tmp673, xarea.63
        stfdx 0,26,11    #, tmp674
        bdnz .L185       #


With --param max-aliased-vops=10000:

.L185:
        lfd 0,0(11)      #* ivtmp.886, tmp667
        add 11,11,18     # ivtmp.886, ivtmp.886, D.3744
        fmul 0,0,27      # tmp668, tmp667, xarea.63
        stfdx 0,31,9     #* ivtmp.889, tmp668
        addi 9,9,8       # ivtmp.889, ivtmp.889,
        cmpd 7,9,29      # tmp787, tmp673, ivtmp.889
        bne 7,.L185      #
Comment 48 Richard Biener 2008-05-02 10:28:58 UTC
Btw, on x86_64 leslie3d performance is now above that from before r126326.

The differences you mention can be seen on x86_64 as well, but they are
not related to aliasing or partitioning but due to differences in what
IVOPTs produces.

The first difference at all on the tree level is with mergephi2 that is able
to remove a single forwarder BB if --param max-aliased-vops=10000 is _not_ specified.  First real changes happen when DOM is able to CSE some loads
with --param max-aliased-vops=10000 but not without:

-  D.1747_808 = D.1747_763;
-  D.1748_809 = D.1748_764;
-  D.1749_811 = D.1749_766;
-  D.1750_812 = D.1749_766 * D.1644_799;
+  D.1747_808 = pav.data;
+  D.1748_809 = (real(kind=8)[0:] *) D.1747_808;
+  D.1749_811 = pav.dim[0].stride;
+  D.1750_812 = D.1644_799 * D.1749_811;

and as PRE itself is not alias-oracle aware as well further missed optimizations
occur.
Comment 49 Richard Biener 2008-05-02 12:16:57 UTC
Missing jump-threading causes quite a number of missed FRE opportunities, we
have

  if (i2 >= 0)
    {
       ... = load X
    }

  if (i2 >= 0)
    {
       ... = load X
    }

where we figure out the redundancy but don't do the replacement because
the result of the first load is not available at the place of the second load.

Some pass reordering may fix this and similar problems but of course may
have other side-effects.  My suggestion is to move passes from

   fre, vrp, ch

to

   ch, vrp, fre

and dropping the final dom pass in favor of another FRE one (DOM has
weaker memory optimization but in addition does some jump threading and
cond expr combining, both of which don't happen a lot after loop
optimizations).

But pass reordering has to wait until after the statistics patch has
been approved.

Note that teaching DOM about the alias-oracle is very difficult and
unlikely to happen - I'd rather get rid of DOM completely.
Comment 50 dnovillo@google.com 2008-05-02 12:32:20 UTC
Subject: Re:  [4.3/4.4 Regression]  Revision
 126326 causes 12% slowdown

On 05/02/08 08:16, rguenth at gcc dot gnu dot org wrote:

> and dropping the final dom pass in favor of another FRE one (DOM has
> weaker memory optimization but in addition does some jump threading and
> cond expr combining, both of which don't happen a lot after loop
> optimizations).

Along these lines, I think we would really benefit from doing a thorough 
study of pass reordering using the techniques in GCC-ICI (GCC Summit 
2008) and COLE (CGO 2008).  Both have found significant improvements 
with different optimization pipelines.


> Note that teaching DOM about the alias-oracle is very difficult and
> unlikely to happen - I'd rather get rid of DOM completely.

Agreed.


Diego.
Comment 51 rguenther@suse.de 2008-05-02 12:55:17 UTC
Subject: Re:  [4.3/4.4 Regression]  Revision
 126326 causes 12% slowdown

On Fri, 2 May 2008, dnovillo at google dot com wrote:

> ------- Comment #50 from dnovillo at google dot com  2008-05-02 12:32 -------
> Subject: Re:  [4.3/4.4 Regression]  Revision
>  126326 causes 12% slowdown
> 
> On 05/02/08 08:16, rguenth at gcc dot gnu dot org wrote:
> 
> > and dropping the final dom pass in favor of another FRE one (DOM has
> > weaker memory optimization but in addition does some jump threading and
> > cond expr combining, both of which don't happen a lot after loop
> > optimizations).
> 
> Along these lines, I think we would really benefit from doing a thorough 
> study of pass reordering using the techniques in GCC-ICI (GCC Summit 
> 2008) and COLE (CGO 2008).  Both have found significant improvements 
> with different optimization pipelines.

If somebody wants to do the legwork ... (the study probably needs to
be re-done and weight passes properly with "we like this pass" and
"we don't like this pass").  Just by visual inspection and reasoning
I can come up with some reasonable changes -- that probably expose
missed optimizations that can and should be fixed in the proper
passes rather than relying on others to cleanup (see the CCP bugs
I tackled over the last month -- I expect similar stuff in other
passes).

The question is how to go forward here.  As any re-ordering
that makes sense appearantly needs thorough analysis of regressions
that happen this isn't an automatic process.  Automatic processes
either require "perfect" passes or will come up with solutions that
either still regress or contain a lot more passes than we like
(in possible un-intuitive order).

Richard.
Comment 52 Mark Mitchell 2008-05-02 14:16:01 UTC
Yes, the "perfect pass" problem is what concerns me too.  For example, if we try to do dynamic reordering of passes, or allow users to specify that, we have to worry that, in practice, the compiler will crash or generate wrong code.  We'll have no good way of ever validating even a small set of the possible combinations.

Perhaps we need to make the passes fast, so we can run them more often?  Or weave some of them together, even though of course it's nice if each pass is logically separate and does a single thing?
Comment 53 rguenther@suse.de 2008-05-02 15:05:18 UTC
Subject: Re:  [4.3/4.4 Regression]  Revision
 126326 causes 12% slowdown

On Fri, 2 May 2008, mmitchel at gcc dot gnu dot org wrote:

> ------- Comment #52 from mmitchel at gcc dot gnu dot org  2008-05-02 14:16 -------
> Yes, the "perfect pass" problem is what concerns me too.  For example, if we
> try to do dynamic reordering of passes, or allow users to specify that, we have
> to worry that, in practice, the compiler will crash or generate wrong code. 
> We'll have no good way of ever validating even a small set of the possible
> combinations.

True.  Still if we don't have this ability to easily re-order passes
we'll never know ;)

So I still would like to have our pass-manager scripted, if it is
only for development purposes of GCC.

> Perhaps we need to make the passes fast, so we can run them more often?  Or
> weave some of them together, even though of course it's nice if each pass is
> logically separate and does a single thing?

Most of the cleanup passes are fast, and one of the most important things
in my view is maintainability which helps to reduce possible wrong-code
bugs.  This of course is easier with passes that do one single thing.

One thing we should consider is to keep more information around across
passes to make passes simpler.  For example VRP throws away range
information - we could separate out jump threading if we didn't do that,
likewise other passes could benefit from this information.  FRE and PRE
compute value numbers which are also thrown away.

Richard.
Comment 54 Richard Biener 2008-06-06 14:57:40 UTC
4.3.1 is being released, adjusting target milestone.
Comment 55 Joseph S. Myers 2008-08-27 22:02:25 UTC
4.3.2 is released, changing milestones to 4.3.3.
Comment 56 Richard Biener 2008-12-08 15:37:14 UTC
Unassigning.
Comment 57 Richard Biener 2009-01-24 10:19:52 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 58 Paolo Bonzini 2009-01-31 14:26:46 UTC
> Btw, on x86_64 leslie3d performance is now above that from before r126326.

Changing to 4.3 only.  A separate bug (likely not a regression) should be opened for the testcase of comment #47.

Might even be WONTFIX for 4.3.
Comment 59 Richard Biener 2009-01-31 14:35:56 UTC
This is btw very likely fixed on the alias-improvements branch.  SPEC testing
that for PPC would be much appreciated.

OTOH I agree with Paolo, so ... FIXED for 4.4, WONTFIX for 4.3.
Comment 60 Pat Haugen 2009-02-02 19:16:44 UTC
I tried leslie3d on PPC. The alias-improvements branch does indeed seem to fix the issue. The version of leslie3d built with the alias-improvements branch is about 10% faster than a version built with trunk, and is equivalent to a trunk version built with --param max-aliased-vops=10000.