This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
> 
> --- Comment #12 from amker at gcc dot gnu.org ---
> (In reply to rguenther@suse.de from comment #11)
> > On Thu, 18 Jan 2018, amker at gcc dot gnu.org wrote:
> > 
> > 
> > I think the zeroing stmt can be distributed into a separate loop nest
> > (up to whavever level we choose) and in the then non-parallelized nest
> > the memset can stay at the current level.  So distribute
> > 
> > >          do j=1,ny
> > >             jm1=mod(j+ny-2,ny)+1
> > >             jp1=mod(j,ny)+1
> > >             do i=1,nx
> > >                im1=mod(i+nx-2,nx)+1
> > >                ip1=mod(i,nx)+1
> > >                do l=1,nb
> > >                   y(l,i,j,k)=0.0d0
> > >                   do m=1,nb
> > >                      y(l,i,j,k)=y(l,i,j,k)+
> > >                      ;; ....
> > >                   enddo
> > >                enddo
> > >             enddo
> > >          enddo
> > 
> > to
> > 
> > >          do j=1,ny
> > >             jm1=mod(j+ny-2,ny)+1
> > >             jp1=mod(j,ny)+1
> > >             do i=1,nx
> > >                im1=mod(i+nx-2,nx)+1
> > >                ip1=mod(i,nx)+1
> > >                do l=1,nb
> > >                   y(l,i,j,k)=0.0d0
> > >                enddo
> > >             enddo
> > >          enddo
> > >          do j=1,ny
> > >             jm1=mod(j+ny-2,ny)+1
> > >             jp1=mod(j,ny)+1
> > >             do i=1,nx
> > >                im1=mod(i+nx-2,nx)+1
> > >                ip1=mod(i,nx)+1
> > >                do l=1,nb
> > >                   do m=1,nb
> > >                      y(l,i,j,k)=y(l,i,j,k)+
> > >                      ;; ....
> > >                   enddo
> > >                enddo
> > >             enddo
> > >          enddo
> > 
> Yes, this can be done.  For now, it's disabled because without classifying
> zeroing stmt as a builtin partition, it's fused because of shared memory
> reference to y(l,i,j,k).  This step can be made by cost model changes.  The
> on;y problem is the cost model change doesn't make sense here (without
> considering builtin partition stuff, it should be fused, right?)

It might be profitable to distribute away stores that have no dependent
stmts (thus stores from invariants).

Another heuristic would be to never merge builtin partitions with
other partitions because loop optimizations do not handle memory
builtins (the data dependence limitation).  Which might also be a reason
not to handle those as builtins but revert to a non-builtin
classification.

I suppose implementing both and then looking at what distributions
change due to them on say SPEC CPU 2006, classifying each change
as either good or bad is the only way we'd know whether such
cost model change is wanted.

> > And then do memset replacement in the first loop.
> I guess this step is equally hard to what I mentioned?  We still need to prove
> loops of zeroing statement doesn't leave bubble in memory.

No, you'd simply have the i and j loops containing a memset...

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]