43413 – Powerpc generates worse code for -mvsx on gromacs even though there are no VSX instructions used

Bug 43413 - Powerpc generates worse code for -mvsx on gromacs even though there are no VSX instructions used

Summary: Powerpc generates worse code for -mvsx on gromacs even though there are no VS...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.5.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-03-17 22:34 UTC by Michael Meissner
Modified:	2010-03-23 23:40 UTC (History)
CC List:	4 users (show)

See Also:
Host:	powerpc64-unknown-linux-gnu
Target:	powerpc64-unknown-linux-gnu
Build:	powerpc64-unknown-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Test case from the gromacs benchmark (2.68 KB, text/plain) 2010-03-17 22:35 UTC, Michael Meissner	Details
Bzip2 tar file of the assembly output for altivec, vsx, scalar, and no-spill (23.82 KB, text/plain) 2010-03-17 22:37 UTC, Michael Meissner	Details
Bzip2 tar file of the ira dump output for altivec, vsx, scalar, and no-spill (558.48 KB, text/plain) 2010-03-17 22:38 UTC, Michael Meissner	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Meissner 2010-03-17 22:34:19 UTC

The powerpc64-unknown-linux-gnu compiler generates much worse code on the inl1130 function in the gromacs benchmark from spec 2006 when compiling for the VSX instruction set with -mcpu=power7 (or -mvsx).  The code in question in not vectorizable, and in fact only uses integer and single precision floating point.

Just to be clear, the powerpc architecture originally had two sets of registers (FLOAT_REGS for scalar floating point registers and ALTIVEC_REGS for vector single precision/int registers).  The VSX addition to the architecture adds a new set of scalar/vector instructions that can use registers from either register set.  So, in the VSX work, I added a new register class (VSX_REGS) that is the union of the two register classes, and changed TARGET_IRA_COVER_CLASSES to return a cover class that returns VSX_REGS in the VSX case, and FLOAT_REGS/ALTIVEC_REGS in the traditional case.

In the enclosed test case, it generates the following spills for the options:
-O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 117 stfs, 139 lfs
-O3 -ffast-math -mcpu=power5 -maltivec: 80 stfs, 100 lfs
-O3 -ffast-math -mcpu=power5: 80 stfs, 100 lfs

Now, if I enable -fno-ira-share-spill-slots, it gets somewhat better, though obviously it uses more stack space because it can't resuse the spill stack slots:
-O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fno-ira-share-spill-slots: 102 stfs, 111 lfs

If I don't change the IRA cover class, gromacs generates the same code, but other benchmarks that do use 64 registers won't compile correctly.

Comment 1 Michael Meissner 2010-03-17 22:35:15 UTC

Created attachment 20134 [details]
Test case from the gromacs benchmark

Comment 2 Michael Meissner 2010-03-17 22:37:32 UTC

Created attachment 20135 [details]
Bzip2 tar file of the assembly output for altivec, vsx, scalar, and no-spill

Comment 3 Michael Meissner 2010-03-17 22:38:09 UTC

Created attachment 20136 [details]
Bzip2 tar file of the ira dump output for altivec, vsx, scalar, and no-spill

Comment 4 Jeffrey A. Law 2010-03-22 19:49:06 UTC

FWIW, I seem to get considerably worse code from mainline than you -- for -O3 -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns (compared to 117 & 139 respectively that you reported).

Just for fun, I ran the same code through the a ppc compiler with the LRS code from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping, but it's not enough to offset the badness shown by IRA.


I couldn't reconcile how -fno-ira-share-spill-slots would be changing the number of load/store insns, so I poked at that a bit.  -fno-ira-share-spill-slots twiddles whether or not a pseudo which gets assigned a hard reg is put into live_throughout or dead_or_set_p in the reload chain structures, which in turn changes what pseudos get reassigned hard regs during reload.  This is a somewhat odd effect and should be investigated further.

Comment 5 Vladimir Makarov 2010-03-22 22:16:54 UTC

(In reply to comment #0)
> 
> In the enclosed test case, it generates the following spills for the options:
> -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 117 stfs, 139 lfs
> -O3 -ffast-math -mcpu=power5 -maltivec: 80 stfs, 100 lfs
> -O3 -ffast-math -mcpu=power5: 80 stfs, 100 lfs

Hi, Mike.  I think the comparison should be done with the same -mcpu because there is 1st insn scheduling which increases register pressure differently for different architectures.  But that is not so important.  I see a lot of spills during assigning because memory is more profitable.  Graph coloring pushes them on the stack suggesting that they get registers (and that is not happened during the assignment).

On one my branch, I got 
-O3 -ffast-math -mcpu=power7 -mno-vsx -maltivec: 248 of stfs and lfs
-O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 331 of stfs and lfs
-O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure: 310
-O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure -fexper: 179

Where -fexper switches on a new graph coloring code without cover classes which I am working on.

So I think that this new code and register pressure sensitive insn scheduling will help.

Still I'll investigate a bit more why there are a lot of unexpected spills during assignment with -mvsx for the current code.

Comment 6 Vladimir Makarov 2010-03-22 22:20:21 UTC

(In reply to comment #4)
> FWIW, I seem to get considerably worse code from mainline than you -- for -O3
> -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns
> (compared to 117 & 139 respectively that you reported).
> 

I suspect the differnce is because Mike calculated only stfs/lfs and you stfs(x)/lfs(x).  But may be I am wrong.

> Just for fun, I ran the same code through the a ppc compiler with the LRS code
> from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping,
> but it's not enough to offset the badness shown by IRA.
> 
> 
> I couldn't reconcile how -fno-ira-share-spill-slots would be changing the
> number of load/store insns, so I poked at that a bit.

Yes, I cannot understand that too.
 
> -fno-ira-share-spill-slots twiddles whether or not a pseudo which gets assigned
> a hard reg is put into live_throughout or dead_or_set_p in the reload chain
> structures, which in turn changes what pseudos get reassigned hard regs during
> reload.  This is a somewhat odd effect and should be investigated further.
> 
>

Comment 7 Michael Meissner 2010-03-22 22:24:49 UTC

Subject: Re:  Powerpc generates worse code for
 -mvsx on gromacs even though there are no VSX instructions used

On Mon, Mar 22, 2010 at 10:20:21PM -0000, vmakarov at redhat dot com wrote:
> 
> 
> ------- Comment #6 from vmakarov at redhat dot com  2010-03-22 22:20 -------
> (In reply to comment #4)
> > FWIW, I seem to get considerably worse code from mainline than you -- for -O3
> > -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns
> > (compared to 117 & 139 respectively that you reported).
> > 
> 
> I suspect the differnce is because Mike calculated only stfs/lfs and you
> stfs(x)/lfs(x).  But may be I am wrong.

I only calculated the stores and loads to the stack, i.e.
	egrep (stfs|lfs).*\(1\)

since I was just looking for the spills.

> > Just for fun, I ran the same code through the a ppc compiler with the LRS code
> > from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping,
> > but it's not enough to offset the badness shown by IRA.
> > 
> > 
> > I couldn't reconcile how -fno-ira-share-spill-slots would be changing the
> > number of load/store insns, so I poked at that a bit.
> 
> Yes, I cannot understand that too.

Note, while -fno-ira-share-spill-slots as fewer spills, I just measured the
results on the machine, and the time spent is pretty much the same.

> > -fno-ira-share-spill-slots twiddles whether or not a pseudo which gets assigned
> > a hard reg is put into live_throughout or dead_or_set_p in the reload chain
> > structures, which in turn changes what pseudos get reassigned hard regs during
> > reload.  This is a somewhat odd effect and should be investigated further.

Comment 8 Michael Meissner 2010-03-22 22:29:10 UTC

Subject: Re:  Powerpc generates worse code for
 -mvsx on gromacs even though there are no VSX instructions used

On Mon, Mar 22, 2010 at 10:16:56PM -0000, vmakarov at redhat dot com wrote:
> 
> 
> ------- Comment #5 from vmakarov at redhat dot com  2010-03-22 22:16 -------
> (In reply to comment #0)
> > 
> > In the enclosed test case, it generates the following spills for the options:
> > -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 117 stfs, 139 lfs
> > -O3 -ffast-math -mcpu=power5 -maltivec: 80 stfs, 100 lfs
> > -O3 -ffast-math -mcpu=power5: 80 stfs, 100 lfs
> 
> Hi, Mike.  I think the comparison should be done with the same -mcpu because
> there is 1st insn scheduling which increases register pressure differently for
> different architectures.  But that is not so important.  I see a lot of spills
> during assigning because memory is more profitable.  Graph coloring pushes them
> on the stack suggesting that they get registers (and that is not happened
> during the assignment).

I just tried it for -mcpu=power5 -mtune=power7 -maltivec:

with -mvsx:	117 stfs to stack, 139 lfs from stack
without -mvsx:	 74 stfs to stack,  90 lfs from stack

If I disable setting the cover class to VSX_REGS, I get 74/90 spills.

> On one my branch, I got 
> -O3 -ffast-math -mcpu=power7 -mno-vsx -maltivec: 248 of stfs and lfs
> -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 331 of stfs and lfs
> -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure: 310
> -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure -fexper: 179
> 
> Where -fexper switches on a new graph coloring code without cover classes which
> I am working on.
> 
> So I think that this new code and register pressure sensitive insn scheduling
> will help.
> 
> Still I'll investigate a bit more why there are a lot of unexpected spills
> during assignment with -mvsx for the current code.

Ok, thanks.

Comment 9 Jeffrey A. Law 2010-03-23 17:36:08 UTC

Subject: Re:  Powerpc generates worse code for
 -mvsx on gromacs even though there are no VSX instructions used

On 03/22/10 16:20, vmakarov at redhat dot com wrote:
> ------- Comment #6 from vmakarov at redhat dot com  2010-03-22 22:20 -------
> (In reply to comment #4)
>    
>> FWIW, I seem to get considerably worse code from mainline than you -- for -O3
>> -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns
>> (compared to 117&  139 respectively that you reported).
>>
>>      
> I suspect the differnce is because Mike calculated only stfs/lfs and you
> stfs(x)/lfs(x).  But may be I am wrong.
>    
I think you're right.  I get 117/144 for the mainline now (compared to 
117/139 in the PR), so those are in the right ballpark.  With the LRS 
bits I get 110/130, which is a clear improvement, but still nowhere near 
good.
>    
>> Just for fun, I ran the same code through the a ppc compiler with the LRS code
>> from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping,
>> but it's not enough to offset the badness shown by IRA.
>>
>>
>> I couldn't reconcile how -fno-ira-share-spill-slots would be changing the
>> number of load/store insns, so I poked at that a bit.
>>      
> Yes, I cannot understand that too.
>    
Given how -fno-ira-share-spill-slots twiddles the bitmaps in the reload 
chains, it's got to either be reload register selection or reallocation 
occuring during reload.

jeff

Comment 10 Vladimir Makarov 2010-03-23 18:45:07 UTC

(In reply to comment #5)
> 
> Still I'll investigate a bit more why there are a lot of unexpected spills
> during assignment with -mvsx for the current code.
> 

The problem is in that part of VSX_REGS (altivec regs) does not contain values of SFmode.  The coloring algorithm does not take it into account.  The problem can be solved if we check this in available register calculation.  The patch I will send soon decreases # stfs(x)/lfs(x) from 332 to 246.

Comment 11 Vladimir Makarov 2010-03-23 19:19:00 UTC

Subject: Bug 43413

Author: vmakarov
Date: Tue Mar 23 19:18:42 2010
New Revision: 157676

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=157676
Log:
2010-03-23  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/43413
	* ira-color.c (setup_allocno_available_regs_num): Count prohibited
	hard regs too.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-color.c

Comment 12 Michael Meissner 2010-03-23 23:40:37 UTC

This reduces the spills, and brings the performance backs up.  I'm closing the bug.  Thanks.