The powerpc64-unknown-linux-gnu compiler generates much worse code on the inl1130 function in the gromacs benchmark from spec 2006 when compiling for the VSX instruction set with -mcpu=power7 (or -mvsx). The code in question in not vectorizable, and in fact only uses integer and single precision floating point. Just to be clear, the powerpc architecture originally had two sets of registers (FLOAT_REGS for scalar floating point registers and ALTIVEC_REGS for vector single precision/int registers). The VSX addition to the architecture adds a new set of scalar/vector instructions that can use registers from either register set. So, in the VSX work, I added a new register class (VSX_REGS) that is the union of the two register classes, and changed TARGET_IRA_COVER_CLASSES to return a cover class that returns VSX_REGS in the VSX case, and FLOAT_REGS/ALTIVEC_REGS in the traditional case. In the enclosed test case, it generates the following spills for the options: -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 117 stfs, 139 lfs -O3 -ffast-math -mcpu=power5 -maltivec: 80 stfs, 100 lfs -O3 -ffast-math -mcpu=power5: 80 stfs, 100 lfs Now, if I enable -fno-ira-share-spill-slots, it gets somewhat better, though obviously it uses more stack space because it can't resuse the spill stack slots: -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fno-ira-share-spill-slots: 102 stfs, 111 lfs If I don't change the IRA cover class, gromacs generates the same code, but other benchmarks that do use 64 registers won't compile correctly.
Created attachment 20134 [details] Test case from the gromacs benchmark
Created attachment 20135 [details] Bzip2 tar file of the assembly output for altivec, vsx, scalar, and no-spill
Created attachment 20136 [details] Bzip2 tar file of the ira dump output for altivec, vsx, scalar, and no-spill
FWIW, I seem to get considerably worse code from mainline than you -- for -O3 -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns (compared to 117 & 139 respectively that you reported). Just for fun, I ran the same code through the a ppc compiler with the LRS code from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping, but it's not enough to offset the badness shown by IRA. I couldn't reconcile how -fno-ira-share-spill-slots would be changing the number of load/store insns, so I poked at that a bit. -fno-ira-share-spill-slots twiddles whether or not a pseudo which gets assigned a hard reg is put into live_throughout or dead_or_set_p in the reload chain structures, which in turn changes what pseudos get reassigned hard regs during reload. This is a somewhat odd effect and should be investigated further.
(In reply to comment #0) > > In the enclosed test case, it generates the following spills for the options: > -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 117 stfs, 139 lfs > -O3 -ffast-math -mcpu=power5 -maltivec: 80 stfs, 100 lfs > -O3 -ffast-math -mcpu=power5: 80 stfs, 100 lfs Hi, Mike. I think the comparison should be done with the same -mcpu because there is 1st insn scheduling which increases register pressure differently for different architectures. But that is not so important. I see a lot of spills during assigning because memory is more profitable. Graph coloring pushes them on the stack suggesting that they get registers (and that is not happened during the assignment). On one my branch, I got -O3 -ffast-math -mcpu=power7 -mno-vsx -maltivec: 248 of stfs and lfs -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 331 of stfs and lfs -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure: 310 -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure -fexper: 179 Where -fexper switches on a new graph coloring code without cover classes which I am working on. So I think that this new code and register pressure sensitive insn scheduling will help. Still I'll investigate a bit more why there are a lot of unexpected spills during assignment with -mvsx for the current code.
(In reply to comment #4) > FWIW, I seem to get considerably worse code from mainline than you -- for -O3 > -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns > (compared to 117 & 139 respectively that you reported). > I suspect the differnce is because Mike calculated only stfs/lfs and you stfs(x)/lfs(x). But may be I am wrong. > Just for fun, I ran the same code through the a ppc compiler with the LRS code > from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping, > but it's not enough to offset the badness shown by IRA. > > > I couldn't reconcile how -fno-ira-share-spill-slots would be changing the > number of load/store insns, so I poked at that a bit. Yes, I cannot understand that too. > -fno-ira-share-spill-slots twiddles whether or not a pseudo which gets assigned > a hard reg is put into live_throughout or dead_or_set_p in the reload chain > structures, which in turn changes what pseudos get reassigned hard regs during > reload. This is a somewhat odd effect and should be investigated further. > >
Subject: Re: Powerpc generates worse code for -mvsx on gromacs even though there are no VSX instructions used On Mon, Mar 22, 2010 at 10:20:21PM -0000, vmakarov at redhat dot com wrote: > > > ------- Comment #6 from vmakarov at redhat dot com 2010-03-22 22:20 ------- > (In reply to comment #4) > > FWIW, I seem to get considerably worse code from mainline than you -- for -O3 > > -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns > > (compared to 117 & 139 respectively that you reported). > > > > I suspect the differnce is because Mike calculated only stfs/lfs and you > stfs(x)/lfs(x). But may be I am wrong. I only calculated the stores and loads to the stack, i.e. egrep (stfs|lfs).*\(1\) since I was just looking for the spills. > > Just for fun, I ran the same code through the a ppc compiler with the LRS code > > from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping, > > but it's not enough to offset the badness shown by IRA. > > > > > > I couldn't reconcile how -fno-ira-share-spill-slots would be changing the > > number of load/store insns, so I poked at that a bit. > > Yes, I cannot understand that too. Note, while -fno-ira-share-spill-slots as fewer spills, I just measured the results on the machine, and the time spent is pretty much the same. > > -fno-ira-share-spill-slots twiddles whether or not a pseudo which gets assigned > > a hard reg is put into live_throughout or dead_or_set_p in the reload chain > > structures, which in turn changes what pseudos get reassigned hard regs during > > reload. This is a somewhat odd effect and should be investigated further.
Subject: Re: Powerpc generates worse code for -mvsx on gromacs even though there are no VSX instructions used On Mon, Mar 22, 2010 at 10:16:56PM -0000, vmakarov at redhat dot com wrote: > > > ------- Comment #5 from vmakarov at redhat dot com 2010-03-22 22:16 ------- > (In reply to comment #0) > > > > In the enclosed test case, it generates the following spills for the options: > > -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 117 stfs, 139 lfs > > -O3 -ffast-math -mcpu=power5 -maltivec: 80 stfs, 100 lfs > > -O3 -ffast-math -mcpu=power5: 80 stfs, 100 lfs > > Hi, Mike. I think the comparison should be done with the same -mcpu because > there is 1st insn scheduling which increases register pressure differently for > different architectures. But that is not so important. I see a lot of spills > during assigning because memory is more profitable. Graph coloring pushes them > on the stack suggesting that they get registers (and that is not happened > during the assignment). I just tried it for -mcpu=power5 -mtune=power7 -maltivec: with -mvsx: 117 stfs to stack, 139 lfs from stack without -mvsx: 74 stfs to stack, 90 lfs from stack If I disable setting the cover class to VSX_REGS, I get 74/90 spills. > On one my branch, I got > -O3 -ffast-math -mcpu=power7 -mno-vsx -maltivec: 248 of stfs and lfs > -O3 -ffast-math -mcpu=power7 -mvsx -maltivec: 331 of stfs and lfs > -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure: 310 > -O3 -ffast-math -mcpu=power7 -mvsx -maltivec -fsched-pressure -fexper: 179 > > Where -fexper switches on a new graph coloring code without cover classes which > I am working on. > > So I think that this new code and register pressure sensitive insn scheduling > will help. > > Still I'll investigate a bit more why there are a lot of unexpected spills > during assignment with -mvsx for the current code. Ok, thanks.
Subject: Re: Powerpc generates worse code for -mvsx on gromacs even though there are no VSX instructions used On 03/22/10 16:20, vmakarov at redhat dot com wrote: > ------- Comment #6 from vmakarov at redhat dot com 2010-03-22 22:20 ------- > (In reply to comment #4) > >> FWIW, I seem to get considerably worse code from mainline than you -- for -O3 >> -ffast-math -mcpu=power7 -mvsx -maltivec I get 140 stfs and 192 lfs insns >> (compared to 117& 139 respectively that you reported). >> >> > I suspect the differnce is because Mike calculated only stfs/lfs and you > stfs(x)/lfs(x). But may be I am wrong. > I think you're right. I get 117/144 for the mainline now (compared to 117/139 in the PR), so those are in the right ballpark. With the LRS bits I get 110/130, which is a clear improvement, but still nowhere near good. > >> Just for fun, I ran the same code through the a ppc compiler with the LRS code >> from reload-v2 and get 133:178 stfs/lsf insns, so that code clearly is helping, >> but it's not enough to offset the badness shown by IRA. >> >> >> I couldn't reconcile how -fno-ira-share-spill-slots would be changing the >> number of load/store insns, so I poked at that a bit. >> > Yes, I cannot understand that too. > Given how -fno-ira-share-spill-slots twiddles the bitmaps in the reload chains, it's got to either be reload register selection or reallocation occuring during reload. jeff
(In reply to comment #5) > > Still I'll investigate a bit more why there are a lot of unexpected spills > during assignment with -mvsx for the current code. > The problem is in that part of VSX_REGS (altivec regs) does not contain values of SFmode. The coloring algorithm does not take it into account. The problem can be solved if we check this in available register calculation. The patch I will send soon decreases # stfs(x)/lfs(x) from 332 to 246.
Subject: Bug 43413 Author: vmakarov Date: Tue Mar 23 19:18:42 2010 New Revision: 157676 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=157676 Log: 2010-03-23 Vladimir Makarov <vmakarov@redhat.com> PR rtl-optimization/43413 * ira-color.c (setup_allocno_available_regs_num): Count prohibited hard regs too. Modified: trunk/gcc/ChangeLog trunk/gcc/ira-color.c
This reduces the spills, and brings the performance backs up. I'm closing the bug. Thanks.