Summary: | x87 reg allocated for constants for -mfpmath=sse | ||
---|---|---|---|
Product: | gcc | Reporter: | Uroš Bizjak <ubizjak> |
Component: | target | Assignee: | Paolo Bonzini <bonzini> |
Status: | RESOLVED FIXED | ||
Severity: | minor | CC: | bonzini, dalej, fjahanian, gcc-bugs, matz, pawel_sikora, rguenth, rhill |
Priority: | P2 | Keywords: | missed-optimization, patch, ra |
Version: | 4.0.0 | ||
Target Milestone: | --- | ||
URL: | http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01825.html | ||
Host: | Target: | i686-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2005-07-27 15:53:05 | |
Bug Depends on: | |||
Bug Blocks: | 23111 | ||
Attachments: |
Self-contained example
updated patch Re-updated patch (reload.c only) another patch update one more updated patch |
Description
Uroš Bizjak
2005-01-27 08:34:00 UTC
Created attachment 8080 [details]
Self-contained example
Self-contained example
A patch (RFC actually) that was used to teach allocator which register set to use: http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01783.html [This patch probably doesn't work well with xmmintrin.h stuff as pointed by rth in follow-up comment.] BTW: I don't think that x87 should be fully disabled for -mfpmath=sse. st(0) can be used as a temporary storage for memory-to-memory transfers. Also, it can do on-the-fly FP extending and truncating, without touching a SSE reg: movsd (%eax),xmm1 # ~7 cycles cvtsd2ss xmm1,(%esp) # 14 cycles could be implemented by: fldl (%eax) # ~7cycles fstps (%esp) # ~7cycles There is nothing wrong, if fldl is replaced with fldz or fld1. The performance problems will arise in case when memory location is used in subsequent SSE computations. In this case, it would be better if zero is "generated" in SSE register and stored to memory from SSE reg. rth hacked the constraints recently to have better ra for some fp cases. Can you see if the bug is still there today on mainline? (In reply to comment #4) > rth hacked the constraints recently to have better ra for some fp cases. Can > you see if the bug is still there today on mainline? gcc version 4.0.0 20050209 (experimental) '-O2 -march=pentium4 -mfpmath=sse -ffast-math -D__NO_MATH_INLINES' ... comisd .LC2, %xmm0 jb .L2 fld1 <- load of 1 into SSE reg fstpl -112(%ebp) movsd -112(%ebp), %xmm2 fldz <- load of 0 into SSE reg fstl -112(%ebp) movsd -112(%ebp), %xmm1 movapd %xmm1, %xmm0 .L4: fldl 32(%ebx) movsd 32(%ebx), %xmm3 fstpl -96(%ebp) movsd %xmm3, -96(%ebp) movsd -96(%ebp), %xmm3 [ -not needed- ] mulsd %xmm2, %xmm3 subsd %xmm0, %xmm3 fldl 24(%ebx) <-move to temporary, this is OK fstpl -88(%ebp) mulsd -88(%ebp), %xmm2 xorpd .LC4, %xmm2 movsd -88(%ebp), %xmm4 mulsd %xmm1, %xmm4 ... The bug is still there (it manifests itself also in a couple of other places). Could somebody confirm this bug? A testcase is attached. Confirmed. *** Bug 22453 has been marked as a duplicate of this bug. *** Smaller testcase from PR22453: extern double f(double, double); void g (double x) { double z, y; z = 0.0; y = 1.0 - x; again: z = y - z; f(z, 1.0); if (z == 0.0) goto again; } has a fld1 instruction when compiled with "-mfpmath=sse -msse2 -msseregparm -mtune=pentiumpro -O2". This instruction is caused by a reload into a FLOAT_REGS register, and moving the value to a SSE register needs secondary memory. I was going to add this text to PR22453, when I noticed that it was closed as duplicate to this one. So putting it here for reference, although everything seems to be analyzed already: The reload happens, because reg 58 gets no hardreg, because it's live over a call, and it's not worthwhile to put it into a call clobbered reg (which SSE regs are). So reg 58 is placed onto stack (at ebp+16). Now this mem must be initialized with 1.0. If that is done via x87 (fld1 , fst ebp+16), via GENERAL_REGS (mov 1.0 -> (reg:DF ax) , mov (reg:DF ax) -> (ebp+16)), or via SSE_REGS (movsd (mem 1.0) -> xmm0 , mov xmm0 -> (ebp+16)) is actually not that important. You won't get rid of this reload. Except that _if_ you force it to use SSE_REGS, then the next reload from (ebp+16) for the next insn can be inherited (as it's then the same mode), hence the initial store to ebp+16 is useless and will be removed. This can be tested with this hack: --- i386.md 12 Jul 2005 09:20:12 -0000 1.645 +++ i386.md 13 Jul 2005 13:47:06 -0000 @@ -2417,9 +2417,9 @@ (define_insn "*movdf_nointeger" [(set (match_operand:DF 0 "nonimmediate_operand" - "=f#Y,m ,f#Y,*r ,o ,Y*x#f,Y*x#f,Y*x#f ,m ") + "=?f#Y,m ,f#Y,*?r ,o ,Y*x#f,Y*x#f,Y*x#f ,m ") (match_operand:DF 1 "general_operand" - "fm#Y,f#Y,G ,*roF,F*r,C ,Y*x#f,HmY*x#f,Y*x#f"))] + "?fm#Y,f#Y,G ,*?roF,F*r,C ,Y*x#f,HmY*x#f,Y*x#f"))] "(GET_CODE (operands[0]) != MEM || GET_CODE (operands[1]) != MEM) && ((optimize_size || !TARGET_INTEGER_DFMODE_MOVES) && !TARGET_64BIT) && (reload_in_progress || reload_completed But I don't see immediately how reload could be convinced to do so automatically, as the choice of the reload class for one insn is independend from the choices of reloads for the same reg but in other insns. http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01783.html crashes on richard guenther's libsse2, fwiw. Michael, thank you very much. Your analysis will probably help a lot. > But I don't see immediately how reload could be convinced to do so > automatically, as the choice of the reload class for one insn is independend > from the choices of reloads for the same reg but in other insns. We can use PREFERRED_RELOAD_CLASS and PREFERRED_OUTPUT_RELOAD_CLASS. I am not sure if the fix is over-eager, but it works great on libsse2. Every fld that was there disappears, and from a cursory check, in current mainline's code all of them could have been inherited. Patches are at: - http://gcc.gnu.org/ml/gcc-patches/2005-07/msg00914.html (regclass) - http://gcc.gnu.org/ml/gcc-patches/2005-07/msg00983.html (reload+i386) It is also necessary to remove from the MD the `#' hints for regclass, with sed -i 's/#[^",\][^",]*\([",]\)/\1/g' i386.md Paolo Created attachment 9374 [details]
updated patch
updated patch, should fix perlbmk and twolf failures in previous versions of
the patch
the failures were reproducible on x86_64 with
./cc1 -O2 -fvisibility=hidden -quiet write2.i -o - -da -quiet -march=k8
-mfpmath=sse -m32
where write2.i is
extern double f;
void dbl (long long value)
{
double g = (double) (value);
__asm__ (";;;" : : "t" (g) : "xmm0", "xmm1",
"xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7");
f = g + 2.0;
}
(the asm is only there to force selection of the *floatdidf2_387 pattern)
Unfortunately the latest version of this patch causes a bootstrap failure on ppc: ../../gcc3.apple.200502/gcc/reload.c: In function 'find_reloads': ../../gcc3.apple.200502/gcc/reload.c:4512: internal compiler error: in do_output_ reload, at reload1.c:6936 which is /* If is a JUMP_INSN, we can't support output reloads yet. */ gcc_assert (!JUMP_P (insn)); Digging further. Created attachment 9418 [details]
Re-updated patch (reload.c only)
Preceding patch fixes the ICE I was getting. The tests following the modified area in find_reloads were being skipped in cases where they weren't before (in particular, when output reloads are not allowed, this was not detected). The revised patch moves the area Paolo modified below those tests. Going through more testing now. Created attachment 9445 [details]
another patch update
More problems... below this comment
/* Alternative loses if it requires a type of reload not
permitted for this insn. We can always reload SCRATCH
and objects with a REG_UNUSED note. */
there is an "else if" that must be changed to an "if". Attaching the patch
that fixes this.
Created attachment 9454 [details] one more updated patch Note that http://gcc.gnu.org/ml/gcc/2005-07/txt00007.txt should be applied together with this patch, because each patch helps the other. SPEC results for i686-pc-linux-gnu follow. The only significant regression is in galgel, overall it's about 1% better for SPECint and 2% better for SPECfp. Note that crafty improves a lot because of Dale's patch. 164.gzip 1400 151 925* 1400 152 921* 175.vpr 1400 163 859* 1400 162 862* 176.gcc 1100 70.9 1552* 1100 71.1 1548* 181.mcf 1800 176 1021* 1800 176 1020* 186.crafty 1000 107 938* 1000 102 984* 197.parser 1800 205 877* 1800 205 876* 252.eon 1300 149 873* 1300 141 919* 253.perlbmk 1800 126 1434* 1800 123 1464* 254.gap 1100 82.1 1340* 1100 80.8 1361* 255.vortex 1900 134 1415* 1900 134 1413* 256.bzip2 1500 161 930* 1500 161 933* 300.twolf 3000 235 1276* 3000 234 1281* SPECint_base2000 1093 SPECint2000 1106 168.wupwise 1600 174 920* 1600 173 926* 171.swim 3100 180 1721* 3100 181 1713* 172.mgrid 1800 257 700* 1800 257 701* 173.applu 2100 200 1049* 2100 199 1056* 177.mesa 1400 125 1116* 1400 123 1138* 178.galgel 2900 305 952* 2900 312 930* 179.art 2600 216 1206* 2600 212 1229* 183.equake 1300 80.5 1615* 1300 76.1 1708* 187.facerec 1900 286 664* 1900 286 664* 188.ammp 2200 305 721* 2200 298 739* 189.lucas 2000 225 889* 2000 203 983* 191.fma3d 2100 397 530* 2100 373 563* 200.sixtrack 1100 188 587* 1100 188 586* 301.apsi 2600 262 991* 2600 261 996* SPECfp_base2000 921 SPECfp2000 939 NOTES ----- Base flags: -O2 -msse -msse2 -mfpmath=sse Peak flags: -O2 -msse -msse2 -mfpmath=sse (In reply to comment #18) > Created an attachment (id=9454) > one more updated patch > > Note that http://gcc.gnu.org/ml/gcc/2005-07/txt00007.txt should be applied > together with this patch, because each patch helps the other. We have the entire patch in our apple local branch. This patch causes performance degredation in a user benchmark and is summarized in this this case: typedef int __m64 __attribute__ ((__vector_size__ (8))); static __inline __m64 __attribute__((__always_inline__)) _mm_add_si64 (__m64 __m1, __m64 __m2) { return (__m64) __builtin_ia32_paddq ((long long)__m1, (long long)__m2); } __m64 unsigned_add3( const __m64 *a, const __m64 *b, __m64 *result, unsigned long count ) { __m64 sum, _a, _b; unsigned int i; _a = a[0]; _b = b[0]; sum = _mm_add_si64( _a, _b ); for( i = 1; i < count; i++ ) { result[i-1] = sum; _a = a[i]; _b = b[i]; sum = _mm_add_si64( _a, _b ); } return sum; } Compiling this test case with -O1 (-fPIC -march=pentium4 are defaults), we get: L4: movl 20(%ebp), %edi movl %eax, -8(%edi,%ecx,8) movl %edx, -4(%edi,%ecx,8) leal 0(,%ecx,8), %eax movl 16(%ebp), %edx movq (%edx,%eax), %mm0 movq %mm0, -32(%ebp) <-------- movq -32(%ebp), %mm0 <--------- movl 12(%ebp), %edx paddq (%edx,%eax), %mm0 movq %mm0, -40(%ebp) movl -40(%ebp), %eax movl -36(%ebp), %edx addl $1, %ecx cmpl %ecx, %esi jne L4 ! Note the redundant store and load. While FSF gcc-4.0 produces the following loop (no redundancy): L4: movl 20(%ebp), %edi movl %eax, -8(%edi,%ecx,8) movl %edx, -4(%edi,%ecx,8) leal 0(,%ecx,8), %eax movl 12(%ebp), %edx movq (%edx,%eax), %mm0 movl 16(%ebp), %edi paddq (%edi,%eax), %mm0 movq %mm0, -16(%ebp) movl -16(%ebp), %eax movl -12(%ebp), %edx addl $1, %ecx cmpl %ecx, %esi jne L4 The main culprit in the patch is this statement in regclass.c /* If no register class is better than memory, use memory. */ if (p->mem_cost < best_cost) best = NO_REGS; For the following RTL: (insn 21 19 23 0 (set (reg/v:V2SI 63 [ sum ]) (subreg:V2SI (reg:DI 69) 0)) 736 {*movv2si_internal} (insn_list:REG_DEP_TRUE 19 (nil)) (expr_list:REG_DEAD (reg:DI 69) (nil))) Cost computation says MMX_REGS is inferior to store to memory. I spent some time trying to find out why the cost of memory is less than cost of keeping the operation in register. This is the pattern which causes the cost of register to be higher (in mmx.md): (define_insn "*mov<mode>_internal" [(set (match_operand:MMXMODEI 0 "nonimmediate_operand" "=*y,*y ,m ,*y,*Y,*Y,*Y ,m ,*x,*x,*x,m ,?r ,?m") (match_operand:MMXMODEI 1 "vector_move_operand" "C ,*ym,*y,*Y,*y,C ,*Ym,*Y,C ,*x,m ,*x,irm,r"))] Note that in this pattern cost computation of MMX_REGS are all ignored ('*' in front of y). So, the cost which is computed is for 'r' which is GENERAL_REGS. This cost is too high and eventually results in memory cost to be lower than register cost. I tried the following simple patch as experiment and got all the performance back (it is now comparable with 4.0). Note that in this patch, I removed the '*' in the 2nd alternative so cost of keeping the operand in mmx_regs class is factored in. This resulted in a lower cost than that of memory. Is this the way to go? This is just an experiment which seems to work. Bu I am wondering of its implication as most of other alternatives also have their cost ignored in this and similar patterns: Here is the patch: RCS file: /cvs/gcc/gcc/gcc/config/i386/mmx.md,v retrieving revision 1.6.14.3 diff -c -p -r1.6.14.3 mmx.md *** config/i386/mmx.md 9 Jul 2005 22:06:34 -0000 1.6.14.3 --- config/i386/mmx.md 20 Sep 2005 21:23:52 -0000 *************** *** 87,95 **** (define_insn "*mov<mode>_internal" [(set (match_operand:MMXMODEI 0 "nonimmediate_operand" ! "=*y,*y ,m ,*y,*Y,*Y,*Y ,m ,*x,*x,*x,m ,?r ,?m") (match_operand:MMXMODEI 1 "vector_move_operand" ! "C ,*ym,*y,*Y,*y,C ,*Ym,*Y,C ,*x,m ,*x,irm,r"))] "TARGET_MMX && (GET_CODE (operands[0]) != MEM || GET_CODE (operands[1]) != MEM)" "@ --- 87,95 ---- (define_insn "*mov<mode>_internal" [(set (match_operand:MMXMODEI 0 "nonimmediate_operand" ! "=*y,y ,m ,*y,*Y,*Y,*Y ,m ,*x,*x,*x,m ,?r ,?m") (match_operand:MMXMODEI 1 "vector_move_operand" ! "C ,ym,*y,*Y,*y,C ,*Ym,*Y,C ,*x,m ,*x,irm,r"))] "TARGET_MMX && (GET_CODE (operands[0]) != MEM || GET_CODE (operands[1]) != MEM)" "@ Subject: Re: x87 reg allocated for constants for -mfpmath=sse
>Note that in this pattern cost computation of MMX_REGS are all ignored ('*' in front of y). So, the cost
>which is computed is for 'r' which is GENERAL_REGS. This cost is too high and eventually results in
>memory cost to be lower than register cost. I tried the following simple patch as experiment and got all
>the performance back (it is now comparable with 4.0). Note that in this patch, I removed the '*' in the
>2nd alternative so cost of keeping the operand in mmx_regs class is factored in. This resulted in a
>lower cost than that of memory. Is this the way to go? This is just an experiment which seems to work.
>
>
I think it makes sense. The x86 back-end is playing too many tricks
(such as the # classes) with the register allocator and regclass
especially, and they are biting back.
Still, I'd rather hear from an expert as to why the classes were written
like this.
Paolo
I agree with Paolo. Dale, can you please take care of merging this into 4.2? Subject: Bug 19653 Author: bonzini Date: Mon Apr 3 11:20:07 2006 New Revision: 112637 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=112637 Log: 2005-08-08 Paolo Bonzini <bonzini@gnu.org> Dale Johannesen <dalej@apple.com> PR target/19653 * regclass.c (struct reg_pref): Update documentation. (regclass): Set prefclass to NO_REGS if memory is the best option. (record_reg_classes): Cope with a prefclass set to NO_REGS. * reload.c (find_reloads): Take PREFERRED_OUTPUT_RELOAD_CLASS into account. For non-registers, equate an empty preferred reload class to a `!' in the constraint; move the if clause to do so after those that reject the insn. (push_reload): Allow PREFERRED_*_RELOAD_CLASS to liberally return NO_REGS. (find_dummy_reload): Likewise. * doc/tm.texi (Register Classes): Document what it means if PREFERRED_*_RELOAD_CLASS return NO_REGS. * config/i386/i386.c (ix86_preferred_reload_class): Force using SSE registers (and return NO_REGS for floating-point constants) if math is done with SSE. (ix86_preferred_output_reload_class): New. * config/i386/i386-protos.h (ix86_preferred_output_reload_class): New. * config/i386/i386.h (PREFERRED_OUTPUT_RELOAD_CLASS): New. * config/i386/i386.md: Remove # register preferences. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-protos.h trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/i386.md trunk/gcc/doc/tm.texi trunk/gcc/regclass.c trunk/gcc/reload.c fixed on mainline. Subject: Bug 19653 Author: bonzini Date: Tue Apr 18 08:23:39 2006 New Revision: 113026 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=113026 Log: 2006-04-18 Paolo Bonzini <bonzini@gnu.org> PR target/27117 Partial revert of revision 112637 2006-04-03 Paolo Bonzini <bonzini@gnu.org> Dale Johannesen <dalej@apple.com> PR target/19653 * regclass.c (struct reg_pref): Update documentation. (regclass): Set prefclass to NO_REGS if memory is the best option. (record_reg_classes): Cope with a prefclass set to NO_REGS. Modified: trunk/gcc/ChangeLog trunk/gcc/regclass.c |