On x86_64, the following function (extracted from gmp, the asms are from its longlong.h) void mul_basecase (unsigned long * wp, unsigned long * up, long un, unsigned long * vp, long vn) { long j; unsigned long prod_low, prod_high; unsigned long cy_dig; unsigned long v_limb; v_limb = vp[0]; cy_dig = 0; for (j = un; j > 0; j--) { unsigned long u_limb, w_limb; u_limb = *up++; __asm__ ("mulq %3" : "=a" (prod_low), "=d" (prod_high) : "%0" (u_limb), "rm" (v_limb)); __asm__ ("addq %5,%q1\n" "\tadcq %3,%q0" : "=r" (cy_dig), "=&r" (w_limb) : "0" (prod_high), "rme" (0), "%1" (prod_low), "rme" (cy_dig)); *wp++ = w_limb; } } produces with optimization: #APP # 16 "t.c" 1 mulq %rcx # 0 "" 2 # 18 "t.c" 1 addq %rdx,%rax adcq $0,%rdx # 0 "" 2 #NO_APP where you can see that cy_dig is allocated to the upper half of the mulq result. It looks like that we are confused by rest_of_match_asm_constraints which inserts (insn 44 25 26 4 t.c:18 (set (reg/v:DI 67 [ cy_dig ]) (reg/v:DI 68 [ prod_high ])) -1 (nil)) after the mulq asm.
1) umul_ppmm(high_prod, low_prod, multipler, multiplicand) multiplies two UWtype integers MULTIPLER and MULTIPLICAND, and generates a two UWtype word product in HIGH_PROD and LOW_PROD. #define umul_ppmm(w1, w0, u, v) \ __asm__ ("mulq %3" \ : "=a" (w0), "=d" (w1) \ : "%0" ((UDItype)(u)), "rm" ((UDItype)(v))) 7) add_ssaaaa(high_sum, low_sum, high_addend_1, low_addend_1, high_addend_2, low_addend_2) adds two UWtype integers, composed by HIGH_ADDEND_1 and LOW_ADDEND_1, and HIGH_ADDEND_2 and LOW_ADDEND_2 respectively. The result is placed in HIGH_SUM and LOW_SUM. Overflow (i.e. carry out) is not stored anywhere, and is lost. #define add_ssaaaa(sh, sl, ah, al, bh, bl) \ __asm__ ("addq %5,%q1\n\tadcq %3,%q0" \ : "=r" (sh), "=&r" (sl) \ : "0" ((UDItype)(ah)), "rme" ((UDItype)(bh)), \ "%1" ((UDItype)(al)), "rme" ((UDItype)(bl))) marking %0 early-clobbered fixes the problem.
Does it fix the bug to remove the pass?
Yes, disabling rest_of_match_asm_constraints fixes the problem.
Correct asm is then for example #APP # 16 "t.c" 1 mulq %rcx # 0 "" 2 # 18 "t.c" 1 addq %r9,%rax adcq $0,%rdx # 0 "" 2 #NO_APP
(In reply to comment #1) > #define add_ssaaaa(sh, sl, ah, al, bh, bl) \ > __asm__ ("addq %5,%q1\n\tadcq %3,%q0" \ > : "=r" (sh), "=&r" (sl) \ > : "0" ((UDItype)(ah)), "rme" ((UDItype)(bh)), \ > "%1" ((UDItype)(al)), "rme" ((UDItype)(bl))) > > marking %0 early-clobbered fixes the problem. We also have longlong.h in gcc/ directory, where #define add_ssaaaa(sh, sl, ah, al, bh, bl) \ __asm__ ("addq %5,%1\n\tadcq %3,%0" \ : "=r" ((UDItype) (sh)), \ "=&r" ((UDItype) (sl)) \ : "%0" ((UDItype) (ah)), \ "rme" ((UDItype) (bh)), \ "%1" ((UDItype) (al)), \ "rme" ((UDItype) (bl))) It looks to me that gmp's longlong.h is buggy (gcc/longlong.h is used in soft-fp TFmode calculations and these are used in a couple of places through the testsuite). >
Subject: Re: wrong code for multiple output asm, wrong df? ubizjak at gmail dot com wrote: > ------- Comment #5 from ubizjak at gmail dot com 2007-09-25 13:58 ------- > (In reply to comment #1) > >> #define add_ssaaaa(sh, sl, ah, al, bh, bl) \ >> __asm__ ("addq %5,%q1\n\tadcq %3,%q0" \ >> : "=r" (sh), "=&r" (sl) \ >> : "0" ((UDItype)(ah)), "rme" ((UDItype)(bh)), \ >> "%1" ((UDItype)(al)), "rme" ((UDItype)(bl))) >> >> marking %0 early-clobbered fixes the problem. > > We also have longlong.h in gcc/ directory, where > > #define add_ssaaaa(sh, sl, ah, al, bh, bl) \ > __asm__ ("addq %5,%1\n\tadcq %3,%0" \ > : "=r" ((UDItype) (sh)), \ > "=&r" ((UDItype) (sl)) \ > : "%0" ((UDItype) (ah)), \ > "rme" ((UDItype) (bh)), \ > "%1" ((UDItype) (al)), \ > "rme" ((UDItype) (bl))) I think that both version ought to work. Paolo
(In reply to comment #6) > I think that both version ought to work. There is a comment in the '%' documentation: GCC can only handle one commutative pair in an asm; if you use more, the compiler may fail. Note that you need not use the The 'fail' word is not exactly encouraging to use two '%'s in the asm.
Subject: Re: wrong code for multiple output asm, wrong df? > There is a comment in the '%' documentation: > > GCC can only handle one commutative pair in an asm; if you use > more, the compiler may fail. Note that you need not use the Right, but then it is gmp (the subject of this PR) who's right and gcc who's wrong. You said the other way round. :-)
Use of % in asm should be discougaged.
(In reply to comment #8) > Right, but then it is gmp (the subject of this PR) who's right and gcc > who's wrong. You said the other way round. :-) ;) My conclusion is based on the generated code, where gmp's version produces wrong and gcc's version produces correct code. But let's fix this problem for real.
The fact that "%0" version produces correct code is simply due to strtoul in match_asm_constraints_1, where "%0" is not recognized as a valid string for conversion and further processing of input constraint is stopped.
(In reply to comment #1) > marking %0 early-clobbered fixes the problem. Please look at comment #7. Confirmed as a 4.3 regression, something is wrong in match_asm_constraints_1.
(In reply to comment #12) > > marking %0 early-clobbered fixes the problem. > Please look at comment #7. I think I need some sleep. I was thinking of comment #11.
This isn't related to commutative constraints, can be reproduced with: void mul_basecase (unsigned long *wp, unsigned long *up, long un, unsigned long *vp, long vn) { long j; unsigned long prod_low, prod_high; unsigned long cy_dig; unsigned long v_limb; v_limb = vp[0]; cy_dig = 0; for (j = un; j > 0; j--) { unsigned long u_limb, w_limb; u_limb = *up++; __asm__ ("mulq %3" : "=a" (prod_low), "=d" (prod_high) : "0" (u_limb), "rm" (v_limb)); __asm__ ("addq %5,%q1\n\tadcq %3,%q0" : "=r" (cy_dig), "=&r" (w_limb) : "0" (prod_high), "rme" (0), "1" (prod_low), "rme" (cy_dig)); *wp++ = w_limb; } } The problem is that match_asm_constraints_1 doesn't do any checking whether the change it wants to do is actually valid. Particularly it must and does not check whether the output (whose value it will kill in the new insn prepended before the asm) isn't among inputs of the asm. Also, I wonder whether it shouldn't limit any changes to REGs, ATM it will happily copy around MEMs etc., which IMHO is highly undesirable. When the output is present among inputs (except for the one with matching constraint), we have either a choice to create a new pseudo or just don't do anything.
The restriction at least not to allow MEM_Ps was posted in: http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01329.html but never applied to the trunk. But I believe it should instead check just for REG_P, instead of !MEM_P.
I will implement something along the lines that Jakub discussed. In the meanwhile, could anybody figure a self-contained execution testcase based on comment #14? Thanks!
(In reply to comment #16) something like this (for gcc.target/i386): /* { dg-do run } */ /* { dg-require-effective-target lp64 } */ /* { dg-options "-O2"} */ unsigned long a [100]; int main (void) { unsigned long v = 0x500000000UL, c = 13, low, high; long j; a [0] = 0x200000000UL; a [1] = 0x300000000UL; for (j = 0; j < 100; j++) { asm ("mulq %3" : "=a" (low), "=d" (high) : "0" (a [j]), "rm" (v)); asm ("addq %5,%q1\n\tadcq %3,%q0" : "=r" (c), "=&r" (a [j]) : "0" (high), "rme" (0), "1" (low), "rme" (c)); } if (a [0] != 13 || a [1] != 10 || a [2] != 15 || a [3] != 0) __builtin_abort (); return 0; }
Sure, no problem. /* PR rtl-optimization/33552 */ /* { dg-do run } */ /* { dg-options "-O2" } */ extern void abort (void); void __attribute__((noinline)) foo (unsigned long *wp, unsigned long *up, long un, unsigned long *vp) { long j; unsigned long prod_low, prod_high; unsigned long cy_dig; unsigned long v_limb; v_limb = vp[0]; cy_dig = 64; for (j = un; j > 0; j--) { unsigned long u_limb, w_limb; u_limb = *up++; __asm__ ("" : "=r" (prod_low), "=r" (prod_high) : "0" (u_limb), "1" (v_limb)); __asm__ ("mov %5, %1; add %5, %0" : "=r" (cy_dig), "=&r" (w_limb) : "0" (prod_high), "rm" (0), "1" (prod_low), "rm" (cy_dig)); *wp++ = w_limb; } } int main (void) { unsigned long wp[4]; unsigned long up[4] = { 0x1248, 0x248a, 0x1745, 0x1853 }; unsigned long vp = 0xdead; foo (wp, up, 4, &vp); if (wp[0] != 0x40 || wp[1] != 0xdeed || wp[2] != 0x1bd9a || wp[3] != 0x29c47) abort (); return 0; } For dg.target/i386, unless we add a bunch of target variants for the only asm arch dependent string in there.
Have a patch on http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01968.html . Fixes also the reload failure on x86 -O2 -fPIC on this testcase (which hits glibc): /* { dg-do compile } */ /* { dg-options "-O2 -fPIC" } */ typedef unsigned int size_t; char * __mempcpy_by2 (char *__dest, __const char *__src, size_t __srclen) { register char *__tmp = __dest; register unsigned long int __d0, __d1; __asm__ __volatile__ ("shrl $1,%3\n\t" "jz 2f\n" "1:\n\t" "movl (%2),%0\n\t" "leal 4(%2),%2\n\t" "movl %0,(%1)\n\t" "leal 4(%1),%1\n\t" "decl %3\n\t" "jnz 1b\n" "2:\n\t" "movw (%2),%w0\n\t" "movw %w0,(%1)" : "=&q" (__d0), "=r" (__tmp), "=&r" (__src), "=&r" (__d1), "=m" ( *(struct { __extension__ char __x[__srclen]; } *)__dest) : "1" (__tmp), "2" (__src), "3" (__srclen / 2), "m" ( *(struct { __extension__ char __x[__srclen]; } *)__src) : "cc"); return __tmp + 2; }
Thanks.
(In reply to comment #19) > Have a patch on http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01968.html . > Fixes also the reload failure on x86 -O2 -fPIC on this testcase (which hits > glibc): Do we need a solution for "%." decorations of matched operands? strtoul() will not return 0 for "%0" operands, as described in Comment 11.
I don't think so, as you don't know which input will match the output (i.e. whether the two inputs will be swapped) if you have a % constraint in the asm.
Subject: Bug 33552 Author: matz Date: Fri Sep 28 13:31:50 2007 New Revision: 128864 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128864 Log: PR rtl-optimization/33552 * function.c (match_asm_constraints_1): Check for overlap in inputs and replace all occurences. Modified: trunk/gcc/ChangeLog trunk/gcc/function.c
Subject: Bug 33552 Author: matz Date: Fri Sep 28 13:33:09 2007 New Revision: 128865 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128865 Log: PR rtl-optimization/33552 * gcc.target/i386/pr33552.c: New runtime test. * gcc.target/i386/strinline.c: New compile time test. Added: trunk/gcc/testsuite/gcc.target/i386/pr33552.c trunk/gcc/testsuite/gcc.target/i386/strinline.c Modified: trunk/gcc/testsuite/ChangeLog
Fixed now.