Bug 33552 - [4.3 Regression] wrong code for multiple output asm, wrong df?
Summary: [4.3 Regression] wrong code for multiple output asm, wrong df?
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.3.0
: P3 major
Target Milestone: 4.3.0
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2007-09-25 12:22 UTC by Richard Biener
Modified: 2007-09-28 14:45 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2007-09-27 06:28:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2007-09-25 12:22:50 UTC
On x86_64, the following function (extracted from gmp, the asms are from
its longlong.h)

void
mul_basecase (unsigned long * wp, unsigned long * up, long un,
              unsigned long * vp, long vn)
{
  long j;
  unsigned long prod_low, prod_high;
  unsigned long cy_dig;
  unsigned long v_limb;

  v_limb = vp[0];
  cy_dig = 0;
  for (j = un; j > 0; j--)
    {
      unsigned long u_limb, w_limb;
      u_limb = *up++;
      __asm__ ("mulq %3" : "=a" (prod_low), "=d" (prod_high)
                         : "%0" (u_limb), "rm" (v_limb));
      __asm__ ("addq %5,%q1\n"
             "\tadcq %3,%q0" : "=r" (cy_dig), "=&r" (w_limb)
                             : "0" (prod_high), "rme" (0), "%1" (prod_low), "rme" (cy_dig));
      *wp++ = w_limb;
    }
}

produces with optimization:

#APP
# 16 "t.c" 1
        mulq %rcx
# 0 "" 2
# 18 "t.c" 1
        addq %rdx,%rax
        adcq $0,%rdx
# 0 "" 2
#NO_APP

where you can see that cy_dig is allocated to the upper half of the mulq
result.  It looks like that we are confused by rest_of_match_asm_constraints
which inserts

(insn 44 25 26 4 t.c:18 (set (reg/v:DI 67 [ cy_dig ])
        (reg/v:DI 68 [ prod_high ])) -1 (nil))

after the mulq asm.
Comment 1 Richard Biener 2007-09-25 12:32:25 UTC
   1) umul_ppmm(high_prod, low_prod, multipler, multiplicand) multiplies two
   UWtype integers MULTIPLER and MULTIPLICAND, and generates a two UWtype
   word product in HIGH_PROD and LOW_PROD.

#define umul_ppmm(w1, w0, u, v) \
  __asm__ ("mulq %3"                                                    \
           : "=a" (w0), "=d" (w1)                                       \
           : "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))


   7) add_ssaaaa(high_sum, low_sum, high_addend_1, low_addend_1,
   high_addend_2, low_addend_2) adds two UWtype integers, composed by
   HIGH_ADDEND_1 and LOW_ADDEND_1, and HIGH_ADDEND_2 and LOW_ADDEND_2
   respectively.  The result is placed in HIGH_SUM and LOW_SUM.  Overflow
   (i.e. carry out) is not stored anywhere, and is lost.

#define add_ssaaaa(sh, sl, ah, al, bh, bl) \
  __asm__ ("addq %5,%q1\n\tadcq %3,%q0"                                 \
           : "=r" (sh), "=&r" (sl)                                      \
           : "0"  ((UDItype)(ah)), "rme" ((UDItype)(bh)),               \
             "%1" ((UDItype)(al)), "rme" ((UDItype)(bl)))

marking %0 early-clobbered fixes the problem.
Comment 2 Paolo Bonzini 2007-09-25 12:33:21 UTC
Does it fix the bug to remove the pass?
Comment 3 Richard Biener 2007-09-25 12:37:03 UTC
Yes, disabling rest_of_match_asm_constraints fixes the problem.
Comment 4 Richard Biener 2007-09-25 12:40:29 UTC
Correct asm is then for example

#APP
# 16 "t.c" 1
        mulq %rcx
# 0 "" 2
# 18 "t.c" 1
        addq %r9,%rax
        adcq $0,%rdx
# 0 "" 2
#NO_APP
Comment 5 Uroš Bizjak 2007-09-25 13:58:36 UTC
(In reply to comment #1)

> #define add_ssaaaa(sh, sl, ah, al, bh, bl) \
>   __asm__ ("addq %5,%q1\n\tadcq %3,%q0"                                 \
>            : "=r" (sh), "=&r" (sl)                                      \
>            : "0"  ((UDItype)(ah)), "rme" ((UDItype)(bh)),               \
>              "%1" ((UDItype)(al)), "rme" ((UDItype)(bl)))
> 
> marking %0 early-clobbered fixes the problem.

We also have longlong.h in gcc/ directory, where

#define add_ssaaaa(sh, sl, ah, al, bh, bl) \
  __asm__ ("addq %5,%1\n\tadcq %3,%0"					\
	   : "=r" ((UDItype) (sh)),					\
	     "=&r" ((UDItype) (sl))					\
	   : "%0" ((UDItype) (ah)),					\
	     "rme" ((UDItype) (bh)),					\
	     "%1" ((UDItype) (al)),					\
	     "rme" ((UDItype) (bl)))

It looks to me that gmp's longlong.h is buggy (gcc/longlong.h is used in soft-fp TFmode calculations and these are used in a couple of places through the testsuite).
> 

Comment 6 paolo.bonzini@lu.unisi.ch 2007-09-25 14:22:03 UTC
Subject: Re:  wrong code for multiple output asm,
 wrong df?

ubizjak at gmail dot com wrote:
> ------- Comment #5 from ubizjak at gmail dot com  2007-09-25 13:58 -------
> (In reply to comment #1)
> 
>> #define add_ssaaaa(sh, sl, ah, al, bh, bl) \
>>   __asm__ ("addq %5,%q1\n\tadcq %3,%q0"                                 \
>>            : "=r" (sh), "=&r" (sl)                                      \
>>            : "0"  ((UDItype)(ah)), "rme" ((UDItype)(bh)),               \
>>              "%1" ((UDItype)(al)), "rme" ((UDItype)(bl)))
>>
>> marking %0 early-clobbered fixes the problem.
> 
> We also have longlong.h in gcc/ directory, where
> 
> #define add_ssaaaa(sh, sl, ah, al, bh, bl) \
>   __asm__ ("addq %5,%1\n\tadcq %3,%0"                                   \
>            : "=r" ((UDItype) (sh)),                                     \
>              "=&r" ((UDItype) (sl))                                     \
>            : "%0" ((UDItype) (ah)),                                     \
>              "rme" ((UDItype) (bh)),                                    \
>              "%1" ((UDItype) (al)),                                     \
>              "rme" ((UDItype) (bl)))

I think that both version ought to work.

Paolo
Comment 7 Uroš Bizjak 2007-09-25 14:29:14 UTC
(In reply to comment #6)

> I think that both version ought to work.

There is a comment in the '%' documentation:

     GCC can only handle one commutative pair in an asm; if you use
     more, the compiler may fail.  Note that you need not use the

The 'fail' word is not exactly encouraging to use two '%'s in the asm.
Comment 8 paolo.bonzini@lu.unisi.ch 2007-09-25 14:40:00 UTC
Subject: Re:  wrong code for multiple output asm,
 wrong df?

> There is a comment in the '%' documentation:
> 
>      GCC can only handle one commutative pair in an asm; if you use
>      more, the compiler may fail.  Note that you need not use the

Right, but then it is gmp (the subject of this PR) who's right and gcc 
who's wrong.  You said the other way round. :-)
Comment 9 Andrew Pinski 2007-09-25 17:15:06 UTC
Use of % in asm should be discougaged.
Comment 10 Uroš Bizjak 2007-09-25 17:53:43 UTC
(In reply to comment #8)

> Right, but then it is gmp (the subject of this PR) who's right and gcc 
> who's wrong.  You said the other way round. :-)

;)

My conclusion is based on the generated code, where gmp's version produces wrong and gcc's version produces correct code. But let's fix this problem for real.
Comment 11 Uroš Bizjak 2007-09-25 18:48:10 UTC
The fact that "%0" version produces correct code is simply due to strtoul in match_asm_constraints_1, where "%0" is not recognized as a valid string for conversion and further processing of input constraint is stopped.
Comment 12 Uroš Bizjak 2007-09-25 18:57:01 UTC
(In reply to comment #1)

> marking %0 early-clobbered fixes the problem.

Please look at comment #7.

Confirmed as a 4.3 regression, something is wrong in match_asm_constraints_1. 
Comment 13 Uroš Bizjak 2007-09-25 19:29:53 UTC
(In reply to comment #12)

> > marking %0 early-clobbered fixes the problem.
> Please look at comment #7.

I think I need some sleep. I was thinking of comment #11.
Comment 14 Jakub Jelinek 2007-09-26 21:19:48 UTC
This isn't related to commutative constraints, can be reproduced with:
void
mul_basecase (unsigned long *wp, unsigned long *up, long un,
              unsigned long *vp, long vn)
{
  long j;
  unsigned long prod_low, prod_high;
  unsigned long cy_dig;
  unsigned long v_limb;
  v_limb = vp[0];
  cy_dig = 0;
  for (j = un; j > 0; j--)
    {
      unsigned long u_limb, w_limb;
      u_limb = *up++;
      __asm__ ("mulq %3"
               : "=a" (prod_low), "=d" (prod_high)
               : "0" (u_limb), "rm" (v_limb));
      __asm__ ("addq %5,%q1\n\tadcq %3,%q0"
               : "=r" (cy_dig), "=&r" (w_limb)
               : "0" (prod_high), "rme" (0), "1" (prod_low), "rme" (cy_dig));
      *wp++ = w_limb;
    }
}

The problem is that match_asm_constraints_1 doesn't do any checking whether
the change it wants to do is actually valid.
Particularly it must and does not check whether the output (whose value it will
kill in the new insn prepended before the asm) isn't among inputs of the asm.
Also, I wonder whether it shouldn't limit any changes to REGs, ATM it will happily copy around MEMs etc., which IMHO is highly undesirable.
When the output is present among inputs (except for the one with matching constraint), we have either a choice to create a new pseudo or just don't do anything.
Comment 15 Jakub Jelinek 2007-09-26 21:24:37 UTC
The restriction at least not to allow MEM_Ps was posted in:
http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01329.html
but never applied to the trunk.  But I believe it should instead check just
for REG_P, instead of !MEM_P.
Comment 16 Paolo Bonzini 2007-09-27 06:28:00 UTC
I will implement something along the lines that Jakub discussed.  In the meanwhile, could anybody figure a self-contained execution testcase based on comment #14?

Thanks!
Comment 17 Serge Belyshev 2007-09-27 08:01:35 UTC
(In reply to comment #16)
something like this (for gcc.target/i386):

/* { dg-do run } */
/* { dg-require-effective-target lp64 } */
/* { dg-options "-O2"} */

unsigned long a [100];

int main (void)
{
  unsigned long v = 0x500000000UL, c = 13, low, high;
  long j;

  a [0] = 0x200000000UL;
  a [1] = 0x300000000UL;

  for (j = 0; j < 100; j++)
    {
      asm ("mulq %3"
	   : "=a" (low), "=d" (high)
	   : "0" (a [j]), "rm" (v));
      
      asm ("addq %5,%q1\n\tadcq %3,%q0"
	   : "=r" (c), "=&r" (a [j])
	   : "0" (high), "rme" (0), "1" (low), "rme" (c));
    }

  if (a [0] != 13 || a [1] != 10 || a [2] != 15 || a [3] != 0)
    __builtin_abort ();
  return 0;
}
Comment 18 Jakub Jelinek 2007-09-27 08:12:01 UTC
Sure, no problem.

/* PR rtl-optimization/33552 */
/* { dg-do run } */
/* { dg-options "-O2" } */

extern void abort (void);

void
__attribute__((noinline))
foo (unsigned long *wp, unsigned long *up, long un, unsigned long *vp)
{
  long j;
  unsigned long prod_low, prod_high;
  unsigned long cy_dig;
  unsigned long v_limb;
  v_limb = vp[0];
  cy_dig = 64;
  for (j = un; j > 0; j--)
    {
      unsigned long u_limb, w_limb;
      u_limb = *up++;
      __asm__ (""
               : "=r" (prod_low), "=r" (prod_high)
               : "0" (u_limb), "1" (v_limb));
      __asm__ ("mov %5, %1; add %5, %0"
               : "=r" (cy_dig), "=&r" (w_limb)
               : "0" (prod_high), "rm" (0), "1" (prod_low), "rm" (cy_dig));
      *wp++ = w_limb;
    }
}

int
main (void)
{
  unsigned long wp[4];
  unsigned long up[4] = { 0x1248, 0x248a, 0x1745, 0x1853 };
  unsigned long vp = 0xdead;
  foo (wp, up, 4, &vp);
  if (wp[0] != 0x40 || wp[1] != 0xdeed || wp[2] != 0x1bd9a || wp[3] != 0x29c47)
    abort ();
  return 0;
}

For dg.target/i386, unless we add a bunch of target variants for the only
asm arch dependent string in there.
Comment 19 Michael Matz 2007-09-27 14:52:39 UTC
Have a patch on http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01968.html .
Fixes also the reload failure on x86 -O2 -fPIC on this testcase (which hits
glibc):

/* { dg-do compile } */
/* { dg-options "-O2 -fPIC" } */
typedef unsigned int size_t;
 char *
__mempcpy_by2 (char *__dest, __const char *__src, size_t __srclen)
{
  register char *__tmp = __dest;
  register unsigned long int __d0, __d1;
  __asm__ __volatile__
    ("shrl      $1,%3\n\t"
     "jz        2f\n"
     "1:\n\t"
     "movl      (%2),%0\n\t"
     "leal      4(%2),%2\n\t"
     "movl      %0,(%1)\n\t"
     "leal      4(%1),%1\n\t"
     "decl      %3\n\t"
     "jnz       1b\n"
     "2:\n\t"
     "movw      (%2),%w0\n\t"
     "movw      %w0,(%1)"
     : "=&q" (__d0), "=r" (__tmp), "=&r" (__src), "=&r" (__d1),
       "=m" ( *(struct { __extension__ char __x[__srclen]; } *)__dest)
     : "1" (__tmp), "2" (__src), "3" (__srclen / 2),
       "m" ( *(struct { __extension__ char __x[__srclen]; } *)__src)
     : "cc");
  return __tmp + 2;
}
Comment 20 Paolo Bonzini 2007-09-27 15:01:29 UTC
Thanks.
Comment 21 Uroš Bizjak 2007-09-27 19:59:21 UTC
(In reply to comment #19)
> Have a patch on http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01968.html .
> Fixes also the reload failure on x86 -O2 -fPIC on this testcase (which hits
> glibc):

Do we need a solution for "%." decorations of matched operands? strtoul() will not return 0 for "%0" operands, as described in Comment 11.
Comment 22 Paolo Bonzini 2007-09-28 10:04:15 UTC
I don't think so, as you don't know which input will match the output (i.e. whether the two inputs will be swapped) if you have a % constraint in the asm.
Comment 23 Michael Matz 2007-09-28 13:32:12 UTC
Subject: Bug 33552

Author: matz
Date: Fri Sep 28 13:31:50 2007
New Revision: 128864

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128864
Log:
        PR rtl-optimization/33552
        * function.c (match_asm_constraints_1): Check for overlap in
        inputs and replace all occurences.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/function.c

Comment 24 Michael Matz 2007-09-28 13:33:26 UTC
Subject: Bug 33552

Author: matz
Date: Fri Sep 28 13:33:09 2007
New Revision: 128865

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128865
Log:
        PR rtl-optimization/33552
        * gcc.target/i386/pr33552.c: New runtime test.
        * gcc.target/i386/strinline.c: New compile time test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr33552.c
    trunk/gcc/testsuite/gcc.target/i386/strinline.c
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 25 Michael Matz 2007-09-28 14:45:27 UTC
Fixed now.