11203 – source doesn't compile with -O0 but they compile with -O3

Bug 11203 - source doesn't compile with -O0 but they compile with -O3

Summary: source doesn't compile with -O0 but they compile with -O3

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	inline-asm (show other bugs)
Version:	3.2.3

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	ice-on-valid-code

Duplicates (9):	13410 13850 14090 17291 19549 20645 23743 25226 25853 (view as bug list)
Depends on:
Blocks:

Reported:	2003-06-16 07:07 UTC by Sergei Patchkov
Modified:	2014-02-16 10:01 UTC (History)
CC List:	15 users (show)

See Also:
Host:
Target:	i?86-*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2004-12-02 02:19:24

Attachments
compressed archive containing two files (1.54 KB, application/octet-stream) 2003-06-16 07:13 UTC, Sergei Patchkov	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sergei Patchkov 2003-06-16 07:07:32 UTC

I have little source :

start --->

typedef struct {
    float real;
    float imag;
    } complex_t;

extern void
fft_asmb_3dnow (int k, complex_t * x, complex_t * wTB,
		const complex_t * d, const complex_t * d_3)
{
  register complex_t *x2k, *x3k, *x4k, *wB;
  {
    __asm__ __volatile__ ("movq	%4, %%mm0\n\t"
			  "movq	%5, %%mm1\n\t"
			  "movq	%%mm0, %%mm5\n\t"
			  "pfadd %%mm1, %%mm5\n\t"
			  "pxor	%%mm6, %%mm0\n\t"
			  "pxor	%%mm7, %%mm1\n\t"
			  "pfadd %%mm1, %%mm0\n\t"
			  "movq	%%mm0, %%mm4\n\t"
			  "pswapd %%mm4, %%mm4\n\t"
			  "movq	%6, %%mm0\n\t"
			  "movq	%7, %%mm2\n\t"
			  "movq	%%mm0, %%mm1\n\t"
			  "movq	%%mm2, %%mm3\n\t"
			  "pfadd %%mm5, %%mm0\n\t"
			  "pfadd %%mm4, %%mm2\n\t"
			  "movq	%%mm0, %0\n\t"
			  "pfsub %%mm5, %%mm1\n\t"
			  "movq	%%mm2, %3\n\t"
			  "pfsub %%mm4, %%mm3\n\t"
			  "movq	%%mm1, %1\n\t"
			  "movq	%%mm3, %2":"=m"(x[0]),
			  "=m"(x3k[0]),
			  "=m"(x2k[0]),
			  "=m"(x4k[0]):"m"(wTB[0]),
			  "m" (wTB[k * 2]),
			   "m" (x[0]),
			   "m" (x2k[0]):"memory");
  };
}

end ------->

if I add "-O0" option for gcc 3.2.3 or 3.3.1 then
compiler say "NO" like this:
"error: can't find a register in class 'GENERAL_REGS' while reloading 'asm'"
but if I add "-O3" then compile and code work fine.

What is wrong compiler or my asm code?

Comment 1 Sergei Patchkov 2003-06-16 07:13:03 UTC

Created attachment 4229 [details]
compressed archive containing two files

in attachment I add compressed archive containing two files. One of them is
err.log and second is my source code.

Comment 2 Wolfgang Bangerth 2003-06-16 14:32:37 UTC

Confirmed with 3.2.x, 3.3 and mainline
W.

Comment 3 Nathanael C. Nerode 2003-07-09 04:28:07 UTC

Maybe neither.  *sigh*
i386 is infamously register-starved, and there may simply not be enough registers unless optimizations are used.

Comment 4 Steven Bosscher 2003-07-29 08:24:58 UTC

Any chance this might be related to bug 9929?  A patch for that is pending
(http://gcc.gnu.org/ml/gcc-patches/2003-07/msg02582.html), so someone could try...

Comment 5 Andrew Pinski 2003-12-16 22:59:03 UTC

*** Bug 13410 has been marked as a duplicate of this bug. ***

Comment 6 Andrew Pinski 2004-02-10 01:16:34 UTC

*** Bug 14090 has been marked as a duplicate of this bug. ***

Comment 7 Sergei Patchkov 2004-03-31 08:37:21 UTC

Current version gcc (3.3.4 20040331) cann't compile sample code only with command
$gcc -O0 -c regs_test.c
but can do it with line
$gcc -O0 -fnew-ra -c regs_test.c
and
$gcc -Os -c regs_test.c

Comment 8 Andrew Pinski 2004-08-06 07:01:14 UTC

*** Bug 13850 has been marked as a duplicate of this bug. ***

Comment 9 Pawel Sikora 2004-08-15 11:00:42 UTC

confirmed with 3.4.2-20040806 (-O0 works, -O[123] fails). 
 
ps). building qemu-0.5.5 also fails. 
 
pentium3-pld-linux-gcc -O2 -march=pentium3 --save-temps -fomit-frame-pointer 
-mpreferred-stack-boundary=2 -falign-functions=0 -fno-reorder-blocks 
-fno-optimize-sibling-calls -I. 
-I/home/users/pluto/rpm/BUILD/qemu-0.5.5/target-i386 
-I/home/users/pluto/rpm/BUILD/qemu-0.5.5 -D_GNU_SOURCE 
-c -o op.o /home/users/pluto/rpm/BUILD/qemu-0.5.5/target-i386/op.c 
/home/users/pluto/rpm/BUILD/qemu-0.5.5/target-i386/ops_template_mem.h: 
In function `op_rolb_kernel_T0_T1_cc': 
/home/users/pluto/rpm/BUILD/qemu-0.5.5/softmmu_header.h:179: 
error: can't find a register in class `GENERAL_REGS' while reloading `asm' 
 
 
static inline void glue(glue(st, SUFFIX), MEMSUFFIX)(void *ptr, RES_TYPE v) 
{ 
    asm volatile ("movl %0, %%edx\n"                 / * line 179 */ 
                  "movl %0, %%eax\n" 
                  "shrl %3, %%edx\n" 
                  "andl %4, %%eax\n" 
                  "andl %2, %%edx\n" 
                  "leal %5(%%edx, %%ebp), %%edx\n" 
                  "cmpl (%%edx), %%eax\n" 
                  "movl %0, %%eax\n" 
                  "je 1f\n" 
#if DATA_SIZE == 1 
                  "movzbl %b1, %%edx\n" 
#elif DATA_SIZE == 2 
                  "movzwl %w1, %%edx\n" 
#elif DATA_SIZE == 4 
                  "movl %1, %%edx\n" 
#else 
#error unsupported size 
#endif 
                  "pushl %6\n" 
                  "call %7\n" 
                  "popl %%eax\n" 
                  "jmp 2f\n" 
                  "1:\n" 
                  "addl 4(%%edx), %%eax\n" 
#if DATA_SIZE == 1 
                  "movb %b1, (%%eax)\n" 
#elif DATA_SIZE == 2 
                  "movw %w1, (%%eax)\n" 
#elif DATA_SIZE == 4 
                  "movl %1, (%%eax)\n" 
#else 
#error unsupported size 
#endif 
                  "2:\n" 
                  : 
                  : "r" (ptr), 
/* NOTE: 'q' would be needed as constraint, but we could not use it 
   with T1 ! */ 
                  "r" (v), 
                  "i" ((CPU_TLB_SIZE - 1) << 3), 
                  "i" (TARGET_PAGE_BITS - 3), 
                  "i" (TARGET_PAGE_MASK | (DATA_SIZE - 1)), 
                  "m" (*(uint32_t *)offsetof(CPUState, 
tlb_write[CPU_MEM_INDEX][0].address)), 
                  "i" (CPU_MEM_INDEX), 
                  "m" (*(uint8_t *)&glue(glue(__st, SUFFIX), MMUSUFFIX)) 
                  : "%eax", "%ecx", "%edx", "memory", "cc"); 
}

Comment 10 Andrew Pinski 2004-09-02 18:26:26 UTC

*** Bug 17291 has been marked as a duplicate of this bug. ***

Comment 11 stian 2005-01-01 17:15:42 UTC

Reference to other bug-reports:

http://bugs.gentoo.org/show_bug.cgi?id=71360

Comment 12 Andrew Pinski 2005-01-01 17:22:48 UTC

Why do people write inline-asm like this?
It is crazy to do so.  Split up the inline-asm correctly.
Anyone who writes like inline-asm should get what they get.
For mmx inline-asm, you should be using the intrinsics instead as suggested before
or just write real asm file.

Comment 13 michaelni 2005-01-01 18:57:10 UTC

(In reply to comment #12)
> Why do people write inline-asm like this?

why not? its valid code and a compiler should compile valid code ...


> It is crazy to do so.  Split up the inline-asm correctly.

fix gcc first so it doesnt load&store more then needed between the splited up parts


> Anyone who writes like inline-asm should get what they get.
> For mmx inline-asm, you should be using the intrinsics instead as suggested before

lets see why its not using intrinsics
* it was written before intrinsics support was common
* intrinsics fail / get misscompiled commonly, its so bad that some of the
altivec intrinsic code has been disabled in ffmpeg if standard gcc is detected,
there also have been very serious and similar problems in mplayer with
altivec-intrinsics, sadly i cant provide more details as i dont have a ppc
* many if not most of the mplayer developers still use gcc 2.95 because gcc 3.*
is slower and needs more memory, and AFAIK 2.95 doesnt support intrinsics
* it is alot of work to rewrite and debug it just to make it compileable with
gcc -O0


> or just write real asm file.

thats not a good idea either as:
* its slower due to the additional call/ret/parameter passing
* there are some symbol name mangling issues on some obscure systems (see
mplayer-dev or cvslog mailinglist, it was disscussed there a long time ago)

Comment 14 Steven Bosscher 2005-01-01 22:50:12 UTC

You've just constrained the compiler too much to do anything.  You're right 
that gcc should produce fewer loads and stores sometimes, but in this case 
I suggest you show that this actually hurts you still with GCC 4.0, I would 
hope it does better.  In any case, just because code is syntactically "valid" 
GNU C doesn't mean gcc can always compile it.  With this kind of inline asm, 
you're bound to confuse the register allocator.  The fact that it works at O3 
is pure luck and not a bug.  Note that you're hitting an *error*, not an ICE. 
It is a deliberate choice to inform you that GCC cannot compile your inline 
assembly.  Bad luck for you.

Comment 15 Steven Bosscher 2005-01-01 23:05:23 UTC

I will note for the record that disabling local-alloc will resolve 
this problem.  A patch for that is in the audit trail of another bug, 
for unrelated reasons: http://gcc.gnu.org/PR13776.  It also happens 
to fix the particular problem in this bug report.

Comment 16 Andrew Pinski 2005-01-20 21:04:12 UTC

*** Bug 19549 has been marked as a duplicate of this bug. ***

Comment 17 Martin Drab 2005-01-21 12:39:18 UTC

(In reply to comment #15)
> I will note for the record that disabling local-alloc will resolve 
> this problem.  A patch for that is in the audit trail of another bug, 
> for unrelated reasons: http://gcc.gnu.org/PR13776.  It also happens 
> to fix the particular problem in this bug report. 

I didn't test the source proposed in this bugreport, but the patch mentioned
above (disabling of local-alloc) DOES NOT resolve the problem with the testcode
proposed  in bugreport http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19549, and,
though, it also doesn't fix the problem of compiling ffmpegs
libavcodec/i386/dsputil_mmx.c, because that is the original, from which the
testcode was extracted/simplified. So, either it is not the same bug (as marked
by Andrew) or the problem was not resolved. And IMHO this shoul be perfectly
valid, since the operands to the asm construction are all marked as "m" (!!!),
so no registers should be needed for that! They are just memory operands!! And
so I think this bug (or at least
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19549) should NOT be marked as resolved.

Comment 18 Falk Hueffner 2005-01-21 13:55:46 UTC

(In reply to comment #17)
> And IMHO this shoul be perfectly
> valid, since the operands to the asm construction are all marked as "m" (!!!),
> so no registers should be needed for that!

Huh? The memory operands are not at a compile time constant address, so of course
you need a register to hold them. Of course, you need only one register for
all of them, but you explicitely disallowed gcc to discover that by specifying
-O0.

Comment 19 Martin Drab 2005-01-21 14:10:03 UTC

(In reply to comment #18)
> Huh? The memory operands are not at a compile time constant address, so of course
> you need a register to hold them. Of course, you need only one register for
> all of them, but you explicitely disallowed gcc to discover that by specifying
> -O0.

Sure, one, sorry. But problem is the Bug 19549 testcode doesn't compile AT ALL.
I.e., not only with -O0, but also with -O1, -O2, or -O3. It simply doesn't
compile under any circumstances.

Comment 20 Andrew Pinski 2005-01-21 15:15:18 UTC

*** Bug 19549 has been marked as a duplicate of this bug. ***

Comment 21 Martin Drab 2005-01-21 15:49:05 UTC

OK, sorry, the Bug 19549 testcode passes with -O1 and above, but the original,
that it was stripped from (maybe too much stripped) doesn't:

-- test2.c -------------------------------------
extern const unsigned char ff_h263_loop_filter_strength[32];
static const unsigned long long ff_pb_FC __attribute__((used)) __attribute__
((aligned(8))) = 0xFCFCFCFCFCFCFCFCULL;
void h263_h_loop_filter_mmx(unsigned char *src, int stride, int qscale){
    const int strength= ff_h263_loop_filter_strength[qscale];
    unsigned long long temp[4] __attribute__ ((aligned(8)));
    unsigned char *btemp= (unsigned char *)temp;
    src -= 2;
    asm volatile(""
        : "+m" (temp[0]),
          "+m" (temp[1]),
          "+m" (temp[2]),
          "+m" (temp[3])
        : "g" (2*strength), "m"(ff_pb_FC)
    );
    asm volatile(""
        : "=m" (*(unsigned int*)(src + 0*stride)),
          "=m" (*(unsigned int*)(src + 1*stride)),
          "=m" (*(unsigned int*)(src + 2*stride)),
          "=m" (*(unsigned int*)(src + 3*stride)),
          "=m" (*(unsigned int*)(src + 4*stride)),
          "=m" (*(unsigned int*)(src + 5*stride)),
          "=m" (*(unsigned int*)(src + 6*stride)),
          "=m" (*(unsigned int*)(src + 7*stride))
    );
}
------------------------------------------------

Or do you consider this also invalid?

Comment 22 Falk Hueffner 2005-01-21 16:33:32 UTC

(In reply to comment #21)

> Or do you consider this also invalid?

It doesn't seem invalid to me. But it is basically impossible to write the
register allocator such that it finds a register allocation for every situation
where it's theoretically possible. So this is unlikely to get fixed in a
reliable way.

Comment 23 Martin Drab 2005-01-21 16:48:14 UTC

(In reply to comment #22)
> It doesn't seem invalid to me. But it is basically impossible to write the
> register allocator such that it finds a register allocation for every situation
> where it's theoretically possible. So this is unlikely to get fixed in a
> reliable way.

OK, I guess I fixed the code in the ffmpeg to help gcc in the compilation a bit
(I hope it will be accepted). So consider the above code rather as another code
for testing, if occasionally, sometimes the problem gets resolved.

Comment 24 Steven Bosscher 2005-01-22 12:14:27 UTC

Martin, you should realize that this problem *cannot* be solved.  Yes, 
there will perhaps be a time when this particular test case compiles, 
though I think that is unlikely.  But anyway, then there will be other 
cases that fail. 
 
The reason is dead simple: register allocation is NP-complete, so it 
is even *theoretically* not possible to write register allocators that 
always find a coloring.  That means any register allocator will always 
fail on some very constrained asm input.  And you cannot allow it to 
run indefinitely until a coloring is found, because then you've turned 
the graph coloring problem into the halting problem because you can't 
prove that a coloring exists and that the register allocator algorithm 
will terminate. 
 
So really it doesn't matter at all whether or not your specific inline 
asm compiles or not.  When yours does, someone else's will fail.

Comment 25 stian 2005-01-22 15:58:53 UTC

if you resolve all memory-referenses to temporary variables
void *a=(src + 0*stride)
and use those instead. Doesn't that lessen the stress the register-allocator is
given?

Comment 26 michaelni 2005-01-22 17:10:11 UTC

(In reply to comment #14)
> In any case, just because code is syntactically "valid" 
> GNU C doesn't mean gcc can always compile it.  With this kind of inline asm, 
> you're bound to confuse the register allocator.  The fact that it works at O3 
> is pure luck and not a bug.  

well, you are the gcc developers so theres not much arguing about what you
consider valid, but last time i checked the docs did not mention that asm
statemts may fail to compile at random, and IMO as long as this is not clearly
stated in the docs this bugreport really shouldnt be marked as invalid, say you
dont want to fix it, say it would be too complicated to fix or whatever but its
not invalid


> Note that you're hitting an *error*, not an ICE. 

no, at least one of the bugreports marked as duplicate of this ends in an ICE



(In reply to comment #24)
> Martin, you should realize that this problem *cannot* be solved. Yes, 
> there will perhaps be a time when this particular test case compiles, 
> though I think that is unlikely.  But anyway, then there will be other 
> cases that fail. 

hmm, so the probelm cannot be solved but then maybe it will be solved but this
doesnt count because there will be other unrelated bugs? i cant follow this
reasoning or do u mean that u can never solve all bugs and so theres no need to
fix any single one?


>  
> The reason is dead simple: register allocation is NP-complete, so it 
> is even *theoretically* not possible to write register allocators that 
> always find a coloring. 

register allocation in general is NP-complete, yes, but it seems u forget that
this is about finding the optimal solution while gcc fails finding any solution
which in practice is a matter of assigning the registers beginning from the most
constrained operands to the least, and copying a few things on the stack if gcc
cant figure out howto access them, sure this method might fail in 0.001% of the
practical cases and need a 2nd or 3rd pass where it tries different registers
it might also happen that in some intentionally overconstrained cases it ends up
searching the whole 5040 possible assignments of 7 registers onto 7 non memory
operands but still it wont fail

> That means any register allocator will always 
> fail on some very constrained asm input.

now that statement is just false, not to mention irrelevant as none of these asm
statemets are unreasonably constrained


>  And you cannot allow it to 
> run indefinitely until a coloring is found, because then you've turned 
> the graph coloring problem into the halting problem because you can't 
> prove that a coloring exists and that the register allocator algorithm 
> will terminate. 

this is ridiculous, the number of possible colorings is finite, u can always try
them all in finite time

Comment 27 Daniel Berlin 2005-01-22 17:21:02 UTC

Subject: Re:  source doesn't compile with -O0 but they
 compile with -O3



>
>
>>
>> The reason is dead simple: register allocation is NP-complete, so it
>> is even *theoretically* not possible to write register allocators that
>> always find a coloring.
>
> register allocation in general is NP-complete, yes, but it seems u forget that
> this is about finding the optimal solution while gcc fails finding any solution
> which in practice is a matter of assigning the registers beginning from the most
> constrained operands to the least, and copying a few things on the stack if gcc
> cant figure out howto access them, sure this method might fail in 0.001% of the
> practical cases and need a 2nd or 3rd pass where it tries different registers
> it might also happen that in some intentionally overconstrained cases it ends up
> searching the whole 5040 possible assignments of 7 registers onto 7 non memory
> operands but still it wont fail

Just to also point out, it doesn't appear to be NP complete for register 
interference graphs, because they all seem to be 1-perfect.
Various papers have observed this, and i've actually  compiled all of gcc, 
libstdc++, etc, and every package ever on my computer, and not once has a 
single non-1-perfect interference graph 
occurred [my compiler would abort if it was true].

On 1-perfect graphs you can solve this problem in O(time it takes to 
determine the max clique), and there already exists a polynomial time 
algorithm for max-clique on perfect graphs.



  >
>> That means any register allocator will always
>> fail on some very constrained asm input.
>
> now that statement is just false, not to mention irrelevant as none of these asm
> statemets are unreasonably constrained

You are correct, NP completeness does not imply impossiblity.
There are only a finite number of possibilities.
>
>
>>  And you cannot allow it to
>> run indefinitely until a coloring is found, because then you've turned
>> the graph coloring problem into the halting problem because you can't
>> prove that a coloring exists and that the register allocator algorithm
>> will terminate.
>
> this is ridiculous, the number of possible colorings is finite, u can always try
> them all in finite time

You are right, he is wrong.

Comment 28 Sergei Patchkov 2005-01-24 06:45:43 UTC

Subject: Re:  source doesn't compile with -O0 but they
 compile with -O3

Yeah, fine battle!

Comment 29 Andrew Pinski 2005-03-26 00:29:27 UTC

*** Bug 20645 has been marked as a duplicate of this bug. ***

Comment 30 Andrew Pinski 2005-09-05 22:20:00 UTC

*** Bug 23743 has been marked as a duplicate of this bug. ***

Comment 31 Andrew Pinski 2005-12-02 17:44:50 UTC

*** Bug 25226 has been marked as a duplicate of this bug. ***

Comment 32 Andrew Pinski 2005-12-02 17:46:29 UTC

*** Bug 25221 has been marked as a duplicate of this bug. ***

Comment 33 Andrew Pinski 2006-01-19 12:38:03 UTC

*** Bug 25853 has been marked as a duplicate of this bug. ***

Comment 34 Stephan Diestelhorst 2006-04-21 15:56:08 UTC

> The reason is dead simple: register allocation is NP-complete, so it 
> is even *theoretically* not possible to write register allocators that 
> always find a coloring.

Not at all. If a problem is NP-hard, you can in fact solve it! It is just quite likely that your algortihm takes exponentiallly many steps in the size of the problem. Which, given the few registers of x86 might turn out not to be a problem. 

> That means any register allocator will always 
> fail on some very constrained asm input.  And you cannot allow it to 
> run indefinitely until a coloring is found, because then you've turned 
> the graph coloring problem into the halting problem because you can't 
> prove that a coloring exists and that the register allocator algorithm 
> will terminate. 

Not necessary. The coloring problem is decidable (just enumerate all the colorings aka. register mappings), whereas the halting problem is not decidable (or semi-decidable if you're intrested in that)
 
> So really it doesn't matter at all whether or not your specific inline 
> asm compiles or not.  When yours does, someone else's will fail. 

Nope.

Comment 35 Stephan Diestelhorst 2006-04-21 15:59:04 UTC

(In reply to comment #34)
> > The reason is dead simple: register allocation is NP-complete, so it 
> > is even *theoretically* not possible to write register allocators that 
> > always find a coloring.
> 
> Not at all. If a problem is NP-hard, you can in fact solve it! It is just quite
> likely that your algortihm takes exponentiallly many steps in the size of the
> problem. Which, given the few registers of x86 might turn out not to be a
> problem. 
> 
> > That means any register allocator will always 
> > fail on some very constrained asm input.  And you cannot allow it to 
> > run indefinitely until a coloring is found, because then you've turned 
> > the graph coloring problem into the halting problem because you can't 
> > prove that a coloring exists and that the register allocator algorithm 
> > will terminate. 
> 
> Not necessary. The coloring problem is decidable (just enumerate all the
> colorings aka. register mappings), whereas the halting problem is not decidable
> (or semi-decidable if you're intrested in that)
> 
> > So really it doesn't matter at all whether or not your specific inline 
> > asm compiles or not.  When yours does, someone else's will fail. 
> 
> Nope.
> 

Sorry for the spam. Didn't read up to the end. Have been quite angry with the whole situation....

Comment 36 Trent Piepho 2006-11-08 20:03:00 UTC

(In reply to comment #21)
>     asm volatile(""
>         : "=m" (*(unsigned int*)(src + 0*stride)),
>           "=m" (*(unsigned int*)(src + 1*stride)),
>           "=m" (*(unsigned int*)(src + 2*stride)),
>           "=m" (*(unsigned int*)(src + 3*stride)),
>           "=m" (*(unsigned int*)(src + 4*stride)),
>           "=m" (*(unsigned int*)(src + 5*stride)),
>           "=m" (*(unsigned int*)(src + 6*stride)),
>           "=m" (*(unsigned int*)(src + 7*stride))
>     );

(In reply to comment #26)
> it might also happen that in some intentionally overconstrained cases it ends up
> searching the whole 5040 possible assignments of 7 registers onto 7 non memory
> operands but still it wont fail

The example Martin gave has *8* operands.  You can try every possible direct mapping of those 8 addresses to just 7 registers, but they will obviously all fail.  Except with ia32 addressing modes it _can_ be done, and with only 4 registers.

reg1 = src, reg2 = stride, reg3 = src+stride*3, reg4 = src+stride*6
Then the 8 memory operands are:
(reg1), (reg1,reg2,1), (reg1,reg2,2), (reg3),
(reg1,reg2,4), (reg3,reg2,2), (reg4), (reg3,reg2,4)

When one considers all the addressing modes, there are not just 7 possible registers, but (I think) 261 possible addresses.  There are not just 5040 possibilities as Michael said, but over 76 x 10^15 possible ways of assigning these addresses to 7 operands!  Then each register can be loaded not just with an address but with some sub-expression too, like how I loaded reg2 with stride.

Even for ia32, which makes up for its limited number of registers with complex addressing modes, finding a register allocation that satisfies an asm statement is not something that can always be done in reasonable time.  If the number of operands <= number of available registers it should be able to (but gcc doesn't) always find an allocation (_an_ allocation, not the best allocation).

Comment 37 michaelni 2006-11-08 20:45:04 UTC

(In reply to comment #36)
> (In reply to comment #21)
> >     asm volatile(""
> >         : "=m" (*(unsigned int*)(src + 0*stride)),
> >           "=m" (*(unsigned int*)(src + 1*stride)),
> >           "=m" (*(unsigned int*)(src + 2*stride)),
> >           "=m" (*(unsigned int*)(src + 3*stride)),
> >           "=m" (*(unsigned int*)(src + 4*stride)),
> >           "=m" (*(unsigned int*)(src + 5*stride)),
> >           "=m" (*(unsigned int*)(src + 6*stride)),
> >           "=m" (*(unsigned int*)(src + 7*stride))
> >     );
> 
> (In reply to comment #26)
> > it might also happen that in some intentionally overconstrained cases it ends up
> > searching the whole 5040 possible assignments of 7 registers onto 7 non memory
> > operands but still it wont fail
> 
> The example Martin gave has *8* operands.  You can try every possible direct
> mapping of those 8 addresses to just 7 registers, but they will obviously all
> fail.  Except with ia32 addressing modes it _can_ be done, and with only 4
> registers.
> 
> reg1 = src, reg2 = stride, reg3 = src+stride*3, reg4 = src+stride*6
> Then the 8 memory operands are:
> (reg1), (reg1,reg2,1), (reg1,reg2,2), (reg3),
> (reg1,reg2,4), (reg3,reg2,2), (reg4), (reg3,reg2,4)
> 
> When one considers all the addressing modes, there are not just 7 possible
> registers, but (I think) 261 possible addresses.  There are not just 5040
> possibilities as Michael said, but over 76 x 10^15 possible ways of assigning
> these addresses to 7 operands!  Then each register can be loaded not just with
> an address but with some sub-expression too, like how I loaded reg2 with
> stride.

"m" operands and variations can be copied onto the stack and accessed from there, so no matter how many memory operands there are they can always be accessed over esp on ia32, so whatever you did calculate it is meaningless

now if there is a unwritten rule that "m" operands and variations of them cannot be copied anywhere, then it would be very desireable to have a asm constraint like "m" without this restriction this would resolve this and several other bugs
also it would be very nice if such a dont copy restriction on "m" if it does exist could be documented

Comment 38 Trent Piepho 2007-02-27 19:36:03 UTC

(In reply to comment #37)
> now if there is a unwritten rule that "m" operands and variations of them
> cannot be copied anywhere, then it would be very desireable to have a asm
> constraint like "m" without this restriction this would resolve this and
> several other bugs
> also it would be very nice if such a dont copy restriction on "m" if it does
> exist could be documented

Copying "m" operands onto the stack might not be such a great thing to wish for.  Imagine if you used asm("movaps %xmm0, %0": "=m"(x[i]));  If x[i] is only 32-bits, and gcc copied it onto the stack, then writing 16 bytes with movaps wouldn't also write to x[i+1] to x[i+3] as intended.  I know there is a plenty of asm code in ffmpeg that overwrites or overreads memory operands and will fail if gcc tried to move them onto the stack.  There is also alignment.  movaps requires an aligned address, and maybe you have chosen x and i in such a way that it will be aligned.  But when gcc copies the value onto the stack, how is it supposed to know what alignment it needs?

Comment 39 michaelni 2007-02-27 22:50:12 UTC

(In reply to comment #38)
> (In reply to comment #37)
> > now if there is a unwritten rule that "m" operands and variations of them
> > cannot be copied anywhere, then it would be very desireable to have a asm
> > constraint like "m" without this restriction this would resolve this and
> > several other bugs
> > also it would be very nice if such a dont copy restriction on "m" if it does
> > exist could be documented
> 
> Copying "m" operands onto the stack might not be such a great thing to wish
> for.  Imagine if you used asm("movaps %xmm0, %0": "=m"(x[i]));  If x[i] is only
> 32-bits, and gcc copied it onto the stack, then writing 16 bytes with movaps
> wouldn't also write to x[i+1] to x[i+3] as intended.  I know there is a plenty
> of asm code in ffmpeg that overwrites or overreads memory operands and will
> fail if gcc tried to move them onto the stack.  There is also alignment. 
> movaps requires an aligned address, and maybe you have chosen x and i in such a
> way that it will be aligned.  But when gcc copies the value onto the stack, how
> is it supposed to know what alignment it needs?
 
well the data type used in "m"() must of course be correct, that is here a 128bit type, alignment can be handled like with all other types, double also gets aligned if the architecture needs it, so a uint128_t or sse128 or whatever can as well, the example you show is a fairly obscure special case in respect to moving "m" to the stack, in the end theres a need for a "m" like constraint which must not be moveable and a "m" like constraint which should be moveable (to the stack for example) the exact letters used are irrelevant

Comment 40 astrange+gcc@gmail.com 2009-10-18 19:56:24 UTC

Linked from http://x264dev.multimedia.cx/?p=185, I'd forgotten all about the ridiculous flamewar in this one.

Just as a note, the actual definitions of the four variables (from liba52):
  x2k = x + 2 * k;
  x3k = x2k + 2 * k;
  x4k = x3k + 2 * k;
  wB = wTB + 2 * k;

Also, the asm inputs are silly - output 0 is the same as input 6 for no reason, and the same with output 2 and input 7. So change those to "+m" and change %6/%7 to %0/%2.

That doesn't actually change anything, even though it should free two registers. It works with gcc 4.5 -O0 -fno-pic -fomit-frame-pointer, but not without one of those flags. Looks like that's because it's allocating 2 more registers for the unused fake inputs for the "+m" - change it to "=m" and it works with one flag removed, but still not both. So there's a specific bug.

And of course it all works at -O1 because it doesn't have to use registers there. So maybe it should just do that.

Comment 41 Jackie Rosen 2014-02-16 10:01:35 UTC Comment hidden (spam)

*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Marked for reference. Resolved as fixed @bugzilla.