Bug 39847 - 16 symbolic register names generates error: more than 30 operands in 'asm'
Summary: 16 symbolic register names generates error: more than 30 operands in 'asm'
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: inline-asm (show other bugs)
Version: 4.3.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-22 13:47 UTC by David Taht
Modified: 2009-04-23 11:08 UTC (History)
3 users (show)

See Also:
Host: x86_64-linux-gnu
Target: x86_64-linux-gnu
Build: x86_64-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Test program demonstrating 16 register breakage on inline asm in x86_64 (1.17 KB, text/plain)
2009-04-22 13:49 UTC, David Taht
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Taht 2009-04-22 13:47:55 UTC
Using more than 15 operands in inline assembly of the symbolic form:

  "... lots of assembly elided ..." \
  "movaps %[inner_filter_index], %[icoeff_l2]\n\t"			\
  :[icoeff_l2] "+x" (icoeff_l2),					\
   [inner_filter_index] "+x" (inner_filter_index),			\
   ... 13 more operands .. .

blows up at 16 register names, which is a bit lower than 30, and in either case, too low for modern vector architectures such as the x86_64, vmx, larrabee, etc. 

Test program to come...
Comment 1 David Taht 2009-04-22 13:49:59 UTC
Created attachment 17672 [details]
Test program demonstrating 16 register breakage on inline asm in x86_64
Comment 2 David Taht 2009-04-22 14:04:01 UTC
tested against: gcc (Ubuntu 4.3.2-1ubuntu12) 4.3.2 - fails
Comment 3 Jakub Jelinek 2009-04-22 14:30:25 UTC
"+x" counts as 2 operands, not just one (one input and one output).
Comment 4 David Taht 2009-04-22 14:58:05 UTC
"+" counts as two operands? ok, that makes sense. So, basically 2*num_of_physical_regs would be a saner default ... 
Comment 5 Andrew Pinski 2009-04-22 15:00:45 UTC
Actually no, it would be better if you moved over to using the intrincs which can be optimized and scheduled.
Comment 6 David Taht 2009-04-22 15:40:13 UTC
Pinska: Actually, no. I started with the intrinsics and looked hard at what the code scheduler was doing before settling on rewriting this in inline assembly. 

The intrinsics have several problems that effect the code quality in this case.

1) They don't issue a request from memory for many instructions, such as cvtps2pd. Doing oneliners for stuff like is feasible but even harder to understand and debug than pure assembly.  Gcc also seems to have a misguided sense for how many clocks cvtX2Y instructions take.

2) The combination of intrinsics, C, and assembly gcc was generating included a lot of extra instructions, promoting ints to longs, leas, etc. 

3) The optimizer tends to push prefetches to the end of the loop when it really needs to happen as early as possible. This particular bit of code *might* benefit from prefetching (it is not a very predictable access pattern) but at the end of the loop prefetches hurt more than they help.

4) this code is right up against the edge of the x86_64 register set (all the xmm registers (for 8 channel resampling) and 7 integer registers) 

5) You can't use push/pop across multiple bits of inline assembly.

Yes, it would be nice if gcc did a better job on it...

I can show you oprofiles of the gcc generated code, but the larger point remains that doing complex vectorized operations tends to use up a lot of registers and doing it well requires hand optimized assembly... and to do that well, it would be helpful to have as many named parameters available as in the register set.

Comment 7 Andrew Pinski 2009-04-22 15:45:43 UTC
(In reply to comment #6)
> Pinska: Actually, no. I started with the intrinsics and looked hard at what the
> code scheduler was doing before settling on rewriting this in inline assembly. 
> 
> The intrinsics have several problems that effect the code quality in this case.
> 
> 1) They don't issue a request from memory for many instructions, such as
> cvtps2pd. Doing oneliners for stuff like is feasible but even harder to
> understand and debug than pure assembly.  Gcc also seems to have a misguided
> sense for how many clocks cvtX2Y instructions take.

Are you using the correct -mtune= value for the processor you are tuning for?  Because different processors have different clock cycles.  If you have an issue with the optimizers, I rather see the bugs filed there rather you working around it with inline-asm.  

> 
> 2) The combination of intrinsics, C, and assembly gcc was generating included a
> lot of extra instructions, promoting ints to longs, leas, etc. 

Int to Long, that is normal and a different issue and really you should have filed this one.

> 
> 3) The optimizer tends to push prefetches to the end of the loop when it really
> needs to happen as early as possible. This particular bit of code *might*
> benefit from prefetching (it is not a very predictable access pattern) but at
> the end of the loop prefetches hurt more than they help.

file a bug.

> 
> 4) this code is right up against the edge of the x86_64 register set (all the
> xmm registers (for 8 channel resampling) and 7 integer registers) 

try 4.4.0 which was just released, it has a better register allocator.

> I can show you oprofiles of the gcc generated code, but the larger point
> remains that doing complex vectorized operations tends to use up a lot of
> registers and doing it well requires hand optimized assembly... and to do that
> well, it would be helpful to have as many named parameters available as in the
> register set.

No, GCC should be doing a better job with the intrinsics which is much better than you doing it manually in the inline-asm.  Inline-asm should be used when there are no intrinsics for the instruction or something which you really cannot do using intrinsics.
Comment 8 Andrew Haley 2009-04-22 16:53:39 UTC
I don't see why this is changed to WONTFIX.  Fixing inline asm to allow the use of all a machine's registers is trivial, and should not be refused for the sake of a pedantic argument about whether someone should be using asm.  gcc is a professional tool whose users are capable of deciding for themselves if they need to use asm.  It makes no sense at all arbitrarily to restrict the number of operands.

Comment 9 David Taht 2009-04-22 17:24:19 UTC
Pinskia:

It is going to take me a long time to address these issues piecemeal, so...

0) I will build gcc-4.4 and try that. I will also make the 1 line patch to it to try increasing the number of asm params, and try that. I would prefer that someone with more guts inside the guts of gcc do the latter, I fear I would rapidly end up over my head. Is it a magic number or just a stupid default?

re 1) I am using -mtune=core2 -O3 which is correct. 

I note, that in looking at the generated code today, without that and with -O2, using the non-sse version (just doubles), -O2 generates the following code sequence for    

left [0] += icoeff * filter->buffer [data_index];
left [1] += icoeff * filter->buffer [data_index+1];

 - where left[0] and icoeff are doubles, filter->buffer[data_index] is a float

movss  (%r11),%xmm0
cvtps2pd %xmm0,%xmm0; cvtss2sd would be more correct and faster on most x86_64 arches prior to the k10 and core2.
... mult and add elided, second line elided ... 

(-O3 -mtune will do a cvtss2sd (%r9), %xmm0 which is better)

converting this into the SSE2 equivalent can't be expressed in the intrinsics (requires an explicit, separate, load & cast). Doing it as inline assembly ended up generating extra leas, would not get scheduled well, and stuff like that. 

... like I said, it will take me a while to discuss this piecemeal and going to 0) is the right thing. 
Comment 10 Andrew Haley 2009-04-22 17:33:26 UTC
Subject: Re:  16 symbolic register names generates error:
 more than 30 operands in 'asm'


> 0) I will build gcc-4.4 and try that. I will also make the 1 line patch to it
> to try increasing the number of asm params, and try that. I would prefer that
> someone with more guts inside the guts of gcc do the latter, I fear I would
> rapidly end up over my head. Is it a magic number or just a stupid default?

Try

max_recog_operands = FIRST_PSEUDO_REGISTER*2
Comment 11 Andrew Haley 2009-04-22 17:47:39 UTC
I suspect the reason the limit is 30 is that when that code was written the largest register set was 32 registers, 2 of which were reserved to the implementation.  Inline asm hasn't kept up with the hardware.
Comment 12 Jakub Jelinek 2009-04-22 17:51:53 UTC
That's not going to fly very well e.g. on ia64:
config/ia64/ia64.h:#define FIRST_PSEUDO_REGISTER 334
then we have automatic arrays like:
char operands_match[MAX_RECOG_OPERANDS][MAX_RECOG_OPERANDS];
etc., many of them are cleared quite often, so such a change could sometimes lead to overflowing stack and definitely to a noticeable slowdown.
Or struct recog_data contains many MAX_RECOG_OPERANDS sized arrays in it.
Or e.g.:
extern struct operand_alternative recog_op_alt[MAX_RECOG_OPERANDS][MAX_RECOG_ALTERNATIVES];
where operand_alternative struct is pretty large.
Many functions also iterate from 0 to MAX_RECOG_OPERANDS.
Comment 13 David Taht 2009-04-22 17:55:40 UTC
@Andrew
>I suspect the reason the limit is 30 is that when that code was written the
>largest register set was 32 registers, 2 of which were reserved to the
>implementation.  Inline asm hasn't kept up with the hardware.

That old huh? Given that I/O operands take two virtual regs... methinks that the history of this is more of an x86ism... 

and symbolic register parameters date back to gcc 3.1....
Comment 14 David Taht 2009-04-22 18:00:46 UTC
@Jakub:

I'm going to build this thing today. (once I figure out the best way, and I figure it will take a while, even so) Are there any specific tests I should run to check for performance issues? I expect any stack overflows to show up quickly. :)

Admittedly, in this case (x86_64) we're only talking about doubling the number of registers available, not the extreme ia64 case... 
Comment 15 Andrew Pinski 2009-04-22 18:03:19 UTC
(In reply to comment #13)
> That old huh? Given that I/O operands take two virtual regs... methinks that
> the history of this is more of an x86ism... 
> 
> and symbolic register parameters date back to gcc 3.1....

They date back to 1.0 :).  And x86 was not the first target that GCC implemented (m68k was definitely early on).  So the argument about it being an x86ism is wrong.  And really increasing the limit will not help in really because the bigger the inline-asm, the more likely the program will not work.
Comment 16 David Taht 2009-04-22 18:30:45 UTC
@Jakub/Andrew:

max_recog_operands = MIN(FIRST_PSEUDO_REGISTER*2,SOME_SANE_VALUE_DERIVED_FROM_SMASHING_THE_STACK_ON_IA64)
; // ?

I certainly am not in a position to make a one line change to gcc and test it on ia64 or other insane architectures like vmx,intel avx, etc, etc... I also somehow doubt that a human could deal with 668 registers (a code generator might)

this human, at least, copes with 32 registers just fine.
Comment 17 Andrew Haley 2009-04-22 18:40:35 UTC
I agree with Jakub's point.

David, can you try instead of register operands using named register variables instead?  I think that may work, unless there is some other limit of which I'm unaware.
Comment 18 H.J. Lu 2009-04-22 19:03:10 UTC
(In reply to comment #6)
> Pinska: Actually, no. I started with the intrinsics and looked hard at what the
> code scheduler was doing before settling on rewriting this in inline assembly. 
> 
> The intrinsics have several problems that effect the code quality in this case.
> 

Please provide an example to show code quality problems with intrinsics.
I will take a look.
Comment 19 David Taht 2009-04-22 19:48:19 UTC
@Andrew: I agree with Jakub's point too, but don't believe merely doubling the number of operands will hurt much. Am trying it against 4.3.2... it's building as I write. 

When I figure out how to safely build 4.4 I will look at its code quality and fiddle in the same ways.

I don't understand how using named register variables would help except for making this slightly easier to write in C + snippets of asm. symbolic assembly, and using the occasional complex memory-addressing instruction helps a lot. I will think on it.

@H.J: I will provide an example when I get the spare brain cells. It will pay for me to test against 4.4 first, however. 

I very much appreciate all the attention paid to this today. I am going away to hack for a while while my cpu glows from building gcc.

Comment 20 David Taht 2009-04-23 03:29:22 UTC
I got gcc 4.4 built with the 1 line patch.

It assembles my 24 operand function just fine (which had several errors in the asm that I couldn't detect without assembling it - pushed out to my repo now - it even gets through a few loops with my as yet unfinished test code). Yea! 

Optimal register allocation (w/wo REX prefixes) is an issue (but that was why I was writing this as inline asm in the first place, that is easy to fix)

It successfully compiles, links, and runs my project (ScreamingRabbitCode) at -O3 -mtune=core2 w/o the asm code in 17.397s. gcc 4.3.2 takes 17.865s. (both are best case times over several runs and well within margin of error) 

This is obviously not a particularly good test (I suspect the bottleneck is libtool). I will try it and out of the box 4.4 on some bigger stuff tomorrow (suggestions? The biggest thing I ever build is ardour), and figure out how to run the gcc testsuite as well.


Comment 21 David Taht 2009-04-23 03:44:43 UTC
(In reply to comment #20)
> It successfully compiles, links, and runs my project (ScreamingRabbitCode) at
> -O3 -mtune=core2 w/o the asm code in 17.397s. gcc 4.3.2 takes 17.865s. (both
> are best case times over several runs and well within margin of error) 

To be clear, the timing above was for compiling and linking using the patched gcc 4.4 vs gcc 4.3.2. 
Comment 22 Andrew Haley 2009-04-23 11:08:23 UTC
Re named register variables:
You can, instead of using 

[coeff_ptr_l1] "+r" (coeff_ptr_l1)

declare something like

register long double *coeff_ptr_l1 asm ("%%r8");

and then use "%%r8" in your asm.  This means that you allocate the registers instead of the compiler, but it may solve your immediate problem.