Using more than 15 operands in inline assembly of the symbolic form: "... lots of assembly elided ..." \ "movaps %[inner_filter_index], %[icoeff_l2]\n\t" \ :[icoeff_l2] "+x" (icoeff_l2), \ [inner_filter_index] "+x" (inner_filter_index), \ ... 13 more operands .. . blows up at 16 register names, which is a bit lower than 30, and in either case, too low for modern vector architectures such as the x86_64, vmx, larrabee, etc. Test program to come...
Created attachment 17672 [details] Test program demonstrating 16 register breakage on inline asm in x86_64
tested against: gcc (Ubuntu 4.3.2-1ubuntu12) 4.3.2 - fails
"+x" counts as 2 operands, not just one (one input and one output).
"+" counts as two operands? ok, that makes sense. So, basically 2*num_of_physical_regs would be a saner default ...
Actually no, it would be better if you moved over to using the intrincs which can be optimized and scheduled.
Pinska: Actually, no. I started with the intrinsics and looked hard at what the code scheduler was doing before settling on rewriting this in inline assembly. The intrinsics have several problems that effect the code quality in this case. 1) They don't issue a request from memory for many instructions, such as cvtps2pd. Doing oneliners for stuff like is feasible but even harder to understand and debug than pure assembly. Gcc also seems to have a misguided sense for how many clocks cvtX2Y instructions take. 2) The combination of intrinsics, C, and assembly gcc was generating included a lot of extra instructions, promoting ints to longs, leas, etc. 3) The optimizer tends to push prefetches to the end of the loop when it really needs to happen as early as possible. This particular bit of code *might* benefit from prefetching (it is not a very predictable access pattern) but at the end of the loop prefetches hurt more than they help. 4) this code is right up against the edge of the x86_64 register set (all the xmm registers (for 8 channel resampling) and 7 integer registers) 5) You can't use push/pop across multiple bits of inline assembly. Yes, it would be nice if gcc did a better job on it... I can show you oprofiles of the gcc generated code, but the larger point remains that doing complex vectorized operations tends to use up a lot of registers and doing it well requires hand optimized assembly... and to do that well, it would be helpful to have as many named parameters available as in the register set.
(In reply to comment #6) > Pinska: Actually, no. I started with the intrinsics and looked hard at what the > code scheduler was doing before settling on rewriting this in inline assembly. > > The intrinsics have several problems that effect the code quality in this case. > > 1) They don't issue a request from memory for many instructions, such as > cvtps2pd. Doing oneliners for stuff like is feasible but even harder to > understand and debug than pure assembly. Gcc also seems to have a misguided > sense for how many clocks cvtX2Y instructions take. Are you using the correct -mtune= value for the processor you are tuning for? Because different processors have different clock cycles. If you have an issue with the optimizers, I rather see the bugs filed there rather you working around it with inline-asm. > > 2) The combination of intrinsics, C, and assembly gcc was generating included a > lot of extra instructions, promoting ints to longs, leas, etc. Int to Long, that is normal and a different issue and really you should have filed this one. > > 3) The optimizer tends to push prefetches to the end of the loop when it really > needs to happen as early as possible. This particular bit of code *might* > benefit from prefetching (it is not a very predictable access pattern) but at > the end of the loop prefetches hurt more than they help. file a bug. > > 4) this code is right up against the edge of the x86_64 register set (all the > xmm registers (for 8 channel resampling) and 7 integer registers) try 4.4.0 which was just released, it has a better register allocator. > I can show you oprofiles of the gcc generated code, but the larger point > remains that doing complex vectorized operations tends to use up a lot of > registers and doing it well requires hand optimized assembly... and to do that > well, it would be helpful to have as many named parameters available as in the > register set. No, GCC should be doing a better job with the intrinsics which is much better than you doing it manually in the inline-asm. Inline-asm should be used when there are no intrinsics for the instruction or something which you really cannot do using intrinsics.
I don't see why this is changed to WONTFIX. Fixing inline asm to allow the use of all a machine's registers is trivial, and should not be refused for the sake of a pedantic argument about whether someone should be using asm. gcc is a professional tool whose users are capable of deciding for themselves if they need to use asm. It makes no sense at all arbitrarily to restrict the number of operands.
Pinskia: It is going to take me a long time to address these issues piecemeal, so... 0) I will build gcc-4.4 and try that. I will also make the 1 line patch to it to try increasing the number of asm params, and try that. I would prefer that someone with more guts inside the guts of gcc do the latter, I fear I would rapidly end up over my head. Is it a magic number or just a stupid default? re 1) I am using -mtune=core2 -O3 which is correct. I note, that in looking at the generated code today, without that and with -O2, using the non-sse version (just doubles), -O2 generates the following code sequence for left [0] += icoeff * filter->buffer [data_index]; left [1] += icoeff * filter->buffer [data_index+1]; - where left[0] and icoeff are doubles, filter->buffer[data_index] is a float movss (%r11),%xmm0 cvtps2pd %xmm0,%xmm0; cvtss2sd would be more correct and faster on most x86_64 arches prior to the k10 and core2. ... mult and add elided, second line elided ... (-O3 -mtune will do a cvtss2sd (%r9), %xmm0 which is better) converting this into the SSE2 equivalent can't be expressed in the intrinsics (requires an explicit, separate, load & cast). Doing it as inline assembly ended up generating extra leas, would not get scheduled well, and stuff like that. ... like I said, it will take me a while to discuss this piecemeal and going to 0) is the right thing.
Subject: Re: 16 symbolic register names generates error: more than 30 operands in 'asm' > 0) I will build gcc-4.4 and try that. I will also make the 1 line patch to it > to try increasing the number of asm params, and try that. I would prefer that > someone with more guts inside the guts of gcc do the latter, I fear I would > rapidly end up over my head. Is it a magic number or just a stupid default? Try max_recog_operands = FIRST_PSEUDO_REGISTER*2
I suspect the reason the limit is 30 is that when that code was written the largest register set was 32 registers, 2 of which were reserved to the implementation. Inline asm hasn't kept up with the hardware.
That's not going to fly very well e.g. on ia64: config/ia64/ia64.h:#define FIRST_PSEUDO_REGISTER 334 then we have automatic arrays like: char operands_match[MAX_RECOG_OPERANDS][MAX_RECOG_OPERANDS]; etc., many of them are cleared quite often, so such a change could sometimes lead to overflowing stack and definitely to a noticeable slowdown. Or struct recog_data contains many MAX_RECOG_OPERANDS sized arrays in it. Or e.g.: extern struct operand_alternative recog_op_alt[MAX_RECOG_OPERANDS][MAX_RECOG_ALTERNATIVES]; where operand_alternative struct is pretty large. Many functions also iterate from 0 to MAX_RECOG_OPERANDS.
@Andrew >I suspect the reason the limit is 30 is that when that code was written the >largest register set was 32 registers, 2 of which were reserved to the >implementation. Inline asm hasn't kept up with the hardware. That old huh? Given that I/O operands take two virtual regs... methinks that the history of this is more of an x86ism... and symbolic register parameters date back to gcc 3.1....
@Jakub: I'm going to build this thing today. (once I figure out the best way, and I figure it will take a while, even so) Are there any specific tests I should run to check for performance issues? I expect any stack overflows to show up quickly. :) Admittedly, in this case (x86_64) we're only talking about doubling the number of registers available, not the extreme ia64 case...
(In reply to comment #13) > That old huh? Given that I/O operands take two virtual regs... methinks that > the history of this is more of an x86ism... > > and symbolic register parameters date back to gcc 3.1.... They date back to 1.0 :). And x86 was not the first target that GCC implemented (m68k was definitely early on). So the argument about it being an x86ism is wrong. And really increasing the limit will not help in really because the bigger the inline-asm, the more likely the program will not work.
@Jakub/Andrew: max_recog_operands = MIN(FIRST_PSEUDO_REGISTER*2,SOME_SANE_VALUE_DERIVED_FROM_SMASHING_THE_STACK_ON_IA64) ; // ? I certainly am not in a position to make a one line change to gcc and test it on ia64 or other insane architectures like vmx,intel avx, etc, etc... I also somehow doubt that a human could deal with 668 registers (a code generator might) this human, at least, copes with 32 registers just fine.
I agree with Jakub's point. David, can you try instead of register operands using named register variables instead? I think that may work, unless there is some other limit of which I'm unaware.
(In reply to comment #6) > Pinska: Actually, no. I started with the intrinsics and looked hard at what the > code scheduler was doing before settling on rewriting this in inline assembly. > > The intrinsics have several problems that effect the code quality in this case. > Please provide an example to show code quality problems with intrinsics. I will take a look.
@Andrew: I agree with Jakub's point too, but don't believe merely doubling the number of operands will hurt much. Am trying it against 4.3.2... it's building as I write. When I figure out how to safely build 4.4 I will look at its code quality and fiddle in the same ways. I don't understand how using named register variables would help except for making this slightly easier to write in C + snippets of asm. symbolic assembly, and using the occasional complex memory-addressing instruction helps a lot. I will think on it. @H.J: I will provide an example when I get the spare brain cells. It will pay for me to test against 4.4 first, however. I very much appreciate all the attention paid to this today. I am going away to hack for a while while my cpu glows from building gcc.
I got gcc 4.4 built with the 1 line patch. It assembles my 24 operand function just fine (which had several errors in the asm that I couldn't detect without assembling it - pushed out to my repo now - it even gets through a few loops with my as yet unfinished test code). Yea! Optimal register allocation (w/wo REX prefixes) is an issue (but that was why I was writing this as inline asm in the first place, that is easy to fix) It successfully compiles, links, and runs my project (ScreamingRabbitCode) at -O3 -mtune=core2 w/o the asm code in 17.397s. gcc 4.3.2 takes 17.865s. (both are best case times over several runs and well within margin of error) This is obviously not a particularly good test (I suspect the bottleneck is libtool). I will try it and out of the box 4.4 on some bigger stuff tomorrow (suggestions? The biggest thing I ever build is ardour), and figure out how to run the gcc testsuite as well.
(In reply to comment #20) > It successfully compiles, links, and runs my project (ScreamingRabbitCode) at > -O3 -mtune=core2 w/o the asm code in 17.397s. gcc 4.3.2 takes 17.865s. (both > are best case times over several runs and well within margin of error) To be clear, the timing above was for compiling and linking using the patched gcc 4.4 vs gcc 4.3.2.
Re named register variables: You can, instead of using [coeff_ptr_l1] "+r" (coeff_ptr_l1) declare something like register long double *coeff_ptr_l1 asm ("%%r8"); and then use "%%r8" in your asm. This means that you allocate the registers instead of the compiler, but it may solve your immediate problem.