Since GCC 5 (r208990 still works fine, r215478 ICEs already up to current trunk)
the following testcase ICEs on armv7hl-linux-gnueabi with -O2:
"\n" : : "nor" (0), "nor" (0));
Removing just one newline makes the ICE go away. get_attr_length for such inline asm is 248 bytes (estimation) and the arm minipool code is apparently upset about it.
Here's the original thread where the bug was reported when compiling qemu on Fedora Rawhide for armv7: https://firstname.lastname@example.org/thread/GD3ABSWD6HHTNEKV2EJY4PXABQ245UCZ/
Note, in qemu it isn't just those newlines but simply a long inline asm.
It can very well be something that is empty or very short in the .text section,
e.g. can .pushsection; .section whatever; put lots of stuff in there; .popsection.
But as inline asm is a black box to gcc, we have only very rough estimation for the inline asm text length, which just counts how many newlines and typically ; characters there are and multiple that by some insn size.
I suspect the best we're likely to be able to do is to downgrade the ICE into an error if there's a long inline ASM in the code, much like the way we handle invalid constraints in inline ASMs.
That will still mean qemu will not compile.
I admit I know next to nothing about the arm minipool stuff, but can't whatever needs to be inserted be inserted either before or after the inline asm, if it is needed for the asm inputs then likely before that?
It depends on the the reference. Some minipool references can only be forwards due to the nature of the instruction making the reference. I'll need to take a look, and I might also need a real testcase, not just the reduced one below.
Created attachment 51867 [details]
Unreduced preprocessed source.
OK, so the real problem in the test case is the constraint ("nor") is meaningless on Arm (because there is no instruction in the architecture that can accept both a constant and a memref) and this confuses the minipool code because it exploits this restriction to detect insns that need to be reworked by the md_reorg pass.
When processing an ASM we allow only a forward literal pool reference and it must be less than 250 bytes beyond the /start/ of the pattern (because we don't know where in the asm it gets used). So we have to deduct from that range 4 bytes for every asm statement: add too many lines to the asm and we reach the point where it is impossible to place the literal pool even directly after the asm.
So I think really this is an ICE on invalid, because the constraint is not meaningful on Arm.
CCing Frank as this is systemtap sys/sdt.h
# ifndef STAP_SDT_ARG_CONSTRAINT
# if defined __powerpc__
# define STAP_SDT_ARG_CONSTRAINT nZr
# define STAP_SDT_ARG_CONSTRAINT nor
All of n, o and r are generic constraints and const0_rtx is valid for the "n" constraint, so why is the backend trying to put it into memory at all?
What is systemtap trying to do is not use those operands in any instruction, but note for the debugger how to find out the value of the asm input operand (read some register, some memory or the immediate constant).
It's been this way now for over 20 years. A single instruction cannot take a constant and a memory for the same operand, so this has been used in the backend to trigger the minipool generation: a constant in an operand that takes a memory triggers a minipool spill to be pushed.
If we changed this now we risk breaking existing inline asms that exploit this feature (good or bad) and expect a constant to be pushed into a minipool entry.
Inline asm that only works with memory but in constraints says it accepts both immediate constant and memory is IMNSHO just broken, it is just fine if the compiler makes a different choice.
If "nor" with constant input on arm has meant actually just "or", then sure, systemtap could be changed and after a couple of years it will propagate to all stap copies used in the wild, but it is quite severe misoptimization of one of the most common cases. The systemtap macros don't really know what argument will be passed to them, whether a constant, something that lives in memory, something that lives in a register and ideally wants as few actual instructions before those macros as possible to arrange the arguments so that debugger can inspect them.
Note, sys/sdt.h is using the "nor" constraints for 11 years already.
Maybe it's possible to provide specific, architecture-independent constraints for Systemtap-like use cases?
If it can be proven that all gcc versions until now treat "nor" constraint as ignoring the n in there and pushing all constants into constant pool, I think it could be changed into "or" for arm32. But it would be IMNSHO unnecessary pessimization (but it could e.g. be done for GCC < 12 or whenever this would be fixed).
Another option is to tweak whatever generates those large inline asms.
In the qemu case it is created with
/usr/bin/python3 ../scripts/tracetool.py --backend=dtrace --group=util --format=h /builddir/build/BUILD/qemu-6.1.0/util/trace-events trace/trace-util.h
whatever that is (but that means I haven't actually seen what it generates).
Note, apparently several other packages are affected, so not sure what changed recently in systemtap-sdt-devel or whatever else that adds up to this.
In the preprocessed source I got I see several blocks of
It already uses assembler macros, perhaps adding a macro to do all 3 at once or perhaps with something extra could bring the number of newlines down...
Apparently the change on the systemtap side was:
which indeed adds a lot of newlines to the inline asms.
But when already using gas macros, I wonder if all that
#define _SDT_ASM_TEMPLATE_1 _SDT_ARGFMT(1)
#define _SDT_ASM_TEMPLATE_2 _SDT_ASM_TEMPLATE_1 _SDT_ASM_BLANK _SDT_ARGFMT(2)
#define _SDT_ASM_TEMPLATE_3 _SDT_ASM_TEMPLATE_2 _SDT_ASM_BLANK _SDT_ARGFMT(3)
#define _SDT_ASM_TEMPLATE_4 _SDT_ASM_TEMPLATE_3 _SDT_ASM_BLANK _SDT_ARGFMT(4)
#define _SDT_ASM_TEMPLATE_5 _SDT_ASM_TEMPLATE_4 _SDT_ASM_BLANK _SDT_ARGFMT(5)
#define _SDT_ASM_TEMPLATE_6 _SDT_ASM_TEMPLATE_5 _SDT_ASM_BLANK _SDT_ARGFMT(6)
#define _SDT_ASM_TEMPLATE_7 _SDT_ASM_TEMPLATE_6 _SDT_ASM_BLANK _SDT_ARGFMT(7)
#define _SDT_ASM_TEMPLATE_8 _SDT_ASM_TEMPLATE_7 _SDT_ASM_BLANK _SDT_ARGFMT(8)
#define _SDT_ASM_TEMPLATE_9 _SDT_ASM_TEMPLATE_8 _SDT_ASM_BLANK _SDT_ARGFMT(9)
#define _SDT_ASM_TEMPLATE_10 _SDT_ASM_TEMPLATE_9 _SDT_ASM_BLANK _SDT_ARGFMT(10)
#define _SDT_ASM_TEMPLATE_11 _SDT_ASM_TEMPLATE_10 _SDT_ASM_BLANK _SDT_ARGFMT(11)
#define _SDT_ASM_TEMPLATE_12 _SDT_ASM_TEMPLATE_11 _SDT_ASM_BLANK _SDT_ARGFMT(12)
couldn't be rewritten into use of another .macro _SDT_ASM_TEMPLATE that just takes emits it all.
.macro sum from=0, to=5
macro from gas documentation for inspiration.
Note, the %n[_SDT_S##no] in there need to stay (dunno about the
_SDT_ASM_SUBSTR(_SDT_ARGTMPL(_SDT_A##no)) stuff), but that could be achieved
by giving the macro from, to, arg, args:vararg arguments and use it like:
_SDT_ASM_TEMPLATE 1, 4, %n[_SDT_S1], %n[_SDT_S2], %n[_SDT_S3], %n[_SDT_S4]
FYI, I've opened a bug against systemtap in Fedora to track this problem on that side too https://bugzilla.redhat.com/show_bug.cgi?id=2026858
Note that we document how the size of asm is estimated:
and unfortunately asm inline ("..." ...) makes the size estimation 0 only for inlining purposes and not for others too.
So, for systemtap it is still desirable to use as few newlines/; characters in the pattern as possible. If one macro is used to handle 1-12 operands through recursions, one way to save a few newlines would be avoid all those other 5 macros, each one is used just once for each parameter and avoiding their .macro, .endm and .purgem lines will get rid of 15 lines...
Similarly, not doing .pushsection/.popsection 3 times for each argument but just once would help etc.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.