Code generated by gcc targeting `sh2eb-linux-musl` occasionally
contains duplicate symbols with optimization level `O2` or above
and produces the following errors during assembly:
$ gcc -c -mfdpic -O2 -Wno-int-conversion mintest.c
mintest.s: Assembler messages:
mintest.s:44: Error: symbol `.LPCS0' is already defined
This also occurs when the target is `sh2-linux-musl` but when it
specifically does not target `sh4` (even FDPIC). It does not
occur at optimization levels `O0` or `O1`.
I am using GCC with the following configuration, however, this
may also occur with 7.3.0 (unverified):
GCC : 8.2.0
binutils : 2.31.1
musl : git-39ef61 (2018-11-19)
a minimal test case:
/* mintest.c */
int a, b, c, d;
int *e = (void *)0;
void f ()
for (; d;)
if (b) c = a;
e = e = 1 << c;
the assembly contains:
I could not reproduce this under 6.4.0 or 7.3.0 with binutils of
2.30 (the .LPCS0 sections are not present in the assembly).
Binutils version should not be relevant; the bug here is before anything even gets to binutils. It looks like one of the RTL patterns used for calling libgcc bitshift functions from FDPIC was inteded to be non-duplicable but isn't. This bug goes undetected on J2 (-mj2) target because the J2 has the SH3 barrel shift instructions and does not need libgcc for variable shifts, but affects baseline SH2 ISA. I'll look back at the source and see if I can figure out what's happenning.
You can compile the code with the '-dp' option to see which insn patterns make up the asm code. The pattern names will be emitted as comments in the asm output.
Created attachment 45545 [details]
Tarball containing intermediate asm (with -dp) for each of 5 cases.
The error can be reproduced at `O1` optimization level with both
(strictly both) of the following options:
./cc -c mintest.c -O1 -freorder-blocks-algorithm=stc -ftree-pre
Changing to `-freorder-blocks-algorithm=simple` will not reveal
the issue at `O1`, `O2` or `O3`.
In summary, the only known ways to reproduce this issue are:
(0) `-O2` as described in original bug report;
(1) `-O1 -freorder-blocks-algorithm=stc -ftree-pre`, exclusively
not at any other optimization level;
and the only known ways ot mitigate this issue using either of
the above configurations are:
(2) `-O2 -freorder-blocks-algorithm=simple`;
(3) `-O1` without specifically both of the aforementioned flags.
The attached tarball contains 5 files named by letters 'A' - 'E'
containing the generated assembly, each with -dp` as suggested:
(A) FAIL: `-O2 -freorder-blocks-algorithm=stc`
(B) PASS: `-O2 -freorder-blocks-algorithm=simple`
(C) FAIL: `-O1 -freorder-blocks-algorithm=stc -ftree-pre`
(D) PASS: `-O1 -freorder-blocks-algorithm=simple -ftree-pre`
(E) PASS: `-O1 -freorder-blocks-algorithm=stc`
Created attachment 45546 [details]
All files produced by -O2 -da
The ashlsi3_d_call insn, alternative 1 seems to be the problem here, but I think the same problem could happen for all other libcalls that expand to a bsrf sequence.
In this particular case, something in the optimizers decides that moving the shift instruction out of the loop is a good idea. To do that, it copies the shift instruction only -- which is "ashlsi3_d_call", which is the bsrf plus the following label. The label is generated once during RTL expansion.
This works for normal PIC code which calculates the call address and puts it in a register first. This happens only once during RTL expansion and the address remains fixed. Subsequent copies of the shift instruction (which is just a jsr) work, because they re-use the calculated address.
In case of FDPIC, the use of bsrf will always require a unique address/offset in the symbols. It can't just make arbitrary copies of the shift instruction (which is a bsrf) because the pre-calculated address/offset will be wrong.
1) If not already there, something needs to be added that allows re-generating the address/offset of the copied/cloned instruction and also re-emitting the constant load. This is probably not so easy to do, as instructions can be copied in many places during the compilation.
2) Postpone the calculation and emission of symbols and constant loads until a very late stage of the compilation using very late splitters.
3) Use jsr instead of bsrf in the compiler and let the linker post-optimize calls via relaxation (although linker relaxation on SH has been broken for a long time).
I don't know much about PIC/FDPIC, but one thing I've noticed looks strange.
The right-shift pattern "ashrsi3_n" will result in
bsrf r1 ! 268 [c=5 l=2] call_valuei_pcrel
While the left-shift pattern "ashlsi3_d_call" will result in
bsrf r6 ! 24 [c=80 l=2] ashlsi3_d_call/1
I think all the FDPIC work done to use bsrf like this was probably a mistake. It ends up greatly enlarging functions that make a lot of such calls, for example soft-float that does it for each floating point operation. We actually encountered this in production usage. I think the bulk of the FDPIC patch was adding this stuff, and I wouldn't be opposed to removing all that (just generating jsr) if we can determine that it fixes bugs and results in better codegen.
Converting the FDPIC from bsrf back to jsr sounds like quite some work. However, I think chances of success are higher of it does the same thing as the normal PIC code.
Do you know what the main reason was to use bsrf for FDPIC? How does FDPIC differ from regular PIC?
I think it's actually just a matter of removing the patterns for generating bsrf, but I may be mistaken. Generating jsr should be what happens "by default" in some sense if GCC just has to load the address, no? Of course that would also explicitly load the GOT pointer for the callee which we don't need since it's local. I'll try to take a look at this in more detail soon and see what I can find.
(In reply to Rich Felker from comment #9)
> I think it's actually just a matter of removing the patterns for generating
> bsrf, but I may be mistaken. Generating jsr should be what happens "by
> default" in some sense if GCC just has to load the address, no?
I think so, yes.
> Of course
> that would also explicitly load the GOT pointer for the callee which we
> don't need since it's local.
Can you make an example? Maybe it can get optimized away afterwards, if it's not used?