Bug 89012 - SH2 (FDPIC) duplicate symbols in generated assembly.
Summary: SH2 (FDPIC) duplicate symbols in generated assembly.
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-23 14:49 UTC by Zach van Rijn
Modified: 2019-10-02 16:39 UTC (History)
2 users (show)

See Also:
Host: x86_64-linux-musl
Target: sh2[,eb]-linux-musl (FDPIC)
Build: x86_64-linux-musl
Known to work: 6.4.0
Known to fail: 8.2.0
Last reconfirmed: 2019-09-29 00:00:00


Attachments
Tarball containing intermediate asm (with -dp) for each of 5 cases. (1.98 KB, application/x-compressed-tar)
2019-01-29 04:44 UTC, Zach van Rijn
Details
All files produced by -O2 -da (77.11 KB, application/x-compressed-tar)
2019-01-29 05:12 UTC, Zach van Rijn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zach van Rijn 2019-01-23 14:49:39 UTC
Code generated by gcc targeting `sh2eb-linux-musl` occasionally
contains duplicate symbols with optimization level `O2` or above
and produces the following errors during assembly:

    $ gcc -c -mfdpic -O2 -Wno-int-conversion mintest.c
    mintest.s: Assembler messages:
    mintest.s:44: Error: symbol `.LPCS0' is already defined

This also occurs when the target is `sh2-linux-musl` but when it
specifically does not target `sh4` (even FDPIC). It does not
occur at optimization levels `O0` or `O1`.

I am using GCC with the following configuration, however, this
may also occur with 7.3.0 (unverified):

    GCC      : 8.2.0
    binutils : 2.31.1
    musl     : git-39ef61 (2018-11-19)

a minimal test case:

/* mintest.c */
int a, b, c, d;
int *e = (void *)0;
void f ()
{
    for (; d;)
    {
        if (b) c = a;
        e = e[0] = 1 << c;
    }
}

the assembly contains:

.LPCS0:
        mov.l   @r2,r0
        mov.l   r0,@r1
        mov.l   @r3,r2
        tst     r2,r2
        bt.s    .L5
        mov     r0,r1
.LPCS0:
        mov.l   r0,@r2
        mov.l   r0,@r1
        mov.l   @r3,r2
        tst     r2,r2
        bf.s    .L2
        mov     r0,r1

I could not reproduce this under 6.4.0 or 7.3.0 with binutils of
2.30 (the .LPCS0 sections are not present in the assembly).
Comment 1 Rich Felker 2019-01-23 15:55:55 UTC
Binutils version should not be relevant; the bug here is before anything even gets to binutils. It looks like one of the RTL patterns used for calling libgcc bitshift functions from FDPIC was inteded to be non-duplicable but isn't. This bug goes undetected on J2 (-mj2) target because the J2 has the SH3 barrel shift instructions and does not need libgcc for variable shifts, but affects baseline SH2 ISA. I'll look back at the source and see if I can figure out what's happenning.
Comment 2 Oleg Endo 2019-01-25 13:06:41 UTC
You can compile the code with the '-dp' option to see which insn patterns make up the asm code.  The pattern names will be emitted as comments in the asm output.
Comment 3 Zach van Rijn 2019-01-29 04:44:43 UTC
Created attachment 45545 [details]
Tarball containing intermediate asm (with -dp) for each of 5 cases.
Comment 4 Zach van Rijn 2019-01-29 04:45:15 UTC
The error can be reproduced at `O1` optimization level with both
(strictly both) of the following options:

./cc -c mintest.c -O1 -freorder-blocks-algorithm=stc -ftree-pre

Changing to `-freorder-blocks-algorithm=simple` will not reveal
the issue at `O1`, `O2` or `O3`.

In summary, the only known ways to reproduce this issue are:

(0) `-O2` as described in original bug report;

(1) `-O1 -freorder-blocks-algorithm=stc -ftree-pre`, exclusively
    not at any other optimization level;

and the only known ways ot mitigate this issue using either of
the above configurations are:

(2) `-O2 -freorder-blocks-algorithm=simple`;

(3) `-O1` without specifically both of the aforementioned flags.

The attached tarball contains 5 files named by letters 'A' - 'E'
containing the generated assembly, each with -dp` as suggested:

(A) FAIL: `-O2 -freorder-blocks-algorithm=stc`

(B) PASS: `-O2 -freorder-blocks-algorithm=simple`

(C) FAIL: `-O1 -freorder-blocks-algorithm=stc -ftree-pre`

(D) PASS: `-O1 -freorder-blocks-algorithm=simple -ftree-pre`

(E) PASS: `-O1 -freorder-blocks-algorithm=stc`
Comment 5 Zach van Rijn 2019-01-29 05:12:06 UTC
Created attachment 45546 [details]
All files produced by -O2 -da
Comment 6 Oleg Endo 2019-09-29 03:50:41 UTC
The ashlsi3_d_call insn, alternative 1 seems to be the problem here, but I think the same problem could happen for all other libcalls that expand to a bsrf sequence.

In this particular case, something in the optimizers decides that moving the shift instruction out of the loop is a good idea.  To do that, it copies the shift instruction only -- which is "ashlsi3_d_call", which is the bsrf plus the following label.  The label is generated once during RTL expansion.

This works for normal PIC code which calculates the call address and puts it in a register first.  This happens only once during RTL expansion and the address remains fixed.  Subsequent copies of the shift instruction (which is just a jsr) work, because they re-use the calculated address.

In case of FDPIC, the use of bsrf will always require a unique address/offset in the symbols.  It can't just make arbitrary copies of the shift instruction (which is a bsrf) because the pre-calculated address/offset will be wrong.


1) If not already there, something needs to be added that allows re-generating the address/offset of the copied/cloned instruction and also re-emitting the constant load.  This is probably not so easy to do, as instructions can be copied in many places during the compilation.

2) Postpone the calculation and emission of symbols and constant loads until a very late stage of the compilation using very late splitters.

3) Use jsr instead of bsrf in the compiler and let the linker post-optimize calls via relaxation (although linker relaxation on SH has been broken for a long time).


I don't know much about PIC/FDPIC, but one thing I've noticed looks strange.
The right-shift pattern "ashrsi3_n" will result in

	bsrf	r1	! 268	[c=5 l=2]  call_valuei_pcrel
	...

	.long	___ashrsi3@PLT-(.LPCS0+2-.)



While the left-shift pattern "ashlsi3_d_call" will result in

	bsrf	r6	! 24	[c=80 l=2]  ashlsi3_d_call/1
	...

	.long	___ashlsi3_r0-(.LPCS0+2)
Comment 7 Rich Felker 2019-09-29 04:02:18 UTC
I think all the FDPIC work done to use bsrf like this was probably a mistake. It ends up greatly enlarging functions that make a lot of such calls, for example soft-float that does it for each floating point operation. We actually encountered this in production usage. I think the bulk of the FDPIC patch was adding this stuff, and I wouldn't be opposed to removing all that (just generating jsr) if we can determine that it fixes bugs and results in better codegen.
Comment 8 Oleg Endo 2019-09-29 04:10:06 UTC
Converting the FDPIC from bsrf back to jsr sounds like quite some work.  However, I think chances of success are higher of it does the same thing as the normal PIC code.

Do you know what the main reason was to use bsrf for FDPIC?  How does FDPIC differ from regular PIC?
Comment 9 Rich Felker 2019-09-30 18:32:59 UTC
I think it's actually just a matter of removing the patterns for generating bsrf, but I may be mistaken. Generating jsr should be what happens "by default" in some sense if GCC just has to load the address, no? Of course that would also explicitly load the GOT pointer for the callee which we don't need since it's local. I'll try to take a look at this in more detail soon and see what I can find.
Comment 10 Oleg Endo 2019-10-02 16:39:52 UTC
(In reply to Rich Felker from comment #9)
> I think it's actually just a matter of removing the patterns for generating
> bsrf, but I may be mistaken. Generating jsr should be what happens "by
> default" in some sense if GCC just has to load the address, no?

I think so, yes.

> Of course 
> that would also explicitly load the GOT pointer for the callee which we
> don't need since it's local.

Can you make an example?  Maybe it can get optimized away afterwards, if it's not used?