Bug 98482 - -mfentry creates invalid call for -mcmodel=large
Summary: -mfentry creates invalid call for -mcmodel=large
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.2.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2020-12-30 19:13 UTC by Topi Miettinen
Modified: 2021-01-08 16:47 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-01-07 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Topi Miettinen 2020-12-30 19:13:10 UTC
GCC on x86_64 with `-mfentry` generates invalid code for `-mcmodel=large`. The call to `__fentry__` uses plain `call` instruction, but this can only address locations within 32 bit range while the target may be anywhere in the 64 bit range due to `-mcmodel=large`.

The expected code would be something which loads a 64 bit value to a register and then uses register indirect call, so instead of
        call    __fentry__
there needs to be
        movabsq $__fentry__, %rax
        call    *%rax

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 10.2.1-3' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-10-0py4jQ/gcc-10-10.2.1/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-0py4jQ/gcc-10-10.2.1/debian/tmp-gcn/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20201224 (Debian 10.2.1-3) 
$ cat mfentry.c
void func(void)
{
}
$ gcc -mfentry -pg -S mfentry.c -mcmodel=large -fno-pie -O2 -save-temps
$ cat mfentry.i
# 1 "mfentry.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 31 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<command-line>" 2
# 1 "mfentry.c"
void func(void)
{
}
$ cat mfentry.s
        .file   "mfentry.c"
        .text
        .p2align 4
        .globl  func
        .type   func, @function
func:
.LFB0:
        .cfi_startproc
1:      call    __fentry__
        ret
        .cfi_endproc
.LFE0:
        .size   func, .-func
        .ident  "GCC: (Debian 10.2.1-3) 10.2.1 20201224"
        .section        .note.GNU-stack,"",@progbits
Comment 1 Hongtao.liu 2021-01-07 10:44:52 UTC
(In reply to Topi Miettinen from comment #0)
> GCC on x86_64 with `-mfentry` generates invalid code for `-mcmodel=large`.
> The call to `__fentry__` uses plain `call` instruction, but this can only
> address locations within 32 bit range while the target may be anywhere in
> the 64 bit range due to `-mcmodel=large`.
> 
> The expected code would be something which loads a 64 bit value to a
> register and then uses register indirect call, so instead of
>         call    __fentry__
> there needs to be
>         movabsq $__fentry__, %rax
>         call    *%rax

according to PSabi, %rax is not safe.
%rax
temporary register; with variable arguments No
passes information about the number of vector
registers used; 1 st return register

and by the time of output __fentry__ in gcc, register is already accocated, is there any regs supposed to be safe in the entry of function? or we need to spill reg to stack and load it back after call, it looks inefficient.
Comment 2 Uroš Bizjak 2021-01-07 12:01:30 UTC
(In reply to Hongtao.liu from comment #1)
> and by the time of output __fentry__ in gcc, register is already accocated,
> is there any regs supposed to be safe in the entry of function? or we need
> to spill reg to stack and load it back after call, it looks inefficient.

You can use any calee-saved register here.
Comment 3 Uroš Bizjak 2021-01-07 12:24:33 UTC
(In reply to Uroš Bizjak from comment #2)
> (In reply to Hongtao.liu from comment #1)
> > and by the time of output __fentry__ in gcc, register is already accocated,
> > is there any regs supposed to be safe in the entry of function? or we need
> > to spill reg to stack and load it back after call, it looks inefficient.
> 
> You can use any calee-saved register here.

Eh, no - __fentry__ is called before pushes.
Comment 4 Topi Miettinen 2021-01-07 13:30:43 UTC
Sorry, I didn't check the ABI. It seems that %r11 and maybe %r10 should be usable:

Figure 3.4: Register Usage

Register
 Usage
 Preserved across function calls

%r10
 temporary register, used for passing a function’s static chain pointer
 No

%r11
 temporary register
 No

Otherwise, I suppose any register could be used if it's saved:
        pushq   %reg
        movabsq $__fentry__, %reg
        call    *%reg
        popq     %reg
Comment 5 Uroš Bizjak 2021-01-07 13:37:44 UTC
(In reply to Topi Miettinen from comment #4)
> Sorry, I didn't check the ABI. It seems that %r11 and maybe %r10 should be
> usable:

%r11 is already used as PROFILE_COUNT_REGISTER for !NO_PROFILE_COUNTERS targets.
Comment 6 H.J. Lu 2021-01-07 23:50:20 UTC
A patch is posted at

https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563033.html
Comment 7 Hongtao.liu 2021-01-08 02:08:42 UTC
(In reply to H.J. Lu from comment #6)
> A patch is posted at
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563033.html

Yes, %r10 is pushed before __fentry__

cat test.c

void func(int (*param)(int));

void outer(int x)
{
    int nested(int y)
    {
        // If x is not used somewhere in here,
        // then the function will be "lifted" into
        // a normal, non-nested function.
        return x + y;
    }
    func(nested);
}

with -O2 -pg -mfentry got

nested.0:
.LFB1:
	.cfi_startproc
	pushq	%r10
1:	call	__fentry__
	popq	%r10
	movl	(%r10), %eax
	addl	%edi, %eax
	ret
	.cfi_endproc
.LFE1:
	.size	nested.0, .-nested.0
	.p2align 4
	.globl	outer
	.type	outer, @function
outer:
.LFB0:
	.cfi_startproc
1:	call	__fentry__
	subq	$56, %rsp
	.cfi_def_cfa_offset 64
	leaq	64(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$-17599, %eax
	movl	%edi, (%rsp)
	movw	%ax, 4(%rsp)
	movl	$-17847, %edx
	movl	$nested.0, %eax
	leaq	4(%rsp), %rdi
	movl	%eax, 6(%rsp)
	movw	%dx, 10(%rsp)
	movq	%rsp, 12(%rsp)
	movl	$-1864106167, 20(%rsp)
	call	func
	addq	$56, %rsp
	.cfi_def_cfa_offset 8
Comment 8 Topi Miettinen 2021-01-08 08:58:15 UTC
I'm unfortunately ignorant to GCC internals and usage of %r10, but otherwise the patch looks good to me.

For -mcmodel=large -fPIC, the call sequence probably needs to be similar to how other extern functions are called under those flags:

.L2:
        movabsq $_GLOBAL_OFFSET_TABLE_-.L2, %reg0
        leaq    .L2(%rip), %reg1
        movabsq $__fentry__@PLTOFF, %reg2
        addq    %reg0, %reg1
        addq    %reg1, %reg2
        call    *%reg2
Comment 9 Uroš Bizjak 2021-01-08 09:29:27 UTC
(In reply to Topi Miettinen from comment #8)
> I'm unfortunately ignorant to GCC internals and usage of %r10, but otherwise
> the patch looks good to me.
> 
> For -mcmodel=large -fPIC, the call sequence probably needs to be similar to
> how other extern functions are called under those flags:
> 
> .L2:
>         movabsq $_GLOBAL_OFFSET_TABLE_-.L2, %reg0
>         leaq    .L2(%rip), %reg1
>         movabsq $__fentry__@PLTOFF, %reg2
>         addq    %reg0, %reg1
>         addq    %reg1, %reg2
>         call    *%reg2

We are only lucky to get one temporary register (%r10), so perhaps the above could be implemented for NO_PROFILE_COUNTERS targets, where %r11 is also available.
Comment 10 Jakub Jelinek 2021-01-08 09:34:12 UTC
If we are emitting for nested functions
	pushq	%r10
1:	call	__fentry__
	popq	%r10
(is it ok to misalign the stack for __fentry__? but then even plain call __fentry__ actually misaligns it), then perhaps we can do similarly for the PIC case.  But I wonder how does __fentry__ then find the caller if it can't rely on the return address being right above the return address to the function that called __fentry__ (appart from unwind info of course, but we don't really emit .cfi_* directives here either, do we?).
Comment 11 GCC Commits 2021-01-08 12:54:04 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:1b885264a48dcd71b7aeb26c0abeb91246724897

commit r11-6548-g1b885264a48dcd71b7aeb26c0abeb91246724897
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 7 14:27:49 2021 -0800

    x86-64: Use R10 for profiling large model
    
    R10 is caller-saved.  Although it can be used as a static chain register,
    it is preserved when calling mcount for nested functions.  Use R10 as a
    scratch register to call mcount in large model.
    
    gcc/
    
            PR target/98482
            * config/i386/i386.c (x86_function_profiler): Use R10 to call
            mcount in large model.  Sorry for large model with PIC.
    
    gcc/testsuite/
    
            PR target/98482
            * gcc.target/i386/pr98482-1.c: New test.
            * gcc.target/i386/pr98482-1.c: Likewise.
Comment 12 GCC Commits 2021-01-08 14:48:04 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:76be18f442948d1a4bc49a7d670b07097f9e5983

commit r11-6552-g76be18f442948d1a4bc49a7d670b07097f9e5983
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 8 05:20:19 2021 -0800

    x86-64: Use R10 and R11 for profiling large model with PIC
    
    For NO_PROFILE_COUNTERS targets, R11 is a scratch register.  We can use
    R10 and R11 to call mcount in large model with PIC.
    
    gcc/
    
            PR target/98482
            * config/i386/i386.c (x86_function_profiler): Use R10 and R11
            to call mcount in large model with PIC for NO_PROFILE_COUNTERS
            targets.
    
    gcc/testsuite/
    
            PR target/98482
            * gcc.target/i386/pr98482-2.c: Updated.
Comment 13 H.J. Lu 2021-01-08 14:49:45 UTC
Fixed for GCC 11 so far.  Please open a new GCC bug for mcount stack
alignment.
Comment 14 Uroš Bizjak 2021-01-08 14:50:57 UTC
(In reply to Jakub Jelinek from comment #10)
> If we are emitting for nested functions
> 	pushq	%r10
> 1:	call	__fentry__
> 	popq	%r10
> (is it ok to misalign the stack for __fentry__? but then even plain call
> __fentry__ actually misaligns it), then perhaps we can do similarly for the
> PIC case.  But I wonder how does __fentry__ then find the caller if it can't
> rely on the return address being right above the return address to the
> function that called __fentry__ (appart from unwind info of course, but we
> don't really emit .cfi_* directives here either, do we?).

Generic part of the compiler pushes static chain register for nested functions, so there is little we can do in the target part. If there is a problem with misaligned stack, then I think __mcount_internal will have to be realigned, because calls to both, mcount and __fentry__ can be misaligned.

I don't know what to do with __fentry__ argument. Luckily, mcount finds its argument via frame pointer, so it works there.
Comment 15 GCC Commits 2021-01-08 16:47:20 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:745d04e796c1a7ebcea0185d0742d58b0c0030ab

commit r11-6557-g745d04e796c1a7ebcea0185d0742d58b0c0030ab
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jan 8 08:41:38 2021 -0800

    x86-64: Require lp64 for PR target/98482 tests
    
    Require lp64 for PR target/98482 tests since -mcmodel=large is isn't
    supported for x32.
    
            PR target/98482
            * gcc.target/i386/pr98482-1.c: Require lp64.
            * gcc.target/i386/pr98482-2.c: Likewise.