7061 – Access of bytes in struct parameters

Bug 7061 - Access of bytes in struct parameters

Summary: Access of bytes in struct parameters

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	3.2

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	argument, return
	Show dependency tree / graph

Reported:	2002-06-17 16:46 UTC by falk.hueffner
Modified:	2022-06-27 06:49 UTC (History)
CC List:	4 users (show)

See Also:	16657
Host:
Target:
Build:	alphapca56-*-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:	2006-02-05 20:06:31

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description falk.hueffner 2002-06-17 16:46:01 UTC

struct s1 { unsigned char a, b; };
unsigned long f1(struct s1 x) {
    return x.a + x.b;
}
struct s2 { unsigned a: 8, b: 8; };
unsigned long f2(struct s2 x) {
    return x.a + x.b;
}

compiled with -O3 -mcpu=pca56 gives:

f1:	lda     sp,-16(sp)
	stw     a0,0(sp)
	ldbu    t1,0(sp)
	ldbu    t2,1(sp)
	addq    t1,t2,v0
	lda     sp,16(sp)
	ret

f2:	mov     a0,t0
	and     a0,0xff,t1
	extbl   t0,0x1,t2
	addq    t1,t2,v0
	ret

In the second case, gcc generates pretty much optimal code[1], whereas
in the first case, it spills the struct to the stack just to re-read
it in single bytes. While this doesn't look too terrible, it gets
really ugly with -mcpu=ev4 (which is default in most distributions),
which has no byte read/write opcodes.

It would be nice if for the first case similar code could be generated
as for the second case.

[1] Well, except for the spurious mov, and it could generate
    zap a0,0x03,v0; perr a0,v0,v0, but I don't expect it to :)

Release:
3.2 20020607 (experimental)

Environment:
System: Linux borkum 2.4.18 #6 Wed Apr 24 22:18:43 CEST 2002 alpha unknown
Architecture: alpha

	
host: alphapca56-unknown-linux-gnu
build: alphapca56-unknown-linux-gnu
target: alphapca56-unknown-linux-gnu
configured with: ../configure --enable-languages=c

Comment 1 Dara Hazeghi 2003-06-08 05:11:49 UTC

Falk,

I don't know Alpha assembly, but here's what gcc 3.3 branch produces. I take it this is still not 
optimal? Thanks,

Dara

        .set noat
        .set noreorder
        .arch pca56
        .text
        .align 2
        .align 4
        .globl f1
        .ent f1
$f1..ng:
f1:
        .frame $30,0,$26,0
        lda $30,-16($30)
        .prologue 0
        stw $16,0($30)
        ldbu $2,0($30)
        ldbu $3,1($30)
        addl $2,$3,$0
        lda $30,16($30)
        ret $31,($26),1
        .end f1
        .align 2
        .align 4
        .globl f2
        .ent f2
$f2..ng:
f2:
        .frame $30,0,$26,0
        .prologue 0
        bis $31,$16,$1
        and $16,0xff,$2
        extbl $1,1,$3
        addl $2,$3,$0
        ret $31,($26),1
        .end f2
        .ident  "GCC: (GNU) 3.3.1 20030608 (prerelease)"

Comment 2 falk.hueffner 2003-06-08 10:04:25 UTC

Subject: Re:  Alpha: Access of bytes in struct parameters

"dhazeghi@yahoo.com" <gcc-bugzilla@gcc.gnu.org> writes:

> I don't know Alpha assembly, but here's what gcc 3.3 branch
> produces. I take it this is still not optimal?

Indeed (the code is even worse than it looks, because it will produce
a replay trap, so it can easily take 10-20 cycles). This seems to be a
well-known problem, it was recently discussed on the gcc-list
(http://gcc.gnu.org/ml/gcc/2003-06/msg00629.html). Unfortunately, the
proposed idea of special-casing one field per register wouldn't help
here. However, the fact that gcc produces optimal code for the
bit-field case indicates that it might be possible to improve the
situation without large changes like introducing new RTL.

Comment 3 Dara Hazeghi 2003-06-08 16:41:33 UTC

Confirmed...

Comment 4 Richard Henderson 2005-11-02 08:06:36 UTC

Still present in 4.1.

As a guess, we're creating a BLKmode register (because alignof(struct s1)
is less than alignof(HImode)), and assign_parm_setup_block forces the data
into the stack.  It's possible this can be bypassed for some structures,
but I havn't given it much thought.

Comment 5 Richard Henderson 2005-11-02 08:09:29 UTC

And there is nothing Alpha specific about this.  Any target which passes
structures in registers can show it.  For instance, ia64:

f1:
        .prologue
        .body
        .mmi
        st2 [r12] = r32
        nop 0
        mov r14 = r12
        ;;
        .mmi
        ld1 r15 = [r14], 1
        ;;
        ld1 r8 = [r14]
        nop 0
        ;;
        .mib
        nop 0
        add r8 = r8, r15
        br.ret.sptk.many b0
        .endp f1#
        .align 16
        .global f2#
        .proc f2#
f2:
        .prologue
        .body
        .mii
        nop 0
        extr.u r8 = r32, 8, 8
        zxt1 r32 = r32
        ;;
        .mib
        nop 0
        add r8 = r8, r32
        br.ret.sptk.many b0

Comment 6 Pawel Sikora 2006-02-08 21:10:16 UTC

one more testcase from x86-64 / gcc-4.1:

float re(float _Complex a) { return __real__ a; }

is compiled to:

re: movq    %xmm0, -8(%rsp)
    movss   -8(%rsp), %xmm0
    ret

but can be optimized to `movss %xmm0, %xmm0`.

Comment 7 Gabriel Ravier 2021-09-22 20:26:49 UTC

Compiling this under ia64 seems to now be optimized perfectly as of at least GCC 10, though the other ones look like they're still badly handled.

Comment 8 GCC Commits 2022-05-30 20:40:10 UTC

The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:1ad584d538d349db13cfa8440222d91d5e9aff3f

commit r13-859-g1ad584d538d349db13cfa8440222d91d5e9aff3f
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Mon May 30 21:32:58 2022 +0100

    Allow SCmode and DImode to be tieable with TARGET_64BIT on x86_64.
    
    This patch is a form of insurance policy in case my patch for PR 7061 runs
    into problems on non-x86 targets; the middle-end can add an extra check
    that the backend is happy placing SCmode and DImode values in the same
    register, before creating a SUBREG.  Unfortunately, ix86_modes_tieable_p
    currently claims this is not allowed(?), even though the default target
    hook for modes_tieable_p is to always return true [i.e. false can be
    used to specifically prohibit bad combinations], and the x86_64 ABI
    passes SCmode values in DImode registers!.  This makes the backend's
    modes_tiable_p hook a little more forgiving, and additionally enables
    interconversion between SCmode and V2SFmode, and between DCmode and
    VD2Fmode, which opens interesting opporutunities in the future.
    
    2022-05-30  Roger Sayle  <roger@nextmovesoftware.com>
    
    gcc/ChangeLog
            * config/i386/i386.cc (ix86_modes_tieable_p): Allow SCmode to be
            tieable with DImode on TARGET_64BIT, and SCmode tieable with
            V2SFmode, and DCmode with V2DFmode.

Comment 9 GCC Commits 2022-06-10 14:20:13 UTC

The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:1753a7120109c1d3b682f9487d6cca64fb2f0929

commit r13-1038-g1753a7120109c1d3b682f9487d6cca64fb2f0929
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Jun 10 15:14:23 2022 +0100

    PR rtl-optimization/7061: Complex number arguments on x86_64-like ABIs.
    
    This patch addresses the issue in comment #6 of PR rtl-optimization/7061
    (a four digit PR number) from 2006 where on x86_64 complex number arguments
    are unconditionally spilled to the stack.
    
    For the test cases below:
    float re(float _Complex a) { return __real__ a; }
    float im(float _Complex a) { return __imag__ a; }
    
    GCC with -O2 currently generates:
    
    re:     movq    %xmm0, -8(%rsp)
            movss   -8(%rsp), %xmm0
            ret
    im:     movq    %xmm0, -8(%rsp)
            movss   -4(%rsp), %xmm0
            ret
    
    with this patch we now generate:
    
    re:     ret
    im:     movq    %xmm0, %rax
            shrq    $32, %rax
            movd    %eax, %xmm0
            ret
    
    [Technically, this shift can be performed on %xmm0 in a single
    instruction, but the backend needs to be taught to do that, the
    important bit is that the SCmode argument isn't written to the
    stack].
    
    The patch itself is to emit_group_store where just before RTL
    expansion commits to writing to the stack, we check if the store
    group consists of a single scalar integer register that holds
    a complex mode value; on x86_64 SCmode arguments are passed in
    DImode registers.  If this is the case, we can use a SUBREG to
    "view_convert" the integer to the equivalent complex mode.
    
    An interesting corner case that showed up during testing is that
    x86_64 also passes HCmode arguments in DImode registers(!), i.e.
    using modes of different sizes.  This is easily handled/supported
    by first converting to an integer mode of the correct size, and
    then generating a complex mode SUBREG of this.  This is similar
    in concept to the patch I proposed here:
    https://gcc.gnu.org/pipermail/gcc-patches/2022-February/590139.html
    
    2020-06-10  Roger Sayle  <roger@nextmovesoftware.com>
    
    gcc/ChangeLog
            PR rtl-optimization/7061
            * expr.cc (emit_group_store): For groups that consist of a single
            scalar integer register that hold a complex mode value, use
            gen_lowpart to generate a SUBREG to "view_convert" to the complex
            mode.  For modes of different sizes, first convert to an integer
            mode of the appropriate size.
    
    gcc/testsuite/ChangeLog
            PR rtl-optimization/7061
            * gcc.target/i386/pr7061-1.c: New test case.
            * gcc.target/i386/pr7061-2.c: New test case.

Comment 10 Dávid Bolvanský 2022-06-11 13:29:26 UTC

llvm emits just:
im:                                     # @im
        shufps  xmm0, xmm0, 85                  # xmm0 = xmm0[1,1,1,1]
        ret

Comment 11 GCC Commits 2022-06-27 06:49:24 UTC

The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:64d4f27a0ce47e97867512bda7fa5683acf8a134

commit r13-1282-g64d4f27a0ce47e97867512bda7fa5683acf8a134
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Mon Jun 27 07:47:40 2022 +0100

    Implement __imag__ of float _Complex using shufps on x86_64.
    
    This patch is a follow-up improvement to my recent patch for
    PR rtl-optimization/7061.  That patch added the test case
    gcc.target/i386/pr7061-2.c:
    
    float im(float _Complex a) { return __imag__ a; }
    
    For which GCC on x86_64 currently generates:
    
            movq    %xmm0, %rax
            shrq    $32, %rax
            movd    %eax, %xmm0
            ret
    
    but with this patch we now generate (the same as LLVM):
    
            shufps  $85, %xmm0, %xmm0
            ret
    
    This is achieved by providing a define_insn_and_split that allows
    truncated lshiftrt:DI by 32 to be performed on either SSE or general
    regs, where if the register allocator prefers to use SSE, we split
    to a shufps_v4si, or if not, we use a regular shrq.
    
    2022-06-27  Roger Sayle  <roger@nextmovesoftware.com>
    
    gcc/ChangeLog
            PR rtl-optimization/7061
            * config/i386/i386.md (*highpartdisi2): New define_insn_and_split.
    
    gcc/testsuite/ChangeLog
            PR rtl-optimization/7061
            * gcc.target/i386/pr7061-2.c: Update to look for shufps.