struct s1 { unsigned char a, b; }; unsigned long f1(struct s1 x) { return x.a + x.b; } struct s2 { unsigned a: 8, b: 8; }; unsigned long f2(struct s2 x) { return x.a + x.b; } compiled with -O3 -mcpu=pca56 gives: f1: lda sp,-16(sp) stw a0,0(sp) ldbu t1,0(sp) ldbu t2,1(sp) addq t1,t2,v0 lda sp,16(sp) ret f2: mov a0,t0 and a0,0xff,t1 extbl t0,0x1,t2 addq t1,t2,v0 ret In the second case, gcc generates pretty much optimal code[1], whereas in the first case, it spills the struct to the stack just to re-read it in single bytes. While this doesn't look too terrible, it gets really ugly with -mcpu=ev4 (which is default in most distributions), which has no byte read/write opcodes. It would be nice if for the first case similar code could be generated as for the second case. [1] Well, except for the spurious mov, and it could generate zap a0,0x03,v0; perr a0,v0,v0, but I don't expect it to :) Release: 3.2 20020607 (experimental) Environment: System: Linux borkum 2.4.18 #6 Wed Apr 24 22:18:43 CEST 2002 alpha unknown Architecture: alpha host: alphapca56-unknown-linux-gnu build: alphapca56-unknown-linux-gnu target: alphapca56-unknown-linux-gnu configured with: ../configure --enable-languages=c
Falk, I don't know Alpha assembly, but here's what gcc 3.3 branch produces. I take it this is still not optimal? Thanks, Dara .set noat .set noreorder .arch pca56 .text .align 2 .align 4 .globl f1 .ent f1 $f1..ng: f1: .frame $30,0,$26,0 lda $30,-16($30) .prologue 0 stw $16,0($30) ldbu $2,0($30) ldbu $3,1($30) addl $2,$3,$0 lda $30,16($30) ret $31,($26),1 .end f1 .align 2 .align 4 .globl f2 .ent f2 $f2..ng: f2: .frame $30,0,$26,0 .prologue 0 bis $31,$16,$1 and $16,0xff,$2 extbl $1,1,$3 addl $2,$3,$0 ret $31,($26),1 .end f2 .ident "GCC: (GNU) 3.3.1 20030608 (prerelease)"
Subject: Re: Alpha: Access of bytes in struct parameters "dhazeghi@yahoo.com" <gcc-bugzilla@gcc.gnu.org> writes: > I don't know Alpha assembly, but here's what gcc 3.3 branch > produces. I take it this is still not optimal? Indeed (the code is even worse than it looks, because it will produce a replay trap, so it can easily take 10-20 cycles). This seems to be a well-known problem, it was recently discussed on the gcc-list (http://gcc.gnu.org/ml/gcc/2003-06/msg00629.html). Unfortunately, the proposed idea of special-casing one field per register wouldn't help here. However, the fact that gcc produces optimal code for the bit-field case indicates that it might be possible to improve the situation without large changes like introducing new RTL.
Confirmed...
Still present in 4.1. As a guess, we're creating a BLKmode register (because alignof(struct s1) is less than alignof(HImode)), and assign_parm_setup_block forces the data into the stack. It's possible this can be bypassed for some structures, but I havn't given it much thought.
And there is nothing Alpha specific about this. Any target which passes structures in registers can show it. For instance, ia64: f1: .prologue .body .mmi st2 [r12] = r32 nop 0 mov r14 = r12 ;; .mmi ld1 r15 = [r14], 1 ;; ld1 r8 = [r14] nop 0 ;; .mib nop 0 add r8 = r8, r15 br.ret.sptk.many b0 .endp f1# .align 16 .global f2# .proc f2# f2: .prologue .body .mii nop 0 extr.u r8 = r32, 8, 8 zxt1 r32 = r32 ;; .mib nop 0 add r8 = r8, r32 br.ret.sptk.many b0
one more testcase from x86-64 / gcc-4.1: float re(float _Complex a) { return __real__ a; } is compiled to: re: movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm0 ret but can be optimized to `movss %xmm0, %xmm0`.
Compiling this under ia64 seems to now be optimized perfectly as of at least GCC 10, though the other ones look like they're still badly handled.
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:1ad584d538d349db13cfa8440222d91d5e9aff3f commit r13-859-g1ad584d538d349db13cfa8440222d91d5e9aff3f Author: Roger Sayle <roger@nextmovesoftware.com> Date: Mon May 30 21:32:58 2022 +0100 Allow SCmode and DImode to be tieable with TARGET_64BIT on x86_64. This patch is a form of insurance policy in case my patch for PR 7061 runs into problems on non-x86 targets; the middle-end can add an extra check that the backend is happy placing SCmode and DImode values in the same register, before creating a SUBREG. Unfortunately, ix86_modes_tieable_p currently claims this is not allowed(?), even though the default target hook for modes_tieable_p is to always return true [i.e. false can be used to specifically prohibit bad combinations], and the x86_64 ABI passes SCmode values in DImode registers!. This makes the backend's modes_tiable_p hook a little more forgiving, and additionally enables interconversion between SCmode and V2SFmode, and between DCmode and VD2Fmode, which opens interesting opporutunities in the future. 2022-05-30 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386.cc (ix86_modes_tieable_p): Allow SCmode to be tieable with DImode on TARGET_64BIT, and SCmode tieable with V2SFmode, and DCmode with V2DFmode.
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:1753a7120109c1d3b682f9487d6cca64fb2f0929 commit r13-1038-g1753a7120109c1d3b682f9487d6cca64fb2f0929 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Fri Jun 10 15:14:23 2022 +0100 PR rtl-optimization/7061: Complex number arguments on x86_64-like ABIs. This patch addresses the issue in comment #6 of PR rtl-optimization/7061 (a four digit PR number) from 2006 where on x86_64 complex number arguments are unconditionally spilled to the stack. For the test cases below: float re(float _Complex a) { return __real__ a; } float im(float _Complex a) { return __imag__ a; } GCC with -O2 currently generates: re: movq %xmm0, -8(%rsp) movss -8(%rsp), %xmm0 ret im: movq %xmm0, -8(%rsp) movss -4(%rsp), %xmm0 ret with this patch we now generate: re: ret im: movq %xmm0, %rax shrq $32, %rax movd %eax, %xmm0 ret [Technically, this shift can be performed on %xmm0 in a single instruction, but the backend needs to be taught to do that, the important bit is that the SCmode argument isn't written to the stack]. The patch itself is to emit_group_store where just before RTL expansion commits to writing to the stack, we check if the store group consists of a single scalar integer register that holds a complex mode value; on x86_64 SCmode arguments are passed in DImode registers. If this is the case, we can use a SUBREG to "view_convert" the integer to the equivalent complex mode. An interesting corner case that showed up during testing is that x86_64 also passes HCmode arguments in DImode registers(!), i.e. using modes of different sizes. This is easily handled/supported by first converting to an integer mode of the correct size, and then generating a complex mode SUBREG of this. This is similar in concept to the patch I proposed here: https://gcc.gnu.org/pipermail/gcc-patches/2022-February/590139.html 2020-06-10 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog PR rtl-optimization/7061 * expr.cc (emit_group_store): For groups that consist of a single scalar integer register that hold a complex mode value, use gen_lowpart to generate a SUBREG to "view_convert" to the complex mode. For modes of different sizes, first convert to an integer mode of the appropriate size. gcc/testsuite/ChangeLog PR rtl-optimization/7061 * gcc.target/i386/pr7061-1.c: New test case. * gcc.target/i386/pr7061-2.c: New test case.
llvm emits just: im: # @im shufps xmm0, xmm0, 85 # xmm0 = xmm0[1,1,1,1] ret
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:64d4f27a0ce47e97867512bda7fa5683acf8a134 commit r13-1282-g64d4f27a0ce47e97867512bda7fa5683acf8a134 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Mon Jun 27 07:47:40 2022 +0100 Implement __imag__ of float _Complex using shufps on x86_64. This patch is a follow-up improvement to my recent patch for PR rtl-optimization/7061. That patch added the test case gcc.target/i386/pr7061-2.c: float im(float _Complex a) { return __imag__ a; } For which GCC on x86_64 currently generates: movq %xmm0, %rax shrq $32, %rax movd %eax, %xmm0 ret but with this patch we now generate (the same as LLVM): shufps $85, %xmm0, %xmm0 ret This is achieved by providing a define_insn_and_split that allows truncated lshiftrt:DI by 32 to be performed on either SSE or general regs, where if the register allocator prefers to use SSE, we split to a shufps_v4si, or if not, we use a regular shrq. 2022-06-27 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog PR rtl-optimization/7061 * config/i386/i386.md (*highpartdisi2): New define_insn_and_split. gcc/testsuite/ChangeLog PR rtl-optimization/7061 * gcc.target/i386/pr7061-2.c: Update to look for shufps.