This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [i386] Scalar DImode instructions on XMM registers
- From: Ilya Enkovich <enkovich dot gnu at gmail dot com>
- To: Vladimir Makarov <vmakarov at redhat dot com>
- Cc: GCC Development <gcc at gcc dot gnu dot org>, Uros Bizjak <ubizjak at gmail dot com>, Richard Henderson <rth at redhat dot com>, Jan Hubicka <hubicka at ucw dot cz>, Jeff Law <law at redhat dot com>
- Date: Mon, 18 May 2015 15:13:44 +0300
- Subject: Re: [i386] Scalar DImode instructions on XMM registers
- Authentication-results: sourceware.org; auth=none
- References: <CAMbmDYYT6zE86-xAYs08VV2nWDK6Np+qEYoj+6oGM276MtBuPQ at mail dot gmail dot com> <CAFULd4YVruAT=RHgENhBcuKZgE6FvRa=8aR6WygKm9F4GjnJyg at mail dot gmail dot com> <CAFULd4aycTg3bYKx7c9GXpgiY4WeqmLh1f5HFYL6K+K35QmTWA at mail dot gmail dot com> <CAMbmDYaDrCnDCnQfP0toV87pi_mE_pbPCP6M-FEkGNDAtWKFUA at mail dot gmail dot com> <CAFULd4amXWDT45oUNqi2cLL2Tec-kMJm7Kz301myZSWZw-3H7Q at mail dot gmail dot com> <alpine dot DEB dot 2 dot 11 dot 1504241222020 dot 1687 at laptop-mg dot saclay dot inria dot fr> <CAMbmDYYfq-RVYa0MwrGH_DpnV7psPHKZpxaouMuq_nsOPeO_ug at mail dot gmail dot com> <20150425013239 dot GB719 at atrey dot karlin dot mff dot cuni dot cz> <CAMbmDYbN7Zk9gg=UNRP3O8L8e5qxiK6jXi-SLEVDoMmBbqLXFQ at mail dot gmail dot com>
2015-05-06 17:18 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
> 2015-04-25 4:32 GMT+03:00 Jan Hubicka <hubicka@ucw.cz>:
>> Hi,
>> I am adding Vladimir and Richard into CC. I tried to solve similar problem
>> with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
>> use of i387 registers when SSE ones run out and possibly also model the fact
>> that Pentium4 had faster i387 additions than SSE additions. I also had some
>> plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
>> got to that.
>>
>> This did not really fly becuase of the regalloc not really being able to
>> understnad it (I made path to regclass to propagate the classes and figure out
>> what operations needs to stay in i387 and what in SSE to avoid reloading, but
>> that never got in).
>>
>> I believe Vladimir did some work on this with IRA (he is able to spill GPR
>> regs into SSE and do bit of other tricks).
>>
>> Also I believe it was kind of Richard's design deicsion to avoid use of
>> (paradoxical) subregs for vector conversions because these have funny
>> implications.
>>
>> The code for handling upper parts of paradoxical subregs is controlled by
>> macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
>> V1DI->V2DI conversions fluently without some middle-end hacking. (it will
>> probably try to produce zero extensions)
>>
>> When we are on SSE instructions, it would be great to finally teach
>> copy_by_pieces/store_by_pieces to use vector instructions (these are more
>> compact and either equaly fast or faster on some CPUs). I hope to get into
>> this, but it would be great if someone beat me.
>>
>> Honza
>>
>
> I'm trying to implement it as separate RTL pass which chooses a
> scalar/vector mode for each 64bit computation chain and performs
> transformation if we choose to use vectors. I also want to split DI
> instructions which are going to be implemented on GPRs before RA
> (currently it is done on the second split). Good metrics for such
> transformation is a big question but currently I can't even make it
> generate correct code when paradoxical subregs are used. It works in
> simple cases but I get troubles when spills appear.
>
> Trying to beat the following testcase:
>
> test (long long *arr)
> {
> register unsigned long long tmp;
> tmp = arr[0] | arr[1] & arr[2];
> while (tmp)
> {
> counter (tmp);
> tmp = *(arr++) & tmp;
> }
> }
>
> RTL I generate seems OK to me (ignoring the fact that it is not optimal):
>
> (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
> (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
> (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
> + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
> (nil))
> (insn 50 6 7 2 (set (reg:DI 104)
> (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
> (const_int 16 [0x10])) [2 MEM[(long long int
> *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
> (nil))
> (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
> (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
> *)arr_5(D) + 8B] ]) 0)
> (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
> (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
> *)arr_5(D) + 8B] ]) 0)
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
> 96 [ arr ])
> (const_int 8 [0x8])) [2 MEM[(long long int
> *)arr_5(D) + 8B]+0 S8 A64])
> (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
> (const_int 16 [0x10])) [2 MEM[(long long
> int *)arr_5(D) + 16B]+0 S8 A64]))
> (nil)))))
> (insn 51 7 8 2 (set (reg:DI 105)
> (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
> pr65105-1.c:22 -1
> (nil))
> (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
> (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
> (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
> (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil))))
> (insn 46 8 47 2 (set (reg:V2DI 103)
> (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
> (nil))
> (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
> (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
> (nil))
> (insn 48 47 49 2 (set (reg:V2DI 103)
> (lshiftrt:V2DI (reg:V2DI 103)
> (const_int 32 [0x20]))) pr65105-1.c:22 -1
> (nil))
> (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
> (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
> (nil))
> (note 9 49 10 2 NOTE_INSN_DELETED)
> (insn 10 9 11 2 (parallel [
> (set (reg:CCZ 17 flags)
> (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
> (subreg:SI (reg:DI 101) 0))
> (const_int 0 [0])))
> (clobber (scratch:SI))
> ]) pr65105-1.c:23 447 {*iorsi_3}
> (nil))
> (jump_insn 11 10 37 2 (set (pc)
> (if_then_else (ne (reg:CCZ 17 flags)
> (const_int 0 [0]))
> (label_ref:SI 37)
> (pc))) pr65105-1.c:23 619 {*jcc_1}
> (expr_list:REG_DEAD (reg:CCZ 17 flags)
> (int_list:REG_BR_PROB 9100 (nil)))
> -> 37)
> (code_label 37 11 36 3 11 "" [2 uses])
> (note 36 37 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
> (insn 18 36 19 3 (set (mem:DI (reg/f:SI 7 sp) [0 S8 A32])
> (reg/v:DI 87 [ tmp ])) pr65105-1.c:25 89 {*movdi_internal}
> (nil))
> (call_insn 19 18 20 3 (call (mem:QI (symbol_ref:SI ("counter") [flags
> 0x3] <function_decl 0x7f94046ea798 counter>) [0 counter S1 A8])
> (const_int 8 [0x8])) pr65105-1.c:25 666 {*call}
> (expr_list:REG_CALL_DECL (symbol_ref:SI ("counter") [flags 0x3]
> <function_decl 0x7f94046ea798 counter>)
> (expr_list:REG_EH_REGION (const_int 0 [0])
> (nil)))
> (expr_list:DI (use (mem:DI (reg/f:SI 7 sp) [0 S8 A32]))
> (nil)))
> (insn 20 19 52 3 (parallel [
> (set (reg/v/f:SI 96 [ arr ])
> (plus:SI (reg/v/f:SI 96 [ arr ])
> (const_int 8 [0x8])))
> (clobber (reg:CC 17 flags))
> ]) pr65105-1.c:26 220 {*addsi_1}
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil)))
> (insn 52 20 21 3 (set (reg:DI 106)
> (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
> (const_int -8 [0xfffffffffffffff8])) [2 MEM[base:
> arr_14, offset: 4294967288B]+0 S8 A64])) pr65105-1.c:26 -1
> (nil))
> (insn 21 52 42 3 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
> (and:V2DI (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
> (subreg:V2DI (reg:DI 106) 0))) pr65105-1.c:26 3487 {*andv2di3}
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil)))
> (insn 42 21 43 3 (set (reg:V2DI 102)
> (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:26 -1
> (nil))
> (insn 43 42 44 3 (set (subreg:SI (reg:DI 101) 0)
> (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
> (nil))
> (insn 44 43 45 3 (set (reg:V2DI 102)
> (lshiftrt:V2DI (reg:V2DI 102)
> (const_int 32 [0x20]))) pr65105-1.c:26 -1
> (nil))
> (insn 45 44 23 3 (set (subreg:SI (reg:DI 101) 4)
> (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
> (nil))
> (note 23 45 24 3 NOTE_INSN_DELETED)
> (insn 24 23 25 3 (parallel [
> (set (reg:CCZ 17 flags)
> (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
> (subreg:SI (reg:DI 101) 0))
> (const_int 0 [0])))
> (clobber (scratch:SI))
> ]) pr65105-1.c:23 447 {*iorsi_3}
> (nil))
> (jump_insn 25 24 30 3 (set (pc)
> (if_then_else (ne (reg:CCZ 17 flags)
> (const_int 0 [0]))
> (label_ref:SI 37)
> (pc))) pr65105-1.c:23 619 {*jcc_1}
> (expr_list:REG_DEAD (reg:CCZ 17 flags)
> (int_list:REG_BR_PROB 9100 (nil)))
> -> 37)
>
>
> r87 [tmp] has one definition before the loop (insn 8) and one
> definition in the loop (insn 21). But after reload I see that insn 8
> result is stored into stack and this stored value is used in the loop.
> But value produced in in 21 is not stored into stack and therefore
> wrong value is used starting from the second loop iteration. Here is
> the resulting assembler:
>
> test:
> .LFB10:
> .cfi_startproc
> pushl %ebx
> .cfi_def_cfa_offset 8
> .cfi_offset 3, -8
> leal -40(%esp), %esp
> .cfi_def_cfa_offset 48
> movl 48(%esp), %ebx
> movq 8(%ebx), %xmm1
> movq 16(%ebx), %xmm0
> pand %xmm1, %xmm0
> movq (%ebx), %xmm1
> movdqa %xmm0, %xmm4
> por %xmm1, %xmm4
> movdqa %xmm4, %xmm0
> movd %xmm4, %edx
> **movq %xmm4, 16(%esp)**
> psrlq $32, %xmm0
> movd %xmm0, %eax
> orl %edx, %eax
> je .L7
> .p2align 4,,15
> .L11:
> **movl 16(%esp), %eax**
> addl $8, %ebx
> **movl 20(%esp), %edx**
> movl %eax, (%esp)
> movl %edx, 4(%esp)
> call counter
> movq -8(%ebx), %xmm0
> **movdqa 16(%esp), %xmm2**
> pand %xmm0, %xmm2
> movdqa %xmm2, %xmm0
> movd %xmm2, %edx
> psrlq $32, %xmm0
> movd %xmm0, %eax
> orl %edx, %eax
> jne .L11
> .L7:
> leal 40(%esp), %esp
> .cfi_def_cfa_offset 8
> popl %ebx
> .cfi_restore 3
> .cfi_def_cfa_offset 4
> ret
>
> Do I misuse paradoxical subregs? Is there any other way to mix scalar
> and vector code and perform vector casts?
>
> BTW this test works OK on another optset when r87 is not spilled into
> a memory but is preserved on GPRs through the call instead.
>
> Thanks,
> Ilya
Hi Vladimir,
Could you please comment on this?
Thanks,
Ilya