This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [i386] Scalar DImode instructions on XMM registers

From: Ilya Enkovich <enkovich dot gnu at gmail dot com>
To: Vladimir Makarov <vmakarov at redhat dot com>
Cc: GCC Development <gcc at gcc dot gnu dot org>, Uros Bizjak <ubizjak at gmail dot com>, Richard Henderson <rth at redhat dot com>, Jan Hubicka <hubicka at ucw dot cz>, Jeff Law <law at redhat dot com>
Date: Mon, 18 May 2015 15:13:44 +0300
Subject: Re: [i386] Scalar DImode instructions on XMM registers
Authentication-results: sourceware.org; auth=none
References: <CAMbmDYYT6zE86-xAYs08VV2nWDK6Np+qEYoj+6oGM276MtBuPQ at mail dot gmail dot com> <CAFULd4YVruAT=RHgENhBcuKZgE6FvRa=8aR6WygKm9F4GjnJyg at mail dot gmail dot com> <CAFULd4aycTg3bYKx7c9GXpgiY4WeqmLh1f5HFYL6K+K35QmTWA at mail dot gmail dot com> <CAMbmDYaDrCnDCnQfP0toV87pi_mE_pbPCP6M-FEkGNDAtWKFUA at mail dot gmail dot com> <CAFULd4amXWDT45oUNqi2cLL2Tec-kMJm7Kz301myZSWZw-3H7Q at mail dot gmail dot com> <alpine dot DEB dot 2 dot 11 dot 1504241222020 dot 1687 at laptop-mg dot saclay dot inria dot fr> <CAMbmDYYfq-RVYa0MwrGH_DpnV7psPHKZpxaouMuq_nsOPeO_ug at mail dot gmail dot com> <20150425013239 dot GB719 at atrey dot karlin dot mff dot cuni dot cz> <CAMbmDYbN7Zk9gg=UNRP3O8L8e5qxiK6jXi-SLEVDoMmBbqLXFQ at mail dot gmail dot com>

2015-05-06 17:18 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
> 2015-04-25 4:32 GMT+03:00 Jan Hubicka <hubicka@ucw.cz>:
>> Hi,
>> I am adding Vladimir and Richard into CC. I tried to solve similar problem
>> with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
>> use of i387 registers when SSE ones run out and possibly also model the fact
>> that Pentium4 had faster i387 additions than SSE additions. I also had some
>> plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
>> got to that.
>>
>> This did not really fly becuase of the regalloc not really being able to
>> understnad it (I made path to regclass to propagate the classes and figure out
>> what operations needs to stay in i387 and what in SSE to avoid reloading, but
>> that never got in).
>>
>> I believe Vladimir did some work on this with IRA (he is able to spill GPR
>> regs into SSE and do bit of other tricks).
>>
>> Also I believe it was kind of Richard's design deicsion to avoid use of
>> (paradoxical) subregs for vector conversions because these have funny
>> implications.
>>
>> The code for handling upper parts of paradoxical subregs is controlled by
>> macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
>> V1DI->V2DI conversions fluently without some middle-end hacking. (it will
>> probably try to produce zero extensions)
>>
>> When we are on SSE instructions, it would be great to finally teach
>> copy_by_pieces/store_by_pieces to use vector instructions (these are more
>> compact and either equaly fast or faster on some CPUs). I hope to get into
>> this, but it would be great if someone beat me.
>>
>> Honza
>>
>
> I'm trying to implement it as separate RTL pass which chooses a
> scalar/vector mode for each 64bit computation chain and performs
> transformation if we choose to use vectors. I also want to split DI
> instructions which are going to be implemented on GPRs before RA
> (currently it is done on the second split). Good metrics for such
> transformation is a big question but currently I can't even make it
> generate correct code when paradoxical subregs are used. It works in
> simple cases but I get troubles when spills appear.
>
> Trying to beat the following testcase:
>
> test (long long *arr)
> {
>   register unsigned long long tmp;
>   tmp = arr[0] | arr[1] & arr[2];
>   while (tmp)
>     {
>       counter (tmp);
>       tmp = *(arr++) & tmp;
>     }
> }
>
> RTL I generate seems OK to me (ignoring the fact that it is not optimal):
>
> (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
>         (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
>                 (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
> + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
>      (nil))
> (insn 50 6 7 2 (set (reg:DI 104)
>         (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
>                 (const_int 16 [0x10])) [2 MEM[(long long int
> *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
>      (nil))
> (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
>         (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
> *)arr_5(D) + 8B] ]) 0)
>             (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
>      (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
> *)arr_5(D) + 8B] ]) 0)
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
> 96 [ arr ])
>                             (const_int 8 [0x8])) [2 MEM[(long long int
> *)arr_5(D) + 8B]+0 S8 A64])
>                     (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
>                             (const_int 16 [0x10])) [2 MEM[(long long
> int *)arr_5(D) + 16B]+0 S8 A64]))
>                 (nil)))))
> (insn 51 7 8 2 (set (reg:DI 105)
>         (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
> pr65105-1.c:22 -1
>      (nil))
> (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
>         (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
>             (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
>      (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> (insn 46 8 47 2 (set (reg:V2DI 103)
>         (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
>      (nil))
> (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
>         (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
>      (nil))
> (insn 48 47 49 2 (set (reg:V2DI 103)
>         (lshiftrt:V2DI (reg:V2DI 103)
>             (const_int 32 [0x20]))) pr65105-1.c:22 -1
>      (nil))
> (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
>         (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
>      (nil))
> (note 9 49 10 2 NOTE_INSN_DELETED)
> (insn 10 9 11 2 (parallel [
>             (set (reg:CCZ 17 flags)
>                 (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
>                         (subreg:SI (reg:DI 101) 0))
>                     (const_int 0 [0])))
>             (clobber (scratch:SI))
>         ]) pr65105-1.c:23 447 {*iorsi_3}
>      (nil))
> (jump_insn 11 10 37 2 (set (pc)
>         (if_then_else (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref:SI 37)
>             (pc))) pr65105-1.c:23 619 {*jcc_1}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (int_list:REG_BR_PROB 9100 (nil)))
>  -> 37)
> (code_label 37 11 36 3 11 "" [2 uses])
> (note 36 37 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
> (insn 18 36 19 3 (set (mem:DI (reg/f:SI 7 sp) [0  S8 A32])
>         (reg/v:DI 87 [ tmp ])) pr65105-1.c:25 89 {*movdi_internal}
>      (nil))
> (call_insn 19 18 20 3 (call (mem:QI (symbol_ref:SI ("counter") [flags
> 0x3]  <function_decl 0x7f94046ea798 counter>) [0 counter S1 A8])
>         (const_int 8 [0x8])) pr65105-1.c:25 666 {*call}
>      (expr_list:REG_CALL_DECL (symbol_ref:SI ("counter") [flags 0x3]
> <function_decl 0x7f94046ea798 counter>)
>         (expr_list:REG_EH_REGION (const_int 0 [0])
>             (nil)))
>     (expr_list:DI (use (mem:DI (reg/f:SI 7 sp) [0  S8 A32]))
>         (nil)))
> (insn 20 19 52 3 (parallel [
>             (set (reg/v/f:SI 96 [ arr ])
>                 (plus:SI (reg/v/f:SI 96 [ arr ])
>                     (const_int 8 [0x8])))
>             (clobber (reg:CC 17 flags))
>         ]) pr65105-1.c:26 220 {*addsi_1}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 52 20 21 3 (set (reg:DI 106)
>         (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
>                 (const_int -8 [0xfffffffffffffff8])) [2 MEM[base:
> arr_14, offset: 4294967288B]+0 S8 A64])) pr65105-1.c:26 -1
>      (nil))
> (insn 21 52 42 3 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
>         (and:V2DI (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
>             (subreg:V2DI (reg:DI 106) 0))) pr65105-1.c:26 3487 {*andv2di3}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 42 21 43 3 (set (reg:V2DI 102)
>         (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:26 -1
>      (nil))
> (insn 43 42 44 3 (set (subreg:SI (reg:DI 101) 0)
>         (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
>      (nil))
> (insn 44 43 45 3 (set (reg:V2DI 102)
>         (lshiftrt:V2DI (reg:V2DI 102)
>             (const_int 32 [0x20]))) pr65105-1.c:26 -1
>      (nil))
> (insn 45 44 23 3 (set (subreg:SI (reg:DI 101) 4)
>         (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
>      (nil))
> (note 23 45 24 3 NOTE_INSN_DELETED)
> (insn 24 23 25 3 (parallel [
>             (set (reg:CCZ 17 flags)
>                 (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
>                         (subreg:SI (reg:DI 101) 0))
>                     (const_int 0 [0])))
>             (clobber (scratch:SI))
>         ]) pr65105-1.c:23 447 {*iorsi_3}
>      (nil))
> (jump_insn 25 24 30 3 (set (pc)
>         (if_then_else (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref:SI 37)
>             (pc))) pr65105-1.c:23 619 {*jcc_1}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (int_list:REG_BR_PROB 9100 (nil)))
>  -> 37)
>
>
> r87 [tmp] has one definition before the loop (insn 8) and one
> definition in the loop (insn 21). But after reload I see that insn 8
> result is stored into stack and this stored value is used in the loop.
> But value produced in in 21 is not stored into stack and therefore
> wrong value is used starting from the second loop iteration. Here is
> the resulting assembler:
>
> test:
> .LFB10:
>         .cfi_startproc
>         pushl   %ebx
>         .cfi_def_cfa_offset 8
>         .cfi_offset 3, -8
>         leal    -40(%esp), %esp
>         .cfi_def_cfa_offset 48
>         movl    48(%esp), %ebx
>         movq    8(%ebx), %xmm1
>         movq    16(%ebx), %xmm0
>         pand    %xmm1, %xmm0
>         movq    (%ebx), %xmm1
>         movdqa  %xmm0, %xmm4
>         por     %xmm1, %xmm4
>         movdqa  %xmm4, %xmm0
>         movd    %xmm4, %edx
>         **movq    %xmm4, 16(%esp)**
>         psrlq   $32, %xmm0
>         movd    %xmm0, %eax
>         orl     %edx, %eax
>         je      .L7
>         .p2align 4,,15
> .L11:
>         **movl    16(%esp), %eax**
>         addl    $8, %ebx
>         **movl    20(%esp), %edx**
>         movl    %eax, (%esp)
>         movl    %edx, 4(%esp)
>         call    counter
>         movq    -8(%ebx), %xmm0
>         **movdqa  16(%esp), %xmm2**
>         pand    %xmm0, %xmm2
>         movdqa  %xmm2, %xmm0
>         movd    %xmm2, %edx
>         psrlq   $32, %xmm0
>         movd    %xmm0, %eax
>         orl     %edx, %eax
>         jne     .L11
> .L7:
>         leal    40(%esp), %esp
>         .cfi_def_cfa_offset 8
>         popl    %ebx
>         .cfi_restore 3
>         .cfi_def_cfa_offset 4
>         ret
>
> Do I misuse paradoxical subregs? Is there any other way to mix scalar
> and vector code and perform vector casts?
>
> BTW this test works OK on another optset when r87 is not spilled into
> a memory but is preserved on GPRs through the call instead.
>
> Thanks,
> Ilya

Hi Vladimir,

Could you please comment on this?

Thanks,
Ilya

Follow-Ups:
- Re: [i386] Scalar DImode instructions on XMM registers
  - From: Vladimir Makarov

References:
- Re: [i386] Scalar DImode instructions on XMM registers
  - From: Ilya Enkovich

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]