This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
- From: Wei Mi <wmi at google dot com>
- To: "Zamyatin, Igor" <igor dot zamyatin at intel dot com>
- Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, David Li <davidxl at google dot com>, "ubizjak at gmail dot com" <ubizjak at gmail dot com>
- Date: Fri, 20 Sep 2013 15:12:50 -0700
- Subject: Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
- Authentication-results: sourceware.org; auth=none
- References: <CA+4CFy6AWcvddHi-S8N1pzae7ChCx00EF7+mPM88pdQuyy3Zow at mail dot gmail dot com> <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427A8C52 at IRSMSX101 dot ger dot corp dot intel dot com>
Ping.
> -----Original Message-----
> From: Wei Mi [mailto:wmi@google.com]
> Sent: Thursday, September 12, 2013 2:51 AM
> To: GCC Patches
> Cc: David Li; Zamyatin, Igor
> Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
>
> For the following testcase 1.c, on westmere and sandybridge, performance with the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to 2.83s). It means cvtss2sd is often better than
> unpcklps+cvtps2pd on recent x86 platforms.
>
> 1.c:
> float total = 0.2;
> int k = 5;
>
> int main() {
> int i;
>
> for (i = 0; i < 1000000000; i++) {
> total += (0.5 + k);
> }
>
> return total == 0.3;
> }
>
> assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts
> .L2:
> unpcklps %xmm0, %xmm0
> subl $1, %eax
> cvtps2pd %xmm0, %xmm0
> addsd %xmm1, %xmm0
> unpcklpd %xmm0, %xmm0
> cvtpd2ps %xmm0, %xmm0
> jne .L2
>
> assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts
> .L2:
> cvtss2sd %xmm0, %xmm0
> subl $1, %eax
> addsd %xmm1, %xmm0
> cvtsd2ss %xmm0, %xmm0
> jne .L2
>
> But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance with the option -mtune=^use_vector_fp_converts is worse.
> Analysis to the assembly shows the performance degradation comes from partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back.
>
> 2.c:
> double b[1024];
>
> float a[1024];
>
> int main()
> {
> int i;
> for(i = 0 ; i < 1024 * 1024 * 256; i++)
> a[i & 1023] = a[i & 1023] * (float)b[i & 1023];
> return (int)a[512];
> }
>
> without -mtune-crtl=^use_vector_fp_converts
> .L2:
> movl %eax, %edx
> addl $1, %eax
> andl $1023, %edx
> cmpl $268435456, %eax
> movsd b(,%rdx,8), %xmm0
> cvtpd2ps %xmm0, %xmm0 ==> without partial reg stall
> because of movsd.
> mulss a(,%rdx,4), %xmm0
> movss %xmm0, a(,%rdx,4)
> jne .L2
>
> with -mtune-crtl=^use_vector_fp_converts
> .L2:
> movl %eax, %edx
> addl $1, %eax
> andl $1023, %edx
> cmpl $268435456, %eax
> cvtsd2ss b(,%rdx,8), %xmm0 ==> with partial reg
> stall. Needs to insert "pxor %xmm0, %xmm0" before current insn.
> mulss a(,%rdx,4), %xmm0
> movss %xmm0, a(,%rdx,4)
> jne .L2
>
> So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk?
>
> Thanks,
> Wei Mi.
>
> 2013-09-11 Wei Mi <wmi@google.com>
>
> * config/i386/x86-tune.def (DEF_TUNE): Remove
> m_CORE_ALL.
> * config/i386/i386.md: Add define_peephole2 to
> break partial reg stall for cvtss2sd/cvtsd2ss.
>
> Index: config/i386/x86-tune.def
> ===================================================================
> --- config/i386/x86-tune.def (revision 201963)
> +++ config/i386/x86-tune.def (working copy)
> @@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_
> /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
> from FP to FP. */
> DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
> - m_CORE_ALL | m_AMDFAM10 | m_GENERIC)
> + m_AMDFAM10 | m_GENERIC)
> /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
> from integer to FP. */
> DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
> Index: config/i386/i386.md
> ===================================================================
> --- config/i386/i386.md (revision 201963)
> +++ config/i386/i386.md (working copy)
> @@ -5075,6 +5075,63 @@
> emit_move_insn (operands[0], CONST0_RTX (<ssevecmode>mode));
> })
>
> +;; Break partial reg stall for cvtsd2ss.
> +
> +(define_peephole2
> + [(set (match_operand:SF 0 "register_operand")
> + (float_truncate:SF
> + (match_operand:DF 1 "nonimmediate_operand")))]
> + "TARGET_SSE2 && TARGET_SSE_MATH
> + && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> + && optimize_function_for_speed_p (cfun)
> + && reload_completed && SSE_REG_P (operands[0])
> + && peep2_reg_dead_p (0, operands[0])
> + && (!SSE_REG_P (operands[1])
> + || REGNO (operands[0]) != REGNO (operands[1]))"
> + [(set (match_dup 0)
> + (vec_merge:V4SF
> + (vec_duplicate:V4SF
> + (float_truncate:V2SF
> + (match_dup 1)))
> + (match_dup 0)
> + (const_int 1)))]
> +{
> + operands[0] = simplify_gen_subreg (V4SFmode, operands[0],
> + SFmode, 0);
> + operands[1] = simplify_gen_subreg (V2DFmode, operands[1],
> + DFmode, 0);
> + emit_move_insn (operands[0], CONST0_RTX (V4SFmode));
> +})
> +
> +;; Break partial reg stall for cvtss2sd.
> +
> +(define_peephole2
> + [(set (match_operand:DF 0 "register_operand")
> + (float_extend:DF
> + (match_operand:SF 1 "nonimmediate_operand")))]
> + "TARGET_SSE2 && TARGET_SSE_MATH
> + && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> + && optimize_function_for_speed_p (cfun)
> + && reload_completed && SSE_REG_P (operands[0])
> + && peep2_reg_dead_p (0, operands[0])
> + && (!SSE_REG_P (operands[1])
> + || REGNO (operands[0]) != REGNO (operands[1]))"
> + [(set (match_dup 0)
> + (vec_merge:V2DF
> + (float_extend:V2DF
> + (vec_select:V2SF
> + (match_dup 1)
> + (parallel [(const_int 0) (const_int 1)])))
> + (match_dup 0)
> + (const_int 1)))]
> +{
> + operands[0] = simplify_gen_subreg (V2DFmode, operands[0],
> + DFmode, 0);
> + operands[1] = simplify_gen_subreg (V4SFmode, operands[1],
> + SFmode, 0);
> + emit_move_insn (operands[0], CONST0_RTX (V2DFmode));
> +})
> +
> ;; Avoid store forwarding (partial memory) stall penalty ;; by passing DImode value through XMM registers. */