This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[patch] tuning gcc for AMDFAM10 processor (patch 3)
- From: "Jagasia, Harsha" <harsha dot jagasia at amd dot com>
- To: "Richard Henderson" <rth at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Tue, 30 Jan 2007 18:58:09 -0600
- Subject: [patch] tuning gcc for AMDFAM10 processor (patch 3)
Hi Richard,
>On Mon, Jan 29, 2007 at 07:12:44PM -0600, Jagasia, Harsha wrote:
>> + xorps reg3, reg3
>> + movaps reg3, reg2
>
>Surely you're not advocating *moving* a zero. =)
Actually this is something the current mainline compiler does. It's not
being introduced by this patch. This patch does not enable
x86_sse_unaligned_move_optimal for any target other than amdfam10. So
mtune=generic leaves x86_sse_unaligned_move_optimal disabled, but it
enables x86_sse_partial_reg_dependency. The resulting code
is as indicated in the comments:
+ Code generation for unaligned packed loads of single precision
+ data:
+ if (x86_sse_partial_reg_dependency == true)
+ {
+ if (x86_sse_unaligned_move_optimal == true)
+ {
+ movups mem, reg
+ }
+ else
+ {
+ xorps reg3, reg3
+ movaps reg3, reg2
+ movlps mem, reg2
+ movhps mem+8, reg2
+ }
+ }
I built one of the polyhedron benchmarks with "gfortran -march=k8
-mtune=generic -O3 -ftree-vectorize -w -S aermod.f90 -o
generic/aermod.s".
(I have not extracted a simple test case yet, but we have observed this
with other polyhedron and cpu2006 FP benchmarks as well)
Snippet:
xorps %xmm2, %xmm2
shufps $0, %xmm0, %xmm0
addq $4, %rax
leaq (%r12,%rax), %r11
leaq (%rbp,%rax), %r10
leaq (%rdi,%rax), %rax
xorl %r8d, %r8d
xorl %ecx, %ecx
movaps %xmm0, %xmm3
.p2align 4,,7
.L154:
movaps %xmm2, %xmm0
addq $1, %r8
movaps %xmm2, %xmm1
movlps (%rcx,%r10), %xmm0
movlps (%rcx,%r11), %xmm1
movhps 8(%rcx,%r10), %xmm0
movhps 8(%rcx,%r11), %xmm1
It is possible that this is a bug with generic. I think we need Honza to
pitch in on this as he wrote this code. However, the last I heard he is
attending a course on ergodic Ramsey theory somewhere in the mountains
without much internet access.
Meanwhile, perhaps I should fix the comments in the patch I posted to
clearly indicate clearly what is new because of the patch and what is
already being done. Thoughts?
>
>> @@ -9434,6 +9491,13 @@ ix86_expand_vector_move_misalign (enum m
>> }
>> else
>> {
>> + if (TARGET_SSE_UNALIGNED_MOVE_OPTIMAL)
>> + {
>> + op0 = gen_lowpart (V2DFmode, op0);
>> + op1 = gen_lowpart (V2DFmode, op1);
>> + emit_insn (gen_sse2_movupd (op0, op1));
>> + return;
>> + }
>> /* ??? Not sure about the best option for the Intel chips.
>> The following would seem to satisfy; the register is
>> entirely cleared, breaking the dependency chain. We
>> @@ -9453,7 +9517,16 @@ ix86_expand_vector_move_misalign (enum m
>> else
>> {
>> if (TARGET_SSE_PARTIAL_REG_DEPENDENCY)
>> + {
>> + if (TARGET_SSE_UNALIGNED_MOVE_OPTIMAL)
>> + {
>> + op0 = gen_lowpart (V4SFmode, op0);
>> + op1 = gen_lowpart (V4SFmode, op1);
>> + emit_insn (gen_sse_movups (op0, op1));
>> + return;
>> + }
>> emit_move_insn (op0, CONST0_RTX (mode));
>> + }
>
>Un-nest both of these blocks from the IF they're inside.
>TARGET_SSE_UNALIGNED_MOVE_OPTIMAL really has no bearing on
>TARGET_SSE_PARTIAL_REG_DEPENDENCY or TARGET_SSE_SPLIT_REGS,
>and should override both of them.
Ok, I will change this.
Thanks,
Harsha