This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: PATCH: Add XOP 128-bit and 256-bit support for upcoming AMD Orochi processor.
- From: Uros Bizjak <ubizjak at gmail dot com>
- To: Uros Bizjak <ubizjak at gmail dot com>
- Cc: Sebastian Pop <sebpop at gmail dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Jan Hubicka <hubicka at ucw dot cz>, "Harle, Christophe" <christophe dot harle at amd dot com>
- Date: Thu, 19 Nov 2009 22:35:09 +0100
- Subject: Re: PATCH: Add XOP 128-bit and 256-bit support for upcoming AMD Orochi processor.
- References: <5787cf470910210254m63c9f5bbp2585eb4f52ab8360@mail.gmail.com> <1C8DE0332CB01445BF7ADEDE3DDD570718D0502C@sausexmbp02.amd.com> <4AE5DB24.7090907@gmail.com> <cb9d34b20911190751w60bf89e5qd1fdd482e360b081@mail.gmail.com> <4B059CC4.4080801@gmail.com>
On 11/19/2009 08:30 PM, Uros Bizjak wrote:
Note, that combine will see a nonimmediate_operand, where memory
operand will be later fixed in a reload pass. Reload will
automatically move memory operand to SSE register to satisfy operand3
constraint.
Actually, looking a bit deeper into the splitting logic of FMA
instructions, I think that this whole business of using
ix86_fma4_valid_op_p () predicate and splitting depending on type of
operand is flawed.
Let me illustrate this by vector FMA insn, where we split invalid
instructions by looking at the operands using
ix86_expand_fma4_multiple_memory fix-up function:
(define_insn "fma4_fmadd<mode>4256"
[(set (match_operand:FMA4MODEF4 0 "register_operand" "=x,x,x")
(plus:FMA4MODEF4
(mult:FMA4MODEF4
(match_operand:FMA4MODEF4 1 "nonimmediate_operand" "x,x,xm")
(match_operand:FMA4MODEF4 2 "nonimmediate_operand" "x,xm,x"))
(match_operand:FMA4MODEF4 3 "nonimmediate_operand" "xm,x,x")))]
"TARGET_FMA4
&& ix86_fma4_valid_op_p (operands, insn, 4, true, 2, true)"
"vfmadd<fma4modesuffixf4>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
[(set_attr "type" "ssemuladd")
(set_attr "mode" "<MODE>")])
;; Split fmadd with two memory operands into a load and the fmadd.
(define_split
[(set (match_operand:FMA4MODEF4 0 "register_operand" "")
(plus:FMA4MODEF4
(mult:FMA4MODEF4
(match_operand:FMA4MODEF4 1 "nonimmediate_operand" "")
(match_operand:FMA4MODEF4 2 "nonimmediate_operand" ""))
(match_operand:FMA4MODEF4 3 "nonimmediate_operand" "")))]
"TARGET_FMA4
&& !ix86_fma4_valid_op_p (operands, insn, 4, true, 1, true)
&& ix86_fma4_valid_op_p (operands, insn, 4, true, 2, true)
&& !reg_mentioned_p (operands[0], operands[1])
&& !reg_mentioned_p (operands[0], operands[2])
&& !reg_mentioned_p (operands[0], operands[3])"
[(const_int 0)]
{
ix86_expand_fma4_multiple_memory (operands, 4, <MODE>mode);
emit_insn (gen_fma4_fmadd<mode>4256 (operands[0], operands[1],
operands[2], operands[3]));
DONE;
})
This generates following vectorized loop:
.L2:
vmovaps b(%rax), %xmm1
vmovaps d(%rax), %xmm0
vfmaddps %xmm0, c(%rax), %xmm1, %xmm0
vmovaps %xmm0, a(%rax)
addq $16, %rax
cmpq $40960, %rax
jne .L2
However, the same result can be obtained by carefully placing operand
constraints in alternatives:
(define_insn "fma4_fmadd<mode>4"
[(set (match_operand:SSEMODEF4 0 "register_operand" "=x,x")
(plus:SSEMODEF4
(mult:SSEMODEF4
(match_operand:SSEMODEF4 1 "nonimmediate_operand" "%x,x")
(match_operand:SSEMODEF4 2 "nonimmediate_operand" " x,m"))
(match_operand:SSEMODEF4 3 "nonimmediate_operand" "xm,x")))]
"TARGET_FMA4"
"___vfmadd<ssemodesuffixf4>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
[(set_attr "type" "ssemuladd")
(set_attr "mode" "<MODE>")])
Please note, that all other insn predicates, splitters and fixups for
this instruction were disabled. This pattern still correctly vectorizes
operation with three arrays and for gcc.target/i386/fma4-vector.c
generates similar vectorized sequence:
.L2:
vmovaps b(%rax), %xmm1
vmovaps c(%rax), %xmm2
___vfmaddps d(%rax), %xmm2, %xmm1, %xmm0
vmovaps %xmm0, a(%rax)
addq $16, %rax
cmpq $40960, %rax
jne .L2
Please note, that by introducing "%" into operand predicate, we also
increased the number of possible reloads.
Uros.