This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: PATCH: Add XOP 128-bit and 256-bit support for upcoming AMD Orochi processor.

From: Uros Bizjak <ubizjak at gmail dot com>
To: Uros Bizjak <ubizjak at gmail dot com>
Cc: Sebastian Pop <sebpop at gmail dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Jan Hubicka <hubicka at ucw dot cz>, "Harle, Christophe" <christophe dot harle at amd dot com>
Date: Thu, 19 Nov 2009 22:35:09 +0100
Subject: Re: PATCH: Add XOP 128-bit and 256-bit support for upcoming AMD Orochi processor.
References: <5787cf470910210254m63c9f5bbp2585eb4f52ab8360@mail.gmail.com> <1C8DE0332CB01445BF7ADEDE3DDD570718D0502C@sausexmbp02.amd.com> <4AE5DB24.7090907@gmail.com> <cb9d34b20911190751w60bf89e5qd1fdd482e360b081@mail.gmail.com> <4B059CC4.4080801@gmail.com>

On 11/19/2009 08:30 PM, Uros Bizjak wrote:

Note, that combine will see a nonimmediate_operand, where memory operand will be later fixed in a reload pass. Reload will automatically move memory operand to SSE register to satisfy operand3 constraint.

Actually, looking a bit deeper into the splitting logic of FMA instructions, I think that this whole business of using ix86_fma4_valid_op_p () predicate and splitting depending on type of operand is flawed.

Let me illustrate this by vector FMA insn, where we split invalid instructions by looking at the operands using ix86_expand_fma4_multiple_memory fix-up function:

(define_insn "fma4_fmadd<mode>4256"
  [(set (match_operand:FMA4MODEF4 0 "register_operand" "=x,x,x")
    (plus:FMA4MODEF4
     (mult:FMA4MODEF4
      (match_operand:FMA4MODEF4 1 "nonimmediate_operand" "x,x,xm")
      (match_operand:FMA4MODEF4 2 "nonimmediate_operand" "x,xm,x"))
     (match_operand:FMA4MODEF4 3 "nonimmediate_operand" "xm,x,x")))]
  "TARGET_FMA4
&& ix86_fma4_valid_op_p (operands, insn, 4, true, 2, true)"
  "vfmadd<fma4modesuffixf4>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
  [(set_attr "type" "ssemuladd")
   (set_attr "mode" "<MODE>")])

;; Split fmadd with two memory operands into a load and the fmadd.
(define_split
  [(set (match_operand:FMA4MODEF4 0 "register_operand" "")
    (plus:FMA4MODEF4
     (mult:FMA4MODEF4
      (match_operand:FMA4MODEF4 1 "nonimmediate_operand" "")
      (match_operand:FMA4MODEF4 2 "nonimmediate_operand" ""))
     (match_operand:FMA4MODEF4 3 "nonimmediate_operand" "")))]
  "TARGET_FMA4
&& !ix86_fma4_valid_op_p (operands, insn, 4, true, 1, true)
&& ix86_fma4_valid_op_p (operands, insn, 4, true, 2, true)
&& !reg_mentioned_p (operands[0], operands[1])
&& !reg_mentioned_p (operands[0], operands[2])
&& !reg_mentioned_p (operands[0], operands[3])"
  [(const_int 0)]
{
  ix86_expand_fma4_multiple_memory (operands, 4, <MODE>mode);
  emit_insn (gen_fma4_fmadd<mode>4256 (operands[0], operands[1],
                    operands[2], operands[3]));
  DONE;
})

This generates following vectorized loop:

.L2:
    vmovaps    b(%rax), %xmm1
    vmovaps    d(%rax), %xmm0
    vfmaddps    %xmm0, c(%rax), %xmm1, %xmm0
    vmovaps    %xmm0, a(%rax)
    addq    $16, %rax
    cmpq    $40960, %rax
    jne    .L2

However, the same result can be obtained by carefully placing operand constraints in alternatives:

(define_insn "fma4_fmadd<mode>4"
  [(set (match_operand:SSEMODEF4 0 "register_operand"        "=x,x")
    (plus:SSEMODEF4
     (mult:SSEMODEF4
      (match_operand:SSEMODEF4 1 "nonimmediate_operand" "%x,x")
      (match_operand:SSEMODEF4 2 "nonimmediate_operand" " x,m"))
     (match_operand:SSEMODEF4 3 "nonimmediate_operand"  "xm,x")))]
  "TARGET_FMA4"
  "___vfmadd<ssemodesuffixf4>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
  [(set_attr "type" "ssemuladd")
   (set_attr "mode" "<MODE>")])

Please note, that all other insn predicates, splitters and fixups for this instruction were disabled. This pattern still correctly vectorizes operation with three arrays and for gcc.target/i386/fma4-vector.c generates similar vectorized sequence:

.L2:
    vmovaps    b(%rax), %xmm1
    vmovaps    c(%rax), %xmm2
    ___vfmaddps    d(%rax), %xmm2, %xmm1, %xmm0
    vmovaps    %xmm0, a(%rax)
    addq    $16, %rax
    cmpq    $40960, %rax
    jne    .L2

Please note, that by introducing "%" into operand predicate, we also increased the number of possible reloads.

Uros.

References:
- Re: PATCH: Add XOP 128-bit and 256-bit support for upcoming AMD Orochi processor.
  - From: Sebastian Pop
- Re: PATCH: Add XOP 128-bit and 256-bit support for upcoming AMD Orochi processor.
  - From: Uros Bizjak

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]