This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH v2, ARM] 64-bit shifts in NEON


Richard Earnshaw wrote:
> On 20/09/12 16:49, Ulrich Weigand wrote:
> > Richard Earnshaw wrote:
> >
> >> Hmm, this is going to cause bottlenecks on Cortex-A15: writing a Neon
> >> single-precision register and then reading it back as a double-precision
> >> value will cause scheduling problems.
[snip]
> >> A solution to this is to have the set of the shifter register done as a
> >> lane-set operation rather than as a set of the lower register, but it
> >> probably needs some thought as to how to achieve this without creating
> >> other overheads.
> >
> > What instruction are you refering to here?  Loads from memory?
> 
> Yes, if that's the source, or if from another register, something like
> 
> 	vmov.32 Dd[0], Rt
> 
> (it doesn't matter that the other lane remains unintialized).  This has
> the advantage that it doesn't clobber the other half of the register.
> 
> If the operand is already known to be in an S-register, then
> vdup(scalar) can be used, but of course that needs a full 64-bit target
> register.

Here's an updated version of the patch that allocated double registers
and uses lane-set or load from memory instructions to fill in the
shift count operand.

Re-tested on arm-linux-gnueabi with no regression.  Re-benchmarked
(the Linaro backport) with results comparable to the original patch.

Would this version be OK?

Bye,
Ulrich


2012-10-16  Andrew Stubbs  <ams@codesourcery.com>
            Ulrich Weigand  <ulrich.weigand@linaro.org>

	* config/arm/arm.c (arm_emit_coreregs_64bit_shift): Fix comment.
	* config/arm/arm.md (opt, opt_enabled): New attributes.
	(enabled): Use opt_enabled.
	(ashldi3, ashrdi3, lshrdi3): Add TARGET_NEON case.
	(ashldi3): Allow general operands for TARGET_NEON case.
	* config/arm/iterators.md (rshifts): New code iterator.
	(shift, shifttype): New code attributes.
	* config/arm/neon.md (UNSPEC_LOAD_COUNT): New unspec type.
	(neon_load_count, ashldi3_neon_noclobber, ashldi3_neon,
	signed_shift_di3_neon, unsigned_shift_di3_neon,
	ashrdi3_neon_imm_noclobber, lshrdi3_neon_imm_noclobber,
	<shift>di3_neon): New patterns.


Index: gcc/config/arm/arm.c
===================================================================
*** gcc/config/arm/arm.c	(revision 191400)
--- gcc/config/arm/arm.c	(working copy)
*************** arm_autoinc_modes_ok_p (enum machine_mod
*** 26293,26300 ****
     Input requirements:
      - It is safe for the input and output to be the same register, but
        early-clobber rules apply for the shift amount and scratch registers.
!     - Shift by register requires both scratch registers.  Shift by a constant
!       less than 32 in Thumb2 mode requires SCRATCH1 only.  In all other cases
        the scratch registers may be NULL.
      - Ashiftrt by a register also clobbers the CC register.  */
  void
--- 26293,26299 ----
     Input requirements:
      - It is safe for the input and output to be the same register, but
        early-clobber rules apply for the shift amount and scratch registers.
!     - Shift by register requires both scratch registers.  In all other cases
        the scratch registers may be NULL.
      - Ashiftrt by a register also clobbers the CC register.  */
  void
Index: gcc/config/arm/neon.md
===================================================================
*** gcc/config/arm/neon.md	(revision 191400)
--- gcc/config/arm/neon.md	(working copy)
***************
*** 23,28 ****
--- 23,29 ----
  (define_c_enum "unspec" [
    UNSPEC_ASHIFT_SIGNED
    UNSPEC_ASHIFT_UNSIGNED
+   UNSPEC_LOAD_COUNT
    UNSPEC_VABD
    UNSPEC_VABDL
    UNSPEC_VADD
***************
*** 1170,1175 ****
--- 1171,1359 ----
    DONE;
  })
  
+ ;; 64-bit shifts
+ 
+ ;; This pattern loads a 32-bit shift count into a 64-bit NEON register,
+ ;; leaving the upper half uninitalized.  This is OK since the shift
+ ;; instruction only looks at the low 8 bits anyway.  To avoid confusing
+ ;; data flow analysis however, we pretend the full register is set
+ ;; using an unspec.
+ (define_insn "neon_load_count"
+   [(set (match_operand:DI 0 "s_register_operand" "=w,w")
+         (unspec:DI [(match_operand:SI 1 "nonimmediate_operand" "Um,r")]
+                    UNSPEC_LOAD_COUNT))]
+   "TARGET_NEON"
+   "@
+    vld1.32\t{%P0[0]}, %A1
+    vmov.32\t%P0[0], %1"
+   [(set_attr "neon_type" "neon_vld1_vld2_lane,neon_mcr")]
+ )
+ 
+ (define_insn "ashldi3_neon_noclobber"
+   [(set (match_operand:DI 0 "s_register_operand"	    "=w,w")
+ 	(ashift:DI (match_operand:DI 1 "s_register_operand" " w,w")
+ 		   (match_operand:DI 2 "reg_or_int_operand" " i,w")))]
+   "TARGET_NEON && reload_completed
+    && (!CONST_INT_P (operands[2])
+        || (INTVAL (operands[2]) >= 0 && INTVAL (operands[2]) < 64))"
+   "@
+    vshl.u64\t%P0, %P1, %2
+    vshl.u64\t%P0, %P1, %P2"
+   [(set_attr "neon_type" "neon_vshl_ddd,neon_vshl_ddd")]
+ )
+ 
+ (define_insn_and_split "ashldi3_neon"
+   [(set (match_operand:DI 0 "s_register_operand"	    "= w, w,?&r,?r, ?w,w")
+ 	(ashift:DI (match_operand:DI 1 "s_register_operand" " 0w, w, 0r, r, 0w,w")
+ 		   (match_operand:SI 2 "general_operand"    "rUm, i,  r, i,rUm,i")))
+    (clobber (match_scratch:SI 3				    "= X, X,?&r, X,  X,X"))
+    (clobber (match_scratch:SI 4				    "= X, X,?&r, X,  X,X"))
+    (clobber (match_scratch:DI 5				    "=&w, X,  X, X, &w,X"))
+    (clobber (reg:CC_C CC_REGNUM))]
+   "TARGET_NEON"
+   "#"
+   "TARGET_NEON && reload_completed"
+   [(const_int 0)]
+   "
+   {
+     if (IS_VFP_REGNUM (REGNO (operands[0])))
+       {
+         if (CONST_INT_P (operands[2]))
+ 	  {
+ 	    if (INTVAL (operands[2]) < 1)
+ 	      {
+ 	        emit_insn (gen_movdi (operands[0], operands[1]));
+ 		DONE;
+ 	      }
+ 	    else if (INTVAL (operands[2]) > 63)
+ 	      operands[2] = gen_rtx_CONST_INT (VOIDmode, 63);
+ 	  }
+ 	else
+ 	  {
+ 	    emit_insn (gen_neon_load_count (operands[5], operands[2]));
+ 	    operands[2] = operands[5];
+ 	  }
+ 
+ 	/* Ditch the unnecessary clobbers.  */
+ 	emit_insn (gen_ashldi3_neon_noclobber (operands[0], operands[1],
+ 					       operands[2]));
+       }
+     else
+       {
+ 	if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 1)
+ 	  /* This clobbers CC.  */
+ 	  emit_insn (gen_arm_ashldi3_1bit (operands[0], operands[1]));
+ 	else
+ 	  arm_emit_coreregs_64bit_shift (ASHIFT, operands[0], operands[1],
+ 					 operands[2], operands[3], operands[4]);
+       }
+     DONE;
+   }"
+   [(set_attr "arch" "nota8,nota8,*,*,onlya8,onlya8")
+    (set_attr "opt" "*,*,speed,speed,*,*")]
+ )
+ 
+ ; The shift amount needs to be negated for right-shifts
+ (define_insn "signed_shift_di3_neon"
+   [(set (match_operand:DI 0 "s_register_operand"	     "=w")
+ 	(unspec:DI [(match_operand:DI 1 "s_register_operand" " w")
+ 		    (match_operand:DI 2 "s_register_operand" " w")]
+ 		   UNSPEC_ASHIFT_SIGNED))]
+   "TARGET_NEON && reload_completed"
+   "vshl.s64\t%P0, %P1, %P2"
+   [(set_attr "neon_type" "neon_vshl_ddd")]
+ )
+ 
+ ; The shift amount needs to be negated for right-shifts
+ (define_insn "unsigned_shift_di3_neon"
+   [(set (match_operand:DI 0 "s_register_operand"	     "=w")
+ 	(unspec:DI [(match_operand:DI 1 "s_register_operand" " w")
+ 		    (match_operand:DI 2 "s_register_operand" " w")]
+ 		   UNSPEC_ASHIFT_UNSIGNED))]
+   "TARGET_NEON && reload_completed"
+   "vshl.u64\t%P0, %P1, %P2"
+   [(set_attr "neon_type" "neon_vshl_ddd")]
+ )
+ 
+ (define_insn "ashrdi3_neon_imm_noclobber"
+   [(set (match_operand:DI 0 "s_register_operand"	      "=w")
+ 	(ashiftrt:DI (match_operand:DI 1 "s_register_operand" " w")
+ 		     (match_operand:DI 2 "const_int_operand"  " i")))]
+   "TARGET_NEON && reload_completed
+    && INTVAL (operands[2]) > 0 && INTVAL (operands[2]) <= 64"
+   "vshr.s64\t%P0, %P1, %2"
+   [(set_attr "neon_type" "neon_vshl_ddd")]
+ )
+ 
+ (define_insn "lshrdi3_neon_imm_noclobber"
+   [(set (match_operand:DI 0 "s_register_operand"	      "=w")
+ 	(lshiftrt:DI (match_operand:DI 1 "s_register_operand" " w")
+ 		     (match_operand:DI 2 "const_int_operand"  " i")))]
+   "TARGET_NEON && reload_completed
+    && INTVAL (operands[2]) > 0 && INTVAL (operands[2]) <= 64"
+   "vshr.u64\t%P0, %P1, %2"
+   [(set_attr "neon_type" "neon_vshl_ddd")]
+ )
+ 
+ ;; ashrdi3_neon
+ ;; lshrdi3_neon
+ (define_insn_and_split "<shift>di3_neon"
+   [(set (match_operand:DI 0 "s_register_operand"	     "= w, w,?&r,?r,?w,?w")
+ 	(rshifts:DI (match_operand:DI 1 "s_register_operand" " 0w, w, 0r, r,0w, w")
+ 		    (match_operand:SI 2 "reg_or_int_operand" "  r, i,  r, i, r, i")))
+    (clobber (match_scratch:SI 3				     "=2r, X, &r, X,2r, X"))
+    (clobber (match_scratch:SI 4				     "= X, X, &r, X, X, X"))
+    (clobber (match_scratch:DI 5				     "=&w, X,  X, X,&w, X"))
+    (clobber (reg:CC CC_REGNUM))]
+   "TARGET_NEON"
+   "#"
+   "TARGET_NEON && reload_completed"
+   [(const_int 0)]
+   "
+   {
+     if (IS_VFP_REGNUM (REGNO (operands[0])))
+       {
+ 	if (CONST_INT_P (operands[2]))
+ 	  {
+ 	    if (INTVAL (operands[2]) < 1)
+ 	      {
+ 	        emit_insn (gen_movdi (operands[0], operands[1]));
+ 		DONE;
+ 	      }
+ 	    else if (INTVAL (operands[2]) > 64)
+ 	      operands[2] = gen_rtx_CONST_INT (VOIDmode, 64);
+ 
+ 	    /* Ditch the unnecessary clobbers.  */
+ 	    emit_insn (gen_<shift>di3_neon_imm_noclobber (operands[0],
+ 							  operands[1],
+ 							  operands[2]));
+ 	  }
+ 	else 
+ 	  {
+ 	    /* We must use a negative left-shift.  */
+ 	    emit_insn (gen_negsi2 (operands[3], operands[2]));
+ 	    emit_insn (gen_neon_load_count (operands[5], operands[3]));
+ 	    emit_insn (gen_<shifttype>_shift_di3_neon (operands[0], operands[1],
+ 						       operands[5]));
+ 	  }
+       }
+     else
+       {
+ 	if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 1)
+ 	  /* This clobbers CC.  */
+ 	  emit_insn (gen_arm_<shift>di3_1bit (operands[0], operands[1]));
+ 	else
+ 	  /* This clobbers CC (ASHIFTRT by register only).  */
+ 	  arm_emit_coreregs_64bit_shift (<CODE>, operands[0], operands[1],
+ 				 	 operands[2], operands[3], operands[4]);
+       }
+ 
+     DONE;
+   }"
+   [(set_attr "arch" "nota8,nota8,*,*,onlya8,onlya8")
+    (set_attr "opt" "*,*,speed,speed,*,*")]
+ )
+ 
  ;; Widening operations
  
  (define_insn "widen_ssum<mode>3"
Index: gcc/config/arm/iterators.md
===================================================================
*** gcc/config/arm/iterators.md	(revision 191254)
--- gcc/config/arm/iterators.md	(working copy)
***************
*** 188,193 ****
--- 188,196 ----
  ;; A list of widening operators
  (define_code_iterator SE [sign_extend zero_extend])
  
+ ;; Right shifts
+ (define_code_iterator rshifts [ashiftrt lshiftrt])
+ 
  ;;----------------------------------------------------------------------------
  ;; Mode attributes
  ;;----------------------------------------------------------------------------
***************
*** 449,451 ****
--- 452,459 ----
  
  ;; Assembler mnemonics for signedness of widening operations.
  (define_code_attr US [(sign_extend "s") (zero_extend "u")])
+ 
+ ;; Right shifts
+ (define_code_attr shift [(ashiftrt "ashr") (lshiftrt "lshr")])
+ (define_code_attr shifttype [(ashiftrt "signed") (lshiftrt "unsigned")])
+ 
Index: gcc/config/arm/arm.md
===================================================================
*** gcc/config/arm/arm.md	(revision 191254)
--- gcc/config/arm/arm.md	(working copy)
***************
*** 249,254 ****
--- 249,270 ----
  
  	(const_string "no")))
  
+ (define_attr "opt" "any,speed,size"
+   (const_string "any"))
+ 
+ (define_attr "opt_enabled" "no,yes"
+   (cond [(eq_attr "opt" "any")
+          (const_string "yes")
+ 
+ 	 (and (eq_attr "opt" "speed")
+ 	      (match_test "optimize_function_for_speed_p (cfun)"))
+ 	 (const_string "yes")
+ 
+ 	 (and (eq_attr "opt" "size")
+ 	      (match_test "optimize_function_for_size_p (cfun)"))
+ 	 (const_string "yes")]
+ 	(const_string "no")))
+ 
  ; Allows an insn to disable certain alternatives for reasons other than
  ; arch support.
  (define_attr "insn_enabled" "no,yes"
***************
*** 256,266 ****
  
  ; Enable all alternatives that are both arch_enabled and insn_enabled.
   (define_attr "enabled" "no,yes"
!    (if_then_else (eq_attr "insn_enabled" "yes")
!                (if_then_else (eq_attr "arch_enabled" "yes")
!                              (const_string "yes")
!                              (const_string "no"))
!                 (const_string "no")))
  
  ; POOL_RANGE is how far away from a constant pool entry that this insn
  ; can be placed.  If the distance is zero, then this insn will never
--- 272,286 ----
  
  ; Enable all alternatives that are both arch_enabled and insn_enabled.
   (define_attr "enabled" "no,yes"
!    (cond [(eq_attr "insn_enabled" "no")
! 	  (const_string "no")
! 
! 	  (eq_attr "arch_enabled" "no")
! 	  (const_string "no")
! 
! 	  (eq_attr "opt_enabled" "no")
! 	  (const_string "no")]
! 	 (const_string "yes")))
  
  ; POOL_RANGE is how far away from a constant pool entry that this insn
  ; can be placed.  If the distance is zero, then this insn will never
***************
*** 3489,3497 ****
  (define_expand "ashldi3"
    [(set (match_operand:DI            0 "s_register_operand" "")
          (ashift:DI (match_operand:DI 1 "s_register_operand" "")
!                    (match_operand:SI 2 "reg_or_int_operand" "")))]
    "TARGET_32BIT"
    "
    if (!CONST_INT_P (operands[2]) && TARGET_REALLY_IWMMXT)
      ; /* No special preparation statements; expand pattern as above.  */
    else
--- 3509,3531 ----
  (define_expand "ashldi3"
    [(set (match_operand:DI            0 "s_register_operand" "")
          (ashift:DI (match_operand:DI 1 "s_register_operand" "")
!                    (match_operand:SI 2 "general_operand" "")))]
    "TARGET_32BIT"
    "
+   if (TARGET_NEON)
+     {
+       /* Delay the decision whether to use NEON or core-regs until
+ 	 register allocation.  */
+       emit_insn (gen_ashldi3_neon (operands[0], operands[1], operands[2]));
+       DONE;
+     }
+   else
+     {
+       /* Only the NEON case can handle in-memory shift counts.  */
+       if (!reg_or_int_operand (operands[2], SImode))
+         operands[2] = force_reg (SImode, operands[2]);
+     }
+ 
    if (!CONST_INT_P (operands[2]) && TARGET_REALLY_IWMMXT)
      ; /* No special preparation statements; expand pattern as above.  */
    else
***************
*** 3566,3571 ****
--- 3600,3613 ----
                       (match_operand:SI 2 "reg_or_int_operand" "")))]
    "TARGET_32BIT"
    "
+   if (TARGET_NEON)
+     {
+       /* Delay the decision whether to use NEON or core-regs until
+ 	 register allocation.  */
+       emit_insn (gen_ashrdi3_neon (operands[0], operands[1], operands[2]));
+       DONE;
+     }
+ 
    if (!CONST_INT_P (operands[2]) && TARGET_REALLY_IWMMXT)
      ; /* No special preparation statements; expand pattern as above.  */
    else
***************
*** 3638,3643 ****
--- 3680,3693 ----
                       (match_operand:SI 2 "reg_or_int_operand" "")))]
    "TARGET_32BIT"
    "
+   if (TARGET_NEON)
+     {
+       /* Delay the decision whether to use NEON or core-regs until
+ 	 register allocation.  */
+       emit_insn (gen_lshrdi3_neon (operands[0], operands[1], operands[2]));
+       DONE;
+     }
+ 
    if (!CONST_INT_P (operands[2]) && TARGET_REALLY_IWMMXT)
      ; /* No special preparation statements; expand pattern as above.  */
    else


-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  Ulrich.Weigand@de.ibm.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]