This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[PATCH,RFC] Support non-constants as the second argument of __builtin_expect
- From: Trevor_Smigiel at playstation dot sony dot com
- To: gcc-patches at gcc dot gnu dot org
- Cc: andrew_pinski at playstation dot sony dot com, Ulrich Weigand <Ulrich dot Weigand at de dot ibm dot com>
- Date: Wed, 15 Jul 2009 14:41:35 -0700
- Subject: [PATCH,RFC] Support non-constants as the second argument of __builtin_expect
Hi,
The Cell SPU processor does not have typical hardware branch prediction.
(By typical, I mean not like x86 or powerpc.) The compiler must insert
explicit prefetch instructions to say "this branch will go to this
target". It is also possible to do it conditionally at run time. It is
useful to expose these prefetch instructions to programmers so they can
do their own software based dynamic branch prediction. (See below for
specific examples.)
In our Language Extensions document for Cell we extend
__builtin_expect() to allow a non-constant in the second argument. This
will result in a conditional prefetch instruction.
Are there other targets that could take advantage of this extension?
Do any other targets have some kind of conditional prefetch insn?
The attached patch implements this extension thusly:
- do not remove __builtin_expect() in strip_predict_hints() when the
second arguments is non-constant and the target provides the
"builtin_expect" md file pattern.
- add new rtl pattern "builtin_expect" with 2 input operands that
correspond to the 2 arguments of __builtin_expect(). Perhaps it
should also have a matching output operand, but the implementation
for SPU does not require it.
- call the "builtin_expect" pattern in expand_builtin_expect().
- the target specific code defines the new rtl pattern and generates
whatever code is necessary for that target.
The target specific part on SPU does the following:
- the "builtin_expect" pattern saves the two input operands in global
variables.
- when generating a conditional branch instruction and the first operand
for the branch matches the first operand saved by "builtin_expect" we
generate the code for dynamic instruction fetching.
- added 2 new "*cbranch" patterns which include a use of the register
which contains the predicted target.
- the machine reorg generates the prefetch insn using that register.
I chose this approach because it had minimal impact on common code.
Most of the effort happens in target specific code.
Is this approach reasonably? I think saving operands in a global is
fragile, i.e., changing the code for branches could indirectly break a
target's "builtin_expect" implementation.
A different implementation might be to add a new pattern like
"expect_cbranch<mode>5" that passes the second argument of
builtin_expect() directly, rather than the target saving it in a global.
One way to achieve this is to update the code that sets branch
priorities based on __builtin_expect() and have it record the second
argument somewhere, then use that when expanding. But I do not know the
best place to save that argument or how to represent it in the tree IR.
Another consideration for the target side implementation is that the
"builtin_expect" pattern might generate some setup code (e.g., loading
addresses) and we want that setup code to be optimized. Which means we
should generate the setup code at expand time. At the same time, we
want this extension to have minimal impact on branch optimizations that
would happen otherwise. (e.g., if a branch is dead code, it should
still be removed, even though it is being effected by builtin_expect.)
Here is more detail on SPU hardware and why this extension is useful:
On SPU, the fall through case of a branch has no delay by default and
the taken case has about 18 cycle delay, which is the time for fetching
instruction into its instruction buffer.
SPU has a hint instruction, which can prefetch instructions from a
specified address and inserted into the pipeline after a specified
instruction. For example:
hbrr branch,target # the hint instruction
nop
nop # need at least 8 insns between the
nop # hint and the branch for the insn
nop # pipeline to take advantage of it.
nop
nop
nop
nop
branch: br target # the effect branch
With out the hint instruction, the "br target" instruction will stall as
it fetches the instructions at "target". With the hint instruction,
the SPU has already started fetching instructions at "target" and they
will be inserted into the instruction pipeline right after the "br
target" instruction. When the hint is done early enough the branch has
no delay, otherwise the branch will stall for the remaining time needed
to complete the fetch.
There is also a hint instruction that uses the address specified in a
register.
hbr branch,$reg
The obvious use is to prefetch for indirect calls, switch tables and
function return. But it also allows the the compiler or programmer to
do dynamic branch predicition.
For example, by default the compiler will hint the backwards branch of a
loop so the taken case has no delay. This means the last iteration will
stall. It is possible to prevent the stall by dynamically prefetching.
(This is not always profitable because we are adding instruction to the
loop.)
# With out dynamic prediction we would use this hint insn:
# hbrr branch,loop
# It can be issued outside the loop as long as there is no
# code in the loop that clobbers the internal hint state.
ila $2,loop
ila $3,loop_end
ila $4,iterations
loop: ai $4,$4,-1
cgti $5,$4,0
selb $6,$3,$2,$5 # conditional move
hbr branch,$6
nop
nop # in real code this would be instructions that do
nop # useful work.
nop
nop
nop
nop
nop
branch: brnz $6,loop
loop_end:
To allow programmers to write a loop like this we extend
__builtin_expect() to allow a non-constant as the second argument.
For example:
int i = iterations
do
{
i--;
}
while (__builtin_expect (i > 0, i > 0));
For the SPU implementation, one tricky part is getting the 2 addresses
used for the branch target. How do we allow for branch optimizations in
the RTL passes but also allow the instruction that load the addresses to
get optimized (e.g., hoisted out of a loop)? The conditional move we
generate must be optimized in step with optimizations done to the
corresponding branch.
We do this by creating the 2 "*cbranch" patterns mentioned above. We
reposition the labels referenced in those patterns to the appropriate
location based on the how the branch now looks. It is not 100% clear
to me that this will always work, but it works in the simple cases,
e.g., the branch is removed, or its condition is flipped. This builtin
is not widely used, and the effect when it is wrong is only worse
performance, not incorrect results, so I am ok with the SPU
implementation. Other targets could choose to make the branch volatile
to prevent optimizations on it.
Thanks,
Trevor
Index: gcc/gcc/doc/md.texi
===================================================================
*** gcc/gcc/doc/md.texi (revision 149502)
--- gcc/gcc/doc/md.texi (working copy)
*************** inclusive and operand 1 exclusive.
*** 5318,5323 ****
--- 5318,5332 ----
If this pattern is not defined, a call to the library function
@code{__clear_cache} is used.
+ @cindex @code{builtin_expect} instruction pattern
+ @item @samp{builtin_expect}
+
+ This pattern, if defined, handles a call to @code{__builtin_expect} with
+ a non-constant second argument. Operand 0 is the target register of
+ @code{__builtin_expect} and can be compared against operands of the
+ following compare or branch to determine when it applies. Operand 1 is
+ the expanded value of the second argument.
+
@end table
@end ifset
Index: gcc/gcc/builtins.c
===================================================================
*** gcc/gcc/builtins.c (revision 149502)
--- gcc/gcc/builtins.c (working copy)
*************** static rtx
*** 5276,5292 ****
expand_builtin_expect (tree exp, rtx target)
{
tree arg, c;
if (call_expr_nargs (exp) < 2)
return const0_rtx;
arg = CALL_EXPR_ARG (exp, 0);
c = CALL_EXPR_ARG (exp, 1);
! target = expand_expr (arg, target, VOIDmode, EXPAND_NORMAL);
/* When guessing was done, the hints should be already stripped away. */
gcc_assert (!flag_guess_branch_prob
|| optimize == 0 || errorcount || sorrycount);
! return target;
}
void
--- 5276,5310 ----
expand_builtin_expect (tree exp, rtx target)
{
tree arg, c;
+ rtx new_target;
if (call_expr_nargs (exp) < 2)
return const0_rtx;
arg = CALL_EXPR_ARG (exp, 0);
c = CALL_EXPR_ARG (exp, 1);
! new_target = expand_expr (arg, target, VOIDmode, EXPAND_NORMAL);
!
! #ifdef HAVE_builtin_expect
! if (HAVE_builtin_expect)
! {
! int icode = CODE_FOR_builtin_expect;
! enum machine_mode mode1 = insn_data[icode].operand[1].mode;
! rtx op1 = expand_expr (c, NULL_RTX, VOIDmode, EXPAND_NORMAL);
! if (GET_MODE (op1) != mode1 && mode1 != VOIDmode
! && GET_MODE (op1) != VOIDmode)
! op1 = convert_modes (mode1, GET_MODE (op1), op1, 0);
! if (!insn_data[icode].operand[1].predicate (op1, mode1)
! && mode1 != VOIDmode)
! op1 = copy_to_mode_reg (mode1, op1);
! emit_insn (gen_builtin_expect (target, op1));
! }
! #else
/* When guessing was done, the hints should be already stripped away. */
gcc_assert (!flag_guess_branch_prob
|| optimize == 0 || errorcount || sorrycount);
! #endif
! return new_target;
}
void
Index: gcc/gcc/testsuite/lib/compat.exp
===================================================================
*** gcc/gcc/testsuite/lib/compat.exp (revision 149502)
--- gcc/gcc/testsuite/lib/compat.exp (working copy)
*************** proc compat-execute { src1 sid use_alt }
*** 287,296 ****
# On the SPU, most of the compat test cases exceed local store size.
# Use automatic overlay support to make them fit.
if { [check_effective_target_spu_auto_overlay] } {
! set extra_flags_1 "$extra_flags_1 -Wl,--auto-overlay"
! set extra_flags_1 "$extra_flags_1 -ffunction-sections"
! set extra_flags_2 "$extra_flags_2 -ffunction-sections"
! set extra_flags_3 "$extra_flags_3 -ffunction-sections"
}
# Define the names of the object files.
--- 287,299 ----
# On the SPU, most of the compat test cases exceed local store size.
# Use automatic overlay support to make them fit.
if { [check_effective_target_spu_auto_overlay] } {
! # set extra_flags_1 "$extra_flags_1 -Wl,--auto-overlay"
! # set extra_flags_1 "$extra_flags_1 -ffunction-sections"
! # set extra_flags_2 "$extra_flags_2 -ffunction-sections"
! # set extra_flags_3 "$extra_flags_3 -ffunction-sections"
! set extra_flags_1 "$extra_flags_1 -mlarge-mem -Wl,--local-store=0:0x400000"
! set extra_flags_2 "$extra_flags_2 -mlarge-mem"
! set extra_flags_3 "$extra_flags_3 -mlarge-mem"
}
# Define the names of the object files.
Index: gcc/gcc/predict.c
===================================================================
*** gcc/gcc/predict.c (revision 149502)
--- gcc/gcc/predict.c (working copy)
*************** strip_predict_hints (void)
*** 1318,1324 ****
if (fndecl
&& DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL
&& DECL_FUNCTION_CODE (fndecl) == BUILT_IN_EXPECT
! && gimple_call_num_args (stmt) == 2)
{
var = gimple_call_lhs (stmt);
ass_stmt = gimple_build_assign (var, gimple_call_arg (stmt, 0));
--- 1318,1330 ----
if (fndecl
&& DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL
&& DECL_FUNCTION_CODE (fndecl) == BUILT_IN_EXPECT
! && gimple_call_num_args (stmt) == 2
! #ifdef HAVE_builtin_expect
! /* When the target provides a builtin_expect rtl pattern
! keep the calls that aren't constant. */
! && is_gimple_constant (gimple_call_arg (stmt, 1))
! #endif
! )
{
var = gimple_call_lhs (stmt);
ass_stmt = gimple_build_assign (var, gimple_call_arg (stmt, 0));
Index: gcc/gcc/config/spu/spu.c
===================================================================
*** gcc/gcc/config/spu/spu.c (revision 149502)
--- gcc/gcc/config/spu/spu.c (working copy)
*************** static unsigned int spu_section_type_fla
*** 214,219 ****
--- 214,220 ----
static rtx spu_expand_load (rtx, rtx, rtx, int);
extern const char *reg_names[];
+ rtx spu_expect_op0, spu_expect_op1;
/* Which instruction set architecture to use. */
int spu_arch;
*************** spu_emit_branch_or_set (int is_set, rtx
*** 1096,1101 ****
--- 1097,1103 ----
{
rtx bcomp;
rtx loc_ref;
+ rtx jump_pat;
/* We don't have branch on QI compare insns, so we convert the
QI compare result to a HI result. */
*************** spu_emit_branch_or_set (int is_set, rtx
*** 1113,1121 ****
bcomp = gen_rtx_NE (comp_mode, compare_result, const0_rtx);
loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[3]);
! emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
! gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
! loc_ref, pc_rtx)));
}
else if (is_set == 2)
{
--- 1115,1167 ----
bcomp = gen_rtx_NE (comp_mode, compare_result, const0_rtx);
loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[3]);
! jump_pat = gen_rtx_SET (VOIDmode, pc_rtx,
! gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
! loc_ref, pc_rtx));
!
! if (flag_schedule_insns_after_reload && TARGET_BRANCH_HINTS
! && spu_expect_op0 && comp_mode == Pmode
! && spu_expect_op0 == op0)
! {
! rtx then_reg = gen_reg_rtx (Pmode);
! rtx else_reg = gen_reg_rtx (Pmode);
! rtx expect_cmp = gen_reg_rtx (Pmode);
! rtx hint_target = gen_reg_rtx (Pmode);
! rtx then_label = gen_label_rtx ();
! rtx then_ref = gen_rtx_LABEL_REF (VOIDmode, then_label);
! rtx else_label = gen_label_rtx ();
! rtx else_ref = gen_rtx_LABEL_REF (VOIDmode, else_label);
! rtvec v;
!
! emit_move_insn (then_reg, then_ref);
! emit_move_insn (else_reg, else_ref);
! emit_insn (gen_clgt_si (expect_cmp, spu_expect_op1, const0_rtx));
! emit_insn (gen_selb (hint_target, then_reg, else_reg, expect_cmp));
!
! LABEL_NUSES (then_label)++;
! LABEL_PRESERVE_P (then_label) = 1;
! LABEL_NUSES (else_label)++;
! LABEL_PRESERVE_P (else_label) = 1;
!
! /* We delete the labels to make sure they don't get used for
! anything else. The machine reorg phase will move them to
! the correct place. We don't try to reuse existing labels
! because we move these around later. */
! delete_insn (emit_label (then_label));
! delete_insn (emit_label (else_label));
!
! v = rtvec_alloc (5);
! RTVEC_ELT (v, 0) = jump_pat;
! RTVEC_ELT (v, 1) = gen_rtx_USE (VOIDmode, hint_target);
! RTVEC_ELT (v, 2) = gen_rtx_USE (VOIDmode, then_ref);
! RTVEC_ELT (v, 3) = gen_rtx_USE (VOIDmode, else_ref);
! RTVEC_ELT (v, 4) = gen_rtx_CLOBBER (VOIDmode,
! gen_rtx_REG (SImode,
! HBR_REGNUM));
! jump_pat = gen_rtx_PARALLEL (VOIDmode, v);
! }
!
! emit_jump_insn (jump_pat);
}
else if (is_set == 2)
{
*************** spu_emit_branch_or_set (int is_set, rtx
*** 1166,1171 ****
--- 1212,1218 ----
else
emit_move_insn (target, compare_result);
}
+ spu_expect_op0 = spu_expect_op1 = 0;
}
HOST_WIDE_INT
*************** get_branch_target (rtx branch)
*** 2360,2365 ****
--- 2407,2417 ----
|| GET_CODE (PATTERN (branch)) == ADDR_DIFF_VEC)
return 0;
+ /* branch for __builtin_expect */
+ if (INSN_CODE (branch) == CODE_FOR_expect_then
+ || INSN_CODE (branch) == CODE_FOR_expect_else)
+ return XEXP (XVECEXP (PATTERN (branch), 0, 1), 0);
+
set = single_set (branch);
src = SET_SRC (set);
if (GET_CODE (SET_DEST (set)) != PC)
*************** get_branch_target (rtx branch)
*** 2369,2374 ****
--- 2421,2427 ----
{
rtx lab = 0;
rtx note = find_reg_note (branch, REG_BR_PROB, 0);
+
if (note)
{
/* If the more probable case is not a fall through, then
*************** get_branch_target (rtx branch)
*** 2416,2421 ****
--- 2469,2476 ----
static bool
insn_clobbers_hbr (rtx insn)
{
+ if (NONJUMP_INSN_P (insn) && INSN_CODE (insn) == CODE_FOR_hbr)
+ return 1;
if (INSN_P (insn)
&& GET_CODE (PATTERN (insn)) == PARALLEL)
{
*************** spu_machine_dependent_reorg (void)
*** 2688,2693 ****
--- 2743,2751 ----
branch = insn;
branch_addr = insn_addr;
required_dist = spu_hint_dist;
+ if (INSN_CODE (branch) == CODE_FOR_expect_then
+ || INSN_CODE (branch) == CODE_FOR_expect_else)
+ required_dist = 0;
}
}
}
*************** spu_machine_dependent_reorg (void)
*** 2812,2817 ****
--- 2870,2904 ----
if (offset > 0)
XVECEXP (unspec, 0, 0) = plus_constant (label_ref, offset);
}
+ else if (JUMP_P (insn) && (INSN_CODE (insn) == CODE_FOR_expect_then
+ || INSN_CODE (insn) == CODE_FOR_expect_else))
+ {
+ /* __builtin_expect with a non-constant second argument
+ generates patterns which contain labels that need to be
+ relocated. These are generated in spu_emit_branch_or_set. */
+ rtx set0 = XVECEXP (PATTERN (insn), 0, 0);
+ rtx use2 = XVECEXP (PATTERN (insn), 0, 2);
+ rtx use3 = XVECEXP (PATTERN (insn), 0, 3);
+ rtx label0 = XEXP (XEXP (set0, 1), 1);
+ rtx label2 = XEXP (XEXP (use2, 0), 0);
+ rtx label3 = XEXP (XEXP (use3, 0), 0);
+ if (GET_CODE (label0) == PC)
+ label0 = XEXP (XEXP (set0, 1), 2);
+ if (GET_CODE (XEXP (XEXP (set0, 1), 0)) == NE)
+ {
+ remove_insn (label2);
+ add_insn_after (label2, insn, 0);
+ remove_insn (label3);
+ add_insn_after (label3, XEXP (label0, 0), 0);
+ }
+ else
+ {
+ remove_insn (label2);
+ add_insn_after (label2, XEXP (label0, 0), 0);
+ remove_insn (label3);
+ add_insn_after (label3, insn, 0);
+ }
+ }
if (spu_flag_var_tracking)
{
Index: gcc/gcc/config/spu/spu.h
===================================================================
*** gcc/gcc/config/spu/spu.h (revision 149502)
--- gcc/gcc/config/spu/spu.h (working copy)
*************** extern GTY(()) int spu_tune;
*** 51,57 ****
/* Default target_flags if no switches specified. */
#ifndef TARGET_DEFAULT
#define TARGET_DEFAULT (MASK_ERROR_RELOC | MASK_SAFE_DMA | MASK_BRANCH_HINTS \
! | MASK_SAFE_HINTS)
#endif
--- 51,57 ----
/* Default target_flags if no switches specified. */
#ifndef TARGET_DEFAULT
#define TARGET_DEFAULT (MASK_ERROR_RELOC | MASK_SAFE_DMA | MASK_BRANCH_HINTS \
! | MASK_SAFE_HINTS | MASK_LARGE_MEM)
#endif
*************** targetm.resolve_overloaded_builtin = spu
*** 593,598 ****
--- 593,600 ----
} \
} while (0)
+ extern GTY(()) rtx spu_expect_op0;
+ extern GTY(()) rtx spu_expect_op1;
/* Builtins. */
Index: gcc/gcc/config/spu/spu.md
===================================================================
*** gcc/gcc/config/spu/spu.md (revision 149502)
--- gcc/gcc/config/spu/spu.md (working copy)
*************** (define_insn ""
*** 3750,3755 ****
--- 3750,3790 ----
"bi%b1%b0z\t%1,$lr"
[(set_attr "type" "br")])
+ ;; The following 2 variations are used when the using __builtin_expect
+ ;; with a non-constant second argument. They include a reference to a
+ ;; deleted label which will be placed immediately after the branch.
+ (define_insn "expect_then"
+ [(set (pc)
+ (if_then_else (match_operator 1 "branch_comparison_operator"
+ [(match_operand 2
+ "spu_reg_operand" "r")
+ (const_int 0)])
+ (label_ref (match_operand 0 "" ""))
+ (pc)))
+ (use (match_operand 3 "register_operand" "r"))
+ (use (match_operand 4 "" ""))
+ (use (match_operand 5 "" ""))
+ (clobber (reg:SI 130))]
+ ""
+ "br%b2%b1z\t%2,%0"
+ [(set_attr "type" "br")])
+
+ (define_insn "expect_else"
+ [(set (pc)
+ (if_then_else (match_operator 1 "branch_comparison_operator"
+ [(match_operand 2
+ "spu_reg_operand" "r")
+ (const_int 0)])
+ (pc)
+ (label_ref (match_operand 0 "" ""))))
+ (use (match_operand 3 "register_operand" "r"))
+ (use (match_operand 4 "" ""))
+ (use (match_operand 5 "" ""))
+ (clobber (reg:SI 130))]
+ ""
+ "br%b2%b1z\t%2,%0"
+ [(set_attr "type" "br")])
+
;; vector conditional compare patterns
(define_expand "vcond<mode>"
*************** (define_insn "stack_protect_test_si"
*** 5326,5328 ****
--- 5361,5374 ----
(set_attr "type" "multi1")]
)
+ ;; Save the operands for use by spu_emit_branch_or_set
+ (define_expand "builtin_expect"
+ [(use (match_operand:SI 0 "spu_reg_operand" "r"))
+ (use (match_operand:SI 1 "spu_reg_operand" "r"))]
+ ""
+ {
+ spu_expect_op0 = operands[0];
+ spu_expect_op1 = operands[1];
+ DONE;
+ })
+