Thread-safe profiling

Michael Matz matz@suse.de
Wed Mar 14 20:23:00 GMT 2007


Hi,

this patch implements thread safe profiling by using the __sync_* builtins 
where appropriate.  Currently the profile updating code is not thread-safe 
at all (simply incrementing 64bit wide values in a global counters[] 
arrays), which might lead to counter values slightly less then the real 
number (for instance for the edge counters), or even completely invalid 
counter values (e.g. for the one_value counter, which sometimes also 
decrements).

As this comes at some performance cost (a simple testprogram of me, which 
uses the profile functions heavily needed 2:36 minutes instead of 0:51, 
when using the thread-safe profiling) I made this an option.  The other 
reason for making this an option is, that not all architectures support 
the necessary atomic builtins.  You'll get linking errors (about 
__sync_val_compare_and_swap_8) if your architecture doesn't support it and 
you request thread safe profiling (i.e. you can provide a library 
implementing those functions on your own, using phtread mutexes or 
whatever, if you really insist on thread safe profiling on such incapable 
machines, at even higher cost then).  As libgcov.a contains also such code 
it needs to be compiled with -march set to something which supports these 
instructions (e.g. i686 on x86), which is done when configuring correctly 
(like --with-march=i686 at configure time).

For this I extended libgcov a bit by mirroring all performance counter 
update functions: for each _gcov_bla_counter I added an 
_gcov_bla_counter_r function, which is the thread safe one, implemented 
with GCC builtins.  Mostly trivial stuff, except the one_value counter 
implementation, which is a bit tricky, as it doesn't use one global mutex 
(cf. contention), but one of the counter[] values itself, which requires 
some hoops to implement the spin lock by hand.

On the course of implementing this I hit some bugs in the expansion code 
of these builtins and the i386 backend, which are also fixed by this 
patch.

Bootstrapped and regtested on x86-64-linux (though I made some small 
changes to it afterwards, so it's testing again).  Tested to do what it's 
supposed to do by inspecting the output of gcov on a multi-threaded test 
program.

Ok for mainline?


Ciao,
Michael.
-- 
	* builtins.c (expand_builtin_sync_operation,
	expand_builtin_compare_and_swap,
	expand_builtin_lock_test_and_set): Don't extend CONST_INTs.

	* config/i386/sync.md (sync_double_compare_and_swapdi_pic,
	sync_double_compare_and_swap_ccdi_pic): Use "SD" as constraint
	for operand 3.

	* common.opt (fprofile-thread-safe): New option.
	* tree-profile.c (gcov_type_size_log2): New static variable.
	(tree_init_ic_make_global_vars): Make __gcov_indirect_call_callee
	and __gcov_indirect_call_counters be TLS variables.
	(tree_init_edge_profiler): Initialize gcov_type_size_log2 and
	use the *_r profile updaters when flag_profile_thread_safe.
	(tree_gen_edge_profiler): Emit thread-safe code when
	flag_profile_thread_safe.
	* libgcov.c (__gcov_interval_profiler_r, __gcov_pow2_profiler_r,
	__gcov_one_value_profiler_body_r, __gcov_one_value_profiler_r,
	__gcov_indirect_call_profiler_r, __gcov_average_profiler_r,
	__gcov_ior_profiler_r): New thread-safe profile updaters.
	* Makefile.in (LIBGCOV): Add new *_r profile updater functions.
	* doc/invoke.texi (-fprofile-thread-safe): Document new option.

Index: builtins.c
===================================================================
--- builtins.c	(revision 122909)
+++ builtins.c	(working copy)
@@ -5717,8 +5717,10 @@ expand_builtin_sync_operation (enum mach
   mem = get_builtin_sync_mem (CALL_EXPR_ARG (exp, 0), mode);
 
   val = expand_expr (CALL_EXPR_ARG (exp, 1), NULL, mode, EXPAND_NORMAL);
-  /* If VAL is promoted to a wider mode, convert it back to MODE.  */
-  val = convert_to_mode (mode, val, 1);
+  /* If VAL is promoted to a wider mode, convert it back to MODE.
+     Not for CONST_INTs though.  */
+  if (GET_MODE (val) != VOIDmode)
+    val = convert_to_mode (mode, val, 1);
 
   if (ignore)
     return expand_sync_operation (mem, val, code);
@@ -5742,12 +5744,16 @@ expand_builtin_compare_and_swap (enum ma
 
 
   old_val = expand_expr (CALL_EXPR_ARG (exp, 1), NULL, mode, EXPAND_NORMAL);
-  /* If OLD_VAL is promoted to a wider mode, convert it back to MODE.  */
-  old_val = convert_to_mode (mode, old_val, 1);
+  /* If OLD_VAL is promoted to a wider mode, convert it back to MODE.
+     Not for CONST_INTs though.  */
+  if (GET_MODE (old_val) != VOIDmode)
+    old_val = convert_to_mode (mode, old_val, 1);
 
   new_val = expand_expr (CALL_EXPR_ARG (exp, 2), NULL, mode, EXPAND_NORMAL);
-  /* If NEW_VAL is promoted to a wider mode, convert it back to MODE.  */
-  new_val = convert_to_mode (mode, new_val, 1);
+  /* If NEW_VAL is promoted to a wider mode, convert it back to MODE.
+     Not for CONST_INTs though.  */
+  if (GET_MODE (new_val) != VOIDmode)
+    new_val = convert_to_mode (mode, new_val, 1);
 
   if (is_bool)
     return expand_bool_compare_and_swap (mem, old_val, new_val, target);
@@ -5770,8 +5776,10 @@ expand_builtin_lock_test_and_set (enum m
   /* Expand the operands.  */
   mem = get_builtin_sync_mem (CALL_EXPR_ARG (exp, 0), mode);
   val = expand_expr (CALL_EXPR_ARG (exp, 1), NULL, mode, EXPAND_NORMAL);
-  /* If VAL is promoted to a wider mode, convert it back to MODE.  */
-  val = convert_to_mode (mode, val, 1);
+  /* If VAL is promoted to a wider mode, convert it back to MODE.
+     Not for CONST_INTs though.  */
+  if (GET_MODE (val) != VOIDmode)
+    val = convert_to_mode (mode, val, 1);
 
   return expand_sync_lock_test_and_set (mem, val, target);
 }
Index: common.opt
===================================================================
--- common.opt	(revision 122909)
+++ common.opt	(working copy)
@@ -735,6 +735,10 @@ fprofile-generate
 Common
 Enable common options for generating profile info for profile feedback directed optimizations
 
+fprofile-thread-safe
+Common Report Var(flag_profile_thread_safe)
+Generate thread safe profiling code, used with -fprofile-generate
+
 fprofile-use
 Common
 Enable common options for performing profile feedback directed optimizations
Index: tree-profile.c
===================================================================
--- tree-profile.c	(revision 122909)
+++ tree-profile.c	(working copy)
@@ -61,6 +61,8 @@ static GTY(()) tree ic_void_ptr_var;
 static GTY(()) tree ic_gcov_type_ptr_var;
 static GTY(()) tree ptr_void;
 
+static int gcov_type_size_log2;
+
 /* Do initialization work for the edge profiler.  */
 
 /* Add code:
@@ -82,6 +84,7 @@ tree_init_ic_make_global_vars (void)
   TREE_PUBLIC (ic_void_ptr_var) = 0;
   DECL_ARTIFICIAL (ic_void_ptr_var) = 1;
   DECL_INITIAL (ic_void_ptr_var) = NULL;
+  DECL_TLS_MODEL (ic_void_ptr_var) = decl_default_tls_model (ic_void_ptr_var);
   assemble_variable (ic_void_ptr_var, 0, 0, 0);
 
   gcov_type_ptr = build_pointer_type (get_gcov_type ());
@@ -93,6 +96,8 @@ tree_init_ic_make_global_vars (void)
   TREE_PUBLIC (ic_gcov_type_ptr_var) = 0;
   DECL_ARTIFICIAL (ic_gcov_type_ptr_var) = 1;
   DECL_INITIAL (ic_gcov_type_ptr_var) = NULL;
+  DECL_TLS_MODEL (ic_gcov_type_ptr_var)
+    = decl_default_tls_model (ic_gcov_type_ptr_var);
   assemble_variable (ic_gcov_type_ptr_var, 0, 0, 0);
 }
 
@@ -109,6 +114,8 @@ tree_init_edge_profiler (void)
   if (!gcov_type_node)
     {
       gcov_type_node = get_gcov_type ();
+      gcov_type_size_log2 = tree_low_cst (TYPE_SIZE_UNIT (gcov_type_node), 1);
+      gcov_type_size_log2 = exact_log2 (gcov_type_size_log2);
       gcov_type_ptr = build_pointer_type (gcov_type_node);
 
       /* void (*) (gcov_type *, gcov_type, int, unsigned)  */
@@ -118,7 +125,9 @@ tree_init_edge_profiler (void)
 					  integer_type_node,
 					  unsigned_type_node, NULL_TREE);
       tree_interval_profiler_fn
-	      = build_fn_decl ("__gcov_interval_profiler",
+	      = build_fn_decl (flag_profile_thread_safe
+	      		       ? "__gcov_interval_profiler_r"
+			       : "__gcov_interval_profiler",
 				     interval_profiler_fn_type);
 
       /* void (*) (gcov_type *, gcov_type)  */
@@ -126,7 +135,9 @@ tree_init_edge_profiler (void)
 	      = build_function_type_list (void_type_node,
 					  gcov_type_ptr, gcov_type_node,
 					  NULL_TREE);
-      tree_pow2_profiler_fn = build_fn_decl ("__gcov_pow2_profiler",
+      tree_pow2_profiler_fn = build_fn_decl (flag_profile_thread_safe
+      					     ? "__gcov_pow2_profiler_r"
+					     : "__gcov_pow2_profiler",
 						   pow2_profiler_fn_type);
 
       /* void (*) (gcov_type *, gcov_type)  */
@@ -135,7 +146,9 @@ tree_init_edge_profiler (void)
 					  gcov_type_ptr, gcov_type_node,
 					  NULL_TREE);
       tree_one_value_profiler_fn
-	      = build_fn_decl ("__gcov_one_value_profiler",
+	      = build_fn_decl (flag_profile_thread_safe
+	      		       ? "__gcov_one_value_profiler_r"
+			       : "__gcov_one_value_profiler",
 				     one_value_profiler_fn_type);
 
       tree_init_ic_make_global_vars ();
@@ -147,17 +160,23 @@ tree_init_edge_profiler (void)
 					  ptr_void,
 					  ptr_void, NULL_TREE);
       tree_indirect_call_profiler_fn
-	      = build_fn_decl ("__gcov_indirect_call_profiler",
+	      = build_fn_decl (flag_profile_thread_safe
+	      		       ? "__gcov_indirect_call_profiler_r"
+			       : "__gcov_indirect_call_profiler",
 				     ic_profiler_fn_type);
       /* void (*) (gcov_type *, gcov_type)  */
       average_profiler_fn_type
 	      = build_function_type_list (void_type_node,
 					  gcov_type_ptr, gcov_type_node, NULL_TREE);
       tree_average_profiler_fn
-	      = build_fn_decl ("__gcov_average_profiler",
+	      = build_fn_decl (flag_profile_thread_safe
+	      		       ? "__gcov_average_profiler_r"
+			       : "__gcov_average_profiler",
 				     average_profiler_fn_type);
       tree_ior_profiler_fn
-	      = build_fn_decl ("__gcov_ior_profiler",
+	      = build_fn_decl (flag_profile_thread_safe
+	      		       ? "__gcov_ior_profiler_r"
+			       : "__gcov_ior_profiler",
 				     average_profiler_fn_type);
     }
 }
@@ -169,18 +188,33 @@ tree_init_edge_profiler (void)
 static void
 tree_gen_edge_profiler (int edgeno, edge e)
 {
-  tree tmp1 = create_tmp_var (gcov_type_node, "PROF");
-  tree tmp2 = create_tmp_var (gcov_type_node, "PROF");
   tree ref = tree_coverage_counter_ref (GCOV_COUNTER_ARCS, edgeno);
   tree one = build_int_cst (gcov_type_node, 1);
-  tree stmt1 = build_gimple_modify_stmt (tmp1, ref);
-  tree stmt2 = build_gimple_modify_stmt (tmp2,
-					 build2 (PLUS_EXPR, gcov_type_node,
-						 tmp1, one));
-  tree stmt3 = build_gimple_modify_stmt (ref, tmp2);
-  bsi_insert_on_edge (e, stmt1);
-  bsi_insert_on_edge (e, stmt2);
-  bsi_insert_on_edge (e, stmt3);
+  if (! flag_profile_thread_safe)
+    {
+      tree tmp1 = create_tmp_var (gcov_type_node, "PROF");
+      tree tmp2 = create_tmp_var (gcov_type_node, "PROF");
+      tree stmt1 = build_gimple_modify_stmt (tmp1, ref);
+      tree stmt2 = build_gimple_modify_stmt (tmp2,
+					     build2 (PLUS_EXPR, gcov_type_node,
+						     tmp1, one));
+      tree stmt3 = build_gimple_modify_stmt (ref, tmp2);
+      bsi_insert_on_edge (e, stmt1);
+      bsi_insert_on_edge (e, stmt2);
+      bsi_insert_on_edge (e, stmt3);
+    }
+  else
+    {
+      tree ref_ptr;
+      tree stmts, call;
+      ref_ptr = force_gimple_operand (build_addr (ref, current_function_decl),
+				      &stmts, true, NULL_TREE);
+      if (stmts)
+	bsi_insert_on_edge (e, stmts);
+      call = built_in_decls[BUILT_IN_FETCH_AND_ADD_1 + gcov_type_size_log2];
+      call = build_call_expr (call, 2, ref_ptr, one);
+      bsi_insert_on_edge (e, call);
+    }
 }
 
 /* Emits code to get VALUE to instrument at BSI, and returns the
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 122909)
+++ Makefile.in	(working copy)
@@ -1204,6 +1204,9 @@ LIBGCOV = _gcov _gcov_merge_add _gcov_me
     _gcov_execv _gcov_execvp _gcov_execve \
     _gcov_interval_profiler _gcov_pow2_profiler _gcov_one_value_profiler \
     _gcov_indirect_call_profiler _gcov_average_profiler _gcov_ior_profiler \
+    _gcov_interval_profiler_r _gcov_pow2_profiler_r _gcov_one_value_profiler_r \
+    _gcov_indirect_call_profiler_r _gcov_average_profiler_r \
+    _gcov_ior_profiler_r \
     _gcov_merge_ior
 
 FPBIT_FUNCS = _pack_sf _unpack_sf _addsub_sf _mul_sf _div_sf \
Index: libgcov.c
===================================================================
--- libgcov.c	(revision 122909)
+++ libgcov.c	(working copy)
@@ -724,6 +724,22 @@ __gcov_interval_profiler (gcov_type *cou
 }
 #endif
 
+#ifdef L_gcov_interval_profiler_r
+void
+__gcov_interval_profiler_r (gcov_type *counters, gcov_type value,
+			  int start, unsigned steps)
+{
+  gcov_type delta = value - start;
+  if (delta < 0)
+    counters = &counters[steps + 1];
+  else if (delta >= steps)
+    counters = &counters[steps];
+  else
+    counters = &counters[delta];
+  __sync_fetch_and_add (counters, 1);
+}
+#endif
+
 #ifdef L_gcov_pow2_profiler
 /* If VALUE is a power of two, COUNTERS[1] is incremented.  Otherwise
    COUNTERS[0] is incremented.  */
@@ -738,6 +754,16 @@ __gcov_pow2_profiler (gcov_type *counter
 }
 #endif
 
+#ifdef L_gcov_pow2_profiler_r
+void
+__gcov_pow2_profiler_r (gcov_type *counters, gcov_type value)
+{
+  if ((value & (value - 1)) == 0)
+    counters = counters + 1;
+  __sync_fetch_and_add (counters, 1);
+}
+#endif
+
 /* Tries to determine the most common value among its inputs.  Checks if the
    value stored in COUNTERS[0] matches VALUE.  If this is the case, COUNTERS[1]
    is incremented.  If this is not the case and COUNTERS[1] is not zero,
@@ -763,6 +789,53 @@ __gcov_one_value_profiler_body (gcov_typ
   counters[2]++;
 }
 
+static inline void
+__gcov_one_value_profiler_body_r (gcov_type *counters, gcov_type value)
+{
+  /* We want to atomically update counter 0 and 1, but they depend on each
+     other, so we have no choice than to build a critical section.  We are
+     using counter 1 as spinlock for that section, with an value of
+     -1 as detector that we've entered it.  For that we need to make sure
+     that -1 is no valid value for counter 1 in the normal flow of operation.
+     We ensure that by special casing the increment.  Zero is no good
+     guard value, because that is how those counters start.  We're not using
+     a simple static int for fear of lock contention.  */
+  gcov_type old_val;
+  while (1)
+    {
+      gcov_type old2;
+      /* We need the old value, which we need to read atomically, but
+         the only primitive we have for that is compare_and_swap :-/ */
+      old_val = __sync_val_compare_and_swap (&counters[1], 0, 0);
+      if (old_val == (gcov_type)-1)
+	continue;
+      old2 = __sync_val_compare_and_swap (&counters[1], old_val, (gcov_type)-1);
+      if (old2 == old_val)
+        break;
+    }
+  /* counters[1] is now -1, we're sure we're the only thread here,
+     and old_val contains the original value of counters[1].  */
+  if (value == counters[0])
+    old_val++;
+  else if (old_val == 0)
+    {
+      old_val = 1;
+      counters[0] = value;
+    }
+  else
+    old_val--;
+  /* If we've overflowed make sure we're not using that value as it's
+     also our lock guard value.  Just overflow some more to 0.  */
+  if (old_val == (gcov_type)-1)
+    old_val = 0;
+  counters[2]++;
+
+  /* Release the lock by writing something different from -1.  But as this
+     is possibly a long long even writes aren't atomic, so use
+     compare_and_swap again.  */
+  __sync_val_compare_and_swap (&counters[1], (gcov_type)-1, old_val);
+}
+
 #ifdef L_gcov_one_value_profiler
 void
 __gcov_one_value_profiler (gcov_type *counters, gcov_type value)
@@ -771,6 +844,14 @@ __gcov_one_value_profiler (gcov_type *co
 }
 #endif
 
+#ifdef L_gcov_one_value_profiler_r
+void
+__gcov_one_value_profiler_r (gcov_type *counters, gcov_type value)
+{
+  __gcov_one_value_profiler_body_r (counters, value);
+}
+#endif
+
 #ifdef L_gcov_indirect_call_profiler
 /* Tries to determine the most common value among its inputs. */
 void
@@ -782,6 +863,16 @@ __gcov_indirect_call_profiler (gcov_type
 }
 #endif
 
+#ifdef L_gcov_indirect_call_profiler_r
+void
+__gcov_indirect_call_profiler_r (gcov_type* counter, gcov_type value, 
+			       void* cur_func, void* callee_func)
+{
+  if (cur_func == callee_func)
+    __gcov_one_value_profiler_body_r (counter, value);
+}
+#endif
+
 
 #ifdef L_gcov_average_profiler
 /* Increase corresponding COUNTER by VALUE.  FIXME: Perhaps we want
@@ -795,6 +886,15 @@ __gcov_average_profiler (gcov_type *coun
 }
 #endif
 
+#ifdef L_gcov_average_profiler_r
+void
+__gcov_average_profiler_r (gcov_type *counters, gcov_type value)
+{
+  __sync_fetch_and_add (&counters[0], value);
+  __sync_fetch_and_add (&counters[1], 1);
+}
+#endif
+
 #ifdef L_gcov_ior_profiler
 /* Increase corresponding COUNTER by VALUE.  FIXME: Perhaps we want
    to saturate up.  */
@@ -806,6 +906,14 @@ __gcov_ior_profiler (gcov_type *counters
 }
 #endif
 
+#ifdef L_gcov_ior_profiler_r
+void
+__gcov_ior_profiler_r (gcov_type *counters, gcov_type value)
+{
+  __sync_fetch_and_or (counters, value);
+}
+#endif
+
 #ifdef L_gcov_fork
 /* A wrapper for the fork function.  Flushes the accumulated profiling data, so
    that they are not counted twice.  */
Index: config/i386/sync.md
===================================================================
--- config/i386/sync.md	(revision 122909)
+++ config/i386/sync.md	(working copy)
@@ -98,6 +98,15 @@ (define_insn "sync_double_compare_and_sw
   ""
   "lock{\;| }cmpxchg<doublemodesuffix>b{\t| }%1")
 
+;; Theoretically we'd like to use constraint "r" (any reg) for operand
+;; 3, but that includes ecx.  If operand 3 and 4 are the same (like when
+;; the input is -1LL) GCC might chose to allocate operand 3 to ecx, like
+;; operand 4.  This breaks, as the xchg will move the PIC register contents
+;; to %ecx then --> boom.  Operands 3 and 4 really need to be different
+;; registers, which in this case means operand 3 must not be ecx.
+;; Instead of playing tricks with fake early clobbers or the like we
+;; just enumerate all regs possible here, which (as this is !TARGET_64BIT)
+;; are just esi and edi.
 (define_insn "*sync_double_compare_and_swapdi_pic"
   [(set (match_operand:DI 0 "register_operand" "=A")
 	(match_operand:DI 1 "memory_operand" "+m"))
@@ -105,7 +114,7 @@ (define_insn "*sync_double_compare_and_s
 	(unspec_volatile:DI
 	  [(match_dup 1)
 	   (match_operand:DI 2 "register_operand" "A")
-	   (match_operand:SI 3 "register_operand" "r")
+	   (match_operand:SI 3 "register_operand" "SD")
 	   (match_operand:SI 4 "register_operand" "c")]
 	  UNSPECV_CMPXCHG_1))
    (clobber (reg:CC FLAGS_REG))]
@@ -189,6 +198,8 @@ (define_insn "sync_double_compare_and_sw
   ""
   "lock{\;| }cmpxchg<doublemodesuffix>b{\t| }%1")
 
+;; See above for the explanation of using the constraint "SD" for
+;; operand 3.
 (define_insn "*sync_double_compare_and_swap_ccdi_pic"
   [(set (match_operand:DI 0 "register_operand" "=A")
 	(match_operand:DI 1 "memory_operand" "+m"))
@@ -196,7 +207,7 @@ (define_insn "*sync_double_compare_and_s
 	(unspec_volatile:DI
 	  [(match_dup 1)
 	   (match_operand:DI 2 "register_operand" "A")
-	   (match_operand:SI 3 "register_operand" "r")
+	   (match_operand:SI 3 "register_operand" "SD")
 	   (match_operand:SI 4 "register_operand" "c")]
 	  UNSPECV_CMPXCHG_1))
    (set (reg:CCZ FLAGS_REG)
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 122909)
+++ doc/invoke.texi	(working copy)
@@ -332,7 +332,7 @@ Objective-C and Objective-C++ Dialects}.
 -fno-toplevel-reorder -fno-trapping-math  -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer  -foptimize-register-move @gol
 -foptimize-sibling-calls  -fprefetch-loop-arrays @gol
--fprofile-generate -fprofile-use @gol
+-fprofile-generate -fprofile-use -fprofile-thread-safe@gol
 -fregmove  -frename-registers @gol
 -freorder-blocks  -freorder-blocks-and-partition -freorder-functions @gol
 -frerun-cse-after-loop @gol
@@ -5867,8 +5867,24 @@ By default, GCC emits an error message i
 match the source code.  This error can be turned into a warning by using
 @option{-Wcoverage-mismatch}.  Note this may result in poorly optimized
 code.
+
+@item -fprofile-thread-safe
+@opindex fprofile-thread-safe
+Together with @option{-fprofile-generate} this will generate thread safe
+profiling code.  Normally this code isn't thread-safe, so the profile counters
+might be wrong in multi-threaded programs.  Usually they won't be off very
+far but if exact profile counters are needed you need this option.
+This comes at a performance price when running the program
+generating the profiles.  Only some architectures support the atomic
+instructions necessary for this.  If your architecture doesn't support them,
+you will get link errors about missing symbols __sync_fetch_and_add_8.
+
+GCC also has to be configured correctly, so that libgcov.a (the library
+containing some of the support code) is compiled for an architecture
+supporting these atomic instructions.
 @end table
 
+
 The following options control compiler behavior regarding floating
 point arithmetic.  These options trade off between speed and
 correctness.  All must be specifically enabled.



More information about the Gcc-patches mailing list