This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] add self-tuning to x86 hardware fast path in libitm


Hello everyone,

after a summer internship and some backlog catching up in the past
weeks, I have finally got around to review the patch according to the
latest discussion.

The changes have not caused regression, and the latest speedup results
are coherent with what we had before.


In the following I add some comments inline to the discussion, and
after that paste the updated patch.


> >
> > So let's iterate on this in parallel with the other changes that we need
> > to get in place.  I'd prefer to have some more confidence that measuring
> > txn throughput in epochs is the best way forward.
> >
> > Here are a few thoughts:
> >
> > Why do we actually need to measure succeeded transactions?  If a HW txn
> > is just getting committed without interfering with anything else, is
> > this really different from, say, two HW txns getting committed?  Aren't
> > we really just interested in wasted work, or delayed txns?  That may
> > help taking care of the nontxnal vs. txnal problem.
> >
> > Measuring committed txns during a time that might otherwise be spent by
> > a serial txns could be useful to figure out how much other useful work a
> > serial txn prevents.  But we'd see that as well if we'd just go serial
> > during the auto-tuning because then concurrent txns will have to wait;
> > and in this case we could measure it in the slow path of those
> > concurrent txns (ie, during waiting, where measurement overhead wouldn't
> > matter as much).
> >
> > If a serial txn is running, concurrent txns (that wait) are free to sync
> > and tell the serial how much cost it creates for the concurrent txns.
> > There, txn length could matter, but we won't find out for real until
> > after the concurrent ones have run (they could be pretty short, so we
> > can't simply assume that the concurrent ones are as long as the serial
> > one, so that simply the number of concurrent ones can be used to
> > calculate delayed work).


I understand you concern: measuring failure in a slow path rather than
success in
the fast/common path.

My problem is that the approach taken in our work is tailored for accessing one
given metric, in our case some form of throughput of successfully
executed atomic
blocks.
However, to measure failure, we seem to need two things:
 1) the rate of failed hardware transactions
 2) the time spent waiting for the serial lock

In short, the performance will be bad if there is a mix of those two
issues that is bad
enough. The problem is how to define a metric that considers those two together?
It is not obvious to me that there is one. Furthermore, that deviates
significantly from
our approach, and in the end --- even if there is a solution that
works in that way ---
that is a different approach, not one that can be applied to our
self-tuning framework.

What we are doing right now is incrementing a counter in a (most
likely cached line)
that is single-writer and multiple-reader. Most often, it will be
owned by the single-writer.
As such, the cost should be quite negligible compared to the execution
of the atomic
block, more so in the case that a Hardware transaction was used, as
the commit itself
should be a hundred cycles or more.

A good way to measure this impact is to use the new "htm_notune" flag
and compare
its performance with the non-changed libitm that I use as a baseline above.
The average results are similar, without any trend noticeable outside
the standard
deviation for one side or the other.




>
> >
> >>
> >> > Also, note that the mitigation strategy for rdtsc
> >> > short-comings that you mention in the paper is not applicable in
> >> > general, specifically not in libitm.
> >>
> >>
> >> I suppose you mean the preemption of threads inflating the cycles measured?
> >
> > Yes, preemption and migration of threads (in case there's no global
> > sync'ed TSC or similar) -- you mentioned in the paper that you pin
> > threads to cores...
> >
> >> This would be similarly a problem to any time source that tries to measure the
> >> amount of work performed; not sure how we can avoid it in general. Any thoughts?
> >
> > Not really as long as we keep depending on measuring time in a
> > light-weight way.  Measuring smaller time intervals could make it less
> > likely that preemption happens during such an interval, though.
> >


Note that in this patch we use no such pinning. In fact, I've measured
that it does not
have a noticeable impact in performance. Maybe that could be the case with heavy
over-subscription of the cores with lots of threads.
Still, it is not a correctness issue, so I believe that the fact this
performs great in the
common case should make it prevail.



>
> >>
> >> > Another issue is that we need to implement the tuning in a portable way.
> >> > You currently make it depend on whether the assembler knows about RTM,
> >> > whereas the rest of the code makes this more portable and defers to
> >> > arch-specific files.  I'd prefer if we could use the tuning on other
> >> > archs too.  But for that, we need to cleanly separate generic from
> >> > arch-specific parts.  That affects magic numbers as well as things like
> >> > rdtsc().
> >>
> >>
> >> Yes, I refrained from adding new calls to the arch-specific files, to
> >> contain the
> >> changes mainly. But that is possible and that's part of the feedback I
> >> was hoping
> >> to get.
> >
> > OK.  Let me know if you want further input regarding this.


I have established this arch-specific part in the new patch. PTAL



>
> >
> >> > I'm wondering about whether it really makes sense to treat XABORT like
> >> > conflicts and other abort reasons, instead of like capacity aborts.
> >> > Perhaps we need to differentiate between the explicit abort codes glibc
> >> > uses, so that we can distinguish between cases where an abort is
> >> > supposed to signal incompatibility with txnal execution and cases where
> >> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h
> >> > in current glibc):
> >> > #define _ABORT_LOCK_BUSY        0xff
> >> > #define _ABORT_LOCK_IS_LOCKED   0xfe
> >> > #define _ABORT_NESTED_TRYLOCK   0xfd
> >>
> >>
> >> I am not sure I follow: are you suggesting to consider other aborts besides
> >> those of capacity?
> >
> > We may want to do this with UCB, similar to capacity.  Another option
> > would be to, for now, make fixed choices regarding whether they are
> > considered permanent or not.
> > _ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution
> > (similar to illegal instructions and such).
> > _ABORT_LOCK_BUSY is failing lock elision due to the lock being already
> > acquired; thus, this is transient I'd say.
> >
> > Andi, what was _ABORT_LOCK_IS_LOCKED used for again?


This would seem a more or less trivial change, after it is agreed
upon. So I suggest
we leave it for later if/when this one gets through.


 >
>
> >> > > --- libitm/config/x86/target.h (revision 219316)
> >> > > +++ libitm/config/x86/target.h (working copy)
> >> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret)
> >> > >    return begin_ret & _XABORT_RETRY;
> >> > >  }
> >> > >
> >> > > +static inline bool
> >> > > +htm_is_capacity_abort(uint32_t begin_ret)
> >> > > +{
> >> > > +  return begin_ret & _XABORT_CAPACITY;
> >> > > +}
> >> >
> >> > Is this indeed just about capacity, or do we actually mean a more
> >> > general situation?
> >>
> >> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was
> >> to detect "real" capacity aborts (i.e., permanent) vs those that are
> >> "transient".
> >> So in this case we really wanted to know if the abort was a capacity one so
> >> that we could direct the UCB algorithm and understand the "type" of capacity
> >> abort.
> >
> > OK.  But do you want to single out capacity aborts as one case for which
> > you want to detect permanent vs. transient, or do you want to generally
> > find out whether aborts that are reported as permanent (e.g., capacity)
> > are indeed permanent?  In the latter case, a more generic name would be
> > more appropriate.  (And that's not just a naming thing, but guidance for
> > other archs regarding what they should put in the "capacity" bucket of
> > abort reasons...)
> >
> >


As far as we know, capacity aborts are the ones that fit our criteria:
they are identifiable
in a different abort category, yet they are not necessarily
persistent/deterministic.
This could be expanded to others in the future, but it is not
something we tested with, as
other abort types are not as widespread.





Updated patch follows:

Index: util.cc
===================================================================
--- util.cc (revision 219316)
+++ util.cc (working copy)
@@ -94,4 +94,15 @@ xrealloc (void *old, size_t size, bool separate_cl
   return r;
 }

+// State for the generation is not synchronized. It assumes single thread
+// usage at any time. It is, in fact, invoked by the thread that is
+// re-optimizing the libitm HTM usage, so the assumption holds.
+uint32_t
+random_int(uint32_t max)
+{
+  static uint32_t seed = 123456789;
+  seed = (1103515245 * seed + 12345);
+  return seed;
+}
+
 } // namespace GTM
Index: libitm.h
===================================================================
--- libitm.h (revision 219316)
+++ libitm.h (working copy)
@@ -101,7 +101,11 @@ typedef enum
    /* These are not part of the ABI but used for custom HTM fast paths.  See
       ITM_beginTransaction and gtm_thread::begin_transaction.  */
    pr_HTMRetryableAbort = 0x800000,
-   pr_HTMRetriedAfterAbort = 0x1000000
+   pr_HTMRetriedAfterAbort = 0x1000000,
+   /* Reports whether a capacity abort was detected in the custom HTM fast
+      path, so that it can be used in the abort handler in
+      gtm_thread::begin_transaction.  */
+   pr_HTMCapacityAbort          = 0x2000000
 } _ITM_codeProperties;

 /* Result from startTransaction that describes what actions to take.
Index: method-serial.cc
===================================================================
--- method-serial.cc (revision 219316)
+++ method-serial.cc (working copy)
@@ -223,6 +223,19 @@ struct htm_mg : public method_group
     // initially disabled.
 #ifdef USE_HTM_FASTPATH
     htm_fastpath = htm_init();
+    optimizer.optimized_mode = STUBBORN;
+    optimizer.last_cycles = get_efficient_time();
+    optimizer.last_total_txs_executed = 0;
+    optimizer.last_throughput = optimizer.MIN_THROUGHPUT;
+    optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1;
+    optimizer.best_ever_throughput = optimizer.last_throughput;
+    optimizer.best_ever_attempts = htm_fastpath;
+    optimizer.txns_while_stubborn = 1;
+    optimizer.cycles_while_stubborn = optimizer.INIT_CYCLES;
+    optimizer.txns_while_halven = 1;
+    optimizer.cycles_while_halven = optimizer.INIT_CYCLES;
+    optimizer.txns_while_giveup = 1;
+    optimizer.cycles_while_giveup = optimizer.INIT_CYCLES;
 #endif
   }
   virtual void fini()
Index: retry.cc
===================================================================
--- retry.cc (revision 219316)
+++ retry.cc (working copy)
@@ -218,7 +218,6 @@ GTM::gtm_thread::set_default_dispatch(GTM::abi_dis
   default_dispatch.store(disp, memory_order_relaxed);
 }

-
 static GTM::abi_dispatch*
 parse_default_method()
 {
@@ -254,10 +253,17 @@ parse_default_method()
       disp = GTM::dispatch_ml_wt();
       env += 5;
     }
+  else if (strncmp(env, "htm_notune", 10) == 0)
+    {
+      disp = GTM::dispatch_htm();
+      env += 10;
+      GTM::optimizer.set_is_enabled(0);
+    }
   else if (strncmp(env, "htm", 3) == 0)
     {
       disp = GTM::dispatch_htm();
       env += 3;
+      GTM::optimizer.set_is_enabled(1);
     }
   else
     goto unknown;
Index: beginend.cc
===================================================================
--- beginend.cc (revision 219316)
+++ beginend.cc (working copy)
@@ -39,6 +39,9 @@ unsigned GTM::gtm_thread::number_of_threads = 0;
 gtm_stmlock GTM::gtm_stmlock_array[LOCK_ARRAY_SIZE];
 atomic<gtm_version> GTM::gtm_clock;

+// Holds the optimized configuration for parameters relevant to HTM execution.
+gtm_global_optimizer GTM::optimizer;
+
 /* ??? Move elsewhere when we figure out library initialization.  */
 uint64_t GTM::gtm_spin_count_var = 1000;

@@ -95,7 +98,139 @@ thread_exit_init()
     GTM_fatal("Creating thread release TLS key failed.");
 }

+/* Implementations for the fixed-point arithmetic type. */
+const uint32_t GTM::fixed_pt_t::FIXED_PT_WIDTH = 64;
+const uint32_t GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH = 40;
+const uint32_t GTM::fixed_pt_t::FIXED_PT_FRAC_WIDTH =
+  GTM::fixed_pt_t::FIXED_PT_WIDTH - GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH;
+const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_ONE = GTM::fixed_pt_t::one();
+const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_TWO =
+  GTM::fixed_pt_t::FIXED_PT_ONE.add(GTM::fixed_pt_t::FIXED_PT_ONE);

+fixed_pt_t
+GTM::fixed_pt_t::one()
+{
+  fixed_pt_t result(0);
+  result.number = static_cast<uint64_t>(1) << FIXED_PT_FRAC_WIDTH;
+  return result;
+}
+
+fixed_pt_t
+GTM::fixed_pt_t::add(const fixed_pt_t operand) const
+{
+  fixed_pt_t result(0);
+  result.number = this->number + operand.number;
+  return result;
+}
+
+fixed_pt_t
+GTM::fixed_pt_t::sub(const fixed_pt_t operand) const
+{
+  fixed_pt_t result(0);
+  result.number = this->number - operand.number;
+  return result;
+}
+
+fixed_pt_t
+GTM::fixed_pt_t::mul(const fixed_pt_t operand) const
+{
+  fixed_pt_t result(0);
+  result.number = (static_cast<__uint128_t>(this->number) *
+     static_cast<__uint128_t>(operand.number)) >> FIXED_PT_FRAC_WIDTH;
+  return result;
+}
+
+fixed_pt_t
+GTM::fixed_pt_t::div(const fixed_pt_t operand) const
+{
+  fixed_pt_t result(0);
+  result.number =
+    (static_cast<__uint128_t>(this->number) << FIXED_PT_FRAC_WIDTH) /
+      static_cast<__uint128_t>(operand.number);
+  return result;
+}
+
+fixed_pt_t
+GTM::fixed_pt_t::sqrt() const
+{
+  uint64_t invert = 0;
+  uint64_t iter = FIXED_PT_FRAC_WIDTH;
+  uint64_t i;
+  fixed_pt_t n = *this;
+
+  if (n.number == 0 || n.number == FIXED_PT_ONE.number)
+    {
+      return n;
+    }
+  if (n.number < FIXED_PT_ONE.number && n.number > 6)
+    {
+      invert = 1;
+      n = FIXED_PT_ONE.div(n);
+    }
+  if (n.number > FIXED_PT_ONE.number)
+    {
+      uint64_t s = n.number;
+      iter = 0;
+      while (s > 0)
+        {
+          s >>= 2;
+          iter++;
+        }
+    }
+
+  fixed_pt_t l(0);
+  l.number = (n.number >> 1) + 1;
+  for (i = 0; i < iter; i++)
+    {
+      l.number = l.add(n.div(l)).number >> 1;
+    }
+  if (invert)
+    {
+      return FIXED_PT_ONE.div(l);
+    }
+  return l;
+}
+
+fixed_pt_t
+GTM::fixed_pt_t::ln() const
+{
+  const fixed_pt_t LN2 = fixed_pt_t(0.69314718055994530942);
+  const fixed_pt_t LG[7] = {
+    fixed_pt_t(6.666666666666735130e-01),
+    fixed_pt_t(3.999999999940941908e-01),
+    fixed_pt_t(2.857142874366239149e-01),
+    fixed_pt_t(2.222219843214978396e-01),
+    fixed_pt_t(1.818357216161805012e-01),
+    fixed_pt_t(1.531383769920937332e-01),
+    fixed_pt_t(1.479819860511658591e-01)
+  };
+
+  if (this->number == 0)
+    {
+      return 0xffffffff;
+    }
+
+  fixed_pt_t log2 = fixed_pt_t(0);
+  fixed_pt_t xi = *this;
+  while (xi.number > FIXED_PT_TWO.number)
+    {
+      xi.number >>= 1;
+      log2.number++;
+    }
+
+  fixed_pt_t f = xi.sub(FIXED_PT_ONE);
+  fixed_pt_t s = f.div(FIXED_PT_TWO.add(f));
+  fixed_pt_t z = s.mul(s);
+  fixed_pt_t w = z.mul(z);
+  fixed_pt_t R =
+    w.mul(LG[1].add(w.mul(LG[3].add(w.mul(LG[5]))))).add(
+      z.mul(LG[0].add(w.mul(LG[2].add(
+        w.mul(LG[4].add(w.mul(LG[6]))))))));
+
+  return LN2.mul(fixed_pt_t(log2.number << FIXED_PT_FRAC_WIDTH)).add(
+    f.sub(s.mul(f.sub(R))));
+}
+
 GTM::gtm_thread::~gtm_thread()
 {
   if (nesting > 0)
@@ -149,6 +284,180 @@ choose_code_path(uint32_t prop, abi_dispatch *disp
     return a_runInstrumentedCode;
 }

+// Periodically invoked by some thread(s) that are in the HTM fall-back path.
+// This implements the Upper Confidence Bounds and Gradient Descent to tune,
+// respectively, how to deal with HTM capacity aborts and how many HTM retries
+// to allow. For more details, check Diegues et al. in USENIX ICAC 2014.
+void
+GTM::gtm_global_optimizer::reoptimize_htm_execution()
+{
+  // Avoid redundant re-optimization if another thread may be doing it already.
+  // This allows the threads to avoid racing to re-optimize concurrently, which
+  // is useless.
+
+  if (unlikely(GTM::gtm_thread::serial_lock.is_write_locked()))
+    {
+      return;
+    }
+  // Eventually, some thread(s) will follow this path, even in the face of
+  // spurious short-cut exits due to the serial lock being taken.
+  GTM::gtm_thread::serial_lock.write_lock();
+
+  // Collect the statistics obtained so far.
+  uint64_t total_txs_executed = 0;
+  gtm_thread *ptr = GTM::gtm_thread::list_of_threads;
+  for (; ptr; ptr = ptr->next_thread)
+    {
+      total_txs_executed += ptr->number_executed_txns;
+    }
+
+  // Obtain the delta performance with respect to the last period measured.
+  // We expect the underlying architecture to provide an efficient way to
+  // measure time passed since the last re-optimization.
+  const uint64_t current_cycles = get_efficient_time();
+  const uint64_t cycles_used = current_cycles - optimizer.last_cycles;
+  const uint64_t new_txs_executed =
+    total_txs_executed - optimizer.last_total_txs_executed;
+  const fixed_pt_t current_throughput =
+    fixed_pt_t(new_txs_executed).div(fixed_pt_t(cycles_used));
+
+  if (current_throughput.number == 0)
+    {
+      // This thread probably executed right after another one as they both
+      // found it was time to re-optimize, but did not contend on the serial
+      // lock. As such, we avoid redundant re-optimizations and short cut out.
+      GTM::gtm_thread::serial_lock.write_unlock();
+      return;
+    }
+
+  optimizer.last_cycles = current_cycles;
+  optimizer.last_total_txs_executed = total_txs_executed;
+  if (optimizer.optimized_mode == STUBBORN)
+    {
+      optimizer.txns_while_stubborn += new_txs_executed;
+      optimizer.cycles_while_stubborn += cycles_used;
+    }
+  else if (optimizer.optimized_mode == HALVEN)
+    {
+      optimizer.txns_while_halven += new_txs_executed;
+      optimizer.cycles_while_halven += cycles_used;
+    }
+  else
+    {
+      optimizer.txns_while_giveup += new_txs_executed;
+      optimizer.cycles_while_giveup += cycles_used;
+    }
+
+  // Compute Upper Confidence Bounds for the mode to choose next to decide on
+  // how to deal with HTM capacity aborts.
+  // Use fixed point arithmetic types to spare floating point usage.
+  const fixed_pt_t log_sum =
+    GTM::fixed_pt_t::FIXED_PT_TWO.mul(
+      (fixed_pt_t(optimizer.txns_while_stubborn).add(
+       fixed_pt_t(optimizer.txns_while_halven)).add(
+       fixed_pt_t(optimizer.txns_while_giveup))).ln());
+  const fixed_pt_t ucb_stubborn =
+    ((fixed_pt_t(optimizer.txns_while_stubborn).div(
+      fixed_pt_t(optimizer.cycles_while_stubborn))).div(
+      optimizer.best_ever_throughput)).add(
+     (log_sum.div(fixed_pt_t(optimizer.txns_while_stubborn))).sqrt());
+  const fixed_pt_t ucb_halven =
+    ((fixed_pt_t(optimizer.txns_while_halven).div(
+      fixed_pt_t(optimizer.cycles_while_halven))).div(
+      optimizer.best_ever_throughput)).add(
+     (log_sum.div(fixed_pt_t(optimizer.txns_while_halven))).sqrt());
+  const fixed_pt_t ucb_giveup =
+    ((fixed_pt_t(optimizer.txns_while_giveup).div(
+      fixed_pt_t(optimizer.cycles_while_giveup)).div(
+      optimizer.best_ever_throughput))).add(
+     (log_sum.div(fixed_pt_t(optimizer.txns_while_giveup))).sqrt());
+
+  if (ucb_stubborn.number > ucb_halven.number &&
+      ucb_stubborn.number > ucb_giveup.number)
+    {
+      optimizer.optimized_mode = STUBBORN;
+    }
+  else if (ucb_halven.number > ucb_giveup.number)
+    {
+      optimizer.optimized_mode = HALVEN;
+    }
+  else
+    {
+      optimizer.optimized_mode = GIVEUP;
+    }
+
+  // Compute Gradient Descent to regulate the number of retries in HTM.
+  const fixed_pt_t LARGE_DEGRADATION = fixed_pt_t(1.40);
+  const fixed_pt_t MIN_IMPROVEMENT = fixed_pt_t(1.05);
+  const int32_t MAX_RETRIES = 20;
+  const int32_t MIN_RETRIES = 2;
+
+  const fixed_pt_t change_for_better =
+    current_throughput.div(optimizer.last_throughput);
+  const fixed_pt_t change_for_worse =
+    optimizer.last_throughput.div(current_throughput);
+  int32_t last_attempts = optimizer.last_attempts;
+  int32_t current_attempts = htm_fastpath;
+  int32_t new_attempts = current_attempts;
+  if (unlikely(change_for_worse.number > fixed_pt_t(LARGE_DEGRADATION).number))
+    {
+      htm_fastpath = optimizer.best_ever_attempts;
+      optimizer.last_throughput = current_throughput;
+      optimizer.last_attempts = current_attempts;
+      goto finish_reoptimization;
+    }
+
+  if (unlikely(random_int(100) == 0))
+    {
+      optimizer.last_attempts = htm_fastpath;
+      optimizer.last_throughput = current_throughput;
+      htm_fastpath = random_int(MAX_RETRIES - MIN_RETRIES) + MIN_RETRIES;
+      goto finish_reoptimization;
+    }
+
+  if (change_for_better.number > MIN_IMPROVEMENT.number)
+    {
+      new_attempts += current_attempts - last_attempts;
+      if (current_attempts == last_attempts)
+ {
+  new_attempts += random_int(2) == 0 ? -1 : 1;
+ }
+    }
+  else if (change_for_worse.number > MIN_IMPROVEMENT.number)
+    {
+      new_attempts -= current_attempts - last_attempts;
+      if (current_attempts == last_attempts)
+ {
+  new_attempts += random_int(2) == 0 ? -1 : 1;
+ }
+    }
+
+  if (unlikely(new_attempts > MAX_RETRIES))
+    {
+      new_attempts = MAX_RETRIES;
+    }
+  else if (unlikely(new_attempts < MIN_RETRIES))
+    {
+      new_attempts = MIN_RETRIES;
+    }
+
+  if (current_attempts != new_attempts)
+    {
+      optimizer.last_attempts = current_attempts;
+      optimizer.last_throughput = current_throughput;
+    }
+
+  htm_fastpath = new_attempts;
+  if (unlikely(current_throughput.number >
optimizer.best_ever_throughput.number))
+    {
+      optimizer.best_ever_throughput = current_throughput;
+      optimizer.best_ever_attempts = current_attempts;
+    }
+
+ finish_reoptimization:
+  GTM::gtm_thread::serial_lock.write_unlock();
+}
+
 uint32_t
 GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb)
 {
@@ -209,8 +518,36 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
     }
   // The transaction has aborted.  Don't retry if it's unlikely that
   // retrying the transaction will be successful.
-  if (!htm_abort_should_retry(ret))
-    break;
+  if (optimizer.is_enabled)
+    {
+      // If the HTM self-tuning is enabled, then we act according to
+      // its configuration.
+      if (htm_is_capacity_abort(ret))
+ {
+  gtm_capacity_abort_mode capacity_mode =
+    optimizer.optimized_mode;
+  // Adapt the retries according to the mode. Note that the
+  // linear decrease is implicit in the outer for loop.
+  if (capacity_mode == HALVEN)
+    t = t / 2 + 1;
+  else if (capacity_mode == GIVEUP)
+    break;
+ }
+
+      // Take the chance of having aborted to possibly re-optimize the
+      // the HTM configuration. Do this as a last resort before going
+      // for the serial lock.
+      if (unlikely(tx->number_executed_txns %
+   optimizer.OPTIMIZATION_PERIOD_TXS == 0 && t == 1))
+ optimizer.reoptimize_htm_execution();
+    }
+  else
+    {
+      // Execute the fixed logic when there is no adaptation.
+      if (!htm_abort_should_retry(ret))
+ break;
+    }
+
   // Wait until any concurrent serial-mode transactions have finished.
   // This is an empty critical section, but won't be elided.
   if (serial_lock.is_write_locked())
@@ -248,7 +585,11 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
   // HTM fastpath aborted, and that we thus have to decide whether to retry
   // the fastpath (returning a_tryHTMFastPath) or just proceed with the
   // fallback method.
-  if (likely(htm_fastpath && (prop & pr_HTMRetryableAbort)))
+  // Even if pr_HTMRetryableAbort is not true, we still consider to retry if
+  // it was a capacity abort (stated by pr_HTMCapacityAbort) and the self
+  // tuning capabilities are enabled (stated by optimizer.is_enabled).
+  if (likely(htm_fastpath && ((prop & pr_HTMRetryableAbort)
+      || ((prop & pr_HTMCapacityAbort) && (optimizer.is_enabled)))))
     {
       tx = gtm_thr();
       if (unlikely(tx == NULL))
@@ -266,6 +607,26 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,

       if (--tx->restart_total > 0)
  {
+  // See above: we use the optimized configuration if enabled.
+  if (optimizer.is_enabled)
+    {
+      // See above: similarly, we adapt upon capacity aborts.
+      if (prop & pr_HTMCapacityAbort)
+ {
+  gtm_capacity_abort_mode capacity_mode = optimizer.optimized_mode;
+  if (capacity_mode == HALVEN)
+    tx->restart_total = tx->restart_total / 2 + 1;
+  else if (capacity_mode == GIVEUP)
+    goto stop_custom_htm_fastpath;
+ }
+
+      // See above: similarly, we periodically re-optimize.
+      if (unlikely(tx->number_executed_txns %
+  optimizer.OPTIMIZATION_PERIOD_TXS == 0 &&
+  tx->restart_total == 1))
+ optimizer.reoptimize_htm_execution();
+    }
+
   // Wait until any concurrent serial-mode transactions have finished.
   // Essentially the same code as above.
   if (serial_lock.is_write_locked())
@@ -665,12 +1026,23 @@ _ITM_commitTransaction(void)
   if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
     {
       htm_commit();
+      gtm_thread *tx = gtm_thr();
+      if (unlikely(tx == NULL))
+ {
+  // See above: when using the HTM fast-path, we might not have created
+  // a thread object yet.
+  tx = new gtm_thread();
+  set_gtm_thr(tx);
+ }
+      tx->number_executed_txns++;
       return;
     }
 #endif
   gtm_thread *tx = gtm_thr();
   if (!tx->trycommit ())
     tx->restart (RESTART_VALIDATE_COMMIT);
+
+  tx->number_executed_txns++;
 }

 void ITM_REGPARM
@@ -681,6 +1053,14 @@ _ITM_commitTransactionEH(void *exc_ptr)
   if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
     {
       htm_commit();
+      gtm_thread *tx = gtm_thr();
+      if (unlikely(tx == NULL))
+ {
+  // See above.
+  tx = new gtm_thread();
+  set_gtm_thr(tx);
+ }
+      tx->number_executed_txns++;
       return;
     }
 #endif
@@ -690,4 +1070,5 @@ _ITM_commitTransactionEH(void *exc_ptr)
       tx->eh_in_flight = exc_ptr;
       tx->restart (RESTART_VALIDATE_COMMIT);
     }
+  tx->number_executed_txns++;
 }
Index: config/s390/target.h
===================================================================
--- config/s390/target.h (revision 219316)
+++ config/s390/target.h (working copy)
@@ -116,12 +116,30 @@ htm_abort_should_retry (uint32_t begin_ret)
 }

 static inline bool
+htm_is_capacity_abort (uint32_t begin_ret)
+{
+  return begin_ret != _HTM_TBEGIN_PERSISTENT;
+}
+
+static inline bool
 htm_transaction_active ()
 {
   __asm volatile (".machine \"all\"  \n\t");
   return __builtin_tx_nesting_depth() != 0;
 }

+static inline uint64_t
+get_efficient_time()
+{
+    return 0;
+}
+
+static inline bool
+allows_reoptimization()
+{
+  return false;
+}
+
 #endif

 } // namespace GTM
Index: config/arm/target.h
===================================================================
--- config/arm/target.h (revision 219316)
+++ config/arm/target.h (working copy)
@@ -46,4 +46,10 @@ cpu_relax (void)
   __sync_synchronize ();
 }

+static inline bool
+allows_reoptimization()
+{
+  return false;
+}
+
 } // namespace GTM
Index: config/powerpc/target.h
===================================================================
--- config/powerpc/target.h (revision 219316)
+++ config/powerpc/target.h (working copy)
@@ -74,6 +74,7 @@ cpu_relax (void)
 #define _TBEGIN_STARTED       0
 #define _TBEGIN_INDETERMINATE 1
 #define _TBEGIN_PERSISTENT    2
+#define _TBEGIN_CAPACITY      3

 /* Number of retries for transient failures.  */
 #define _HTM_RETRIES 10
@@ -101,6 +102,9 @@ htm_begin (void)
   if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
     return _TBEGIN_PERSISTENT;

+  if (_TEXASRU_FOOTPRINT_OVERFLOW (__builtin_get_texasru ()))
+    return _TBEGIN_CAPACITY;
+
   return _TBEGIN_INDETERMINATE;
 }

@@ -128,6 +132,12 @@ htm_abort_should_retry (uint32_t begin_ret)
   return begin_ret != _TBEGIN_PERSISTENT;
 }

+static inline bool
+htm_is_capacity_abort (uint32_t begin_ret)
+{
+  return begin_ret == _TBEGIN_CAPACITY;
+}
+
 /* Returns true iff a hardware transaction is currently being executed.  */
 static inline bool
 htm_transaction_active (void)
@@ -135,6 +145,36 @@ htm_transaction_active (void)
   return (_HTM_STATE (__builtin_ttest ()) == _HTM_TRANSACTIONAL);
 }

+static inline uint64_t
+get_efficient_time()
+{
+  #if defined (__powerpc64__) || defined (__ppc64__)
+  uint64_t __tb;
+  __asm__ volatile ("mfspr %[tb], 268\n"
+    : [tb]"=r" (__tb)
+    : );
+  return __tb;
+  #else
+  register unsigned long __tbu, __tbl, __tmp; \
+  __asm__ volatile (
+    "0:\n"
+    "mftbu %[tbu]\n"
+    "mftbl %[tbl]\n"
+    "mftbu %[tmp]\n"
+    "cmpw %[tbu], %[tmp]\n"
+    "bne- 0b\n"
+    : [tbu]"=r" (__tbu), [tbl]"=r" (__tbl), [tmp]"=r" (__tmp)
+    : );
+  return (( (uint64_t) __tbu << 32) | __tbl);
+  #endif /* not __powerpc64__ */
+}
+
+static inline bool
+allows_reoptimization()
+{
+  return true;
+}
+
 #endif

 } // namespace GTM
Index: config/alpha/target.h
===================================================================
--- config/alpha/target.h (revision 219316)
+++ config/alpha/target.h (working copy)
@@ -41,4 +41,10 @@ cpu_relax (void)
   __asm volatile ("" : : : "memory");
 }

+static inline bool
+allows_reoptimization()
+{
+  return false;
+}
+
 } // namespace GTM
Index: config/x86/sjlj.S
===================================================================
--- config/x86/sjlj.S (revision 219316)
+++ config/x86/sjlj.S (working copy)
@@ -59,12 +59,14 @@
 #define pr_hasNoAbort 0x08
 #define pr_HTMRetryableAbort 0x800000
 #define pr_HTMRetriedAfterAbort 0x1000000
+#define pr_HTMCapacityAbort     0x2000000
 #define a_runInstrumentedCode 0x01
 #define a_runUninstrumentedCode 0x02
 #define a_tryHTMFastPath 0x20

 #define _XABORT_EXPLICIT (1 << 0)
 #define _XABORT_RETRY (1 << 1)
+#define _XABORT_CAPACITY (1 << 3)

  .text

@@ -111,6 +113,9 @@ SYM(_ITM_beginTransaction):
  testl $(_XABORT_RETRY|_XABORT_EXPLICIT), %eax
  jz .Lno_htm
  orl $pr_HTMRetryableAbort, %edi
+ testl   $(_XABORT_CAPACITY), %eax
+ jz      .Lno_htm
+ orl     $pr_HTMCapacityAbort, %edi
  /* Let the C++ code handle the retry policy.  */
 .Lno_htm:
 #endif
Index: config/x86/target.h
===================================================================
--- config/x86/target.h (revision 219316)
+++ config/x86/target.h (working copy)
@@ -126,13 +126,33 @@ htm_abort_should_retry (uint32_t begin_ret)
   return begin_ret & _XABORT_RETRY;
 }

+static inline bool
+htm_is_capacity_abort(uint32_t begin_ret)
+{
+  return begin_ret & _XABORT_CAPACITY;
+}
+
 /* Returns true iff a hardware transaction is currently being executed.  */
 static inline bool
 htm_transaction_active ()
 {
   return _xtest() != 0;
 }
+
+static inline uint64_t
+get_efficient_time()
+{
+  uint32_t hi, lo;
+  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
+  return ((unsigned long long) lo) | (((unsigned long long) hi) << 32);
+}
+
+static inline bool
+allows_reoptimization()
+{
+  return true;
+}
+
 #endif

-
 } // namespace GTM
Index: config/aarch64/target.h
===================================================================
--- config/aarch64/target.h (revision 219316)
+++ config/aarch64/target.h (working copy)
@@ -42,4 +42,10 @@ cpu_relax (void)
   __asm volatile ("" : : : "memory");
 }

+static inline bool
+allows_reoptimization()
+{
+  return false;
+}
+
 } // namespace GTM
Index: config/sparc/target.h
===================================================================
--- config/sparc/target.h (revision 219316)
+++ config/sparc/target.h (working copy)
@@ -39,4 +39,10 @@ cpu_relax (void)
   __asm volatile ("rd %%ccr, %%g0" : : : "memory");
 }

+static inline bool
+allows_reoptimization()
+{
+  return false;
+}
+
 } // namespace GTM
Index: config/sh/target.h
===================================================================
--- config/sh/target.h (revision 219316)
+++ config/sh/target.h (working copy)
@@ -44,4 +44,10 @@ cpu_relax (void)
   __asm volatile ("" : : : "memory");
 }

+static inline bool
+allows_reoptimization()
+{
+  return false;
+}
+
 } // namespace GTM
Index: libitm_i.h
===================================================================
--- libitm_i.h (revision 219316)
+++ libitm_i.h (working copy)
@@ -242,6 +242,9 @@ struct gtm_thread
   uint32_t restart_reason[NUM_RESTARTS];
   uint32_t restart_total;

+  // Keeps track of how many transactions were successfully executed.
+  uint64_t number_executed_txns;
+
   // *** The shared part of gtm_thread starts here. ***
   // Shared state is on separate cachelines to avoid false sharing with
   // thread-local parts of gtm_thread.
@@ -309,6 +312,94 @@ struct gtm_thread
   void commit_user_actions ();
 };

+// Definition of fixed point arithmetic types.
+// Half the bits are dedicated to the fractional type, and the rest to the
+// "whole" part.
+struct fixed_pt_t
+{
+  uint64_t number;
+
+  fixed_pt_t(double n)
+  {
+    this->number = static_cast<uint64_t>(n * FIXED_PT_ONE.number + (n
>= 0 ? 0.5 : -0.5));
+  }
+
+  fixed_pt_t add(const fixed_pt_t operand) const;
+  fixed_pt_t sub(const fixed_pt_t operand) const;
+  fixed_pt_t mul(const fixed_pt_t operand) const;
+  fixed_pt_t div(const fixed_pt_t operand) const;
+  fixed_pt_t sqrt() const;
+  fixed_pt_t ln() const;
+
+  static fixed_pt_t one();
+
+  static const uint32_t FIXED_PT_WIDTH;
+  static const uint32_t FIXED_PT_INTEGER_WIDTH;
+  static const uint32_t FIXED_PT_FRAC_WIDTH;
+  static const fixed_pt_t FIXED_PT_ONE;
+  static const fixed_pt_t FIXED_PT_TWO;
+};
+
+// Different ways to deal with capacity aborts in HTM execution.
+enum gtm_capacity_abort_mode
+{
+  STUBBORN,
+  HALVEN,
+  GIVEUP
+};
+
+// Maintains the current values optimized for HTM execution and the
+// corresponding statistics gathered for the decision-making.
+struct gtm_global_optimizer
+{
+  // Whether to re-optimize the HTM execution. This is parametrizable by an
+  // env variable.
+  uint32_t is_enabled;
+
+  // Mode chosen to currently deal with capacity aborts.
+  gtm_capacity_abort_mode optimized_mode;
+  // The number of attempts chosen to currently insist on HTM execution is
+  // maintained in htm_fastpath.
+
+  uint64_t last_cycles;
+  uint64_t last_total_txs_executed;
+
+  fixed_pt_t last_throughput;
+  uint32_t last_attempts;
+
+  fixed_pt_t best_ever_throughput;
+  uint32_t best_ever_attempts;
+
+  uint64_t txns_while_stubborn;
+  uint64_t cycles_while_stubborn;
+  uint64_t txns_while_halven;
+  uint64_t cycles_while_halven;
+  uint64_t txns_while_giveup;
+  uint64_t cycles_while_giveup;
+
+  gtm_global_optimizer() :
+    last_throughput(fixed_pt_t(0.0001)),
+    best_ever_throughput(fixed_pt_t(0.0001))
+  {
+  }
+
+  void
+  set_is_enabled(bool arg)
+  {
+    is_enabled = allows_reoptimization() && arg;
+  }
+
+  void reoptimize_htm_execution();
+
+  // Period, in number of transactions, between which we execute the
+  // re-optimization procedure.
+  static const uint32_t OPTIMIZATION_PERIOD_TXS = 500;
+  // Initialization value for the last, fictional, throughput measured.
+  static const fixed_pt_t MIN_THROUGHPUT = fixed_pt_t(0.001);
+  // This serves as an initialization for the cycles used so far.
+  static const uint32_t INIT_CYCLES = 100;
+};
+
 } // namespace GTM

 #include "tls.h"
@@ -346,6 +437,9 @@ extern gtm_cacheline_mask gtm_mask_stack(gtm_cache
 // the name, avoiding complex name mangling.
 extern uint32_t htm_fastpath __asm__(UPFX "gtm_htm_fastpath");

+// Maintains the optimization for HTM execution.
+extern gtm_global_optimizer optimizer;
+
 } // namespace GTM

 #endif // LIBITM_I_H
Index: common.h
===================================================================
--- common.h (revision 219316)
+++ common.h (working copy)
@@ -59,6 +59,11 @@ extern void * xcalloc (size_t s, bool separate_cl
 extern void * xrealloc (void *p, size_t s, bool separate_cl = false)
   __attribute__((malloc, nothrow));

+// Linear congruential number generator.
+//
+// Used also in glibc for generators with small state.
+extern uint32_t random_int(uint32_t max);
+
 } // namespace GTM


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]