diff mbox

Handle abortTransaction with RTM

Message ID 509C5B26.6000204@redhat.com
State New
Headers show

Commit Message

Richard Henderson Nov. 9, 2012, 1:23 a.m. UTC
On 2012-11-08 15:32, Torvald Riegel wrote:
> On Thu, 2012-11-08 at 15:38 +0000, Lai, Konrad wrote:
>> The xabort argument is the only "escape" currently allowed in RTM. So it is not possible to use a separate Boolean in memory.
> 
> No, that's not what I meant.  The boolean would be used in libitm's
> htm_abort(), which the architecture-specific code (eg,
> config/x86/target.h) would then change into whatever the particular HTM
> uses as abort reasons (eg, true would become 0xff).  That's just to keep
> as much of libitm portable as possible, nothing more.
> 
>> [Roman just posted an example in http://software.intel.com/en-us/blogs/2012/11/06/exploring-intel-transactional-synchronization-extensions-with-intel-software ]
>>
>> I don't know "any" large code that uses cancel. Someone claim there is one in STAMP, and one got speed up if it was removed. I think this discussion potentially explains why.
> 
> The abort in STAMP is bogus.  They didn't instrument all of the code (if
> you use the manual instrumentation), and the abort is just there to
> "catch" race conditions and other inconsistencies.
> 
> My suggestions for next steps are to move the begin stuff to assembly.
> After that, we can go for the abort branch, if removing it really makes
> a non-negligible performance difference.

I believe this is the sort of patch that Torvald was talking about
for handling abortTransaction with RTM.

Andi, can you please test?


r~

Comments

Andi Kleen Nov. 9, 2012, 4:34 a.m. UTC | #1
Richard Henderson <rth@redhat.com> writes:
>  
>  static inline void
> -htm_abort ()
> +htm_abort_retry ()
>  {
>    // ??? According to a yet unpublished ABI rule, 0xff is reserved and
>    // supposed to signal a busy lock.  Source: andi.kleen@intel.com
>    _xabort(0xff);
>  }
>  
> +static inline void
> +htm_abort_cancel ()
> +{
> +  // ??? What's the unpublished ABI rule for this, Andi?

There is none for cancel, just for lock-is-locked (0xfe) and lock-busy (0xff)

The convention is just for easier abort profiling. The profiler (perf) can display
this abort code and it's far easier to understand if common situations
have their standard code. But you can always make up your own too.

-Andi
Richard Henderson Nov. 9, 2012, 4:55 p.m. UTC | #2
On 2012-11-08 17:23, Richard Henderson wrote:
> I believe this is the sort of patch that Torvald was talking about
> for handling abortTransaction with RTM.

FYI, I realized that this patch doesn't handle aborts on a
nested transaction properly.


r~
Torvald Riegel Nov. 9, 2012, 5:56 p.m. UTC | #3
On Thu, 2012-11-08 at 17:23 -0800, Richard Henderson wrote:
> +         // Honor an abort from abortTransaction.
> +         else if (htm_abort_is_cancel(ret))
> +           return a_abortTransaction | a_restoreLiveVariables;

The problem is that we cannot reliably detect whether an abort with a
certain abort reason code really means that we got canceled.  A nested
transaction is one example: how do we distinguish whether the nested or
the outermost transaction were canceled?.
Another example is transaction_pure or tm_wrapper code that might reuse
the same abort reason for explicit aborts: we would like to allow
malloc() calls in transactions, but if malloc uses transactions with
explicit aborts, we might abort a transaction without a cause.

It might be possible to avoid that by carefully specifying the rules for
abort reasons, which of those can be used in nested transactions, and so
on.  But I'm not aware of any specification like that, and it seems to
likely be a pretty fragile approach.

Thus, there are two options for how to handle transactions that may
abort: Either execute them transactionally the first time, and do an
explicit HTM abort on transaction_cancel that tells libitm not not retry
as HW transaction, or execute such transactions always as SW
transactions.  Which of these two options is better depends on how
frequently transaction_cancel would actually be called.  If it's
infrequent, then trying to run as HW transactions might be better.

Torvald
Torvald Riegel Nov. 9, 2012, 6:01 p.m. UTC | #4
On Thu, 2012-11-08 at 20:34 -0800, Andi Kleen wrote:
> Richard Henderson <rth@redhat.com> writes:
> >  
> >  static inline void
> > -htm_abort ()
> > +htm_abort_retry ()
> >  {
> >    // ??? According to a yet unpublished ABI rule, 0xff is reserved and
> >    // supposed to signal a busy lock.  Source: andi.kleen@intel.com
> >    _xabort(0xff);
> >  }
> >  
> > +static inline void
> > +htm_abort_cancel ()
> > +{
> > +  // ??? What's the unpublished ABI rule for this, Andi?
> 
> There is none for cancel, just for lock-is-locked (0xfe) and lock-busy (0xff)
> 
> The convention is just for easier abort profiling. The profiler (perf) can display
> this abort code and it's far easier to understand if common situations
> have their standard code. But you can always make up your own too.

I'm not sure this is quite true.  If a libitm-executed transaction is
started from within some other transactional region (e.g., managed
explicitly by the user), and those two disagree about what is an abort
that should be retried or not, then this can at least have a negative
impact on performance: Either because the outermost HW transaction
handler thinks it shouldn't retry even though it should, or because the
outermost transaction retries continuously even though there's no way it
can succeed.

If all code that uses transactions will eventually choose the fallback
path, this is not a correctness issue.  But it can be a performance
issue, so I think that HTMs and the code using them should be careful to
agree about abort reasons and what they communicate.


Torvald
Andi Kleen Nov. 9, 2012, 6:21 p.m. UTC | #5
Torvald Riegel <triegel@redhat.com> writes:
>
> I'm not sure this is quite true.  If a libitm-executed transaction is
> started from within some other transactional region (e.g., managed
> explicitly by the user), and those two disagree about what is an abort
> that should be retried or not, then this can at least have a negative
> impact on performance: Either because the outermost HW transaction
> handler thinks it shouldn't retry even though it should, or because the
> outermost transaction retries continuously even though there's no way it
> can succeed.

It shouldn't retry continuously for a nested abort.
Also retrying a lot is dangerous anyways, you have to be very
careful about that.

Short term your biggest problem is profiling the whole thing in any
case, and that is what the XABORT with well defined codes make much
easier.

-Andi
Andi Kleen Nov. 9, 2012, 6:24 p.m. UTC | #6
Torvald Riegel <triegel@redhat.com> writes:

> On Thu, 2012-11-08 at 17:23 -0800, Richard Henderson wrote:
>> +         // Honor an abort from abortTransaction.
>> +         else if (htm_abort_is_cancel(ret))
>> +           return a_abortTransaction | a_restoreLiveVariables;
>
> The problem is that we cannot reliably detect whether an abort with a
> certain abort reason code really means that we got canceled.  A nested
> transaction is one example: how do we distinguish whether the nested or
> the outermost transaction were canceled?.

In TSX you always come back to the outermost anyways.
Look at the nested bit in the abort code?

> Thus, there are two options for how to handle transactions that may
> abort: Either execute them transactionally the first time, and do an
> explicit HTM abort on transaction_cancel that tells libitm not not retry
> as HW transaction, or execute such transactions always as SW
> transactions.  Which of these two options is better depends on how
> frequently transaction_cancel would actually be called.  If it's
> infrequent, then trying to run as HW transactions might be better.

I believe the current evidence is that cancel is very uncommon.

-Andi
Torvald Riegel Nov. 9, 2012, 6:29 p.m. UTC | #7
On Fri, 2012-11-09 at 10:24 -0800, Andi Kleen wrote:
> Torvald Riegel <triegel@redhat.com> writes:
> 
> > On Thu, 2012-11-08 at 17:23 -0800, Richard Henderson wrote:
> >> +         // Honor an abort from abortTransaction.
> >> +         else if (htm_abort_is_cancel(ret))
> >> +           return a_abortTransaction | a_restoreLiveVariables;
> >
> > The problem is that we cannot reliably detect whether an abort with a
> > certain abort reason code really means that we got canceled.  A nested
> > transaction is one example: how do we distinguish whether the nested or
> > the outermost transaction were canceled?.
> 
> In TSX you always come back to the outermost anyways.

That, and the fact that we cannot reliably communicate which transaction
(on the programming language level) was supposed to be aborted, is
exactly the problem that I described.
Andi Kleen Nov. 9, 2012, 8:54 p.m. UTC | #8
> I'm not sure this is quite true.  If a libitm-executed transaction is

It's just a convention. You don't have to use it. Not doing it will
just make abort profiling harder.

-Andi
Torvald Riegel Nov. 12, 2012, 10:46 a.m. UTC | #9
On Fri, 2012-11-09 at 21:54 +0100, Andi Kleen wrote:
> > I'm not sure this is quite true.  If a libitm-executed transaction is
> 
> It's just a convention. You don't have to use it.

That's true ...

> Not doing it will
> just make abort profiling harder.

... but I disagree with this one.  This won't only make profiling
harder, but it also can have a negative impact on performance.  If you
think that there's something wrong with the reasons that I gave for why
this can be a performance issue, please say why you think so instead of
just saying that it's just about profiling.

To avoid the potential performance complications, it would probably be
sufficient to encourage people to follow this convention.  For example,
making sure that libitm and other libraries such as glibc are aware of
it (or, ideally, follow it) would cover most of the combinations I
suppose;  perhaps the convention should also be mentioned in the RTM
intrinsics header file.


Torvald
diff mbox

Patch

diff --git a/libitm/beginend.cc b/libitm/beginend.cc
index 4369946..8f5c4ef 100644
--- a/libitm/beginend.cc
+++ b/libitm/beginend.cc
@@ -166,18 +166,16 @@  GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb)
     GTM_fatal("pr_undoLogCode not supported");
 
 #if defined(USE_HTM_FASTPATH) && !defined(HTM_CUSTOM_FASTPATH)
-  // HTM fastpath.  Only chosen in the absence of transaction_cancel to allow
-  // using an uninstrumented code path.
-  // The fastpath is enabled only by dispatch_htm's method group, which uses
-  // serial-mode methods as fallback.  Serial-mode transactions cannot execute
-  // concurrently with HW transactions because the latter monitor the serial
-  // lock's writer flag and thus abort if another thread is or becomes a
-  // serial transaction.  Therefore, if the fastpath is enabled, then a
-  // transaction is not executing as a HW transaction iff the serial lock is
-  // write-locked.  This allows us to use htm_fastpath and the serial lock's
-  // writer flag to reliable determine whether the current thread runs a HW
-  // transaction, and thus we do not need to maintain this information in
-  // per-thread state.
+  // HTM fastpath.  The fastpath is enabled only by dispatch_htm's method
+  // group, which uses serial-mode methods as fallback.  Serial-mode
+  // transactions cannot execute concurrently with HW transactions because
+  // the latter monitor the serial lock's writer flag and thus abort if
+  // another thread is or becomes a serial transaction.  Therefore, if the
+  // fastpath is enabled, then a transaction is not executing as a HW
+  // transaction iff the serial lock is write-locked.  This allows us to
+  // use htm_fastpath and the serial lock's writer flag to reliable determine
+  // whether the current thread runs a HW transaction, and thus we do not
+  // need to maintain this information in per-thread state.
   // If an uninstrumented code path is not available, we can still run
   // instrumented code from a HW transaction because the HTM fastpath kicks
   // in early in both begin and commit, and the transaction is not canceled.
@@ -187,7 +185,7 @@  GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb)
   // indeed in serial mode, and HW transactions should never need serial mode
   // for any internal changes (e.g., they never abort visibly to the STM code
   // and thus do not trigger the standard retry handling).
-  if (likely(htm_fastpath && (prop & pr_hasNoAbort)))
+  if (likely(htm_fastpath))
     {
       for (uint32_t t = htm_fastpath; t; t--)
 	{
@@ -198,33 +196,41 @@  GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb)
 	      // Monitor the writer flag in the serial-mode lock, and abort
 	      // if there is an active or waiting serial-mode transaction.
 	      if (unlikely(serial_lock.is_write_locked()))
-		htm_abort();
+		htm_abort_retry();
 	      else
 		// We do not need to set a_saveLiveVariables because of HTM.
 		return (prop & pr_uninstrumentedCode) ?
 		    a_runUninstrumentedCode : a_runInstrumentedCode;
 	    }
-	  // The transaction has aborted.  Don't retry if it's unlikely that
+	  // The transaction has aborted.  Retry if it's likely that
 	  // retrying the transaction will be successful.
-	  if (!htm_abort_should_retry(ret))
-	    break;
-	  // Wait until any concurrent serial-mode transactions have finished.
-	  // This is an empty critical section, but won't be elided.
-	  if (serial_lock.is_write_locked())
+	  if (htm_abort_should_retry(ret))
 	    {
-	      tx = gtm_thr();
-	      if (unlikely(tx == NULL))
-	        {
-	          // See below.
-	          tx = new gtm_thread();
-	          set_gtm_thr(tx);
-	        }
-	      serial_lock.read_lock(tx);
-	      serial_lock.read_unlock(tx);
-	      // TODO We should probably reset the retry count t here, unless
-	      // we have retried so often that we should go serial to avoid
-	      // starvation.
+	      // Wait until any concurrent serial-mode transactions have
+	      // finished.  This is an empty critical section, but won't
+	      // be elided.
+	      if (serial_lock.is_write_locked())
+		{
+		  tx = gtm_thr();
+		  if (unlikely(tx == NULL))
+		    {
+		      // See below.
+		      tx = new gtm_thread();
+		      set_gtm_thr(tx);
+		    }
+		  serial_lock.read_lock(tx);
+		  serial_lock.read_unlock(tx);
+		  // TODO We should probably reset the retry count t here,
+		  // unless we have retried so often that we should go
+		  // serial to avoid starvation.
+		}
 	    }
+	  // Honor an abort from abortTransaction.
+	  else if (htm_abort_is_cancel(ret))
+	    return a_abortTransaction | a_restoreLiveVariables;
+	  // Otherwise quit using HTM and fall back to STM.
+	  else
+	    break;
 	}
     }
 #endif
@@ -438,6 +444,11 @@  GTM::gtm_thread::rollback (gtm_transaction_cp *cp, bool aborting)
 void ITM_REGPARM
 _ITM_abortTransaction (_ITM_abortReason reason)
 {
+#if defined(USE_HTM_FASTPATH)
+  if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked()))
+    htm_abort_cancel();
+#endif
+
   gtm_thread *tx = gtm_thr();
 
   assert (reason == userAbort || reason == (userAbort | outerAbort));
diff --git a/libitm/config/x86/target.h b/libitm/config/x86/target.h
index 41ae2eb..1d2e822 100644
--- a/libitm/config/x86/target.h
+++ b/libitm/config/x86/target.h
@@ -125,18 +125,32 @@  htm_commit ()
 }
 
 static inline void
-htm_abort ()
+htm_abort_retry ()
 {
   // ??? According to a yet unpublished ABI rule, 0xff is reserved and
   // supposed to signal a busy lock.  Source: andi.kleen@intel.com
   _xabort(0xff);
 }
 
+static inline void
+htm_abort_cancel ()
+{
+  // ??? What's the unpublished ABI rule for this, Andi?
+  _xabort(0);
+}
+
 static inline bool
 htm_abort_should_retry (uint32_t begin_ret)
 {
   return begin_ret & _XABORT_RETRY;
 }
+
+static inline bool
+htm_abort_is_cancel (uint32_t begin_ret)
+{
+  // return (begin_ret & _XABORT_EXPLICIT) && _XABORT_CODE(begin_ret) == 0;
+  return begin_ret == _XABORT_EXPLICIT;
+}
 #endif