Patchwork libitm: fixes and optimizations in gl_wt

login
register
mail settings
Submitter Torvald Riegel
Date Feb. 20, 2012, 4:54 p.m.
Message ID <1329756857.2986.1153.camel@triegel.csb>
Download mbox | patch
Permalink /patch/142168/
State New
Headers show

Comments

Torvald Riegel - Feb. 20, 2012, 4:54 p.m.
The following patches don't depend on each other but are all for the
gl_wt TM method.

patch1:
For memory transfers from source regions with the read-for-write
modifier, undo logging of these regions was missing.  Also optimize the
number of accesses to gtm_thr().

patch2:
Similar to the ml_wt code, we don't need to use acq_rel but just acquire
memory order when acquiring the orec.  The following release fence is
still necessary.

patch3:
Recently, I changed how gtm_rwlock handles upgrades to serial mode.
This change was to fix privatization safety and basically consists of
not modifying gtm_thread::shared_state until trycommit() and/or
rollback() have finished after the upgrade (and before the switch to
serialirr dispatch).  With this in place, we can avoid the previous
workarounds for the commit-in-serial-mode corner case.

patch4:
Privatization safety in gl_wt rollback was previously enforced with
something that probably worked but for which the precise C++11 mem model
reasoning was unclear.  The patch fixes that (so it's clear why it works
on the basis of the language's memory model, not on the HW models), and
it less costly than the previously used seq_cst barrier.

patch5:
Just put gl_wt's global orec on a separate cacheline, same as we do in
ml_wt.


OK for trunk?
commit 61ffbfa5610636556335595cc4c28c9e8cd45e23
Author: Torvald Riegel <triegel@redhat.com>
Date:   Mon Feb 20 14:04:17 2012 +0100

    libitm: Add missing undo-logging of RfW src regions in gl_wt memtransfer.
    
    	libitm/
    	* method-gl.cc (gl_wt_dispatch::validate): New.
    	(gl_wt_dispatch::pre_write): New.
    	(gl_wt_dispatch::memtransfer_static): Add missing undo for RfW src.
    	Optimize number of calls to gtm_thr.
commit 609458878f63aafd77902466dae76b59a6c2ba68
Author: Torvald Riegel <triegel@redhat.com>
Date:   Mon Feb 20 14:27:55 2012 +0100

    libitm: Optimize memory order requiremens in gl_wt pre_write.
    
    	libtim/
    	* method-gl.cc (gl_wt_dispatch::pre_write): Optimize memory orders.

diff --git a/libitm/method-gl.cc b/libitm/method-gl.cc
index 8df3d30..958e81f 100644
--- a/libitm/method-gl.cc
+++ b/libitm/method-gl.cc
@@ -103,16 +103,27 @@ protected:
 	  tx->restart(RESTART_VALIDATE_WRITE);
 
 	// CAS global orec from our snapshot time to the locked state.
-	// We need acq_rel memory order here to synchronize with other loads
-	// and modifications of orec.
+        // We need acquire memory order here to synchronize with other
+        // (ownership) releases of the orec.  We do not need acq_rel order
+        // because whenever another thread reads from this CAS'
+        // modification, then it will abort anyway and does not rely on
+        // any further happens-before relation to be established.
+	// Also note that unlike in ml_wt's increase of the global time
+	// base (remember that the global orec is used as time base), we do
+	// not need require memory order here because we do not need to make
+	// prior orec acquisitions visible to other threads that try to
+	// extend their snapshot time.
 	if (!o_gl_mg.orec.compare_exchange_strong (now, gl_mg::set_locked(now),
-						   memory_order_acq_rel))
+						   memory_order_acquire))
 	  tx->restart(RESTART_LOCKED_WRITE);
 
 	// We use an explicit fence here to avoid having to use release
 	// memory order for all subsequent data stores.  This fence will
 	// synchronize with loads of the data with acquire memory order.  See
 	// validate() for why this is necessary.
+        // Adding require memory order to the prior CAS is not sufficient,
+        // at least according to the Batty et al. formalization of the
+        // memory model.
 	atomic_thread_fence(memory_order_release);
 
 	// Set shared_state to new value.
commit 4892ddf448108b20d44cb7529dcc8561614a0ff4
Author: Torvald Riegel <triegel@redhat.com>
Date:   Mon Feb 20 14:53:23 2012 +0100

    libitm: Remove obsolete handling of prior serial lock corner cases in gl_wt.
    
    	libitm/
    	* method-gl.cc (gl_wt_dispatch::trycommit): Remove handling of
    	serial mode corner cases made obsolete by prior gtm_rwlock changes.
    	(gl_wt_dispatch.rollback): Same.

diff --git a/libitm/method-gl.cc b/libitm/method-gl.cc
index 958e81f..89e33a9 100644
--- a/libitm/method-gl.cc
+++ b/libitm/method-gl.cc
@@ -76,12 +76,10 @@ static gl_mg o_gl_mg;
 // validate that no other update transaction comitted before we acquired the
 // orec, so we have the most recent timestamp and no other transaction can
 // commit until we have committed).
-// However, we therefore cannot use this method for a serial transaction
-// (because shared_state needs to remain at ~0) and we have to be careful
-// when switching to serial mode (see the special handling in trycommit() and
-// rollback()).
-// ??? This sharing adds some complexity wrt. serial mode. Just use a separate
-// state variable?
+// However, we therefore depend on shared_state not being modified by the
+// serial lock during upgrades to serial mode, which is ensured by
+// gtm_thread::serialirr_mode by not calling gtm_rwlock::write_upgrade_finish
+// before we have committed or rolled back.
 class gl_wt_dispatch : public abi_dispatch
 {
 protected:
@@ -305,15 +303,6 @@ public:
     gtm_thread* tx = gtm_thr();
     gtm_word v = tx->shared_state.load(memory_order_relaxed);
 
-    // Special case: If shared_state is ~0, then we have acquired the
-    // serial lock (tx->state is not updated yet). In this case, the previous
-    // value isn't available anymore, so grab it from the global lock, which
-    // must have a meaningful value because no other transactions are active
-    // anymore. In particular, if it is locked, then we are an update
-    // transaction, which is all we care about for commit.
-    if (v == ~(typeof v)0)
-      v = o_gl_mg.orec.load(memory_order_relaxed);
-
     // Release the orec but do not reset shared_state, which will be modified
     // by the serial lock right after our commit anyway. Also, resetting
     // shared state here would interfere with the serial lock's use of this
@@ -341,15 +330,6 @@ public:
 
     gtm_thread *tx = gtm_thr();
     gtm_word v = tx->shared_state.load(memory_order_relaxed);
-    // Special case: If shared_state is ~0, then we have acquired the
-    // serial lock (tx->state is not updated yet). In this case, the previous
-    // value isn't available anymore, so grab it from the global lock, which
-    // must have a meaningful value because no other transactions are active
-    // anymore. In particular, if it is locked, then we are an update
-    // transaction, which is all we care about for rollback.
-    bool is_serial = v == ~(typeof v)0;
-    if (is_serial)
-      v = o_gl_mg.orec.load(memory_order_relaxed);
 
     // Release lock and increment version number to prevent dirty reads.
     // Also reset shared state here, so that begin_or_restart() can expect a
@@ -362,10 +342,7 @@ public:
 	o_gl_mg.orec.store(v, memory_order_release);
 
 	// Also reset the timestamp published via shared_state.
-	// Special case: Only do this if we are not a serial transaction
-	// because otherwise, we would interfere with the serial lock.
-	if (!is_serial)
-	  tx->shared_state.store(v, memory_order_release);
+	tx->shared_state.store(v, memory_order_release);
 
 	// We need a store-load barrier after this store to prevent it
 	// from becoming visible after later data loads because the
commit 629ec95d80d0bb77ab43d4fd3c922c88073594a0
Author: Torvald Riegel <triegel@redhat.com>
Date:   Mon Feb 20 16:17:59 2012 +0100

    libitm: Optimize synchronization in gl_wt rollback.
    
    	libitm/
    	* method-gl.cc (gl_wt_dispatch::rollback): Optimize memory orders.

diff --git a/libitm/method-gl.cc b/libitm/method-gl.cc
index 89e33a9..02865fb 100644
--- a/libitm/method-gl.cc
+++ b/libitm/method-gl.cc
@@ -336,22 +336,26 @@ public:
     // value that is correct wrt. privatization safety.
     if (gl_mg::is_locked(v))
       {
-	// Release the global orec, increasing its version number / timestamp.
-        // See begin_or_restart() for why we need release memory order here.
+	// With our rollback, global time increases.
 	v = gl_mg::clear_locked(v) + 1;
-	o_gl_mg.orec.store(v, memory_order_release);
 
-	// Also reset the timestamp published via shared_state.
+	// First reset the timestamp published via shared_state.  Release
+	// memory order will make this happen after undoing prior data writes.
+	// This must also happen before we actually release the global orec
+	// next, so that future update transactions in other threads observe
+	// a meaningful snapshot time for our transaction; otherwise, they
+	// could read a shared_store value with the LOCK_BIT set, which can
+	// break privatization safety because it's larger than the actual
+	// snapshot time.  Note that we only need to consider other update
+	// transactions because only those will potentially privatize data.
 	tx->shared_state.store(v, memory_order_release);
 
-	// We need a store-load barrier after this store to prevent it
-	// from becoming visible after later data loads because the
-	// previous value of shared_state has been higher than the actual
-	// snapshot time (the lock bit had been set), which could break
-	// privatization safety. We do not need a barrier before this
-	// store (see pre_write() for an explanation).
-	// ??? What is the precise reasoning in the C++11 model?
-	atomic_thread_fence(memory_order_seq_cst);
+	// Release the global orec, increasing its version number / timestamp.
+	// See begin_or_restart() for why we need release memory order here,
+	// and we also need it to make future update transactions read the
+	// prior update to shared_state too (update transactions acquire the
+	// global orec with acquire memory order).
+	o_gl_mg.orec.store(v, memory_order_release);
       }
 
   }
commit 423e4b674968252fbda648fc8857a4520d9e30e8
Author: Torvald Riegel <triegel@redhat.com>
Date:   Mon Feb 20 16:22:58 2012 +0100

    libitm: Put gl_wt global lock on separate cache line.
    
    	libitm/
    	* method-gl.cc (gl_wt_dispatch::orec): Put on separate cacheline.

diff --git a/libitm/method-gl.cc b/libitm/method-gl.cc
index 02865fb..01347e5 100644
--- a/libitm/method-gl.cc
+++ b/libitm/method-gl.cc
@@ -41,7 +41,8 @@ struct gl_mg : public method_group
   static gtm_word clear_locked(gtm_word l) { return l & ~LOCK_BIT; }
 
   // The global ownership record.
-  atomic<gtm_word> orec;
+  atomic<gtm_word> orec __attribute__((aligned(HW_CACHELINE_SIZE)));
+  char tailpadding[HW_CACHELINE_SIZE - sizeof(atomic<gtm_word>)];
 
   virtual void init()
   {
@@ -52,7 +53,6 @@ struct gl_mg : public method_group
   virtual void fini() { }
 };
 
-// TODO cacheline padding
 static gl_mg o_gl_mg;
Richard Henderson - Feb. 20, 2012, 6:36 p.m.
On 02/20/12 08:54, Torvald Riegel wrote:
> The following patches don't depend on each other but are all for the
> gl_wt TM method.
> 
> patch1:
> For memory transfers from source regions with the read-for-write
> modifier, undo logging of these regions was missing.  Also optimize the
> number of accesses to gtm_thr().

Out of curiosity, does an optional argument as in

  static void pre_write(const void *addr, size_t len, gtm_thread *tx = gtm_thr())

work just as well as

+  static void pre_write(const void *addr, size_t len)
+  {
+    gtm_thread *tx = gtm_thr();
+    pre_write(addr, len, tx);
+  }

> +	if (dst_mod == WaW || dst_mod == NONTXNAL)
> +	  tx = gtm_thr();

This sort of micro-optimization can't be good.  If we are, for some reason,
not unifying the two loads, it seems to me that it'd be better to simply
unconditionally load the thread data here:

> +    gtm_thread *tx = 0;

The actual bug fix, calling pre_write for RfW, is of course ok.

> patch2:
> Similar to the ml_wt code, we don't need to use acq_rel but just acquire
> memory order when acquiring the orec.  The following release fence is
> still necessary.

Ok.


> 
> patch3:
> Recently, I changed how gtm_rwlock handles upgrades to serial mode.
> This change was to fix privatization safety and basically consists of
> not modifying gtm_thread::shared_state until trycommit() and/or
> rollback() have finished after the upgrade (and before the switch to
> serialirr dispatch).  With this in place, we can avoid the previous
> workarounds for the commit-in-serial-mode corner case.

Ok.


> 
> patch4:
> Privatization safety in gl_wt rollback was previously enforced with
> something that probably worked but for which the precise C++11 mem model
> reasoning was unclear.  The patch fixes that (so it's clear why it works
> on the basis of the language's memory model, not on the HW models), and
> it less costly than the previously used seq_cst barrier.

Ok.

> 
> patch5:
> Just put gl_wt's global orec on a separate cacheline, same as we do in
> ml_wt.

> +  atomic<gtm_word> orec __attribute__((aligned(HW_CACHELINE_SIZE)));
> +  char tailpadding[HW_CACHELINE_SIZE - sizeof(atomic<gtm_word>)];

The "tailpadding" structure should not be required.  Since an element of
the structure is aligned, the whole class will be aligned.  And the 
compiler automatically pads aligned structures so that arrays of them
work, and satisfy the alignment.

Or is the orec overlapping with the vtable or something?


r~
Torvald Riegel - Feb. 20, 2012, 8:58 p.m.
On Mon, 2012-02-20 at 10:36 -0800, Richard Henderson wrote:
> On 02/20/12 08:54, Torvald Riegel wrote:
> > The following patches don't depend on each other but are all for the
> > gl_wt TM method.
> > 
> > patch1:
> > For memory transfers from source regions with the read-for-write
> > modifier, undo logging of these regions was missing.  Also optimize the
> > number of accesses to gtm_thr().
> 
> Out of curiosity, does an optional argument as in
> 
>   static void pre_write(const void *addr, size_t len, gtm_thread *tx = gtm_thr())
> 
> work just as well as
> 
> +  static void pre_write(const void *addr, size_t len)
> +  {
> +    gtm_thread *tx = gtm_thr();
> +    pre_write(addr, len, tx);
> +  }
> 
> > +	if (dst_mod == WaW || dst_mod == NONTXNAL)
> > +	  tx = gtm_thr();
> 
> This sort of micro-optimization can't be good.  If we are, for some reason,
> not unifying the two loads, it seems to me that it'd be better to simply
> unconditionally load the thread data here:
> 
> > +    gtm_thread *tx = 0;
> 
> The actual bug fix, calling pre_write for RfW, is of course ok.
> 
> > patch2:
> > Similar to the ml_wt code, we don't need to use acq_rel but just acquire
> > memory order when acquiring the orec.  The following release fence is
> > still necessary.
> 
> Ok.
> 
> 
> > 
> > patch3:
> > Recently, I changed how gtm_rwlock handles upgrades to serial mode.
> > This change was to fix privatization safety and basically consists of
> > not modifying gtm_thread::shared_state until trycommit() and/or
> > rollback() have finished after the upgrade (and before the switch to
> > serialirr dispatch).  With this in place, we can avoid the previous
> > workarounds for the commit-in-serial-mode corner case.
> 
> Ok.
> 
> 
> > 
> > patch4:
> > Privatization safety in gl_wt rollback was previously enforced with
> > something that probably worked but for which the precise C++11 mem model
> > reasoning was unclear.  The patch fixes that (so it's clear why it works
> > on the basis of the language's memory model, not on the HW models), and
> > it less costly than the previously used seq_cst barrier.
> 
> Ok.
> 
> > 
> > patch5:
> > Just put gl_wt's global orec on a separate cacheline, same as we do in
> > ml_wt.
> 
> > +  atomic<gtm_word> orec __attribute__((aligned(HW_CACHELINE_SIZE)));
> > +  char tailpadding[HW_CACHELINE_SIZE - sizeof(atomic<gtm_word>)];
> 
> The "tailpadding" structure should not be required.  Since an element of
> the structure is aligned, the whole class will be aligned.  And the 
> compiler automatically pads aligned structures so that arrays of them
> work, and satisfy the alignment.
> 
> Or is the orec overlapping with the vtable or something?

It is without the tail-padding, but currently the virtual functions are
called very infrequently.  Removed the tailpadding and added a comment
instead.

Committed with your changes applied.

Patch

diff --git a/libitm/method-gl.cc b/libitm/method-gl.cc
index d6d050d..8df3d30 100644
--- a/libitm/method-gl.cc
+++ b/libitm/method-gl.cc
@@ -85,9 +85,8 @@  static gl_mg o_gl_mg;
 class gl_wt_dispatch : public abi_dispatch
 {
 protected:
-  static void pre_write(const void *addr, size_t len)
+  static void pre_write(const void *addr, size_t len, gtm_thread *tx)
   {
-    gtm_thread *tx = gtm_thr();
     gtm_word v = tx->shared_state.load(memory_order_relaxed);
     if (unlikely(!gl_mg::is_locked(v)))
       {
@@ -123,7 +122,13 @@  protected:
     tx->undolog.log(addr, len);
   }
 
-  static void validate()
+  static void pre_write(const void *addr, size_t len)
+  {
+    gtm_thread *tx = gtm_thr();
+    pre_write(addr, len, tx);
+  }
+
+  static void validate(gtm_thread *tx)
   {
     // Check that snapshot is consistent.  We expect the previous data load to
     // have acquire memory order, or be atomic and followed by an acquire
@@ -137,12 +142,17 @@  protected:
     // or read an orec value that was written after the data had been written.
     // Either will allow us to detect inconsistent reads because it will have
     // a higher/different value.
-    gtm_thread *tx = gtm_thr();
     gtm_word l = o_gl_mg.orec.load(memory_order_relaxed);
     if (l != tx->shared_state.load(memory_order_relaxed))
       tx->restart(RESTART_VALIDATE_READ);
   }
 
+  static void validate()
+  {
+    gtm_thread *tx = gtm_thr();
+    validate(tx);
+  }
+
   template <typename V> static V load(const V* addr, ls_modifier mod)
   {
     // Read-for-write should be unlikely, but we need to handle it or will
@@ -198,9 +208,20 @@  public:
   static void memtransfer_static(void *dst, const void* src, size_t size,
       bool may_overlap, ls_modifier dst_mod, ls_modifier src_mod)
   {
-    if ((dst_mod != WaW && src_mod != RaW)
-	&& (dst_mod != NONTXNAL || src_mod == RfW))
-      pre_write(dst, size);
+    gtm_thread *tx = 0;
+    if (dst_mod != WaW && dst_mod != NONTXNAL)
+      {
+	tx = gtm_thr();
+	pre_write(dst, size, tx);
+      }
+    // We need at least undo-logging for an RfW src region because we might
+    // subsequently write there with WaW.
+    if (src_mod == RfW)
+      {
+	if (dst_mod == WaW || dst_mod == NONTXNAL)
+	  tx = gtm_thr();
+	pre_write(src, size, tx);
+      }
 
     // FIXME We should use atomics here (see store()).  Let's just hope that
     // memcpy/memmove are good enough.
@@ -211,7 +232,11 @@  public:
 
     if (src_mod != RfW && src_mod != RaW && src_mod != NONTXNAL
 	&& dst_mod != WaW)
-      validate();
+      {
+	if ((dst_mod == WaW || dst_mod == NONTXNAL) && src_mod != RfW)
+	  tx = gtm_thr();
+	validate(tx);
+      }
   }
 
   static void memset_static(void *dst, int c, size_t size, ls_modifier mod)