diff mbox

[8/7] net/netfilter/nf_conntrack_core: Remove another memory barrier

Message ID 5d88110f-cf0e-c72e-7acc-518b736e715e@colorfullife.com
State Changes Requested
Delegated to: Pablo Neira
Headers show

Commit Message

Manfred Spraul Sept. 5, 2016, 6:57 p.m. UTC
Hi Peter,

On 09/02/2016 09:22 PM, Peter Zijlstra wrote:
> On Fri, Sep 02, 2016 at 08:35:55AM +0200, Manfred Spraul wrote:
>> On 09/01/2016 06:41 PM, Peter Zijlstra wrote:
>>> On Thu, Sep 01, 2016 at 04:30:39PM +0100, Will Deacon wrote:
>>>> On Thu, Sep 01, 2016 at 05:27:52PM +0200, Manfred Spraul wrote:
>>>>> Since spin_unlock_wait() is defined as equivalent to spin_lock();
>>>>> spin_unlock(), the memory barrier before spin_unlock_wait() is
>>>>> also not required.
>>> Note that ACQUIRE+RELEASE isn't a barrier.
>>>
>>> Both are semi-permeable and things can cross in the middle, like:
>>>
>>>
>>> 	x = 1;
>>> 	LOCK
>>> 	UNLOCK
>>> 	r = y;
>>>
>>> can (validly) get re-ordered like:
>>>
>>> 	LOCK
>>> 	r = y;
>>> 	x = 1;
>>> 	UNLOCK
>>>
>>> So if you want things ordered, as I think you do, I think the smp_mb()
>>> is still needed.
>> CPU1:
>> x=1; /* without WRITE_ONCE */
>> LOCK(l);
>> UNLOCK(l);
>> <do_semop>
>> smp_store_release(x,0)
>>
>>
>> CPU2;
>> LOCK(l)
>> if (smp_load_acquire(x)==1) goto slow_path
>> <do_semop>
>> UNLOCK(l)
>>
>> Ordering is enforced because both CPUs access the same lock.
>>
>> x=1 can't be reordered past the UNLOCK(l), I don't see that further
>> guarantees are necessary.
>>
>> Correct?
> Correct, sadly implementations do not comply :/ In fact, even x86 is
> broken here.
>
> I spoke to Will earlier today and he suggests either making
> spin_unlock_wait() stronger to avoids any and all such surprises or just
> getting rid of the thing.
I've tried the trivial solution:
Replace spin_unlock_wait() with spin_lock();spin_unlock().
With sem-scalebench, I get around a factor 2 slowdown with an array with 
16 semaphores and factor 13 slowdown with an array of 256 semaphores :-(
[with LOCKDEP+DEBUG_SPINLOCK].

Anyone around with a ppc or arm? How slow is the loop of the 
spin_unlock_wait() calls?
Single CPU is sufficient.

Question 1: How large is the difference between:
#./sem-scalebench -t 10 -c 1 -p 1 -o 4 -f -d 1
#./sem-scalebench -t 10 -c 1 -p 1 -o 4 -f -d 256
https://github.com/manfred-colorfu/ipcscale

For x86, the difference is only ~30%.

Question 2:
Is it faster if the attached patch is applied? (relative to mmots)

--
     Manfred

Comments

Will Deacon Sept. 6, 2016, 5:56 p.m. UTC | #1
On Mon, Sep 05, 2016 at 08:57:19PM +0200, Manfred Spraul wrote:
> On 09/02/2016 09:22 PM, Peter Zijlstra wrote:
> Anyone around with a ppc or arm? How slow is the loop of the
> spin_unlock_wait() calls?
> Single CPU is sufficient.
> 
> Question 1: How large is the difference between:
> #./sem-scalebench -t 10 -c 1 -p 1 -o 4 -f -d 1
> #./sem-scalebench -t 10 -c 1 -p 1 -o 4 -f -d 256
> https://github.com/manfred-colorfu/ipcscale

Not sure exactly what you want me to run here, but with an arm64
defconfig -rc3 kernel, those two invocations give me "Max total" values
where the first is 20x bigger than the second.

Will
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From b063c9edbb264cfcbca6c23eee3c85f90cd77ae1 Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Mon, 5 Sep 2016 20:45:38 +0200
Subject: [PATCH] ipc/sem.c: Avoid spin_unlock_wait()

experimental, not fully tested!
spin_unlock_wait() may be expensive, because it must ensure memory
ordering.
Test: Would it be faster if an explicit is_locked flag is used?
For large arrays, only 1 barrier would be required.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/sem.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 5e318c5..062ece2d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -101,6 +101,7 @@  struct sem {
 	 */
 	int	sempid;
 	spinlock_t	lock;	/* spinlock for fine-grained semtimedop */
+	int		is_locked;	/* locked flag */
 	struct list_head pending_alter; /* pending single-sop operations */
 					/* that alter the semaphore */
 	struct list_head pending_const; /* pending single-sop operations */
@@ -282,17 +283,22 @@  static void complexmode_enter(struct sem_array *sma)
 
 	/* We need a full barrier after seting complex_mode:
 	 * The write to complex_mode must be visible
-	 * before we read the first sem->lock spinlock state.
+	 * before we read the first sem->is_locked state.
 	 */
 	smp_store_mb(sma->complex_mode, true);
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
-		spin_unlock_wait(&sem->lock);
+		if (sem->is_locked) {
+			spin_lock(&sem->lock);
+			spin_unlock(&sem->lock);
+		}
 	}
 	/*
-	 * spin_unlock_wait() is not a memory barriers, it is only a
-	 * control barrier. The code must pair with spin_unlock(&sem->lock),
+	 * If spin_lock(); spin_unlock() is used, then everything is
+	 * ordered. Otherwise: Reading sem->is_locked is only a control
+	 * barrier.
+	 * The code must pair with smp_store_release(&sem->is_locked),
 	 * thus just the control barrier is insufficient.
 	 *
 	 * smp_rmb() is sufficient, as writes cannot pass the control barrier.
@@ -364,17 +370,16 @@  static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		spin_lock(&sem->lock);
 
 		/*
-		 * See 51d7d5205d33
-		 * ("powerpc: Add smp_mb() to arch_spin_is_locked()"):
-		 * A full barrier is required: the write of sem->lock
-		 * must be visible before the read is executed
+		 * set is_locked. It must be ordered before
+		 * reading sma->complex_mode.
 		 */
-		smp_mb();
+		smp_store_mb(sem->is_locked, true);
 
 		if (!smp_load_acquire(&sma->complex_mode)) {
 			/* fast path successful! */
 			return sops->sem_num;
 		}
+		smp_store_release(&sem->is_locked, false);
 		spin_unlock(&sem->lock);
 	}
 
@@ -387,6 +392,8 @@  static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 * back to the fast path.
 		 */
 		spin_lock(&sem->lock);
+		/* no need for smp_mb, we own the global lock */
+		sem->is_locked = true;
 		ipc_unlock_object(&sma->sem_perm);
 		return sops->sem_num;
 	} else {
@@ -406,6 +413,7 @@  static inline void sem_unlock(struct sem_array *sma, int locknum)
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
+		smp_store_release(&sem->is_locked, false);
 		spin_unlock(&sem->lock);
 	}
 }
@@ -551,6 +559,7 @@  static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 		INIT_LIST_HEAD(&sma->sem_base[i].pending_alter);
 		INIT_LIST_HEAD(&sma->sem_base[i].pending_const);
 		spin_lock_init(&sma->sem_base[i].lock);
+		sma->sem_base[i].is_locked = false;
 	}
 
 	sma->complex_count = 0;
-- 
2.7.4