Patchwork [RFC] net: add additional lock to qdisc to increase enqueue/dequeue fairness

login
register
mail settings
Submitter Alexander Duyck
Date March 23, 2010, 8:25 p.m.
Message ID <20100323202553.21598.10754.stgit@gitlad.jf.intel.com>
Download mbox | patch
Permalink /patch/48363/
State RFC
Delegated to: David Miller
Headers show

Comments

Alexander Duyck - March 23, 2010, 8:25 p.m.
The qdisc layer shows a significant issue when you start transmitting from
multiple CPUs.  The issue is that the transmit rate drops significantly, and I
believe it is due to the fact that the spinlock is shared between the 1
dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
adding one additional lock which will need to be obtained during the enqueue
portion of the path.  This essentially allows sch_direct_xmit to jump to
near the head of the line when attempting to obtain the lock after
completing a transmit.

Running the script below I saw an increase from 200K packets per second to
1.07M packets per second as a result of this patch.

for j in `seq 0 15`; do
	for i in `seq 0 7`; do
		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
	done
done

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/net/sch_generic.h |    3 ++-
 net/core/dev.c            |    7 ++++++-
 net/sched/sch_generic.c   |    1 +
 3 files changed, 9 insertions(+), 2 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - March 23, 2010, 8:31 p.m.
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Tue, 23 Mar 2010 13:25:53 -0700

> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.

Why are you hitting any central qdisc lock on a multiqueue
configuration?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - March 23, 2010, 8:40 p.m.
Alexander Duyck wrote:
> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.  The issue is that the transmit rate drops significantly, and I
> believe it is due to the fact that the spinlock is shared between the 1
> dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
> adding one additional lock which will need to be obtained during the enqueue
> portion of the path.  This essentially allows sch_direct_xmit to jump to
> near the head of the line when attempting to obtain the lock after
> completing a transmit.
> 
> Running the script below I saw an increase from 200K packets per second to
> 1.07M packets per second as a result of this patch.
> 
> for j in `seq 0 15`; do
> 	for i in `seq 0 7`; do
> 		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
> 	done
> done


Wow - someone is using the no control connection mode in netperf :)  Or were you 
actually trying to get the test-specific -N option to control whether or not 
netperf "connects" the UDP socket?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 23, 2010, 8:54 p.m.
Le mardi 23 mars 2010 à 13:25 -0700, Alexander Duyck a écrit :
> The qdisc layer shows a significant issue when you start transmitting from
> multiple CPUs.  The issue is that the transmit rate drops significantly, and I
> believe it is due to the fact that the spinlock is shared between the 1
> dequeue, and n-1 enqueue cpu threads.  In order to improve this situation I am
> adding one additional lock which will need to be obtained during the enqueue
> portion of the path.  This essentially allows sch_direct_xmit to jump to
> near the head of the line when attempting to obtain the lock after
> completing a transmit.
> 
> Running the script below I saw an increase from 200K packets per second to
> 1.07M packets per second as a result of this patch.
> 
> for j in `seq 0 15`; do
> 	for i in `seq 0 7`; do
> 		netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 &
> 	done
> done
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 

Hi Alexander

Thats a pretty good topic :)

So to speedup a pathological case (dozen of cpus all sending to same
queue), you suggest adding a spin_lock to fast path, slowing down normal
cases ?

Quite frankly, the real problem in this case is not the reduced
throughput, but fact that one cpu can stay a long time doing the xmits
to device, of skb queued by other cpus. This can hurt latencies a lot,
for real time threads for example...

I wonder if ticket spinlocks are not the problem. Maybe we want a
variant of spinlocks, so that cpu doing transmits can get the lock
before other cpus...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 23, 2010, 9:18 p.m.
Le mardi 23 mars 2010 à 21:54 +0100, Eric Dumazet a écrit :
> I wonder if ticket spinlocks are not the problem. Maybe we want a
> variant of spinlocks, so that cpu doing transmits can get the lock
> before other cpus...

Something like this portable implementation :

struct spinprio {
	spinlock_t	lock;
	int		highprio_cnt;
};

void spinprio_lock(struct spinprio *l)
{
	while (1) {
		spin_lock(&l->lock);
		if (!l->highprio_cnt)
			break;
		spin_unlock(&l->lock);
		cpu_relax();
	}
}

void spinprio_unlock(struct spinprio *l)
{
	spin_unlock(&l->lock);
}

void spinprio_relock(struct spinprio *l)
{
	l->highprio_cnt = 1;
	spin_lock(&l->lock);
	l->highprio_cnt = 0;
}

We would have to use spinprio_unlock()/spinprio_relock() in
sch_direct_xmit()



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - March 23, 2010, 9:45 p.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 23 Mar 2010 21:54:27 +0100

> Quite frankly, the real problem in this case is not the reduced
> throughput, but fact that one cpu can stay a long time doing the xmits
> to device, of skb queued by other cpus. This can hurt latencies a lot,
> for real time threads for example...
> 
> I wonder if ticket spinlocks are not the problem. Maybe we want a
> variant of spinlocks, so that cpu doing transmits can get the lock
> before other cpus...

I want to note that things operate the way they do now
intentionally.

Herbert Xu and Jamal Hadi Salim were active in this area
about 4 years ago.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck - March 23, 2010, 10:08 p.m.
David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 23 Mar 2010 21:54:27 +0100
> 
>> Quite frankly, the real problem in this case is not the reduced
>> throughput, but fact that one cpu can stay a long time doing the
>> xmits to device, of skb queued by other cpus. This can hurt
>> latencies a lot, for real time threads for example...
>> 
>> I wonder if ticket spinlocks are not the problem. Maybe we want a
>> variant of spinlocks, so that cpu doing transmits can get the lock
>> before other cpus...
> 
> I want to note that things operate the way they do now
> intentionally.
> 
> Herbert Xu and Jamal Hadi Salim were active in this area
> about 4 years ago.

I'll do some digging on this to see what I can find out.  I'm not sure why we would want to do things this way since it seems like the CPU that has qdisc_running is sitting idle due to the fact that the ticket spinlocks have us waiting for all of the other CPU enqueues before we can get around to another dequeue.  It seems like it would be in our best interest to keep qdisc_running going while we are still enqueuing on the other CPUs.

I'll also look into the idea of possibly coming up with a more portable solution such as a priority based spinlock solution.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 23, 2010, 10:13 p.m.
Le mardi 23 mars 2010 à 14:45 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 23 Mar 2010 21:54:27 +0100
> 
> > Quite frankly, the real problem in this case is not the reduced
> > throughput, but fact that one cpu can stay a long time doing the xmits
> > to device, of skb queued by other cpus. This can hurt latencies a lot,
> > for real time threads for example...
> > 
> > I wonder if ticket spinlocks are not the problem. Maybe we want a
> > variant of spinlocks, so that cpu doing transmits can get the lock
> > before other cpus...
> 
> I want to note that things operate the way they do now
> intentionally.
> 
> Herbert Xu and Jamal Hadi Salim were active in this area
> about 4 years ago.

Yes, but ticket spinlocks were added after their work (in 2008 - 2.6.25
if I remember well) and change things.

We want cpu owning __QDISC_STATE_RUNNING being able to re-get the lock
as fast as possible. Alexander results can show the possible speedup.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen - March 24, 2010, 2:12 a.m.
David Miller <davem@davemloft.net> writes:

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Tue, 23 Mar 2010 13:25:53 -0700
>
>> The qdisc layer shows a significant issue when you start transmitting from
>> multiple CPUs.
>
> Why are you hitting any central qdisc lock on a multiqueue
> configuration?

One thing to remember is that systems can have a lot more CPUs than a NIC queues
(e.g. 64 vs 8)

-Andi
David Miller - March 24, 2010, 2:58 a.m.
From: "Duyck, Alexander H" <alexander.h.duyck@intel.com>
Date: Tue, 23 Mar 2010 15:08:12 -0700

> I'll do some digging on this to see what I can find out.

Please see:  http://vger.kernel.org/jamal_netconf2006.sxi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 24, 2010, 5:42 a.m.
Le mardi 23 mars 2010 à 19:58 -0700, David Miller a écrit :
> From: "Duyck, Alexander H" <alexander.h.duyck@intel.com>
> Date: Tue, 23 Mar 2010 15:08:12 -0700
> 
> > I'll do some digging on this to see what I can find out.
> 
> Please see:  http://vger.kernel.org/jamal_netconf2006.sxi
> --

Thanks David, this is very informative !

Do we have a list of such network related documents ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - March 24, 2010, 6:10 a.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 24 Mar 2010 06:42:04 +0100

> Do we have a list of such network related documents ?

Not really, just scan http://vger.kernel.org/netconf200${N}.html
for the papers :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - May 21, 2010, 3:43 p.m.
On Tue, 23 Mar 2010 23:13:00 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 23 mars 2010 à 14:45 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 23 Mar 2010 21:54:27 +0100
> > 
> > > Quite frankly, the real problem in this case is not the reduced
> > > throughput, but fact that one cpu can stay a long time doing the xmits
> > > to device, of skb queued by other cpus. This can hurt latencies a lot,
> > > for real time threads for example...
> > > 
> > > I wonder if ticket spinlocks are not the problem. Maybe we want a
> > > variant of spinlocks, so that cpu doing transmits can get the lock
> > > before other cpus...
> > 
> > I want to note that things operate the way they do now
> > intentionally.
> > 
> > Herbert Xu and Jamal Hadi Salim were active in this area
> > about 4 years ago.
> 
> Yes, but ticket spinlocks were added after their work (in 2008 - 2.6.25
> if I remember well) and change things.
> 
> We want cpu owning __QDISC_STATE_RUNNING being able to re-get the lock
> as fast as possible. Alexander results can show the possible speedup.

What about having a special function (spin_lock_greedy?) that just ignores
the ticket mechanism and always assumes it has right to next ticket.
Eric Dumazet - May 21, 2010, 4:44 p.m.
Le vendredi 21 mai 2010 à 08:43 -0700, Stephen Hemminger a écrit :

> What about having a special function (spin_lock_greedy?) that just ignores
> the ticket mechanism and always assumes it has right to next ticket.
> 

I am not sure we want to do this, for a single use case...
Adding a new lock primitive to linux should be really really motivated.

x86 ticket spinlocks are nice because we are sure no cpu is going to
stay forever in this code. For other arches, I dont know how this is
coped.

I thought about this before re-working on this patch, but found it was
easier to slightly change Alexander code, ie adding a regular spinlock
for the slowpath.

We could use cmpxchg() and manipulate several bits at once in fast path.
( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
of cpus spin on the same bits/cacheline than dequeue worker would
definitely slowdown the worker.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - May 21, 2010, 5:38 p.m.
On Fri, 21 May 2010 18:44:35 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 21 mai 2010 à 08:43 -0700, Stephen Hemminger a écrit :
> 
> > What about having a special function (spin_lock_greedy?) that just ignores
> > the ticket mechanism and always assumes it has right to next ticket.
> > 
> 
> I am not sure we want to do this, for a single use case...
> Adding a new lock primitive to linux should be really really motivated.
> 
> x86 ticket spinlocks are nice because we are sure no cpu is going to
> stay forever in this code. For other arches, I dont know how this is
> coped.
> 
> I thought about this before re-working on this patch, but found it was
> easier to slightly change Alexander code, ie adding a regular spinlock
> for the slowpath.
> 
> We could use cmpxchg() and manipulate several bits at once in fast path.
> ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> of cpus spin on the same bits/cacheline than dequeue worker would
> definitely slowdown the worker.

Your solution is okay, but it is a special case performance hack
and we seem to be getting more and more of these lately.
This is a general problem of any producer/consumer case;
other code will have same problem (like disk requests) 
and can't use the same solution.

Maybe the RT and scheduler folks have some input because it does
seem like the same problem has occurred in other contexts before.
Eric Dumazet - May 21, 2010, 5:40 p.m.
Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit :

> We could use cmpxchg() and manipulate several bits at once in fast path.
> ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> of cpus spin on the same bits/cacheline than dequeue worker would
> definitely slowdown the worker.
> 
> 

Maybe I am missing something, but __QDISC_STATE_RUNNING is always
manipulated with the lock held...

We might avoid two atomic ops when changing this state (if moved to a
separate container) in fast path (when a cpu sends only one packet and
returns)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - June 2, 2010, 9:48 a.m.
Le vendredi 21 mai 2010 à 19:40 +0200, Eric Dumazet a écrit :
> Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit :
> 
> > We could use cmpxchg() and manipulate several bits at once in fast path.
> > ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd
> > of cpus spin on the same bits/cacheline than dequeue worker would
> > definitely slowdown the worker.
> > 
> > 
> 
> Maybe I am missing something, but __QDISC_STATE_RUNNING is always
> manipulated with the lock held...
> 
> We might avoid two atomic ops when changing this state (if moved to a
> separate container) in fast path (when a cpu sends only one packet and
> returns)
> 
> 

Here are two patches to implement this idea.

First patch to abstract QDISC_STATE_RUNNING access.

Second patch to add a __qstate container and remove the atomic ops.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 67dc08e..f5088a9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,7 +51,6 @@  struct Qdisc {
 	struct list_head	list;
 	u32			handle;
 	u32			parent;
-	atomic_t		refcnt;
 	struct gnet_stats_rate_est	rate_est;
 	int			(*reshape_fail)(struct sk_buff *skb,
 					struct Qdisc *q);
@@ -65,6 +64,8 @@  struct Qdisc {
 	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
 
+	atomic_t		refcnt;
+	spinlock_t		enqueue_lock;
 	struct sk_buff		*gso_skb;
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
diff --git a/net/core/dev.c b/net/core/dev.c
index a03aab4..8a2d508 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2006,8 +2006,10 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	spinlock_t *root_lock = qdisc_lock(q);
 	int rc;
 
+	spin_lock(&q->enqueue_lock);
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+		spin_unlock(&q->enqueue_lock);
 		kfree_skb(skb);
 		rc = NET_XMIT_DROP;
 	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
@@ -2018,7 +2020,9 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		 * xmit the skb directly.
 		 */
 		__qdisc_update_bstats(q, skb->len);
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock))
+		spin_unlock(&q->enqueue_lock);
+		rc = sch_direct_xmit(skb, q, dev, txq, root_lock);
+		if (rc)
 			__qdisc_run(q);
 		else
 			clear_bit(__QDISC_STATE_RUNNING, &q->state);
@@ -2026,6 +2030,7 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		rc = NET_XMIT_SUCCESS;
 	} else {
 		rc = qdisc_enqueue_root(skb, q);
+		spin_unlock(&q->enqueue_lock);
 		qdisc_run(q);
 	}
 	spin_unlock(root_lock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 5173c1e..4b5e125 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -538,6 +538,7 @@  struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch = (struct Qdisc *) QDISC_ALIGN((unsigned long) p);
 	sch->padded = (char *) sch - (char *) p;
 
+	spin_lock_init(&sch->enqueue_lock);
 	INIT_LIST_HEAD(&sch->list);
 	skb_queue_head_init(&sch->q);
 	sch->ops = ops;