Message ID | 20100323202553.21598.10754.stgit@gitlad.jf.intel.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Alexander Duyck <alexander.h.duyck@intel.com> Date: Tue, 23 Mar 2010 13:25:53 -0700 > The qdisc layer shows a significant issue when you start transmitting from > multiple CPUs. Why are you hitting any central qdisc lock on a multiqueue configuration? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Duyck wrote: > The qdisc layer shows a significant issue when you start transmitting from > multiple CPUs. The issue is that the transmit rate drops significantly, and I > believe it is due to the fact that the spinlock is shared between the 1 > dequeue, and n-1 enqueue cpu threads. In order to improve this situation I am > adding one additional lock which will need to be obtained during the enqueue > portion of the path. This essentially allows sch_direct_xmit to jump to > near the head of the line when attempting to obtain the lock after > completing a transmit. > > Running the script below I saw an increase from 200K packets per second to > 1.07M packets per second as a result of this patch. > > for j in `seq 0 15`; do > for i in `seq 0 7`; do > netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 & > done > done Wow - someone is using the no control connection mode in netperf :) Or were you actually trying to get the test-specific -N option to control whether or not netperf "connects" the UDP socket? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 23 mars 2010 à 13:25 -0700, Alexander Duyck a écrit : > The qdisc layer shows a significant issue when you start transmitting from > multiple CPUs. The issue is that the transmit rate drops significantly, and I > believe it is due to the fact that the spinlock is shared between the 1 > dequeue, and n-1 enqueue cpu threads. In order to improve this situation I am > adding one additional lock which will need to be obtained during the enqueue > portion of the path. This essentially allows sch_direct_xmit to jump to > near the head of the line when attempting to obtain the lock after > completing a transmit. > > Running the script below I saw an increase from 200K packets per second to > 1.07M packets per second as a result of this patch. > > for j in `seq 0 15`; do > for i in `seq 0 7`; do > netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 & > done > done > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > --- > Hi Alexander Thats a pretty good topic :) So to speedup a pathological case (dozen of cpus all sending to same queue), you suggest adding a spin_lock to fast path, slowing down normal cases ? Quite frankly, the real problem in this case is not the reduced throughput, but fact that one cpu can stay a long time doing the xmits to device, of skb queued by other cpus. This can hurt latencies a lot, for real time threads for example... I wonder if ticket spinlocks are not the problem. Maybe we want a variant of spinlocks, so that cpu doing transmits can get the lock before other cpus... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 23 mars 2010 à 21:54 +0100, Eric Dumazet a écrit : > I wonder if ticket spinlocks are not the problem. Maybe we want a > variant of spinlocks, so that cpu doing transmits can get the lock > before other cpus... Something like this portable implementation : struct spinprio { spinlock_t lock; int highprio_cnt; }; void spinprio_lock(struct spinprio *l) { while (1) { spin_lock(&l->lock); if (!l->highprio_cnt) break; spin_unlock(&l->lock); cpu_relax(); } } void spinprio_unlock(struct spinprio *l) { spin_unlock(&l->lock); } void spinprio_relock(struct spinprio *l) { l->highprio_cnt = 1; spin_lock(&l->lock); l->highprio_cnt = 0; } We would have to use spinprio_unlock()/spinprio_relock() in sch_direct_xmit() -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 23 Mar 2010 21:54:27 +0100 > Quite frankly, the real problem in this case is not the reduced > throughput, but fact that one cpu can stay a long time doing the xmits > to device, of skb queued by other cpus. This can hurt latencies a lot, > for real time threads for example... > > I wonder if ticket spinlocks are not the problem. Maybe we want a > variant of spinlocks, so that cpu doing transmits can get the lock > before other cpus... I want to note that things operate the way they do now intentionally. Herbert Xu and Jamal Hadi Salim were active in this area about 4 years ago. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 23 Mar 2010 21:54:27 +0100 > >> Quite frankly, the real problem in this case is not the reduced >> throughput, but fact that one cpu can stay a long time doing the >> xmits to device, of skb queued by other cpus. This can hurt >> latencies a lot, for real time threads for example... >> >> I wonder if ticket spinlocks are not the problem. Maybe we want a >> variant of spinlocks, so that cpu doing transmits can get the lock >> before other cpus... > > I want to note that things operate the way they do now > intentionally. > > Herbert Xu and Jamal Hadi Salim were active in this area > about 4 years ago. I'll do some digging on this to see what I can find out. I'm not sure why we would want to do things this way since it seems like the CPU that has qdisc_running is sitting idle due to the fact that the ticket spinlocks have us waiting for all of the other CPU enqueues before we can get around to another dequeue. It seems like it would be in our best interest to keep qdisc_running going while we are still enqueuing on the other CPUs. I'll also look into the idea of possibly coming up with a more portable solution such as a priority based spinlock solution. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 23 mars 2010 à 14:45 -0700, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 23 Mar 2010 21:54:27 +0100 > > > Quite frankly, the real problem in this case is not the reduced > > throughput, but fact that one cpu can stay a long time doing the xmits > > to device, of skb queued by other cpus. This can hurt latencies a lot, > > for real time threads for example... > > > > I wonder if ticket spinlocks are not the problem. Maybe we want a > > variant of spinlocks, so that cpu doing transmits can get the lock > > before other cpus... > > I want to note that things operate the way they do now > intentionally. > > Herbert Xu and Jamal Hadi Salim were active in this area > about 4 years ago. Yes, but ticket spinlocks were added after their work (in 2008 - 2.6.25 if I remember well) and change things. We want cpu owning __QDISC_STATE_RUNNING being able to re-get the lock as fast as possible. Alexander results can show the possible speedup. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller <davem@davemloft.net> writes: > From: Alexander Duyck <alexander.h.duyck@intel.com> > Date: Tue, 23 Mar 2010 13:25:53 -0700 > >> The qdisc layer shows a significant issue when you start transmitting from >> multiple CPUs. > > Why are you hitting any central qdisc lock on a multiqueue > configuration? One thing to remember is that systems can have a lot more CPUs than a NIC queues (e.g. 64 vs 8) -Andi
From: "Duyck, Alexander H" <alexander.h.duyck@intel.com> Date: Tue, 23 Mar 2010 15:08:12 -0700 > I'll do some digging on this to see what I can find out. Please see: http://vger.kernel.org/jamal_netconf2006.sxi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 23 mars 2010 à 19:58 -0700, David Miller a écrit : > From: "Duyck, Alexander H" <alexander.h.duyck@intel.com> > Date: Tue, 23 Mar 2010 15:08:12 -0700 > > > I'll do some digging on this to see what I can find out. > > Please see: http://vger.kernel.org/jamal_netconf2006.sxi > -- Thanks David, this is very informative ! Do we have a list of such network related documents ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 24 Mar 2010 06:42:04 +0100 > Do we have a list of such network related documents ? Not really, just scan http://vger.kernel.org/netconf200${N}.html for the papers :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 23 Mar 2010 23:13:00 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mardi 23 mars 2010 à 14:45 -0700, David Miller a écrit : > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Tue, 23 Mar 2010 21:54:27 +0100 > > > > > Quite frankly, the real problem in this case is not the reduced > > > throughput, but fact that one cpu can stay a long time doing the xmits > > > to device, of skb queued by other cpus. This can hurt latencies a lot, > > > for real time threads for example... > > > > > > I wonder if ticket spinlocks are not the problem. Maybe we want a > > > variant of spinlocks, so that cpu doing transmits can get the lock > > > before other cpus... > > > > I want to note that things operate the way they do now > > intentionally. > > > > Herbert Xu and Jamal Hadi Salim were active in this area > > about 4 years ago. > > Yes, but ticket spinlocks were added after their work (in 2008 - 2.6.25 > if I remember well) and change things. > > We want cpu owning __QDISC_STATE_RUNNING being able to re-get the lock > as fast as possible. Alexander results can show the possible speedup. What about having a special function (spin_lock_greedy?) that just ignores the ticket mechanism and always assumes it has right to next ticket.
Le vendredi 21 mai 2010 à 08:43 -0700, Stephen Hemminger a écrit : > What about having a special function (spin_lock_greedy?) that just ignores > the ticket mechanism and always assumes it has right to next ticket. > I am not sure we want to do this, for a single use case... Adding a new lock primitive to linux should be really really motivated. x86 ticket spinlocks are nice because we are sure no cpu is going to stay forever in this code. For other arches, I dont know how this is coped. I thought about this before re-working on this patch, but found it was easier to slightly change Alexander code, ie adding a regular spinlock for the slowpath. We could use cmpxchg() and manipulate several bits at once in fast path. ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd of cpus spin on the same bits/cacheline than dequeue worker would definitely slowdown the worker. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 21 May 2010 18:44:35 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le vendredi 21 mai 2010 à 08:43 -0700, Stephen Hemminger a écrit : > > > What about having a special function (spin_lock_greedy?) that just ignores > > the ticket mechanism and always assumes it has right to next ticket. > > > > I am not sure we want to do this, for a single use case... > Adding a new lock primitive to linux should be really really motivated. > > x86 ticket spinlocks are nice because we are sure no cpu is going to > stay forever in this code. For other arches, I dont know how this is > coped. > > I thought about this before re-working on this patch, but found it was > easier to slightly change Alexander code, ie adding a regular spinlock > for the slowpath. > > We could use cmpxchg() and manipulate several bits at once in fast path. > ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd > of cpus spin on the same bits/cacheline than dequeue worker would > definitely slowdown the worker. Your solution is okay, but it is a special case performance hack and we seem to be getting more and more of these lately. This is a general problem of any producer/consumer case; other code will have same problem (like disk requests) and can't use the same solution. Maybe the RT and scheduler folks have some input because it does seem like the same problem has occurred in other contexts before.
Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit : > We could use cmpxchg() and manipulate several bits at once in fast path. > ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd > of cpus spin on the same bits/cacheline than dequeue worker would > definitely slowdown the worker. > > Maybe I am missing something, but __QDISC_STATE_RUNNING is always manipulated with the lock held... We might avoid two atomic ops when changing this state (if moved to a separate container) in fast path (when a cpu sends only one packet and returns) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 21 mai 2010 à 19:40 +0200, Eric Dumazet a écrit : > Le vendredi 21 mai 2010 à 18:44 +0200, Eric Dumazet a écrit : > > > We could use cmpxchg() and manipulate several bits at once in fast path. > > ( __QDISC_STATE_RUNNING, __QDISC_STATE_LOCKED ... ) but making the crowd > > of cpus spin on the same bits/cacheline than dequeue worker would > > definitely slowdown the worker. > > > > > > Maybe I am missing something, but __QDISC_STATE_RUNNING is always > manipulated with the lock held... > > We might avoid two atomic ops when changing this state (if moved to a > separate container) in fast path (when a cpu sends only one packet and > returns) > > Here are two patches to implement this idea. First patch to abstract QDISC_STATE_RUNNING access. Second patch to add a __qstate container and remove the atomic ops. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 67dc08e..f5088a9 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -51,7 +51,6 @@ struct Qdisc { struct list_head list; u32 handle; u32 parent; - atomic_t refcnt; struct gnet_stats_rate_est rate_est; int (*reshape_fail)(struct sk_buff *skb, struct Qdisc *q); @@ -65,6 +64,8 @@ struct Qdisc { struct netdev_queue *dev_queue; struct Qdisc *next_sched; + atomic_t refcnt; + spinlock_t enqueue_lock; struct sk_buff *gso_skb; /* * For performance sake on SMP, we put highly modified fields at the end diff --git a/net/core/dev.c b/net/core/dev.c index a03aab4..8a2d508 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2006,8 +2006,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, spinlock_t *root_lock = qdisc_lock(q); int rc; + spin_lock(&q->enqueue_lock); spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { + spin_unlock(&q->enqueue_lock); kfree_skb(skb); rc = NET_XMIT_DROP; } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && @@ -2018,7 +2020,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, * xmit the skb directly. */ __qdisc_update_bstats(q, skb->len); - if (sch_direct_xmit(skb, q, dev, txq, root_lock)) + spin_unlock(&q->enqueue_lock); + rc = sch_direct_xmit(skb, q, dev, txq, root_lock); + if (rc) __qdisc_run(q); else clear_bit(__QDISC_STATE_RUNNING, &q->state); @@ -2026,6 +2030,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, rc = NET_XMIT_SUCCESS; } else { rc = qdisc_enqueue_root(skb, q); + spin_unlock(&q->enqueue_lock); qdisc_run(q); } spin_unlock(root_lock); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 5173c1e..4b5e125 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -538,6 +538,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue, sch = (struct Qdisc *) QDISC_ALIGN((unsigned long) p); sch->padded = (char *) sch - (char *) p; + spin_lock_init(&sch->enqueue_lock); INIT_LIST_HEAD(&sch->list); skb_queue_head_init(&sch->q); sch->ops = ops;
The qdisc layer shows a significant issue when you start transmitting from multiple CPUs. The issue is that the transmit rate drops significantly, and I believe it is due to the fact that the spinlock is shared between the 1 dequeue, and n-1 enqueue cpu threads. In order to improve this situation I am adding one additional lock which will need to be obtained during the enqueue portion of the path. This essentially allows sch_direct_xmit to jump to near the head of the line when attempting to obtain the lock after completing a transmit. Running the script below I saw an increase from 200K packets per second to 1.07M packets per second as a result of this patch. for j in `seq 0 15`; do for i in `seq 0 7`; do netperf -H <ip> -t UDP_STREAM -l 600 -N -T $i -- -m 6 & done done Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> --- include/net/sch_generic.h | 3 ++- net/core/dev.c | 7 ++++++- net/sched/sch_generic.c | 1 + 3 files changed, 9 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html