Message ID | 20080920.225033.261544815.davem@davemloft.net |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, Sep 20, 2008 at 10:50:33PM -0700, David Miller wrote: > > Ok, here is the kind of thing I'm suggesting in all of this. > > It gets rid of bouncing unnecessarily into __qdisc_run() when > dev_queue_xmit()'s finally selected TXQ is stopped. > > It also gets rid of the dev_requeue_skb() looping case Jarek > discovered. Looks good to me. However, I've changed my mind about letting this thread die :) Let's go back to the basic requirements. I claim that there are exactly two different ways for which multiple TX queues are useful: SMP scaling and QOS. In the first case we stuff different flows into different queues to reduce contention between CPUs. In the latter case we put packets of different priorities into different queues in order to prevent a storm of lower priority packets from starving higher priority ones that arrive later. Despite the different motivations, these two scenarios have one thing in common, we can structure it such that there is a one-to-one correspondence between the qdisc/software queues and the hardware queues. I know that this isn't currently the case for prio but I'll get to that later in the message. What I'm trying to say is that we shouldn't ever support cases where a single qdisc empties into multiple TX queues. It just doesn't make sense. For example, if you were using a qdisc like TBF, multiple queues buy you absolutely nothing. All it does is give you a longer queue to stuff packets into but you can already get that in software. Why am I saying all this? It's because a lot of the complexity in the current code comes from supporting the case of one qdisc queue mapping onto n hardware queues. If we didn't do that then handling stopped queues becomes trivial (or at least as easy as it was before). Put it another way, it makes absolutely no sense to map packets onto different queues after you've dequeued them from a single qdisc queue. The mapping by hashing is for SMP scalability only and if you've already gone through a single qdisc queue you can stop worrying about it because it will suck on SMP :) Going back to the case of prio, I think what we should do is to create a separate qdisc queue for each band. The qdisc selection should be done before the packet is queued, just as we do in the TX hashing case. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sun, 21 Sep 2008 15:38:21 +0900 > Going back to the case of prio, I think what we should do is to > create a separate qdisc queue for each band. The qdisc selection > should be done before the packet is queued, just as we do in the > TX hashing case. That's a very interesting idea. This works if you want it at the root, but what if you only wanted to prio at a leaf? I think that case has value too. I tend to also disagree with another mentioned assertion. The one where having a shared qdisc sucks on SMP. It doesn't. The TX queue lock is held much longer than the qdisc lock. The ->hard_start_xmit() TXQ lock has to be held while: 1) DMA mapping the SKB 2) building TX descriptors 3) doing at least one PIO to the hardware These operations, each, can be on the order of few thousands of cycles. Whereas a qdisc dequeue is perhaps a few hundred, maybe on the order of a thousand, except in very elaborate classful qdisc configs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 20, 2008 at 10:50:33PM -0700, David Miller wrote: > From: David Miller <davem@davemloft.net> > Date: Sat, 20 Sep 2008 22:35:38 -0700 (PDT) > > > Therefore it seems logical that what really needs to happen is that > > we simply pick some new local special token value for 'ret' so that > > we can handle that case. "-1" would probably work fine. > ... > > I also think the qdisc_run() test needs to be there. When the TX > > queue fills up, we will doing tons of completely useless work going: > > Ok, here is the kind of thing I'm suggesting in all of this. > > It gets rid of bouncing unnecessarily into __qdisc_run() when > dev_queue_xmit()'s finally selected TXQ is stopped. > > It also gets rid of the dev_requeue_skb() looping case Jarek > discovered. > > diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h > index b786a5b..4082f39 100644 > --- a/include/net/pkt_sched.h > +++ b/include/net/pkt_sched.h > @@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q); > > static inline void qdisc_run(struct Qdisc *q) > { > - struct netdev_queue *txq = q->dev_queue; > - > - if (!netif_tx_queue_stopped(txq) && > - !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) > + if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) > __qdisc_run(q); > } > > diff --git a/net/core/dev.c b/net/core/dev.c > index fdfc4b6..4654127 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1809,7 +1809,15 @@ gso: > rc = NET_XMIT_DROP; > } else { > rc = qdisc_enqueue_root(skb, q); > - qdisc_run(q); > + > + txq = NULL; > + if (rc == NET_XMIT_SUCCESS) { > + int map = skb_get_queue_mapping(skb); Deja vu? How about e.g. blackhole_enqueue()? > + txq = netdev_get_tx_queue(dev, map); > + } > + > + if (!txq || !netif_tx_queue_stopped(txq)) > + qdisc_run(q); > } > spin_unlock(root_lock); > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index ec0a083..b6e6926 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -117,7 +117,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, > static inline int qdisc_restart(struct Qdisc *q) > { > struct netdev_queue *txq; > - int ret = NETDEV_TX_BUSY; > + int ret = -2; > struct net_device *dev; > spinlock_t *root_lock; > struct sk_buff *skb; > @@ -158,7 +158,7 @@ static inline int qdisc_restart(struct Qdisc *q) > if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit())) > printk(KERN_WARNING "BUG %s code %d qlen %d\n", > dev->name, ret, q->q.qlen); > - > + case -2: > ret = dev_requeue_skb(skb, q); Hmm... I really can't see any difference - except getting rid of this if sometimes. > break; > } Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 21, 2008 at 12:03:01AM -0700, David Miller wrote: > > This works if you want it at the root, but what if you only wanted to > prio at a leaf? I think that case has value too. Good question :) I think what we should do is to pass some token that rerepsents the TX queue that's being run down into the dequeue function. Then each qdisc can decide which child to recursively dequeue based on that token (or ignore it for non-prio qdiscs such as HTB). When the token reaches the leaf then we have two cases: 1) A prio-like qdisc that has separate queues based on priorities. In this case we dequeue the respective queue based on the token. 2) Any other qdisc. We dequeue the first packet that hashes into the queue given by the token. Ideally these qdiscs should have separate queues already so that this would be trivial. > I tend to also disagree with another mentioned assertion. The one > where having a shared qdisc sucks on SMP. It doesn't. The TX queue > lock is held much longer than the qdisc lock. Yes I was exaggerating :) However, after answering your question above I'm even more convinced that we should be separating the traffic at the point of enqueueing, and not after we dequeue it in qdisc_run. The only reason to do the separation after dequeueing would be to allow the TX queue selction to change in the time being. However, since I see absolutely no reason why we'd need that, it's just so much simpler to separate them at qdisc_enqueue, and actually have the same number of software queues as there are hardware queues. Cheers,
On Tue, Sep 23, 2008 at 02:23:33PM +0800, Herbert Xu wrote: > On Sun, Sep 21, 2008 at 12:03:01AM -0700, David Miller wrote: > > > > This works if you want it at the root, but what if you only wanted to > > prio at a leaf? I think that case has value too. > > Good question :) > > I think what we should do is to pass some token that rerepsents > the TX queue that's being run down into the dequeue function. > > Then each qdisc can decide which child to recursively dequeue > based on that token (or ignore it for non-prio qdiscs such as > HTB). I don't think HTB could be considered as a non-prio qdisc. > When the token reaches the leaf then we have two cases: > 1) A prio-like qdisc that has separate queues based on priorities. > In this case we dequeue the respective queue based on the token. As matter of fact I can't figure out this idea of a prio at the root or leaf either. Could you explain in which point do you expect the gain? If it's about the locks, what kind of synchronization would be used to assure packets from lower prio queues (or qdiscs?) aren't sent to free tx queues, while higher prio wait on stopped ones? Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 24, 2008 at 07:15:21AM +0000, Jarek Poplawski wrote: > > > Then each qdisc can decide which child to recursively dequeue > > based on that token (or ignore it for non-prio qdiscs such as > > HTB). > > I don't think HTB could be considered as a non-prio qdisc. It is non-prio in the sense that it has other criteria for deciding which child qdisc to enqueue into. > As matter of fact I can't figure out this idea of a prio at the root > or leaf either. Could you explain in which point do you expect the > gain? If it's about the locks, what kind of synchronization would be > used to assure packets from lower prio queues (or qdiscs?) aren't > sent to free tx queues, while higher prio wait on stopped ones? It's very simple really. For a non-leaf prio you determine which child qdisc to enqueue into using the priority. For a leaf prio you determine which software queue to enqueue into based on the priority. To put it another way, what I'm saying is that instead of duplicating the qdiscs as we do now for pfifo_fast, we should make the leaf qdiscs duplicate its software queues to match the hardware queues it's feeding into. Cheers,
On Wed, Sep 24, 2008 at 04:04:27PM +0800, Herbert Xu wrote: > On Wed, Sep 24, 2008 at 07:15:21AM +0000, Jarek Poplawski wrote: > > > > > Then each qdisc can decide which child to recursively dequeue > > > based on that token (or ignore it for non-prio qdiscs such as > > > HTB). > > > > I don't think HTB could be considered as a non-prio qdisc. > > It is non-prio in the sense that it has other criteria for deciding > which child qdisc to enqueue into. > > > As matter of fact I can't figure out this idea of a prio at the root > > or leaf either. Could you explain in which point do you expect the > > gain? If it's about the locks, what kind of synchronization would be > > used to assure packets from lower prio queues (or qdiscs?) aren't > > sent to free tx queues, while higher prio wait on stopped ones? > > It's very simple really. For a non-leaf prio you determine which > child qdisc to enqueue into using the priority. For a leaf prio > you determine which software queue to enqueue into based on the > priority. OK, it's too simple then. Could you make this more complex and show me the gain. > To put it another way, what I'm saying is that instead of duplicating > the qdiscs as we do now for pfifo_fast, we should make the leaf > qdiscs duplicate its software queues to match the hardware queues > it's feeding into. It looks like sch_multiq, but you probably mean something else... Cheers, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index b786a5b..4082f39 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q); static inline void qdisc_run(struct Qdisc *q) { - struct netdev_queue *txq = q->dev_queue; - - if (!netif_tx_queue_stopped(txq) && - !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) + if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) __qdisc_run(q); } diff --git a/net/core/dev.c b/net/core/dev.c index fdfc4b6..4654127 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1809,7 +1809,15 @@ gso: rc = NET_XMIT_DROP; } else { rc = qdisc_enqueue_root(skb, q); - qdisc_run(q); + + txq = NULL; + if (rc == NET_XMIT_SUCCESS) { + int map = skb_get_queue_mapping(skb); + txq = netdev_get_tx_queue(dev, map); + } + + if (!txq || !netif_tx_queue_stopped(txq)) + qdisc_run(q); } spin_unlock(root_lock); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ec0a083..b6e6926 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -117,7 +117,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, static inline int qdisc_restart(struct Qdisc *q) { struct netdev_queue *txq; - int ret = NETDEV_TX_BUSY; + int ret = -2; struct net_device *dev; spinlock_t *root_lock; struct sk_buff *skb; @@ -158,7 +158,7 @@ static inline int qdisc_restart(struct Qdisc *q) if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit())) printk(KERN_WARNING "BUG %s code %d qlen %d\n", dev->name, ret, q->q.qlen); - + case -2: ret = dev_requeue_skb(skb, q); break; }