Message ID | 20080920234843.GA2531@ami.dom.local |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
From: Jarek Poplawski <jarkao2@gmail.com> Date: Sun, 21 Sep 2008 01:48:43 +0200 > On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote: > ... > > Let's look at what actually matters for cpu utilization. These > > __qdisc_run() things are invoked in two situations where we might > > block on the hw queue being stopped: > > > > 1) When feeding packets into the qdisc in dev_queue_xmit(). ... > > 2) When waking up a queue. And here we should schedule the qdisc_run > > _unconditionally_. ... > > The cpu utilization savings exist for case #1 only, and we can > > implement the bypass logic _perfectly_ as described above. > > > > For #2 there is nothing to check, just do it and see what comes > > out of the qdisc. > > Right, unless __netif_schedule() wasn't done when waking up. I've > thought about this because of another thread/patch around this > problem, and got misled by dev_requeue_skb() scheduling. Now, I think > this could be the main reason for this high load. Anyway, if we want > to skip this check for #2 I think something like the patch below is > needed. Hmmm, looking at your patch.... It's only doing something new when the driver returns NETDEV_TX_BUSY from ->hard_start_xmit(). That _never_ happens in any sane driver. That case is for buggy devices that do not maintain their TX queue state properly. And in fact it's a case for which I advocate we just drop the packet instead of requeueing. :-) Oh I see, you're concerned about that cases where qdisc_restart() ends up using the default initialization of the 'ret' variable. Really, for the case where the driver actually returns NETDEV_TX_BUSY we _do_ want to unconditionally __netif_schedule(), since the device doesn't maintain it's queue state in the normal way. Therefore it seems logical that what really needs to happen is that we simply pick some new local special token value for 'ret' so that we can handle that case. "-1" would probably work fine. So I'm dropping your patch. I also think the qdisc_run() test needs to be there. When the TX queue fills up, we will doing tons of completely useless work going: 1) ->dequeue 2) qdisc unlock 3) TXQ lock 4) test state 5) TXQ unlock 6) qdisc lock 7) ->requeue for EVERY SINGLE packet that is generated towards that device. That has to be expensive, and I am still very much convinced that this was the original regression cause that made me put that TXQ state test back into qdisc_run(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 20, 2008 at 10:35:38PM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Sun, 21 Sep 2008 01:48:43 +0200 > > > On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote: > > ... > > > Let's look at what actually matters for cpu utilization. These > > > __qdisc_run() things are invoked in two situations where we might > > > block on the hw queue being stopped: > > > > > > 1) When feeding packets into the qdisc in dev_queue_xmit(). > ... > > > 2) When waking up a queue. And here we should schedule the qdisc_run > > > _unconditionally_. > ... > > > The cpu utilization savings exist for case #1 only, and we can > > > implement the bypass logic _perfectly_ as described above. > > > > > > For #2 there is nothing to check, just do it and see what comes > > > out of the qdisc. > > > > Right, unless __netif_schedule() wasn't done when waking up. I've > > thought about this because of another thread/patch around this > > problem, and got misled by dev_requeue_skb() scheduling. Now, I think > > this could be the main reason for this high load. Anyway, if we want > > to skip this check for #2 I think something like the patch below is > > needed. > > Hmmm, looking at your patch.... > > It's only doing something new when the driver returns NETDEV_TX_BUSY > from ->hard_start_xmit(). > > That _never_ happens in any sane driver. That case is for buggy > devices that do not maintain their TX queue state properly. And > in fact it's a case for which I advocate we just drop the packet > instead of requeueing. :-) OK, then let's do it! Why I can't see this in your new patch? > > Oh I see, you're concerned about that cases where qdisc_restart() ends > up using the default initialization of the 'ret' variable. Yes, this is my main concern. > > Really, for the case where the driver actually returns NETDEV_TX_BUSY > we _do_ want to unconditionally __netif_schedule(), since the device > doesn't maintain it's queue state in the normal way. So, do you advocate both to drop the packet and unconditionally __netif_schedule()?! > > Therefore it seems logical that what really needs to happen is that > we simply pick some new local special token value for 'ret' so that > we can handle that case. "-1" would probably work fine. > > So I'm dropping your patch. > > I also think the qdisc_run() test needs to be there. When the TX > queue fills up, we will doing tons of completely useless work going: > > 1) ->dequeue > 2) qdisc unlock > 3) TXQ lock > 4) test state > 5) TXQ unlock > 6) qdisc lock > 7) ->requeue > > for EVERY SINGLE packet that is generated towards that device. > > That has to be expensive, I agree this useless work should be avoided, but only with a reliable (and not too expensive) test. Your test might be done for the last packet in the queue, while all the previous packets (and especially the first one) have a different state of the queue. This should work well for uniqueue devs and multiqueues with dedicated qdiscs, but is doubtful for multiqueues with one qdisc, where it actually should be most needed, because of potentially complex multiclass configs with this new problem of blocking at the head-of-line (Alexander's main concern). BTW, since this problem is strongly conected with the requeuing policy, I wonder why you seemingly lost interest in this. I tried to advocate for your simple, one level requeuing, but also Herbert's peek, and Alexander's early detection, after some polish(!), should make this initial test meaningless. > and I am still very much convinced that > this was the original regression cause that made me put that TXQ > state test back into qdisc_run(). I doubt this: I've just looked at this Andrew Gallatin's report, and there is really a lot of net_tx_action, __netif_schedule, and guess what: pfifo_fast_requeue in this oprofile... Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Sun, 21 Sep 2008 11:57:06 +0200 > On Sat, Sep 20, 2008 at 10:35:38PM -0700, David Miller wrote: > > That _never_ happens in any sane driver. That case is for buggy > > devices that do not maintain their TX queue state properly. And > > in fact it's a case for which I advocate we just drop the packet > > instead of requeueing. :-) > > OK, then let's do it! Why I can't see this in your new patch? I'm trying to address one thing at a time. I really want to also encourage an audit of the drivers that trigger that condition, and I fear if I put the packet drop in there it might not happen :) > BTW, since this problem is strongly conected with the requeuing > policy, I wonder why you seemingly lost interest in this. I tried to > advocate for your simple, one level requeuing, but also Herbert's > peek, and Alexander's early detection, after some polish(!), should > make this initial test meaningless. Yes, thanks for reminding me about the the multiq qdisc head of line blocking thing. I really don't like the requeue/peek patches, because they resulted in so much code duplication in the CBQ and other classful qdiscs. Alexander's patch has similar code duplication issues. Since I've seen the code duplication happen twice, I begin to suspect we're attacking the implementation (not the idea) from the wrong angle. It might make review easier if we first attack the classful qdiscs and restructure their internal implementation into seperate "pick" and a "remove" operations. Of course, initially it'll just be that ->dequeue is implemented as pick+remove. On a similar note I think all of the ->requeue() uses can die trivially except for the netem usage. > > and I am still very much convinced that > > this was the original regression cause that made me put that TXQ > > state test back into qdisc_run(). > > I doubt this: I've just looked at this Andrew Gallatin's report, and > there is really a lot of net_tx_action, __netif_schedule, and guess > what: pfifo_fast_requeue in this oprofile... I see. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ec0a083..bae2eb8 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -42,14 +42,17 @@ static inline int qdisc_qlen(struct Qdisc *q) return q->q.qlen; } -static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) +static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q, + bool stopped) { if (unlikely(skb->next)) q->gso_skb = skb; else q->ops->requeue(skb, q); - __netif_schedule(q); + if (!stopped) + __netif_schedule(q); + return 0; } @@ -89,7 +92,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, * some time. */ __get_cpu_var(netdev_rx_stat).cpu_collision++; - ret = dev_requeue_skb(skb, q); + ret = dev_requeue_skb(skb, q, false); } return ret; @@ -121,6 +124,7 @@ static inline int qdisc_restart(struct Qdisc *q) struct net_device *dev; spinlock_t *root_lock; struct sk_buff *skb; + bool stopped; /* Dequeue packet */ if (unlikely((skb = dequeue_skb(q)) == NULL)) @@ -135,9 +139,13 @@ static inline int qdisc_restart(struct Qdisc *q) txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb)); HARD_TX_LOCK(dev, txq, smp_processor_id()); - if (!netif_tx_queue_stopped(txq) && - !netif_tx_queue_frozen(txq)) + if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq)) { ret = dev_hard_start_xmit(skb, dev, txq); + stopped = netif_tx_queue_stopped(txq) || + netif_tx_queue_frozen(txq); + } else { + stopped = true; + } HARD_TX_UNLOCK(dev, txq); spin_lock(root_lock); @@ -159,12 +167,11 @@ static inline int qdisc_restart(struct Qdisc *q) printk(KERN_WARNING "BUG %s code %d qlen %d\n", dev->name, ret, q->q.qlen); - ret = dev_requeue_skb(skb, q); + ret = dev_requeue_skb(skb, q, stopped); break; } - if (ret && (netif_tx_queue_stopped(txq) || - netif_tx_queue_frozen(txq))) + if (ret && stopped) ret = 0; return ret;