Patchwork [take,2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race

login
register
mail settings
Submitter Jarek Poplawski
Date Sept. 20, 2008, 11:48 p.m.
Message ID <20080920234843.GA2531@ami.dom.local>
Download mbox | patch
Permalink /patch/761/
State Rejected
Delegated to: David Miller
Headers show

Comments

Jarek Poplawski - Sept. 20, 2008, 11:48 p.m.
On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
...
> Let's look at what actually matters for cpu utilization.  These
> __qdisc_run() things are invoked in two situations where we might
> block on the hw queue being stopped:
> 
> 1) When feeding packets into the qdisc in dev_queue_xmit().
> 
>    Guess what?  We _know_ the queue this packet is going to
>    hit.
> 
>    The only new thing we can possible trigger and be interested
>    in at this specific point is if _this_ packet can be sent at
>    this time.
> 
>    And we can check that queue mapping after the qdisc_enqueue_root()
>    call, so that multiq aware qdiscs can have made their changes.
> 
> 2) When waking up a queue.  And here we should schedule the qdisc_run
>    _unconditionally_.
> 
>    If the queue was full, it is extremely likely that new packets
>    are bound for that device queue.  There is no real savings to
>    be had by doing this peek/requeue/dequeue stuff.
> 
> The cpu utilization savings exist for case #1 only, and we can
> implement the bypass logic _perfectly_ as described above.
> 
> For #2 there is nothing to check, just do it and see what comes
> out of the qdisc.

Right, unless __netif_schedule() wasn't done when waking up. I've
thought about this because of another thread/patch around this
problem, and got misled by dev_requeue_skb() scheduling. Now, I think
this could be the main reason for this high load. Anyway, if we want
to skip this check for #2 I think something like the patch below is
needed.

> I would suggest adding an skb pointer argument to qdisc_run().
> If it's NULL, unconditionally schedule __qdisc_run().  Else,
> only schedule if the TX queue indicated by skb_queue_mapping()
> is not stopped.
> 
> dev_queue_xmit() will use the "pass the skb" case, but only if
> qdisc_enqueue_root()'s return value doesn't indicate that there
> is a potential drop.  On potential drop, we'll pass NULL to
> make sure we don't potentially reference a free'd SKB.
> 
> The other case in net_tx_action() can always pass NULL to qdisc_run().

I'm not convinced this #1 is useful for us: this could be an skb #1000
in a queue; the tx status could change many times before this packet
would be #1; why worry? This adds additional checks on the fast path
for something which is unlikely even if this skb would be #1, but for
any later skbs it's only a guess. IMHO, if we can't check for the next
skb to be xmitted it's better to skip this test entirely (which seems
to be safe with the patch below).

Jarek P.

--------------->
pkt_sched: dev_requeue_skb: Don't schedule if a queue is stopped

Doing __netif_schedule() while requeuing because of a stopped tx queue
and skipping such a test in qdisc_run() can cause a requeuing loop with
high cpu use until the queue is awaken.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/sched/sch_generic.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Sept. 21, 2008, 5:35 a.m.
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 21 Sep 2008 01:48:43 +0200

> On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
> ...
> > Let's look at what actually matters for cpu utilization.  These
> > __qdisc_run() things are invoked in two situations where we might
> > block on the hw queue being stopped:
> > 
> > 1) When feeding packets into the qdisc in dev_queue_xmit().
 ...
> > 2) When waking up a queue.  And here we should schedule the qdisc_run
> >    _unconditionally_.
 ...
> > The cpu utilization savings exist for case #1 only, and we can
> > implement the bypass logic _perfectly_ as described above.
> > 
> > For #2 there is nothing to check, just do it and see what comes
> > out of the qdisc.
> 
> Right, unless __netif_schedule() wasn't done when waking up. I've
> thought about this because of another thread/patch around this
> problem, and got misled by dev_requeue_skb() scheduling. Now, I think
> this could be the main reason for this high load. Anyway, if we want
> to skip this check for #2 I think something like the patch below is
> needed.

Hmmm, looking at your patch....

It's only doing something new when the driver returns NETDEV_TX_BUSY
from ->hard_start_xmit().

That _never_ happens in any sane driver.  That case is for buggy
devices that do not maintain their TX queue state properly.  And
in fact it's a case for which I advocate we just drop the packet
instead of requeueing.  :-)

Oh I see, you're concerned about that cases where qdisc_restart() ends
up using the default initialization of the 'ret' variable.

Really, for the case where the driver actually returns NETDEV_TX_BUSY
we _do_ want to unconditionally __netif_schedule(), since the device
doesn't maintain it's queue state in the normal way.

Therefore it seems logical that what really needs to happen is that
we simply pick some new local special token value for 'ret' so that
we can handle that case.  "-1" would probably work fine.

So I'm dropping your patch.

I also think the qdisc_run() test needs to be there.  When the TX
queue fills up, we will doing tons of completely useless work going:

1) ->dequeue
2) qdisc unlock
3) TXQ lock
4) test state
5) TXQ unlock
6) qdisc lock
7) ->requeue

for EVERY SINGLE packet that is generated towards that device.

That has to be expensive, and I am still very much convinced that
this was the original regression cause that made me put that TXQ
state test back into qdisc_run().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - Sept. 21, 2008, 9:57 a.m.
On Sat, Sep 20, 2008 at 10:35:38PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 21 Sep 2008 01:48:43 +0200
> 
> > On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
> > ...
> > > Let's look at what actually matters for cpu utilization.  These
> > > __qdisc_run() things are invoked in two situations where we might
> > > block on the hw queue being stopped:
> > > 
> > > 1) When feeding packets into the qdisc in dev_queue_xmit().
>  ...
> > > 2) When waking up a queue.  And here we should schedule the qdisc_run
> > >    _unconditionally_.
>  ...
> > > The cpu utilization savings exist for case #1 only, and we can
> > > implement the bypass logic _perfectly_ as described above.
> > > 
> > > For #2 there is nothing to check, just do it and see what comes
> > > out of the qdisc.
> > 
> > Right, unless __netif_schedule() wasn't done when waking up. I've
> > thought about this because of another thread/patch around this
> > problem, and got misled by dev_requeue_skb() scheduling. Now, I think
> > this could be the main reason for this high load. Anyway, if we want
> > to skip this check for #2 I think something like the patch below is
> > needed.
> 
> Hmmm, looking at your patch....
> 
> It's only doing something new when the driver returns NETDEV_TX_BUSY
> from ->hard_start_xmit().
> 
> That _never_ happens in any sane driver.  That case is for buggy
> devices that do not maintain their TX queue state properly.  And
> in fact it's a case for which I advocate we just drop the packet
> instead of requeueing.  :-)

OK, then let's do it! Why I can't see this in your new patch?

> 
> Oh I see, you're concerned about that cases where qdisc_restart() ends
> up using the default initialization of the 'ret' variable.

Yes, this is my main concern.

> 
> Really, for the case where the driver actually returns NETDEV_TX_BUSY
> we _do_ want to unconditionally __netif_schedule(), since the device
> doesn't maintain it's queue state in the normal way.

So, do you advocate both to drop the packet and unconditionally
__netif_schedule()?!

> 
> Therefore it seems logical that what really needs to happen is that
> we simply pick some new local special token value for 'ret' so that
> we can handle that case.  "-1" would probably work fine.
> 
> So I'm dropping your patch.
> 
> I also think the qdisc_run() test needs to be there.  When the TX
> queue fills up, we will doing tons of completely useless work going:
> 
> 1) ->dequeue
> 2) qdisc unlock
> 3) TXQ lock
> 4) test state
> 5) TXQ unlock
> 6) qdisc lock
> 7) ->requeue
> 
> for EVERY SINGLE packet that is generated towards that device.
> 
> That has to be expensive,

I agree this useless work should be avoided, but only with a reliable
(and not too expensive) test. Your test might be done for the last
packet in the queue, while all the previous packets (and especially
the first one) have a different state of the queue. This should work
well for uniqueue devs and multiqueues with dedicated qdiscs, but is
doubtful for multiqueues with one qdisc, where it actually should be
most needed, because of potentially complex multiclass configs with
this new problem of blocking at the head-of-line (Alexander's main
concern).

BTW, since this problem is strongly conected with the requeuing
policy, I wonder why you seemingly lost interest in this. I tried to
advocate for your simple, one level requeuing, but also Herbert's
peek, and Alexander's early detection, after some polish(!), should
make this initial test meaningless.

> and I am still very much convinced that
> this was the original regression cause that made me put that TXQ
> state test back into qdisc_run().

I doubt this: I've just looked at this Andrew Gallatin's report, and
there is really a lot of net_tx_action, __netif_schedule, and guess
what: pfifo_fast_requeue in this oprofile...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Sept. 21, 2008, 10:18 a.m.
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 21 Sep 2008 11:57:06 +0200

> On Sat, Sep 20, 2008 at 10:35:38PM -0700, David Miller wrote:
> > That _never_ happens in any sane driver.  That case is for buggy
> > devices that do not maintain their TX queue state properly.  And
> > in fact it's a case for which I advocate we just drop the packet
> > instead of requeueing.  :-)
> 
> OK, then let's do it! Why I can't see this in your new patch?

I'm trying to address one thing at a time.  I really want to
also encourage an audit of the drivers that trigger that condition,
and I fear if I put the packet drop in there it might not happen :)

> BTW, since this problem is strongly conected with the requeuing
> policy, I wonder why you seemingly lost interest in this. I tried to
> advocate for your simple, one level requeuing, but also Herbert's
> peek, and Alexander's early detection, after some polish(!), should
> make this initial test meaningless.

Yes, thanks for reminding me about the the multiq qdisc head of line
blocking thing.

I really don't like the requeue/peek patches, because they resulted in
so much code duplication in the CBQ and other classful qdiscs.

Alexander's patch has similar code duplication issues.

Since I've seen the code duplication happen twice, I begin to suspect
we're attacking the implementation (not the idea) from the wrong
angle.

It might make review easier if we first attack the classful qdiscs and
restructure their internal implementation into seperate "pick" and a
"remove" operations.  Of course, initially it'll just be that
->dequeue is implemented as pick+remove.

On a similar note I think all of the ->requeue() uses can die
trivially except for the netem usage.

> > and I am still very much convinced that
> > this was the original regression cause that made me put that TXQ
> > state test back into qdisc_run().
> 
> I doubt this: I've just looked at this Andrew Gallatin's report, and
> there is really a lot of net_tx_action, __netif_schedule, and guess
> what: pfifo_fast_requeue in this oprofile...

I see.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..bae2eb8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,14 +42,17 @@  static inline int qdisc_qlen(struct Qdisc *q)
 	return q->q.qlen;
 }
 
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q,
+				  bool stopped)
 {
 	if (unlikely(skb->next))
 		q->gso_skb = skb;
 	else
 		q->ops->requeue(skb, q);
 
-	__netif_schedule(q);
+	if (!stopped)
+		__netif_schedule(q);
+
 	return 0;
 }
 
@@ -89,7 +92,7 @@  static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 		 * some time.
 		 */
 		__get_cpu_var(netdev_rx_stat).cpu_collision++;
-		ret = dev_requeue_skb(skb, q);
+		ret = dev_requeue_skb(skb, q, false);
 	}
 
 	return ret;
@@ -121,6 +124,7 @@  static inline int qdisc_restart(struct Qdisc *q)
 	struct net_device *dev;
 	spinlock_t *root_lock;
 	struct sk_buff *skb;
+	bool stopped;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dequeue_skb(q)) == NULL))
@@ -135,9 +139,13 @@  static inline int qdisc_restart(struct Qdisc *q)
 	txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
 
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
-	if (!netif_tx_queue_stopped(txq) &&
-	    !netif_tx_queue_frozen(txq))
+	if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq)) {
 		ret = dev_hard_start_xmit(skb, dev, txq);
+		stopped = netif_tx_queue_stopped(txq) ||
+			  netif_tx_queue_frozen(txq);
+	} else {
+		stopped = true;
+	}
 	HARD_TX_UNLOCK(dev, txq);
 
 	spin_lock(root_lock);
@@ -159,12 +167,11 @@  static inline int qdisc_restart(struct Qdisc *q)
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, q);
+		ret = dev_requeue_skb(skb, q, stopped);
 		break;
 	}
 
-	if (ret && (netif_tx_queue_stopped(txq) ||
-		    netif_tx_queue_frozen(txq)))
+	if (ret && stopped)
 		ret = 0;
 
 	return ret;