Patchwork [take,2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race

login
register
mail settings
Submitter David Miller
Date Sept. 21, 2008, 5:50 a.m.
Message ID <20080920.225033.261544815.davem@davemloft.net>
Download mbox | patch
Permalink /patch/773/
State Rejected
Delegated to: David Miller
Headers show

Comments

David Miller - Sept. 21, 2008, 5:50 a.m.
From: David Miller <davem@davemloft.net>
Date: Sat, 20 Sep 2008 22:35:38 -0700 (PDT)

> Therefore it seems logical that what really needs to happen is that
> we simply pick some new local special token value for 'ret' so that
> we can handle that case.  "-1" would probably work fine.
 ...
> I also think the qdisc_run() test needs to be there.  When the TX
> queue fills up, we will doing tons of completely useless work going:

Ok, here is the kind of thing I'm suggesting in all of this.

It gets rid of bouncing unnecessarily into __qdisc_run() when
dev_queue_xmit()'s finally selected TXQ is stopped.

It also gets rid of the dev_requeue_skb() looping case Jarek
discovered.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 21, 2008, 6:38 a.m.
On Sat, Sep 20, 2008 at 10:50:33PM -0700, David Miller wrote:
>
> Ok, here is the kind of thing I'm suggesting in all of this.
> 
> It gets rid of bouncing unnecessarily into __qdisc_run() when
> dev_queue_xmit()'s finally selected TXQ is stopped.
> 
> It also gets rid of the dev_requeue_skb() looping case Jarek
> discovered.

Looks good to me.

However, I've changed my mind about letting this thread die :)

Let's go back to the basic requirements.  I claim that there
are exactly two different ways for which multiple TX queues are
useful:  SMP scaling and QOS.  In the first case we stuff different
flows into different queues to reduce contention between CPUs.
In the latter case we put packets of different priorities into
different queues in order to prevent a storm of lower priority
packets from starving higher priority ones that arrive later.

Despite the different motivations, these two scenarios have
one thing in common, we can structure it such that there is
a one-to-one correspondence between the qdisc/software queues
and the hardware queues.  I know that this isn't currently the
case for prio but I'll get to that later in the message.

What I'm trying to say is that we shouldn't ever support cases
where a single qdisc empties into multiple TX queues.  It just
doesn't make sense.

For example, if you were using a qdisc like TBF, multiple queues
buy you absolutely nothing.  All it does is give you a longer queue
to stuff packets into but you can already get that in software.

Why am I saying all this? It's because a lot of the complexity
in the current code comes from supporting the case of one qdisc
queue mapping onto n hardware queues.  If we didn't do that then
handling stopped queues becomes trivial (or at least as easy as
it was before).

Put it another way, it makes absolutely no sense to map packets
onto different queues after you've dequeued them from a single
qdisc queue.  The mapping by hashing is for SMP scalability only
and if you've already gone through a single qdisc queue you can
stop worrying about it because it will suck on SMP :)

Going back to the case of prio, I think what we should do is to
create a separate qdisc queue for each band.  The qdisc selection
should be done before the packet is queued, just as we do in the
TX hashing case.

Cheers,
David Miller - Sept. 21, 2008, 7:03 a.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 21 Sep 2008 15:38:21 +0900

> Going back to the case of prio, I think what we should do is to
> create a separate qdisc queue for each band.  The qdisc selection
> should be done before the packet is queued, just as we do in the
> TX hashing case.

That's a very interesting idea.

This works if you want it at the root, but what if you only wanted to
prio at a leaf?  I think that case has value too.

I tend to also disagree with another mentioned assertion.  The one
where having a shared qdisc sucks on SMP.  It doesn't.  The TX queue
lock is held much longer than the qdisc lock.

The ->hard_start_xmit() TXQ lock has to be held while:

1) DMA mapping the SKB
2) building TX descriptors
3) doing at least one PIO to the hardware

These operations, each, can be on the order of few thousands of
cycles.

Whereas a qdisc dequeue is perhaps a few hundred, maybe on the order
of a thousand, except in very elaborate classful qdisc configs.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - Sept. 21, 2008, 3:25 p.m.
On Sat, Sep 20, 2008 at 10:50:33PM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sat, 20 Sep 2008 22:35:38 -0700 (PDT)
> 
> > Therefore it seems logical that what really needs to happen is that
> > we simply pick some new local special token value for 'ret' so that
> > we can handle that case.  "-1" would probably work fine.
>  ...
> > I also think the qdisc_run() test needs to be there.  When the TX
> > queue fills up, we will doing tons of completely useless work going:
> 
> Ok, here is the kind of thing I'm suggesting in all of this.
> 
> It gets rid of bouncing unnecessarily into __qdisc_run() when
> dev_queue_xmit()'s finally selected TXQ is stopped.
> 
> It also gets rid of the dev_requeue_skb() looping case Jarek
> discovered.
> 
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index b786a5b..4082f39 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
>  
>  static inline void qdisc_run(struct Qdisc *q)
>  {
> -	struct netdev_queue *txq = q->dev_queue;
> -
> -	if (!netif_tx_queue_stopped(txq) &&
> -	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
> +	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
>  		__qdisc_run(q);
>  }
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fdfc4b6..4654127 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1809,7 +1809,15 @@ gso:
>  			rc = NET_XMIT_DROP;
>  		} else {
>  			rc = qdisc_enqueue_root(skb, q);
> -			qdisc_run(q);
> +
> +			txq = NULL;
> +			if (rc == NET_XMIT_SUCCESS) {
> +				int map = skb_get_queue_mapping(skb);

Deja vu? How about e.g. blackhole_enqueue()?

> +				txq = netdev_get_tx_queue(dev, map);
> +			}
> +
> +			if (!txq || !netif_tx_queue_stopped(txq))
> +				qdisc_run(q);
>  		}
>  		spin_unlock(root_lock);
>  
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index ec0a083..b6e6926 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -117,7 +117,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
>  static inline int qdisc_restart(struct Qdisc *q)
>  {
>  	struct netdev_queue *txq;
> -	int ret = NETDEV_TX_BUSY;
> +	int ret = -2;
>  	struct net_device *dev;
>  	spinlock_t *root_lock;
>  	struct sk_buff *skb;
> @@ -158,7 +158,7 @@ static inline int qdisc_restart(struct Qdisc *q)
>  		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
>  			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
>  			       dev->name, ret, q->q.qlen);
> -
> +	case -2:
>  		ret = dev_requeue_skb(skb, q);

Hmm... I really can't see any difference - except getting rid of this
if sometimes.

>  		break;
>  	}

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 23, 2008, 6:23 a.m.
On Sun, Sep 21, 2008 at 12:03:01AM -0700, David Miller wrote:
>
> This works if you want it at the root, but what if you only wanted to
> prio at a leaf?  I think that case has value too.

Good question :)

I think what we should do is to pass some token that rerepsents
the TX queue that's being run down into the dequeue function.

Then each qdisc can decide which child to recursively dequeue
based on that token (or ignore it for non-prio qdiscs such as
HTB).  When the token reaches the leaf then we have two cases:

1) A prio-like qdisc that has separate queues based on priorities.
In this case we dequeue the respective queue based on the token.

2) Any other qdisc.  We dequeue the first packet that hashes
into the queue given by the token.  Ideally these qdiscs should
have separate queues already so that this would be trivial.

> I tend to also disagree with another mentioned assertion.  The one
> where having a shared qdisc sucks on SMP.  It doesn't.  The TX queue
> lock is held much longer than the qdisc lock.

Yes I was exaggerating :)

However, after answering your question above I'm even more convinced
that we should be separating the traffic at the point of enqueueing,
and not after we dequeue it in qdisc_run.

The only reason to do the separation after dequeueing would be to
allow the TX queue selction to change in the time being.  However,
since I see absolutely no reason why we'd need that, it's just so
much simpler to separate them at qdisc_enqueue, and actually have
the same number of software queues as there are hardware queues.

Cheers,
Jarek Poplawski - Sept. 24, 2008, 7:15 a.m.
On Tue, Sep 23, 2008 at 02:23:33PM +0800, Herbert Xu wrote:
> On Sun, Sep 21, 2008 at 12:03:01AM -0700, David Miller wrote:
> >
> > This works if you want it at the root, but what if you only wanted to
> > prio at a leaf?  I think that case has value too.
> 
> Good question :)
> 
> I think what we should do is to pass some token that rerepsents
> the TX queue that's being run down into the dequeue function.
> 
> Then each qdisc can decide which child to recursively dequeue
> based on that token (or ignore it for non-prio qdiscs such as
> HTB).

I don't think HTB could be considered as a non-prio qdisc.

> When the token reaches the leaf then we have two cases:
> 1) A prio-like qdisc that has separate queues based on priorities.
> In this case we dequeue the respective queue based on the token.

As matter of fact I can't figure out this idea of a prio at the root
or leaf either. Could you explain in which point do you expect the
gain? If it's about the locks, what kind of synchronization would be
used to assure packets from lower prio queues (or qdiscs?)  aren't
sent to free tx queues, while higher prio wait on stopped ones?

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 24, 2008, 8:04 a.m.
On Wed, Sep 24, 2008 at 07:15:21AM +0000, Jarek Poplawski wrote:
>
> > Then each qdisc can decide which child to recursively dequeue
> > based on that token (or ignore it for non-prio qdiscs such as
> > HTB).
> 
> I don't think HTB could be considered as a non-prio qdisc.

It is non-prio in the sense that it has other criteria for deciding
which child qdisc to enqueue into.

> As matter of fact I can't figure out this idea of a prio at the root
> or leaf either. Could you explain in which point do you expect the
> gain? If it's about the locks, what kind of synchronization would be
> used to assure packets from lower prio queues (or qdiscs?)  aren't
> sent to free tx queues, while higher prio wait on stopped ones?

It's very simple really.  For a non-leaf prio you determine which
child qdisc to enqueue into using the priority.  For a leaf prio
you determine which software queue to enqueue into based on the
priority.

To put it another way, what I'm saying is that instead of duplicating
the qdiscs as we do now for pfifo_fast, we should make the leaf
qdiscs duplicate its software queues to match the hardware queues
it's feeding into.

Cheers,
Jarek Poplawski - Sept. 24, 2008, 8:28 a.m.
On Wed, Sep 24, 2008 at 04:04:27PM +0800, Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 07:15:21AM +0000, Jarek Poplawski wrote:
> >
> > > Then each qdisc can decide which child to recursively dequeue
> > > based on that token (or ignore it for non-prio qdiscs such as
> > > HTB).
> > 
> > I don't think HTB could be considered as a non-prio qdisc.
> 
> It is non-prio in the sense that it has other criteria for deciding
> which child qdisc to enqueue into.
> 
> > As matter of fact I can't figure out this idea of a prio at the root
> > or leaf either. Could you explain in which point do you expect the
> > gain? If it's about the locks, what kind of synchronization would be
> > used to assure packets from lower prio queues (or qdiscs?)  aren't
> > sent to free tx queues, while higher prio wait on stopped ones?
> 
> It's very simple really.  For a non-leaf prio you determine which
> child qdisc to enqueue into using the priority.  For a leaf prio
> you determine which software queue to enqueue into based on the
> priority.

OK, it's too simple then. Could you make this more complex and
show me the gain.

> To put it another way, what I'm saying is that instead of duplicating
> the qdiscs as we do now for pfifo_fast, we should make the leaf
> qdiscs duplicate its software queues to match the hardware queues
> it's feeding into.

It looks like sch_multiq, but you probably mean something else...

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b786a5b..4082f39 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -90,10 +90,7 @@  extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	struct netdev_queue *txq = q->dev_queue;
-
-	if (!netif_tx_queue_stopped(txq) &&
-	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
 		__qdisc_run(q);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fdfc4b6..4654127 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1809,7 +1809,15 @@  gso:
 			rc = NET_XMIT_DROP;
 		} else {
 			rc = qdisc_enqueue_root(skb, q);
-			qdisc_run(q);
+
+			txq = NULL;
+			if (rc == NET_XMIT_SUCCESS) {
+				int map = skb_get_queue_mapping(skb);
+				txq = netdev_get_tx_queue(dev, map);
+			}
+
+			if (!txq || !netif_tx_queue_stopped(txq))
+				qdisc_run(q);
 		}
 		spin_unlock(root_lock);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..b6e6926 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -117,7 +117,7 @@  static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 static inline int qdisc_restart(struct Qdisc *q)
 {
 	struct netdev_queue *txq;
-	int ret = NETDEV_TX_BUSY;
+	int ret = -2;
 	struct net_device *dev;
 	spinlock_t *root_lock;
 	struct sk_buff *skb;
@@ -158,7 +158,7 @@  static inline int qdisc_restart(struct Qdisc *q)
 		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
-
+	case -2:
 		ret = dev_requeue_skb(skb, q);
 		break;
 	}