diff mbox

macvtap: Limit packet queue length

Message ID 20100722074431.GA26744@gondor.apana.org.au
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu July 22, 2010, 7:44 a.m. UTC
On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> Hi:
> 
> macvtap: Limit packet queue length

Chris has informed me that he's already tried a similar patch
and it only makes the problem worse :)

The issue is that the macvtap TX queue length defaults to zero.

So here is an updated patch which addresses this:

macvtap: Limit packet queue length

Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.

This appears to be caused by the fact that macvtap packet queues
are unlimited in length.  This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.

This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.

Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.

This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.

Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.

Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.

Chris Wright noticed that for this patch to work, we need a
non-zero TX queue length.  This patch includes his work to change
the default macvtap TX queue length to 500.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>


Cheers,

Comments

Chris Wright July 22, 2010, 7:47 a.m. UTC | #1
* Herbert Xu (herbert@gondor.hengli.com.au) wrote:
> On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> > Hi:
> > 
> > macvtap: Limit packet queue length
> 
> Chris has informed me that he's already tried a similar patch
> and it only makes the problem worse :)
> 
> The issue is that the macvtap TX queue length defaults to zero.
> 
> So here is an updated patch which addresses this:
> 
> macvtap: Limit packet queue length
> 
> Mark Wagner reported OOM symptoms when sending UDP traffic over
> a macvtap link to a kvm receiver.
> 
> This appears to be caused by the fact that macvtap packet queues
> are unlimited in length.  This means that if the receiver can't
> keep up with the rate of flow, then we will hit OOM. Of course
> it gets worse if the OOM killer then decides to kill the receiver.
> 
> This patch imposes a cap on the packet queue length, in the same
> way as the tuntap driver, using the device TX queue length.
> 
> Please note that macvtap currently has no way of giving congestion
> notification, that means the software device TX queue cannot be
> used and packets will always be dropped once the macvtap driver
> queue fills up.
> 
> This shouldn't be a great problem for the scenario where macvtap
> is used to feed a kvm receiver, as the traffic is most likely
> external in origin so congestion notification can't be applied
> anyway.
> 
> Of course, if anybody decides to complain about guest-to-guest
> UDP packet loss down the track, then we may have to revisit this.
> 
> Incidentally, this patch also fixes a real memory leak when
> macvtap_get_queue fails.
> 
> Chris Wright noticed that for this patch to work, we need a
> non-zero TX queue length.  This patch includes his work to change
> the default macvtap TX queue length to 500.
> 
> Reported-by: Mark Wagner <mwagner@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Chris Wright <chrisw@sous-sol.org>

Thanks Herbert.
-chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann July 22, 2010, 9:30 a.m. UTC | #2
On Thursday 22 July 2010, Herbert Xu wrote:
> On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> > Hi:
> > 
> > macvtap: Limit packet queue length
> 
> Chris has informed me that he's already tried a similar patch
> and it only makes the problem worse :)
> 
> The issue is that the macvtap TX queue length defaults to zero.
> 
> So here is an updated patch which addresses this:

Thanks for debugging this and coming up with a solution.
I'm currently travelling, so I can't easily work on it myself.

> Please note that macvtap currently has no way of giving congestion
> notification, that means the software device TX queue cannot be
> used and packets will always be dropped once the macvtap driver
> queue fills up.

This is something I was planning to look into for doing it right,
and then I forgot about it. I'll investigate what could be done
to get proper flow control once I get back to the office.

> Chris Wright noticed that for this patch to work, we need a
> non-zero TX queue length.  This patch includes his work to change
> the default macvtap TX queue length to 500.

The only problem I can see with this patch is making it depend on
the *TX* queue length. The point is that unlike tun/tap, the
macvtap network interface's point of view is that this is the
receive queue, not the transmit queue.

In the TX direction, we really don't queue, since we simply forward
to the lowerdev tx queue, so exposing the tunable to user space
as the tx queue length is a little bit awkward, as well as inconsistent
between macvtap and macvlan.

> Reported-by: Mark Wagner <mwagner@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

As long as we're missing a better solution,

Acked-by: Arnd Bergmann <arnd@arndb.de>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma July 22, 2010, 4:05 p.m. UTC | #3
On Thu, 2010-07-22 at 11:30 +0200, Arnd Bergmann wrote:
> In the TX direction, we really don't queue, since we simply forward
> to the lowerdev tx queue, so exposing the tunable to user space
> as the tx queue length is a little bit awkward, as well as
> inconsistent
> between macvtap and macvlan.

Maybe we can use lowerdev backlog queue size here from receiving path
here?

thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 22, 2010, 4:08 p.m. UTC | #4
On Thu, Jul 22, 2010 at 09:05:26AM -0700, Shirley Ma wrote:
> On Thu, 2010-07-22 at 11:30 +0200, Arnd Bergmann wrote:
> > In the TX direction, we really don't queue, since we simply forward
> > to the lowerdev tx queue, so exposing the tunable to user space
> > as the tx queue length is a little bit awkward, as well as
> > inconsistent
> > between macvtap and macvlan.
> 
> Maybe we can use lowerdev backlog queue size here from receiving path
> here?

No, you may wish to set different queue lengths for different
macvtap devices over the same lowerdev so you definitely don't
want to use any lowerdev parameter for this.

I honestly don't see any problems with using tx_queue_len since
it isn't used by anything else for macvtap.

Cheers,
Shirley Ma July 22, 2010, 6:42 p.m. UTC | #5
Then it's better to add some comments here to indicate macvtap
tx_queue_len actually controls sk_receive_queue in receiving path.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 22, 2010, 8:09 p.m. UTC | #6
From: Arnd Bergmann <arnd@arndb.de>
Date: Thu, 22 Jul 2010 11:30:53 +0200

> On Thursday 22 July 2010, Herbert Xu wrote:
>> Reported-by: Mark Wagner <mwagner@redhat.com>
>> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> As long as we're missing a better solution,
> 
> Acked-by: Arnd Bergmann <arnd@arndb.de>

Applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 87e8d4c..f15fe2c 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -499,7 +499,7 @@  static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_validate_addr	= eth_validate_addr,
 };
 
-static void macvlan_setup(struct net_device *dev)
+void macvlan_common_setup(struct net_device *dev)
 {
 	ether_setup(dev);
 
@@ -508,6 +508,12 @@  static void macvlan_setup(struct net_device *dev)
 	dev->destructor		= free_netdev;
 	dev->header_ops		= &macvlan_hard_header_ops,
 	dev->ethtool_ops	= &macvlan_ethtool_ops;
+}
+EXPORT_SYMBOL_GPL(macvlan_common_setup);
+
+static void macvlan_setup(struct net_device *dev)
+{
+	macvlan_common_setup(dev);
 	dev->tx_queue_len	= 0;
 }
 
@@ -705,7 +711,6 @@  int macvlan_link_register(struct rtnl_link_ops *ops)
 	/* common fields */
 	ops->priv_size		= sizeof(struct macvlan_dev);
 	ops->get_tx_queues	= macvlan_get_tx_queues;
-	ops->setup		= macvlan_setup;
 	ops->validate		= macvlan_validate;
 	ops->maxtype		= IFLA_MACVLAN_MAX;
 	ops->policy		= macvlan_policy;
@@ -719,6 +724,7 @@  EXPORT_SYMBOL_GPL(macvlan_link_register);
 
 static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
+	.setup		= macvlan_setup,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
 };
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a8a94e2..ff02b83 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -180,11 +180,18 @@  static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
 {
 	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
 	if (!q)
-		return -ENOLINK;
+		goto drop;
+
+	if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
+		goto drop;
 
 	skb_queue_tail(&q->sk.sk_receive_queue, skb);
 	wake_up_interruptible_poll(sk_sleep(&q->sk), POLLIN | POLLRDNORM | POLLRDBAND);
-	return 0;
+	return NET_RX_SUCCESS;
+
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
 }
 
 /*
@@ -235,8 +242,15 @@  static void macvtap_dellink(struct net_device *dev,
 	macvlan_dellink(dev, head);
 }
 
+static void macvtap_setup(struct net_device *dev)
+{
+	macvlan_common_setup(dev);
+	dev->tx_queue_len = TUN_READQ_SIZE;
+}
+
 static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
 	.kind		= "macvtap",
+	.setup		= macvtap_setup,
 	.newlink	= macvtap_newlink,
 	.dellink	= macvtap_dellink,
 };
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9ea047a..1ffaeff 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -67,6 +67,8 @@  static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
 	}
 }
 
+extern void macvlan_common_setup(struct net_device *dev);
+
 extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 				  struct nlattr *tb[], struct nlattr *data[],
 				  int (*receive)(struct sk_buff *skb),