diff mbox

[RFC] Regression in linux 2.6.32 virtio_net seen with vhost-net

Message ID 1260312605.19229.8.camel@w-sridhar.beaverton.ibm.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Sridhar Samudrala Dec. 8, 2009, 10:50 p.m. UTC
When testing vhost-net patches with a guest running linux 2.6.32, i am
running into "Unexpected full queue" warning messages from start_xmit() in
virtio_net.c causing a lot of requeues and a drastic reduction in throughput.

With a guest running 2.6.31, i get guest to host throughput around 7000Mb/s,
but it drops to around 3200Mb/s with 2.6.32.

I don't see this problem with usermode virtio in qemu, but i get a thruput of
only 2700Mb/s with both 2.6.31 and 2.6.32.

The following patch fixes this problem by dropping the skb and not requeuing
to qdisc when it cannot be added to ring buffer. With this patch, i see
similar performance as 2.6.31 with vhost-net.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>


Thanks
Sridhar





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu Dec. 13, 2009, 12:25 p.m. UTC | #1
Sridhar Samudrala <sri@us.ibm.com> wrote:
> When testing vhost-net patches with a guest running linux 2.6.32, i am
> running into "Unexpected full queue" warning messages from start_xmit() in
> virtio_net.c causing a lot of requeues and a drastic reduction in throughput.
> 
> With a guest running 2.6.31, i get guest to host throughput around 7000Mb/s,
> but it drops to around 3200Mb/s with 2.6.32.
> 
> I don't see this problem with usermode virtio in qemu, but i get a thruput of
> only 2700Mb/s with both 2.6.31 and 2.6.32.
> 
> The following patch fixes this problem by dropping the skb and not requeuing
> to qdisc when it cannot be added to ring buffer. With this patch, i see
> similar performance as 2.6.31 with vhost-net.
> 
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b9e002f..307cfd6 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -528,7 +528,11 @@ again:
>                        netif_start_queue(dev);
>                        goto again;
>                }
> -               return NETDEV_TX_BUSY;
> +
> +               /* drop the skb under stress. */ 
> +               vi->dev->stats.tx_dropped++;
> +               kfree_skb(skb);
> +               return NETDEV_TX_OK;
>        }

This patch only hides the real problem.  The issue is that we
should never start the queue unless we can accomodate a full
64KB packet.

In your case, whenever we have the space for a single slot we'd
start the queue.  However, if the head of the queue required more
than one slot it would immediately return BUSY and stop the queue
again.

Your patch simply drops the packet (likely to be TSO) which would
end up upsetting the TCP transmitter.

Please see what a real Ethernet driver (e.g., tg3 or ixgbe) does
for queue management.

Cheers,
Michael S. Tsirkin Dec. 13, 2009, 11:40 p.m. UTC | #2
On Sun, Dec 13, 2009 at 08:25:12PM +0800, Herbert Xu wrote:
> Sridhar Samudrala <sri@us.ibm.com> wrote:
> > When testing vhost-net patches with a guest running linux 2.6.32, i am
> > running into "Unexpected full queue" warning messages from start_xmit() in
> > virtio_net.c causing a lot of requeues and a drastic reduction in throughput.
> > 
> > With a guest running 2.6.31, i get guest to host throughput around 7000Mb/s,
> > but it drops to around 3200Mb/s with 2.6.32.
> > 
> > I don't see this problem with usermode virtio in qemu, but i get a thruput of
> > only 2700Mb/s with both 2.6.31 and 2.6.32.
> > 
> > The following patch fixes this problem by dropping the skb and not requeuing
> > to qdisc when it cannot be added to ring buffer. With this patch, i see
> > similar performance as 2.6.31 with vhost-net.
> > 
> > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index b9e002f..307cfd6 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -528,7 +528,11 @@ again:
> >                        netif_start_queue(dev);
> >                        goto again;
> >                }
> > -               return NETDEV_TX_BUSY;
> > +
> > +               /* drop the skb under stress. */ 
> > +               vi->dev->stats.tx_dropped++;
> > +               kfree_skb(skb);
> > +               return NETDEV_TX_OK;
> >        }
> 
> This patch only hides the real problem.  The issue is that we
> should never start the queue unless we can accomodate a full
> 64KB packet.
> 
> In your case, whenever we have the space for a single slot we'd
> start the queue.  However, if the head of the queue required more
> than one slot it would immediately return BUSY and stop the queue
> again.
> 
> Your patch simply drops the packet (likely to be TSO) which would
> end up upsetting the TCP transmitter.
> 
> Please see what a real Ethernet driver (e.g., tg3 or ixgbe) does
> for queue management.
> 
> Cheers,

An interesting thing is that
48925e372f04f5e35fec6269127c62b2c71ab794
was supposed to do this already.

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 15, 2009, 2:42 p.m. UTC | #3
On Mon, Dec 14, 2009 at 01:40:55AM +0200, Michael S. Tsirkin wrote:
>
> An interesting thing is that
> 48925e372f04f5e35fec6269127c62b2c71ab794
> was supposed to do this already.

But Rusty missed the queue wake call.  That too needs to check
that the required amount of space exists.

Cheers,
Sridhar Samudrala Dec. 15, 2009, 4:26 p.m. UTC | #4
On Tue, 2009-12-15 at 22:42 +0800, Herbert Xu wrote:
> On Mon, Dec 14, 2009 at 01:40:55AM +0200, Michael S. Tsirkin wrote:
> >
> > An interesting thing is that
> > 48925e372f04f5e35fec6269127c62b2c71ab794
> > was supposed to do this already.
> 
> But Rusty missed the queue wake call.  That too needs to check
> that the required amount of space exists.

Yes. I noticied that. skb_xmit_done() doesn't check for MAX_SKB_FRAGS+2
descriptors before doing a netif_wake_queue(). We don't have any easy way 
to get the number of free desciptors in the virtqueue and we may need a 
vring_get_num_free() interface.

Also, can we do free_old_xmit_skbs() to freeup any used descriptors in
the callback(skb_xmit_done) context? I tried this, but hit a panic in
free_old_xmit_skbs(__skb_unlink() trying to remove the skb from the vi->send queue).
Looks like we need to add the skb to the vi->send queue before doing a kick.

Thanks
Sridhar

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 15, 2009, 11:32 p.m. UTC | #5
On Tue, Dec 15, 2009 at 10:42:52PM +0800, Herbert Xu wrote:
> On Mon, Dec 14, 2009 at 01:40:55AM +0200, Michael S. Tsirkin wrote:
> >
> > An interesting thing is that
> > 48925e372f04f5e35fec6269127c62b2c71ab794
> > was supposed to do this already.
> 
> But Rusty missed the queue wake call.  That too needs to check
> that the required amount of space exists.

Right. Hmm. So for this to work we'll need to
1. Free skb upon interrupt instead of
   waiting for the next xmit call
2. Add API to query free ring capacity

Rusty, sounds like a good plan?

We could also extend host to delay interrupt
until there is sufficient TX capacity
but of course we also need to support
old hosts as well.


> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 16, 2009, 1:21 a.m. UTC | #6
On Tue, Dec 15, 2009 at 08:26:20AM -0800, Sridhar Samudrala wrote:
> 
> Also, can we do free_old_xmit_skbs() to freeup any used descriptors in
> the callback(skb_xmit_done) context? I tried this, but hit a panic in
> free_old_xmit_skbs(__skb_unlink() trying to remove the skb from the vi->send queue).
> Looks like we need to add the skb to the vi->send queue before doing a kick.

No you can't free skbs from hard IRQ context.

Cheers,
Herbert Xu Dec. 16, 2009, 1:58 a.m. UTC | #7
On Wed, Dec 16, 2009 at 01:32:27AM +0200, Michael S. Tsirkin wrote:
> 
> Right. Hmm. So for this to work we'll need to
> 1. Free skb upon interrupt instead of
>    waiting for the next xmit call

No the next xmit call can free the skbs.

> 2. Add API to query free ring capacity

This is all you need.

> We could also extend host to delay interrupt
> until there is sufficient TX capacity
> but of course we also need to support
> old hosts as well.

Yes that what real hardware does too.  They are able to delay
the interrupt until the number of free entries exceeds a minimum
threshold.

Cheers,
Rusty Russell Dec. 16, 2009, 2:41 a.m. UTC | #8
On Sun, 13 Dec 2009 10:55:12 pm Herbert Xu wrote:
> Please see what a real Ethernet driver (e.g., tg3 or ixgbe) does
> for queue management.

Hi Herbert,

   Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.

Sridhar, if you simply remove the dev_warn() does that restore some
performance?  I'll work on a proper fix for 2.6.33 now.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 16, 2009, 2:53 a.m. UTC | #9
On Wed, Dec 16, 2009 at 01:11:40PM +1030, Rusty Russell wrote:
> 
>    Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
> what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.

Well it depends.  Real drivers can't touch the hardware so they're
stuck with whatever the hardware does.  For virtio we do have the
flexibility of modifying the backend.

Having said that, for existing backends that will signal when there
is just a single free entry on the queue something like NAPI could
reduce the overhead associated with the IRQs.

Cheers,
Rusty Russell Dec. 16, 2009, 4:37 a.m. UTC | #10
On Wed, 16 Dec 2009 10:02:27 am Michael S. Tsirkin wrote:
> Right. Hmm. So for this to work we'll need to
> 1. Free skb upon interrupt instead of
>    waiting for the next xmit call
> 2. Add API to query free ring capacity
> 
> Rusty, sounds like a good plan?

Well, the query stuff is not too bad, but I can't completely convince myself
it's race-free.  We don't want to do locking.

A NAPI-style solution seems cleaner, and I'm testing that now.

> We could also extend host to delay interrupt
> until there is sufficient TX capacity
> but of course we also need to support
> old hosts as well.

Xen does this, and I rejected it in favor of simple enable/disable
flags in the original virtio design.  It was probably wrong: while the
guest can enable on a "few remaining" heuristic, it's going to have
latency.  The host can do a more timely decision.

There's nothing stopping the Host from doing this heuristic today, of
course: the DISABLE flag is advisory only.  But let's check the limitations
of the guest-enable approach first?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 16, 2009, 10:37 a.m. UTC | #11
On Wed, Dec 16, 2009 at 03:07:53PM +1030, Rusty Russell wrote:
> On Wed, 16 Dec 2009 10:02:27 am Michael S. Tsirkin wrote:
> > Right. Hmm. So for this to work we'll need to
> > 1. Free skb upon interrupt instead of
> >    waiting for the next xmit call
> > 2. Add API to query free ring capacity
> > 
> > Rusty, sounds like a good plan?
> 
> Well, the query stuff is not too bad, but I can't completely convince myself
> it's race-free.  We don't want to do locking.

We do not want to lock TX? For performance?
It might not be a problem: interrupts only start
to be enabled when we are running out of space on TX,
so this is a kind of slow path anyway now.

If necessary, we can also only do this range check
if netif_tx_queue_stopped.

> A NAPI-style solution seems cleaner, and I'm testing that now.

Hmm, as you say separately, this might not be 2.6.32 material though.
Maybe a simply capacity check will be safe enough for 2.6.32.

> > We could also extend host to delay interrupt
> > until there is sufficient TX capacity
> > but of course we also need to support
> > old hosts as well.
> 
> Xen does this, and I rejected it in favor of simple enable/disable
> flags in the original virtio design.  It was probably wrong: while the
> guest can enable on a "few remaining" heuristic, it's going to have
> latency.  The host can do a more timely decision.
> 
> There's nothing stopping the Host from doing this heuristic today, of
> course: the DISABLE flag is advisory only.

Heh, but this might hurt performance on guests that do
assume it's used correctly. A new feature might be cleaner?

>  But let's check the limitations
> of the guest-enable approach first?

Sure.

> Thanks,
> Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sridhar Samudrala Dec. 16, 2009, 5:42 p.m. UTC | #12
On Wed, 2009-12-16 at 13:11 +1030, Rusty Russell wrote:
> Sridhar, if you simply remove the dev_warn() does that restore some
> performance?  I'll work on a proper fix for 2.6.33 now.

Here is the comparision of a 60sec TCP STREAM test from 2.6.31.6 and 2.6.32 guests
to a 2.6.32 host with vhost-net patches.
Removing the dev_warn() in 2.6.32 doesn't make any significant difference
in throughput. I think the requeues are the main cause for the regression.
Will try your xmit NAPI patch.

linux-2.6.32
------------
$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      3279.98   81.48    77.79    2.035   3.885  
$ tc -s qdisc show dev eth0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 24708777531 bytes 1469482 pkt (dropped 0, overlimits 0 requeues 283575) 
 rate 0bit 0pps backlog 0b 0p requeues 283575 
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 54:52:00:35:e3:74 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    97000940   1469585  0       0       0       0      
    TX: bytes  packets  errors  dropped carrier collsns 
    3233946817 1469533  0       0       0       0      

linux-2.6.31.6
--------------
$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      8038.48   69.44    71.02    0.708   1.448  
$ tc -s qdisc show dev eth0
qdisc pfifo_fast 0: root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 60378698117 bytes 937388 pkt (dropped 0, overlimits 0 requeues 61) 
 rate 0bit 0pps backlog 0b 0p requeues 61 
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 54:52:00:35:e3:73 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    54702188   828655   0       0       0       0      
    TX: bytes  packets  errors  dropped carrier collsns 
    248753779  937399   0       7       0       0     


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b9e002f..307cfd6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -528,7 +528,11 @@  again:
 			netif_start_queue(dev);
 			goto again;
 		}
-		return NETDEV_TX_BUSY;
+
+		/* drop the skb under stress. */ 
+ 		vi->dev->stats.tx_dropped++;
+		kfree_skb(skb);
+		return NETDEV_TX_OK;
 	}
 	vi->svq->vq_ops->kick(vi->svq);