diff mbox

Network performance with small packets

Message ID 1296523838.30191.39.camel@sridhar.beaverton.ibm.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Sridhar Samudrala Feb. 1, 2011, 1:30 a.m. UTC
On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:
> 
> > OK, so thinking about it more, maybe the issue is this:
> > tx becomes full. We process one request and interrupt the guest,
> > then it adds one request and the queue is full again.
> >
> > Maybe the following will help it stabilize?
> > By itself it does nothing, but if you set
> > all the parameters to a huge value we will
> > only interrupt when we see an empty ring.
> > Which might be too much: pls try other values
> > in the middle: e.g. make bufs half the ring,
> > or bytes some small value, or packets some
> > small value etc.
> >
> > Warning: completely untested.
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index aac05bc..6769cdc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,13 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> >
> > +int tx_bytes_coalesce = 0;
> > +module_param(tx_bytes_coalesce, int, 0644);
> > +int tx_bufs_coalesce = 0;
> > +module_param(tx_bufs_coalesce, int, 0644);
> > +int tx_packets_coalesce = 0;
> > +module_param(tx_packets_coalesce, int, 0644);
> > +
> >  enum {
> >     VHOST_NET_VQ_RX = 0,
> >     VHOST_NET_VQ_TX = 1,
> > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> >     int err, wmem;
> >     size_t hdr_size;
> >     struct socket *sock;
> > +   int bytes_coalesced = 0;
> > +   int bufs_coalesced = 0;
> > +   int packets_coalesced = 0;
> >
> >     /* TODO: check that we are running from vhost_worker? */
> >     sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> >        if (err != len)
> >           pr_debug("Truncated TX packet: "
> >               " len %d != %zd\n", err, len);
> > -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
> >        total_len += len;
> > +      packets_coalesced += 1;
> > +      bytes_coalesced += len;
> > +      bufs_coalesced += in;
> 
> Should this instead be:
>       bufs_coalesced += out;
> 
> Perusing the code I see that earlier there is a check to see if "in" is not
> zero, and, if so, error out of the loop.  After the check, "in" is not
> touched until it is added to bufs_coalesced, effectively not changing
> bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
> below.

Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.

I tried a simpler version of this patch without any tunables by
delaying the signaling until we come out of the for loop.
It definitely reduced the number of vmexits significantly for small message
guest to host stream test and the throughput went up a little.


> 
> Or am I missing something?
> 
> > +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > +              bytes_coalesced > tx_bytes_coalesce ||
> > +              bufs_coalesced > tx_bufs_coalesce))
> > +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +      else
> > +         vhost_add_used(vq, head, 0);
> >        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >           vhost_poll_queue(&vq->poll);
> >           break;
> >        }
> >     }
> >
> > +   if (likely(packets_coalesced > tx_packets_coalesce ||
> > +         bytes_coalesced > tx_bytes_coalesce ||
> > +         bufs_coalesced > tx_bufs_coalesce))
> > +      vhost_signal(&net->dev, vq);
> >     mutex_unlock(&vq->mutex);
> >  }

It is possible that we can miss signaling the guest even after
processing a few pkts, if we don't hit any of these conditions.

> >
> 
> Steve D.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michael S. Tsirkin Feb. 1, 2011, 5:56 a.m. UTC | #1
On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote:
> On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:
> > 
> > > OK, so thinking about it more, maybe the issue is this:
> > > tx becomes full. We process one request and interrupt the guest,
> > > then it adds one request and the queue is full again.
> > >
> > > Maybe the following will help it stabilize?
> > > By itself it does nothing, but if you set
> > > all the parameters to a huge value we will
> > > only interrupt when we see an empty ring.
> > > Which might be too much: pls try other values
> > > in the middle: e.g. make bufs half the ring,
> > > or bytes some small value, or packets some
> > > small value etc.
> > >
> > > Warning: completely untested.
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index aac05bc..6769cdc 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -32,6 +32,13 @@
> > >   * Using this limit prevents one virtqueue from starving others. */
> > >  #define VHOST_NET_WEIGHT 0x80000
> > >
> > > +int tx_bytes_coalesce = 0;
> > > +module_param(tx_bytes_coalesce, int, 0644);
> > > +int tx_bufs_coalesce = 0;
> > > +module_param(tx_bufs_coalesce, int, 0644);
> > > +int tx_packets_coalesce = 0;
> > > +module_param(tx_packets_coalesce, int, 0644);
> > > +
> > >  enum {
> > >     VHOST_NET_VQ_RX = 0,
> > >     VHOST_NET_VQ_TX = 1,
> > > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> > >     int err, wmem;
> > >     size_t hdr_size;
> > >     struct socket *sock;
> > > +   int bytes_coalesced = 0;
> > > +   int bufs_coalesced = 0;
> > > +   int packets_coalesced = 0;
> > >
> > >     /* TODO: check that we are running from vhost_worker? */
> > >     sock = rcu_dereference_check(vq->private_data, 1);
> > > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> > >        if (err != len)
> > >           pr_debug("Truncated TX packet: "
> > >               " len %d != %zd\n", err, len);
> > > -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > >        total_len += len;
> > > +      packets_coalesced += 1;
> > > +      bytes_coalesced += len;
> > > +      bufs_coalesced += in;
> > 
> > Should this instead be:
> >       bufs_coalesced += out;
> > 
> > Perusing the code I see that earlier there is a check to see if "in" is not
> > zero, and, if so, error out of the loop.  After the check, "in" is not
> > touched until it is added to bufs_coalesced, effectively not changing
> > bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
> > below.
> 
> Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
> 
> I tried a simpler version of this patch without any tunables by
> delaying the signaling until we come out of the for loop.
> It definitely reduced the number of vmexits significantly for small message
> guest to host stream test and the throughput went up a little.
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9b3ca10..5f9fae9 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		vhost_add_used(vq, head, 0);
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
>  		}
>  	}
>  
> +	if (total_len > 0)
> +		vhost_signal(&net->dev, vq);
>  	mutex_unlock(&vq->mutex);
>  }
>  
> 
> > 
> > Or am I missing something?
> > 
> > > +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > > +              bytes_coalesced > tx_bytes_coalesce ||
> > > +              bufs_coalesced > tx_bufs_coalesce))
> > > +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > > +      else
> > > +         vhost_add_used(vq, head, 0);
> > >        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > >           vhost_poll_queue(&vq->poll);
> > >           break;
> > >        }
> > >     }
> > >
> > > +   if (likely(packets_coalesced > tx_packets_coalesce ||
> > > +         bytes_coalesced > tx_bytes_coalesce ||
> > > +         bufs_coalesced > tx_bufs_coalesce))
> > > +      vhost_signal(&net->dev, vq);
> > >     mutex_unlock(&vq->mutex);
> > >  }
> 
> It is possible that we can miss signaling the guest even after
> processing a few pkts, if we don't hit any of these conditions.

Yes. It really should be
   if (likely(packets_coalesced && bytes_coalesced && bufs_coalesced))
      vhost_signal(&net->dev, vq);

> > >
> > 
> > Steve D.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 1, 2011, 9:09 p.m. UTC | #2
On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
> Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
> 
> I tried a simpler version of this patch without any tunables by
> delaying the signaling until we come out of the for loop.
> It definitely reduced the number of vmexits significantly for small
> message
> guest to host stream test and the throughput went up a little.
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9b3ca10..5f9fae9 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
>                 if (err != len)
>                         pr_debug("Truncated TX packet: "
>                                  " len %d != %zd\n", err, len);
> -               vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +               vhost_add_used(vq, head, 0);
>                 total_len += len;
>                 if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>                         vhost_poll_queue(&vq->poll);
> @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
>                 }
>         }
> 
> +       if (total_len > 0)
> +               vhost_signal(&net->dev, vq);
>         mutex_unlock(&vq->mutex);
>  }

Reducing the signaling will reduce the CPU utilization by reducing VM
exits. 

The small message BW is a problem we have seen faster guest/slow vhost,
even I increased VHOST_NET_WEIGHT times, it didn't help that much for
BW. For large message size, vhost is able to process all packets on
time. I played around with guest/host codes, I only see huge BW
improvement by dropping packets on guest side so far.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 1, 2011, 9:24 p.m. UTC | #3
On Tue, Feb 01, 2011 at 01:09:45PM -0800, Shirley Ma wrote:
> On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
> > Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
> > 
> > I tried a simpler version of this patch without any tunables by
> > delaying the signaling until we come out of the for loop.
> > It definitely reduced the number of vmexits significantly for small
> > message
> > guest to host stream test and the throughput went up a little.
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 9b3ca10..5f9fae9 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
> >                 if (err != len)
> >                         pr_debug("Truncated TX packet: "
> >                                  " len %d != %zd\n", err, len);
> > -               vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +               vhost_add_used(vq, head, 0);
> >                 total_len += len;
> >                 if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >                         vhost_poll_queue(&vq->poll);
> > @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
> >                 }
> >         }
> > 
> > +       if (total_len > 0)
> > +               vhost_signal(&net->dev, vq);
> >         mutex_unlock(&vq->mutex);
> >  }
> 
> Reducing the signaling will reduce the CPU utilization by reducing VM
> exits. 
> 
> The small message BW is a problem we have seen faster guest/slow vhost,
> even I increased VHOST_NET_WEIGHT times, it didn't help that much for
> BW. For large message size, vhost is able to process all packets on
> time. I played around with guest/host codes, I only see huge BW
> improvement by dropping packets on guest side so far.
> 
> Thanks
> Shirley


My theory is that the issue is not signalling.
Rather, our queue fills up, then host handles
one packet and sends an interrupt, and we
immediately wake the queue. So the vq
once it gets full, stays full.

If you try my patch with bufs threshold set to e.g.
half the vq, what we will do is send interrupt after we have processed
half the vq.  So host has half the vq to go, and guest has half the vq
to fill.

See?
Shirley Ma Feb. 1, 2011, 9:32 p.m. UTC | #4
On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> My theory is that the issue is not signalling.
> Rather, our queue fills up, then host handles
> one packet and sends an interrupt, and we
> immediately wake the queue. So the vq
> once it gets full, stays full.

>From the printk debugging output, it might not be exactly the case. The
ring gets full, run a bit, then gets full, then run a bit, then full...

> If you try my patch with bufs threshold set to e.g.
> half the vq, what we will do is send interrupt after we have processed
> half the vq.  So host has half the vq to go, and guest has half the vq
> to fill.
> 
> See?

I am cleaning up my set up to run your patch ...

Shirley


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 1, 2011, 9:42 p.m. UTC | #5
On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> > My theory is that the issue is not signalling.
> > Rather, our queue fills up, then host handles
> > one packet and sends an interrupt, and we
> > immediately wake the queue. So the vq
> > once it gets full, stays full.
> 
> >From the printk debugging output, it might not be exactly the case. The
> ring gets full, run a bit, then gets full, then run a bit, then full...

Yes, but does it get even half empty in between?

> > If you try my patch with bufs threshold set to e.g.
> > half the vq, what we will do is send interrupt after we have processed
> > half the vq.  So host has half the vq to go, and guest has half the vq
> > to fill.
> > 
> > See?
> 
> I am cleaning up my set up to run your patch ...
> 
> Shirley
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 1, 2011, 9:53 p.m. UTC | #6
On Tue, 2011-02-01 at 23:42 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> > > My theory is that the issue is not signalling.
> > > Rather, our queue fills up, then host handles
> > > one packet and sends an interrupt, and we
> > > immediately wake the queue. So the vq
> > > once it gets full, stays full.
> > 
> > >From the printk debugging output, it might not be exactly the case.
> The
> > ring gets full, run a bit, then gets full, then run a bit, then
> full...
> 
> Yes, but does it get even half empty in between?

Sometimes, most of them not half of empty in between. But printk slow
down the traffics, so it's not accurate. I think your patch will improve
the performance if it signals guest when half of the ring size is
empty. 

But you manage signal by using TX bytes, I would like to change it to
half of the ring size instead for signaling. Is that OK?

Shirley



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 1, 2011, 9:56 p.m. UTC | #7
On Tue, Feb 01, 2011 at 01:53:05PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:42 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> > > > My theory is that the issue is not signalling.
> > > > Rather, our queue fills up, then host handles
> > > > one packet and sends an interrupt, and we
> > > > immediately wake the queue. So the vq
> > > > once it gets full, stays full.
> > > 
> > > >From the printk debugging output, it might not be exactly the case.
> > The
> > > ring gets full, run a bit, then gets full, then run a bit, then
> > full...
> > 
> > Yes, but does it get even half empty in between?
> 
> Sometimes, most of them not half of empty in between. But printk slow
> down the traffics, so it's not accurate. I think your patch will improve
> the performance if it signals guest when half of the ring size is
> empty. 
> 
> But you manage signal by using TX bytes,

There are flags for bytes, buffers and packets.
Try playing with any one of them :)
Just be sure to use v2.


>I would like to change it to
> half of the ring size instead for signaling. Is that OK?
> 
> Shirley
> 
> 

Sure that is why I made it a parameter so you can experiment.
Shirley Ma Feb. 1, 2011, 10:59 p.m. UTC | #8
On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> There are flags for bytes, buffers and packets.
> Try playing with any one of them :)
> Just be sure to use v2.
> 
> 
> >I would like to change it to
> > half of the ring size instead for signaling. Is that OK?
> > 
> > Shirley
> > 
> > 
> 
> Sure that is why I made it a parameter so you can experiment. 

The initial test results shows that the CPUs utilization has been
reduced some, and BW has increased some with the default parameters,
like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
down from 4x% to 38%, (Similar results from the patch I submitted a
while ago to reduce signaling on vhost) but far away from dropping
packet results.

I am going to change the code to use 1/2 ring size to wake the netif
queue.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 4:40 a.m. UTC | #9
On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> > There are flags for bytes, buffers and packets.
> > Try playing with any one of them :)
> > Just be sure to use v2.
> > 
> > 
> > >I would like to change it to
> > > half of the ring size instead for signaling. Is that OK?
> > > 
> > > Shirley
> > > 
> > > 
> > 
> > Sure that is why I made it a parameter so you can experiment. 
> 
> The initial test results shows that the CPUs utilization has been
> reduced some, and BW has increased some with the default parameters,
> like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
> down from 4x% to 38%, (Similar results from the patch I submitted a
> while ago to reduce signaling on vhost) but far away from dropping
> packet results.
> 
> I am going to change the code to use 1/2 ring size to wake the netif
> queue.
> 
> Shirley

Just tweak the parameters with sysfs, you do not have to edit the code:
echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce

Or in a similar way for tx_packets_coalesce (since we use indirect,
packets will typically use 1 buffer each).
Shirley Ma Feb. 2, 2011, 6:05 a.m. UTC | #10
On Wed, 2011-02-02 at 06:40 +0200, Michael S. Tsirkin wrote:
> ust tweak the parameters with sysfs, you do not have to edit the code:
> echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce
> 
> Or in a similar way for tx_packets_coalesce (since we use indirect,
> packets will typically use 1 buffer each).

We should use packets instead of buffers, in indirect case, one packet
has multiple buffers, each packet uses one descriptor from the ring
(default size is 256).

echo 128 > /sys/module/vhost_net/parameters/tx_packets_coalesce

The way I am changing is only when netif queue has stopped, then we
start to count num_free descriptors to send the signal to wake netif
queue.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 6:19 a.m. UTC | #11
On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> 
> The way I am changing is only when netif queue has stopped, then we
> start to count num_free descriptors to send the signal to wake netif
> queue. 

I forgot to mention, the code change I am making is in guest kernel, in
xmit call back only wake up the queue when it's stopped && num_free >=
1/2 *vq->num, I add a new API in virtio_ring.

However vhost signaling reduction is needed as well. The patch I
submitted a while ago showed both CPUs and BW improvement.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 6:29 a.m. UTC | #12
On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > 
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue. 
> 
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

Interesting. Yes, I agree an API extension would be helpful. However,
wouldn't just the signaling reduction be enough, without guest changes?

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.
> 
> Thanks
> Shirley

Which patch was that?
Krishna Kumar Feb. 2, 2011, 6:34 a.m. UTC | #13
> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> >
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue.
>
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

FYI :)

I have tried this before. There are a couple of issues:

1. the free count will not reduce until you run free_old_xmit_skbs,
   which will not run anymore since the tx queue is stopped.
2. You cannot call free_old_xmit_skbs directly as it races with a
   queue that was just awakened (current cb was due to the delay
   in disabling cb's).

You have to call free_old_xmit_skbs() under netif_queue_stopped()
check to avoid the race.

I got a small improvement in my testing upto some number of threads
(32 or 48?), but beyond that I was getting a regression.

Thanks,

- KK

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 7:03 a.m. UTC | #14
On Wed, 2011-02-02 at 12:04 +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)

> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.

Yes, that' what I did, when the netif queue stop, don't enable the
queue, just free_old_xmit_skbs(), if not enough freed, then enabling
callback until half of the ring size are freed, then wake the netif
queue. But somehow I didn't reach the performance compared to drop
packets, need to think about it more. :)

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 7:14 a.m. UTC | #15
On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > 
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue. 
> > 
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> Interesting. Yes, I agree an API extension would be helpful. However,
> wouldn't just the signaling reduction be enough, without guest
> changes?

w/i guest change, I played around the parameters,for example: I could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
usage.

> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.
> > 
> > Thanks
> > Shirley
> 
> Which patch was that? 

The patch was called "vhost: TX used buffer guest signal accumulation".
You suggested to split add_used_bufs and signal. I am still thinking
what's the best approach to cooperate guest (virtio_kick) and
vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
the overheads, so I haven't submit the new patch yet.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 7:33 a.m. UTC | #16
On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage. 

I meant w/o guest change, only vhost changes. Sorry about that.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Krishna Kumar Feb. 2, 2011, 7:37 a.m. UTC | #17
> Shirley Ma <mashirle@us.ibm.com> wrote:
>
> > I have tried this before. There are a couple of issues:
> >
> > 1. the free count will not reduce until you run free_old_xmit_skbs,
> >    which will not run anymore since the tx queue is stopped.
> > 2. You cannot call free_old_xmit_skbs directly as it races with a
> >    queue that was just awakened (current cb was due to the delay
> >    in disabling cb's).
> >
> > You have to call free_old_xmit_skbs() under netif_queue_stopped()
> > check to avoid the race.
>
> Yes, that' what I did, when the netif queue stop, don't enable the
> queue, just free_old_xmit_skbs(), if not enough freed, then enabling
> callback until half of the ring size are freed, then wake the netif
> queue. But somehow I didn't reach the performance compared to drop
> packets, need to think about it more. :)

Did you check if the number of vmexits increased with this
patch? This is possible if the device was keeping up (and
not going into a stop, start, xmit 1 packet, stop, start
loop). Also maybe you should try for 1/4th instead of 1/2?

MST's delayed signalling should avoid this issue, I haven't
tried both together.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 10:48 a.m. UTC | #18
On Wed, Feb 02, 2011 at 12:04:37PM +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then we
> > > start to count num_free descriptors to send the signal to wake netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel, in
> > xmit call back only wake up the queue when it's stopped && num_free >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)
> 
> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.
> 
> I got a small improvement in my testing upto some number of threads
> (32 or 48?), but beyond that I was getting a regression.
> 
> Thanks,
> 
> - KK
> 
> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.

Yes, I think doing this in the host is much simpler,
just send an interrupt after there's a decent amount
of space in the queue.

Having said that the simple heuristic that I coded
might be a bit too simple.
Michael S. Tsirkin Feb. 2, 2011, 10:48 a.m. UTC | #19
On Tue, Feb 01, 2011 at 11:14:51PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > > 
> > > > The way I am changing is only when netif queue has stopped, then
> > we
> > > > start to count num_free descriptors to send the signal to wake
> > netif
> > > > queue. 
> > > 
> > > I forgot to mention, the code change I am making is in guest kernel,
> > in
> > > xmit call back only wake up the queue when it's stopped && num_free
> > >=
> > > 1/2 *vq->num, I add a new API in virtio_ring.
> > 
> > Interesting. Yes, I agree an API extension would be helpful. However,
> > wouldn't just the signaling reduction be enough, without guest
> > changes?
> 
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage.

We need to consider them separately IMO.  What's the best we can get
without guest change?  And which parameters give it?
There will always be old guests, and as far as I can tell
it should work better from host.

> > > However vhost signaling reduction is needed as well. The patch I
> > > submitted a while ago showed both CPUs and BW improvement.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Which patch was that? 
> 
> The patch was called "vhost: TX used buffer guest signal accumulation".
Yes, a somewhat similar idea.

> You suggested to split add_used_bufs and signal.
Exactly. And this is basically what this patch does.

> I am still thinking
> what's the best approach to cooperate guest (virtio_kick) and
> vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
> the overheads, so I haven't submit the new patch yet.
> 
> Thanks
> Shirley
Michael S. Tsirkin Feb. 2, 2011, 10:49 a.m. UTC | #20
On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > w/i guest change, I played around the parameters,for example: I could
> > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > size,
> > w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> > usage. 
> 
> I meant w/o guest change, only vhost changes. Sorry about that.
> 
> Shirley

Ah, excellent. What were the parameters?
Shirley Ma Feb. 2, 2011, 3:39 p.m. UTC | #21
On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> Yes, I think doing this in the host is much simpler,
> just send an interrupt after there's a decent amount
> of space in the queue.
> 
> Having said that the simple heuristic that I coded
> might be a bit too simple.

>From the debugging out what I have seen so far (a single small message
TCP_STEAM test), I think the right approach is to patch both guest and
vhost. The problem I have found is a regression for single  small
message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
kernel has problem.

For Steven's problem, it's multiple stream TCP_RR issues, the old guest
doesn't perform well, so does new guest kernel. We tested reducing vhost
signaling patch before, it didn't help the performance at all.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 3:42 p.m. UTC | #22
On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > w/i guest change, I played around the parameters,for example: I
> could
> > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > size,
> > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> CPU
> > > usage. 
> > 
> > I meant w/o guest change, only vhost changes. Sorry about that.
> > 
> > Shirley
> 
> Ah, excellent. What were the parameters? 

I used half of the ring size 129 for packet counters, but the
performance is still not as good as dropping packets on guest, 3.7 Gb/s
vs. 6.2Gb/s.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 3:47 p.m. UTC | #23
On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > Yes, I think doing this in the host is much simpler,
> > just send an interrupt after there's a decent amount
> > of space in the queue.
> > 
> > Having said that the simple heuristic that I coded
> > might be a bit too simple.
> 
> >From the debugging out what I have seen so far (a single small message
> TCP_STEAM test), I think the right approach is to patch both guest and
> vhost.

One problem is slowing down the guest helps here.
So there's a chance that just by adding complexity
in guest driver we get a small improvement :(

We can't rely on a patched guest anyway, so
I think it is best to test guest and host changes separately.

And I do agree something needs to be done in guest too,
for example when vqs share an interrupt, we
might invoke a callback when we see vq is not empty
even though it's not requested. Probably should
check interrupts enabled here?

> The problem I have found is a regression for single  small
> message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
> kernel has problem.

Likely new kernel is faster :)

> For Steven's problem, it's multiple stream TCP_RR issues, the old guest
> doesn't perform well, so does new guest kernel. We tested reducing vhost
> signaling patch before, it didn't help the performance at all.
> 
> Thanks
> Shirley

Yes, it seems unrelated to tx interrupts.
Michael S. Tsirkin Feb. 2, 2011, 3:48 p.m. UTC | #24
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > > w/i guest change, I played around the parameters,for example: I
> > could
> > > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > > size,
> > > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> > CPU
> > > > usage. 
> > > 
> > > I meant w/o guest change, only vhost changes. Sorry about that.
> > > 
> > > Shirley
> > 
> > Ah, excellent. What were the parameters? 
> 
> I used half of the ring size 129 for packet counters, but the
> performance is still not as good as dropping packets on guest, 3.7 Gb/s
> vs. 6.2Gb/s.
> 
> Shirley

And this is with sndbuf=0 in host, yes?
And do you see a lot of tx interrupts?
How packets per interrupt?
Shirley Ma Feb. 2, 2011, 5:10 p.m. UTC | #25
On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> > On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > > Yes, I think doing this in the host is much simpler,
> > > just send an interrupt after there's a decent amount
> > > of space in the queue.
> > > 
> > > Having said that the simple heuristic that I coded
> > > might be a bit too simple.
> > 
> > >From the debugging out what I have seen so far (a single small
> message
> > TCP_STEAM test), I think the right approach is to patch both guest
> and
> > vhost.
> 
> One problem is slowing down the guest helps here.
> So there's a chance that just by adding complexity
> in guest driver we get a small improvement :(
> 
> We can't rely on a patched guest anyway, so
> I think it is best to test guest and host changes separately.
> 
> And I do agree something needs to be done in guest too,
> for example when vqs share an interrupt, we
> might invoke a callback when we see vq is not empty
> even though it's not requested. Probably should
> check interrupts enabled here?

Yes, I modified xmit callback something like below:

static void skb_xmit_done(struct virtqueue *svq)
{
        struct virtnet_info *vi = svq->vdev->priv;

        /* Suppress further interrupts. */
        virtqueue_disable_cb(svq);

        /* We were probably waiting for more output buffers. */
        if (netif_queue_stopped(vi->dev)) {
                free_old_xmit_skbs(vi);
                if (virtqueue_free_size(svq)  <= svq->vring.num / 2) {
                        virtqueue_enable_cb(svq);
			return;
		}
        }
	netif_wake_queue(vi->dev);
}

> > The problem I have found is a regression for single  small
> > message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
> new
> > kernel has problem.
> 
> Likely new kernel is faster :)

> > For Steven's problem, it's multiple stream TCP_RR issues, the old
> guest
> > doesn't perform well, so does new guest kernel. We tested reducing
> vhost
> > signaling patch before, it didn't help the performance at all.
> > 
> > Thanks
> > Shirley
> 
> Yes, it seems unrelated to tx interrupts. 

The issue is more likely related to latency. Do you have anything in
mind on how to reduce vhost latency?

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 5:12 p.m. UTC | #26
On Wed, 2011-02-02 at 17:48 +0200, Michael S. Tsirkin wrote:
> And this is with sndbuf=0 in host, yes?
> And do you see a lot of tx interrupts?
> How packets per interrupt?

Nope, sndbuf doens't matter since I never hit reaching sock wmem
condition in vhost. I am still playing around, let me know what data you
would like to collect.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 5:32 p.m. UTC | #27
On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> > > On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > > > Yes, I think doing this in the host is much simpler,
> > > > just send an interrupt after there's a decent amount
> > > > of space in the queue.
> > > > 
> > > > Having said that the simple heuristic that I coded
> > > > might be a bit too simple.
> > > 
> > > >From the debugging out what I have seen so far (a single small
> > message
> > > TCP_STEAM test), I think the right approach is to patch both guest
> > and
> > > vhost.
> > 
> > One problem is slowing down the guest helps here.
> > So there's a chance that just by adding complexity
> > in guest driver we get a small improvement :(
> > 
> > We can't rely on a patched guest anyway, so
> > I think it is best to test guest and host changes separately.
> > 
> > And I do agree something needs to be done in guest too,
> > for example when vqs share an interrupt, we
> > might invoke a callback when we see vq is not empty
> > even though it's not requested. Probably should
> > check interrupts enabled here?
> 
> Yes, I modified xmit callback something like below:
> 
> static void skb_xmit_done(struct virtqueue *svq)
> {
>         struct virtnet_info *vi = svq->vdev->priv;
> 
>         /* Suppress further interrupts. */
>         virtqueue_disable_cb(svq);
> 
>         /* We were probably waiting for more output buffers. */
>         if (netif_queue_stopped(vi->dev)) {
>                 free_old_xmit_skbs(vi);
>                 if (virtqueue_free_size(svq)  <= svq->vring.num / 2) {
>                         virtqueue_enable_cb(svq);
> 			return;
> 		}
>         }
> 	netif_wake_queue(vi->dev);
> }

OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?

> > > The problem I have found is a regression for single  small
> > > message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
> > new
> > > kernel has problem.
> > 
> > Likely new kernel is faster :)
> 
> > > For Steven's problem, it's multiple stream TCP_RR issues, the old
> > guest
> > > doesn't perform well, so does new guest kernel. We tested reducing
> > vhost
> > > signaling patch before, it didn't help the performance at all.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Yes, it seems unrelated to tx interrupts. 
> 
> The issue is more likely related to latency.

Could be. Why do you think so?

> Do you have anything in
> mind on how to reduce vhost latency?
> 
> Thanks
> Shirley

Hmm, bypassing the bridge might help a bit.
Are you using tap+bridge or macvtap?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 6:11 p.m. UTC | #28
On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> OK, but this should have no effect with a vhost patch
> which should ensure that we don't get an interrupt
> until the queue is at least half empty.
> Right?

There should be some coordination between guest and vhost. We shouldn't
count the TX packets when netif queue is enabled since next guest TX
xmit will free any used buffers in vhost. We need to be careful here in
case we miss the interrupts when netif queue has stopped.

However we can't change old guest so we can test the patches separately
for guest only, vhost only, and the combination.

> > > 
> > > Yes, it seems unrelated to tx interrupts. 
> > 
> > The issue is more likely related to latency.
> 
> Could be. Why do you think so?

Since I played with latency hack, I can see performance difference for
different latency.

> > Do you have anything in
> > mind on how to reduce vhost latency?
> > 
> > Thanks
> > Shirley
> 
> Hmm, bypassing the bridge might help a bit.
> Are you using tap+bridge or macvtap? 

I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
before. He might have some data from his workload performance
measurement.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 6:20 p.m. UTC | #29
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > > w/i guest change, I played around the parameters,for example: I
> > could
> > > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > > size,
> > > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> > CPU
> > > > usage. 
> > > 
> > > I meant w/o guest change, only vhost changes. Sorry about that.
> > > 
> > > Shirley
> > 
> > Ah, excellent. What were the parameters? 
> 
> I used half of the ring size 129 for packet counters,
> but the
> performance is still not as good as dropping packets on guest, 3.7 Gb/s
> vs. 6.2Gb/s.
> 
> Shirley

How many packets and bytes per interrupt are sent?
Also, what about other values for the counters and other counters?

What does your patch do? Just drop packets instead of
stopping the interface?

To have an understanding when should we drop packets
in the guest, we need to know *why* does it help.
Otherwise, how do we know it will work for others?
Note that qdisc will drop packets when it overruns -
so what is different? Also, are we over-running some other queue
somewhere?
Shirley Ma Feb. 2, 2011, 6:26 p.m. UTC | #30
On Wed, 2011-02-02 at 20:20 +0200, Michael S. Tsirkin wrote:
> How many packets and bytes per interrupt are sent?
> Also, what about other values for the counters and other counters?
> 
> What does your patch do? Just drop packets instead of
> stopping the interface?
> 
> To have an understanding when should we drop packets
> in the guest, we need to know *why* does it help.
> Otherwise, how do we know it will work for others?
> Note that qdisc will drop packets when it overruns -
> so what is different? Also, are we over-running some other queue
> somewhere? 

Agreed. I am trying to put more debugging output to look for all these
answers.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 6:27 p.m. UTC | #31
On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > OK, but this should have no effect with a vhost patch
> > which should ensure that we don't get an interrupt
> > until the queue is at least half empty.
> > Right?
> 
> There should be some coordination between guest and vhost.

What kind of coordination? With a patched vhost, and a full ring.
you should get an interrupt per 100 packets.
Is this what you see? And if yes, isn't the guest patch
doing nothing then?

> We shouldn't
> count the TX packets when netif queue is enabled since next guest TX
> xmit will free any used buffers in vhost. We need to be careful here in
> case we miss the interrupts when netif queue has stopped.
> 
> However we can't change old guest so we can test the patches separately
> for guest only, vhost only, and the combination.
> 
> > > > 
> > > > Yes, it seems unrelated to tx interrupts. 
> > > 
> > > The issue is more likely related to latency.
> > 
> > Could be. Why do you think so?
> 
> Since I played with latency hack, I can see performance difference for
> different latency.

Which hack was that?

> > > Do you have anything in
> > > mind on how to reduce vhost latency?
> > > 
> > > Thanks
> > > Shirley
> > 
> > Hmm, bypassing the bridge might help a bit.
> > Are you using tap+bridge or macvtap? 
> 
> I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
> before. He might have some data from his workload performance
> measurement.
> 
> Shirley
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 7:29 p.m. UTC | #32
On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> > On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > > OK, but this should have no effect with a vhost patch
> > > which should ensure that we don't get an interrupt
> > > until the queue is at least half empty.
> > > Right?
> > 
> > There should be some coordination between guest and vhost.
> 
> What kind of coordination? With a patched vhost, and a full ring.
> you should get an interrupt per 100 packets.
> Is this what you see? And if yes, isn't the guest patch
> doing nothing then?

vhost_signal won't be able send any TX interrupts to guest when guest TX
interrupt is disabled. Guest TX interrupt is only enabled when running
out of descriptors.

> > We shouldn't
> > count the TX packets when netif queue is enabled since next guest TX
> > xmit will free any used buffers in vhost. We need to be careful here
> in
> > case we miss the interrupts when netif queue has stopped.
> > 
> > However we can't change old guest so we can test the patches
> separately
> > for guest only, vhost only, and the combination.
> > 
> > > > > 
> > > > > Yes, it seems unrelated to tx interrupts. 
> > > > 
> > > > The issue is more likely related to latency.
> > > 
> > > Could be. Why do you think so?
> > 
> > Since I played with latency hack, I can see performance difference
> for
> > different latency.
> 
> Which hack was that? 

I tried to accumulate multiple guest to host notifications for TX xmits,
it did help multiple streams TCP_RR results; I also forced vhost
handle_tx to handle more packets; both hack seemed help.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 8:17 p.m. UTC | #33
On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> > > On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > > > OK, but this should have no effect with a vhost patch
> > > > which should ensure that we don't get an interrupt
> > > > until the queue is at least half empty.
> > > > Right?
> > > 
> > > There should be some coordination between guest and vhost.
> > 
> > What kind of coordination? With a patched vhost, and a full ring.
> > you should get an interrupt per 100 packets.
> > Is this what you see? And if yes, isn't the guest patch
> > doing nothing then?
> 
> vhost_signal won't be able send any TX interrupts to guest when guest TX
> interrupt is disabled. Guest TX interrupt is only enabled when running
> out of descriptors.

Well, this is also the only case where the queue is stopped, no?

> > > We shouldn't
> > > count the TX packets when netif queue is enabled since next guest TX
> > > xmit will free any used buffers in vhost. We need to be careful here
> > in
> > > case we miss the interrupts when netif queue has stopped.
> > > 
> > > However we can't change old guest so we can test the patches
> > separately
> > > for guest only, vhost only, and the combination.
> > > 
> > > > > > 
> > > > > > Yes, it seems unrelated to tx interrupts. 
> > > > > 
> > > > > The issue is more likely related to latency.
> > > > 
> > > > Could be. Why do you think so?
> > > 
> > > Since I played with latency hack, I can see performance difference
> > for
> > > different latency.
> > 
> > Which hack was that? 
> 
> I tried to accumulate multiple guest to host notifications for TX xmits,
> it did help multiple streams TCP_RR results;

I don't see a point to delay used idx update, do you?
So delaying just signal seems better, right?

> I also forced vhost
> handle_tx to handle more packets; both hack seemed help.
> 
> Thanks
> Shirley

Haven't noticed that part, how does your patch make it
handle more packets?
Shirley Ma Feb. 2, 2011, 9:03 p.m. UTC | #34
On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> Well, this is also the only case where the queue is stopped, no?
Yes. I got some debugging data, I saw that sometimes there were so many
packets were waiting for free in guest between vhost_signal & guest xmit
callback. Looks like the time spent too long from vhost_signal to guest
xmit callback?

> > I tried to accumulate multiple guest to host notifications for TX
> xmits,
> > it did help multiple streams TCP_RR results;
> I don't see a point to delay used idx update, do you?

It might cause per vhost handle_tx processed more packets.

> So delaying just signal seems better, right?

I think I need to define the test matrix to collect data for TX xmit
from guest to host here for different tests.

Data to be collected:
---------------------
1. kvm_stat for VM, I/O exits
2. cpu utilization for both guest and host
3. cat /proc/interrupts on guest
4. packets rate from vhost handle_tx per loop
5. guest netif queue stop rate
6. how many packets are waiting for free between vhost signaling and
guest callback
7. performance results

Test
----
1. TCP_STREAM single stream test for 1K to 4K message size
2. TCP_RR (64 instance test): 128 - 1K request/response size

Different hacks
---------------
1. Base line data ( with the patch to fix capacity check first,
free_old_xmit_skbs returns number of skbs)

2. Drop packet data (will put some debugging in generic networking code)

3. Delay guest netif queue wake up until certain descriptors (1/2 ring
size, 1/4 ring size...) are available once the queue has stopped.

4. Accumulate more packets per vhost signal in handle_tx?

5. 3 & 4 combinations

6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 

7. Accumulate more packets per vhost handle_tx() by adding some delay?

> Haven't noticed that part, how does your patch make it
handle more packets?

Added a delay in handle_tx().

What else?

It would take sometimes to do this.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 2, 2011, 9:20 p.m. UTC | #35
On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > Well, this is also the only case where the queue is stopped, no?
> Yes. I got some debugging data, I saw that sometimes there were so many
> packets were waiting for free in guest between vhost_signal & guest xmit
> callback.

What does this mean?

> Looks like the time spent too long from vhost_signal to guest
> xmit callback?



> > > I tried to accumulate multiple guest to host notifications for TX
> > xmits,
> > > it did help multiple streams TCP_RR results;
> > I don't see a point to delay used idx update, do you?
> 
> It might cause per vhost handle_tx processed more packets.

I don't understand. It's a couple of writes - what is the issue?

> > So delaying just signal seems better, right?
> 
> I think I need to define the test matrix to collect data for TX xmit
> from guest to host here for different tests.
> 
> Data to be collected:
> ---------------------
> 1. kvm_stat for VM, I/O exits
> 2. cpu utilization for both guest and host
> 3. cat /proc/interrupts on guest
> 4. packets rate from vhost handle_tx per loop
> 5. guest netif queue stop rate
> 6. how many packets are waiting for free between vhost signaling and
> guest callback
> 7. performance results
> 
> Test
> ----
> 1. TCP_STREAM single stream test for 1K to 4K message size
> 2. TCP_RR (64 instance test): 128 - 1K request/response size
> 
> Different hacks
> ---------------
> 1. Base line data ( with the patch to fix capacity check first,
> free_old_xmit_skbs returns number of skbs)
> 
> 2. Drop packet data (will put some debugging in generic networking code)
> 
> 3. Delay guest netif queue wake up until certain descriptors (1/2 ring
> size, 1/4 ring size...) are available once the queue has stopped.
> 
> 4. Accumulate more packets per vhost signal in handle_tx?
> 
> 5. 3 & 4 combinations
> 
> 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 
> 
> 7. Accumulate more packets per vhost handle_tx() by adding some delay?
> 
> > Haven't noticed that part, how does your patch make it
> handle more packets?
> 
> Added a delay in handle_tx().
> 
> What else?
> 
> It would take sometimes to do this.
> 
> Shirley


Need to think about this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 2, 2011, 9:41 p.m. UTC | #36
On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > > Well, this is also the only case where the queue is stopped, no?
> > Yes. I got some debugging data, I saw that sometimes there were so
> many
> > packets were waiting for free in guest between vhost_signal & guest
> xmit
> > callback.
> 
> What does this mean?

Let's look at the sequence here:

guest start_xmit()
	xmit_skb()
	if ring is full,
		enable_cb()

guest skb_xmit_done()
	disable_cb,
        printk free_old_xmit_skbs <-- it was between more than 1/2 to
full ring size
	printk vq->num_free 

vhost handle_tx()
	if (guest interrupt is enabled)
		signal guest to free xmit buffers

So between guest queue full/stopped queue/enable call back to guest
receives the callback from host to free_old_xmit_skbs, there were about
1/2 to full ring size descriptors available. I thought there were only a
few. (I disabled your vhost patch for this test.)
 

> > Looks like the time spent too long from vhost_signal to guest
> > xmit callback?
> 
> 
> 
> > > > I tried to accumulate multiple guest to host notifications for
> TX
> > > xmits,
> > > > it did help multiple streams TCP_RR results;
> > > I don't see a point to delay used idx update, do you?
> > 
> > It might cause per vhost handle_tx processed more packets.
> 
> I don't understand. It's a couple of writes - what is the issue?

Oh, handle_tx could process more packets per loop for multiple streams
TCP_RR case. I need to print out the data rate per loop to confirm this.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 3, 2011, 5:05 a.m. UTC | #37
On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > I think I need to define the test matrix to collect data for TX xmit
> > from guest to host here for different tests.
> > 
> > Data to be collected:
> > ---------------------
> > 1. kvm_stat for VM, I/O exits
> > 2. cpu utilization for both guest and host
> > 3. cat /proc/interrupts on guest
> > 4. packets rate from vhost handle_tx per loop
> > 5. guest netif queue stop rate
> > 6. how many packets are waiting for free between vhost signaling and
> > guest callback
> > 7. performance results
> > 
> > Test
> > ----
> > 1. TCP_STREAM single stream test for 1K to 4K message size
> > 2. TCP_RR (64 instance test): 128 - 1K request/response size
> > 
> > Different hacks
> > ---------------
> > 1. Base line data ( with the patch to fix capacity check first,
> > free_old_xmit_skbs returns number of skbs)
> > 
> > 2. Drop packet data (will put some debugging in generic networking
> code)

Since I found that the netif queue stop/wake up is so expensive, I
created a dropping packets patch on guest side so I don't need to debug
generic networking code.

guest start_xmit()
	capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
	if (capacity == 0)
		drop this packet;
		return;

In the patch, both guest TX interrupts and callback have been omitted.
Host vhost_signal in handle_tx can totally be removed as well. (A new
virtio_ring API is needed for exporting total of num_free descriptors
here -- virtioqueue_get_num_freed)

Initial TCP_STREAM performance results I got for guest to local host 
4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
9.8Gb/s for 4K message size. (vs.5.xGb/s)

Since large message size (64K) doesn't hit (capacity == 0) case, so the
performance only has a little better. (from 13.xGb/s to 14.x Gb/s)

kvm_stat output shows significant exits reduction for both VM and I/O,
no guest TX interrupts.

With dropping packets, TCP retrans has been increased here, so I can see
performance numbers are various.

This might be not a good solution, but it gave us some ideas on
expensive netif queue stop/wake up between guest and host notification.

I couldn't find a better solution on how to reduce netif queue stop/wake
up rate for small message size. But I think once we can address this,
the guest TX performance will burst for small message size.

I also compared this with return TX_BUSY approach when (capacity == 0),
it is not as good as dropping packets.

> > 3. Delay guest netif queue wake up until certain descriptors (1/2
> ring
> > size, 1/4 ring size...) are available once the queue has stopped.
> > 
> > 4. Accumulate more packets per vhost signal in handle_tx?
> > 
> > 5. 3 & 4 combinations
> > 
> > 6. Accumulate more packets per guest kick() (TCP_RR) by adding a
> timer? 
> > 
> > 7. Accumulate more packets per vhost handle_tx() by adding some
> delay?
> > 
> > > Haven't noticed that part, how does your patch make it
> > handle more packets?
> > 
> > Added a delay in handle_tx().
> > 
> > What else?
> > 
> > It would take sometimes to do this.
> > 
> > Shirley
> 
> 
> Need to think about this.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 3, 2011, 5:59 a.m. UTC | #38
On Wed, Feb 02, 2011 at 01:41:33PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > > On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > > > Well, this is also the only case where the queue is stopped, no?
> > > Yes. I got some debugging data, I saw that sometimes there were so
> > many
> > > packets were waiting for free in guest between vhost_signal & guest
> > xmit
> > > callback.
> > 
> > What does this mean?
> 
> Let's look at the sequence here:
> 
> guest start_xmit()
> 	xmit_skb()
> 	if ring is full,
> 		enable_cb()
> 
> guest skb_xmit_done()
> 	disable_cb,
>         printk free_old_xmit_skbs <-- it was between more than 1/2 to
> full ring size
> 	printk vq->num_free 
> 
> vhost handle_tx()
> 	if (guest interrupt is enabled)
> 		signal guest to free xmit buffers
> 
> So between guest queue full/stopped queue/enable call back to guest
> receives the callback from host to free_old_xmit_skbs, there were about
> 1/2 to full ring size descriptors available. I thought there were only a
> few. (I disabled your vhost patch for this test.)


The expected number is vq->num - max skb frags - 2.

> 
> > > Looks like the time spent too long from vhost_signal to guest
> > > xmit callback?
> > 
> > 
> > 
> > > > > I tried to accumulate multiple guest to host notifications for
> > TX
> > > > xmits,
> > > > > it did help multiple streams TCP_RR results;
> > > > I don't see a point to delay used idx update, do you?
> > > 
> > > It might cause per vhost handle_tx processed more packets.
> > 
> > I don't understand. It's a couple of writes - what is the issue?
> 
> Oh, handle_tx could process more packets per loop for multiple streams
> TCP_RR case. I need to print out the data rate per loop to confirm this.
> 
> Shirley
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma Feb. 3, 2011, 6:09 a.m. UTC | #39
On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
> > Let's look at the sequence here:
> > 
> > guest start_xmit()
> >       xmit_skb()
> >       if ring is full,
> >               enable_cb()
> > 
> > guest skb_xmit_done()
> >       disable_cb,
> >         printk free_old_xmit_skbs <-- it was between more than 1/2
> to
> > full ring size
> >       printk vq->num_free 
> > 
> > vhost handle_tx()
> >       if (guest interrupt is enabled)
> >               signal guest to free xmit buffers
> > 
> > So between guest queue full/stopped queue/enable call back to guest
> > receives the callback from host to free_old_xmit_skbs, there were
> about
> > 1/2 to full ring size descriptors available. I thought there were
> only a
> > few. (I disabled your vhost patch for this test.)
> 
> 
> The expected number is vq->num - max skb frags - 2. 

It was various (up to the ring size 256). This is using indirection
buffers, it returned how many freed descriptors, not number of buffers.

Why do you think it is vq->num - max skb frags - 2 here?

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 3, 2011, 6:13 a.m. UTC | #40
On Wed, Feb 02, 2011 at 09:05:56PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > > I think I need to define the test matrix to collect data for TX xmit
> > > from guest to host here for different tests.
> > > 
> > > Data to be collected:
> > > ---------------------
> > > 1. kvm_stat for VM, I/O exits
> > > 2. cpu utilization for both guest and host
> > > 3. cat /proc/interrupts on guest
> > > 4. packets rate from vhost handle_tx per loop
> > > 5. guest netif queue stop rate
> > > 6. how many packets are waiting for free between vhost signaling and
> > > guest callback
> > > 7. performance results
> > > 
> > > Test
> > > ----
> > > 1. TCP_STREAM single stream test for 1K to 4K message size
> > > 2. TCP_RR (64 instance test): 128 - 1K request/response size
> > > 
> > > Different hacks
> > > ---------------
> > > 1. Base line data ( with the patch to fix capacity check first,
> > > free_old_xmit_skbs returns number of skbs)
> > > 
> > > 2. Drop packet data (will put some debugging in generic networking
> > code)
> 
> Since I found that the netif queue stop/wake up is so expensive, I
> created a dropping packets patch on guest side so I don't need to debug
> generic networking code.
> 
> guest start_xmit()
> 	capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
> 	if (capacity == 0)
> 		drop this packet;
> 		return;
> 
> In the patch, both guest TX interrupts and callback have been omitted.
> Host vhost_signal in handle_tx can totally be removed as well. (A new
> virtio_ring API is needed for exporting total of num_free descriptors
> here -- virtioqueue_get_num_freed)
> 
> Initial TCP_STREAM performance results I got for guest to local host 
> 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
> 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
> 9.8Gb/s for 4K message size. (vs.5.xGb/s)

What is the average packet size, # bytes per ack, and the # of interrupts
per packet? It could be that just slowing down trahsmission
makes GSO work better.

> Since large message size (64K) doesn't hit (capacity == 0) case, so the
> performance only has a little better. (from 13.xGb/s to 14.x Gb/s)
> 
> kvm_stat output shows significant exits reduction for both VM and I/O,
> no guest TX interrupts.
> 
> With dropping packets, TCP retrans has been increased here, so I can see
> performance numbers are various.
> 
> This might be not a good solution, but it gave us some ideas on
> expensive netif queue stop/wake up between guest and host notification.
> 
> I couldn't find a better solution on how to reduce netif queue stop/wake
> up rate for small message size. But I think once we can address this,
> the guest TX performance will burst for small message size.
> 
> I also compared this with return TX_BUSY approach when (capacity == 0),
> it is not as good as dropping packets.
> 
> > > 3. Delay guest netif queue wake up until certain descriptors (1/2
> > ring
> > > size, 1/4 ring size...) are available once the queue has stopped.
> > > 
> > > 4. Accumulate more packets per vhost signal in handle_tx?
> > > 
> > > 5. 3 & 4 combinations
> > > 
> > > 6. Accumulate more packets per guest kick() (TCP_RR) by adding a
> > timer? 
> > > 
> > > 7. Accumulate more packets per vhost handle_tx() by adding some
> > delay?
> > > 
> > > > Haven't noticed that part, how does your patch make it
> > > handle more packets?
> > > 
> > > Added a delay in handle_tx().
> > > 
> > > What else?
> > > 
> > > It would take sometimes to do this.
> > > 
> > > Shirley
> > 
> > 
> > Need to think about this.
> > 
> > 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 3, 2011, 6:16 a.m. UTC | #41
On Wed, Feb 02, 2011 at 10:09:14PM -0800, Shirley Ma wrote:
> On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
> > > Let's look at the sequence here:
> > > 
> > > guest start_xmit()
> > >       xmit_skb()
> > >       if ring is full,
> > >               enable_cb()
> > > 
> > > guest skb_xmit_done()
> > >       disable_cb,
> > >         printk free_old_xmit_skbs <-- it was between more than 1/2
> > to
> > > full ring size
> > >       printk vq->num_free 
> > > 
> > > vhost handle_tx()
> > >       if (guest interrupt is enabled)
> > >               signal guest to free xmit buffers
> > > 
> > > So between guest queue full/stopped queue/enable call back to guest
> > > receives the callback from host to free_old_xmit_skbs, there were
> > about
> > > 1/2 to full ring size descriptors available. I thought there were
> > only a
> > > few. (I disabled your vhost patch for this test.)
> > 
> > 
> > The expected number is vq->num - max skb frags - 2. 
> 
> It was various (up to the ring size 256). This is using indirection
> buffers, it returned how many freed descriptors, not number of buffers.
> 
> Why do you think it is vq->num - max skb frags - 2 here?
> 
> Shirley

well queue is stopped which happens when

        if (capacity < 2+MAX_SKB_FRAGS) {
                netif_stop_queue(dev);
                if (unlikely(!virtqueue_enable_cb(vi->svq))) {
                        /* More just got used, free them then recheck.
 * */
                        capacity += free_old_xmit_skbs(vi);
                        if (capacity >= 2+MAX_SKB_FRAGS) {
                                netif_start_queue(dev);
                                virtqueue_disable_cb(vi->svq);
                        }
                }
        }

This should be the most common case.
I guess the case with += free_old_xmit_skbs is what can get us more.
But it should be rare. Can you count how common it is?
Shirley Ma Feb. 3, 2011, 3:58 p.m. UTC | #42
On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
> > Initial TCP_STREAM performance results I got for guest to local
> host 
> > 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
> > 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
> > 9.8Gb/s for 4K message size. (vs.5.xGb/s)
> 
> What is the average packet size, # bytes per ack, and the # of
> interrupts
> per packet? It could be that just slowing down trahsmission
> makes GSO work better. 

There is no TX interrupts with dropping packet.

GSO/TSO is the key for small message performance, w/o GSO/TSO, the
performance is limited to about 2Gb/s no matter how big the message size
it is. I think any work we try here will increase large packet size
rate. BTW for dropping packet, TCP increased fast retrans, not slow
start. 

I will collect tcpdump, netstart before and after data to compare packet
size/rate w/o w/i the patch.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Feb. 3, 2011, 4:20 p.m. UTC | #43
On Thu, Feb 03, 2011 at 07:58:00AM -0800, Shirley Ma wrote:
> On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
> > > Initial TCP_STREAM performance results I got for guest to local
> > host 
> > > 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
> > > 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
> > > 9.8Gb/s for 4K message size. (vs.5.xGb/s)
> > 
> > What is the average packet size, # bytes per ack, and the # of
> > interrupts
> > per packet? It could be that just slowing down trahsmission
> > makes GSO work better. 
> 
> There is no TX interrupts with dropping packet.
> 
> GSO/TSO is the key for small message performance, w/o GSO/TSO, the
> performance is limited to about 2Gb/s no matter how big the message size
> it is. I think any work we try here will increase large packet size
> rate. BTW for dropping packet, TCP increased fast retrans, not slow
> start. 
> 
> I will collect tcpdump, netstart before and after data to compare packet
> size/rate w/o w/i the patch.
> 
> Thanks
> Shirley

Just a thought: does it help to make tx queue len of the
virtio device smaller?
E.g. match the vq size?
Shirley Ma Feb. 3, 2011, 5:18 p.m. UTC | #44
On Thu, 2011-02-03 at 18:20 +0200, Michael S. Tsirkin wrote:
> Just a thought: does it help to make tx queue len of the
> virtio device smaller? 

Yes, that what I did before, reducing txqueuelen will cause qdisc dropp
the packet early, But it's hard to control by using tx queuelen for
performance gain. I tried on different systems, it required different
values.

Also, I tried another patch, instead of dropping packets, I used to
timer (2 jiffies) to enable/disable queue on guest without interrupts
notification, it gets better performance than original but worse
performance than dropping packets because of netif stop/wake up too
often.

vhost is definitely needed to improve for handling small message sizes.
It's unable to handle small message packets rate for queue size 256,
even with ring size 1024. QEMU seems not allowing to increase the TX
ring size to 2K (start qemu_kvm failure with no errors), I am not able
to test it out.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..5f9fae9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,7 +197,7 @@  static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		vhost_add_used(vq, head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
@@ -205,6 +205,8 @@  static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	if (total_len > 0)
+		vhost_signal(&net->dev, vq);
 	mutex_unlock(&vq->mutex);
 }