diff mbox

[2/2] virtio_net: remove send completion interrupts and avoid TX queue overrun through packet drop

Message ID 20110322113649.GA17071@redhat.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Michael S. Tsirkin March 22, 2011, 11:36 a.m. UTC
On Mon, Mar 21, 2011 at 11:03:07AM -0700, Shirley Ma wrote:
> On Fri, 2011-03-18 at 18:41 -0700, Shirley Ma wrote:
> > > > +       /* Drop packet instead of stop queue for better
> > performance
> > > */
> > > 
> > > I would like to see some justification as to why this is the right
> > > way to go and not just papering over the real problem. 
> > 
> > Fair. KVM guest virtio_net TX queue stop/restart is pretty expensive,
> > which involves:
> > 
> > 1. Guest enable callback: one memory barrier, interrupt flag set
> 
> Missed this cost: for history reason, it also involves a guest exit from
> I/O write (PCI_QUEUE_NOTIFY).

OK, after some research, it looks like the reason was the tx timer that
qemu used to use. So the hack of avoiding the add_buf call will
avoid this kick and so break these hosts.
I guess we can add a feature bit to detect a new host
and so avoid the kick. We are running low on feature bits
unfortunately, but just fo testing, could you quantify the difference
that this makes using the following patch:

Comments

Shirley Ma March 23, 2011, 2:26 a.m. UTC | #1
On Tue, 2011-03-22 at 13:36 +0200, Michael S. Tsirkin wrote:
> diff --git a/drivers/virtio/virtio_ring.c
> b/drivers/virtio/virtio_ring.c
> index cc2f73e..6106017 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -185,11 +185,6 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
>         if (vq->num_free < out + in) {
>                 pr_debug("Can't add buf len %i - avail = %i\n",
>                          out + in, vq->num_free);
> -               /* FIXME: for historical reasons, we force a notify
> here if
> -                * there are outgoing parts to the buffer.  Presumably
> the
> -                * host should service the ring ASAP. */
> -               if (out)
> -                       vq->notify(&vq->vq);
>                 END_USE(vq);
>                 return -ENOSPC;
>         }
> 

With simply removing the notify here, it does help the case when TX
overrun hits too often, for example for 1K message size, the single
TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell March 24, 2011, 12:16 a.m. UTC | #2
On Tue, 22 Mar 2011 13:36:50 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, Mar 21, 2011 at 11:03:07AM -0700, Shirley Ma wrote:
> > On Fri, 2011-03-18 at 18:41 -0700, Shirley Ma wrote:
> > > > > +       /* Drop packet instead of stop queue for better
> > > performance
> > > > */
> > > > 
> > > > I would like to see some justification as to why this is the right
> > > > way to go and not just papering over the real problem. 
> > > 
> > > Fair. KVM guest virtio_net TX queue stop/restart is pretty expensive,
> > > which involves:
> > > 
> > > 1. Guest enable callback: one memory barrier, interrupt flag set
> > 
> > Missed this cost: for history reason, it also involves a guest exit from
> > I/O write (PCI_QUEUE_NOTIFY).
> 
> OK, after some research, it looks like the reason was the tx timer that
> qemu used to use. So the hack of avoiding the add_buf call will
> avoid this kick and so break these hosts.
> I guess we can add a feature bit to detect a new host
> and so avoid the kick. We are running low on feature bits
> unfortunately, but just fo testing, could you quantify the difference
> that this makes using the following patch:

Performance would suffer for those ancient qemus if we didn't do this,
but it wouldn't be fatal to them.

I think we should just remove it; the standard certainly doesn't mention
it.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell March 24, 2011, 12:30 a.m. UTC | #3
> With simply removing the notify here, it does help the case when TX
> overrun hits too often, for example for 1K message size, the single
> TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.

OK, we'll be getting rid of the "kick on full", so please delete that on
all benchmarks.

Now, does the capacity check before add_buf() still win anything?  I
can't see how unless we have some weird bug.

Once we've sorted that out, we should look at the more radical change
of publishing last_used and using that to intuit whether interrupts
should be sent.  If we're not careful with ordering and barriers that
could introduce more bugs.

Anything else on the optimization agenda I've missed?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin March 24, 2011, 6:39 a.m. UTC | #4
On Thu, Mar 24, 2011 at 10:46:49AM +1030, Rusty Russell wrote:
> On Tue, 22 Mar 2011 13:36:50 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Mon, Mar 21, 2011 at 11:03:07AM -0700, Shirley Ma wrote:
> > > On Fri, 2011-03-18 at 18:41 -0700, Shirley Ma wrote:
> > > > > > +       /* Drop packet instead of stop queue for better
> > > > performance
> > > > > */
> > > > > 
> > > > > I would like to see some justification as to why this is the right
> > > > > way to go and not just papering over the real problem. 
> > > > 
> > > > Fair. KVM guest virtio_net TX queue stop/restart is pretty expensive,
> > > > which involves:
> > > > 
> > > > 1. Guest enable callback: one memory barrier, interrupt flag set
> > > 
> > > Missed this cost: for history reason, it also involves a guest exit from
> > > I/O write (PCI_QUEUE_NOTIFY).
> > 
> > OK, after some research, it looks like the reason was the tx timer that
> > qemu used to use. So the hack of avoiding the add_buf call will
> > avoid this kick and so break these hosts.
> > I guess we can add a feature bit to detect a new host
> > and so avoid the kick. We are running low on feature bits
> > unfortunately, but just fo testing, could you quantify the difference
> > that this makes using the following patch:
> 
> Performance would suffer for those ancient qemus if we didn't do this,
> but it wouldn't be fatal to them.
> 
> I think we should just remove it; the standard certainly doesn't mention
> it.
> 
> Cheers,
> Rusty.

I agree here. Anthony, agree?
Michael S. Tsirkin March 24, 2011, 2:28 p.m. UTC | #5
On Thu, Mar 24, 2011 at 11:00:53AM +1030, Rusty Russell wrote:
> > With simply removing the notify here, it does help the case when TX
> > overrun hits too often, for example for 1K message size, the single
> > TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.
> 
> OK, we'll be getting rid of the "kick on full", so please delete that on
> all benchmarks.
> 
> Now, does the capacity check before add_buf() still win anything?  I
> can't see how unless we have some weird bug.
> 
> Once we've sorted that out, we should look at the more radical change
> of publishing last_used and using that to intuit whether interrupts
> should be sent.  If we're not careful with ordering and barriers that
> could introduce more bugs.

Right. I am working on this, and trying to be careful.
One thing I'm in doubt about: sometimes we just want to
disable interrupts. Should still use flags in that case?
I thought that if we make the published index 0 to vq->num - 1,
then a special value in the index field could disable
interrupts completely. We could even reuse the space
for the flags field to stick the index in. Too complex?

> Anything else on the optimization agenda I've missed?
> 
> Thanks,
> Rusty.

Several other things I am looking at, wellcome cooperation:
1. It's probably a good idea to update avail index
   immediately instead of upon kick: for RX
   this might help parallelism with the host.

2. Adding an API to add a single buffer instead of s/g,
   seems to help a bit.

3. For TX sometimes we free a single buffer, sometimes
   a ton of them, which might make the transmit latency
   vary. It's probably a good idea to limit this,
   maybe free the minimal number possible to keep the device
   going without stops, maybe free up to MAX_SKB_FRAGS.

4. If the ring is full, we now notify right after
   the first entry is consumed. For TX this is suboptimal,
   we should try delaying the interrupt on host.

More ideas, would be nice if someone can try them out:
1. We are allocating/freeing buffers for indirect descriptors.
   Use some kind of pool instead?
   And we could preformat part of the descriptor.

2. I didn't have time to work on virtio2 ideas presented
   at the kvm forum yet, any takers?
Shirley Ma March 24, 2011, 5:46 p.m. UTC | #6
On Thu, 2011-03-24 at 16:28 +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 24, 2011 at 11:00:53AM +1030, Rusty Russell wrote:
> > > With simply removing the notify here, it does help the case when TX
> > > overrun hits too often, for example for 1K message size, the single
> > > TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.
> > 
> > OK, we'll be getting rid of the "kick on full", so please delete that on
> > all benchmarks.
> > 
> > Now, does the capacity check before add_buf() still win anything?  I
> > can't see how unless we have some weird bug.
> > 
> > Once we've sorted that out, we should look at the more radical change
> > of publishing last_used and using that to intuit whether interrupts
> > should be sent.  If we're not careful with ordering and barriers that
> > could introduce more bugs.
> 
> Right. I am working on this, and trying to be careful.
> One thing I'm in doubt about: sometimes we just want to
> disable interrupts. Should still use flags in that case?
> I thought that if we make the published index 0 to vq->num - 1,
> then a special value in the index field could disable
> interrupts completely. We could even reuse the space
> for the flags field to stick the index in. Too complex?
> > Anything else on the optimization agenda I've missed?
> > 
> > Thanks,
> > Rusty.
> 
> Several other things I am looking at, wellcome cooperation:
> 1. It's probably a good idea to update avail index
>    immediately instead of upon kick: for RX
>    this might help parallelism with the host.
Is that possible to use the same idea for publishing last used idx to
publish avail idx? Then we can save guest iowrite/exits.

> 2. Adding an API to add a single buffer instead of s/g,
>    seems to help a bit.
> 
> 3. For TX sometimes we free a single buffer, sometimes
>    a ton of them, which might make the transmit latency
>    vary. It's probably a good idea to limit this,
>    maybe free the minimal number possible to keep the device
>    going without stops, maybe free up to MAX_SKB_FRAGS.

I am playing with it now, to collect more perf data to see what's the
best value to free number of used buffers.

> 4. If the ring is full, we now notify right after
>    the first entry is consumed. For TX this is suboptimal,
>    we should try delaying the interrupt on host.

> More ideas, would be nice if someone can try them out:
> 1. We are allocating/freeing buffers for indirect descriptors.
>    Use some kind of pool instead?
>    And we could preformat part of the descriptor.
> 2. I didn't have time to work on virtio2 ideas presented
>    at the kvm forum yet, any takers?
If I have time, I will look at this.

Thanks
Shirley


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin March 24, 2011, 6:10 p.m. UTC | #7
On Thu, Mar 24, 2011 at 10:46:49AM -0700, Shirley Ma wrote:
> On Thu, 2011-03-24 at 16:28 +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 24, 2011 at 11:00:53AM +1030, Rusty Russell wrote:
> > > > With simply removing the notify here, it does help the case when TX
> > > > overrun hits too often, for example for 1K message size, the single
> > > > TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.
> > > 
> > > OK, we'll be getting rid of the "kick on full", so please delete that on
> > > all benchmarks.
> > > 
> > > Now, does the capacity check before add_buf() still win anything?  I
> > > can't see how unless we have some weird bug.
> > > 
> > > Once we've sorted that out, we should look at the more radical change
> > > of publishing last_used and using that to intuit whether interrupts
> > > should be sent.  If we're not careful with ordering and barriers that
> > > could introduce more bugs.
> > 
> > Right. I am working on this, and trying to be careful.
> > One thing I'm in doubt about: sometimes we just want to
> > disable interrupts. Should still use flags in that case?
> > I thought that if we make the published index 0 to vq->num - 1,
> > then a special value in the index field could disable
> > interrupts completely. We could even reuse the space
> > for the flags field to stick the index in. Too complex?
> > > Anything else on the optimization agenda I've missed?
> > > 
> > > Thanks,
> > > Rusty.
> > 
> > Several other things I am looking at, wellcome cooperation:
> > 1. It's probably a good idea to update avail index
> >    immediately instead of upon kick: for RX
> >    this might help parallelism with the host.
> Is that possible to use the same idea for publishing last used idx to
> publish avail idx? Then we can save guest iowrite/exits.

Yes, but unrelated to 1 above :)

> > 2. Adding an API to add a single buffer instead of s/g,
> >    seems to help a bit.
> > 
> > 3. For TX sometimes we free a single buffer, sometimes
> >    a ton of them, which might make the transmit latency
> >    vary. It's probably a good idea to limit this,
> >    maybe free the minimal number possible to keep the device
> >    going without stops, maybe free up to MAX_SKB_FRAGS.
> 
> I am playing with it now, to collect more perf data to see what's the
> best value to free number of used buffers.

The best IMO is to keep the number of freed buffers constant
so that we have more or less identical overhead for each packet.

> > 4. If the ring is full, we now notify right after
> >    the first entry is consumed. For TX this is suboptimal,
> >    we should try delaying the interrupt on host.
> 
> > More ideas, would be nice if someone can try them out:
> > 1. We are allocating/freeing buffers for indirect descriptors.
> >    Use some kind of pool instead?
> >    And we could preformat part of the descriptor.
> > 2. I didn't have time to work on virtio2 ideas presented
> >    at the kvm forum yet, any takers?
> If I have time, I will look at this.
> 
> Thanks
> Shirley
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell March 25, 2011, 4:50 a.m. UTC | #8
On Thu, 24 Mar 2011 16:28:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Mar 24, 2011 at 11:00:53AM +1030, Rusty Russell wrote:
> > > With simply removing the notify here, it does help the case when TX
> > > overrun hits too often, for example for 1K message size, the single
> > > TCP_STREAM performance improved from 2.xGb/s to 4.xGb/s.
> > 
> > OK, we'll be getting rid of the "kick on full", so please delete that on
> > all benchmarks.
> > 
> > Now, does the capacity check before add_buf() still win anything?  I
> > can't see how unless we have some weird bug.
> > 
> > Once we've sorted that out, we should look at the more radical change
> > of publishing last_used and using that to intuit whether interrupts
> > should be sent.  If we're not careful with ordering and barriers that
> > could introduce more bugs.
> 
> Right. I am working on this, and trying to be careful.
> One thing I'm in doubt about: sometimes we just want to
> disable interrupts. Should still use flags in that case?
> I thought that if we make the published index 0 to vq->num - 1,
> then a special value in the index field could disable
> interrupts completely. We could even reuse the space
> for the flags field to stick the index in. Too complex?

Making the index free-running avoids the "full or empty" confusion, plus
offers and extra debugging insight.

I think that if they really want to disable interrrupts, the flag should
still work, and when the client accepts the "publish last_idx" feature
they are accepting that interrupts may be omitted if they haven't
updated last_idx yet.

> > Anything else on the optimization agenda I've missed?
> > 
> > Thanks,
> > Rusty.
> 
> Several other things I am looking at, wellcome cooperation:
> 1. It's probably a good idea to update avail index
>    immediately instead of upon kick: for RX
>    this might help parallelism with the host.

Yes, once we've done everything else, we should measure this.  It makes
sense.

> 2. Adding an API to add a single buffer instead of s/g,
>    seems to help a bit.

This goes last, since it's kind of an ugly hack, but all internal to
Linux if we decide it's a win.

> 3. For TX sometimes we free a single buffer, sometimes
>    a ton of them, which might make the transmit latency
>    vary. It's probably a good idea to limit this,
>    maybe free the minimal number possible to keep the device
>    going without stops, maybe free up to MAX_SKB_FRAGS.

This kind of heuristic is going to be quite variable depending on
circumstance, I think, so it's a lot of work to make sure we get it
right.

> 4. If the ring is full, we now notify right after
>    the first entry is consumed. For TX this is suboptimal,
>    we should try delaying the interrupt on host.

Lguest already does that: only sends an interrupt when it's run out of
things to do.  It does update the used ring, however, as it processes
them.

This seems sensible to me, but needs to be measured separately as well.

> More ideas, would be nice if someone can try them out:
> 1. We are allocating/freeing buffers for indirect descriptors.
>    Use some kind of pool instead?
>    And we could preformat part of the descriptor.

We need some poolish mechanism for virtio_blk too; perhaps an allocation
callback which both can use (virtio_blk to alloc from a pool, virtio_net
to recycle?).

Along similar lines to preformatting, we could actually try to prepend
the skb_vnet_hdr to the vnet data, and use a single descriptor for the
hdr and the first part of the packet.

Though IIRC, qemu's virtio barfs if the first descriptor isn't just the
hdr (barf...).

> 2. I didn't have time to work on virtio2 ideas presented
>    at the kvm forum yet, any takers?

I didn't even attend.  But I think that virtio is moribund for the
moment; there wasn't enough demand and it's clear that there are
optimization unexplored in virtio1.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell March 25, 2011, 4:51 a.m. UTC | #9
On Thu, 24 Mar 2011 10:46:49 -0700, Shirley Ma <mashirle@us.ibm.com> wrote:
> On Thu, 2011-03-24 at 16:28 +0200, Michael S. Tsirkin wrote:
> > Several other things I am looking at, wellcome cooperation:
> > 1. It's probably a good idea to update avail index
> >    immediately instead of upon kick: for RX
> >    this might help parallelism with the host.
> Is that possible to use the same idea for publishing last used idx to
> publish avail idx? Then we can save guest iowrite/exits.

Yes, it should be symmetrical.  Test independently of course, but the
same logic applies.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin March 27, 2011, 7:52 a.m. UTC | #10
On Fri, Mar 25, 2011 at 03:20:46PM +1030, Rusty Russell wrote:
> > 3. For TX sometimes we free a single buffer, sometimes
> >    a ton of them, which might make the transmit latency
> >    vary. It's probably a good idea to limit this,
> >    maybe free the minimal number possible to keep the device
> >    going without stops, maybe free up to MAX_SKB_FRAGS.
> 
> This kind of heuristic is going to be quite variable depending on
> circumstance, I think, so it's a lot of work to make sure we get it
> right.

Hmm, trying to keep the amount of work per descriptor
constant seems to make sense though, no?
Latency variations are not good for either RT uses or
protocols such as TCP.

> > 4. If the ring is full, we now notify right after
> >    the first entry is consumed. For TX this is suboptimal,
> >    we should try delaying the interrupt on host.
> 
> Lguest already does that: only sends an interrupt when it's run out of
> things to do.  It does update the used ring, however, as it processes
> them.

There are many approaches here I suspect something like
interrupt after half work is done might be better for
parallelism.

> 
> This seems sensible to me, but needs to be measured separately as well.

Agree.

> > More ideas, would be nice if someone can try them out:
> > 1. We are allocating/freeing buffers for indirect descriptors.
> >    Use some kind of pool instead?
> >    And we could preformat part of the descriptor.
> 
> We need some poolish mechanism for virtio_blk too; perhaps an allocation
> callback which both can use (virtio_blk to alloc from a pool, virtio_net
> to recycle?).

BTW for recycling, need to be careful about numa effects:
probably store cpu id and reallocate if we switch cpus ...
(or noma nodes - unfortunately not always described correctly).

> Along similar lines to preformatting, we could actually try to prepend
> the skb_vnet_hdr to the vnet data, and use a single descriptor for the
> hdr and the first part of the packet.
> 
> Though IIRC, qemu's virtio barfs if the first descriptor isn't just the
> hdr (barf...).

Maybe we can try fixing this before adding more flags,
then e.g. publish used flag can be resued to also
tell us layout is flexible. Or just add a feature flag for that.

> > 2. I didn't have time to work on virtio2 ideas presented
> >    at the kvm forum yet, any takers?
> 
> I didn't even attend.

Hmm, right. But what was presented there was discussed on list as well:
a single R/W descriptor ring with valid bit instead of 2 rings
+ a descriptor array.

>  But I think that virtio is moribund for the
> moment; there wasn't enough demand and it's clear that there are
> optimization unexplored in virtio1.

I agree absolutely that not all lessons has been learned,
playing with different ring layouts would make at least
an interesting paper IMO.
Rusty Russell April 4, 2011, 6:13 a.m. UTC | #11
On Sun, 27 Mar 2011 09:52:54 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Though IIRC, qemu's virtio barfs if the first descriptor isn't just the
> > hdr (barf...).
> 
> Maybe we can try fixing this before adding more flags,
> then e.g. publish used flag can be resued to also
> tell us layout is flexible. Or just add a feature flag for that.

We should probably do this at some stage, yes.

> > > 2. I didn't have time to work on virtio2 ideas presented
> > >    at the kvm forum yet, any takers?
> > 
> > I didn't even attend.
> 
> Hmm, right. But what was presented there was discussed on list as well:
> a single R/W descriptor ring with valid bit instead of 2 rings
> + a descriptor array.

I'll be happy when we reach the point that the extra cacheline is
hurting us :)

Then we should do direct descriptors w/ a cookie as the value to hand
back when finished.  That seems to be close to optimal.

> I agree absolutely that not all lessons has been learned,
> playing with different ring layouts would make at least
> an interesting paper IMO.

Yes, I'd like to see the results...

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cc2f73e..6106017 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -185,11 +185,6 @@  int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 	if (vq->num_free < out + in) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 out + in, vq->num_free);
-		/* FIXME: for historical reasons, we force a notify here if
-		 * there are outgoing parts to the buffer.  Presumably the
-		 * host should service the ring ASAP. */
-		if (out)
-			vq->notify(&vq->vq);
 		END_USE(vq);
 		return -ENOSPC;
 	}