Patchwork qemu/virtio-net: remove wrong s/g layout assumptions

login
register
mail settings
Submitter Michael S. Tsirkin
Date Nov. 24, 2009, 7:45 p.m.
Message ID <20091124194502.GA4250@redhat.com>
Download mbox | patch
Permalink /patch/39231/
State New
Headers show

Comments

Michael S. Tsirkin - Nov. 24, 2009, 7:45 p.m.
virtio net currently assumes that the first s/g element it gets is
always virtio net header. This is wrong.
There should be no assumption on sg boundaries.  For example, the guest
should be able to put the virtio_net_hdr in the front of the skbuf data
if there is room.  Get rid of this assumption, properly consume space
from iovec, always.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Rusty, since you suggested this fix, could you ack it please?

 hw/virtio-net.c |   80 +++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 57 insertions(+), 23 deletions(-)
Anthony Liguori - Nov. 24, 2009, 7:50 p.m.
Michael S. Tsirkin wrote:
> virtio net currently assumes that the first s/g element it gets is
> always virtio net header. This is wrong.
> There should be no assumption on sg boundaries.  For example, the guest
> should be able to put the virtio_net_hdr in the front of the skbuf data
> if there is room.  Get rid of this assumption, properly consume space
> from iovec, always.
>   

Practically speaking, we ought to advertise a feature bit to let a 
kernel know that we are no longer broken.

Otherwise, there are a ton of old userspaces that will break with new 
guests.
Michael S. Tsirkin - Nov. 24, 2009, 7:54 p.m.
On Tue, Nov 24, 2009 at 01:50:25PM -0600, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> virtio net currently assumes that the first s/g element it gets is
>> always virtio net header. This is wrong.
>> There should be no assumption on sg boundaries.  For example, the guest
>> should be able to put the virtio_net_hdr in the front of the skbuf data
>> if there is room.  Get rid of this assumption, properly consume space
>> from iovec, always.
>>   
>
> Practically speaking, we ought to advertise a feature bit to let a  
> kernel know that we are no longer broken.
>
> Otherwise, there are a ton of old userspaces that will break with new  
> guests.

My thinking is, first of all let's fix the bug.
We'll add a feature bit when or if some guest wants to use it.
Maybe this will be 100 years down the road when all old userspace
has died a natural death :)
Makes sense?

> -- 
> Regards,
>
> Anthony Liguori
Anthony Liguori - Nov. 24, 2009, 9:04 p.m.
Michael S. Tsirkin wrote:
> On Tue, Nov 24, 2009 at 01:50:25PM -0600, Anthony Liguori wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> virtio net currently assumes that the first s/g element it gets is
>>> always virtio net header. This is wrong.
>>> There should be no assumption on sg boundaries.  For example, the guest
>>> should be able to put the virtio_net_hdr in the front of the skbuf data
>>> if there is room.  Get rid of this assumption, properly consume space
>>> from iovec, always.
>>>   
>>>       
>> Practically speaking, we ought to advertise a feature bit to let a  
>> kernel know that we are no longer broken.
>>
>> Otherwise, there are a ton of old userspaces that will break with new  
>> guests.
>>     
>
> My thinking is, first of all let's fix the bug.
> We'll add a feature bit when or if some guest wants to use it.
> Maybe this will be 100 years down the road when all old userspace
> has died a natural death :)
> Makes sense?
>   

I don't think it's useful to do this without adding a feature bit.  If 
we don't add a feature bit, the guest kernel cannot rely on this 
behavior so it means by definition this is dead code.
Michael S. Tsirkin - Nov. 24, 2009, 9:30 p.m.
On Tue, Nov 24, 2009 at 03:04:47PM -0600, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> On Tue, Nov 24, 2009 at 01:50:25PM -0600, Anthony Liguori wrote:
>>   
>>> Michael S. Tsirkin wrote:
>>>     
>>>> virtio net currently assumes that the first s/g element it gets is
>>>> always virtio net header. This is wrong.
>>>> There should be no assumption on sg boundaries.  For example, the guest
>>>> should be able to put the virtio_net_hdr in the front of the skbuf data
>>>> if there is room.  Get rid of this assumption, properly consume space
>>>> from iovec, always.
>>>>         
>>> Practically speaking, we ought to advertise a feature bit to let a   
>>> kernel know that we are no longer broken.
>>>
>>> Otherwise, there are a ton of old userspaces that will break with new 
>>>  guests.
>>>     
>>
>> My thinking is, first of all let's fix the bug.
>> We'll add a feature bit when or if some guest wants to use it.
>> Maybe this will be 100 years down the road when all old userspace
>> has died a natural death :)
>> Makes sense?
>>   
>
> I don't think it's useful to do this without adding a feature bit.
> If we don't add a feature bit, the guest kernel cannot rely on this  
> behavior so it means by definition this is dead code.


It's useful because this way I won't have to maintain the fix, and it
will make it possible for guests to experiment with layouts, without
hacking qemu. If someone wants to make it a product, that's a different
thing.

Also, it might be a valid thing for a guest to say that host needs to be
fixed. Not everyone might care about running on any possible broken qemu
version.

Finally - where do we draw the line? Does any bugfix need a feature bit?

> -- 
> Regards,
>
> Anthony Liguori
Anthony Liguori - Nov. 24, 2009, 9:57 p.m.
Michael S. Tsirkin wrote:
> It's useful because this way I won't have to maintain the fix, and it
> will make it possible for guests to experiment with layouts, without
> hacking qemu. If someone wants to make it a product, that's a different
> thing.
>   

Advertising a new feature is not hard.  It's one line of code in qemu 
with Rusty's ACK.

> Also, it might be a valid thing for a guest to say that host needs to be
> fixed. Not everyone might care about running on any possible broken qemu
> version.
>
> Finally - where do we draw the line? Does any bugfix need a feature bit?
>   

This is a spec bug, not a qemu bug IMHO.

The kernel drivers have always behaved this way and when this was all 
written originally, the semantics were never defined.  All the userspace 
implementations relied on this.  The fact that the spec claims a 
different behavior is a result of deciding that it should be different 
after the fact.

Thinking about it more, your patch is broken.  If you run it against a 
really old guest, it will break badly.

You assume that the guest header is always a fixed size.  It's not, 
we've added fields as we've added feature bits.  You actually have to 
look at the set of Ack'd features bits to determine how large the header is.

Which is why I now remember why this has never changed.  It's a PITA 
:-)  I'm not even sure you can do it in a correct way.
Michael S. Tsirkin - Nov. 24, 2009, 10:20 p.m.
On Tue, Nov 24, 2009 at 03:57:48PM -0600, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> It's useful because this way I won't have to maintain the fix, and it
>> will make it possible for guests to experiment with layouts, without
>> hacking qemu. If someone wants to make it a product, that's a different
>> thing.
>>   
>
> Advertising a new feature is not hard.  It's one line of code in qemu  
> with Rusty's ACK.

Okay. Rusty, ACK?

>> Also, it might be a valid thing for a guest to say that host needs to be
>> fixed. Not everyone might care about running on any possible broken qemu
>> version.
>>
>> Finally - where do we draw the line? Does any bugfix need a feature bit?
>>   
>
> This is a spec bug, not a qemu bug IMHO.
>
> The kernel drivers have always behaved this way and when this was all  
> written originally, the semantics were never defined.  All the userspace  
> implementations relied on this.  The fact that the spec claims a  
> different behavior is a result of deciding that it should be different  
> after the fact.
>
> Thinking about it more, your patch is broken.  If you run it against a  
> really old guest, it will break badly.

From what I can tell, my patch will behave exactly as
existing code does unless guest puts more data than virtio net hdr
size in the first element. In that last case, qemu called exit(1)
and I handle it properly.

> You assume that the guest header is always a fixed size.  It's not,  
> we've added fields as we've added feature bits.

Hmm. How so? Look at old qemu code I am replacing. qemu
just exits if header size is different from sizeof virtio_net_hdr.

> You actually have to  
> look at the set of Ack'd features bits to determine how large the header 
> is.

I think you are speaking about the mergeable header thing.
If you look at the code, I think you will see that I handle this
case correctly.

> Which is why I now remember why this has never changed.  It's a PITA :-)  
> I'm not even sure you can do it in a correct way.

I think I got it right. If not, let me know which feature is handled
wrong please.

> -- 
> Regards,
>
> Anthony Liguori
Michael S. Tsirkin - Nov. 30, 2009, 11:16 a.m.
On Wed, Nov 25, 2009 at 12:20:05AM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 24, 2009 at 03:57:48PM -0600, Anthony Liguori wrote:
> > Michael S. Tsirkin wrote:
> >> It's useful because this way I won't have to maintain the fix, and it
> >> will make it possible for guests to experiment with layouts, without
> >> hacking qemu. If someone wants to make it a product, that's a different
> >> thing.
> >>   
> >
> > Advertising a new feature is not hard.  It's one line of code in qemu  
> > with Rusty's ACK.
> 
> Okay. Rusty, ACK?
> 
> >> Also, it might be a valid thing for a guest to say that host needs to be
> >> fixed. Not everyone might care about running on any possible broken qemu
> >> version.
> >>
> >> Finally - where do we draw the line? Does any bugfix need a feature bit?
> >>   
> >
> > This is a spec bug, not a qemu bug IMHO.
> >
> > The kernel drivers have always behaved this way and when this was all  
> > written originally, the semantics were never defined.  All the userspace  
> > implementations relied on this.  The fact that the spec claims a  
> > different behavior is a result of deciding that it should be different  
> > after the fact.
> >
> > Thinking about it more, your patch is broken.  If you run it against a  
> > really old guest, it will break badly.
> 
> From what I can tell, my patch will behave exactly as
> existing code does unless guest puts more data than virtio net hdr
> size in the first element. In that last case, qemu called exit(1)
> and I handle it properly.
> 
> > You assume that the guest header is always a fixed size.  It's not,  
> > we've added fields as we've added feature bits.
> 
> Hmm. How so? Look at old qemu code I am replacing. qemu
> just exits if header size is different from sizeof virtio_net_hdr.
> 
> > You actually have to  
> > look at the set of Ack'd features bits to determine how large the header 
> > is.
> 
> I think you are speaking about the mergeable header thing.
> If you look at the code, I think you will see that I handle this
> case correctly.
> 
> > Which is why I now remember why this has never changed.  It's a PITA :-)  
> > I'm not even sure you can do it in a correct way.
> 
> I think I got it right. If not, let me know which feature is handled
> wrong please.

Anthony could you comment please?  I think the patch is correct, and I
even tested on a relatively old guest.
If you still think it is wrong, please tell me why.

> > -- 
> > Regards,
> >
> > Anthony Liguori

Patch

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 2f147e5..06c5148 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -434,26 +434,59 @@  static int iov_fill(struct iovec *iov, int iovcnt, const void *buf, int count)
     return offset;
 }
 
+static int iov_skip(struct iovec *iov, int iovcnt, int count)
+{
+    int offset, i;
+
+    offset = i = 0;
+    while (offset < count && i < iovcnt) {
+        int len = MIN(iov[i].iov_len, count - offset);
+	iov[i].iov_base += len;
+	iov[i].iov_len -= len;
+        offset += len;
+        i++;
+    }
+
+    return offset;
+}
+
+static int iov_copy(struct iovec *to, struct iovec *from, int iovcnt, int count)
+{
+    int offset, i;
+
+    offset = i = 0;
+    while (offset < count && i < iovcnt) {
+        int len = MIN(from[i].iov_len, count - offset);
+        to[i].iov_base = from[i].iov_base;
+        to[i].iov_len = from[i].iov_len;
+        offset += len;
+        i++;
+    }
+
+    return i;
+}
+
 static int receive_header(VirtIONet *n, struct iovec *iov, int iovcnt,
                           const void *buf, size_t size, size_t hdr_len)
 {
-    struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)iov[0].iov_base;
+    struct virtio_net_hdr hdr = {};
     int offset = 0;
 
-    hdr->flags = 0;
-    hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
+    hdr.flags = 0;
+    hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
     if (n->has_vnet_hdr) {
-        memcpy(hdr, buf, sizeof(*hdr));
-        offset = sizeof(*hdr);
-        work_around_broken_dhclient(hdr, buf + offset, size - offset);
+        memcpy(&hdr, buf, sizeof hdr);
+        offset = sizeof hdr;
+        work_around_broken_dhclient(&hdr, buf + offset, size - offset);
     }
 
+    iov_fill(iov, iovcnt, &hdr, sizeof hdr);
+
     /* We only ever receive a struct virtio_net_hdr from the tapfd,
      * but we may be passing along a larger header to the guest.
      */
-    iov[0].iov_base += hdr_len;
-    iov[0].iov_len  -= hdr_len;
+    iov_skip(iov, iovcnt, hdr_len);
 
     return offset;
 }
@@ -514,7 +547,8 @@  static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
 static ssize_t virtio_net_receive(VLANClientState *vc, const uint8_t *buf, size_t size)
 {
     VirtIONet *n = vc->opaque;
-    struct virtio_net_hdr_mrg_rxbuf *mhdr = NULL;
+    struct iovec mhdr[VIRTQUEUE_MAX_SIZE];
+    int mhdrcnt = 0;
     size_t hdr_len, offset, i;
 
     if (!virtio_net_can_receive(n->vc))
@@ -552,16 +586,13 @@  static ssize_t virtio_net_receive(VLANClientState *vc, const uint8_t *buf, size_
             exit(1);
         }
 
-        if (!n->mergeable_rx_bufs && elem.in_sg[0].iov_len != hdr_len) {
-            fprintf(stderr, "virtio-net header not in first element\n");
-            exit(1);
-        }
-
         memcpy(&sg, &elem.in_sg[0], sizeof(sg[0]) * elem.in_num);
 
         if (i == 0) {
-            if (n->mergeable_rx_bufs)
-                mhdr = (struct virtio_net_hdr_mrg_rxbuf *)sg[0].iov_base;
+            if (n->mergeable_rx_bufs) {
+                mhdrcnt = iov_copy(mhdr, sg, elem.in_num,
+                                   sizeof(struct virtio_net_hdr_mrg_rxbuf));
+            }
 
             offset += receive_header(n, sg, elem.in_num,
                                      buf + offset, size - offset, hdr_len);
@@ -579,8 +610,12 @@  static ssize_t virtio_net_receive(VLANClientState *vc, const uint8_t *buf, size_
         offset += len;
     }
 
-    if (mhdr)
-        mhdr->num_buffers = i;
+    if (mhdrcnt) {
+        uint16_t num = i;
+        iov_skip(mhdr, mhdrcnt,
+                 offsetof(struct virtio_net_hdr_mrg_rxbuf, num_buffers));
+        iov_fill(mhdr, mhdrcnt, &num, sizeof num);
+    }
 
     virtqueue_flush(n->rx_vq, i);
     virtio_notify(&n->vdev, n->rx_vq);
@@ -627,20 +662,19 @@  static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             sizeof(struct virtio_net_hdr_mrg_rxbuf) :
             sizeof(struct virtio_net_hdr);
 
-        if (out_num < 1 || out_sg->iov_len != hdr_len) {
-            fprintf(stderr, "virtio-net header not in first element\n");
+        if (out_num < 1) {
+            fprintf(stderr, "virtio-net: no output element\n");
             exit(1);
         }
 
         /* ignore the header if GSO is not supported */
         if (!n->has_vnet_hdr) {
-            out_num--;
-            out_sg++;
+            iov_skip(out_sg, out_num, hdr_len);
             len += hdr_len;
         } else if (n->mergeable_rx_bufs) {
             /* tapfd expects a struct virtio_net_hdr */
             hdr_len -= sizeof(struct virtio_net_hdr);
-            out_sg->iov_len -= hdr_len;
+            iov_skip(out_sg, out_num, hdr_len);
             len += hdr_len;
         }