diff mbox

[2/3,RFC] Changes for MQ virtio-net

Message ID 20110302100600.GB31309@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Michael S. Tsirkin March 2, 2011, 10:06 a.m. UTC
On Tue, Mar 01, 2011 at 09:32:56PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 02/28/2011 03:13:20 PM:
> 
> Thank you once again for your feedback on both these patches.
> I will send the qemu patch tomorrow. I will also send the next
> version incorporating these suggestions once we finalize some
> minor points.
> 
> > Overall looks good.
> > The numtxqs meaning the number of rx queues needs some cleanup.
> > init/cleanup routines need more symmetry.
> > Error handling on setup also seems slightly buggy or at least
> asymmetrical.
> > Finally, this will use up a large number of MSI vectors,
> > while TX interrupts mostly stay unused.
> >
> > Some comments below.
> >
> > > +/* Maximum number of individual RX/TX queues supported */
> > > +#define VIRTIO_MAX_TXQS		 		 16
> > > +
> >
> > This also does not seem to belong in the header.
> 
> Both virtio-net and vhost need some check to make sure very
> high values are not passed by userspace. Is this not required?

Whatever we stick in the header is effectively part of
host/gues interface. Are you sure we'll never want
more than 16 VQs? This value does not seem that high.

> > > +#define VIRTIO_NET_F_NUMTXQS		 21		 /* Device supports multiple
> TX queue */
> >
> > VIRTIO_NET_F_MULTIQUEUE ?
> 
> Yes, that's a better name.
> 
> > > @@ -34,6 +38,8 @@ struct virtio_net_config {
> > >  		 __u8 mac[6];
> > >  		 /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
> > >  		 __u16 status;
> > > +		 /* number of RX/TX queues */
> > > +		 __u16 numtxqs;
> >
> > The interface here is a bit ugly:
> > - this is really both # of tx and rx queues but called numtxqs
> > - there's a hardcoded max value
> > - 0 is assumed to be same as 1
> > - assumptions above are undocumented.
> >
> > One way to address this could be num_queue_pairs, and something like
> > 		 /* The actual number of TX and RX queues is num_queue_pairs +
> 1 each. */
> > 		 __u16 num_queue_pairs;
> > (and tweak code to match).
> >
> > Alternatively, have separate registers for the number of tx and rx
> queues.
> 
> OK, so virtio_net_config has num_queue_pairs, and this gets converted to
> numtxqs in virtnet_info?

Or put num_queue_pairs in virtnet_info too.

> > > +struct virtnet_info {
> > > +		 struct send_queue **sq;
> > > +		 struct receive_queue **rq;
> > > +
> > > +		 /* read-mostly variables */
> > > +		 int numtxqs ____cacheline_aligned_in_smp;
> >
> > Why do you think this alignment is a win?
> 
> Actually this code was from the earlier patchset (MQ TX only) where
> the layout was different. Now rq and sq are allocated as follows:
> 	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
> 	for (i = 0; i < numtxqs; i++) {
> 		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
> Since the two pointers becomes read-only during use, there is no cache
> line dirty'ing.  I will remove this directive.
> 
> > > +/*
> > > + * Note for 'qnum' below:
> > > + *		 first 'numtxqs' vqs are RX, next 'numtxqs' vqs are TX.
> > > + */
> >
> > Another option to consider is to have them RX,TX,RX,TX:
> > this way vq->queue_index / 2 gives you the
> > queue pair number, no need to read numtxqs. On the other hand, it makes
> the
> > #RX==#TX assumption even more entrenched.
> 
> OK. I was following how many drivers were allocating RX and TX's
> together - eg ixgbe_adapter has tx_ring and rx_ring arrays; bnx2
> has rx_buf_ring and tx_buf_ring arrays, etc.

That's fine. I am only talking about the VQ numbers.

> Also, vhost has some
> code that processes tx first before rx (e.g. vhost_net_stop/flush),

No idea why did I do it this way. I don't think it matters.

> so this approach seemed helpful.
> I am OK either way, what do you
> suggest?

We get less code generated but also less flexibility.
I am not sure, I'll play around with code, for now
let's keep it as is.


> > > +		 err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs,
> callbacks,
> > > +		 		 		 		 		  (const char
> **)names);
> > > +		 if (err)
> > > +		 		 goto free_params;
> > > +
> >
> > This would use up quite a lot of vectors. However,
> > tx interrupt is, in fact, slow path. So, assuming we don't have
> > enough vectors to use per vq, I think it's a good idea to
> > support reducing MSI vector usage by mapping all TX VQs to the same
> vector
> > and separate vectors for RX.
> > The hypervisor actually allows this, but we don't have an API at the
> virtio
> > level to pass that info to virtio pci ATM.
> > Any idea what a good API to use would be?
> 
> Yes, it is a waste to have these vectors for tx ints. I initially
> thought of adding a flag to virtio_device to pass to vp_find_vqs,
> but it won't work, so a new API is needed. I can work with you on
> this in the background if you like.


OK. For starters, how about we change find_vqs to get a structure?  Then
we can easily add flags that tell us that some interrupts are rare.




> > > +		 for (i = 0; i < numtxqs; i++) {
> > > +		 		 vi->rq[i]->rvq = vqs[i];
> > > +		 		 vi->sq[i]->svq = vqs[i + numtxqs];
> >
> > This logic is spread all over. We need some kind of macro to
> > get queue number of vq number and back.
> 
> Will add this.
> 
> > > +		 if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> > > +		 		 vi->cvq = vqs[i + numtxqs];
> > > +
> > > +		 		 if (virtio_has_feature(vi->vdev,
> VIRTIO_NET_F_CTRL_VLAN))
> > > +		 		 		 vi->dev->features |=
> NETIF_F_HW_VLAN_FILTER;
> >
> > This bit does not seem to belong in initialize_vqs.
> 
> I will move it back to probe.
> 
> > > +		 err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
> > > +		 		 		 		 offsetof(struct
> virtio_net_config, numtxqs),
> > > +		 		 		 		 &numtxqs);
> > > +
> > > +		 /* We need atleast one txq */
> > > +		 if (err || !numtxqs)
> > > +		 		 numtxqs = 1;
> >
> > err is okay, but should we just fail on illegal values?
> > Or change the semantics:
> >	n = 0;
> >	err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
> >				offsetof(struct virtio_net_config, numtxqs),
> >				&n);
> >	numtxq = n + 1;
> 
> Will this be better:
> 	int num_queue_pairs = 2;
> 	int numtxqs;
> 
> 	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> 				offsetof(struct virtio_net_config,
> 					 num_queue_pairs), &num_queue_pairs);
> 	<ignore error, if any>
> 	numtxqs = num_queue_pairs / 2;
> 
> > > +		 if (numtxqs > VIRTIO_MAX_TXQS)
> > > +		 		 return -EINVAL;
> >
> > Do we strictly need this?
> > I think we should just use whatever hardware has,
> > or alternatively somehow ignore the unused queues
> > (easy for tx, not sure about rx).
> 
> vq's are matched between qemu, virtio-net and vhost. Isn't some check
> required that userspace has not passed a bad value?


For virtio, I'm not too concerned: qemu can already easily
crash the guest :)
For vhost yes, but I'm concerned that even with 16 VQs we are
drinking a lot of resources already. I would be happier
if we had a file descriptor per VQs pair in some way.
The the amount of memory userspace can use up is
limited by the # of file descriptors.

> > > +		 		 if (vi->rq[i]->num == 0) {
> > > +		 		 		 err = -ENOMEM;
> > > +		 		 		 goto free_recv_bufs;
> > > +		 		 }
> > >  		 }
> > If this fails for vq > 0, you have to detach bufs.
> 
> Right, will fix this.
> 
> > >  free_vqs:
> > > +		 for (i = 0; i < numtxqs; i++)
> > > +		 		 cancel_delayed_work_sync(&vi->rq[i]->refill);
> > >  		 vdev->config->del_vqs(vdev);
> > > -free:
> > > +		 free_rq_sq(vi);
> >
> > If we have a wrapper to init all vqs, pls add a wrapper to clean up
> > all vqs as well.
> 
> Will add that.
> 
> > > +		 for (i = 0; i < vi->numtxqs; i++) {
> > > +		 		 struct virtqueue *rvq = vi->rq[i]->rvq;
> > > +
> > > +		 		 while (1) {
> > > +		 		 		 buf = virtqueue_detach_unused_buf
> (rvq);
> > > +		 		 		 if (!buf)
> > > +		 		 		 		 break;
> > > +		 		 		 if (vi->mergeable_rx_bufs || vi->
> big_packets)
> > > +		 		 		 		 give_pages(vi->rq[i],
> buf);
> > > +		 		 		 else
> > > +		 		 		 		 dev_kfree_skb(buf);
> > > +		 		 		 --vi->rq[i]->num;
> > > +		 		 }
> > > +		 		 BUG_ON(vi->rq[i]->num != 0);
> > >  		 }
> > > -		 BUG_ON(vi->num != 0);
> > > +
> > > +		 free_rq_sq(vi);
> >
> >
> > This looks wrong here. This function should detach
> > and free all bufs, not internal malloc stuff.
> 
> That is being done by free_receive_buf after free_unused_bufs()
> returns. I hope this addresses your point.
> 
> > I think we should have free_unused_bufs that handles
> > a single queue, and call it in a loop.
> 
> OK, so define free_unused_bufs() as:
> 
> static void free_unused_bufs(struct virtnet_info *vi, struct virtqueue
> *svq,
> 			     struct virtqueue *rvq)
> {
> 	/* Use svq and rvq with the remaining code unchanged */
> }
> 
> Thanks,
> 
> - KK

Not sure I understand. I am just suggesting
adding symmetrical functions like init/cleanup
alloc/free etc instead of adding stuff in random
functions that just happens to be called at the right time.

Comments

Krishna Kumar March 8, 2011, 3:32 p.m. UTC | #1
"Michael S. Tsirkin" <mst@redhat.com> wrote on 03/02/2011 03:36:00 PM:

Sorry for the delayed response, I have been sick the last few days.
I am responding to both your posts here.

> > Both virtio-net and vhost need some check to make sure very
> > high values are not passed by userspace. Is this not required?
>
> Whatever we stick in the header is effectively part of
> host/gues interface. Are you sure we'll never want
> more than 16 VQs? This value does not seem that high.

OK, so even constants cannot change?  Given that, should I remove all
checks and use kcalloc?

> > OK, so virtio_net_config has num_queue_pairs, and this gets converted to
> > numtxqs in virtnet_info?
>
> Or put num_queue_pairs in virtnet_info too.

For virtnet_info, having numtxqs is easier since all code that loops needs
only 'numtxq'.

> > Also, vhost has some
> > code that processes tx first before rx (e.g. vhost_net_stop/flush),
>
> No idea why did I do it this way. I don't think it matters.
>
> > so this approach seemed helpful.
> > I am OK either way, what do you
> > suggest?
>
> We get less code generated but also less flexibility.
> I am not sure, I'll play around with code, for now
> let's keep it as is.

OK.

> > Yes, it is a waste to have these vectors for tx ints. I initially
> > thought of adding a flag to virtio_device to pass to vp_find_vqs,
> > but it won't work, so a new API is needed. I can work with you on
> > this in the background if you like.
>
> OK. For starters, how about we change find_vqs to get a structure?  Then
> we can easily add flags that tell us that some interrupts are rare.

Yes. OK to work on this outside this patch series, I guess?

> > vq's are matched between qemu, virtio-net and vhost. Isn't some check
> > required that userspace has not passed a bad value?
>
>
> For virtio, I'm not too concerned: qemu can already easily
> crash the guest :)
> For vhost yes, but I'm concerned that even with 16 VQs we are
> drinking a lot of resources already. I would be happier
> if we had a file descriptor per VQs pair in some way.
> The the amount of memory userspace can use up is
> limited by the # of file descriptors.

I will start working on this approach this week and see how it goes.

> > OK, so define free_unused_bufs() as:
> >
> > static void free_unused_bufs(struct virtnet_info *vi, struct virtqueue
> > *svq,
> > 		 		 		      struct virtqueue *rvq)
> > {
> > 		 /* Use svq and rvq with the remaining code unchanged */
> > }
>
> Not sure I understand. I am just suggesting
> adding symmetrical functions like init/cleanup
> alloc/free etc instead of adding stuff in random
> functions that just happens to be called at the right time.

OK, I will clean up this part in the next revision.

> > I was not sure what is the best way - a sysctl parameter? Or should the
> > maximum depend on number of host cpus? But that results in too many
> > threads, e.g. if I have 16 cpus and 16 txqs.
>
> I guess the question is, wouldn't # of threads == # of vqs work best?
> If we process stuff on a single CPU, let's make it pass through
> a single VQ.
> And to do this, we could simply open multiple vhost fds without
> changing vhost at all.
>
> Would this work well?
>
> > > > -		 		  enum vhost_net_poll_state tx_poll_state;
> > > > +		 		  enum vhost_net_poll_state *tx_poll_state;
> > >
> > > another array?
> >
> > Yes... I am also allocating twice the space than what is required
> > to make it's usage simple.
>
> Where's the allocation? Couldn't find it.

vhost_setup_vqs(net.c) allocates it based on nvqs, though numtxqs is
enough.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin March 8, 2011, 3:41 p.m. UTC | #2
On Tue, Mar 08, 2011 at 09:02:38PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 03/02/2011 03:36:00 PM:
> 
> Sorry for the delayed response, I have been sick the last few days.
> I am responding to both your posts here.
> 
> > > Both virtio-net and vhost need some check to make sure very
> > > high values are not passed by userspace. Is this not required?
> >
> > Whatever we stick in the header is effectively part of
> > host/gues interface. Are you sure we'll never want
> > more than 16 VQs? This value does not seem that high.
> 
> OK, so even constants cannot change?

Parse error :).

> Given that, should I remove all
> checks and use kcalloc?

I think that it's not a bad idea to avoid crashing if
hardware (in our case, host) misbehaves.
But as long as the code is prepared to handle any # of vqs,
I don't see a point of limiting that arbitrarily,
and if the code to handle hardware errors is complex
it'll have bugs itself.

> > > OK, so virtio_net_config has num_queue_pairs, and this gets converted to
> > > numtxqs in virtnet_info?
> >
> > Or put num_queue_pairs in virtnet_info too.
> 
> For virtnet_info, having numtxqs is easier since all code that loops needs
> only 'numtxq'.

It seemed to me that the code used numtxqs for # of rx qs as well
sometimes.

> > > Also, vhost has some
> > > code that processes tx first before rx (e.g. vhost_net_stop/flush),
> >
> > No idea why did I do it this way. I don't think it matters.
> >
> > > so this approach seemed helpful.
> > > I am OK either way, what do you
> > > suggest?
> >
> > We get less code generated but also less flexibility.
> > I am not sure, I'll play around with code, for now
> > let's keep it as is.
> 
> OK.
> 
> > > Yes, it is a waste to have these vectors for tx ints. I initially
> > > thought of adding a flag to virtio_device to pass to vp_find_vqs,
> > > but it won't work, so a new API is needed. I can work with you on
> > > this in the background if you like.
> >
> > OK. For starters, how about we change find_vqs to get a structure?  Then
> > we can easily add flags that tell us that some interrupts are rare.
> 
> Yes. OK to work on this outside this patch series, I guess?

We could work on this in parallel. Changing APIs is not a problem,
I'm a bit concerned that this might affect the host/guest ABI as well.

> > > vq's are matched between qemu, virtio-net and vhost. Isn't some check
> > > required that userspace has not passed a bad value?
> >
> >
> > For virtio, I'm not too concerned: qemu can already easily
> > crash the guest :)
> > For vhost yes, but I'm concerned that even with 16 VQs we are
> > drinking a lot of resources already. I would be happier
> > if we had a file descriptor per VQs pair in some way.
> > The the amount of memory userspace can use up is
> > limited by the # of file descriptors.
> 
> I will start working on this approach this week and see how it goes.

Also, could you post your current version of the qemu code pls?
It's useful for testing and to see the whole picture.

> > > OK, so define free_unused_bufs() as:
> > >
> > > static void free_unused_bufs(struct virtnet_info *vi, struct virtqueue
> > > *svq,
> > > 		 		 		      struct virtqueue *rvq)
> > > {
> > > 		 /* Use svq and rvq with the remaining code unchanged */
> > > }
> >
> > Not sure I understand. I am just suggesting
> > adding symmetrical functions like init/cleanup
> > alloc/free etc instead of adding stuff in random
> > functions that just happens to be called at the right time.
> 
> OK, I will clean up this part in the next revision.
> 
> > > I was not sure what is the best way - a sysctl parameter? Or should the
> > > maximum depend on number of host cpus? But that results in too many
> > > threads, e.g. if I have 16 cpus and 16 txqs.
> >
> > I guess the question is, wouldn't # of threads == # of vqs work best?
> > If we process stuff on a single CPU, let's make it pass through
> > a single VQ.
> > And to do this, we could simply open multiple vhost fds without
> > changing vhost at all.
> >
> > Would this work well?
> >
> > > > > -		 		  enum vhost_net_poll_state tx_poll_state;
> > > > > +		 		  enum vhost_net_poll_state *tx_poll_state;
> > > >
> > > > another array?
> > >
> > > Yes... I am also allocating twice the space than what is required
> > > to make it's usage simple.
> >
> > Where's the allocation? Couldn't find it.
> 
> vhost_setup_vqs(net.c) allocates it based on nvqs, though numtxqs is
> enough.
> 
> Thanks,
> 
> - KK
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Krishna Kumar March 9, 2011, 6:01 a.m. UTC | #3
"Michael S. Tsirkin" <mst@redhat.com> wrote on 03/08/2011 09:11:04 PM:

> Also, could you post your current version of the qemu code pls?
> It's useful for testing and to see the whole picture.

Sorry for the delay on this.

I am attaching the qemu changes. Some parts of the patch are
completely redundant, eg MAX_TUN_DEVICES and I will remove it
later.

It works with latest qemu and the kernel patch sent earlier.

Please let me know if there are any issues.

thanks,

- KK


(See attached file: qemu.patch)
diff mbox

Patch

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..2b765bb 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -78,7 +78,14 @@ 
  *	This gives the final feature bits for the device: it can change
  *	the dev->feature bits if it wants.
  */
+
 typedef void vq_callback_t(struct virtqueue *);
+struct virtqueue_info {
+	struct virtqueue *vq;
+	vq_callback_t *callback;
+	const char *name;
+};
+
 struct virtio_config_ops {
 	void (*get)(struct virtio_device *vdev, unsigned offset,
 		    void *buf, unsigned len);
@@ -88,9 +95,7 @@  struct virtio_config_ops {
 	void (*set_status)(struct virtio_device *vdev, u8 status);
 	void (*reset)(struct virtio_device *vdev);
 	int (*find_vqs)(struct virtio_device *, unsigned nvqs,
-			struct virtqueue *vqs[],
-			vq_callback_t *callbacks[],
-			const char *names[]);
+			struct virtqueue_info vq_info[]);
 	void (*del_vqs)(struct virtio_device *);
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);