diff mbox series

[RFC,net-next,v2,1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

Message ID 1515736720-39368-2-git-send-email-sridhar.samudrala@intel.com
State RFC, archived
Delegated to: David Miller
Headers show
Series Enable virtio to act as a backup for a passthru device | expand

Commit Message

Samudrala, Sridhar Jan. 12, 2018, 5:58 a.m. UTC
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

Alexander H Duyck Jan. 17, 2018, 6:15 p.m. UTC | #1
On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a backup for another device with the same MAC address.
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>  drivers/net/virtio_net.c        | 2 +-
>  include/uapi/linux/virtio_net.h | 3 +++
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 12dfc5fee58e..f149a160a8c5 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
>         VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>         VIRTIO_NET_F_CTRL_MAC_ADDR, \
>         VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> -       VIRTIO_NET_F_SPEED_DUPLEX
> +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>
>  static unsigned int features[] = {
>         VIRTNET_FEATURES,
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed37695b..c7c35fd1a5ed 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -57,6 +57,9 @@
>                                          * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
>
> +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
> +                                        * with the same MAC.
> +                                        */
>  #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
>
>  #ifndef VIRTIO_NET_NO_LEGACY

I'm not a huge fan of the name "backup" since that implies that the
Virtio interface is only used if the VF is not present, and there are
multiple instances such as dealing with east/west or
broadcast/multicast traffic where it may be desirable to use the
para-virtual interface rather then deal with PCI overhead/bottleneck
to send the packet.

What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
is a bit of double entendre as we are using the physical MAC address
to provide configuration information, and then in addition this
interface acts as a secondary channel for passing frames to and from
the guest rather than just using the VF.

Just a thought.

Thanks.

- Alex
Michael S. Tsirkin Jan. 17, 2018, 7:02 p.m. UTC | #2
On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
> > This feature bit can be used by hypervisor to indicate virtio_net device to
> > act as a backup for another device with the same MAC address.
> >
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> >  drivers/net/virtio_net.c        | 2 +-
> >  include/uapi/linux/virtio_net.h | 3 +++
> >  2 files changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 12dfc5fee58e..f149a160a8c5 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >         VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >         VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >         VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> > -       VIRTIO_NET_F_SPEED_DUPLEX
> > +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >
> >  static unsigned int features[] = {
> >         VIRTNET_FEATURES,
> > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> > index 5de6ed37695b..c7c35fd1a5ed 100644
> > --- a/include/uapi/linux/virtio_net.h
> > +++ b/include/uapi/linux/virtio_net.h
> > @@ -57,6 +57,9 @@
> >                                          * Steering */
> >  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> >
> > +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
> > +                                        * with the same MAC.
> > +                                        */
> >  #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
> >
> >  #ifndef VIRTIO_NET_NO_LEGACY
> 
> I'm not a huge fan of the name "backup" since that implies that the
> Virtio interface is only used if the VF is not present, and there are
> multiple instances such as dealing with east/west or
> broadcast/multicast traffic where it may be desirable to use the
> para-virtual interface rather then deal with PCI overhead/bottleneck
> to send the packet.

Right now hypervisors mostly expect that yes, only one at a time is
used.  E.g. if you try to do multicast sending packets on both VF and
virtio then you will end up with two copies of each packet.

To me the east/west scenario looks like you want something
more similar to a bridge on top of the virtio/PT pair.

So I suspect that use-case will need a separate configuration bit,
and possibly that's when you will want something more powerful
such as a full bridge.


> What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
> is a bit of double entendre as we are using the physical MAC address
> to provide configuration information, and then in addition this
> interface acts as a secondary channel for passing frames to and from
> the guest rather than just using the VF.
> 
> Just a thought.
> 
> Thanks.
> 
> - Alex

I just feel that's a very generic name, not conveying enough information
about how they hypervisor expects the pair of devices to be used.
Samudrala, Sridhar Jan. 17, 2018, 7:25 p.m. UTC | #3
On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
> On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
>> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>> act as a backup for another device with the same MAC address.
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c        | 2 +-
>>>   include/uapi/linux/virtio_net.h | 3 +++
>>>   2 files changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 12dfc5fee58e..f149a160a8c5 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
>>>          VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>>          VIRTIO_NET_F_CTRL_MAC_ADDR, \
>>>          VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>>> -       VIRTIO_NET_F_SPEED_DUPLEX
>>> +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>>>
>>>   static unsigned int features[] = {
>>>          VIRTNET_FEATURES,
>>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>>> index 5de6ed37695b..c7c35fd1a5ed 100644
>>> --- a/include/uapi/linux/virtio_net.h
>>> +++ b/include/uapi/linux/virtio_net.h
>>> @@ -57,6 +57,9 @@
>>>                                           * Steering */
>>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
>>>
>>> +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
>>> +                                        * with the same MAC.
>>> +                                        */
>>>   #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
>>>
>>>   #ifndef VIRTIO_NET_NO_LEGACY
>> I'm not a huge fan of the name "backup" since that implies that the
>> Virtio interface is only used if the VF is not present, and there are
>> multiple instances such as dealing with east/west or
>> broadcast/multicast traffic where it may be desirable to use the
>> para-virtual interface rather then deal with PCI overhead/bottleneck
>> to send the packet.
> Right now hypervisors mostly expect that yes, only one at a time is
> used.  E.g. if you try to do multicast sending packets on both VF and
> virtio then you will end up with two copies of each packet.

I think we want to use only 1 interface to  send out any packet. In case of
broadcast/multicasts it would be an optimization to send them via virtio and
this patch series adds that optimization.

In the receive path,  the broadcasts should only go the PF and reach the VM
via vitio so that the VM doesn't see duplicate broadcasts.


>
> To me the east/west scenario looks like you want something
> more similar to a bridge on top of the virtio/PT pair.
>
> So I suspect that use-case will need a separate configuration bit,
> and possibly that's when you will want something more powerful
> such as a full bridge.

east-west optimization of unicasts would be a harder problem to solve as
somehow the hypervisor needs to indicate the VM about the local dest. macs

>
>
>> What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
>> is a bit of double entendre as we are using the physical MAC address
>> to provide configuration information, and then in addition this
>> interface acts as a secondary channel for passing frames to and from
>> the guest rather than just using the VF.
>>
>> Just a thought.
>>
>> Thanks.
>>
>> - Alex
> I just feel that's a very generic name, not conveying enough information
> about how they hypervisor expects the pair of devices to be used.
>
Michael S. Tsirkin Jan. 17, 2018, 7:57 p.m. UTC | #4
On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> > > <sridhar.samudrala@intel.com> wrote:
> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > > act as a backup for another device with the same MAC address.
> > > > 
> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > ---
> > > >   drivers/net/virtio_net.c        | 2 +-
> > > >   include/uapi/linux/virtio_net.h | 3 +++
> > > >   2 files changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index 12dfc5fee58e..f149a160a8c5 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> > > >          VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> > > >          VIRTIO_NET_F_CTRL_MAC_ADDR, \
> > > >          VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> > > > -       VIRTIO_NET_F_SPEED_DUPLEX
> > > > +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> > > > 
> > > >   static unsigned int features[] = {
> > > >          VIRTNET_FEATURES,
> > > > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> > > > index 5de6ed37695b..c7c35fd1a5ed 100644
> > > > --- a/include/uapi/linux/virtio_net.h
> > > > +++ b/include/uapi/linux/virtio_net.h
> > > > @@ -57,6 +57,9 @@
> > > >                                           * Steering */
> > > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> > > > 
> > > > +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
> > > > +                                        * with the same MAC.
> > > > +                                        */
> > > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
> > > > 
> > > >   #ifndef VIRTIO_NET_NO_LEGACY
> > > I'm not a huge fan of the name "backup" since that implies that the
> > > Virtio interface is only used if the VF is not present, and there are
> > > multiple instances such as dealing with east/west or
> > > broadcast/multicast traffic where it may be desirable to use the
> > > para-virtual interface rather then deal with PCI overhead/bottleneck
> > > to send the packet.
> > Right now hypervisors mostly expect that yes, only one at a time is
> > used.  E.g. if you try to do multicast sending packets on both VF and
> > virtio then you will end up with two copies of each packet.
> 
> I think we want to use only 1 interface to  send out any packet. In case of
> broadcast/multicasts it would be an optimization to send them via virtio and
> this patch series adds that optimization.

Right that's what I think we should rather avoid for now.

It's *not* an optimization if there's a single VM on this host,
or if a specific multicast group does not have any VMs on same
host.

I'd rather we just sent everything out on the PT if that's
there. The reason we have virtio in the picture is just so
we can migrate without downtime.


> In the receive path,  the broadcasts should only go the PF and reach the VM
> via vitio so that the VM doesn't see duplicate broadcasts.
> 
> 
> > 
> > To me the east/west scenario looks like you want something
> > more similar to a bridge on top of the virtio/PT pair.
> > 
> > So I suspect that use-case will need a separate configuration bit,
> > and possibly that's when you will want something more powerful
> > such as a full bridge.
> 
> east-west optimization of unicasts would be a harder problem to solve as
> somehow the hypervisor needs to indicate the VM about the local dest. macs

Using a bridge with a dedicated device for east/west would let
bridge use standard learning techniques for that perhaps?


> > 
> > 
> > > What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
> > > is a bit of double entendre as we are using the physical MAC address
> > > to provide configuration information, and then in addition this
> > > interface acts as a secondary channel for passing frames to and from
> > > the guest rather than just using the VF.
> > > 
> > > Just a thought.
> > > 
> > > Thanks.
> > > 
> > > - Alex
> > I just feel that's a very generic name, not conveying enough information
> > about how they hypervisor expects the pair of devices to be used.
> >
Alexander H Duyck Jan. 17, 2018, 9:49 p.m. UTC | #5
On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
>>
>>
>> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
>> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
>> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
>> > > <sridhar.samudrala@intel.com> wrote:
>> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
>> > > > act as a backup for another device with the same MAC address.
>> > > >
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > ---
>> > > >   drivers/net/virtio_net.c        | 2 +-
>> > > >   include/uapi/linux/virtio_net.h | 3 +++
>> > > >   2 files changed, 4 insertions(+), 1 deletion(-)
>> > > >
>> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > > > index 12dfc5fee58e..f149a160a8c5 100644
>> > > > --- a/drivers/net/virtio_net.c
>> > > > +++ b/drivers/net/virtio_net.c
>> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
>> > > >          VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>> > > >          VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> > > >          VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> > > > -       VIRTIO_NET_F_SPEED_DUPLEX
>> > > > +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>> > > >
>> > > >   static unsigned int features[] = {
>> > > >          VIRTNET_FEATURES,
>> > > > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>> > > > index 5de6ed37695b..c7c35fd1a5ed 100644
>> > > > --- a/include/uapi/linux/virtio_net.h
>> > > > +++ b/include/uapi/linux/virtio_net.h
>> > > > @@ -57,6 +57,9 @@
>> > > >                                           * Steering */
>> > > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
>> > > >
>> > > > +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
>> > > > +                                        * with the same MAC.
>> > > > +                                        */
>> > > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
>> > > >
>> > > >   #ifndef VIRTIO_NET_NO_LEGACY
>> > > I'm not a huge fan of the name "backup" since that implies that the
>> > > Virtio interface is only used if the VF is not present, and there are
>> > > multiple instances such as dealing with east/west or
>> > > broadcast/multicast traffic where it may be desirable to use the
>> > > para-virtual interface rather then deal with PCI overhead/bottleneck
>> > > to send the packet.
>> > Right now hypervisors mostly expect that yes, only one at a time is
>> > used.  E.g. if you try to do multicast sending packets on both VF and
>> > virtio then you will end up with two copies of each packet.
>>
>> I think we want to use only 1 interface to  send out any packet. In case of
>> broadcast/multicasts it would be an optimization to send them via virtio and
>> this patch series adds that optimization.
>
> Right that's what I think we should rather avoid for now.
>
> It's *not* an optimization if there's a single VM on this host,
> or if a specific multicast group does not have any VMs on same
> host.

Agreed. In my mind this is something that is controlled by the
pass-thru interface once it is enslaved.

> I'd rather we just sent everything out on the PT if that's
> there. The reason we have virtio in the picture is just so
> we can migrate without downtime.

I wasn't saying we do that in all cases. That would be something that
would have to be decided by the pass-thru interface. Ideally the
virtio would provide just enough information to get itself into the
bond and I see this being the mechanism for it to do so. From there
the complexity mostly lies in the pass-thru interface to configure the
correct transmit modes if for example you have multiple pass-thru
interfaces or a more complex traffic setup due to things like
SwitchDev.

In my mind we go the bonding route and there are few use cases for all
of this. First is the backup case that is being addressed here. That
becomes your basic "copy netvsc" approach for this which would be
default. It is how we would handle basic pass-thru back-up paths. If
the host decides to send multicast/broadcast traffic from the host up
through it that is a host side decision. I am okay with our default
transmit behavior from the guest being to send everything through the
pass-thru interface. All the virtio would be doing is advertising that
it is a side channel for some sort of bond with another interface. By
calling it a side channel it implies that it isn't the direct channel
for the interface, but it is an alternative communications channel for
things like backup, and other future uses.

There end up being a few different "phase 2" concepts I have floating
around which I want to avoid limiting. Solving the east/west problem
is an example. In my mind that just becomes a bonding Tx mode that is
controlled via the pass-thru interface. The VF could black-list the
local addresses so that they instead fall into the virtio interface.
In addition I seem to recall watching a presentation from Mellanox
where they were talking about a VF per NUMA node in a system which
would imply multiple VFs with the same MAC address. I'm not sure how
the current patch handles that. Obviously that would require a more
complex load balancing setup. The bonding solution ends up being a way
to resolve that so that they could just have it take care of picking
the right Tx queue based on the NUMA affinity and fall back to the
virtio/netvsc when those fail.

>> In the receive path,  the broadcasts should only go the PF and reach the VM
>> via vitio so that the VM doesn't see duplicate broadcasts.
>>
>>
>> >
>> > To me the east/west scenario looks like you want something
>> > more similar to a bridge on top of the virtio/PT pair.
>> >
>> > So I suspect that use-case will need a separate configuration bit,
>> > and possibly that's when you will want something more powerful
>> > such as a full bridge.
>>
>> east-west optimization of unicasts would be a harder problem to solve as
>> somehow the hypervisor needs to indicate the VM about the local dest. macs
>
> Using a bridge with a dedicated device for east/west would let
> bridge use standard learning techniques for that perhaps?

I'm not sure about having the guest have to learn. In my mind the VF
should know who is local and who isn't. In the case of SwitchDev it
should be possible for the port representors and the switch to provide
data on which interfaces are bonded on the host side and which aren't.
With that data it would be pretty easy to just put together a list of
addresses that would prefer to go the para-virtual route instead of
being transmitted through physical hardware.

In addition a bridge implies much more overhead since normally a
bridge can receive a packet in on one interface and transmit it on
another. We don't really need that. This is more of a VEPA type setup
and doesn't need to be anything all that complex. You could probably
even handle the Tx queue selection via a simple eBPF program and map
since the input for whatever is used to select Tx should be pretty
simple, destination MAC, source NUMA node, etc, and the data-set
shouldn't be too large.

>> > > What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
>> > > is a bit of double entendre as we are using the physical MAC address
>> > > to provide configuration information, and then in addition this
>> > > interface acts as a secondary channel for passing frames to and from
>> > > the guest rather than just using the VF.
>> > >
>> > > Just a thought.
>> > >
>> > > Thanks.
>> > >
>> > > - Alex
>> > I just feel that's a very generic name, not conveying enough information
>> > about how they hypervisor expects the pair of devices to be used.

I would argue that BACKUP is too specific. I can see many other uses
other than just being a backup interface and I don't want us to box
ourselves in. I agree that it makes sense for active/backup to be the
base mode, but I really don't think it conveys all of the
possibilities since I see this as possibly being able to solve
multiple issues as this evolves. I'm also open to other suggestions.
We could go with PASSTHRU_SIDE_CHANNEL for instance, I just didn't
suggest that since it seemed kind of wordy.
Michael S. Tsirkin Jan. 22, 2018, 9:31 p.m. UTC | #6
On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:
> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
> >>
> >>
> >> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
> >> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
> >> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >> > > <sridhar.samudrala@intel.com> wrote:
> >> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> >> > > > act as a backup for another device with the same MAC address.
> >> > > >
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > >   drivers/net/virtio_net.c        | 2 +-
> >> > > >   include/uapi/linux/virtio_net.h | 3 +++
> >> > > >   2 files changed, 4 insertions(+), 1 deletion(-)
> >> > > >
> >> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> > > > index 12dfc5fee58e..f149a160a8c5 100644
> >> > > > --- a/drivers/net/virtio_net.c
> >> > > > +++ b/drivers/net/virtio_net.c
> >> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >> > > >          VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >> > > >          VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >> > > >          VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >> > > > -       VIRTIO_NET_F_SPEED_DUPLEX
> >> > > > +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >> > > >
> >> > > >   static unsigned int features[] = {
> >> > > >          VIRTNET_FEATURES,
> >> > > > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >> > > > index 5de6ed37695b..c7c35fd1a5ed 100644
> >> > > > --- a/include/uapi/linux/virtio_net.h
> >> > > > +++ b/include/uapi/linux/virtio_net.h
> >> > > > @@ -57,6 +57,9 @@
> >> > > >                                           * Steering */
> >> > > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> >> > > >
> >> > > > +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
> >> > > > +                                        * with the same MAC.
> >> > > > +                                        */
> >> > > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
> >> > > >
> >> > > >   #ifndef VIRTIO_NET_NO_LEGACY
> >> > > I'm not a huge fan of the name "backup" since that implies that the
> >> > > Virtio interface is only used if the VF is not present, and there are
> >> > > multiple instances such as dealing with east/west or
> >> > > broadcast/multicast traffic where it may be desirable to use the
> >> > > para-virtual interface rather then deal with PCI overhead/bottleneck
> >> > > to send the packet.
> >> > Right now hypervisors mostly expect that yes, only one at a time is
> >> > used.  E.g. if you try to do multicast sending packets on both VF and
> >> > virtio then you will end up with two copies of each packet.
> >>
> >> I think we want to use only 1 interface to  send out any packet. In case of
> >> broadcast/multicasts it would be an optimization to send them via virtio and
> >> this patch series adds that optimization.
> >
> > Right that's what I think we should rather avoid for now.
> >
> > It's *not* an optimization if there's a single VM on this host,
> > or if a specific multicast group does not have any VMs on same
> > host.
> 
> Agreed. In my mind this is something that is controlled by the
> pass-thru interface once it is enslaved.

It would be pretty tricky to control through the PT
interface since a PT interface pretends to be a physical
device, which has no concept of VMs.

> > I'd rather we just sent everything out on the PT if that's
> > there. The reason we have virtio in the picture is just so
> > we can migrate without downtime.
> 
> I wasn't saying we do that in all cases. That would be something that
> would have to be decided by the pass-thru interface. Ideally the
> virtio would provide just enough information to get itself into the
> bond and I see this being the mechanism for it to do so. From there
> the complexity mostly lies in the pass-thru interface to configure the
> correct transmit modes if for example you have multiple pass-thru
> interfaces or a more complex traffic setup due to things like
> SwitchDev.
> 
> In my mind we go the bonding route and there are few use cases for all
> of this. First is the backup case that is being addressed here. That
> becomes your basic "copy netvsc" approach for this which would be
> default. It is how we would handle basic pass-thru back-up paths. If
> the host decides to send multicast/broadcast traffic from the host up
> through it that is a host side decision. I am okay with our default
> transmit behavior from the guest being to send everything through the
> pass-thru interface. All the virtio would be doing is advertising that
> it is a side channel for some sort of bond with another interface. By
> calling it a side channel it implies that it isn't the direct channel
> for the interface, but it is an alternative communications channel for
> things like backup, and other future uses.
> 
> There end up being a few different "phase 2" concepts I have floating
> around which I want to avoid limiting. Solving the east/west problem
> is an example. In my mind that just becomes a bonding Tx mode that is
> controlled via the pass-thru interface. The VF could black-list the
> local addresses so that they instead fall into the virtio interface.
> In addition I seem to recall watching a presentation from Mellanox
> where they were talking about a VF per NUMA node in a system which
> would imply multiple VFs with the same MAC address. I'm not sure how
> the current patch handles that. Obviously that would require a more
> complex load balancing setup. The bonding solution ends up being a way
> to resolve that so that they could just have it take care of picking
> the right Tx queue based on the NUMA affinity and fall back to the
> virtio/netvsc when those fail.

The way I see it, we'd need to pass a bunch of extra information
host to guest, and we'd have to use a PV interface for it.
When we do this, we'll need to have another
feature bit, and we can call it SIDE_CHANNEL or whatever.


> >> In the receive path,  the broadcasts should only go the PF and reach the VM
> >> via vitio so that the VM doesn't see duplicate broadcasts.
> >>
> >>
> >> >
> >> > To me the east/west scenario looks like you want something
> >> > more similar to a bridge on top of the virtio/PT pair.
> >> >
> >> > So I suspect that use-case will need a separate configuration bit,
> >> > and possibly that's when you will want something more powerful
> >> > such as a full bridge.
> >>
> >> east-west optimization of unicasts would be a harder problem to solve as
> >> somehow the hypervisor needs to indicate the VM about the local dest. macs
> >
> > Using a bridge with a dedicated device for east/west would let
> > bridge use standard learning techniques for that perhaps?
> 
> I'm not sure about having the guest have to learn.

It's certainly a way to do this, but certainly not the only one.

> In my mind the VF
> should know who is local and who isn't.

Right. But note that these things change.

> In the case of SwitchDev it
> should be possible for the port representors and the switch to provide
> data on which interfaces are bonded on the host side and which aren't.
> With that data it would be pretty easy to just put together a list of
> addresses that would prefer to go the para-virtual route instead of
> being transmitted through physical hardware.
> 
> In addition a bridge implies much more overhead since normally a
> bridge can receive a packet in on one interface and transmit it on
> another. We don't really need that. This is more of a VEPA type setup
> and doesn't need to be anything all that complex. You could probably
> even handle the Tx queue selection via a simple eBPF program and map
> since the input for whatever is used to select Tx should be pretty
> simple, destination MAC, source NUMA node, etc, and the data-set
> shouldn't be too large.

That sounds interesting. A separate device might make this kind of setup
a bit easier.  Sridhar, did you look into creating a separate device for
the virtual bond device at all?  It does not have to be in a separate
module, that kind of refactoring can come later, but once we commit to
using the same single device as virtio, we can't change that.


> >> > > What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
> >> > > is a bit of double entendre as we are using the physical MAC address
> >> > > to provide configuration information, and then in addition this
> >> > > interface acts as a secondary channel for passing frames to and from
> >> > > the guest rather than just using the VF.
> >> > >
> >> > > Just a thought.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > - Alex
> >> > I just feel that's a very generic name, not conveying enough information
> >> > about how they hypervisor expects the pair of devices to be used.
> 
> I would argue that BACKUP is too specific. I can see many other uses
> other than just being a backup interface

True but the ones you listed above all seem to require virtio
changes anyway, we'll be able to add a new feature or
rename this one as we make them.

> and I don't want us to box
> ourselves in. I agree that it makes sense for active/backup to be the
> base mode, but I really don't think it conveys all of the
> possibilities since I see this as possibly being able to solve
> multiple issues as this evolves. I'm also open to other suggestions.
> We could go with PASSTHRU_SIDE_CHANNEL for instance, I just didn't
> suggest that since it seemed kind of wordy.

There's only so much info a single bit can confer :)
So we can always rename later, the point is to draw the line
and say "this is the functionality current hosts expect".
Samudrala, Sridhar Jan. 22, 2018, 11:27 p.m. UTC | #7
On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:
>> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
>>>>
>>>> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
>>>>>> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
>>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>>>> act as a backup for another device with the same MAC address.
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>> ---
>>>>>>>    drivers/net/virtio_net.c        | 2 +-
>>>>>>>    include/uapi/linux/virtio_net.h | 3 +++
>>>>>>>    2 files changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>>> index 12dfc5fee58e..f149a160a8c5 100644
>>>>>>> --- a/drivers/net/virtio_net.c
>>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
>>>>>>>           VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>>>>>>           VIRTIO_NET_F_CTRL_MAC_ADDR, \
>>>>>>>           VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>>>>>>> -       VIRTIO_NET_F_SPEED_DUPLEX
>>>>>>> +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>>>>>>>
>>>>>>>    static unsigned int features[] = {
>>>>>>>           VIRTNET_FEATURES,
>>>>>>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>>>>>>> index 5de6ed37695b..c7c35fd1a5ed 100644
>>>>>>> --- a/include/uapi/linux/virtio_net.h
>>>>>>> +++ b/include/uapi/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>>                                            * Steering */
>>>>>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
>>>>>>>
>>>>>>> +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
>>>>>>> +                                        * with the same MAC.
>>>>>>> +                                        */
>>>>>>>    #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
>>>>>>>
>>>>>>>    #ifndef VIRTIO_NET_NO_LEGACY
>>>>>> I'm not a huge fan of the name "backup" since that implies that the
>>>>>> Virtio interface is only used if the VF is not present, and there are
>>>>>> multiple instances such as dealing with east/west or
>>>>>> broadcast/multicast traffic where it may be desirable to use the
>>>>>> para-virtual interface rather then deal with PCI overhead/bottleneck
>>>>>> to send the packet.
>>>>> Right now hypervisors mostly expect that yes, only one at a time is
>>>>> used.  E.g. if you try to do multicast sending packets on both VF and
>>>>> virtio then you will end up with two copies of each packet.
>>>> I think we want to use only 1 interface to  send out any packet. In case of
>>>> broadcast/multicasts it would be an optimization to send them via virtio and
>>>> this patch series adds that optimization.
>>> Right that's what I think we should rather avoid for now.
>>>
>>> It's *not* an optimization if there's a single VM on this host,
>>> or if a specific multicast group does not have any VMs on same
>>> host.
>> Agreed. In my mind this is something that is controlled by the
>> pass-thru interface once it is enslaved.
> It would be pretty tricky to control through the PT
> interface since a PT interface pretends to be a physical
> device, which has no concept of VMs.
>
>>> I'd rather we just sent everything out on the PT if that's
>>> there. The reason we have virtio in the picture is just so
>>> we can migrate without downtime.
>> I wasn't saying we do that in all cases. That would be something that
>> would have to be decided by the pass-thru interface. Ideally the
>> virtio would provide just enough information to get itself into the
>> bond and I see this being the mechanism for it to do so. From there
>> the complexity mostly lies in the pass-thru interface to configure the
>> correct transmit modes if for example you have multiple pass-thru
>> interfaces or a more complex traffic setup due to things like
>> SwitchDev.
>>
>> In my mind we go the bonding route and there are few use cases for all
>> of this. First is the backup case that is being addressed here. That
>> becomes your basic "copy netvsc" approach for this which would be
>> default. It is how we would handle basic pass-thru back-up paths. If
>> the host decides to send multicast/broadcast traffic from the host up
>> through it that is a host side decision. I am okay with our default
>> transmit behavior from the guest being to send everything through the
>> pass-thru interface. All the virtio would be doing is advertising that
>> it is a side channel for some sort of bond with another interface. By
>> calling it a side channel it implies that it isn't the direct channel
>> for the interface, but it is an alternative communications channel for
>> things like backup, and other future uses.
>>
>> There end up being a few different "phase 2" concepts I have floating
>> around which I want to avoid limiting. Solving the east/west problem
>> is an example. In my mind that just becomes a bonding Tx mode that is
>> controlled via the pass-thru interface. The VF could black-list the
>> local addresses so that they instead fall into the virtio interface.
>> In addition I seem to recall watching a presentation from Mellanox
>> where they were talking about a VF per NUMA node in a system which
>> would imply multiple VFs with the same MAC address. I'm not sure how
>> the current patch handles that. Obviously that would require a more
>> complex load balancing setup. The bonding solution ends up being a way
>> to resolve that so that they could just have it take care of picking
>> the right Tx queue based on the NUMA affinity and fall back to the
>> virtio/netvsc when those fail.
> The way I see it, we'd need to pass a bunch of extra information
> host to guest, and we'd have to use a PV interface for it.
> When we do this, we'll need to have another
> feature bit, and we can call it SIDE_CHANNEL or whatever.
>
>
>>>> In the receive path,  the broadcasts should only go the PF and reach the VM
>>>> via vitio so that the VM doesn't see duplicate broadcasts.
>>>>
>>>>
>>>>> To me the east/west scenario looks like you want something
>>>>> more similar to a bridge on top of the virtio/PT pair.
>>>>>
>>>>> So I suspect that use-case will need a separate configuration bit,
>>>>> and possibly that's when you will want something more powerful
>>>>> such as a full bridge.
>>>> east-west optimization of unicasts would be a harder problem to solve as
>>>> somehow the hypervisor needs to indicate the VM about the local dest. macs
>>> Using a bridge with a dedicated device for east/west would let
>>> bridge use standard learning techniques for that perhaps?
>> I'm not sure about having the guest have to learn.
> It's certainly a way to do this, but certainly not the only one.
>
>> In my mind the VF
>> should know who is local and who isn't.
> Right. But note that these things change.
>
>> In the case of SwitchDev it
>> should be possible for the port representors and the switch to provide
>> data on which interfaces are bonded on the host side and which aren't.
>> With that data it would be pretty easy to just put together a list of
>> addresses that would prefer to go the para-virtual route instead of
>> being transmitted through physical hardware.
>>
>> In addition a bridge implies much more overhead since normally a
>> bridge can receive a packet in on one interface and transmit it on
>> another. We don't really need that. This is more of a VEPA type setup
>> and doesn't need to be anything all that complex. You could probably
>> even handle the Tx queue selection via a simple eBPF program and map
>> since the input for whatever is used to select Tx should be pretty
>> simple, destination MAC, source NUMA node, etc, and the data-set
>> shouldn't be too large.
> That sounds interesting. A separate device might make this kind of setup
> a bit easier.  Sridhar, did you look into creating a separate device for
> the virtual bond device at all?  It does not have to be in a separate
> module, that kind of refactoring can come later, but once we commit to
> using the same single device as virtio, we can't change that.

No. I haven't looked into creating a separate device. If we are going to 
create a new
device, i guess it has to be of a new device type with its own driver.

As we are using virtio_net to control and manage the VF data path, it is 
not clear to me
what is the advantage of creating a new device rather than extending 
virtio_net to manage
the VF datapath via transparent bond mechanism.

Thanks
Sridhar
Stephen Hemminger Jan. 23, 2018, 12:02 a.m. UTC | #8
On Mon, 22 Jan 2018 15:27:40 -0800
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:  
> >> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:  
> >>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:  
> >>>>
> >>>> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:  
> >>>>> On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:  
> >>>>>> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >>>>>> <sridhar.samudrala@intel.com> wrote:  
> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
> >>>>>>> act as a backup for another device with the same MAC address.
> >>>>>>>
> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >>>>>>> ---
> >>>>>>>    drivers/net/virtio_net.c        | 2 +-
> >>>>>>>    include/uapi/linux/virtio_net.h | 3 +++
> >>>>>>>    2 files changed, 4 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>>>>> index 12dfc5fee58e..f149a160a8c5 100644
> >>>>>>> --- a/drivers/net/virtio_net.c
> >>>>>>> +++ b/drivers/net/virtio_net.c
> >>>>>>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >>>>>>>           VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >>>>>>>           VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >>>>>>>           VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >>>>>>> -       VIRTIO_NET_F_SPEED_DUPLEX
> >>>>>>> +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >>>>>>>
> >>>>>>>    static unsigned int features[] = {
> >>>>>>>           VIRTNET_FEATURES,
> >>>>>>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >>>>>>> index 5de6ed37695b..c7c35fd1a5ed 100644
> >>>>>>> --- a/include/uapi/linux/virtio_net.h
> >>>>>>> +++ b/include/uapi/linux/virtio_net.h
> >>>>>>> @@ -57,6 +57,9 @@
> >>>>>>>                                            * Steering */
> >>>>>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> >>>>>>>
> >>>>>>> +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
> >>>>>>> +                                        * with the same MAC.
> >>>>>>> +                                        */
> >>>>>>>    #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
> >>>>>>>
> >>>>>>>    #ifndef VIRTIO_NET_NO_LEGACY  
> >>>>>> I'm not a huge fan of the name "backup" since that implies that the
> >>>>>> Virtio interface is only used if the VF is not present, and there are
> >>>>>> multiple instances such as dealing with east/west or
> >>>>>> broadcast/multicast traffic where it may be desirable to use the
> >>>>>> para-virtual interface rather then deal with PCI overhead/bottleneck
> >>>>>> to send the packet.  
> >>>>> Right now hypervisors mostly expect that yes, only one at a time is
> >>>>> used.  E.g. if you try to do multicast sending packets on both VF and
> >>>>> virtio then you will end up with two copies of each packet.  
> >>>> I think we want to use only 1 interface to  send out any packet. In case of
> >>>> broadcast/multicasts it would be an optimization to send them via virtio and
> >>>> this patch series adds that optimization.  
> >>> Right that's what I think we should rather avoid for now.
> >>>
> >>> It's *not* an optimization if there's a single VM on this host,
> >>> or if a specific multicast group does not have any VMs on same
> >>> host.  
> >> Agreed. In my mind this is something that is controlled by the
> >> pass-thru interface once it is enslaved.  
> > It would be pretty tricky to control through the PT
> > interface since a PT interface pretends to be a physical
> > device, which has no concept of VMs.
> >  
> >>> I'd rather we just sent everything out on the PT if that's
> >>> there. The reason we have virtio in the picture is just so
> >>> we can migrate without downtime.  
> >> I wasn't saying we do that in all cases. That would be something that
> >> would have to be decided by the pass-thru interface. Ideally the
> >> virtio would provide just enough information to get itself into the
> >> bond and I see this being the mechanism for it to do so. From there
> >> the complexity mostly lies in the pass-thru interface to configure the
> >> correct transmit modes if for example you have multiple pass-thru
> >> interfaces or a more complex traffic setup due to things like
> >> SwitchDev.
> >>
> >> In my mind we go the bonding route and there are few use cases for all
> >> of this. First is the backup case that is being addressed here. That
> >> becomes your basic "copy netvsc" approach for this which would be
> >> default. It is how we would handle basic pass-thru back-up paths. If
> >> the host decides to send multicast/broadcast traffic from the host up
> >> through it that is a host side decision. I am okay with our default
> >> transmit behavior from the guest being to send everything through the
> >> pass-thru interface. All the virtio would be doing is advertising that
> >> it is a side channel for some sort of bond with another interface. By
> >> calling it a side channel it implies that it isn't the direct channel
> >> for the interface, but it is an alternative communications channel for
> >> things like backup, and other future uses.
> >>
> >> There end up being a few different "phase 2" concepts I have floating
> >> around which I want to avoid limiting. Solving the east/west problem
> >> is an example. In my mind that just becomes a bonding Tx mode that is
> >> controlled via the pass-thru interface. The VF could black-list the
> >> local addresses so that they instead fall into the virtio interface.
> >> In addition I seem to recall watching a presentation from Mellanox
> >> where they were talking about a VF per NUMA node in a system which
> >> would imply multiple VFs with the same MAC address. I'm not sure how
> >> the current patch handles that. Obviously that would require a more
> >> complex load balancing setup. The bonding solution ends up being a way
> >> to resolve that so that they could just have it take care of picking
> >> the right Tx queue based on the NUMA affinity and fall back to the
> >> virtio/netvsc when those fail.  
> > The way I see it, we'd need to pass a bunch of extra information
> > host to guest, and we'd have to use a PV interface for it.
> > When we do this, we'll need to have another
> > feature bit, and we can call it SIDE_CHANNEL or whatever.
> >
> >  
> >>>> In the receive path,  the broadcasts should only go the PF and reach the VM
> >>>> via vitio so that the VM doesn't see duplicate broadcasts.
> >>>>
> >>>>  
> >>>>> To me the east/west scenario looks like you want something
> >>>>> more similar to a bridge on top of the virtio/PT pair.
> >>>>>
> >>>>> So I suspect that use-case will need a separate configuration bit,
> >>>>> and possibly that's when you will want something more powerful
> >>>>> such as a full bridge.  
> >>>> east-west optimization of unicasts would be a harder problem to solve as
> >>>> somehow the hypervisor needs to indicate the VM about the local dest. macs  
> >>> Using a bridge with a dedicated device for east/west would let
> >>> bridge use standard learning techniques for that perhaps?  
> >> I'm not sure about having the guest have to learn.  
> > It's certainly a way to do this, but certainly not the only one.
> >  
> >> In my mind the VF
> >> should know who is local and who isn't.  
> > Right. But note that these things change.
> >  
> >> In the case of SwitchDev it
> >> should be possible for the port representors and the switch to provide
> >> data on which interfaces are bonded on the host side and which aren't.
> >> With that data it would be pretty easy to just put together a list of
> >> addresses that would prefer to go the para-virtual route instead of
> >> being transmitted through physical hardware.
> >>
> >> In addition a bridge implies much more overhead since normally a
> >> bridge can receive a packet in on one interface and transmit it on
> >> another. We don't really need that. This is more of a VEPA type setup
> >> and doesn't need to be anything all that complex. You could probably
> >> even handle the Tx queue selection via a simple eBPF program and map
> >> since the input for whatever is used to select Tx should be pretty
> >> simple, destination MAC, source NUMA node, etc, and the data-set
> >> shouldn't be too large.  
> > That sounds interesting. A separate device might make this kind of setup
> > a bit easier.  Sridhar, did you look into creating a separate device for
> > the virtual bond device at all?  It does not have to be in a separate
> > module, that kind of refactoring can come later, but once we commit to
> > using the same single device as virtio, we can't change that.  
> 
> No. I haven't looked into creating a separate device. If we are going to 
> create a new
> device, i guess it has to be of a new device type with its own driver.
> 
> As we are using virtio_net to control and manage the VF data path, it is 
> not clear to me
> what is the advantage of creating a new device rather than extending 
> virtio_net to manage
> the VF datapath via transparent bond mechanism.
> 
> Thanks
> Sridhar
> 
> 

The requirement with Azure accelerated network was that a stock distribution image from the
store must be able to run unmodified and get accelerated networking.
Not sure if other environments need to work the same, but it would be nice.

That meant no additional setup scripts (aka no bonding) and also it must
work transparently with hot-plug. Also there are diverse set of environments:
openstack, cloudinit, network manager and systemd. The solution had to not depend
on any one of them, but also not break any of them.
Michael S. Tsirkin Jan. 23, 2018, 12:05 a.m. UTC | #9
On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
> > > You could probably
> > > even handle the Tx queue selection via a simple eBPF program and map
> > > since the input for whatever is used to select Tx should be pretty
> > > simple, destination MAC, source NUMA node, etc, and the data-set
> > > shouldn't be too large.
> > That sounds interesting. A separate device might make this kind of setup
> > a bit easier.  Sridhar, did you look into creating a separate device for
> > the virtual bond device at all?  It does not have to be in a separate
> > module, that kind of refactoring can come later, but once we commit to
> > using the same single device as virtio, we can't change that.
> 
> No. I haven't looked into creating a separate device. If we are going to
> create a new
> device, i guess it has to be of a new device type with its own driver.

Well not necessarily - just a separate netdev ops.
Kind of like e.g. vlans share a driver with the main driver.

> As we are using virtio_net to control and manage the VF data path, it is not
> clear to me
> what is the advantage of creating a new device rather than extending
> virtio_net to manage
> the VF datapath via transparent bond mechanism.
> 
> Thanks
> Sridhar

So that XDP redirect actions can differentiate between virtio, PT
device and the bond. Without it there's no way to redirect
to virtio specifically.
Jakub Kicinski Jan. 23, 2018, 12:16 a.m. UTC | #10
On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > As we are using virtio_net to control and manage the VF data path, it is not
> > clear to me
> > what is the advantage of creating a new device rather than extending
> > virtio_net to manage
> > the VF datapath via transparent bond mechanism.
> 
> So that XDP redirect actions can differentiate between virtio, PT
> device and the bond. Without it there's no way to redirect
> to virtio specifically.

Let's make a list :P

separate netdev:
1. virtio device being a bond master is confusing/unexpected.
2. virtio device being both a master and a slave is confusing.
3. configuration of a master may have different semantics than
   configuration of a slave.
4. two para-virt devices may create a loop (or rather will be bound 
   to each other indeterministically, depending on which spawns first).
5. there is no user configuration AFAIR in existing patch, VM admin
   won't be able to prevent the bond.  Separate netdev we can make 
   removable even if it's spawned automatically.
6. XDP redirect use-case (or any explicit use of the virtio slave)
   (from MST)

independent driver:
7. code reuse.

separate device:
8. bond any netdev with any netdev.
9. reuse well-known device driver model.
a. natural anchor for hypervisor configuration (switchdev etc.)
b. next-gen silicon may be able to disguise as virtio device, and the
   loop check in virtio driver will prevent the legitimate bond it such
   case.  AFAIU that's one of the goals of next-gen virtio spec as well.
Michael S. Tsirkin Jan. 23, 2018, 12:47 a.m. UTC | #11
On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > > As we are using virtio_net to control and manage the VF data path, it is not
> > > clear to me
> > > what is the advantage of creating a new device rather than extending
> > > virtio_net to manage
> > > the VF datapath via transparent bond mechanism.
> > 
> > So that XDP redirect actions can differentiate between virtio, PT
> > device and the bond. Without it there's no way to redirect
> > to virtio specifically.
> 
> Let's make a list :P
> 
> separate netdev:
> 1. virtio device being a bond master is confusing/unexpected.
> 2. virtio device being both a master and a slave is confusing.

vlans are like this too, aren't they?

> 3. configuration of a master may have different semantics than
>    configuration of a slave.
> 4. two para-virt devices may create a loop (or rather will be bound 
>    to each other indeterministically, depending on which spawns first).

For 2 virtio devices, we can disable the bond to make it deterministic.

> 5. there is no user configuration AFAIR in existing patch, VM admin
>    won't be able to prevent the bond.  Separate netdev we can make 
>    removable even if it's spawned automatically.

That's more or less a feature. If you want configurability, just use
any of the existing generic solutions (team,bond,bridge,...).

> 6. XDP redirect use-case (or any explicit use of the virtio slave)
>    (from MST)
> 
> independent driver:
> 7. code reuse.

With netvsc? That precludes a separate device though because of
compatibility.

> 
> separate device:

I'm not sure I understand how "separate device" is different from
"separate netdev". Do you advocate for a special device who's job is
just to tell the guest "bind these two devices together"?

Yea, sure, that works. However for sure it's more work to
implement and manage at all levels. Further

- it doesn't answer the question
- a feature bit in a virtio device is cheap enough that
  I wouldn't worry too much about this feature
  going unused later.

> 8. bond any netdev with any netdev.
> 9. reuse well-known device driver model.
> a. natural anchor for hypervisor configuration (switchdev etc.)

saparate netdev has the same property

> b. next-gen silicon may be able to disguise as virtio device, and the
>    loop check in virtio driver will prevent the legitimate bond it such
>    case.  AFAIU that's one of the goals of next-gen virtio spec as well.

In fact we have a virtio feature bit for the fallback.
So this part does not depend on how software in guest works
and does not need software solutions.
Jakub Kicinski Jan. 23, 2018, 1:13 a.m. UTC | #12
On Tue, 23 Jan 2018 02:47:57 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> > On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:  
> > > > As we are using virtio_net to control and manage the VF data path, it is not
> > > > clear to me
> > > > what is the advantage of creating a new device rather than extending
> > > > virtio_net to manage
> > > > the VF datapath via transparent bond mechanism.  
> > > 
> > > So that XDP redirect actions can differentiate between virtio, PT
> > > device and the bond. Without it there's no way to redirect
> > > to virtio specifically.  
> > 
> > Let's make a list :P
> > 
> > separate netdev:
> > 1. virtio device being a bond master is confusing/unexpected.
> > 2. virtio device being both a master and a slave is confusing.  
> 
> vlans are like this too, aren't they?

Perhaps a bad wording.  Both master and member would be better.

> > 3. configuration of a master may have different semantics than
> >    configuration of a slave.
> > 4. two para-virt devices may create a loop (or rather will be bound 
> >    to each other indeterministically, depending on which spawns first).  
> 
> For 2 virtio devices, we can disable the bond to make it deterministic.

Do you mean the hypervisor can or is there a knob in virtio_net to mask
off features?  Would that require re-probe of the virtio device?

> > 5. there is no user configuration AFAIR in existing patch, VM admin
> >    won't be able to prevent the bond.  Separate netdev we can make 
> >    removable even if it's spawned automatically.  
> 
> That's more or less a feature. If you want configurability, just use
> any of the existing generic solutions (team,bond,bridge,...).

The use case in mind is that VM admin wants to troubleshoot a problem
and temporarily disable the auto-bond without touching the hypervisor 
(and either member preferably).

> > 6. XDP redirect use-case (or any explicit use of the virtio slave)
> >    (from MST)
> > 
> > independent driver:
> > 7. code reuse.  
> 
> With netvsc? That precludes a separate device though because of
> compatibility.

Hopefully with one of the established bonding drivers (fingers
crossed).  We may see proliferation of special bonds (see Achiad's
presentation from last netdev about NIC-NUMA-node-bonds).

> > separate device:  
> 
> I'm not sure I understand how "separate device" is different from
> "separate netdev". Do you advocate for a special device who's job is
> just to tell the guest "bind these two devices together"?
> 
> Yea, sure, that works. However for sure it's more work to
> implement and manage at all levels. Further
> 
> - it doesn't answer the question
> - a feature bit in a virtio device is cheap enough that
>   I wouldn't worry too much about this feature
>   going unused later.

Right, I think we are referring to different things as device.  I mean
a bus device/struct device, but no strong preference on that one.  I'll
be happy as long as there is a separate netdev, really :)

> > 8. bond any netdev with any netdev.
> > 9. reuse well-known device driver model.
> > a. natural anchor for hypervisor configuration (switchdev etc.)  
> 
> saparate netdev has the same property
>
> > b. next-gen silicon may be able to disguise as virtio device, and the
> >    loop check in virtio driver will prevent the legitimate bond it such
> >    case.  AFAIU that's one of the goals of next-gen virtio spec as well.  
> 
> In fact we have a virtio feature bit for the fallback.
> So this part does not depend on how software in guest works
> and does not need software solutions.

You mean in the new spec?  Nice.  Still I think people will try to
implement the old one too given sufficiently capable HW.
Michael S. Tsirkin Jan. 23, 2018, 1:23 a.m. UTC | #13
On Mon, Jan 22, 2018 at 05:13:01PM -0800, Jakub Kicinski wrote:
> On Tue, 23 Jan 2018 02:47:57 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> > > On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:  
> > > > > As we are using virtio_net to control and manage the VF data path, it is not
> > > > > clear to me
> > > > > what is the advantage of creating a new device rather than extending
> > > > > virtio_net to manage
> > > > > the VF datapath via transparent bond mechanism.  
> > > > 
> > > > So that XDP redirect actions can differentiate between virtio, PT
> > > > device and the bond. Without it there's no way to redirect
> > > > to virtio specifically.  
> > > 
> > > Let's make a list :P
> > > 
> > > separate netdev:
> > > 1. virtio device being a bond master is confusing/unexpected.
> > > 2. virtio device being both a master and a slave is confusing.  
> > 
> > vlans are like this too, aren't they?
> 
> Perhaps a bad wording.  Both master and member would be better.
> 
> > > 3. configuration of a master may have different semantics than
> > >    configuration of a slave.
> > > 4. two para-virt devices may create a loop (or rather will be bound 
> > >    to each other indeterministically, depending on which spawns first).  
> > 
> > For 2 virtio devices, we can disable the bond to make it deterministic.
> 
> Do you mean the hypervisor can or is there a knob in virtio_net to mask
> off features?


Hypervisor can do it too. And it really should:
specifying 2 devices as backup and giving them same mac
seems like a misconfiguration.

But it's easy to do in virtio without knobs - check
that the potential slave is also a virtio device with the
backup flag, and don't make it a slave.


>  Would that require re-probe of the virtio device?

Probably not.

> > > 5. there is no user configuration AFAIR in existing patch, VM admin
> > >    won't be able to prevent the bond.  Separate netdev we can make 
> > >    removable even if it's spawned automatically.  
> > 
> > That's more or less a feature. If you want configurability, just use
> > any of the existing generic solutions (team,bond,bridge,...).
> 
> The use case in mind is that VM admin wants to troubleshoot a problem
> and temporarily disable the auto-bond without touching the hypervisor 
> (and either member preferably).

I don't think it's possible to support this unconditionally.

E.g. think of a config where these actually share
a backend, with virtio becoming active when PT
access to the backend is disabled.

So you will need some device specific extension for that.


> > > 6. XDP redirect use-case (or any explicit use of the virtio slave)
> > >    (from MST)
> > > 
> > > independent driver:
> > > 7. code reuse.  
> > 
> > With netvsc? That precludes a separate device though because of
> > compatibility.
> 
> Hopefully with one of the established bonding drivers (fingers
> crossed).

There is very little similarity. Calling this device a bond
just confuses people.

>  We may see proliferation of special bonds (see Achiad's
> presentation from last netdev about NIC-NUMA-node-bonds).

I'll take a look, but this isn't like a bond at all, no more than a vlan
is a bond.  E.g. if PT link goes down then link is down period and you
do not want to switch to virtio.

> > > separate device:  
> > 
> > I'm not sure I understand how "separate device" is different from
> > "separate netdev". Do you advocate for a special device who's job is
> > just to tell the guest "bind these two devices together"?
> > 
> > Yea, sure, that works. However for sure it's more work to
> > implement and manage at all levels. Further
> > 
> > - it doesn't answer the question
> > - a feature bit in a virtio device is cheap enough that
> >   I wouldn't worry too much about this feature
> >   going unused later.
> 
> Right, I think we are referring to different things as device.  I mean
> a bus device/struct device, but no strong preference on that one.  I'll
> be happy as long as there is a separate netdev, really :)
> 
> > > 8. bond any netdev with any netdev.
> > > 9. reuse well-known device driver model.
> > > a. natural anchor for hypervisor configuration (switchdev etc.)  
> > 
> > saparate netdev has the same property
> >
> > > b. next-gen silicon may be able to disguise as virtio device, and the
> > >    loop check in virtio driver will prevent the legitimate bond it such
> > >    case.  AFAIU that's one of the goals of next-gen virtio spec as well.  
> > 
> > In fact we have a virtio feature bit for the fallback.
> > So this part does not depend on how software in guest works
> > and does not need software solutions.
> 
> You mean in the new spec?  Nice.  Still I think people will try to
> implement the old one too given sufficiently capable HW.

Existing HW won't have the BACKUP feature so the new functionality
won't be activated. So no problem I think.
Samudrala, Sridhar Jan. 23, 2018, 1:34 a.m. UTC | #14
On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
>>>> You could probably
>>>> even handle the Tx queue selection via a simple eBPF program and map
>>>> since the input for whatever is used to select Tx should be pretty
>>>> simple, destination MAC, source NUMA node, etc, and the data-set
>>>> shouldn't be too large.
>>> That sounds interesting. A separate device might make this kind of setup
>>> a bit easier.  Sridhar, did you look into creating a separate device for
>>> the virtual bond device at all?  It does not have to be in a separate
>>> module, that kind of refactoring can come later, but once we commit to
>>> using the same single device as virtio, we can't change that.
>> No. I haven't looked into creating a separate device. If we are going to
>> create a new
>> device, i guess it has to be of a new device type with its own driver.
> Well not necessarily - just a separate netdev ops.
> Kind of like e.g. vlans share a driver with the main driver.

Not sure what you meant by vlans sharing a driver with the main driver.
IIUC, vlans are supported via 802.1q driver and  creates a netdev of 
type 'vlan'
with vlan_netdev_ops
>
>> As we are using virtio_net to control and manage the VF data path, it is not
>> clear to me
>> what is the advantage of creating a new device rather than extending
>> virtio_net to manage
>> the VF datapath via transparent bond mechanism.
>>
>> Thanks
>> Sridhar
> So that XDP redirect actions can differentiate between virtio, PT
> device and the bond. Without it there's no way to redirect
> to virtio specifically.
I guess this usecase is for a guest admin to add bpf programs to VF 
netdev and
redirect frames to virtio.  How does bond enable this feature?


Thanks
Sridhar
Samudrala, Sridhar Jan. 23, 2018, 1:37 a.m. UTC | #15
On 1/22/2018 4:02 PM, Stephen Hemminger wrote:
>
>>>   
>>>> In the case of SwitchDev it
>>>> should be possible for the port representors and the switch to provide
>>>> data on which interfaces are bonded on the host side and which aren't.
>>>> With that data it would be pretty easy to just put together a list of
>>>> addresses that would prefer to go the para-virtual route instead of
>>>> being transmitted through physical hardware.
>>>>
>>>> In addition a bridge implies much more overhead since normally a
>>>> bridge can receive a packet in on one interface and transmit it on
>>>> another. We don't really need that. This is more of a VEPA type setup
>>>> and doesn't need to be anything all that complex. You could probably
>>>> even handle the Tx queue selection via a simple eBPF program and map
>>>> since the input for whatever is used to select Tx should be pretty
>>>> simple, destination MAC, source NUMA node, etc, and the data-set
>>>> shouldn't be too large.
>>> That sounds interesting. A separate device might make this kind of setup
>>> a bit easier.  Sridhar, did you look into creating a separate device for
>>> the virtual bond device at all?  It does not have to be in a separate
>>> module, that kind of refactoring can come later, but once we commit to
>>> using the same single device as virtio, we can't change that.
>> No. I haven't looked into creating a separate device. If we are going to
>> create a new
>> device, i guess it has to be of a new device type with its own driver.
>>
>> As we are using virtio_net to control and manage the VF data path, it is
>> not clear to me
>> what is the advantage of creating a new device rather than extending
>> virtio_net to manage
>> the VF datapath via transparent bond mechanism.
>>
>> Thanks
>> Sridhar
>>
>>
> The requirement with Azure accelerated network was that a stock distribution image from the
> store must be able to run unmodified and get accelerated networking.
> Not sure if other environments need to work the same, but it would be nice.
>
> That meant no additional setup scripts (aka no bonding) and also it must
> work transparently with hot-plug. Also there are diverse set of environments:
> openstack, cloudinit, network manager and systemd. The solution had to not depend
> on any one of them, but also not break any of them.
Yes. Cloud Service Providers using KVM as hypervisor have a similar 
requirement to provide accelerated
networking with VM images that support virtio_net.

Thanks
Sridhar
Michael S. Tsirkin Jan. 23, 2018, 2:04 a.m. UTC | #16
On Mon, Jan 22, 2018 at 05:34:37PM -0800, Samudrala, Sridhar wrote:
> On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
> > On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
> > > > > You could probably
> > > > > even handle the Tx queue selection via a simple eBPF program and map
> > > > > since the input for whatever is used to select Tx should be pretty
> > > > > simple, destination MAC, source NUMA node, etc, and the data-set
> > > > > shouldn't be too large.
> > > > That sounds interesting. A separate device might make this kind of setup
> > > > a bit easier.  Sridhar, did you look into creating a separate device for
> > > > the virtual bond device at all?  It does not have to be in a separate
> > > > module, that kind of refactoring can come later, but once we commit to
> > > > using the same single device as virtio, we can't change that.
> > > No. I haven't looked into creating a separate device. If we are going to
> > > create a new
> > > device, i guess it has to be of a new device type with its own driver.
> > Well not necessarily - just a separate netdev ops.
> > Kind of like e.g. vlans share a driver with the main driver.
> 
> Not sure what you meant by vlans sharing a driver with the main driver.
> IIUC, vlans are supported via 802.1q driver and  creates a netdev of type
> 'vlan'
> with vlan_netdev_ops

But nothing prevents a single module from registering
multiple ops.


> > 
> > > As we are using virtio_net to control and manage the VF data path, it is not
> > > clear to me
> > > what is the advantage of creating a new device rather than extending
> > > virtio_net to manage
> > > the VF datapath via transparent bond mechanism.
> > > 
> > > Thanks
> > > Sridhar
> > So that XDP redirect actions can differentiate between virtio, PT
> > device and the bond. Without it there's no way to redirect
> > to virtio specifically.
> I guess this usecase is for a guest admin to add bpf programs to VF netdev
> and
> redirect frames to virtio.

No - this doesn't make much sense IMHO.  The usecase would be VM
bridging where we process packets incoming on one virtio interface, and
transmit them on another one. It was pointed out using a separate master
netdev and making both virtio and PT its slaves would allow redirect
switch to force virtio, without it it won't be possible to force
redirect to virtio.

How important that use-case is, I am not sure.

>  How does bond enable this feature?
> 
> 
> Thanks
> Sridhar

As it's a userspace configuration, I guess for starters we
can punt this to userspace, too.
Alexander H Duyck Jan. 23, 2018, 3:36 a.m. UTC | #17
On Mon, Jan 22, 2018 at 6:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Jan 22, 2018 at 05:34:37PM -0800, Samudrala, Sridhar wrote:
>> On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
>> > On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
>> > > > > You could probably
>> > > > > even handle the Tx queue selection via a simple eBPF program and map
>> > > > > since the input for whatever is used to select Tx should be pretty
>> > > > > simple, destination MAC, source NUMA node, etc, and the data-set
>> > > > > shouldn't be too large.
>> > > > That sounds interesting. A separate device might make this kind of setup
>> > > > a bit easier.  Sridhar, did you look into creating a separate device for
>> > > > the virtual bond device at all?  It does not have to be in a separate
>> > > > module, that kind of refactoring can come later, but once we commit to
>> > > > using the same single device as virtio, we can't change that.
>> > > No. I haven't looked into creating a separate device. If we are going to
>> > > create a new
>> > > device, i guess it has to be of a new device type with its own driver.
>> > Well not necessarily - just a separate netdev ops.
>> > Kind of like e.g. vlans share a driver with the main driver.
>>
>> Not sure what you meant by vlans sharing a driver with the main driver.
>> IIUC, vlans are supported via 802.1q driver and  creates a netdev of type
>> 'vlan'
>> with vlan_netdev_ops
>
> But nothing prevents a single module from registering
> multiple ops.

Just to clarify, it seems like what you are suggesting is just adding
the "master" as a separate set of netdev ops or netdevice and to have
virtio spawn two network devices, one slave and one master, if the
BACKUP bit is set. Do I have that right?

I am good with the code still living in the virtio driver and
consolidation with other similar implementations and further
improvement could probably happen later as part of some refactor.

Thanks.

- Alex
Jakub Kicinski Jan. 23, 2018, 7:21 p.m. UTC | #18
On Tue, 23 Jan 2018 03:23:57 +0200, Michael S. Tsirkin wrote:
> > > > b. next-gen silicon may be able to disguise as virtio device, and the
> > > >    loop check in virtio driver will prevent the legitimate bond it such
> > > >    case.  AFAIU that's one of the goals of next-gen virtio spec as well.    
> > > 
> > > In fact we have a virtio feature bit for the fallback.
> > > So this part does not depend on how software in guest works
> > > and does not need software solutions.  
> > 
> > You mean in the new spec?  Nice.  Still I think people will try to
> > implement the old one too given sufficiently capable HW.  
> 
> Existing HW won't have the BACKUP feature so the new functionality
> won't be activated. So no problem I think.

I meant code that compares of netdev_ops, e.g.:

+	/* Skip our own events */
+	if (event_dev->netdev_ops == &virtnet_netdev)
+		return NOTIFY_DONE;

Would be an obstacle to bonding virtio_nets.  But that's just one of
the thoughts, perhaps of disputable value.  Anyway, separate netdev and
netdev_ops will solve this, and I think we agree that's preferable :)
Michael S. Tsirkin Jan. 23, 2018, 11:01 p.m. UTC | #19
On Mon, Jan 22, 2018 at 07:36:32PM -0800, Alexander Duyck wrote:
> On Mon, Jan 22, 2018 at 6:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Jan 22, 2018 at 05:34:37PM -0800, Samudrala, Sridhar wrote:
> >> On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
> >> > On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
> >> > > > > You could probably
> >> > > > > even handle the Tx queue selection via a simple eBPF program and map
> >> > > > > since the input for whatever is used to select Tx should be pretty
> >> > > > > simple, destination MAC, source NUMA node, etc, and the data-set
> >> > > > > shouldn't be too large.
> >> > > > That sounds interesting. A separate device might make this kind of setup
> >> > > > a bit easier.  Sridhar, did you look into creating a separate device for
> >> > > > the virtual bond device at all?  It does not have to be in a separate
> >> > > > module, that kind of refactoring can come later, but once we commit to
> >> > > > using the same single device as virtio, we can't change that.
> >> > > No. I haven't looked into creating a separate device. If we are going to
> >> > > create a new
> >> > > device, i guess it has to be of a new device type with its own driver.
> >> > Well not necessarily - just a separate netdev ops.
> >> > Kind of like e.g. vlans share a driver with the main driver.
> >>
> >> Not sure what you meant by vlans sharing a driver with the main driver.
> >> IIUC, vlans are supported via 802.1q driver and  creates a netdev of type
> >> 'vlan'
> >> with vlan_netdev_ops
> >
> > But nothing prevents a single module from registering
> > multiple ops.
> 
> Just to clarify, it seems like what you are suggesting is just adding
> the "master" as a separate set of netdev ops or netdevice and to have
> virtio spawn two network devices, one slave and one master, if the
> BACKUP bit is set. Do I have that right?

Yes, that was my idea.

> I am good with the code still living in the virtio driver and
> consolidation with other similar implementations and further
> improvement could probably happen later as part of some refactor.
> 
> Thanks.
> 
> - Alex
diff mbox series

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 12dfc5fee58e..f149a160a8c5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2829,7 +2829,7 @@  static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@ 
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY