diff mbox series

[net] failover: eliminate callback hell

Message ID 20180605034231.31610-1-sthemmin@microsoft.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net] failover: eliminate callback hell | expand

Commit Message

Stephen Hemminger June 5, 2018, 3:42 a.m. UTC
The net failover should be a simple library, not a virtual
object with function callbacks (see callback hell).
The code is simpler is smaller both for the netvsc and virtio use case.

The code is restructured in many ways. I should have given these
as review comments to net_failover during review
but did not want to overwhelm the original submitter.
Therefore it was merged prematurely.

Some of the many items changed are:

  * The support routines should just be selected as needed in
    kernel config, no need for them to be visible config items.

  * Both netvsc and net_failover should keep their list of their
    own devices. Not a common list.

  * The matching of secondary device to primary device policy
    is up to the network device. Both net_failover and netvsc
    will use MAC for now but can change separately.

  * The match policy is only used during initial discovery; after
    that the secondary device knows what the upper device is because
    of the parent/child relationship; no searching is required.

  * Now, netvsc and net_failover use the same delayed work type
    mechanism for setup. Previously, net_failover code was triggering off
    name change but a similar policy was rejected for netvsc.
    "what is good for the goose is good for the gander"

  * The net_failover private device info 'struct net_failover_info'
    should have been private to the driver file, not a visible
    API.

  * The net_failover device should use SET_NETDEV_DEV
    that is intended only for physical devices not virtual devices.

  * No point in having DocBook style comments on a driver file.
    They only make sense on an external exposed API.

  * net_failover only supports Ethernet, so use ether_addr_copy.

  * Set permanent and current address of net_failover device
    to match the primary.

  * Carrier should be marked off before registering device
    the net_failover device.

  * Use netdev_XXX for log messages, in net_failover (not dev_xxx)

  * Since failover infrastructure is about linking devices just
    use RTNL no need for other locking in init and teardown.

  * Don't bother with ERR_PTR() style return if only possible
    return is success or no memory.

  * As much as possible, the terms master and slave should be avoided
    because of their cultural connotations.

Note; this code has been tested on Hyper-V
but is compile tested only on virtio.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---

Although this patch needs to go into 4.18 (linux-net),
this version is based against net-next because net-next
hasn't been merged into linux-net yet.


 drivers/net/hyperv/hyperv_net.h |   3 +-
 drivers/net/hyperv/netvsc_drv.c | 173 +++++++++++------
 drivers/net/net_failover.c      | 312 ++++++++++++++++++++-----------
 drivers/net/virtio_net.c        |   9 +-
 include/net/failover.h          |  31 +---
 include/net/net_failover.h      |  32 +---
 net/Kconfig                     |  13 +-
 net/core/failover.c             | 316 ++++----------------------------
 8 files changed, 373 insertions(+), 516 deletions(-)

Comments

Samudrala, Sridhar June 5, 2018, 5:22 p.m. UTC | #1
On 6/4/2018 8:42 PM, Stephen Hemminger wrote:
> The net failover should be a simple library, not a virtual
> object with function callbacks (see callback hell).
> The code is simpler is smaller both for the netvsc and virtio use case.

I quickly tried this patch and it breaks virtio-net in standby mode.
I don't see failover netdev, unloading virtio-net causes a crash.

With these changes, there is very minimal code that is shared between
netvsc and virtio-net. The notifier and event handling code and the
lookup_bymac routines are now duplicated in both the drivers. I thought
we wanted to keep this code common between the 2 drivers and we went through
multiple revisions to make sure that it works with both netvsc's 2 netdev
and virtio-net's 3 netdev models.

The reason for the indirect ops is to support these 2 different models and
i am not sure if the overhead of the callbacks is that significant considering
that they are not called in the hot path.




>
> The code is restructured in many ways. I should have given these
> as review comments to net_failover during review
> but did not want to overwhelm the original submitter.
> Therefore it was merged prematurely.
>
> Some of the many items changed are:
>
>    * The support routines should just be selected as needed in
>      kernel config, no need for them to be visible config items.
>
>    * Both netvsc and net_failover should keep their list of their
>      own devices. Not a common list.
>
> 	  * The matching of secondary device to primary device policy
>      is up to the network device. Both net_failover and netvsc
>      will use MAC for now but can change separately.
>
>    * The match policy is only used during initial discovery; after
>      that the secondary device knows what the upper device is because
>      of the parent/child relationship; no searching is required.
>
>    * Now, netvsc and net_failover use the same delayed work type
>      mechanism for setup. Previously, net_failover code was triggering off
>      name change but a similar policy was rejected for netvsc.
>      "what is good for the goose is good for the gander"
>
>    * The net_failover private device info 'struct net_failover_info'
>      should have been private to the driver file, not a visible
>      API.
>
>    * The net_failover device should use SET_NETDEV_DEV
>      that is intended only for physical devices not virtual devices.
>
>    * No point in having DocBook style comments on a driver file.
>      They only make sense on an external exposed API.
>
>    * net_failover only supports Ethernet, so use ether_addr_copy.
>
>    * Set permanent and current address of net_failover device
>      to match the primary.
>
>    * Carrier should be marked off before registering device
>      the net_failover device.
>
>    * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
>
>    * Since failover infrastructure is about linking devices just
>      use RTNL no need for other locking in init and teardown.
>
>    * Don't bother with ERR_PTR() style return if only possible
>      return is success or no memory.
>
>    * As much as possible, the terms master and slave should be avoided
>      because of their cultural connotations.
>
> Note; this code has been tested on Hyper-V
> but is compile tested only on virtio.
>
> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---
>
> Although this patch needs to go into 4.18 (linux-net),
> this version is based against net-next because net-next
> hasn't been merged into linux-net yet.
>
>
>   drivers/net/hyperv/hyperv_net.h |   3 +-
>   drivers/net/hyperv/netvsc_drv.c | 173 +++++++++++------
>   drivers/net/net_failover.c      | 312 ++++++++++++++++++++-----------
>   drivers/net/virtio_net.c        |   9 +-
>   include/net/failover.h          |  31 +---
>   include/net/net_failover.h      |  32 +---
>   net/Kconfig                     |  13 +-
>   net/core/failover.c             | 316 ++++----------------------------
>   8 files changed, 373 insertions(+), 516 deletions(-)
>
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 99d8e7398a5b..c7d25d10765e 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -902,6 +902,8 @@ struct net_device_context {
>   	struct hv_device *device_ctx;
>   	/* netvsc_device */
>   	struct netvsc_device __rcu *nvdev;
> +	/* list of netvsc net_devices */
> +	struct list_head list;
>   	/* reconfigure work */
>   	struct delayed_work dwork;
>   	/* last reconfig time */
> @@ -933,7 +935,6 @@ struct net_device_context {
>   	/* Serial number of the VF to team with */
>   	u32 vf_serial;
>   
> -	struct failover *failover;
>   };
>   
>   /* Per channel data */
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index bef4d55a108c..074e6b8578df 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -70,6 +70,8 @@ static int debug = -1;
>   module_param(debug, int, 0444);
>   MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>   
> +static LIST_HEAD(netvsc_dev_list);
> +
>   static void netvsc_change_rx_flags(struct net_device *net, int change)
>   {
>   	struct net_device_context *ndev_ctx = netdev_priv(net);
> @@ -1846,101 +1848,120 @@ static void netvsc_vf_setup(struct work_struct *w)
>   	}
>   
>   	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
> -	if (vf_netdev)
> +	if (vf_netdev) {
>   		__netvsc_vf_setup(ndev, vf_netdev);
> -
> +		dev_put(vf_netdev);
> +	}
>   	rtnl_unlock();
>   }
>   
> -static int netvsc_pre_register_vf(struct net_device *vf_netdev,
> -				  struct net_device *ndev)
> +static struct net_device *get_netvsc_bymac(const u8 *mac)
>   {
> -	struct net_device_context *net_device_ctx;
> -	struct netvsc_device *netvsc_dev;
> +	struct net_device_context *ndev_ctx;
>   
> -	net_device_ctx = netdev_priv(ndev);
> -	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
> -	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
> -		return -ENODEV;
> +	ASSERT_RTNL();
>   
> -	return 0;
> +	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
> +		struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
> +
> +		if (ether_addr_equal(mac, dev->perm_addr))
> +			return dev;
> +	}
> +
> +	return NULL;
>   }
>   
> -static int netvsc_register_vf(struct net_device *vf_netdev,
> -			      struct net_device *ndev)
> +static int netvsc_register_vf(struct net_device *vf_netdev)
>   {
> -	struct net_device_context *ndev_ctx = netdev_priv(ndev);
> +	struct net_device *ndev;
> +	struct net_device_context *ndev_ctx;
> +
> +	/* Must use Ethernet addresses */
> +	if (vf_netdev->addr_len != ETH_ALEN)
> +		return NOTIFY_DONE;
> +
> +	/* VF must be a physical device not VLAN, etc */
> +	if (!vf_netdev->dev.parent)
> +		return NOTIFY_DONE;
> +
> +	/* Use the MAC address to locate the synthetic interface to
> +	 * associate with the VF interface.
> +	 */
> +	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
> +	/* If network device is being removed, don't do anything */
> +	ndev_ctx = netdev_priv(ndev);
> +	if (!rtnl_dereference(ndev_ctx->nvdev))
> +		return NOTIFY_DONE;
> +
> +	if (netdev_failover_join(vf_netdev, ndev, netvsc_vf_handle_frame)) {
> +		netdev_err(vf_netdev, "could not join: %s", ndev->name);
> +		return NOTIFY_DONE;
> +	}
>   
>   	/* set slave flag before open to prevent IPv6 addrconf */
>   	vf_netdev->flags |= IFF_SLAVE;
>   
> +	dev_hold(vf_netdev);
> +
>   	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
>   
>   	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
>   
>   	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
>   
> -	dev_hold(vf_netdev);
>   	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
>   
> -	return 0;
> +	return NOTIFY_OK;
>   }
>   
>   /* VF up/down change detected, schedule to change data path */
> -static int netvsc_vf_changed(struct net_device *vf_netdev,
> -			     struct net_device *ndev)
> +static int netvsc_vf_changed(struct net_device *vf_netdev)
>   {
>   	struct net_device_context *net_device_ctx;
>   	struct netvsc_device *netvsc_dev;
> +	struct net_device *ndev;
>   	bool vf_is_up = netif_running(vf_netdev);
>   
> +	ndev = netdev_failover_upper_get(vf_netdev);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
>   	net_device_ctx = netdev_priv(ndev);
>   	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
>   	if (!netvsc_dev)
> -		return -ENODEV;
> +		return NOTIFY_DONE;
>   
>   	netvsc_switch_datapath(ndev, vf_is_up);
>   	netdev_info(ndev, "Data path switched %s VF: %s\n",
>   		    vf_is_up ? "to" : "from", vf_netdev->name);
>   
> -	return 0;
> +	return NOTIFY_OK;
>   }
>   
> -static int netvsc_pre_unregister_vf(struct net_device *vf_netdev,
> -				    struct net_device *ndev)
> +static int netvsc_unregister_vf(struct net_device *vf_netdev)
>   {
>   	struct net_device_context *net_device_ctx;
> +	struct net_device *ndev;
>   
> -	net_device_ctx = netdev_priv(ndev);
> -	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
> -
> -	return 0;
> -}
> -
> -static int netvsc_unregister_vf(struct net_device *vf_netdev,
> -				struct net_device *ndev)
> -{
> -	struct net_device_context *net_device_ctx;
> +	ndev = netdev_failover_upper_get(vf_netdev);
> +	if (!ndev)
> +		return NOTIFY_DONE;
>   
>   	net_device_ctx = netdev_priv(ndev);
> +	if (cancel_delayed_work_sync(&net_device_ctx->vf_takeover))
> +		dev_put(vf_netdev);
>   
>   	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
>   
> +	netdev_failover_unjoin(vf_netdev, ndev);
>   	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
> -	dev_put(vf_netdev);
>   
> -	return 0;
> +	return NOTIFY_OK;
>   }
>   
> -static struct failover_ops netvsc_failover_ops = {
> -	.slave_pre_register	= netvsc_pre_register_vf,
> -	.slave_register		= netvsc_register_vf,
> -	.slave_pre_unregister	= netvsc_pre_unregister_vf,
> -	.slave_unregister	= netvsc_unregister_vf,
> -	.slave_link_change	= netvsc_vf_changed,
> -	.slave_handle_frame	= netvsc_vf_handle_frame,
> -};
> -
>   static int netvsc_probe(struct hv_device *dev,
>   			const struct hv_vmbus_device_id *dev_id)
>   {
> @@ -2009,6 +2030,8 @@ static int netvsc_probe(struct hv_device *dev,
>   
>   	memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN);
>   
> +	net->priv_flags |= IFF_FAILOVER;
> +
>   	/* hw_features computed in rndis_netdev_set_hwcaps() */
>   	net->features = net->hw_features |
>   		NETIF_F_HIGHDMA | NETIF_F_SG |
> @@ -2024,23 +2047,19 @@ static int netvsc_probe(struct hv_device *dev,
>   	else
>   		net->max_mtu = ETH_DATA_LEN;
>   
> -	ret = register_netdev(net);
> +	rtnl_lock();
> +	ret = register_netdevice(net);
>   	if (ret != 0) {
>   		pr_err("Unable to register netdev.\n");
>   		goto register_failed;
>   	}
>   
> -	net_device_ctx->failover = failover_register(net, &netvsc_failover_ops);
> -	if (IS_ERR(net_device_ctx->failover)) {
> -		ret = PTR_ERR(net_device_ctx->failover);
> -		goto err_failover;
> -	}
> -
> -	return ret;
> +	list_add(&net_device_ctx->list, &netvsc_dev_list);
> +	rtnl_unlock();
> +	return 0;
>   
> -err_failover:
> -	unregister_netdev(net);
>   register_failed:
> +	rtnl_unlock();
>   	rndis_filter_device_remove(dev, nvdev);
>   rndis_failed:
>   	free_percpu(net_device_ctx->vf_stats);
> @@ -2079,15 +2098,17 @@ static int netvsc_remove(struct hv_device *dev)
>   	 */
>   	rtnl_lock();
>   	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
> -	if (vf_netdev)
> -		failover_slave_unregister(vf_netdev);
> +	if (vf_netdev) {
> +		netdev_failover_unjoin(vf_netdev, net);
> +		dev_put(vf_netdev);
> +	}
>   
>   	if (nvdev)
>   		rndis_filter_device_remove(dev, nvdev);
>   
>   	unregister_netdevice(net);
>   
> -	failover_unregister(ndev_ctx->failover);
> +	list_del(&ndev_ctx->list);
>   
>   	rtnl_unlock();
>   	rcu_read_unlock();
> @@ -2115,8 +2136,47 @@ static struct  hv_driver netvsc_drv = {
>   	.remove = netvsc_remove,
>   };
>   
> +/* On Hyper-V, every VF interface is matched with a corresponding
> + * synthetic interface. The synthetic interface is presented first
> + * to the guest. When the corresponding VF instance is registered,
> + * we will take care of switching the data path.
> + */
> +static int netvsc_netdev_event(struct notifier_block *this,
> +			       unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip parent events */
> +	if (netif_is_failover(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid non-Ethernet type devices */
> +	if (event_dev->type != ARPHRD_ETHER)
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return netvsc_register_vf(event_dev);
> +
> +	case NETDEV_UNREGISTER:
> +		return netvsc_unregister_vf(event_dev);
> +
> +	case NETDEV_UP:
> +	case NETDEV_DOWN:
> +		return netvsc_vf_changed(event_dev);
> +
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block netvsc_netdev_notifier = {
> +	.notifier_call = netvsc_netdev_event,
> +};
> +
>   static void __exit netvsc_drv_exit(void)
>   {
> +	unregister_netdevice_notifier(&netvsc_netdev_notifier);
>   	vmbus_driver_unregister(&netvsc_drv);
>   }
>   
> @@ -2136,6 +2196,7 @@ static int __init netvsc_drv_init(void)
>   	if (ret)
>   		return ret;
>   
> +	register_netdevice_notifier(&netvsc_netdev_notifier);
>   	return 0;
>   }
>   
> diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
> index 83f7420ddea5..e0d30527f748 100644
> --- a/drivers/net/net_failover.c
> +++ b/drivers/net/net_failover.c
> @@ -28,6 +28,46 @@
>   #include <uapi/linux/if_arp.h>
>   #include <net/net_failover.h>
>   
> +static LIST_HEAD(net_failover_list);
> +
> +/* failover state */
> +struct net_failover_info {
> +	struct net_device *failover_dev;
> +
> +	/* list of failover virtual devices */
> +	struct list_head list;
> +
> +	/* primary netdev with same MAC */
> +	struct net_device __rcu *primary_dev;
> +
> +	/* standby netdev */
> +	struct net_device __rcu *standby_dev;
> +
> +	/* primary netdev stats */
> +	struct rtnl_link_stats64 primary_stats;
> +
> +	/* standby netdev stats */
> +	struct rtnl_link_stats64 standby_stats;
> +
> +	/* aggregated stats */
> +	struct rtnl_link_stats64 failover_stats;
> +
> +	/* spinlock while updating stats */
> +	spinlock_t stats_lock;
> +
> +	/* delayed setup of slave */
> +	struct delayed_work standby_init;
> +};
> +
> +#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> +				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
> +				 NETIF_F_HIGHDMA | NETIF_F_LRO)
> +
> +#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> +				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
> +
> +#define FAILOVER_SETUP_INTERVAL	(HZ / 10)
> +
>   static bool net_failover_xmit_ready(struct net_device *dev)
>   {
>   	return netif_running(dev) && netif_carrier_ok(dev);
> @@ -460,22 +500,42 @@ static void net_failover_lower_state_changed(struct net_device *slave_dev,
>   	netdev_lower_state_changed(slave_dev, &info);
>   }
>   
> -static int net_failover_slave_pre_register(struct net_device *slave_dev,
> -					   struct net_device *failover_dev)
> +static struct net_device *get_net_failover_bymac(const u8 *mac)
>   {
> -	struct net_device *standby_dev, *primary_dev;
> +	struct net_failover_info *nfo_info;
> +
> +	ASSERT_RTNL();
> +
> +	list_for_each_entry(nfo_info, &net_failover_list, list) {
> +		struct net_device *failover_dev = nfo_info->failover_dev;
> +
> +		if (ether_addr_equal(mac, failover_dev->perm_addr))
> +			return failover_dev;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int net_failover_register_event(struct net_device *slave_dev)
> +{
> +	struct net_device *failover_dev, *standby_dev, *primary_dev;
>   	struct net_failover_info *nfo_info;
>   	bool slave_is_standby;
>   
> +	failover_dev = get_net_failover_bymac(slave_dev->perm_addr);
> +	if (!failover_dev)
> +		return NOTIFY_DONE;
> +
>   	nfo_info = netdev_priv(failover_dev);
>   	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>   	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>   	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
>   	if (slave_is_standby ? standby_dev : primary_dev) {
> -		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
> +		netdev_err(failover_dev,
> +			   "%s attempting to register as slave dev when %s already present\n",
>   			   slave_dev->name,
>   			   slave_is_standby ? "standby" : "primary");
> -		return -EINVAL;
> +		return NOTIFY_DONE;
>   	}
>   
>   	/* We want to allow only a direct attached VF device as a primary
> @@ -484,23 +544,33 @@ static int net_failover_slave_pre_register(struct net_device *slave_dev,
>   	 */
>   	if (!slave_is_standby && (!slave_dev->dev.parent ||
>   				  !dev_is_pci(slave_dev->dev.parent)))
> -		return -EINVAL;
> +		return NOTIFY_DONE;
>   
>   	if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
>   	    vlan_uses_dev(failover_dev)) {
> -		netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
> +		netdev_err(failover_dev,
> +			   "Device %s is VLAN challenged and failover device has VLAN set up\n",
>   			   failover_dev->name);
> -		return -EINVAL;
> +		return NOTIFY_DONE;
>   	}
>   
> -	return 0;
> +	if (netdev_failover_join(slave_dev, failover_dev,
> +				 net_failover_handle_frame)) {
> +		netdev_err(failover_dev, "could not join: %s", slave_dev->name);
> +		return NOTIFY_DONE;
> +	}
> +
> +	/* Trigger rest of setup in process context */
> +	schedule_delayed_work(&nfo_info->standby_init, FAILOVER_SETUP_INTERVAL);
> +
> +	return NOTIFY_OK;
>   }
>   
> -static int net_failover_slave_register(struct net_device *slave_dev,
> -				       struct net_device *failover_dev)
> +static void __net_failover_setup(struct net_device *failover_dev)
>   {
> +	struct net_failover_info *nfo_info = netdev_priv(failover_dev);
> +	struct net_device *slave_dev = rtnl_dereference(nfo_info->standby_dev);
>   	struct net_device *standby_dev, *primary_dev;
> -	struct net_failover_info *nfo_info;
>   	bool slave_is_standby;
>   	u32 orig_mtu;
>   	int err;
> @@ -509,13 +579,12 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>   	orig_mtu = slave_dev->mtu;
>   	err = dev_set_mtu(slave_dev, failover_dev->mtu);
>   	if (err) {
> -		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
> +		netdev_err(failover_dev,
> +			   "unable to change mtu of %s to %u register failed\n",
>   			   slave_dev->name, failover_dev->mtu);
>   		goto done;
>   	}
>   
> -	dev_hold(slave_dev);
> -
>   	if (netif_running(failover_dev)) {
>   		err = dev_open(slave_dev);
>   		if (err && (err != -EBUSY)) {
> @@ -537,7 +606,6 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>   		goto err_vlan_add;
>   	}
>   
> -	nfo_info = netdev_priv(failover_dev);
>   	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>   	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>   	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
> @@ -562,52 +630,56 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>   	netdev_info(failover_dev, "failover %s slave:%s registered\n",
>   		    slave_is_standby ? "standby" : "primary", slave_dev->name);
>   
> -	return 0;
> +	return;
>   
>   err_vlan_add:
>   	dev_uc_unsync(slave_dev, failover_dev);
>   	dev_mc_unsync(slave_dev, failover_dev);
>   	dev_close(slave_dev);
>   err_dev_open:
> -	dev_put(slave_dev);
>   	dev_set_mtu(slave_dev, orig_mtu);
>   done:
> -	return err;
> +	return;
>   }
>   
> -static int net_failover_slave_pre_unregister(struct net_device *slave_dev,
> -					     struct net_device *failover_dev)
> +static void net_failover_setup(struct work_struct *w)
>   {
> -	struct net_device *standby_dev, *primary_dev;
> -	struct net_failover_info *nfo_info;
> +	struct net_failover_info *nfo_info
> +		= container_of(w, struct net_failover_info, standby_init.work);
> +	struct net_device *failover_dev = nfo_info->failover_dev;
>   
> -	nfo_info = netdev_priv(failover_dev);
> -	primary_dev = rtnl_dereference(nfo_info->primary_dev);
> -	standby_dev = rtnl_dereference(nfo_info->standby_dev);
> -
> -	if (slave_dev != primary_dev && slave_dev != standby_dev)
> -		return -ENODEV;
> +	/* handle race with cancel delayed work on removal */
> +	if (!rtnl_trylock()) {
> +		schedule_delayed_work(&nfo_info->standby_init, 0);
> +		return;
> +	}
>   
> -	return 0;
> +	__net_failover_setup(failover_dev);
> +	rtnl_unlock();
>   }
>   
> -static int net_failover_slave_unregister(struct net_device *slave_dev,
> -					 struct net_device *failover_dev)
> +static int net_failover_unregister_event(struct net_device *slave_dev)
>   {
> -	struct net_device *standby_dev, *primary_dev;
> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
>   	struct net_failover_info *nfo_info;
>   	bool slave_is_standby;
>   
> +	failover_dev = netdev_failover_upper_get(slave_dev);
> +	if (!failover_dev)
> +		return NOTIFY_DONE;
> +
>   	nfo_info = netdev_priv(failover_dev);
>   	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>   	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>   
> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
> +		return NOTIFY_DONE;
> +
>   	vlan_vids_del_by_dev(slave_dev, failover_dev);
>   	dev_uc_unsync(slave_dev, failover_dev);
>   	dev_mc_unsync(slave_dev, failover_dev);
>   	dev_close(slave_dev);
>   
> -	nfo_info = netdev_priv(failover_dev);
>   	dev_get_stats(failover_dev, &nfo_info->failover_stats);
>   
>   	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
> @@ -628,22 +700,25 @@ static int net_failover_slave_unregister(struct net_device *slave_dev,
>   	netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
>   		    slave_is_standby ? "standby" : "primary", slave_dev->name);
>   
> -	return 0;
> +	return NOTIFY_OK;
>   }
>   
> -static int net_failover_slave_link_change(struct net_device *slave_dev,
> -					  struct net_device *failover_dev)
> +static int net_failover_link_event(struct net_device *slave_dev)
> +
>   {
> -	struct net_device *primary_dev, *standby_dev;
> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
>   	struct net_failover_info *nfo_info;
>   
> -	nfo_info = netdev_priv(failover_dev);
> +	failover_dev = netdev_failover_upper_get(slave_dev);
> +	if (!failover_dev)
> +		return NOTIFY_DONE;
>   
> +	nfo_info = netdev_priv(failover_dev);
>   	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>   	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>   
>   	if (slave_dev != primary_dev && slave_dev != standby_dev)
> -		return -ENODEV;
> +		return NOTIFY_DONE;
>   
>   	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
>   	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
> @@ -657,43 +732,11 @@ static int net_failover_slave_link_change(struct net_device *slave_dev,
>   
>   	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
>   
> -	return 0;
> +	return NOTIFY_DONE;
>   }
>   
> -static int net_failover_slave_name_change(struct net_device *slave_dev,
> -					  struct net_device *failover_dev)
> -{
> -	struct net_device *primary_dev, *standby_dev;
> -	struct net_failover_info *nfo_info;
> -
> -	nfo_info = netdev_priv(failover_dev);
> -
> -	primary_dev = rtnl_dereference(nfo_info->primary_dev);
> -	standby_dev = rtnl_dereference(nfo_info->standby_dev);
> -
> -	if (slave_dev != primary_dev && slave_dev != standby_dev)
> -		return -ENODEV;
> -
> -	/* We need to bring up the slave after the rename by udev in case
> -	 * open failed with EBUSY when it was registered.
> -	 */
> -	dev_open(slave_dev);
> -
> -	return 0;
> -}
> -
> -static struct failover_ops net_failover_ops = {
> -	.slave_pre_register	= net_failover_slave_pre_register,
> -	.slave_register		= net_failover_slave_register,
> -	.slave_pre_unregister	= net_failover_slave_pre_unregister,
> -	.slave_unregister	= net_failover_slave_unregister,
> -	.slave_link_change	= net_failover_slave_link_change,
> -	.slave_name_change	= net_failover_slave_name_change,
> -	.slave_handle_frame	= net_failover_handle_frame,
> -};
> -
>   /**
> - * net_failover_create - Create and register a failover instance
> + * net_failover_create - Create and register a failover device
>    *
>    * @dev: standby netdev
>    *
> @@ -703,13 +746,12 @@ static struct failover_ops net_failover_ops = {
>    * the original standby netdev and a VF netdev with the same MAC gets
>    * registered as primary netdev.
>    *
> - * Return: pointer to failover instance
> + * Return: pointer to failover network device
>    */
> -struct failover *net_failover_create(struct net_device *standby_dev)
> +struct net_device *net_failover_create(struct net_device *standby_dev)
>   {
> -	struct device *dev = standby_dev->dev.parent;
> +	struct net_failover_info *nfo_info;
>   	struct net_device *failover_dev;
> -	struct failover *failover;
>   	int err;
>   
>   	/* Alloc at least 2 queues, for now we are going with 16 assuming
> @@ -717,18 +759,22 @@ struct failover *net_failover_create(struct net_device *standby_dev)
>   	 */
>   	failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
>   	if (!failover_dev) {
> -		dev_err(dev, "Unable to allocate failover_netdev!\n");
> -		return ERR_PTR(-ENOMEM);
> +		netdev_err(standby_dev, "Unable to allocate failover_netdev!\n");
> +		return NULL;
>   	}
>   
> +	nfo_info = netdev_priv(failover_dev);
>   	dev_net_set(failover_dev, dev_net(standby_dev));
> -	SET_NETDEV_DEV(failover_dev, dev);
> +	nfo_info->failover_dev = failover_dev;
> +	INIT_DELAYED_WORK(&nfo_info->standby_init, net_failover_setup);
>   
>   	failover_dev->netdev_ops = &failover_dev_ops;
>   	failover_dev->ethtool_ops = &failover_ethtool_ops;
>   
>   	/* Initialize the device options */
> -	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
> +	failover_dev->priv_flags |= IFF_UNICAST_FLT |
> +				    IFF_NO_QUEUE |
> +				    IFF_FAILOVER;
>   	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>   				       IFF_TX_SKB_SHARING);
>   
> @@ -746,29 +792,38 @@ struct failover *net_failover_create(struct net_device *standby_dev)
>   	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>   	failover_dev->features |= failover_dev->hw_features;
>   
> -	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
> -	       failover_dev->addr_len);
> +	ether_addr_copy(failover_dev->dev_addr, standby_dev->dev_addr);
> +	ether_addr_copy(failover_dev->perm_addr, standby_dev->perm_addr);
>   
>   	failover_dev->min_mtu = standby_dev->min_mtu;
>   	failover_dev->max_mtu = standby_dev->max_mtu;
>   
> -	err = register_netdev(failover_dev);
> +	netif_carrier_off(failover_dev);
> +
> +	rtnl_lock();
> +	err = register_netdevice(failover_dev);
>   	if (err) {
> -		dev_err(dev, "Unable to register failover_dev!\n");
> +		netdev_err(standby_dev, "Unable to register failover_dev!\n");
>   		goto err_register_netdev;
>   	}
>   
> -	netif_carrier_off(failover_dev);
> +	err = netdev_failover_join(standby_dev, failover_dev,
> +				   net_failover_handle_frame);
> +	if (err) {
> +		netdev_err(failover_dev, "Unable to join with %s\n",
> +			   standby_dev->name);
> +		goto err_failover_join;
> +	}
>   
> -	failover = failover_register(failover_dev, &net_failover_ops);
> -	if (IS_ERR(failover))
> -		goto err_failover_register;
> +	list_add(&nfo_info->list, &net_failover_list);
> +	rtnl_unlock();
>   
> -	return failover;
> +	return failover_dev;
>   
> -err_failover_register:
> -	unregister_netdev(failover_dev);
> +err_failover_join:
> +	unregister_netdevice(failover_dev);
>   err_register_netdev:
> +	rtnl_unlock();
>   	free_netdev(failover_dev);
>   
>   	return ERR_PTR(err);
> @@ -786,31 +841,27 @@ EXPORT_SYMBOL_GPL(net_failover_create);
>    * netdev. Used by paravirtual drivers that use 3-netdev model.
>    *
>    */
> -void net_failover_destroy(struct failover *failover)
> +void net_failover_destroy(struct net_device *failover_dev)
>   {
> -	struct net_failover_info *nfo_info;
> -	struct net_device *failover_dev;
> +	struct net_failover_info *nfo_info = netdev_priv(failover_dev);
>   	struct net_device *slave_dev;
>   
> -	if (!failover)
> -		return;
> -
> -	failover_dev = rcu_dereference(failover->failover_dev);
> -	nfo_info = netdev_priv(failover_dev);
> -
>   	netif_device_detach(failover_dev);
>   
>   	rtnl_lock();
> -
>   	slave_dev = rtnl_dereference(nfo_info->primary_dev);
> -	if (slave_dev)
> -		failover_slave_unregister(slave_dev);
> +	if (slave_dev) {
> +		netdev_failover_unjoin(slave_dev, failover_dev);
> +		dev_put(slave_dev);
> +	}
>   
>   	slave_dev = rtnl_dereference(nfo_info->standby_dev);
> -	if (slave_dev)
> -		failover_slave_unregister(slave_dev);
> +	if (slave_dev) {
> +		netdev_failover_unjoin(slave_dev, failover_dev);
> +		dev_put(slave_dev);
> +	}
>   
> -	failover_unregister(failover);
> +	list_del(&nfo_info->list);
>   
>   	unregister_netdevice(failover_dev);
>   
> @@ -820,9 +871,53 @@ void net_failover_destroy(struct failover *failover)
>   }
>   EXPORT_SYMBOL_GPL(net_failover_destroy);
>   
> +static int net_failover_event(struct notifier_block *this,
> +			      unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip parent events */
> +	if (netif_is_failover(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid non-Ethernet type devices */
> +	if (event_dev->type != ARPHRD_ETHER)
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Vlan dev with same MAC registering as VF */
> +	if (is_vlan_dev(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Bonding master dev with same MAC registering as VF */
> +	if ((event_dev->priv_flags & IFF_BONDING) &&
> +	    (event_dev->flags & IFF_MASTER))
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return net_failover_register_event(event_dev);
> +
> +	case NETDEV_UNREGISTER:
> +		return net_failover_unregister_event(event_dev);
> +
> +	case NETDEV_UP:
> +	case NETDEV_DOWN:
> +	case NETDEV_CHANGE:
> +		return net_failover_link_event(event_dev);
> +
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block net_failover_notifier = {
> +	.notifier_call = net_failover_event,
> +};
> +
>   static __init int
>   net_failover_init(void)
>   {
> +	register_netdevice_notifier(&net_failover_notifier);
>   	return 0;
>   }
>   module_init(net_failover_init);
> @@ -830,6 +925,7 @@ module_init(net_failover_init);
>   static __exit
>   void net_failover_exit(void)
>   {
> +	unregister_netdevice_notifier(&net_failover_notifier);
>   }
>   module_exit(net_failover_exit);
>   
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6d710b8b41c5..b40ae28dac93 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -215,7 +215,7 @@ struct virtnet_info {
>   	unsigned long guest_offloads;
>   
>   	/* failover when STANDBY feature enabled */
> -	struct failover *failover;
> +	struct net_device *failover;
>   };
>   
>   struct padded_vnet_hdr {
> @@ -2930,11 +2930,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>   	virtnet_init_settings(dev);
>   
>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> -		vi->failover = net_failover_create(vi->dev);
> -		if (IS_ERR(vi->failover)) {
> -			err = PTR_ERR(vi->failover);
> +		err = -ENOMEM;
> +		vi->failover = net_failover_create(dev);
> +		if (!vi->failover)
>   			goto free_vqs;
> -		}
>   	}
>   
>   	err = register_netdev(dev);
> diff --git a/include/net/failover.h b/include/net/failover.h
> index bb15438f39c7..22d6c1369101 100644
> --- a/include/net/failover.h
> +++ b/include/net/failover.h
> @@ -6,31 +6,10 @@
>   
>   #include <linux/netdevice.h>
>   
> -struct failover_ops {
> -	int (*slave_pre_register)(struct net_device *slave_dev,
> -				  struct net_device *failover_dev);
> -	int (*slave_register)(struct net_device *slave_dev,
> -			      struct net_device *failover_dev);
> -	int (*slave_pre_unregister)(struct net_device *slave_dev,
> -				    struct net_device *failover_dev);
> -	int (*slave_unregister)(struct net_device *slave_dev,
> -				struct net_device *failover_dev);
> -	int (*slave_link_change)(struct net_device *slave_dev,
> -				 struct net_device *failover_dev);
> -	int (*slave_name_change)(struct net_device *slave_dev,
> -				 struct net_device *failover_dev);
> -	rx_handler_result_t (*slave_handle_frame)(struct sk_buff **pskb);
> -};
> -
> -struct failover {
> -	struct list_head list;
> -	struct net_device __rcu *failover_dev;
> -	struct failover_ops __rcu *ops;
> -};
> -
> -struct failover *failover_register(struct net_device *dev,
> -				   struct failover_ops *ops);
> -void failover_unregister(struct failover *failover);
> -int failover_slave_unregister(struct net_device *slave_dev);
> +int netdev_failover_join(struct net_device *lower, struct net_device *upper,
> +			 rx_handler_func_t *rx_handler);
> +struct net_device *netdev_failover_upper_get(struct net_device *lower);
> +void netdev_failover_unjoin(struct net_device *lower,
> +			    struct net_device *upper);
>   
>   #endif /* _FAILOVER_H */
> diff --git a/include/net/net_failover.h b/include/net/net_failover.h
> index b12a1c469d1c..a99b3b00b4e3 100644
> --- a/include/net/net_failover.h
> +++ b/include/net/net_failover.h
> @@ -6,35 +6,7 @@
>   
>   #include <net/failover.h>
>   
> -/* failover state */
> -struct net_failover_info {
> -	/* primary netdev with same MAC */
> -	struct net_device __rcu *primary_dev;
> -
> -	/* standby netdev */
> -	struct net_device __rcu *standby_dev;
> -
> -	/* primary netdev stats */
> -	struct rtnl_link_stats64 primary_stats;
> -
> -	/* standby netdev stats */
> -	struct rtnl_link_stats64 standby_stats;
> -
> -	/* aggregated stats */
> -	struct rtnl_link_stats64 failover_stats;
> -
> -	/* spinlock while updating stats */
> -	spinlock_t stats_lock;
> -};
> -
> -struct failover *net_failover_create(struct net_device *standby_dev);
> -void net_failover_destroy(struct failover *failover);
> -
> -#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> -				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
> -				 NETIF_F_HIGHDMA | NETIF_F_LRO)
> -
> -#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> -				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
> +struct net_device *net_failover_create(struct net_device *standby_dev);
> +void net_failover_destroy(struct net_device *failover_dev);
>   
>   #endif /* _NET_FAILOVER_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index f738a6f27665..697d84202695 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -433,17 +433,8 @@ config PAGE_POOL
>          bool
>   
>   config FAILOVER
> -	tristate "Generic failover module"
> -	help
> -	  The failover module provides a generic interface for paravirtual
> -	  drivers to register a netdev and a set of ops with a failover
> -	  instance. The ops are used as event handlers that get called to
> -	  handle netdev register/unregister/link change/name change events
> -	  on slave pci ethernet devices with the same mac address as the
> -	  failover netdev. This enables paravirtual drivers to use a
> -	  VF as an accelerated low latency datapath. It also allows live
> -	  migration of VMs with direct attached VFs by failing over to the
> -	  paravirtual datapath when the VF is unplugged.
> +	bool
> +	default n
>   
>   endif   # if NET
>   
> diff --git a/net/core/failover.c b/net/core/failover.c
> index 4a92a98ccce9..499f0fd7e4d3 100644
> --- a/net/core/failover.c
> +++ b/net/core/failover.c
> @@ -1,10 +1,8 @@
>   // SPDX-License-Identifier: GPL-2.0
>   /* Copyright (c) 2018, Intel Corporation. */
>   
> -/* A common module to handle registrations and notifications for paravirtual
> +/* A library for managing chained upper/oower devices such as
>    * drivers to enable accelerated datapath and support VF live migration.
> - *
> - * The notifier and event handling code is based on netvsc driver.
>    */
>   
>   #include <linux/module.h>
> @@ -14,302 +12,62 @@
>   #include <linux/if_vlan.h>
>   #include <net/failover.h>
>   
> -static LIST_HEAD(failover_list);
> -static DEFINE_SPINLOCK(failover_lock);
> -
> -static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
> -{
> -	struct net_device *failover_dev;
> -	struct failover *failover;
> -
> -	spin_lock(&failover_lock);
> -	list_for_each_entry(failover, &failover_list, list) {
> -		failover_dev = rtnl_dereference(failover->failover_dev);
> -		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
> -			*ops = rtnl_dereference(failover->ops);
> -			spin_unlock(&failover_lock);
> -			return failover_dev;
> -		}
> -	}
> -	spin_unlock(&failover_lock);
> -	return NULL;
> -}
> -
> -/**
> - * failover_slave_register - Register a slave netdev
> - *
> - * @slave_dev: slave netdev that is being registered
> - *
> - * Registers a slave device to a failover instance. Only ethernet devices
> - * are supported.
> - */
> -static int failover_slave_register(struct net_device *slave_dev)
> +/* failover_join - Join an lower netdev with an upper device. */
> +int netdev_failover_join(struct net_device *lower_dev,
> +			 struct net_device *upper_dev,
> +			 rx_handler_func_t *rx_handler)
>   {
> -	struct netdev_lag_upper_info lag_upper_info;
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
>   	int err;
>   
> -	if (slave_dev->type != ARPHRD_ETHER)
> -		goto done;
> -
>   	ASSERT_RTNL();
>   
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> +	/* Don't allow joining devices of different protocols */
> +	if (upper_dev->type != lower_dev->type)
> +		return -EINVAL;
>   
> -	if (fops && fops->slave_pre_register &&
> -	    fops->slave_pre_register(slave_dev, failover_dev))
> -		goto done;
> -
> -	err = netdev_rx_handler_register(slave_dev, fops->slave_handle_frame,
> -					 failover_dev);
> +	err = netdev_rx_handler_register(lower_dev, rx_handler, upper_dev);
>   	if (err) {
> -		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
> +		netdev_err(lower_dev,
> +			   "can not register failover rx handler (err = %d)\n",
>   			   err);
> -		goto done;
> +		return err;
>   	}
>   
> -	lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP;
> -	err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL,
> -					   &lag_upper_info, NULL);
> +	err = netdev_master_upper_dev_link(lower_dev, upper_dev, NULL,
> +					   NULL, NULL);
>   	if (err) {
> -		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
> -			   failover_dev->name, err);
> -		goto err_upper_link;
> +		netdev_err(lower_dev,
> +			   "can not set failover device %s (err = %d)\n",
> +			   upper_dev->name, err);
> +		netdev_rx_handler_unregister(lower_dev);
> +		return err;
>   	}
>   
> -	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
> -
> -	if (fops && fops->slave_register &&
> -	    !fops->slave_register(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -	netdev_upper_dev_unlink(slave_dev, failover_dev);
> -	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
> -err_upper_link:
> -	netdev_rx_handler_unregister(slave_dev);
> -done:
> -	return NOTIFY_DONE;
> -}
> -
> -/**
> - * failover_slave_unregister - Unregister a slave netdev
> - *
> - * @slave_dev: slave netdev that is being unregistered
> - *
> - * Unregisters a slave device from a failover instance.
> - */
> -int failover_slave_unregister(struct net_device *slave_dev)
> -{
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
> -
> -	if (!netif_is_failover_slave(slave_dev))
> -		goto done;
> -
> -	ASSERT_RTNL();
> -
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> -
> -	if (fops && fops->slave_pre_unregister &&
> -	    fops->slave_pre_unregister(slave_dev, failover_dev))
> -		goto done;
> -
> -	netdev_rx_handler_unregister(slave_dev);
> -	netdev_upper_dev_unlink(slave_dev, failover_dev);
> -	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
> -
> -	if (fops && fops->slave_unregister &&
> -	    !fops->slave_unregister(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -done:
> -	return NOTIFY_DONE;
> +	dev_hold(lower_dev);
> +	lower_dev->priv_flags |= IFF_FAILOVER_SLAVE;
> +	return 0;
>   }
> -EXPORT_SYMBOL_GPL(failover_slave_unregister);
> +EXPORT_SYMBOL_GPL(netdev_failover_join);
>   
> -static int failover_slave_link_change(struct net_device *slave_dev)
> +/* Find upper network device for failover slave device */
> +struct net_device *netdev_failover_upper_get(struct net_device *lower_dev)
>   {
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
> -
> -	if (!netif_is_failover_slave(slave_dev))
> -		goto done;
> -
> -	ASSERT_RTNL();
> -
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> -
> -	if (!netif_running(failover_dev))
> -		goto done;
> +	if (!netif_is_failover_slave(lower_dev))
> +		return NULL;
>   
> -	if (fops && fops->slave_link_change &&
> -	    !fops->slave_link_change(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -done:
> -	return NOTIFY_DONE;
> +	return netdev_master_upper_dev_get(lower_dev);
>   }
> +EXPORT_SYMBOL_GPL(netdev_failover_upper_get);
>   
> -static int failover_slave_name_change(struct net_device *slave_dev)
> +/* failover_unjoin - Break connection between lower and upper device. */
> +void netdev_failover_unjoin(struct net_device *lower_dev,
> +			    struct net_device *upper_dev)
>   {
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
> -
> -	if (!netif_is_failover_slave(slave_dev))
> -		goto done;
> -
>   	ASSERT_RTNL();
>   
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> -
> -	if (!netif_running(failover_dev))
> -		goto done;
> -
> -	if (fops && fops->slave_name_change &&
> -	    !fops->slave_name_change(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -done:
> -	return NOTIFY_DONE;
> -}
> -
> -static int
> -failover_event(struct notifier_block *this, unsigned long event, void *ptr)
> -{
> -	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> -
> -	/* Skip parent events */
> -	if (netif_is_failover(event_dev))
> -		return NOTIFY_DONE;
> -
> -	switch (event) {
> -	case NETDEV_REGISTER:
> -		return failover_slave_register(event_dev);
> -	case NETDEV_UNREGISTER:
> -		return failover_slave_unregister(event_dev);
> -	case NETDEV_UP:
> -	case NETDEV_DOWN:
> -	case NETDEV_CHANGE:
> -		return failover_slave_link_change(event_dev);
> -	case NETDEV_CHANGENAME:
> -		return failover_slave_name_change(event_dev);
> -	default:
> -		return NOTIFY_DONE;
> -	}
> -}
> -
> -static struct notifier_block failover_notifier = {
> -	.notifier_call = failover_event,
> -};
> -
> -static void
> -failover_existing_slave_register(struct net_device *failover_dev)
> -{
> -	struct net *net = dev_net(failover_dev);
> -	struct net_device *dev;
> -
> -	rtnl_lock();
> -	for_each_netdev(net, dev) {
> -		if (netif_is_failover(dev))
> -			continue;
> -		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
> -			failover_slave_register(dev);
> -	}
> -	rtnl_unlock();
> -}
> -
> -/**
> - * failover_register - Register a failover instance
> - *
> - * @dev: failover netdev
> - * @ops: failover ops
> - *
> - * Allocate and register a failover instance for a failover netdev. ops
> - * provides handlers for slave device register/unregister/link change/
> - * name change events.
> - *
> - * Return: pointer to failover instance
> - */
> -struct failover *failover_register(struct net_device *dev,
> -				   struct failover_ops *ops)
> -{
> -	struct failover *failover;
> -
> -	if (dev->type != ARPHRD_ETHER)
> -		return ERR_PTR(-EINVAL);
> -
> -	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
> -	if (!failover)
> -		return ERR_PTR(-ENOMEM);
> -
> -	rcu_assign_pointer(failover->ops, ops);
> -	dev_hold(dev);
> -	dev->priv_flags |= IFF_FAILOVER;
> -	rcu_assign_pointer(failover->failover_dev, dev);
> -
> -	spin_lock(&failover_lock);
> -	list_add_tail(&failover->list, &failover_list);
> -	spin_unlock(&failover_lock);
> -
> -	netdev_info(dev, "failover master:%s registered\n", dev->name);
> -
> -	failover_existing_slave_register(dev);
> -
> -	return failover;
> -}
> -EXPORT_SYMBOL_GPL(failover_register);
> -
> -/**
> - * failover_unregister - Unregister a failover instance
> - *
> - * @failover: pointer to failover instance
> - *
> - * Unregisters and frees a failover instance.
> - */
> -void failover_unregister(struct failover *failover)
> -{
> -	struct net_device *failover_dev;
> -
> -	failover_dev = rcu_dereference(failover->failover_dev);
> -
> -	netdev_info(failover_dev, "failover master:%s unregistered\n",
> -		    failover_dev->name);
> -
> -	failover_dev->priv_flags &= ~IFF_FAILOVER;
> -	dev_put(failover_dev);
> -
> -	spin_lock(&failover_lock);
> -	list_del(&failover->list);
> -	spin_unlock(&failover_lock);
> -
> -	kfree(failover);
> +	netdev_rx_handler_unregister(lower_dev);
> +	netdev_upper_dev_unlink(lower_dev, upper_dev);
> +	dev_put(lower_dev);
> +	lower_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>   }
> -EXPORT_SYMBOL_GPL(failover_unregister);
> -
> -static __init int
> -failover_init(void)
> -{
> -	register_netdevice_notifier(&failover_notifier);
> -
> -	return 0;
> -}
> -module_init(failover_init);
> -
> -static __exit
> -void failover_exit(void)
> -{
> -	unregister_netdevice_notifier(&failover_notifier);
> -}
> -module_exit(failover_exit);
> -
> -MODULE_DESCRIPTION("Generic failover infrastructure/interface");
> -MODULE_LICENSE("GPL v2");
> +EXPORT_SYMBOL_GPL(netdev_failover_unjoin);
Stephen Hemminger June 5, 2018, 5:45 p.m. UTC | #2
On Tue, 5 Jun 2018 10:22:13 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 6/4/2018 8:42 PM, Stephen Hemminger wrote:
> > The net failover should be a simple library, not a virtual
> > object with function callbacks (see callback hell).
> > The code is simpler is smaller both for the netvsc and virtio use case.  
> 
> I quickly tried this patch and it breaks virtio-net in standby mode.
> I don't see failover netdev, unloading virtio-net causes a crash.

I said it wasn't tested. Not surprising. Don't have a version of KVM
that supports standby (and not going to build KVM from scratch for this).

> 
> With these changes, there is very minimal code that is shared between
> netvsc and virtio-net. The notifier and event handling code and the
> lookup_bymac routines are now duplicated in both the drivers. I thought
> we wanted to keep this code common between the 2 drivers and we went through
> multiple revisions to make sure that it works with both netvsc's 2 netdev
> and virtio-net's 3 netdev models.

The sharing is what needs to be shared; it turns out not much
is common really. The total code size is less with this version.
Also, since even with nested virtualization, both drivers will not
be present on same guest at same time.

> The reason for the indirect ops is to support these 2 different models and
> i am not sure if the overhead of the callbacks is that significant considering
> that they are not called in the hot path.

The callbacks invert the functionality. It makes everything bottom up.
David Miller June 5, 2018, 6:14 p.m. UTC | #3
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Tue, 5 Jun 2018 10:45:10 -0700

> I said it wasn't tested. Not surprising. Don't have a version of KVM
> that supports standby (and not going to build KVM from scratch for
> this).

It would definitely help me if you put "RFC" in the subject line
for patches which aren't tested :-)

Thanks.
Michael S. Tsirkin June 5, 2018, 6:35 p.m. UTC | #4
Thanks, I think this is nice patch but I wonder whether it can be split
up somewhat. Not all of it is uncontroversial.

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * The matching of secondary device to primary device policy
>     is up to the network device. Both net_failover and netvsc
>     will use MAC for now but can change separately.

I actually suspect both will change to a serial number
down the road.

>   * The match policy is only used during initial discovery; after
>     that the secondary device knows what the upper device is because
>     of the parent/child relationship; no searching is required.

That would obviously be an improvement - does it have to be tied with
rest of changes?

>   * Now, netvsc and net_failover use the same delayed work type
>     mechanism for setup. Previously, net_failover code was triggering off
>     name change but a similar policy was rejected for netvsc.
>     "what is good for the goose is good for the gander"

I don't really understand what you are saying here.  I think the delayed
hack is kind of ugly and seems racy.  Current failover code was rejected
by whom?  Why is new one good and for whom?  Did you want to do a name
change in netvsc but it was rejected? Could you clarify please?

>   * The net_failover private device info 'struct net_failover_info'
>     should have been private to the driver file, not a visible
>     API.
> 
>   * The net_failover device should use SET_NETDEV_DEV
>     that is intended only for physical devices not virtual devices.

You mean should not.

>   * No point in having DocBook style comments on a driver file.
>     They only make sense on an external exposed API.
> 
>   * net_failover only supports Ethernet, so use ether_addr_copy.

It is since you need to know about all the things you need to copy, and
because of mac matching.  But it isn't too much effort to add more
transports and I don't see value in going in the reverse direction and
making it more ethernet specific that it already is.

>   * Set permanent and current address of net_failover device
>     to match the primary.
> 
>   * Carrier should be marked off before registering device
>     the net_failover device.

Are above two bugfixes?

>   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
> 
>   * Since failover infrastructure is about linking devices just
>     use RTNL no need for other locking in init and teardown.
> 
>   * Don't bother with ERR_PTR() style return if only possible
>     return is success or no memory.
> 
>   * As much as possible, the terms master and slave should be avoided
>     because of their cultural connotations.

Also for consistency, failover is calling these primary and standby now.

> Note; this code has been tested on Hyper-V
> but is compile tested only on virtio.
> 
> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---
> 
> Although this patch needs to go into 4.18 (linux-net),

I'd rather we focused on fixing bugs in 4.18, and left refactoring to
4.19.

At some point you said refactoring is needed to support matching using
the serial number, but I see this didn't make 4.18. So no rush IMHO.

> this version is based against net-next because net-next
> hasn't been merged into linux-net yet.
> 
> 
>  drivers/net/hyperv/hyperv_net.h |   3 +-
>  drivers/net/hyperv/netvsc_drv.c | 173 +++++++++++------
>  drivers/net/net_failover.c      | 312 ++++++++++++++++++++-----------
>  drivers/net/virtio_net.c        |   9 +-
>  include/net/failover.h          |  31 +---
>  include/net/net_failover.h      |  32 +---
>  net/Kconfig                     |  13 +-
>  net/core/failover.c             | 316 ++++----------------------------
>  8 files changed, 373 insertions(+), 516 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 99d8e7398a5b..c7d25d10765e 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -902,6 +902,8 @@ struct net_device_context {
>  	struct hv_device *device_ctx;
>  	/* netvsc_device */
>  	struct netvsc_device __rcu *nvdev;
> +	/* list of netvsc net_devices */
> +	struct list_head list;
>  	/* reconfigure work */
>  	struct delayed_work dwork;
>  	/* last reconfig time */
> @@ -933,7 +935,6 @@ struct net_device_context {
>  	/* Serial number of the VF to team with */
>  	u32 vf_serial;
>  
> -	struct failover *failover;
>  };
>  
>  /* Per channel data */
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index bef4d55a108c..074e6b8578df 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -70,6 +70,8 @@ static int debug = -1;
>  module_param(debug, int, 0444);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +static LIST_HEAD(netvsc_dev_list);
> +
>  static void netvsc_change_rx_flags(struct net_device *net, int change)
>  {
>  	struct net_device_context *ndev_ctx = netdev_priv(net);
> @@ -1846,101 +1848,120 @@ static void netvsc_vf_setup(struct work_struct *w)
>  	}
>  
>  	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
> -	if (vf_netdev)
> +	if (vf_netdev) {
>  		__netvsc_vf_setup(ndev, vf_netdev);
> -
> +		dev_put(vf_netdev);
> +	}
>  	rtnl_unlock();
>  }
>  
> -static int netvsc_pre_register_vf(struct net_device *vf_netdev,
> -				  struct net_device *ndev)
> +static struct net_device *get_netvsc_bymac(const u8 *mac)
>  {
> -	struct net_device_context *net_device_ctx;
> -	struct netvsc_device *netvsc_dev;
> +	struct net_device_context *ndev_ctx;
>  
> -	net_device_ctx = netdev_priv(ndev);
> -	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
> -	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
> -		return -ENODEV;
> +	ASSERT_RTNL();
>  
> -	return 0;
> +	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
> +		struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
> +
> +		if (ether_addr_equal(mac, dev->perm_addr))
> +			return dev;
> +	}
> +
> +	return NULL;
>  }
>  
> -static int netvsc_register_vf(struct net_device *vf_netdev,
> -			      struct net_device *ndev)
> +static int netvsc_register_vf(struct net_device *vf_netdev)
>  {
> -	struct net_device_context *ndev_ctx = netdev_priv(ndev);
> +	struct net_device *ndev;
> +	struct net_device_context *ndev_ctx;
> +
> +	/* Must use Ethernet addresses */
> +	if (vf_netdev->addr_len != ETH_ALEN)
> +		return NOTIFY_DONE;
> +
> +	/* VF must be a physical device not VLAN, etc */
> +	if (!vf_netdev->dev.parent)
> +		return NOTIFY_DONE;
> +
> +	/* Use the MAC address to locate the synthetic interface to
> +	 * associate with the VF interface.
> +	 */
> +	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
> +	/* If network device is being removed, don't do anything */
> +	ndev_ctx = netdev_priv(ndev);
> +	if (!rtnl_dereference(ndev_ctx->nvdev))
> +		return NOTIFY_DONE;
> +
> +	if (netdev_failover_join(vf_netdev, ndev, netvsc_vf_handle_frame)) {
> +		netdev_err(vf_netdev, "could not join: %s", ndev->name);
> +		return NOTIFY_DONE;
> +	}
>  
>  	/* set slave flag before open to prevent IPv6 addrconf */
>  	vf_netdev->flags |= IFF_SLAVE;
>  
> +	dev_hold(vf_netdev);
> +
>  	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
>  
>  	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
>  
>  	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
>  
> -	dev_hold(vf_netdev);
>  	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
>  
> -	return 0;
> +	return NOTIFY_OK;
>  }
>  
>  /* VF up/down change detected, schedule to change data path */
> -static int netvsc_vf_changed(struct net_device *vf_netdev,
> -			     struct net_device *ndev)
> +static int netvsc_vf_changed(struct net_device *vf_netdev)
>  {
>  	struct net_device_context *net_device_ctx;
>  	struct netvsc_device *netvsc_dev;
> +	struct net_device *ndev;
>  	bool vf_is_up = netif_running(vf_netdev);
>  
> +	ndev = netdev_failover_upper_get(vf_netdev);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
>  	net_device_ctx = netdev_priv(ndev);
>  	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
>  	if (!netvsc_dev)
> -		return -ENODEV;
> +		return NOTIFY_DONE;
>  
>  	netvsc_switch_datapath(ndev, vf_is_up);
>  	netdev_info(ndev, "Data path switched %s VF: %s\n",
>  		    vf_is_up ? "to" : "from", vf_netdev->name);
>  
> -	return 0;
> +	return NOTIFY_OK;
>  }
>  
> -static int netvsc_pre_unregister_vf(struct net_device *vf_netdev,
> -				    struct net_device *ndev)
> +static int netvsc_unregister_vf(struct net_device *vf_netdev)
>  {
>  	struct net_device_context *net_device_ctx;
> +	struct net_device *ndev;
>  
> -	net_device_ctx = netdev_priv(ndev);
> -	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
> -
> -	return 0;
> -}
> -
> -static int netvsc_unregister_vf(struct net_device *vf_netdev,
> -				struct net_device *ndev)
> -{
> -	struct net_device_context *net_device_ctx;
> +	ndev = netdev_failover_upper_get(vf_netdev);
> +	if (!ndev)
> +		return NOTIFY_DONE;
>  
>  	net_device_ctx = netdev_priv(ndev);
> +	if (cancel_delayed_work_sync(&net_device_ctx->vf_takeover))
> +		dev_put(vf_netdev);
>  
>  	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
>  
> +	netdev_failover_unjoin(vf_netdev, ndev);
>  	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
> -	dev_put(vf_netdev);
>  
> -	return 0;
> +	return NOTIFY_OK;
>  }
>  
> -static struct failover_ops netvsc_failover_ops = {
> -	.slave_pre_register	= netvsc_pre_register_vf,
> -	.slave_register		= netvsc_register_vf,
> -	.slave_pre_unregister	= netvsc_pre_unregister_vf,
> -	.slave_unregister	= netvsc_unregister_vf,
> -	.slave_link_change	= netvsc_vf_changed,
> -	.slave_handle_frame	= netvsc_vf_handle_frame,
> -};
> -
>  static int netvsc_probe(struct hv_device *dev,
>  			const struct hv_vmbus_device_id *dev_id)
>  {
> @@ -2009,6 +2030,8 @@ static int netvsc_probe(struct hv_device *dev,
>  
>  	memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN);
>  
> +	net->priv_flags |= IFF_FAILOVER;
> +
>  	/* hw_features computed in rndis_netdev_set_hwcaps() */
>  	net->features = net->hw_features |
>  		NETIF_F_HIGHDMA | NETIF_F_SG |
> @@ -2024,23 +2047,19 @@ static int netvsc_probe(struct hv_device *dev,
>  	else
>  		net->max_mtu = ETH_DATA_LEN;
>  
> -	ret = register_netdev(net);
> +	rtnl_lock();
> +	ret = register_netdevice(net);
>  	if (ret != 0) {
>  		pr_err("Unable to register netdev.\n");
>  		goto register_failed;
>  	}
>  
> -	net_device_ctx->failover = failover_register(net, &netvsc_failover_ops);
> -	if (IS_ERR(net_device_ctx->failover)) {
> -		ret = PTR_ERR(net_device_ctx->failover);
> -		goto err_failover;
> -	}
> -
> -	return ret;
> +	list_add(&net_device_ctx->list, &netvsc_dev_list);
> +	rtnl_unlock();
> +	return 0;
>  
> -err_failover:
> -	unregister_netdev(net);
>  register_failed:
> +	rtnl_unlock();
>  	rndis_filter_device_remove(dev, nvdev);
>  rndis_failed:
>  	free_percpu(net_device_ctx->vf_stats);
> @@ -2079,15 +2098,17 @@ static int netvsc_remove(struct hv_device *dev)
>  	 */
>  	rtnl_lock();
>  	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
> -	if (vf_netdev)
> -		failover_slave_unregister(vf_netdev);
> +	if (vf_netdev) {
> +		netdev_failover_unjoin(vf_netdev, net);
> +		dev_put(vf_netdev);
> +	}
>  
>  	if (nvdev)
>  		rndis_filter_device_remove(dev, nvdev);
>  
>  	unregister_netdevice(net);
>  
> -	failover_unregister(ndev_ctx->failover);
> +	list_del(&ndev_ctx->list);
>  
>  	rtnl_unlock();
>  	rcu_read_unlock();
> @@ -2115,8 +2136,47 @@ static struct  hv_driver netvsc_drv = {
>  	.remove = netvsc_remove,
>  };
>  
> +/* On Hyper-V, every VF interface is matched with a corresponding
> + * synthetic interface. The synthetic interface is presented first
> + * to the guest. When the corresponding VF instance is registered,
> + * we will take care of switching the data path.
> + */
> +static int netvsc_netdev_event(struct notifier_block *this,
> +			       unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip parent events */
> +	if (netif_is_failover(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid non-Ethernet type devices */
> +	if (event_dev->type != ARPHRD_ETHER)
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return netvsc_register_vf(event_dev);
> +
> +	case NETDEV_UNREGISTER:
> +		return netvsc_unregister_vf(event_dev);
> +
> +	case NETDEV_UP:
> +	case NETDEV_DOWN:
> +		return netvsc_vf_changed(event_dev);
> +
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block netvsc_netdev_notifier = {
> +	.notifier_call = netvsc_netdev_event,
> +};
> +
>  static void __exit netvsc_drv_exit(void)
>  {
> +	unregister_netdevice_notifier(&netvsc_netdev_notifier);
>  	vmbus_driver_unregister(&netvsc_drv);
>  }
>  
> @@ -2136,6 +2196,7 @@ static int __init netvsc_drv_init(void)
>  	if (ret)
>  		return ret;
>  
> +	register_netdevice_notifier(&netvsc_netdev_notifier);
>  	return 0;
>  }
>  
> diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
> index 83f7420ddea5..e0d30527f748 100644
> --- a/drivers/net/net_failover.c
> +++ b/drivers/net/net_failover.c
> @@ -28,6 +28,46 @@
>  #include <uapi/linux/if_arp.h>
>  #include <net/net_failover.h>
>  
> +static LIST_HEAD(net_failover_list);
> +
> +/* failover state */
> +struct net_failover_info {
> +	struct net_device *failover_dev;
> +
> +	/* list of failover virtual devices */
> +	struct list_head list;
> +
> +	/* primary netdev with same MAC */
> +	struct net_device __rcu *primary_dev;
> +
> +	/* standby netdev */
> +	struct net_device __rcu *standby_dev;
> +
> +	/* primary netdev stats */
> +	struct rtnl_link_stats64 primary_stats;
> +
> +	/* standby netdev stats */
> +	struct rtnl_link_stats64 standby_stats;
> +
> +	/* aggregated stats */
> +	struct rtnl_link_stats64 failover_stats;
> +
> +	/* spinlock while updating stats */
> +	spinlock_t stats_lock;
> +
> +	/* delayed setup of slave */
> +	struct delayed_work standby_init;
> +};
> +
> +#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> +				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
> +				 NETIF_F_HIGHDMA | NETIF_F_LRO)
> +
> +#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> +				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
> +
> +#define FAILOVER_SETUP_INTERVAL	(HZ / 10)
> +
>  static bool net_failover_xmit_ready(struct net_device *dev)
>  {
>  	return netif_running(dev) && netif_carrier_ok(dev);
> @@ -460,22 +500,42 @@ static void net_failover_lower_state_changed(struct net_device *slave_dev,
>  	netdev_lower_state_changed(slave_dev, &info);
>  }
>  
> -static int net_failover_slave_pre_register(struct net_device *slave_dev,
> -					   struct net_device *failover_dev)
> +static struct net_device *get_net_failover_bymac(const u8 *mac)
>  {
> -	struct net_device *standby_dev, *primary_dev;
> +	struct net_failover_info *nfo_info;
> +
> +	ASSERT_RTNL();
> +
> +	list_for_each_entry(nfo_info, &net_failover_list, list) {
> +		struct net_device *failover_dev = nfo_info->failover_dev;
> +
> +		if (ether_addr_equal(mac, failover_dev->perm_addr))
> +			return failover_dev;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int net_failover_register_event(struct net_device *slave_dev)
> +{
> +	struct net_device *failover_dev, *standby_dev, *primary_dev;
>  	struct net_failover_info *nfo_info;
>  	bool slave_is_standby;
>  
> +	failover_dev = get_net_failover_bymac(slave_dev->perm_addr);
> +	if (!failover_dev)
> +		return NOTIFY_DONE;
> +
>  	nfo_info = netdev_priv(failover_dev);
>  	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>  	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>  	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
>  	if (slave_is_standby ? standby_dev : primary_dev) {
> -		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
> +		netdev_err(failover_dev,
> +			   "%s attempting to register as slave dev when %s already present\n",
>  			   slave_dev->name,
>  			   slave_is_standby ? "standby" : "primary");
> -		return -EINVAL;
> +		return NOTIFY_DONE;
>  	}
>  
>  	/* We want to allow only a direct attached VF device as a primary
> @@ -484,23 +544,33 @@ static int net_failover_slave_pre_register(struct net_device *slave_dev,
>  	 */
>  	if (!slave_is_standby && (!slave_dev->dev.parent ||
>  				  !dev_is_pci(slave_dev->dev.parent)))
> -		return -EINVAL;
> +		return NOTIFY_DONE;
>  
>  	if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
>  	    vlan_uses_dev(failover_dev)) {
> -		netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
> +		netdev_err(failover_dev,
> +			   "Device %s is VLAN challenged and failover device has VLAN set up\n",
>  			   failover_dev->name);
> -		return -EINVAL;
> +		return NOTIFY_DONE;
>  	}
>  
> -	return 0;
> +	if (netdev_failover_join(slave_dev, failover_dev,
> +				 net_failover_handle_frame)) {
> +		netdev_err(failover_dev, "could not join: %s", slave_dev->name);
> +		return NOTIFY_DONE;
> +	}
> +
> +	/* Trigger rest of setup in process context */
> +	schedule_delayed_work(&nfo_info->standby_init, FAILOVER_SETUP_INTERVAL);
> +
> +	return NOTIFY_OK;
>  }
>  
> -static int net_failover_slave_register(struct net_device *slave_dev,
> -				       struct net_device *failover_dev)
> +static void __net_failover_setup(struct net_device *failover_dev)
>  {
> +	struct net_failover_info *nfo_info = netdev_priv(failover_dev);
> +	struct net_device *slave_dev = rtnl_dereference(nfo_info->standby_dev);
>  	struct net_device *standby_dev, *primary_dev;
> -	struct net_failover_info *nfo_info;
>  	bool slave_is_standby;
>  	u32 orig_mtu;
>  	int err;
> @@ -509,13 +579,12 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>  	orig_mtu = slave_dev->mtu;
>  	err = dev_set_mtu(slave_dev, failover_dev->mtu);
>  	if (err) {
> -		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
> +		netdev_err(failover_dev,
> +			   "unable to change mtu of %s to %u register failed\n",
>  			   slave_dev->name, failover_dev->mtu);
>  		goto done;
>  	}
>  
> -	dev_hold(slave_dev);
> -
>  	if (netif_running(failover_dev)) {
>  		err = dev_open(slave_dev);
>  		if (err && (err != -EBUSY)) {
> @@ -537,7 +606,6 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>  		goto err_vlan_add;
>  	}
>  
> -	nfo_info = netdev_priv(failover_dev);
>  	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>  	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>  	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
> @@ -562,52 +630,56 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>  	netdev_info(failover_dev, "failover %s slave:%s registered\n",
>  		    slave_is_standby ? "standby" : "primary", slave_dev->name);
>  
> -	return 0;
> +	return;
>  
>  err_vlan_add:
>  	dev_uc_unsync(slave_dev, failover_dev);
>  	dev_mc_unsync(slave_dev, failover_dev);
>  	dev_close(slave_dev);
>  err_dev_open:
> -	dev_put(slave_dev);
>  	dev_set_mtu(slave_dev, orig_mtu);
>  done:
> -	return err;
> +	return;
>  }
>  
> -static int net_failover_slave_pre_unregister(struct net_device *slave_dev,
> -					     struct net_device *failover_dev)
> +static void net_failover_setup(struct work_struct *w)
>  {
> -	struct net_device *standby_dev, *primary_dev;
> -	struct net_failover_info *nfo_info;
> +	struct net_failover_info *nfo_info
> +		= container_of(w, struct net_failover_info, standby_init.work);
> +	struct net_device *failover_dev = nfo_info->failover_dev;
>  
> -	nfo_info = netdev_priv(failover_dev);
> -	primary_dev = rtnl_dereference(nfo_info->primary_dev);
> -	standby_dev = rtnl_dereference(nfo_info->standby_dev);
> -
> -	if (slave_dev != primary_dev && slave_dev != standby_dev)
> -		return -ENODEV;
> +	/* handle race with cancel delayed work on removal */
> +	if (!rtnl_trylock()) {
> +		schedule_delayed_work(&nfo_info->standby_init, 0);
> +		return;
> +	}
>  
> -	return 0;
> +	__net_failover_setup(failover_dev);
> +	rtnl_unlock();
>  }
>  
> -static int net_failover_slave_unregister(struct net_device *slave_dev,
> -					 struct net_device *failover_dev)
> +static int net_failover_unregister_event(struct net_device *slave_dev)
>  {
> -	struct net_device *standby_dev, *primary_dev;
> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
>  	struct net_failover_info *nfo_info;
>  	bool slave_is_standby;
>  
> +	failover_dev = netdev_failover_upper_get(slave_dev);
> +	if (!failover_dev)
> +		return NOTIFY_DONE;
> +
>  	nfo_info = netdev_priv(failover_dev);
>  	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>  	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>  
> +	if (slave_dev != primary_dev && slave_dev != standby_dev)
> +		return NOTIFY_DONE;
> +
>  	vlan_vids_del_by_dev(slave_dev, failover_dev);
>  	dev_uc_unsync(slave_dev, failover_dev);
>  	dev_mc_unsync(slave_dev, failover_dev);
>  	dev_close(slave_dev);
>  
> -	nfo_info = netdev_priv(failover_dev);
>  	dev_get_stats(failover_dev, &nfo_info->failover_stats);
>  
>  	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
> @@ -628,22 +700,25 @@ static int net_failover_slave_unregister(struct net_device *slave_dev,
>  	netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
>  		    slave_is_standby ? "standby" : "primary", slave_dev->name);
>  
> -	return 0;
> +	return NOTIFY_OK;
>  }
>  
> -static int net_failover_slave_link_change(struct net_device *slave_dev,
> -					  struct net_device *failover_dev)
> +static int net_failover_link_event(struct net_device *slave_dev)
> +
>  {
> -	struct net_device *primary_dev, *standby_dev;
> +	struct net_device *failover_dev, *primary_dev, *standby_dev;
>  	struct net_failover_info *nfo_info;
>  
> -	nfo_info = netdev_priv(failover_dev);
> +	failover_dev = netdev_failover_upper_get(slave_dev);
> +	if (!failover_dev)
> +		return NOTIFY_DONE;
>  
> +	nfo_info = netdev_priv(failover_dev);
>  	primary_dev = rtnl_dereference(nfo_info->primary_dev);
>  	standby_dev = rtnl_dereference(nfo_info->standby_dev);
>  
>  	if (slave_dev != primary_dev && slave_dev != standby_dev)
> -		return -ENODEV;
> +		return NOTIFY_DONE;
>  
>  	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
>  	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
> @@ -657,43 +732,11 @@ static int net_failover_slave_link_change(struct net_device *slave_dev,
>  
>  	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
>  
> -	return 0;
> +	return NOTIFY_DONE;
>  }
>  
> -static int net_failover_slave_name_change(struct net_device *slave_dev,
> -					  struct net_device *failover_dev)
> -{
> -	struct net_device *primary_dev, *standby_dev;
> -	struct net_failover_info *nfo_info;
> -
> -	nfo_info = netdev_priv(failover_dev);
> -
> -	primary_dev = rtnl_dereference(nfo_info->primary_dev);
> -	standby_dev = rtnl_dereference(nfo_info->standby_dev);
> -
> -	if (slave_dev != primary_dev && slave_dev != standby_dev)
> -		return -ENODEV;
> -
> -	/* We need to bring up the slave after the rename by udev in case
> -	 * open failed with EBUSY when it was registered.
> -	 */
> -	dev_open(slave_dev);
> -
> -	return 0;
> -}
> -
> -static struct failover_ops net_failover_ops = {
> -	.slave_pre_register	= net_failover_slave_pre_register,
> -	.slave_register		= net_failover_slave_register,
> -	.slave_pre_unregister	= net_failover_slave_pre_unregister,
> -	.slave_unregister	= net_failover_slave_unregister,
> -	.slave_link_change	= net_failover_slave_link_change,
> -	.slave_name_change	= net_failover_slave_name_change,
> -	.slave_handle_frame	= net_failover_handle_frame,
> -};
> -
>  /**
> - * net_failover_create - Create and register a failover instance
> + * net_failover_create - Create and register a failover device
>   *
>   * @dev: standby netdev
>   *
> @@ -703,13 +746,12 @@ static struct failover_ops net_failover_ops = {
>   * the original standby netdev and a VF netdev with the same MAC gets
>   * registered as primary netdev.
>   *
> - * Return: pointer to failover instance
> + * Return: pointer to failover network device
>   */
> -struct failover *net_failover_create(struct net_device *standby_dev)
> +struct net_device *net_failover_create(struct net_device *standby_dev)
>  {
> -	struct device *dev = standby_dev->dev.parent;
> +	struct net_failover_info *nfo_info;
>  	struct net_device *failover_dev;
> -	struct failover *failover;
>  	int err;
>  
>  	/* Alloc at least 2 queues, for now we are going with 16 assuming
> @@ -717,18 +759,22 @@ struct failover *net_failover_create(struct net_device *standby_dev)
>  	 */
>  	failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
>  	if (!failover_dev) {
> -		dev_err(dev, "Unable to allocate failover_netdev!\n");
> -		return ERR_PTR(-ENOMEM);
> +		netdev_err(standby_dev, "Unable to allocate failover_netdev!\n");
> +		return NULL;
>  	}
>  
> +	nfo_info = netdev_priv(failover_dev);
>  	dev_net_set(failover_dev, dev_net(standby_dev));
> -	SET_NETDEV_DEV(failover_dev, dev);
> +	nfo_info->failover_dev = failover_dev;
> +	INIT_DELAYED_WORK(&nfo_info->standby_init, net_failover_setup);
>  
>  	failover_dev->netdev_ops = &failover_dev_ops;
>  	failover_dev->ethtool_ops = &failover_ethtool_ops;
>  
>  	/* Initialize the device options */
> -	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
> +	failover_dev->priv_flags |= IFF_UNICAST_FLT |
> +				    IFF_NO_QUEUE |
> +				    IFF_FAILOVER;
>  	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>  				       IFF_TX_SKB_SHARING);
>  
> @@ -746,29 +792,38 @@ struct failover *net_failover_create(struct net_device *standby_dev)
>  	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>  	failover_dev->features |= failover_dev->hw_features;
>  
> -	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
> -	       failover_dev->addr_len);
> +	ether_addr_copy(failover_dev->dev_addr, standby_dev->dev_addr);
> +	ether_addr_copy(failover_dev->perm_addr, standby_dev->perm_addr);
>  
>  	failover_dev->min_mtu = standby_dev->min_mtu;
>  	failover_dev->max_mtu = standby_dev->max_mtu;
>  
> -	err = register_netdev(failover_dev);
> +	netif_carrier_off(failover_dev);
> +
> +	rtnl_lock();
> +	err = register_netdevice(failover_dev);
>  	if (err) {
> -		dev_err(dev, "Unable to register failover_dev!\n");
> +		netdev_err(standby_dev, "Unable to register failover_dev!\n");
>  		goto err_register_netdev;
>  	}
>  
> -	netif_carrier_off(failover_dev);
> +	err = netdev_failover_join(standby_dev, failover_dev,
> +				   net_failover_handle_frame);
> +	if (err) {
> +		netdev_err(failover_dev, "Unable to join with %s\n",
> +			   standby_dev->name);
> +		goto err_failover_join;
> +	}
>  
> -	failover = failover_register(failover_dev, &net_failover_ops);
> -	if (IS_ERR(failover))
> -		goto err_failover_register;
> +	list_add(&nfo_info->list, &net_failover_list);
> +	rtnl_unlock();
>  
> -	return failover;
> +	return failover_dev;
>  
> -err_failover_register:
> -	unregister_netdev(failover_dev);
> +err_failover_join:
> +	unregister_netdevice(failover_dev);
>  err_register_netdev:
> +	rtnl_unlock();
>  	free_netdev(failover_dev);
>  
>  	return ERR_PTR(err);
> @@ -786,31 +841,27 @@ EXPORT_SYMBOL_GPL(net_failover_create);
>   * netdev. Used by paravirtual drivers that use 3-netdev model.
>   *
>   */
> -void net_failover_destroy(struct failover *failover)
> +void net_failover_destroy(struct net_device *failover_dev)
>  {
> -	struct net_failover_info *nfo_info;
> -	struct net_device *failover_dev;
> +	struct net_failover_info *nfo_info = netdev_priv(failover_dev);
>  	struct net_device *slave_dev;
>  
> -	if (!failover)
> -		return;
> -
> -	failover_dev = rcu_dereference(failover->failover_dev);
> -	nfo_info = netdev_priv(failover_dev);
> -
>  	netif_device_detach(failover_dev);
>  
>  	rtnl_lock();
> -
>  	slave_dev = rtnl_dereference(nfo_info->primary_dev);
> -	if (slave_dev)
> -		failover_slave_unregister(slave_dev);
> +	if (slave_dev) {
> +		netdev_failover_unjoin(slave_dev, failover_dev);
> +		dev_put(slave_dev);
> +	}
>  
>  	slave_dev = rtnl_dereference(nfo_info->standby_dev);
> -	if (slave_dev)
> -		failover_slave_unregister(slave_dev);
> +	if (slave_dev) {
> +		netdev_failover_unjoin(slave_dev, failover_dev);
> +		dev_put(slave_dev);
> +	}
>  
> -	failover_unregister(failover);
> +	list_del(&nfo_info->list);
>  
>  	unregister_netdevice(failover_dev);
>  
> @@ -820,9 +871,53 @@ void net_failover_destroy(struct failover *failover)
>  }
>  EXPORT_SYMBOL_GPL(net_failover_destroy);
>  
> +static int net_failover_event(struct notifier_block *this,
> +			      unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip parent events */
> +	if (netif_is_failover(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid non-Ethernet type devices */
> +	if (event_dev->type != ARPHRD_ETHER)
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Vlan dev with same MAC registering as VF */
> +	if (is_vlan_dev(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Bonding master dev with same MAC registering as VF */
> +	if ((event_dev->priv_flags & IFF_BONDING) &&
> +	    (event_dev->flags & IFF_MASTER))
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return net_failover_register_event(event_dev);
> +
> +	case NETDEV_UNREGISTER:
> +		return net_failover_unregister_event(event_dev);
> +
> +	case NETDEV_UP:
> +	case NETDEV_DOWN:
> +	case NETDEV_CHANGE:
> +		return net_failover_link_event(event_dev);
> +
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block net_failover_notifier = {
> +	.notifier_call = net_failover_event,
> +};
> +
>  static __init int
>  net_failover_init(void)
>  {
> +	register_netdevice_notifier(&net_failover_notifier);
>  	return 0;
>  }
>  module_init(net_failover_init);
> @@ -830,6 +925,7 @@ module_init(net_failover_init);
>  static __exit
>  void net_failover_exit(void)
>  {
> +	unregister_netdevice_notifier(&net_failover_notifier);
>  }
>  module_exit(net_failover_exit);
>  
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6d710b8b41c5..b40ae28dac93 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -215,7 +215,7 @@ struct virtnet_info {
>  	unsigned long guest_offloads;
>  
>  	/* failover when STANDBY feature enabled */
> -	struct failover *failover;
> +	struct net_device *failover;
>  };
>  
>  struct padded_vnet_hdr {
> @@ -2930,11 +2930,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	virtnet_init_settings(dev);
>  
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> -		vi->failover = net_failover_create(vi->dev);
> -		if (IS_ERR(vi->failover)) {
> -			err = PTR_ERR(vi->failover);
> +		err = -ENOMEM;
> +		vi->failover = net_failover_create(dev);
> +		if (!vi->failover)
>  			goto free_vqs;
> -		}
>  	}
>  
>  	err = register_netdev(dev);
> diff --git a/include/net/failover.h b/include/net/failover.h
> index bb15438f39c7..22d6c1369101 100644
> --- a/include/net/failover.h
> +++ b/include/net/failover.h
> @@ -6,31 +6,10 @@
>  
>  #include <linux/netdevice.h>
>  
> -struct failover_ops {
> -	int (*slave_pre_register)(struct net_device *slave_dev,
> -				  struct net_device *failover_dev);
> -	int (*slave_register)(struct net_device *slave_dev,
> -			      struct net_device *failover_dev);
> -	int (*slave_pre_unregister)(struct net_device *slave_dev,
> -				    struct net_device *failover_dev);
> -	int (*slave_unregister)(struct net_device *slave_dev,
> -				struct net_device *failover_dev);
> -	int (*slave_link_change)(struct net_device *slave_dev,
> -				 struct net_device *failover_dev);
> -	int (*slave_name_change)(struct net_device *slave_dev,
> -				 struct net_device *failover_dev);
> -	rx_handler_result_t (*slave_handle_frame)(struct sk_buff **pskb);
> -};
> -
> -struct failover {
> -	struct list_head list;
> -	struct net_device __rcu *failover_dev;
> -	struct failover_ops __rcu *ops;
> -};
> -
> -struct failover *failover_register(struct net_device *dev,
> -				   struct failover_ops *ops);
> -void failover_unregister(struct failover *failover);
> -int failover_slave_unregister(struct net_device *slave_dev);
> +int netdev_failover_join(struct net_device *lower, struct net_device *upper,
> +			 rx_handler_func_t *rx_handler);
> +struct net_device *netdev_failover_upper_get(struct net_device *lower);
> +void netdev_failover_unjoin(struct net_device *lower,
> +			    struct net_device *upper);
>  
>  #endif /* _FAILOVER_H */
> diff --git a/include/net/net_failover.h b/include/net/net_failover.h
> index b12a1c469d1c..a99b3b00b4e3 100644
> --- a/include/net/net_failover.h
> +++ b/include/net/net_failover.h
> @@ -6,35 +6,7 @@
>  
>  #include <net/failover.h>
>  
> -/* failover state */
> -struct net_failover_info {
> -	/* primary netdev with same MAC */
> -	struct net_device __rcu *primary_dev;
> -
> -	/* standby netdev */
> -	struct net_device __rcu *standby_dev;
> -
> -	/* primary netdev stats */
> -	struct rtnl_link_stats64 primary_stats;
> -
> -	/* standby netdev stats */
> -	struct rtnl_link_stats64 standby_stats;
> -
> -	/* aggregated stats */
> -	struct rtnl_link_stats64 failover_stats;
> -
> -	/* spinlock while updating stats */
> -	spinlock_t stats_lock;
> -};
> -
> -struct failover *net_failover_create(struct net_device *standby_dev);
> -void net_failover_destroy(struct failover *failover);
> -
> -#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> -				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
> -				 NETIF_F_HIGHDMA | NETIF_F_LRO)
> -
> -#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
> -				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
> +struct net_device *net_failover_create(struct net_device *standby_dev);
> +void net_failover_destroy(struct net_device *failover_dev);
>  
>  #endif /* _NET_FAILOVER_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index f738a6f27665..697d84202695 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -433,17 +433,8 @@ config PAGE_POOL
>         bool
>  
>  config FAILOVER
> -	tristate "Generic failover module"
> -	help
> -	  The failover module provides a generic interface for paravirtual
> -	  drivers to register a netdev and a set of ops with a failover
> -	  instance. The ops are used as event handlers that get called to
> -	  handle netdev register/unregister/link change/name change events
> -	  on slave pci ethernet devices with the same mac address as the
> -	  failover netdev. This enables paravirtual drivers to use a
> -	  VF as an accelerated low latency datapath. It also allows live
> -	  migration of VMs with direct attached VFs by failing over to the
> -	  paravirtual datapath when the VF is unplugged.
> +	bool
> +	default n
>  
>  endif   # if NET
>  
> diff --git a/net/core/failover.c b/net/core/failover.c
> index 4a92a98ccce9..499f0fd7e4d3 100644
> --- a/net/core/failover.c
> +++ b/net/core/failover.c
> @@ -1,10 +1,8 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /* Copyright (c) 2018, Intel Corporation. */
>  
> -/* A common module to handle registrations and notifications for paravirtual
> +/* A library for managing chained upper/oower devices such as
>   * drivers to enable accelerated datapath and support VF live migration.
> - *
> - * The notifier and event handling code is based on netvsc driver.
>   */
>  
>  #include <linux/module.h>
> @@ -14,302 +12,62 @@
>  #include <linux/if_vlan.h>
>  #include <net/failover.h>
>  
> -static LIST_HEAD(failover_list);
> -static DEFINE_SPINLOCK(failover_lock);
> -
> -static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
> -{
> -	struct net_device *failover_dev;
> -	struct failover *failover;
> -
> -	spin_lock(&failover_lock);
> -	list_for_each_entry(failover, &failover_list, list) {
> -		failover_dev = rtnl_dereference(failover->failover_dev);
> -		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
> -			*ops = rtnl_dereference(failover->ops);
> -			spin_unlock(&failover_lock);
> -			return failover_dev;
> -		}
> -	}
> -	spin_unlock(&failover_lock);
> -	return NULL;
> -}
> -
> -/**
> - * failover_slave_register - Register a slave netdev
> - *
> - * @slave_dev: slave netdev that is being registered
> - *
> - * Registers a slave device to a failover instance. Only ethernet devices
> - * are supported.
> - */
> -static int failover_slave_register(struct net_device *slave_dev)
> +/* failover_join - Join an lower netdev with an upper device. */
> +int netdev_failover_join(struct net_device *lower_dev,
> +			 struct net_device *upper_dev,
> +			 rx_handler_func_t *rx_handler)
>  {
> -	struct netdev_lag_upper_info lag_upper_info;
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
>  	int err;
>  
> -	if (slave_dev->type != ARPHRD_ETHER)
> -		goto done;
> -
>  	ASSERT_RTNL();
>  
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> +	/* Don't allow joining devices of different protocols */
> +	if (upper_dev->type != lower_dev->type)
> +		return -EINVAL;
>  
> -	if (fops && fops->slave_pre_register &&
> -	    fops->slave_pre_register(slave_dev, failover_dev))
> -		goto done;
> -
> -	err = netdev_rx_handler_register(slave_dev, fops->slave_handle_frame,
> -					 failover_dev);
> +	err = netdev_rx_handler_register(lower_dev, rx_handler, upper_dev);
>  	if (err) {
> -		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
> +		netdev_err(lower_dev,
> +			   "can not register failover rx handler (err = %d)\n",
>  			   err);
> -		goto done;
> +		return err;
>  	}
>  
> -	lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP;
> -	err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL,
> -					   &lag_upper_info, NULL);
> +	err = netdev_master_upper_dev_link(lower_dev, upper_dev, NULL,
> +					   NULL, NULL);
>  	if (err) {
> -		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
> -			   failover_dev->name, err);
> -		goto err_upper_link;
> +		netdev_err(lower_dev,
> +			   "can not set failover device %s (err = %d)\n",
> +			   upper_dev->name, err);
> +		netdev_rx_handler_unregister(lower_dev);
> +		return err;
>  	}
>  
> -	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
> -
> -	if (fops && fops->slave_register &&
> -	    !fops->slave_register(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -	netdev_upper_dev_unlink(slave_dev, failover_dev);
> -	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
> -err_upper_link:
> -	netdev_rx_handler_unregister(slave_dev);
> -done:
> -	return NOTIFY_DONE;
> -}
> -
> -/**
> - * failover_slave_unregister - Unregister a slave netdev
> - *
> - * @slave_dev: slave netdev that is being unregistered
> - *
> - * Unregisters a slave device from a failover instance.
> - */
> -int failover_slave_unregister(struct net_device *slave_dev)
> -{
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
> -
> -	if (!netif_is_failover_slave(slave_dev))
> -		goto done;
> -
> -	ASSERT_RTNL();
> -
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> -
> -	if (fops && fops->slave_pre_unregister &&
> -	    fops->slave_pre_unregister(slave_dev, failover_dev))
> -		goto done;
> -
> -	netdev_rx_handler_unregister(slave_dev);
> -	netdev_upper_dev_unlink(slave_dev, failover_dev);
> -	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
> -
> -	if (fops && fops->slave_unregister &&
> -	    !fops->slave_unregister(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -done:
> -	return NOTIFY_DONE;
> +	dev_hold(lower_dev);
> +	lower_dev->priv_flags |= IFF_FAILOVER_SLAVE;
> +	return 0;
>  }
> -EXPORT_SYMBOL_GPL(failover_slave_unregister);
> +EXPORT_SYMBOL_GPL(netdev_failover_join);
>  
> -static int failover_slave_link_change(struct net_device *slave_dev)
> +/* Find upper network device for failover slave device */
> +struct net_device *netdev_failover_upper_get(struct net_device *lower_dev)
>  {
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
> -
> -	if (!netif_is_failover_slave(slave_dev))
> -		goto done;
> -
> -	ASSERT_RTNL();
> -
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> -
> -	if (!netif_running(failover_dev))
> -		goto done;
> +	if (!netif_is_failover_slave(lower_dev))
> +		return NULL;
>  
> -	if (fops && fops->slave_link_change &&
> -	    !fops->slave_link_change(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -done:
> -	return NOTIFY_DONE;
> +	return netdev_master_upper_dev_get(lower_dev);
>  }
> +EXPORT_SYMBOL_GPL(netdev_failover_upper_get);
>  
> -static int failover_slave_name_change(struct net_device *slave_dev)
> +/* failover_unjoin - Break connection between lower and upper device. */
> +void netdev_failover_unjoin(struct net_device *lower_dev,
> +			    struct net_device *upper_dev)
>  {
> -	struct net_device *failover_dev;
> -	struct failover_ops *fops;
> -
> -	if (!netif_is_failover_slave(slave_dev))
> -		goto done;
> -
>  	ASSERT_RTNL();
>  
> -	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> -	if (!failover_dev)
> -		goto done;
> -
> -	if (!netif_running(failover_dev))
> -		goto done;
> -
> -	if (fops && fops->slave_name_change &&
> -	    !fops->slave_name_change(slave_dev, failover_dev))
> -		return NOTIFY_OK;
> -
> -done:
> -	return NOTIFY_DONE;
> -}
> -
> -static int
> -failover_event(struct notifier_block *this, unsigned long event, void *ptr)
> -{
> -	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> -
> -	/* Skip parent events */
> -	if (netif_is_failover(event_dev))
> -		return NOTIFY_DONE;
> -
> -	switch (event) {
> -	case NETDEV_REGISTER:
> -		return failover_slave_register(event_dev);
> -	case NETDEV_UNREGISTER:
> -		return failover_slave_unregister(event_dev);
> -	case NETDEV_UP:
> -	case NETDEV_DOWN:
> -	case NETDEV_CHANGE:
> -		return failover_slave_link_change(event_dev);
> -	case NETDEV_CHANGENAME:
> -		return failover_slave_name_change(event_dev);
> -	default:
> -		return NOTIFY_DONE;
> -	}
> -}
> -
> -static struct notifier_block failover_notifier = {
> -	.notifier_call = failover_event,
> -};
> -
> -static void
> -failover_existing_slave_register(struct net_device *failover_dev)
> -{
> -	struct net *net = dev_net(failover_dev);
> -	struct net_device *dev;
> -
> -	rtnl_lock();
> -	for_each_netdev(net, dev) {
> -		if (netif_is_failover(dev))
> -			continue;
> -		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
> -			failover_slave_register(dev);
> -	}
> -	rtnl_unlock();
> -}
> -
> -/**
> - * failover_register - Register a failover instance
> - *
> - * @dev: failover netdev
> - * @ops: failover ops
> - *
> - * Allocate and register a failover instance for a failover netdev. ops
> - * provides handlers for slave device register/unregister/link change/
> - * name change events.
> - *
> - * Return: pointer to failover instance
> - */
> -struct failover *failover_register(struct net_device *dev,
> -				   struct failover_ops *ops)
> -{
> -	struct failover *failover;
> -
> -	if (dev->type != ARPHRD_ETHER)
> -		return ERR_PTR(-EINVAL);
> -
> -	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
> -	if (!failover)
> -		return ERR_PTR(-ENOMEM);
> -
> -	rcu_assign_pointer(failover->ops, ops);
> -	dev_hold(dev);
> -	dev->priv_flags |= IFF_FAILOVER;
> -	rcu_assign_pointer(failover->failover_dev, dev);
> -
> -	spin_lock(&failover_lock);
> -	list_add_tail(&failover->list, &failover_list);
> -	spin_unlock(&failover_lock);
> -
> -	netdev_info(dev, "failover master:%s registered\n", dev->name);
> -
> -	failover_existing_slave_register(dev);
> -
> -	return failover;
> -}
> -EXPORT_SYMBOL_GPL(failover_register);
> -
> -/**
> - * failover_unregister - Unregister a failover instance
> - *
> - * @failover: pointer to failover instance
> - *
> - * Unregisters and frees a failover instance.
> - */
> -void failover_unregister(struct failover *failover)
> -{
> -	struct net_device *failover_dev;
> -
> -	failover_dev = rcu_dereference(failover->failover_dev);
> -
> -	netdev_info(failover_dev, "failover master:%s unregistered\n",
> -		    failover_dev->name);
> -
> -	failover_dev->priv_flags &= ~IFF_FAILOVER;
> -	dev_put(failover_dev);
> -
> -	spin_lock(&failover_lock);
> -	list_del(&failover->list);
> -	spin_unlock(&failover_lock);
> -
> -	kfree(failover);
> +	netdev_rx_handler_unregister(lower_dev);
> +	netdev_upper_dev_unlink(lower_dev, upper_dev);
> +	dev_put(lower_dev);
> +	lower_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
>  }
> -EXPORT_SYMBOL_GPL(failover_unregister);
> -
> -static __init int
> -failover_init(void)
> -{
> -	register_netdevice_notifier(&failover_notifier);
> -
> -	return 0;
> -}
> -module_init(failover_init);
> -
> -static __exit
> -void failover_exit(void)
> -{
> -	unregister_netdevice_notifier(&failover_notifier);
> -}
> -module_exit(failover_exit);
> -
> -MODULE_DESCRIPTION("Generic failover infrastructure/interface");
> -MODULE_LICENSE("GPL v2");
> +EXPORT_SYMBOL_GPL(netdev_failover_unjoin);
> -- 
> 2.17.1
Stephen Hemminger June 5, 2018, 6:53 p.m. UTC | #5
On Tue, 5 Jun 2018 21:35:26 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> Thanks, I think this is nice patch but I wonder whether it can be split
> up somewhat. Not all of it is uncontroversial.

I started that way, but then I was fixing code that was later deleted.
The big change was eliminating the callbacks.

> 
> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
> >   * The matching of secondary device to primary device policy
> >     is up to the network device. Both net_failover and netvsc
> >     will use MAC for now but can change separately.  
> 
> I actually suspect both will change to a serial number
> down the road.
> 
> >   * The match policy is only used during initial discovery; after
> >     that the secondary device knows what the upper device is because
> >     of the parent/child relationship; no searching is required.  
> 
> That would obviously be an improvement - does it have to be tied with
> rest of changes?

This was not possible with the version of the common code that
is in net now.

> 
> >   * Now, netvsc and net_failover use the same delayed work type
> >     mechanism for setup. Previously, net_failover code was triggering off
> >     name change but a similar policy was rejected for netvsc.
> >     "what is good for the goose is good for the gander"  
> 
> I don't really understand what you are saying here.  I think the delayed
> hack is kind of ugly and seems racy.  Current failover code was rejected
> by whom?  Why is new one good and for whom?  Did you want to do a name
> change in netvsc but it was rejected? Could you clarify please?

See:
   https://patchwork.ozlabs.org/patch/851711/

> 
> >   * The net_failover private device info 'struct net_failover_info'
> >     should have been private to the driver file, not a visible
> >     API.
> > 
> >   * The net_failover device should use SET_NETDEV_DEV
> >     that is intended only for physical devices not virtual devices.  
> 
> You mean should not.

Yes. Virtual device should not set device parent.

> 
> >   * No point in having DocBook style comments on a driver file.
> >     They only make sense on an external exposed API.
> > 
> >   * net_failover only supports Ethernet, so use ether_addr_copy.  
> 
> It is since you need to know about all the things you need to copy, and
> because of mac matching.  But it isn't too much effort to add more
> transports and I don't see value in going in the reverse direction and
> making it more ethernet specific that it already is.

Sure, then do memcpy base on addr_len

> 
> >   * Set permanent and current address of net_failover device
> >     to match the primary.
> > 
> >   * Carrier should be marked off before registering device
> >     the net_failover device.  
> 
> Are above two bugfixes?

Yes.

> 
> >   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
> > 
> >   * Since failover infrastructure is about linking devices just
> >     use RTNL no need for other locking in init and teardown.
> > 
> >   * Don't bother with ERR_PTR() style return if only possible
> >     return is success or no memory.
> > 
> >   * As much as possible, the terms master and slave should be avoided
> >     because of their cultural connotations.  
> 
> Also for consistency, failover is calling these primary and standby now.

Good, let's standardize on that. 

> 
> > Note; this code has been tested on Hyper-V
> > but is compile tested only on virtio.
> > 
> > Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> > Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > ---
> > 
> > Although this patch needs to go into 4.18 (linux-net),  
> 
> I'd rather we focused on fixing bugs in 4.18, and left refactoring to
> 4.19.
>

Either we fix or revert the current code in 4.18.
Sorry, I am not having callback hell code in any vendor or upstream kernel.
Michael S. Tsirkin June 5, 2018, 7:38 p.m. UTC | #6
On Tue, Jun 05, 2018 at 11:53:05AM -0700, Stephen Hemminger wrote:
> > >   * Now, netvsc and net_failover use the same delayed work type
> > >     mechanism for setup. Previously, net_failover code was triggering off
> > >     name change but a similar policy was rejected for netvsc.
> > >     "what is good for the goose is good for the gander"  
> > 
> > I don't really understand what you are saying here.  I think the delayed
> > hack is kind of ugly and seems racy.  Current failover code was rejected
> > by whom?  Why is new one good and for whom?  Did you want to do a name
> > change in netvsc but it was rejected? Could you clarify please?
> 
> See:
>    https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

	You wanted to speed up the delayed link up.  You had an idea to
	additionally take link up when userspace renames the interface (standby
	one which is also the failover for netvsc).

	But userspace might not do any renames, in which case there will
	still be the delay, and so this never got applied.

	Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.


> > >   * Set permanent and current address of net_failover device
> > >     to match the primary.
> > > 
> > >   * Carrier should be marked off before registering device
> > >     the net_failover device.  
> > 
> > Are above two bugfixes?
> 
> Yes.

Maybe fix these two as first patches in the set?

> > > Although this patch needs to go into 4.18 (linux-net),  
> > 
> > I'd rather we focused on fixing bugs in 4.18, and left refactoring to
> > 4.19.
> >
> 
> Either we fix or revert the current code in 4.18.
> Sorry, I am not having callback hell code in any vendor or upstream kernel.

I agree callbacks add complexity which often isn't necessary, so
removing them where possible is a good cleanup.  But maybe a patch
shouldn't mix bugfixes, cleanups and behaviour changes all together.  If
nothing else it makes review harder.  Splitting patches up might make it
more likely they can go into 4.18 which seems to be what you want.

HTH,
Stephen Hemminger June 5, 2018, 9:52 p.m. UTC | #7
On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> > 
> > See:
> >    https://patchwork.ozlabs.org/patch/851711/  
> 
> Let me try to summarize that:
> 
> 	You wanted to speed up the delayed link up.  You had an idea to
> 	additionally take link up when userspace renames the interface (standby
> 	one which is also the failover for netvsc).
> 
> 	But userspace might not do any renames, in which case there will
> 	still be the delay, and so this never got applied.
> 
> 	Is this a good summary?
> 
> Davem said delay should go away completely as it's not robust, and I
> think I agree.  So I don't think we should make all failover users use
> delay. IIUC failover kept a delay option especially for netvsc to
> minimize the surprise factor. Hopefully we can come up with
> something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.
Samudrala, Sridhar June 5, 2018, 11:52 p.m. UTC | #8
On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 22:38:43 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>>> See:
>>>     https://patchwork.ozlabs.org/patch/851711/
>> Let me try to summarize that:
>>
>> 	You wanted to speed up the delayed link up.  You had an idea to
>> 	additionally take link up when userspace renames the interface (standby
>> 	one which is also the failover for netvsc).
>>
>> 	But userspace might not do any renames, in which case there will
>> 	still be the delay, and so this never got applied.
>>
>> 	Is this a good summary?
>>
>> Davem said delay should go away completely as it's not robust, and I
>> think I agree.  So I don't think we should make all failover users use
>> delay. IIUC failover kept a delay option especially for netvsc to
>> minimize the surprise factor. Hopefully we can come up with
>> something more robust and drop that option completely.
> The timeout was the original solution to how to complete setup after
> userspace has had a chance to rename the device. Unfortunately, the whole network
> device initialization (cooperation with udev and userspace) is a a mess because
> there is no well defined specification, and there are multiple ways userspace
> does this in old and new distributions.  The timeout has its own issues
> (how long, handling errors during that window, what if userspace modifies other
> device state); and open to finding a better solution.
>
> My point was that if name change can not be relied on (or used) by netvsc,
> then we can't allow it for net_failover either.

I think the push back was with the usage of the delay, not bringing up the primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?
Stephen Hemminger June 6, 2018, 3:51 a.m. UTC | #9
On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 22:38:43 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >  
> >>> See:
> >>>     https://patchwork.ozlabs.org/patch/851711/  
> >> Let me try to summarize that:
> >>
> >> 	You wanted to speed up the delayed link up.  You had an idea to
> >> 	additionally take link up when userspace renames the interface (standby
> >> 	one which is also the failover for netvsc).
> >>
> >> 	But userspace might not do any renames, in which case there will
> >> 	still be the delay, and so this never got applied.
> >>
> >> 	Is this a good summary?
> >>
> >> Davem said delay should go away completely as it's not robust, and I
> >> think I agree.  So I don't think we should make all failover users use
> >> delay. IIUC failover kept a delay option especially for netvsc to
> >> minimize the surprise factor. Hopefully we can come up with
> >> something more robust and drop that option completely.  
> > The timeout was the original solution to how to complete setup after
> > userspace has had a chance to rename the device. Unfortunately, the whole network
> > device initialization (cooperation with udev and userspace) is a a mess because
> > there is no well defined specification, and there are multiple ways userspace
> > does this in old and new distributions.  The timeout has its own issues
> > (how long, handling errors during that window, what if userspace modifies other
> > device state); and open to finding a better solution.
> >
> > My point was that if name change can not be relied on (or used) by netvsc,
> > then we can't allow it for net_failover either.  
> 
> I think the push back was with the usage of the delay, not bringing up the primary/standby
> device in the name change event handler.
> Can't netvsc use this mechanism instead of depending on the delay?
> 
> 

The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all distributions
change name of devices. Or user has blocked that by udev rules.
Samudrala, Sridhar June 6, 2018, 5:39 a.m. UTC | #10
On 6/5/2018 8:51 PM, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 16:52:22 -0700
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
>
>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
>>> On Tue, 5 Jun 2018 22:38:43 +0300
>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>   
>>>>> See:
>>>>>      https://patchwork.ozlabs.org/patch/851711/
>>>> Let me try to summarize that:
>>>>
>>>> 	You wanted to speed up the delayed link up.  You had an idea to
>>>> 	additionally take link up when userspace renames the interface (standby
>>>> 	one which is also the failover for netvsc).
>>>>
>>>> 	But userspace might not do any renames, in which case there will
>>>> 	still be the delay, and so this never got applied.
>>>>
>>>> 	Is this a good summary?
>>>>
>>>> Davem said delay should go away completely as it's not robust, and I
>>>> think I agree.  So I don't think we should make all failover users use
>>>> delay. IIUC failover kept a delay option especially for netvsc to
>>>> minimize the surprise factor. Hopefully we can come up with
>>>> something more robust and drop that option completely.
>>> The timeout was the original solution to how to complete setup after
>>> userspace has had a chance to rename the device. Unfortunately, the whole network
>>> device initialization (cooperation with udev and userspace) is a a mess because
>>> there is no well defined specification, and there are multiple ways userspace
>>> does this in old and new distributions.  The timeout has its own issues
>>> (how long, handling errors during that window, what if userspace modifies other
>>> device state); and open to finding a better solution.
>>>
>>> My point was that if name change can not be relied on (or used) by netvsc,
>>> then we can't allow it for net_failover either.
>> I think the push back was with the usage of the delay, not bringing up the primary/standby
>> device in the name change event handler.
>> Can't netvsc use this mechanism instead of depending on the delay?
>>
>>
> The patch that was rejected for netvsc was about using name change.
> Also, you can't depend on name change; you still need a timer. Not all distributions
> change name of devices. Or user has blocked that by udev rules.

In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
EBUSY and do another dev_open() in the name change event handler.
If the name is not expected to change, i would think the dev_open() at the time of
register will succeed.
Stephen Hemminger June 6, 2018, 6 a.m. UTC | #11
On Tue, 5 Jun 2018 22:39:12 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 16:52:22 -0700
> > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >  
> >> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> >>> On Tue, 5 Jun 2018 22:38:43 +0300
> >>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>     
> >>>>> See:
> >>>>>      https://patchwork.ozlabs.org/patch/851711/  
> >>>> Let me try to summarize that:
> >>>>
> >>>> 	You wanted to speed up the delayed link up.  You had an idea to
> >>>> 	additionally take link up when userspace renames the interface (standby
> >>>> 	one which is also the failover for netvsc).
> >>>>
> >>>> 	But userspace might not do any renames, in which case there will
> >>>> 	still be the delay, and so this never got applied.
> >>>>
> >>>> 	Is this a good summary?
> >>>>
> >>>> Davem said delay should go away completely as it's not robust, and I
> >>>> think I agree.  So I don't think we should make all failover users use
> >>>> delay. IIUC failover kept a delay option especially for netvsc to
> >>>> minimize the surprise factor. Hopefully we can come up with
> >>>> something more robust and drop that option completely.  
> >>> The timeout was the original solution to how to complete setup after
> >>> userspace has had a chance to rename the device. Unfortunately, the whole network
> >>> device initialization (cooperation with udev and userspace) is a a mess because
> >>> there is no well defined specification, and there are multiple ways userspace
> >>> does this in old and new distributions.  The timeout has its own issues
> >>> (how long, handling errors during that window, what if userspace modifies other
> >>> device state); and open to finding a better solution.
> >>>
> >>> My point was that if name change can not be relied on (or used) by netvsc,
> >>> then we can't allow it for net_failover either.  
> >> I think the push back was with the usage of the delay, not bringing up the primary/standby
> >> device in the name change event handler.
> >> Can't netvsc use this mechanism instead of depending on the delay?
> >>
> >>  
> > The patch that was rejected for netvsc was about using name change.
> > Also, you can't depend on name change; you still need a timer. Not all distributions
> > change name of devices. Or user has blocked that by udev rules.  
> 
> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
> EBUSY and do another dev_open() in the name change event handler.
> If the name is not expected to change, i would think the dev_open() at the time of
> register will succeed.

The problem is your first dev_open will bring device up and lockout
udev from changing the network device name.
Samudrala, Sridhar June 6, 2018, 6:11 a.m. UTC | #12
On 6/5/2018 11:00 PM, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 22:39:12 -0700
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
>
>> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:
>>> On Tue, 5 Jun 2018 16:52:22 -0700
>>> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
>>>   
>>>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
>>>>> On Tue, 5 Jun 2018 22:38:43 +0300
>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>      
>>>>>>> See:
>>>>>>>       https://patchwork.ozlabs.org/patch/851711/
>>>>>> Let me try to summarize that:
>>>>>>
>>>>>> 	You wanted to speed up the delayed link up.  You had an idea to
>>>>>> 	additionally take link up when userspace renames the interface (standby
>>>>>> 	one which is also the failover for netvsc).
>>>>>>
>>>>>> 	But userspace might not do any renames, in which case there will
>>>>>> 	still be the delay, and so this never got applied.
>>>>>>
>>>>>> 	Is this a good summary?
>>>>>>
>>>>>> Davem said delay should go away completely as it's not robust, and I
>>>>>> think I agree.  So I don't think we should make all failover users use
>>>>>> delay. IIUC failover kept a delay option especially for netvsc to
>>>>>> minimize the surprise factor. Hopefully we can come up with
>>>>>> something more robust and drop that option completely.
>>>>> The timeout was the original solution to how to complete setup after
>>>>> userspace has had a chance to rename the device. Unfortunately, the whole network
>>>>> device initialization (cooperation with udev and userspace) is a a mess because
>>>>> there is no well defined specification, and there are multiple ways userspace
>>>>> does this in old and new distributions.  The timeout has its own issues
>>>>> (how long, handling errors during that window, what if userspace modifies other
>>>>> device state); and open to finding a better solution.
>>>>>
>>>>> My point was that if name change can not be relied on (or used) by netvsc,
>>>>> then we can't allow it for net_failover either.
>>>> I think the push back was with the usage of the delay, not bringing up the primary/standby
>>>> device in the name change event handler.
>>>> Can't netvsc use this mechanism instead of depending on the delay?
>>>>
>>>>   
>>> The patch that was rejected for netvsc was about using name change.
>>> Also, you can't depend on name change; you still need a timer. Not all distributions
>>> change name of devices. Or user has blocked that by udev rules.
>> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
>> EBUSY and do another dev_open() in the name change event handler.
>> If the name is not expected to change, i would think the dev_open() at the time of
>> register will succeed.
> The problem is your first dev_open will bring device up and lockout
> udev from changing the network device name.

I have tried with/without udev and didn't see any issue with the naming of the primary/standby
devices in my testing.

With the 3-netdev failover model, we are only interested in setting the right name for the failover
netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails
to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected
to touch the lower netdevs.
Jiri Pirko June 6, 2018, 7:25 a.m. UTC | #13
Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>The net failover should be a simple library, not a virtual
>object with function callbacks (see callback hell).

Why just a library? It should do a common things. I think it should be a
virtual object. Looks like your patch again splits the common
functionality into multiple drivers. That is kind of backwards attitude.
I don't get it. We should rather focus on fixing the mess the
introduction of netvsc-bonding caused and switch netvsc to 3-netdev
model.
Michael S. Tsirkin June 6, 2018, 12:19 p.m. UTC | #14
On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote:
> > I think the push back was with the usage of the delay, not bringing up the primary/standby
> > device in the name change event handler.
> > Can't netvsc use this mechanism instead of depending on the delay?
> > 
> > 
> 
> The patch that was rejected for netvsc was about using name change.

So failover is now doing exactly what you wanted netvsc to do.  Rather
than reverting everyone to old behaviour how about using more pieces
from failover?

> Also, you can't depend on name change; you still need a timer. Not all distributions
> change name of devices.

So failover chose not to implement the delayed open so far.
If it does I suspect we'll have to keep it around forever -
kind of like netvsc seems to be stuck with it.
But let's see if there's enough actual demand from people running
ancient distros with latest kernels to even start looking for
a solution for failover.

And this kind of behaviour change really should be split out
so we can discuss it separately.

> Or user has blocked that by udev rules.

Don't do that then?
Michael S. Tsirkin June 6, 2018, 12:30 p.m. UTC | #15
On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
> >The net failover should be a simple library, not a virtual
> >object with function callbacks (see callback hell).
> 
> Why just a library? It should do a common things. I think it should be a
> virtual object. Looks like your patch again splits the common
> functionality into multiple drivers. That is kind of backwards attitude.
> I don't get it. We should rather focus on fixing the mess the
> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> model.

So it seems that at least one benefit for netvsc would be better
handling of renames.

Question is how can this change to 3-netdev happen?  Stephen is
concerned about risk of breaking some userspace.

Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
address, and you said then "why not use existing network namespaces
rather than inventing a new abstraction". So how about it then? Do you
want to find a way to use namespaces to hide the PV device for netvsc
compatibility?
Stephen Hemminger June 6, 2018, 9:16 p.m. UTC | #16
On Tue, 5 Jun 2018 23:11:37 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 6/5/2018 11:00 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 22:39:12 -0700
> > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >  
> >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:  
> >>> On Tue, 5 Jun 2018 16:52:22 -0700
> >>> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >>>     
> >>>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> >>>>> On Tue, 5 Jun 2018 22:38:43 +0300
> >>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>        
> >>>>>>> See:
> >>>>>>>       https://patchwork.ozlabs.org/patch/851711/  
> >>>>>> Let me try to summarize that:
> >>>>>>
> >>>>>> 	You wanted to speed up the delayed link up.  You had an idea to
> >>>>>> 	additionally take link up when userspace renames the interface (standby
> >>>>>> 	one which is also the failover for netvsc).
> >>>>>>
> >>>>>> 	But userspace might not do any renames, in which case there will
> >>>>>> 	still be the delay, and so this never got applied.
> >>>>>>
> >>>>>> 	Is this a good summary?
> >>>>>>
> >>>>>> Davem said delay should go away completely as it's not robust, and I
> >>>>>> think I agree.  So I don't think we should make all failover users use
> >>>>>> delay. IIUC failover kept a delay option especially for netvsc to
> >>>>>> minimize the surprise factor. Hopefully we can come up with
> >>>>>> something more robust and drop that option completely.  
> >>>>> The timeout was the original solution to how to complete setup after
> >>>>> userspace has had a chance to rename the device. Unfortunately, the whole network
> >>>>> device initialization (cooperation with udev and userspace) is a a mess because
> >>>>> there is no well defined specification, and there are multiple ways userspace
> >>>>> does this in old and new distributions.  The timeout has its own issues
> >>>>> (how long, handling errors during that window, what if userspace modifies other
> >>>>> device state); and open to finding a better solution.
> >>>>>
> >>>>> My point was that if name change can not be relied on (or used) by netvsc,
> >>>>> then we can't allow it for net_failover either.  
> >>>> I think the push back was with the usage of the delay, not bringing up the primary/standby
> >>>> device in the name change event handler.
> >>>> Can't netvsc use this mechanism instead of depending on the delay?
> >>>>
> >>>>     
> >>> The patch that was rejected for netvsc was about using name change.
> >>> Also, you can't depend on name change; you still need a timer. Not all distributions
> >>> change name of devices. Or user has blocked that by udev rules.  
> >> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
> >> EBUSY and do another dev_open() in the name change event handler.
> >> If the name is not expected to change, i would think the dev_open() at the time of
> >> register will succeed.  
> > The problem is your first dev_open will bring device up and lockout
> > udev from changing the network device name.  
> 
> I have tried with/without udev and didn't see any issue with the naming of the primary/standby
> devices in my testing.
> 
> With the 3-netdev failover model, we are only interested in setting the right name for the failover
> netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails
> to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected
> to touch the lower netdevs.

Renaming matters to some users and the udev maintainers. They want the VF to be named enpXXX
The primary in virtio case needs to be ens3 with some cloud platforms.

I think you need to get rid of triggering off of the name change.

Long term, let's open the discussion about allowing network devices to change name when up?
Maybe with NETIF_LIVENAME_CHANGE flag?
Stephen Hemminger June 6, 2018, 9:17 p.m. UTC | #17
On Wed, 6 Jun 2018 15:19:30 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote:
> > > I think the push back was with the usage of the delay, not bringing up the primary/standby
> > > device in the name change event handler.
> > > Can't netvsc use this mechanism instead of depending on the delay?
> > > 
> > >   
> > 
> > The patch that was rejected for netvsc was about using name change.  
> 
> So failover is now doing exactly what you wanted netvsc to do.  Rather
> than reverting everyone to old behaviour how about using more pieces
> from failover?
> 
> > Also, you can't depend on name change; you still need a timer. Not all distributions
> > change name of devices.  
> 
> So failover chose not to implement the delayed open so far.
> If it does I suspect we'll have to keep it around forever -
> kind of like netvsc seems to be stuck with it.
> But let's see if there's enough actual demand from people running
> ancient distros with latest kernels to even start looking for
> a solution for failover.
> 
> And this kind of behaviour change really should be split out
> so we can discuss it separately.
> 
> > Or user has blocked that by udev rules.  
> 
> Don't do that then?
> 

If you don't want to allow udev to rename the device, then just pull
the name change hook.
Stephen Hemminger June 6, 2018, 9:24 p.m. UTC | #18
On Wed, 6 Jun 2018 15:30:27 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> > >The net failover should be a simple library, not a virtual
> > >object with function callbacks (see callback hell).  
> > 
> > Why just a library? It should do a common things. I think it should be a
> > virtual object. Looks like your patch again splits the common
> > functionality into multiple drivers. That is kind of backwards attitude.
> > I don't get it. We should rather focus on fixing the mess the
> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > model.  
> 
> So it seems that at least one benefit for netvsc would be better
> handling of renames.
> 
> Question is how can this change to 3-netdev happen?  Stephen is
> concerned about risk of breaking some userspace.
> 
> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> address, and you said then "why not use existing network namespaces
> rather than inventing a new abstraction". So how about it then? Do you
> want to find a way to use namespaces to hide the PV device for netvsc
> compatibility?
> 

Netvsc can't work with 3 dev model. MS has worked with enough distro's and
startups that all demand eth0 always be present. And VF may come and go.
After this history, there is a strong motivation not to change how kernel
behaves. Switching to 3 device model would be perceived as breaking
existing userspace.

With virtio you can  work it out with the distro's yourself.
There is no pre-existing semantics to deal with.

For the virtio, I don't see the need for IFF_HIDDEN.
With 3-dev model as long as you mark the PV and VF devices
as slaves, then userspace knows to leave them alone. Assuming userspace
is already able to deal with team and bond devices.
Any time you introduce new UAPI behavior something breaks.

On the rename front, I really don't care if VF can be renamed. And for
netvsc want to allow the PV device to be renamed. Udev developers want that
but have not found a stable/persistent value to expose to userspace
to allow it.
Stephen Hemminger June 6, 2018, 9:26 p.m. UTC | #19
On Wed, 6 Jun 2018 09:25:12 +0200
Jiri Pirko <jiri@resnulli.us> wrote:

> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
> >The net failover should be a simple library, not a virtual
> >object with function callbacks (see callback hell).  
> 
> Why just a library? It should do a common things. I think it should be a
> virtual object. Looks like your patch again splits the common
> functionality into multiple drivers. That is kind of backwards attitude.
> I don't get it. We should rather focus on fixing the mess the
> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> model.

To me, callbacks are worse than a simple notifier. Is there some other
agenda going on? do you want to get rid of network notifier.

Netvsc is not switching to 3 device model in near future.
Too much userspace is wrapped around it already.
Michael S. Tsirkin June 6, 2018, 9:30 p.m. UTC | #20
On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 23:11:37 -0700
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> 
> > On 6/5/2018 11:00 PM, Stephen Hemminger wrote:
> > > On Tue, 5 Jun 2018 22:39:12 -0700
> > > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > >  
> > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:  
> > >>> On Tue, 5 Jun 2018 16:52:22 -0700
> > >>> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > >>>     
> > >>>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> > >>>>> On Tue, 5 Jun 2018 22:38:43 +0300
> > >>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>>>>        
> > >>>>>>> See:
> > >>>>>>>       https://patchwork.ozlabs.org/patch/851711/  
> > >>>>>> Let me try to summarize that:
> > >>>>>>
> > >>>>>> 	You wanted to speed up the delayed link up.  You had an idea to
> > >>>>>> 	additionally take link up when userspace renames the interface (standby
> > >>>>>> 	one which is also the failover for netvsc).
> > >>>>>>
> > >>>>>> 	But userspace might not do any renames, in which case there will
> > >>>>>> 	still be the delay, and so this never got applied.
> > >>>>>>
> > >>>>>> 	Is this a good summary?
> > >>>>>>
> > >>>>>> Davem said delay should go away completely as it's not robust, and I
> > >>>>>> think I agree.  So I don't think we should make all failover users use
> > >>>>>> delay. IIUC failover kept a delay option especially for netvsc to
> > >>>>>> minimize the surprise factor. Hopefully we can come up with
> > >>>>>> something more robust and drop that option completely.  
> > >>>>> The timeout was the original solution to how to complete setup after
> > >>>>> userspace has had a chance to rename the device. Unfortunately, the whole network
> > >>>>> device initialization (cooperation with udev and userspace) is a a mess because
> > >>>>> there is no well defined specification, and there are multiple ways userspace
> > >>>>> does this in old and new distributions.  The timeout has its own issues
> > >>>>> (how long, handling errors during that window, what if userspace modifies other
> > >>>>> device state); and open to finding a better solution.
> > >>>>>
> > >>>>> My point was that if name change can not be relied on (or used) by netvsc,
> > >>>>> then we can't allow it for net_failover either.  
> > >>>> I think the push back was with the usage of the delay, not bringing up the primary/standby
> > >>>> device in the name change event handler.
> > >>>> Can't netvsc use this mechanism instead of depending on the delay?
> > >>>>
> > >>>>     
> > >>> The patch that was rejected for netvsc was about using name change.
> > >>> Also, you can't depend on name change; you still need a timer. Not all distributions
> > >>> change name of devices. Or user has blocked that by udev rules.  
> > >> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
> > >> EBUSY and do another dev_open() in the name change event handler.
> > >> If the name is not expected to change, i would think the dev_open() at the time of
> > >> register will succeed.  
> > > The problem is your first dev_open will bring device up and lockout
> > > udev from changing the network device name.  
> > 
> > I have tried with/without udev and didn't see any issue with the naming of the primary/standby
> > devices in my testing.
> > 
> > With the 3-netdev failover model, we are only interested in setting the right name for the failover
> > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails
> > to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected
> > to touch the lower netdevs.
> 
> Renaming matters to some users and the udev maintainers. They want the VF to be named enpXXX
> The primary in virtio case needs to be ens3 with some cloud platforms.

Confused. VF can't be a standby, of it's used in a failover it's a
primary, you can't call it both enpXXX amd ens3. Could you describe the
use case in a bit more detail?

> 
> I think you need to get rid of triggering off of the name change.

Worth considering down the road since it's a bit of a hack but it's one
we won't have trouble supporting, unlike the delayed uplink.

> Long term, let's open the discussion about allowing network devices to change name when up?
> Maybe with NETIF_LIVENAME_CHANGE flag?

That's probably the cleanest approach assuming it can be made to work
without races. I suspect we can just live with what we have until then.
Michael S. Tsirkin June 6, 2018, 9:47 p.m. UTC | #21
On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> > > >The net failover should be a simple library, not a virtual
> > > >object with function callbacks (see callback hell).  
> > > 
> > > Why just a library? It should do a common things. I think it should be a
> > > virtual object. Looks like your patch again splits the common
> > > functionality into multiple drivers. That is kind of backwards attitude.
> > > I don't get it. We should rather focus on fixing the mess the
> > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > model.  
> > 
> > So it seems that at least one benefit for netvsc would be better
> > handling of renames.
> > 
> > Question is how can this change to 3-netdev happen?  Stephen is
> > concerned about risk of breaking some userspace.
> > 
> > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > address, and you said then "why not use existing network namespaces
> > rather than inventing a new abstraction". So how about it then? Do you
> > want to find a way to use namespaces to hide the PV device for netvsc
> > compatibility?
> > 
> 
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.

Well failover seems to maintain this invariant with the 3 dev model.

> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.

I feel I'm misunderstood. I was asking whether a 3-rd device can be
hidden so that userspace does not know that you switched to a 3 device
model. It will think there are 2 devices and will keep working.

If you do that, then there won't be anything that
would be perceived as breaking existing userspace, will there?


> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
> 
> For the virtio, I don't see the need for IFF_HIDDEN.
> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.

That's clear enough.

> Any time you introduce new UAPI behavior something breaks.

Not if we do it right.

> On the rename front, I really don't care if VF can be renamed.

OK that's nice.

> And for
> netvsc want to allow the PV device to be renamed.

That's because of the 2 device model, right?  So that explains why even
if the delayed hack is good for the goose it might not be good for the
gander :)

> Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.
Samudrala, Sridhar June 6, 2018, 9:54 p.m. UTC | #22
On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>>>> The net failover should be a simple library, not a virtual
>>>> object with function callbacks (see callback hell).
>>> Why just a library? It should do a common things. I think it should be a
>>> virtual object. Looks like your patch again splits the common
>>> functionality into multiple drivers. That is kind of backwards attitude.
>>> I don't get it. We should rather focus on fixing the mess the
>>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>>> model.
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.

I think it should be possible for netvsc to work with 3 dev model if the only
requirement is that eth0 will always be present. With net_failover, you will
see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
if somehow there is userspace requirement that there can be only 2 netdevs, not 3
when VF is plugged.

eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
and the IP address gets configured on eth0. Will this be an issue?


>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.
> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.
Stephen Hemminger June 6, 2018, 10:21 p.m. UTC | #23
On Thu, 7 Jun 2018 00:30:21 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 23:11:37 -0700
> > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >   
> > > On 6/5/2018 11:00 PM, Stephen Hemminger wrote:  
> > > > On Tue, 5 Jun 2018 22:39:12 -0700
> > > > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > > >    
> > > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:    
> > > >>> On Tue, 5 Jun 2018 16:52:22 -0700
> > > >>> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > > >>>       
> > > >>>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:    
> > > >>>>> On Tue, 5 Jun 2018 22:38:43 +0300
> > > >>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >>>>>          
> > > >>>>>>> See:
> > > >>>>>>>       https://patchwork.ozlabs.org/patch/851711/    
> > > >>>>>> Let me try to summarize that:
> > > >>>>>>
> > > >>>>>> 	You wanted to speed up the delayed link up.  You had an idea to
> > > >>>>>> 	additionally take link up when userspace renames the interface (standby
> > > >>>>>> 	one which is also the failover for netvsc).
> > > >>>>>>
> > > >>>>>> 	But userspace might not do any renames, in which case there will
> > > >>>>>> 	still be the delay, and so this never got applied.
> > > >>>>>>
> > > >>>>>> 	Is this a good summary?
> > > >>>>>>
> > > >>>>>> Davem said delay should go away completely as it's not robust, and I
> > > >>>>>> think I agree.  So I don't think we should make all failover users use
> > > >>>>>> delay. IIUC failover kept a delay option especially for netvsc to
> > > >>>>>> minimize the surprise factor. Hopefully we can come up with
> > > >>>>>> something more robust and drop that option completely.    
> > > >>>>> The timeout was the original solution to how to complete setup after
> > > >>>>> userspace has had a chance to rename the device. Unfortunately, the whole network
> > > >>>>> device initialization (cooperation with udev and userspace) is a a mess because
> > > >>>>> there is no well defined specification, and there are multiple ways userspace
> > > >>>>> does this in old and new distributions.  The timeout has its own issues
> > > >>>>> (how long, handling errors during that window, what if userspace modifies other
> > > >>>>> device state); and open to finding a better solution.
> > > >>>>>
> > > >>>>> My point was that if name change can not be relied on (or used) by netvsc,
> > > >>>>> then we can't allow it for net_failover either.    
> > > >>>> I think the push back was with the usage of the delay, not bringing up the primary/standby
> > > >>>> device in the name change event handler.
> > > >>>> Can't netvsc use this mechanism instead of depending on the delay?
> > > >>>>
> > > >>>>       
> > > >>> The patch that was rejected for netvsc was about using name change.
> > > >>> Also, you can't depend on name change; you still need a timer. Not all distributions
> > > >>> change name of devices. Or user has blocked that by udev rules.    
> > > >> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
> > > >> EBUSY and do another dev_open() in the name change event handler.
> > > >> If the name is not expected to change, i would think the dev_open() at the time of
> > > >> register will succeed.    
> > > > The problem is your first dev_open will bring device up and lockout
> > > > udev from changing the network device name.    
> > > 
> > > I have tried with/without udev and didn't see any issue with the naming of the primary/standby
> > > devices in my testing.
> > > 
> > > With the 3-netdev failover model, we are only interested in setting the right name for the failover
> > > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails
> > > to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected
> > > to touch the lower netdevs.  
> > 
> > Renaming matters to some users and the udev maintainers. They want the VF to be named enpXXX
> > The primary in virtio case needs to be ens3 with some cloud platforms.  
> 
> Confused. VF can't be a standby, of it's used in a failover it's a
> primary, you can't call it both enpXXX amd ens3. Could you describe the
> use case in a bit more detail?

Sorry, got things backwards.  The primary is VF and it should be possible
to have it renamed based on PCI info.
The standby PV (in 3 dev model) would get renamed by udev to ens3.
So the failover device would just be ethX right?

> > 
> > I think you need to get rid of triggering off of the name change.  
> 
> Worth considering down the road since it's a bit of a hack but it's one
> we won't have trouble supporting, unlike the delayed uplink.

You can't depend on userspace doing rename.

> 
> > Long term, let's open the discussion about allowing network devices to change name when up?
> > Maybe with NETIF_LIVENAME_CHANGE flag?  
> 
> That's probably the cleanest approach assuming it can be made to work
> without races. I suspect we can just live with what we have until then.
> 
>
Stephen Hemminger June 6, 2018, 10:24 p.m. UTC | #24
On Thu, 7 Jun 2018 00:47:52 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:    
> > > > >The net failover should be a simple library, not a virtual
> > > > >object with function callbacks (see callback hell).    
> > > > 
> > > > Why just a library? It should do a common things. I think it should be a
> > > > virtual object. Looks like your patch again splits the common
> > > > functionality into multiple drivers. That is kind of backwards attitude.
> > > > I don't get it. We should rather focus on fixing the mess the
> > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > model.    
> > > 
> > > So it seems that at least one benefit for netvsc would be better
> > > handling of renames.
> > > 
> > > Question is how can this change to 3-netdev happen?  Stephen is
> > > concerned about risk of breaking some userspace.
> > > 
> > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > address, and you said then "why not use existing network namespaces
> > > rather than inventing a new abstraction". So how about it then? Do you
> > > want to find a way to use namespaces to hide the PV device for netvsc
> > > compatibility?
> > >   
> > 
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.  
> 
> Well failover seems to maintain this invariant with the 3 dev model.
> 
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.  
> 
> I feel I'm misunderstood. I was asking whether a 3-rd device can be
> hidden so that userspace does not know that you switched to a 3 device
> model. It will think there are 2 devices and will keep working.
> 
> If you do that, then there won't be anything that
> would be perceived as breaking existing userspace, will there?

DPDK now knows about the netvsc 2 device model and drivers in userspace
depend on it.

> 
> 
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> > 
> > For the virtio, I don't see the need for IFF_HIDDEN.
> > With 3-dev model as long as you mark the PV and VF devices
> > as slaves, then userspace knows to leave them alone. Assuming userspace
> > is already able to deal with team and bond devices.  
> 
> That's clear enough.
> 
> > Any time you introduce new UAPI behavior something breaks.  
> 
> Not if we do it right.
> 
> > On the rename front, I really don't care if VF can be renamed.  
> 
> OK that's nice.
> 
> > And for
> > netvsc want to allow the PV device to be renamed.  
> 
> That's because of the 2 device model, right?  So that explains why even
> if the delayed hack is good for the goose it might not be good for the
> gander :)

You are bringing up the VF right away. How does the 3-device initialization
state machine work?  Do you give a window for udev to possibly rename the
VF? Do you rely on that?

> 
> > Udev developers want that
> > but have not found a stable/persistent value to expose to userspace
> > to allow it.
Stephen Hemminger June 6, 2018, 10:25 p.m. UTC | #25
On Wed, 6 Jun 2018 14:54:04 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >>>> The net failover should be a simple library, not a virtual
> >>>> object with function callbacks (see callback hell).  
> >>> Why just a library? It should do a common things. I think it should be a
> >>> virtual object. Looks like your patch again splits the common
> >>> functionality into multiple drivers. That is kind of backwards attitude.
> >>> I don't get it. We should rather focus on fixing the mess the
> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >>> model.  
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.  
> 
> I think it should be possible for netvsc to work with 3 dev model if the only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
> if somehow there is userspace requirement that there can be only 2 netdevs, not 3
> when VF is plugged.
> 
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
> and the IP address gets configured on eth0. Will this be an issue?

DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
but that is the way it is.
Alexander H Duyck June 7, 2018, 2:17 p.m. UTC | #26
On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 6 Jun 2018 14:54:04 -0700
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
>
>> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >>>> The net failover should be a simple library, not a virtual
>> >>>> object with function callbacks (see callback hell).
>> >>> Why just a library? It should do a common things. I think it should be a
>> >>> virtual object. Looks like your patch again splits the common
>> >>> functionality into multiple drivers. That is kind of backwards attitude.
>> >>> I don't get it. We should rather focus on fixing the mess the
>> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >>> model.
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>>
>> I think it should be possible for netvsc to work with 3 dev model if the only
>> requirement is that eth0 will always be present. With net_failover, you will
>> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
>> if somehow there is userspace requirement that there can be only 2 netdevs, not 3
>> when VF is plugged.
>>
>> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
>> and the IP address gets configured on eth0. Will this be an issue?
>
> DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> but that is the way it is.

Why would DPDK care what we do in the kernel? Isn't it just slapping
vfio-pci on the netdevs it sees?
Stephen Hemminger June 7, 2018, 2:51 p.m. UTC | #27
On Thu, 7 Jun 2018 07:17:51 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Wed, 6 Jun 2018 14:54:04 -0700
> > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >  
> >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >> >>>> The net failover should be a simple library, not a virtual
> >> >>>> object with function callbacks (see callback hell).  
> >> >>> Why just a library? It should do a common things. I think it should be a
> >> >>> virtual object. Looks like your patch again splits the common
> >> >>> functionality into multiple drivers. That is kind of backwards attitude.
> >> >>> I don't get it. We should rather focus on fixing the mess the
> >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >>> model.  
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.  
> >>
> >> I think it should be possible for netvsc to work with 3 dev model if the only
> >> requirement is that eth0 will always be present. With net_failover, you will
> >> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
> >> if somehow there is userspace requirement that there can be only 2 netdevs, not 3
> >> when VF is plugged.
> >>
> >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
> >> and the IP address gets configured on eth0. Will this be an issue?  
> >
> > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> > but that is the way it is.  
> 
> Why would DPDK care what we do in the kernel? Isn't it just slapping
> vfio-pci on the netdevs it sees?

Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
The DPDK support uses:
 * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
   userspace. This means VF netdev device must exist and be visible.
 * Slow path using kernel netvsc device, TAP and BPF to get exception
   path packets to userspace.
 * A autodiscovery mechanism that to set all this up that relies on
   2 device model and sysfs.

In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual
IOMMU so VFIO will not work there at all.
Michael S. Tsirkin June 7, 2018, 2:57 p.m. UTC | #28
On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 00:47:52 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> > > On Wed, 6 Jun 2018 15:30:27 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:    
> > > > > >The net failover should be a simple library, not a virtual
> > > > > >object with function callbacks (see callback hell).    
> > > > > 
> > > > > Why just a library? It should do a common things. I think it should be a
> > > > > virtual object. Looks like your patch again splits the common
> > > > > functionality into multiple drivers. That is kind of backwards attitude.
> > > > > I don't get it. We should rather focus on fixing the mess the
> > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > > model.    
> > > > 
> > > > So it seems that at least one benefit for netvsc would be better
> > > > handling of renames.
> > > > 
> > > > Question is how can this change to 3-netdev happen?  Stephen is
> > > > concerned about risk of breaking some userspace.
> > > > 
> > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > > address, and you said then "why not use existing network namespaces
> > > > rather than inventing a new abstraction". So how about it then? Do you
> > > > want to find a way to use namespaces to hide the PV device for netvsc
> > > > compatibility?
> > > >   
> > > 
> > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > > startups that all demand eth0 always be present. And VF may come and go.  
> > 
> > Well failover seems to maintain this invariant with the 3 dev model.
> > 
> > > After this history, there is a strong motivation not to change how kernel
> > > behaves. Switching to 3 device model would be perceived as breaking
> > > existing userspace.  
> > 
> > I feel I'm misunderstood. I was asking whether a 3-rd device can be
> > hidden so that userspace does not know that you switched to a 3 device
> > model. It will think there are 2 devices and will keep working.
> > 
> > If you do that, then there won't be anything that
> > would be perceived as breaking existing userspace, will there?
> 
> DPDK now knows about the netvsc 2 device model and drivers in userspace
> depend on it.

Interesting but I'm not sure how this answers the question. How would
DPDK care that there's a hidden device? If you can point out the
code in question, maybe a way can be found to make changes while
keeping it working.

> > 
> > 
> > > With virtio you can  work it out with the distro's yourself.
> > > There is no pre-existing semantics to deal with.
> > > 
> > > For the virtio, I don't see the need for IFF_HIDDEN.
> > > With 3-dev model as long as you mark the PV and VF devices
> > > as slaves, then userspace knows to leave them alone. Assuming userspace
> > > is already able to deal with team and bond devices.  
> > 
> > That's clear enough.
> > 
> > > Any time you introduce new UAPI behavior something breaks.  
> > 
> > Not if we do it right.
> > 
> > > On the rename front, I really don't care if VF can be renamed.  
> > 
> > OK that's nice.
> > 
> > > And for
> > > netvsc want to allow the PV device to be renamed.  
> > 
> > That's because of the 2 device model, right?  So that explains why even
> > if the delayed hack is good for the goose it might not be good for the
> > gander :)
> 
> You are bringing up the VF right away. How does the 3-device initialization
> state machine work?  Do you give a window for udev to possibly rename the
> VF? Do you rely on that?
> 
> > 
> > > Udev developers want that
> > > but have not found a stable/persistent value to expose to userspace
> > > to allow it.
Stephen Hemminger June 7, 2018, 3:23 p.m. UTC | #29
On Thu, 7 Jun 2018 17:57:50 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 00:47:52 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:  
> > > > On Wed, 6 Jun 2018 15:30:27 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >     
> > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:    
> > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:      
> > > > > > >The net failover should be a simple library, not a virtual
> > > > > > >object with function callbacks (see callback hell).      
> > > > > > 
> > > > > > Why just a library? It should do a common things. I think it should be a
> > > > > > virtual object. Looks like your patch again splits the common
> > > > > > functionality into multiple drivers. That is kind of backwards attitude.
> > > > > > I don't get it. We should rather focus on fixing the mess the
> > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > > > model.      
> > > > > 
> > > > > So it seems that at least one benefit for netvsc would be better
> > > > > handling of renames.
> > > > > 
> > > > > Question is how can this change to 3-netdev happen?  Stephen is
> > > > > concerned about risk of breaking some userspace.
> > > > > 
> > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > > > address, and you said then "why not use existing network namespaces
> > > > > rather than inventing a new abstraction". So how about it then? Do you
> > > > > want to find a way to use namespaces to hide the PV device for netvsc
> > > > > compatibility?
> > > > >     
> > > > 
> > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > > > startups that all demand eth0 always be present. And VF may come and go.    
> > > 
> > > Well failover seems to maintain this invariant with the 3 dev model.
> > >   
> > > > After this history, there is a strong motivation not to change how kernel
> > > > behaves. Switching to 3 device model would be perceived as breaking
> > > > existing userspace.    
> > > 
> > > I feel I'm misunderstood. I was asking whether a 3-rd device can be
> > > hidden so that userspace does not know that you switched to a 3 device
> > > model. It will think there are 2 devices and will keep working.
> > > 
> > > If you do that, then there won't be anything that
> > > would be perceived as breaking existing userspace, will there?  
> > 
> > DPDK now knows about the netvsc 2 device model and drivers in userspace
> > depend on it.  
> 
> Interesting but I'm not sure how this answers the question. How would
> DPDK care that there's a hidden device? If you can point out the
> code in question, maybe a way can be found to make changes while
> keeping it working.

See http://dpdk.org/browse/dpdk/tree/drivers/net/vdev_netvsc/vdev_netvsc.c

I am working to eliminate the necessity of this complex model in DPDK.
Having a 3 device model inside DPDK has just as many problems as in the
kernel.
Michael S. Tsirkin June 7, 2018, 3:41 p.m. UTC | #30
On Thu, Jun 07, 2018 at 07:51:12AM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 07:17:51 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
> 
> > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
> > <stephen@networkplumber.org> wrote:
> > > On Wed, 6 Jun 2018 14:54:04 -0700
> > > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > >  
> > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> > >> > On Wed, 6 Jun 2018 15:30:27 +0300
> > >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >> >  
> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> > >> >>>> The net failover should be a simple library, not a virtual
> > >> >>>> object with function callbacks (see callback hell).  
> > >> >>> Why just a library? It should do a common things. I think it should be a
> > >> >>> virtual object. Looks like your patch again splits the common
> > >> >>> functionality into multiple drivers. That is kind of backwards attitude.
> > >> >>> I don't get it. We should rather focus on fixing the mess the
> > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > >> >>> model.  
> > >> >> So it seems that at least one benefit for netvsc would be better
> > >> >> handling of renames.
> > >> >>
> > >> >> Question is how can this change to 3-netdev happen?  Stephen is
> > >> >> concerned about risk of breaking some userspace.
> > >> >>
> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > >> >> address, and you said then "why not use existing network namespaces
> > >> >> rather than inventing a new abstraction". So how about it then? Do you
> > >> >> want to find a way to use namespaces to hide the PV device for netvsc
> > >> >> compatibility?
> > >> >>  
> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > >> > startups that all demand eth0 always be present. And VF may come and go.
> > >> > After this history, there is a strong motivation not to change how kernel
> > >> > behaves. Switching to 3 device model would be perceived as breaking
> > >> > existing userspace.  
> > >>
> > >> I think it should be possible for netvsc to work with 3 dev model if the only
> > >> requirement is that eth0 will always be present. With net_failover, you will
> > >> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
> > >> if somehow there is userspace requirement that there can be only 2 netdevs, not 3
> > >> when VF is plugged.
> > >>
> > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
> > >> and the IP address gets configured on eth0. Will this be an issue?  
> > >
> > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> > > but that is the way it is.  
> > 
> > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > vfio-pci on the netdevs it sees?
> 
> Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
> The DPDK support uses:
>  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
>    userspace. This means VF netdev device must exist and be visible.
>  * Slow path using kernel netvsc device, TAP and BPF to get exception
>    path packets to userspace.
>  * A autodiscovery mechanism that to set all this up that relies on
>    2 device model and sysfs.

Could you describe what does it look for exactly? What will break if
instead of MLX5 being a child of the PV, it's a child of the failover
device?

> In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual
> IOMMU so VFIO will not work there at all.
>
Stephen Hemminger June 7, 2018, 4:17 p.m. UTC | #31
On Thu, 7 Jun 2018 18:41:31 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > vfio-pci on the netdevs it sees?  
> > 
> > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
> > The DPDK support uses:
> >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> >    userspace. This means VF netdev device must exist and be visible.
> >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> >    path packets to userspace.
> >  * A autodiscovery mechanism that to set all this up that relies on
> >    2 device model and sysfs.  
> 
> Could you describe what does it look for exactly? What will break if
> instead of MLX5 being a child of the PV, it's a child of the failover
> device?

So in DPDK there is an internal four device model:
	1. failsafe is like failover in your model
	2. TAP is used like netvsc in kernel
	3. MLX5 is the VF
	4. vdev_netvsc is a pseudo device whose only reason to exist
	   is to glue everything together.

Digging deeper inside...

Vdev_netvsc does:
   * driver is started in a convuluted way off device arguments
   * probe routine for driver runs
      - scans list of kernel interfaces in sysfs
      - matches those using VMBUS 
      - skip netvsc devices that have an IPV4 route
   * scan for PCI devices that have same MAC address as kernel netvsc
     devices discovered in previous step
   * add these interfaces to arguments to failsafe

Then failsafe configures based on arguments on device

The code works but is specific to the Azure hardware model, and exposes lots
of things to application that it should not have to care about.

If you  try and walk through this code in DPDK, you will see why I have developed
a dislike for high levels of indirection.
Michael S. Tsirkin June 7, 2018, 5:22 p.m. UTC | #32
On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 18:41:31 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > vfio-pci on the netdevs it sees?  
> > > 
> > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
> > > The DPDK support uses:
> > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > >    userspace. This means VF netdev device must exist and be visible.
> > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > >    path packets to userspace.
> > >  * A autodiscovery mechanism that to set all this up that relies on
> > >    2 device model and sysfs.  
> > 
> > Could you describe what does it look for exactly? What will break if
> > instead of MLX5 being a child of the PV, it's a child of the failover
> > device?
> 
> So in DPDK there is an internal four device model:
> 	1. failsafe is like failover in your model
> 	2. TAP is used like netvsc in kernel
> 	3. MLX5 is the VF
> 	4. vdev_netvsc is a pseudo device whose only reason to exist
> 	   is to glue everything together.
> 
> Digging deeper inside...
> 
> Vdev_netvsc does:
>    * driver is started in a convuluted way off device arguments
>    * probe routine for driver runs
>       - scans list of kernel interfaces in sysfs
>       - matches those using VMBUS 

Could you tell a bit more what does this step entail?

>       - skip netvsc devices that have an IPV4 route
>    * scan for PCI devices that have same MAC address as kernel netvsc
>      devices discovered in previous step
>    * add these interfaces to arguments to failsafe
> 
> Then failsafe configures based on arguments on device
> 
> The code works but is specific to the Azure hardware model, and exposes lots
> of things to application that it should not have to care about.
> 
> If you  try and walk through this code in DPDK, you will see why I have developed
> a dislike for high levels of indirection.
> 
> 
> 	   

Thanks that was helpful!  I'll try to poke at it next week.  Just from
the description it seems the kernel is merely used to locate the MAC
address through sysfs and that for this DPDK code to keep working the
hidden device must be hidden from it in sysfs - is that a fair summary?
Stephen Hemminger June 8, 2018, 6:30 p.m. UTC | #33
On Thu, 7 Jun 2018 20:22:15 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 18:41:31 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > > vfio-pci on the netdevs it sees?    
> > > > 
> > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
> > > > The DPDK support uses:
> > > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > > >    userspace. This means VF netdev device must exist and be visible.
> > > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > > >    path packets to userspace.
> > > >  * A autodiscovery mechanism that to set all this up that relies on
> > > >    2 device model and sysfs.    
> > > 
> > > Could you describe what does it look for exactly? What will break if
> > > instead of MLX5 being a child of the PV, it's a child of the failover
> > > device?  
> > 
> > So in DPDK there is an internal four device model:
> > 	1. failsafe is like failover in your model
> > 	2. TAP is used like netvsc in kernel
> > 	3. MLX5 is the VF
> > 	4. vdev_netvsc is a pseudo device whose only reason to exist
> > 	   is to glue everything together.
> > 
> > Digging deeper inside...
> > 
> > Vdev_netvsc does:
> >    * driver is started in a convuluted way off device arguments
> >    * probe routine for driver runs
> >       - scans list of kernel interfaces in sysfs
> >       - matches those using VMBUS   
> 
> Could you tell a bit more what does this step entail?

Quick code high/low lights.


	ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, 1, name,
					kvargs, specified, &matched);
static int
vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
				      const struct ether_addr *eth_addr,
				      va_list ap), int is_netvsc, ...)
{
	struct if_nameindex *iface = if_nameindex();


	for (i = 0; iface[i].if_name; ++i) {

		is_netvsc_ret = vdev_netvsc_iface_is_netvsc(&iface[i]) ? 1 : 0;
		if (is_netvsc ^ is_netvsc_ret)
			continue;

		strlcpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
		if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
		}

		memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
		       RTE_DIM(eth_addr.addr_bytes));

		ret = func(&iface[i], &eth_addr, ap);  << func is vdev_netvsc_netvsc_probe


static int
vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
			 const struct ether_addr *eth_addr,
			 va_list ap)
{

	/* Routed NetVSC should not be probed. */
	if (vdev_netvsc_has_route(iface, AF_INET) ||
	    vdev_netvsc_has_route(iface, AF_INET6)) {
		if (!specified)
			return 0;
		DRV_LOG(WARNING, "probably using routed NetVSC interface \"%s\""
			" (index %u)", iface->if_name, iface->if_index);
	}
	/* Create interface context. */
	ctx = calloc(1, sizeof(*ctx));
...


> 
> >       - skip netvsc devices that have an IPV4 route
> >    * scan for PCI devices that have same MAC address as kernel netvsc
> >      devices discovered in previous step
> >    * add these interfaces to arguments to failsafe
> > 
> > Then failsafe configures based on arguments on device
> > 
> > The code works but is specific to the Azure hardware model, and exposes lots
> > of things to application that it should not have to care about.
> > 
> > If you  try and walk through this code in DPDK, you will see why I have developed
> > a dislike for high levels of indirection.
> > 
> > 
> > 	     
> 
> Thanks that was helpful!  I'll try to poke at it next week.  Just from
> the description it seems the kernel is merely used to locate the MAC
> address through sysfs and that for this DPDK code to keep working the
> hidden device must be hidden from it in sysfs - is that a fair summary?

What is the point of the 3 device model? What value does it have
to userspace? How would userspace use each of the three devices.
Going back to 3 device model really doesn't make sense to me if
there is not visible benefit.

Some other considerations:
   * there is ongoing development to support RDMA failover as
     well in netvsc.

   * there is a new driver which implements the VMBUS protocol
     in userspace for DPDK. This gets rid of several layers and
     removes any special scanning code. The vmbus device is
     unbound from netvsc and bound to UIO device.  Then the user
     space DPDK driver manages all the host signalling events
     including VF discovery. It is really 2 device model done
     all in userspace. The kernel device is still needed when
     the VF is mellanox; because that is how the MLX DPDK driver
     rolls.

  * what about nested KVM on Hyper-V? Would it make sense to
    have a way to pass subset of VF queues to guest?
Michael S. Tsirkin June 8, 2018, 7:04 p.m. UTC | #34
On Fri, Jun 08, 2018 at 11:30:08AM -0700, Stephen Hemminger wrote:
>   * what about nested KVM on Hyper-V? Would it make sense to
>     have a way to pass subset of VF queues to guest?

No as long as hyper-v doesn't have a vIOMMU.
Siwei Liu June 8, 2018, 10:25 p.m. UTC | #35
On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> > >The net failover should be a simple library, not a virtual
>> > >object with function callbacks (see callback hell).
>> >
>> > Why just a library? It should do a common things. I think it should be a
>> > virtual object. Looks like your patch again splits the common
>> > functionality into multiple drivers. That is kind of backwards attitude.
>> > I don't get it. We should rather focus on fixing the mess the
>> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> > model.
>>
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.
>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.

I have a somewhat different view regarding IFF_HIDDEN. The purpose of
that flag, as well as the 1-netdev model, is to have a means to
inherit the interface name from the VF, and to eliminate playing hacks
around renaming devices, customizing udev rules and et al. Why
inheriting VF's name important? To allow existing config/setup around
VF continues to work across kernel feature upgrade. Most of network
config files in all distros are based on interface names. Few are MAC
address based but making lower slaves hidden would cover the rest. And
most importantly, preserving the same level of user experience as
using raw VF interface once getting all ndo_ops and ethtool_ops
exposed. This is essential to realize transparent live migration that
users dont have to learn and be aware of the undertaken.

It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few
use cases don't care about getting slaves exposed, the 3-netdev model
would work for them. For the rest, the pre-existing semantics to them
is the single VF device model they've already dealt with. This is
nothing different than having Azure stick to the 2-netdev model
because of existing user base IMHO.

-Siwei


> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.
Siwei Liu June 8, 2018, 10:54 p.m. UTC | #36
On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>>
>> On Wed, 6 Jun 2018 15:30:27 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>>>>
>>>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>>>>>
>>>>> The net failover should be a simple library, not a virtual
>>>>> object with function callbacks (see callback hell).
>>>>
>>>> Why just a library? It should do a common things. I think it should be a
>>>> virtual object. Looks like your patch again splits the common
>>>> functionality into multiple drivers. That is kind of backwards attitude.
>>>> I don't get it. We should rather focus on fixing the mess the
>>>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>>>> model.
>>>
>>> So it seems that at least one benefit for netvsc would be better
>>> handling of renames.
>>>
>>> Question is how can this change to 3-netdev happen?  Stephen is
>>> concerned about risk of breaking some userspace.
>>>
>>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>>> address, and you said then "why not use existing network namespaces
>>> rather than inventing a new abstraction". So how about it then? Do you
>>> want to find a way to use namespaces to hide the PV device for netvsc
>>> compatibility?
>>>
>> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> startups that all demand eth0 always be present. And VF may come and go.
>> After this history, there is a strong motivation not to change how kernel
>> behaves. Switching to 3 device model would be perceived as breaking
>> existing userspace.
>
>
> I think it should be possible for netvsc to work with 3 dev model if the
> only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> issue
> if somehow there is userspace requirement that there can be only 2 netdevs,
> not 3
> when VF is plugged.
>
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> device
> and the IP address gets configured on eth0. Will this be an issue?
>
Did you realize that the eth0 name in the current 3-netdev code can't
be consistently persisted across reboot, if you have more than one VFs
to pair with? On one boot it got eth0/eth0nsby, on the next it may get
eth1/eth1nsby on the same interface.

It won't be useable by default until you add some custom udev rules.

-Siwei

>
>
>>
>> With virtio you can  work it out with the distro's yourself.
>> There is no pre-existing semantics to deal with.
>>
>> For the virtio, I don't see the need for IFF_HIDDEN.
>> With 3-dev model as long as you mark the PV and VF devices
>> as slaves, then userspace knows to leave them alone. Assuming userspace
>> is already able to deal with team and bond devices.
>> Any time you introduce new UAPI behavior something breaks.
>>
>> On the rename front, I really don't care if VF can be renamed. And for
>> netvsc want to allow the PV device to be renamed. Udev developers want
>> that
>> but have not found a stable/persistent value to expose to userspace
>> to allow it.
>
>
Stephen Hemminger June 8, 2018, 11:18 p.m. UTC | #37
On Fri, 8 Jun 2018 15:25:59 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >> > >The net failover should be a simple library, not a virtual
> >> > >object with function callbacks (see callback hell).  
> >> >
> >> > Why just a library? It should do a common things. I think it should be a
> >> > virtual object. Looks like your patch again splits the common
> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> > I don't get it. We should rather focus on fixing the mess the
> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> > model.  
> >>
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> >
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.
> >
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> >
> > For the virtio, I don't see the need for IFF_HIDDEN.  
> 
> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> that flag, as well as the 1-netdev model, is to have a means to
> inherit the interface name from the VF, and to eliminate playing hacks
> around renaming devices, customizing udev rules and et al. Why
> inheriting VF's name important? To allow existing config/setup around
> VF continues to work across kernel feature upgrade. Most of network
> config files in all distros are based on interface names. Few are MAC
> address based but making lower slaves hidden would cover the rest. And
> most importantly, preserving the same level of user experience as
> using raw VF interface once getting all ndo_ops and ethtool_ops
> exposed. This is essential to realize transparent live migration that
> users dont have to learn and be aware of the undertaken.

Inheriting the VF name will fail in the migration scenario.
It is perfectly reasonable to migrate a guest to another machine where
the VF PCI address is different. And since current udev/systemd model
is to base network device name off of PCI address, the device will change
name when guest is migrated.

On Azure, the VF maybe removed (by host) at any time and then later
reattached. There is no guarantee that VF will show back up at
the same synthetic PCI address. It will likely have a different
PCI domain value.
Siwei Liu June 8, 2018, 11:44 p.m. UTC | #38
On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 15:25:59 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> > >The net failover should be a simple library, not a virtual
>> >> > >object with function callbacks (see callback hell).
>> >> >
>> >> > Why just a library? It should do a common things. I think it should be a
>> >> > virtual object. Looks like your patch again splits the common
>> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> > I don't get it. We should rather focus on fixing the mess the
>> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> > model.
>> >>
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> >
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>> >
>> > With virtio you can  work it out with the distro's yourself.
>> > There is no pre-existing semantics to deal with.
>> >
>> > For the virtio, I don't see the need for IFF_HIDDEN.
>>
>> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> that flag, as well as the 1-netdev model, is to have a means to
>> inherit the interface name from the VF, and to eliminate playing hacks
>> around renaming devices, customizing udev rules and et al. Why
>> inheriting VF's name important? To allow existing config/setup around
>> VF continues to work across kernel feature upgrade. Most of network
>> config files in all distros are based on interface names. Few are MAC
>> address based but making lower slaves hidden would cover the rest. And
>> most importantly, preserving the same level of user experience as
>> using raw VF interface once getting all ndo_ops and ethtool_ops
>> exposed. This is essential to realize transparent live migration that
>> users dont have to learn and be aware of the undertaken.
>
> Inheriting the VF name will fail in the migration scenario.
> It is perfectly reasonable to migrate a guest to another machine where
> the VF PCI address is different. And since current udev/systemd model
> is to base network device name off of PCI address, the device will change
> name when guest is migrated.
>
The scenario of having VF on a different PCI address on post migration
is essentially equal to plugging in a new NIC. Why it has to pair with
the original PV? A sepearte PV device should be in place to pair the
new VF.


> On Azure, the VF maybe removed (by host) at any time and then later
> reattached. There is no guarantee that VF will show back up at
> the same synthetic PCI address. It will likely have a different
> PCI domain value.

This is something QEMU can do and make sure the PCI address is
consistent after migration.

-Siwei
Stephen Hemminger June 9, 2018, 12:02 a.m. UTC | #39
On Fri, 8 Jun 2018 16:44:12 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Fri, 8 Jun 2018 15:25:59 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >> <stephen@networkplumber.org> wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >> >> > >The net failover should be a simple library, not a virtual
> >> >> > >object with function callbacks (see callback hell).  
> >> >> >
> >> >> > Why just a library? It should do a common things. I think it should be a
> >> >> > virtual object. Looks like your patch again splits the common
> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> > model.  
> >> >>
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> >
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.
> >> >
> >> > With virtio you can  work it out with the distro's yourself.
> >> > There is no pre-existing semantics to deal with.
> >> >
> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >>
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

The host only guarantees that the PV device will be on the same network.
It does not make any PCI guarantees. The way Windows works is to find
the device based on "serial number" which is an Hyper-V specific attribute
of PCI devices.

I considered naming off of serial number but that won't work for the
case where PV device is present first and VF arrives later. The serial
number is attribute of VF, not the PV which is there first.

Your ideas about having the PCI information of the VF form the name
of the failover device have the same problem. The PV device may
be the only one present on boot.


> > On Azure, the VF maybe removed (by host) at any time and then later
> > reattached. There is no guarantee that VF will show back up at
> > the same synthetic PCI address. It will likely have a different
> > PCI domain value.  
> 
> This is something QEMU can do and make sure the PCI address is
> consistent after migration.
> 
> -Siwei
Siwei Liu June 9, 2018, 12:42 a.m. UTC | #40
On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Fri, 8 Jun 2018 15:25:59 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> <stephen@networkplumber.org> wrote:
>> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >
>> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> > >object with function callbacks (see callback hell).
>> >> >> >
>> >> >> > Why just a library? It should do a common things. I think it should be a
>> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> > model.
>> >> >>
>> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> handling of renames.
>> >> >>
>> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> concerned about risk of breaking some userspace.
>> >> >>
>> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> address, and you said then "why not use existing network namespaces
>> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> compatibility?
>> >> >>
>> >> >
>> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> > After this history, there is a strong motivation not to change how kernel
>> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> > existing userspace.
>> >> >
>> >> > With virtio you can  work it out with the distro's yourself.
>> >> > There is no pre-existing semantics to deal with.
>> >> >
>> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >>
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> The host only guarantees that the PV device will be on the same network.
> It does not make any PCI guarantees. The way Windows works is to find
> the device based on "serial number" which is an Hyper-V specific attribute
> of PCI devices.
>
> I considered naming off of serial number but that won't work for the
> case where PV device is present first and VF arrives later. The serial
> number is attribute of VF, not the PV which is there first.

I assume the PV can get that information ahead of time before VF
arrives? Without it how do you match the device when you see a VF
coming with some serial number? Is it possible for PV to get the
matching SN even earlier during probe time? Or it has to depend on the
presence of vPCI bridge to generate this SN?

>
> Your ideas about having the PCI information of the VF form the name
> of the failover device have the same problem. The PV device may
> be the only one present on boot.

Yeah, this is a chicken-egg problem indeed, and that was the reason
why I supply the BDF info for PV to name the master interface.
However, the ACPI PCI slot needs to depend on the PCI bus enumeration
so that can't be predictable.  Would it make sense to only rename when
the first time a matching VF appears and PV interface isn't brought
up, then the failover master would always stick to the name
afterwards? I think it should cover most scenarios as it's usually
during boot time (dracut) the VF first appears and the PV interface at
the time then shouldn't have been configured yet.

-Siwei

>
>
>> > On Azure, the VF maybe removed (by host) at any time and then later
>> > reattached. There is no guarantee that VF will show back up at
>> > the same synthetic PCI address. It will likely have a different
>> > PCI domain value.
>>
>> This is something QEMU can do and make sure the PCI address is
>> consistent after migration.
>>
>> -Siwei
>
Jakub Kicinski June 9, 2018, 1:29 a.m. UTC | #41
On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

IMHO it may be a better idea to look at the VF as acceleration for the
PV rather than PV a migration vehicle from the VF.  Hence we should
continue to follow the naming of PV, like the current implementation
does implicitly by linking to PV's struct device.
Michael S. Tsirkin June 11, 2018, 2:01 p.m. UTC | #42
On Fri, Jun 08, 2018 at 05:02:35PM -0700, Stephen Hemminger wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
> 
> > On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> > <stephen@networkplumber.org> wrote:
> > > On Fri, 8 Jun 2018 15:25:59 -0700
> > > Siwei Liu <loseweigh@gmail.com> wrote:
> > >  
> > >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> > >> <stephen@networkplumber.org> wrote:  
> > >> > On Wed, 6 Jun 2018 15:30:27 +0300
> > >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >> >  
> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> > >> >> > >The net failover should be a simple library, not a virtual
> > >> >> > >object with function callbacks (see callback hell).  
> > >> >> >
> > >> >> > Why just a library? It should do a common things. I think it should be a
> > >> >> > virtual object. Looks like your patch again splits the common
> > >> >> > functionality into multiple drivers. That is kind of backwards attitude.
> > >> >> > I don't get it. We should rather focus on fixing the mess the
> > >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > >> >> > model.  
> > >> >>
> > >> >> So it seems that at least one benefit for netvsc would be better
> > >> >> handling of renames.
> > >> >>
> > >> >> Question is how can this change to 3-netdev happen?  Stephen is
> > >> >> concerned about risk of breaking some userspace.
> > >> >>
> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > >> >> address, and you said then "why not use existing network namespaces
> > >> >> rather than inventing a new abstraction". So how about it then? Do you
> > >> >> want to find a way to use namespaces to hide the PV device for netvsc
> > >> >> compatibility?
> > >> >>  
> > >> >
> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > >> > startups that all demand eth0 always be present. And VF may come and go.
> > >> > After this history, there is a strong motivation not to change how kernel
> > >> > behaves. Switching to 3 device model would be perceived as breaking
> > >> > existing userspace.
> > >> >
> > >> > With virtio you can  work it out with the distro's yourself.
> > >> > There is no pre-existing semantics to deal with.
> > >> >
> > >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> > >>
> > >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> > >> that flag, as well as the 1-netdev model, is to have a means to
> > >> inherit the interface name from the VF, and to eliminate playing hacks
> > >> around renaming devices, customizing udev rules and et al. Why
> > >> inheriting VF's name important? To allow existing config/setup around
> > >> VF continues to work across kernel feature upgrade. Most of network
> > >> config files in all distros are based on interface names. Few are MAC
> > >> address based but making lower slaves hidden would cover the rest. And
> > >> most importantly, preserving the same level of user experience as
> > >> using raw VF interface once getting all ndo_ops and ethtool_ops
> > >> exposed. This is essential to realize transparent live migration that
> > >> users dont have to learn and be aware of the undertaken.  
> > >
> > > Inheriting the VF name will fail in the migration scenario.
> > > It is perfectly reasonable to migrate a guest to another machine where
> > > the VF PCI address is different. And since current udev/systemd model
> > > is to base network device name off of PCI address, the device will change
> > > name when guest is migrated.
> > >  
> > The scenario of having VF on a different PCI address on post migration
> > is essentially equal to plugging in a new NIC. Why it has to pair with
> > the original PV? A sepearte PV device should be in place to pair the
> > new VF.
> 
> The host only guarantees that the PV device will be on the same network.
> It does not make any PCI guarantees. The way Windows works is to find
> the device based on "serial number" which is an Hyper-V specific attribute
> of PCI devices.
> 
> I considered naming off of serial number but that won't work for the
> case where PV device is present first and VF arrives later. The serial
> number is attribute of VF, not the PV which is there first.
> 
> Your ideas about having the PCI information of the VF form the name
> of the failover device have the same problem. The PV device may
> be the only one present on boot.

We plan to add the serial number to the PV.


> 
> > > On Azure, the VF maybe removed (by host) at any time and then later
> > > reattached. There is no guarantee that VF will show back up at
> > > the same synthetic PCI address. It will likely have a different
> > > PCI domain value.  
> > 
> > This is something QEMU can do and make sure the PCI address is
> > consistent after migration.
> > 
> > -Siwei
Stephen Hemminger June 11, 2018, 3:17 p.m. UTC | #43
On Fri, 8 Jun 2018 15:54:38 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
> >
> >
> > On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> >>
> >> On Wed, 6 Jun 2018 15:30:27 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>  
> >>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >>>>
> >>>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >>>>>
> >>>>> The net failover should be a simple library, not a virtual
> >>>>> object with function callbacks (see callback hell).  
> >>>>
> >>>> Why just a library? It should do a common things. I think it should be a
> >>>> virtual object. Looks like your patch again splits the common
> >>>> functionality into multiple drivers. That is kind of backwards attitude.
> >>>> I don't get it. We should rather focus on fixing the mess the
> >>>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >>>> model.  
> >>>
> >>> So it seems that at least one benefit for netvsc would be better
> >>> handling of renames.
> >>>
> >>> Question is how can this change to 3-netdev happen?  Stephen is
> >>> concerned about risk of breaking some userspace.
> >>>
> >>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >>> address, and you said then "why not use existing network namespaces
> >>> rather than inventing a new abstraction". So how about it then? Do you
> >>> want to find a way to use namespaces to hide the PV device for netvsc
> >>> compatibility?
> >>>  
> >> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> startups that all demand eth0 always be present. And VF may come and go.
> >> After this history, there is a strong motivation not to change how kernel
> >> behaves. Switching to 3 device model would be perceived as breaking
> >> existing userspace.  
> >
> >
> > I think it should be possible for netvsc to work with 3 dev model if the
> > only
> > requirement is that eth0 will always be present. With net_failover, you will
> > see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> > issue
> > if somehow there is userspace requirement that there can be only 2 netdevs,
> > not 3
> > when VF is plugged.
> >
> > eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> > device
> > and the IP address gets configured on eth0. Will this be an issue?
> >  
> Did you realize that the eth0 name in the current 3-netdev code can't
> be consistently persisted across reboot, if you have more than one VFs
> to pair with? On one boot it got eth0/eth0nsby, on the next it may get
> eth1/eth1nsby on the same interface.
> 
> It won't be useable by default until you add some custom udev rules.
> 

I think there is no reason to break things by moving netvsc to 3 device
model. 

The first device probed is always the same on Hyper-V/Azure, and always
comes up as eth0. The order comes from the fact that they are reported
to guest in that order and currently vmbus probe is single threaded.
Stephen Hemminger June 11, 2018, 3:22 p.m. UTC | #44
On Fri, 8 Jun 2018 17:42:21 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Fri, 8 Jun 2018 16:44:12 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> >> <stephen@networkplumber.org> wrote:  
> >> > On Fri, 8 Jun 2018 15:25:59 -0700
> >> > Siwei Liu <loseweigh@gmail.com> wrote:
> >> >  
> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >> >> <stephen@networkplumber.org> wrote:  
> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >> >  
> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >> >> >> > >The net failover should be a simple library, not a virtual
> >> >> >> > >object with function callbacks (see callback hell).  
> >> >> >> >
> >> >> >> > Why just a library? It should do a common things. I think it should be a
> >> >> >> > virtual object. Looks like your patch again splits the common
> >> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> >> > model.  
> >> >> >>
> >> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> >> handling of renames.
> >> >> >>
> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> >> concerned about risk of breaking some userspace.
> >> >> >>
> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> >> address, and you said then "why not use existing network namespaces
> >> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> >> compatibility?
> >> >> >>  
> >> >> >
> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> >> > After this history, there is a strong motivation not to change how kernel
> >> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> >> > existing userspace.
> >> >> >
> >> >> > With virtio you can  work it out with the distro's yourself.
> >> >> > There is no pre-existing semantics to deal with.
> >> >> >
> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >> >>
> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> >> that flag, as well as the 1-netdev model, is to have a means to
> >> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> >> around renaming devices, customizing udev rules and et al. Why
> >> >> inheriting VF's name important? To allow existing config/setup around
> >> >> VF continues to work across kernel feature upgrade. Most of network
> >> >> config files in all distros are based on interface names. Few are MAC
> >> >> address based but making lower slaves hidden would cover the rest. And
> >> >> most importantly, preserving the same level of user experience as
> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> >> exposed. This is essential to realize transparent live migration that
> >> >> users dont have to learn and be aware of the undertaken.  
> >> >
> >> > Inheriting the VF name will fail in the migration scenario.
> >> > It is perfectly reasonable to migrate a guest to another machine where
> >> > the VF PCI address is different. And since current udev/systemd model
> >> > is to base network device name off of PCI address, the device will change
> >> > name when guest is migrated.
> >> >  
> >> The scenario of having VF on a different PCI address on post migration
> >> is essentially equal to plugging in a new NIC. Why it has to pair with
> >> the original PV? A sepearte PV device should be in place to pair the
> >> new VF.  
> >
> > The host only guarantees that the PV device will be on the same network.
> > It does not make any PCI guarantees. The way Windows works is to find
> > the device based on "serial number" which is an Hyper-V specific attribute
> > of PCI devices.
> >
> > I considered naming off of serial number but that won't work for the
> > case where PV device is present first and VF arrives later. The serial
> > number is attribute of VF, not the PV which is there first.  
> 
> I assume the PV can get that information ahead of time before VF
> arrives? Without it how do you match the device when you see a VF
> coming with some serial number? Is it possible for PV to get the
> matching SN even earlier during probe time? Or it has to depend on the
> presence of vPCI bridge to generate this SN?



NO. the PV device does not know ahead of time and there are scenario
where the serial and PCI info can change when it does arrive. These
are test cases (not something people usually do). Example on WS2016:
  Guest configured with two or more vswitches and NICs.
  SR-IOV is not enabled

Later:
  On Hyper-V console (or Powershell command line) on host SR-IOV
  is enabled on the second NIC.

  The guest will be notified of new PCI device; the "serial number"
  will be 1.

If same process is repeated but in this case the first NIC has
SR-IOV enabled, it will get serial # 1.


I agree with Jakub. What you are proposing is backwards. The VF
must be thought of as a dependent of PV device not vice/versa.

> >
> > Your ideas about having the PCI information of the VF form the name
> > of the failover device have the same problem. The PV device may
> > be the only one present on boot.  
> 
> Yeah, this is a chicken-egg problem indeed, and that was the reason
> why I supply the BDF info for PV to name the master interface.
> However, the ACPI PCI slot needs to depend on the PCI bus enumeration
> so that can't be predictable.  Would it make sense to only rename when
> the first time a matching VF appears and PV interface isn't brought
> up, then the failover master would always stick to the name
> afterwards? I think it should cover most scenarios as it's usually
> during boot time (dracut) the VF first appears and the PV interface at
> the time then shouldn't have been configured yet.
> 
> -Siwei
> 
> >
> >  
> >> > On Azure, the VF maybe removed (by host) at any time and then later
> >> > reattached. There is no guarantee that VF will show back up at
> >> > the same synthetic PCI address. It will likely have a different
> >> > PCI domain value.  
> >>
> >> This is something QEMU can do and make sure the PCI address is
> >> consistent after migration.
> >>
> >> -Siwei  
> >
Michael S. Tsirkin June 11, 2018, 6:07 p.m. UTC | #45
On Wed, Jun 06, 2018 at 03:21:21PM -0700, Stephen Hemminger wrote:
> > > 
> > > I think you need to get rid of triggering off of the name change.  
> > 
> > Worth considering down the road since it's a bit of a hack but it's one
> > we won't have trouble supporting, unlike the delayed uplink.
> 
> You can't depend on userspace doing rename.

There's only so much we can do to add new functionality to old
userspace. You can always just use PV and all will work.
Michael S. Tsirkin June 11, 2018, 6:10 p.m. UTC | #46
On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * Set permanent and current address of net_failover device
>     to match the primary.
> 
>   * Carrier should be marked off before registering device
>     the net_failover device.

Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?
Siwei Liu June 11, 2018, 6:56 p.m. UTC | #47
On Fri, Jun 8, 2018 at 6:29 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> IMHO it may be a better idea to look at the VF as acceleration for the
> PV rather than PV a migration vehicle from the VF.  Hence we should

I'm basically talking about two use cases not about solutions or
implementations specifically. As said, the one I'm looking into needs
to migrate a pre-failover VF setup to 1-netdev failover model in a
transparent manner. There's no point to switch PCI address back and
forth in the backend to set where to bind the PV or the VF, as you
have no ways to predict what guest kernel will be running until its
fully loaded. Supporting a VF on new location binding to existing PV
might be nice, but not directly relevant to those who don't need this
side feature than migration itself.

Having said that, while I somewhat agree both use cases should have
its own place in the picture, I don't think judging one better than
the other or vice versa is logical IMHO.

> continue to follow the naming of PV, like the current implementation
> does implicitly by linking to PV's struct device.

The current implementation may only work with new userspace, even so
the eth0/eth0nsby naming is not consistenly persisted due to races in
bus probing. The naming part should be fixed.

-Siwei
Siwei Liu June 11, 2018, 7:23 p.m. UTC | #48
On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 17:42:21 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Fri, 8 Jun 2018 16:44:12 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> >> <stephen@networkplumber.org> wrote:
>> >> > On Fri, 8 Jun 2018 15:25:59 -0700
>> >> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >> >
>> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> >> <stephen@networkplumber.org> wrote:
>> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >> >
>> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> >> > >object with function callbacks (see callback hell).
>> >> >> >> >
>> >> >> >> > Why just a library? It should do a common things. I think it should be a
>> >> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> >> > model.
>> >> >> >>
>> >> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> >> handling of renames.
>> >> >> >>
>> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> >> concerned about risk of breaking some userspace.
>> >> >> >>
>> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> >> address, and you said then "why not use existing network namespaces
>> >> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> >> compatibility?
>> >> >> >>
>> >> >> >
>> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> >> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> >> > After this history, there is a strong motivation not to change how kernel
>> >> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> >> > existing userspace.
>> >> >> >
>> >> >> > With virtio you can  work it out with the distro's yourself.
>> >> >> > There is no pre-existing semantics to deal with.
>> >> >> >
>> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >> >>
>> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> >> around renaming devices, customizing udev rules and et al. Why
>> >> >> inheriting VF's name important? To allow existing config/setup around
>> >> >> VF continues to work across kernel feature upgrade. Most of network
>> >> >> config files in all distros are based on interface names. Few are MAC
>> >> >> address based but making lower slaves hidden would cover the rest. And
>> >> >> most importantly, preserving the same level of user experience as
>> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> >> exposed. This is essential to realize transparent live migration that
>> >> >> users dont have to learn and be aware of the undertaken.
>> >> >
>> >> > Inheriting the VF name will fail in the migration scenario.
>> >> > It is perfectly reasonable to migrate a guest to another machine where
>> >> > the VF PCI address is different. And since current udev/systemd model
>> >> > is to base network device name off of PCI address, the device will change
>> >> > name when guest is migrated.
>> >> >
>> >> The scenario of having VF on a different PCI address on post migration
>> >> is essentially equal to plugging in a new NIC. Why it has to pair with
>> >> the original PV? A sepearte PV device should be in place to pair the
>> >> new VF.
>> >
>> > The host only guarantees that the PV device will be on the same network.
>> > It does not make any PCI guarantees. The way Windows works is to find
>> > the device based on "serial number" which is an Hyper-V specific attribute
>> > of PCI devices.
>> >
>> > I considered naming off of serial number but that won't work for the
>> > case where PV device is present first and VF arrives later. The serial
>> > number is attribute of VF, not the PV which is there first.
>>
>> I assume the PV can get that information ahead of time before VF
>> arrives? Without it how do you match the device when you see a VF
>> coming with some serial number? Is it possible for PV to get the
>> matching SN even earlier during probe time? Or it has to depend on the
>> presence of vPCI bridge to generate this SN?
>
>
>
> NO. the PV device does not know ahead of time and there are scenario
> where the serial and PCI info can change when it does arrive. These
> are test cases (not something people usually do). Example on WS2016:
>   Guest configured with two or more vswitches and NICs.
>   SR-IOV is not enabled
>
> Later:
>   On Hyper-V console (or Powershell command line) on host SR-IOV
>   is enabled on the second NIC.
>
>   The guest will be notified of new PCI device; the "serial number"
>   will be 1.
>
> If same process is repeated but in this case the first NIC has
> SR-IOV enabled, it will get serial # 1.
>
>
> I agree with Jakub. What you are proposing is backwards. The VF
> must be thought of as a dependent of PV device not vice/versa.

I don't enforce netvsc moving to the same 1-netdev model, did I? I
understand Hyper-V has its specific design that's hard to get around
of.

All I said transparent live migration and the 1-netdev model should
work for the passthrough with virtio as helper under QEMU. As I recall
the initial intent was to use virtio as a migration helper rather than
having VF as acceleration path. The latter is as far as I know is from
Hyper-V's point of view. I don't know where those side features come
from and why doing live migration religiously is backwards.

-Siwei

>
>> >
>> > Your ideas about having the PCI information of the VF form the name
>> > of the failover device have the same problem. The PV device may
>> > be the only one present on boot.
>>
>> Yeah, this is a chicken-egg problem indeed, and that was the reason
>> why I supply the BDF info for PV to name the master interface.
>> However, the ACPI PCI slot needs to depend on the PCI bus enumeration
>> so that can't be predictable.  Would it make sense to only rename when
>> the first time a matching VF appears and PV interface isn't brought
>> up, then the failover master would always stick to the name
>> afterwards? I think it should cover most scenarios as it's usually
>> during boot time (dracut) the VF first appears and the PV interface at
>> the time then shouldn't have been configured yet.
>>
>> -Siwei
>>
>> >
>> >
>> >> > On Azure, the VF maybe removed (by host) at any time and then later
>> >> > reattached. There is no guarantee that VF will show back up at
>> >> > the same synthetic PCI address. It will likely have a different
>> >> > PCI domain value.
>> >>
>> >> This is something QEMU can do and make sure the PCI address is
>> >> consistent after migration.
>> >>
>> >> -Siwei
>> >
>
Samudrala, Sridhar June 11, 2018, 7:34 p.m. UTC | #49
On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>>    * Set permanent and current address of net_failover device
>>      to match the primary.
>>
>>    * Carrier should be marked off before registering device
>>      the net_failover device.
> Sridhar, do we want to address this?
> If yes, could you please take a look at addressing these
> meanwhile, while we keep arguing about making API changes?

Sure. I will submit patches to address these issues raised by Stephen.
Samudrala, Sridhar June 12, 2018, 12:08 a.m. UTC | #50
On 6/11/2018 12:34 PM, Samudrala, Sridhar wrote:
>
> On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:
>> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>>>    * Set permanent and current address of net_failover device
>>>      to match the primary.

We copy the dev_addr of standby dev to failover_dev in net_failover_create()
before calling register_netdev().
register_netdev() does a copy of dev_addr to perm_addr.

So i don't think this is an issue.

>>>
>>>    * Carrier should be marked off before registering device
>>>      the net_failover device.

Will fix this and also a couple of places dev_err() needs to be replaced with netdev_err()

>> Sridhar, do we want to address this?
>> If yes, could you please take a look at addressing these
>> meanwhile, while we keep arguing about making API changes?
>
> Sure. I will submit patches to address these issues raised by Stephen.
>
>
Michael S. Tsirkin June 12, 2018, 2:14 a.m. UTC | #51
On Mon, Jun 11, 2018 at 11:56:56AM -0700, Siwei Liu wrote:
> The current implementation may only work with new userspace, even so
> the eth0/eth0nsby naming is not consistenly persisted due to races in
> bus probing.

Which race do you mean exactly?
diff mbox series

Patch

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 99d8e7398a5b..c7d25d10765e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -902,6 +902,8 @@  struct net_device_context {
 	struct hv_device *device_ctx;
 	/* netvsc_device */
 	struct netvsc_device __rcu *nvdev;
+	/* list of netvsc net_devices */
+	struct list_head list;
 	/* reconfigure work */
 	struct delayed_work dwork;
 	/* last reconfig time */
@@ -933,7 +935,6 @@  struct net_device_context {
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
 
-	struct failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index bef4d55a108c..074e6b8578df 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -70,6 +70,8 @@  static int debug = -1;
 module_param(debug, int, 0444);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+static LIST_HEAD(netvsc_dev_list);
+
 static void netvsc_change_rx_flags(struct net_device *net, int change)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -1846,101 +1848,120 @@  static void netvsc_vf_setup(struct work_struct *w)
 	}
 
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
-	if (vf_netdev)
+	if (vf_netdev) {
 		__netvsc_vf_setup(ndev, vf_netdev);
-
+		dev_put(vf_netdev);
+	}
 	rtnl_unlock();
 }
 
-static int netvsc_pre_register_vf(struct net_device *vf_netdev,
-				  struct net_device *ndev)
+static struct net_device *get_netvsc_bymac(const u8 *mac)
 {
-	struct net_device_context *net_device_ctx;
-	struct netvsc_device *netvsc_dev;
+	struct net_device_context *ndev_ctx;
 
-	net_device_ctx = netdev_priv(ndev);
-	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
-	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
-		return -ENODEV;
+	ASSERT_RTNL();
 
-	return 0;
+	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
+		struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev,
-			      struct net_device *ndev)
+static int netvsc_register_vf(struct net_device *vf_netdev)
 {
-	struct net_device_context *ndev_ctx = netdev_priv(ndev);
+	struct net_device *ndev;
+	struct net_device_context *ndev_ctx;
+
+	/* Must use Ethernet addresses */
+	if (vf_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* VF must be a physical device not VLAN, etc */
+	if (!vf_netdev->dev.parent)
+		return NOTIFY_DONE;
+
+	/* Use the MAC address to locate the synthetic interface to
+	 * associate with the VF interface.
+	 */
+	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
+	if (!ndev)
+		return NOTIFY_DONE;
+
+	/* If network device is being removed, don't do anything */
+	ndev_ctx = netdev_priv(ndev);
+	if (!rtnl_dereference(ndev_ctx->nvdev))
+		return NOTIFY_DONE;
+
+	if (netdev_failover_join(vf_netdev, ndev, netvsc_vf_handle_frame)) {
+		netdev_err(vf_netdev, "could not join: %s", ndev->name);
+		return NOTIFY_DONE;
+	}
 
 	/* set slave flag before open to prevent IPv6 addrconf */
 	vf_netdev->flags |= IFF_SLAVE;
 
+	dev_hold(vf_netdev);
+
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
 	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
 
 	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
 
-	dev_hold(vf_netdev);
 	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
 
-	return 0;
+	return NOTIFY_OK;
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev,
-			     struct net_device *ndev)
+static int netvsc_vf_changed(struct net_device *vf_netdev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
+	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
+	ndev = netdev_failover_upper_get(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
+
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
-		return -ENODEV;
+		return NOTIFY_DONE;
 
 	netvsc_switch_datapath(ndev, vf_is_up);
 	netdev_info(ndev, "Data path switched %s VF: %s\n",
 		    vf_is_up ? "to" : "from", vf_netdev->name);
 
-	return 0;
+	return NOTIFY_OK;
 }
 
-static int netvsc_pre_unregister_vf(struct net_device *vf_netdev,
-				    struct net_device *ndev)
+static int netvsc_unregister_vf(struct net_device *vf_netdev)
 {
 	struct net_device_context *net_device_ctx;
+	struct net_device *ndev;
 
-	net_device_ctx = netdev_priv(ndev);
-	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
-
-	return 0;
-}
-
-static int netvsc_unregister_vf(struct net_device *vf_netdev,
-				struct net_device *ndev)
-{
-	struct net_device_context *net_device_ctx;
+	ndev = netdev_failover_upper_get(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
 
 	net_device_ctx = netdev_priv(ndev);
+	if (cancel_delayed_work_sync(&net_device_ctx->vf_takeover))
+		dev_put(vf_netdev);
 
 	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
 
+	netdev_failover_unjoin(vf_netdev, ndev);
 	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
-	dev_put(vf_netdev);
 
-	return 0;
+	return NOTIFY_OK;
 }
 
-static struct failover_ops netvsc_failover_ops = {
-	.slave_pre_register	= netvsc_pre_register_vf,
-	.slave_register		= netvsc_register_vf,
-	.slave_pre_unregister	= netvsc_pre_unregister_vf,
-	.slave_unregister	= netvsc_unregister_vf,
-	.slave_link_change	= netvsc_vf_changed,
-	.slave_handle_frame	= netvsc_vf_handle_frame,
-};
-
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2009,6 +2030,8 @@  static int netvsc_probe(struct hv_device *dev,
 
 	memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN);
 
+	net->priv_flags |= IFF_FAILOVER;
+
 	/* hw_features computed in rndis_netdev_set_hwcaps() */
 	net->features = net->hw_features |
 		NETIF_F_HIGHDMA | NETIF_F_SG |
@@ -2024,23 +2047,19 @@  static int netvsc_probe(struct hv_device *dev,
 	else
 		net->max_mtu = ETH_DATA_LEN;
 
-	ret = register_netdev(net);
+	rtnl_lock();
+	ret = register_netdevice(net);
 	if (ret != 0) {
 		pr_err("Unable to register netdev.\n");
 		goto register_failed;
 	}
 
-	net_device_ctx->failover = failover_register(net, &netvsc_failover_ops);
-	if (IS_ERR(net_device_ctx->failover)) {
-		ret = PTR_ERR(net_device_ctx->failover);
-		goto err_failover;
-	}
-
-	return ret;
+	list_add(&net_device_ctx->list, &netvsc_dev_list);
+	rtnl_unlock();
+	return 0;
 
-err_failover:
-	unregister_netdev(net);
 register_failed:
+	rtnl_unlock();
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
 	free_percpu(net_device_ctx->vf_stats);
@@ -2079,15 +2098,17 @@  static int netvsc_remove(struct hv_device *dev)
 	 */
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
-	if (vf_netdev)
-		failover_slave_unregister(vf_netdev);
+	if (vf_netdev) {
+		netdev_failover_unjoin(vf_netdev, net);
+		dev_put(vf_netdev);
+	}
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
-	failover_unregister(ndev_ctx->failover);
+	list_del(&ndev_ctx->list);
 
 	rtnl_unlock();
 	rcu_read_unlock();
@@ -2115,8 +2136,47 @@  static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
+/* On Hyper-V, every VF interface is matched with a corresponding
+ * synthetic interface. The synthetic interface is presented first
+ * to the guest. When the corresponding VF instance is registered,
+ * we will take care of switching the data path.
+ */
+static int netvsc_netdev_event(struct notifier_block *this,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip parent events */
+	if (netif_is_failover(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return netvsc_register_vf(event_dev);
+
+	case NETDEV_UNREGISTER:
+		return netvsc_unregister_vf(event_dev);
+
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+		return netvsc_vf_changed(event_dev);
+
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block netvsc_netdev_notifier = {
+	.notifier_call = netvsc_netdev_event,
+};
+
 static void __exit netvsc_drv_exit(void)
 {
+	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2136,6 +2196,7 @@  static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
+	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 83f7420ddea5..e0d30527f748 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -28,6 +28,46 @@ 
 #include <uapi/linux/if_arp.h>
 #include <net/net_failover.h>
 
+static LIST_HEAD(net_failover_list);
+
+/* failover state */
+struct net_failover_info {
+	struct net_device *failover_dev;
+
+	/* list of failover virtual devices */
+	struct list_head list;
+
+	/* primary netdev with same MAC */
+	struct net_device __rcu *primary_dev;
+
+	/* standby netdev */
+	struct net_device __rcu *standby_dev;
+
+	/* primary netdev stats */
+	struct rtnl_link_stats64 primary_stats;
+
+	/* standby netdev stats */
+	struct rtnl_link_stats64 standby_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 failover_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+
+	/* delayed setup of slave */
+	struct delayed_work standby_init;
+};
+
+#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+				 NETIF_F_HIGHDMA | NETIF_F_LRO)
+
+#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
+
+#define FAILOVER_SETUP_INTERVAL	(HZ / 10)
+
 static bool net_failover_xmit_ready(struct net_device *dev)
 {
 	return netif_running(dev) && netif_carrier_ok(dev);
@@ -460,22 +500,42 @@  static void net_failover_lower_state_changed(struct net_device *slave_dev,
 	netdev_lower_state_changed(slave_dev, &info);
 }
 
-static int net_failover_slave_pre_register(struct net_device *slave_dev,
-					   struct net_device *failover_dev)
+static struct net_device *get_net_failover_bymac(const u8 *mac)
 {
-	struct net_device *standby_dev, *primary_dev;
+	struct net_failover_info *nfo_info;
+
+	ASSERT_RTNL();
+
+	list_for_each_entry(nfo_info, &net_failover_list, list) {
+		struct net_device *failover_dev = nfo_info->failover_dev;
+
+		if (ether_addr_equal(mac, failover_dev->perm_addr))
+			return failover_dev;
+	}
+
+	return NULL;
+}
+
+static int net_failover_register_event(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev, *standby_dev, *primary_dev;
 	struct net_failover_info *nfo_info;
 	bool slave_is_standby;
 
+	failover_dev = get_net_failover_bymac(slave_dev->perm_addr);
+	if (!failover_dev)
+		return NOTIFY_DONE;
+
 	nfo_info = netdev_priv(failover_dev);
 	standby_dev = rtnl_dereference(nfo_info->standby_dev);
 	primary_dev = rtnl_dereference(nfo_info->primary_dev);
 	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
 	if (slave_is_standby ? standby_dev : primary_dev) {
-		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+		netdev_err(failover_dev,
+			   "%s attempting to register as slave dev when %s already present\n",
 			   slave_dev->name,
 			   slave_is_standby ? "standby" : "primary");
-		return -EINVAL;
+		return NOTIFY_DONE;
 	}
 
 	/* We want to allow only a direct attached VF device as a primary
@@ -484,23 +544,33 @@  static int net_failover_slave_pre_register(struct net_device *slave_dev,
 	 */
 	if (!slave_is_standby && (!slave_dev->dev.parent ||
 				  !dev_is_pci(slave_dev->dev.parent)))
-		return -EINVAL;
+		return NOTIFY_DONE;
 
 	if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
 	    vlan_uses_dev(failover_dev)) {
-		netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
+		netdev_err(failover_dev,
+			   "Device %s is VLAN challenged and failover device has VLAN set up\n",
 			   failover_dev->name);
-		return -EINVAL;
+		return NOTIFY_DONE;
 	}
 
-	return 0;
+	if (netdev_failover_join(slave_dev, failover_dev,
+				 net_failover_handle_frame)) {
+		netdev_err(failover_dev, "could not join: %s", slave_dev->name);
+		return NOTIFY_DONE;
+	}
+
+	/* Trigger rest of setup in process context */
+	schedule_delayed_work(&nfo_info->standby_init, FAILOVER_SETUP_INTERVAL);
+
+	return NOTIFY_OK;
 }
 
-static int net_failover_slave_register(struct net_device *slave_dev,
-				       struct net_device *failover_dev)
+static void __net_failover_setup(struct net_device *failover_dev)
 {
+	struct net_failover_info *nfo_info = netdev_priv(failover_dev);
+	struct net_device *slave_dev = rtnl_dereference(nfo_info->standby_dev);
 	struct net_device *standby_dev, *primary_dev;
-	struct net_failover_info *nfo_info;
 	bool slave_is_standby;
 	u32 orig_mtu;
 	int err;
@@ -509,13 +579,12 @@  static int net_failover_slave_register(struct net_device *slave_dev,
 	orig_mtu = slave_dev->mtu;
 	err = dev_set_mtu(slave_dev, failover_dev->mtu);
 	if (err) {
-		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+		netdev_err(failover_dev,
+			   "unable to change mtu of %s to %u register failed\n",
 			   slave_dev->name, failover_dev->mtu);
 		goto done;
 	}
 
-	dev_hold(slave_dev);
-
 	if (netif_running(failover_dev)) {
 		err = dev_open(slave_dev);
 		if (err && (err != -EBUSY)) {
@@ -537,7 +606,6 @@  static int net_failover_slave_register(struct net_device *slave_dev,
 		goto err_vlan_add;
 	}
 
-	nfo_info = netdev_priv(failover_dev);
 	standby_dev = rtnl_dereference(nfo_info->standby_dev);
 	primary_dev = rtnl_dereference(nfo_info->primary_dev);
 	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
@@ -562,52 +630,56 @@  static int net_failover_slave_register(struct net_device *slave_dev,
 	netdev_info(failover_dev, "failover %s slave:%s registered\n",
 		    slave_is_standby ? "standby" : "primary", slave_dev->name);
 
-	return 0;
+	return;
 
 err_vlan_add:
 	dev_uc_unsync(slave_dev, failover_dev);
 	dev_mc_unsync(slave_dev, failover_dev);
 	dev_close(slave_dev);
 err_dev_open:
-	dev_put(slave_dev);
 	dev_set_mtu(slave_dev, orig_mtu);
 done:
-	return err;
+	return;
 }
 
-static int net_failover_slave_pre_unregister(struct net_device *slave_dev,
-					     struct net_device *failover_dev)
+static void net_failover_setup(struct work_struct *w)
 {
-	struct net_device *standby_dev, *primary_dev;
-	struct net_failover_info *nfo_info;
+	struct net_failover_info *nfo_info
+		= container_of(w, struct net_failover_info, standby_init.work);
+	struct net_device *failover_dev = nfo_info->failover_dev;
 
-	nfo_info = netdev_priv(failover_dev);
-	primary_dev = rtnl_dereference(nfo_info->primary_dev);
-	standby_dev = rtnl_dereference(nfo_info->standby_dev);
-
-	if (slave_dev != primary_dev && slave_dev != standby_dev)
-		return -ENODEV;
+	/* handle race with cancel delayed work on removal */
+	if (!rtnl_trylock()) {
+		schedule_delayed_work(&nfo_info->standby_init, 0);
+		return;
+	}
 
-	return 0;
+	__net_failover_setup(failover_dev);
+	rtnl_unlock();
 }
 
-static int net_failover_slave_unregister(struct net_device *slave_dev,
-					 struct net_device *failover_dev)
+static int net_failover_unregister_event(struct net_device *slave_dev)
 {
-	struct net_device *standby_dev, *primary_dev;
+	struct net_device *failover_dev, *primary_dev, *standby_dev;
 	struct net_failover_info *nfo_info;
 	bool slave_is_standby;
 
+	failover_dev = netdev_failover_upper_get(slave_dev);
+	if (!failover_dev)
+		return NOTIFY_DONE;
+
 	nfo_info = netdev_priv(failover_dev);
 	primary_dev = rtnl_dereference(nfo_info->primary_dev);
 	standby_dev = rtnl_dereference(nfo_info->standby_dev);
 
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		return NOTIFY_DONE;
+
 	vlan_vids_del_by_dev(slave_dev, failover_dev);
 	dev_uc_unsync(slave_dev, failover_dev);
 	dev_mc_unsync(slave_dev, failover_dev);
 	dev_close(slave_dev);
 
-	nfo_info = netdev_priv(failover_dev);
 	dev_get_stats(failover_dev, &nfo_info->failover_stats);
 
 	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
@@ -628,22 +700,25 @@  static int net_failover_slave_unregister(struct net_device *slave_dev,
 	netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
 		    slave_is_standby ? "standby" : "primary", slave_dev->name);
 
-	return 0;
+	return NOTIFY_OK;
 }
 
-static int net_failover_slave_link_change(struct net_device *slave_dev,
-					  struct net_device *failover_dev)
+static int net_failover_link_event(struct net_device *slave_dev)
+
 {
-	struct net_device *primary_dev, *standby_dev;
+	struct net_device *failover_dev, *primary_dev, *standby_dev;
 	struct net_failover_info *nfo_info;
 
-	nfo_info = netdev_priv(failover_dev);
+	failover_dev = netdev_failover_upper_get(slave_dev);
+	if (!failover_dev)
+		return NOTIFY_DONE;
 
+	nfo_info = netdev_priv(failover_dev);
 	primary_dev = rtnl_dereference(nfo_info->primary_dev);
 	standby_dev = rtnl_dereference(nfo_info->standby_dev);
 
 	if (slave_dev != primary_dev && slave_dev != standby_dev)
-		return -ENODEV;
+		return NOTIFY_DONE;
 
 	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
 	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
@@ -657,43 +732,11 @@  static int net_failover_slave_link_change(struct net_device *slave_dev,
 
 	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
 
-	return 0;
+	return NOTIFY_DONE;
 }
 
-static int net_failover_slave_name_change(struct net_device *slave_dev,
-					  struct net_device *failover_dev)
-{
-	struct net_device *primary_dev, *standby_dev;
-	struct net_failover_info *nfo_info;
-
-	nfo_info = netdev_priv(failover_dev);
-
-	primary_dev = rtnl_dereference(nfo_info->primary_dev);
-	standby_dev = rtnl_dereference(nfo_info->standby_dev);
-
-	if (slave_dev != primary_dev && slave_dev != standby_dev)
-		return -ENODEV;
-
-	/* We need to bring up the slave after the rename by udev in case
-	 * open failed with EBUSY when it was registered.
-	 */
-	dev_open(slave_dev);
-
-	return 0;
-}
-
-static struct failover_ops net_failover_ops = {
-	.slave_pre_register	= net_failover_slave_pre_register,
-	.slave_register		= net_failover_slave_register,
-	.slave_pre_unregister	= net_failover_slave_pre_unregister,
-	.slave_unregister	= net_failover_slave_unregister,
-	.slave_link_change	= net_failover_slave_link_change,
-	.slave_name_change	= net_failover_slave_name_change,
-	.slave_handle_frame	= net_failover_handle_frame,
-};
-
 /**
- * net_failover_create - Create and register a failover instance
+ * net_failover_create - Create and register a failover device
  *
  * @dev: standby netdev
  *
@@ -703,13 +746,12 @@  static struct failover_ops net_failover_ops = {
  * the original standby netdev and a VF netdev with the same MAC gets
  * registered as primary netdev.
  *
- * Return: pointer to failover instance
+ * Return: pointer to failover network device
  */
-struct failover *net_failover_create(struct net_device *standby_dev)
+struct net_device *net_failover_create(struct net_device *standby_dev)
 {
-	struct device *dev = standby_dev->dev.parent;
+	struct net_failover_info *nfo_info;
 	struct net_device *failover_dev;
-	struct failover *failover;
 	int err;
 
 	/* Alloc at least 2 queues, for now we are going with 16 assuming
@@ -717,18 +759,22 @@  struct failover *net_failover_create(struct net_device *standby_dev)
 	 */
 	failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
 	if (!failover_dev) {
-		dev_err(dev, "Unable to allocate failover_netdev!\n");
-		return ERR_PTR(-ENOMEM);
+		netdev_err(standby_dev, "Unable to allocate failover_netdev!\n");
+		return NULL;
 	}
 
+	nfo_info = netdev_priv(failover_dev);
 	dev_net_set(failover_dev, dev_net(standby_dev));
-	SET_NETDEV_DEV(failover_dev, dev);
+	nfo_info->failover_dev = failover_dev;
+	INIT_DELAYED_WORK(&nfo_info->standby_init, net_failover_setup);
 
 	failover_dev->netdev_ops = &failover_dev_ops;
 	failover_dev->ethtool_ops = &failover_ethtool_ops;
 
 	/* Initialize the device options */
-	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	failover_dev->priv_flags |= IFF_UNICAST_FLT |
+				    IFF_NO_QUEUE |
+				    IFF_FAILOVER;
 	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
 				       IFF_TX_SKB_SHARING);
 
@@ -746,29 +792,38 @@  struct failover *net_failover_create(struct net_device *standby_dev)
 	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
 	failover_dev->features |= failover_dev->hw_features;
 
-	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
-	       failover_dev->addr_len);
+	ether_addr_copy(failover_dev->dev_addr, standby_dev->dev_addr);
+	ether_addr_copy(failover_dev->perm_addr, standby_dev->perm_addr);
 
 	failover_dev->min_mtu = standby_dev->min_mtu;
 	failover_dev->max_mtu = standby_dev->max_mtu;
 
-	err = register_netdev(failover_dev);
+	netif_carrier_off(failover_dev);
+
+	rtnl_lock();
+	err = register_netdevice(failover_dev);
 	if (err) {
-		dev_err(dev, "Unable to register failover_dev!\n");
+		netdev_err(standby_dev, "Unable to register failover_dev!\n");
 		goto err_register_netdev;
 	}
 
-	netif_carrier_off(failover_dev);
+	err = netdev_failover_join(standby_dev, failover_dev,
+				   net_failover_handle_frame);
+	if (err) {
+		netdev_err(failover_dev, "Unable to join with %s\n",
+			   standby_dev->name);
+		goto err_failover_join;
+	}
 
-	failover = failover_register(failover_dev, &net_failover_ops);
-	if (IS_ERR(failover))
-		goto err_failover_register;
+	list_add(&nfo_info->list, &net_failover_list);
+	rtnl_unlock();
 
-	return failover;
+	return failover_dev;
 
-err_failover_register:
-	unregister_netdev(failover_dev);
+err_failover_join:
+	unregister_netdevice(failover_dev);
 err_register_netdev:
+	rtnl_unlock();
 	free_netdev(failover_dev);
 
 	return ERR_PTR(err);
@@ -786,31 +841,27 @@  EXPORT_SYMBOL_GPL(net_failover_create);
  * netdev. Used by paravirtual drivers that use 3-netdev model.
  *
  */
-void net_failover_destroy(struct failover *failover)
+void net_failover_destroy(struct net_device *failover_dev)
 {
-	struct net_failover_info *nfo_info;
-	struct net_device *failover_dev;
+	struct net_failover_info *nfo_info = netdev_priv(failover_dev);
 	struct net_device *slave_dev;
 
-	if (!failover)
-		return;
-
-	failover_dev = rcu_dereference(failover->failover_dev);
-	nfo_info = netdev_priv(failover_dev);
-
 	netif_device_detach(failover_dev);
 
 	rtnl_lock();
-
 	slave_dev = rtnl_dereference(nfo_info->primary_dev);
-	if (slave_dev)
-		failover_slave_unregister(slave_dev);
+	if (slave_dev) {
+		netdev_failover_unjoin(slave_dev, failover_dev);
+		dev_put(slave_dev);
+	}
 
 	slave_dev = rtnl_dereference(nfo_info->standby_dev);
-	if (slave_dev)
-		failover_slave_unregister(slave_dev);
+	if (slave_dev) {
+		netdev_failover_unjoin(slave_dev, failover_dev);
+		dev_put(slave_dev);
+	}
 
-	failover_unregister(failover);
+	list_del(&nfo_info->list);
 
 	unregister_netdevice(failover_dev);
 
@@ -820,9 +871,53 @@  void net_failover_destroy(struct failover *failover)
 }
 EXPORT_SYMBOL_GPL(net_failover_destroy);
 
+static int net_failover_event(struct notifier_block *this,
+			      unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip parent events */
+	if (netif_is_failover(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as VF */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as VF */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return net_failover_register_event(event_dev);
+
+	case NETDEV_UNREGISTER:
+		return net_failover_unregister_event(event_dev);
+
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return net_failover_link_event(event_dev);
+
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block net_failover_notifier = {
+	.notifier_call = net_failover_event,
+};
+
 static __init int
 net_failover_init(void)
 {
+	register_netdevice_notifier(&net_failover_notifier);
 	return 0;
 }
 module_init(net_failover_init);
@@ -830,6 +925,7 @@  module_init(net_failover_init);
 static __exit
 void net_failover_exit(void)
 {
+	unregister_netdevice_notifier(&net_failover_notifier);
 }
 module_exit(net_failover_exit);
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6d710b8b41c5..b40ae28dac93 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -215,7 +215,7 @@  struct virtnet_info {
 	unsigned long guest_offloads;
 
 	/* failover when STANDBY feature enabled */
-	struct failover *failover;
+	struct net_device *failover;
 };
 
 struct padded_vnet_hdr {
@@ -2930,11 +2930,10 @@  static int virtnet_probe(struct virtio_device *vdev)
 	virtnet_init_settings(dev);
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
-		vi->failover = net_failover_create(vi->dev);
-		if (IS_ERR(vi->failover)) {
-			err = PTR_ERR(vi->failover);
+		err = -ENOMEM;
+		vi->failover = net_failover_create(dev);
+		if (!vi->failover)
 			goto free_vqs;
-		}
 	}
 
 	err = register_netdev(dev);
diff --git a/include/net/failover.h b/include/net/failover.h
index bb15438f39c7..22d6c1369101 100644
--- a/include/net/failover.h
+++ b/include/net/failover.h
@@ -6,31 +6,10 @@ 
 
 #include <linux/netdevice.h>
 
-struct failover_ops {
-	int (*slave_pre_register)(struct net_device *slave_dev,
-				  struct net_device *failover_dev);
-	int (*slave_register)(struct net_device *slave_dev,
-			      struct net_device *failover_dev);
-	int (*slave_pre_unregister)(struct net_device *slave_dev,
-				    struct net_device *failover_dev);
-	int (*slave_unregister)(struct net_device *slave_dev,
-				struct net_device *failover_dev);
-	int (*slave_link_change)(struct net_device *slave_dev,
-				 struct net_device *failover_dev);
-	int (*slave_name_change)(struct net_device *slave_dev,
-				 struct net_device *failover_dev);
-	rx_handler_result_t (*slave_handle_frame)(struct sk_buff **pskb);
-};
-
-struct failover {
-	struct list_head list;
-	struct net_device __rcu *failover_dev;
-	struct failover_ops __rcu *ops;
-};
-
-struct failover *failover_register(struct net_device *dev,
-				   struct failover_ops *ops);
-void failover_unregister(struct failover *failover);
-int failover_slave_unregister(struct net_device *slave_dev);
+int netdev_failover_join(struct net_device *lower, struct net_device *upper,
+			 rx_handler_func_t *rx_handler);
+struct net_device *netdev_failover_upper_get(struct net_device *lower);
+void netdev_failover_unjoin(struct net_device *lower,
+			    struct net_device *upper);
 
 #endif /* _FAILOVER_H */
diff --git a/include/net/net_failover.h b/include/net/net_failover.h
index b12a1c469d1c..a99b3b00b4e3 100644
--- a/include/net/net_failover.h
+++ b/include/net/net_failover.h
@@ -6,35 +6,7 @@ 
 
 #include <net/failover.h>
 
-/* failover state */
-struct net_failover_info {
-	/* primary netdev with same MAC */
-	struct net_device __rcu *primary_dev;
-
-	/* standby netdev */
-	struct net_device __rcu *standby_dev;
-
-	/* primary netdev stats */
-	struct rtnl_link_stats64 primary_stats;
-
-	/* standby netdev stats */
-	struct rtnl_link_stats64 standby_stats;
-
-	/* aggregated stats */
-	struct rtnl_link_stats64 failover_stats;
-
-	/* spinlock while updating stats */
-	spinlock_t stats_lock;
-};
-
-struct failover *net_failover_create(struct net_device *standby_dev);
-void net_failover_destroy(struct failover *failover);
-
-#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
-				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
-				 NETIF_F_HIGHDMA | NETIF_F_LRO)
-
-#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
-				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
+struct net_device *net_failover_create(struct net_device *standby_dev);
+void net_failover_destroy(struct net_device *failover_dev);
 
 #endif /* _NET_FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index f738a6f27665..697d84202695 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -433,17 +433,8 @@  config PAGE_POOL
        bool
 
 config FAILOVER
-	tristate "Generic failover module"
-	help
-	  The failover module provides a generic interface for paravirtual
-	  drivers to register a netdev and a set of ops with a failover
-	  instance. The ops are used as event handlers that get called to
-	  handle netdev register/unregister/link change/name change events
-	  on slave pci ethernet devices with the same mac address as the
-	  failover netdev. This enables paravirtual drivers to use a
-	  VF as an accelerated low latency datapath. It also allows live
-	  migration of VMs with direct attached VFs by failing over to the
-	  paravirtual datapath when the VF is unplugged.
+	bool
+	default n
 
 endif   # if NET
 
diff --git a/net/core/failover.c b/net/core/failover.c
index 4a92a98ccce9..499f0fd7e4d3 100644
--- a/net/core/failover.c
+++ b/net/core/failover.c
@@ -1,10 +1,8 @@ 
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2018, Intel Corporation. */
 
-/* A common module to handle registrations and notifications for paravirtual
+/* A library for managing chained upper/oower devices such as
  * drivers to enable accelerated datapath and support VF live migration.
- *
- * The notifier and event handling code is based on netvsc driver.
  */
 
 #include <linux/module.h>
@@ -14,302 +12,62 @@ 
 #include <linux/if_vlan.h>
 #include <net/failover.h>
 
-static LIST_HEAD(failover_list);
-static DEFINE_SPINLOCK(failover_lock);
-
-static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
-{
-	struct net_device *failover_dev;
-	struct failover *failover;
-
-	spin_lock(&failover_lock);
-	list_for_each_entry(failover, &failover_list, list) {
-		failover_dev = rtnl_dereference(failover->failover_dev);
-		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
-			*ops = rtnl_dereference(failover->ops);
-			spin_unlock(&failover_lock);
-			return failover_dev;
-		}
-	}
-	spin_unlock(&failover_lock);
-	return NULL;
-}
-
-/**
- * failover_slave_register - Register a slave netdev
- *
- * @slave_dev: slave netdev that is being registered
- *
- * Registers a slave device to a failover instance. Only ethernet devices
- * are supported.
- */
-static int failover_slave_register(struct net_device *slave_dev)
+/* failover_join - Join an lower netdev with an upper device. */
+int netdev_failover_join(struct net_device *lower_dev,
+			 struct net_device *upper_dev,
+			 rx_handler_func_t *rx_handler)
 {
-	struct netdev_lag_upper_info lag_upper_info;
-	struct net_device *failover_dev;
-	struct failover_ops *fops;
 	int err;
 
-	if (slave_dev->type != ARPHRD_ETHER)
-		goto done;
-
 	ASSERT_RTNL();
 
-	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
-	if (!failover_dev)
-		goto done;
+	/* Don't allow joining devices of different protocols */
+	if (upper_dev->type != lower_dev->type)
+		return -EINVAL;
 
-	if (fops && fops->slave_pre_register &&
-	    fops->slave_pre_register(slave_dev, failover_dev))
-		goto done;
-
-	err = netdev_rx_handler_register(slave_dev, fops->slave_handle_frame,
-					 failover_dev);
+	err = netdev_rx_handler_register(lower_dev, rx_handler, upper_dev);
 	if (err) {
-		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+		netdev_err(lower_dev,
+			   "can not register failover rx handler (err = %d)\n",
 			   err);
-		goto done;
+		return err;
 	}
 
-	lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP;
-	err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL,
-					   &lag_upper_info, NULL);
+	err = netdev_master_upper_dev_link(lower_dev, upper_dev, NULL,
+					   NULL, NULL);
 	if (err) {
-		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
-			   failover_dev->name, err);
-		goto err_upper_link;
+		netdev_err(lower_dev,
+			   "can not set failover device %s (err = %d)\n",
+			   upper_dev->name, err);
+		netdev_rx_handler_unregister(lower_dev);
+		return err;
 	}
 
-	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
-
-	if (fops && fops->slave_register &&
-	    !fops->slave_register(slave_dev, failover_dev))
-		return NOTIFY_OK;
-
-	netdev_upper_dev_unlink(slave_dev, failover_dev);
-	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
-err_upper_link:
-	netdev_rx_handler_unregister(slave_dev);
-done:
-	return NOTIFY_DONE;
-}
-
-/**
- * failover_slave_unregister - Unregister a slave netdev
- *
- * @slave_dev: slave netdev that is being unregistered
- *
- * Unregisters a slave device from a failover instance.
- */
-int failover_slave_unregister(struct net_device *slave_dev)
-{
-	struct net_device *failover_dev;
-	struct failover_ops *fops;
-
-	if (!netif_is_failover_slave(slave_dev))
-		goto done;
-
-	ASSERT_RTNL();
-
-	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
-	if (!failover_dev)
-		goto done;
-
-	if (fops && fops->slave_pre_unregister &&
-	    fops->slave_pre_unregister(slave_dev, failover_dev))
-		goto done;
-
-	netdev_rx_handler_unregister(slave_dev);
-	netdev_upper_dev_unlink(slave_dev, failover_dev);
-	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
-
-	if (fops && fops->slave_unregister &&
-	    !fops->slave_unregister(slave_dev, failover_dev))
-		return NOTIFY_OK;
-
-done:
-	return NOTIFY_DONE;
+	dev_hold(lower_dev);
+	lower_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+	return 0;
 }
-EXPORT_SYMBOL_GPL(failover_slave_unregister);
+EXPORT_SYMBOL_GPL(netdev_failover_join);
 
-static int failover_slave_link_change(struct net_device *slave_dev)
+/* Find upper network device for failover slave device */
+struct net_device *netdev_failover_upper_get(struct net_device *lower_dev)
 {
-	struct net_device *failover_dev;
-	struct failover_ops *fops;
-
-	if (!netif_is_failover_slave(slave_dev))
-		goto done;
-
-	ASSERT_RTNL();
-
-	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
-	if (!failover_dev)
-		goto done;
-
-	if (!netif_running(failover_dev))
-		goto done;
+	if (!netif_is_failover_slave(lower_dev))
+		return NULL;
 
-	if (fops && fops->slave_link_change &&
-	    !fops->slave_link_change(slave_dev, failover_dev))
-		return NOTIFY_OK;
-
-done:
-	return NOTIFY_DONE;
+	return netdev_master_upper_dev_get(lower_dev);
 }
+EXPORT_SYMBOL_GPL(netdev_failover_upper_get);
 
-static int failover_slave_name_change(struct net_device *slave_dev)
+/* failover_unjoin - Break connection between lower and upper device. */
+void netdev_failover_unjoin(struct net_device *lower_dev,
+			    struct net_device *upper_dev)
 {
-	struct net_device *failover_dev;
-	struct failover_ops *fops;
-
-	if (!netif_is_failover_slave(slave_dev))
-		goto done;
-
 	ASSERT_RTNL();
 
-	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
-	if (!failover_dev)
-		goto done;
-
-	if (!netif_running(failover_dev))
-		goto done;
-
-	if (fops && fops->slave_name_change &&
-	    !fops->slave_name_change(slave_dev, failover_dev))
-		return NOTIFY_OK;
-
-done:
-	return NOTIFY_DONE;
-}
-
-static int
-failover_event(struct notifier_block *this, unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip parent events */
-	if (netif_is_failover(event_dev))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return failover_slave_register(event_dev);
-	case NETDEV_UNREGISTER:
-		return failover_slave_unregister(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-	case NETDEV_CHANGE:
-		return failover_slave_link_change(event_dev);
-	case NETDEV_CHANGENAME:
-		return failover_slave_name_change(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block failover_notifier = {
-	.notifier_call = failover_event,
-};
-
-static void
-failover_existing_slave_register(struct net_device *failover_dev)
-{
-	struct net *net = dev_net(failover_dev);
-	struct net_device *dev;
-
-	rtnl_lock();
-	for_each_netdev(net, dev) {
-		if (netif_is_failover(dev))
-			continue;
-		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
-			failover_slave_register(dev);
-	}
-	rtnl_unlock();
-}
-
-/**
- * failover_register - Register a failover instance
- *
- * @dev: failover netdev
- * @ops: failover ops
- *
- * Allocate and register a failover instance for a failover netdev. ops
- * provides handlers for slave device register/unregister/link change/
- * name change events.
- *
- * Return: pointer to failover instance
- */
-struct failover *failover_register(struct net_device *dev,
-				   struct failover_ops *ops)
-{
-	struct failover *failover;
-
-	if (dev->type != ARPHRD_ETHER)
-		return ERR_PTR(-EINVAL);
-
-	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
-	if (!failover)
-		return ERR_PTR(-ENOMEM);
-
-	rcu_assign_pointer(failover->ops, ops);
-	dev_hold(dev);
-	dev->priv_flags |= IFF_FAILOVER;
-	rcu_assign_pointer(failover->failover_dev, dev);
-
-	spin_lock(&failover_lock);
-	list_add_tail(&failover->list, &failover_list);
-	spin_unlock(&failover_lock);
-
-	netdev_info(dev, "failover master:%s registered\n", dev->name);
-
-	failover_existing_slave_register(dev);
-
-	return failover;
-}
-EXPORT_SYMBOL_GPL(failover_register);
-
-/**
- * failover_unregister - Unregister a failover instance
- *
- * @failover: pointer to failover instance
- *
- * Unregisters and frees a failover instance.
- */
-void failover_unregister(struct failover *failover)
-{
-	struct net_device *failover_dev;
-
-	failover_dev = rcu_dereference(failover->failover_dev);
-
-	netdev_info(failover_dev, "failover master:%s unregistered\n",
-		    failover_dev->name);
-
-	failover_dev->priv_flags &= ~IFF_FAILOVER;
-	dev_put(failover_dev);
-
-	spin_lock(&failover_lock);
-	list_del(&failover->list);
-	spin_unlock(&failover_lock);
-
-	kfree(failover);
+	netdev_rx_handler_unregister(lower_dev);
+	netdev_upper_dev_unlink(lower_dev, upper_dev);
+	dev_put(lower_dev);
+	lower_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
 }
-EXPORT_SYMBOL_GPL(failover_unregister);
-
-static __init int
-failover_init(void)
-{
-	register_netdevice_notifier(&failover_notifier);
-
-	return 0;
-}
-module_init(failover_init);
-
-static __exit
-void failover_exit(void)
-{
-	unregister_netdevice_notifier(&failover_notifier);
-}
-module_exit(failover_exit);
-
-MODULE_DESCRIPTION("Generic failover infrastructure/interface");
-MODULE_LICENSE("GPL v2");
+EXPORT_SYMBOL_GPL(netdev_failover_unjoin);