[net] bonding: Fix bonding crash

Message ID	1472793514-31850-1-git-send-email-mahesh@bandewar.net
State	Accepted, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Mahesh Bandewar <mahesh@bandewar.net> To: Jay Vosburgh <j.vosburgh@gmail.com>, Andy Gospodarek <gospo@cumulusnetworks.com>, Veaceslav Falico <vfalico@gmail.com>, David Miller <davem@davemloft.net> Cc: Mahesh Bandewar <maheshb@google.com>, Eric Dumazet <edumazet@google.com>, netdev <netdev@vger.kernel.org> Subject: [PATCH net] bonding: Fix bonding crash Date: Thu, 1 Sep 2016 22:18:34 -0700 Message-Id: <1472793514-31850-1-git-send-email-mahesh@bandewar.net> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Mahesh Bandewar Sept. 2, 2016, 5:18 a.m. UTC

From: Mahesh Bandewar <maheshb@google.com>

Following few steps will crash kernel -

  (a) Create bonding master
      > modprobe bonding miimon=50
  (b) Create macvlan bridge on eth2
      > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
	   type macvlan
  (c) Now try adding eth2 into the bond
      > echo +eth2 > /sys/class/net/bond0/bonding/slaves
      <crash>

Bonding does lots of things before checking if the device enslaved is
busy or not.

In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.

This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 drivers/net/bonding/bond_main.c |  7 ++++---
 include/linux/netdevice.h       |  1 +
 net/core/dev.c                  | 16 ++++++++++++++++
 3 files changed, 21 insertions(+), 3 deletions(-)

Jiri Pirko Sept. 2, 2016, 7:52 a.m. UTC | #1

Fri, Sep 02, 2016 at 07:18:34AM CEST, mahesh@bandewar.net wrote:
>From: Mahesh Bandewar <maheshb@google.com>
>
>Following few steps will crash kernel -
>
>  (a) Create bonding master
>      > modprobe bonding miimon=50
>  (b) Create macvlan bridge on eth2
>      > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
>	   type macvlan
>  (c) Now try adding eth2 into the bond
>      > echo +eth2 > /sys/class/net/bond0/bonding/slaves
>      <crash>
>
>Bonding does lots of things before checking if the device enslaved is
>busy or not.
>
>In this case when the notifier call-chain sends notifications, the
>bond_netdev_event() assumes that the rx_handler /rx_handler_data is
>registered while the bond_enslave() hasn't progressed far enough to
>register rx_handler for the new slave.
>
>This patch adds a rx_handler check that can be performed right at the
>beginning of the enslave code to avoid getting into this situation.
>
>Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>---
> drivers/net/bonding/bond_main.c |  7 ++++---
> include/linux/netdevice.h       |  1 +
> net/core/dev.c                  | 16 ++++++++++++++++
> 3 files changed, 21 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 217e8da0628c..9599ed6f1213 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1341,9 +1341,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> 			    slave_dev->name);
> 	}
> 
>-	/* already enslaved */
>-	if (slave_dev->flags & IFF_SLAVE) {
>-		netdev_dbg(bond_dev, "Error: Device was already enslaved\n");
>+	/* already in-use? */
>+	if (netdev_is_rx_handler_busy(slave_dev)) {
>+		netdev_err(bond_dev,
>+			   "Error: Device is in use and cannot be enslaved\n");
> 		return -EBUSY;
> 	}
> 
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index d122be9345c7..99ba59a7b114 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -3266,6 +3266,7 @@ static inline void napi_free_frags(struct napi_struct *napi)
> 	napi->skb = NULL;
> }
> 
>+bool netdev_is_rx_handler_busy(struct net_device *dev);
> int netdev_rx_handler_register(struct net_device *dev,
> 			       rx_handler_func_t *rx_handler,
> 			       void *rx_handler_data);
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 34b5322bc081..81e6f7298122 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -3965,6 +3965,22 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> }
> 
> /**
>+ *	netdev_is_rx_handler_busy - check if receive handler is registered
>+ *	@dev: device to check
>+ *
>+ *	Check if a receive handler is already registered for a given device.
>+ *	Return true if there one.
>+ *
>+ *	The caller must hold the rtnl_mutex.
>+ */
>+bool netdev_is_rx_handler_busy(struct net_device *dev)
>+{
>+	ASSERT_RTNL();
>+	return dev && rtnl_dereference(dev->rx_handler);
>+}
>+EXPORT_SYMBOL_GPL(netdev_is_rx_handler_busy);

No, please, don't make bonding a spacial citizen introducing this.
Please handle the issue inside the bonding code, like we do for the rest
of master devices (and how it was once done for bonding). Thanks.

Eric Dumazet Sept. 2, 2016, 1:33 p.m. UTC | #2

On Fri, 2016-09-02 at 09:52 +0200, Jiri Pirko wrote:

> 
> No, please, don't make bonding a spacial citizen introducing this.
> Please handle the issue inside the bonding code, like we do for the rest
> of master devices (and how it was once done for bonding). Thanks.

I do not feel this netdev_is_rx_handler_busy() use is special to
bonding.

It makes sense to use it early, to avoid complex rollback, once various
events have been sent all over.

bond_enslave() is 447 lines long already.

You perfectly know how hard it is to 'handle the issue inside the
bonding code' as you chose to not fix bonding and write team instead.

So Mahesh patch makes perfect sense to me. It exactly fixes a stupid
bond_enslave() behavior, trying to set rx_handler way too late.

By doing sanity checks before any action, we simply do not have to add
complex rollback.

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks.

David Ahern Sept. 2, 2016, 2:30 p.m. UTC | #3

On 9/1/16 11:18 PM, Mahesh Bandewar wrote:
> From: Mahesh Bandewar <maheshb@google.com>
> 
> Following few steps will crash kernel -
> 
>   (a) Create bonding master
>       > modprobe bonding miimon=50
>   (b) Create macvlan bridge on eth2
>       > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
> 	   type macvlan
>   (c) Now try adding eth2 into the bond
>       > echo +eth2 > /sys/class/net/bond0/bonding/slaves
>       <crash>
> 
> Bonding does lots of things before checking if the device enslaved is
> busy or not.
> 
> In this case when the notifier call-chain sends notifications, the
> bond_netdev_event() assumes that the rx_handler /rx_handler_data is
> registered while the bond_enslave() hasn't progressed far enough to
> register rx_handler for the new slave.
> 
> This patch adds a rx_handler check that can be performed right at the
> beginning of the enslave code to avoid getting into this situation.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> ---
>  drivers/net/bonding/bond_main.c |  7 ++++---
>  include/linux/netdevice.h       |  1 +
>  net/core/dev.c                  | 16 ++++++++++++++++
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 217e8da0628c..9599ed6f1213 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1341,9 +1341,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  			    slave_dev->name);
>  	}
>  
> -	/* already enslaved */
> -	if (slave_dev->flags & IFF_SLAVE) {
> -		netdev_dbg(bond_dev, "Error: Device was already enslaved\n");
> +	/* already in-use? */
> +	if (netdev_is_rx_handler_busy(slave_dev)) {
> +		netdev_err(bond_dev,
> +			   "Error: Device is in use and cannot be enslaved\n");
>  		return -EBUSY;
>  	}
>  

This check duplicates what netdev_rx_handler_register does. Why not move the call to netdev_rx_handler_register here and then call unregister on failure paths?

Eric Dumazet Sept. 2, 2016, 2:45 p.m. UTC | #4

On Fri, 2016-09-02 at 08:30 -0600, David Ahern wrote:

> This check duplicates what netdev_rx_handler_register does. Why not
> move the call to netdev_rx_handler_register here and then call
> unregister on failure paths?

As soon as you call netdev_rx_handler_register(), incoming packets will
hit your driver and we'll likely crash since the enslaving is not done
yet.

Really think about RCU, we do rcu_assign_pointer() because we want all
prior changes being committed to memory before 'enabling' readers to see
the updated rcu protected pointer.

There are 9 call sites where netdev_rx_handler_register() is used.

We can get rid of the extra check in netdev_rx_handler_register() once
all of them are using netdev_is_rx_handler_busy()

Since this patch takes care of bonding only, we need to keep the
existing check in netdev_rx_handler_register()

Anyway, we are speaking of control function, an extra check is simply
safer, like all the ASSERT_RTNL() we do have...

David Ahern Sept. 2, 2016, 2:57 p.m. UTC | #5

On 9/2/16 8:45 AM, Eric Dumazet wrote:
> On Fri, 2016-09-02 at 08:30 -0600, David Ahern wrote:
> 
>> This check duplicates what netdev_rx_handler_register does. Why not
>> move the call to netdev_rx_handler_register here and then call
>> unregister on failure paths?
> 
> As soon as you call netdev_rx_handler_register(), incoming packets will
> hit your driver and we'll likely crash since the enslaving is not done
> yet.
> 
> Really think about RCU, we do rcu_assign_pointer() because we want all
> prior changes being committed to memory before 'enabling' readers to see
> the updated rcu protected pointer.
> 
> There are 9 call sites where netdev_rx_handler_register() is used.
> 
> We can get rid of the extra check in netdev_rx_handler_register() once
> all of them are using netdev_is_rx_handler_busy()
> 
> Since this patch takes care of bonding only, we need to keep the
> existing check in netdev_rx_handler_register()
> 
> Anyway, we are speaking of control function, an extra check is simply
> safer, like all the ASSERT_RTNL() we do have...
> 

I hit this same problem yesterday but with the bridge. I forgot I had a macvlan device on an interface and tried to enslave it to a bridge. It failed with EBUSY without crashing the kernel so it is one example that handles the conflict, and the bridge also calls the register before the enslaving is done.

Eric Dumazet Sept. 2, 2016, 3:21 p.m. UTC | #6

On Fri, 2016-09-02 at 08:57 -0600, David Ahern wrote:

> I hit this same problem yesterday but with the bridge. I forgot I had
> a macvlan device on an interface and tried to enslave it to a bridge.
> It failed with EBUSY without crashing the kernel so it is one example
> that handles the conflict, and the bridge also calls the register
> before the enslaving is done.
> 

Sure, some netdev_rx_handler_register() users are simpler than bonding,
this was the point Jiri raised. Hey guys just use team instead of
bonding ;)

If you carefully look at bridge code, you'll find other kind of errors
for sure ;)

For example, should dev_disable_lro() be called before or after
netdev_rx_handler_register()  ? ;)

Mahesh proposal allows for simplification, since one level can be
removed from the error rollback chain.

Ideally the netdev_rx_handler_register() should be called only when no
further errors can be detected in the 'enslaving' process, otherwise
some live packets could come and be incorrectly processed/dropped by a
not fully initialized driver.

So it is a chicken and egg problem, if you allow
netdev_rx_handler_register() to return an error, while it is so easy
to early check what could go wrong with it.

Eric Dumazet Sept. 2, 2016, 3:40 p.m. UTC | #7

On Fri, 2016-09-02 at 08:21 -0700, Eric Dumazet wrote:

> Ideally the netdev_rx_handler_register() should be called only when no
> further errors can be detected in the 'enslaving' process, otherwise
> some live packets could come and be incorrectly processed/dropped by a
> not fully initialized driver.
> 
> So it is a chicken and egg problem, if you allow
> netdev_rx_handler_register() to return an error, while it is so easy
> to early check what could go wrong with it.

One final point is that this patch is easy to backport to stable
versions.

bond_enslave() is a monster, and changing the rollback chain is likely
to cause backport merge conflicts.

Jiri Pirko Sept. 2, 2016, 4:25 p.m. UTC | #8

Fri, Sep 02, 2016 at 03:33:20PM CEST, eric.dumazet@gmail.com wrote:
>On Fri, 2016-09-02 at 09:52 +0200, Jiri Pirko wrote:
>
>> 
>> No, please, don't make bonding a spacial citizen introducing this.
>> Please handle the issue inside the bonding code, like we do for the rest
>> of master devices (and how it was once done for bonding). Thanks.
>
>I do not feel this netdev_is_rx_handler_busy() use is special to
>bonding.
>
>It makes sense to use it early, to avoid complex rollback, once various
>events have been sent all over.
>
>bond_enslave() is 447 lines long already.
>
>You perfectly know how hard it is to 'handle the issue inside the
>bonding code' as you chose to not fix bonding and write team instead.
>
>So Mahesh patch makes perfect sense to me. It exactly fixes a stupid
>bond_enslave() behavior, trying to set rx_handler way too late.
>
>By doing sanity checks before any action, we simply do not have to add
>complex rollback.

Understand the reason. All I say is we tried hard (I surely did) in
the past to remove bonding specific quirks and now we are adding one.
And the simple reason is that the code is such a mess.

Just use team instead and you'll be fine. You can "google" it :)

Eric Dumazet Sept. 2, 2016, 4:37 p.m. UTC | #9

On Fri, 2016-09-02 at 18:25 +0200, Jiri Pirko wrote:

> Understand the reason. All I say is we tried hard (I surely did) in
> the past to remove bonding specific quirks and now we are adding one.
> And the simple reason is that the code is such a mess.
> 
> Just use team instead and you'll be fine. You can "google" it :)

Sure, please join _our_ team and make all the needed changes in user
land.

The kernel part of it is epsilon compared to all the control part (ie
talk to the various Google switches.), and monitoring.

I am fine to leave the bug in upstream bonding driver, if you really
want to force people out of bonding land !

Here an automated test simply used macvlan and did not remove the
macvlan before enslaving the netdev again in a bonding, we already fixed
the test, because it is faster than deploying new kernels anyway.

Mahesh Bandewar (महेश बंडेवार) Sept. 2, 2016, 5:08 p.m. UTC | #10

On Fri, Sep 2, 2016 at 9:37 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-09-02 at 18:25 +0200, Jiri Pirko wrote:
>
>> Understand the reason. All I say is we tried hard (I surely did) in
>> the past to remove bonding specific quirks and now we are adding one.
>> And the simple reason is that the code is such a mess.
>>
I would not qualify this as bonding specific change! The check of you can
really use the interface is simple enough, and can be used by other virtual
drivers (e.g. macvlan, ipvlan, vrf etc.). Waiting until
rx_handler_register() can be
called in your code and then try to do rollback is bad. Also doing
register() early
has it's own consequences. On contrary, I would argue that this is much cleaner.

Agreed that bonding-driver has become a monster but that doesn't mean we
should not fix issues. This argument would be valid when we are
dealing with last
couple installations / use cases of bonding and everyone else has moved to
alternatives. Unfortunately we are not there yet :(

>> Just use team instead and you'll be fine. You can "google" it :)
>
> Sure, please join _our_ team and make all the needed changes in user
> land.
>
:) Cannot put it more eloquently than Eric did but we would (in
theory) love to move
to team-driver, but logistics don't support this (yet!).

> The kernel part of it is epsilon compared to all the control part (ie
> talk to the various Google switches.), and monitoring.
>
> I am fine to leave the bug in upstream bonding driver, if you really
> want to force people out of bonding land !
>
> Here an automated test simply used macvlan and did not remove the
> macvlan before enslaving the netdev again in a bonding, we already fixed
> the test, because it is faster than deploying new kernels anyway.
>
>
>

Jiri Pirko Sept. 2, 2016, 5:15 p.m. UTC | #11

Fri, Sep 02, 2016 at 07:08:54PM CEST, maheshb@google.com wrote:
>On Fri, Sep 2, 2016 at 9:37 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Fri, 2016-09-02 at 18:25 +0200, Jiri Pirko wrote:
>>
>>> Understand the reason. All I say is we tried hard (I surely did) in
>>> the past to remove bonding specific quirks and now we are adding one.
>>> And the simple reason is that the code is such a mess.
>>>
>I would not qualify this as bonding specific change! The check of you can
>really use the interface is simple enough, and can be used by other virtual
>drivers (e.g. macvlan, ipvlan, vrf etc.). Waiting until
>rx_handler_register() can be
>called in your code and then try to do rollback is bad. Also doing
>register() early
>has it's own consequences. On contrary, I would argue that this is much cleaner.

Fair enough. Would be great to do this in all other rx_handler users so
we are consistent. Thanks!


>
>Agreed that bonding-driver has become a monster but that doesn't mean we
>should not fix issues. This argument would be valid when we are
>dealing with last
>couple installations / use cases of bonding and everyone else has moved to
>alternatives. Unfortunately we are not there yet :(
>
>>> Just use team instead and you'll be fine. You can "google" it :)
>>
>> Sure, please join _our_ team and make all the needed changes in user
>> land.
>>
>:) Cannot put it more eloquently than Eric did but we would (in
>theory) love to move
>to team-driver, but logistics don't support this (yet!).
>
>> The kernel part of it is epsilon compared to all the control part (ie
>> talk to the various Google switches.), and monitoring.
>>
>> I am fine to leave the bug in upstream bonding driver, if you really
>> want to force people out of bonding land !
>>
>> Here an automated test simply used macvlan and did not remove the
>> macvlan before enslaving the netdev again in a bonding, we already fixed
>> the test, because it is faster than deploying new kernels anyway.
>>
>>
>>

David Miller Sept. 4, 2016, 6:41 p.m. UTC | #12

From: Mahesh Bandewar <mahesh@bandewar.net>
Date: Thu,  1 Sep 2016 22:18:34 -0700

> From: Mahesh Bandewar <maheshb@google.com>
> 
> Following few steps will crash kernel -
> 
>   (a) Create bonding master
>       > modprobe bonding miimon=50
>   (b) Create macvlan bridge on eth2
>       > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
> 	   type macvlan
>   (c) Now try adding eth2 into the bond
>       > echo +eth2 > /sys/class/net/bond0/bonding/slaves
>       <crash>
> 
> Bonding does lots of things before checking if the device enslaved is
> busy or not.
> 
> In this case when the notifier call-chain sends notifications, the
> bond_netdev_event() assumes that the rx_handler /rx_handler_data is
> registered while the bond_enslave() hasn't progressed far enough to
> register rx_handler for the new slave.
> 
> This patch adds a rx_handler check that can be performed right at the
> beginning of the enslave code to avoid getting into this situation.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>

Applied and queued up for -stable, thanks.

[net] bonding: Fix bonding crash

Commit Message

Comments

Patch