diff mbox

[RFC] bonding: fix workqueue re-arming races

Message ID 20100901131626.GA12447@midget.suse.cz
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Bohac Sept. 1, 2010, 1:16 p.m. UTC
On Tue, Aug 31, 2010 at 01:54:23PM -0700, Jay Vosburgh wrote:
> Jiri Bohac <jbohac@suse.cz> wrote:
> >[note, this does not deal with bond_loadbalance_arp_mon(), where
> >rtnl is now taken as well in net-next; I'll do this if you think
> >the idea is good ]
> 
> 	I don't believe the loadbalance_arp_mon acquires RTNL in
> net-next.  I recall discussing this with Andy not too long ago, but I
> didn't think that change went in, and I don't see it in the tree.

Of course, you are right, I misread the e-mail thread and did not look at
the code.

> >+void bond_alb_promisc_disable(struct work_struct *work)
> >+{
> >+	struct bonding *bond = container_of(work, struct bonding,
> >+					    alb_promisc_disable_work);
> >+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> >+
> >+	/*
> >+	 * dev_set_promiscuity requires rtnl and
> >+	 * nothing else.
> >+	 */
> >+	rtnl_lock();
> >+	dev_set_promiscuity(bond->curr_active_slave->dev, -1);
> >+	bond_info->primary_is_promisc = 0;
> >+	bond_info->rlb_promisc_timeout_counter = 0;
> >+	rtnl_unlock();
> >+}
> 
> 	What prevents this from deadlocking such that cpu A is in
> bond_close, holding RTNL and in cancel_delayed_work_sync, while cpu B is
> in the above function, trying to acquire RTNL?

The main idea of the patch is to move the code (the "commit"
functions) that needs rtnl to another work item. Then
cancel_delayed_work_sync() can be used to cancel the re-arming
work. But you are absolutely right, there is still a deadlock,
since I queue the "commit" work on the same workqueue. So when
cancel_delayed_work_sync() waits for the re-arming work to
finish, it can wait forever because a previously queued "commit"
work is waiting for rtnl.

The solution is to move the "commit" work items to a different
workqueue. Fixed in the new version of the patch below
(bond->wq_rtnl).

> 	Also, assuming for the moment that the above isn't a problem,
> curr_active_slave may be NULL if the last slave is removed between the
> time bond_alb_promisc_disable is scheduled and when it runs.  I'm not
> sure that the alb_bond_info can be guaranteed to be valid, either, if
> the mode changed.

Yes, there may be problems like these, but these are present
already in the current code. Because bond->lock() is released
before rtnl is taken. 

Sure, it would be good to deal with these problems, but I don't
think this patch introduces new races like these. They are
already there ... (see below)

> > void bond_alb_monitor(struct work_struct *work)
> > {
> > 	struct bonding *bond = container_of(work, struct bonding,
> >@@ -1407,10 +1424,6 @@ void bond_alb_monitor(struct work_struct *work)
> >
> > 	read_lock(&bond->lock);
> >
> >-	if (bond->kill_timers) {
> >-		goto out;
> >-	}
> >-
> > 	if (bond->slave_cnt == 0) {
> > 		bond_info->tx_rebalance_counter = 0;
> > 		bond_info->lp_counter = 0;
> >@@ -1462,25 +1475,11 @@ void bond_alb_monitor(struct work_struct *work)
> > 	if (bond_info->rlb_enabled) {
> > 		if (bond_info->primary_is_promisc &&
> > 		    (++bond_info->rlb_promisc_timeout_counter >= RLB_PROMISC_TIMEOUT)) {
> >-
> >-			/*
> >-			 * dev_set_promiscuity requires rtnl and
> >-			 * nothing else.
> >-			 */
> >-			read_unlock(&bond->lock);

... e.g. here; the current slave may change/disappear, the mode
may change .... 

> >-			rtnl_lock();
> >-
> >-			bond_info->rlb_promisc_timeout_counter = 0;
> >-

I fixed both issues in this new version of the patch.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

Comments

Jay Vosburgh Sept. 1, 2010, 5:14 p.m. UTC | #1
Jiri Bohac <jbohac@suse.cz> wrote:

>On Tue, Aug 31, 2010 at 01:54:23PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@suse.cz> wrote:
>> >[note, this does not deal with bond_loadbalance_arp_mon(), where
>> >rtnl is now taken as well in net-next; I'll do this if you think
>> >the idea is good ]
>> 
>> 	I don't believe the loadbalance_arp_mon acquires RTNL in
>> net-next.  I recall discussing this with Andy not too long ago, but I
>> didn't think that change went in, and I don't see it in the tree.
>
>Of course, you are right, I misread the e-mail thread and did not look at
>the code.
>
>> >+void bond_alb_promisc_disable(struct work_struct *work)
>> >+{
>> >+	struct bonding *bond = container_of(work, struct bonding,
>> >+					    alb_promisc_disable_work);
>> >+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>> >+
>> >+	/*
>> >+	 * dev_set_promiscuity requires rtnl and
>> >+	 * nothing else.
>> >+	 */
>> >+	rtnl_lock();
>> >+	dev_set_promiscuity(bond->curr_active_slave->dev, -1);
>> >+	bond_info->primary_is_promisc = 0;
>> >+	bond_info->rlb_promisc_timeout_counter = 0;
>> >+	rtnl_unlock();
>> >+}
>> 
>> 	What prevents this from deadlocking such that cpu A is in
>> bond_close, holding RTNL and in cancel_delayed_work_sync, while cpu B is
>> in the above function, trying to acquire RTNL?
>
>The main idea of the patch is to move the code (the "commit"
>functions) that needs rtnl to another work item. Then
>cancel_delayed_work_sync() can be used to cancel the re-arming
>work. But you are absolutely right, there is still a deadlock,
>since I queue the "commit" work on the same workqueue. So when
>cancel_delayed_work_sync() waits for the re-arming work to
>finish, it can wait forever because a previously queued "commit"
>work is waiting for rtnl.
>
>The solution is to move the "commit" work items to a different
>workqueue. Fixed in the new version of the patch below
>(bond->wq_rtnl).
>
>> 	Also, assuming for the moment that the above isn't a problem,
>> curr_active_slave may be NULL if the last slave is removed between the
>> time bond_alb_promisc_disable is scheduled and when it runs.  I'm not
>> sure that the alb_bond_info can be guaranteed to be valid, either, if
>> the mode changed.
>
>Yes, there may be problems like these, but these are present
>already in the current code. Because bond->lock() is released
>before rtnl is taken. 
>
>Sure, it would be good to deal with these problems, but I don't
>think this patch introduces new races like these. They are
>already there ... (see below)

	Yep, the current code does appear to have a race in
bond_alb_monitor that should be fixed (to check curr_active_slave before
dereferencing it).

	I also thought a bit more, and in the current code, the mode
shouldn't change in the middle of one of the work functions, because a
mode change requires the bond to be closed, so the various work things
will be stopped (more or less; excepting the race under disucssion
here).

	I don't think this is true for the new wq_rtnl functions,
however, because its work items are not canceled until the workqueue
itself is freed in bond_destructor.  Does the wq_rtnl open new races,
because it's work items are not synchronized with other activities
(close in particular)?  It's possible for its work functions (which may
do things like set the active slave, carrier, etc) to be invoked after
the bond has closed, and possibly reopened, or been otherwise adjusted.

	I'm not sure this is better than one of the alternatives I
believe we discussed the last time around: having the rtnl-acquiring
work functions do a conditional acquire of rtnl, and if that fails,
reschedule themselves.

	So, e.g., bond_mii_monitor becomes something like:

void bond_mii_monitor(struct work_struct *work)
{
	struct bonding *bond = container_of(work, struct bonding,
					    mii_work.work);

	read_lock(&bond->lock);

	if (bond->slave_cnt == 0)
		goto re_arm;

	if (bond->send_grat_arp) {
		read_lock(&bond->curr_slave_lock);
		bond_send_gratuitous_arp(bond);
		read_unlock(&bond->curr_slave_lock);
	}

	if (bond->send_unsol_na) {
		read_lock(&bond->curr_slave_lock);
		bond_send_unsolicited_na(bond);
		read_unlock(&bond->curr_slave_lock);
	}

	if (bond_miimon_inspect(bond)) {
		read_unlock(&bond->lock);
		/*
		 * Awe-inspiring comment explaining why we do this
		 */
		if (rtnl_trylock()) {
			read_lock(&bond->lock);

			bond_miimon_commit(bond);

			read_unlock(&bond->lock);
			rtnl_unlock();	/* might sleep, hold no other locks */
			read_lock(&bond->lock);
		} else {
			read_lock(&bond->lock);
			queue_work(bond->wq, &bond->mii_work);
			read_unlock(&bond->lock);
			return;
		}
	}

re_arm:
	if (bond->params.miimon)
		queue_delayed_work(bond->wq, &bond->mii_work,
				   msecs_to_jiffies(bond->params.miimon));
out:
	read_unlock(&bond->lock);
}

	Which, if I'm not missing something, does not need the
kill_timers (because I believe we can use cancel_delayed_work_sync now
that there's no deadlock against rtnl).

	It might need a "fast reschedule" flag of some sort so that the
gratutious ARPs and NAs aren't bunched up when the trylock fails, but
that's a secondary concern.

	Thoughts?

>> > void bond_alb_monitor(struct work_struct *work)
>> > {
>> > 	struct bonding *bond = container_of(work, struct bonding,
>> >@@ -1407,10 +1424,6 @@ void bond_alb_monitor(struct work_struct *work)
>> >
>> > 	read_lock(&bond->lock);
>> >
>> >-	if (bond->kill_timers) {
>> >-		goto out;
>> >-	}
>> >-
>> > 	if (bond->slave_cnt == 0) {
>> > 		bond_info->tx_rebalance_counter = 0;
>> > 		bond_info->lp_counter = 0;
>> >@@ -1462,25 +1475,11 @@ void bond_alb_monitor(struct work_struct *work)
>> > 	if (bond_info->rlb_enabled) {
>> > 		if (bond_info->primary_is_promisc &&
>> > 		    (++bond_info->rlb_promisc_timeout_counter >= RLB_PROMISC_TIMEOUT)) {
>> >-
>> >-			/*
>> >-			 * dev_set_promiscuity requires rtnl and
>> >-			 * nothing else.
>> >-			 */
>> >-			read_unlock(&bond->lock);
>
>... e.g. here; the current slave may change/disappear, the mode
>may change .... 
>
>> >-			rtnl_lock();
>> >-
>> >-			bond_info->rlb_promisc_timeout_counter = 0;
>> >-
>
>I fixed both issues in this new version of the patch.
>
>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index 822f586..8015e12 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -2119,10 +2119,6 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
>
> 	read_lock(&bond->lock);
>
>-	if (bond->kill_timers) {
>-		goto out;
>-	}
>-
> 	//check if there are any slaves
> 	if (bond->slave_cnt == 0) {
> 		goto re_arm;
>@@ -2166,7 +2162,6 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
>
> re_arm:
> 	queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks);
>-out:
> 	read_unlock(&bond->lock);
> }
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index c746b33..e4fa3a5 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -1397,6 +1397,35 @@ out:
> 	return NETDEV_TX_OK;
> }
>
>+void bond_alb_promisc_disable(struct work_struct *work)
>+{
>+	struct bonding *bond = container_of(work, struct bonding,
>+					    alb_promisc_disable_work);
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+
>+	/*
>+	 * dev_set_promiscuity requires rtnl and
>+	 * nothing else.
>+	 */
>+	rtnl_lock();
>+	read_lock(&bond->lock);
>+	read_lock(&bond->curr_slave_lock);
>+
>+	if (!bond_is_lb(bond))
>+		goto out;
>+	if (!bond->curr_active_slave)
>+		goto out;
>+
>+	dev_set_promiscuity(bond->curr_active_slave->dev, -1);
>+	bond_info->primary_is_promisc = 0;
>+	bond_info->rlb_promisc_timeout_counter = 0;
>+
>+out:
>+	read_unlock(&bond->curr_slave_lock);
>+	read_unlock(&bond->lock);
>+	rtnl_unlock();

	Should we end up using this, it is also holding too many locks
over the dev_set_promiscuity call.


>+}
>+
> void bond_alb_monitor(struct work_struct *work)
> {
> 	struct bonding *bond = container_of(work, struct bonding,
>@@ -1407,10 +1436,6 @@ void bond_alb_monitor(struct work_struct *work)
>
> 	read_lock(&bond->lock);
>
>-	if (bond->kill_timers) {
>-		goto out;
>-	}
>-
> 	if (bond->slave_cnt == 0) {
> 		bond_info->tx_rebalance_counter = 0;
> 		bond_info->lp_counter = 0;
>@@ -1462,25 +1487,11 @@ void bond_alb_monitor(struct work_struct *work)
> 	if (bond_info->rlb_enabled) {
> 		if (bond_info->primary_is_promisc &&
> 		    (++bond_info->rlb_promisc_timeout_counter >= RLB_PROMISC_TIMEOUT)) {
>-
>-			/*
>-			 * dev_set_promiscuity requires rtnl and
>-			 * nothing else.
>-			 */
>-			read_unlock(&bond->lock);
>-			rtnl_lock();
>-
>-			bond_info->rlb_promisc_timeout_counter = 0;
>-
> 			/* If the primary was set to promiscuous mode
> 			 * because a slave was disabled then
> 			 * it can now leave promiscuous mode.
> 			 */
>-			dev_set_promiscuity(bond->curr_active_slave->dev, -1);
>-			bond_info->primary_is_promisc = 0;
>-
>-			rtnl_unlock();
>-			read_lock(&bond->lock);
>+			queue_work(bond->wq_rtnl, &bond->alb_promisc_disable_work);
> 		}
>
> 		if (bond_info->rlb_rebalance) {
>@@ -1505,7 +1516,6 @@ void bond_alb_monitor(struct work_struct *work)
>
> re_arm:
> 	queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks);
>-out:
> 	read_unlock(&bond->lock);
> }
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 2cc4cfc..aae2864 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2343,10 +2343,15 @@ static int bond_miimon_inspect(struct bonding *bond)
> 	return commit;
> }
>
>-static void bond_miimon_commit(struct bonding *bond)
>+static void bond_miimon_commit(struct work_struct *work)
> {
> 	struct slave *slave;
> 	int i;
>+	struct bonding *bond = container_of(work, struct bonding,
>+					    miimon_commit_work);
>+
>+	rtnl_lock();
>+	read_lock(&bond->lock);
>
> 	bond_for_each_slave(bond, slave, i) {
> 		switch (slave->new_link) {
>@@ -2421,15 +2426,18 @@ static void bond_miimon_commit(struct bonding *bond)
> 		}
>
> do_failover:
>-		ASSERT_RTNL();
> 		write_lock_bh(&bond->curr_slave_lock);
> 		bond_select_active_slave(bond);
> 		write_unlock_bh(&bond->curr_slave_lock);
> 	}
>
> 	bond_set_carrier(bond);
>+
>+	read_unlock(&bond->lock);
>+	rtnl_unlock();
> }
>
>+
> /*
>  * bond_mii_monitor
>  *
>@@ -2438,14 +2446,13 @@ do_failover:
>  * an acquisition of appropriate locks followed by a commit phase to
>  * implement whatever link state changes are indicated.
>  */
>+
> void bond_mii_monitor(struct work_struct *work)
> {
> 	struct bonding *bond = container_of(work, struct bonding,
> 					    mii_work.work);
>
> 	read_lock(&bond->lock);
>-	if (bond->kill_timers)
>-		goto out;
>
> 	if (bond->slave_cnt == 0)
> 		goto re_arm;
>@@ -2462,23 +2469,14 @@ void bond_mii_monitor(struct work_struct *work)
> 		read_unlock(&bond->curr_slave_lock);
> 	}
>
>-	if (bond_miimon_inspect(bond)) {
>-		read_unlock(&bond->lock);
>-		rtnl_lock();
>-		read_lock(&bond->lock);
>+	if (bond_miimon_inspect(bond))
>+		queue_work(bond->wq_rtnl, &bond->miimon_commit_work);
>
>-		bond_miimon_commit(bond);
>-
>-		read_unlock(&bond->lock);
>-		rtnl_unlock();	/* might sleep, hold no other locks */
>-		read_lock(&bond->lock);
>-	}
>
> re_arm:
> 	if (bond->params.miimon)
> 		queue_delayed_work(bond->wq, &bond->mii_work,
> 				   msecs_to_jiffies(bond->params.miimon));
>-out:
> 	read_unlock(&bond->lock);
> }
>
>@@ -2778,9 +2776,6 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
>
> 	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>
>-	if (bond->kill_timers)
>-		goto out;
>-
> 	if (bond->slave_cnt == 0)
> 		goto re_arm;
>
>@@ -2867,7 +2862,6 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
> re_arm:
> 	if (bond->params.arp_interval)
> 		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
>-out:
> 	read_unlock(&bond->lock);
> }
>
>@@ -2949,13 +2943,19 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
> /*
>  * Called to commit link state changes noted by inspection step of
>  * active-backup mode ARP monitor.
>- *
>- * Called with RTNL and bond->lock for read.
>  */
>-static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
>+static void bond_ab_arp_commit(struct work_struct *work)
> {
> 	struct slave *slave;
> 	int i;
>+	int delta_in_ticks;
>+	struct bonding *bond = container_of(work, struct bonding,
>+					    ab_arp_commit_work);
>+
>+	rtnl_lock();
>+	read_lock(&bond->lock);
>+
>+	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>
> 	bond_for_each_slave(bond, slave, i) {
> 		switch (slave->new_link) {
>@@ -3014,6 +3014,9 @@ do_failover:
> 	}
>
> 	bond_set_carrier(bond);
>+
>+	read_unlock(&bond->lock);
>+	rtnl_unlock();
> }
>
> /*
>@@ -3093,9 +3096,6 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>
> 	read_lock(&bond->lock);
>
>-	if (bond->kill_timers)
>-		goto out;
>-
> 	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>
> 	if (bond->slave_cnt == 0)
>@@ -3113,24 +3113,15 @@ void bond_activebackup_arp_mon(struct work_struct *work)
> 		read_unlock(&bond->curr_slave_lock);
> 	}
>
>-	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
>-		read_unlock(&bond->lock);
>-		rtnl_lock();
>-		read_lock(&bond->lock);
>+	if (bond_ab_arp_inspect(bond, delta_in_ticks))
>+		queue_work(bond->wq_rtnl, &bond->ab_arp_commit_work);
>
>-		bond_ab_arp_commit(bond, delta_in_ticks);
>-
>-		read_unlock(&bond->lock);
>-		rtnl_unlock();
>-		read_lock(&bond->lock);
>-	}
>
> 	bond_ab_arp_probe(bond);
>
> re_arm:
> 	if (bond->params.arp_interval)
> 		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
>-out:
> 	read_unlock(&bond->lock);
> }
>
>@@ -3720,8 +3711,6 @@ static int bond_open(struct net_device *bond_dev)
> {
> 	struct bonding *bond = netdev_priv(bond_dev);
>
>-	bond->kill_timers = 0;
>-
> 	if (bond_is_lb(bond)) {
> 		/* bond_alb_initialize must be called before the timer
> 		 * is started.
>@@ -3781,26 +3770,23 @@ static int bond_close(struct net_device *bond_dev)
> 	bond->send_grat_arp = 0;
> 	bond->send_unsol_na = 0;
>
>-	/* signal timers not to re-arm */
>-	bond->kill_timers = 1;
>-
> 	write_unlock_bh(&bond->lock);
>
> 	if (bond->params.miimon) {  /* link check interval, in milliseconds. */
>-		cancel_delayed_work(&bond->mii_work);
>+		cancel_delayed_work_sync(&bond->mii_work);
> 	}
>
> 	if (bond->params.arp_interval) {  /* arp interval, in milliseconds. */
>-		cancel_delayed_work(&bond->arp_work);
>+		cancel_delayed_work_sync(&bond->arp_work);
> 	}
>
> 	switch (bond->params.mode) {
> 	case BOND_MODE_8023AD:
>-		cancel_delayed_work(&bond->ad_work);
>+		cancel_delayed_work_sync(&bond->ad_work);
> 		break;
> 	case BOND_MODE_TLB:
> 	case BOND_MODE_ALB:
>-		cancel_delayed_work(&bond->alb_work);
>+		cancel_delayed_work_sync(&bond->alb_work);
> 		break;
> 	default:
> 		break;
>@@ -4601,6 +4587,8 @@ static void bond_destructor(struct net_device *bond_dev)
> 	struct bonding *bond = netdev_priv(bond_dev);
> 	if (bond->wq)
> 		destroy_workqueue(bond->wq);
>+	if (bond->wq_rtnl)
>+		destroy_workqueue(bond->wq_rtnl);
> 	free_netdev(bond_dev);
> }
>
>@@ -4660,23 +4648,19 @@ static void bond_setup(struct net_device *bond_dev)
>
> static void bond_work_cancel_all(struct bonding *bond)
> {
>-	write_lock_bh(&bond->lock);
>-	bond->kill_timers = 1;
>-	write_unlock_bh(&bond->lock);
>-
> 	if (bond->params.miimon && delayed_work_pending(&bond->mii_work))
>-		cancel_delayed_work(&bond->mii_work);
>+		cancel_delayed_work_sync(&bond->mii_work);
>
> 	if (bond->params.arp_interval && delayed_work_pending(&bond->arp_work))
>-		cancel_delayed_work(&bond->arp_work);
>+		cancel_delayed_work_sync(&bond->arp_work);
>
> 	if (bond->params.mode == BOND_MODE_ALB &&
> 	    delayed_work_pending(&bond->alb_work))
>-		cancel_delayed_work(&bond->alb_work);
>+		cancel_delayed_work_sync(&bond->alb_work);
>
> 	if (bond->params.mode == BOND_MODE_8023AD &&
> 	    delayed_work_pending(&bond->ad_work))
>-		cancel_delayed_work(&bond->ad_work);
>+		cancel_delayed_work_sync(&bond->ad_work);
> }
>
> /*
>@@ -5083,6 +5067,12 @@ static int bond_init(struct net_device *bond_dev)
> 	bond->wq = create_singlethread_workqueue(bond_dev->name);
> 	if (!bond->wq)
> 		return -ENOMEM;
>+	bond->wq_rtnl = create_singlethread_workqueue(bond_dev->name);
>+	if (!bond->wq_rtnl) {
>+		destroy_workqueue(bond->wq);
>+		bond->wq = NULL;
>+		return -ENOMEM;
>+	}
>
> 	bond_set_lockdep_class(bond_dev);
>
>@@ -5094,6 +5084,9 @@ static int bond_init(struct net_device *bond_dev)
> 	bond_prepare_sysfs_group(bond);
>
> 	__hw_addr_init(&bond->mc_list);
>+	INIT_WORK(&bond->miimon_commit_work, bond_miimon_commit);
>+	INIT_WORK(&bond->ab_arp_commit_work, bond_ab_arp_commit);
>+	INIT_WORK(&bond->alb_promisc_disable_work, bond_alb_promisc_disable);
> 	return 0;
> }
>
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index c6fdd85..43ba807 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -198,7 +198,6 @@ struct bonding {
> 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
> 	rwlock_t lock;
> 	rwlock_t curr_slave_lock;
>-	s8       kill_timers;
> 	s8	 send_grat_arp;
> 	s8	 send_unsol_na;
> 	s8	 setup_by_slave;
>@@ -219,10 +218,14 @@ struct bonding {
> 	struct   vlan_group *vlgrp;
> 	struct   packet_type arp_mon_pt;
> 	struct   workqueue_struct *wq;
>+	struct   workqueue_struct *wq_rtnl;
> 	struct   delayed_work mii_work;
> 	struct   delayed_work arp_work;
> 	struct   delayed_work alb_work;
> 	struct   delayed_work ad_work;
>+	struct    work_struct miimon_commit_work;
>+	struct    work_struct ab_arp_commit_work;
>+	struct    work_struct alb_promisc_disable_work;
> #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> 	struct   in6_addr master_ipv6;
> #endif
>@@ -348,6 +351,7 @@ void bond_select_active_slave(struct bonding *bond);
> void bond_change_active_slave(struct bonding *bond, struct slave *new_active);
> void bond_register_arp(struct bonding *);
> void bond_unregister_arp(struct bonding *);
>+void bond_alb_promisc_disable(struct work_struct *work);
>
> struct bond_net {
> 	struct net *		net;	/* Associated network namespace */
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Bohac Sept. 1, 2010, 6:31 p.m. UTC | #2
On Wed, Sep 01, 2010 at 10:14:34AM -0700, Jay Vosburgh wrote:
> 	I also thought a bit more, and in the current code, the mode
> shouldn't change in the middle of one of the work functions, because a
> mode change requires the bond to be closed, so the various work things
> will be stopped (more or less; excepting the race under disucssion
> here).
>
> 	I don't think this is true for the new wq_rtnl functions,
> however, because its work items are not canceled until the workqueue
> itself is freed in bond_destructor.  Does the wq_rtnl open new races,
> because it's work items are not synchronized with other activities
> (close in particular)?  It's possible for its work functions (which may
> do things like set the active slave, carrier, etc) to be invoked after
> the bond has closed, and possibly reopened, or been otherwise adjusted.

I don't think this patch opens new races. The current race
scenario is:

1) schedule_delayed_work(foo)
2) foo's timer expires and foo is queued on bond->wq
  (possibly, foo starts to run and either gets preempted or
  sleeps on rtnl)
3) bond_close() sets kill_timers=1 and calls
  cancel_delayed_work() which accomplishes nothing
4) bond_open() sets kill_timers=0
5) bond_open() calls schedule_delayed_work(bar)
6) foo may run the "commit" work that should not be run
7) foo re-arms
8) if (foo == bar) -> BUG	/* bond->mode did not change */

With this patch, it is:

1) schedule_delayed_work(foo)
2) foo's timer expires and foo is queued on bond->wq
3) foo may have queued foo_commit on bond->wq_rtnl
4) bond_close() cancels foo
5) bond_open()
6) foo_commit may run and it should not be run

The patch avoids the problem of 7) and 8)

I think the race in 6) remains the same. It is now easier to fix.
This could even be done with a flag (similar to kill_timers),
which would be set each time the "commit" work is queued on wq_rtnl and
cleared by bond_close(). This should avoid the races completely,
I think. The trick is that, unlike kill_timers, bond_open() would
not touch this flag.

> 	I'm not sure this is better than one of the alternatives I
> believe we discussed the last time around: having the rtnl-acquiring
> work functions do a conditional acquire of rtnl, and if that fails,
> reschedule themselves.

[...]

> 		if (rtnl_trylock()) {
> 			read_lock(&bond->lock);
> 
> 			bond_miimon_commit(bond);
> 
> 			read_unlock(&bond->lock);
> 			rtnl_unlock();	/* might sleep, hold no other locks */
> 			read_lock(&bond->lock);
> 		} else {
> 			read_lock(&bond->lock);
> 			queue_work(bond->wq, &bond->mii_work);
> 			read_unlock(&bond->lock);
> 			return;
> 		}

I actually tried the other variant suggested last time
(basically:

while (!rtnl_trylock()) {
	read_lock(&bond->lock)
	kill = bond->kill_timers;
	read_unlock(&bond->lock)
	if (kill)
		return;
})

and gave that to a customer experiencing this problem (I cannot
reproduce it myself). It was reported to lock up. I suspect some
kind of live-lock on bond->lock caused by the active waiting, but
I did not spend too much time debugging this.
[BTW, this is https://bugzilla.novell.com/show_bug.cgi?id=602969
,Novell BZ account needeed]

FWIW  this would be the only use of rtnl_trylock() in the kernel,
besides a few places that do:
	if (!rtnl_trylock()) return restart_syscall();
I think it is plain ugly -- semaphores are just not supposed to be
spinned on.

Your re-scheduling variant is more or less equivalent to active
spinning, isn't it? Anyway, if you really think it is a better approach,
are you going to apply it? I can supply the patch. (Although I
kinda don't like people seeing my name next to it ;))
Jay Vosburgh Sept. 1, 2010, 8 p.m. UTC | #3
Jiri Bohac <jbohac@suse.cz> wrote:

>On Wed, Sep 01, 2010 at 10:14:34AM -0700, Jay Vosburgh wrote:
>> 	I also thought a bit more, and in the current code, the mode
>> shouldn't change in the middle of one of the work functions, because a
>> mode change requires the bond to be closed, so the various work things
>> will be stopped (more or less; excepting the race under disucssion
>> here).
>>
>> 	I don't think this is true for the new wq_rtnl functions,
>> however, because its work items are not canceled until the workqueue
>> itself is freed in bond_destructor.  Does the wq_rtnl open new races,
>> because it's work items are not synchronized with other activities
>> (close in particular)?  It's possible for its work functions (which may
>> do things like set the active slave, carrier, etc) to be invoked after
>> the bond has closed, and possibly reopened, or been otherwise adjusted.
>
>I don't think this patch opens new races. The current race
>scenario is:
>
>1) schedule_delayed_work(foo)
>2) foo's timer expires and foo is queued on bond->wq
>  (possibly, foo starts to run and either gets preempted or
>  sleeps on rtnl)
>3) bond_close() sets kill_timers=1 and calls
>  cancel_delayed_work() which accomplishes nothing
>4) bond_open() sets kill_timers=0
>5) bond_open() calls schedule_delayed_work(bar)
>6) foo may run the "commit" work that should not be run
>7) foo re-arms
>8) if (foo == bar) -> BUG	/* bond->mode did not change */
>
>With this patch, it is:
>
>1) schedule_delayed_work(foo)
>2) foo's timer expires and foo is queued on bond->wq
>3) foo may have queued foo_commit on bond->wq_rtnl
>4) bond_close() cancels foo
>5) bond_open()
>6) foo_commit may run and it should not be run
>
>The patch avoids the problem of 7) and 8)

	But the "with patch" #6 is a bigger window; it doesn't require
step 5; the foo_commit, et al, can always run after bond_close (because
nothing ever cancels the foo_commit except for unregistration).  That's
the part that makes me nervous.

	The current race, as I understand it, requires a "close then
open" sequence with little delay between the two.

>I think the race in 6) remains the same. It is now easier to fix.
>This could even be done with a flag (similar to kill_timers),
>which would be set each time the "commit" work is queued on wq_rtnl and
>cleared by bond_close(). This should avoid the races completely,
>I think. The trick is that, unlike kill_timers, bond_open() would
>not touch this flag.

	I'm chewing on whether or not it's feasible to introduce some
kind of generation count into bond_open/close, so that, e.g., at
bond_close, the generation is incremented.  Each time any of the work
items is queued, the current generation is stashed somewhere private to
that work item (in struct bonding, probably).  Then, when it runs, it
compares the current generation to the stored one.  If they don't match,
then the work item does nothing.

	It's still a "kill_timers," but perhaps not subject to the
close/open issue that a boolean flag has.  It's also not all that
elegant, either, but maybe is less bad than the other alternatives.

>> 	I'm not sure this is better than one of the alternatives I
>> believe we discussed the last time around: having the rtnl-acquiring
>> work functions do a conditional acquire of rtnl, and if that fails,
>> reschedule themselves.
>
>[...]
>
>> 		if (rtnl_trylock()) {
>> 			read_lock(&bond->lock);
>> 
>> 			bond_miimon_commit(bond);
>> 
>> 			read_unlock(&bond->lock);
>> 			rtnl_unlock();	/* might sleep, hold no other locks */
>> 			read_lock(&bond->lock);
>> 		} else {
>> 			read_lock(&bond->lock);
>> 			queue_work(bond->wq, &bond->mii_work);
>> 			read_unlock(&bond->lock);
>> 			return;
>> 		}
>
>I actually tried the other variant suggested last time
>(basically:
>
>while (!rtnl_trylock()) {
>	read_lock(&bond->lock)
>	kill = bond->kill_timers;
>	read_unlock(&bond->lock)
>	if (kill)
>		return;
>})
>
>and gave that to a customer experiencing this problem (I cannot
>reproduce it myself). It was reported to lock up. I suspect some
>kind of live-lock on bond->lock caused by the active waiting, but
>I did not spend too much time debugging this.
>[BTW, this is https://bugzilla.novell.com/show_bug.cgi?id=602969
>,Novell BZ account needeed]

	My BZ account is unworthy to access that bug; can you provide
any information as to how they're hitting the problem?  Presumably
they're doing something that's doing a fast down/up cycle on the bond,
but anything else?

>FWIW  this would be the only use of rtnl_trylock() in the kernel,
>besides a few places that do:
>	if (!rtnl_trylock()) return restart_syscall();
>I think it is plain ugly -- semaphores are just not supposed to be
>spinned on.
>
>Your re-scheduling variant is more or less equivalent to active
>spinning, isn't it? Anyway, if you really think it is a better approach,
>are you going to apply it? I can supply the patch. (Although I
>kinda don't like people seeing my name next to it ;))

	Well, it could have a queue_delayed_work() with a bit of real
time in there, and it's not a simple spin loop, there's actually a
scheduling step in the middle.  I'm relucant to call it "better,"
though, as that kind of implies it's inherently "good."  "Less bad,"
perhaps.

	Anyway, maybe there are other ways to accomplish the end goal
(no executing of "stuff" after close) without resorting to either of
these strategies (since what we're really discussing here is relative
awfulness; neither is what I'd call a really good option).

	I'm wondering if there's any utility in the "generation count"
idea I mention above.  It's still a sentinel, but if that can be worked
out to reliably stop the work items after close, then maybe it's the
least bad option.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Bohac Sept. 1, 2010, 8:56 p.m. UTC | #4
On Wed, Sep 01, 2010 at 01:00:38PM -0700, Jay Vosburgh wrote:
> Jiri Bohac <jbohac@suse.cz> wrote:
> >I don't think this patch opens new races. The current race
> >scenario is:
> >
> >1) schedule_delayed_work(foo)
> >2) foo's timer expires and foo is queued on bond->wq
> >  (possibly, foo starts to run and either gets preempted or
> >  sleeps on rtnl)
> >3) bond_close() sets kill_timers=1 and calls
> >  cancel_delayed_work() which accomplishes nothing
> >4) bond_open() sets kill_timers=0
> >5) bond_open() calls schedule_delayed_work(bar)
> >6) foo may run the "commit" work that should not be run
> >7) foo re-arms
> >8) if (foo == bar) -> BUG	/* bond->mode did not change */
> >
> >With this patch, it is:
> >
> >1) schedule_delayed_work(foo)
> >2) foo's timer expires and foo is queued on bond->wq
> >3) foo may have queued foo_commit on bond->wq_rtnl
> >4) bond_close() cancels foo
> >5) bond_open()
> >6) foo_commit may run and it should not be run
> >
> >The patch avoids the problem of 7) and 8)
> 
> 	But the "with patch" #6 is a bigger window; it doesn't require
> step 5; the foo_commit, et al, can always run after bond_close (because
> nothing ever cancels the foo_commit except for unregistration).  That's
> the part that makes me nervous.

We can always call cancel_work(foo_commit) in bond_close. This
should make the race window the same size it is now.
I didn't do that because I was thinking we'd avoid the race
somehow completely. Perhaps we can do cancel_work() now and solve
it cleanly later.

> 	The current race, as I understand it, requires a "close then
> open" sequence with little delay between the two.

Yeah, not sure how small the delay has to be. With releasing
bond->lock, acquiring rtnl and re-acquiring bond->lock in most of
the work items it may be pretty long. Putting an extra check for
kill_timers after bond->lock is re-acquired will make the window
much smaller ...  just in case this is the way we want to "fix"
race conditions ;-)

> >I think the race in 6) remains the same. It is now easier to fix.
> >This could even be done with a flag (similar to kill_timers),
> >which would be set each time the "commit" work is queued on wq_rtnl and
> >cleared by bond_close(). This should avoid the races completely,
> >I think. The trick is that, unlike kill_timers, bond_open() would
> >not touch this flag.
> 
> 	I'm chewing on whether or not it's feasible to introduce some
> kind of generation count into bond_open/close, so that, e.g., at
> bond_close, the generation is incremented.  Each time any of the work
> items is queued, the current generation is stashed somewhere private to
> that work item (in struct bonding, probably).  Then, when it runs, it
> compares the current generation to the stored one.  If they don't match,
> then the work item does nothing.

I thought about the generation count as well before I did this
patch. I don't think you can put the counter in struct bonding --
because that would be overwritten with the new value if the work
item is re-scheduled by bond_open.

I think you would have to create a new dynamic structure on each
work schedule and pass it to the work item in the "data" pointer.
The structure would contain the counter and the bond pointer. It
would be freed by thework item. I did not like this too much.

> >[BTW, this is https://bugzilla.novell.com/show_bug.cgi?id=602969
> >,Novell BZ account needeed]
> 
> 	My BZ account is unworthy to access that bug; can you provide
> any information as to how they're hitting the problem?  Presumably
> they're doing something that's doing a fast down/up cycle on the bond,
> but anything else?

They are doing "rcnetwork restart", which will do the
close->open. Perhaps all the contention on the rtnl (lots of work
with other network interfaces) makes the race window longer. I
couldn't reproduce this.

> 	I'm wondering if there's any utility in the "generation count"
> idea I mention above.  It's still a sentinel, but if that can be worked
> out to reliably stop the work items after close, then maybe it's the
> least bad option.

Not without the dynamic allocation, I think.
How about the "kill_timers" on top of this patch (see my
previous e-mail) -- a flag that would be set when queuing the
"commit" work and cleared by bond_close()?

While this can not stop the re-arming race it is trying to stop
now, it should be able to stop the "commit" work items (where it
does not matter if you try to queue them on the workqueue for a
second time, since it is not a delayed work).
Jay Vosburgh Sept. 2, 2010, 12:54 a.m. UTC | #5
Jiri Bohac <jbohac@suse.cz> wrote:

>On Wed, Sep 01, 2010 at 01:00:38PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@suse.cz> wrote:
>> >I don't think this patch opens new races. The current race
>> >scenario is:
>> >
>> >1) schedule_delayed_work(foo)
>> >2) foo's timer expires and foo is queued on bond->wq
>> >  (possibly, foo starts to run and either gets preempted or
>> >  sleeps on rtnl)
>> >3) bond_close() sets kill_timers=1 and calls
>> >  cancel_delayed_work() which accomplishes nothing
>> >4) bond_open() sets kill_timers=0
>> >5) bond_open() calls schedule_delayed_work(bar)
>> >6) foo may run the "commit" work that should not be run
>> >7) foo re-arms
>> >8) if (foo == bar) -> BUG	/* bond->mode did not change */

	In looking at bond_open a bit, I wonder if a contributing factor
to the crash (if that's what happens) is that INIT_DELAYED_WORK is being
done in bond_open on a work item that's already queued or running (left
over from the kill_timers sentinel being missed).  Calling
queue_delayed_work when the work item is already queued is ok, I
believe, so I don't think that part is an issue (or at least not a
panic-worthy one).

	I suspect that the INIT_DELAYED_WORK calls will have to move to
bond_create if any of the work items end up be able to run beyond the
end of close (which seems likely; I'm running out of ideas).

>> >With this patch, it is:
>> >
>> >1) schedule_delayed_work(foo)
>> >2) foo's timer expires and foo is queued on bond->wq
>> >3) foo may have queued foo_commit on bond->wq_rtnl
>> >4) bond_close() cancels foo
>> >5) bond_open()
>> >6) foo_commit may run and it should not be run
>> >
>> >The patch avoids the problem of 7) and 8)
>> 
>> 	But the "with patch" #6 is a bigger window; it doesn't require
>> step 5; the foo_commit, et al, can always run after bond_close (because
>> nothing ever cancels the foo_commit except for unregistration).  That's
>> the part that makes me nervous.
>
>We can always call cancel_work(foo_commit) in bond_close. This
>should make the race window the same size it is now.
>I didn't do that because I was thinking we'd avoid the race
>somehow completely. Perhaps we can do cancel_work() now and solve
>it cleanly later.
>
>> 	The current race, as I understand it, requires a "close then
>> open" sequence with little delay between the two.
>
>Yeah, not sure how small the delay has to be. With releasing
>bond->lock, acquiring rtnl and re-acquiring bond->lock in most of
>the work items it may be pretty long. Putting an extra check for
>kill_timers after bond->lock is re-acquired will make the window
>much smaller ...  just in case this is the way we want to "fix"
>race conditions ;-)
>
>> >I think the race in 6) remains the same. It is now easier to fix.
>> >This could even be done with a flag (similar to kill_timers),
>> >which would be set each time the "commit" work is queued on wq_rtnl and
>> >cleared by bond_close(). This should avoid the races completely,
>> >I think. The trick is that, unlike kill_timers, bond_open() would
>> >not touch this flag.
>> 
>> 	I'm chewing on whether or not it's feasible to introduce some
>> kind of generation count into bond_open/close, so that, e.g., at
>> bond_close, the generation is incremented.  Each time any of the work
>> items is queued, the current generation is stashed somewhere private to
>> that work item (in struct bonding, probably).  Then, when it runs, it
>> compares the current generation to the stored one.  If they don't match,
>> then the work item does nothing.
>
>I thought about the generation count as well before I did this
>patch. I don't think you can put the counter in struct bonding --
>because that would be overwritten with the new value if the work
>item is re-scheduled by bond_open.
>
>I think you would have to create a new dynamic structure on each
>work schedule and pass it to the work item in the "data" pointer.
>The structure would contain the counter and the bond pointer. It
>would be freed by thework item. I did not like this too much.
>
>> >[BTW, this is https://bugzilla.novell.com/show_bug.cgi?id=602969
>> >,Novell BZ account needeed]
>> 
>> 	My BZ account is unworthy to access that bug; can you provide
>> any information as to how they're hitting the problem?  Presumably
>> they're doing something that's doing a fast down/up cycle on the bond,
>> but anything else?
>
>They are doing "rcnetwork restart", which will do the
>close->open. Perhaps all the contention on the rtnl (lots of work
>with other network interfaces) makes the race window longer. I
>couldn't reproduce this.

	What actually happens when the failure occurs?  Is there a stack
dump?

>> 	I'm wondering if there's any utility in the "generation count"
>> idea I mention above.  It's still a sentinel, but if that can be worked
>> out to reliably stop the work items after close, then maybe it's the
>> least bad option.
>
>Not without the dynamic allocation, I think.
>How about the "kill_timers" on top of this patch (see my
>previous e-mail) -- a flag that would be set when queuing the
>"commit" work and cleared by bond_close()?
>
>While this can not stop the re-arming race it is trying to stop
>now, it should be able to stop the "commit" work items (where it
>does not matter if you try to queue them on the workqueue for a
>second time, since it is not a delayed work).
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 822f586..8015e12 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2119,10 +2119,6 @@  void bond_3ad_state_machine_handler(struct work_struct *work)
 
 	read_lock(&bond->lock);
 
-	if (bond->kill_timers) {
-		goto out;
-	}
-
 	//check if there are any slaves
 	if (bond->slave_cnt == 0) {
 		goto re_arm;
@@ -2166,7 +2162,6 @@  void bond_3ad_state_machine_handler(struct work_struct *work)
 
 re_arm:
 	queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks);
-out:
 	read_unlock(&bond->lock);
 }
 
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index c746b33..e4fa3a5 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1397,6 +1397,35 @@  out:
 	return NETDEV_TX_OK;
 }
 
+void bond_alb_promisc_disable(struct work_struct *work)
+{
+	struct bonding *bond = container_of(work, struct bonding,
+					    alb_promisc_disable_work);
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+
+	/*
+	 * dev_set_promiscuity requires rtnl and
+	 * nothing else.
+	 */
+	rtnl_lock();
+	read_lock(&bond->lock);
+	read_lock(&bond->curr_slave_lock);
+
+	if (!bond_is_lb(bond))
+		goto out;
+	if (!bond->curr_active_slave)
+		goto out;
+
+	dev_set_promiscuity(bond->curr_active_slave->dev, -1);
+	bond_info->primary_is_promisc = 0;
+	bond_info->rlb_promisc_timeout_counter = 0;
+
+out:
+	read_unlock(&bond->curr_slave_lock);
+	read_unlock(&bond->lock);
+	rtnl_unlock();
+}
+
 void bond_alb_monitor(struct work_struct *work)
 {
 	struct bonding *bond = container_of(work, struct bonding,
@@ -1407,10 +1436,6 @@  void bond_alb_monitor(struct work_struct *work)
 
 	read_lock(&bond->lock);
 
-	if (bond->kill_timers) {
-		goto out;
-	}
-
 	if (bond->slave_cnt == 0) {
 		bond_info->tx_rebalance_counter = 0;
 		bond_info->lp_counter = 0;
@@ -1462,25 +1487,11 @@  void bond_alb_monitor(struct work_struct *work)
 	if (bond_info->rlb_enabled) {
 		if (bond_info->primary_is_promisc &&
 		    (++bond_info->rlb_promisc_timeout_counter >= RLB_PROMISC_TIMEOUT)) {
-
-			/*
-			 * dev_set_promiscuity requires rtnl and
-			 * nothing else.
-			 */
-			read_unlock(&bond->lock);
-			rtnl_lock();
-
-			bond_info->rlb_promisc_timeout_counter = 0;
-
 			/* If the primary was set to promiscuous mode
 			 * because a slave was disabled then
 			 * it can now leave promiscuous mode.
 			 */
-			dev_set_promiscuity(bond->curr_active_slave->dev, -1);
-			bond_info->primary_is_promisc = 0;
-
-			rtnl_unlock();
-			read_lock(&bond->lock);
+			queue_work(bond->wq_rtnl, &bond->alb_promisc_disable_work);
 		}
 
 		if (bond_info->rlb_rebalance) {
@@ -1505,7 +1516,6 @@  void bond_alb_monitor(struct work_struct *work)
 
 re_arm:
 	queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks);
-out:
 	read_unlock(&bond->lock);
 }
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 2cc4cfc..aae2864 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2343,10 +2343,15 @@  static int bond_miimon_inspect(struct bonding *bond)
 	return commit;
 }
 
-static void bond_miimon_commit(struct bonding *bond)
+static void bond_miimon_commit(struct work_struct *work)
 {
 	struct slave *slave;
 	int i;
+	struct bonding *bond = container_of(work, struct bonding,
+					    miimon_commit_work);
+
+	rtnl_lock();
+	read_lock(&bond->lock);
 
 	bond_for_each_slave(bond, slave, i) {
 		switch (slave->new_link) {
@@ -2421,15 +2426,18 @@  static void bond_miimon_commit(struct bonding *bond)
 		}
 
 do_failover:
-		ASSERT_RTNL();
 		write_lock_bh(&bond->curr_slave_lock);
 		bond_select_active_slave(bond);
 		write_unlock_bh(&bond->curr_slave_lock);
 	}
 
 	bond_set_carrier(bond);
+
+	read_unlock(&bond->lock);
+	rtnl_unlock();
 }
 
+
 /*
  * bond_mii_monitor
  *
@@ -2438,14 +2446,13 @@  do_failover:
  * an acquisition of appropriate locks followed by a commit phase to
  * implement whatever link state changes are indicated.
  */
+
 void bond_mii_monitor(struct work_struct *work)
 {
 	struct bonding *bond = container_of(work, struct bonding,
 					    mii_work.work);
 
 	read_lock(&bond->lock);
-	if (bond->kill_timers)
-		goto out;
 
 	if (bond->slave_cnt == 0)
 		goto re_arm;
@@ -2462,23 +2469,14 @@  void bond_mii_monitor(struct work_struct *work)
 		read_unlock(&bond->curr_slave_lock);
 	}
 
-	if (bond_miimon_inspect(bond)) {
-		read_unlock(&bond->lock);
-		rtnl_lock();
-		read_lock(&bond->lock);
+	if (bond_miimon_inspect(bond))
+		queue_work(bond->wq_rtnl, &bond->miimon_commit_work);
 
-		bond_miimon_commit(bond);
-
-		read_unlock(&bond->lock);
-		rtnl_unlock();	/* might sleep, hold no other locks */
-		read_lock(&bond->lock);
-	}
 
 re_arm:
 	if (bond->params.miimon)
 		queue_delayed_work(bond->wq, &bond->mii_work,
 				   msecs_to_jiffies(bond->params.miimon));
-out:
 	read_unlock(&bond->lock);
 }
 
@@ -2778,9 +2776,6 @@  void bond_loadbalance_arp_mon(struct work_struct *work)
 
 	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
 
-	if (bond->kill_timers)
-		goto out;
-
 	if (bond->slave_cnt == 0)
 		goto re_arm;
 
@@ -2867,7 +2862,6 @@  void bond_loadbalance_arp_mon(struct work_struct *work)
 re_arm:
 	if (bond->params.arp_interval)
 		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
-out:
 	read_unlock(&bond->lock);
 }
 
@@ -2949,13 +2943,19 @@  static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 /*
  * Called to commit link state changes noted by inspection step of
  * active-backup mode ARP monitor.
- *
- * Called with RTNL and bond->lock for read.
  */
-static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
+static void bond_ab_arp_commit(struct work_struct *work)
 {
 	struct slave *slave;
 	int i;
+	int delta_in_ticks;
+	struct bonding *bond = container_of(work, struct bonding,
+					    ab_arp_commit_work);
+
+	rtnl_lock();
+	read_lock(&bond->lock);
+
+	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
 
 	bond_for_each_slave(bond, slave, i) {
 		switch (slave->new_link) {
@@ -3014,6 +3014,9 @@  do_failover:
 	}
 
 	bond_set_carrier(bond);
+
+	read_unlock(&bond->lock);
+	rtnl_unlock();
 }
 
 /*
@@ -3093,9 +3096,6 @@  void bond_activebackup_arp_mon(struct work_struct *work)
 
 	read_lock(&bond->lock);
 
-	if (bond->kill_timers)
-		goto out;
-
 	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
 
 	if (bond->slave_cnt == 0)
@@ -3113,24 +3113,15 @@  void bond_activebackup_arp_mon(struct work_struct *work)
 		read_unlock(&bond->curr_slave_lock);
 	}
 
-	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
-		read_unlock(&bond->lock);
-		rtnl_lock();
-		read_lock(&bond->lock);
+	if (bond_ab_arp_inspect(bond, delta_in_ticks))
+		queue_work(bond->wq_rtnl, &bond->ab_arp_commit_work);
 
-		bond_ab_arp_commit(bond, delta_in_ticks);
-
-		read_unlock(&bond->lock);
-		rtnl_unlock();
-		read_lock(&bond->lock);
-	}
 
 	bond_ab_arp_probe(bond);
 
 re_arm:
 	if (bond->params.arp_interval)
 		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
-out:
 	read_unlock(&bond->lock);
 }
 
@@ -3720,8 +3711,6 @@  static int bond_open(struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
 
-	bond->kill_timers = 0;
-
 	if (bond_is_lb(bond)) {
 		/* bond_alb_initialize must be called before the timer
 		 * is started.
@@ -3781,26 +3770,23 @@  static int bond_close(struct net_device *bond_dev)
 	bond->send_grat_arp = 0;
 	bond->send_unsol_na = 0;
 
-	/* signal timers not to re-arm */
-	bond->kill_timers = 1;
-
 	write_unlock_bh(&bond->lock);
 
 	if (bond->params.miimon) {  /* link check interval, in milliseconds. */
-		cancel_delayed_work(&bond->mii_work);
+		cancel_delayed_work_sync(&bond->mii_work);
 	}
 
 	if (bond->params.arp_interval) {  /* arp interval, in milliseconds. */
-		cancel_delayed_work(&bond->arp_work);
+		cancel_delayed_work_sync(&bond->arp_work);
 	}
 
 	switch (bond->params.mode) {
 	case BOND_MODE_8023AD:
-		cancel_delayed_work(&bond->ad_work);
+		cancel_delayed_work_sync(&bond->ad_work);
 		break;
 	case BOND_MODE_TLB:
 	case BOND_MODE_ALB:
-		cancel_delayed_work(&bond->alb_work);
+		cancel_delayed_work_sync(&bond->alb_work);
 		break;
 	default:
 		break;
@@ -4601,6 +4587,8 @@  static void bond_destructor(struct net_device *bond_dev)
 	struct bonding *bond = netdev_priv(bond_dev);
 	if (bond->wq)
 		destroy_workqueue(bond->wq);
+	if (bond->wq_rtnl)
+		destroy_workqueue(bond->wq_rtnl);
 	free_netdev(bond_dev);
 }
 
@@ -4660,23 +4648,19 @@  static void bond_setup(struct net_device *bond_dev)
 
 static void bond_work_cancel_all(struct bonding *bond)
 {
-	write_lock_bh(&bond->lock);
-	bond->kill_timers = 1;
-	write_unlock_bh(&bond->lock);
-
 	if (bond->params.miimon && delayed_work_pending(&bond->mii_work))
-		cancel_delayed_work(&bond->mii_work);
+		cancel_delayed_work_sync(&bond->mii_work);
 
 	if (bond->params.arp_interval && delayed_work_pending(&bond->arp_work))
-		cancel_delayed_work(&bond->arp_work);
+		cancel_delayed_work_sync(&bond->arp_work);
 
 	if (bond->params.mode == BOND_MODE_ALB &&
 	    delayed_work_pending(&bond->alb_work))
-		cancel_delayed_work(&bond->alb_work);
+		cancel_delayed_work_sync(&bond->alb_work);
 
 	if (bond->params.mode == BOND_MODE_8023AD &&
 	    delayed_work_pending(&bond->ad_work))
-		cancel_delayed_work(&bond->ad_work);
+		cancel_delayed_work_sync(&bond->ad_work);
 }
 
 /*
@@ -5083,6 +5067,12 @@  static int bond_init(struct net_device *bond_dev)
 	bond->wq = create_singlethread_workqueue(bond_dev->name);
 	if (!bond->wq)
 		return -ENOMEM;
+	bond->wq_rtnl = create_singlethread_workqueue(bond_dev->name);
+	if (!bond->wq_rtnl) {
+		destroy_workqueue(bond->wq);
+		bond->wq = NULL;
+		return -ENOMEM;
+	}
 
 	bond_set_lockdep_class(bond_dev);
 
@@ -5094,6 +5084,9 @@  static int bond_init(struct net_device *bond_dev)
 	bond_prepare_sysfs_group(bond);
 
 	__hw_addr_init(&bond->mc_list);
+	INIT_WORK(&bond->miimon_commit_work, bond_miimon_commit);
+	INIT_WORK(&bond->ab_arp_commit_work, bond_ab_arp_commit);
+	INIT_WORK(&bond->alb_promisc_disable_work, bond_alb_promisc_disable);
 	return 0;
 }
 
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index c6fdd85..43ba807 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -198,7 +198,6 @@  struct bonding {
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	rwlock_t lock;
 	rwlock_t curr_slave_lock;
-	s8       kill_timers;
 	s8	 send_grat_arp;
 	s8	 send_unsol_na;
 	s8	 setup_by_slave;
@@ -219,10 +218,14 @@  struct bonding {
 	struct   vlan_group *vlgrp;
 	struct   packet_type arp_mon_pt;
 	struct   workqueue_struct *wq;
+	struct   workqueue_struct *wq_rtnl;
 	struct   delayed_work mii_work;
 	struct   delayed_work arp_work;
 	struct   delayed_work alb_work;
 	struct   delayed_work ad_work;
+	struct    work_struct miimon_commit_work;
+	struct    work_struct ab_arp_commit_work;
+	struct    work_struct alb_promisc_disable_work;
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 	struct   in6_addr master_ipv6;
 #endif
@@ -348,6 +351,7 @@  void bond_select_active_slave(struct bonding *bond);
 void bond_change_active_slave(struct bonding *bond, struct slave *new_active);
 void bond_register_arp(struct bonding *);
 void bond_unregister_arp(struct bonding *);
+void bond_alb_promisc_disable(struct work_struct *work);
 
 struct bond_net {
 	struct net *		net;	/* Associated network namespace */