From patchwork Tue Nov 7 10:50:07 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jay Vosburgh X-Patchwork-Id: 835232 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3yWR7b0X2Nz9t3F for ; Tue, 7 Nov 2017 21:51:39 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754487AbdKGKvf convert rfc822-to-8bit (ORCPT ); Tue, 7 Nov 2017 05:51:35 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:35738 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753560AbdKGKuS (ORCPT ); Tue, 7 Nov 2017 05:50:18 -0500 Received: from [61.40.109.130] (helo=nyx.localdomain) by youngberry.canonical.com with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.76) (envelope-from ) id 1eC1SN-0000M3-Qy; Tue, 07 Nov 2017 10:50:12 +0000 Received: by nyx.localdomain (Postfix, from userid 1000) id 83EF7240008; Tue, 7 Nov 2017 02:50:07 -0800 (PST) Received: from nyx (localhost [127.0.0.1]) by nyx.localdomain (Postfix) with ESMTP id 7DC732800DE; Tue, 7 Nov 2017 19:50:07 +0900 (KST) From: Jay Vosburgh To: netdev@vger.kernel.org Cc: Alex Sidorenko , Mahesh Bandewar , Jarod Wilson , Veaceslav Falico , Andy Gospodarek , "David Miller" Subject: [PATCH net] bonding: fix slave stuck in BOND_LINK_FAIL state X-Mailer: MH-E 8.5+bzr; nmh 1.6; GNU Emacs 25.2.2 MIME-Version: 1.0 Content-ID: <15115.1510051807.1@nyx> Date: Tue, 07 Nov 2017 19:50:07 +0900 Message-ID: <15116.1510051807@nyx> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The bonding miimon logic has a flaw, in that a failure of the rtnl_trylock can cause a slave to become permanently stuck in BOND_LINK_FAIL state. The sequence of events to cause this is as follows: 1) bond_miimon_inspect finds that a slave's link is down, and so calls bond_propose_link_state, setting slave->new_link_state to BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns non-zero. 2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is rescheduled. No change is committed. 3) bond_miimon_inspect is called again, but this time the slave from step 1 has recovered. slave->new_link is reset to NOCHANGE, and, as slave->link was never changed, the switch enters the BOND_LINK_UP case, and does nothing. The pending BOND_LINK_FAIL state from step 1 remains pending, as new_link_state is not reset. 4) The state from step 3 persists until another slave changes link state and causes bond_miimon_inspect to return non-zero. At this point, the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed, and the slave will remain stuck in BOND_LINK_FAIL state even though it is actually link up. The remedy for this is to initialize new_link_state on each entry to bond_miimon_inspect, as is already done with new_link. Reported-by: Alex Sidorenko Reviewed-by: Jarod Wilson Signed-off-by: Jay Vosburgh Fixes: fb9eb899a6dc ("bonding: handle link transition from FAIL to UP correctly") Acked-by: Mahesh Bandewar --- drivers/net/bonding/bond_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index c99dc59d729b..167434e952da 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2042,6 +2042,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_for_each_slave_rcu(bond, slave, iter) { slave->new_link = BOND_LINK_NOCHANGE; + slave->link_new_state = slave->link; link_state = bond_check_dev_link(bond, slave->dev, 0);