From patchwork Wed May 6 23:03:48 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 469171 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 8E202140281 for ; Thu, 7 May 2015 09:03:54 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751638AbbEFXDu (ORCPT ); Wed, 6 May 2015 19:03:50 -0400 Received: from mail-wg0-f50.google.com ([74.125.82.50]:33815 "EHLO mail-wg0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751459AbbEFXDt (ORCPT ); Wed, 6 May 2015 19:03:49 -0400 Received: by wgic8 with SMTP id c8so354895wgi.1 for ; Wed, 06 May 2015 16:03:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=zr9K0arXOs3cGiaoJTqD+tmGng1JacSI32poiO6+iHo=; b=EsgqbB1Ys+Ff2zAPvJcmY25kXeeBHy3bYRpc9aYrr/OTv0FakesbKkhsSF25t+bSNC h3QsTlp0LrlHpqDvK/beJeoPMgrG7RHSs+Z6YSqZeI7bSfw1mKXEjofpLXceGN2WHT7o okFWtXURl9E0OROWMtCXJrDCoX9ec9ZFiaL349uiArqE/12nRFl0QfrpYJTXjasHjMhR wor5PfkLXcUusVoi1yTzr9WxaeLNPmWQELXGGv9R4d+uuryX8JajGimZi64sL7iWs3cZ fKmAHrzxjwhYKfZ/wawAU9i5lrBCL6yIAJ6j+Qg71EBMFisYu1mFhD0weiJkjXihtxxM q2oQ== X-Gm-Message-State: ALoCoQnph9QA1gPdiXFDKNR46LLB24FPrKq55J4/8oTScHKKr7W++2tFVkxEmxeIqd2BhlggGxnk MIME-Version: 1.0 X-Received: by 10.194.78.12 with SMTP id x12mr1774577wjw.112.1430953428357; Wed, 06 May 2015 16:03:48 -0700 (PDT) Received: by 10.28.24.211 with HTTP; Wed, 6 May 2015 16:03:48 -0700 (PDT) In-Reply-To: <1430892526-2446-1-git-send-email-ying.xue@windriver.com> References: <1430892526-2446-1-git-send-email-ying.xue@windriver.com> Date: Wed, 6 May 2015 16:03:48 -0700 Message-ID: Subject: Re: [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket From: Cong Wang To: Ying Xue Cc: netdev , Herbert Xu , xemul@openvz.org, David Miller , Eric Dumazet , "Eric W. Biederman" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Tue, May 5, 2015 at 11:08 PM, Ying Xue wrote: > Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and > netlink_kernel_create().") attempts to fix the following race > scenario: > > put_net() > if (atomic_dec_and_test(&net->refcnt)) > /* true */ > __put_net(net); > queue_work(...); > > /* > * note: the net now has refcnt 0, but still in > * the global list of net namespaces > */ > > == re-schedule == > > register_pernet_subsys(&some_ops); > register_pernet_operations(&some_ops); > (*some_ops)->init(net); > /* > * we call netlink_kernel_create() here > * in some places > */ > netlink_kernel_create(); > sk_alloc(); > get_net(net); /* refcnt = 1 */ > /* > * now we drop the net refcount not to > * block the net namespace exit in the > * future (or this can be done on the > * error path) > */ > put_net(sk->sk_net); > if (atomic_dec_and_test(&...)) > /* > * true. BOOOM! The net is > * scheduled for release twice > */ > > In order to prevent the race from happening, the commit adopted the > following solution: create netlink socket inside init_net namespace > and then re-attach it to the desired one right after the socket is > created; similarly, when close the socket, move back its namespace > to init_net so that the socket can be destroyed in the context which > is same as the socket creation. > > Actually the proposal artificially makes the whole thing complex. > Instead there exists a simpler solution to avoid the risk of net > double release: if we find that the net reference counter reaches > zero before the reference counter will be increased in sk_alloc(), > we can identify that the process of the net namespace exit happening > in workqueue is not finished yet. At the moment, we should immediately > exit from sk_alloc() to avoid the risk. This is because once refcount > reaches zero, the net will be definetely destroyed later in workqueue > whatever we take its refcount or not. This solution is not only simple > and easily understandable, but also it can help to avoid the redundant > namespace change. > Hmm, why does the caller have to handle some race condition of the callee? Isn't this solvable at netns API layer? How about the following patch (PoC ONLY) ? __put_net() should be able to detect a pending cleanup work, shouldn't it? error = ops_init(ops, net); @@ -409,12 +410,17 @@ void __put_net(struct net *net) { /* Cleanup the network namespace in process context */ unsigned long flags; + bool added = false; spin_lock_irqsave(&cleanup_list_lock, flags); - list_add(&net->cleanup_list, &cleanup_list); + if (list_empty(&net->cleanup_list)) { + list_add(&net->cleanup_list, &cleanup_list); + added = true; + } spin_unlock_irqrestore(&cleanup_list_lock, flags); - queue_work(netns_wq, &net_cleanup_work); + if (added) + queue_work(netns_wq, &net_cleanup_work); } EXPORT_SYMBOL_GPL(__put_net); --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 78fc04a..ded15a7 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) net->dev_base_seq = 1; net->user_ns = user_ns; idr_init(&net->netns_ids); + LIST_HEAD_INIT(&net->cleanup_list); list_for_each_entry(ops, &pernet_list, list) {