From patchwork Sun Jan 26 08:54:46 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pablo Neira Ayuso X-Patchwork-Id: 314158 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 21A882C0089 for ; Sun, 26 Jan 2014 19:55:57 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752602AbaAZIzz (ORCPT ); Sun, 26 Jan 2014 03:55:55 -0500 Received: from mail.us.es ([193.147.175.20]:52794 "EHLO mail.us.es" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752546AbaAZIzy (ORCPT ); Sun, 26 Jan 2014 03:55:54 -0500 Received: (qmail 26830 invoked from network); 26 Jan 2014 09:55:51 +0100 Received: from unknown (HELO us.es) (192.168.2.12) by us.es with SMTP; 26 Jan 2014 09:55:51 +0100 Received: (qmail 22786 invoked by uid 507); 26 Jan 2014 08:55:51 -0000 X-Qmail-Scanner-Diagnostics: from 127.0.0.1 by antivirus2 (envelope-from , uid 501) with qmail-scanner-2.10 (clamdscan: 0.98/18399. spamassassin: 3.3.2. Clear:RC:1(127.0.0.1):SA:0(-100.4/7.5):. Processed in 11.062851 secs); 26 Jan 2014 08:55:51 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on antivirus2 X-Spam-Level: X-Spam-Status: No, score=-100.4 required=7.5 tests=BAYES_50,RCVD_IN_PBL, RDNS_NONE, SMTPAUTH_US, SPF_HELO_FAIL, USER_IN_WHITELIST autolearn=disabled version=3.3.2 X-Spam-ASN: AS12430 77.208.0.0/14 X-Envelope-From: pneira@us.es Received: from unknown (HELO antivirus2) (127.0.0.1) by us.es with SMTP; 26 Jan 2014 08:55:40 -0000 Received: from 192.168.1.13 (192.168.1.13) by antivirus2 (F-Secure/fsigk_smtp/412/antivirus2); Sun, 26 Jan 2014 09:55:40 +0100 (CET) X-Virus-Status: clean(F-Secure/fsigk_smtp/412/antivirus2) Received: (qmail 18464 invoked from network); 26 Jan 2014 09:55:23 +0100 Received: from unknown (HELO us.es) (1984lsi@77.208.221.75) by mail.us.es with AES128-SHA encrypted SMTP; 26 Jan 2014 09:55:23 +0100 Date: Sun, 26 Jan 2014 09:54:46 +0100 From: Pablo Neira Ayuso To: Patrick McHardy Cc: netfilter-devel@vger.kernel.org, arturo.borrero.glez@gmail.com Subject: Re: [PATCH] netfilter: nf_tables: fix racy rule deletion Message-ID: <20140126085446.GA4130@localhost> References: <1390655031-4115-1-git-send-email-pablo@netfilter.org> <20140125135552.GA31554@macbook.localnet> <20140125163533.GA14235@macbook.localnet> <20140125171451.GA11373@macbook.localnet> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20140125171451.GA11373@macbook.localnet> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: netfilter-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netfilter-devel@vger.kernel.org On Sat, Jan 25, 2014 at 05:14:51PM +0000, Patrick McHardy wrote: > On Sat, Jan 25, 2014 at 04:35:33PM +0000, Patrick McHardy wrote: > > On Sat, Jan 25, 2014 at 01:55:52PM +0000, Patrick McHardy wrote: > > > On Sat, Jan 25, 2014 at 02:03:51PM +0100, Pablo Neira Ayuso wrote: > > > > We still have a bug somewhere else. When creating 10000 rules like: > > > > tcp dport { 22, 23 }, I can see more than 10000 sets. > > > > > > > > # ./nft-set-get ip | wc -l > > > > 10016 > > > > > > > > It seems set 511 is not being used. See below: > > > > > > > > # ./nft-rule-get > > > > ip filter output 513 512 > > > > [ payload load 1b @ network header + 9 => reg 1 ] > > > > [ cmp eq reg 1 0x00000006 ] > > > > [ payload load 2b @ transport header + 2 => reg 1 ] > > > > [ lookup reg 1 set set510 ] > > > > [ counter pkts 0 bytes 0 ] > > > > > > > > ip filter output 514 513 > > > > [ payload load 1b @ network header + 9 => reg 1 ] > > > > [ cmp eq reg 1 0x00000006 ] > > > > [ payload load 2b @ transport header + 2 => reg 1 ] > > > > [ lookup reg 1 set set512 ] > > > > [ counter pkts 0 bytes 0 ] > > > > > > > > It seems to happen every 512 sets are added. Still investigating, so > > > > this needs a second follow up patch to resolve what Arturo is reporting. > > > > > > Yeah, we seem to have a couple of bugs in nf_tables_set_alloc_name(). > > > I'll fix them up and will then have a look at this patch. > > > > I can't reproduce the gaps in the name space, but we have an obvious > > overflow since we're using BITS_PER_LONG * PAGE_SIZE instead of BITS_PER_BYTE. > > > > This shouldn't have affected your test case though since the overflow only > > happens for more than 32768 sets. > > > > As a start, please try this patch. It fixes the overflow, might also > fix your problem. > > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c > index 9ce3053..e8c7437 100644 > --- a/net/netfilter/nf_tables_api.c > +++ b/net/netfilter/nf_tables_api.c > @@ -1989,13 +1992,13 @@ static int nf_tables_set_alloc_name(struct nft_ctx *ctx, struct nft_set *set, > > if (!sscanf(i->name, name, &tmp)) > continue; > - if (tmp < 0 || tmp > BITS_PER_LONG * PAGE_SIZE) > + if (tmp < 0 || tmp >= BITS_PER_BYTE * PAGE_SIZE) > continue; > > set_bit(tmp, inuse); > } > > - n = find_first_zero_bit(inuse, BITS_PER_LONG * PAGE_SIZE); > + n = find_first_zero_bit(inuse, BITS_PER_BYTE * PAGE_SIZE); > free_page((unsigned long)inuse); > } > Tested this patch, it works fine here, I hit -EMFILE with 32768 sets with no crashes. The problem I was reporting was different though, I found a bug in the batching code of libmnl. The mnl_nlmsg_batch_next function was not accounting the last message not fitting in the batch. With my patch + libmnl patch I can perform: nft -f pablo-lots-test; nft flush table filter; nft delete chain filter output; nft delete table filter without seeing unused anonymous sets left attached to the table and -EBUSY problems in that table. > Another thing is that our name allocation algorithm really sucks. It > was copied from dev_alloc_name(), but network device allocation doesn't > happen on the same scale as we might have. I'm considering switching to > something taking O(1). Basically, the name allocation is only useful for > anonymous sets anyway since in all other cases you need to manually populate > them. So if we switch to a prefix string that can't clash with user defined > names, we can simply use an incrementing 64 bit counter. So my > proposal would be to just use names starting with \0. Alternatively use a > handle instead of a name for anonymous sets. > > The second upside is that its not possible anymore for the user to run > into unexpected EEXIST when using setN or mapN as name. > > Thoughts? I like the u64 handle for anonymous sets, it's similar to what we do with other objects, we get O(1) handle allocation. I think we can allow both u64 and set%d, map%d. The kernel can check if the handle is available first, if not check if the name looks like set%d, map%d (so the the maximum number of sets limitation only applies to that case). Userspace only needs to send both set%d and the u64 handle. Would you be OK with that? diff --git a/src/nlmsg.c b/src/nlmsg.c index fdb7af8..0a414a7 100644 --- a/src/nlmsg.c +++ b/src/nlmsg.c @@ -484,14 +484,15 @@ EXPORT_SYMBOL(mnl_nlmsg_batch_stop); bool mnl_nlmsg_batch_next(struct mnl_nlmsg_batch *b) { struct nlmsghdr *nlh = b->cur; + bool ret = true; if (b->buflen + nlh->nlmsg_len > b->limit) { b->overflow = true; - return false; + ret = false; } b->cur = b->buf + b->buflen + nlh->nlmsg_len; b->buflen += nlh->nlmsg_len; - return true; + return ret; } EXPORT_SYMBOL(mnl_nlmsg_batch_next);