diff mbox

[nf-next] netfilter: nf_tables: export rule-set generation ID

Message ID 1410282105-12880-1-git-send-email-pablo@netfilter.org
State Changes Requested
Delegated to: Pablo Neira
Headers show

Commit Message

Pablo Neira Ayuso Sept. 9, 2014, 5:01 p.m. UTC
This patch adds the NFT_MSG_GENID command to nf_tables that exposes
the 16-bits ruleset generation ID. This ID is incremented in every
commit. The generation ID is also exposed to userspace through the
nfnetlink res_id header field when dumping object lists.

This generation ID allows a userspace client to detect that an update
has happened between two consecutive object list dumps, so it can
retry from scratch.

This is complementary to the NLM_F_DUMP_INTR approach, which allows
us to detect an interference in the middle one single list dumping.
There is no way to explicitly check that an interference has occurred
between two list dumps from the kernel, since it doesn't know how
many lists the userspace client is actually going to dump.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/uapi/linux/netfilter/nf_tables.h |    2 +
 net/netfilter/nf_tables_api.c            |   64 +++++++++++++++++++++++++++---
 2 files changed, 60 insertions(+), 6 deletions(-)

Comments

Patrick McHardy Sept. 9, 2014, 6:20 p.m. UTC | #1
On Tue, Sep 09, 2014 at 07:01:45PM +0200, Pablo Neira Ayuso wrote:
> This patch adds the NFT_MSG_GENID command to nf_tables that exposes
> the 16-bits ruleset generation ID. This ID is incremented in every
> commit. The generation ID is also exposed to userspace through the
> nfnetlink res_id header field when dumping object lists.
> 
> This generation ID allows a userspace client to detect that an update
> has happened between two consecutive object list dumps, so it can
> retry from scratch.
> 
> This is complementary to the NLM_F_DUMP_INTR approach, which allows
> us to detect an interference in the middle one single list dumping.
> There is no way to explicitly check that an interference has occurred
> between two list dumps from the kernel, since it doesn't know how
> many lists the userspace client is actually going to dump.

Well, the obvious question is, are we sure that 16 bit is enough?
I mean, sure, it most likely is for almost any usecase, but if you
want to write a reliable piece of software using it, can you be
sure and how could you handle the case that it overflows?

If we for instance consider an optimization algorithm that wants to
perform an update and has to dump all objects, perform its change,
reoptimize the resulting ruleset and then insert it. This might take
some considerable amount of time. I guess its at least possible that
2^16 updates could be performed in the same period.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 10, 2014, 2:40 p.m. UTC | #2
On Tue, Sep 09, 2014 at 07:20:16PM +0100, Patrick McHardy wrote:
> On Tue, Sep 09, 2014 at 07:01:45PM +0200, Pablo Neira Ayuso wrote:
> > This patch adds the NFT_MSG_GENID command to nf_tables that exposes
> > the 16-bits ruleset generation ID. This ID is incremented in every
> > commit. The generation ID is also exposed to userspace through the
> > nfnetlink res_id header field when dumping object lists.
> > 
> > This generation ID allows a userspace client to detect that an update
> > has happened between two consecutive object list dumps, so it can
> > retry from scratch.
> > 
> > This is complementary to the NLM_F_DUMP_INTR approach, which allows
> > us to detect an interference in the middle one single list dumping.
> > There is no way to explicitly check that an interference has occurred
> > between two list dumps from the kernel, since it doesn't know how
> > many lists the userspace client is actually going to dump.
> 
> Well, the obvious question is, are we sure that 16 bit is enough?
> I mean, sure, it most likely is for almost any usecase, but if you
> want to write a reliable piece of software using it, can you be
> sure and how could you handle the case that it overflows?
> 
> If we for instance consider an optimization algorithm that wants to
> perform an update and has to dump all objects, perform its change,
> reoptimize the resulting ruleset and then insert it. This might take
> some considerable amount of time. I guess its at least possible that
> 2^16 updates could be performed in the same period.

Right. nf-hipac was taking quite some time to perform those
transformations (I remember they were masking the update time by using
a kernel thread in the netlink receive path), so it's reasonable to
assume that those transformation algorithms may take enough time to
see a wrap-around.

I think of several alternatives:

1) Switch to 2^32. With this approach, I don't see a way to
expose it via the existing headers (ie. nfnetlink res_id was easy).
Thus, we don't have a way to stop a dump immediately after when we
notice the objects we're getting are stale. So the approach should
look more like:

request genID and annotate it
dump objects
request genID and compare what we annotated, if unequal, retry.

This can be expensive when we get an unfortunate update between two
list dumps when we operate with large object list though. BTW, the
update between two list dumps is unlikely but still possible.

2) Stick to 2^16 but introduce a commit event. The optimization
algorithm can subscribe to this events so it notices when it's been
working with stale objects. The optimization software would need to
poll for commit events and we still may lose events.

I think we have to go 1).

Regarding the commit event, Arturo wants this for nft-sync anyway, so
he knows where the batch ends to consistently replicate and apply the
rules to the other peer. So this can come in a separated patch.

Let me know, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Sept. 10, 2014, 3:23 p.m. UTC | #3
On Wed, Sep 10, 2014 at 04:40:54PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Sep 09, 2014 at 07:20:16PM +0100, Patrick McHardy wrote:
> > On Tue, Sep 09, 2014 at 07:01:45PM +0200, Pablo Neira Ayuso wrote:
> > > This patch adds the NFT_MSG_GENID command to nf_tables that exposes
> > > the 16-bits ruleset generation ID. This ID is incremented in every
> > > commit. The generation ID is also exposed to userspace through the
> > > nfnetlink res_id header field when dumping object lists.
> > > 
> > > This generation ID allows a userspace client to detect that an update
> > > has happened between two consecutive object list dumps, so it can
> > > retry from scratch.
> > > 
> > > This is complementary to the NLM_F_DUMP_INTR approach, which allows
> > > us to detect an interference in the middle one single list dumping.
> > > There is no way to explicitly check that an interference has occurred
> > > between two list dumps from the kernel, since it doesn't know how
> > > many lists the userspace client is actually going to dump.
> > 
> > Well, the obvious question is, are we sure that 16 bit is enough?
> > I mean, sure, it most likely is for almost any usecase, but if you
> > want to write a reliable piece of software using it, can you be
> > sure and how could you handle the case that it overflows?
> > 
> > If we for instance consider an optimization algorithm that wants to
> > perform an update and has to dump all objects, perform its change,
> > reoptimize the resulting ruleset and then insert it. This might take
> > some considerable amount of time. I guess its at least possible that
> > 2^16 updates could be performed in the same period.
> 
> Right. nf-hipac was taking quite some time to perform those
> transformations (I remember they were masking the update time by using
> a kernel thread in the netlink receive path), so it's reasonable to
> assume that those transformation algorithms may take enough time to
> see a wrap-around.
> 
> I think of several alternatives:
> 
> 1) Switch to 2^32. With this approach, I don't see a way to
> expose it via the existing headers (ie. nfnetlink res_id was easy).
> Thus, we don't have a way to stop a dump immediately after when we
> notice the objects we're getting are stale. So the approach should
> look more like:
> 
> request genID and annotate it
> dump objects
> request genID and compare what we annotated, if unequal, retry.
> 
> This can be expensive when we get an unfortunate update between two
> list dumps when we operate with large object list though. BTW, the
> update between two list dumps is unlikely but still possible.

Alternatively put it into an attribute and include that with every
message or object.

> 2) Stick to 2^16 but introduce a commit event. The optimization
> algorithm can subscribe to this events so it notices when it's been
> working with stale objects. The optimization software would need to
> poll for commit events and we still may lose events.

Since events are unreliable, so this wouldn't work that easily. On
message loss you'd have to always start over. OTOH probably not
a big problem, at least currently, because message loss only happens
when other notifications are sent which, at present, obviously indicates
some change has been made.

> I think we have to go 1).
> 
> Regarding the commit event, Arturo wants this for nft-sync anyway, so
> he knows where the batch ends to consistently replicate and apply the
> rules to the other peer. So this can come in a separated patch.
> 
> Let me know, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index c000947..351418a 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -51,6 +51,7 @@  enum nft_verdicts {
  * @NFT_MSG_NEWSETELEM: create a new set element (enum nft_set_elem_attributes)
  * @NFT_MSG_GETSETELEM: get a set element (enum nft_set_elem_attributes)
  * @NFT_MSG_DELSETELEM: delete a set element (enum nft_set_elem_attributes)
+ * @NFT_MSG_GENID: get the rule-set generation ID (no attributes)
  */
 enum nf_tables_msg_types {
 	NFT_MSG_NEWTABLE,
@@ -68,6 +69,7 @@  enum nf_tables_msg_types {
 	NFT_MSG_NEWSETELEM,
 	NFT_MSG_GETSETELEM,
 	NFT_MSG_DELSETELEM,
+	NFT_MSG_GENID,
 	NFT_MSG_MAX,
 };
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index eac4fab..e5d0e93 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -222,6 +222,7 @@  static int nf_tables_fill_table_info(struct sk_buff *skb, u32 portid, u32 seq,
 {
 	struct nlmsghdr *nlh;
 	struct nfgenmsg *nfmsg;
+	struct net *net = sock_net(skb->sk);
 
 	event |= NFNL_SUBSYS_NFTABLES << 8;
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(struct nfgenmsg), flags);
@@ -231,7 +232,7 @@  static int nf_tables_fill_table_info(struct sk_buff *skb, u32 portid, u32 seq,
 	nfmsg = nlmsg_data(nlh);
 	nfmsg->nfgen_family	= family;
 	nfmsg->version		= NFNETLINK_V0;
-	nfmsg->res_id		= 0;
+	nfmsg->res_id		= htons(net->nft.base_seq & 0xffff);
 
 	if (nla_put_string(skb, NFTA_TABLE_NAME, table->name) ||
 	    nla_put_be32(skb, NFTA_TABLE_FLAGS, htonl(table->flags)) ||
@@ -690,6 +691,7 @@  static int nf_tables_fill_chain_info(struct sk_buff *skb, u32 portid, u32 seq,
 {
 	struct nlmsghdr *nlh;
 	struct nfgenmsg *nfmsg;
+	struct net *net = sock_net(skb->sk);
 
 	event |= NFNL_SUBSYS_NFTABLES << 8;
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(struct nfgenmsg), flags);
@@ -699,7 +701,7 @@  static int nf_tables_fill_chain_info(struct sk_buff *skb, u32 portid, u32 seq,
 	nfmsg = nlmsg_data(nlh);
 	nfmsg->nfgen_family	= family;
 	nfmsg->version		= NFNETLINK_V0;
-	nfmsg->res_id		= 0;
+	nfmsg->res_id		= htons(net->nft.base_seq & 0xffff);
 
 	if (nla_put_string(skb, NFTA_CHAIN_TABLE, table->name))
 		goto nla_put_failure;
@@ -1449,6 +1451,7 @@  static int nf_tables_fill_rule_info(struct sk_buff *skb, u32 portid, u32 seq,
 	struct nlattr *list;
 	const struct nft_rule *prule;
 	int type = event | NFNL_SUBSYS_NFTABLES << 8;
+	struct net *net = sock_net(skb->sk);
 
 	nlh = nlmsg_put(skb, portid, seq, type, sizeof(struct nfgenmsg),
 			flags);
@@ -1458,7 +1461,7 @@  static int nf_tables_fill_rule_info(struct sk_buff *skb, u32 portid, u32 seq,
 	nfmsg = nlmsg_data(nlh);
 	nfmsg->nfgen_family	= family;
 	nfmsg->version		= NFNETLINK_V0;
-	nfmsg->res_id		= 0;
+	nfmsg->res_id		= htons(net->nft.base_seq & 0xffff);
 
 	if (nla_put_string(skb, NFTA_RULE_TABLE, table->name))
 		goto nla_put_failure;
@@ -2204,7 +2207,7 @@  static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
 	nfmsg = nlmsg_data(nlh);
 	nfmsg->nfgen_family	= ctx->afi->family;
 	nfmsg->version		= NFNETLINK_V0;
-	nfmsg->res_id		= 0;
+	nfmsg->res_id		= htons(ctx->net->nft.base_seq & 0xffff);
 
 	if (nla_put_string(skb, NFTA_SET_TABLE, ctx->table->name))
 		goto nla_put_failure;
@@ -2836,7 +2839,7 @@  static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb)
 	nfmsg = nlmsg_data(nlh);
 	nfmsg->nfgen_family = ctx.afi->family;
 	nfmsg->version      = NFNETLINK_V0;
-	nfmsg->res_id       = 0;
+	nfmsg->res_id	    = htons(ctx.net->nft.base_seq & 0xffff);
 
 	if (nla_put_string(skb, NFTA_SET_ELEM_LIST_TABLE, ctx.table->name))
 		goto nla_put_failure;
@@ -2917,7 +2920,7 @@  static int nf_tables_fill_setelem_info(struct sk_buff *skb,
 	nfmsg = nlmsg_data(nlh);
 	nfmsg->nfgen_family	= ctx->afi->family;
 	nfmsg->version		= NFNETLINK_V0;
-	nfmsg->res_id		= 0;
+	nfmsg->res_id		= htons(ctx->net->nft.base_seq & 0xffff);
 
 	if (nla_put_string(skb, NFTA_SET_TABLE, ctx->table->name))
 		goto nla_put_failure;
@@ -3204,6 +3207,52 @@  static int nf_tables_delsetelem(struct sock *nlsk, struct sk_buff *skb,
 	return err;
 }
 
+static int nf_tables_fill_genid_info(struct sk_buff *skb, struct net *net,
+				     u32 portid, u32 seq)
+{
+	struct nlmsghdr *nlh;
+	struct nfgenmsg *nfmsg;
+	int event = (NFNL_SUBSYS_NFTABLES << 8) | NFT_MSG_GENID;
+
+	nlh = nlmsg_put(skb, portid, seq, event, sizeof(struct nfgenmsg), 0);
+	if (nlh == NULL)
+		goto nla_put_failure;
+
+	nfmsg = nlmsg_data(nlh);
+	nfmsg->nfgen_family	= AF_UNSPEC;
+	nfmsg->version		= NFNETLINK_V0;
+	nfmsg->res_id		= htons(net->nft.base_seq & 0xffff);
+
+	return nlmsg_end(skb, nlh);
+
+nla_put_failure:
+	nlmsg_trim(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static int nf_tables_getgenid(struct sock *nlsk, struct sk_buff *skb,
+			      const struct nlmsghdr *nlh,
+			      const struct nlattr * const nla[])
+{
+	struct net *net = sock_net(skb->sk);
+	struct sk_buff *skb2;
+	int err;
+
+	skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (skb2 == NULL)
+		return -ENOMEM;
+
+	err = nf_tables_fill_genid_info(skb2, net, NETLINK_CB(skb).portid,
+					nlh->nlmsg_seq);
+	if (err < 0)
+		goto err;
+
+	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
+err:
+	kfree_skb(skb2);
+	return err;
+}
+
 static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
 	[NFT_MSG_NEWTABLE] = {
 		.call_batch	= nf_tables_newtable,
@@ -3280,6 +3329,9 @@  static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
 		.attr_count	= NFTA_SET_ELEM_LIST_MAX,
 		.policy		= nft_set_elem_list_policy,
 	},
+	[NFT_MSG_GENID] = {
+		.call		= nf_tables_getgenid,
+	},
 };
 
 static void nft_chain_commit_update(struct nft_trans *trans)