Patchwork [nftables,3/9] netfilter: nf_tables: atomic rule updates and dumps

login
register
mail settings
Submitter Pablo Neira
Date Jan. 31, 2013, 12:03 a.m.
Message ID <1359590645-4703-3-git-send-email-pablo@netfilter.org>
Download mbox | patch
Permalink /patch/217051/
State Accepted
Headers show

Comments

Pablo Neira - Jan. 31, 2013, 12:03 a.m.
From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch adds global atomic rule updates and dumps based on
bitmask generations. This allows to atomically commit a set of
rule-set updates incrementally without altering the internal
state of existing nf_tables expressions/matches/targets.

The idea consists of using a generation cursor of 1 bit and
a bitmask of 2 bits per rule. Assuming the gencursor is 0,
then the genmask (expressed as a bitmask) can be interpreted
as:

00 active in the present, will be active in the next generation.
01 inactive in the present, will be active in the next generation.
10 active in the present, will be deleted in the next generation.
 ^
 gencursor

Once you invoke the transition to the next generation, the global
gencursor is updated:

00 active in the present, will be active in the next generation.
01 active in the present, needs to zero its future, it becomes 00.
10 inactive in the present, delete now.
^
gencursor

If a dump is in progress and nf_tables enters a new generation,
the dump will stop and return -EBUSY to let userspace know that
it has to retry again. In order to invalidate dumps, a global
genctr counter is increased everytime nf_tables enters a new
generation.

This new operation can be used from the user-space utility
that controls the firewall, eg.

nft restore < file

The rule updates contained in `file' will be applied atomically.

cat file
-----
add filter INPUT ip saddr 1.1.1.1 counter accept #1
del filter INPUT ip daddr 2.2.2.2 counter drop   #2
commit                                           #3
-EOF-

Note that the rule 1 will be inactive until the transition to the
next generation, the rule 2 will be evicted in the next generation.

There is a penalty during the rule update due to the branch
misprediction in the packet matching framework. But that should be
quickly resolved once the iteration over the dirty list that
contain rules that require updates is finished.

Event notification happens once the rule-set update has been
committed. So we skip notifications is case the rule-set update
is aborted, which can happen in case that the rule-set is tested
to apply correctly.

This patch is based on ideas extracted from discussions with
Patrick McHardy.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter/nf_tables.h |    8 ++
 include/net/netfilter/nf_tables.h   |   10 +-
 include/net/netns/nftables.h        |    2 +
 net/netfilter/nf_tables_api.c       |  227 ++++++++++++++++++++++++++++++++---
 net/netfilter/nf_tables_core.c      |   10 ++
 5 files changed, 240 insertions(+), 17 deletions(-)
Tomasz Bursztyka - Feb. 18, 2013, 5:17 p.m.
Hi Pablo,

Hopefully you mentioned such patchset, went far away in my mail box.
I am fine with most of them but this particular one.
Some small parts should be reworked a bit.

The overall idea is nice, especially the bit field so it does not make 
the nft_rule bigger.

> diff --git a/include/linux/netfilter/nf_tables.h b/include/linux/netfilter/nf_tables.h
> index 7640290..3749069 100644
> --- a/include/linux/netfilter/nf_tables.h
> +++ b/include/linux/netfilter/nf_tables.h
> @@ -37,6 +37,8 @@ enum nf_tables_msg_types {
>   	NFT_MSG_NEWSETELEM,
>   	NFT_MSG_GETSETELEM,
>   	NFT_MSG_DELSETELEM,
> +	NFT_MSG_COMMIT,
> +	NFT_MSG_ABORT,
>   	NFT_MSG_MAX,
>   };
>   
> @@ -85,12 +87,18 @@ enum nft_chain_attributes {
>   };
>   #define NFTA_CHAIN_MAX		(__NFTA_CHAIN_MAX - 1)
>   
> +enum {
> +	NFT_RULE_F_COMMIT	= (1 << 0),
> +	NFT_RULE_F_MASK		= NFT_RULE_F_COMMIT,
> +};
> +

I guess you added this flags to add rules on non-transaction based 
use-case, so it commits right away.
Wouldn't it be better to have no such flags in the API, but instead keep 
track of an on going transaction in the struct nft_ctx (let's have an 
internal flag here).
I.E: no on-going transaction -> we directly commit. If not, it's handled 
as being part of the transaction.

I just took a look at nfnetlink however, it does not seem possible to 
keep a data pointer per connection on the subsystem.
Wait, there is a field in struct sock that is meant for such usage. It 
would require to change nf_tables_api.c only then.
Would it make sense or am I wrong with nft_ctx usage?  (there would be 
an issue: to free such context when the connection goes down, we could 
abort an on-going transaction this way then)

>   enum nft_rule_attributes {
>   	NFTA_RULE_UNSPEC,
>   	NFTA_RULE_TABLE,
>   	NFTA_RULE_CHAIN,
>   	NFTA_RULE_HANDLE,
>   	NFTA_RULE_EXPRESSIONS,
> +	NFTA_RULE_FLAGS,
>   	__NFTA_RULE_MAX
>   };

... then it would not require such extra arguments, we would keep a 
simple api.

>    *	@rcu_head: used internally for rcu
>    *	@handle: rule handle
> + *	@genmask: generation mask
>    *	@dlen: length of expression data
>    *	@data: expression data
>    */
>   struct nft_rule {
>   	struct list_head		list;
> +	struct list_head		dirty_list;
>   	struct rcu_head			rcu_head;
> -	u64				handle:48,
> +	u64				handle:46,
> +					genmask:2,
>   					dlen:16;
>   	unsigned char			data[]
>   		__attribute__((aligned(__alignof__(struct nft_expr))));
> @@ -366,8 +370,10 @@ enum nft_chain_flags {
>    *	struct nft_chain - nf_tables chain
>    *
>    *	@rules: list of rules in the chain
> + *	@dirty_rules: rules that need an update after next generation

See the end of the patch, on abort or commit function. I think we should 
try to do something better (performance wise)

>    *	@list: used internally
>    *	@rcu_head: used internally
> + *	@net: net namespace that this chain belongs to

I would see that in another patch, even if it's a really tiny one. 
(preceding this current one)
Moreover that we will have to full support the namespaces at some point, 
right?

> +static int nf_tables_commit(struct sock *nlsk, struct sk_buff *skb,
> +			    const struct nlmsghdr *nlh,
> +			    const struct nlattr * const nla[])
> +{
> +	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> +	const struct nft_af_info *afi;
> +	struct net *net = sock_net(skb->sk);
> +	struct nft_table *table;
> +	struct nft_chain *chain;
> +	struct nft_rule *rule, *tmp;
> +	int family = nfmsg->nfgen_family;
> +	bool create;
> +
> +	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
> +
> +	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
> +	if (IS_ERR(afi))
> +		return PTR_ERR(afi);
> +
> +	/* Bump generation counter, invalidate any dump in progress */
> +	net->nft.genctr++;
> +
> +	/* A new generation has just started */
> +	net->nft.gencursor++;
> +
> +	/* Make sure all packets have left the previous generation before
> +	 * purging old rules.
> +	 */
> +	synchronize_rcu();
> +
> +	list_for_each_entry(table, &afi->tables, list) {
> +		list_for_each_entry(chain, &table->chains, list) {
> +			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
> +				/* Delete this rule from the dirty list */
> +				list_del(&rule->dirty_list);

Ok here it sounds we could do better. Instead of storing the dirty list 
inside each chain, we may try an external one so we won't have to go 
through the whole table/chain/rule base like we do in a dump. If such 
dirty list is external (keeping a pointer on targeted table and chain 
too for each rule), it will be possible to loop only on it.

And having this list_for_each_entry() make the line out of 80 chars 
anyway. (there is the same issue for nft_dump_rules and some other)

> +static int nf_tables_abort(struct sock *nlsk, struct sk_buff *skb,
> +			   const struct nlmsghdr *nlh,
> +			   const struct nlattr * const nla[])
> +{
> +	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> +	const struct nft_af_info *afi;
> +	struct net *net = sock_net(skb->sk);
> +	struct nft_table *table;
> +	struct nft_chain *chain;
> +	struct nft_rule *rule, *tmp;
> +	bool create;
> +
> +	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
> +
> +	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
> +	if (IS_ERR(afi))
> +		return PTR_ERR(afi);
> +
> +	list_for_each_entry(table, &afi->tables, list) {
> +		list_for_each_entry(chain, &table->chains, list) {
> +			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {

Same here.


Br,

Tomasz
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - Feb. 19, 2013, 10:02 p.m.
On Thu, Jan 31, 2013 at 01:03:59AM +0100, pablo@netfilter.org wrote:
> From: Pablo Neira Ayuso <pablo@netfilter.org>
> 
> This patch adds global atomic rule updates and dumps based on
> bitmask generations. This allows to atomically commit a set of
> rule-set updates incrementally without altering the internal
> state of existing nf_tables expressions/matches/targets.
> 
> The idea consists of using a generation cursor of 1 bit and
> a bitmask of 2 bits per rule. Assuming the gencursor is 0,
> then the genmask (expressed as a bitmask) can be interpreted
> as:
> 
> 00 active in the present, will be active in the next generation.
> 01 inactive in the present, will be active in the next generation.
> 10 active in the present, will be deleted in the next generation.
>  ^
>  gencursor
> 
> Once you invoke the transition to the next generation, the global
> gencursor is updated:
> 
> 00 active in the present, will be active in the next generation.
> 01 active in the present, needs to zero its future, it becomes 00.
> 10 inactive in the present, delete now.
> ^
> gencursor
> 
> If a dump is in progress and nf_tables enters a new generation,
> the dump will stop and return -EBUSY to let userspace know that
> it has to retry again. In order to invalidate dumps, a global
> genctr counter is increased everytime nf_tables enters a new
> generation.
> 
> This new operation can be used from the user-space utility
> that controls the firewall, eg.
> 
> nft restore < file
> 
> The rule updates contained in `file' will be applied atomically.
> 
> cat file
> -----
> add filter INPUT ip saddr 1.1.1.1 counter accept #1
> del filter INPUT ip daddr 2.2.2.2 counter drop   #2
> commit                                           #3
> -EOF-
> 
> Note that the rule 1 will be inactive until the transition to the
> next generation, the rule 2 will be evicted in the next generation.
> 
> There is a penalty during the rule update due to the branch
> misprediction in the packet matching framework. But that should be
> quickly resolved once the iteration over the dirty list that
> contain rules that require updates is finished.
> 
> Event notification happens once the rule-set update has been
> committed. So we skip notifications is case the rule-set update
> is aborted, which can happen in case that the rule-set is tested
> to apply correctly.
> 
> This patch is based on ideas extracted from discussions with
> Patrick McHardy.

Generally, I think this has a pretty high penalty both memory-wise
and performance-wise, so we should reconsider the idea of moving
most of the costs to userspace by temporarily adding "skip" rules
that skip over uncommitted rules.

A couple of comments on the patch below though.

> @@ -37,6 +37,8 @@ enum nf_tables_msg_types {
>  	NFT_MSG_NEWSETELEM,
>  	NFT_MSG_GETSETELEM,
>  	NFT_MSG_DELSETELEM,
> +	NFT_MSG_COMMIT,
> +	NFT_MSG_ABORT,

I see why you're doing this, but this is similar to the NEWCHAINNAME
attribute, it doesn't really fit into the scheme how the netlink API
is designed.

> index 3ba63b6..1131e49 100644
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -319,15 +319,19 @@ static inline void *nft_expr_priv(const struct nft_expr *expr)
>   *	struct nft_rule - nf_tables rule
>   *
>   *	@list: used internally
> + *	@dirty_list: this rule needs an update after new generation
>   *	@rcu_head: used internally for rcu
>   *	@handle: rule handle
> + *	@genmask: generation mask
>   *	@dlen: length of expression data
>   *	@data: expression data
>   */
>  struct nft_rule {
>  	struct list_head		list;
> +	struct list_head		dirty_list;

This is a quite heavy price. To reduce memory overhead, we could just have
dirty chains and iterate over all rules within them. 

> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> index db6150b..7f08381 100644
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
>  static int nf_tables_dump_rules(struct sk_buff *skb,
>  				struct netlink_callback *cb)
>  {
> @@ -1250,6 +1288,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
>  	unsigned int idx = 0, s_idx = cb->args[0];
>  	struct net *net = sock_net(skb->sk);
>  	int family = nfmsg->nfgen_family;
> +	unsigned int genctr = net->nft.genctr, gencursor = net->nft.gencursor;
>  
>  	list_for_each_entry(afi, &net->nft.af_info, list) {
>  		if (family != NFPROTO_UNSPEC && family != afi->family)
> @@ -1258,6 +1297,8 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
>  		list_for_each_entry(table, &afi->tables, list) {
>  			list_for_each_entry(chain, &table->chains, list) {
>  				list_for_each_entry(rule, &chain->rules, list) {
> +					if (!nft_rule_is_active(net, rule))
> +						goto cont;

Not sure whether we should really skip them during the dump. That hides
information from userspace, we could just as well include an INACTIVE flag
and have userspace decide whether to process them or not.

> +static void
> +nf_tables_delrule_one(struct nft_ctx *ctx, struct nft_rule *rule, u32 flags)
> +{
> +	if (flags & NFT_RULE_F_COMMIT) {
> +		struct nft_chain *chain = (struct nft_chain *)ctx->chain;
> +
> +		nft_rule_disactivate_next(ctx->net, rule);

deactivate :)

> +		list_add(&rule->dirty_list, &chain->dirty_rules);
> +	} else {
> +		list_del_rcu(&rule->list);
> +		nf_tables_rule_notify(ctx->skb, ctx->nlh, ctx->table,
> +				      ctx->chain, rule, NFT_MSG_DELRULE,
> +				      0, ctx->afi->family);
> +		nf_tables_rule_destroy(rule);
> +	}
> +}
> +
>  static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb,
>  			     const struct nlmsghdr *nlh,
>  			     const struct nlattr * const nla[])
> +static int nf_tables_commit(struct sock *nlsk, struct sk_buff *skb,
> +			    const struct nlmsghdr *nlh,
> +			    const struct nlattr * const nla[])
> +{
> +	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> +	const struct nft_af_info *afi;
> +	struct net *net = sock_net(skb->sk);
> +	struct nft_table *table;
> +	struct nft_chain *chain;
> +	struct nft_rule *rule, *tmp;
> +	int family = nfmsg->nfgen_family;
> +	bool create;
> +
> +	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
> +
> +	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
> +	if (IS_ERR(afi))
> +		return PTR_ERR(afi);
> +
> +	/* Bump generation counter, invalidate any dump in progress */
> +	net->nft.genctr++;
> +
> +	/* A new generation has just started */
> +	net->nft.gencursor++;
> +
> +	/* Make sure all packets have left the previous generation before
> +	 * purging old rules.
> +	 */
> +	synchronize_rcu();

This doesn't work. As soon as synchronize_rcu() is finished, new packets
might enter nft_do_chain(). And this also seems to be the main problem
with this approach, you can't iteratively flip the active bits without
locking out packet processing, otherwise it will be just as non-atomic as
it is now. For this to work you'd need a bigger generation counter in the
rules that doesn't overflow and doesn't require changes to its values.

> +
> +	list_for_each_entry(table, &afi->tables, list) {
> +		list_for_each_entry(chain, &table->chains, list) {
> +			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
> +				/* Delete this rule from the dirty list */
> +				list_del(&rule->dirty_list);
> +
> +				/* This rule was inactive in the past and just
> +				 * became active. Clear the next bit of the
> +				 * genmask since its meaning has changed, now
> +				 * it is the future.
> +				 */
> +				if (nft_rule_is_active(net, rule)) {
> +					nft_rule_clear(net, rule);
> +					nf_tables_rule_notify(skb, nlh, table,
> +							      chain, rule,
> +							      NFT_MSG_NEWRULE,
> +							      0,
> +							      nfmsg->nfgen_family);
> +					continue;
> +				}
> +
> +				/* This rule is in the past, get rid of it */
> +				list_del_rcu(&rule->list);
> +				nf_tables_rule_notify(skb, nlh, table, chain,
> +						      rule, NFT_MSG_DELRULE, 0,
> +						      family);
> +				nf_tables_rule_destroy(rule);
> +			}
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira - Feb. 20, 2013, 12:44 a.m.
Hi Patrick,

On Tue, Feb 19, 2013 at 11:02:58PM +0100, Patrick McHardy wrote:
> On Thu, Jan 31, 2013 at 01:03:59AM +0100, pablo@netfilter.org wrote:
> > From: Pablo Neira Ayuso <pablo@netfilter.org>
> > 
> > This patch adds global atomic rule updates and dumps based on
> > bitmask generations. This allows to atomically commit a set of
> > rule-set updates incrementally without altering the internal
> > state of existing nf_tables expressions/matches/targets.
> > 
> > The idea consists of using a generation cursor of 1 bit and
> > a bitmask of 2 bits per rule. Assuming the gencursor is 0,
> > then the genmask (expressed as a bitmask) can be interpreted
> > as:
> > 
> > 00 active in the present, will be active in the next generation.
> > 01 inactive in the present, will be active in the next generation.
> > 10 active in the present, will be deleted in the next generation.
> >  ^
> >  gencursor
> > 
> > Once you invoke the transition to the next generation, the global
> > gencursor is updated:
> > 
> > 00 active in the present, will be active in the next generation.
> > 01 active in the present, needs to zero its future, it becomes 00.
> > 10 inactive in the present, delete now.
> > ^
> > gencursor
> > 
> > If a dump is in progress and nf_tables enters a new generation,
> > the dump will stop and return -EBUSY to let userspace know that
> > it has to retry again. In order to invalidate dumps, a global
> > genctr counter is increased everytime nf_tables enters a new
> > generation.
> > 
> > This new operation can be used from the user-space utility
> > that controls the firewall, eg.
> > 
> > nft restore < file
> > 
> > The rule updates contained in `file' will be applied atomically.
> > 
> > cat file
> > -----
> > add filter INPUT ip saddr 1.1.1.1 counter accept #1
> > del filter INPUT ip daddr 2.2.2.2 counter drop   #2
> > commit                                           #3
> > -EOF-
> > 
> > Note that the rule 1 will be inactive until the transition to the
> > next generation, the rule 2 will be evicted in the next generation.
> > 
> > There is a penalty during the rule update due to the branch
> > misprediction in the packet matching framework. But that should be
> > quickly resolved once the iteration over the dirty list that
> > contain rules that require updates is finished.
> > 
> > Event notification happens once the rule-set update has been
> > committed. So we skip notifications is case the rule-set update
> > is aborted, which can happen in case that the rule-set is tested
> > to apply correctly.
> > 
> > This patch is based on ideas extracted from discussions with
> > Patrick McHardy.
> 
> Generally, I think this has a pretty high penalty both memory-wise
> and performance-wise, so we should reconsider the idea of moving
> most of the costs to userspace by temporarily adding "skip" rules
> that skip over uncommitted rules.
>
> A couple of comments on the patch below though.
> 
> > @@ -37,6 +37,8 @@ enum nf_tables_msg_types {
> >  	NFT_MSG_NEWSETELEM,
> >  	NFT_MSG_GETSETELEM,
> >  	NFT_MSG_DELSETELEM,
> > +	NFT_MSG_COMMIT,
> > +	NFT_MSG_ABORT,
> 
> I see why you're doing this, but this is similar to the NEWCHAINNAME
> attribute, it doesn't really fit into the scheme how the netlink API
> is designed.

What would you suggest as an alternative?

> > index 3ba63b6..1131e49 100644
> > --- a/include/net/netfilter/nf_tables.h
> > +++ b/include/net/netfilter/nf_tables.h
> > @@ -319,15 +319,19 @@ static inline void *nft_expr_priv(const struct nft_expr *expr)
> >   *	struct nft_rule - nf_tables rule
> >   *
> >   *	@list: used internally
> > + *	@dirty_list: this rule needs an update after new generation
> >   *	@rcu_head: used internally for rcu
> >   *	@handle: rule handle
> > + *	@genmask: generation mask
> >   *	@dlen: length of expression data
> >   *	@data: expression data
> >   */
> >  struct nft_rule {
> >  	struct list_head		list;
> > +	struct list_head		dirty_list;
> 
> This is a quite heavy price. To reduce memory overhead, we could just have
> dirty chains and iterate over all rules within them.

We can save that the extra struct list_head for the dirty_list in the
rule by allocating temporary container structure, I have a follow up
patch for that.

> > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> > index db6150b..7f08381 100644
> > --- a/net/netfilter/nf_tables_api.c
> > +++ b/net/netfilter/nf_tables_api.c
> >  static int nf_tables_dump_rules(struct sk_buff *skb,
> >  				struct netlink_callback *cb)
> >  {
> > @@ -1250,6 +1288,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
> >  	unsigned int idx = 0, s_idx = cb->args[0];
> >  	struct net *net = sock_net(skb->sk);
> >  	int family = nfmsg->nfgen_family;
> > +	unsigned int genctr = net->nft.genctr, gencursor = net->nft.gencursor;
> >  
> >  	list_for_each_entry(afi, &net->nft.af_info, list) {
> >  		if (family != NFPROTO_UNSPEC && family != afi->family)
> > @@ -1258,6 +1297,8 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
> >  		list_for_each_entry(table, &afi->tables, list) {
> >  			list_for_each_entry(chain, &table->chains, list) {
> >  				list_for_each_entry(rule, &chain->rules, list) {
> > +					if (!nft_rule_is_active(net, rule))
> > +						goto cont;
> 
> Not sure whether we should really skip them during the dump. That hides
> information from userspace, we could just as well include an INACTIVE flag
> and have userspace decide whether to process them or not.

Agreed, they should be dumped.

> > +static void
> > +nf_tables_delrule_one(struct nft_ctx *ctx, struct nft_rule *rule, u32 flags)
> > +{
> > +	if (flags & NFT_RULE_F_COMMIT) {
> > +		struct nft_chain *chain = (struct nft_chain *)ctx->chain;
> > +
> > +		nft_rule_disactivate_next(ctx->net, rule);
> 
> deactivate :)

Hm, I swear I checked this in the dictionary before writing it, you
know how broken my English can be ;-)

> > +		list_add(&rule->dirty_list, &chain->dirty_rules);
> > +	} else {
> > +		list_del_rcu(&rule->list);
> > +		nf_tables_rule_notify(ctx->skb, ctx->nlh, ctx->table,
> > +				      ctx->chain, rule, NFT_MSG_DELRULE,
> > +				      0, ctx->afi->family);
> > +		nf_tables_rule_destroy(rule);
> > +	}
> > +}
> > +
> >  static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb,
> >  			     const struct nlmsghdr *nlh,
> >  			     const struct nlattr * const nla[])
> > +static int nf_tables_commit(struct sock *nlsk, struct sk_buff *skb,
> > +			    const struct nlmsghdr *nlh,
> > +			    const struct nlattr * const nla[])
> > +{
> > +	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> > +	const struct nft_af_info *afi;
> > +	struct net *net = sock_net(skb->sk);
> > +	struct nft_table *table;
> > +	struct nft_chain *chain;
> > +	struct nft_rule *rule, *tmp;
> > +	int family = nfmsg->nfgen_family;
> > +	bool create;
> > +
> > +	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
> > +
> > +	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
> > +	if (IS_ERR(afi))
> > +		return PTR_ERR(afi);
> > +
> > +	/* Bump generation counter, invalidate any dump in progress */
> > +	net->nft.genctr++;
> > +
> > +	/* A new generation has just started */
> > +	net->nft.gencursor++;
> > +
> > +	/* Make sure all packets have left the previous generation before
> > +	 * purging old rules.
> > +	 */
> > +	synchronize_rcu();
> 
> This doesn't work. As soon as synchronize_rcu() is finished, new packets
> might enter nft_do_chain(). And this also seems to be the main problem
> with this approach, you can't iteratively flip the active bits without
> locking out packet processing, otherwise it will be just as non-atomic as
> it is now. For this to work you'd need a bigger generation counter in the
> rules that doesn't overflow and doesn't require changes to its values.

The gencursor is being cached at the entrace of nft_do_chain_pktinfo to
avoid that intermediate state. So AFAICS we wait until old packets
have finished iterating over the rule-set with the old gencursor and,
then, new packets will see the new rule-set using the new gencursor.

> > +	list_for_each_entry(table, &afi->tables, list) {
> > +		list_for_each_entry(chain, &table->chains, list) {
> > +			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
> > +				/* Delete this rule from the dirty list */
> > +				list_del(&rule->dirty_list);
> > +
> > +				/* This rule was inactive in the past and just
> > +				 * became active. Clear the next bit of the
> > +				 * genmask since its meaning has changed, now
> > +				 * it is the future.
> > +				 */
> > +				if (nft_rule_is_active(net, rule)) {
> > +					nft_rule_clear(net, rule);
> > +					nf_tables_rule_notify(skb, nlh, table,
> > +							      chain, rule,
> > +							      NFT_MSG_NEWRULE,
> > +							      0,
> > +							      nfmsg->nfgen_family);
> > +					continue;
> > +				}
> > +
> > +				/* This rule is in the past, get rid of it */
> > +				list_del_rcu(&rule->list);
> > +				nf_tables_rule_notify(skb, nlh, table, chain,
> > +						      rule, NFT_MSG_DELRULE, 0,
> > +						      family);
> > +				nf_tables_rule_destroy(rule);
> > +			}
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira - Feb. 20, 2013, 1:12 a.m.
Hi Tomasz,

On Mon, Feb 18, 2013 at 07:17:56PM +0200, Tomasz Bursztyka wrote:
> Hi Pablo,
> 
> Hopefully you mentioned such patchset, went far away in my mail box.
> I am fine with most of them but this particular one.
> Some small parts should be reworked a bit.
> 
> The overall idea is nice, especially the bit field so it does not
> make the nft_rule bigger.
> 
> >diff --git a/include/linux/netfilter/nf_tables.h b/include/linux/netfilter/nf_tables.h
> >index 7640290..3749069 100644
> >--- a/include/linux/netfilter/nf_tables.h
> >+++ b/include/linux/netfilter/nf_tables.h
> >@@ -37,6 +37,8 @@ enum nf_tables_msg_types {
> >  	NFT_MSG_NEWSETELEM,
> >  	NFT_MSG_GETSETELEM,
> >  	NFT_MSG_DELSETELEM,
> >+	NFT_MSG_COMMIT,
> >+	NFT_MSG_ABORT,
> >  	NFT_MSG_MAX,
> >  };
> >@@ -85,12 +87,18 @@ enum nft_chain_attributes {
> >  };
> >  #define NFTA_CHAIN_MAX		(__NFTA_CHAIN_MAX - 1)
> >+enum {
> >+	NFT_RULE_F_COMMIT	= (1 << 0),
> >+	NFT_RULE_F_MASK		= NFT_RULE_F_COMMIT,
> >+};
> >+
> 
> I guess you added this flags to add rules on non-transaction based
> use-case, so it commits right away.
> Wouldn't it be better to have no such flags in the API, but instead
> keep track of an on going transaction in the struct nft_ctx (let's
> have an internal flag here).
> I.E: no on-going transaction -> we directly commit. If not, it's
> handled as being part of the transaction.
> 
> I just took a look at nfnetlink however, it does not seem possible
> to keep a data pointer per connection on the subsystem.
> Wait, there is a field in struct sock that is meant for such usage.
> It would require to change nf_tables_api.c only then.
> Would it make sense or am I wrong with nft_ctx usage?  (there would
> be an issue: to free such context when the connection goes down, we
> could abort an on-going transaction this way then)

We can define some container structure to store rules in the dirty
list:

struct nft_rule_update {
        struct list_head head;
        uint32_t nl_portid;
        struct nft_rule *rule;
        struct nft_table *table;
        struct nft_chain *chain;
}

That should allows us to remove the struct list_head dirty_list in
struct nft_rule that I needed for this.

The nl_portid would be the netlink portid so we know what updates
belong to what netlink connection. Still I don't see how to get rid of
the commit flag.

Could you develop your idea?

> >  enum nft_rule_attributes {
> >  	NFTA_RULE_UNSPEC,
> >  	NFTA_RULE_TABLE,
> >  	NFTA_RULE_CHAIN,
> >  	NFTA_RULE_HANDLE,
> >  	NFTA_RULE_EXPRESSIONS,
> >+	NFTA_RULE_FLAGS,
> >  	__NFTA_RULE_MAX
> >  };
> 
> ... then it would not require such extra arguments, we would keep a
> simple api.
> 
> >   *	@rcu_head: used internally for rcu
> >   *	@handle: rule handle
> >+ *	@genmask: generation mask
> >   *	@dlen: length of expression data
> >   *	@data: expression data
> >   */
> >  struct nft_rule {
> >  	struct list_head		list;
> >+	struct list_head		dirty_list;
> >  	struct rcu_head			rcu_head;
> >-	u64				handle:48,
> >+	u64				handle:46,
> >+					genmask:2,
> >  					dlen:16;
> >  	unsigned char			data[]
> >  		__attribute__((aligned(__alignof__(struct nft_expr))));
> >@@ -366,8 +370,10 @@ enum nft_chain_flags {
> >   *	struct nft_chain - nf_tables chain
> >   *
> >   *	@rules: list of rules in the chain
> >+ *	@dirty_rules: rules that need an update after next generation
> 
> See the end of the patch, on abort or commit function. I think we
> should try to do something better (performance wise)
> 
> >   *	@list: used internally
> >   *	@rcu_head: used internally
> >+ *	@net: net namespace that this chain belongs to
> 
> I would see that in another patch, even if it's a really tiny one.
> (preceding this current one)
> Moreover that we will have to full support the namespaces at some
> point, right?

I need that net for the code in nft_do_chain_pktinfo added in this
patch. Probably it can be added to base chains only, would need to
check that.

> >+static int nf_tables_commit(struct sock *nlsk, struct sk_buff *skb,
> >+			    const struct nlmsghdr *nlh,
> >+			    const struct nlattr * const nla[])
> >+{
> >+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> >+	const struct nft_af_info *afi;
> >+	struct net *net = sock_net(skb->sk);
> >+	struct nft_table *table;
> >+	struct nft_chain *chain;
> >+	struct nft_rule *rule, *tmp;
> >+	int family = nfmsg->nfgen_family;
> >+	bool create;
> >+
> >+	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
> >+
> >+	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
> >+	if (IS_ERR(afi))
> >+		return PTR_ERR(afi);
> >+
> >+	/* Bump generation counter, invalidate any dump in progress */
> >+	net->nft.genctr++;
> >+
> >+	/* A new generation has just started */
> >+	net->nft.gencursor++;
> >+
> >+	/* Make sure all packets have left the previous generation before
> >+	 * purging old rules.
> >+	 */
> >+	synchronize_rcu();
> >+
> >+	list_for_each_entry(table, &afi->tables, list) {
> >+		list_for_each_entry(chain, &table->chains, list) {
> >+			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
> >+				/* Delete this rule from the dirty list */
> >+				list_del(&rule->dirty_list);
> 
> Ok here it sounds we could do better. Instead of storing the dirty
> list inside each chain, we may try an external one so we won't have
> to go through the whole table/chain/rule base like we do in a dump.
> If such dirty list is external (keeping a pointer on targeted table
> and chain too for each rule), it will be possible to loop only on
> it.

Yes, that's doable. We can store a pointer to the table and the chain
in the nft_rule_update container structure that I proposed above.

> And having this list_for_each_entry() make the line out of 80 chars
> anyway. (there is the same issue for nft_dump_rules and some other)
> 
> >+static int nf_tables_abort(struct sock *nlsk, struct sk_buff *skb,
> >+			   const struct nlmsghdr *nlh,
> >+			   const struct nlattr * const nla[])
> >+{
> >+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
> >+	const struct nft_af_info *afi;
> >+	struct net *net = sock_net(skb->sk);
> >+	struct nft_table *table;
> >+	struct nft_chain *chain;
> >+	struct nft_rule *rule, *tmp;
> >+	bool create;
> >+
> >+	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
> >+
> >+	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
> >+	if (IS_ERR(afi))
> >+		return PTR_ERR(afi);
> >+
> >+	list_for_each_entry(table, &afi->tables, list) {
> >+		list_for_each_entry(chain, &table->chains, list) {
> >+			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
> 
> Same here.
> 
> 
> Br,
> 
> Tomasz
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tomasz Bursztyka - Feb. 20, 2013, 8:16 a.m.
Hi Pablo,

>>
>>> diff --git a/include/linux/netfilter/nf_tables.h b/include/linux/netfilter/nf_tables.h
>>> index 7640290..3749069 100644
>>> --- a/include/linux/netfilter/nf_tables.h
>>> +++ b/include/linux/netfilter/nf_tables.h
>>> @@ -37,6 +37,8 @@ enum nf_tables_msg_types {
>>>   	NFT_MSG_NEWSETELEM,
>>>   	NFT_MSG_GETSETELEM,
>>>   	NFT_MSG_DELSETELEM,
>>> +	NFT_MSG_COMMIT,
>>> +	NFT_MSG_ABORT,
>>>   	NFT_MSG_MAX,
>>>   };
>>> @@ -85,12 +87,18 @@ enum nft_chain_attributes {
>>>   };
>>>   #define NFTA_CHAIN_MAX		(__NFTA_CHAIN_MAX - 1)
>>> +enum {
>>> +	NFT_RULE_F_COMMIT	= (1 << 0),
>>> +	NFT_RULE_F_MASK		= NFT_RULE_F_COMMIT,
>>> +};
>>> +
>> I guess you added this flags to add rules on non-transaction based
>> use-case, so it commits right away.
>> Wouldn't it be better to have no such flags in the API, but instead
>> keep track of an on going transaction in the struct nft_ctx (let's
>> have an internal flag here).
>> I.E: no on-going transaction -> we directly commit. If not, it's
>> handled as being part of the transaction.
>>
>> I just took a look at nfnetlink however, it does not seem possible
>> to keep a data pointer per connection on the subsystem.
>> Wait, there is a field in struct sock that is meant for such usage.
>> It would require to change nf_tables_api.c only then.
>> Would it make sense or am I wrong with nft_ctx usage?  (there would
>> be an issue: to free such context when the connection goes down, we
>> could abort an on-going transaction this way then)
> We can define some container structure to store rules in the dirty
> list:
>
> struct nft_rule_update {
>          struct list_head head;
>          uint32_t nl_portid;
>          struct nft_rule *rule;
>          struct nft_table *table;
>          struct nft_chain *chain;
> }
>
> That should allows us to remove the struct list_head dirty_list in
> struct nft_rule that I needed for this.
>
> The nl_portid would be the netlink portid so we know what updates
> belong to what netlink connection. Still I don't see how to get rid of
> the commit flag.
>
> Could you develop your idea?

I was exactly thinking about such external list. But it would be tighten 
to the netlink connection more deeply: as a user data to the socket, 
instead of storing the nl_portid.
I will explain below why.

To me iptables-nftables is a non-transaction based tool. There is no 
point to start a transaction for one rule.
Btw it would then require a  NFT_MSG_START or some sort, to start the 
transaction based manipulation.

Let's say now you have a software, which require to do rules 
manipulation, very often, which will be always connected through netlink 
to nftables.
starts a transaction:
     - manip 1
     - manip 2
     - ...
     - manip n
Commit or Abort.

done.

Now, for whatever reason: the software crashes in the middle of a 
transaction. When it restarts it has no idea what it was doing and why.
Here is why we should tight the transaction per netlink connection: if 
the connections breaks, we abort right away. It's transparent.
It's consistent also with the fact that you raise a notification only 
when the rule has been activated. Until it comes: no one knows about those
rules in the transaction but the one who owns the transaction.

We could do that via the struct sock { ... user_data ... }; related to 
the netlink connection, storing the transaction list.

Now, no need of a flag: if the transaction list for the current netlink 
connection is not NULL: you know you are on a transaction-based work, so
whatever manipulation comes: it will be part of the transaction. If it's 
NULL, you do as usual: activating the rule right away.


>
>>>    *	@list: used internally
>>>    *	@rcu_head: used internally
>>> + *	@net: net namespace that this chain belongs to
>> I would see that in another patch, even if it's a really tiny one.
>> (preceding this current one)
>> Moreover that we will have to full support the namespaces at some
>> point, right?
> I need that net for the code in nft_do_chain_pktinfo added in this
> patch. Probably it can be added to base chains only, would need to
> check that.

Indeed. I missed that.


Tomasz
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tomasz Bursztyka - Feb. 20, 2013, 10:32 a.m.
Hi Patrick,

>
>> @@ -37,6 +37,8 @@ enum nf_tables_msg_types {
>>   	NFT_MSG_NEWSETELEM,
>>   	NFT_MSG_GETSETELEM,
>>   	NFT_MSG_DELSETELEM,
>> +	NFT_MSG_COMMIT,
>> +	NFT_MSG_ABORT,
> I see why you're doing this, but this is similar to the NEWCHAINNAME
> attribute, it doesn't really fit into the scheme how the netlink API
> is designed.

  Why doesn't it fit? Do you mean you would prefer one command leads to 
one result, so no transaction based manipulation?
I would rather see a NFT_MSG_START added actually.

>
>> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
>> index db6150b..7f08381 100644
>> --- a/net/netfilter/nf_tables_api.c
>> +++ b/net/netfilter/nf_tables_api.c
>>   static int nf_tables_dump_rules(struct sk_buff *skb,
>>   				struct netlink_callback *cb)
>>   {
>> @@ -1250,6 +1288,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
>>   	unsigned int idx = 0, s_idx = cb->args[0];
>>   	struct net *net = sock_net(skb->sk);
>>   	int family = nfmsg->nfgen_family;
>> +	unsigned int genctr = net->nft.genctr, gencursor = net->nft.gencursor;
>>   
>>   	list_for_each_entry(afi, &net->nft.af_info, list) {
>>   		if (family != NFPROTO_UNSPEC && family != afi->family)
>> @@ -1258,6 +1297,8 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
>>   		list_for_each_entry(table, &afi->tables, list) {
>>   			list_for_each_entry(chain, &table->chains, list) {
>>   				list_for_each_entry(rule, &chain->rules, list) {
>> +					if (!nft_rule_is_active(net, rule))
>> +						goto cont;
> Not sure whether we should really skip them during the dump. That hides
> information from userspace, we could just as well include an INACTIVE flag
> and have userspace decide whether to process them or not.

Personally I don't see much point dumping non-active (yet to be 
committed) rules to anybody.
The current way of notifying the rule change when committed, as 
described at the end of this patch message, is I think just right.
That would be better though within the proposal I made earlier to Pablo.

Unless you don't want transaction based manipulation. What would be the 
idea then?


Tomasz
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira - Feb. 20, 2013, 11:10 p.m.
Hi Tomasz,

On Wed, Feb 20, 2013 at 10:16:29AM +0200, Tomasz Bursztyka wrote:
[...]
> >We can define some container structure to store rules in the dirty
> >list:
> >
> >struct nft_rule_update {
> >         struct list_head head;
> >         uint32_t nl_portid;
> >         struct nft_rule *rule;
> >         struct nft_table *table;
> >         struct nft_chain *chain;
> >}
> >
> >That should allows us to remove the struct list_head dirty_list in
> >struct nft_rule that I needed for this.
> >
> >The nl_portid would be the netlink portid so we know what updates
> >belong to what netlink connection. Still I don't see how to get rid of
> >the commit flag.
> >
> >Could you develop your idea?
> 
> I was exactly thinking about such external list. But it would be
> tighten to the netlink connection more deeply: as a user data to the
> socket, instead of storing the nl_portid.
> I will explain below why.
> 
> To me iptables-nftables is a non-transaction based tool. There is no
> point to start a transaction for one rule.

Agreed, ie. Not for iptables, only for iptables-restore.

> Btw it would then require a  NFT_MSG_START or some sort, to start
> the transaction based manipulation.

What I had in mind is that the transaction is considered to implicitly
start once you open a netlink socket from user-space, you add rule
updates that you want to happen atomically and then you call commit.

> Let's say now you have a software, which require to do rules
> manipulation, very often, which will be always connected through
> netlink to nftables.
> starts a transaction:
>     - manip 1
>     - manip 2
>     - ...
>     - manip n
> Commit or Abort.
> 
> done.
> 
> Now, for whatever reason: the software crashes in the middle of a
> transaction. When it restarts it has no idea what it was doing and
> why.  Here is why we should tight the transaction per netlink
> connection: if the connections breaks, we abort right away. It's
> transparent.

Good point, I have a patch for that as well. We can catch the
NETLINK_URELEASE event to remove all entries in the dirty list for the
program that crashed / exited without committing.

See net/netfilter/nfnetlink_queue_core.c for instance.

> It's consistent also with the fact that you raise a notification
> only when the rule has been activated. Until it comes: no one knows
> about those rules in the transaction but the one who owns the
> transaction.
> 
> We could do that via the struct sock { ... user_data ... }; related
> to the netlink connection, storing the transaction list.
>
> Now, no need of a flag: if the transaction list for the current
> netlink connection is not NULL: you know you are on a
> transaction-based work.
> whatever manipulation comes: it will be part of the transaction. If
> it's NULL, you do as usual: activating the rule right away.

This reminds me that we can use netlink's NLM_F_ATOMIC and remove
that flags.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/netfilter/nf_tables.h b/include/linux/netfilter/nf_tables.h
index 7640290..3749069 100644
--- a/include/linux/netfilter/nf_tables.h
+++ b/include/linux/netfilter/nf_tables.h
@@ -37,6 +37,8 @@  enum nf_tables_msg_types {
 	NFT_MSG_NEWSETELEM,
 	NFT_MSG_GETSETELEM,
 	NFT_MSG_DELSETELEM,
+	NFT_MSG_COMMIT,
+	NFT_MSG_ABORT,
 	NFT_MSG_MAX,
 };
 
@@ -85,12 +87,18 @@  enum nft_chain_attributes {
 };
 #define NFTA_CHAIN_MAX		(__NFTA_CHAIN_MAX - 1)
 
+enum {
+	NFT_RULE_F_COMMIT	= (1 << 0),
+	NFT_RULE_F_MASK		= NFT_RULE_F_COMMIT,
+};
+
 enum nft_rule_attributes {
 	NFTA_RULE_UNSPEC,
 	NFTA_RULE_TABLE,
 	NFTA_RULE_CHAIN,
 	NFTA_RULE_HANDLE,
 	NFTA_RULE_EXPRESSIONS,
+	NFTA_RULE_FLAGS,
 	__NFTA_RULE_MAX
 };
 #define NFTA_RULE_MAX		(__NFTA_RULE_MAX - 1)
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 3ba63b6..1131e49 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -319,15 +319,19 @@  static inline void *nft_expr_priv(const struct nft_expr *expr)
  *	struct nft_rule - nf_tables rule
  *
  *	@list: used internally
+ *	@dirty_list: this rule needs an update after new generation
  *	@rcu_head: used internally for rcu
  *	@handle: rule handle
+ *	@genmask: generation mask
  *	@dlen: length of expression data
  *	@data: expression data
  */
 struct nft_rule {
 	struct list_head		list;
+	struct list_head		dirty_list;
 	struct rcu_head			rcu_head;
-	u64				handle:48,
+	u64				handle:46,
+					genmask:2,
 					dlen:16;
 	unsigned char			data[]
 		__attribute__((aligned(__alignof__(struct nft_expr))));
@@ -366,8 +370,10 @@  enum nft_chain_flags {
  *	struct nft_chain - nf_tables chain
  *
  *	@rules: list of rules in the chain
+ *	@dirty_rules: rules that need an update after next generation
  *	@list: used internally
  *	@rcu_head: used internally
+ *	@net: net namespace that this chain belongs to
  *	@handle: chain handle
  *	@flags: bitmask of enum nft_chain_flags
  *	@use: number of jump references to this chain
@@ -376,8 +382,10 @@  enum nft_chain_flags {
  */
 struct nft_chain {
 	struct list_head		rules;
+	struct list_head		dirty_rules;
 	struct list_head		list;
 	struct rcu_head			rcu_head;
+	struct net			*net;
 	u64				handle;
 	u8				flags;
 	u16				use;
diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index a98b1c5..fe919c7 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -10,6 +10,8 @@  struct netns_nftables {
 	struct nft_af_info	*ipv4;
 	struct nft_af_info	*ipv6;
 	struct nft_af_info	*bridge;
+	u8			gencursor:1,
+				genctr:7;
 };
 
 #endif
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index db6150b..7f08381 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -863,7 +863,9 @@  static int nf_tables_newchain(struct sock *nlsk, struct sk_buff *skb,
 	}
 
 	INIT_LIST_HEAD(&chain->rules);
+	INIT_LIST_HEAD(&chain->dirty_rules);
 	chain->handle = nf_tables_alloc_handle(table);
+	chain->net = net;
 	nla_strlcpy(chain->name, name, NFT_CHAIN_MAXNAMELEN);
 
 	list_add_tail(&chain->list, &table->chains);
@@ -1150,6 +1152,7 @@  static const struct nla_policy nft_rule_policy[NFTA_RULE_MAX + 1] = {
 				    .len = NFT_CHAIN_MAXNAMELEN - 1 },
 	[NFTA_RULE_HANDLE]	= { .type = NLA_U64 },
 	[NFTA_RULE_EXPRESSIONS]	= { .type = NLA_NESTED },
+	[NFTA_RULE_FLAGS]	= { .type = NLA_U32 },
 };
 
 static int nf_tables_fill_rule_info(struct sk_buff *skb, u32 portid, u32 seq,
@@ -1239,6 +1242,41 @@  err:
 	return err;
 }
 
+static inline bool
+nft_rule_is_active(struct net *net, const struct nft_rule *rule)
+{
+	return (rule->genmask & (1 << net->nft.gencursor)) == 0;
+}
+
+static inline int gencursor_next(struct net *net)
+{
+	return net->nft.gencursor+1 == 1 ? 1 : 0;
+}
+
+static inline int
+nft_rule_is_active_next(struct net *net, const struct nft_rule *rule)
+{
+	return (rule->genmask & (1 << gencursor_next(net))) == 0;
+}
+
+static inline void
+nft_rule_activate_next(struct net *net, struct nft_rule *rule)
+{
+	/* Now inactive, will be active in the future */
+	rule->genmask = (1 << net->nft.gencursor);
+}
+
+static inline void
+nft_rule_disactivate_next(struct net *net, struct nft_rule *rule)
+{
+	rule->genmask = (1 << gencursor_next(net));
+}
+
+static inline void nft_rule_clear(struct net *net, struct nft_rule *rule)
+{
+	rule->genmask = 0;
+}
+
 static int nf_tables_dump_rules(struct sk_buff *skb,
 				struct netlink_callback *cb)
 {
@@ -1250,6 +1288,7 @@  static int nf_tables_dump_rules(struct sk_buff *skb,
 	unsigned int idx = 0, s_idx = cb->args[0];
 	struct net *net = sock_net(skb->sk);
 	int family = nfmsg->nfgen_family;
+	unsigned int genctr = net->nft.genctr, gencursor = net->nft.gencursor;
 
 	list_for_each_entry(afi, &net->nft.af_info, list) {
 		if (family != NFPROTO_UNSPEC && family != afi->family)
@@ -1258,6 +1297,8 @@  static int nf_tables_dump_rules(struct sk_buff *skb,
 		list_for_each_entry(table, &afi->tables, list) {
 			list_for_each_entry(chain, &table->chains, list) {
 				list_for_each_entry(rule, &chain->rules, list) {
+					if (!nft_rule_is_active(net, rule))
+						goto cont;
 					if (idx < s_idx)
 						goto cont;
 					if (idx > s_idx)
@@ -1276,6 +1317,10 @@  cont:
 		}
 	}
 done:
+	/* Invalidate this dump, a transition to the new generation happened */
+	if (gencursor != net->nft.gencursor || genctr != net->nft.genctr)
+		return -EBUSY;
+
 	cb->args[0] = idx;
 	return skb->len;
 }
@@ -1376,6 +1421,7 @@  static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 	unsigned int size, i, n;
 	int err, rem;
 	bool create;
+	u32 flags = 0;
 	u64 handle;
 
 	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
@@ -1434,6 +1480,15 @@  static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 	if (rule == NULL)
 		goto err1;
 
+	if (nla[NFTA_RULE_FLAGS]) {
+		flags = ntohl(nla_get_be32(nla[NFTA_RULE_FLAGS]));
+		if (flags & ~NFT_RULE_F_MASK)
+			return -EINVAL;
+
+		if (flags & NFT_RULE_F_COMMIT)
+			nft_rule_activate_next(net, rule);
+	}
+
 	rule->handle = handle;
 	rule->dlen   = size;
 
@@ -1447,16 +1502,26 @@  static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 	}
 
 	if (nlh->nlmsg_flags & NLM_F_REPLACE) {
-		list_replace_rcu(&old_rule->list, &rule->list);
-		nf_tables_rule_destroy(old_rule);
+		if (flags & NFT_RULE_F_COMMIT) {
+			nft_rule_disactivate_next(net, old_rule);
+			list_add_tail_rcu(&rule->list, &chain->rules);
+		} else {
+			list_replace_rcu(&old_rule->list, &rule->list);
+			nf_tables_rule_destroy(old_rule);
+		}
 	} else if (nlh->nlmsg_flags & NLM_F_APPEND)
 		list_add_tail_rcu(&rule->list, &chain->rules);
 	else
 		list_add_rcu(&rule->list, &chain->rules);
 
-	nf_tables_rule_notify(skb, nlh, table, chain, rule, NFT_MSG_NEWRULE,
-			      nlh->nlmsg_flags & (NLM_F_APPEND | NLM_F_REPLACE),
-			      nfmsg->nfgen_family);
+	if (flags & NFT_RULE_F_COMMIT)
+		list_add(&rule->dirty_list, &chain->dirty_rules);
+	else {
+		nf_tables_rule_notify(skb, nlh, table, chain, rule,
+				      NFT_MSG_NEWRULE,
+				      nlh->nlmsg_flags & (NLM_F_APPEND | NLM_F_REPLACE),
+				      nfmsg->nfgen_family);
+	}
 	return 0;
 
 err2:
@@ -1469,6 +1534,23 @@  err1:
 	return err;
 }
 
+static void
+nf_tables_delrule_one(struct nft_ctx *ctx, struct nft_rule *rule, u32 flags)
+{
+	if (flags & NFT_RULE_F_COMMIT) {
+		struct nft_chain *chain = (struct nft_chain *)ctx->chain;
+
+		nft_rule_disactivate_next(ctx->net, rule);
+		list_add(&rule->dirty_list, &chain->dirty_rules);
+	} else {
+		list_del_rcu(&rule->list);
+		nf_tables_rule_notify(ctx->skb, ctx->nlh, ctx->table,
+				      ctx->chain, rule, NFT_MSG_DELRULE,
+				      0, ctx->afi->family);
+		nf_tables_rule_destroy(rule);
+	}
+}
+
 static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb,
 			     const struct nlmsghdr *nlh,
 			     const struct nlattr * const nla[])
@@ -1480,6 +1562,8 @@  static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb,
 	struct nft_chain *chain;
 	struct nft_rule *rule, *tmp;
 	int family = nfmsg->nfgen_family;
+	struct nft_ctx ctx;
+	u32 flags = 0;
 
 	afi = nf_tables_afinfo_lookup(net, family, false);
 	if (IS_ERR(afi))
@@ -1493,31 +1577,132 @@  static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(chain))
 		return PTR_ERR(chain);
 
+	nft_ctx_init(&ctx, skb, nlh, afi, table, chain);
+
+	if (nla[NFTA_RULE_FLAGS]) {
+		flags = ntohl(nla_get_be32(nla[NFTA_RULE_FLAGS]));
+
+		if (flags & ~NFT_RULE_F_MASK)
+			return -EINVAL;
+	}
+
 	if (nla[NFTA_RULE_HANDLE]) {
 		rule = nf_tables_rule_lookup(chain, nla[NFTA_RULE_HANDLE]);
 		if (IS_ERR(rule))
 			return PTR_ERR(rule);
 
-		/* List removal must be visible before destroying expressions */
-		list_del_rcu(&rule->list);
-
-		nf_tables_rule_notify(skb, nlh, table, chain, rule,
-				      NFT_MSG_DELRULE, 0, family);
-		nf_tables_rule_destroy(rule);
+		nf_tables_delrule_one(&ctx, rule, flags);
 	} else {
 		/* Remove all rules in this chain */
-		list_for_each_entry_safe(rule, tmp, &chain->rules, list) {
-			list_del_rcu(&rule->list);
+		list_for_each_entry_safe(rule, tmp, &chain->rules, list)
+			nf_tables_delrule_one(&ctx, rule, flags);
+	}
 
-			nf_tables_rule_notify(skb, nlh, table, chain, rule,
-					      NFT_MSG_DELRULE, 0, family);
-			nf_tables_rule_destroy(rule);
+	return 0;
+}
+
+static int nf_tables_commit(struct sock *nlsk, struct sk_buff *skb,
+			    const struct nlmsghdr *nlh,
+			    const struct nlattr * const nla[])
+{
+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
+	const struct nft_af_info *afi;
+	struct net *net = sock_net(skb->sk);
+	struct nft_table *table;
+	struct nft_chain *chain;
+	struct nft_rule *rule, *tmp;
+	int family = nfmsg->nfgen_family;
+	bool create;
+
+	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
+
+	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
+	if (IS_ERR(afi))
+		return PTR_ERR(afi);
+
+	/* Bump generation counter, invalidate any dump in progress */
+	net->nft.genctr++;
+
+	/* A new generation has just started */
+	net->nft.gencursor++;
+
+	/* Make sure all packets have left the previous generation before
+	 * purging old rules.
+	 */
+	synchronize_rcu();
+
+	list_for_each_entry(table, &afi->tables, list) {
+		list_for_each_entry(chain, &table->chains, list) {
+			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
+				/* Delete this rule from the dirty list */
+				list_del(&rule->dirty_list);
+
+				/* This rule was inactive in the past and just
+				 * became active. Clear the next bit of the
+				 * genmask since its meaning has changed, now
+				 * it is the future.
+				 */
+				if (nft_rule_is_active(net, rule)) {
+					nft_rule_clear(net, rule);
+					nf_tables_rule_notify(skb, nlh, table,
+							      chain, rule,
+							      NFT_MSG_NEWRULE,
+							      0,
+							      nfmsg->nfgen_family);
+					continue;
+				}
+
+				/* This rule is in the past, get rid of it */
+				list_del_rcu(&rule->list);
+				nf_tables_rule_notify(skb, nlh, table, chain,
+						      rule, NFT_MSG_DELRULE, 0,
+						      family);
+				nf_tables_rule_destroy(rule);
+			}
 		}
 	}
 
 	return 0;
 }
 
+static int nf_tables_abort(struct sock *nlsk, struct sk_buff *skb,
+			   const struct nlmsghdr *nlh,
+			   const struct nlattr * const nla[])
+{
+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
+	const struct nft_af_info *afi;
+	struct net *net = sock_net(skb->sk);
+	struct nft_table *table;
+	struct nft_chain *chain;
+	struct nft_rule *rule, *tmp;
+	bool create;
+
+	create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false;
+
+	afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create);
+	if (IS_ERR(afi))
+		return PTR_ERR(afi);
+
+	list_for_each_entry(table, &afi->tables, list) {
+		list_for_each_entry(chain, &table->chains, list) {
+			list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) {
+				/* Delete all rules from the dirty list */
+				list_del(&rule->dirty_list);
+
+				if (!nft_rule_is_active_next(net, rule)) {
+					nft_rule_clear(net, rule);
+					continue;
+				}
+
+				/* This rule is inactive, get rid of it */
+				list_del_rcu(&rule->list);
+				nf_tables_rule_destroy(rule);
+			}
+		}
+	}
+	return 0;
+}
+
 /*
  * Sets
  */
@@ -2477,6 +2662,16 @@  static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
 		.attr_count	= NFTA_SET_ELEM_LIST_MAX,
 		.policy		= nft_set_elem_list_policy,
 	},
+	[NFT_MSG_COMMIT] = {
+		.call		= nf_tables_commit,
+		.attr_count	= NFTA_TABLE_MAX,
+		.policy		= nft_rule_policy,
+	},
+	[NFT_MSG_ABORT] = {
+		.call		= nf_tables_abort,
+		.attr_count	= NFTA_TABLE_MAX,
+		.policy		= nft_rule_policy,
+	},
 };
 
 static const struct nfnetlink_subsystem nf_tables_subsys = {
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index b9917b7..a3f848f 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -72,12 +72,22 @@  nft_do_chain_pktinfo(struct nft_pktinfo *pkt, const struct nf_hook_ops *ops)
 		const struct nft_chain	*chain;
 		const struct nft_rule	*rule;
 	} jumpstack[NFT_JUMP_STACK_SIZE];
+	/*
+	 * Cache cursor to avoid problems in case that the cursor is updated
+	 * while traversing the ruleset.
+	 */
+	unsigned int gencursor = chain->net->nft.gencursor;
 
 do_chain:
 	rule = list_entry(&chain->rules, struct nft_rule, list);
 next_rule:
 	data[NFT_REG_VERDICT].verdict = NFT_CONTINUE;
 	list_for_each_entry_continue_rcu(rule, &chain->rules, list) {
+
+		/* This rule is not active, skip. */
+		if (unlikely(rule->genmask & (1 << gencursor)))
+			continue;
+
 		nft_rule_for_each_expr(expr, last, rule) {
 			if (expr->ops == &nft_cmp_fast_ops)
 				nft_cmp_fast_eval(expr, data);