diff mbox series

[v3,nf-next,2/2] netfilter: nf_tables: support updating userdata for nft_table

Message ID 20240311141454.31537-2-tianquan23@gmail.com
State New
Headers show
Series [v3,nf-next,1/2] netfilter: nf_tables: use struct nlattr * to store userdata for nft_table | expand

Commit Message

Quan Tian March 11, 2024, 2:14 p.m. UTC
The NFTA_TABLE_USERDATA attribute was ignored on updates. The patch adds
handling for it to support table comment updates.

Signed-off-by: Quan Tian <tianquan23@gmail.com>
---
 v2: Change to store userdata in struct nlattr * to ensure atomical update
 v3: Do not call nft_trans_destroy() on table updates in nf_tables_commit()

 include/net/netfilter/nf_tables.h |  3 ++
 net/netfilter/nf_tables_api.c     | 56 ++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 16 deletions(-)

Comments

Florian Westphal March 12, 2024, 12:27 p.m. UTC | #1
Quan Tian <tianquan23@gmail.com> wrote:
> The NFTA_TABLE_USERDATA attribute was ignored on updates. The patch adds
> handling for it to support table comment updates.

One generic API question below.  Pablo, please look at this too.

>  		case NFT_MSG_NEWTABLE:
>  			if (nft_trans_table_update(trans)) {
> -				if (!(trans->ctx.table->flags & __NFT_TABLE_F_UPDATE)) {
> -					nft_trans_destroy(trans);
> -					break;
> +				if (trans->ctx.table->flags & __NFT_TABLE_F_UPDATE) {
> +					if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
> +						nf_tables_table_disable(net, trans->ctx.table);
> +					trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
>  				}
> -				if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
> -					nf_tables_table_disable(net, trans->ctx.table);
> -
> -				trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
> +				swap(trans->ctx.table->udata, nft_trans_table_udata(trans));
> +				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE

AFAICS this means that if the table as udata attached, and userspace
makes an update request without a UDATA netlink attribute, we will
delete the existing udata.

Is that right?

My question is, should we instead leave the existing udata as-is and not
support removal, only replace?
Pablo Neira Ayuso March 12, 2024, 12:49 p.m. UTC | #2
On Tue, Mar 12, 2024 at 01:27:58PM +0100, Florian Westphal wrote:
> Quan Tian <tianquan23@gmail.com> wrote:
> > The NFTA_TABLE_USERDATA attribute was ignored on updates. The patch adds
> > handling for it to support table comment updates.
> 
> One generic API question below.  Pablo, please look at this too.
> 
> >  		case NFT_MSG_NEWTABLE:
> >  			if (nft_trans_table_update(trans)) {
> > -				if (!(trans->ctx.table->flags & __NFT_TABLE_F_UPDATE)) {
> > -					nft_trans_destroy(trans);
> > -					break;
> > +				if (trans->ctx.table->flags & __NFT_TABLE_F_UPDATE) {
> > +					if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
> > +						nf_tables_table_disable(net, trans->ctx.table);
> > +					trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
> >  				}
> > -				if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
> > -					nf_tables_table_disable(net, trans->ctx.table);
> > -
> > -				trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
> > +				swap(trans->ctx.table->udata, nft_trans_table_udata(trans));
> > +				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE
> 
> AFAICS this means that if the table as udata attached, and userspace
> makes an update request without a UDATA netlink attribute, we will
> delete the existing udata.
> 
> Is that right?
> 
> My question is, should we instead leave the existing udata as-is and not
> support removal, only replace?

I would leave it in place too if no _USERDATA is specified.

One more question is if the memcmp() with old and new udata makes
sense considering two consecutive requests for _USERDATA update in one
batch.
Florian Westphal March 12, 2024, 1:01 p.m. UTC | #3
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > AFAICS this means that if the table as udata attached, and userspace
> > makes an update request without a UDATA netlink attribute, we will
> > delete the existing udata.
> > 
> > Is that right?
> > 
> > My question is, should we instead leave the existing udata as-is and not
> > support removal, only replace?
> 
> I would leave it in place too if no _USERDATA is specified.
> 
> One more question is if the memcmp() with old and new udata makes
> sense considering two consecutive requests for _USERDATA update in one
> batch.

Great point, any second udata change request in the same batch must fail.

We learned this the hard way with flag updates :(
Pablo Neira Ayuso March 12, 2024, 1:49 p.m. UTC | #4
On Mon, Mar 11, 2024 at 10:14:54PM +0800, Quan Tian wrote:
> The NFTA_TABLE_USERDATA attribute was ignored on updates. The patch adds
> handling for it to support table comment updates.

dump path is lockless:

        if (table->udata) {
                if (nla_put(skb, NFTA_TABLE_USERDATA, table->udlen, table->udata))
                        goto nla_put_failure;
        }

there are two things to update at the same time here, table->udata and
table->udlen.

This needs to be reworked fully if updates are required.

nft_userdata {
        const u8        *data;
        u16             len;
}

then, update struct nft_table to have:

        struct nft_userdata __rcu *user;

then, from dump path, rcu_dereference struct nft_userdata pointer.

BTW, does swap() ensure rcu semantics?

> Signed-off-by: Quan Tian <tianquan23@gmail.com>
> ---
>  v2: Change to store userdata in struct nlattr * to ensure atomical update
>  v3: Do not call nft_trans_destroy() on table updates in nf_tables_commit()
> 
>  include/net/netfilter/nf_tables.h |  3 ++
>  net/netfilter/nf_tables_api.c     | 56 ++++++++++++++++++++++---------
>  2 files changed, 43 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> index 144dc469ebf8..43c747bd3f80 100644
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -1684,10 +1684,13 @@ struct nft_trans_chain {
>  
>  struct nft_trans_table {
>  	bool				update;
> +	struct nlattr			*udata;
>  };
>  
>  #define nft_trans_table_update(trans)	\
>  	(((struct nft_trans_table *)trans->data)->update)
> +#define nft_trans_table_udata(trans)	\
> +	(((struct nft_trans_table *)trans->data)->udata)
>  
>  struct nft_trans_elem {
>  	struct nft_set			*set;
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> index 85088297dd0d..62a2a1798052 100644
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -188,6 +188,17 @@ static struct nlattr *nft_userdata_dup(const struct nlattr *udata, gfp_t gfp)
>  	return kmemdup(udata, nla_total_size(nla_len(udata)), gfp);
>  }
>  
> +static bool nft_userdata_is_same(const struct nlattr *a, const struct nlattr *b)
> +{
> +	if (!a && !b)
> +		return true;
> +	if (!a || !b)
> +		return false;
> +	if (nla_len(a) != nla_len(b))
> +		return false;
> +	return !memcmp(nla_data(a), nla_data(b), nla_len(a));
> +}
> +
>  static void __nft_set_trans_bind(const struct nft_ctx *ctx, struct nft_set *set,
>  				 bool bind)
>  {
> @@ -1210,16 +1221,16 @@ static int nf_tables_updtable(struct nft_ctx *ctx)
>  {
>  	struct nft_trans *trans;
>  	u32 flags;
> +	const struct nlattr *udata = ctx->nla[NFTA_TABLE_USERDATA];
>  	int ret;
>  
> -	if (!ctx->nla[NFTA_TABLE_FLAGS])
> -		return 0;
> -
> -	flags = ntohl(nla_get_be32(ctx->nla[NFTA_TABLE_FLAGS]));
> -	if (flags & ~NFT_TABLE_F_MASK)
> -		return -EOPNOTSUPP;
> +	if (ctx->nla[NFTA_TABLE_FLAGS]) {
> +		flags = ntohl(nla_get_be32(ctx->nla[NFTA_TABLE_FLAGS]));
> +		if (flags & ~NFT_TABLE_F_MASK)
> +			return -EOPNOTSUPP;
> +	}
>  
> -	if (flags == ctx->table->flags)
> +	if (flags == ctx->table->flags && nft_userdata_is_same(udata, ctx->table->udata))
>  		return 0;
>  
>  	if ((nft_table_has_owner(ctx->table) &&
> @@ -1240,6 +1251,14 @@ static int nf_tables_updtable(struct nft_ctx *ctx)
>  	if (trans == NULL)
>  		return -ENOMEM;
>  
> +	if (udata) {
> +		nft_trans_table_udata(trans) = nft_userdata_dup(udata, GFP_KERNEL_ACCOUNT);
> +		if (!nft_trans_table_udata(trans)) {
> +			ret = -ENOMEM;
> +			goto err_table_udata;
> +		}
> +	}
> +
>  	if ((flags & NFT_TABLE_F_DORMANT) &&
>  	    !(ctx->table->flags & NFT_TABLE_F_DORMANT)) {
>  		ctx->table->flags |= NFT_TABLE_F_DORMANT;
> @@ -1271,6 +1290,8 @@ static int nf_tables_updtable(struct nft_ctx *ctx)
>  
>  err_register_hooks:
>  	ctx->table->flags |= NFT_TABLE_F_DORMANT;
> +	kfree(nft_trans_table_udata(trans));
> +err_table_udata:
>  	nft_trans_destroy(trans);
>  	return ret;
>  }
> @@ -9378,6 +9399,9 @@ static void nft_obj_commit_update(struct nft_trans *trans)
>  static void nft_commit_release(struct nft_trans *trans)
>  {
>  	switch (trans->msg_type) {
> +	case NFT_MSG_NEWTABLE:
> +		kfree(nft_trans_table_udata(trans));
> +		break;
>  	case NFT_MSG_DELTABLE:
>  	case NFT_MSG_DESTROYTABLE:
>  		nf_tables_table_destroy(&trans->ctx);
> @@ -10140,19 +10164,18 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
>  		switch (trans->msg_type) {
>  		case NFT_MSG_NEWTABLE:
>  			if (nft_trans_table_update(trans)) {
> -				if (!(trans->ctx.table->flags & __NFT_TABLE_F_UPDATE)) {
> -					nft_trans_destroy(trans);
> -					break;
> +				if (trans->ctx.table->flags & __NFT_TABLE_F_UPDATE) {
> +					if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
> +						nf_tables_table_disable(net, trans->ctx.table);
> +					trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
>  				}
> -				if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
> -					nf_tables_table_disable(net, trans->ctx.table);
> -
> -				trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
> +				swap(trans->ctx.table->udata, nft_trans_table_udata(trans));
> +				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
>  			} else {
>  				nft_clear(net, trans->ctx.table);
> +				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
> +				nft_trans_destroy(trans);
>  			}
> -			nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
> -			nft_trans_destroy(trans);
>  			break;
>  		case NFT_MSG_DELTABLE:
>  		case NFT_MSG_DESTROYTABLE:
> @@ -10430,6 +10453,7 @@ static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
>  		switch (trans->msg_type) {
>  		case NFT_MSG_NEWTABLE:
>  			if (nft_trans_table_update(trans)) {
> +				kfree(nft_trans_table_udata(trans));
>  				if (!(trans->ctx.table->flags & __NFT_TABLE_F_UPDATE)) {
>  					nft_trans_destroy(trans);
>  					break;
> -- 
> 2.39.3 (Apple Git-145)
>
Florian Westphal March 12, 2024, 2:02 p.m. UTC | #5
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Mar 11, 2024 at 10:14:54PM +0800, Quan Tian wrote:
> > The NFTA_TABLE_USERDATA attribute was ignored on updates. The patch adds
> > handling for it to support table comment updates.
> 
> dump path is lockless:
> 
>         if (table->udata) {
>                 if (nla_put(skb, NFTA_TABLE_USERDATA, table->udlen, table->udata))
>                         goto nla_put_failure;
>         }
> 
> there are two things to update at the same time here, table->udata and
> table->udlen.
> 
> This needs to be reworked fully if updates are required.

See first patch in the series, it makes this a single pointer,
but you are right...

> then, update struct nft_table to have:
> 
>         struct nft_userdata __rcu *user;

.. this needs an __rcu annotation.  I'll respond
to patch 1 too.

> BTW, does swap() ensure rcu semantics?

No, this needs to use rcu_replace_pointer() and manual update
of the old one stored in the transaction update.
Florian Westphal March 12, 2024, 2:10 p.m. UTC | #6
Quan Tian <tianquan23@gmail.com> wrote:
> -				trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
> +				swap(trans->ctx.table->udata, nft_trans_table_udata(trans));
> +				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);

I missed this in my review, as Pablo pointed out you can't use swap()
here, because table->udata is rcu protected.

Something like this should work as replacement:

nft_trans_table_udata(trans) = rcu_replace_pointer(trans->ctx.table->udata,
                                                   nft_trans_table_udata(trans),
						   lockdep_commit_lock_is_held(trans->ctx.net));

This will swap and ensure all stores are visible.
Quan Tian March 12, 2024, 2:26 p.m. UTC | #7
On Tue, Mar 12, 2024 at 02:01:34PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > AFAICS this means that if the table as udata attached, and userspace
> > > makes an update request without a UDATA netlink attribute, we will
> > > delete the existing udata.
> > > 
> > > Is that right?
> > > 
> > > My question is, should we instead leave the existing udata as-is and not
> > > support removal, only replace?
> > 
> > I would leave it in place too if no _USERDATA is specified.
> > 

Sure, I will change it in the proposed way.

> > One more question is if the memcmp() with old and new udata makes
> > sense considering two consecutive requests for _USERDATA update in one
> > batch.
> 
> Great point, any second udata change request in the same batch must fail.
> 
> We learned this the hard way with flag updates :(

Is it the same as two consectutive requests for chain name update and
chain stats update? 

In nf_tables_commit():
The 1st trans swaps old udata with 1st new udata;
The 2nd trans swaps 1st new udata with 2nd new udata.

In nft_commit_release():
The 1st trans frees old udata;
The 2nd trans frees 1st new udata.

So multiple udata requests in a batch could work?

Thanks,
Quan
Quan Tian March 12, 2024, 2:30 p.m. UTC | #8
On Tue, Mar 12, 2024 at 03:10:46PM +0100, Florian Westphal wrote:
> Quan Tian <tianquan23@gmail.com> wrote:
> > -				trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
> > +				swap(trans->ctx.table->udata, nft_trans_table_udata(trans));
> > +				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
> 
> I missed this in my review, as Pablo pointed out you can't use swap()
> here, because table->udata is rcu protected.
> 
> Something like this should work as replacement:
> 
> nft_trans_table_udata(trans) = rcu_replace_pointer(trans->ctx.table->udata,
>                                                    nft_trans_table_udata(trans),
> 						   lockdep_commit_lock_is_held(trans->ctx.net));
> 
> This will swap and ensure all stores are visible.

Thank you Pablo and Florian for pointing it out. I will fix it following
your suggestions.

Thanks,
Quan
Florian Westphal March 12, 2024, 2:33 p.m. UTC | #9
Quan Tian <tianquan23@gmail.com> wrote:
> In nf_tables_commit():
> The 1st trans swaps old udata with 1st new udata;
> The 2nd trans swaps 1st new udata with 2nd new udata.
> 
> In nft_commit_release():
> The 1st trans frees old udata;
> The 2nd trans frees 1st new udata.
> 
> So multiple udata requests in a batch could work?

Yes, it could work indeed but we got bitten by
subtle bugs with back-to-back updates.

If there is a simple way to detect and reject
this then I believe its better to disallow it.

Unless you come up with a use-case where such back-to-back
udate updates make sense of course.
Pablo Neira Ayuso March 12, 2024, 2:36 p.m. UTC | #10
On Tue, Mar 12, 2024 at 10:26:17PM +0800, Quan Tian wrote:
> On Tue, Mar 12, 2024 at 02:01:34PM +0100, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > AFAICS this means that if the table as udata attached, and userspace
> > > > makes an update request without a UDATA netlink attribute, we will
> > > > delete the existing udata.
> > > > 
> > > > Is that right?
> > > > 
> > > > My question is, should we instead leave the existing udata as-is and not
> > > > support removal, only replace?
> > > 
> > > I would leave it in place too if no _USERDATA is specified.
> > > 
> 
> Sure, I will change it in the proposed way.
> 
> > > One more question is if the memcmp() with old and new udata makes
> > > sense considering two consecutive requests for _USERDATA update in one
> > > batch.
> > 
> > Great point, any second udata change request in the same batch must fail.
> > 
> > We learned this the hard way with flag updates :(
> 
> Is it the same as two consectutive requests for chain name update and
> chain stats update? 
> 
> In nf_tables_commit():
> The 1st trans swaps old udata with 1st new udata;
> The 2nd trans swaps 1st new udata with 2nd new udata.
>
> In nft_commit_release():
> The 1st trans frees old udata;
> The 2nd trans frees 1st new udata.

And abort path simply releases the transaction objects, since udata
was left intact in place.

> So multiple udata requests in a batch could work?

Yes, if rcu is used correctly, that should work fine. But memcmp()
needs to be removed.
Pablo Neira Ayuso March 12, 2024, 10:23 p.m. UTC | #11
On Tue, Mar 12, 2024 at 03:33:00PM +0100, Florian Westphal wrote:
> Quan Tian <tianquan23@gmail.com> wrote:
> > In nf_tables_commit():
> > The 1st trans swaps old udata with 1st new udata;
> > The 2nd trans swaps 1st new udata with 2nd new udata.
> > 
> > In nft_commit_release():
> > The 1st trans frees old udata;
> > The 2nd trans frees 1st new udata.
> > 
> > So multiple udata requests in a batch could work?
> 
> Yes, it could work indeed but we got bitten by
> subtle bugs with back-to-back updates.

yes, we have seen subtle bugs recently. As for the table flags, the
dormant flag has been particularly a problem, it is tricky one because
it registers hooks from preparation step (which might fail) but it
cannot register hooks because abort path might need undo things, and
re-register the hooks could lead to failures from a later path which
does not admit failures. For the dormant flag, another possibility
would be to handle this from the core, so there is no need to register
and unregister hooks, instead simply set them as inactive.

The dormant flag use case is rather corner case, but hardware offload
will require sooner or later a mode to run in _hardware mode only_
(currently it is both hardware and software for nftables) and
considering the hardware offload API has been designed for packet
classifiers from the late 90s (that is, very strictly express your
policy in a linear ruleset) that means dropping packets early is fine,
but wanted traffic gets evaluated in a linear list.

> If there is a simple way to detect and reject
> this then I believe its better to disallow it.

That requires to iterate over the list of transaction, or add some
kind of flag to reject this.

> Unless you come up with a use-case where such back-to-back
> udate updates make sense of course.

I don't have a use-case for this table userdata myself, this is
currently only used to store comments by userspace, why someone would
be willing to update such comment associated to a table, I don't know.

I would like to know if there are plans to submit similar patches for
other objects. As for sets, this needs to be careful because userdata
contains the set description.
Quan Tian March 13, 2024, 1:35 a.m. UTC | #12
On Tue, Mar 12, 2024 at 11:23:56PM +0100, Pablo Neira Ayuso wrote:
> On Tue, Mar 12, 2024 at 03:33:00PM +0100, Florian Westphal wrote:
> > Quan Tian <tianquan23@gmail.com> wrote:
> > > In nf_tables_commit():
> > > The 1st trans swaps old udata with 1st new udata;
> > > The 2nd trans swaps 1st new udata with 2nd new udata.
> > > 
> > > In nft_commit_release():
> > > The 1st trans frees old udata;
> > > The 2nd trans frees 1st new udata.
> > > 
> > > So multiple udata requests in a batch could work?
> > 
> > Yes, it could work indeed but we got bitten by
> > subtle bugs with back-to-back updates.
> 
> yes, we have seen subtle bugs recently. As for the table flags, the
> dormant flag has been particularly a problem, it is tricky one because
> it registers hooks from preparation step (which might fail) but it
> cannot register hooks because abort path might need undo things, and
> re-register the hooks could lead to failures from a later path which
> does not admit failures. For the dormant flag, another possibility
> would be to handle this from the core, so there is no need to register
> and unregister hooks, instead simply set them as inactive.
> 
> The dormant flag use case is rather corner case, but hardware offload
> will require sooner or later a mode to run in _hardware mode only_
> (currently it is both hardware and software for nftables) and
> considering the hardware offload API has been designed for packet
> classifiers from the late 90s (that is, very strictly express your
> policy in a linear ruleset) that means dropping packets early is fine,
> but wanted traffic gets evaluated in a linear list.
> 
> > If there is a simple way to detect and reject
> > this then I believe its better to disallow it.
> 
> That requires to iterate over the list of transaction, or add some
> kind of flag to reject this.
> 
> > Unless you come up with a use-case where such back-to-back
> > udate updates make sense of course.
> 
> I don't have a use-case for this table userdata myself, this is
> currently only used to store comments by userspace, why someone would
> be willing to update such comment associated to a table, I don't know.

There was a use-case in kube-proxy that we wanted to use comment to
store version/compatibility information so we could know whether it
has to recreate the whole table when upgrading to a new version due to
incompatible chain/rule changes (e.g. a chain's hook and priority is
changed). The reason why we wanted to avoid recreating the whole table
when it doesn't have to is because deleting the table would also
destory the dynamic sets in the table, losing some data.

Another minor reason is that the comments could be used to store
human-readable explanation for the objects. And they may be change when
the objects' functions change. It would be great if they could be
updated via the "add" operation, otherwise "delete+add" would always be
needed to keep comments up-to-date.

However, I don't know a use-case for back-to-back comment updates, I 
will check which one is more complex and risky: rejecting it or
supporting it.

> I would like to know if there are plans to submit similar patches for
> other objects. As for sets, this needs to be careful because userdata
> contains the set description.

As explained above, I think there is some value in supporting comment
update for other objects, especially when the objects contain dynamic
data. But I understand there are risks like you mentioned and it would
need to be more careful than tables. I would volunteer to take a try
for other objects, starting with objects similiar to tables first, if
it doesn’t sound bad to you. Please let me know if you feel it's not
worth.

Thanks,
Quan
Pablo Neira Ayuso March 13, 2024, 9:40 a.m. UTC | #13
On Wed, Mar 13, 2024 at 09:35:10AM +0800, Quan Tian wrote:
> On Tue, Mar 12, 2024 at 11:23:56PM +0100, Pablo Neira Ayuso wrote:
[...]
> > I don't have a use-case for this table userdata myself, this is
> > currently only used to store comments by userspace, why someone would
> > be willing to update such comment associated to a table, I don't know.
> 
> There was a use-case in kube-proxy that we wanted to use comment to
> store version/compatibility information so we could know whether it
> has to recreate the whole table when upgrading to a new version due to
> incompatible chain/rule changes (e.g. a chain's hook and priority is
> changed). The reason why we wanted to avoid recreating the whole table
> when it doesn't have to is because deleting the table would also
> destory the dynamic sets in the table, losing some data.

There is a generation number which gets bumped for each ruleset
update which is currently global.

Would having such generation ID per table help or you need more
flexibility in what needs to be stored in the userdata area?

> Another minor reason is that the comments could be used to store
> human-readable explanation for the objects. And they may be change when
> the objects' functions change. It would be great if they could be
> updated via the "add" operation, otherwise "delete+add" would always be
> needed to keep comments up-to-date.
> 
> However, I don't know a use-case for back-to-back comment updates, I
> will check which one is more complex and risky: rejecting it or
> supporting it.

We have these back-to-back update everywhere. The original idea was to
allow for robots to place all commands in a batch and send it to the
kernel, if the robot adds objects and then, while the same batch is
being collected, another update on such object happens, including one
that cancels the object that was just added, this is accepted since
this nonsense is valid according to the transactional model.

In this transactional model, batch processing also continues on
errors, rather than stopping on the first error, error unwinding is
special in that sense because objects remain in place and are just
marked are deleted. This allows for getting all errors to userspace at
once.

It is harder than classic netdev interfaces in the kernel.

Florian has concerns on this transactional model as too hard because
of the recent bug reports. If the direction is to tigthen this, then
I think all should be revised, not just reject this userdata update
while everything else still allows for back-to-back updates.

This is of course not related to your patch.

> > I would like to know if there are plans to submit similar patches for
> > other objects. As for sets, this needs to be careful because userdata
> > contains the set description.
> 
> As explained above, I think there is some value in supporting comment
> update for other objects, especially when the objects contain dynamic
> data. But I understand there are risks like you mentioned and it would
> need to be more careful than tables. I would volunteer to take a try
> for other objects, starting with objects similiar to tables first, if
> it doesn’t sound bad to you. Please let me know if you feel it's not
> worth.

I am not saying it is not worth, and I am trying to understand your
requirements in case there is a chance to provide an alternative path
to using _USERDATA for this.

Thanks for explaining.
Quan Tian March 13, 2024, 5:08 p.m. UTC | #14
On Wed, Mar 13, 2024 at 10:40:05AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 13, 2024 at 09:35:10AM +0800, Quan Tian wrote:
> > On Tue, Mar 12, 2024 at 11:23:56PM +0100, Pablo Neira Ayuso wrote:
> [...]
> > > I don't have a use-case for this table userdata myself, this is
> > > currently only used to store comments by userspace, why someone would
> > > be willing to update such comment associated to a table, I don't know.
> > 
> > There was a use-case in kube-proxy that we wanted to use comment to
> > store version/compatibility information so we could know whether it
> > has to recreate the whole table when upgrading to a new version due to
> > incompatible chain/rule changes (e.g. a chain's hook and priority is
> > changed). The reason why we wanted to avoid recreating the whole table
> > when it doesn't have to is because deleting the table would also
> > destory the dynamic sets in the table, losing some data.
> 
> There is a generation number which gets bumped for each ruleset
> update which is currently global.
> 
> Would having such generation ID per table help or you need more
> flexibility in what needs to be stored in the userdata area?

Auto-increased generation ID may not meet the above requirement as the
table could also be frequently updated by configuration changes at
runtime, not only after the application upgrades. An example scenario
could be: say we have a loadbalancer application based on nftables:

* LB v0.1.0 installs a collection of nftable rules;
* LB v0.1.1 makes some changes compatible with v0.1.0. In upgrade case,
  it doesn't need to recreate the whole table if the existing rules
  were installed by version >= 0.1.0;
* LB v0.2.0 makes some changes incompatible with v0.1.x (e.g. some
  chains' priorities are changed), it needs to recreate the whole table
  when upgrading from v0.1.x to it;
* LB v0.2.1 makes some changes compatible with v0.2.0, it doesn't need
  to recreate the table when upgrading from v0.2.0 to it but needs to
  recreate it when upgrading from v0.1.x to it.

With the support for comment updates, we can store, get, and update
such version information in comment, and use it to determine when it's
necessary to recreate the table.

>> If there is a simple way to detect and reject
>> this then I believe its better to disallow it.
>
> That requires to iterate over the list of transaction, or add some
> kind of flag to reject this.

I tried to detect back-to-back userdata comment updates, but found it's
more complex than I thought. A flag may not work because it can only
tell whether the 1st update touched comment and we don't know whether
the 2nd update is the same as the 1st one, unless we iterate over the
list of transaction, which looks a bit overkill? It seems simpler if we
just allow it and accept the last userdata.

>> My question is, should we instead leave the existing udata as-is and not
>> support removal, only replace?
>
>I would leave it in place too if no _USERDATA is specified.

I got a question when trying to implementing this. If the update request
specifies USERDATA but its length is 0, do we treat it like not
specified, or remove the existing userdata? Asking because I see
nf_tables_newrule() treats 0 length in the same way as unspecified and
doesn't initialize a nft_userdata. It makes sense for create, but I'm
not sure if it's right to do it in the same way for update. 

And if we want to treat it as "remove comment", we can't simply add a
single pointer of nft_userdata to nft_trans_table to achieve it, because
there would be 3 cases: leave it in place, remove it, update it.

Thanks,
Quan
diff mbox series

Patch

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 144dc469ebf8..43c747bd3f80 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1684,10 +1684,13 @@  struct nft_trans_chain {
 
 struct nft_trans_table {
 	bool				update;
+	struct nlattr			*udata;
 };
 
 #define nft_trans_table_update(trans)	\
 	(((struct nft_trans_table *)trans->data)->update)
+#define nft_trans_table_udata(trans)	\
+	(((struct nft_trans_table *)trans->data)->udata)
 
 struct nft_trans_elem {
 	struct nft_set			*set;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 85088297dd0d..62a2a1798052 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -188,6 +188,17 @@  static struct nlattr *nft_userdata_dup(const struct nlattr *udata, gfp_t gfp)
 	return kmemdup(udata, nla_total_size(nla_len(udata)), gfp);
 }
 
+static bool nft_userdata_is_same(const struct nlattr *a, const struct nlattr *b)
+{
+	if (!a && !b)
+		return true;
+	if (!a || !b)
+		return false;
+	if (nla_len(a) != nla_len(b))
+		return false;
+	return !memcmp(nla_data(a), nla_data(b), nla_len(a));
+}
+
 static void __nft_set_trans_bind(const struct nft_ctx *ctx, struct nft_set *set,
 				 bool bind)
 {
@@ -1210,16 +1221,16 @@  static int nf_tables_updtable(struct nft_ctx *ctx)
 {
 	struct nft_trans *trans;
 	u32 flags;
+	const struct nlattr *udata = ctx->nla[NFTA_TABLE_USERDATA];
 	int ret;
 
-	if (!ctx->nla[NFTA_TABLE_FLAGS])
-		return 0;
-
-	flags = ntohl(nla_get_be32(ctx->nla[NFTA_TABLE_FLAGS]));
-	if (flags & ~NFT_TABLE_F_MASK)
-		return -EOPNOTSUPP;
+	if (ctx->nla[NFTA_TABLE_FLAGS]) {
+		flags = ntohl(nla_get_be32(ctx->nla[NFTA_TABLE_FLAGS]));
+		if (flags & ~NFT_TABLE_F_MASK)
+			return -EOPNOTSUPP;
+	}
 
-	if (flags == ctx->table->flags)
+	if (flags == ctx->table->flags && nft_userdata_is_same(udata, ctx->table->udata))
 		return 0;
 
 	if ((nft_table_has_owner(ctx->table) &&
@@ -1240,6 +1251,14 @@  static int nf_tables_updtable(struct nft_ctx *ctx)
 	if (trans == NULL)
 		return -ENOMEM;
 
+	if (udata) {
+		nft_trans_table_udata(trans) = nft_userdata_dup(udata, GFP_KERNEL_ACCOUNT);
+		if (!nft_trans_table_udata(trans)) {
+			ret = -ENOMEM;
+			goto err_table_udata;
+		}
+	}
+
 	if ((flags & NFT_TABLE_F_DORMANT) &&
 	    !(ctx->table->flags & NFT_TABLE_F_DORMANT)) {
 		ctx->table->flags |= NFT_TABLE_F_DORMANT;
@@ -1271,6 +1290,8 @@  static int nf_tables_updtable(struct nft_ctx *ctx)
 
 err_register_hooks:
 	ctx->table->flags |= NFT_TABLE_F_DORMANT;
+	kfree(nft_trans_table_udata(trans));
+err_table_udata:
 	nft_trans_destroy(trans);
 	return ret;
 }
@@ -9378,6 +9399,9 @@  static void nft_obj_commit_update(struct nft_trans *trans)
 static void nft_commit_release(struct nft_trans *trans)
 {
 	switch (trans->msg_type) {
+	case NFT_MSG_NEWTABLE:
+		kfree(nft_trans_table_udata(trans));
+		break;
 	case NFT_MSG_DELTABLE:
 	case NFT_MSG_DESTROYTABLE:
 		nf_tables_table_destroy(&trans->ctx);
@@ -10140,19 +10164,18 @@  static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 		switch (trans->msg_type) {
 		case NFT_MSG_NEWTABLE:
 			if (nft_trans_table_update(trans)) {
-				if (!(trans->ctx.table->flags & __NFT_TABLE_F_UPDATE)) {
-					nft_trans_destroy(trans);
-					break;
+				if (trans->ctx.table->flags & __NFT_TABLE_F_UPDATE) {
+					if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
+						nf_tables_table_disable(net, trans->ctx.table);
+					trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
 				}
-				if (trans->ctx.table->flags & NFT_TABLE_F_DORMANT)
-					nf_tables_table_disable(net, trans->ctx.table);
-
-				trans->ctx.table->flags &= ~__NFT_TABLE_F_UPDATE;
+				swap(trans->ctx.table->udata, nft_trans_table_udata(trans));
+				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
 			} else {
 				nft_clear(net, trans->ctx.table);
+				nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
+				nft_trans_destroy(trans);
 			}
-			nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
-			nft_trans_destroy(trans);
 			break;
 		case NFT_MSG_DELTABLE:
 		case NFT_MSG_DESTROYTABLE:
@@ -10430,6 +10453,7 @@  static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
 		switch (trans->msg_type) {
 		case NFT_MSG_NEWTABLE:
 			if (nft_trans_table_update(trans)) {
+				kfree(nft_trans_table_udata(trans));
 				if (!(trans->ctx.table->flags & __NFT_TABLE_F_UPDATE)) {
 					nft_trans_destroy(trans);
 					break;