diff mbox

[nf-next,1/3] netfilter: nf_tables: add generation mask to table objects

Message ID 1438679128-4146-1-git-send-email-pablo@netfilter.org
State Changes Requested
Delegated to: Pablo Neira
Headers show

Commit Message

Pablo Neira Ayuso Aug. 4, 2015, 9:05 a.m. UTC
The dumping of table objects can be inconsistent when interfering with the
preparation phase of our 2-phase commit protocol because:

1) We remove objects from the lists during the preparation phase, that can be
   added re-added from the abort step. Thus, we may miss objects that are still
   active.

2) We add new objects to the lists during the preparation phase, so we may get
   objects that are not yet active with an internal flag set.

We can resolve this problem with generation masks, as we already do for rules
when we expose them to the packet path.

After this change, we always obtain a consistent list as long as we stay in the
same generation. The userspace side can detect interferences through the
generation counter. If so, it needs to restart.

As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |    4 +-
 net/netfilter/nf_tables_api.c     |  104 ++++++++++++++++++++++++-------------
 2 files changed, 71 insertions(+), 37 deletions(-)

Comments

Patrick McHardy Aug. 4, 2015, 9:09 a.m. UTC | #1
On 04.08, Pablo Neira Ayuso wrote:
> The dumping of table objects can be inconsistent when interfering with the
> preparation phase of our 2-phase commit protocol because:
> 
> 1) We remove objects from the lists during the preparation phase, that can be
>    added re-added from the abort step. Thus, we may miss objects that are still
>    active.
> 
> 2) We add new objects to the lists during the preparation phase, so we may get
>    objects that are not yet active with an internal flag set.
> 
> We can resolve this problem with generation masks, as we already do for rules
> when we expose them to the packet path.
> 
> After this change, we always obtain a consistent list as long as we stay in the
> same generation. The userspace side can detect interferences through the
> generation counter. If so, it needs to restart.
> 
> As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.

I have a similar patch queued up, however there seems to be something missing
in this patch. The lookup functions need to take the genmask into account.
Otherwise you can not delete and add a new table in the same batch. The same
holds for all other object types.

> +static struct nft_table *nf_tables_table_lookup(struct net *net,
> +						const struct nft_af_info *afi,
> +						const struct nlattr *nla,
> +						bool trans)
>  {
>  	struct nft_table *table;
>  
> @@ -382,10 +411,10 @@ static struct nft_table *nf_tables_table_lookup(const struct nft_af_info *afi,
>  		return ERR_PTR(-EINVAL);
>  
>  	table = nft_table_lookup(afi, nla);
> -	if (table != NULL)
> -		return table;
> +	if (table == NULL || (trans && !nft_table_is_active_next(net, table)))
> +		return ERR_PTR(-ENOENT);

We really need to check the genid itself, in some cases we *only* want
currently active tables, f.i. gettable and dumps.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 4, 2015, 9:29 a.m. UTC | #2
On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
> On 04.08, Pablo Neira Ayuso wrote:
> > The dumping of table objects can be inconsistent when interfering with the
> > preparation phase of our 2-phase commit protocol because:
> > 
> > 1) We remove objects from the lists during the preparation phase, that can be
> >    added re-added from the abort step. Thus, we may miss objects that are still
> >    active.
> > 
> > 2) We add new objects to the lists during the preparation phase, so we may get
> >    objects that are not yet active with an internal flag set.
> > 
> > We can resolve this problem with generation masks, as we already do for rules
> > when we expose them to the packet path.
> > 
> > After this change, we always obtain a consistent list as long as we stay in the
> > same generation. The userspace side can detect interferences through the
> > generation counter. If so, it needs to restart.
> > 
> > As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.
> 
> I have a similar patch queued up, however there seems to be something missing
> in this patch. The lookup functions need to take the genmask into account.

They already do for the deletion case, so we hit -ENOENT for objects
that has been deleted in this batch, so we cannot delete objects
twice.

> Otherwise you can not delete and add a new table in the same batch. The same
> holds for all other object types.

I can with this patch, we always operate with the *next* bit to
indicate that the object will be inactive in the future.

> > +static struct nft_table *nf_tables_table_lookup(struct net *net,
> > +						const struct nft_af_info *afi,
> > +						const struct nlattr *nla,
> > +						bool trans)
> >  {
> >  	struct nft_table *table;
> >  
> > @@ -382,10 +411,10 @@ static struct nft_table *nf_tables_table_lookup(const struct nft_af_info *afi,
> >  		return ERR_PTR(-EINVAL);
> >  
> >  	table = nft_table_lookup(afi, nla);
> > -	if (table != NULL)
> > -		return table;
> > +	if (table == NULL || (trans && !nft_table_is_active_next(net, table)))
> > +		return ERR_PTR(-ENOENT);
> 
> We really need to check the genid itself, in some cases we *only* want
> currently active tables, f.i. gettable and dumps.

This is what this patch is doing from the dump path.

We shouldn't check if the object is active from the lookup function if
we're in the middle of a transaction, since we hold the lock there is
no way we can see inactive objects in the list. There's only one
transaction at the same time.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Aug. 4, 2015, 10:26 a.m. UTC | #3
On 04.08, Pablo Neira Ayuso wrote:
> On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
> > On 04.08, Pablo Neira Ayuso wrote:
> > > The dumping of table objects can be inconsistent when interfering with the
> > > preparation phase of our 2-phase commit protocol because:
> > > 
> > > 1) We remove objects from the lists during the preparation phase, that can be
> > >    added re-added from the abort step. Thus, we may miss objects that are still
> > >    active.
> > > 
> > > 2) We add new objects to the lists during the preparation phase, so we may get
> > >    objects that are not yet active with an internal flag set.
> > > 
> > > We can resolve this problem with generation masks, as we already do for rules
> > > when we expose them to the packet path.
> > > 
> > > After this change, we always obtain a consistent list as long as we stay in the
> > > same generation. The userspace side can detect interferences through the
> > > generation counter. If so, it needs to restart.
> > > 
> > > As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.
> > 
> > I have a similar patch queued up, however there seems to be something missing
> > in this patch. The lookup functions need to take the genmask into account.
> 
> They already do for the deletion case, so we hit -ENOENT for objects
> that has been deleted in this batch, so we cannot delete objects
> twice.

> @@ -829,10 +860,10 @@ static int nf_tables_deltable(struct sock *nlsk, struct sk_buff *skb,
>       if (IS_ERR(afi))
>               return PTR_ERR(afi);

> -	table = nf_tables_table_lookup(afi, nla[NFTA_TABLE_NAME]);
> +	table = nf_tables_table_lookup(net, afi, nla[NFTA_TABLE_NAME], true);
>       if (IS_ERR(table))
>               return PTR_ERR(table);
> -	if (table->flags & NFT_TABLE_INACTIVE)
> +	if (!nft_table_is_active(net, table))
                return -ENOENT;

Looking at it, that part seems wrong. They need to be active in the *next*
generation, not the current one, to be deleted. All netlink actions only
affect the next generation.

The same bug is present in multiple locations.

> > Otherwise you can not delete and add a new table in the same batch. The same
> > holds for all other object types.
> 
> I can with this patch, we always operate with the *next* bit to
> indicate that the object will be inactive in the future.
> 
> > > +static struct nft_table *nf_tables_table_lookup(struct net *net,
> > > +						const struct nft_af_info *afi,
> > > +						const struct nlattr *nla,
> > > +						bool trans)
> > >  {
> > >  	struct nft_table *table;
> > >  
> > > @@ -382,10 +411,10 @@ static struct nft_table *nf_tables_table_lookup(const struct nft_af_info *afi,
> > >  		return ERR_PTR(-EINVAL);
> > >  
> > >  	table = nft_table_lookup(afi, nla);
> > > -	if (table != NULL)
> > > -		return table;
> > > +	if (table == NULL || (trans && !nft_table_is_active_next(net, table)))
> > > +		return ERR_PTR(-ENOENT);
> > 
> > We really need to check the genid itself, in some cases we *only* want
> > currently active tables, f.i. gettable and dumps.
> 
> This is what this patch is doing from the dump path.
> 
> We shouldn't check if the object is active from the lookup function if
> we're in the middle of a transaction, since we hold the lock there is
> no way we can see inactive objects in the list. There's only one
> transaction at the same time.

That's not entirely correct. Dump continuations happen asynchronously to
netlink modifications and commit operations, so the genid may bump in the
middle. We can get an inconsistent view if we have:

			dump set elements from set x table y
delete table y
create table y
create set x
begin commit
			continue dump from new set
commit, send NEWGEN

Sure, we will get a NEWGEN message, but at that time we might already have
sent a full message for the new table/set since that message is only send
after the commit is completed.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 4, 2015, 5:04 p.m. UTC | #4
On Tue, Aug 04, 2015 at 12:26:35PM +0200, Patrick McHardy wrote:
> On 04.08, Pablo Neira Ayuso wrote:
> > On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
[...]
> > > I have a similar patch queued up, however there seems to be something missing
> > > in this patch. The lookup functions need to take the genmask into account.
> > 
> > They already do for the deletion case, so we hit -ENOENT for objects
> > that has been deleted in this batch, so we cannot delete objects
> > twice.
> 
> > @@ -829,10 +860,10 @@ static int nf_tables_deltable(struct sock *nlsk, struct sk_buff *skb,
> >       if (IS_ERR(afi))
> >               return PTR_ERR(afi);
> 
> > -	table = nf_tables_table_lookup(afi, nla[NFTA_TABLE_NAME]);
> > +	table = nf_tables_table_lookup(net, afi, nla[NFTA_TABLE_NAME], true);
> >       if (IS_ERR(table))
> >               return PTR_ERR(table);
> > -	if (table->flags & NFT_TABLE_INACTIVE)
> > +	if (!nft_table_is_active(net, table))
>                 return -ENOENT;
> 
> Looking at it, that part seems wrong. They need to be active in the *next*
> generation, not the current one, to be deleted. All netlink actions only
> affect the next generation.
> 
> The same bug is present in multiple locations.

That check is there to avoid the deletion of a table that has been
added in this batch, unlike the delete + add, the add + delete in the
same batch doesn't make much sense.

Revisiting this scenario, this how this looks if we remove that check:

preparation starts:

add: table X (10), added to table list (now inactive)
del: table X (11), inactive next.
              ^
              gencursor

commit starts (update gencursor):

add: table X (01): clear past and report event, *NOTE*: the rule table is inactive.
add: table X (01): delete from list and report event.
               ^
               gencursor

So it seems it should be fine to remove it as it is defensive. I think
robots can generate this kind of command placing updates in a batch,
anyway that should come in a follow up patch IMO.

> > > Otherwise you can not delete and add a new table in the same batch. The same
> > > holds for all other object types.
> > 
> > I can with this patch, we always operate with the *next* bit to
> > indicate that the object will be inactive in the future.
> > 
> > > > +static struct nft_table *nf_tables_table_lookup(struct net *net,
> > > > +						const struct nft_af_info *afi,
> > > > +						const struct nlattr *nla,
> > > > +						bool trans)
> > > >  {
> > > >  	struct nft_table *table;
> > > >  
> > > > @@ -382,10 +411,10 @@ static struct nft_table *nf_tables_table_lookup(const struct nft_af_info *afi,
> > > >  		return ERR_PTR(-EINVAL);
> > > >  
> > > >  	table = nft_table_lookup(afi, nla);
> > > > -	if (table != NULL)
> > > > -		return table;
> > > > +	if (table == NULL || (trans && !nft_table_is_active_next(net, table)))
> > > > +		return ERR_PTR(-ENOENT);
> > > 
> > > We really need to check the genid itself, in some cases we *only* want
> > > currently active tables, f.i. gettable and dumps.
> > 
> > This is what this patch is doing from the dump path.
> > 
> > We shouldn't check if the object is active from the lookup function if
> > we're in the middle of a transaction, since we hold the lock there is
> > no way we can see inactive objects in the list. There's only one
> > transaction at the same time.
> 
> That's not entirely correct. Dump continuations happen asynchronously to
> netlink modifications and commit operations, so the genid may bump in the
> middle. We can get an inconsistent view if we have:
> 
> 			dump set elements from set x table y
> delete table y
> create table y
> create set x
> begin commit
> 			continue dump from new set

We catch this from the nfnlhdr->res_id field in the nfnetlink message,
but see below.

> commit, send NEWGEN
> 
> Sure, we will get a NEWGEN message, but at that time we might already have
> sent a full message for the new table/set since that message is only send
> after the commit is completed.

I agree in that an event message at the beginning of the commit phase
to announce the beginning new generation and another one to indicate
of this transaction.

- preparation phase -
delete table y
create table y
create set x
- commit phase -
send NEWGEN, attribute type: begin
delete table y
create table y
create set x
send NEWGEN, attribute type: end

Thanks for your feedback!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 4, 2015, 6:21 p.m. UTC | #5
On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
> On 04.08, Pablo Neira Ayuso wrote:
> > The dumping of table objects can be inconsistent when interfering with the
> > preparation phase of our 2-phase commit protocol because:
> > 
> > 1) We remove objects from the lists during the preparation phase, that can be
> >    added re-added from the abort step. Thus, we may miss objects that are still
> >    active.
> > 
> > 2) We add new objects to the lists during the preparation phase, so we may get
> >    objects that are not yet active with an internal flag set.
> > 
> > We can resolve this problem with generation masks, as we already do for rules
> > when we expose them to the packet path.
> > 
> > After this change, we always obtain a consistent list as long as we stay in the
> > same generation. The userspace side can detect interferences through the
> > generation counter. If so, it needs to restart.
> > 
> > As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.
> 
> I have a similar patch queued up, however there seems to be something missing
> in this patch. The lookup functions need to take the genmask into account.
> Otherwise you can not delete and add a new table in the same batch. The same
> holds for all other object types.

I got what you meant, we have to skip the delete table when iterating
over the list.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Aug. 5, 2015, 8:41 a.m. UTC | #6
On 04.08, Pablo Neira Ayuso wrote:
> On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
> > On 04.08, Pablo Neira Ayuso wrote:
> > > The dumping of table objects can be inconsistent when interfering with the
> > > preparation phase of our 2-phase commit protocol because:
> > > 
> > > 1) We remove objects from the lists during the preparation phase, that can be
> > >    added re-added from the abort step. Thus, we may miss objects that are still
> > >    active.
> > > 
> > > 2) We add new objects to the lists during the preparation phase, so we may get
> > >    objects that are not yet active with an internal flag set.
> > > 
> > > We can resolve this problem with generation masks, as we already do for rules
> > > when we expose them to the packet path.
> > > 
> > > After this change, we always obtain a consistent list as long as we stay in the
> > > same generation. The userspace side can detect interferences through the
> > > generation counter. If so, it needs to restart.
> > > 
> > > As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.
> > 
> > I have a similar patch queued up, however there seems to be something missing
> > in this patch. The lookup functions need to take the genmask into account.
> > Otherwise you can not delete and add a new table in the same batch. The same
> > holds for all other object types.
> 
> I got what you meant, we have to skip the delete table when iterating
> over the list.

Exactly. I'd propose to simply pass in the requested genid, this has the added
benefit of not having to sprinkle those checks throughout the code.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Aug. 5, 2015, 9:09 a.m. UTC | #7
On 04.08, Pablo Neira Ayuso wrote:
> On Tue, Aug 04, 2015 at 12:26:35PM +0200, Patrick McHardy wrote:
> > On 04.08, Pablo Neira Ayuso wrote:
> > > On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
> [...]
> > > > I have a similar patch queued up, however there seems to be something missing
> > > > in this patch. The lookup functions need to take the genmask into account.
> > > 
> > > They already do for the deletion case, so we hit -ENOENT for objects
> > > that has been deleted in this batch, so we cannot delete objects
> > > twice.
> > 
> > > @@ -829,10 +860,10 @@ static int nf_tables_deltable(struct sock *nlsk, struct sk_buff *skb,
> > >       if (IS_ERR(afi))
> > >               return PTR_ERR(afi);
> > 
> > > -	table = nf_tables_table_lookup(afi, nla[NFTA_TABLE_NAME]);
> > > +	table = nf_tables_table_lookup(net, afi, nla[NFTA_TABLE_NAME], true);
> > >       if (IS_ERR(table))
> > >               return PTR_ERR(table);
> > > -	if (table->flags & NFT_TABLE_INACTIVE)
> > > +	if (!nft_table_is_active(net, table))
> >                 return -ENOENT;
> > 
> > Looking at it, that part seems wrong. They need to be active in the *next*
> > generation, not the current one, to be deleted. All netlink actions only
> > affect the next generation.
> > 
> > The same bug is present in multiple locations.
> 
> That check is there to avoid the deletion of a table that has been
> added in this batch, unlike the delete + add, the add + delete in the
> same batch doesn't make much sense.

Its still a valid sequence. All actions should only ever look at activeness
in the next generation since that is when the change will take effect.

> Revisiting this scenario, this how this looks if we remove that check:
> 
> preparation starts:
> 
> add: table X (10), added to table list (now inactive)
> del: table X (11), inactive next.
>               ^
>               gencursor
> 
> commit starts (update gencursor):
> 
> add: table X (01): clear past and report event, *NOTE*: the rule table is inactive.
> add: table X (01): delete from list and report event.
>                ^
>                gencursor
> 
> So it seems it should be fine to remove it as it is defensive. I think
> robots can generate this kind of command placing updates in a batch,
> anyway that should come in a follow up patch IMO.

I don't follow. Why add an unnecessary check just to remove it again?
As I said, the only thing that matters is the next generation, we should
never even look at the current one when performing actions.

> > > We shouldn't check if the object is active from the lookup function if
> > > we're in the middle of a transaction, since we hold the lock there is
> > > no way we can see inactive objects in the list. There's only one
> > > transaction at the same time.
> > 
> > That's not entirely correct. Dump continuations happen asynchronously to
> > netlink modifications and commit operations, so the genid may bump in the
> > middle. We can get an inconsistent view if we have:
> > 
> > 			dump set elements from set x table y
> > delete table y
> > create table y
> > create set x
> > begin commit
> > 			continue dump from new set
> 
> We catch this from the nfnlhdr->res_id field in the nfnetlink message,
> but see below.
> 
> > commit, send NEWGEN
> > 
> > Sure, we will get a NEWGEN message, but at that time we might already have
> > sent a full message for the new table/set since that message is only send
> > after the commit is completed.
> 
> I agree in that an event message at the beginning of the commit phase
> to announce the beginning new generation and another one to indicate
> of this transaction.
> 
> - preparation phase -
> delete table y
> create table y
> create set x
> - commit phase -
> send NEWGEN, attribute type: begin
> delete table y
> create table y
> create set x
> send NEWGEN, attribute type: end
> 
> Thanks for your feedback!

That might work if the message ordering is then guaranteed. However I think
we can fix this case without changing NEWGEN. Let me think about that a bit,
for now just taking care of the genid checks correctly seems like a good
step forward.

BTW, we also need to adjust loop detection to only take into account
active rules, active chains, active sets etc.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 6, 2015, 10:20 a.m. UTC | #8
On Wed, Aug 05, 2015 at 11:09:16AM +0200, Patrick McHardy wrote:
> On 04.08, Pablo Neira Ayuso wrote:
[...]
> > Revisiting this scenario, this how this looks if we remove that check:
> > 
> > preparation starts:
> > 
> > add: table X (10), added to table list (now inactive)
> > del: table X (11), inactive next.
> >               ^
> >               gencursor
> > 
> > commit starts (update gencursor):
> > 
> > add: table X (01): clear past and report event, *NOTE*: the rule table is inactive.
> > add: table X (01): delete from list and report event.
> >                ^
> >                gencursor
> > 
> > So it seems it should be fine to remove it as it is defensive. I think
> > robots can generate this kind of command placing updates in a batch,
> > anyway that should come in a follow up patch IMO.
> 
> I don't follow. Why add an unnecessary check just to remove it again?
> As I said, the only thing that matters is the next generation, we should
> never even look at the current one when performing actions.

Yes, we can remove those checks to reject add+del in the same batch in
first place.

I remember I added this because I found some problematic scenario, but
given looking at the example above, I agree we can remove this first
place. I'm going to recheck for other objects too.

> > > > We shouldn't check if the object is active from the lookup function if
> > > > we're in the middle of a transaction, since we hold the lock there is
> > > > no way we can see inactive objects in the list. There's only one
> > > > transaction at the same time.
> > > 
> > > That's not entirely correct. Dump continuations happen asynchronously to
> > > netlink modifications and commit operations, so the genid may bump in the
> > > middle. We can get an inconsistent view if we have:
> > > 
> > > 			dump set elements from set x table y
> > > delete table y
> > > create table y
> > > create set x
> > > begin commit
> > > 			continue dump from new set
> > 
> > We catch this from the nfnlhdr->res_id field in the nfnetlink message,
> > but see below.
> > 
> > > commit, send NEWGEN
> > > 
> > > Sure, we will get a NEWGEN message, but at that time we might already have
> > > sent a full message for the new table/set since that message is only send
> > > after the commit is completed.
> > 
> > I agree in that an event message at the beginning of the commit phase
> > to announce the beginning new generation and another one to indicate
> > of this transaction.
> > 
> > - preparation phase -
> > delete table y
> > create table y
> > create set x
> > - commit phase -
> > send NEWGEN, attribute type: begin
> > delete table y
> > create table y
> > create set x
> > send NEWGEN, attribute type: end
> > 
> > Thanks for your feedback!
> 
> That might work if the message ordering is then guaranteed. However I think
> we can fix this case without changing NEWGEN. Let me think about that a bit,
> for now just taking care of the genid checks correctly seems like a good
> step forward.

But we can catch this problem through ->res_id, OK?

> BTW, we also need to adjust loop detection to only take into account
> active rules, active chains, active sets etc.

Indeed, thanks Patrick.

Will you take care of this? It would be great to have a fix for these
in this merge window. On top of that, I have a patchset here to add
named expressions as you suggested as a generic way to implement named
counters (or any other stateful expression) and I need that this is
fixed first so I don't need to add another ugly _INACTIVE flag to the
nft_nexpr object.

Let me know, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 6, 2015, 10:21 a.m. UTC | #9
On Wed, Aug 05, 2015 at 10:41:28AM +0200, Patrick McHardy wrote:
> On 04.08, Pablo Neira Ayuso wrote:
> > On Tue, Aug 04, 2015 at 11:09:17AM +0200, Patrick McHardy wrote:
> > > On 04.08, Pablo Neira Ayuso wrote:
> > > > The dumping of table objects can be inconsistent when interfering with the
> > > > preparation phase of our 2-phase commit protocol because:
> > > > 
> > > > 1) We remove objects from the lists during the preparation phase, that can be
> > > >    added re-added from the abort step. Thus, we may miss objects that are still
> > > >    active.
> > > > 
> > > > 2) We add new objects to the lists during the preparation phase, so we may get
> > > >    objects that are not yet active with an internal flag set.
> > > > 
> > > > We can resolve this problem with generation masks, as we already do for rules
> > > > when we expose them to the packet path.
> > > > 
> > > > After this change, we always obtain a consistent list as long as we stay in the
> > > > same generation. The userspace side can detect interferences through the
> > > > generation counter. If so, it needs to restart.
> > > > 
> > > > As a result, we can get rid of the internal NFT_TABLE_INACTIVE flag.
> > > 
> > > I have a similar patch queued up, however there seems to be something missing
> > > in this patch. The lookup functions need to take the genmask into account.
> > > Otherwise you can not delete and add a new table in the same batch. The same
> > > holds for all other object types.
> > 
> > I got what you meant, we have to skip the delete table when iterating
> > over the list.
> 
> Exactly. I'd propose to simply pass in the requested genid, this has the added
> benefit of not having to sprinkle those checks throughout the code.

If that simplifies the patchset, I think it's a good idea. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Aug. 8, 2015, 3:53 p.m. UTC | #10
On 06.08, Pablo Neira Ayuso wrote:
> > That might work if the message ordering is then guaranteed. However I think
> > we can fix this case without changing NEWGEN. Let me think about that a bit,
> > for now just taking care of the genid checks correctly seems like a good
> > step forward.
> 
> But we can catch this problem through ->res_id, OK?

Have to look at it in detail. Currently sitting at the airport, will
take me a bit.

> > BTW, we also need to adjust loop detection to only take into account
> > active rules, active chains, active sets etc.
> 
> Indeed, thanks Patrick.
> 
> Will you take care of this? It would be great to have a fix for these
> in this merge window. On top of that, I have a patchset here to add

Sure. I already have this in my patches, however I'll wait for your new
patchset so I can test on top of it.

> named expressions as you suggested as a generic way to implement named
> counters (or any other stateful expression) and I need that this is
> fixed first so I don't need to add another ugly _INACTIVE flag to the
> nft_nexpr object.
> 
> Let me know, thanks!

I agree, the _INACTIVE flags need to go.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Aug. 10, 2015, 7:56 a.m. UTC | #11
On 06.08, Pablo Neira Ayuso wrote:
> On Wed, Aug 05, 2015 at 11:09:16AM +0200, Patrick McHardy wrote:
> > > > > We shouldn't check if the object is active from the lookup function if
> > > > > we're in the middle of a transaction, since we hold the lock there is
> > > > > no way we can see inactive objects in the list. There's only one
> > > > > transaction at the same time.
> > > > 
> > > > That's not entirely correct. Dump continuations happen asynchronously to
> > > > netlink modifications and commit operations, so the genid may bump in the
> > > > middle. We can get an inconsistent view if we have:
> > > > 
> > > > 			dump set elements from set x table y
> > > > delete table y
> > > > create table y
> > > > create set x
> > > > begin commit
> > > > 			continue dump from new set
> > > 
> > > We catch this from the nfnlhdr->res_id field in the nfnetlink message,
> > > but see below.
> > > 
> > > > commit, send NEWGEN
> > > > 
> > > > Sure, we will get a NEWGEN message, but at that time we might already have
> > > > sent a full message for the new table/set since that message is only send
> > > > after the commit is completed.
> > > 
> > > I agree in that an event message at the beginning of the commit phase
> > > to announce the beginning new generation and another one to indicate
> > > of this transaction.
> > > 
> > > - preparation phase -
> > > delete table y
> > > create table y
> > > create set x
> > > - commit phase -
> > > send NEWGEN, attribute type: begin
> > > delete table y
> > > create table y
> > > create set x
> > > send NEWGEN, attribute type: end
> > > 
> > > Thanks for your feedback!
> > 
> > That might work if the message ordering is then guaranteed. However I think
> > we can fix this case without changing NEWGEN. Let me think about that a bit,
> > for now just taking care of the genid checks correctly seems like a good
> > step forward.
> 
> But we can catch this problem through ->res_id, OK?

I guess we could with a unique res_id per object, but how would this work
with multiple object types? Any change bumps res_id, so we'd invalidate
the full dump for any change.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 10, 2015, 6:37 p.m. UTC | #12
On Mon, Aug 10, 2015 at 09:56:46AM +0200, Patrick McHardy wrote:
> On 06.08, Pablo Neira Ayuso wrote:
> > On Wed, Aug 05, 2015 at 11:09:16AM +0200, Patrick McHardy wrote:
[...]
> > > > - preparation phase -
> > > > delete table y
> > > > create table y
> > > > create set x
> > > > - commit phase -
> > > > send NEWGEN, attribute type: begin
> > > > delete table y
> > > > create table y
> > > > create set x
> > > > send NEWGEN, attribute type: end
> > > > 
> > > > Thanks for your feedback!
> > > 
> > > That might work if the message ordering is then guaranteed. However I think
> > > we can fix this case without changing NEWGEN. Let me think about that a bit,
> > > for now just taking care of the genid checks correctly seems like a good
> > > step forward.
> > 
> > But we can catch this problem through ->res_id, OK?
> 
> I guess we could with a unique res_id per object, but how would this work
> with multiple object types? Any change bumps res_id, so we'd invalidate
> the full dump for any change.

I see, if we want to be able to invalidate caches at per-object level,
then I think we have to recover the idea of having a netlink attribute
for the per-object generation counter.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 2a24668..1b94bf2 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -827,6 +827,7 @@  unsigned int nft_do_chain(struct nft_pktinfo *pkt,
  *	@hgenerator: handle generator state
  *	@use: number of chain references to this table
  *	@flags: table flag (see enum nft_table_flags)
+ *	@genmask: generation mask
  *	@name: name of the table
  */
 struct nft_table {
@@ -835,7 +836,8 @@  struct nft_table {
 	struct list_head		sets;
 	u64				hgenerator;
 	u32				use;
-	u16				flags;
+	u16				flags:14,
+					genmask:2;
 	char				name[NFT_TABLE_MAXNAMELEN];
 };
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 4a41eb9..cee7326 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -173,8 +173,35 @@  static void nf_tables_unregister_hooks(const struct nft_table *table,
 	nft_unregister_basechain(nft_base_chain(chain), hook_nops);
 }
 
-/* Internal table flags */
-#define NFT_TABLE_INACTIVE	(1 << 15)
+static inline bool
+nft_table_is_active(struct net *net, const struct nft_table *table)
+{
+	return (table->genmask & nft_genmask_cur(net)) == 0;
+}
+
+static inline int
+nft_table_is_active_next(struct net *net, const struct nft_table *table)
+{
+	return (table->genmask & nft_genmask_next(net)) == 0;
+}
+
+static inline void
+nft_table_activate_next(struct net *net, struct nft_table *table)
+{
+	/* Now inactive, will be active in the future */
+	table->genmask = nft_genmask_cur(net);
+}
+
+static inline void
+nft_table_deactivate_next(struct net *net, struct nft_table *table)
+{
+	table->genmask = nft_genmask_next(net);
+}
+
+static inline void nft_table_clear(struct net *net, struct nft_table *table)
+{
+	table->genmask &= ~nft_genmask_next(net);
+}
 
 static int nft_trans_table_add(struct nft_ctx *ctx, int msg_type)
 {
@@ -185,7 +212,7 @@  static int nft_trans_table_add(struct nft_ctx *ctx, int msg_type)
 		return -ENOMEM;
 
 	if (msg_type == NFT_MSG_NEWTABLE)
-		ctx->table->flags |= NFT_TABLE_INACTIVE;
+		nft_table_activate_next(ctx->net, ctx->table);
 
 	list_add_tail(&trans->list, &ctx->net->nft.commit_list);
 	return 0;
@@ -199,7 +226,7 @@  static int nft_deltable(struct nft_ctx *ctx)
 	if (err < 0)
 		return err;
 
-	list_del_rcu(&ctx->table->list);
+	nft_table_deactivate_next(ctx->net, ctx->table);
 	return err;
 }
 
@@ -373,8 +400,10 @@  static struct nft_table *nft_table_lookup(const struct nft_af_info *afi,
 	return NULL;
 }
 
-static struct nft_table *nf_tables_table_lookup(const struct nft_af_info *afi,
-						const struct nlattr *nla)
+static struct nft_table *nf_tables_table_lookup(struct net *net,
+						const struct nft_af_info *afi,
+						const struct nlattr *nla,
+						bool trans)
 {
 	struct nft_table *table;
 
@@ -382,10 +411,10 @@  static struct nft_table *nf_tables_table_lookup(const struct nft_af_info *afi,
 		return ERR_PTR(-EINVAL);
 
 	table = nft_table_lookup(afi, nla);
-	if (table != NULL)
-		return table;
+	if (table == NULL || (trans && !nft_table_is_active_next(net, table)))
+		return ERR_PTR(-ENOENT);
 
-	return ERR_PTR(-ENOENT);
+	return table;
 }
 
 static inline u64 nf_tables_alloc_handle(struct nft_table *table)
@@ -522,6 +551,8 @@  static int nf_tables_dump_tables(struct sk_buff *skb,
 			if (idx > s_idx)
 				memset(&cb->args[1], 0,
 				       sizeof(cb->args) - sizeof(cb->args[0]));
+			if (!nft_table_is_active(net, table))
+				continue;
 			if (nf_tables_fill_table_info(skb, net,
 						      NETLINK_CB(cb->skb).portid,
 						      cb->nlh->nlmsg_seq,
@@ -564,10 +595,10 @@  static int nf_tables_gettable(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_TABLE_NAME]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_TABLE_NAME], false);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (table->flags & NFT_TABLE_INACTIVE)
+	if (!nft_table_is_active(net, table))
 		return -ENOENT;
 
 	skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
@@ -691,7 +722,7 @@  static int nf_tables_newtable(struct sock *nlsk, struct sk_buff *skb,
 		return PTR_ERR(afi);
 
 	name = nla[NFTA_TABLE_NAME];
-	table = nf_tables_table_lookup(afi, name);
+	table = nf_tables_table_lookup(net, afi, name, true);
 	if (IS_ERR(table)) {
 		if (PTR_ERR(table) != -ENOENT)
 			return PTR_ERR(table);
@@ -699,7 +730,7 @@  static int nf_tables_newtable(struct sock *nlsk, struct sk_buff *skb,
 	}
 
 	if (table != NULL) {
-		if (table->flags & NFT_TABLE_INACTIVE)
+		if (!nft_table_is_active(net, table))
 			return -ENOENT;
 		if (nlh->nlmsg_flags & NLM_F_EXCL)
 			return -EEXIST;
@@ -829,10 +860,10 @@  static int nf_tables_deltable(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_TABLE_NAME]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_TABLE_NAME], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (table->flags & NFT_TABLE_INACTIVE)
+	if (!nft_table_is_active(net, table))
 		return -ENOENT;
 
 	ctx.afi = afi;
@@ -1123,10 +1154,10 @@  static int nf_tables_getchain(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_CHAIN_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_CHAIN_TABLE], false);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (table->flags & NFT_TABLE_INACTIVE)
+	if (!nft_table_is_active(net, table))
 		return -ENOENT;
 
 	chain = nf_tables_chain_lookup(table, nla[NFTA_CHAIN_NAME]);
@@ -1249,7 +1280,7 @@  static int nf_tables_newchain(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_CHAIN_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_CHAIN_TABLE], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
 
@@ -1493,10 +1524,10 @@  static int nf_tables_delchain(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_CHAIN_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_CHAIN_TABLE], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (table->flags & NFT_TABLE_INACTIVE)
+	if (!nft_table_is_active(net, table))
 		return -ENOENT;
 
 	chain = nf_tables_chain_lookup(table, nla[NFTA_CHAIN_NAME]);
@@ -1957,10 +1988,10 @@  static int nf_tables_getrule(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_RULE_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_RULE_TABLE], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (table->flags & NFT_TABLE_INACTIVE)
+	if (!nft_table_is_active(net, table))
 		return -ENOENT;
 
 	chain = nf_tables_chain_lookup(table, nla[NFTA_RULE_CHAIN]);
@@ -2037,7 +2068,7 @@  static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_RULE_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_RULE_TABLE], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
 
@@ -2194,10 +2225,10 @@  static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_RULE_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_RULE_TABLE], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (table->flags & NFT_TABLE_INACTIVE)
+	if (!nft_table_is_active(net, table))
 		return -ENOENT;
 
 	if (nla[NFTA_RULE_CHAIN]) {
@@ -2348,7 +2379,8 @@  static const struct nla_policy nft_set_desc_policy[NFTA_SET_DESC_MAX + 1] = {
 static int nft_ctx_init_from_setattr(struct nft_ctx *ctx,
 				     const struct sk_buff *skb,
 				     const struct nlmsghdr *nlh,
-				     const struct nlattr * const nla[])
+				     const struct nlattr * const nla[],
+				     bool trans)
 {
 	struct net *net = sock_net(skb->sk);
 	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
@@ -2365,10 +2397,10 @@  static int nft_ctx_init_from_setattr(struct nft_ctx *ctx,
 		if (afi == NULL)
 			return -EAFNOSUPPORT;
 
-		table = nf_tables_table_lookup(afi, nla[NFTA_SET_TABLE]);
+		table = nf_tables_table_lookup(net, afi, nla[NFTA_SET_TABLE], trans);
 		if (IS_ERR(table))
 			return PTR_ERR(table);
-		if (table->flags & NFT_TABLE_INACTIVE)
+		if (!nft_table_is_active(net, table))
 			return -ENOENT;
 	}
 
@@ -2631,7 +2663,7 @@  static int nf_tables_getset(struct sock *nlsk, struct sk_buff *skb,
 	int err;
 
 	/* Verify existence before starting dump */
-	err = nft_ctx_init_from_setattr(&ctx, skb, nlh, nla);
+	err = nft_ctx_init_from_setattr(&ctx, skb, nlh, nla, false);
 	if (err < 0)
 		return err;
 
@@ -2795,7 +2827,7 @@  static int nf_tables_newset(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_SET_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_SET_TABLE], true);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
 
@@ -2897,7 +2929,7 @@  static int nf_tables_delset(struct sock *nlsk, struct sk_buff *skb,
 	if (nla[NFTA_SET_TABLE] == NULL)
 		return -EINVAL;
 
-	err = nft_ctx_init_from_setattr(&ctx, skb, nlh, nla);
+	err = nft_ctx_init_from_setattr(&ctx, skb, nlh, nla, true);
 	if (err < 0)
 		return err;
 
@@ -3040,10 +3072,10 @@  static int nft_ctx_init_from_elemattr(struct nft_ctx *ctx,
 	if (IS_ERR(afi))
 		return PTR_ERR(afi);
 
-	table = nf_tables_table_lookup(afi, nla[NFTA_SET_ELEM_LIST_TABLE]);
+	table = nf_tables_table_lookup(net, afi, nla[NFTA_SET_ELEM_LIST_TABLE], trans);
 	if (IS_ERR(table))
 		return PTR_ERR(table);
-	if (!trans && (table->flags & NFT_TABLE_INACTIVE))
+	if (!trans && !nft_table_is_active(net, table))
 		return -ENOENT;
 
 	nft_ctx_init(ctx, skb, nlh, afi, table, NULL, nla);
@@ -3915,12 +3947,13 @@  static int nf_tables_commit(struct sk_buff *skb)
 					trans->ctx.table->flags |= NFT_TABLE_F_DORMANT;
 				}
 			} else {
-				trans->ctx.table->flags &= ~NFT_TABLE_INACTIVE;
+				nft_table_clear(net, trans->ctx.table);
 			}
 			nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
 			nft_trans_destroy(trans);
 			break;
 		case NFT_MSG_DELTABLE:
+			list_del_rcu(&trans->ctx.table->list);
 			nf_tables_table_notify(&trans->ctx, NFT_MSG_DELTABLE);
 			break;
 		case NFT_MSG_NEWCHAIN:
@@ -4046,8 +4079,7 @@  static int nf_tables_abort(struct sk_buff *skb)
 			}
 			break;
 		case NFT_MSG_DELTABLE:
-			list_add_tail_rcu(&trans->ctx.table->list,
-					  &trans->ctx.afi->tables);
+			nft_table_clear(trans->ctx.net, trans->ctx.table);
 			nft_trans_destroy(trans);
 			break;
 		case NFT_MSG_NEWCHAIN: