diff mbox series

[nf] x_tables: Properly close read section with read_seqcount_retry

Message ID 1605320516-17810-1-git-send-email-stranche@codeaurora.org
State Under Review
Delegated to: Pablo Neira
Headers show
Series [nf] x_tables: Properly close read section with read_seqcount_retry | expand

Commit Message

Sean Tranchetti Nov. 14, 2020, 2:21 a.m. UTC
xtables uses a modified version of a seqcount lock for synchronization
between replacing private table information and the packet processing
path (i.e. XX_do_table). The main modification is in the "write"
semantics, where instead of adding 1 for each write, the write side will
add 1 only if it finds no other writes ongoing, and adds 1 again (thereby
clearing the LSB) when it finishes.

This allows the read side to avoid calling read_seqcount_begin() in a loop
if a write is detected, since the value is guaranteed to only increment
once all writes have completed. As such, it is only necessary to check if
the initial value of the sequence count has changed to inform the reader
that all writes are finished.

However, the current check for the changed value uses the wrong API;
raw_seqcount_read() is protected by smp_rmb() in the same way as
read_seqcount_begin(), making it appropriate for ENTERING read-side
critical sections, but not exiting them. For that, read_seqcount_rety()
must be used to ensure the proper barrier placement for synchronization
with xt_write_recseq_end() (itself modeled after write_seqcount_end()).

Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
---
 net/netfilter/x_tables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Florian Westphal Nov. 14, 2020, 4:53 p.m. UTC | #1
Sean Tranchetti <stranche@codeaurora.org> wrote:
> xtables uses a modified version of a seqcount lock for synchronization
> between replacing private table information and the packet processing
> path (i.e. XX_do_table). The main modification is in the "write"
> semantics, where instead of adding 1 for each write, the write side will
> add 1 only if it finds no other writes ongoing, and adds 1 again (thereby
> clearing the LSB) when it finishes.
> 
> This allows the read side to avoid calling read_seqcount_begin() in a loop
> if a write is detected, since the value is guaranteed to only increment
> once all writes have completed. As such, it is only necessary to check if
> the initial value of the sequence count has changed to inform the reader
> that all writes are finished.
> 
> However, the current check for the changed value uses the wrong API;
> raw_seqcount_read() is protected by smp_rmb() in the same way as
> read_seqcount_begin(), making it appropriate for ENTERING read-side
> critical sections, but not exiting them. For that, read_seqcount_rety()
> must be used to ensure the proper barrier placement for synchronization
> with xt_write_recseq_end() (itself modeled after write_seqcount_end()).

[..]

> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index af22dbe..39f1f2b 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1404,7 +1404,7 @@ xt_replace_table(struct xt_table *table,
>  			do {
>  				cond_resched();
>  				cpu_relax();
> -			} while (seq == raw_read_seqcount(s));
> +			} while (!read_seqcount_retry(s, seq));

I'm fine with this change but AFAIU this is just a cleanup since
this part isn't a read-sequence as no  'shared state' is accessed/read between
the seqcount begin and the do{}while. smb_rmb placement should not matter here.

Did I miss anything?

Thanks.
Subash Abhinov Kasiviswanathan Nov. 16, 2020, 6:32 a.m. UTC | #2
> I'm fine with this change but AFAIU this is just a cleanup since
> this part isn't a read-sequence as no  'shared state' is accessed/read 
> between
> the seqcount begin and the do{}while. smb_rmb placement should not 
> matter here.
> 
> Did I miss anything?
> 
> Thanks.

Hi Florian

To provide more background on this, we are seeing occasional crashes in 
a
regression rack in the packet receive path where there appear to be
some rules being modified concurrently.

Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c

We found that ip6t_do_table was reading stale values from the READ_ONCE
leading to use after free when dereferencing per CPU jumpstack values.

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv6/netfilter/ip6_tables.c?h=v5.4.77#n282]
	addend = xt_write_recseq_begin();
	private = READ_ONCE(table->private); /* Address dependency. */

We were able to confirm that the xt_replace_table & __do_replace
had already executed by logging the seqcount values. The value of
seqcount at ip6t_do_table was 1 more than the value at xt_replace_table.
The seqcount read at xt_replace_table also showed that it was an even 
value
and hence meant that there was no conccurent writer instance 
(xt_write_recseq_begin)
at that point.

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/netfilter/x_tables.c?h=v5.4.77#n1401]
		u32 seq = raw_read_seqcount(s);

This means that table assignment at xt_replace_table did not take effect
as expected.

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/netfilter/x_tables.c?h=v5.4.77#n1386]
	smp_wmb();
	table->private = newinfo;

	/* make sure all cpus see new ->private value */
	smp_wmb();

We want to know if this barrier usage is as expected here.
Alternatively, would changing this help here -

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 525f674..417ea1b 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1431,11 +1431,11 @@ xt_replace_table(struct xt_table *table,
          * Ensure contents of newinfo are visible before assigning to
          * private.
          */
-       smp_wmb();
-       table->private = newinfo;
+       smp_mb();
+       WRITE_ONCE(table->private, newinfo);

         /* make sure all cpus see new ->private value */
-       smp_wmb();
+       smp_mb();

         /*
          * Even though table entries have now been swapped, other CPU's
Florian Westphal Nov. 16, 2020, 2:18 p.m. UTC | #3
subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > I'm fine with this change but AFAIU this is just a cleanup since
> > this part isn't a read-sequence as no  'shared state' is accessed/read
> > between
> > the seqcount begin and the do{}while. smb_rmb placement should not
> > matter here.
> > 
> > Did I miss anything?
> > 
> > Thanks.
> 
> Hi Florian
> 
> To provide more background on this, we are seeing occasional crashes in a
> regression rack in the packet receive path where there appear to be
> some rules being modified concurrently.
> 
> Unable to handle kernel NULL pointer dereference at virtual
> address 000000000000008e
> pc : ip6t_do_table+0x5d0/0x89c
> lr : ip6t_do_table+0x5b8/0x89c
> ip6t_do_table+0x5d0/0x89c
> ip6table_filter_hook+0x24/0x30
> nf_hook_slow+0x84/0x120
> ip6_input+0x74/0xe0
> ip6_rcv_finish+0x7c/0x128
> ipv6_rcv+0xac/0xe4
> __netif_receive_skb+0x84/0x17c
> process_backlog+0x15c/0x1b8
> napi_poll+0x88/0x284
> net_rx_action+0xbc/0x23c
> __do_softirq+0x20c/0x48c
> 
> We found that ip6t_do_table was reading stale values from the READ_ONCE
> leading to use after free when dereferencing per CPU jumpstack values.
> 
> [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv6/netfilter/ip6_tables.c?h=v5.4.77#n282]
> 	addend = xt_write_recseq_begin();
> 	private = READ_ONCE(table->private); /* Address dependency. */
> 
> We were able to confirm that the xt_replace_table & __do_replace
> had already executed by logging the seqcount values. The value of
> seqcount at ip6t_do_table was 1 more than the value at xt_replace_table.

Can you elaborate? Was that before xt_write_recseq_begin() or after?

> The seqcount read at xt_replace_table also showed that it was an even value
> and hence meant that there was no conccurent writer instance
> (xt_write_recseq_begin)
> at that point.

Ok, so either ip6_do_table was done or just starting (and it should not
matter which one).

> [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/netfilter/x_tables.c?h=v5.4.77#n1401]
> 		u32 seq = raw_read_seqcount(s);
> 
> This means that table assignment at xt_replace_table did not take effect
> as expected.

You mean
"private = READ_ONCE(table->private); /* Address dependency. */" in
ip6_do_table did pick up the 'old' private pointer, AFTER xt_replace
had updated it?

> [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/netfilter/x_tables.c?h=v5.4.77#n1386]
> 	smp_wmb();
> 	table->private = newinfo;
> 
> 	/* make sure all cpus see new ->private value */
> 	smp_wmb();
> 
> We want to know if this barrier usage is as expected here.
> Alternatively, would changing this help here -
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index 525f674..417ea1b 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1431,11 +1431,11 @@ xt_replace_table(struct xt_table *table,
>          * Ensure contents of newinfo are visible before assigning to
>          * private.
>          */
> -       smp_wmb();
> -       table->private = newinfo;
> +       smp_mb();
> +       WRITE_ONCE(table->private, newinfo);

The WRITE_ONCE looks correct.

>         /* make sure all cpus see new ->private value */
> -       smp_wmb();
> +       smp_mb();

Is that to order wrt. seqcount_sequence?
Do you have a way to reproduce such crashes?

I tried to no avail but I guess thats just because amd64 is more
forgiving.

Thanks!
Subash Abhinov Kasiviswanathan Nov. 16, 2020, 4:26 p.m. UTC | #4
>> We found that ip6t_do_table was reading stale values from the 
>> READ_ONCE
>> leading to use after free when dereferencing per CPU jumpstack values.
>> 
>> [https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv6/netfilter/ip6_tables.c?h=v5.4.77#n282]
>> 	addend = xt_write_recseq_begin();
>> 	private = READ_ONCE(table->private); /* Address dependency. */
>> 
>> We were able to confirm that the xt_replace_table & __do_replace
>> had already executed by logging the seqcount values. The value of
>> seqcount at ip6t_do_table was 1 more than the value at 
>> xt_replace_table.
> 
> Can you elaborate? Was that before xt_write_recseq_begin() or after?

The log in ip6t_do_table was after xt_write_recseq_begin().

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv6/netfilter/ip6_tables.c?h=v5.4.77#n282]
	addend = xt_write_recseq_begin();

The log in xt_replace_table was after raw_read_seqcount of 
per_cpu(xt_recseq)

[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/netfilter/x_tables.c?h=v5.4.77#n1400]
		seqcount_t *s = &per_cpu(xt_recseq, cpu);
		u32 seq = raw_read_seqcount(s);

In this specific run, we observed that the value in ip6t_do_table after
xt_write_recseq_begin() was 75316887. The value in xt_replace_table was 
after
raw_read_seqcount of per_cpu(xt_recseq) was 75316886. This confirms that 
the
private pointer assignment in xt_replace_table happened before the start 
of
ip6t_do_table.

> You mean
> "private = READ_ONCE(table->private); /* Address dependency. */" in
> ip6_do_table did pick up the 'old' private pointer, AFTER xt_replace
> had updated it?
> 

Yes, that is what we have observed from the logging.

>> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
>> index 525f674..417ea1b 100644
>> --- a/net/netfilter/x_tables.c
>> +++ b/net/netfilter/x_tables.c
>> @@ -1431,11 +1431,11 @@ xt_replace_table(struct xt_table *table,
>>          * Ensure contents of newinfo are visible before assigning to
>>          * private.
>>          */
>> -       smp_wmb();
>> -       table->private = newinfo;
>> +       smp_mb();
>> +       WRITE_ONCE(table->private, newinfo);
> 
> The WRITE_ONCE looks correct.
> 
>>         /* make sure all cpus see new ->private value */
>> -       smp_wmb();
>> +       smp_mb();
> 
> Is that to order wrt. seqcount_sequence?

Correct, we want to ensure that the table->private is updated and is
in sync on all CPUs beyond that point.

> Do you have a way to reproduce such crashes?
> 
> I tried to no avail but I guess thats just because amd64 is more
> forgiving.
> 
> Thanks!

Unfortunately we are seeing it on ARM64 regression systems which runs a 
variety of
usecases so the exact steps are not known.
Florian Westphal Nov. 16, 2020, 5:04 p.m. UTC | #5
subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > Is that to order wrt. seqcount_sequence?
> 
> Correct, we want to ensure that the table->private is updated and is
> in sync on all CPUs beyond that point.
> 
> > Do you have a way to reproduce such crashes?
> > 
> > I tried to no avail but I guess thats just because amd64 is more
> > forgiving.
> > 
> > Thanks!
> 
> Unfortunately we are seeing it on ARM64 regression systems which runs a
> variety of
> usecases so the exact steps are not known.

Ok.  Would you be willing to run some of those with your suggested
change to see if that resolves the crashes or is that so rare that this
isn't practical?
Subash Abhinov Kasiviswanathan Nov. 16, 2020, 5:51 p.m. UTC | #6
>> Unfortunately we are seeing it on ARM64 regression systems which runs 
>> a
>> variety of
>> usecases so the exact steps are not known.
> 
> Ok.  Would you be willing to run some of those with your suggested
> change to see if that resolves the crashes or is that so rare that this
> isn't practical?

I can try that out. Let me know if you have any other suggestions as 
well
and I can try that too.

I assume we cant add locks here as it would be in the packet processing 
path.
Florian Westphal Nov. 16, 2020, 6:20 p.m. UTC | #7
subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > > Unfortunately we are seeing it on ARM64 regression systems which
> > > runs a
> > > variety of
> > > usecases so the exact steps are not known.
> > 
> > Ok.  Would you be willing to run some of those with your suggested
> > change to see if that resolves the crashes or is that so rare that this
> > isn't practical?
> 
> I can try that out. Let me know if you have any other suggestions as well
> and I can try that too.
> 
> I assume we cant add locks here as it would be in the packet processing
> path.

Yes.  We can add a synchronize_net() in xt_replace_table if needed
though, before starting to put the references on the old ruleset
This would avoid the free of the jumpstack while skbs are still
in-flight.
Will Deacon Nov. 18, 2020, 12:13 p.m. UTC | #8
Hi all,

[+tglx and peterz for seqcount abuse]

On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote:
> subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > > > Unfortunately we are seeing it on ARM64 regression systems which
> > > > runs a
> > > > variety of
> > > > usecases so the exact steps are not known.
> > > 
> > > Ok.  Would you be willing to run some of those with your suggested
> > > change to see if that resolves the crashes or is that so rare that this
> > > isn't practical?
> > 
> > I can try that out. Let me know if you have any other suggestions as well
> > and I can try that too.
> > 
> > I assume we cant add locks here as it would be in the packet processing
> > path.
> 
> Yes.  We can add a synchronize_net() in xt_replace_table if needed
> though, before starting to put the references on the old ruleset
> This would avoid the free of the jumpstack while skbs are still
> in-flight.

I tried to make sense of what's going on here, and it looks to me like
the interaction between xt_replace_table() and ip6t_do_table() relies on
store->load order that is _not_ enforced in the code.

xt_replace_table() does:

	table->private = newinfo;

	/* make sure all cpus see new ->private value */
	smp_wmb();

	/*
	 * Even though table entries have now been swapped, other CPU's
	 * may still be using the old entries...
	 */
	local_bh_enable();

	/* ... so wait for even xt_recseq on all cpus */
	for_each_possible_cpu(cpu) {
		seqcount_t *s = &per_cpu(xt_recseq, cpu);
		u32 seq = raw_read_seqcount(s);

		if (seq & 1) {
			do {
				cond_resched();
				cpu_relax();
			} while (seq == raw_read_seqcount(s));
		}
	}

and I think the idea here is that we swizzle the table->private pointer
to point to the new data, and then wait for all pre-existing readers to
finish with the old table (isn't this what RCU is for?).

On the reader side, ip6t_do_table() does:

	addend = xt_write_recseq_begin();
	private = READ_ONCE(table->private); /* Address dependency. */

where xt_write_recseq_begin() is pretty creative:

	unsigned int addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
	__this_cpu_add(xt_recseq.sequence, addend);
	smp_wmb();
	return addend;

I think this can be boiled down to an instance of the "SB" litmus test:

CPU 0				CPU 1

Wtable = newinfo		Rseq = 0
smp_wmb()			Wseq = 1	(i.e. seq += 1)
Rseq = 0			smp_wmb()
<free oldinfo>			Rtable = oldinfo

Since smp_wmb() only orders store-store, then both CPUs can hoist the reads
up and we get the undesirable outcome. This should be observable on x86 too.

Given that this appears to be going wrong in practice, I've included a quick
hack below which should fix the ordering. However, I think it would be
better if we could avoid abusing the seqcount code for this sort of thing.

Cheers,

Will

--->8

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5deb099d156d..8ec48466410a 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -376,7 +376,7 @@ static inline unsigned int xt_write_recseq_begin(void)
 	 * since addend is most likely 1
 	 */
 	__this_cpu_add(xt_recseq.sequence, addend);
-	smp_wmb();
+	smp_mb();
 
 	return addend;
 }
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe85e2c..383a47ff0c40 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1384,10 +1384,12 @@ xt_replace_table(struct xt_table *table,
 	 * private.
 	 */
 	smp_wmb();
-	table->private = newinfo;
-
-	/* make sure all cpus see new ->private value */
-	smp_wmb();
+	WRITE_ONCE(table->private, newinfo);
+	/*
+	 * make sure all cpus see new ->private value before we load the
+	 * sequence counters
+	 */
+	smp_mb();
 
 	/*
 	 * Even though table entries have now been swapped, other CPU's
Florian Westphal Nov. 18, 2020, 12:42 p.m. UTC | #9
Will Deacon <will@kernel.org> wrote:
> On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote:
> > subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > > > > Unfortunately we are seeing it on ARM64 regression systems which
> > > > > runs a
> > > > > variety of
> > > > > usecases so the exact steps are not known.
> > > > 
> > > > Ok.  Would you be willing to run some of those with your suggested
> > > > change to see if that resolves the crashes or is that so rare that this
> > > > isn't practical?
> > > 
> > > I can try that out. Let me know if you have any other suggestions as well
> > > and I can try that too.
> > > 
> > > I assume we cant add locks here as it would be in the packet processing
> > > path.
> > 
> > Yes.  We can add a synchronize_net() in xt_replace_table if needed
> > though, before starting to put the references on the old ruleset
> > This would avoid the free of the jumpstack while skbs are still
> > in-flight.
> 
> I tried to make sense of what's going on here, and it looks to me like
> the interaction between xt_replace_table() and ip6t_do_table() relies on
> store->load order that is _not_ enforced in the code.
> 
> xt_replace_table() does:
> 
> 	table->private = newinfo;
> 
> 	/* make sure all cpus see new ->private value */
> 	smp_wmb();
> 
> 	/*
> 	 * Even though table entries have now been swapped, other CPU's
> 	 * may still be using the old entries...
> 	 */
> 	local_bh_enable();
> 
> 	/* ... so wait for even xt_recseq on all cpus */
> 	for_each_possible_cpu(cpu) {
> 		seqcount_t *s = &per_cpu(xt_recseq, cpu);
> 		u32 seq = raw_read_seqcount(s);
> 
> 		if (seq & 1) {
> 			do {
> 				cond_resched();
> 				cpu_relax();
> 			} while (seq == raw_read_seqcount(s));
> 		}
> 	}
> 
> and I think the idea here is that we swizzle the table->private pointer
> to point to the new data, and then wait for all pre-existing readers to
> finish with the old table (isn't this what RCU is for?).

Yes and yes.   Before the current 'for_each_possible_cpu() + seqcount
abuse the abuse was just implicit; xt_replace_table did NOT wait for
readers to go away and relied on arp, eb and ip(6)tables to store the
counters back to userspace after xt_replace_table().

This meant a read_seqcount_begin/retry for each counter & eventually
gave complaits from k8s users that x_tables rule replacement was too
slow.

[..]

> Given that this appears to be going wrong in practice, I've included a quick
> hack below which should fix the ordering. However, I think it would be
> better if we could avoid abusing the seqcount code for this sort of thing.

I'd be ok with going via the simpler solution & wait if k8s users
complain that its too slow.  Those memory blobs can be huge so I would
not use call_rcu here.

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe85e2c..b5911985d1eb 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table,
         * private.
         */
        smp_wmb();
-       table->private = newinfo;
+       WRITE_ONCE(table->private, newinfo);
 
        /* make sure all cpus see new ->private value */
        smp_wmb();
@@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table,
         * may still be using the old entries...
         */
        local_bh_enable();
-
-       /* ... so wait for even xt_recseq on all cpus */
-       for_each_possible_cpu(cpu) {
-               seqcount_t *s = &per_cpu(xt_recseq, cpu);
-               u32 seq = raw_read_seqcount(s);
-
-               if (seq & 1) {
-                       do {
-                               cond_resched();
-                               cpu_relax();
-                       } while (seq == raw_read_seqcount(s));
-               }
-       }
+       synchronize_rcu();
 
        audit_log_nfcfg(table->name, table->af, private->number,
                        !private->number ? AUDIT_XT_OP_REGISTER :
Will Deacon Nov. 18, 2020, 12:54 p.m. UTC | #10
On Wed, Nov 18, 2020 at 01:42:28PM +0100, Florian Westphal wrote:
> Will Deacon <will@kernel.org> wrote:
> > On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote:
> > > subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > > > > > Unfortunately we are seeing it on ARM64 regression systems which
> > > > > > runs a
> > > > > > variety of
> > > > > > usecases so the exact steps are not known.
> > > > > 
> > > > > Ok.  Would you be willing to run some of those with your suggested
> > > > > change to see if that resolves the crashes or is that so rare that this
> > > > > isn't practical?
> > > > 
> > > > I can try that out. Let me know if you have any other suggestions as well
> > > > and I can try that too.
> > > > 
> > > > I assume we cant add locks here as it would be in the packet processing
> > > > path.
> > > 
> > > Yes.  We can add a synchronize_net() in xt_replace_table if needed
> > > though, before starting to put the references on the old ruleset
> > > This would avoid the free of the jumpstack while skbs are still
> > > in-flight.
> > 
> > I tried to make sense of what's going on here, and it looks to me like
> > the interaction between xt_replace_table() and ip6t_do_table() relies on
> > store->load order that is _not_ enforced in the code.
> > 
> > xt_replace_table() does:
> > 
> > 	table->private = newinfo;
> > 
> > 	/* make sure all cpus see new ->private value */
> > 	smp_wmb();
> > 
> > 	/*
> > 	 * Even though table entries have now been swapped, other CPU's
> > 	 * may still be using the old entries...
> > 	 */
> > 	local_bh_enable();
> > 
> > 	/* ... so wait for even xt_recseq on all cpus */
> > 	for_each_possible_cpu(cpu) {
> > 		seqcount_t *s = &per_cpu(xt_recseq, cpu);
> > 		u32 seq = raw_read_seqcount(s);
> > 
> > 		if (seq & 1) {
> > 			do {
> > 				cond_resched();
> > 				cpu_relax();
> > 			} while (seq == raw_read_seqcount(s));
> > 		}
> > 	}
> > 
> > and I think the idea here is that we swizzle the table->private pointer
> > to point to the new data, and then wait for all pre-existing readers to
> > finish with the old table (isn't this what RCU is for?).
> 
> Yes and yes.   Before the current 'for_each_possible_cpu() + seqcount
> abuse the abuse was just implicit; xt_replace_table did NOT wait for
> readers to go away and relied on arp, eb and ip(6)tables to store the
> counters back to userspace after xt_replace_table().

Thanks for the background.

> This meant a read_seqcount_begin/retry for each counter & eventually
> gave complaits from k8s users that x_tables rule replacement was too
> slow.
> 
> [..]
> 
> > Given that this appears to be going wrong in practice, I've included a quick
> > hack below which should fix the ordering. However, I think it would be
> > better if we could avoid abusing the seqcount code for this sort of thing.
> 
> I'd be ok with going via the simpler solution & wait if k8s users
> complain that its too slow.  Those memory blobs can be huge so I would
> not use call_rcu here.

If you can stomach the cost of synchronize_rcu() then this at least
gets rid of that for_each_possible_cpu() loop! Also...

> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index af22dbe85e2c..b5911985d1eb 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table,
>          * private.
>          */
>         smp_wmb();
> -       table->private = newinfo;
> +       WRITE_ONCE(table->private, newinfo);

... you could make this rcu_assign_pointer() and get rid of the preceding
smp_wmb()...

>  
>         /* make sure all cpus see new ->private value */
>         smp_wmb();

... and this smp_wmb() is no longer needed because synchronize_rcu()
takes care of the ordering.

> @@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table,
>          * may still be using the old entries...
>          */
>         local_bh_enable();
> -
> -       /* ... so wait for even xt_recseq on all cpus */
> -       for_each_possible_cpu(cpu) {
> -               seqcount_t *s = &per_cpu(xt_recseq, cpu);
> -               u32 seq = raw_read_seqcount(s);
> -
> -               if (seq & 1) {
> -                       do {
> -                               cond_resched();
> -                               cpu_relax();
> -                       } while (seq == raw_read_seqcount(s));
> -               }
> -       }

Do we still need xt_write_recseq_{begin,end}() with this removed?

Cheers,

Will
Florian Westphal Nov. 18, 2020, 1:14 p.m. UTC | #11
Will Deacon <will@kernel.org> wrote:
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index af22dbe85e2c..b5911985d1eb 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table,
> >          * private.
> >          */
> >         smp_wmb();
> > -       table->private = newinfo;
> > +       WRITE_ONCE(table->private, newinfo);
> 
> ... you could make this rcu_assign_pointer() and get rid of the preceding
> smp_wmb()...

Yes, it would make sense to add proper rcu annotation as well.

> >         /* make sure all cpus see new ->private value */
> >         smp_wmb();
> 
> ... and this smp_wmb() is no longer needed because synchronize_rcu()
> takes care of the ordering.

Right, thanks.

> > @@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table,
> >          * may still be using the old entries...
> >          */
> >         local_bh_enable();
> > -
> > -       /* ... so wait for even xt_recseq on all cpus */
> > -       for_each_possible_cpu(cpu) {
> > -               seqcount_t *s = &per_cpu(xt_recseq, cpu);
> > -               u32 seq = raw_read_seqcount(s);
> > -
> > -               if (seq & 1) {
> > -                       do {
> > -                               cond_resched();
> > -                               cpu_relax();
> > -                       } while (seq == raw_read_seqcount(s));
> > -               }
> > -       }
> 
> Do we still need xt_write_recseq_{begin,end}() with this removed?

Yes, it enables userspace to get stable per rule packet and byte counters.
Subash Abhinov Kasiviswanathan Nov. 18, 2020, 8:39 p.m. UTC | #12
I have tried the following to ensure the instruction ordering of private
assignment and I haven't seen the crash so far.

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe..2a4f6b3 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1389,6 +1389,9 @@ xt_replace_table(struct xt_table *table,
         /* make sure all cpus see new ->private value */
         smp_wmb();

+       /* make sure the instructions above are actually executed */
+       smp_mb();
+
         /*
          * Even though table entries have now been swapped, other CPU's
          * may still be using the old entries...

I had added a debug to store the values of the xt_recseq.sequence in 
ip6t_do_table
after the increment so it probably forced a load in the code order 
rather
than allowing CPU to defer it.

[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv6/netfilter/ip6_tables.c?h=v5.10-rc4#n282]
	addend = xt_write_recseq_begin();

Otherwise, I would have needed the additional instruction barrier which 
Will noted
in xt_write_recseq_begin() below.

> diff --git a/include/linux/netfilter/x_tables.h
> b/include/linux/netfilter/x_tables.h
> index 5deb099d156d..8ec48466410a 100644
> --- a/include/linux/netfilter/x_tables.h
> +++ b/include/linux/netfilter/x_tables.h
> @@ -376,7 +376,7 @@ static inline unsigned int 
> xt_write_recseq_begin(void)
>  	 * since addend is most likely 1
>  	 */
>  	__this_cpu_add(xt_recseq.sequence, addend);
> -	smp_wmb();
> +	smp_mb();
> 
>  	return addend;
>  }

>> ... you could make this rcu_assign_pointer() and get rid of the 
>> preceding
>> smp_wmb()...
> 
> Yes, it would make sense to add proper rcu annotation as well.
> 
>> >         /* make sure all cpus see new ->private value */
>> >         smp_wmb();
>> 
>> ... and this smp_wmb() is no longer needed because synchronize_rcu()
>> takes care of the ordering.

I assume we would need the following changes to address this.

diff --git a/include/linux/netfilter/x_tables.h 
b/include/linux/netfilter/x_tables.h
index 5deb099..7ab0e4f 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -227,7 +227,7 @@ struct xt_table {
  	unsigned int valid_hooks;

  	/* Man behind the curtain... */
-	struct xt_table_info *private;
+	struct xt_table_info __rcu *private;

  	/* Set this to THIS_MODULE if you are a module, otherwise NULL */
  	struct module *me;
diff --git a/net/ipv4/netfilter/arp_tables.c 
b/net/ipv4/netfilter/arp_tables.c
index d1e04d2..6a2b551 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -203,7 +203,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,

  	local_bh_disable();
  	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = rcu_access_pointer(table->private);
  	cpu     = smp_processor_id();
  	table_base = private->entries;
  	jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
@@ -649,7 +649,7 @@ static struct xt_counters *alloc_counters(const 
struct xt_table *table)
  {
  	unsigned int countersize;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);

  	/* We need atomic snapshot of counters: rest doesn't change
  	 * (other than comefrom, which userspace doesn't care
@@ -673,7 +673,7 @@ static int copy_entries_to_user(unsigned int 
total_size,
  	unsigned int off, num;
  	const struct arpt_entry *e;
  	struct xt_counters *counters;
-	struct xt_table_info *private = table->private;
+	struct xt_table_info *private = rcu_access_pointer(table->private);
  	int ret = 0;
  	void *loc_cpu_entry;

@@ -1330,7 +1330,7 @@ static int compat_copy_entries_to_user(unsigned 
int total_size,
  				       void __user *userptr)
  {
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);
  	void __user *pos;
  	unsigned int size;
  	int ret = 0;
diff --git a/net/ipv4/netfilter/ip_tables.c 
b/net/ipv4/netfilter/ip_tables.c
index f15bc21..64d0fb7 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -258,7 +258,7 @@ ipt_do_table(struct sk_buff *skb,
  	WARN_ON(!(table->valid_hooks & (1 << hook)));
  	local_bh_disable();
  	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = rcu_access_pointer(table->private);
  	cpu        = smp_processor_id();
  	table_base = private->entries;
  	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
@@ -791,7 +791,7 @@ static struct xt_counters *alloc_counters(const 
struct xt_table *table)
  {
  	unsigned int countersize;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);

  	/* We need atomic snapshot of counters: rest doesn't change
  	   (other than comefrom, which userspace doesn't care
@@ -815,7 +815,7 @@ copy_entries_to_user(unsigned int total_size,
  	unsigned int off, num;
  	const struct ipt_entry *e;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);
  	int ret = 0;
  	const void *loc_cpu_entry;

@@ -1543,7 +1543,7 @@ compat_copy_entries_to_user(unsigned int 
total_size, struct xt_table *table,
  			    void __user *userptr)
  {
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);
  	void __user *pos;
  	unsigned int size;
  	int ret = 0;
diff --git a/net/ipv6/netfilter/ip6_tables.c 
b/net/ipv6/netfilter/ip6_tables.c
index 2e2119b..db27e29 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -280,7 +280,7 @@ ip6t_do_table(struct sk_buff *skb,

  	local_bh_disable();
  	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = rcu_access_pointer(table->private);
  	cpu        = smp_processor_id();
  	table_base = private->entries;
  	jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -807,7 +807,7 @@ static struct xt_counters *alloc_counters(const 
struct xt_table *table)
  {
  	unsigned int countersize;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);

  	/* We need atomic snapshot of counters: rest doesn't change
  	   (other than comefrom, which userspace doesn't care
@@ -831,7 +831,7 @@ copy_entries_to_user(unsigned int total_size,
  	unsigned int off, num;
  	const struct ip6t_entry *e;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);
  	int ret = 0;
  	const void *loc_cpu_entry;

@@ -1552,7 +1552,7 @@ compat_copy_entries_to_user(unsigned int 
total_size, struct xt_table *table,
  			    void __user *userptr)
  {
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
rcu_access_pointer(table->private);
  	void __user *pos;
  	unsigned int size;
  	int ret = 0;
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe..2e6c09c 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1367,7 +1367,7 @@ xt_replace_table(struct xt_table *table,

  	/* Do the substitution. */
  	local_bh_disable();
-	private = table->private;
+	private = rcu_access_pointer(table->private);

  	/* Check inside lock: is the old number correct? */
  	if (num_counters != private->number) {
@@ -1379,15 +1379,8 @@ xt_replace_table(struct xt_table *table,
  	}

  	newinfo->initial_entries = private->initial_entries;
-	/*
-	 * Ensure contents of newinfo are visible before assigning to
-	 * private.
-	 */
-	smp_wmb();
-	table->private = newinfo;

-	/* make sure all cpus see new ->private value */
-	smp_wmb();
+	rcu_assign_pointer(table->private, newinfo);

  	/*
  	 * Even though table entries have now been swapped, other CPU's
@@ -1442,12 +1435,12 @@ struct xt_table *xt_register_table(struct net 
*net,
  	}

  	/* Simplifies replace_table code. */
-	table->private = bootstrap;
+	rcu_assign_pointer(table->private, bootstrap);

  	if (!xt_replace_table(table, 0, newinfo, &ret))
  		goto unlock;

-	private = table->private;
+	private = rcu_access_pointer(table->private);
  	pr_debug("table->private->number = %u\n", private->number);

  	/* save number of initial entries */
@@ -1470,7 +1463,8 @@ void *xt_unregister_table(struct xt_table *table)
  	struct xt_table_info *private;

  	mutex_lock(&xt[table->af].mutex);
-	private = table->private;
+	private = rcu_access_pointer(table->private);
+	RCU_INIT_POINTER(table->private, NULL);
  	list_del(&table->list);
  	mutex_unlock(&xt[table->af].mutex);
  	audit_log_nfcfg(table->name, table->af, private->number,
Florian Westphal Nov. 18, 2020, 9:10 p.m. UTC | #13
subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> I have tried the following to ensure the instruction ordering of private
> assignment and I haven't seen the crash so far.
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index af22dbe..2a4f6b3 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1389,6 +1389,9 @@ xt_replace_table(struct xt_table *table,
>         /* make sure all cpus see new ->private value */
>         smp_wmb();
> 
> +       /* make sure the instructions above are actually executed */
> +       smp_mb();
> +

This looks funny, I'd rather have s/smp_wmb/smp_mb instead of yet
another one.

> I assume we would need the following changes to address this.

Yes, something like this.  More comments below.

> diff --git a/include/linux/netfilter/x_tables.h
> b/include/linux/netfilter/x_tables.h
> index 5deb099..7ab0e4f 100644
> --- a/include/linux/netfilter/x_tables.h
> +++ b/include/linux/netfilter/x_tables.h
> @@ -227,7 +227,7 @@ struct xt_table {
>  	unsigned int valid_hooks;
> 
>  	/* Man behind the curtain... */
> -	struct xt_table_info *private;
> +	struct xt_table_info __rcu *private;
> 
>  	/* Set this to THIS_MODULE if you are a module, otherwise NULL */
>  	struct module *me;
> diff --git a/net/ipv4/netfilter/arp_tables.c
> b/net/ipv4/netfilter/arp_tables.c
> index d1e04d2..6a2b551 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -203,7 +203,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
> 
>  	local_bh_disable();
>  	addend = xt_write_recseq_begin();
> -	private = READ_ONCE(table->private); /* Address dependency. */
> +	private = rcu_access_pointer(table->private);

The three _do_table() functions need to use rcu_dereference().

>  	cpu     = smp_processor_id();
>  	table_base = private->entries;
>  	jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
> @@ -649,7 +649,7 @@ static struct xt_counters *alloc_counters(const struct
> xt_table *table)
>  {
>  	unsigned int countersize;
>  	struct xt_counters *counters;
> -	const struct xt_table_info *private = table->private;
> +	const struct xt_table_info *private = rcu_access_pointer(table->private);

This looks wrong.  I know its ok, but perhaps its better to add this:

struct xt_table_info *xt_table_get_private_protected(const struct xt_table *table)
{
 return rcu_dereference_protected(table->private, mutex_is_locked(&xt[table->af].mutex));
}
EXPORT_SYMBOL(xt_table_get_private_protected);

to x_tables.c.

If you dislike this extra function, add

#define xt_table_get_private_protected(t) rcu_access_pointer((t)->private)

in include/linux/netfilter/x_tables.h, with a bit fat comment telling
that the xt table mutex must be held.

But I'd rather have/use the helper function as it documents when its
safe to do this (and there will be splats if misused).

> index af22dbe..2e6c09c 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1367,7 +1367,7 @@ xt_replace_table(struct xt_table *table,
> 
>  	/* Do the substitution. */
>  	local_bh_disable();
> -	private = table->private;
> +	private = rcu_access_pointer(table->private);

AFAICS the local_bh_disable/enable calls can be removed too after this,
if we're interrupted by softirq calling any of the _do_table()
functions changes to the xt seqcount do not matter anymore.

>       /*
>        * Even though table entries have now been swapped, other CPU's

We need this additional hunk to switch to rcu for replacement/sync, no?

-       local_bh_enable();
-
-       /* ... so wait for even xt_recseq on all cpus */
-       for_each_possible_cpu(cpu) {
-               seqcount_t *s = &per_cpu(xt_recseq, cpu);
-               u32 seq = raw_read_seqcount(s);
-
-               if (seq & 1) {
-                       do {
-                               cond_resched();
-                               cpu_relax();
-                       } while (seq == raw_read_seqcount(s));
-               }
-       }
+       synchronize_rcu();
Subash Abhinov Kasiviswanathan Nov. 20, 2020, 5:53 a.m. UTC | #14
> The three _do_table() functions need to use rcu_dereference().
> 
> This looks wrong.  I know its ok, but perhaps its better to add this:
> 
> struct xt_table_info *xt_table_get_private_protected(const struct
> xt_table *table)
> {
>  return rcu_dereference_protected(table->private,
> mutex_is_locked(&xt[table->af].mutex));
> }
> EXPORT_SYMBOL(xt_table_get_private_protected);
> 
> to x_tables.c.
> 
> If you dislike this extra function, add
> 
> #define xt_table_get_private_protected(t) 
> rcu_access_pointer((t)->private)
> 
> in include/linux/netfilter/x_tables.h, with a bit fat comment telling
> that the xt table mutex must be held.
> 
> But I'd rather have/use the helper function as it documents when its
> safe to do this (and there will be splats if misused).
> AFAICS the local_bh_disable/enable calls can be removed too after this,
> if we're interrupted by softirq calling any of the _do_table()
> functions changes to the xt seqcount do not matter anymore.
> 
> 
> We need this additional hunk to switch to rcu for replacement/sync, no?
> 
> -       local_bh_enable();
> -
> -       /* ... so wait for even xt_recseq on all cpus */
> -       for_each_possible_cpu(cpu) {
> -               seqcount_t *s = &per_cpu(xt_recseq, cpu);
> -               u32 seq = raw_read_seqcount(s);
> -
> -               if (seq & 1) {
> -                       do {
> -                               cond_resched();
> -                               cpu_relax();
> -                       } while (seq == raw_read_seqcount(s));
> -               }
> -       }
> +       synchronize_rcu();

I've updated the patch with your comments.
Do you expect a performance impact either in datapath or perhaps more in
the rule installation with the rcu changes.
As a I understand, the change in the barrier types would be sufficient 
to
resolve the race.

diff --git a/include/linux/netfilter/x_tables.h 
b/include/linux/netfilter/x_tables.h
index 5deb099..8ebb641 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -227,7 +227,7 @@ struct xt_table {
  	unsigned int valid_hooks;

  	/* Man behind the curtain... */
-	struct xt_table_info *private;
+	struct xt_table_info __rcu *private;

  	/* Set this to THIS_MODULE if you are a module, otherwise NULL */
  	struct module *me;
@@ -448,6 +448,9 @@ xt_get_per_cpu_counter(struct xt_counters *cnt, 
unsigned int cpu)

  struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, 
nf_hookfn *);

+struct xt_table_info
+*xt_table_get_private_protected(const struct xt_table *table);
+
  #ifdef CONFIG_COMPAT
  #include <net/compat.h>

diff --git a/net/ipv4/netfilter/arp_tables.c 
b/net/ipv4/netfilter/arp_tables.c
index d1e04d2..dda5d8f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -203,7 +203,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,

  	local_bh_disable();
  	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = xt_table_get_private_protected(table);
  	cpu     = smp_processor_id();
  	table_base = private->entries;
  	jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
@@ -649,7 +649,7 @@ static struct xt_counters *alloc_counters(const 
struct xt_table *table)
  {
  	unsigned int countersize;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);

  	/* We need atomic snapshot of counters: rest doesn't change
  	 * (other than comefrom, which userspace doesn't care
@@ -673,7 +673,7 @@ static int copy_entries_to_user(unsigned int 
total_size,
  	unsigned int off, num;
  	const struct arpt_entry *e;
  	struct xt_counters *counters;
-	struct xt_table_info *private = table->private;
+	struct xt_table_info *private = xt_table_get_private_protected(table);
  	int ret = 0;
  	void *loc_cpu_entry;

@@ -1330,7 +1330,7 @@ static int compat_copy_entries_to_user(unsigned 
int total_size,
  				       void __user *userptr)
  {
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);
  	void __user *pos;
  	unsigned int size;
  	int ret = 0;
diff --git a/net/ipv4/netfilter/ip_tables.c 
b/net/ipv4/netfilter/ip_tables.c
index f15bc21..5ec422c 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -258,7 +258,7 @@ ipt_do_table(struct sk_buff *skb,
  	WARN_ON(!(table->valid_hooks & (1 << hook)));
  	local_bh_disable();
  	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = xt_table_get_private_protected(table);
  	cpu        = smp_processor_id();
  	table_base = private->entries;
  	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
@@ -791,7 +791,7 @@ static struct xt_counters *alloc_counters(const 
struct xt_table *table)
  {
  	unsigned int countersize;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);

  	/* We need atomic snapshot of counters: rest doesn't change
  	   (other than comefrom, which userspace doesn't care
@@ -815,7 +815,7 @@ copy_entries_to_user(unsigned int total_size,
  	unsigned int off, num;
  	const struct ipt_entry *e;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);
  	int ret = 0;
  	const void *loc_cpu_entry;

@@ -1543,7 +1543,7 @@ compat_copy_entries_to_user(unsigned int 
total_size, struct xt_table *table,
  			    void __user *userptr)
  {
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);
  	void __user *pos;
  	unsigned int size;
  	int ret = 0;
diff --git a/net/ipv6/netfilter/ip6_tables.c 
b/net/ipv6/netfilter/ip6_tables.c
index 2e2119b..91d8364 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -280,7 +280,7 @@ ip6t_do_table(struct sk_buff *skb,

  	local_bh_disable();
  	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = xt_table_get_private_protected(table);
  	cpu        = smp_processor_id();
  	table_base = private->entries;
  	jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -807,7 +807,7 @@ static struct xt_counters *alloc_counters(const 
struct xt_table *table)
  {
  	unsigned int countersize;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);

  	/* We need atomic snapshot of counters: rest doesn't change
  	   (other than comefrom, which userspace doesn't care
@@ -831,7 +831,7 @@ copy_entries_to_user(unsigned int total_size,
  	unsigned int off, num;
  	const struct ip6t_entry *e;
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);
  	int ret = 0;
  	const void *loc_cpu_entry;

@@ -1552,7 +1552,7 @@ compat_copy_entries_to_user(unsigned int 
total_size, struct xt_table *table,
  			    void __user *userptr)
  {
  	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	const struct xt_table_info *private = 
xt_table_get_private_protected(table);
  	void __user *pos;
  	unsigned int size;
  	int ret = 0;
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe..416a617 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1349,6 +1349,14 @@ struct xt_counters *xt_counters_alloc(unsigned 
int counters)
  }
  EXPORT_SYMBOL(xt_counters_alloc);

+struct xt_table_info
+*xt_table_get_private_protected(const struct xt_table *table)
+{
+	return rcu_dereference_protected(table->private,
+					 mutex_is_locked(&xt[table->af].mutex));
+}
+EXPORT_SYMBOL(xt_table_get_private_protected);
+
  struct xt_table_info *
  xt_replace_table(struct xt_table *table,
  	      unsigned int num_counters,
@@ -1356,7 +1364,6 @@ xt_replace_table(struct xt_table *table,
  	      int *error)
  {
  	struct xt_table_info *private;
-	unsigned int cpu;
  	int ret;

  	ret = xt_jumpstack_alloc(newinfo);
@@ -1366,8 +1373,7 @@ xt_replace_table(struct xt_table *table,
  	}

  	/* Do the substitution. */
-	local_bh_disable();
-	private = table->private;
+	private = xt_table_get_private_protected(table);

  	/* Check inside lock: is the old number correct? */
  	if (num_counters != private->number) {
@@ -1379,34 +1385,9 @@ xt_replace_table(struct xt_table *table,
  	}

  	newinfo->initial_entries = private->initial_entries;
-	/*
-	 * Ensure contents of newinfo are visible before assigning to
-	 * private.
-	 */
-	smp_wmb();
-	table->private = newinfo;
-
-	/* make sure all cpus see new ->private value */
-	smp_wmb();

-	/*
-	 * Even though table entries have now been swapped, other CPU's
-	 * may still be using the old entries...
-	 */
-	local_bh_enable();
-
-	/* ... so wait for even xt_recseq on all cpus */
-	for_each_possible_cpu(cpu) {
-		seqcount_t *s = &per_cpu(xt_recseq, cpu);
-		u32 seq = raw_read_seqcount(s);
-
-		if (seq & 1) {
-			do {
-				cond_resched();
-				cpu_relax();
-			} while (seq == raw_read_seqcount(s));
-		}
-	}
+	rcu_assign_pointer(table->private, newinfo);
+	synchronize_rcu();

  	audit_log_nfcfg(table->name, table->af, private->number,
  			!private->number ? AUDIT_XT_OP_REGISTER :
@@ -1442,12 +1423,12 @@ struct xt_table *xt_register_table(struct net 
*net,
  	}

  	/* Simplifies replace_table code. */
-	table->private = bootstrap;
+	rcu_assign_pointer(table->private, bootstrap);

  	if (!xt_replace_table(table, 0, newinfo, &ret))
  		goto unlock;

-	private = table->private;
+	private = xt_table_get_private_protected(table);
  	pr_debug("table->private->number = %u\n", private->number);

  	/* save number of initial entries */
@@ -1470,7 +1451,8 @@ void *xt_unregister_table(struct xt_table *table)
  	struct xt_table_info *private;

  	mutex_lock(&xt[table->af].mutex);
-	private = table->private;
+	private = xt_table_get_private_protected(table);
+	RCU_INIT_POINTER(table->private, NULL);
  	list_del(&table->list);
  	mutex_unlock(&xt[table->af].mutex);
  	audit_log_nfcfg(table->name, table->af, private->number,
Florian Westphal Nov. 20, 2020, 6:31 a.m. UTC | #15
subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> I've updated the patch with your comments.
> Do you expect a performance impact either in datapath or perhaps more in
> the rule installation with the rcu changes.

Rule installation.  synchronize_rcu() can take several seconds on busy
systems.

> diff --git a/net/ipv4/netfilter/arp_tables.c
> b/net/ipv4/netfilter/arp_tables.c
> index d1e04d2..dda5d8f 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -203,7 +203,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
> 
>  	local_bh_disable();
>  	addend = xt_write_recseq_begin();
> -	private = READ_ONCE(table->private); /* Address dependency. */
> +	private = xt_table_get_private_protected(table);

Err, no, this needs to be plain rcu_dereference(table->private).
Same in the other _do_table() versions.

We do not hold the table mutex here.
Peter Zijlstra Nov. 20, 2020, 9:44 a.m. UTC | #16
On Thu, Nov 19, 2020 at 10:53:13PM -0700, subashab@codeaurora.org wrote:
> +struct xt_table_info
> +*xt_table_get_private_protected(const struct xt_table *table)
> +{
> +	return rcu_dereference_protected(table->private,
> +					 mutex_is_locked(&xt[table->af].mutex));
> +}
> +EXPORT_SYMBOL(xt_table_get_private_protected);

In all debug builds this function compiles to a single memory
dereference. Do you really want that out-of-line?
Peter Zijlstra Nov. 20, 2020, 9:53 a.m. UTC | #17
On Fri, Nov 20, 2020 at 10:44:13AM +0100, Peter Zijlstra wrote:
> On Thu, Nov 19, 2020 at 10:53:13PM -0700, subashab@codeaurora.org wrote:
> > +struct xt_table_info
> > +*xt_table_get_private_protected(const struct xt_table *table)
> > +{
> > +	return rcu_dereference_protected(table->private,
> > +					 mutex_is_locked(&xt[table->af].mutex));
> > +}
> > +EXPORT_SYMBOL(xt_table_get_private_protected);
> 
> In all debug builds this function compiles to a single memory

! went missing... :/

> dereference. Do you really want that out-of-line?
Florian Westphal Nov. 20, 2020, 10:20 a.m. UTC | #18
Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Nov 20, 2020 at 10:44:13AM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 19, 2020 at 10:53:13PM -0700, subashab@codeaurora.org wrote:
> > > +struct xt_table_info
> > > +*xt_table_get_private_protected(const struct xt_table *table)
> > > +{
> > > +	return rcu_dereference_protected(table->private,
> > > +					 mutex_is_locked(&xt[table->af].mutex));
> > > +}
> > > +EXPORT_SYMBOL(xt_table_get_private_protected);
> > 
> > In all debug builds this function compiles to a single memory
> 
> ! went missing... :/

? Not sure about "!".

> > dereference. Do you really want that out-of-line?

Its the lesser of two evils, the other solution would be to export
xt[] from x_tables.c; its static right now.

Or we just use a rcu primitive that does no checking but I'd prefer
avoid that.
Peter Zijlstra Nov. 20, 2020, 10:47 a.m. UTC | #19
On Fri, Nov 20, 2020 at 11:20:32AM +0100, Florian Westphal wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Nov 20, 2020 at 10:44:13AM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 19, 2020 at 10:53:13PM -0700, subashab@codeaurora.org wrote:
> > > > +struct xt_table_info
> > > > +*xt_table_get_private_protected(const struct xt_table *table)
> > > > +{
> > > > +	return rcu_dereference_protected(table->private,
> > > > +					 mutex_is_locked(&xt[table->af].mutex));
> > > > +}
> > > > +EXPORT_SYMBOL(xt_table_get_private_protected);
> > > 
> > > In all debug builds this function compiles to a single memory
> > 
> > ! went missing... :/
> 
> ? Not sure about "!".

!debug gets you a single deref, for debug builds do we get extra code.

> > > dereference. Do you really want that out-of-line?
> 
> Its the lesser of two evils, the other solution would be to export
> xt[] from x_tables.c; its static right now.

Bah, missed that.
Subash Abhinov Kasiviswanathan Nov. 21, 2020, 1:27 a.m. UTC | #20
>> > > > +struct xt_table_info
>> > > > +*xt_table_get_private_protected(const struct xt_table *table)
>> > > > +{
>> > > > +	return rcu_dereference_protected(table->private,
>> > > > +					 mutex_is_locked(&xt[table->af].mutex));
>> > > > +}
>> > > > +EXPORT_SYMBOL(xt_table_get_private_protected);
>> > >
>> > > In all debug builds this function compiles to a single memory
>> >
>> > ! went missing... :/
>> 
>> ? Not sure about "!".
> 
> !debug gets you a single deref, for debug builds do we get extra code.

Can you elaborate on this. I am not able to get the context here.
Are you referring to the definition of RCU_LOCKDEP_WARN.
RCU_LOCKDEP_WARN is a NOOP if CONFIG_PROVE_RCU is false.

#ifdef CONFIG_PROVE_RCU
#define RCU_LOCKDEP_WARN(c, s)						\
	do {								\
		static bool __section(.data.unlikely) __warned;		\
		if (debug_lockdep_rcu_enabled() && !__warned && (c)) {	\
			__warned = true;				\
			lockdep_rcu_suspicious(__FILE__, __LINE__, s);	\
		}							\
	} while (0)
...
#else /* #ifdef CONFIG_PROVE_RCU */

#define RCU_LOCKDEP_WARN(c, s) do { } while (0)
diff mbox series

Patch

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe..39f1f2b 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1404,7 +1404,7 @@  xt_replace_table(struct xt_table *table,
 			do {
 				cond_resched();
 				cpu_relax();
-			} while (seq == raw_read_seqcount(s));
+			} while (!read_seqcount_retry(s, seq));
 		}
 	}