diff mbox

[RFC,v3,2/2] br_netfilter: per-netns copy of structure for sysctl flags

Message ID 5370F781.7010909@parallels.com
State RFC
Headers show

Commit Message

Vasily Averin May 12, 2014, 4:32 p.m. UTC
pernet_operations creates per-netns copy of common structure for sysctl flags
and initialize it values taken from init_brnf_net.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 net/bridge/br_netfilter.c |  104 +++++++++++++++++++++++++++++++++++++-------
 1 files changed, 87 insertions(+), 17 deletions(-)

Comments

Bart De Schuymer May 12, 2014, 7:04 p.m. UTC | #1
Vasily Averin schreef op 12/05/2014 18:32:
> pernet_operations creates per-netns copy of common structure for sysctl flags
> and initialize it values taken from init_brnf_net.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

> +static int __net_init brnf_net_init(struct net *net)
> +{
> +	struct brnf_net *bn = brnf_net(net);
> +
> +	memcpy(bn, &init_brnf_net, sizeof(struct brnf_net));
> +	bn->net = net;
> +	return brnf_sysctl_net_register(bn);

This does introduce a bit of backwards incompatibility (easily fixed by 
adapting scripts), but this is really unavoidable when transforming an 
existing global configuration to a per-netns configuration. I'm ok with it.

cheers,
Bart

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vasily Averin May 12, 2014, 8:11 p.m. UTC | #2
On 05/12/2014 11:04 PM, Bart De Schuymer wrote:
> Vasily Averin schreef op 12/05/2014 18:32:
>> pernet_operations creates per-netns copy of common structure for sysctl flags
>> and initialize it values taken from init_brnf_net.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
> 
>> +static int __net_init brnf_net_init(struct net *net)
>> +{
>> +    struct brnf_net *bn = brnf_net(net);
>> +
>> +    memcpy(bn, &init_brnf_net, sizeof(struct brnf_net));
>> +    bn->net = net;
>> +    return brnf_sysctl_net_register(bn);
> 
> This does introduce a bit of backwards incompatibility (easily fixed
> by adapting scripts), but this is really unavoidable when
> transforming an existing global configuration to a per-netns
> configuration. I'm ok with it.

Could you please explain, which backward incompatibility you mean here?
Nobody changes values init_brnf_net,
init_net have own copy, like any other network namespaces.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart De Schuymer May 13, 2014, 7:28 p.m. UTC | #3
Vasily Averin schreef op 12/05/2014 22:11:
> On 05/12/2014 11:04 PM, Bart De Schuymer wrote:
>> Vasily Averin schreef op 12/05/2014 18:32:
>>> pernet_operations creates per-netns copy of common structure for sysctl flags
>>> and initialize it values taken from init_brnf_net.
>>>
>>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>>
>>> +static int __net_init brnf_net_init(struct net *net)
>>> +{
>>> +    struct brnf_net *bn = brnf_net(net);
>>> +
>>> +    memcpy(bn, &init_brnf_net, sizeof(struct brnf_net));
>>> +    bn->net = net;
>>> +    return brnf_sysctl_net_register(bn);
>>
>> This does introduce a bit of backwards incompatibility (easily fixed
>> by adapting scripts), but this is really unavoidable when
>> transforming an existing global configuration to a per-netns
>> configuration. I'm ok with it.
>
> Could you please explain, which backward incompatibility you mean here?
> Nobody changes values init_brnf_net,
> init_net have own copy, like any other network namespaces.

Well, init_brnf_net is never written to, so it keeps the default flags.
If a new netns is created, a copy of the contents of init_brnf_net is 
made. So, whenever a netns is created, it starts with the default flags 
(e.g. brnf_call_iptables is always 1 for a newly created netns).

In the current kernel, when a new netns is created, the configuration of 
the main netns is used (the proc system doesn't even show the flags in 
the created netns): if brnf_call_iptables is 0 before the new netns is 
created, iptables won't see bridged IP traffic in the new netns.
With your patch, this behaviour will change.

It's possible to alter your patch to keep the same behaviour as before 
at netns creation, but always starting from the same defaults is cleaner.

cheers,
Bart

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vasily Averin May 15, 2014, 9:01 a.m. UTC | #4
Dear Tejun,

how do you think, which defaults should be used for per-namespace settings in general case
and for per-netns sysctls especially? Do we have some common position about this or
perhaps we already have some setting that allows to select desired behavior?

I'm preparing patch that makes per-netns sysctls in br_netfilter,
to be able to enable/disable br-nf-call processing in each network namespace independently.

I've initialized sysctl values in each netns by system defaults, like it was done in similar cases.
However Bart pointed that "this does introduce a bit of backwards incompatibility":
currently all netns shares the br_netfilter sysctl settings applied in init_net.

From OpenVz point of view containers should be properly isolated,
should have predictable initial configuration
and should not depend on settings applied in another containers.
On the other hand independent containers is only one of possible usecases,
and I have no strong objections against Bart's proposal. Frankly speaking
initially I've planned to copy setting from init_net too.

To make possible both variants I can introduce one more setting,
it allows to specify desired behavior:
to use system defaults or to copy current settings from init_net.

However probably the same dilemma was observed in another subsystems?
Perhaps this selector already exists?

If isn't, how do you think, should I introduce some new global parameter,
or may be it should be some local bridge-only-related setting?

Thank you,
	Vasily Averin

you can find some more details about my patch in netfilter-devel@
[PATCH RFC v3 2/2] br_netfilter: per-netns copy of structure for sysctl flags
On 05/13/2014 11:28 PM, Bart De Schuymer wrote:
> Vasily Averin schreef op 12/05/2014 22:11:
>> On 05/12/2014 11:04 PM, Bart De Schuymer wrote:
>>> Vasily Averin schreef op 12/05/2014 18:32:
>>>> pernet_operations creates per-netns copy of common structure for sysctl flags
>>>> and initialize it values taken from init_brnf_net.
>>>>
>>>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>>>
>>>> +static int __net_init brnf_net_init(struct net *net)
>>>> +{
>>>> +    struct brnf_net *bn = brnf_net(net);
>>>> +
>>>> +    memcpy(bn, &init_brnf_net, sizeof(struct brnf_net));
>>>> +    bn->net = net;
>>>> +    return brnf_sysctl_net_register(bn);
>>>
>>> This does introduce a bit of backwards incompatibility (easily fixed
>>> by adapting scripts), but this is really unavoidable when
>>> transforming an existing global configuration to a per-netns
>>> configuration. I'm ok with it.
>>
>> Could you please explain, which backward incompatibility you mean here?
>> Nobody changes values init_brnf_net,
>> init_net have own copy, like any other network namespaces.
> 
> Well, init_brnf_net is never written to, so it keeps the default flags.
> If a new netns is created, a copy of the contents of init_brnf_net is made. So, whenever a netns is created, it starts with the default flags (e.g. brnf_call_iptables is always 1 for a newly created netns).
> 
> In the current kernel, when a new netns is created, the configuration of the main netns is used (the proc system doesn't even show the flags in the created netns): if brnf_call_iptables is 0 before the new netns is created, iptables won't see bridged IP traffic in the new netns.
> With your patch, this behaviour will change.
> 
> It's possible to alter your patch to keep the same behaviour as before at netns creation, but always starting from the same defaults is cleaner.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn May 15, 2014, 11:02 a.m. UTC | #5
Quoting Vasily Averin (vvs@parallels.com):
> Dear Tejun,
> 
> how do you think, which defaults should be used for per-namespace settings in general case
> and for per-netns sysctls especially? Do we have some common position about this or
> perhaps we already have some setting that allows to select desired behavior?
> 
> I'm preparing patch that makes per-netns sysctls in br_netfilter,
> to be able to enable/disable br-nf-call processing in each network namespace independently.
> 
> I've initialized sysctl values in each netns by system defaults, like it was done in similar cases.
> However Bart pointed that "this does introduce a bit of backwards incompatibility":
> currently all netns shares the br_netfilter sysctl settings applied in init_net.
> 
> From OpenVz point of view containers should be properly isolated,
> should have predictable initial configuration
> and should not depend on settings applied in another containers.

In 'another container' no, but if you are starting a nested container, then
the from the parent container, yes.  Not from init_net.

> On the other hand independent containers is only one of possible usecases,
> and I have no strong objections against Bart's proposal. Frankly speaking
> initially I've planned to copy setting from init_net too.
> 
> To make possible both variants I can introduce one more setting,
> it allows to specify desired behavior:
> to use system defaults or to copy current settings from init_net.
> 
> However probably the same dilemma was observed in another subsystems?
> Perhaps this selector already exists?
> 
> If isn't, how do you think, should I introduce some new global parameter,
> or may be it should be some local bridge-only-related setting?
> 
> Thank you,
> 	Vasily Averin
> 
> you can find some more details about my patch in netfilter-devel@
> [PATCH RFC v3 2/2] br_netfilter: per-netns copy of structure for sysctl flags
> On 05/13/2014 11:28 PM, Bart De Schuymer wrote:
> > Vasily Averin schreef op 12/05/2014 22:11:
> >> On 05/12/2014 11:04 PM, Bart De Schuymer wrote:
> >>> Vasily Averin schreef op 12/05/2014 18:32:
> >>>> pernet_operations creates per-netns copy of common structure for sysctl flags
> >>>> and initialize it values taken from init_brnf_net.
> >>>>
> >>>> Signed-off-by: Vasily Averin <vvs@openvz.org>
> >>>
> >>>> +static int __net_init brnf_net_init(struct net *net)
> >>>> +{
> >>>> +    struct brnf_net *bn = brnf_net(net);
> >>>> +
> >>>> +    memcpy(bn, &init_brnf_net, sizeof(struct brnf_net));
> >>>> +    bn->net = net;
> >>>> +    return brnf_sysctl_net_register(bn);
> >>>
> >>> This does introduce a bit of backwards incompatibility (easily fixed
> >>> by adapting scripts), but this is really unavoidable when
> >>> transforming an existing global configuration to a per-netns
> >>> configuration. I'm ok with it.
> >>
> >> Could you please explain, which backward incompatibility you mean here?
> >> Nobody changes values init_brnf_net,
> >> init_net have own copy, like any other network namespaces.
> > 
> > Well, init_brnf_net is never written to, so it keeps the default flags.
> > If a new netns is created, a copy of the contents of init_brnf_net is made. So, whenever a netns is created, it starts with the default flags (e.g. brnf_call_iptables is always 1 for a newly created netns).
> > 
> > In the current kernel, when a new netns is created, the configuration of the main netns is used (the proc system doesn't even show the flags in the created netns): if brnf_call_iptables is 0 before the new netns is created, iptables won't see bridged IP traffic in the new netns.
> > With your patch, this behaviour will change.
> > 
> > It's possible to alter your patch to keep the same behaviour as before at netns creation, but always starting from the same defaults is cleaner.
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo May 15, 2014, 5:48 p.m. UTC | #6
Hello,

On Thu, May 15, 2014 at 01:01:52PM +0400, Vasily Averin wrote:
> how do you think, which defaults should be used for per-namespace settings in general case
> and for per-netns sysctls especially? Do we have some common position about this or
> perhaps we already have some setting that allows to select desired behavior?

Unfortunately, I don't have much idea on the subject.  Serge and
others on the containers list should know a lot better than me.

Thanks.
Maciej Żenczykowski May 16, 2014, 11:16 a.m. UTC | #7
Isn't pulling current settings from init_net an isolation violation if
init_net isn't the namespace you are in at the time you are creating
the new namespace?

The way I see it there are 2 possibilities:
(a) you use some kernel (probably compile time) defaults (ie. what
init_net gets when you boot machine)
(b) you inherit from current namespace
I'm not sure what the right choice is.

For something like 'iptables configuration' it seems (a) is correct
(come up with no firewall).
For something like 'tcp socket memory limits' or 'bindv6only' or
'v6.default_use_tempaddr) it does seem like (b) is possibly more
appropriate.

That said I think there are cases where (a) is clearly correct and (b)
is clearly not desirable (iptables conf being a prime example).
After all a new namespace doesn't inherit interfaces from the
namespace we're in when we create it.

I can't think of any cases where (b) is clearly correct and (a) is
clearly not desirable.
[I guess this is less than clear for settings which auto scale at boot
with available ram and/or number of cpus in the machine]

Based on that doing (a) for everything may be the right choice
(consistency trump...).
This would imply network namespace you are in should have no effect on
the new network namespace you are creating.

OTOH, if I want to change some tcp mem tuning sysctl (or something
like net.ipv6.conf.default.use_tempaddr = 2)
it would be annoying if /etc/sysctl.conf didn't apply to non-init
namespace.  But perhaps this is better solved in userspace
by loading some /etc/sysctls-for-new-network-namespaces.conf settings
in some network namespace creating libraries.

- Maciej
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart De Schuymer May 19, 2014, 7:30 p.m. UTC | #8
Vasily Averin schreef op 15/05/2014 11:01:
> Dear Tejun,
>
> how do you think, which defaults should be used for per-namespace settings in general case
> and for per-netns sysctls especially? Do we have some common position about this or
> perhaps we already have some setting that allows to select desired behavior?
>
> I'm preparing patch that makes per-netns sysctls in br_netfilter,
> to be able to enable/disable br-nf-call processing in each network namespace independently.
>
> I've initialized sysctl values in each netns by system defaults, like it was done in similar cases.
> However Bart pointed that "this does introduce a bit of backwards incompatibility":
> currently all netns shares the br_netfilter sysctl settings applied in init_net.
>
>>From OpenVz point of view containers should be properly isolated,
> should have predictable initial configuration
> and should not depend on settings applied in another containers.
> On the other hand independent containers is only one of possible usecases,
> and I have no strong objections against Bart's proposal. Frankly speaking
> initially I've planned to copy setting from init_net too.

You misread my mail. I stated that I'm ok with always starting from the 
defaults (as your patch does). As pointed out by Maciej, always starting 
from init_net isn't really an option in case of nested namespaces (start 
from the parent's namespace instead).
There'll always be pros and cons to whatever you choose here. Complete 
backwards compatibility isn't possible either way.
The only way to keep backwards compatibility is to introduce new proc 
file names and keep the old behavior for the old names (but I'm not in 
favor of that).

cheers,
Bart

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vasily Averin June 24, 2014, 8:21 a.m. UTC | #9
On 05/19/2014 11:30 PM, Bart De Schuymer wrote:
> As pointed out by Maciej, always
> starting from init_net isn't really an option in case of nested
> namespaces (start from the parent's namespace instead).

Dear Bart, Serge, Maciej
thank you very much for your feedback!

I've analyzed possibility to inherit settings from parent net-namespace,
discovered problems described below and finally decided to follow
Maciej's way (a) "use some kernel defaults", with adding an ability
to change pre-compiled kernel defaults.

Below you can found more detailed description of discovered problems.

1) there are no (easy) ways to find parent of given network namespace.

Network namespaces in kernel are not hierarchical but flat,
"struct net" have no reference to parent netns, and my collegians expect
that Eric Biederman will likely object to adding a parent netns pointer.

Without this reference I do not see any good ways to copy parents settings.

2) settings inheriting does not work if subsystem module is loaded after
creation of network namespace. 

In this case all namespaces get pre-compiled defaults settings, and seems
there are no ways to apply "adjusted" setting to all already existing netns.

Moreover there is curious situation: to apply required sysctl settings
during module loading, Red Hat recommends to force "sysctl -p" execution
via install command in modprobe.conf
https://bugzilla.redhat.com/show_bug.cgi?id=634735#c7

However if module is loaded from inside one of network namespaces
it does not work!

In this case sysctl is executed inside netns. 
If assigned sysctl key is not virtualized -- sysctl command can fail
if key is virtualized  -- setting  in current netns  will be adjusted,
but not -- in init_net, that looks unexpected for me.

I believe initial subsystem settings of newly created namespace should
not differ from initial settings of newly created subsystem in already
existing namespace. In case in-kernel setting inheriting this behavior
cannot be reached, additional subsystem tuning is required anyway.

Therefore Maceiej's variant (a) "use some kernel defaults" looks like
right choice for me. If parent wants to assign some adjusted settings
in child environments -- it can only force loading of required modules
and apply required settings directly.

At the same time I would like to have an ability to change pre-compiled
defaults somehow. In my patch I'm going to add new module options, that
allows node owner to specify wished "safe" settings before module loading,
and change them via sysfs after this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 25, 2014, 7:45 a.m. UTC | #10
Vasily Averin <vvs@parallels.com> writes:

> On 05/19/2014 11:30 PM, Bart De Schuymer wrote:
>> As pointed out by Maciej, always
>> starting from init_net isn't really an option in case of nested
>> namespaces (start from the parent's namespace instead).
>
> Dear Bart, Serge, Maciej
> thank you very much for your feedback!

I am missing the context which makes raises this issue.

> I've analyzed possibility to inherit settings from parent net-namespace,
> discovered problems described below and finally decided to follow
> Maciej's way (a) "use some kernel defaults", with adding an ability
> to change pre-compiled kernel defaults.
>
> Below you can found more detailed description of discovered problems.
>
> 1) there are no (easy) ways to find parent of given network namespace.
>
> Network namespaces in kernel are not hierarchical but flat,
> "struct net" have no reference to parent netns, and my collegians expect
> that Eric Biederman will likely object to adding a parent netns pointer.
>
> Without this reference I do not see any good ways to copy parents
> settings.

Copying settings can easily happen at netowkr namespace creation time.
Copying at any other time is too weird to even think about.  So no you
don't need a parent network namespace pointer to enable copying.

> 2) settings inheriting does not work if subsystem module is loaded after
> creation of network namespace. 
>
> In this case all namespaces get pre-compiled defaults settings, and seems
> there are no ways to apply "adjusted" setting to all already existing
> netns.

setns.

> Moreover there is curious situation: to apply required sysctl settings
> during module loading, Red Hat recommends to force "sysctl -p" execution
> via install command in modprobe.conf
> https://bugzilla.redhat.com/show_bug.cgi?id=634735#c7
>
> However if module is loaded from inside one of network namespaces
> it does not work!

Why not?  The appropriate events should fire globally.  And note
in most use cases participants in a network namespace won't have
permissions to load modules.

> In this case sysctl is executed inside netns. 
> If assigned sysctl key is not virtualized -- sysctl command can fail
> if key is virtualized  -- setting  in current netns  will be adjusted,
> but not -- in init_net, that looks unexpected for me.

From what little you have said.  This sounds like a don't do that
then situation.  Certainly if a module is the kernel triggers request
module the module will be loaded in the initial set of namespaces.

> I believe initial subsystem settings of newly created namespace should
> not differ from initial settings of newly created subsystem in already
> existing namespace. In case in-kernel setting inheriting this behavior
> cannot be reached, additional subsystem tuning is required anyway.

You are arguing that creation of a network namespace should use the
kernel's default values for sysctls?  That is a fairly reasonable
position to take.

> Therefore Maceiej's variant (a) "use some kernel defaults" looks like
> right choice for me. If parent wants to assign some adjusted settings
> in child environments -- it can only force loading of required modules
> and apply required settings directly.
>
> At the same time I would like to have an ability to change pre-compiled
> defaults somehow. In my patch I'm going to add new module options, that
> allows node owner to specify wished "safe" settings before module loading,
> and change them via sysfs after this.

Why sysfs and not sysctl?

It is not clear to me what is going on, from the limited details I see
in this message it sounds like there may be a bit of overdesign and
tackling problems that do not matter in the real world going on.

For any kernel settings that apply to a network namespace we have
two very basic choices.
- Set them to default values when a namespace is initialized.
- Copy them from somewhere when the namespace is created.

Last I looked at that code we were copying sysctl values from
the initial network namespace instead of the creators network
namespace.  Which has always seemed a bit silly to me.

In general most people don't care and this does not cause an
issue for most folks, or we could not have gone 5+ years without
addressing it.

For most things any practical program at this point is going to
have to set the sysctls it cares about because it is going to
have to run on existing kernels.

Beyond that I don't have a strong opinion but we could either set values
to well expected defaults, or copy them from the creators previous
network namespace.  Both would give deterministic results without any
significant chance of breaking userspace today.

Is their a compelling use case in this conversation that could weigh the
decision of which semantics make the most sense?

Adding sysfs entries or module parameters to change the action of
sysctls sounds like there is something broken somewhere.  Unfortunately
it is not clear to me where that somewhere is.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 31bfd90..9886afc 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -40,6 +40,7 @@ 
 #include "br_private.h"
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
+#include <net/netns/generic.h>
 #endif
 
 #define skb_origaddr(skb)	 (((struct bridge_skb_cb *) \
@@ -47,9 +48,6 @@ 
 #define store_orig_dstaddr(skb)	 (skb_origaddr(skb) = ip_hdr(skb)->daddr)
 #define dnat_took_place(skb)	 (skb_origaddr(skb) != ip_hdr(skb)->daddr)
 
-#ifdef CONFIG_SYSCTL
-static struct ctl_table_header *brnf_sysctl_header;
-#endif
 #define brnf_call_arptables 1
 #define brnf_call_iptables 1
 #define brnf_call_ip6tables 1
@@ -58,6 +56,7 @@  static struct ctl_table_header *brnf_sysctl_header;
 #define brnf_pass_vlan_indev 0
 
 #ifdef CONFIG_SYSCTL
+static int brnf_net_id __read_mostly;
 static struct brnf_net init_brnf_net = {
 	.hdr			= NULL,
 	.call_arptables		= brnf_call_arptables,
@@ -68,7 +67,13 @@  static struct brnf_net init_brnf_net = {
 	.pass_vlan_indev	= brnf_pass_vlan_indev,
 };
 
-#define brnf_flag(skb, flag)		init_brnf_net.flag
+static inline struct brnf_net *brnf_net(const struct net *net)
+{
+	return net_generic(net, brnf_net_id);
+}
+
+#define skb_netns(skb)			net((skb)->dev)
+#define brnf_flag(skb, flag)		brnf_net(skb_netns(skb))->flag
 #else
 #define brnf_flag(skb, flag)		brnf_##flag
 #endif
@@ -1066,6 +1071,70 @@  static struct ctl_table brnf_table[] = {
 	},
 	{ }
 };
+
+static int brnf_sysctl_net_register(struct brnf_net *bn)
+{
+	struct ctl_table *table;
+	struct ctl_table_header *hdr;
+	int i;
+
+	table = brnf_table;
+	if (!net_eq(bn->net, &init_net)) {
+
+		table = kmemdup(table, sizeof(brnf_table), GFP_KERNEL);
+		if (!table)
+			goto err_alloc;
+	}
+	for (i = 0; table[i].data; i++)
+		table[i].data += (char *)bn - (char *)&init_brnf_net;
+
+	hdr = register_net_sysctl(bn->net, "net/bridge", table);
+	if (!hdr)
+		goto err_reg;
+
+	bn->hdr = hdr;
+	return 0;
+
+err_reg:
+	if (!net_eq(bn->net, &init_net))
+		kfree(table);
+err_alloc:
+	return -ENOMEM;
+}
+
+static void brnf_sysctl_net_unregister(struct brnf_net *bn)
+{
+	struct ctl_table *table;
+
+	if (bn->hdr == NULL)
+		return;
+
+	table = bn->hdr->ctl_table_arg;
+	unregister_net_sysctl_table(bn->hdr);
+	if (!net_eq(bn->net, &init_net))
+		kfree(table);
+}
+
+static int __net_init brnf_net_init(struct net *net)
+{
+	struct brnf_net *bn = brnf_net(net);
+
+	memcpy(bn, &init_brnf_net, sizeof(struct brnf_net));
+	bn->net = net;
+	return brnf_sysctl_net_register(bn);
+}
+
+static void __net_exit brnf_net_exit(struct net *net)
+{
+	brnf_sysctl_net_unregister(brnf_net(net));
+}
+
+static struct pernet_operations __net_initdata brnf_net_ops = {
+	.init	= brnf_net_init,
+	.exit	= brnf_net_exit,
+	.id	= &brnf_net_id,
+	.size	= sizeof(struct brnf_net),
+};
 #endif
 
 int __init br_netfilter_init(void)
@@ -1074,32 +1143,33 @@  int __init br_netfilter_init(void)
 
 	ret = dst_entries_init(&fake_dst_ops);
 	if (ret < 0)
-		return ret;
+		goto err_dst;
 
 	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
-	if (ret < 0) {
-		dst_entries_destroy(&fake_dst_ops);
-		return ret;
-	}
+	if (ret < 0)
+		goto err_nf;
+
 #ifdef CONFIG_SYSCTL
-	brnf_sysctl_header = register_net_sysctl(&init_net, "net/bridge", brnf_table);
-	if (brnf_sysctl_header == NULL) {
-		printk(KERN_WARNING
-		       "br_netfilter: can't register to sysctl.\n");
+	ret = register_pernet_subsys(&brnf_net_ops);
+	if (ret < 0) {
 		nf_unregister_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
-		dst_entries_destroy(&fake_dst_ops);
-		return -ENOMEM;
+		goto err_nf;
 	}
 #endif
 	printk(KERN_NOTICE "Bridge firewalling registered\n");
 	return 0;
+
+err_nf:
+	dst_entries_destroy(&fake_dst_ops);
+err_dst:
+	return ret;
 }
 
 void br_netfilter_fini(void)
 {
-	nf_unregister_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
 #ifdef CONFIG_SYSCTL
-	unregister_net_sysctl_table(brnf_sysctl_header);
+	unregister_pernet_subsys(&brnf_net_ops);
 #endif
+	nf_unregister_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
 	dst_entries_destroy(&fake_dst_ops);
 }