diff mbox

[nf-next,0/8] netfilter: untangle bridge and bridge netfilter

Message ID 20150309130253.GA6677@salvia
State Accepted
Delegated to: Pablo Neira
Headers show

Commit Message

Pablo Neira Ayuso March 9, 2015, 1:02 p.m. UTC
Hi Florian,

On Thu, Mar 05, 2015 at 12:52:32AM +0100, Florian Westphal wrote:
> bridge_netfilter.h contains various helpers, some only used by br_netfilter,
> others however are also called in bridge or even ip stack.
> 
> Lets start untangling bridge, bridge netfilter, and the
> rest of the ip stack (esp. ip_fragment).
> 
> This changes ip_fragment() so that bridge netfilter
> can pass in the required information as arguments instead
> of using skb->nf_bridge to pass some extra information to it.
> 
> Another problem with br_netfilter and the way its plumbed to
> ip/ip6-tables (physdev match) is skb->nf_bridge.
> 
> nf_bridge is kmalloced blob with some extra information, including
> the bridge in and outports (mainly for iptables' physdev match).
> It also has various state bits so we know what manipulations
> have been performed by bridge netfilter on the skb (e.g.
> ppp header stripping).
>
> nf_bridge also provides scratch space where br_netfilter saves
> (and later restores) various things, e.g. ipv4 address for
> dnat detection, mac address to fix up ip fragmented skbs, etc.
> 
> But in almost all cases we can avoid using ->data completely.

I think one of the goals of this patchset is to prepare the removal of
that nf_bridge pointer from sk_buff which sounds as a good idea to me.

Did you consider to implement this scratchpad area using a per-cpu
area? I mean, something similar to what we used to have in
ip_conntrack for events:

http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/net/netfilter/nf_conntrack_ecache.c?v=2.6.25#L65

Following that approach, I think we could avoid changing the
ip_fragment() interface. My concern is that these changes are too
specific of br_netfilter, which is not a good reference client of it
given all its design problems.

I'm preparing a new net-next batch, I can quickly take these three if
you agree:

1/8 bridge: move mac header copying into br_netfilter
2/8 netfilter: bridge: move nf_bridge_update_protocol to where its used
4/8 netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit

Regarding 3/8, I think we can move that code to br_netfilter.c so we
reduce ifdef pollution a bit and keep as much of this monster code
fenced under that file. Please, see patch attached.

BTW, 6/8 I think needs some ifdef to make sure NF_CONNTRACK is in
placed. 7/8 needs NF_BRDIGE_MAX_MAC_HEADER_LENGTH which seems to have
a small typo in it, well these are dependent of 4/8 anyway.

Let me know, thanks a lot!

Comments

Florian Westphal March 9, 2015, 1:13 p.m. UTC | #1
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

[ CC Andy, see below for ip_fragment api discussion ]

> > Lets start untangling bridge, bridge netfilter, and the
> > rest of the ip stack (esp. ip_fragment).
> > 
> > This changes ip_fragment() so that bridge netfilter
> > can pass in the required information as arguments instead
> > of using skb->nf_bridge to pass some extra information to it.
> > 
> > Another problem with br_netfilter and the way its plumbed to
> > ip/ip6-tables (physdev match) is skb->nf_bridge.
> > 
> > nf_bridge is kmalloced blob with some extra information, including
> > the bridge in and outports (mainly for iptables' physdev match).
> > It also has various state bits so we know what manipulations
> > have been performed by bridge netfilter on the skb (e.g.
> > ppp header stripping).
> >
> > nf_bridge also provides scratch space where br_netfilter saves
> > (and later restores) various things, e.g. ipv4 address for
> > dnat detection, mac address to fix up ip fragmented skbs, etc.
> > 
> > But in almost all cases we can avoid using ->data completely.
> 
> I think one of the goals of this patchset is to prepare the removal of
> that nf_bridge pointer from sk_buff which sounds as a good idea to me.

The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
at the end (where its not initialized at allocation time).

The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
so we'd increase skbuff size only by 8.

> Did you consider to implement this scratchpad area using a per-cpu
> area? I mean, something similar to what we used to have in
> ip_conntrack for events:
> 
> http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/net/netfilter/nf_conntrack_ecache.c?v=2.6.25#L65
> 
> Following that approach, I think we could avoid changing the
> ip_fragment() interface.

Yes, I had a patch that used a percpu area.  BUT, please see conflicting
patch:

http://patchwork.ozlabs.org/patch/447939/

So if OVS needs such an interface I thought it makes no sense to work
around current API using percpu data.

Andy, it looks like I could either wait for your patches to hit
net-next, or you could refactor your changes based on my ip_fragment
changes; AFAICS the use cases are pretty much the same so both
ovs and bridge netfilter could use one unified ip_fragment().

> My concern is that these changes are too
> specific of br_netfilter, which is not a good reference client of it
> given all its design problems.

I agree that change just for br_netfilter sake sucks but please also
keep in mind that right now we have 'extra signalling' into ip_frament
via skb->nf_bridge pointer which I dislike even more.

> I'm preparing a new net-next batch, I can quickly take these three if
> you agree:
> 
> 1/8 bridge: move mac header copying into br_netfilter
> 2/8 netfilter: bridge: move nf_bridge_update_protocol to where its used
> 4/8 netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit

Sure.  I can then submit the entire series (i.e. part1 without these 3
plus part2).  I have no further br netfilter changes after that.

> Regarding 3/8, I think we can move that code to br_netfilter.c so we
> reduce ifdef pollution a bit and keep as much of this monster code
> fenced under that file. Please, see patch attached.

Sure.

> BTW, 6/8 I think needs some ifdef to make sure NF_CONNTRACK is in
> placed. 7/8 needs NF_BRDIGE_MAX_MAC_HEADER_LENGTH which seems to have
> a small typo in it, well these are dependent of 4/8 anyway.

Seems you're right, I'll look at it, thanks Pablo!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal March 9, 2015, 1:59 p.m. UTC | #2
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > nf_bridge is kmalloced blob with some extra information, including
> > the bridge in and outports (mainly for iptables' physdev match).
> > It also has various state bits so we know what manipulations
> > have been performed by bridge netfilter on the skb (e.g.
> > ppp header stripping).
> >
> > nf_bridge also provides scratch space where br_netfilter saves
> > (and later restores) various things, e.g. ipv4 address for
> > dnat detection, mac address to fix up ip fragmented skbs, etc.
> > 
> > But in almost all cases we can avoid using ->data completely.
> 
> I think one of the goals of this patchset is to prepare the removal of
> that nf_bridge pointer from sk_buff which sounds as a good idea to me.
> 
> Did you consider to implement this scratchpad area using a per-cpu
> area? I mean, something similar to what we used to have in
> ip_conntrack for events:

I see that I misread part of what you wrote.

We cannot use percpu data for nf_bridge_info; at least its not trivial
to do (I tried).

Main issues are:
- nfqueue (we have to save/restore info)
- local delivery (skb is enqueued in backlog)
- skb is dropped (we need to reset scratch data)
- DNAT mac hack (skb can be queued in qdisc).

We _can_ use percpu data for passing the original mac
header to the ip_fragment output function.  But, as mentioned,
there is a patch in netdev patchwork that adds extra output
parameter for use with OVS so it seems cleaner to (re-) use common
api in both OVS and br netfilter.

> BTW, 6/8 I think needs some ifdef to make sure NF_CONNTRACK is in
> placed.

Right, thanks for catching this.

> 7/8 needs NF_BRDIGE_MAX_MAC_HEADER_LENGTH which seems to have
> a small typo in it, well these are dependent of 4/8 anyway.

Sorry -- I'm dense -- what typo? :-)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 9, 2015, 4:47 p.m. UTC | #3
On Mon, Mar 09, 2015 at 02:13:56PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > I'm preparing a new net-next batch, I can quickly take these three if
> > you agree:
> > 
> > 1/8 bridge: move mac header copying into br_netfilter
> > 2/8 netfilter: bridge: move nf_bridge_update_protocol to where its used
> > 4/8 netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit
> 
> Sure.  I can then submit the entire series (i.e. part1 without these 3
> plus part2).  I have no further br netfilter changes after that.
> 
> > Regarding 3/8, I think we can move that code to br_netfilter.c so we
> > reduce ifdef pollution a bit and keep as much of this monster code
> > fenced under that file. Please, see patch attached.
> 
> Sure.

OK, I'll push these four small changes in the follow nf-next pull
request then, so you and Andry can sort out this. Thanks Florian.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 9, 2015, 5:16 p.m. UTC | #4
From: Florian Westphal <fw@strlen.de>
Date: Mon, 9 Mar 2015 14:13:56 +0100

> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> [ CC Andy, see below for ip_fragment api discussion ]
> 
>> > Lets start untangling bridge, bridge netfilter, and the
>> > rest of the ip stack (esp. ip_fragment).
>> > 
>> > This changes ip_fragment() so that bridge netfilter
>> > can pass in the required information as arguments instead
>> > of using skb->nf_bridge to pass some extra information to it.
>> > 
>> > Another problem with br_netfilter and the way its plumbed to
>> > ip/ip6-tables (physdev match) is skb->nf_bridge.
>> > 
>> > nf_bridge is kmalloced blob with some extra information, including
>> > the bridge in and outports (mainly for iptables' physdev match).
>> > It also has various state bits so we know what manipulations
>> > have been performed by bridge netfilter on the skb (e.g.
>> > ppp header stripping).
>> >
>> > nf_bridge also provides scratch space where br_netfilter saves
>> > (and later restores) various things, e.g. ipv4 address for
>> > dnat detection, mac address to fix up ip fragmented skbs, etc.
>> > 
>> > But in almost all cases we can avoid using ->data completely.
>> 
>> I think one of the goals of this patchset is to prepare the removal of
>> that nf_bridge pointer from sk_buff which sounds as a good idea to me.
> 
> The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
> at the end (where its not initialized at allocation time).
> 
> The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
> so we'd increase skbuff size only by 8.

Sorry, increases in size of any kind of sk_buff are strongly discouraged.

Especially for something like nfbridge.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal March 9, 2015, 5:35 p.m. UTC | #5
David Miller <davem@redhat.com> wrote:
> > The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
> > at the end (where its not initialized at allocation time).
> > 
> > The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
> > so we'd increase skbuff size only by 8.
> 
> Sorry, increases in size of any kind of sk_buff are strongly discouraged.

No need to be sorry, its not a show-stopper.
Its the last patch in the series and nothing depends on this change.

> Especially for something like nfbridge.

Its not initialized, ever, if bridge is not used, so it shouldn't be a big
deal.  The goal of the entire exercise is to remove the need to zero
nf_bridge pointer at skb allocation time, so one of the patches in that
series moves it to the end of the skb.

The only reason why I added the 'fold it' patch on top is that after the
refactoring nf_bridge_info struct is just 16 bytes so it seemed
preferrable to me to add cost of 8 more bytes and get rid of the
nf_bridge refcounting at same time.

I'd just send it along with the series so everyone can look at it,
and, if you (or others) still prefer to keep the pointer then Pablo can
just drop it.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 9, 2015, 7:20 p.m. UTC | #6
From: Florian Westphal <fw@strlen.de>
Date: Mon, 9 Mar 2015 18:35:15 +0100

> David Miller <davem@redhat.com> wrote:
>> > The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
>> > at the end (where its not initialized at allocation time).
>> > 
>> > The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
>> > so we'd increase skbuff size only by 8.
>> 
>> Sorry, increases in size of any kind of sk_buff are strongly discouraged.
> 
> No need to be sorry, its not a show-stopper.
> Its the last patch in the series and nothing depends on this change.
> 
>> Especially for something like nfbridge.
> 
> Its not initialized, ever, if bridge is not used, so it shouldn't be a big
> deal.

It's about the overall size of the structure, not whether the member is
ever used or not.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 14, 2015, 9 a.m. UTC | #7
On Mon, Mar 09, 2015 at 02:59:10PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > nf_bridge is kmalloced blob with some extra information, including
> > > the bridge in and outports (mainly for iptables' physdev match).
> > > It also has various state bits so we know what manipulations
> > > have been performed by bridge netfilter on the skb (e.g.
> > > ppp header stripping).
> > >
> > > nf_bridge also provides scratch space where br_netfilter saves
> > > (and later restores) various things, e.g. ipv4 address for
> > > dnat detection, mac address to fix up ip fragmented skbs, etc.
> > > 
> > > But in almost all cases we can avoid using ->data completely.
> > 
> > I think one of the goals of this patchset is to prepare the removal of
> > that nf_bridge pointer from sk_buff which sounds as a good idea to me.
> > 
> > Did you consider to implement this scratchpad area using a per-cpu
> > area? I mean, something similar to what we used to have in
> > ip_conntrack for events:
> 
> I see that I misread part of what you wrote.
> 
> We cannot use percpu data for nf_bridge_info; at least its not trivial
> to do (I tried).
> 
> Main issues are:
> - nfqueue (we have to save/restore info)
> - local delivery (skb is enqueued in backlog)
> - skb is dropped (we need to reset scratch data)
> - DNAT mac hack (skb can be queued in qdisc).

What about something like this?

1) We keep a nf_bridge table that contains the nf_bridge structures
   that are currently in used, so you can look it up for them every
   time we need it. This can be implemented as a per-cpu list of
   nf_bridge structures.

2) We have another per-cpu cache to hold a pointer to the current
   nf_bridge.

3) We only need one bit from sk_buff to indicate that this sk_buff has
   nf_bridge info attached.

Then, we can use the sk_buff pointer as an unique way to identify
the owner of the nf_bridge structure.

   struct nf_bridge {
        struct list_head        head;
        const struct sk_buff    *owner;
        ...
   }

so if we need the scratchpad area, we first have to look up for it:

   struct nf_bridge *nf_bridge_find(struct sk_buff *skb)
   {
        ...

        /* First, check if we have it in the per_cpu cache. */
        nf_bridge = per_cpu_ptr(nf_bridge_pcpu);
        if (nf_bridge->owner == skb)
            return nf_bridge;

        /* Otherwise, slow path: find this scratchpad area in the list. */
        ...
        nf_bridge_list = per_cpu_ptr(nf_bridge_pcpu_list);
        list_for_each_entry(n, &nf_bridge_list, head) {
            if (nf_bridge->owner == skb) {
                /* Update the per_cpu cache. */
                rcu_assign_pointer(nf_bridge, n);
                return nf_bridge;
            }
        }

        return NULL;
   }

From skb_release_head_state() we'll have to:

    if (skb->nf_bridge_present) {
         struct nf_bridge *nf_bridge = nf_bridge_find(skb);

         /* Remove it from the per-cpu nf_bridge cache. */
         list_del(&nf_bridge->head);
         kfree(nf_bridge);
    }

If nfqueue or other situation where we enqueue the packet, we'll enter
the slow path of nf_bridge_find(). This comes with some overhead in
that case, but one of the goal is to get rid of this pointer from
sk_buff which is unused for most people outthere as you already
mentioned on some patch.

I'm proposing this because I have concerns with the approach to place
nf_bridge in skb->cb. This br_netfilter thing makes the skb go back
and forth from different layers while expecting to find the right
right data in it, this seems fragile to me.

What do you think?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal March 14, 2015, 11:13 a.m. UTC | #8
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > We cannot use percpu data for nf_bridge_info; at least its not trivial
> > to do (I tried).
> > 
> > Main issues are:
> > - nfqueue (we have to save/restore info)
> > - local delivery (skb is enqueued in backlog)
> > - skb is dropped (we need to reset scratch data)
> > - DNAT mac hack (skb can be queued in qdisc).

The DNAT mac hack is no longer an issue, can be handled via skb->cb.

> What about something like this?
> 
> 1) We keep a nf_bridge table that contains the nf_bridge structures
>    that are currently in used, so you can look it up for them every
>    time we need it. This can be implemented as a per-cpu list of
>    nf_bridge structures.

Not sure if it can be percpu.  In particular, when we pass frame up
skb can be enqueud in backlog, also we might encounter ingress scheduler
so I am not sure that skb cannot move to another cpu.

> 2) We have another per-cpu cache to hold a pointer to the current
>    nf_bridge.
> 
> 3) We only need one bit from sk_buff to indicate that this sk_buff has
>    nf_bridge info attached.

Yes, the "put it in skb->cb" also uses this approach, it puts 2bit
"state" information into skb and then removes the pointer.

> Then, we can use the sk_buff pointer as an unique way to identify
> the owner of the nf_bridge structure.
> 
>    struct nf_bridge {
>         struct list_head        head;
>         const struct sk_buff    *owner;
>         ...
>    }
> 
> so if we need the scratchpad area, we first have to look up for it:
> 
>    struct nf_bridge *nf_bridge_find(struct sk_buff *skb)
>    {
>         ...
> 
>         /* First, check if we have it in the per_cpu cache. */
>         nf_bridge = per_cpu_ptr(nf_bridge_pcpu);
>         if (nf_bridge->owner == skb)
>             return nf_bridge;
> 
>         /* Otherwise, slow path: find this scratchpad area in the list. */
>         ...
>         nf_bridge_list = per_cpu_ptr(nf_bridge_pcpu_list);
>         list_for_each_entry(n, &nf_bridge_list, head) {
>             if (nf_bridge->owner == skb) {
>                 /* Update the per_cpu cache. */
>                 rcu_assign_pointer(nf_bridge, n);
>                 return nf_bridge;
>             }
>         }
> 
>         return NULL;
>    }

Ok, I see.  This would work, but I'd prefer to just use the skb control
buffer.

> I'm proposing this because I have concerns with the approach to place
> nf_bridge in skb->cb. This br_netfilter thing makes the skb go back
> and forth from different layers while expecting to find the right
> right data in it, this seems fragile to me.

Mhhh, for the *usual* case of "skb is forwarded by bridge" skb->cb
should be safe since we only move skb between bridge and inet(6).

The only "problematic" case is where skb is DNAT'd, then things get
hairy (e.e.g dnat-to-host-on-other device).

> What do you think?

I think that currently it doesn't matter what solution we'll pick in the
end, I'd first like to send all my refactoring patches.

With those patches, it reduces the nf_bridge_info content that we need
to the physin and physout devices, plus a two-bit state in skb.

Then, if you still think that ->cb is too fragile I'd be happy to add
the lookup table method.  For the normal case of bridge forwarding, it
shouldn't be needed though, perhaps we could also use a hybrid approach,
cb-based fastpath for forwarding, lookup-table slowpath for dnat and
local delivery cases.

One other solution would be to use skb->cb but not store bridge device
pointers but the ifindexes instead (we currently don't hold any
device refcounts either so the current method isn't 100% safe either since
such device can be removed ...).

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 16, 2015, 12:38 p.m. UTC | #9
Hi Florian,

On Sat, Mar 14, 2015 at 12:13:25PM +0100, Florian Westphal wrote:
> I think that currently it doesn't matter what solution we'll pick in the
> end, I'd first like to send all my refactoring patches.
> 
> With those patches, it reduces the nf_bridge_info content that we need
> to the physin and physout devices, plus a two-bit state in skb.
> 
> Then, if you still think that ->cb is too fragile I'd be happy to add
> the lookup table method.  For the normal case of bridge forwarding, it
> shouldn't be needed though, perhaps we could also use a hybrid approach,
> cb-based fastpath for forwarding, lookup-table slowpath for dnat and
> local delivery cases.

I think the use of skb->cb introduces another dependency between
br_netfilter and the core, which is what we should avoid.  I still
have concerns regarding future extensibility issues that this may
introduce:

1) If we want to shrink the skb->cb in size, br_netfilter would limit
   this shrink since we'll have to keep enough room for our internal
   private information, even if most people will not need this.

2) If inet or qdisc private info size needs to be increased for
   whatever reason, we may run out of space to keep the br_netfilter data
   after it. And we cannot ask David to allow us to allocate more bytes
   for skb->cb to resolve this.

I think we have to find some permanent solution to isolate this
br_netfilter code as much as possible. Then, focus on providing a
replacement including the minimal set of features for what most users
are using br_netfilter, I would say:

* nfqueue support
* transparent proxying
* some simple way to perform stateless NAT (mangle mac and IP address
  to place the packet in the right interface).

so we can point them to nf_tables to progressively migrate to a better
environment, the overall goal to me is to deprecate this br_netfilter
thing.

I think we can use the per-cpu cache + global hashtable to look up for
nf_bridge data (using skbuff as index). The forwarded traffic will
find its nf_bridge info most of the time in the cache without having
to follow the slow lookup path.

> One other solution would be to use skb->cb but not store bridge device
> pointers but the ifindexes instead (we currently don't hold any
> device refcounts either so the current method isn't 100% safe either since
> such device can be removed ...).

Good catch, I think it would be great to start by fixing this problem
first.

Let me know,
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal March 16, 2015, 1:01 p.m. UTC | #10
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Mar 14, 2015 at 12:13:25PM +0100, Florian Westphal wrote:
> > I think that currently it doesn't matter what solution we'll pick in the
> > end, I'd first like to send all my refactoring patches.
> > 
> > With those patches, it reduces the nf_bridge_info content that we need
> > to the physin and physout devices, plus a two-bit state in skb.
> > 
> > Then, if you still think that ->cb is too fragile I'd be happy to add
> > the lookup table method.  For the normal case of bridge forwarding, it
> > shouldn't be needed though, perhaps we could also use a hybrid approach,
> > cb-based fastpath for forwarding, lookup-table slowpath for dnat and
> > local delivery cases.
> 
> I think the use of skb->cb introduces another dependency between
> br_netfilter and the core, which is what we should avoid.  I still
> have concerns regarding future extensibility issues that this may
> introduce:
> 
> 1) If we want to shrink the skb->cb in size, br_netfilter would limit
>    this shrink since we'll have to keep enough room for our internal
>    private information, even if most people will not need this.

Given the size of tcp and gro this shouldn't be an issue.
We'd still fit into 40 bytes cb, and, if that would be a problem we
could still go with (more complex) external storage scheme.

> 2) If inet or qdisc private info size needs to be increased for
>    whatever reason, we may run out of space to keep the br_netfilter data
>    after it. And we cannot ask David to allow us to allocate more bytes
>    for skb->cb to resolve this.

Sure, absolutely.  skb->cb increase is out of question.

I don't think its a problem though, qdisc private info could even be
reduced to 16 (2 pointers on 64bit which I think is a reasonable size to
provide for qdiscs).  Infiniband has similar hack.

And tcp cb has to embed  inet/inet6 as well.

We wouldn't even need to pull of this stunt actually, we could just zap
the skb instead of reinjection... But that would induce e.g. a syn
retransmit which I want to avoid.

> I think we have to find some permanent solution to isolate this
> br_netfilter code as much as possible. Then, focus on providing a
> replacement including the minimal set of features for what most users
> are using br_netfilter, I would say:
> 
> * nfqueue support
> * transparent proxying
> * some simple way to perform stateless NAT (mangle mac and IP address
>   to place the packet in the right interface).

Please not stateless (IP) nat, it usually bites users due to lack
of helpers...  I've no problem with mac mangling.

> so we can point them to nf_tables to progressively migrate to a better
> environment, the overall goal to me is to deprecate this br_netfilter
> thing.

Yes, but it will be a long time before we can remove this thing.

> I think we can use the per-cpu cache + global hashtable to look up for
> nf_bridge data (using skbuff as index). The forwarded traffic will
> find its nf_bridge info most of the time in the cache without having
> to follow the slow lookup path.

I'm still not convinced we even need lookups in the forward case since
the skb can never leave br_netfilter context.

And we only need to store the physin/out ports, nothing else, so its a
bit sad that we need to add such lookup stuff in the first place :-/

I agree that it would still be better than the skb->nf_bridge
indirection, though.

> > One other solution would be to use skb->cb but not store bridge device
> > pointers but the ifindexes instead (we currently don't hold any
> > device refcounts either so the current method isn't 100% safe either since
> > such device can be removed ...).
> 
> Good catch, I think it would be great to start by fixing this problem
> first.

I don't see how this is easily fixable, sorry -- it would mean we'd have
to dev_hold right after allocation of nf_bridge_info struct...

We can store ifindexes of physin/physout and re-lookup, sure.

I'll start on the central storage infrastructure.
Do you want me to base it on the current patches i have or on plain
nf-next?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal March 16, 2015, 1:41 p.m. UTC | #11
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> I think the use of skb->cb introduces another dependency between
> br_netfilter and the core, which is what we should avoid.  I still
> have concerns regarding future extensibility issues that this may
> introduce:

One more thing (sorry, forgot to mention this).
We CANNOT avoid dependency between br_netfilter and the core.

Its just a matter of what this dependency looks like, and how
much of a buren it is wrt. to maintenance and debugability.

Currently the dependency is skb->nf_bridge pointer.

I proposed to replace this pointer with smaller state + place
rest of info in skb->cb, with all the problems that this entails.

Using your 'external table' will also require some interaction,
I don't see how you'd remove entries from such table except some
callback function sitting in kfree_skb path, so we'd have to replace
current nf_bridge_put() call with something like

if (skb->nf_bridge_state && nf_bridge_info_destructor_hook)
	nf_bridge_info_destructor_hook(skb);

in kfree_skb.

We'd also still need the refcounting and figure out a way to handle
skb_clone properly (although I *think* we only have to care about those
spots where the bridge clones skbs and don't need a generic solution).

I admit that your proposal has less 'hidden' dependencies compared
to skb->cb based solutions, though.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 16, 2015, 1:47 p.m. UTC | #12
On Mon, Mar 16, 2015 at 02:01:10PM +0100, Florian Westphal wrote:
> We can store ifindexes of physin/physout and re-lookup, sure.
> 
> I'll start on the central storage infrastructure.
> Do you want me to base it on the current patches i have or on plain
> nf-next?

Let me pick these first:

[-next,v2] netfilter: bridge: query conntrack about skb dnat
[-next,resend] netfilter: bridge: remove BRNF_STATE_BRIDGED flag

This, I'll ping David to ack it:

[v2,nf-next,1/6] net: untangle ip_fragment and bridge netfilter

The follow up patches, you please rebase upon the central storage
idea. I like your small state machine idea instead of playing with
flags and the new functions that encapsulate the save/restore mangling
to keep IP/bridge happy.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 57a8e6c3ac46601615f0df0ccc98e6481c8017a9 Mon Sep 17 00:00:00 2001
From: Florian Westphal <fw@strlen.de>
Date: Thu, 5 Mar 2015 00:52:35 +0100
Subject: [PATCH] netfilter: brige: move DNAT helper to br_netfilter

Only one caller, there is no need to keep this in a header.
Move it to br_netfilter.c where this belongs to.

Based on patch from Florian Westphal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter_bridge.h |   12 ------------
 net/bridge/br_device.c           |    5 +----
 net/bridge/br_netfilter.c        |   32 ++++++++++++++++++++++++++++++++
 net/bridge/br_private.h          |    5 +++++
 4 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index dd580a9..bb39113 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -44,18 +44,6 @@  static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 }
 
 int br_handle_frame_finish(struct sk_buff *skb);
-/* Only used in br_device.c */
-static inline int br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
-{
-	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
-
-	skb_pull(skb, ETH_HLEN);
-	nf_bridge->mask ^= BRNF_BRIDGED_DNAT;
-	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
-				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
-	skb->dev = nf_bridge->physindev;
-	return br_handle_frame_finish(skb);
-}
 
 /* This is called by the IP fragmenting code and it ensures there is
  * enough room for the encapsulating header (if there is one). */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index ffd379d..294cbcc 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -36,13 +36,10 @@  netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	u16 vid = 0;
 
 	rcu_read_lock();
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge && (skb->nf_bridge->mask & BRNF_BRIDGED_DNAT)) {
-		br_nf_pre_routing_finish_bridge_slow(skb);
+	if (br_nf_prerouting_finish_bridge(skb)) {
 		rcu_read_unlock();
 		return NETDEV_TX_OK;
 	}
-#endif
 
 	u64_stats_update_begin(&brstats->syncp);
 	brstats->tx_packets++;
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index ef1fe28..a8361c7 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -892,6 +892,38 @@  static unsigned int ip_sabotage_in(const struct nf_hook_ops *ops,
 	return NF_ACCEPT;
 }
 
+/* This is called when br_netfilter has called into iptables/netfilter,
+ * and DNAT has taken place on a bridge-forwarded packet.
+ *
+ * neigh->output has created a new MAC header, with local br0 MAC
+ * as saddr.
+ *
+ * This restores the original MAC saddr of the bridged packet
+ * before invoking bridge forward logic to transmit the packet.
+ */
+static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
+{
+	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+
+	skb_pull(skb, ETH_HLEN);
+	nf_bridge->mask &= ~BRNF_BRIDGED_DNAT;
+
+	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
+				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
+	skb->dev = nf_bridge->physindev;
+	br_handle_frame_finish(skb);
+}
+
+int br_nf_prerouting_finish_bridge(struct sk_buff *skb)
+{
+	if (skb->nf_bridge && (skb->nf_bridge->mask & BRNF_BRIDGED_DNAT)) {
+		br_nf_pre_routing_finish_bridge_slow(skb);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(br_nf_prerouting_finish_bridge);
+
 void br_netfilter_enable(void)
 {
 }
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index de09199..d63fc17 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -764,10 +764,15 @@  static inline int br_vlan_enabled(struct net_bridge *br)
 
 /* br_netfilter.c */
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+int br_nf_prerouting_finish_bridge(struct sk_buff *skb);
 int br_nf_core_init(void);
 void br_nf_core_fini(void);
 void br_netfilter_rtable_init(struct net_bridge *);
 #else
+static inline int br_nf_prerouting_finish_bridge(struct sk_buff *skb)
+{
+        return 0;
+}
 static inline int br_nf_core_init(void) { return 0; }
 static inline void br_nf_core_fini(void) {}
 #define br_netfilter_rtable_init(x)
-- 
1.7.10.4