Message ID | 776b8819c85c83088478b933a35691133055347a.1430733932.git.daniel@iogearbox.net |
---|---|
State | Changes Requested |
Delegated to: | Pablo Neira |
Headers | show |
On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote: > This patch adds support for the possibility of doing NAT with > conflicting IP address/ports tuples from multiple, isolated > tenants, represented as network namespaces and netfilter zones. > For such internal VRFs, traffic is directed to a single or shared > pool of public IP address/port range for the external/public VRF. > > Or in other words, this allows for doing NAT *between* VRFs > instead of *inside* VRFs without requiring each tenant to NAT > twice or to use its own dedicated IP address to SNAT to, also > with the side effect to not requiring to expose a unique marker > per tenant in the data center to the public. > > Simplified example scheme: > > +--- VRF A ---+ +--- CT Zone 1 --------+ > | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | > +-------------+ +--+-------------------+ > | > +--+--+ > | L3 +-SNAT-[20.1.1.1:20000-40000]--eth0 > +--+--+ > | > +-- VRF B ----+ +--- CT Zone 2 --------+ > | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | > +-------------+ +----------------------+ So, it's the skb->mark that survives between the containers. I'm not sure it makes sense to keep a zone 0 from the container that performs SNAT. Instead, we can probably restore the zone based on the skb->mark. The problem is that the existing zone is u16. In nftables, Patrick already mentioned about supporting casting so we can do something like: ct zone set (u16)meta mark So you can reserve a part of the skb->mark to map it to the zone. I'm not very convinced about this. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Pablo, On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote: > On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote: >> This patch adds support for the possibility of doing NAT with >> conflicting IP address/ports tuples from multiple, isolated >> tenants, represented as network namespaces and netfilter zones. >> For such internal VRFs, traffic is directed to a single or shared >> pool of public IP address/port range for the external/public VRF. >> >> Or in other words, this allows for doing NAT *between* VRFs >> instead of *inside* VRFs without requiring each tenant to NAT >> twice or to use its own dedicated IP address to SNAT to, also >> with the side effect to not requiring to expose a unique marker >> per tenant in the data center to the public. >> >> Simplified example scheme: >> >> +--- VRF A ---+ +--- CT Zone 1 --------+ >> | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | >> +-------------+ +--+-------------------+ >> | >> +--+--+ >> | L3 +-SNAT-[20.1.1.1:20000-40000]--eth0 >> +--+--+ >> | >> +-- VRF B ----+ +--- CT Zone 2 --------+ >> | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | >> +-------------+ +----------------------+ > > So, it's the skb->mark that survives between the containers. I'm not > sure it makes sense to keep a zone 0 from the container that performs > SNAT. Instead, we can probably restore the zone based on the > skb->mark. The problem is that the existing zone is u16. In nftables, > Patrick already mentioned about supporting casting so we can do > something like: > > ct zone set (u16)meta mark > > So you can reserve a part of the skb->mark to map it to the zone. I'm > not very convinced about this. Thanks for the feedback! I'm not yet sure though, I understood the above suggestion to the described problem fully so far, i.e. how would replies on the SNAT find the correct zone again? Our issue simplified, basically boils down to: given are two zones, both use IP address <A>, both zones want to talk to IP address <B> in a third zone. To let those two with <A> talk to <B>, connections are being routed + SNATed from a non-unique to a unique address/port tuple [which the proposed approach solves], so they can talk to <B>. Best, Daniel -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 04, 2015 at 01:59:15PM +0200, Daniel Borkmann wrote: > Hi Pablo, > > On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote: > >On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote: > >>This patch adds support for the possibility of doing NAT with > >>conflicting IP address/ports tuples from multiple, isolated > >>tenants, represented as network namespaces and netfilter zones. > >>For such internal VRFs, traffic is directed to a single or shared > >>pool of public IP address/port range for the external/public VRF. > >> > >>Or in other words, this allows for doing NAT *between* VRFs > >>instead of *inside* VRFs without requiring each tenant to NAT > >>twice or to use its own dedicated IP address to SNAT to, also > >>with the side effect to not requiring to expose a unique marker > >>per tenant in the data center to the public. > >> > >>Simplified example scheme: > >> > >> +--- VRF A ---+ +--- CT Zone 1 --------+ > >> | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | > >> +-------------+ +--+-------------------+ > >> | > >> +--+--+ > >> | L3 +-SNAT-[20.1.1.1:20000-40000]--eth0 > >> +--+--+ > >> | > >> +-- VRF B ----+ +--- CT Zone 2 --------+ > >> | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | > >> +-------------+ +----------------------+ > > > >So, it's the skb->mark that survives between the containers. I'm not > >sure it makes sense to keep a zone 0 from the container that performs > >SNAT. Instead, we can probably restore the zone based on the > >skb->mark. The problem is that the existing zone is u16. In nftables, > >Patrick already mentioned about supporting casting so we can do > >something like: > > > > ct zone set (u16)meta mark > > > >So you can reserve a part of the skb->mark to map it to the zone. I'm > >not very convinced about this. > > Thanks for the feedback! I'm not yet sure though, I understood the > above suggestion to the described problem fully so far, i.e. how > would replies on the SNAT find the correct zone again? From the original direction, you can set the zone based on the mark: -m mark --mark 1 -j CT --zone 1 Then, from the reply direction, you can restore it: -m conntrack --ctzone 1 -j MARK --set-mark 1 ... --ctzone is not supported though, it would need a new revision for the conntrack match. > Our issue simplified, basically boils down to: given are two zones, > both use IP address <A>, both zones want to talk to IP address <B> in > a third zone. To let those two with <A> talk to <B>, connections are > being routed + SNATed from a non-unique to a unique address/port > tuple [which the proposed approach solves], so they can talk to <B>. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/04/15 at 03:08pm, Pablo Neira Ayuso wrote: > On Mon, May 04, 2015 at 01:59:15PM +0200, Daniel Borkmann wrote: > > On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote: > > >So, it's the skb->mark that survives between the containers. I'm not > > >sure it makes sense to keep a zone 0 from the container that performs > > >SNAT. Instead, we can probably restore the zone based on the > > >skb->mark. The problem is that the existing zone is u16. In nftables, > > >Patrick already mentioned about supporting casting so we can do > > >something like: > > > > > > ct zone set (u16)meta mark > > > > > >So you can reserve a part of the skb->mark to map it to the zone. I'm > > >not very convinced about this. > > > > Thanks for the feedback! I'm not yet sure though, I understood the > > above suggestion to the described problem fully so far, i.e. how > > would replies on the SNAT find the correct zone again? > > From the original direction, you can set the zone based on the mark: > > -m mark --mark 1 -j CT --zone 1 > > Then, from the reply direction, you can restore it: > > -m conntrack --ctzone 1 -j MARK --set-mark 1 > ... > > --ctzone is not supported though, it would need a new revision for the > conntrack match. Given that the multiple source zones which talk to a common destination zone may have conflicting IPs, the SNAT must either occur in the source zone where the source address is still unique or the CT tuple must be made unique with a source zone identifier so that the SNAT can occur in the destination zone. Doing the SNAT in the source zone requires to use a unique IP pool to map to for each source zone as otherwise IP sources may clash again in the destination zone. We obviously can't do --SNAT -to 10.1.1.1 in two namespaces and then just route into a third namespace. This approach is not scalable in a container environment with 100s or even 1000s of containers each in its own network namespace. What we want to do instead is to do the SNAT in the destination zone where we can have a single SNAT rule which overs all source zones. This allows inter namespace communication in a /31 with minimal waste of addresses. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/04/2015 03:08 PM, Pablo Neira Ayuso wrote: > On Mon, May 04, 2015 at 01:59:15PM +0200, Daniel Borkmann wrote: >> Hi Pablo, >> >> On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote: >>> On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote: >>>> This patch adds support for the possibility of doing NAT with >>>> conflicting IP address/ports tuples from multiple, isolated >>>> tenants, represented as network namespaces and netfilter zones. >>>> For such internal VRFs, traffic is directed to a single or shared >>>> pool of public IP address/port range for the external/public VRF. >>>> >>>> Or in other words, this allows for doing NAT *between* VRFs >>>> instead of *inside* VRFs without requiring each tenant to NAT >>>> twice or to use its own dedicated IP address to SNAT to, also >>>> with the side effect to not requiring to expose a unique marker >>>> per tenant in the data center to the public. >>>> >>>> Simplified example scheme: >>>> >>>> +--- VRF A ---+ +--- CT Zone 1 --------+ >>>> | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | >>>> +-------------+ +--+-------------------+ >>>> | >>>> +--+--+ >>>> | L3 +-SNAT-[20.1.1.1:20000-40000]--eth0 >>>> +--+--+ >>>> | >>>> +-- VRF B ----+ +--- CT Zone 2 --------+ >>>> | 10.1.1.1/8 +--+ 10.1.1.1 ESTABLISHED | >>>> +-------------+ +----------------------+ >>> >>> So, it's the skb->mark that survives between the containers. I'm not >>> sure it makes sense to keep a zone 0 from the container that performs >>> SNAT. Instead, we can probably restore the zone based on the >>> skb->mark. The problem is that the existing zone is u16. In nftables, >>> Patrick already mentioned about supporting casting so we can do >>> something like: >>> >>> ct zone set (u16)meta mark >>> >>> So you can reserve a part of the skb->mark to map it to the zone. I'm >>> not very convinced about this. >> >> Thanks for the feedback! I'm not yet sure though, I understood the >> above suggestion to the described problem fully so far, i.e. how >> would replies on the SNAT find the correct zone again? > > From the original direction, you can set the zone based on the mark: > > -m mark --mark 1 -j CT --zone 1 > > Then, from the reply direction, you can restore it: > > -m conntrack --ctzone 1 -j MARK --set-mark 1 > ... > > --ctzone is not supported though, it would need a new revision for the > conntrack match. Ok, thanks a lot, now I see what you mean. If I'm not missing something, I would see two problems with that: the first would be that the zone match would be linear, f.e. if we support 100 or more zones, we would need to walk through the rules linearly until we find --mark 100, right? The other issue is that from reply direction (when the packet comes in with the translated addr), we couldn't match in the connection tracking table on the correct zone. The above restore rule would assume that the match itself already has taken place and was successfully, no? (That is actually why we are direction based: --flextuple ORIGINAL|REPLY.) >> Our issue simplified, basically boils down to: given are two zones, >> both use IP address <A>, both zones want to talk to IP address <B> in >> a third zone. To let those two with <A> talk to <B>, connections are >> being routed + SNATed from a non-unique to a unique address/port >> tuple [which the proposed approach solves], so they can talk to <B>. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 04, 2015 at 03:47:33PM +0200, Thomas Graf wrote: [...] > Given that the multiple source zones which talk to a common > destination zone may have conflicting IPs, the SNAT must either > occur in the source zone where the source address is still unique > or the CT tuple must be made unique with a source zone identifier > so that the SNAT can occur in the destination zone. > > Doing the SNAT in the source zone requires to use a unique IP pool > to map to for each source zone as otherwise IP sources may clash again > in the destination zone. We obviously can't do --SNAT -to 10.1.1.1 in > two namespaces and then just route into a third namespace. This > approach is not scalable in a container environment with 100s or even > 1000s of containers each in its own network namespace. > > What we want to do instead is to do the SNAT in the destination zone > where we can have a single SNAT rule which overs all source zones. > This allows inter namespace communication in a /31 with minimal waste > of addresses. Thanks for explaining. So you need to allocate an unique tuple using the mark to avoid the clashes for the first packet that goes original using the same pool. Then, the NAT engine will allocate an unique tuple in the reply direction. But what is the use case for -j CT --flextuple reply ? By when you see the reply packet the tuple was already created. Another question is if it makes sense to have part of the flows using your flextuple idea while some others not, ie. -s x.y.z.w/24 -j CT --flextuple original so shouldn't this be a global switch that includes the skb->mark only for packets coming in the original direction? I also wonder how you're going to deal with port redirections. This only seem to be working SNAT/masquerade to me if the NAT happens from VRF side. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Pablo, On 05/06/2015 04:27 PM, Pablo Neira Ayuso wrote: > On Mon, May 04, 2015 at 03:47:33PM +0200, Thomas Graf wrote: > [...] >> Given that the multiple source zones which talk to a common >> destination zone may have conflicting IPs, the SNAT must either >> occur in the source zone where the source address is still unique >> or the CT tuple must be made unique with a source zone identifier >> so that the SNAT can occur in the destination zone. >> >> Doing the SNAT in the source zone requires to use a unique IP pool >> to map to for each source zone as otherwise IP sources may clash again >> in the destination zone. We obviously can't do --SNAT -to 10.1.1.1 in >> two namespaces and then just route into a third namespace. This >> approach is not scalable in a container environment with 100s or even >> 1000s of containers each in its own network namespace. >> >> What we want to do instead is to do the SNAT in the destination zone >> where we can have a single SNAT rule which overs all source zones. >> This allows inter namespace communication in a /31 with minimal waste >> of addresses. > > Thanks for explaining. So you need to allocate an unique tuple using > the mark to avoid the clashes for the first packet that goes original > using the same pool. Then, the NAT engine will allocate an unique > tuple in the reply direction. Yes, that's correct. In original direction, due to the overlapping tuple the ct-mark is considered as well for the match, and in reply direction SNAT chooses already a unique tuple. That's essentially the rationale for our use case with SNAT. > But what is the use case for -j CT --flextuple reply ? By when you see > the reply packet the tuple was already created. Given this change is completely NAT agnostic, we can keep it as a generic addition to the conntracker. Given that the mark is very flexible, I think it could also be used for load balancing as a different usage. > Another question is if it makes sense to have part of the flows using > your flextuple idea while some others not, ie. > > -s x.y.z.w/24 -j CT --flextuple original > > so shouldn't this be a global switch that includes the skb->mark > only for packets coming in the original direction? I first thought about a global sysctl switch, but eventually found this config possibility from iptables side much cleaner resp. better integrated. I think if the environment is correctly configured for that, such a partial flextuple scenario works, too. > I also wonder how you're going to deal with port redirections. This > only seem to be working SNAT/masquerade to me if the NAT happens from > VRF side. In our case, we'd like to use the flextuple when we're explicitly configuring iptables with SNAT. For DNAT, one could reuse it in a different, somewhat reversed example we previously had and together with mark based routing and match on the reply side. Thanks a lot, Daniel -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Daniel, On Wed, May 06, 2015 at 08:00:42PM +0200, Daniel Borkmann wrote: > On 05/06/2015 04:27 PM, Pablo Neira Ayuso wrote: [...] > >But what is the use case for -j CT --flextuple reply ? By when you see > >the reply packet the tuple was already created. > > Given this change is completely NAT agnostic, we can keep it as a > generic addition to the conntracker. Given that the mark is very > flexible, I think it could also be used for load balancing as a > different usage. The original and reply tuples of the conntrack are set by the first packet going in the original direction, at that time they are both inserted in the hashes, in this scenario it will use the default mark (0). So if you set the --flextuple reply, conntrack will include the mark as part of the hashtuple in the first reply packet, but that packet will not match the conntrack object that was created by the first original packet and it will result a new conntrack assuming that the reply is the original direction. This looks broken to me. > >Another question is if it makes sense to have part of the flows using > >your flextuple idea while some others not, ie. > > > > -s x.y.z.w/24 -j CT --flextuple original > > > >so shouldn't this be a global switch that includes the skb->mark > >only for packets coming in the original direction? > > I first thought about a global sysctl switch, but eventually found > this config possibility from iptables side much cleaner resp. better > integrated. I think if the environment is correctly configured for > that, such a partial flextuple scenario works, too. This is consuming two ct status bits, these are exposed to userspace, and we have a limited number of bits there. The one in the original direction might be justified for the SNAT case in the specific scenario that you show. I don't see yet how this can make sense in a hybrid scenario. We may end up with a packet that can potentially create and match two different flow objects if this is misconfigured. > >I also wonder how you're going to deal with port redirections. This > >only seem to be working SNAT/masquerade to me if the NAT happens from > >VRF side. > > In our case, we'd like to use the flextuple when we're explicitly > configuring iptables with SNAT. For DNAT, one could reuse it in a > different, somewhat reversed example we previously had and together > with mark based routing and match on the reply side. OK, so different VRFs with overlapping networks that are redirected to the port of another destination. That might makes sense. Still the hybrid scenario and the --flextuple reply need some thinking. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Pablo, On 05/06/2015 08:50 PM, Pablo Neira Ayuso wrote: > On Wed, May 06, 2015 at 08:00:42PM +0200, Daniel Borkmann wrote: >> On 05/06/2015 04:27 PM, Pablo Neira Ayuso wrote: > [...] Thanks for your feedback! ... >>> Another question is if it makes sense to have part of the flows using >>> your flextuple idea while some others not, ie. >>> >>> -s x.y.z.w/24 -j CT --flextuple original >>> >>> so shouldn't this be a global switch that includes the skb->mark >>> only for packets coming in the original direction? >> >> I first thought about a global sysctl switch, but eventually found >> this config possibility from iptables side much cleaner resp. better >> integrated. I think if the environment is correctly configured for >> that, such a partial flextuple scenario works, too. > > This is consuming two ct status bits, these are exposed to userspace, > and we have a limited number of bits there. The one in the original > direction might be justified for the SNAT case in the specific > scenario that you show. Okay, agreed. I will respin the set with --flextuple ORIGINAL direction allowed where we'd for now only consume a single status bit. If later on there's a need to extend this for REPLY (or even hybrid), we still have the option to extend it. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Daniel, On Thu, May 07, 2015 at 02:01:11PM +0200, Daniel Borkmann wrote: > ... > >>>Another question is if it makes sense to have part of the flows using > >>>your flextuple idea while some others not, ie. > >>> > >>> -s x.y.z.w/24 -j CT --flextuple original > >>> > >>>so shouldn't this be a global switch that includes the skb->mark > >>>only for packets coming in the original direction? > >> > >>I first thought about a global sysctl switch, but eventually found > >>this config possibility from iptables side much cleaner resp. better > >>integrated. I think if the environment is correctly configured for > >>that, such a partial flextuple scenario works, too. > > > >This is consuming two ct status bits, these are exposed to userspace, > >and we have a limited number of bits there. The one in the original > >direction might be justified for the SNAT case in the specific > >scenario that you show. > > Okay, agreed. I will respin the set with --flextuple ORIGINAL direction > allowed where we'd for now only consume a single status bit. If later > on there's a need to extend this for REPLY (or even hybrid), we still > have the option to extend it. I would like to know if it makes sense to add this later on. Would you elaborate a useful DNAT scenario where this can be useful? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Pablo, On 05/07/2015 08:10 PM, Pablo Neira Ayuso wrote: > On Thu, May 07, 2015 at 02:01:11PM +0200, Daniel Borkmann wrote: >> ... >>>>> Another question is if it makes sense to have part of the flows using >>>>> your flextuple idea while some others not, ie. >>>>> >>>>> -s x.y.z.w/24 -j CT --flextuple original >>>>> >>>>> so shouldn't this be a global switch that includes the skb->mark >>>>> only for packets coming in the original direction? >>>> >>>> I first thought about a global sysctl switch, but eventually found >>>> this config possibility from iptables side much cleaner resp. better >>>> integrated. I think if the environment is correctly configured for >>>> that, such a partial flextuple scenario works, too. >>> >>> This is consuming two ct status bits, these are exposed to userspace, >>> and we have a limited number of bits there. The one in the original >>> direction might be justified for the SNAT case in the specific >>> scenario that you show. >> >> Okay, agreed. I will respin the set with --flextuple ORIGINAL direction >> allowed where we'd for now only consume a single status bit. If later >> on there's a need to extend this for REPLY (or even hybrid), we still >> have the option to extend it. > > I would like to know if it makes sense to add this later on. Would you > elaborate a useful DNAT scenario where this can be useful? What comes to mind in case of hybrid usage for firewalling would be that flextuple in both directions would act similarly as zones, for example, you could map things like tunnel id into u32 space and include that into the match'er w/o many additional rules or memory overhead. For the reply-only, case I was thinking about a case where you'd have multiple containers behind the DNAT all with same ip/port each where a server listens on and you'd select one of the containers e.g. via xt_statistic module as a mark and do mark-based routing behind the DNAT, but that itself still has the source in front of the DNAT unique, so it wouldn't need a mark inclusion in the reply case. So for reply-only, I currently don't find an intuitive use case. Best, Daniel -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 095433b..6d67ab4 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -51,6 +51,8 @@ union nf_conntrack_expect_proto { struct nf_conntrack_helper; +#define NF_CT_DEFAULT_MARK 0 + /* Must be kept in sync with the classes defined by helpers */ #define NF_CT_MAX_EXPECT_CLASSES 4 @@ -277,6 +279,28 @@ static inline int nf_ct_is_untracked(const struct nf_conn *ct) return test_bit(IPS_UNTRACKED_BIT, &ct->status); } +static inline bool nf_ct_is_flextuple(const struct nf_conn *ct, + const enum ip_conntrack_dir dir) +{ + switch (dir) { + case IP_CT_DIR_ORIGINAL: + return test_bit(IPS_ORIG_FLEXTUPLE_BIT, &ct->status); + case IP_CT_DIR_REPLY: + return test_bit(IPS_REPL_FLEXTUPLE_BIT, &ct->status); + default: + return false; + } +} + +static inline void nf_ct_init_flextuple(const struct nf_conn *tmpl, + struct nf_conn *ct) +{ + if (test_bit(IPS_ORIG_FLEXTUPLE_BIT, &tmpl->status)) + __set_bit(IPS_ORIG_FLEXTUPLE_BIT, &ct->status); + if (test_bit(IPS_REPL_FLEXTUPLE_BIT, &tmpl->status)) + __set_bit(IPS_REPL_FLEXTUPLE_BIT, &ct->status); +} + /* Packet is received from loopback */ static inline bool nf_is_loopback_packet(const struct sk_buff *skb) { diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index f2f0fa3..b4fd6c6 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -52,7 +52,7 @@ bool nf_ct_invert_tuple(struct nf_conntrack_tuple *inverse, /* Find a connection corresponding to a tuple. */ struct nf_conntrack_tuple_hash * -nf_conntrack_find_get(struct net *net, u16 zone, +nf_conntrack_find_get(struct net *net, u16 zone, u32 mark, const struct nf_conntrack_tuple *tuple); int __nf_conntrack_confirm(struct sk_buff *skb); diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h index 319f471..b242948 100644 --- a/include/uapi/linux/netfilter/nf_conntrack_common.h +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h @@ -91,6 +91,13 @@ enum ip_conntrack_status { /* Conntrack got a helper explicitly attached via CT target. */ IPS_HELPER_BIT = 13, IPS_HELPER = (1 << IPS_HELPER_BIT), + + /* Entry is a flexible conntrack tuple match with mark */ + IPS_ORIG_FLEXTUPLE_BIT = 14, + IPS_ORIG_FLEXTUPLE = (1 << IPS_ORIG_FLEXTUPLE_BIT), + + IPS_REPL_FLEXTUPLE_BIT = 15, + IPS_REPL_FLEXTUPLE = (1 << IPS_REPL_FLEXTUPLE_BIT), }; /* Connection tracking event types */ diff --git a/include/uapi/linux/netfilter/xt_CT.h b/include/uapi/linux/netfilter/xt_CT.h index 5a688c1..5b3bff6 100644 --- a/include/uapi/linux/netfilter/xt_CT.h +++ b/include/uapi/linux/netfilter/xt_CT.h @@ -6,7 +6,12 @@ enum { XT_CT_NOTRACK = 1 << 0, XT_CT_NOTRACK_ALIAS = 1 << 1, - XT_CT_MASK = XT_CT_NOTRACK | XT_CT_NOTRACK_ALIAS, + XT_CT_FLEX_ORIG = 1 << 2, + XT_CT_FLEX_REPL = 1 << 3, + + /* Full option mask */ + XT_CT_MASK = XT_CT_NOTRACK | XT_CT_NOTRACK_ALIAS | + XT_CT_FLEX_ORIG | XT_CT_FLEX_REPL, }; struct xt_ct_target_info { diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c index 30ad955..13280fb 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c @@ -280,7 +280,8 @@ getorigdst(struct sock *sk, int optval, void __user *user, int *len) return -EINVAL; } - h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE, &tuple); + h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE, + NF_CT_DEFAULT_MARK, &tuple); if (h) { struct sockaddr_in sin; struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c index 80d5554..4fc1e83 100644 --- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c +++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c @@ -160,7 +160,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb, *ctinfo = IP_CT_RELATED; - h = nf_conntrack_find_get(net, zone, &innertuple); + h = nf_conntrack_find_get(net, zone, skb->mark, &innertuple); if (!h) { pr_debug("icmp_error_message: no match\n"); return -NF_ACCEPT; diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c index 4ba0c34..21af1a7 100644 --- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c +++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c @@ -251,7 +251,8 @@ ipv6_getorigdst(struct sock *sk, int optval, void __user *user, int *len) if (*len < 0 || (unsigned int) *len < sizeof(sin6)) return -EINVAL; - h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE, &tuple); + h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE, + NF_CT_DEFAULT_MARK, &tuple); if (!h) { pr_debug("IP6T_SO_ORIGINAL_DST: Can't find %pI6c/%u-%pI6c/%u.\n", &tuple.src.u3.ip6, ntohs(tuple.src.u.tcp.port), diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c index 90388d6..9d0a4ce 100644 --- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c +++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c @@ -177,7 +177,7 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl, *ctinfo = IP_CT_RELATED; - h = nf_conntrack_find_get(net, zone, &intuple); + h = nf_conntrack_find_get(net, zone, skb->mark, &intuple); if (!h) { pr_debug("icmpv6_error: no match\n"); return -NF_ACCEPT; diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c index 5882bbf..f27fc79 100644 --- a/net/netfilter/ipvs/ip_vs_nfct.c +++ b/net/netfilter/ipvs/ip_vs_nfct.c @@ -275,7 +275,7 @@ void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp) __func__, ARG_TUPLE(&tuple), ARG_CONN(cp)); h = nf_conntrack_find_get(ip_vs_conn_net(cp), NF_CT_DEFAULT_ZONE, - &tuple); + NF_CT_DEFAULT_MARK, &tuple); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); /* Show what happens instead of calling nf_ct_kill() */ diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 13fad86..6d15ee0 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -385,19 +385,29 @@ static void death_by_timeout(unsigned long ul_conntrack) nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0); } +static inline bool nf_ct_probe_flex(const struct nf_conn *ct, + enum ip_conntrack_dir dir, u32 mark) +{ + return nf_ct_is_flextuple(ct, dir) ? ct->mark == mark : true; +} + static inline bool nf_ct_key_equal(struct nf_conntrack_tuple_hash *h, const struct nf_conntrack_tuple *tuple, - u16 zone) + u16 zone, u32 mark) { struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); /* A conntrack can be recreated with the equal tuple, - * so we need to check that the conntrack is confirmed + * so we need to check that the conntrack is confirmed. + * + * Probing for direction-based flex-tuple is last in + * order to filter out most mismatches first. */ return nf_ct_tuple_equal(tuple, &h->tuple) && - nf_ct_zone(ct) == zone && - nf_ct_is_confirmed(ct); + nf_ct_zone(ct) == zone && + nf_ct_is_confirmed(ct) && + nf_ct_probe_flex(ct, NF_CT_DIRECTION(h), mark); } /* @@ -406,7 +416,7 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h, * and recheck nf_ct_tuple_equal(tuple, &h->tuple) */ static struct nf_conntrack_tuple_hash * -____nf_conntrack_find(struct net *net, u16 zone, +____nf_conntrack_find(struct net *net, u16 zone, u32 mark, const struct nf_conntrack_tuple *tuple, u32 hash) { struct nf_conntrack_tuple_hash *h; @@ -419,7 +429,7 @@ ____nf_conntrack_find(struct net *net, u16 zone, local_bh_disable(); begin: hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) { - if (nf_ct_key_equal(h, tuple, zone)) { + if (nf_ct_key_equal(h, tuple, zone, mark)) { NF_CT_STAT_INC(net, found); local_bh_enable(); return h; @@ -442,7 +452,7 @@ begin: /* Find a connection corresponding to a tuple. */ static struct nf_conntrack_tuple_hash * -__nf_conntrack_find_get(struct net *net, u16 zone, +__nf_conntrack_find_get(struct net *net, u16 zone, u32 mark, const struct nf_conntrack_tuple *tuple, u32 hash) { struct nf_conntrack_tuple_hash *h; @@ -450,14 +460,14 @@ __nf_conntrack_find_get(struct net *net, u16 zone, rcu_read_lock(); begin: - h = ____nf_conntrack_find(net, zone, tuple, hash); + h = ____nf_conntrack_find(net, zone, mark, tuple, hash); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (unlikely(nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use))) h = NULL; else { - if (unlikely(!nf_ct_key_equal(h, tuple, zone))) { + if (unlikely(!nf_ct_key_equal(h, tuple, zone, mark))) { nf_ct_put(ct); goto begin; } @@ -469,10 +479,10 @@ begin: } struct nf_conntrack_tuple_hash * -nf_conntrack_find_get(struct net *net, u16 zone, +nf_conntrack_find_get(struct net *net, u16 zone, u32 mark, const struct nf_conntrack_tuple *tuple) { - return __nf_conntrack_find_get(net, zone, tuple, + return __nf_conntrack_find_get(net, zone, mark, tuple, hash_conntrack_raw(tuple, zone)); } EXPORT_SYMBOL_GPL(nf_conntrack_find_get); @@ -920,6 +930,9 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, nfct_synproxy_ext_add(ct); } + if (tmpl) + nf_ct_init_flextuple(tmpl, ct); + timeout_ext = tmpl ? nf_ct_timeout_find(tmpl) : NULL; if (timeout_ext) timeouts = NF_CT_TIMEOUT_EXT_DATA(timeout_ext); @@ -1019,7 +1032,7 @@ resolve_normal_ct(struct net *net, struct nf_conn *tmpl, /* look for tuple match */ hash = hash_conntrack_raw(&tuple, zone); - h = __nf_conntrack_find_get(net, zone, &tuple, hash); + h = __nf_conntrack_find_get(net, zone, skb->mark, &tuple, hash); if (!h) { h = init_conntrack(net, tmpl, &tuple, l3proto, l4proto, skb, dataoff, hash); diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index d1c2394..03265d21 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -1078,7 +1078,7 @@ ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb, if (err < 0) return err; - h = nf_conntrack_find_get(net, zone, &tuple); + h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &tuple); if (!h) return -ENOENT; @@ -1147,7 +1147,7 @@ ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb, if (err < 0) return err; - h = nf_conntrack_find_get(net, zone, &tuple); + h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &tuple); if (!h) return -ENOENT; @@ -1765,7 +1765,7 @@ ctnetlink_create_conntrack(struct net *net, u16 zone, if (err < 0) goto err2; - master_h = nf_conntrack_find_get(net, zone, &master); + master_h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &master); if (master_h == NULL) { err = -ENOENT; goto err2; @@ -1824,9 +1824,9 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb, } if (cda[CTA_TUPLE_ORIG]) - h = nf_conntrack_find_get(net, zone, &otuple); + h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &otuple); else if (cda[CTA_TUPLE_REPLY]) - h = nf_conntrack_find_get(net, zone, &rtuple); + h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &rtuple); if (h == NULL) { err = -ENOENT; @@ -2628,7 +2628,7 @@ static int ctnetlink_dump_exp_ct(struct sock *ctnl, struct sk_buff *skb, return err; } - h = nf_conntrack_find_get(net, zone, &tuple); + h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &tuple); if (!h) return -ENOENT; @@ -2960,7 +2960,7 @@ ctnetlink_create_expect(struct net *net, u16 zone, return err; /* Look for master conntrack of this expectation */ - h = nf_conntrack_find_get(net, zone, &master_tuple); + h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &master_tuple); if (!h) return -ENOENT; ct = nf_ct_tuplehash_to_ctrack(h); diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c index 825c3e3..ce965af 100644 --- a/net/netfilter/nf_conntrack_pptp.c +++ b/net/netfilter/nf_conntrack_pptp.c @@ -150,7 +150,7 @@ static int destroy_sibling_or_exp(struct net *net, struct nf_conn *ct, pr_debug("trying to timeout ct or exp for tuple "); nf_ct_dump_tuple(t); - h = nf_conntrack_find_get(net, zone, t); + h = nf_conntrack_find_get(net, zone, ct->mark, t); if (h) { sibling = nf_ct_tuplehash_to_ctrack(h); pr_debug("setting timeout of conntrack %p to 0\n", sibling); diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index 75747ae..b1d9b27 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -228,6 +228,11 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par, goto err3; } + if (info->flags & XT_CT_FLEX_ORIG) + set_bit(IPS_ORIG_FLEXTUPLE_BIT, &ct->status); + if (info->flags & XT_CT_FLEX_REPL) + set_bit(IPS_REPL_FLEXTUPLE_BIT, &ct->status); + nf_conntrack_tmpl_insert(par->net, ct); out: info->ct = ct; diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index 29ba621..2fa6551 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -134,7 +134,7 @@ static bool add_hlist(struct hlist_head *head, static unsigned int check_hlist(struct net *net, struct hlist_head *head, const struct nf_conntrack_tuple *tuple, - u16 zone, + u16 zone, u32 mark, bool *addit) { const struct nf_conntrack_tuple_hash *found; @@ -148,7 +148,7 @@ static unsigned int check_hlist(struct net *net, /* check the saved connections */ hlist_for_each_entry_safe(conn, n, head, node) { - found = nf_conntrack_find_get(net, zone, &conn->tuple); + found = nf_conntrack_find_get(net, zone, mark, &conn->tuple); if (found == NULL) { hlist_del(&conn->node); kmem_cache_free(connlimit_conn_cachep, conn); @@ -201,7 +201,7 @@ static unsigned int count_tree(struct net *net, struct rb_root *root, const struct nf_conntrack_tuple *tuple, const union nf_inet_addr *addr, const union nf_inet_addr *mask, - u8 family, u16 zone) + u8 family, u16 zone, u32 mark) { struct xt_connlimit_rb *gc_nodes[CONNLIMIT_GC_MAX_NODES]; struct rb_node **rbnode, *parent; @@ -229,7 +229,8 @@ count_tree(struct net *net, struct rb_root *root, } else { /* same source network -> be counted! */ unsigned int count; - count = check_hlist(net, &rbconn->hhead, tuple, zone, &addit); + count = check_hlist(net, &rbconn->hhead, tuple, zone, + mark, &addit); tree_nodes_free(root, gc_nodes, gc_count); if (!addit) @@ -245,7 +246,7 @@ count_tree(struct net *net, struct rb_root *root, continue; /* only used for GC on hhead, retval and 'addit' ignored */ - check_hlist(net, &rbconn->hhead, tuple, zone, &addit); + check_hlist(net, &rbconn->hhead, tuple, zone, mark, &addit); if (hlist_empty(&rbconn->hhead)) gc_nodes[gc_count++] = rbconn; } @@ -290,7 +291,7 @@ static int count_them(struct net *net, const struct nf_conntrack_tuple *tuple, const union nf_inet_addr *addr, const union nf_inet_addr *mask, - u_int8_t family, u16 zone) + u_int8_t family, u16 zone, u32 mark) { struct rb_root *root; int count; @@ -306,7 +307,7 @@ static int count_them(struct net *net, spin_lock_bh(&xt_connlimit_locks[hash % CONNLIMIT_LOCK_SLOTS]); - count = count_tree(net, root, tuple, addr, mask, family, zone); + count = count_tree(net, root, tuple, addr, mask, family, zone, mark); spin_unlock_bh(&xt_connlimit_locks[hash % CONNLIMIT_LOCK_SLOTS]); @@ -346,7 +347,7 @@ connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par) } connections = count_them(net, info->data, tuple_ptr, &addr, - &info->mask, par->family, zone); + &info->mask, par->family, zone, skb->mark); if (connections == 0) /* kmalloc failed, drop it entirely */ goto hotdrop; diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index 8e47251..a0385f3 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -71,7 +71,8 @@ static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a, proto, &tuple)) goto out; - thash = nf_conntrack_find_get(dev_net(skb->dev), ca->zone, &tuple); + thash = nf_conntrack_find_get(dev_net(skb->dev), ca->zone, + skb->mark, &tuple); if (!thash) goto out;