diff mbox

[nf-next] netfilter: conntrack: add support for flextuples

Message ID 776b8819c85c83088478b933a35691133055347a.1430733932.git.daniel@iogearbox.net
State Changes Requested
Delegated to: Pablo Neira
Headers show

Commit Message

Daniel Borkmann May 4, 2015, 10:23 a.m. UTC
This patch adds support for the possibility of doing NAT with
conflicting IP address/ports tuples from multiple, isolated
tenants, represented as network namespaces and netfilter zones.
For such internal VRFs, traffic is directed to a single or shared
pool of public IP address/port range for the external/public VRF.

Or in other words, this allows for doing NAT *between* VRFs
instead of *inside* VRFs without requiring each tenant to NAT
twice or to use its own dedicated IP address to SNAT to, also
with the side effect to not requiring to expose a unique marker
per tenant in the data center to the public.

Simplified example scheme:

  +--- VRF A ---+  +--- CT Zone 1 --------+
  | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
  +-------------+  +--+-------------------+
                      |
                   +--+--+
                   | L3  +-SNAT-[20.1.1.1:20000-40000]--eth0
                   +--+--+
                      |
  +-- VRF B ----+  +--- CT Zone 2 --------+
  | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
  +-------------+  +----------------------+

VRF A and VRF B are two tenants, e.g. represented as a network
namespace. The connection state for each VRF is tracked separately
to implement differing policies, and thus results in one zone per
VRF. The operator does L3 between VRFs using any kind of L3 routing
entity. The NAT is being done in a separate zone so that global
policies can apply here that are tenant-independant.

The connection tracking table is a natural fit for this and the
VRF context is preserved in the flow by using a mark, which offers
high flexibility and can be configured/set based on any criteria.
The ability to selectively include the mark into the tuple match
for a particular direction, we call flextuple.

With the help of flextuples, we can implement the conflicting
IP address/ports tuples in the NAT zone; simplified example and
path traversal explanation:

  iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to <IP>
  iptables -t raw -A PREROUTING -j CT --flextuple ORIGINAL

  iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
  iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

  +--- VRF A ---+  +--- CT Zone 1 --------+
  | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
  +-------------+  +--+-------------------+
                      | v-- mark=A
                   +--+--+      +-- CT Zone 0 -(rev-mapping)-+
                   |  L3 +-SNAT-| <-- 10.1.1.1:20000, mark=A |
                   +--+--+      | <-- 10.1.1.1:20000, mark=B |
                      |         +----------------------------+
                      | ^-- mark=B
  +-- VRF B ----+  +--- CT Zone 2 --------+
  | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
  +-------------+  +----------------------+

The packet traversal for the outgoing direction, starting from
VRF A, is passing through the dedicated CT zone 1, where VRF A
specific firewalling policies apply, and are then being passed to
the L3 forwarder.

Based on the port/dev, the skb is marked with a unique tenant id
and directs the packets to the SNAT zone, where the conntracker
stores the skb->mark in the ct->mark and matches original traffic
on the flextuple.

A unique entry for reply traffic with the public IP/port mapping
is created. When reverse NAT'ing is being done, the ct->mark is
stored back into the skb->mark per above rule, and pushed back to
the L3 entity that knows to which tenant to forward the skb with
help of mark-based routing.

The implementation is rather straight forward, the only requirement
is to support generic flextuples infrastructure for the connection
tracker, no changes to NAT need to be done.

For the connection tracker, the flextuple direction is configured
through the jump target in the raw table, and will store the flag
in a ct template, which is being picked up when a real connection
is created, so that this can be determined by the matcher.

For users not configuring a flextuple, there's no change in
behaviour. Moreover, usage of flextuples does not have an increase
in the memory footprint for a connection tracking entry.

Joint work with Thomas Graf and Madhu Challa, also thanks to
Florian Westphal for input.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Madhu Challa <challa@noironetworks.com>
---
 include/net/netfilter/nf_conntrack.h               | 24 ++++++++++++++
 include/net/netfilter/nf_conntrack_core.h          |  2 +-
 include/uapi/linux/netfilter/nf_conntrack_common.h |  7 ++++
 include/uapi/linux/netfilter/xt_CT.h               |  7 +++-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c     |  3 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c       |  2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c     |  3 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c     |  2 +-
 net/netfilter/ipvs/ip_vs_nfct.c                    |  2 +-
 net/netfilter/nf_conntrack_core.c                  | 37 +++++++++++++++-------
 net/netfilter/nf_conntrack_netlink.c               | 14 ++++----
 net/netfilter/nf_conntrack_pptp.c                  |  2 +-
 net/netfilter/xt_CT.c                              |  5 +++
 net/netfilter/xt_connlimit.c                       | 17 +++++-----
 net/sched/act_connmark.c                           |  3 +-
 15 files changed, 94 insertions(+), 36 deletions(-)

Comments

Pablo Neira Ayuso May 4, 2015, 10:34 a.m. UTC | #1
On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote:
> This patch adds support for the possibility of doing NAT with
> conflicting IP address/ports tuples from multiple, isolated
> tenants, represented as network namespaces and netfilter zones.
> For such internal VRFs, traffic is directed to a single or shared
> pool of public IP address/port range for the external/public VRF.
> 
> Or in other words, this allows for doing NAT *between* VRFs
> instead of *inside* VRFs without requiring each tenant to NAT
> twice or to use its own dedicated IP address to SNAT to, also
> with the side effect to not requiring to expose a unique marker
> per tenant in the data center to the public.
> 
> Simplified example scheme:
> 
>   +--- VRF A ---+  +--- CT Zone 1 --------+
>   | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
>   +-------------+  +--+-------------------+
>                       |
>                    +--+--+
>                    | L3  +-SNAT-[20.1.1.1:20000-40000]--eth0
>                    +--+--+
>                       |
>   +-- VRF B ----+  +--- CT Zone 2 --------+
>   | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
>   +-------------+  +----------------------+

So, it's the skb->mark that survives between the containers.  I'm not
sure it makes sense to keep a zone 0 from the container that performs
SNAT. Instead, we can probably restore the zone based on the
skb->mark. The problem is that the existing zone is u16. In nftables,
Patrick already mentioned about supporting casting so we can do
something like:

        ct zone set (u16)meta mark

So you can reserve a part of the skb->mark to map it to the zone. I'm
not very convinced about this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann May 4, 2015, 11:59 a.m. UTC | #2
Hi Pablo,

On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote:
> On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote:
>> This patch adds support for the possibility of doing NAT with
>> conflicting IP address/ports tuples from multiple, isolated
>> tenants, represented as network namespaces and netfilter zones.
>> For such internal VRFs, traffic is directed to a single or shared
>> pool of public IP address/port range for the external/public VRF.
>>
>> Or in other words, this allows for doing NAT *between* VRFs
>> instead of *inside* VRFs without requiring each tenant to NAT
>> twice or to use its own dedicated IP address to SNAT to, also
>> with the side effect to not requiring to expose a unique marker
>> per tenant in the data center to the public.
>>
>> Simplified example scheme:
>>
>>    +--- VRF A ---+  +--- CT Zone 1 --------+
>>    | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
>>    +-------------+  +--+-------------------+
>>                        |
>>                     +--+--+
>>                     | L3  +-SNAT-[20.1.1.1:20000-40000]--eth0
>>                     +--+--+
>>                        |
>>    +-- VRF B ----+  +--- CT Zone 2 --------+
>>    | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
>>    +-------------+  +----------------------+
>
> So, it's the skb->mark that survives between the containers.  I'm not
> sure it makes sense to keep a zone 0 from the container that performs
> SNAT. Instead, we can probably restore the zone based on the
> skb->mark. The problem is that the existing zone is u16. In nftables,
> Patrick already mentioned about supporting casting so we can do
> something like:
>
>          ct zone set (u16)meta mark
>
> So you can reserve a part of the skb->mark to map it to the zone. I'm
> not very convinced about this.

Thanks for the feedback! I'm not yet sure though, I understood the
above suggestion to the described problem fully so far, i.e. how
would replies on the SNAT find the correct zone again?

Our issue simplified, basically boils down to: given are two zones,
both use IP address <A>, both zones want to talk to IP address <B> in
a third zone. To let those two with <A> talk to <B>, connections are
being routed + SNATed from a non-unique to a unique address/port
tuple [which the proposed approach solves], so they can talk to <B>.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso May 4, 2015, 1:08 p.m. UTC | #3
On Mon, May 04, 2015 at 01:59:15PM +0200, Daniel Borkmann wrote:
> Hi Pablo,
> 
> On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote:
> >On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote:
> >>This patch adds support for the possibility of doing NAT with
> >>conflicting IP address/ports tuples from multiple, isolated
> >>tenants, represented as network namespaces and netfilter zones.
> >>For such internal VRFs, traffic is directed to a single or shared
> >>pool of public IP address/port range for the external/public VRF.
> >>
> >>Or in other words, this allows for doing NAT *between* VRFs
> >>instead of *inside* VRFs without requiring each tenant to NAT
> >>twice or to use its own dedicated IP address to SNAT to, also
> >>with the side effect to not requiring to expose a unique marker
> >>per tenant in the data center to the public.
> >>
> >>Simplified example scheme:
> >>
> >>   +--- VRF A ---+  +--- CT Zone 1 --------+
> >>   | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
> >>   +-------------+  +--+-------------------+
> >>                       |
> >>                    +--+--+
> >>                    | L3  +-SNAT-[20.1.1.1:20000-40000]--eth0
> >>                    +--+--+
> >>                       |
> >>   +-- VRF B ----+  +--- CT Zone 2 --------+
> >>   | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
> >>   +-------------+  +----------------------+
> >
> >So, it's the skb->mark that survives between the containers.  I'm not
> >sure it makes sense to keep a zone 0 from the container that performs
> >SNAT. Instead, we can probably restore the zone based on the
> >skb->mark. The problem is that the existing zone is u16. In nftables,
> >Patrick already mentioned about supporting casting so we can do
> >something like:
> >
> >         ct zone set (u16)meta mark
> >
> >So you can reserve a part of the skb->mark to map it to the zone. I'm
> >not very convinced about this.
> 
> Thanks for the feedback! I'm not yet sure though, I understood the
> above suggestion to the described problem fully so far, i.e. how
> would replies on the SNAT find the correct zone again?

From the original direction, you can set the zone based on the mark:

        -m mark --mark 1 -j CT --zone 1

Then, from the reply direction, you can restore it:

        -m conntrack --ctzone 1 -j MARK --set-mark 1
        ...

--ctzone is not supported though, it would need a new revision for the
conntrack match.

> Our issue simplified, basically boils down to: given are two zones,
> both use IP address <A>, both zones want to talk to IP address <B> in
> a third zone. To let those two with <A> talk to <B>, connections are
> being routed + SNATed from a non-unique to a unique address/port
> tuple [which the proposed approach solves], so they can talk to <B>.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf May 4, 2015, 1:47 p.m. UTC | #4
On 05/04/15 at 03:08pm, Pablo Neira Ayuso wrote:
> On Mon, May 04, 2015 at 01:59:15PM +0200, Daniel Borkmann wrote:
> > On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote:
> > >So, it's the skb->mark that survives between the containers.  I'm not
> > >sure it makes sense to keep a zone 0 from the container that performs
> > >SNAT. Instead, we can probably restore the zone based on the
> > >skb->mark. The problem is that the existing zone is u16. In nftables,
> > >Patrick already mentioned about supporting casting so we can do
> > >something like:
> > >
> > >         ct zone set (u16)meta mark
> > >
> > >So you can reserve a part of the skb->mark to map it to the zone. I'm
> > >not very convinced about this.
> > 
> > Thanks for the feedback! I'm not yet sure though, I understood the
> > above suggestion to the described problem fully so far, i.e. how
> > would replies on the SNAT find the correct zone again?
> 
> From the original direction, you can set the zone based on the mark:
> 
>         -m mark --mark 1 -j CT --zone 1
> 
> Then, from the reply direction, you can restore it:
> 
>         -m conntrack --ctzone 1 -j MARK --set-mark 1
>         ...
> 
> --ctzone is not supported though, it would need a new revision for the
> conntrack match.

Given that the multiple source zones which talk to a common
destination zone may have conflicting IPs, the SNAT must either
occur in the source zone where the source address is still unique
or the CT tuple must be made unique with a source zone identifier
so that the SNAT can occur in the destination zone.

Doing the SNAT in the source zone requires to use a unique IP pool
to map to for each source zone as otherwise IP sources may clash again
in the destination zone. We obviously can't do --SNAT -to 10.1.1.1 in
two namespaces and then just route into a third namespace. This
approach is not scalable in a container environment with 100s or even
1000s of containers each in its own network namespace.

What we want to do instead is to do the SNAT in the destination zone
where we can have a single SNAT rule which overs all source zones.
This allows inter namespace communication in a /31 with minimal waste
of addresses.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann May 4, 2015, 1:51 p.m. UTC | #5
On 05/04/2015 03:08 PM, Pablo Neira Ayuso wrote:
> On Mon, May 04, 2015 at 01:59:15PM +0200, Daniel Borkmann wrote:
>> Hi Pablo,
>>
>> On 05/04/2015 12:34 PM, Pablo Neira Ayuso wrote:
>>> On Mon, May 04, 2015 at 12:23:41PM +0200, Daniel Borkmann wrote:
>>>> This patch adds support for the possibility of doing NAT with
>>>> conflicting IP address/ports tuples from multiple, isolated
>>>> tenants, represented as network namespaces and netfilter zones.
>>>> For such internal VRFs, traffic is directed to a single or shared
>>>> pool of public IP address/port range for the external/public VRF.
>>>>
>>>> Or in other words, this allows for doing NAT *between* VRFs
>>>> instead of *inside* VRFs without requiring each tenant to NAT
>>>> twice or to use its own dedicated IP address to SNAT to, also
>>>> with the side effect to not requiring to expose a unique marker
>>>> per tenant in the data center to the public.
>>>>
>>>> Simplified example scheme:
>>>>
>>>>    +--- VRF A ---+  +--- CT Zone 1 --------+
>>>>    | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
>>>>    +-------------+  +--+-------------------+
>>>>                        |
>>>>                     +--+--+
>>>>                     | L3  +-SNAT-[20.1.1.1:20000-40000]--eth0
>>>>                     +--+--+
>>>>                        |
>>>>    +-- VRF B ----+  +--- CT Zone 2 --------+
>>>>    | 10.1.1.1/8  +--+ 10.1.1.1 ESTABLISHED |
>>>>    +-------------+  +----------------------+
>>>
>>> So, it's the skb->mark that survives between the containers.  I'm not
>>> sure it makes sense to keep a zone 0 from the container that performs
>>> SNAT. Instead, we can probably restore the zone based on the
>>> skb->mark. The problem is that the existing zone is u16. In nftables,
>>> Patrick already mentioned about supporting casting so we can do
>>> something like:
>>>
>>>          ct zone set (u16)meta mark
>>>
>>> So you can reserve a part of the skb->mark to map it to the zone. I'm
>>> not very convinced about this.
>>
>> Thanks for the feedback! I'm not yet sure though, I understood the
>> above suggestion to the described problem fully so far, i.e. how
>> would replies on the SNAT find the correct zone again?
>
>  From the original direction, you can set the zone based on the mark:
>
>          -m mark --mark 1 -j CT --zone 1
>
> Then, from the reply direction, you can restore it:
>
>          -m conntrack --ctzone 1 -j MARK --set-mark 1
>          ...
>
> --ctzone is not supported though, it would need a new revision for the
> conntrack match.

Ok, thanks a lot, now I see what you mean.

If I'm not missing something, I would see two problems with that: the
first would be that the zone match would be linear, f.e. if we support
100 or more zones, we would need to walk through the rules linearly until
we find --mark 100, right?

The other issue is that from reply direction (when the packet comes in
with the translated addr), we couldn't match in the connection tracking
table on the correct zone. The above restore rule would assume that the
match itself already has taken place and was successfully, no? (That is
actually why we are direction based: --flextuple ORIGINAL|REPLY.)

>> Our issue simplified, basically boils down to: given are two zones,
>> both use IP address <A>, both zones want to talk to IP address <B> in
>> a third zone. To let those two with <A> talk to <B>, connections are
>> being routed + SNATed from a non-unique to a unique address/port
>> tuple [which the proposed approach solves], so they can talk to <B>.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso May 6, 2015, 2:27 p.m. UTC | #6
On Mon, May 04, 2015 at 03:47:33PM +0200, Thomas Graf wrote:
[...] 
> Given that the multiple source zones which talk to a common
> destination zone may have conflicting IPs, the SNAT must either
> occur in the source zone where the source address is still unique
> or the CT tuple must be made unique with a source zone identifier
> so that the SNAT can occur in the destination zone.
> 
> Doing the SNAT in the source zone requires to use a unique IP pool
> to map to for each source zone as otherwise IP sources may clash again
> in the destination zone. We obviously can't do --SNAT -to 10.1.1.1 in
> two namespaces and then just route into a third namespace. This
> approach is not scalable in a container environment with 100s or even
> 1000s of containers each in its own network namespace.
> 
> What we want to do instead is to do the SNAT in the destination zone
> where we can have a single SNAT rule which overs all source zones.
> This allows inter namespace communication in a /31 with minimal waste
> of addresses.

Thanks for explaining. So you need to allocate an unique tuple using
the mark to avoid the clashes for the first packet that goes original
using the same pool. Then, the NAT engine will allocate an unique
tuple in the reply direction.

But what is the use case for -j CT --flextuple reply ? By when you see
the reply packet the tuple was already created.

Another question is if it makes sense to have part of the flows using
your flextuple idea while some others not, ie.

        -s x.y.z.w/24 -j CT --flextuple original

so shouldn't this be a global switch that includes the skb->mark
only for packets coming in the original direction?

I also wonder how you're going to deal with port redirections. This
only seem to be working SNAT/masquerade to me if the NAT happens from
VRF side.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann May 6, 2015, 6 p.m. UTC | #7
Hi Pablo,

On 05/06/2015 04:27 PM, Pablo Neira Ayuso wrote:
> On Mon, May 04, 2015 at 03:47:33PM +0200, Thomas Graf wrote:
> [...]
>> Given that the multiple source zones which talk to a common
>> destination zone may have conflicting IPs, the SNAT must either
>> occur in the source zone where the source address is still unique
>> or the CT tuple must be made unique with a source zone identifier
>> so that the SNAT can occur in the destination zone.
>>
>> Doing the SNAT in the source zone requires to use a unique IP pool
>> to map to for each source zone as otherwise IP sources may clash again
>> in the destination zone. We obviously can't do --SNAT -to 10.1.1.1 in
>> two namespaces and then just route into a third namespace. This
>> approach is not scalable in a container environment with 100s or even
>> 1000s of containers each in its own network namespace.
>>
>> What we want to do instead is to do the SNAT in the destination zone
>> where we can have a single SNAT rule which overs all source zones.
>> This allows inter namespace communication in a /31 with minimal waste
>> of addresses.
>
> Thanks for explaining. So you need to allocate an unique tuple using
> the mark to avoid the clashes for the first packet that goes original
> using the same pool. Then, the NAT engine will allocate an unique
> tuple in the reply direction.

Yes, that's correct. In original direction, due to the overlapping
tuple the ct-mark is considered as well for the match, and in reply
direction SNAT chooses already a unique tuple. That's essentially
the rationale for our use case with SNAT.

> But what is the use case for -j CT --flextuple reply ? By when you see
> the reply packet the tuple was already created.

Given this change is completely NAT agnostic, we can keep it as a
generic addition to the conntracker. Given that the mark is very
flexible, I think it could also be used for load balancing as a
different usage.

> Another question is if it makes sense to have part of the flows using
> your flextuple idea while some others not, ie.
>
>          -s x.y.z.w/24 -j CT --flextuple original
>
> so shouldn't this be a global switch that includes the skb->mark
> only for packets coming in the original direction?

I first thought about a global sysctl switch, but eventually found
this config possibility from iptables side much cleaner resp. better
integrated. I think if the environment is correctly configured for
that, such a partial flextuple scenario works, too.

> I also wonder how you're going to deal with port redirections. This
> only seem to be working SNAT/masquerade to me if the NAT happens from
> VRF side.

In our case, we'd like to use the flextuple when we're explicitly
configuring iptables with SNAT. For DNAT, one could reuse it in a
different, somewhat reversed example we previously had and together
with mark based routing and match on the reply side.

Thanks a lot,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso May 6, 2015, 6:50 p.m. UTC | #8
Hi Daniel,

On Wed, May 06, 2015 at 08:00:42PM +0200, Daniel Borkmann wrote:
> On 05/06/2015 04:27 PM, Pablo Neira Ayuso wrote:
[...]
> >But what is the use case for -j CT --flextuple reply ? By when you see
> >the reply packet the tuple was already created.
> 
> Given this change is completely NAT agnostic, we can keep it as a
> generic addition to the conntracker. Given that the mark is very
> flexible, I think it could also be used for load balancing as a
> different usage.

The original and reply tuples of the conntrack are set by the first
packet going in the original direction, at that time they are both
inserted in the hashes, in this scenario it will use the default mark
(0). So if you set the --flextuple reply, conntrack will include the
mark as part of the hashtuple in the first reply packet, but that
packet will not match the conntrack object that was created by the
first original packet and it will result a new conntrack assuming that
the reply is the original direction. This looks broken to me.

> >Another question is if it makes sense to have part of the flows using
> >your flextuple idea while some others not, ie.
> >
> >         -s x.y.z.w/24 -j CT --flextuple original
> >
> >so shouldn't this be a global switch that includes the skb->mark
> >only for packets coming in the original direction?
> 
> I first thought about a global sysctl switch, but eventually found
> this config possibility from iptables side much cleaner resp. better
> integrated. I think if the environment is correctly configured for
> that, such a partial flextuple scenario works, too.

This is consuming two ct status bits, these are exposed to userspace,
and we have a limited number of bits there. The one in the original
direction might be justified for the SNAT case in the specific
scenario that you show.

I don't see yet how this can make sense in a hybrid scenario. We may
end up with a packet that can potentially create and match two
different flow objects if this is misconfigured.

> >I also wonder how you're going to deal with port redirections. This
> >only seem to be working SNAT/masquerade to me if the NAT happens from
> >VRF side.
> 
> In our case, we'd like to use the flextuple when we're explicitly
> configuring iptables with SNAT. For DNAT, one could reuse it in a
> different, somewhat reversed example we previously had and together
> with mark based routing and match on the reply side.

OK, so different VRFs with overlapping networks that are redirected to
the port of another destination. That might makes sense. Still the
hybrid scenario and the --flextuple reply need some thinking.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann May 7, 2015, 12:01 p.m. UTC | #9
Hi Pablo,

On 05/06/2015 08:50 PM, Pablo Neira Ayuso wrote:
> On Wed, May 06, 2015 at 08:00:42PM +0200, Daniel Borkmann wrote:
>> On 05/06/2015 04:27 PM, Pablo Neira Ayuso wrote:
> [...]

Thanks for your feedback!
...
>>> Another question is if it makes sense to have part of the flows using
>>> your flextuple idea while some others not, ie.
>>>
>>>          -s x.y.z.w/24 -j CT --flextuple original
>>>
>>> so shouldn't this be a global switch that includes the skb->mark
>>> only for packets coming in the original direction?
>>
>> I first thought about a global sysctl switch, but eventually found
>> this config possibility from iptables side much cleaner resp. better
>> integrated. I think if the environment is correctly configured for
>> that, such a partial flextuple scenario works, too.
>
> This is consuming two ct status bits, these are exposed to userspace,
> and we have a limited number of bits there. The one in the original
> direction might be justified for the SNAT case in the specific
> scenario that you show.

Okay, agreed. I will respin the set with --flextuple ORIGINAL direction
allowed where we'd for now only consume a single status bit. If later
on there's a need to extend this for REPLY (or even hybrid), we still
have the option to extend it.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso May 7, 2015, 6:10 p.m. UTC | #10
Hi Daniel,

On Thu, May 07, 2015 at 02:01:11PM +0200, Daniel Borkmann wrote:
> ...
> >>>Another question is if it makes sense to have part of the flows using
> >>>your flextuple idea while some others not, ie.
> >>>
> >>>         -s x.y.z.w/24 -j CT --flextuple original
> >>>
> >>>so shouldn't this be a global switch that includes the skb->mark
> >>>only for packets coming in the original direction?
> >>
> >>I first thought about a global sysctl switch, but eventually found
> >>this config possibility from iptables side much cleaner resp. better
> >>integrated. I think if the environment is correctly configured for
> >>that, such a partial flextuple scenario works, too.
> >
> >This is consuming two ct status bits, these are exposed to userspace,
> >and we have a limited number of bits there. The one in the original
> >direction might be justified for the SNAT case in the specific
> >scenario that you show.
> 
> Okay, agreed. I will respin the set with --flextuple ORIGINAL direction
> allowed where we'd for now only consume a single status bit. If later
> on there's a need to extend this for REPLY (or even hybrid), we still
> have the option to extend it.

I would like to know if it makes sense to add this later on. Would you
elaborate a useful DNAT scenario where this can be useful?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann May 8, 2015, 9:45 a.m. UTC | #11
Hi Pablo,

On 05/07/2015 08:10 PM, Pablo Neira Ayuso wrote:
> On Thu, May 07, 2015 at 02:01:11PM +0200, Daniel Borkmann wrote:
>> ...
>>>>> Another question is if it makes sense to have part of the flows using
>>>>> your flextuple idea while some others not, ie.
>>>>>
>>>>>          -s x.y.z.w/24 -j CT --flextuple original
>>>>>
>>>>> so shouldn't this be a global switch that includes the skb->mark
>>>>> only for packets coming in the original direction?
>>>>
>>>> I first thought about a global sysctl switch, but eventually found
>>>> this config possibility from iptables side much cleaner resp. better
>>>> integrated. I think if the environment is correctly configured for
>>>> that, such a partial flextuple scenario works, too.
>>>
>>> This is consuming two ct status bits, these are exposed to userspace,
>>> and we have a limited number of bits there. The one in the original
>>> direction might be justified for the SNAT case in the specific
>>> scenario that you show.
>>
>> Okay, agreed. I will respin the set with --flextuple ORIGINAL direction
>> allowed where we'd for now only consume a single status bit. If later
>> on there's a need to extend this for REPLY (or even hybrid), we still
>> have the option to extend it.
>
> I would like to know if it makes sense to add this later on. Would you
> elaborate a useful DNAT scenario where this can be useful?

What comes to mind in case of hybrid usage for firewalling would be that
flextuple in both directions would act similarly as zones, for example,
you could map things like tunnel id into u32 space and include that into
the match'er w/o many additional rules or memory overhead.

For the reply-only, case I was thinking about a case where you'd have
multiple containers behind the DNAT all with same ip/port each where a
server listens on and you'd select one of the containers e.g. via
xt_statistic module as a mark and do mark-based routing behind the DNAT,
but that itself still has the source in front of the DNAT unique, so it
wouldn't need a mark inclusion in the reply case. So for reply-only, I
currently don't find an intuitive use case.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 095433b..6d67ab4 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -51,6 +51,8 @@  union nf_conntrack_expect_proto {
 
 struct nf_conntrack_helper;
 
+#define NF_CT_DEFAULT_MARK		0
+
 /* Must be kept in sync with the classes defined by helpers */
 #define NF_CT_MAX_EXPECT_CLASSES	4
 
@@ -277,6 +279,28 @@  static inline int nf_ct_is_untracked(const struct nf_conn *ct)
 	return test_bit(IPS_UNTRACKED_BIT, &ct->status);
 }
 
+static inline bool nf_ct_is_flextuple(const struct nf_conn *ct,
+				      const enum ip_conntrack_dir dir)
+{
+	switch (dir) {
+	case IP_CT_DIR_ORIGINAL:
+		return test_bit(IPS_ORIG_FLEXTUPLE_BIT, &ct->status);
+	case IP_CT_DIR_REPLY:
+		return test_bit(IPS_REPL_FLEXTUPLE_BIT, &ct->status);
+	default:
+		return false;
+	}
+}
+
+static inline void nf_ct_init_flextuple(const struct nf_conn *tmpl,
+					struct nf_conn *ct)
+{
+	if (test_bit(IPS_ORIG_FLEXTUPLE_BIT, &tmpl->status))
+		__set_bit(IPS_ORIG_FLEXTUPLE_BIT, &ct->status);
+	if (test_bit(IPS_REPL_FLEXTUPLE_BIT, &tmpl->status))
+		__set_bit(IPS_REPL_FLEXTUPLE_BIT, &ct->status);
+}
+
 /* Packet is received from loopback */
 static inline bool nf_is_loopback_packet(const struct sk_buff *skb)
 {
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index f2f0fa3..b4fd6c6 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -52,7 +52,7 @@  bool nf_ct_invert_tuple(struct nf_conntrack_tuple *inverse,
 
 /* Find a connection corresponding to a tuple. */
 struct nf_conntrack_tuple_hash *
-nf_conntrack_find_get(struct net *net, u16 zone,
+nf_conntrack_find_get(struct net *net, u16 zone, u32 mark,
 		      const struct nf_conntrack_tuple *tuple);
 
 int __nf_conntrack_confirm(struct sk_buff *skb);
diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h
index 319f471..b242948 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_common.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
@@ -91,6 +91,13 @@  enum ip_conntrack_status {
 	/* Conntrack got a helper explicitly attached via CT target. */
 	IPS_HELPER_BIT = 13,
 	IPS_HELPER = (1 << IPS_HELPER_BIT),
+
+	/* Entry is a flexible conntrack tuple match with mark */
+	IPS_ORIG_FLEXTUPLE_BIT = 14,
+	IPS_ORIG_FLEXTUPLE = (1 << IPS_ORIG_FLEXTUPLE_BIT),
+
+	IPS_REPL_FLEXTUPLE_BIT = 15,
+	IPS_REPL_FLEXTUPLE = (1 << IPS_REPL_FLEXTUPLE_BIT),
 };
 
 /* Connection tracking event types */
diff --git a/include/uapi/linux/netfilter/xt_CT.h b/include/uapi/linux/netfilter/xt_CT.h
index 5a688c1..5b3bff6 100644
--- a/include/uapi/linux/netfilter/xt_CT.h
+++ b/include/uapi/linux/netfilter/xt_CT.h
@@ -6,7 +6,12 @@ 
 enum {
 	XT_CT_NOTRACK		= 1 << 0,
 	XT_CT_NOTRACK_ALIAS	= 1 << 1,
-	XT_CT_MASK		= XT_CT_NOTRACK | XT_CT_NOTRACK_ALIAS,
+	XT_CT_FLEX_ORIG		= 1 << 2,
+	XT_CT_FLEX_REPL		= 1 << 3,
+
+	/* Full option mask */
+	XT_CT_MASK		= XT_CT_NOTRACK | XT_CT_NOTRACK_ALIAS |
+				  XT_CT_FLEX_ORIG | XT_CT_FLEX_REPL,
 };
 
 struct xt_ct_target_info {
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index 30ad955..13280fb 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -280,7 +280,8 @@  getorigdst(struct sock *sk, int optval, void __user *user, int *len)
 		return -EINVAL;
 	}
 
-	h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE, &tuple);
+	h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE,
+				  NF_CT_DEFAULT_MARK, &tuple);
 	if (h) {
 		struct sockaddr_in sin;
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 80d5554..4fc1e83 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -160,7 +160,7 @@  icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 
 	*ctinfo = IP_CT_RELATED;
 
-	h = nf_conntrack_find_get(net, zone, &innertuple);
+	h = nf_conntrack_find_get(net, zone, skb->mark, &innertuple);
 	if (!h) {
 		pr_debug("icmp_error_message: no match\n");
 		return -NF_ACCEPT;
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 4ba0c34..21af1a7 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -251,7 +251,8 @@  ipv6_getorigdst(struct sock *sk, int optval, void __user *user, int *len)
 	if (*len < 0 || (unsigned int) *len < sizeof(sin6))
 		return -EINVAL;
 
-	h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE, &tuple);
+	h = nf_conntrack_find_get(sock_net(sk), NF_CT_DEFAULT_ZONE,
+				  NF_CT_DEFAULT_MARK, &tuple);
 	if (!h) {
 		pr_debug("IP6T_SO_ORIGINAL_DST: Can't find %pI6c/%u-%pI6c/%u.\n",
 			 &tuple.src.u3.ip6, ntohs(tuple.src.u.tcp.port),
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 90388d6..9d0a4ce 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -177,7 +177,7 @@  icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 
 	*ctinfo = IP_CT_RELATED;
 
-	h = nf_conntrack_find_get(net, zone, &intuple);
+	h = nf_conntrack_find_get(net, zone, skb->mark, &intuple);
 	if (!h) {
 		pr_debug("icmpv6_error: no match\n");
 		return -NF_ACCEPT;
diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c
index 5882bbf..f27fc79 100644
--- a/net/netfilter/ipvs/ip_vs_nfct.c
+++ b/net/netfilter/ipvs/ip_vs_nfct.c
@@ -275,7 +275,7 @@  void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp)
 		__func__, ARG_TUPLE(&tuple), ARG_CONN(cp));
 
 	h = nf_conntrack_find_get(ip_vs_conn_net(cp), NF_CT_DEFAULT_ZONE,
-				  &tuple);
+				  NF_CT_DEFAULT_MARK, &tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		/* Show what happens instead of calling nf_ct_kill() */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 13fad86..6d15ee0 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -385,19 +385,29 @@  static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0);
 }
 
+static inline bool nf_ct_probe_flex(const struct nf_conn *ct,
+				    enum ip_conntrack_dir dir, u32 mark)
+{
+	return nf_ct_is_flextuple(ct, dir) ? ct->mark == mark : true;
+}
+
 static inline bool
 nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 			const struct nf_conntrack_tuple *tuple,
-			u16 zone)
+			u16 zone, u32 mark)
 {
 	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
 
 	/* A conntrack can be recreated with the equal tuple,
-	 * so we need to check that the conntrack is confirmed
+	 * so we need to check that the conntrack is confirmed.
+	 *
+	 * Probing for direction-based flex-tuple is last in
+	 * order to filter out most mismatches first.
 	 */
 	return nf_ct_tuple_equal(tuple, &h->tuple) &&
-		nf_ct_zone(ct) == zone &&
-		nf_ct_is_confirmed(ct);
+	       nf_ct_zone(ct) == zone &&
+	       nf_ct_is_confirmed(ct) &&
+	       nf_ct_probe_flex(ct, NF_CT_DIRECTION(h), mark);
 }
 
 /*
@@ -406,7 +416,7 @@  nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
  *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
  */
 static struct nf_conntrack_tuple_hash *
-____nf_conntrack_find(struct net *net, u16 zone,
+____nf_conntrack_find(struct net *net, u16 zone, u32 mark,
 		      const struct nf_conntrack_tuple *tuple, u32 hash)
 {
 	struct nf_conntrack_tuple_hash *h;
@@ -419,7 +429,7 @@  ____nf_conntrack_find(struct net *net, u16 zone,
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
-		if (nf_ct_key_equal(h, tuple, zone)) {
+		if (nf_ct_key_equal(h, tuple, zone, mark)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -442,7 +452,7 @@  begin:
 
 /* Find a connection corresponding to a tuple. */
 static struct nf_conntrack_tuple_hash *
-__nf_conntrack_find_get(struct net *net, u16 zone,
+__nf_conntrack_find_get(struct net *net, u16 zone, u32 mark,
 			const struct nf_conntrack_tuple *tuple, u32 hash)
 {
 	struct nf_conntrack_tuple_hash *h;
@@ -450,14 +460,14 @@  __nf_conntrack_find_get(struct net *net, u16 zone,
 
 	rcu_read_lock();
 begin:
-	h = ____nf_conntrack_find(net, zone, tuple, hash);
+	h = ____nf_conntrack_find(net, zone, mark, tuple, hash);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(nf_ct_is_dying(ct) ||
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
+			if (unlikely(!nf_ct_key_equal(h, tuple, zone, mark))) {
 				nf_ct_put(ct);
 				goto begin;
 			}
@@ -469,10 +479,10 @@  begin:
 }
 
 struct nf_conntrack_tuple_hash *
-nf_conntrack_find_get(struct net *net, u16 zone,
+nf_conntrack_find_get(struct net *net, u16 zone, u32 mark,
 		      const struct nf_conntrack_tuple *tuple)
 {
-	return __nf_conntrack_find_get(net, zone, tuple,
+	return __nf_conntrack_find_get(net, zone, mark, tuple,
 				       hash_conntrack_raw(tuple, zone));
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_find_get);
@@ -920,6 +930,9 @@  init_conntrack(struct net *net, struct nf_conn *tmpl,
 		nfct_synproxy_ext_add(ct);
 	}
 
+	if (tmpl)
+		nf_ct_init_flextuple(tmpl, ct);
+
 	timeout_ext = tmpl ? nf_ct_timeout_find(tmpl) : NULL;
 	if (timeout_ext)
 		timeouts = NF_CT_TIMEOUT_EXT_DATA(timeout_ext);
@@ -1019,7 +1032,7 @@  resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
 
 	/* look for tuple match */
 	hash = hash_conntrack_raw(&tuple, zone);
-	h = __nf_conntrack_find_get(net, zone, &tuple, hash);
+	h = __nf_conntrack_find_get(net, zone, skb->mark, &tuple, hash);
 	if (!h) {
 		h = init_conntrack(net, tmpl, &tuple, l3proto, l4proto,
 				   skb, dataoff, hash);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index d1c2394..03265d21 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1078,7 +1078,7 @@  ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	h = nf_conntrack_find_get(net, zone, &tuple);
+	h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -1147,7 +1147,7 @@  ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	h = nf_conntrack_find_get(net, zone, &tuple);
+	h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -1765,7 +1765,7 @@  ctnetlink_create_conntrack(struct net *net, u16 zone,
 		if (err < 0)
 			goto err2;
 
-		master_h = nf_conntrack_find_get(net, zone, &master);
+		master_h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &master);
 		if (master_h == NULL) {
 			err = -ENOENT;
 			goto err2;
@@ -1824,9 +1824,9 @@  ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	}
 
 	if (cda[CTA_TUPLE_ORIG])
-		h = nf_conntrack_find_get(net, zone, &otuple);
+		h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &otuple);
 	else if (cda[CTA_TUPLE_REPLY])
-		h = nf_conntrack_find_get(net, zone, &rtuple);
+		h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &rtuple);
 
 	if (h == NULL) {
 		err = -ENOENT;
@@ -2628,7 +2628,7 @@  static int ctnetlink_dump_exp_ct(struct sock *ctnl, struct sk_buff *skb,
 			return err;
 	}
 
-	h = nf_conntrack_find_get(net, zone, &tuple);
+	h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -2960,7 +2960,7 @@  ctnetlink_create_expect(struct net *net, u16 zone,
 		return err;
 
 	/* Look for master conntrack of this expectation */
-	h = nf_conntrack_find_get(net, zone, &master_tuple);
+	h = nf_conntrack_find_get(net, zone, NF_CT_DEFAULT_MARK, &master_tuple);
 	if (!h)
 		return -ENOENT;
 	ct = nf_ct_tuplehash_to_ctrack(h);
diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c
index 825c3e3..ce965af 100644
--- a/net/netfilter/nf_conntrack_pptp.c
+++ b/net/netfilter/nf_conntrack_pptp.c
@@ -150,7 +150,7 @@  static int destroy_sibling_or_exp(struct net *net, struct nf_conn *ct,
 	pr_debug("trying to timeout ct or exp for tuple ");
 	nf_ct_dump_tuple(t);
 
-	h = nf_conntrack_find_get(net, zone, t);
+	h = nf_conntrack_find_get(net, zone, ct->mark, t);
 	if (h)  {
 		sibling = nf_ct_tuplehash_to_ctrack(h);
 		pr_debug("setting timeout of conntrack %p to 0\n", sibling);
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 75747ae..b1d9b27 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -228,6 +228,11 @@  static int xt_ct_tg_check(const struct xt_tgchk_param *par,
 			goto err3;
 	}
 
+	if (info->flags & XT_CT_FLEX_ORIG)
+		set_bit(IPS_ORIG_FLEXTUPLE_BIT, &ct->status);
+	if (info->flags & XT_CT_FLEX_REPL)
+		set_bit(IPS_REPL_FLEXTUPLE_BIT, &ct->status);
+
 	nf_conntrack_tmpl_insert(par->net, ct);
 out:
 	info->ct = ct;
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 29ba621..2fa6551 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -134,7 +134,7 @@  static bool add_hlist(struct hlist_head *head,
 static unsigned int check_hlist(struct net *net,
 				struct hlist_head *head,
 				const struct nf_conntrack_tuple *tuple,
-				u16 zone,
+				u16 zone, u32 mark,
 				bool *addit)
 {
 	const struct nf_conntrack_tuple_hash *found;
@@ -148,7 +148,7 @@  static unsigned int check_hlist(struct net *net,
 
 	/* check the saved connections */
 	hlist_for_each_entry_safe(conn, n, head, node) {
-		found = nf_conntrack_find_get(net, zone, &conn->tuple);
+		found = nf_conntrack_find_get(net, zone, mark, &conn->tuple);
 		if (found == NULL) {
 			hlist_del(&conn->node);
 			kmem_cache_free(connlimit_conn_cachep, conn);
@@ -201,7 +201,7 @@  static unsigned int
 count_tree(struct net *net, struct rb_root *root,
 	   const struct nf_conntrack_tuple *tuple,
 	   const union nf_inet_addr *addr, const union nf_inet_addr *mask,
-	   u8 family, u16 zone)
+	   u8 family, u16 zone, u32 mark)
 {
 	struct xt_connlimit_rb *gc_nodes[CONNLIMIT_GC_MAX_NODES];
 	struct rb_node **rbnode, *parent;
@@ -229,7 +229,8 @@  count_tree(struct net *net, struct rb_root *root,
 		} else {
 			/* same source network -> be counted! */
 			unsigned int count;
-			count = check_hlist(net, &rbconn->hhead, tuple, zone, &addit);
+			count = check_hlist(net, &rbconn->hhead, tuple, zone,
+					    mark, &addit);
 
 			tree_nodes_free(root, gc_nodes, gc_count);
 			if (!addit)
@@ -245,7 +246,7 @@  count_tree(struct net *net, struct rb_root *root,
 			continue;
 
 		/* only used for GC on hhead, retval and 'addit' ignored */
-		check_hlist(net, &rbconn->hhead, tuple, zone, &addit);
+		check_hlist(net, &rbconn->hhead, tuple, zone, mark, &addit);
 		if (hlist_empty(&rbconn->hhead))
 			gc_nodes[gc_count++] = rbconn;
 	}
@@ -290,7 +291,7 @@  static int count_them(struct net *net,
 		      const struct nf_conntrack_tuple *tuple,
 		      const union nf_inet_addr *addr,
 		      const union nf_inet_addr *mask,
-		      u_int8_t family, u16 zone)
+		      u_int8_t family, u16 zone, u32 mark)
 {
 	struct rb_root *root;
 	int count;
@@ -306,7 +307,7 @@  static int count_them(struct net *net,
 
 	spin_lock_bh(&xt_connlimit_locks[hash % CONNLIMIT_LOCK_SLOTS]);
 
-	count = count_tree(net, root, tuple, addr, mask, family, zone);
+	count = count_tree(net, root, tuple, addr, mask, family, zone, mark);
 
 	spin_unlock_bh(&xt_connlimit_locks[hash % CONNLIMIT_LOCK_SLOTS]);
 
@@ -346,7 +347,7 @@  connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	}
 
 	connections = count_them(net, info->data, tuple_ptr, &addr,
-	                         &info->mask, par->family, zone);
+	                         &info->mask, par->family, zone, skb->mark);
 	if (connections == 0)
 		/* kmalloc failed, drop it entirely */
 		goto hotdrop;
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 8e47251..a0385f3 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -71,7 +71,8 @@  static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a,
 			       proto, &tuple))
 		goto out;
 
-	thash = nf_conntrack_find_get(dev_net(skb->dev), ca->zone, &tuple);
+	thash = nf_conntrack_find_get(dev_net(skb->dev), ca->zone,
+				      skb->mark, &tuple);
 	if (!thash)
 		goto out;