diff mbox

netfilter: Kill unreplied conntracks by ICMP errors

Message ID 1386861575-121885-1-git-send-email-xiaosuo@gmail.com
State Not Applicable
Headers show

Commit Message

Changli Gao Dec. 12, 2013, 3:19 p.m. UTC
Think about the following scenario:

+--------+      +-------+      +----------+
| Server +------+ NAT 1 +------| Client 1 |
+---+----+      +-------+      +----------+
    |
    |           +-------+      +----------+
    +-----------+ NAT 2 +------| Client 2 |
                +-------+      +----------+

The following UDP punching steps are used to to establish a direct session
between Client 1 and Client 2 with the help from Server.

1. Client 1 sends a UDP packet to Server, and Server learned the public IP and
   port of Client 1.
2. Client 2 sends a UDP packet to Server, and Server learned the public IP and
   port of Client 2.
3. Server tells Client 1 the public IP and port of Client 2.
4. Server tells Client 2 the public IP and port of Client 1.
5. Client 1 sends UDP packets to the public IP and port of Client 2.
6. Client 2 sends UDP packets to the public IP and port of Client 1.

If both NAT 1 and NAT 2 are Cone NAT, Client 1 and Client 2 can communicate
with each other directly.

Linux tries its best to be a Port Restricted NAT. But there is a race condition
between 5 and 6.

Suppose the packet from Client 1 to the public IP and port of Client 2 reaches
NAT 2 before the packet from Client 2 to the public IP and port of Client 1,
and it belongs to a new session to NAT 2 itself since there isn't any
corresponding conntrack in NAT 2, and it is likely that port isn't opened at
NAT 2, so at last, a Port Unreachable ICMP packet will be delivered to Client 1.

Then, the packet from Client 2 to the public IP and port of Client 1 reaches
NAT 2, and NAT 2 fails to use the same public IP and port of the packet sent
to Server as the source IP and port, because the corresponding tuple is in use,
at last, NAT 2 has to allocate a new pair of IP and port.

One and simplest solution is killing unreplied conntracks by ICMP errors.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Oliver Smith Dec. 15, 2013, 6:57 a.m. UTC | #1
On Thursday 12 December 2013 10:19:34 Changli Gao wrote:
> Think about the following scenario:
> 
> +--------+      +-------+      +----------+
> 
> | Server +------+ NAT 1 +------| Client 1 |
> 
> +---+----+      +-------+      +----------+
> 
>     |           +-------+      +----------+
> 
>     +-----------+ NAT 2 +------| Client 2 |
>                 +-------+      +----------+
> 
> The following UDP punching steps are used to to establish a direct session
> between Client 1 and Client 2 with the help from Server.
> 
> 1. Client 1 sends a UDP packet to Server, and Server learned the public IP
> and port of Client 1.
> 2. Client 2 sends a UDP packet to Server, and Server learned the public IP
> and port of Client 2.
> 3. Server tells Client 1 the public IP and port of Client 2.
> 4. Server tells Client 2 the public IP and port of Client 1.
> 5. Client 1 sends UDP packets to the public IP and port of Client 2.
> 6. Client 2 sends UDP packets to the public IP and port of Client 1.
> 
> If both NAT 1 and NAT 2 are Cone NAT, Client 1 and Client 2 can communicate
> with each other directly.
> 
> Linux tries its best to be a Port Restricted NAT. But there is a race
> condition between 5 and 6.
> 
> Suppose the packet from Client 1 to the public IP and port of Client 2
> reaches NAT 2 before the packet from Client 2 to the public IP and port of
> Client 1, and it belongs to a new session to NAT 2 itself since there isn't
> any corresponding conntrack in NAT 2, and it is likely that port isn't
> opened at NAT 2, so at last, a Port Unreachable ICMP packet will be
> delivered to Client 1.

I don't think that's universally the case; whether or not a port unreachable 
happens is going to depend on the configured behaviour; it may very well just 
silently drop the packet.

> 
> Then, the packet from Client 2 to the public IP and port of Client 1 reaches
> NAT 2, and NAT 2 fails to use the same public IP and port of the packet
> sent to Server as the source IP and port, because the corresponding tuple
> is in use, at last, NAT 2 has to allocate a new pair of IP and port.
> 
> One and simplest solution is killing unreplied conntracks by ICMP errors.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ---
>  net/ipv4/netfilter/nf_conntrack_proto_icmp.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
> b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c index a338dad..6210820
> 100644
> --- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
> +++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
> @@ -135,6 +135,7 @@ icmp_error_message(struct net *net, struct nf_conn
> *tmpl, struct sk_buff *skb, const struct nf_conntrack_l4proto *innerproto;
>  	const struct nf_conntrack_tuple_hash *h;
>  	u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE;
> +	struct nf_conn *ct;
> 
>  	NF_CT_ASSERT(skb->nfct == NULL);
> 
> @@ -169,8 +170,12 @@ icmp_error_message(struct net *net, struct nf_conn
> *tmpl, struct sk_buff *skb, if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY)
>  		*ctinfo += IP_CT_IS_REPLY;
> 
> +	ct = nf_ct_tuplehash_to_ctrack(h);
> +	if (!test_bit(IPS_SEEN_REPLY, &ct->status))
> +		nf_ct_kill_acct(ct, *ctinfo, skb);
> +
Perhaps I'm mistaken here so please correct me if so:

Firstly, I don't see why this is necessary as once the client does hole punch, 
the conntrack entry should still be good to go providing the other end is 
adhering to the port it's supposed to use. UDP is unreliable so an application 
shouldn't be expecting perfect delivery; once Client B finally does their 
initial transmit, a retransmit on the part of Client A should succeed without 
any special behaviour on the part of Netfilter.

Secondly; I see this as a great opportunity for a DoS attack if someone can 
spam ICMP errors down the pipe at you.
>  	/* Update skb to refer to this connection */
> -	skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
> +	skb->nfct = &ct->ct_general;
>  	skb->nfctinfo = *ctinfo;
>  	return NF_ACCEPT;
>  }

Regards,
Oliver.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Dec. 17, 2013, 1:01 p.m. UTC | #2
On Sun, Dec 15, 2013 at 07:57:02AM +0100, Oliver wrote:
> On Thursday 12 December 2013 10:19:34 Changli Gao wrote:
> > Think about the following scenario:
> > 
> > +--------+      +-------+      +----------+
> > 
> > | Server +------+ NAT 1 +------| Client 1 |
> > 
> > +---+----+      +-------+      +----------+
> > 
> >     |           +-------+      +----------+
> > 
> >     +-----------+ NAT 2 +------| Client 2 |
> >                 +-------+      +----------+
> > 
> > The following UDP punching steps are used to to establish a direct session
> > between Client 1 and Client 2 with the help from Server.
> > 
> > 1. Client 1 sends a UDP packet to Server, and Server learned the public IP
> > and port of Client 1.
> > 2. Client 2 sends a UDP packet to Server, and Server learned the public IP
> > and port of Client 2.
> > 3. Server tells Client 1 the public IP and port of Client 2.
> > 4. Server tells Client 2 the public IP and port of Client 1.
> > 5. Client 1 sends UDP packets to the public IP and port of Client 2.
> > 6. Client 2 sends UDP packets to the public IP and port of Client 1.
> > 
> > If both NAT 1 and NAT 2 are Cone NAT, Client 1 and Client 2 can communicate
> > with each other directly.
> > 
> > Linux tries its best to be a Port Restricted NAT. But there is a race
> > condition between 5 and 6.
> > 
> > Suppose the packet from Client 1 to the public IP and port of Client 2
> > reaches NAT 2 before the packet from Client 2 to the public IP and port of
> > Client 1, and it belongs to a new session to NAT 2 itself since there isn't
> > any corresponding conntrack in NAT 2, and it is likely that port isn't
> > opened at NAT 2, so at last, a Port Unreachable ICMP packet will be
> > delivered to Client 1.
> 
> I don't think that's universally the case; whether or not a port unreachable 
> happens is going to depend on the configured behaviour; it may very well just 
> silently drop the packet.

Indeed. You can configure those two NATs to make them more
hole-punching friendly by dropping UDP packets to local closed ports,
so that conntrack entry won't be created.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Dec. 17, 2013, 2:46 p.m. UTC | #3
On Sun, Dec 15, 2013 at 2:57 PM, Oliver
<oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa> wrote:
>
> I don't think that's universally the case; whether or not a port unreachable
> happens is going to depend on the configured behaviour;

I know the rate of ICMP error packets is limited by default, but in
many cases, the rate of ICMP packets isn't very large, so this may not
a problem.

> it may very well just
> silently drop the packet.
>

Yes. I know it is also a solution, but it requires the explicit configuration.

> Perhaps I'm mistaken here so please correct me if so:
>
> Firstly, I don't see why this is necessary as once the client does hole punch,
> the conntrack entry should still be good to go providing the other end is
> adhering to the port it's supposed to use. UDP is unreliable so an application
> shouldn't be expecting perfect delivery; once Client B finally does their
> initial transmit, a retransmit on the part of Client A should succeed without
> any special behaviour on the part of Netfilter.
>

Strictly speaking, Linux's NAT implementation isn't a Port Restricted
Cone NAT but based on conntrack, so once the first packets is
delivered locally, the following packets will be delivered locally too
until the conntrack is timed out due to idle. So UDP hole punching
doesn't work.

> Secondly; I see this as a great opportunity for a DoS attack if someone can
> spam ICMP errors down the pipe at you.

I don't think so, because the attacker must get the correct 5 tuples
before the reply packets pass through the linux box.

Thanks.
Changli Gao Dec. 17, 2013, 2:52 p.m. UTC | #4
On Tue, Dec 17, 2013 at 9:01 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> Indeed. You can configure those two NATs to make them more
> hole-punching friendly by dropping UDP packets to local closed ports,
> so that conntrack entry won't be created.

Yes. But it requires the explicit configuration. Why not make it work
by default, although it may fail in some situation? Less is better
than none, isn't it?

Thanks.
Pablo Neira Ayuso Dec. 17, 2013, 4:58 p.m. UTC | #5
On Tue, Dec 17, 2013 at 10:52:02PM +0800, Changli Gao wrote:
> On Tue, Dec 17, 2013 at 9:01 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > Indeed. You can configure those two NATs to make them more
> > hole-punching friendly by dropping UDP packets to local closed ports,
> > so that conntrack entry won't be created.
> 
> Yes. But it requires the explicit configuration. Why not make it work
> by default, although it may fail in some situation? Less is better
> than none, isn't it?

With this patch, an ICMP destination unreachable - fragmentation
needed coming after a big UDP packet will trigger the removal of the
UDP conntrack entry, that should not happen.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Dec. 19, 2013, 4:29 a.m. UTC | #6
On Wed, Dec 18, 2013 at 12:58 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> With this patch, an ICMP destination unreachable - fragmentation
> needed coming after a big UDP packet will trigger the removal of the
> UDP conntrack entry, that should not happen.

I don't think this is a problem, as the conntrack is unreplied and the
original packet is dropped. That means the no virtual circuit is setup
along the path, so why do we keep the partial circuit? And the later
restranmitted packet will set up a new conntrack in anyway.

Thanks.
Pablo Neira Ayuso Dec. 19, 2013, 7:51 p.m. UTC | #7
On Thu, Dec 19, 2013 at 12:29:02PM +0800, Changli Gao wrote:
> On Wed, Dec 18, 2013 at 12:58 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > With this patch, an ICMP destination unreachable - fragmentation
> > needed coming after a big UDP packet will trigger the removal of the
> > UDP conntrack entry, that should not happen.
> 
> I don't think this is a problem, as the conntrack is unreplied and the
> original packet is dropped. That means the no virtual circuit is setup
> along the path, so why do we keep the partial circuit? And the later
> restranmitted packet will set up a new conntrack in anyway.

There are protocols like RTP running over UDP which do not require to
see traffic in both directions, so that partial circuit assumption
sounds wrong to me. The behaviour that this patch introduces is
sloppy, the flow state information would be released and the counters
would be lost in that case.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index a338dad..6210820 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -135,6 +135,7 @@  icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 	const struct nf_conntrack_l4proto *innerproto;
 	const struct nf_conntrack_tuple_hash *h;
 	u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE;
+	struct nf_conn *ct;
 
 	NF_CT_ASSERT(skb->nfct == NULL);
 
@@ -169,8 +170,12 @@  icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 	if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY)
 		*ctinfo += IP_CT_IS_REPLY;
 
+	ct = nf_ct_tuplehash_to_ctrack(h);
+	if (!test_bit(IPS_SEEN_REPLY, &ct->status))
+		nf_ct_kill_acct(ct, *ctinfo, skb);
+
 	/* Update skb to refer to this connection */
-	skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
+	skb->nfct = &ct->ct_general;
 	skb->nfctinfo = *ctinfo;
 	return NF_ACCEPT;
 }