diff mbox series

[1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

Message ID 20190529102542.17742-2-stefan.bader@canonical.com
State Not Applicable
Delegated to: David Miller
Headers show
Series ipv6: frags: fixups for linux-4.4.y | expand

Commit Message

Stefan Bader May 29, 2019, 10:25 a.m. UTC
From: Jiri Wiesner <jwiesner@suse.com>

The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

 PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
  #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
  #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
  #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
  #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
  #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
 #10 [ffff880621563e38] process_one_work at ffffffff81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
Acked-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

(backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
[smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 net/ipv4/ip_fragment.c                  | 7 +++++++
 net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +++++++-
 net/ipv6/reassembly.c                   | 8 +++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

Comments

Greg KH May 29, 2019, 10:37 a.m. UTC | #1
On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
> From: Jiri Wiesner <jwiesner@suse.com>
> 
> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
> 
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
> 
>  PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>   #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
>   #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
>   #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
>   #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
>   #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
>  #10 [ffff880621563e38] process_one_work at ffffffff81096f14
> 
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
> 
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
> 
> Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
> Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
> Acked-by: Peter Oskolkov <posk@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
kernel versions would have a regression :(

Can you also provide a backport of the needed patches for 4.9.y for this
issue so I can take these?

thanks,

greg k-h
Stefan Bader May 29, 2019, 12:31 p.m. UTC | #2
On 29.05.19 12:37, Greg KH wrote:
> On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
>> From: Jiri Wiesner <jwiesner@suse.com>
>>
>> The *_frag_reasm() functions are susceptible to miscalculating the byte
>> count of packet fragments in case the truesize of a head buffer changes.
>> The truesize member may be changed by the call to skb_unclone(), leaving
>> the fragment memory limit counter unbalanced even if all fragments are
>> processed. This miscalculation goes unnoticed as long as the network
>> namespace which holds the counter is not destroyed.
>>
>> Should an attempt be made to destroy a network namespace that holds an
>> unbalanced fragment memory limit counter the cleanup of the namespace
>> never finishes. The thread handling the cleanup gets stuck in
>> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
>> thread is usually in running state with a stacktrace similar to:
>>
>>  PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>>   #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
>>   #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
>>   #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
>>   #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
>>   #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
>>  #10 [ffff880621563e38] process_one_work at ffffffff81096f14
>>
>> It is not possible to create new network namespaces, and processes
>> that call unshare() end up being stuck in uninterruptible sleep state
>> waiting to acquire the net_mutex.
>>
>> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
>> I thank him for his analysis of the problem. The parts of this patch
>> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>>
>> Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
>> Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
>> Acked-by: Peter Oskolkov <posk@google.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
>> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> 
> I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
> kernel versions would have a regression :(
> 
> Can you also provide a backport of the needed patches for 4.9.y for this
> issue so I can take these?

I will, once it is clear that a) the backport looks alright and b) is ok to be
done. Alternatively it might be decided that only the parts necessary for
pulling out a frag head should be picked.

Or the net-devs might decide they want to send things out. The problem
potentially exists in anything that has some stable support up to v5.1
and I have no complete overview where this was backported to.

So this is more a start of discussion that a request to apply it. stable was
just included to make stable maintainers aware.

-Stefan
> 
> thanks,
> 
> greg k-h
>
Stefan Bader June 4, 2019, 1:32 p.m. UTC | #3
On 29.05.19 12:37, Greg KH wrote:
> On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
>> From: Jiri Wiesner <jwiesner@suse.com>
>>
>> The *_frag_reasm() functions are susceptible to miscalculating the byte
>> count of packet fragments in case the truesize of a head buffer changes.
>> The truesize member may be changed by the call to skb_unclone(), leaving
>> the fragment memory limit counter unbalanced even if all fragments are
>> processed. This miscalculation goes unnoticed as long as the network
>> namespace which holds the counter is not destroyed.
>>
>> Should an attempt be made to destroy a network namespace that holds an
>> unbalanced fragment memory limit counter the cleanup of the namespace
>> never finishes. The thread handling the cleanup gets stuck in
>> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
>> thread is usually in running state with a stacktrace similar to:
>>
>>  PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>>   #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
>>   #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
>>   #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
>>   #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
>>   #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
>>  #10 [ffff880621563e38] process_one_work at ffffffff81096f14
>>
>> It is not possible to create new network namespaces, and processes
>> that call unshare() end up being stuck in uninterruptible sleep state
>> waiting to acquire the net_mutex.
>>
>> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
>> I thank him for his analysis of the problem. The parts of this patch
>> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>>
>> Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
>> Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
>> Acked-by: Peter Oskolkov <posk@google.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
>> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> 
> I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
> kernel versions would have a regression :(
> 
> Can you also provide a backport of the needed patches for 4.9.y for this
> issue so I can take these?

It turns out that I cannot provide patches for 4.9.y because those are not
needed there. Patches #1 and #2 of the list I did do not explicitly appear.
However patch #3 does and it might be possible to implicitly do the changes of
the other two by adjusting the removal of the functions from the old locations
and doing the additions unmodified.
And the final patch for pulling the skb from the list is included in 4.9.y by
backports for using rbtrees in ipv6, too. In 4.9.y however the skb_get() still
needs to be dropped. Sasha did not apply it in the end, maybe partially because
of my warning that this was not enough in 4.4.y.

So I think there are two options for 4.4.y which I would defer to the net-devs
to decide:
- either also backport the patches to use rbtrees in ipv6 to 4.4.y (including
  use of inet_frag_pull_head() in ip6_expire_frag_queue() and dropping the
  skb_get() there.
- or some limited change to ip6_expire_frag_queue(). Probably only doing the
  part not related to rbtrees. This would not require patch #3 as pre-req.
  Patches #1 and #2 might be considered separately. Those would be unrelated
  to the crash in ip6_expire-frag_queue() but could fix other issues (not
  liking the sound of net namespace teardown possibly getting stuck).

For 4.9.y, please re-consider picking "ip6: fix skb leak in
ip6frag_expire_frag_queue()". Depending on checks down the path of icmpv6_send()
it might be a crash, too. In that case it should be enough as
inet_frag_pull_head() takes the skb off the queue, so it won't get double freed
when releasing the whole queue.

-Stefan
> 
> thanks,
> 
> greg k-h
>
diff mbox series

Patch

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 2c75330b1c71..5387e6ab78d7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -519,6 +519,7 @@  static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	struct rb_node *rbn;
 	int len;
 	int ihlen;
+	int delta;
 	int err;
 	u8 ecn;
 
@@ -560,10 +561,16 @@  static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	if (len > 65535)
 		goto out_oversize;
 
+	delta = - head->truesize;
+
 	/* Head of list must not be cloned. */
 	if (skb_unclone(head, GFP_ATOMIC))
 		goto out_nomem;
 
+	delta += head->truesize;
+	if (delta)
+		add_frag_mem_limit(qp->q.net, delta);
+
 	/* If the first fragment is fragmented itself, we split
 	 * it to two chunks: the first with data and paged part
 	 * and the second, holding only fragments. */
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 664c84e47bab..e5d06ceb2c0c 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -353,7 +353,7 @@  static struct sk_buff *
 nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 {
 	struct sk_buff *fp, *op, *head = fq->q.fragments;
-	int    payload_len;
+	int    payload_len, delta;
 	u8 ecn;
 
 	inet_frag_kill(&fq->q);
@@ -374,12 +374,18 @@  nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 		goto out_oversize;
 	}
 
+	delta = - head->truesize;
+
 	/* Head of list must not be cloned. */
 	if (skb_unclone(head, GFP_ATOMIC)) {
 		pr_debug("skb is cloned but can't expand head");
 		goto out_oom;
 	}
 
+	delta += head->truesize;
+	if (delta)
+		add_frag_mem_limit(fq->q.net, delta);
+
 	/* If the first fragment is fragmented itself, we split
 	 * it to two chunks: the first with data and paged part
 	 * and the second, holding only fragments. */
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index ec917f58d105..020cf70273c9 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -343,7 +343,7 @@  static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 {
 	struct net *net = container_of(fq->q.net, struct net, ipv6.frags);
 	struct sk_buff *fp, *head = fq->q.fragments;
-	int    payload_len;
+	int    payload_len, delta;
 	unsigned int nhoff;
 	int sum_truesize;
 	u8 ecn;
@@ -384,10 +384,16 @@  static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	if (payload_len > IPV6_MAXPLEN)
 		goto out_oversize;
 
+	delta = - head->truesize;
+
 	/* Head of list must not be cloned. */
 	if (skb_unclone(head, GFP_ATOMIC))
 		goto out_oom;
 
+	delta += head->truesize;
+	if (delta)
+		add_frag_mem_limit(fq->q.net, delta);
+
 	/* If the first fragment is fragmented itself, we split
 	 * it to two chunks: the first with data and paged part
 	 * and the second, holding only fragments. */