diff mbox series

ipv4: Delete uncached routes upon unregistration of loopback device.

Message ID 519ea12b-4c24-9e8e-c5eb-ca02c9c7d264@i-love.sakura.ne.jp
State Rejected
Delegated to: David Miller
Headers show
Series ipv4: Delete uncached routes upon unregistration of loopback device. | expand

Commit Message

Tetsuo Handa May 4, 2019, 2:52 p.m. UTC
syzbot is hitting infinite loop when a loopback device in a namespace is
unregistered [1]. This is because rt_flush_dev() is moving the refcount of
"any device to unregister" to "a loopback device in that namespace" but
nobody can drop the refcount moved from non loopback devices when the
loopback device in that namespace is unregistered.

This behavior was introduced by commit caacf05e5ad1abf0 ("ipv4: Properly
purge netdev references on uncached routes.") but there is no description
why we have to temporarily move the refcount to "a loopback device in that
namespace" and why it is safe to do so, for rt_flush_dev() becomes a no-op
when "a loopback device in that namespace" is about to be unregistered.

Since I don't know the reason, this patch breaks the infinite loop by
deleting the uncached route (which eventually drops the refcount via
dst_destroy()) when "a loopback device in that namespace" is unregistered
rather than when "non-loopback devices in that namespace" is unregistered.

[1] https://syzkaller.appspot.com/bug?id=bae9a2236bfede42cf3d219e6bf6740c583568a4

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com>
---
 net/ipv4/route.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Eric Dumazet May 4, 2019, 3:56 p.m. UTC | #1
On 5/4/19 10:52 AM, Tetsuo Handa wrote:
> syzbot is hitting infinite loop when a loopback device in a namespace is
> unregistered [1]. This is because rt_flush_dev() is moving the refcount of
> "any device to unregister" to "a loopback device in that namespace" but
> nobody can drop the refcount moved from non loopback devices when the
> loopback device in that namespace is unregistered.
> 
> This behavior was introduced by commit caacf05e5ad1abf0 ("ipv4: Properly
> purge netdev references on uncached routes.") but there is no description
> why we have to temporarily move the refcount to "a loopback device in that
> namespace" and why it is safe to do so, for rt_flush_dev() becomes a no-op
> when "a loopback device in that namespace" is about to be unregistered.
> 
> Since I don't know the reason, this patch breaks the infinite loop by
> deleting the uncached route (which eventually drops the refcount via
> dst_destroy()) when "a loopback device in that namespace" is unregistered
> rather than when "non-loopback devices in that namespace" is unregistered.

Well, you have not fixed a bug, you simply made sure that whatever cpu is using the
routes you forcibly deleted is going to crash the host very soon (use-after-frees have
undefined behavior, but KASAN should crash most of the times)

Please do not send patches like that with a huge CC list, keep networking patches
to netdev mailing list.

Mahesh has an alternative patch, adding a fake device that can not be dismantled
to make sure we fully intercept skbs sent through a dead route, instead of relying
on loopback dropping them later at some point.
Tetsuo Handa May 4, 2019, 5:09 p.m. UTC | #2
On 2019/05/05 0:56, Eric Dumazet wrote:> 
> Well, you have not fixed a bug, you simply made sure that whatever cpu is using the
> routes you forcibly deleted is going to crash the host very soon (use-after-frees have
> undefined behavior, but KASAN should crash most of the times)

I confirmed that this patch survives "#syz test:" before submitting.
But you know that this patch is deleting the route entry too early. OK.

> 
> Please do not send patches like that with a huge CC list, keep networking patches
> to netdev mailing list.

If netdev people started working on this "minutely crashing bug" earlier,
I would not have written a patch...

> 
> Mahesh has an alternative patch, adding a fake device that can not be dismantled
> to make sure we fully intercept skbs sent through a dead route, instead of relying
> on loopback dropping them later at some point.

So, the reason to temporarily move the refcount is to give enough period
so that the route entry is no longer used. But moving the refcount to a
loopback device in a namespace was wrong. Is this understanding correct?

Compared to moving the refcount to the loopback device in the init namespace,
the fake device can somehow drop the refcount moved via rt_flush_dev(), can't it?

Anyway, I'll wait for Mahesh.
Eric Dumazet May 4, 2019, 5:24 p.m. UTC | #3
On 5/4/19 1:09 PM, Tetsuo Handa wrote:
> On 2019/05/05 0:56, Eric Dumazet wrote:> 
>> Well, you have not fixed a bug, you simply made sure that whatever cpu is using the
>> routes you forcibly deleted is going to crash the host very soon (use-after-frees have
>> undefined behavior, but KASAN should crash most of the times)
> 
> I confirmed that this patch survives "#syz test:" before submitting.
> But you know that this patch is deleting the route entry too early. OK.
> 
>>
>> Please do not send patches like that with a huge CC list, keep networking patches
>> to netdev mailing list.
> 
> If netdev people started working on this "minutely crashing bug" earlier,
> I would not have written a patch...


So, just that you know, we are working on bug fixes, and this is best effort.

It is not because _you_ want to fix a particular bug (out of hundreds)
that we need to stop everything and work full time on a particular bug.

And here the root cause of the problem is elsewhere. A dst is leaking somewhere,
and prevents the netns dismantle.

We had many dst leaks in the past, and they keep being added by new bugs.

> 
>>
>> Mahesh has an alternative patch, adding a fake device that can not be dismantled
>> to make sure we fully intercept skbs sent through a dead route, instead of relying
>> on loopback dropping them later at some point.
> 
> So, the reason to temporarily move the refcount is to give enough period
> so that the route entry is no longer used. But moving the refcount to a
> loopback device in a namespace was wrong. Is this understanding correct?

I believe you need spend more time on studying the networking code by yourself,
add tracing if you believe this could be useful to you and others.

> 
> Compared to moving the refcount to the loopback device in the init namespace,
> the fake device can somehow drop the refcount moved via rt_flush_dev(), can't it?
> 

The fake device wont ever disappear.

> Anyway, I'll wait for Mahesh.
>
Julian Anastasov May 4, 2019, 8:13 p.m. UTC | #4
Hello,

On Sat, 4 May 2019, Tetsuo Handa wrote:

> syzbot is hitting infinite loop when a loopback device in a namespace is
> unregistered [1]. This is because rt_flush_dev() is moving the refcount of
> "any device to unregister" to "a loopback device in that namespace" but
> nobody can drop the refcount moved from non loopback devices when the
> loopback device in that namespace is unregistered.
> 
> This behavior was introduced by commit caacf05e5ad1abf0 ("ipv4: Properly
> purge netdev references on uncached routes.") but there is no description
> why we have to temporarily move the refcount to "a loopback device in that
> namespace" and why it is safe to do so, for rt_flush_dev() becomes a no-op
> when "a loopback device in that namespace" is about to be unregistered.
> 
> Since I don't know the reason, this patch breaks the infinite loop by
> deleting the uncached route (which eventually drops the refcount via
> dst_destroy()) when "a loopback device in that namespace" is unregistered
> rather than when "non-loopback devices in that namespace" is unregistered.

	There is one simple rule: code that holds device references should
catch device events and to put the references. This should happen
not after the NETDEV_UNREGISTER event.

	But there are users such as dsts that have longer life because
there are other objects that can hold dsts - sockets, for example.
Sockets can hold dst as result of route resolving (sk_dst_cache) and to 
reuse it as long as it is valid. And many sockets can point to same dst.
Obviously, socket can be idle while device disappears. We do not
propagate device events to every socket. What we do is to mark the dst
entry as invalid and to drop its dev reference. So, the problem can be
noticed on next sending when the cached dst is revalidated. As the dst 
entry is marked as obsolete, it will be dropped.

	What you see in rt_flush_dev() is that the IPv4 route subsystem
drops its dev references at the right time. Why net->loopback_dev ?
Because we prefer the leaked dsts to prevent netns to be released, so
that we have more info where is the problem. If we account the leaks
to init netns, nobody will notice them. These are leaks that missed
many cleanup steps: device events and even net-exit events. If you
see "lo" in error messages, it is more likely a dst leak, i.e.
we got a dst reference but dst_release() was not called before netns
cleanup. In such case you need to track not the dev references but the
dst references. If another device is shown, it is not a dst leak
but some dev leak (dev_put not called).

> [1] https://syzkaller.appspot.com/bug?id=bae9a2236bfede42cf3d219e6bf6740c583568a4
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: syzbot <syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com>
> ---
>  net/ipv4/route.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6fdf1c195d8e..7e865c11d4f3 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1522,15 +1522,21 @@ void rt_flush_dev(struct net_device *dev)
>  {
>  	struct net *net = dev_net(dev);
>  	struct rtable *rt;
> +	struct rtable *tmp;
>  	int cpu;
>  
>  	for_each_possible_cpu(cpu) {
>  		struct uncached_list *ul = &per_cpu(rt_uncached_list, cpu);
>  
>  		spin_lock_bh(&ul->lock);
> -		list_for_each_entry(rt, &ul->head, rt_uncached) {
> +		list_for_each_entry_safe(rt, tmp, &ul->head, rt_uncached) {
>  			if (rt->dst.dev != dev)
>  				continue;
> +			if (dev == net->loopback_dev) {
> +				list_del_init(&rt->rt_uncached);
> +				ip_rt_put(rt);
> +				continue;
> +			}
>  			rt->dst.dev = net->loopback_dev;
>  			dev_hold(rt->dst.dev);
>  			dev_put(dev);
> -- 
> 2.17.1

Regards

--
Julian Anastasov <ja@ssi.bg>
diff mbox series

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6fdf1c195d8e..7e865c11d4f3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1522,15 +1522,21 @@  void rt_flush_dev(struct net_device *dev)
 {
 	struct net *net = dev_net(dev);
 	struct rtable *rt;
+	struct rtable *tmp;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
 		struct uncached_list *ul = &per_cpu(rt_uncached_list, cpu);
 
 		spin_lock_bh(&ul->lock);
-		list_for_each_entry(rt, &ul->head, rt_uncached) {
+		list_for_each_entry_safe(rt, tmp, &ul->head, rt_uncached) {
 			if (rt->dst.dev != dev)
 				continue;
+			if (dev == net->loopback_dev) {
+				list_del_init(&rt->rt_uncached);
+				ip_rt_put(rt);
+				continue;
+			}
 			rt->dst.dev = net->loopback_dev;
 			dev_hold(rt->dst.dev);
 			dev_put(dev);