diff mbox series

[net] ipv6: remove null_entry before adding default route

Message ID 20180106013835.127556-1-tracywwnj@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net] ipv6: remove null_entry before adding default route | expand

Commit Message

Wei Wang Jan. 6, 2018, 1:38 a.m. UTC
From: Wei Wang <weiwan@google.com>

In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0xffffffff,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.

In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.

syzkaller reported the following issue which is fixed by this commit:

Comments

Martin KaFai Lau Jan. 6, 2018, 7:42 a.m. UTC | #1
On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
> From: Wei Wang <weiwan@google.com>
> 
> In the current code, when creating a new fib6 table, tb6_root.leaf gets
> initialized to net->ipv6.ip6_null_entry.
> If a default route is being added with rt->rt6i_metric = 0xffffffff,
> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
> null_entry is shared, it could cause problem.
> 
> In order to fix it, set fn->leaf to NULL before calling
> fib6_add_rt2node() when trying to add the first default route.
> And reset fn->leaf to null_entry when adding fails or when deleting the
> last default route.
> 
> syzkaller reported the following issue which is fixed by this commit:
> =============================
> WARNING: suspicious RCU usage
> 4.15.0-rc5+ #171 Not tainted
> -----------------------------
> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
> 
> other info that might help us debug this:
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 4 locks held by swapper/0/0:
>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>  #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
> 
> stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>  expire_timers kernel/time/timer.c:1357 [inline]
>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  invoke_softirq kernel/softirq.c:365 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>  </IRQ>
> 
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
> Signed-off-by: Wei Wang <weiwan@google.com>
> ---
>  net/ipv6/ip6_fib.c | 45 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index d11a5578e4f8..37cb4ad1ea29 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
>  			if (!(fn->fn_flags & RTN_RTINFO)) {
>  				RCU_INIT_POINTER(fn->leaf, NULL);
>  				rt6_release(leaf);
> +			/* remove null_entry in the root node */
> +			} else if (fn->fn_flags & RTN_TL_ROOT &&
> +				   rcu_access_pointer(fn->leaf) ==
> +				   net->ipv6.ip6_null_entry) {
> +				RCU_INIT_POINTER(fn->leaf, NULL);
It seems the reader side could see tb6_root.leaf == NULL after
this change and I think it should be fine?  If it is, instead
of switching betwen NULL and ip6_null_entry, would it be simpler
to always set tb6_root.leaf to NULL until a legit default route
is added?

>  			}
>  
>  			return fn;
> @@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
>  	return err;
>  
>  failure:
> -	/* fn->leaf could be NULL if fn is an intermediate node and we
> -	 * failed to add the new route to it in both subtree creation
> -	 * failure and fib6_add_rt2node() failure case.
> -	 * In both cases, fib6_repair_tree() should be called to fix
> +	/* fn->leaf could be NULL if:
> +	 * 1. fn is the root node in the table and we fail to add the default
> +	 * route to it.
> +	 * In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
> +	 * the way the table was created.
> +	 * 2. fn is an intermediate node and we failed to add the new
> +	 * route to it in both subtree creation failure and fib6_add_rt2node()
> +	 * failure case.
> +	 * In this case, fib6_repair_tree() should be called to fix
>  	 * fn->leaf.
>  	 */
> -	if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
> -		fib6_repair_tree(info->nl_net, table, fn);
> +	if (fn) {
> +		if (fn->fn_flags & RTN_TL_ROOT) {
> +			if (!rcu_access_pointer(fn->leaf)) {
> +				rcu_assign_pointer(fn->leaf,
> +					   info->nl_net->ipv6.ip6_null_entry);
> +			}
> +		} else if (!(fn->fn_flags & (RTN_RTINFO|RTN_ROOT))) {
> +			fib6_repair_tree(info->nl_net, table, fn);
> +		}
> +	}
>  	/* Always release dst as dst->__refcnt is guaranteed
>  	 * to be taken before entering this function
>  	 */
> @@ -1685,11 +1703,18 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
>  	}
>  	read_unlock(&net->ipv6.fib6_walker_lock);
>  
> -	/* If it was last route, expunge its radix tree node */
> +	/* If it was last route:
> +	 * 1. For root node, put back null_entry as how the table was created.
> +	 * 2. For other nodes, expunge its radix tree node.
> +	 */
>  	if (!rcu_access_pointer(fn->leaf)) {
> -		fn->fn_flags &= ~RTN_RTINFO;
> -		net->ipv6.rt6_stats->fib_route_nodes--;
> -		fn = fib6_repair_tree(net, table, fn);
> +		if (fn->fn_flags & RTN_TL_ROOT) {
> +			rcu_assign_pointer(fn->leaf, net->ipv6.ip6_null_entry);
> +		} else {
> +			fn->fn_flags &= ~RTN_RTINFO;
> +			net->ipv6.rt6_stats->fib_route_nodes--;
> +			fn = fib6_repair_tree(net, table, fn);
> +		}
>  	}
>  
>  	fib6_purge_rt(rt, fn, net);
> -- 
> 2.16.0.rc0.223.g4a4ac83678-goog
>
Wei Wang Jan. 7, 2018, 1:41 a.m. UTC | #2
On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
>> From: Wei Wang <weiwan@google.com>
>>
>> In the current code, when creating a new fib6 table, tb6_root.leaf gets
>> initialized to net->ipv6.ip6_null_entry.
>> If a default route is being added with rt->rt6i_metric = 0xffffffff,
>> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
>> null_entry is shared, it could cause problem.
>>
>> In order to fix it, set fn->leaf to NULL before calling
>> fib6_add_rt2node() when trying to add the first default route.
>> And reset fn->leaf to null_entry when adding fails or when deleting the
>> last default route.
>>
>> syzkaller reported the following issue which is fixed by this commit:
>> =============================
>> WARNING: suspicious RCU usage
>> 4.15.0-rc5+ #171 Not tainted
>> -----------------------------
>> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
>>
>> other info that might help us debug this:
>>
>> rcu_scheduler_active = 2, debug_locks = 1
>> 4 locks held by swapper/0/0:
>>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
>>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
>>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>>  #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
>>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
>>
>> stack backtrace:
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> Call Trace:
>>  <IRQ>
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>>  expire_timers kernel/time/timer.c:1357 [inline]
>>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>>  invoke_softirq kernel/softirq.c:365 [inline]
>>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>>  </IRQ>
>>
>> Reported-by: syzbot <syzkaller@googlegroups.com>
>> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
>> Signed-off-by: Wei Wang <weiwan@google.com>
>> ---
>>  net/ipv6/ip6_fib.c | 45 +++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index d11a5578e4f8..37cb4ad1ea29 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
>>                       if (!(fn->fn_flags & RTN_RTINFO)) {
>>                               RCU_INIT_POINTER(fn->leaf, NULL);
>>                               rt6_release(leaf);
>> +                     /* remove null_entry in the root node */
>> +                     } else if (fn->fn_flags & RTN_TL_ROOT &&
>> +                                rcu_access_pointer(fn->leaf) ==
>> +                                net->ipv6.ip6_null_entry) {
>> +                             RCU_INIT_POINTER(fn->leaf, NULL);
> It seems the reader side could see tb6_root.leaf == NULL after
> this change and I think it should be fine?  If it is, instead
> of switching betwen NULL and ip6_null_entry, would it be simpler
> to always set tb6_root.leaf to NULL until a legit default route
> is added?
>

Agree that reader side could see tb6_root.leaf == NULL and it should
be fine cause the lookup code path takes care of NULL case.
If we init tb6_root to NULL, it does seem simpler. The only place we
need to take care is that fib6_del() should not try to expunge fn if
is root node if it does not contain any route. But does it seem weird
that tb6_root.fn_flags is marked with RTN_RTINFO without containing
any route?

>>                       }
>>
>>                       return fn;
>> @@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
>>       return err;
>>
>>  failure:
>> -     /* fn->leaf could be NULL if fn is an intermediate node and we
>> -      * failed to add the new route to it in both subtree creation
>> -      * failure and fib6_add_rt2node() failure case.
>> -      * In both cases, fib6_repair_tree() should be called to fix
>> +     /* fn->leaf could be NULL if:
>> +      * 1. fn is the root node in the table and we fail to add the default
>> +      * route to it.
>> +      * In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
>> +      * the way the table was created.
>> +      * 2. fn is an intermediate node and we failed to add the new
>> +      * route to it in both subtree creation failure and fib6_add_rt2node()
>> +      * failure case.
>> +      * In this case, fib6_repair_tree() should be called to fix
>>        * fn->leaf.
>>        */
>> -     if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
>> -             fib6_repair_tree(info->nl_net, table, fn);
>> +     if (fn) {
>> +             if (fn->fn_flags & RTN_TL_ROOT) {
>> +                     if (!rcu_access_pointer(fn->leaf)) {
>> +                             rcu_assign_pointer(fn->leaf,
>> +                                        info->nl_net->ipv6.ip6_null_entry);
>> +                     }
>> +             } else if (!(fn->fn_flags & (RTN_RTINFO|RTN_ROOT))) {
>> +                     fib6_repair_tree(info->nl_net, table, fn);
>> +             }
>> +     }
>>       /* Always release dst as dst->__refcnt is guaranteed
>>        * to be taken before entering this function
>>        */
>> @@ -1685,11 +1703,18 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
>>       }
>>       read_unlock(&net->ipv6.fib6_walker_lock);
>>
>> -     /* If it was last route, expunge its radix tree node */
>> +     /* If it was last route:
>> +      * 1. For root node, put back null_entry as how the table was created.
>> +      * 2. For other nodes, expunge its radix tree node.
>> +      */
>>       if (!rcu_access_pointer(fn->leaf)) {
>> -             fn->fn_flags &= ~RTN_RTINFO;
>> -             net->ipv6.rt6_stats->fib_route_nodes--;
>> -             fn = fib6_repair_tree(net, table, fn);
>> +             if (fn->fn_flags & RTN_TL_ROOT) {
>> +                     rcu_assign_pointer(fn->leaf, net->ipv6.ip6_null_entry);
>> +             } else {
>> +                     fn->fn_flags &= ~RTN_RTINFO;
>> +                     net->ipv6.rt6_stats->fib_route_nodes--;
>> +                     fn = fib6_repair_tree(net, table, fn);
>> +             }
>>       }
>>
>>       fib6_purge_rt(rt, fn, net);
>> --
>> 2.16.0.rc0.223.g4a4ac83678-goog
>>
Martin KaFai Lau Jan. 7, 2018, 6:16 a.m. UTC | #3
On Sat, Jan 06, 2018 at 05:41:28PM -0800, Wei Wang wrote:
> On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> > On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
> >> From: Wei Wang <weiwan@google.com>
> >>
> >> In the current code, when creating a new fib6 table, tb6_root.leaf gets
> >> initialized to net->ipv6.ip6_null_entry.
> >> If a default route is being added with rt->rt6i_metric = 0xffffffff,
> >> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
> >> null_entry is shared, it could cause problem.
> >>
> >> In order to fix it, set fn->leaf to NULL before calling
> >> fib6_add_rt2node() when trying to add the first default route.
> >> And reset fn->leaf to null_entry when adding fails or when deleting the
> >> last default route.
> >>
> >> syzkaller reported the following issue which is fixed by this commit:
> >> =============================
> >> WARNING: suspicious RCU usage
> >> 4.15.0-rc5+ #171 Not tainted
> >> -----------------------------
> >> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
> >>
> >> other info that might help us debug this:
> >>
> >> rcu_scheduler_active = 2, debug_locks = 1
> >> 4 locks held by swapper/0/0:
> >>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
> >>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
> >>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
> >>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
> >>  #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
> >>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
> >>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
> >>
> >> stack backtrace:
> >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> >> Call Trace:
> >>  <IRQ>
> >>  __dump_stack lib/dump_stack.c:17 [inline]
> >>  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
> >>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
> >>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
> >>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
> >>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
> >>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
> >>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
> >>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
> >>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
> >>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
> >>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
> >>  expire_timers kernel/time/timer.c:1357 [inline]
> >>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
> >>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
> >>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
> >>  invoke_softirq kernel/softirq.c:365 [inline]
> >>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
> >>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
> >>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
> >>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
> >>  </IRQ>
> >>
> >> Reported-by: syzbot <syzkaller@googlegroups.com>
> >> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
> >> Signed-off-by: Wei Wang <weiwan@google.com>
> >> ---
> >>  net/ipv6/ip6_fib.c | 45 +++++++++++++++++++++++++++++++++++----------
> >>  1 file changed, 35 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> >> index d11a5578e4f8..37cb4ad1ea29 100644
> >> --- a/net/ipv6/ip6_fib.c
> >> +++ b/net/ipv6/ip6_fib.c
> >> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
> >>                       if (!(fn->fn_flags & RTN_RTINFO)) {
> >>                               RCU_INIT_POINTER(fn->leaf, NULL);
> >>                               rt6_release(leaf);
> >> +                     /* remove null_entry in the root node */
> >> +                     } else if (fn->fn_flags & RTN_TL_ROOT &&
> >> +                                rcu_access_pointer(fn->leaf) ==
> >> +                                net->ipv6.ip6_null_entry) {
> >> +                             RCU_INIT_POINTER(fn->leaf, NULL);
> > It seems the reader side could see tb6_root.leaf == NULL after
> > this change and I think it should be fine?  If it is, instead
> > of switching betwen NULL and ip6_null_entry, would it be simpler
> > to always set tb6_root.leaf to NULL until a legit default route
> > is added?
> >
> 
> Agree that reader side could see tb6_root.leaf == NULL and it should
> be fine cause the lookup code path takes care of NULL case.
> If we init tb6_root to NULL, it does seem simpler. The only place we
> need to take care is that fib6_del() should not try to expunge fn if
> is root node if it does not contain any route. But does it seem weird
> that tb6_root.fn_flags is marked with RTN_RTINFO without containing
> any route?
Good point on the writer's side.  I think we also need to modify
fib6_add_1() to expect fn->leaf == NULL.  Also, agree that
RTN_RTINFO needs to be set/clear accordingly.  We also need to
make changes to the CONFIG_IPV6_SUBTREES......
It should be doable but requires more changes to break the writer's side
assumption.  net-next may be a better fit for those changes.

For this patch, I have one inline comment below:

> 
> >>                       }
> >>
> >>                       return fn;
> >> @@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
> >>       return err;
> >>
> >>  failure:
> >> -     /* fn->leaf could be NULL if fn is an intermediate node and we
> >> -      * failed to add the new route to it in both subtree creation
> >> -      * failure and fib6_add_rt2node() failure case.
> >> -      * In both cases, fib6_repair_tree() should be called to fix
> >> +     /* fn->leaf could be NULL if:
> >> +      * 1. fn is the root node in the table and we fail to add the default
> >> +      * route to it.
> >> +      * In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
> >> +      * the way the table was created.
> >> +      * 2. fn is an intermediate node and we failed to add the new
> >> +      * route to it in both subtree creation failure and fib6_add_rt2node()
> >> +      * failure case.
> >> +      * In this case, fib6_repair_tree() should be called to fix
> >>        * fn->leaf.
> >>        */
> >> -     if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
> >> -             fib6_repair_tree(info->nl_net, table, fn);
> >> +     if (fn) {
> >> +             if (fn->fn_flags & RTN_TL_ROOT) {
> >> +                     if (!rcu_access_pointer(fn->leaf)) {
> >> +                             rcu_assign_pointer(fn->leaf,
> >> +                                        info->nl_net->ipv6.ip6_null_entry);
> >> +                     }
> >> +             } else if (!(fn->fn_flags & (RTN_RTINFO|RTN_ROOT))) {
> >> +                     fib6_repair_tree(info->nl_net, table, fn);
fib6_repair_tree() is responsible to ensure intermediate fn->leaf is
not NULL (or free the fn if possible).  In this case, we want tb6_root.leaf
to be not NULL.  Can we consolidate the tb6_root.leaf fixing codes to
fib6_repair_tree() also such that all tree repairing logic is
in one place?

> >> +             }
> >> +     }
> >>       /* Always release dst as dst->__refcnt is guaranteed
> >>        * to be taken before entering this function
> >>        */
> >> @@ -1685,11 +1703,18 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
> >>       }
> >>       read_unlock(&net->ipv6.fib6_walker_lock);
> >>
> >> -     /* If it was last route, expunge its radix tree node */
> >> +     /* If it was last route:
> >> +      * 1. For root node, put back null_entry as how the table was created.
> >> +      * 2. For other nodes, expunge its radix tree node.
> >> +      */
> >>       if (!rcu_access_pointer(fn->leaf)) {
> >> -             fn->fn_flags &= ~RTN_RTINFO;
> >> -             net->ipv6.rt6_stats->fib_route_nodes--;
> >> -             fn = fib6_repair_tree(net, table, fn);
> >> +             if (fn->fn_flags & RTN_TL_ROOT) {
> >> +                     rcu_assign_pointer(fn->leaf, net->ipv6.ip6_null_entry);
> >> +             } else {
> >> +                     fn->fn_flags &= ~RTN_RTINFO;
> >> +                     net->ipv6.rt6_stats->fib_route_nodes--;
> >> +                     fn = fib6_repair_tree(net, table, fn);
> >> +             }
> >>       }
> >>
> >>       fib6_purge_rt(rt, fn, net);
> >> --
> >> 2.16.0.rc0.223.g4a4ac83678-goog
> >>
Wei Wang Jan. 7, 2018, 5:57 p.m. UTC | #4
On Sat, Jan 6, 2018 at 10:16 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Sat, Jan 06, 2018 at 05:41:28PM -0800, Wei Wang wrote:
>> On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau <kafai@fb.com> wrote:
>> > On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
>> >> From: Wei Wang <weiwan@google.com>
>> >>
>> >> In the current code, when creating a new fib6 table, tb6_root.leaf gets
>> >> initialized to net->ipv6.ip6_null_entry.
>> >> If a default route is being added with rt->rt6i_metric = 0xffffffff,
>> >> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
>> >> null_entry is shared, it could cause problem.
>> >>
>> >> In order to fix it, set fn->leaf to NULL before calling
>> >> fib6_add_rt2node() when trying to add the first default route.
>> >> And reset fn->leaf to null_entry when adding fails or when deleting the
>> >> last default route.
>> >>
>> >> syzkaller reported the following issue which is fixed by this commit:
>> >> =============================
>> >> WARNING: suspicious RCU usage
>> >> 4.15.0-rc5+ #171 Not tainted
>> >> -----------------------------
>> >> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
>> >>
>> >> other info that might help us debug this:
>> >>
>> >> rcu_scheduler_active = 2, debug_locks = 1
>> >> 4 locks held by swapper/0/0:
>> >>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
>> >>  #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>> >>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
>> >>  #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>> >>  #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>> >>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
>> >>  #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
>> >>
>> >> stack backtrace:
>> >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
>> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> >> Call Trace:
>> >>  <IRQ>
>> >>  __dump_stack lib/dump_stack.c:17 [inline]
>> >>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>> >>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>> >>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>> >>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>> >>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>> >>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>> >>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>> >>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>> >>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>> >>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>> >>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>> >>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>> >>  expire_timers kernel/time/timer.c:1357 [inline]
>> >>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>> >>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>> >>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>> >>  invoke_softirq kernel/softirq.c:365 [inline]
>> >>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>> >>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>> >>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>> >>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>> >>  </IRQ>
>> >>
>> >> Reported-by: syzbot <syzkaller@googlegroups.com>
>> >> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
>> >> Signed-off-by: Wei Wang <weiwan@google.com>
>> >> ---
>> >>  net/ipv6/ip6_fib.c | 45 +++++++++++++++++++++++++++++++++++----------
>> >>  1 file changed, 35 insertions(+), 10 deletions(-)
>> >>
>> >> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> >> index d11a5578e4f8..37cb4ad1ea29 100644
>> >> --- a/net/ipv6/ip6_fib.c
>> >> +++ b/net/ipv6/ip6_fib.c
>> >> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
>> >>                       if (!(fn->fn_flags & RTN_RTINFO)) {
>> >>                               RCU_INIT_POINTER(fn->leaf, NULL);
>> >>                               rt6_release(leaf);
>> >> +                     /* remove null_entry in the root node */
>> >> +                     } else if (fn->fn_flags & RTN_TL_ROOT &&
>> >> +                                rcu_access_pointer(fn->leaf) ==
>> >> +                                net->ipv6.ip6_null_entry) {
>> >> +                             RCU_INIT_POINTER(fn->leaf, NULL);
>> > It seems the reader side could see tb6_root.leaf == NULL after
>> > this change and I think it should be fine?  If it is, instead
>> > of switching betwen NULL and ip6_null_entry, would it be simpler
>> > to always set tb6_root.leaf to NULL until a legit default route
>> > is added?
>> >
>>
>> Agree that reader side could see tb6_root.leaf == NULL and it should
>> be fine cause the lookup code path takes care of NULL case.
>> If we init tb6_root to NULL, it does seem simpler. The only place we
>> need to take care is that fib6_del() should not try to expunge fn if
>> is root node if it does not contain any route. But does it seem weird
>> that tb6_root.fn_flags is marked with RTN_RTINFO without containing
>> any route?
> Good point on the writer's side.  I think we also need to modify
> fib6_add_1() to expect fn->leaf == NULL.  Also, agree that
> RTN_RTINFO needs to be set/clear accordingly.  We also need to
> make changes to the CONFIG_IPV6_SUBTREES......
> It should be doable but requires more changes to break the writer's side
> assumption.  net-next may be a better fit for those changes.
>

Understood. Yes. Agree. I will give it a try on net-next.

> For this patch, I have one inline comment below:
>

>>
>> >>                       }
>> >>
>> >>                       return fn;
>> >> @@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
>> >>       return err;
>> >>
>> >>  failure:
>> >> -     /* fn->leaf could be NULL if fn is an intermediate node and we
>> >> -      * failed to add the new route to it in both subtree creation
>> >> -      * failure and fib6_add_rt2node() failure case.
>> >> -      * In both cases, fib6_repair_tree() should be called to fix
>> >> +     /* fn->leaf could be NULL if:
>> >> +      * 1. fn is the root node in the table and we fail to add the default
>> >> +      * route to it.
>> >> +      * In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
>> >> +      * the way the table was created.
>> >> +      * 2. fn is an intermediate node and we failed to add the new
>> >> +      * route to it in both subtree creation failure and fib6_add_rt2node()
>> >> +      * failure case.
>> >> +      * In this case, fib6_repair_tree() should be called to fix
>> >>        * fn->leaf.
>> >>        */
>> >> -     if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
>> >> -             fib6_repair_tree(info->nl_net, table, fn);
>> >> +     if (fn) {
>> >> +             if (fn->fn_flags & RTN_TL_ROOT) {
>> >> +                     if (!rcu_access_pointer(fn->leaf)) {
>> >> +                             rcu_assign_pointer(fn->leaf,
>> >> +                                        info->nl_net->ipv6.ip6_null_entry);
>> >> +                     }
>> >> +             } else if (!(fn->fn_flags & (RTN_RTINFO|RTN_ROOT))) {
>> >> +                     fib6_repair_tree(info->nl_net, table, fn);
> fib6_repair_tree() is responsible to ensure intermediate fn->leaf is
> not NULL (or free the fn if possible).  In this case, we want tb6_root.leaf
> to be not NULL.  Can we consolidate the tb6_root.leaf fixing codes to
> fib6_repair_tree() also such that all tree repairing logic is
> in one place?
>

Yes. Good idea. I will send out v2 for this.

>> >> +             }
>> >> +     }
>> >>       /* Always release dst as dst->__refcnt is guaranteed
>> >>        * to be taken before entering this function
>> >>        */
>> >> @@ -1685,11 +1703,18 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
>> >>       }
>> >>       read_unlock(&net->ipv6.fib6_walker_lock);
>> >>
>> >> -     /* If it was last route, expunge its radix tree node */
>> >> +     /* If it was last route:
>> >> +      * 1. For root node, put back null_entry as how the table was created.
>> >> +      * 2. For other nodes, expunge its radix tree node.
>> >> +      */
>> >>       if (!rcu_access_pointer(fn->leaf)) {
>> >> -             fn->fn_flags &= ~RTN_RTINFO;
>> >> -             net->ipv6.rt6_stats->fib_route_nodes--;
>> >> -             fn = fib6_repair_tree(net, table, fn);
>> >> +             if (fn->fn_flags & RTN_TL_ROOT) {
>> >> +                     rcu_assign_pointer(fn->leaf, net->ipv6.ip6_null_entry);
>> >> +             } else {
>> >> +                     fn->fn_flags &= ~RTN_RTINFO;
>> >> +                     net->ipv6.rt6_stats->fib_route_nodes--;
>> >> +                     fn = fib6_repair_tree(net, table, fn);
>> >> +             }
>> >>       }
>> >>
>> >>       fib6_purge_rt(rt, fn, net);
>> >> --
>> >> 2.16.0.rc0.223.g4a4ac83678-goog
>> >>
diff mbox series

Patch

=============================
WARNING: suspicious RCU usage
4.15.0-rc5+ #171 Not tainted
-----------------------------
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
 #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
 #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
 #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
 #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
 #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
 #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
 #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
 fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
 expire_timers kernel/time/timer.c:1357 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:540 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
 </IRQ>

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <weiwan@google.com>
---
 net/ipv6/ip6_fib.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d11a5578e4f8..37cb4ad1ea29 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -640,6 +640,11 @@  static struct fib6_node *fib6_add_1(struct net *net,
 			if (!(fn->fn_flags & RTN_RTINFO)) {
 				RCU_INIT_POINTER(fn->leaf, NULL);
 				rt6_release(leaf);
+			/* remove null_entry in the root node */
+			} else if (fn->fn_flags & RTN_TL_ROOT &&
+				   rcu_access_pointer(fn->leaf) ==
+				   net->ipv6.ip6_null_entry) {
+				RCU_INIT_POINTER(fn->leaf, NULL);
 			}
 
 			return fn;
@@ -1270,14 +1275,27 @@  int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 	return err;
 
 failure:
-	/* fn->leaf could be NULL if fn is an intermediate node and we
-	 * failed to add the new route to it in both subtree creation
-	 * failure and fib6_add_rt2node() failure case.
-	 * In both cases, fib6_repair_tree() should be called to fix
+	/* fn->leaf could be NULL if:
+	 * 1. fn is the root node in the table and we fail to add the default
+	 * route to it.
+	 * In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
+	 * the way the table was created.
+	 * 2. fn is an intermediate node and we failed to add the new
+	 * route to it in both subtree creation failure and fib6_add_rt2node()
+	 * failure case.
+	 * In this case, fib6_repair_tree() should be called to fix
 	 * fn->leaf.
 	 */
-	if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
-		fib6_repair_tree(info->nl_net, table, fn);
+	if (fn) {
+		if (fn->fn_flags & RTN_TL_ROOT) {
+			if (!rcu_access_pointer(fn->leaf)) {
+				rcu_assign_pointer(fn->leaf,
+					   info->nl_net->ipv6.ip6_null_entry);
+			}
+		} else if (!(fn->fn_flags & (RTN_RTINFO|RTN_ROOT))) {
+			fib6_repair_tree(info->nl_net, table, fn);
+		}
+	}
 	/* Always release dst as dst->__refcnt is guaranteed
 	 * to be taken before entering this function
 	 */
@@ -1685,11 +1703,18 @@  static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
 	}
 	read_unlock(&net->ipv6.fib6_walker_lock);
 
-	/* If it was last route, expunge its radix tree node */
+	/* If it was last route:
+	 * 1. For root node, put back null_entry as how the table was created.
+	 * 2. For other nodes, expunge its radix tree node.
+	 */
 	if (!rcu_access_pointer(fn->leaf)) {
-		fn->fn_flags &= ~RTN_RTINFO;
-		net->ipv6.rt6_stats->fib_route_nodes--;
-		fn = fib6_repair_tree(net, table, fn);
+		if (fn->fn_flags & RTN_TL_ROOT) {
+			rcu_assign_pointer(fn->leaf, net->ipv6.ip6_null_entry);
+		} else {
+			fn->fn_flags &= ~RTN_RTINFO;
+			net->ipv6.rt6_stats->fib_route_nodes--;
+			fn = fib6_repair_tree(net, table, fn);
+		}
 	}
 
 	fib6_purge_rt(rt, fn, net);