diff mbox

[net] ipv4: restore rt->fi for reference counting

Message ID CAM_iQpWxDKHF4N1xor7qj+5s91GrctgcGnzN0buw18qs7__-gQ@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Cong Wang May 12, 2017, 1:22 a.m. UTC
On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> So, if I understand you correctly it is safe to NULL'ing
> nh_dev in NETDEV_UNREGISTER_FINAL, right?
>
> If still not, how about transfer nh_dev's to loopback_dev
> too in NETDEV_UNREGISTER? Like we transfer dst->dev.
>
> I don't want to touch the fast path to check for NULL, as
> it will change more code and slow down performance.

Finally I come up with the attached patch. Please let me know if
I still miss anything.
commit edc38ecab7101487b35fc9152e166a2670467e49
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Tue May 9 14:35:10 2017 -0700

    ipv4: restore rt->fi for reference counting, try #2
    
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Comments

Eric Dumazet May 12, 2017, 4:55 a.m. UTC | #1
On Thu, 2017-05-11 at 18:22 -0700, Cong Wang wrote:
> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > So, if I understand you correctly it is safe to NULL'ing
> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
> >
> > If still not, how about transfer nh_dev's to loopback_dev
> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
> >
> > I don't want to touch the fast path to check for NULL, as
> > it will change more code and slow down performance.
> 
> Finally I come up with the attached patch. Please let me know if
> I still miss anything.

You have not addressed my prior feedback, unless I am mistaken ?

fib_route_seq_show() and others do not expect fi->fib_dev suddenly
becoming NULL. 

RCU contract is more complicated than simply adding rcu grace period
before delete.

Thanks.
Julian Anastasov May 12, 2017, 6:39 a.m. UTC | #2
Hello,

On Thu, 11 May 2017, Cong Wang wrote:

> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > So, if I understand you correctly it is safe to NULL'ing
> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
> >
> > If still not, how about transfer nh_dev's to loopback_dev
> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
> >
> > I don't want to touch the fast path to check for NULL, as
> > it will change more code and slow down performance.
> 
> Finally I come up with the attached patch. Please let me know if
> I still miss anything.

	fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
time, so we can not see them in any hash tables later on
NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
except if moving NHs in another hash table but that should
not be needed.

	I'm thinking for the following solution which
is a bit hackish:

- on NETDEV_UNREGISTER we want to put the nh_dev references,
so fib_release_info is a good place. But as fib_release_info
is not always called, we will have two places that put
references. We can use such hack:

	- for example, use nh_oif to know if dev_put is
	already called

	- fib_release_info() should set nh_oif to 0 because
	it will now call dev_put without clearing nh_dev

	- free_fib_info_rcu() will not call dev_put if nh_oif is 0:
	if (nexthop_nh->nh_dev && nexthop_nh->nh_oif)
		dev_put(nexthop_nh->nh_dev);

	- as fi is unlinked, there is no chance fib_info_hashfn()
	to use the cleared nh_oif for hashing

	- we expect noone to touch nh_dev fields after fi is
	unlinked, this includes free_fib_info and free_fib_info_rcu

	What we achieve:

- between NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL nh_dev
still points to our dev (and not to loopback, I think, this
answers your question in previous email), so route lookups
will not find loopback in the FIB tree. We do not want
some packets to be wrongly rerouted. Even if we don't
hold dev reference, the RCU grace period helps here.

- fib_dev/nh_dev != NULL checks are not needed, so this addresses
Eric's concerns. BTW, fib_route_seq_show actually checks
for fi->fib_dev, may be not in a safe way (lockless_dereference
needed?) but as we don't set nh_dev to NULL this is not
needed.

	What more? What about nh_pcpu_rth_output and
nh_rth_input holding routes? We should think about
them too. I should think more if nh_oif trick can work
for them, may be not because nh_oif is optional...
May be we should purge them somehow?

Regards

--
Julian Anastasov <ja@ssi.bg>
Cong Wang May 12, 2017, 5:27 p.m. UTC | #3
On Thu, May 11, 2017 at 11:39 PM, Julian Anastasov <ja@ssi.bg> wrote:
>
>         Hello,
>
> On Thu, 11 May 2017, Cong Wang wrote:
>
>> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > So, if I understand you correctly it is safe to NULL'ing
>> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
>> >
>> > If still not, how about transfer nh_dev's to loopback_dev
>> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
>> >
>> > I don't want to touch the fast path to check for NULL, as
>> > it will change more code and slow down performance.
>>
>> Finally I come up with the attached patch. Please let me know if
>> I still miss anything.
>
>         fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
> time, so we can not see them in any hash tables later on
> NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
> except if moving NHs in another hash table but that should
> not be needed.

Ah, I did miss the hlist_del() in fib_table_flush(), so we just need some
way to link those fib_nh together for NETDEV_UNREGISTER_FINAL,
a linked list should be sufficient, but requires adding list_head to fib_nh.

>
>         I'm thinking for the following solution which
> is a bit hackish:
>
> - on NETDEV_UNREGISTER we want to put the nh_dev references,
> so fib_release_info is a good place. But as fib_release_info
> is not always called, we will have two places that put
> references. We can use such hack:
>
>         - for example, use nh_oif to know if dev_put is
>         already called
>
>         - fib_release_info() should set nh_oif to 0 because
>         it will now call dev_put without clearing nh_dev

Are you sure we are safe to call dev_put() in fib_release_info()
for _all_ paths, especially non-unregister paths? See:

commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date:   Wed May 23 15:39:45 2012 +0000

    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

But, hmm, very interesting, I always thought dev_put() triggers a
kfree() potentially, but here your suggestion actually leverages the fact
that it is merely a pcpu counter decrease, so for unregister path,
this is just giving refcnt back, which means it is safe as long as
we don't have any parallel unregister?  We should because of RTNL.

I see why you say this is hackish, really it is. ;) We have to ensure
the evil dev_put() is only called on unregister path.

>
>         - free_fib_info_rcu() will not call dev_put if nh_oif is 0:
>         if (nexthop_nh->nh_dev && nexthop_nh->nh_oif)
>                 dev_put(nexthop_nh->nh_dev);
>
>         - as fi is unlinked, there is no chance fib_info_hashfn()
>         to use the cleared nh_oif for hashing
>
>         - we expect noone to touch nh_dev fields after fi is
>         unlinked, this includes free_fib_info and free_fib_info_rcu
>
>         What we achieve:
>
> - between NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL nh_dev
> still points to our dev (and not to loopback, I think, this
> answers your question in previous email), so route lookups
> will not find loopback in the FIB tree. We do not want
> some packets to be wrongly rerouted. Even if we don't
> hold dev reference, the RCU grace period helps here.

Thanks to dev_put() for not calling a netdev_freemem(). ;-)

>
> - fib_dev/nh_dev != NULL checks are not needed, so this addresses
> Eric's concerns. BTW, fib_route_seq_show actually checks
> for fi->fib_dev, may be not in a safe way (lockless_dereference
> needed?) but as we don't set nh_dev to NULL this is not
> needed.
>

I think Eric was complaining about the lack of rcu_dereference()
there.

>         What more? What about nh_pcpu_rth_output and
> nh_rth_input holding routes? We should think about
> them too. I should think more if nh_oif trick can work
> for them, may be not because nh_oif is optional...
> May be we should purge them somehow?
>

Or maybe don't touch nh_oif but using a new flag in
nh_flags?? We only need to know if we should call
dev_put() in free_fib_info_rcu().

Again, I am still not sure if it is any better than just
putting these fib_nh into a linked list.

Thanks.
Cong Wang May 12, 2017, 5:49 p.m. UTC | #4
On Thu, May 11, 2017 at 9:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-05-11 at 18:22 -0700, Cong Wang wrote:
>> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > So, if I understand you correctly it is safe to NULL'ing
>> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
>> >
>> > If still not, how about transfer nh_dev's to loopback_dev
>> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
>> >
>> > I don't want to touch the fast path to check for NULL, as
>> > it will change more code and slow down performance.
>>
>> Finally I come up with the attached patch. Please let me know if
>> I still miss anything.
>
> You have not addressed my prior feedback, unless I am mistaken ?
>
> fib_route_seq_show() and others do not expect fi->fib_dev suddenly
> becoming NULL.
>
> RCU contract is more complicated than simply adding rcu grace period
> before delete.
>

If you mean the lack of rcu_dereference() in fib_route_seq_show() ,
yeah, I don't fix it because I think it should be in a separate patch,
it is not this patch which introduces RCU to nh_dev as I explained
previously.

Or you mean anything else?
Cong Wang May 12, 2017, 8:58 p.m. UTC | #5
On Fri, May 12, 2017 at 10:27 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Or maybe don't touch nh_oif but using a new flag in
> nh_flags?? We only need to know if we should call
> dev_put() in free_fib_info_rcu().
>
> Again, I am still not sure if it is any better than just
> putting these fib_nh into a linked list.
>

I am afraid we have to use a linked list, because either a flag or
nh_oif is per nh, but one nh could have multiple different nh_devs...
Cong Wang May 12, 2017, 9:13 p.m. UTC | #6
On Fri, May 12, 2017 at 1:58 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, May 12, 2017 at 10:27 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> Or maybe don't touch nh_oif but using a new flag in
>> nh_flags?? We only need to know if we should call
>> dev_put() in free_fib_info_rcu().
>>
>> Again, I am still not sure if it is any better than just
>> putting these fib_nh into a linked list.
>>
>
> I am afraid we have to use a linked list, because either a flag or
> nh_oif is per nh, but one nh could have multiple different nh_devs...

Scratch that. Each fib_nh has one nh_dev...
I am trying to use a flag.
Julian Anastasov May 12, 2017, 9:27 p.m. UTC | #7
Hello,

On Fri, 12 May 2017, Cong Wang wrote:

> On Thu, May 11, 2017 at 11:39 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >
> >         fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
> > time, so we can not see them in any hash tables later on
> > NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
> > except if moving NHs in another hash table but that should
> > not be needed.
> 
> Ah, I did miss the hlist_del() in fib_table_flush(), so we just need some
> way to link those fib_nh together for NETDEV_UNREGISTER_FINAL,
> a linked list should be sufficient, but requires adding list_head to fib_nh.

	It could be a fib_info_devhash_dead hash table, similar
to fib_info_devhash where we can hash NHs for dead fib infos
but then we will need a fib_clntref reference or otherwise last
user can drop the fib info before NETDEV_UNREGISTER_FINAL is called.
It will add more code in fib_sync_down_dev to know when
to call fib_info_hold. But OTOH, it will reduce the work
needed for careful release of the references in fib_release_info.

	So, we have 2 possible cases to consider:

1. route deleted, fib_release_info called, fi held by socket,
NETDEV_UNREGISTER can not see the NHs because they are
unlinked in fib_release_info. dev unreg delayed for long time...

2. NETDEV_UNREGISTER called first, before the NHs are
unlinked.

	Now the main question: is FIB_LOOKUP_NOREF used
everywhere in IPv4? I guess so. If not, it means
someone can walk its res->fi NHs which is bad. I think,
this will delay the unregistration for long time and we
can not solve the problem.

	If yes, free_fib_info() should not use call_rcu.
Instead, fib_release_info() will start RCU callback to
drop everything via a common function for fib_release_info
and free_fib_info. As result, the last fib_info_put will
just need to free fi->fib_metrics and fi.

	Something like:

/* Long living data */
fib_info_destroy:
{
	if (fi->fib_dead == 0) {
		pr_warn("Freeing alive fib_info %p\n", fi);
		return;
	}

	if (fi->fib_metrics != (u32 *) dst_default_metrics)
		kfree(fi->fib_metrics);
	kfree(fi);
}

/* Do not imply RCU grace period anymore, last user is last */
fib_info_put():
{
	if (atomic_dec_and_test(&fi->fib_clntref))
		fib_info_destroy(fi);
}

/* Release everything except long living fields (fib_metrics) */
fib_info_release():
{
        change_nexthops(fi) {
                if (nexthop_nh->nh_dev)
                        dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
                rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
                rt_fibinfo_free(&nexthop_nh->nh_rth_input);
        } endfor_nexthops(fi);
}

/* RCU grace period after unlink */
fib_release_info_rcu():
{
	struct fib_info *fi = container_of(head, struct fib_info, rcu);

	fib_info_release(fi);
	fib_info_put(fi);
}

/* Called just on error */
free_fib_info():
{
	fib_info_release(fi);
	fib_info_put(fi);
}

fib_release_info():
{
	...
	fi->fib_dead = 1;
-	fib_info_put(fi);
+	call_rcu(&fi->rcu, fib_release_info_rcu);
}

> >         I'm thinking for the following solution which
> > is a bit hackish:
> >
> > - on NETDEV_UNREGISTER we want to put the nh_dev references,
> > so fib_release_info is a good place. But as fib_release_info
> > is not always called, we will have two places that put
> > references. We can use such hack:
> >
> >         - for example, use nh_oif to know if dev_put is
> >         already called
> >
> >         - fib_release_info() should set nh_oif to 0 because
> >         it will now call dev_put without clearing nh_dev
> 
> Are you sure we are safe to call dev_put() in fib_release_info()
> for _all_ paths, especially non-unregister paths? See:

	Yep, dev_put is safe there...

> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date:   Wed May 23 15:39:45 2012 +0000
> 
>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

	...as long as we do not set nh_dev to NULL

> But, hmm, very interesting, I always thought dev_put() triggers a
> kfree() potentially, but here your suggestion actually leverages the fact
> that it is merely a pcpu counter decrease, so for unregister path,
> this is just giving refcnt back, which means it is safe as long as
> we don't have any parallel unregister?  We should because of RTNL.

	Yes, we are in the context of unregistration and
there is a rcu_barrier() that prevents device to be
destroyed while there are RCU users. Refcnt reaching 0 is
not enough to free dev, RCU users must be considered,
they do not get reference.

> I see why you say this is hackish, really it is. ;) We have to ensure
> the evil dev_put() is only called on unregister path.

	Or after a RCU grace period but not later...

> > - fib_dev/nh_dev != NULL checks are not needed, so this addresses
> > Eric's concerns. BTW, fib_route_seq_show actually checks
> > for fi->fib_dev, may be not in a safe way (lockless_dereference
> > needed?) but as we don't set nh_dev to NULL this is not
> > needed.
> >
> 
> I think Eric was complaining about the lack of rcu_dereference()
> there.

	Not needed if we don't set nh_dev to NULL.

> >         What more? What about nh_pcpu_rth_output and
> > nh_rth_input holding routes? We should think about
> > them too. I should think more if nh_oif trick can work
> > for them, may be not because nh_oif is optional...
> > May be we should purge them somehow?
> >
> 
> Or maybe don't touch nh_oif but using a new flag in
> nh_flags?? We only need to know if we should call
> dev_put() in free_fib_info_rcu().

	fib_dead is a better option if we decide to use
such solution: 1=not called by fib_release_info,
2=called by fib_release_info.

Regards

--
Julian Anastasov <ja@ssi.bg>
diff mbox

Patch

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 6692c57..0f04f4d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -392,6 +392,7 @@  int fib_unmerge(struct net *net);
 /* Exported by fib_semantics.c */
 int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
+void fib_put_nh_devs(struct net_device *dev);
 int fib_sync_down_addr(struct net_device *dev, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
 
diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@  struct rtable {
 
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
+	struct fib_info		*fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 39bd1ed..59b0a1d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1178,6 +1178,9 @@  static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		fib_disable_ip(dev, event, true);
 		rt_flush_dev(dev);
 		return NOTIFY_DONE;
+	} else if (event == NETDEV_UNREGISTER_FINAL) {
+		fib_put_nh_devs(dev);
+		return NOTIFY_DONE;
 	}
 
 	in_dev = __in_dev_get_rtnl(dev);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..b85c5bb 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1453,6 +1453,36 @@  int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force)
 	return ret;
 }
 
+/* We have to release these nh_dev here because a dst could still hold a
+ * fib_info via rt->fi, we can't wait for GC, a socket could hold the dst
+ * for a long time.
+ */
+void fib_put_nh_devs(struct net_device *dev)
+{
+	struct fib_info *prev_fi = NULL;
+	unsigned int hash = fib_devindex_hashfn(dev->ifindex);
+	struct hlist_head *head = &fib_info_devhash[hash];
+	struct fib_nh *nh;
+
+	hlist_for_each_entry(nh, head, nh_hash) {
+		struct fib_info *fi = nh->nh_parent;
+
+		if (nh->nh_dev != dev || fi == prev_fi)
+			continue;
+		prev_fi = fi;
+		change_nexthops(fi) {
+			if (nexthop_nh->nh_dev == dev) {
+				/* This should be safe, we are on unregister
+				 * path, after synchronize_net() and
+				 * rcu_barrier(), no one could see this.
+				 */
+				RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+				dev_put(dev);
+			}
+		} endfor_nexthops(fi)
+	}
+}
+
 /* Must be invoked inside of an RCU protected region.  */
 static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 {
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@  static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (rt->fi) {
+		fib_info_put(rt->fi);
+		rt->fi = NULL;
+	}
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@  static bool rt_cache_valid(const struct rtable *rt)
 		!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+	if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+		fib_info_hold(fi);
+		rt->fi = fi;
+	}
+
+	dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			   const struct fib_result *res,
 			   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@  static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@  struct rtable *rt_dst_alloc(struct net_device *dev,
 		rt->rt_gateway = 0;
 		rt->rt_uses_gateway = 0;
 		rt->rt_table_id = 0;
+		rt->fi = NULL;
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
 		rt->dst.output = ip_output;