diff mbox

[net] ipv4: restore rt->fi for reference counting

Message ID CAM_iQpXtWUZoGS_KEqfZc+ZbzYxaF1hbAV7H9Qd82=r4auJojw@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Cong Wang May 15, 2017, 6:34 p.m. UTC
On Fri, May 12, 2017 at 2:27 PM, Julian Anastasov <ja@ssi.bg> wrote:
>         Now the main question: is FIB_LOOKUP_NOREF used
> everywhere in IPv4? I guess so. If not, it means
> someone can walk its res->fi NHs which is bad. I think,
> this will delay the unregistration for long time and we
> can not solve the problem.
>
>         If yes, free_fib_info() should not use call_rcu.
> Instead, fib_release_info() will start RCU callback to
> drop everything via a common function for fib_release_info
> and free_fib_info. As result, the last fib_info_put will
> just need to free fi->fib_metrics and fi.


Yes it is used. But this is a different problem from the
dev refcnt issue, right? I can send a separate patch to
address it.


>> Are you sure we are safe to call dev_put() in fib_release_info()
>> for _all_ paths, especially non-unregister paths? See:
>
>         Yep, dev_put is safe there...
>
>> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
>> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
>> Date:   Wed May 23 15:39:45 2012 +0000
>>
>>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
>
>         ...as long as we do not set nh_dev to NULL
>

OK, fair enough, then I think the best solution here is to move
the dev_put() from free_fib_info_rcu() to fib_release_info(),
fib_nh is already removed from hash there anyway.




Thanks!

Comments

Julian Anastasov May 15, 2017, 8:37 p.m. UTC | #1
Hello,

On Mon, 15 May 2017, Cong Wang wrote:

> On Fri, May 12, 2017 at 2:27 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >         Now the main question: is FIB_LOOKUP_NOREF used
> > everywhere in IPv4? I guess so. If not, it means
> > someone can walk its res->fi NHs which is bad. I think,
> > this will delay the unregistration for long time and we
> > can not solve the problem.
> >
> >         If yes, free_fib_info() should not use call_rcu.
> > Instead, fib_release_info() will start RCU callback to
> > drop everything via a common function for fib_release_info
> > and free_fib_info. As result, the last fib_info_put will
> > just need to free fi->fib_metrics and fi.
> 
> 
> Yes it is used. But this is a different problem from the
> dev refcnt issue, right? I can send a separate patch to
> address it.

	Any user that does not set FIB_LOOKUP_NOREF
will need nh_dev refcounts. The assumption is that the
NHs are accessed, who knows, may be even after RCU grace
period. As result, we can not use dev_put on NETDEV_UNREGISTER.
So, we should check if there are users that do not
set FIB_LOOKUP_NOREF, at first look, I don't see such ones
for IPv4.

> >> Are you sure we are safe to call dev_put() in fib_release_info()
> >> for _all_ paths, especially non-unregister paths? See:
> >
> >         Yep, dev_put is safe there...
> >
> >> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
> >> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> >> Date:   Wed May 23 15:39:45 2012 +0000
> >>
> >>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
> >
> >         ...as long as we do not set nh_dev to NULL
> >
> 
> OK, fair enough, then I think the best solution here is to move
> the dev_put() from free_fib_info_rcu() to fib_release_info(),
> fib_nh is already removed from hash there anyway.

	free_fib_info still needs to put the references,
that is the reason for the common fib_info_release() in
my example. It happens in fib_create_info() where free_fib_info()
is called. The func names in my example can be corrected,
if needed.

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index da449dd..cb712d1 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -205,8 +205,6 @@ static void free_fib_info_rcu(struct rcu_head *head)
>         struct fib_info *fi = container_of(head, struct fib_info, rcu);
> 
>         change_nexthops(fi) {
> -               if (nexthop_nh->nh_dev)
> -                       dev_put(nexthop_nh->nh_dev);
>                 lwtstate_put(nexthop_nh->nh_lwtstate);
>                 free_nh_exceptions(nexthop_nh);
>                 rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
> @@ -246,6 +244,14 @@ void fib_release_info(struct fib_info *fi)
>                         if (!nexthop_nh->nh_dev)
>                                 continue;
>                         hlist_del(&nexthop_nh->nh_hash);
> +                       /* We have to release these nh_dev here because a dst
> +                        * could still hold a fib_info via rt->fi, we can't wait
> +                        * for GC, a socket could hold the dst for a long time.
> +                        *
> +                        * This is safe, dev_put() alone does not really free
> +                        * the netdevice, we just have to put the refcnt back.
> +                        */
> +                       dev_put(nexthop_nh->nh_dev);
>                 } endfor_nexthops(fi)
>                 fi->fib_dead = 1;

	Such solution needs the fib_dead = 1|2 game to
know who dropped the nh_dev reference, fib_release_info (2) or
fib_create_info (1). You can not remove the dev_put calls
from free_fib_info_rcu.

>                 fib_info_put(fi);

Regards

--
Julian Anastasov <ja@ssi.bg>
Cong Wang May 15, 2017, 10:13 p.m. UTC | #2
On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
>         Any user that does not set FIB_LOOKUP_NOREF
> will need nh_dev refcounts. The assumption is that the
> NHs are accessed, who knows, may be even after RCU grace
> period. As result, we can not use dev_put on NETDEV_UNREGISTER.
> So, we should check if there are users that do not
> set FIB_LOOKUP_NOREF, at first look, I don't see such ones
> for IPv4.

I see, although we do have FIB_LOOKUP_NOREF set all the times,
there are other places we hold fib_clntref too, for example
mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...

So I am afraid moving dev_put() to fib_release_info() is not a solution
here. I have to rethink about it.
Julian Anastasov May 16, 2017, 7:46 a.m. UTC | #3
Hello,

On Mon, 15 May 2017, Cong Wang wrote:

> On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >         Any user that does not set FIB_LOOKUP_NOREF
> > will need nh_dev refcounts. The assumption is that the
> > NHs are accessed, who knows, may be even after RCU grace
> > period. As result, we can not use dev_put on NETDEV_UNREGISTER.
> > So, we should check if there are users that do not
> > set FIB_LOOKUP_NOREF, at first look, I don't see such ones
> > for IPv4.
> 
> I see, although we do have FIB_LOOKUP_NOREF set all the times,
> there are other places we hold fib_clntref too, for example
> mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...
> 
> So I am afraid moving dev_put() to fib_release_info() is not a solution
> here. I have to rethink about it.

	At first look, they use fib_info_hold() to get fib_clntref 
reference from places where fib_treeref is not fatally decreased
to 0 but later a work is used which finishes the job. I guess, we
can convert such places to use just a fib_treeref reference.
They can use such new method instead of fib_info_hold:

void fib_treeref_get(struct fib_info *fi)
{
	spin_lock_bh(&fib_info_lock);
	fi->fib_treeref++;
	spin_unlock_bh(&fib_info_lock);
}

They will use fib_release_info() to put the reference. But
on FIB_EVENT_ENTRY_DEL there is a small window where the
scheduled work delays the unlink of fib info from the
hash tables, i.e. there is a risk fib_find_info to reuse
a dead fib info.

	May be we can add a fi->fib_flags & RTNH_F_DEAD
check there but the problem is that it is set also on
NETDEV_DOWN. While attempts to add route to device with
!(dev->flags & IFF_UP) is rejected by fib_check_nh(),
fib_create_info still can create routes when
cfg->fc_scope == RT_SCOPE_HOST. So, RTNH_F_DEAD check
in fib_find_info can avoid the reuse of fib info for
host routes while device is down but not unregistered.
As result, many fib infos can be created instead of one
that is reused. Adding new RTNH_F_* flag to properly handle
this does not look good...

Regards

--
Julian Anastasov <ja@ssi.bg>
Cong Wang May 16, 2017, 5:53 p.m. UTC | #4
On Tue, May 16, 2017 at 12:46 AM, Julian Anastasov <ja@ssi.bg> wrote:
>
>         Hello,
>
> On Mon, 15 May 2017, Cong Wang wrote:
>
>> On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
>> >         Any user that does not set FIB_LOOKUP_NOREF
>> > will need nh_dev refcounts. The assumption is that the
>> > NHs are accessed, who knows, may be even after RCU grace
>> > period. As result, we can not use dev_put on NETDEV_UNREGISTER.
>> > So, we should check if there are users that do not
>> > set FIB_LOOKUP_NOREF, at first look, I don't see such ones
>> > for IPv4.
>>
>> I see, although we do have FIB_LOOKUP_NOREF set all the times,
>> there are other places we hold fib_clntref too, for example
>> mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...
>>
>> So I am afraid moving dev_put() to fib_release_info() is not a solution
>> here. I have to rethink about it.
>
>         At first look, they use fib_info_hold() to get fib_clntref
> reference from places where fib_treeref is not fatally decreased
> to 0 but later a work is used which finishes the job. I guess, we
> can convert such places to use just a fib_treeref reference.
> They can use such new method instead of fib_info_hold:
>
> void fib_treeref_get(struct fib_info *fi)
> {
>         spin_lock_bh(&fib_info_lock);
>         fi->fib_treeref++;
>         spin_unlock_bh(&fib_info_lock);
> }
>
> They will use fib_release_info() to put the reference. But
> on FIB_EVENT_ENTRY_DEL there is a small window where the
> scheduled work delays the unlink of fib info from the
> hash tables, i.e. there is a risk fib_find_info to reuse
> a dead fib info.
>

Right, that seems risky.

How about adding a check for ->fib_dead in these work's?
If the last treeref is gone, probably it is no longer needed
to continue to do the offloading work.
Cong Wang May 16, 2017, 6:16 p.m. UTC | #5
On Tue, May 16, 2017 at 10:53 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> How about adding a check for ->fib_dead in these work's?
> If the last treeref is gone, probably it is no longer needed
> to continue to do the offloading work.

Hmm, this doesn't look correct either, we may miss an offloading work
if fib_release_info() comes first.
diff mbox

Patch

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..cb712d1 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -205,8 +205,6 @@  static void free_fib_info_rcu(struct rcu_head *head)
        struct fib_info *fi = container_of(head, struct fib_info, rcu);

        change_nexthops(fi) {
-               if (nexthop_nh->nh_dev)
-                       dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
                rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
@@ -246,6 +244,14 @@  void fib_release_info(struct fib_info *fi)
                        if (!nexthop_nh->nh_dev)
                                continue;
                        hlist_del(&nexthop_nh->nh_hash);
+                       /* We have to release these nh_dev here because a dst
+                        * could still hold a fib_info via rt->fi, we can't wait
+                        * for GC, a socket could hold the dst for a long time.
+                        *
+                        * This is safe, dev_put() alone does not really free
+                        * the netdevice, we just have to put the refcnt back.
+                        */
+                       dev_put(nexthop_nh->nh_dev);
                } endfor_nexthops(fi)
                fi->fib_dead = 1;
                fib_info_put(fi);