Message ID | ca64de092db5a2ac80d22eaa9d662520@codeaurora.org |
---|---|
State | Superseded |
Headers | show |
Series | Refcount mismatch when unregistering netdevice from kernel | expand |
On 12/8/20 4:55 AM, stranche@codeaurora.org wrote: > Hi everyone, > > We've recently been investigating a refcount problem when unregistering a netdevice from the kernel. It seems that the IPv6 module is still holding various references to the inet6_dev associated with the main netdevice struct that are not being released, preventing the unregistration from completing. > > After tracing the various locations that take/release refcounts to these two structs, we see that there are mismatches in the refcount for the inet6_dev in the DST path when there are routes flushed with the rt6_uncached_list_flush_dev() function during rt6_disable_ip() when the device is unregistering. It seems that usually these references are freed via ip6_dst_ifdown() in the dst_ops->ifdown callback from dst_dev_put(), but this callback is not executed in the rt6_uncached_list_flush_dev() function. Instead, a reference to the inet6_dev is simply dropped to account for the inet6_dev_get() in ip6_rt_copy_init(). > > Unfortunately, this does not seem to be sufficient, as these uncached routes have an additional refcount on the inet6_device attached to them from the DST allocation. In the normal case, this reference from the DST allocation will happen in the dst_ops->destroy() callback in the dst_destroy() function when the DST is being freed. However, since rt6_uncached_list_flush_dev() changes the inet6_device stored in the DST to the loopback device, the dst_ops->destroy() callback doesn't decrement the refcount on the correct inet6_dev struct. > > We're wondering if it would be appropriate to put() the refcount additionally for the uncached routes when flushing out the list for the unregistering device. Perhaps something like the following? > dev refcount imbalances are quite common, particularly on old kernel versions, lacking few fixes. First thing is to let us know on which kernel version you see this, and how you reproduce it.
Hi Sean, Thanks for reporting the issue. Like Eric said, if you could provide the kernel version you are using, that would be great. And if you have a reproducer, that would be even better. I have a few comments inline below. On Tue, Dec 8, 2020 at 7:08 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > On 12/8/20 4:55 AM, stranche@codeaurora.org wrote: > > Hi everyone, > > > > We've recently been investigating a refcount problem when unregistering a netdevice from the kernel. It seems that the IPv6 module is still holding various references to the inet6_dev associated with the main netdevice struct that are not being released, preventing the unregistration from completing. > > > > After tracing the various locations that take/release refcounts to these two structs, we see that there are mismatches in the refcount for the inet6_dev in the DST path when there are routes flushed with the rt6_uncached_list_flush_dev() function during rt6_disable_ip() when the device is unregistering. It seems that usually these references are freed via ip6_dst_ifdown() in the dst_ops->ifdown callback from dst_dev_put(), but this callback is not executed in the rt6_uncached_list_flush_dev() function. Instead, a reference to the inet6_dev is simply dropped to account for the inet6_dev_get() in ip6_rt_copy_init(). > > rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev with loopback_dev, and release the reference to the previous inet6_dev by calling in6_dev_put(), which is actually doing the same thing as ip6_dst_ifdown(). I don't understand why you say " a reference to the inet6_dev is simply dropped". > > Unfortunately, this does not seem to be sufficient, as these uncached routes have an additional refcount on the inet6_device attached to them from the DST allocation. In the normal case, this reference from the DST allocation will happen in the dst_ops->destroy() callback in the dst_destroy() function when the DST is being freed. However, since rt6_uncached_list_flush_dev() changes the inet6_device stored in the DST to the loopback device, the dst_ops->destroy() callback doesn't decrement the refcount on the correct inet6_dev struct. > > The additional refcount to the DST is also released by doing the following: if (rt_dev == dev) { rt->dst.dev = blackhole_netdev; dev_hold(rt->dst.dev); dev_put(rt_dev); } Am I missing something? > > We're wondering if it would be appropriate to put() the refcount additionally for the uncached routes when flushing out the list for the unregistering device. Perhaps something like the following? > > > > dev refcount imbalances are quite common, particularly on old kernel versions, lacking few fixes. > > First thing is to let us know on which kernel version you see this, and how you reproduce it. >
Hi Wei and Eric, Thanks for the replies. This was reported to us on the 5.4.61 kernel during a customer regression suite, so we don't have an exact reproducer unfortunately. From the trace logs we've added it seems like this is happening during IPv6 transport mode XFRM data transfer and the device is unregistered in the middle of it, but we've been unable to reproduce it ourselves.. We're open to trying out and sharing debug patches if needed though. > rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev > with loopback_dev, and release the reference to the previous inet6_dev > by calling in6_dev_put(), which is actually doing the same thing as > ip6_dst_ifdown(). I don't understand why you say " a reference to the > inet6_dev is simply dropped". Fair. I was going off the semantics used by the dst_dev_put() function which calls dst_ops->ifdown() explicitly. At least in the case of xfrm6_dst_ifdown() this swap of the loopback device and putting the refcount seems like it could be missing a few things. > The additional refcount to the DST is also released by doing the > following: > if (rt_dev == dev) { > rt->dst.dev = blackhole_netdev; > dev_hold(rt->dst.dev); > dev_put(rt_dev); > } > Am I missing something? That dev_put() is on the actual netdevice struct, not the inet6_dev associated with it. We're seeing many calls to icmp6_dst_alloc() and xfrm6_fill_dst() here, both of which seem to associate a reference to the inet6_dev struct with the DST in addition to the standard dev_hold() on the netdevice during the dst_alloc()/dst_init(). Thanks, Sean
On Tue, Dec 8, 2020 at 11:13 AM <stranche@codeaurora.org> wrote: > > Hi Wei and Eric, > > Thanks for the replies. > > This was reported to us on the 5.4.61 kernel during a customer > regression suite, so we don't have an exact reproducer unfortunately. > From the trace logs we've added it seems like this is happening during > IPv6 transport mode XFRM data transfer and the device is unregistered in > the middle of it, but we've been unable to reproduce it ourselves.. > We're open to trying out and sharing debug patches if needed though. > I double checked 5.4.61, and I didn't find any missing fixes in this area AFAICT. > > rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev > > with loopback_dev, and release the reference to the previous inet6_dev > > by calling in6_dev_put(), which is actually doing the same thing as > > ip6_dst_ifdown(). I don't understand why you say " a reference to the > > inet6_dev is simply dropped". > > Fair. I was going off the semantics used by the dst_dev_put() function > which calls dst_ops->ifdown() explicitly. At least in the case of > xfrm6_dst_ifdown() this swap of the loopback device and putting the > refcount seems like it could be missing a few things. > Looking more into the xfrm code, I think the major difference between xfrm dst and non-xfrm dst is that, xfrm code creates a list of dst entries in one dst_entry linked by xfrm_dst_child(). In xfrm_bundle_create(), which I believe is the main function to create xfrm dst bundles, it allocates the main dst entry and its children, and it calls xfrm_fill_dst() for each of them. So I think each dst in the list (including all the children) are added into the uncached_list. The difference between the current code in rt6_uncached_list_flush_dev() vs dst_ops->ifdown() is that, the current code only releases the refcnt to inet6_dev associated with the main dst, while xfrm6_dst_ifdown() tries to release inet6_dev associated with every dst linked by xfrm_dst_child(). However, since xfrm_bundle_create() anyway adds each child dst to the uncached list, I don't see how the current code could miss any refcnt. BTW, have you tried your previous proposed patch and confirmed it would fix the issue? > > The additional refcount to the DST is also released by doing the > > following: > > if (rt_dev == dev) { > > rt->dst.dev = blackhole_netdev; > > dev_hold(rt->dst.dev); > > dev_put(rt_dev); > > } > > Am I missing something? > > That dev_put() is on the actual netdevice struct, not the inet6_dev > associated with it. We're seeing many calls to icmp6_dst_alloc() and > xfrm6_fill_dst() here, both of which seem to associate a reference to > the inet6_dev struct with the DST in addition to the standard dev_hold() > on the netdevice during the dst_alloc()/dst_init(). > Could we further distinguish between dst added to the uncached list by icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the ones leaking reference? I suspect it would be the xfrm ones, but I think it is worth verifying. > Thanks, > Sean
On 12/8/20 2:51 PM, Wei Wang wrote: > On Tue, Dec 8, 2020 at 11:13 AM <stranche@codeaurora.org> wrote: >> >> Hi Wei and Eric, >> >> Thanks for the replies. >> >> This was reported to us on the 5.4.61 kernel during a customer >> regression suite, so we don't have an exact reproducer unfortunately. >> From the trace logs we've added it seems like this is happening during >> IPv6 transport mode XFRM data transfer and the device is unregistered in >> the middle of it, but we've been unable to reproduce it ourselves.. >> We're open to trying out and sharing debug patches if needed though. >> > > I double checked 5.4.61, and I didn't find any missing fixes in this > area AFAICT. > >>> rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev >>> with loopback_dev, and release the reference to the previous inet6_dev >>> by calling in6_dev_put(), which is actually doing the same thing as >>> ip6_dst_ifdown(). I don't understand why you say " a reference to the >>> inet6_dev is simply dropped". >> >> Fair. I was going off the semantics used by the dst_dev_put() function >> which calls dst_ops->ifdown() explicitly. At least in the case of >> xfrm6_dst_ifdown() this swap of the loopback device and putting the >> refcount seems like it could be missing a few things. >> > > Looking more into the xfrm code, I think the major difference between > xfrm dst and non-xfrm dst is that, xfrm code creates a list of dst > entries in one dst_entry linked by xfrm_dst_child(). > In xfrm_bundle_create(), which I believe is the main function to > create xfrm dst bundles, it allocates the main dst entry and its > children, and it calls xfrm_fill_dst() for each of them. So I think > each dst in the list (including all the children) are added into the > uncached_list. > The difference between the current code in > rt6_uncached_list_flush_dev() vs dst_ops->ifdown() is that, the > current code only releases the refcnt to inet6_dev associated with the > main dst, while xfrm6_dst_ifdown() tries to release inet6_dev > associated with every dst linked by xfrm_dst_child(). However, since > xfrm_bundle_create() anyway adds each child dst to the uncached list, > I don't see how the current code could miss any refcnt. > BTW, have you tried your previous proposed patch and confirmed it > would fix the issue? > >>> The additional refcount to the DST is also released by doing the >>> following: >>> if (rt_dev == dev) { >>> rt->dst.dev = blackhole_netdev; >>> dev_hold(rt->dst.dev); >>> dev_put(rt_dev); >>> } >>> Am I missing something? >> >> That dev_put() is on the actual netdevice struct, not the inet6_dev >> associated with it. We're seeing many calls to icmp6_dst_alloc() and >> xfrm6_fill_dst() here, both of which seem to associate a reference to >> the inet6_dev struct with the DST in addition to the standard dev_hold() >> on the netdevice during the dst_alloc()/dst_init(). >> > > Could we further distinguish between dst added to the uncached list by > icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the > ones leaking reference? > I suspect it would be the xfrm ones, but I think it is worth verifying. > Finally found the reference: tools/testing/selftests/net/l2tp.sh at one point was triggering a refcount leak: https://lore.kernel.org/netdev/20190801235421.8344-1-dsahern@kernel.org/ And then Colin found more problems with it: https://lore.kernel.org/netdev/450f5abb-5fe8-158d-d267-4334e15f8e58@canonical.com/ running that on a 5.8 kernel on Ubuntu 20.10 did not trigger the problem. Neither did Ubuntu 20.04 with 5.4.0-51-generic. Can you run it on your 5.4 version and see?
diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 6602f43..554b07b 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -156,9 +156,11 @@ void rt6_uncached_list_del(struct rt6_info *rt) char rt6_uncached_list_flush_dev_log1[1000][512]; int rt6_uncached_list_flush_dev_log1_iter = 0; -static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev) +static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev, + unsigned long event) { struct net_device *loopback_dev = net->loopback_dev; + bool unreg = event == NETDEV_UNREGISTER; int cpu;