Message ID | 1492704599476.84165@Dell.com |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Apr 20, 2017 at 04:09:56PM +0000, Joe.Ghalam@dell.com wrote:
> I agree with this change, but the same purge would be needed for the macvlan_dellink() call also.
I don't think that's necessary because as long as the master
device is still around it will continue to process the broadcast
queue, thus removing any reference counts held.
It's only when the queue is purged that we run into trouble.
Cheers,
That's not true. macvlan_dellink() unregisters the queue, and macvlan_process_broadcast() will never get called. Please note that I'm not speculating. I have traced enabled on the dev_put and dev_hold, and I'm reporting a real, reproducible issue. Her is a sequence of calls logged, when the issue happens. macvlan_process_broadcast() never happens. Apr 19 04:35:39 OS10 kernel: e101-001-0.v257: dev_put 16 dst_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 dev_get_by_index Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 do_ip_setsockopt Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 dst_alloc Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 dst_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 macvlan_broadcast_enqueue Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 macvlan_process_broadcast Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 macvlan_broadcast_enqueue Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 macvlan_process_broadcast Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 macvlan_broadcast_enqueue <---insert Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 neigh_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 15 neigh_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 14 neigh_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 13 dst_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 12 dst_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 11 __netdev_adjacent_dev_remove <--- macvlan_dellink() Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 10 __netdev_adjacent_dev_remove <--- macvlan_dellink() Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 9 neigh_parms_release Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 8 neigh_parms_release Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 7 in6_dev_finish_destroy Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 6 rx_queue_release Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 5 netdev_queue_release Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 4 rollback_registered_many Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 3 free_fib_info_rcu
Please DO NOT top-post. Thank you.
On Fri, Apr 21, 2017 at 7:40 AM, <Joe.Ghalam@dell.com> wrote: > That's not true. macvlan_dellink() unregisters the queue, and macvlan_process_broadcast() will never get called. Please note that I'm not speculating. I have traced enabled on the dev_put and dev_hold, and I'm reporting a real, reproducible issue. > Her is a sequence of calls logged, when the issue happens. macvlan_process_broadcast() never happens. > > Apr 19 04:35:39 OS10 kernel: e101-001-0.v257: dev_put 16 dst_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 dev_get_by_index > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 do_ip_setsockopt > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 dst_alloc > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 dst_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 macvlan_broadcast_enqueue > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 macvlan_process_broadcast > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 macvlan_broadcast_enqueue > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 macvlan_process_broadcast > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_hold 15 macvlan_broadcast_enqueue <---insert > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 16 neigh_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 15 neigh_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 14 neigh_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 13 dst_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 12 dst_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 11 __netdev_adjacent_dev_remove <--- macvlan_dellink() > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 10 __netdev_adjacent_dev_remove <--- macvlan_dellink() > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 9 neigh_parms_release > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 8 neigh_parms_release > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 7 in6_dev_finish_destroy > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 6 rx_queue_release > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 5 netdev_queue_release > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 4 rollback_registered_many > Apr 19 04:35:41 OS10 kernel: e101-001-0.v257: dev_put 3 free_fib_info_rcu > May be the system is busy and snapshot is too small, and eventually process_broadcast() should get called. Deleting a slave does nothing about cancelling the work-queue so it would happen eventually. The change that Herbert proposed is correct. When packets are enqueued for processing later a dev reference is taken and it's removed when it's processed when it gets scheduled. The backlog is per port so it makes sense to remove reference(s) before purging the queue prior to deleting the port. > ________________________________________ > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Thursday, April 20, 2017 9:40 PM > To: Ghalam, Joe > Cc: davem@davemloft.net; Wichmann, Clifford; netdev@vger.kernel.org > Subject: Re: macvlan: Fix device ref leak when purging bc_queue > > On Thu, Apr 20, 2017 at 04:09:56PM +0000, Joe.Ghalam@dell.com wrote: >> I agree with this change, but the same purge would be needed for the macvlan_dellink() call also. > > I don't think that's necessary because as long as the master > device is still around it will continue to process the broadcast > queue, thus removing any reference counts held. > > It's only when the queue is purged that we run into trouble. > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Fri, Apr 21, 2017 at 02:40:50PM +0000, Joe.Ghalam@dell.com wrote:
> That's not true. macvlan_dellink() unregisters the queue, and macvlan_process_broadcast() will never get called. Please note that I'm not speculating. I have traced enabled on the dev_put and dev_hold, and I'm reporting a real, reproducible issue.
The only thing that can stop macvlan_process_broadcast from getting
called is macvlan_port_destroy. Nothing else can stop the work
queue, unless of course the work queue mechanism itself is broken.
So if you're sure macvlan_port_destroy is never even called in
your case, then you'll need to start debugging the kernel work
queue mechanism to see why macvlan_process_broadcast is not getting
called.
Cheers,
> The only thing that can stop macvlan_process_broadcast from getting > called is macvlan_port_destroy. Nothing else can stop the work > queue, unless of course the work queue mechanism itself is broken. > So if you're sure macvlan_port_destroy is never even called in > your case, then you'll need to start debugging the kernel work > queue mechanism to see why macvlan_process_broadcast is not getting > called. I will get your changes reloaded and re-tested without any other debug tools. Hopefully, we'll see success. I will let you know if I see any issues. Btw, is your fix committed already? if not, do you know when and where it would be committed? Thanks, Joe
From: <Joe.Ghalam@dell.com> Date: Mon, 24 Apr 2017 15:01:24 +0000 >> The only thing that can stop macvlan_process_broadcast from getting >> called is macvlan_port_destroy. Nothing else can stop the work >> queue, unless of course the work queue mechanism itself is broken. > >> So if you're sure macvlan_port_destroy is never even called in >> your case, then you'll need to start debugging the kernel work >> queue mechanism to see why macvlan_process_broadcast is not getting >> called. > > I will get your changes reloaded and re-tested without any other debug tools. Hopefully, we'll see success. I will let you know if I see any issues. > Btw, is your fix committed already? if not, do you know when and where it would be committed? I'm waiting for this discussion to settle down before I apply the patch.
> I'm waiting for this discussion to settle down before I apply the patch.
Thanks David. I will get some answers soon, and hopefully the change is a good one.
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 9261722..b34eaaa 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1139,6 +1139,7 @@ static int macvlan_port_create(struct net_device *dev) static void macvlan_port_destroy(struct net_device *dev) { struct macvlan_port *port = macvlan_port_get_rtnl(dev); + struct sk_buff *skb; dev->priv_flags &= ~IFF_MACVLAN_PORT; netdev_rx_handler_unregister(dev); @@ -1147,7 +1148,15 @@ static void macvlan_port_destroy(struct net_device *dev) * but we need to cancel it and purge left skbs if any. */ cancel_work_sync(&port->bc_work); - __skb_queue_purge(&port->bc_queue); + + while ((skb = __skb_dequeue(&port->bc_queue))) { + const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src; + + if (src) + dev_put(src->dev); + + kfree_skb(skb); + } kfree(port); }