Message ID | 871u4t1d9t.fsf@xmission.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Francesco Ruggeri <fruggeri@aristanetworks.com> writes: > >> That would be great. There would still be one scenario to take care of though: >> >> - veth interfaces v0 and v1 are in namespaces ns0 and ns1. >> - process p0 unregisters v0, which also causes v1 to be unregistered. >> When p0 enters netdev_run_todo both v0 and v1 are in net_todo_list and >> have been unlisted from their namespaces. >> - then in p0's netdev_run_todo: > > So I looked at this a little more and this problem appears largely > specific to veth. In the normal case the caller of dellink has to hold > a reference to the network namespace to find the device to delete. > > So I think the solution is just to warp the interface of the second > device into the network namespace of the device we are actually > deleting. > > I will buy that similar situations can happen with other virtual devices > that have one foot in two network namespaces, and I expect the same > solution will apply. > > So the patch below looks like the solution. If there is more than one > device that needs this treatment perhaps the code should be moved > into a helper function rather than expanded inline. > > Does this look like it will fix your issue? > > Eric > To summarize your proposal: 1) Remove re-broadcasting of NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL from netdev_wait_allrefs. 2) If unregistering a net_device causes unregistering of other virtual devices (eg veth, macvlan, vlan) then move the virtual devices to the namespace of the original net_device. I do have some concerns about both correctness and feasibility of this approach. About 2), namespace dependent operations triggered by unregistering the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and probably more) would not be done in the namespaces where they should. About the feasibility of 1), consider just as an example dst_dev_event in NETDEV_UNREGISTER_FINAL. dst_dev_event will not process dst->dev unless the dst_entry is already in dst_garbage.list, ie unless it has already been dst_free'd or dst_release'd. But since destroying resources is often delayed to workqueues or to one of several garbage collection lists, can we guarantee that one broadcast of NETDEV_UNREGISTER_FINAL is always enough? There may be similar cases with NETDEV_UNREGISTER. I do urge you to take a second look at the approach that I proposed. All it does is make sure that a namespace loopback_dev is not removed until any devices unlisted from that namespace are also disposed of. That in turn prevents non-device pernet subsystems from exiting that namespace in ops_exit_list. Logically it is enforcing the right behavior (namely non-device pernet subsystems should not exit a namespace until all devices in that namespace - listed or unlisted - have been properly disposed of) and it correctly supports existing code, such as rt_flush_dev and dst_ifdown, which relies on the correct loopback_dev to be there. Francesco -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Francesco Ruggeri <fruggeri@aristanetworks.com> writes: > On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman > <ebiederm@xmission.com> wrote: > To summarize your proposal: > 1) Remove re-broadcasting of NETDEV_UNREGISTER and > NETDEV_UNREGISTER_FINAL from netdev_wait_allrefs. Forget that it isn't needed. It would be nice but since it doesn't solve the entire problem let's not go there. > 2) If unregistering a net_device causes unregistering of other virtual > devices (eg veth, macvlan, vlan) then move the virtual devices to the > namespace of the original net_device. > > I do have some concerns about both correctness and feasibility of this approach. > > About 2), namespace dependent operations triggered by unregistering > the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and > probably more) would not be done in the namespaces where they should. Yes they will. That is what dev_change_net_namespace does. dev_change_net_namespace would not be correct if it did not do that. Now dev_change_net_namespace is comparitively expensive so it may not be the best approach but it is an approach that will work. > I do urge you to take a second look at the approach that I proposed. > All it does is make sure that a namespace loopback_dev is not removed > until any devices unlisted from that namespace are also disposed of. > That in turn prevents non-device pernet subsystems from exiting that > namespace in ops_exit_list. It was worth a second look. I can not find anything wrong with your patch but I can not convince myself that it is correct either. The moving around the loopback device in the net dev todo list to prevent deadlock I can't imagine why you are doing that. I think I would prefer something more explicit and less likely to break. Perhaps something that keeps a network namespace from exiting while we delete devices. Using the loopback device like that looks fragile. However it is very true that we require the loopback device to stay around until all of the dst_ifdown calls in a network namespace are made. At least my original concern that you could be incrementing the count on the loop back device when it had already reached zero can't be the case. It would be very nice to have something that I could think through easily and was robust to change. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 13, 2013 at 6:46 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Francesco Ruggeri <fruggeri@aristanetworks.com> writes: > >> On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman >> <ebiederm@xmission.com> wrote: >> >> I do have some concerns about both correctness and feasibility of this approach. >> >> About 2), namespace dependent operations triggered by unregistering >> the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and >> probably more) would not be done in the namespaces where they should. > > Yes they will. That is what dev_change_net_namespace does. You are right, I don't know what I was thinking. > > It was worth a second look. I can not find anything wrong with your > patch but I can not convince myself that it is correct either. The > moving around the loopback device in the net dev todo list to prevent > deadlock I can't imagine why you are doing that. > That is in order not to introduce a potential deadlock when multiple namespaces are destroyed in default_device_exit_batch. Take the same veth scenario as before: v0 in namespace ns0 (loopback_dev lo0) and similarly for v1, ns1 and lo1. Assume two processes destroy ns0 and ns1. cleanup_net is executed in a workqueue and the two namespaces can end up being processed in the same instance of cleanup_net/ops_exit_list/default_device_exit_batch. default_device_exit_batch traverses each namespace's dev_base list and unregister_netdevice_queue is executed in this order: v0 v1 lo0 v1 v0 lo1. unregister_netdevice_queue is executed twice on v0 and v1 but that is ok because it uses list_move_tail and only the last one sticks. The resulting list for unregister_netdevice_many is: lo0 v1 v0 lo1. If v0 holds a reference to lo0 there will be a deadlock in netdev_run_todo if v0 does not come before lo0 in net_todo_list. By pushing all loopback_devs to the end of net_todo_list this situation is avoided. This is the sequence with today's (actually 3.4) code: unregister_netdevice_queue: v0 (ns ffff880037aecd00) unregister_netdevice_queue: v1 (ns ffff880037aed800) unregister_netdevice_queue: lo (ns ffff880037aecd00) unregister_netdevice_queue: v1 (ns ffff880037aed800) unregister_netdevice_queue: v0 (ns ffff880037aecd00) unregister_netdevice_queue: lo (ns ffff880037aed800) unregister_netdevice_many: lo (ns ffff880037aecd00) v1 (ns ffff880037aed800) v0 (ns ffff880037aecd00) lo (ns ffff880037aed800) netdev_run_todo: lo (ns ffff880037aecd00) v1 (ns ffff880037aed800) v0 (ns ffff880037aecd00) lo (ns ffff880037aed800) and this is the sequence after pushing the loopback_devs to the tail of net_todo_list: unregister_netdevice_queue: v0 (ns ffff8800379f8000) unregister_netdevice_queue: v1 (ns ffff8800378c0b00) unregister_netdevice_queue: lo (ns ffff8800379f8000) unregister_netdevice_queue: v1 (ns ffff8800378c0b00) unregister_netdevice_queue: v0 (ns ffff8800379f8000) unregister_netdevice_queue: lo (ns ffff8800378c0b00) unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00) netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800379f8000) lo (ns ffff8800378c0b00) Should we take this discussion offline? I do appreciate your spending time on this. Thanks, Francesco > > Eric > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Francesco Ruggeri <fruggeri@aristanetworks.com> writes: > On Fri, Sep 13, 2013 at 6:46 PM, Eric W. Biederman > <ebiederm@xmission.com> wrote: >> Francesco Ruggeri <fruggeri@aristanetworks.com> writes: >> >>> On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman >>> <ebiederm@xmission.com> wrote: >>> >>> I do have some concerns about both correctness and feasibility of this approach. >>> >>> About 2), namespace dependent operations triggered by unregistering >>> the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and >>> probably more) would not be done in the namespaces where they should. >> >> Yes they will. That is what dev_change_net_namespace does. > > You are right, I don't know what I was thinking. >> It was worth a second look. I can not find anything wrong with your >> patch but I can not convince myself that it is correct either. The >> moving around the loopback device in the net dev todo list to prevent >> deadlock I can't imagine why you are doing that. >> > > That is in order not to introduce a potential deadlock when multiple > namespaces are destroyed in default_device_exit_batch. > Take the same veth scenario as before: > v0 in namespace ns0 (loopback_dev lo0) and similarly for v1, ns1 and lo1. > Assume two processes destroy ns0 and ns1. cleanup_net is executed in a > workqueue and the two namespaces can end up being processed in the > same instance of cleanup_net/ops_exit_list/default_device_exit_batch. > default_device_exit_batch traverses each namespace's dev_base list and > unregister_netdevice_queue is executed in this order: > v0 v1 lo0 v1 v0 lo1. > unregister_netdevice_queue is executed twice on v0 and v1 but that is > ok because it uses list_move_tail and only the last one sticks. > The resulting list for unregister_netdevice_many is: > lo0 v1 v0 lo1. > If v0 holds a reference to lo0 there will be a deadlock in > netdev_run_todo if v0 does not come before lo0 in net_todo_list. By > pushing all loopback_devs to the end of net_todo_list this situation > is avoided. > > This is the sequence with today's (actually 3.4) code: > > unregister_netdevice_queue: v0 (ns ffff880037aecd00) > unregister_netdevice_queue: v1 (ns ffff880037aed800) > unregister_netdevice_queue: lo (ns ffff880037aecd00) > unregister_netdevice_queue: v1 (ns ffff880037aed800) > unregister_netdevice_queue: v0 (ns ffff880037aecd00) > unregister_netdevice_queue: lo (ns ffff880037aed800) > unregister_netdevice_many: lo (ns ffff880037aecd00) v1 (ns > ffff880037aed800) v0 (ns ffff880037aecd00) lo (ns ffff880037aed800) > netdev_run_todo: lo (ns ffff880037aecd00) v1 (ns ffff880037aed800) v0 > (ns ffff880037aecd00) lo (ns ffff880037aed800) Interesting. So we have a very small chance of hillarity ensuing with dst_ifdown. But we probably won't notice because even if lo has been freed the memory likely hasn't been recycled. So dst_hold likely does not affect anyone. So it seems real problems only happens when we send the rebroadcast. Which explains why this has gone unnoticed for so long. I really don't want to support concurrency during the network namespace shutdown. That quickly becomes too crazy to think about. I believe the weird reordering of veth devices is solved or largely solved by: commit f45a5c267da35174e22cec955093a7513dc1623d Author: Eric Dumazet <edumazet@google.com> Date: Fri Feb 8 20:10:49 2013 +0000 veth: fix NULL dereference in veth_dellink() commit d0e2c55e7c940 (veth: avoid a NULL deref in veth_stats_one) added another NULL deref in veth_dellink(). # ip link add name veth1 type veth peer name veth0 # rmmod veth We crash because veth_dellink() is called twice, so we must take care of NULL peer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Which unfortunately means you really need to test 3.11 or 3.12-rc1 to make headway on this issue. If there is further veth specific madness we can solve it by tweaking the veth driver. > and this is the sequence after pushing the loopback_devs to the tail > of net_todo_list: > > unregister_netdevice_queue: v0 (ns ffff8800379f8000) > unregister_netdevice_queue: v1 (ns ffff8800378c0b00) > unregister_netdevice_queue: lo (ns ffff8800379f8000) > unregister_netdevice_queue: v1 (ns ffff8800378c0b00) > unregister_netdevice_queue: v0 (ns ffff8800379f8000) > unregister_netdevice_queue: lo (ns ffff8800378c0b00) > unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns > ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00) > netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo > (ns ffff8800379f8000) lo (ns ffff8800378c0b00) I am scratching my head. That isn't what I would expect from putting loopback devices at the end of the list. Even more I am not comfortable with any situation where preserving the order in which the devices are unregistered is not sufficient to make things work correctly. That quickly gets into fragile layering problems, and I don't know how anyone would be able to maintain that. I still don't see a better solution than dev_change_net_namespace, for the network devices that have this problem. Network devices are not supposed to be findable after the dance at the start of cleanup_net. It might be possible to actually build asynchronous network device unregistration and opportunistick batching. Aka cleanup network devices much like we clean up network namespaces now, and by funnelling them all into a single threaded work queue that would probably stop the races, as well as improve performance. I don't have the energy to do that right now unfortunately :( > Should we take this discussion offline? > I do appreciate your spending time on this. If anyone complains we can drop them off of the CC but it is useful to leave a record of the thought process that went into this, so we should at the minimum keep netdev in the loop. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 16, 2013 at 3:45 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > I believe the weird reordering of veth devices is solved or largely > solved by: > > commit f45a5c267da35174e22cec955093a7513dc1623d I looked at d0e2c55e and f45a5c26 in the past and I do not think they are related to the problem I see, which I think is more related to the infrastructure. See more below. > >> and this is the sequence after pushing the loopback_devs to the tail >> of net_todo_list: >> >> unregister_netdevice_queue: v0 (ns ffff8800379f8000) >> unregister_netdevice_queue: v1 (ns ffff8800378c0b00) >> unregister_netdevice_queue: lo (ns ffff8800379f8000) >> unregister_netdevice_queue: v1 (ns ffff8800378c0b00) >> unregister_netdevice_queue: v0 (ns ffff8800379f8000) >> unregister_netdevice_queue: lo (ns ffff8800378c0b00) >> unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns >> ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00) >> netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo >> (ns ffff8800379f8000) lo (ns ffff8800378c0b00) > > I am scratching my head. That isn't what I would expect from putting > loopback devices at the end of the list. The way to read it is as follows: unregister_netdevice_many gets the sequence lo0 v1 v0 lo1 (lo0 is the occurrence of "lo" with the same namespace as v0, etc.). As the interfaces are unlisted the loopback_devs are queued at the end of net_todo_list, so when netdev_run_todo executes net_todo_list is v1 v0 lo0 lo1. > > Even more I am not comfortable with any situation where preserving the > order in which the devices are unregistered is not sufficient to make > things work correctly. That quickly gets into fragile layering > problems, and I don't know how anyone would be able to maintain that. > I agree with your general statement, but unfortunately the order in which the interfaces are unregistered is not preserved even in the current code, since dev_close_many already rearranges them based on whether they are UP or not. If I delete a namespace with only lo and v0 this is the order in which they are released depending on their state. If lo is DOWN then it will be processed before v0 in netdev_run_todo. ====== lo down, v0 down unregister_netdevice_queue: v0 (ns ffff880037bb0b00) unregister_netdevice_queue: lo (ns ffff880037bb0b00) unregister_netdevice_many: v0 (ns ffff880037bb0b00) lo (ns ffff880037bb0b00) netdev_run_todo: lo (ns ffff880037bb0b00) v0 (ns ffff880037bb0b00) ====== lo up, v0 down unregister_netdevice_queue: v0 (ns ffff8800378a9600) unregister_netdevice_queue: lo (ns ffff8800378a9600) unregister_netdevice_many: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600) netdev_run_todo: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600) ====== lo down, v0 up unregister_netdevice_queue: v0 (ns ffff880037bb0b00) unregister_netdevice_queue: lo (ns ffff880037bb0b00) unregister_netdevice_many: v0 (ns ffff880037bb0b00) lo (ns ffff880037bb0b00) netdev_run_todo: lo (ns ffff880037bb0b00) v0 (ns ffff880037bb0b00) ====== lo up, v0 up unregister_netdevice_queue: v0 (ns ffff8800378a9600) unregister_netdevice_queue: lo (ns ffff8800378a9600) unregister_netdevice_many: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600) netdev_run_todo: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600) > I still don't see a better solution than dev_change_net_namespace, > for the network devices that have this problem. Network devices are not > supposed to be findable after the dance at the start of cleanup_net. > I think the problem we are seeing is an infrastructure one. If I can restate it: 1) When destroying a namespace all net_devices in it should be disposed of before any non-device pernet subsystems exit. The existing code tries to do this but it does not take into consideration net_devices that may have been unlisted from the namespace by another process which is still in the process of destroying them (specifically in netdev_run_todo/netdev_wait_allrefs). There is also another issue, separate but related (my diffs did not try to address it): 2) dev_change_net_namespace does not have the equivalent of netdev_wait_allrefs. If rebroadcasts are in fact needed when destroying a net_device then the same should apply when the net_device is moved out of a namespace. With existing code a net_device can be moved to a different namespace before its refcount drops to 0. Any later broadcasts will not trigger any action in the original namespace where they should have occurred. I suspect that this is not catastrophic but its effects could be subtle. This is another reason why I would avoid using dev_change_net_namespace as a driver level solution for 1) above, besides the fact that we would have to apply it to all drivers affected by this (veth, vlan, macvlan, maybe more). > It might be possible to actually build asynchronous network device > unregistration and opportunistick batching. Aka cleanup network devices > much like we clean up network namespaces now, and by funnelling them all > into a single threaded work queue that would probably stop the races, as > well as improve performance. I don't have the energy to do that right > now unfortunately :( > That would probably solve it, but as you suggest it will take a considerable amount of work and code re-structuring. Francesco -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Francesco Ruggeri <fruggeri@aristanetworks.com> writes: > On Mon, Sep 16, 2013 at 3:45 AM, Eric W. Biederman > <ebiederm@xmission.com> wrote: > >> I believe the weird reordering of veth devices is solved or largely >> solved by: >> >> commit f45a5c267da35174e22cec955093a7513dc1623d > > I looked at d0e2c55e and f45a5c26 in the past and I do not think they > are related to the problem I see, which I think is more related to the > infrastructure. See more below. If everything is in the same batch they definitely relate, as it stops refiling one of the devices past the point where it's loopback device is destroyed. It is not the fundamental problem but it is related and those fixes. >>> and this is the sequence after pushing the loopback_devs to the tail >>> of net_todo_list: >>> >>> unregister_netdevice_queue: v0 (ns ffff8800379f8000) >>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00) >>> unregister_netdevice_queue: lo (ns ffff8800379f8000) >>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00) >>> unregister_netdevice_queue: v0 (ns ffff8800379f8000) >>> unregister_netdevice_queue: lo (ns ffff8800378c0b00) >>> unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns >>> ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00) >>> netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo >>> (ns ffff8800379f8000) lo (ns ffff8800378c0b00) >> >> I am scratching my head. That isn't what I would expect from putting >> loopback devices at the end of the list. > > The way to read it is as follows: > unregister_netdevice_many gets the sequence lo0 v1 v0 lo1 (lo0 is the > occurrence of "lo" with the same namespace as v0, etc.). > As the interfaces are unlisted the loopback_devs are queued at the end > of net_todo_list, so when netdev_run_todo executes net_todo_list is v1 > v0 lo0 lo1. My reading jumbled the unregister_netdevice_many line and the netdev_run_todo line. So I was seeing v0, lo, v1, v0, lo, lo. But since both loop devices are now at the end your that part of your patch looks ok. >> Even more I am not comfortable with any situation where preserving the >> order in which the devices are unregistered is not sufficient to make >> things work correctly. That quickly gets into fragile layering >> problems, and I don't know how anyone would be able to maintain that. >> > > I agree with your general statement, but unfortunately the order in > which the interfaces are unregistered is not preserved even in the > current code, since dev_close_many already rearranges them based on > whether they are UP or not. Ugh. That is a bug. I have sent a patch to solve this by simply not reordering the device deletions as that is easier to think about. And more robust if we happen to care about anything besides the loopback device. Reording happens so seldom I can't imagine it being tested properly. I have also sent a patch to set loopback_dev to NULL so that it is more likely we will see your class of bugs. I think I see bugs with the handling of rtmsg_ifinfo, and netpoll_rx_enable/disable in dev_close_many as well. But those are significantly less serious issues. If you could verify that my patch to dev_close solves the ordering issues you were seeing I would appreciate that. >> I still don't see a better solution than dev_change_net_namespace, >> for the network devices that have this problem. Network devices are not >> supposed to be findable after the dance at the start of cleanup_net. >> > > I think the problem we are seeing is an infrastructure one. If I can restate it: > > 1) When destroying a namespace all net_devices in it should be > disposed of before any non-device pernet subsystems exit. The existing > code tries to do this but it does not take into consideration > net_devices that may have been unlisted from the namespace by another > process which is still in the process of destroying them (specifically > in netdev_run_todo/netdev_wait_allrefs). The existing code makes the assumption that is not possible to find a network device and manipulate it outside the network namespace in any way (outside of the network namespace cleanup methods) after the network namespace reference count has dropped to 0 and the network namespace is no longer in the network namespace list. That assumption is clearly not true for veth pairs, vlans, macvlans and the like, and it is an infrastructure problem. The question is what is the most robust comprehensible and maintinable fix? (That doesn't penalize the fast path of course). > There is also another issue, separate but related (my diffs did not > try to address it): > > 2) dev_change_net_namespace does not have the equivalent of > netdev_wait_allrefs. If rebroadcasts are in fact needed when > destroying a net_device then the same should apply when the net_device > is moved out of a namespace. With existing code a net_device can be > moved to a different namespace before its refcount drops to 0. Any > later broadcasts will not trigger any action in the original namespace > where they should have occurred. Not having the broadcast in dev_change_net_namespace is not a bug. The repeated broadcast is most definitely there for hysterical raisins, and with a good audit of dev_hold/dev_put call sites can most definitely be removed. At a practical level the repeated broadcast is just acting as a hammer and saying to subsystems you messed up now let this device go. Let go! Let go! Let go! And there is a slight hope that something will let go when told multiple times. > I suspect that this is not catastrophic but its effects could be > subtle. This is another reason why I would avoid using > dev_change_net_namespace as a driver level solution for 1) above, > besides the fact that we would have to apply it to all drivers > affected by this (veth, vlan, macvlan, maybe more). We don't avoid the rebroadcast and wait, with dev_change_net_namespace, they are just performed in a different network namespace. So if something is wrong we will know. >> It might be possible to actually build asynchronous network device >> unregistration and opportunistick batching. Aka cleanup network devices >> much like we clean up network namespaces now, and by funnelling them all >> into a single threaded work queue that would probably stop the races, as >> well as improve performance. I don't have the energy to do that right >> now unfortunately :( >> > > That would probably solve it, but as you suggest it will take a > considerable amount of work and code re-structuring. And it may be worth it for the speed up you get when destroying 384 mlag ports. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 16, 2013 at 5:25 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > If you could verify that my patch to dev_close solves the ordering > issues you were seeing I would appreciate that. > I have not run extensive tests, but your patch fixes the reordering issue I was seeing in dev_close_many. When I destroyed a namespace with only v0 and lo the order was preserved from unregister_netdevice_many to netdev_run_todo. Francesco ====== lo down, v0 down unregister_netdevice_queue: v0 (ns ffff880136bc8000) unregister_netdevice_queue: lo (ns ffff880136bc8000) unregister_netdevice_many: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000) netdev_run_todo: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000) ====== lo up, v0 down unregister_netdevice_queue: v0 (ns ffff880037ac8000) unregister_netdevice_queue: lo (ns ffff880037ac8000) unregister_netdevice_many: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000) netdev_run_todo: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000) ====== lo down, v0 up unregister_netdevice_queue: v0 (ns ffff880136bc8000) unregister_netdevice_queue: lo (ns ffff880136bc8000) unregister_netdevice_many: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000) netdev_run_todo: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000) ====== lo up, v0 up unregister_netdevice_queue: v0 (ns ffff880037ac8000) unregister_netdevice_queue: lo (ns ffff880037ac8000) unregister_netdevice_many: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000) netdev_run_todo: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index da86652..5922066 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -423,6 +423,19 @@ static void veth_dellink(struct net_device *dev, struct list_head *head) unregister_netdevice_queue(dev, head); if (peer) { + struct net *net = dev_net(dev); + if (dev_net(peer) != net) { + /* Move the peer to the same net to avoid teardown races */ + char peer_name[IFNAMSIZ]; + int err; + snprintf(fb_name, IFNAMSIZ, "dev%d", peer->ifindex); + err = dev_change_net_namespace(peer, net, peer_name); + if (err) { + pr_emerg("%s: failed to move %s to peers net: %d\n", + __func__, peer->name, err); + BUG(); + } + } priv = netdev_priv(peer); RCU_INIT_POINTER(priv->peer, NULL); unregister_netdevice_queue(peer, head);