[1/1] net: race condition when removing virtual net_device

Message ID	871u4t1d9t.fsf@xmission.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: ebiederm@xmission.com (Eric W. Biederman) To: Francesco Ruggeri <fruggeri@aristanetworks.com> Cc: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jiri Pirko <jiri@resnulli.us>, Alexander Duyck <alexander.h.duyck@intel.com>, Cong Wang <amwang@redhat.com>, netdev@vger.kernel.org References: <1379008796-2121-1-git-send-email-fruggeri@aristanetworks.com> <87txhp249u.fsf@xmission.com> <CA+HUmGgyCsc-er6iB_mqg9whMyaquGWinkHOQEuEHbFCYVrXBw@mail.gmail.com> Date: Thu, 12 Sep 2013 22:50:06 -0700 In-Reply-To: <CA+HUmGgyCsc-er6iB_mqg9whMyaquGWinkHOQEuEHbFCYVrXBw@mail.gmail.com> (Francesco Ruggeri's message of "Thu, 12 Sep 2013 14:48:19 -0700") Message-ID: <871u4t1d9t.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain BODY: T_TM2_M_HEADER_IN_MSG * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4174] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa06 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject Re: [PATCH 1/1] net: race condition when removing virtual net_device Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric W. Biederman Sept. 13, 2013, 5:50 a.m. UTC

Francesco Ruggeri <fruggeri@aristanetworks.com> writes:

> That would be great. There would still be one scenario to take care of though:
>
> - veth interfaces v0 and v1 are in namespaces ns0 and ns1.
> - process p0 unregisters v0, which also causes v1 to be unregistered.
> When p0 enters netdev_run_todo both v0 and v1 are in net_todo_list and
> have been unlisted from their namespaces.
> - then in p0's netdev_run_todo:

So I looked at this a little more and this problem appears largely
specific to veth.  In the normal case the caller of dellink has to hold
a reference to the network namespace to find the device to delete.

So I think the solution is just to warp the interface of the second
device into the network namespace of the device we are actually
deleting.

I will buy that similar situations can happen with other virtual devices
that have one foot in two network namespaces, and I expect the same
solution will apply.

So the patch below looks like the solution.  If there is more than one
device that needs this treatment perhaps the code should be moved
into a helper function rather than expanded inline.

Does this look like it will fix your issue?

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francesco Ruggeri Sept. 13, 2013, 5:54 p.m. UTC | #1

On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Francesco Ruggeri <fruggeri@aristanetworks.com> writes:
>
>> That would be great. There would still be one scenario to take care of though:
>>
>> - veth interfaces v0 and v1 are in namespaces ns0 and ns1.
>> - process p0 unregisters v0, which also causes v1 to be unregistered.
>> When p0 enters netdev_run_todo both v0 and v1 are in net_todo_list and
>> have been unlisted from their namespaces.
>> - then in p0's netdev_run_todo:
>
> So I looked at this a little more and this problem appears largely
> specific to veth.  In the normal case the caller of dellink has to hold
> a reference to the network namespace to find the device to delete.
>
> So I think the solution is just to warp the interface of the second
> device into the network namespace of the device we are actually
> deleting.
>
> I will buy that similar situations can happen with other virtual devices
> that have one foot in two network namespaces, and I expect the same
> solution will apply.
>
> So the patch below looks like the solution.  If there is more than one
> device that needs this treatment perhaps the code should be moved
> into a helper function rather than expanded inline.
>
> Does this look like it will fix your issue?
>
> Eric
>

To summarize your proposal:
1) Remove re-broadcasting of NETDEV_UNREGISTER and
NETDEV_UNREGISTER_FINAL from netdev_wait_allrefs.
2) If unregistering a net_device causes unregistering of other virtual
devices (eg veth, macvlan, vlan) then move the virtual devices to the
namespace of the original net_device.

I do have some concerns about both correctness and feasibility of this approach.

About 2), namespace dependent operations triggered by unregistering
the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and
probably more) would not be done in the namespaces where they should.

About the feasibility of 1), consider just as an example dst_dev_event
in NETDEV_UNREGISTER_FINAL. dst_dev_event will not process dst->dev
unless the dst_entry is already in dst_garbage.list, ie unless it has
already been dst_free'd or dst_release'd. But since destroying
resources is often delayed to workqueues or to one of several garbage
collection lists, can we guarantee that one broadcast of
NETDEV_UNREGISTER_FINAL is always enough? There may be similar cases
with NETDEV_UNREGISTER.

I do urge you to take a second look at the approach that I proposed.
All it does is make sure that a namespace loopback_dev is not removed
until any devices unlisted from that namespace are also disposed of.
That in turn prevents non-device pernet subsystems from exiting that
namespace in ops_exit_list.
Logically it is enforcing the right behavior (namely non-device pernet
subsystems should not exit a namespace until all devices in that
namespace - listed or unlisted - have been properly disposed of) and
it correctly supports existing code, such as rt_flush_dev and
dst_ifdown, which relies on the correct loopback_dev to be there.

Francesco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Sept. 14, 2013, 1:46 a.m. UTC | #2

Francesco Ruggeri <fruggeri@aristanetworks.com> writes:

> On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:

> To summarize your proposal:
> 1) Remove re-broadcasting of NETDEV_UNREGISTER and
> NETDEV_UNREGISTER_FINAL from netdev_wait_allrefs.

Forget that it isn't needed.  It would be nice but since it doesn't
solve the entire problem let's not go there.

> 2) If unregistering a net_device causes unregistering of other virtual
> devices (eg veth, macvlan, vlan) then move the virtual devices to the
> namespace of the original net_device.
>
> I do have some concerns about both correctness and feasibility of this approach.
>
> About 2), namespace dependent operations triggered by unregistering
> the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and
> probably more) would not be done in the namespaces where they should.

Yes they will.  That is what dev_change_net_namespace does.
dev_change_net_namespace would not be correct if it did not do that.

Now dev_change_net_namespace is comparitively expensive so it may not be the
best approach but it is an approach that will work.

> I do urge you to take a second look at the approach that I proposed.
> All it does is make sure that a namespace loopback_dev is not removed
> until any devices unlisted from that namespace are also disposed of.
> That in turn prevents non-device pernet subsystems from exiting that
> namespace in ops_exit_list.

It was worth a second look.  I can not find anything wrong with your
patch but I can not convince myself that it is correct either.  The
moving around the loopback device in the net dev todo list to prevent
deadlock I can't imagine why you are doing that.

I think I would prefer something more explicit and less likely to break.
Perhaps something that keeps a network namespace from exiting while we
delete devices.  Using the loopback device like that looks fragile.
However it is very true that we require the loopback device to stay
around until all of the dst_ifdown calls in a network namespace are
made.

At least my original concern that you could be incrementing the count on
the loop back device when it had already reached zero can't be the case.

It would be very nice to have something that I could think through
easily and was robust to change.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francesco Ruggeri Sept. 16, 2013, 2:54 a.m. UTC | #3

On Fri, Sep 13, 2013 at 6:46 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Francesco Ruggeri <fruggeri@aristanetworks.com> writes:
>
>> On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>
>> I do have some concerns about both correctness and feasibility of this approach.
>>
>> About 2), namespace dependent operations triggered by unregistering
>> the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and
>> probably more) would not be done in the namespaces where they should.
>
> Yes they will.  That is what dev_change_net_namespace does.

You are right, I don't know what I was thinking.

>
> It was worth a second look.  I can not find anything wrong with your
> patch but I can not convince myself that it is correct either.  The
> moving around the loopback device in the net dev todo list to prevent
> deadlock I can't imagine why you are doing that.
>

That is in order not to introduce a potential deadlock when multiple
namespaces are destroyed in default_device_exit_batch.
Take the same veth scenario as before:
v0 in namespace ns0 (loopback_dev lo0) and similarly for v1, ns1 and lo1.
Assume two processes destroy ns0 and ns1. cleanup_net is executed in a
workqueue and the two namespaces can end up being processed in the
same instance of cleanup_net/ops_exit_list/default_device_exit_batch.
default_device_exit_batch traverses each namespace's dev_base list and
unregister_netdevice_queue is executed in this order:
v0 v1 lo0 v1 v0 lo1.
unregister_netdevice_queue is executed twice on v0 and v1 but that is
ok because it uses list_move_tail and only the last one sticks.
The resulting list for unregister_netdevice_many is:
lo0 v1 v0 lo1.
If v0 holds a reference to lo0 there will be a deadlock in
netdev_run_todo if v0 does not come before lo0 in net_todo_list. By
pushing all loopback_devs to the end of net_todo_list this situation
is avoided.

This is the sequence with today's (actually 3.4) code:

unregister_netdevice_queue: v0 (ns ffff880037aecd00)
unregister_netdevice_queue: v1 (ns ffff880037aed800)
unregister_netdevice_queue: lo (ns ffff880037aecd00)
unregister_netdevice_queue: v1 (ns ffff880037aed800)
unregister_netdevice_queue: v0 (ns ffff880037aecd00)
unregister_netdevice_queue: lo (ns ffff880037aed800)
unregister_netdevice_many: lo (ns ffff880037aecd00) v1 (ns
ffff880037aed800) v0 (ns ffff880037aecd00) lo (ns ffff880037aed800)
netdev_run_todo: lo (ns ffff880037aecd00) v1 (ns ffff880037aed800) v0
(ns ffff880037aecd00) lo (ns ffff880037aed800)

and this is the sequence after pushing the loopback_devs to the tail
of net_todo_list:

unregister_netdevice_queue: v0 (ns ffff8800379f8000)
unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
unregister_netdevice_queue: lo (ns ffff8800379f8000)
unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
unregister_netdevice_queue: v0 (ns ffff8800379f8000)
unregister_netdevice_queue: lo (ns ffff8800378c0b00)
unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns
ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo
(ns ffff8800379f8000) lo (ns ffff8800378c0b00)

Should we take this discussion offline?
I do appreciate your spending time on this.

Thanks,
Francesco

>
> Eric
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Sept. 16, 2013, 10:45 a.m. UTC | #4

Francesco Ruggeri <fruggeri@aristanetworks.com> writes:

> On Fri, Sep 13, 2013 at 6:46 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Francesco Ruggeri <fruggeri@aristanetworks.com> writes:
>>
>>> On Thu, Sep 12, 2013 at 10:50 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>
>>> I do have some concerns about both correctness and feasibility of this approach.
>>>
>>> About 2), namespace dependent operations triggered by unregistering
>>> the virtual devices (eg rt_flush_dev, dst_dev_event/dst_ifdown and
>>> probably more) would not be done in the namespaces where they should.
>>
>> Yes they will.  That is what dev_change_net_namespace does.
>
> You are right, I don't know what I was thinking.

>> It was worth a second look.  I can not find anything wrong with your
>> patch but I can not convince myself that it is correct either.  The
>> moving around the loopback device in the net dev todo list to prevent
>> deadlock I can't imagine why you are doing that.
>>
>
> That is in order not to introduce a potential deadlock when multiple
> namespaces are destroyed in default_device_exit_batch.
> Take the same veth scenario as before:
> v0 in namespace ns0 (loopback_dev lo0) and similarly for v1, ns1 and lo1.
> Assume two processes destroy ns0 and ns1. cleanup_net is executed in a
> workqueue and the two namespaces can end up being processed in the
> same instance of cleanup_net/ops_exit_list/default_device_exit_batch.
> default_device_exit_batch traverses each namespace's dev_base list and
> unregister_netdevice_queue is executed in this order:
> v0 v1 lo0 v1 v0 lo1.
> unregister_netdevice_queue is executed twice on v0 and v1 but that is
> ok because it uses list_move_tail and only the last one sticks.
> The resulting list for unregister_netdevice_many is:
> lo0 v1 v0 lo1.
> If v0 holds a reference to lo0 there will be a deadlock in
> netdev_run_todo if v0 does not come before lo0 in net_todo_list. By
> pushing all loopback_devs to the end of net_todo_list this situation
> is avoided.

>
> This is the sequence with today's (actually 3.4) code:
>
> unregister_netdevice_queue: v0 (ns ffff880037aecd00)
> unregister_netdevice_queue: v1 (ns ffff880037aed800)
> unregister_netdevice_queue: lo (ns ffff880037aecd00)
> unregister_netdevice_queue: v1 (ns ffff880037aed800)
> unregister_netdevice_queue: v0 (ns ffff880037aecd00)
> unregister_netdevice_queue: lo (ns ffff880037aed800)
> unregister_netdevice_many: lo (ns ffff880037aecd00) v1 (ns
> ffff880037aed800) v0 (ns ffff880037aecd00) lo (ns ffff880037aed800)
> netdev_run_todo: lo (ns ffff880037aecd00) v1 (ns ffff880037aed800) v0
> (ns ffff880037aecd00) lo (ns ffff880037aed800)

Interesting.

So we have a very small chance of hillarity ensuing with dst_ifdown.
But we probably won't notice because even if lo has been freed
the memory likely hasn't been recycled.  So dst_hold likely does not
affect anyone.

So it seems real problems only happens when we send the rebroadcast.
Which explains why this has gone unnoticed for so long.

I really don't want to support concurrency during the network namespace
shutdown.  That quickly becomes too crazy to think about.

I believe the weird reordering of veth devices is solved or largely
solved by:

commit f45a5c267da35174e22cec955093a7513dc1623d
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 8 20:10:49 2013 +0000

    veth: fix NULL dereference in veth_dellink()

    commit d0e2c55e7c940 (veth: avoid a NULL deref in veth_stats_one)
    added another NULL deref in veth_dellink().

    # ip link add name veth1 type veth peer name veth0
    # rmmod veth

    We crash because veth_dellink() is called twice, so we must
    take care of NULL peer.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Which unfortunately means you really need to test 3.11 or 3.12-rc1
to make headway on this issue.  If there is further veth specific
madness we can solve it by tweaking the veth driver.

> and this is the sequence after pushing the loopback_devs to the tail
> of net_todo_list:
>
> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
> unregister_netdevice_queue: lo (ns ffff8800379f8000)
> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
> unregister_netdevice_queue: lo (ns ffff8800378c0b00)
> unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns
> ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
> netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo
> (ns ffff8800379f8000) lo (ns ffff8800378c0b00)

I am scratching my head.  That isn't what I would expect from putting
loopback devices at the end of the list.

Even more I am not comfortable with any situation where preserving the
order in which the devices are unregistered is not sufficient to make
things work correctly.  That quickly gets into fragile layering
problems, and I don't know how anyone would be able to maintain that.

I still don't see a better solution than dev_change_net_namespace,
for the network devices that have this problem.  Network devices are not
supposed to be findable after the dance at the start of cleanup_net.

It might be possible to actually build asynchronous network device
unregistration and opportunistick batching.  Aka cleanup network devices
much like we clean up network namespaces now, and by funnelling them all
into a single threaded work queue that would probably stop the races, as
well as improve performance.  I don't have the energy to do that right
now unfortunately :(

> Should we take this discussion offline?
> I do appreciate your spending time on this.

If anyone complains we can drop them off of the CC but it is useful to
leave a record of the thought process that went into this, so we should
at the minimum keep netdev in the loop.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francesco Ruggeri Sept. 16, 2013, 8:30 p.m. UTC | #5

On Mon, Sep 16, 2013 at 3:45 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:

> I believe the weird reordering of veth devices is solved or largely
> solved by:
>
> commit f45a5c267da35174e22cec955093a7513dc1623d

I looked at d0e2c55e and f45a5c26 in the past and I do not think they
are related to the problem I see, which I think is more related to the
infrastructure. See more below.

>
>> and this is the sequence after pushing the loopback_devs to the tail
>> of net_todo_list:
>>
>> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
>> unregister_netdevice_queue: lo (ns ffff8800379f8000)
>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
>> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
>> unregister_netdevice_queue: lo (ns ffff8800378c0b00)
>> unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns
>> ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
>> netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo
>> (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
>
> I am scratching my head.  That isn't what I would expect from putting
> loopback devices at the end of the list.

The way to read it is as follows:
unregister_netdevice_many gets the sequence lo0 v1 v0 lo1 (lo0 is the
occurrence of "lo" with the same namespace as v0, etc.).
As the interfaces are unlisted the loopback_devs are queued at the end
of net_todo_list, so when netdev_run_todo executes net_todo_list is v1
v0 lo0 lo1.

>
> Even more I am not comfortable with any situation where preserving the
> order in which the devices are unregistered is not sufficient to make
> things work correctly.  That quickly gets into fragile layering
> problems, and I don't know how anyone would be able to maintain that.
>

I agree with your general statement, but unfortunately the order in
which the interfaces are unregistered is not preserved even in the
current code, since dev_close_many already rearranges them based on
whether they are UP or not.

If I delete a namespace with only lo and v0 this is the order in which
they are released depending on their state. If lo is DOWN then it will
be processed before v0 in netdev_run_todo.

====== lo down, v0 down
unregister_netdevice_queue: v0 (ns ffff880037bb0b00)
unregister_netdevice_queue: lo (ns ffff880037bb0b00)
unregister_netdevice_many: v0 (ns ffff880037bb0b00) lo (ns ffff880037bb0b00)
netdev_run_todo: lo (ns ffff880037bb0b00) v0 (ns ffff880037bb0b00)

====== lo up, v0 down
unregister_netdevice_queue: v0 (ns ffff8800378a9600)
unregister_netdevice_queue: lo (ns ffff8800378a9600)
unregister_netdevice_many: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)
netdev_run_todo: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)

====== lo down, v0 up
unregister_netdevice_queue: v0 (ns ffff880037bb0b00)
unregister_netdevice_queue: lo (ns ffff880037bb0b00)
unregister_netdevice_many: v0 (ns ffff880037bb0b00) lo (ns ffff880037bb0b00)
netdev_run_todo: lo (ns ffff880037bb0b00) v0 (ns ffff880037bb0b00)

====== lo up, v0 up
unregister_netdevice_queue: v0 (ns ffff8800378a9600)
unregister_netdevice_queue: lo (ns ffff8800378a9600)
unregister_netdevice_many: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)
netdev_run_todo: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)

> I still don't see a better solution than dev_change_net_namespace,
> for the network devices that have this problem.  Network devices are not
> supposed to be findable after the dance at the start of cleanup_net.
>

I think the problem we are seeing is an infrastructure one. If I can restate it:

1) When destroying a namespace all net_devices in it should be
disposed of before any non-device pernet subsystems exit. The existing
code tries to do this but it does not take into consideration
net_devices that may have been unlisted from the namespace by another
process which is still in the process of destroying them (specifically
in netdev_run_todo/netdev_wait_allrefs).

There is also another issue, separate but related (my diffs did not
try to address it):

2) dev_change_net_namespace does not have the equivalent of
netdev_wait_allrefs. If rebroadcasts are in fact needed when
destroying a net_device then the same should apply when the net_device
is moved out of a namespace. With existing code a net_device can be
moved to a different namespace before its refcount drops to 0. Any
later broadcasts will not trigger any action in the original namespace
where they should have occurred.

I suspect that this is not catastrophic but its effects could be
subtle. This is another reason why I would avoid using
dev_change_net_namespace as a driver level solution for 1) above,
besides the fact that we would have to apply it to all drivers
affected by this (veth, vlan, macvlan, maybe more).

> It might be possible to actually build asynchronous network device
> unregistration and opportunistick batching.  Aka cleanup network devices
> much like we clean up network namespaces now, and by funnelling them all
> into a single threaded work queue that would probably stop the races, as
> well as improve performance.  I don't have the energy to do that right
> now unfortunately :(
>

That would probably solve it, but as you suggest it will take a
considerable amount of work and code re-structuring.

Francesco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Sept. 17, 2013, 12:25 a.m. UTC | #6

Francesco Ruggeri <fruggeri@aristanetworks.com> writes:

> On Mon, Sep 16, 2013 at 3:45 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>
>> I believe the weird reordering of veth devices is solved or largely
>> solved by:
>>
>> commit f45a5c267da35174e22cec955093a7513dc1623d
>
> I looked at d0e2c55e and f45a5c26 in the past and I do not think they
> are related to the problem I see, which I think is more related to the
> infrastructure. See more below.

If everything is in the same batch they definitely relate, as it stops
refiling one of the devices past the point where it's loopback device
is destroyed.

It is not the fundamental problem but it is related and those fixes.

>>> and this is the sequence after pushing the loopback_devs to the tail
>>> of net_todo_list:
>>>
>>> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
>>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
>>> unregister_netdevice_queue: lo (ns ffff8800379f8000)
>>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
>>> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
>>> unregister_netdevice_queue: lo (ns ffff8800378c0b00)
>>> unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns
>>> ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
>>> netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo
>>> (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
>>
>> I am scratching my head.  That isn't what I would expect from putting
>> loopback devices at the end of the list.
>
> The way to read it is as follows:
> unregister_netdevice_many gets the sequence lo0 v1 v0 lo1 (lo0 is the
> occurrence of "lo" with the same namespace as v0, etc.).
> As the interfaces are unlisted the loopback_devs are queued at the end
> of net_todo_list, so when netdev_run_todo executes net_todo_list is v1
> v0 lo0 lo1.

My reading jumbled the unregister_netdevice_many line and the
netdev_run_todo line.  So I was seeing v0, lo, v1, v0, lo, lo.
But since both loop devices are now at the end your that part of your
patch looks ok.

>> Even more I am not comfortable with any situation where preserving the
>> order in which the devices are unregistered is not sufficient to make
>> things work correctly.  That quickly gets into fragile layering
>> problems, and I don't know how anyone would be able to maintain that.
>>
>
> I agree with your general statement, but unfortunately the order in
> which the interfaces are unregistered is not preserved even in the
> current code, since dev_close_many already rearranges them based on
> whether they are UP or not.

Ugh.  That is a bug.

I have sent a patch to solve this by simply not reordering the device
deletions as that is easier to think about.  And more robust if we
happen to care about anything besides the loopback device.  Reording
happens so seldom I can't imagine it being tested properly.

I have also sent a patch to set loopback_dev to NULL so that it is more
likely we will see your class of bugs.

I think I see bugs with the handling of rtmsg_ifinfo, and
netpoll_rx_enable/disable in dev_close_many as well. But those are
significantly less serious issues.

If you could verify that my patch to dev_close solves the ordering
issues you were seeing I would appreciate that.

>> I still don't see a better solution than dev_change_net_namespace,
>> for the network devices that have this problem.  Network devices are not
>> supposed to be findable after the dance at the start of cleanup_net.
>>
>
> I think the problem we are seeing is an infrastructure one. If I can restate it:
>
> 1) When destroying a namespace all net_devices in it should be
> disposed of before any non-device pernet subsystems exit. The existing
> code tries to do this but it does not take into consideration
> net_devices that may have been unlisted from the namespace by another
> process which is still in the process of destroying them (specifically
> in netdev_run_todo/netdev_wait_allrefs).

The existing code makes the assumption that is not possible to find
a network device and manipulate it outside the network namespace in
any way (outside of the network namespace cleanup methods) after the
network namespace reference count has dropped to 0 and the network
namespace is no longer in the network namespace list.

That assumption is clearly not true for veth pairs, vlans, macvlans and
the like, and it is an infrastructure problem.

The question is what is the most robust comprehensible and maintinable
fix?  (That doesn't penalize the fast path of course).

> There is also another issue, separate but related (my diffs did not
> try to address it):
>
> 2) dev_change_net_namespace does not have the equivalent of
> netdev_wait_allrefs. If rebroadcasts are in fact needed when
> destroying a net_device then the same should apply when the net_device
> is moved out of a namespace. With existing code a net_device can be
> moved to a different namespace before its refcount drops to 0. Any
> later broadcasts will not trigger any action in the original namespace
> where they should have occurred.

Not having the broadcast in dev_change_net_namespace is not a bug.  The
repeated broadcast is most definitely there for hysterical raisins, and
with a good audit of dev_hold/dev_put call sites can most definitely be
removed.  At a practical level the repeated broadcast is just acting as
a hammer and saying to subsystems you messed up now let this device go.
Let go! Let go! Let go! And there is a slight hope that something will
let go when told multiple times.

> I suspect that this is not catastrophic but its effects could be
> subtle. This is another reason why I would avoid using
> dev_change_net_namespace as a driver level solution for 1) above,
> besides the fact that we would have to apply it to all drivers
> affected by this (veth, vlan, macvlan, maybe more).

We don't avoid the rebroadcast and wait, with dev_change_net_namespace,
they are just performed in a different network namespace.  So if
something is wrong we will know.

>> It might be possible to actually build asynchronous network device
>> unregistration and opportunistick batching.  Aka cleanup network devices
>> much like we clean up network namespaces now, and by funnelling them all
>> into a single threaded work queue that would probably stop the races, as
>> well as improve performance.  I don't have the energy to do that right
>> now unfortunately :(
>>
>
> That would probably solve it, but as you suggest it will take a
> considerable amount of work and code re-structuring.

And it may be worth it for the speed up you get when destroying 384 mlag
ports.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francesco Ruggeri Sept. 17, 2013, 5:12 a.m. UTC | #7

On Mon, Sep 16, 2013 at 5:25 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:

>
> If you could verify that my patch to dev_close solves the ordering
> issues you were seeing I would appreciate that.
>

I have not run extensive tests, but your patch fixes the reordering
issue I was seeing in dev_close_many. When I destroyed a namespace
with only v0 and lo the order was preserved from
unregister_netdevice_many to netdev_run_todo.

Francesco

====== lo down, v0 down
unregister_netdevice_queue: v0 (ns ffff880136bc8000)
unregister_netdevice_queue: lo (ns ffff880136bc8000)
unregister_netdevice_many: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000)
netdev_run_todo: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000)

====== lo up, v0 down
unregister_netdevice_queue: v0 (ns ffff880037ac8000)
unregister_netdevice_queue: lo (ns ffff880037ac8000)
unregister_netdevice_many: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000)
netdev_run_todo: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000)


====== lo down, v0 up
unregister_netdevice_queue: v0 (ns ffff880136bc8000)
unregister_netdevice_queue: lo (ns ffff880136bc8000)
unregister_netdevice_many: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000)
netdev_run_todo: v0 (ns ffff880136bc8000) lo (ns ffff880136bc8000)


====== lo up, v0 up
unregister_netdevice_queue: v0 (ns ffff880037ac8000)
unregister_netdevice_queue: lo (ns ffff880037ac8000)
unregister_netdevice_many: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000)
netdev_run_todo: v0 (ns ffff880037ac8000) lo (ns ffff880037ac8000)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/1] net: race condition when removing virtual net_device

Commit Message

Comments

Patch