IPv4/IPv6 sysctl unregistration deadlock

Message ID	20090226084924.16cb3e08@nehalam
State	Accepted, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Date: Thu, 26 Feb 2009 08:49:24 -0800 From: Stephen Hemminger <shemminger@vyatta.com> To: ebiederm@xmission.com (Eric W. Biederman) Cc: Herbert Xu <herbert@gondor.apana.org.au>, Patrick McHardy <kaber@trash.net>, Linux Netdev List <netdev@vger.kernel.org>, "David S. Miller" <davem@davemloft.net> Subject: Re: IPv4/IPv6 sysctl unregistration deadlock Message-ID: <20090226084924.16cb3e08@nehalam> In-Reply-To: <m14oyhis31.fsf@fess.ebiederm.org> References: <49A4D5D5.5090602@trash.net> <20090225061902.GA32430@gondor.apana.org.au> <49A4E3F8.4050406@trash.net> <49A4F0D7.20304@trash.net> <20090225084321.GA1101@gondor.apana.org.au> <m163ixk9t2.fsf@fess.ebiederm.org> <20090226062257.GA11511@gondor.apana.org.au> <m14oyhis31.fsf@fess.ebiederm.org> Organization: Vyatta Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

stephen hemminger Feb. 26, 2009, 4:49 p.m. UTC

On Wed, 25 Feb 2009 23:18:42 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Herbert Xu <herbert@gondor.apana.org.au> writes:
> 
> > On Wed, Feb 25, 2009 at 10:10:33PM -0800, Eric W. Biederman wrote:
> >>
> >> How does adding a rename operation to sysctl sound?
> >
> > Yes that would definitely help.  Of course for the unregister case
> > we'd still need either an async removal or a no-op as Patrick
> > suggested.
> 
> After having reread the thread and looking at the code I think
> I understand what is happening.
> 
> sysctl, proc, and sysfs all need to wait until there are no
> more users before their unregister operation succeeds.  So that
> we can guarantee that it is safe to remove a module that provides
> the callback function.
> 
> Currently ndo_stop, NETDEV_DOWN, unlist_netdevice and I don't
> know how much other code is run from unregister_netdevice
> with the rtnl lock.  If we do an asynchronous unregister
> we need to ensure that entire code path is safe without
> rtnl_lock.  And we would need to run the unregister work
> from rtnl_lock.
> 
> Ugh.  netdev_store() and a few other functions in net-sysfs.c
> take rtnl_lock.  The instance in netdev_store appears to date
> back to 21 May 2003 sometime during 2.5.
> 
> So this is an old problem that we are just noticing now. Ugh.
> 
> Currently rtnl_lock() protects the netdevice_notifier_chain.
> So it appears we need to hold rtnl_lock().
> 
> Which leads me to conclude either we need to completely rewrite the
> locking rules for the networking stack, or we need to teach the sysfs,
> sysctl, and proc how to grab a subsystem lock around a callback.
> 
> We already do this for netlink with netlink_create_kernel.
> 
> So I guess we need a variants of:
> register_sysctl_table, proc_create, and class_create_file.
> 
> What a pain, but at least it looks like it can work.
> 
> Eric

What about something like this:

Subject: [PATCH] Avoid race between network down and sysfs


Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Feb. 26, 2009, 7:01 p.m. UTC | #1

Stephen Hemminger <shemminger@vyatta.com> writes:

> What about something like this:
>
> Subject: [PATCH] Avoid race between network down and sysfs

As far as solutions go.  That looks like the easiest correct solution.
So.  
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Will -ERESTARTSYS trigger the in kernel restart logic in this case?

There are a lot more cases to cover, and I don't I like it long
term.  Spinning waiting for rtnl_lock feels wrong.  Plus it does
not help with discovering the problem in new sysfs, sysctl, or
proc files.

It has the major advantage that we can fix things now.

> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> --- a/net/core/net-sysfs.c	2009-02-26 08:36:18.000000000 -0800
> +++ b/net/core/net-sysfs.c	2009-02-26 08:37:51.000000000 -0800
> @@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
>  	if (endp == buf)
>  		goto err;
>  
> -	rtnl_lock();
> +	if (!rtnl_trylock())
> +		return -ERESTARTSYS;
> +
>  	if (dev_isalive(net)) {
>  		if ((ret = (*set)(net, new)) == 0)
>  			ret = len;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

stephen hemminger Feb. 26, 2009, 8:24 p.m. UTC | #2

On Thu, 26 Feb 2009 11:01:41 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Stephen Hemminger <shemminger@vyatta.com> writes:
> 
> > What about something like this:
> >
> > Subject: [PATCH] Avoid race between network down and sysfs
> 
> As far as solutions go.  That looks like the easiest correct solution.
> So.  
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> 
> Will -ERESTARTSYS trigger the in kernel restart logic in this case?
> 
> There are a lot more cases to cover, and I don't I like it long
> term.  Spinning waiting for rtnl_lock feels wrong.  Plus it does
> not help with discovering the problem in new sysfs, sysctl, or
> proc files.
> 
> It has the major advantage that we can fix things now.
> 

I haven't tested it, but it should restart in VFS.
Spinning is not that big a deal, and it also handles the case
where the name changed or some other race occurred during processing.
When syscall is re-entered, the name will no longer be found and
-ENOENT will be returned.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Herbert Xu Feb. 27, 2009, 12:59 a.m. UTC | #3

On Thu, Feb 26, 2009 at 08:49:24AM -0800, Stephen Hemminger wrote:
>
> -	rtnl_lock();
> +	if (!rtnl_trylock())
> +		return -ERESTARTSYS;

This is going to spin instead of sleep on contention, is that
intended?

Cheers,

stephen hemminger Feb. 27, 2009, 1:25 a.m. UTC | #4

On Fri, 27 Feb 2009 08:59:45 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Thu, Feb 26, 2009 at 08:49:24AM -0800, Stephen Hemminger wrote:
> >
> > -	rtnl_lock();
> > +	if (!rtnl_trylock())
> > +		return -ERESTARTSYS;
> 
> This is going to spin instead of sleep on contention, is that
> intended?
> 
> Cheers,

It walks all the way back out to VFS, which is what you have to
do since the sysctl may move.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ben Greear Feb. 27, 2009, 6:26 p.m. UTC | #5

Stephen Hemminger wrote:

> Subject: [PATCH] Avoid race between network down and sysfs
> 
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> --- a/net/core/net-sysfs.c	2009-02-26 08:36:18.000000000 -0800
> +++ b/net/core/net-sysfs.c	2009-02-26 08:37:51.000000000 -0800
> @@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
>  	if (endp == buf)
>  		goto err;
>  
> -	rtnl_lock();
> +	if (!rtnl_trylock())
> +		return -ERESTARTSYS;
> +
>  	if (dev_isalive(net)) {
>  		if ((ret = (*set)(net, new)) == 0)
>  			ret = len;

I can test this to see if it fixes my problem.  Are the above lines the
entirety of the patch?

I think I'll add a printk if we do the ERESTARTSYS to make sure that I
hit the race (but still recover).

Thanks,
Ben

stephen hemminger Feb. 27, 2009, 6:38 p.m. UTC | #6

On Fri, 27 Feb 2009 10:26:43 -0800
Ben Greear <greearb@candelatech.com> wrote:

> Stephen Hemminger wrote:
> 
> > Subject: [PATCH] Avoid race between network down and sysfs
> > 
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > --- a/net/core/net-sysfs.c	2009-02-26 08:36:18.000000000 -0800
> > +++ b/net/core/net-sysfs.c	2009-02-26 08:37:51.000000000 -0800
> > @@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
> >  	if (endp == buf)
> >  		goto err;
> >  
> > -	rtnl_lock();
> > +	if (!rtnl_trylock())
> > +		return -ERESTARTSYS;
> > +
> >  	if (dev_isalive(net)) {
> >  		if ((ret = (*set)(net, new)) == 0)
> >  			ret = len;
> 
> I can test this to see if it fixes my problem.  Are the above lines the
> entirety of the patch?

yes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patrick McHardy March 2, 2009, 11:07 a.m. UTC | #7

Stephen Hemminger wrote:
> On Fri, 27 Feb 2009 10:26:43 -0800
> Ben Greear <greearb@candelatech.com> wrote:
> 
>> Stephen Hemminger wrote:
>>
>>> Subject: [PATCH] Avoid race between network down and sysfs
>>>
>>>
>>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>>>
>>> --- a/net/core/net-sysfs.c	2009-02-26 08:36:18.000000000 -0800
>>> +++ b/net/core/net-sysfs.c	2009-02-26 08:37:51.000000000 -0800
>>> @@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
>>>  	if (endp == buf)
>>>  		goto err;
>>>  
>>> -	rtnl_lock();
>>> +	if (!rtnl_trylock())
>>> +		return -ERESTARTSYS;
>>> +
>>>  	if (dev_isalive(net)) {
>>>  		if ((ret = (*set)(net, new)) == 0)
>>>  			ret = len;
>> I can test this to see if it fixes my problem.  Are the above lines the
>> entirety of the patch?
> 
> yes

We should be able to avoid the restart looping in most cases,
we only need to do this while unregistration is in progress.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patrick McHardy March 2, 2009, 11:21 a.m. UTC | #8

Patrick McHardy wrote:
> Stephen Hemminger wrote:
>>>> @@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
>>>>      if (endp == buf)
>>>>          goto err;
>>>>  
>>>> -    rtnl_lock();
>>>> +    if (!rtnl_trylock())
>>>> +        return -ERESTARTSYS;
>>>> +
>>>>      if (dev_isalive(net)) {
>>>>          if ((ret = (*set)(net, new)) == 0)
>>>>              ret = len;
>>> I can test this to see if it fixes my problem.  Are the above lines the
>>> entirety of the patch?
>>
>> yes
> 
> We should be able to avoid the restart looping in most cases,
> we only need to do this while unregistration is in progress.

It doesn't seem to work. The idea was something like this:

static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, 
int old)
{
	struct inet6_dev *idev;
	struct net *net;

	net = (struct net *)table->extra2;
	if (p == &net->ipv6.devconf_dflt->forwarding)
		return 0;

	/* Unregistration in progress */
	idev = table->extra1;
	if (idev->cnf.sysctl == NULL)
		return -ERESTARTSYS;

	rtnl_lock();
...

but the process might have entered this function while the rtnl
is already held, but the unregistration hasn't been triggered yet.
So it would still deadlock.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ben Greear March 2, 2009, 10:11 p.m. UTC | #9

Stephen Hemminger wrote:
> On Fri, 27 Feb 2009 10:26:43 -0800
> Ben Greear <greearb@candelatech.com> wrote:
> 
>> Stephen Hemminger wrote:
>>
>>> Subject: [PATCH] Avoid race between network down and sysfs
>>>
>>>
>>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>>>
>>> --- a/net/core/net-sysfs.c	2009-02-26 08:36:18.000000000 -0800
>>> +++ b/net/core/net-sysfs.c	2009-02-26 08:37:51.000000000 -0800
>>> @@ -77,7 +77,9 @@ static ssize_t netdev_store(struct devic
>>>  	if (endp == buf)
>>>  		goto err;
>>>  
>>> -	rtnl_lock();
>>> +	if (!rtnl_trylock())
>>> +		return -ERESTARTSYS;
>>> +
>>>  	if (dev_isalive(net)) {
>>>  		if ((ret = (*set)(net, new)) == 0)
>>>  			ret = len;
>> I can test this to see if it fixes my problem.  Are the above lines the
>> entirety of the patch?
> 
> yes

With both of Stephen's patches included in the latest -rc6 source,
I re-ran the test and it seems to be working (I added printks so
that I would know the new code was being exercised)

I had 2000 or so mac-vlans configured, with 10 of them being
re-configured concurrently, while also deleting groups of 20-100
mac-vlans in my test.  This was locking up reliably before,
and now it seems to be working fine.

Here's the kernel log showing the ERESTARTSYS in action.  I don't
have an easy way to check to see if the VFS (or whatever) retried
the call properly, but will let you all know if I see any indication
that isn't working.

I only saw the ipv6 fixup in my logs, but maybe my test case just
doesn't hit the other...

Mar  2 13:37:41 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:37:41 simech-1 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar  2 13:37:42 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:38:13 simech-1 kernel:last message repeated 42 times
Mar  2 13:39:20 simech-1 kernel:last message repeated 71 times
Mar  2 13:40:13 simech-1 kernel:last message repeated 68 times
Mar  2 13:40:13 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 8
Mar  2 13:40:13 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:21 simech-1 kernel:last message repeated 11 times
Mar  2 13:40:21 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 13
Mar  2 13:40:21 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:28 simech-1 kernel:last message repeated 16 times
Mar  2 13:40:28 simech-1 kernel: __ratelimit: 3352 callbacks suppressed
Mar  2 13:40:28 simech-1 kernel: Neighbour table overflow.
Mar  2 13:40:28 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:34 simech-1 kernel:last message repeated 8 times
Mar  2 13:40:34 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 20
Mar  2 13:40:34 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:40:54 simech-1 kernel:last message repeated 42 times
Mar  2 13:40:54 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 7
Mar  2 13:40:54 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:01 simech-1 kernel:last message repeated 13 times
Mar  2 13:41:01 simech-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 13
Mar  2 13:41:01 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:14 simech-1 kernel:last message repeated 26 times
Mar  2 13:41:14 simech-1 dhclient: No DHCPOFFERS received.
Mar  2 13:41:14 simech-1 dhclient: No working leases in persistent database - sleeping.
Mar  2 13:41:15 simech-1 dhclient: receive_packet failed on eth1: Network is down
Mar  2 13:41:16 simech-1 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar  2 13:41:17 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:24 simech-1 kernel:last message repeated 16 times
Mar  2 13:41:24 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:25 simech-1 kernel:last message repeated 8 times
Mar  2 13:41:25 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:30 simech-1 kernel:last message repeated 11 times
Mar  2 13:41:30 simech-1 kernel: __ratelimit: 7059 callbacks suppressed
Mar  2 13:41:30 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:30 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:30 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:35 simech-1 kernel:last message repeated 10 times
Mar  2 13:41:35 simech-1 kernel: __ratelimit: 7973 callbacks suppressed
Mar  2 13:41:35 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:36 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:36 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:40 simech-1 kernel:last message repeated 7 times
Mar  2 13:41:40 simech-1 kernel: __ratelimit: 10890 callbacks suppressed
Mar  2 13:41:40 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:40 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:40 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:46 simech-1 kernel:last message repeated 13 times
Mar  2 13:41:46 simech-1 kernel: __ratelimit: 6571 callbacks suppressed
Mar  2 13:41:46 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:46 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:46 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:51 simech-1 kernel:last message repeated 7 times
Mar  2 13:41:51 simech-1 kernel: __ratelimit: 15491 callbacks suppressed
Mar  2 13:41:51 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:51 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:51 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:41:56 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:56 simech-1 kernel: __ratelimit: 9658 callbacks suppressed
Mar  2 13:41:56 simech-1 kernel: Neighbour table overflow.
Mar  2 13:41:56 simech-1 kernel:last message repeated 9 times
Mar  2 13:41:56 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:42:01 simech-1 kernel:last message repeated 10 times
Mar  2 13:42:01 simech-1 kernel: __ratelimit: 12678 callbacks suppressed
Mar  2 13:42:01 simech-1 kernel: Neighbour table overflow.
Mar  2 13:42:01 simech-1 kernel:last message repeated 9 times
Mar  2 13:42:01 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:42:06 simech-1 kernel:last message repeated 7 times
Mar  2 13:42:06 simech-1 kernel: __ratelimit: 13401 callbacks suppressed
Mar  2 13:42:06 simech-1 kernel: Neighbour table overflow.
Mar  2 13:42:06 simech-1 kernel:last message repeated 9 times
Mar  2 13:42:06 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:42:11 simech-1 kernel:last message repeated 10 times
Mar  2 13:42:11 simech-1 kernel: __ratelimit: 11875 callbacks suppressed
Mar  2 13:42:11 simech-1 kernel: Neighbour table overflow.
Mar  2 13:42:11 simech-1 kernel:last message repeated 9 times
Mar  2 13:42:11 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.
Mar  2 13:42:16 simech-1 kernel:last message repeated 9 times
Mar  2 13:42:16 simech-1 kernel: __ratelimit: 9362 callbacks suppressed
Mar  2 13:42:16 simech-1 kernel: Neighbour table overflow.
Mar  2 13:42:17 simech-1 kernel:last message repeated 9 times
Mar  2 13:42:17 simech-1 kernel: Returning ERESTARTSYS in ipv6-addrconf-fixup-forwarding to avoid deadlock.

Thanks,
Ben

Patrick McHardy March 2, 2009, 10:20 p.m. UTC | #10

Ben Greear wrote:
> With both of Stephen's patches included in the latest -rc6 source,
> I re-ran the test and it seems to be working (I added printks so
> that I would know the new code was being exercise
> 
> I had 2000 or so mac-vlans configured, with 10 of them being
> re-configured concurrently, while also deleting groups of 20-100
> mac-vlans in my test.  This was locking up reliably before,
> and now it seems to be working fine.
> 
> Here's the kernel log showing the ERESTARTSYS in action.  I don't
> have an easy way to check to see if the VFS (or whatever) retried
> the call properly, but will let you all know if I see any indication
> that isn't working.
> 
> I only saw the ipv6 fixup in my logs, but maybe my test case just
> doesn't hit the other...

This looks like its working fine. Despite the non-desirable active
spinning, this seems like the best fix (actually much simpler than
I expected to be possible) at this time. If we just could avoid
the spinning when unnecessary, it would be perfect :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller March 2, 2009, 10:47 p.m. UTC | #11

From: Patrick McHardy <kaber@trash.net>
Date: Mon, 02 Mar 2009 23:20:49 +0100

> Ben Greear wrote:
> > With both of Stephen's patches included in the latest -rc6 source,
> > I re-ran the test and it seems to be working (I added printks so
> > that I would know the new code was being exercise
> > I had 2000 or so mac-vlans configured, with 10 of them being
> > re-configured concurrently, while also deleting groups of 20-100
> > mac-vlans in my test.  This was locking up reliably before,
> > and now it seems to be working fine.
> > Here's the kernel log showing the ERESTARTSYS in action.  I don't
> > have an easy way to check to see if the VFS (or whatever) retried
> > the call properly, but will let you all know if I see any indication
> > that isn't working.
> > I only saw the ipv6 fixup in my logs, but maybe my test case just
> > doesn't hit the other...
> 
> This looks like its working fine. Despite the non-desirable active
> spinning, this seems like the best fix (actually much simpler than
> I expected to be possible) at this time. If we just could avoid
> the spinning when unnecessary, it would be perfect :)

Could you give that "not actually in-progress" detection a shot?

I don't like the spinning either.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patrick McHardy March 2, 2009, 11:03 p.m. UTC | #12

David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Mon, 02 Mar 2009 23:20:49 +0100
> 
>> This looks like its working fine. Despite the non-desirable active
>> spinning, this seems like the best fix (actually much simpler than
>> I expected to be possible) at this time. If we just could avoid
>> the spinning when unnecessary, it would be perfect :)
> 
> Could you give that "not actually in-progress" detection a shot?
> 
> I don't like the spinning either.


I tried this morning, the problem is that its always the sysctl
handler which will run into the deadlock first, but there is no
reliable indication to avoid it other than that the RTNL is
already held. The problem is that the sysctl interface puts the
process holding the RTNL to sleep and allows a process requiring
it to run. Any different synchronization attempt will have the
same problem, it seems you simply can't hold any locks while
unregistering sysctls.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller March 3, 2009, 8:48 a.m. UTC | #13

From: Patrick McHardy <kaber@trash.net>
Date: Tue, 03 Mar 2009 00:03:00 +0100

> David Miller wrote:
> > From: Patrick McHardy <kaber@trash.net>
> > Date: Mon, 02 Mar 2009 23:20:49 +0100
> > 
> >> This looks like its working fine. Despite the non-desirable active
> >> spinning, this seems like the best fix (actually much simpler than
> >> I expected to be possible) at this time. If we just could avoid
> >> the spinning when unnecessary, it would be perfect :)
> > Could you give that "not actually in-progress" detection a shot?
> > I don't like the spinning either.
> 
> I tried this morning, the problem is that its always the sysctl
> handler which will run into the deadlock first, but there is no
> reliable indication to avoid it other than that the RTNL is
> already held. The problem is that the sysctl interface puts the
> process holding the RTNL to sleep and allows a process requiring
> it to run. Any different synchronization attempt will have the
> same problem, it seems you simply can't hold any locks while
> unregistering sysctls.

Ok, I applied Stephen's patches then...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman March 8, 2009, 3:36 a.m. UTC | #14

David Miller <davem@davemloft.net> writes:

> From: Patrick McHardy <kaber@trash.net>
> Date: Tue, 03 Mar 2009 00:03:00 +0100
>
>> David Miller wrote:
>> > From: Patrick McHardy <kaber@trash.net>
>> > Date: Mon, 02 Mar 2009 23:20:49 +0100
>> > 
>> >> This looks like its working fine. Despite the non-desirable active
>> >> spinning, this seems like the best fix (actually much simpler than
>> >> I expected to be possible) at this time. If we just could avoid
>> >> the spinning when unnecessary, it would be perfect :)
>> > Could you give that "not actually in-progress" detection a shot?
>> > I don't like the spinning either.
>> 
>> I tried this morning, the problem is that its always the sysctl
>> handler which will run into the deadlock first, but there is no
>> reliable indication to avoid it other than that the RTNL is
>> already held. The problem is that the sysctl interface puts the
>> process holding the RTNL to sleep and allows a process requiring
>> it to run. Any different synchronization attempt will have the
>> same problem, it seems you simply can't hold any locks while
>> unregistering sysctls.
>
> Ok, I applied Stephen's patches then...

From the further brainstorming department...

It appears we are using the wait for the completion
for two distinct purposes.

Waiting until it is safe to free storage.

Blocking module unregistration until the all of the
users of the code are gone.

I'm wondering if we could move the memory freeing.

And consolidate the waiting ultimately moving the
wait for completion into netdev_run_todo after
we drop the lock.

The reason I care is that it looks like to get hotplug
handling sane, I'm going to need to implement a generic
version of what sysfs, proc and sysctl are doing, and I
expect that generic version belongs in the vfs ultimately
giving us the capability to implement revoke.

So I am going to be revisiting how the code works, and if
I can come up with a cleaner solution to the networking
stack it would be a good time to implement it.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

IPv4/IPv6 sysctl unregistration deadlock

Commit Message

Comments

Patch