diff mbox

NFS TCP race condition with SOCK_ASYNC_NOSPACE

Message ID 4ECA94F9.4090503@citrix.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Andrew Cooper Nov. 21, 2011, 6:14 p.m. UTC
Following some debugging, I believe that the attached patch fixes the
problem.

Simply returning EAGAIN is not sufficient, as the task does not get
requeued, and times out 13 seconds later (as per our mount options). 
Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.

I realize that this is a gross hack and I should probably not be using
SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
same solution?

Comments

Trond Myklebust Nov. 22, 2011, 11:38 a.m. UTC | #1
On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> Following some debugging, I believe that the attached patch fixes the
> problem.
> 
> Simply returning EAGAIN is not sufficient, as the task does not get
> requeued, and times out 13 seconds later (as per our mount options). 
> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> 
> I realize that this is a gross hack and I should probably not be using
> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> same solution?
> 

What you are doing will cause the request to be put to sleep with no
guarantee that it will ever be woken up. Why would we want to do that if
there is no report of a tcp window/buffer space congestion?
Andrew Cooper Nov. 22, 2011, 12:02 p.m. UTC | #2
On 22/11/11 11:38, Trond Myklebust wrote:
> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>> Following some debugging, I believe that the attached patch fixes the
>> problem.
>>
>> Simply returning EAGAIN is not sufficient, as the task does not get
>> requeued, and times out 13 seconds later (as per our mount options). 
>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>
>> I realize that this is a gross hack and I should probably not be using
>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>> same solution?
>>
> What you are doing will cause the request to be put to sleep with no
> guarantee that it will ever be woken up. Why would we want to do that if
> there is no report of a tcp window/buffer space congestion?

But the reason we get to this code is because there was a report of
space collision.  What would you suggest instead?  Changing
xs_{tcp,udp}_send_request() to retry in this case would defeat the point
of having xs_nospace().

What should happen is the request getting re-queued to run at the next
available opportunity, rather than perhaps sleeping for a certain length
of time.  At the moment, leaving SOCK_ASYNC_NOSPACE unset causes the
request to never be woken, whereas setting that bit seems to always be
re-queued at some near point in the future.
Trond Myklebust Nov. 22, 2011, 12:10 p.m. UTC | #3
On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
> On 22/11/11 11:38, Trond Myklebust wrote:
> > On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> >> Following some debugging, I believe that the attached patch fixes the
> >> problem.
> >>
> >> Simply returning EAGAIN is not sufficient, as the task does not get
> >> requeued, and times out 13 seconds later (as per our mount options). 
> >> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>
> >> I realize that this is a gross hack and I should probably not be using
> >> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> >> same solution?
> >>
> > What you are doing will cause the request to be put to sleep with no
> > guarantee that it will ever be woken up. Why would we want to do that if
> > there is no report of a tcp window/buffer space congestion?
> 
> But the reason we get to this code is because there was a report of
> space collision.  What would you suggest instead?  Changing
> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> of having xs_nospace().

I suggest doing absolutely nothing: do what you originally proposed,
which is to report the EAGAIN so that the client state machine retries
the socket write.

My point is that this is a context which is _not_ atomic with the
original report of tcp window/buffer space congestion. There are no
locks or anything else that will guarantee that the congestion still
exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
indicates that this is the case.
The whole purpose of xs_nospace() is to wait until a congestion
condition clears. If the congestion clears before we get here, then we
have no reason to do anything special other than retry.

Trond
Andrew Cooper Nov. 22, 2011, 12:16 p.m. UTC | #4
On 22/11/11 12:10, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
>> On 22/11/11 11:38, Trond Myklebust wrote:
>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>>>> Following some debugging, I believe that the attached patch fixes the
>>>> problem.
>>>>
>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>> requeued, and times out 13 seconds later (as per our mount options). 
>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>
>>>> I realize that this is a gross hack and I should probably not be using
>>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>>>> same solution?
>>>>
>>> What you are doing will cause the request to be put to sleep with no
>>> guarantee that it will ever be woken up. Why would we want to do that if
>>> there is no report of a tcp window/buffer space congestion?
>> But the reason we get to this code is because there was a report of
>> space collision.  What would you suggest instead?  Changing
>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>> of having xs_nospace().
> I suggest doing absolutely nothing: do what you originally proposed,
> which is to report the EAGAIN so that the client state machine retries
> the socket write.
>
> My point is that this is a context which is _not_ atomic with the
> original report of tcp window/buffer space congestion. There are no
> locks or anything else that will guarantee that the congestion still
> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> indicates that this is the case.
> The whole purpose of xs_nospace() is to wait until a congestion
> condition clears. If the congestion clears before we get here, then we
> have no reason to do anything special other than retry.
>
> Trond

I am slightly confused as to what you mean now.

When you take out the if(test_bit test and always set ret to EAGAIN and
requeue the request, the next time it wakes up is when it is killed due
to timeout.  This results in substantially worse effects for the
userspace, as the NFS session is killed.

Did you mean something else when you said "always report EAGAIN"?
Trond Myklebust Nov. 22, 2011, 12:22 p.m. UTC | #5
On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote: 
> On 22/11/11 12:10, Trond Myklebust wrote:
> > On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
> >> On 22/11/11 11:38, Trond Myklebust wrote:
> >>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> >>>> Following some debugging, I believe that the attached patch fixes the
> >>>> problem.
> >>>>
> >>>> Simply returning EAGAIN is not sufficient, as the task does not get
> >>>> requeued, and times out 13 seconds later (as per our mount options). 
> >>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>>>
> >>>> I realize that this is a gross hack and I should probably not be using
> >>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> >>>> same solution?
> >>>>
> >>> What you are doing will cause the request to be put to sleep with no
> >>> guarantee that it will ever be woken up. Why would we want to do that if
> >>> there is no report of a tcp window/buffer space congestion?
> >> But the reason we get to this code is because there was a report of
> >> space collision.  What would you suggest instead?  Changing
> >> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> >> of having xs_nospace().
> > I suggest doing absolutely nothing: do what you originally proposed,
> > which is to report the EAGAIN so that the client state machine retries
> > the socket write.
> >
> > My point is that this is a context which is _not_ atomic with the
> > original report of tcp window/buffer space congestion. There are no
> > locks or anything else that will guarantee that the congestion still
> > exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> > indicates that this is the case.
> > The whole purpose of xs_nospace() is to wait until a congestion
> > condition clears. If the congestion clears before we get here, then we
> > have no reason to do anything special other than retry.
> >
> > Trond
> 
> I am slightly confused as to what you mean now.
> 
> When you take out the if(test_bit test and always set ret to EAGAIN and
> requeue the request, the next time it wakes up is when it is killed due
> to timeout.  This results in substantially worse effects for the
> userspace, as the NFS session is killed.

What is putting the request to sleep? It should be awake when it enters
xs_nospace(), and nothing in or after that function should be putting it
to sleep until we've retried with call_transmit().

> Did you mean something else when you said "always report EAGAIN"?

Nope.
Andrew Cooper Nov. 22, 2011, 12:34 p.m. UTC | #6
On 22/11/11 12:22, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote: 
>> On 22/11/11 12:10, Trond Myklebust wrote:
>>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
>>>> On 22/11/11 11:38, Trond Myklebust wrote:
>>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>>>>>> Following some debugging, I believe that the attached patch fixes the
>>>>>> problem.
>>>>>>
>>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>>>> requeued, and times out 13 seconds later (as per our mount options). 
>>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>>>
>>>>>> I realize that this is a gross hack and I should probably not be using
>>>>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>>>>>> same solution?
>>>>>>
>>>>> What you are doing will cause the request to be put to sleep with no
>>>>> guarantee that it will ever be woken up. Why would we want to do that if
>>>>> there is no report of a tcp window/buffer space congestion?
>>>> But the reason we get to this code is because there was a report of
>>>> space collision.  What would you suggest instead?  Changing
>>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>>>> of having xs_nospace().
>>> I suggest doing absolutely nothing: do what you originally proposed,
>>> which is to report the EAGAIN so that the client state machine retries
>>> the socket write.
>>>
>>> My point is that this is a context which is _not_ atomic with the
>>> original report of tcp window/buffer space congestion. There are no
>>> locks or anything else that will guarantee that the congestion still
>>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
>>> indicates that this is the case.
>>> The whole purpose of xs_nospace() is to wait until a congestion
>>> condition clears. If the congestion clears before we get here, then we
>>> have no reason to do anything special other than retry.
>>>
>>> Trond
>> I am slightly confused as to what you mean now.
>>
>> When you take out the if(test_bit test and always set ret to EAGAIN and
>> requeue the request, the next time it wakes up is when it is killed due
>> to timeout.  This results in substantially worse effects for the
>> userspace, as the NFS session is killed.
> What is putting the request to sleep? It should be awake when it enters
> xs_nospace(), and nothing in or after that function should be putting it
> to sleep until we've retried with call_transmit().
>

I presume it is the call to xprt_wait_for_buffer_space() which calls
rpc_sleep_on().  There is xs_tcp_write_space which appears to wake it up
based on sk->sk_write_space which is triggered on the sock gaining more
space, which has already happened in this specific case.

Sorry if I am being a bit slow here - I am still learning my way round
an unfamiliar codebase.

>> Did you mean something else when you said "always report EAGAIN"?
> Nope.
>
diff mbox

Patch

diff -r 69bd2176baf9 net/sunrpc/xprtsock.c
--- a/net/sunrpc/xprtsock.c	Mon Nov 07 13:00:06 2011 +0000
+++ b/net/sunrpc/xprtsock.c	Mon Nov 21 18:00:14 2011 +0000
@@ -503,17 +503,16 @@  static int xs_nospace(struct rpc_task *t
 
 	/* Don't race with disconnect */
 	if (xprt_connected(xprt)) {
-		if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
-			ret = -EAGAIN;
-			/*
-			 * Notify TCP that we're limited by the application
-			 * window size
-			 */
-			set_bit(SOCK_NOSPACE, &transport->sock->flags);
-			transport->inet->sk_write_pending++;
-			/* ...and wait for more buffer space */
-			xprt_wait_for_buffer_space(task, xs_nospace_callback);
-		}
+		set_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
+		ret = -EAGAIN;
+		/*
+		 * Notify TCP that we're limited by the application
+		 * window size
+		 */
+		set_bit(SOCK_NOSPACE, &transport->sock->flags);
+		transport->inet->sk_write_pending++;
+		/* ...and wait for more buffer space */
+		xprt_wait_for_buffer_space(task, xs_nospace_callback);
 	} else {
 		clear_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
 		ret = -ENOTCONN;