Patchwork 2.6.30-rc deadline scheduler performance regression for iozone over NFS

login
register
mail settings
Submitter Trond Myklebust
Date May 13, 2009, 11:45 p.m.
Message ID <1242258338.5407.244.camel@heimdal.trondhjem.org>
Download mbox | patch
Permalink /patch/27185/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Trond Myklebust - May 13, 2009, 11:45 p.m.
On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> Hi, netdev folks.  The summary here is:
> 
> A patch added in the 2.6.30 development cycle caused a performance
> regression in my NFS iozone testing.  The patch in question is the
> following:
> 
> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> Author: Olga Kornievskaia <aglo@citi.umich.edu>
> Date:   Tue Oct 21 14:13:47 2008 -0400
> 
>     svcrpc: take advantage of tcp autotuning
>  
> which is also quoted below.  Using 8 nfsd threads, a single client doing
> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> KB/s under 2.6.30-rc4.  I also see more run to run variation under
> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> variation disappears (as does the performance regression) when reverting
> the above commit.

It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
function. I can see no reason why we should stop processing new incoming
RPC requests just because the send buffer happens to be 2/3 full. If we
see that we have space for another reply, then we should just go for it.
OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
that the TCP layer knows that we're congested, and that we'd like it to
increase the send window size, please.

Could you therefore please see if the following (untested) patch helps?

Cheers
  Trond
---------------------------------------------------------------------
>From 1545cbda1b1cda2500cb9db3c760a05fc4f6ed4d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Wed, 13 May 2009 19:44:58 -0400
Subject: [PATCH] SUNRPC: Fix the TCP server's send buffer accounting

Currently, the sunrpc server is refusing to allow us to process new RPC
calls if the TCP send buffer is 2/3 full, even if we do actually have
enough free space to guarantee that we can send another request.
The following patch fixes svc_tcp_has_wspace() so that we only stop
processing requests if we know that the socket buffer cannot possibly fit
another reply.

It also fixes the tcp write_space() callback so that we only clear the
SOCK_NOSPACE flag when the TCP send buffer is less than 2/3 full.
This should ensure that the send window will grow as per the standard TCP
socket code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 net/sunrpc/svcsock.c |   32 ++++++++++++++++----------------
 1 files changed, 16 insertions(+), 16 deletions(-)
Jeff Moyer - May 14, 2009, 1:34 p.m.
Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
>> Hi, netdev folks.  The summary here is:
>> 
>> A patch added in the 2.6.30 development cycle caused a performance
>> regression in my NFS iozone testing.  The patch in question is the
>> following:
>> 
>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>> Date:   Tue Oct 21 14:13:47 2008 -0400
>> 
>>     svcrpc: take advantage of tcp autotuning
>>  
>> which is also quoted below.  Using 8 nfsd threads, a single client doing
>> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
>> KB/s under 2.6.30-rc4.  I also see more run to run variation under
>> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
>> variation disappears (as does the performance regression) when reverting
>> the above commit.
>
> It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> function. I can see no reason why we should stop processing new incoming
> RPC requests just because the send buffer happens to be 2/3 full. If we
> see that we have space for another reply, then we should just go for it.
> OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
> that the TCP layer knows that we're congested, and that we'd like it to
> increase the send window size, please.
>
> Could you therefore please see if the following (untested) patch helps?

I'm seeing slightly better results with the patch:

71548
75987
71557
87432
83538

But that's still not up to the speeds we saw under 2.6.29.  The packet
capture for one run can be found here:
  http://people.redhat.com/jmoyer/trond.pcap.bz2

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields - May 14, 2009, 5:55 p.m.
On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > Hi, netdev folks.  The summary here is:
> > 
> > A patch added in the 2.6.30 development cycle caused a performance
> > regression in my NFS iozone testing.  The patch in question is the
> > following:
> > 
> > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > Date:   Tue Oct 21 14:13:47 2008 -0400
> > 
> >     svcrpc: take advantage of tcp autotuning
> >  
> > which is also quoted below.  Using 8 nfsd threads, a single client doing
> > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > KB/s under 2.6.30-rc4.  I also see more run to run variation under
> > 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> > variation disappears (as does the performance regression) when reverting
> > the above commit.
> 
> It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> function. I can see no reason why we should stop processing new incoming
> RPC requests just because the send buffer happens to be 2/3 full. If we

I agree, the calculation doesn't look right.  But where do you get the
2/3 number from?

...
> @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
>  	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
>  	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
>  	int required;
> -	int wspace;
> -
> -	/*
> -	 * Set the SOCK_NOSPACE flag before checking the available
> -	 * sock space.
> -	 */
> -	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> -	required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
> -	wspace = sk_stream_wspace(svsk->sk_sk);
> -
> -	if (wspace < sk_stream_min_wspace(svsk->sk_sk))
> -		return 0;
> -	if (required * 2 > wspace)
> -		return 0;
>  
> -	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> +	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
> +	if (sk_stream_wspace(svsk->sk_sk) < required)

This calculation looks the same before and after--you've just moved the
"*2" into the calcualtion of "required".  Am I missing something?  Maybe
you meant to write:

	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2;

without the parentheses?

That looks closer, assuming the calculation is meant to be:

		atomic_read(..) == amount of buffer space we think we
			already need
		serv->sv_max_mesg * 2 == space for worst-case request
			and reply?

--b.

> +		goto out_nospace;
>  	return 1;
> +out_nospace:
> +	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> +	return 0;
>  }
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust - May 14, 2009, 6:26 p.m.
On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote:
> On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > > Hi, netdev folks.  The summary here is:
> > > 
> > > A patch added in the 2.6.30 development cycle caused a performance
> > > regression in my NFS iozone testing.  The patch in question is the
> > > following:
> > > 
> > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > > Date:   Tue Oct 21 14:13:47 2008 -0400
> > > 
> > >     svcrpc: take advantage of tcp autotuning
> > >  
> > > which is also quoted below.  Using 8 nfsd threads, a single client doing
> > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > > KB/s under 2.6.30-rc4.  I also see more run to run variation under
> > > 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> > > variation disappears (as does the performance regression) when reverting
> > > the above commit.
> > 
> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> > function. I can see no reason why we should stop processing new incoming
> > RPC requests just because the send buffer happens to be 2/3 full. If we
> 
> I agree, the calculation doesn't look right.  But where do you get the
> 2/3 number from?

That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison.

> ...
> > @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> >  	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
> >  	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
> >  	int required;
> > -	int wspace;
> > -
> > -	/*
> > -	 * Set the SOCK_NOSPACE flag before checking the available
> > -	 * sock space.
> > -	 */
> > -	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > -	required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
> > -	wspace = sk_stream_wspace(svsk->sk_sk);
> > -
> > -	if (wspace < sk_stream_min_wspace(svsk->sk_sk))
> > -		return 0;
> > -	if (required * 2 > wspace)
> > -		return 0;
> >  
> > -	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > +	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
> > +	if (sk_stream_wspace(svsk->sk_sk) < required)
> 
> This calculation looks the same before and after--you've just moved the
> "*2" into the calcualtion of "required".  Am I missing something?  Maybe
> you meant to write:
> 
> 	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2;
> 
> without the parentheses?

I wasn't trying to change that part of the calculation. I'm just
splitting out the stuff which has to do with TCP congestion (i.e. the
window size), and stuff which has to do with remaining socket buffer
space. I do, however, agree that we should probably drop that *2.

However there is (as usual) 'interesting behaviour' when it comes to
deferred requests. Their buffer space is already accounted for in the
'xpt_reserved' calculation, but they cannot get re-scheduled unless
svc_tcp_has_wspace() thinks it has enough free socket space for yet
another reply. Can you spell 'deadlock', children?

Trond

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields - May 15, 2009, 9:37 p.m.
On Thu, May 14, 2009 at 02:26:09PM -0400, Trond Myklebust wrote:
> On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote:
> > On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> > > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > > > Hi, netdev folks.  The summary here is:
> > > > 
> > > > A patch added in the 2.6.30 development cycle caused a performance
> > > > regression in my NFS iozone testing.  The patch in question is the
> > > > following:
> > > > 
> > > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > > > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > > > Date:   Tue Oct 21 14:13:47 2008 -0400
> > > > 
> > > >     svcrpc: take advantage of tcp autotuning
> > > >  
> > > > which is also quoted below.  Using 8 nfsd threads, a single client doing
> > > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > > > KB/s under 2.6.30-rc4.  I also see more run to run variation under
> > > > 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> > > > variation disappears (as does the performance regression) when reverting
> > > > the above commit.
> > > 
> > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> > > function. I can see no reason why we should stop processing new incoming
> > > RPC requests just because the send buffer happens to be 2/3 full. If we
> > 
> > I agree, the calculation doesn't look right.  But where do you get the
> > 2/3 number from?
> 
> That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison.

Oh, I see, so looking at their implementations,

	sk_stream_wspace(sk) < sk_stream_min_wspace(sk)

is equivalent to sk_wmem_queued/2 < sk_->sndbuf - sk_wmem_queued, or
sk_wmem_queued < 2/3 sndbuf, got it.  I didn't understand that the point
of this patch was just to do that calculation around--now I see.--b.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index af31988..8962355 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -386,6 +386,15 @@  static void svc_write_space(struct sock *sk)
 	}
 }
 
+static void svc_tcp_write_space(struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+
+	if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock)
+		clear_bit(SOCK_NOSPACE, &sock->flags);
+	svc_write_space(sk);
+}
+
 /*
  * Copy the UDP datagram's destination address to the rqstp structure.
  * The 'destination' address in this case is the address to which the
@@ -964,23 +973,14 @@  static int svc_tcp_has_wspace(struct svc_xprt *xprt)
 	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
 	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
 	int required;
-	int wspace;
-
-	/*
-	 * Set the SOCK_NOSPACE flag before checking the available
-	 * sock space.
-	 */
-	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
-	required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
-	wspace = sk_stream_wspace(svsk->sk_sk);
-
-	if (wspace < sk_stream_min_wspace(svsk->sk_sk))
-		return 0;
-	if (required * 2 > wspace)
-		return 0;
 
-	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
+	if (sk_stream_wspace(svsk->sk_sk) < required)
+		goto out_nospace;
 	return 1;
+out_nospace:
+	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+	return 0;
 }
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
@@ -1036,7 +1036,7 @@  static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		dprintk("setting up TCP socket for reading\n");
 		sk->sk_state_change = svc_tcp_state_change;
 		sk->sk_data_ready = svc_tcp_data_ready;
-		sk->sk_write_space = svc_write_space;
+		sk->sk_write_space = svc_tcp_write_space;
 
 		svsk->sk_reclen = 0;
 		svsk->sk_tcplen = 0;