diff mbox

sunrpc: use better NUMA affinities

Message ID 1311876249.2346.39.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet July 28, 2011, 6:04 p.m. UTC
Use NUMA aware allocations to reduce latencies and increase throughput.

sunrpc kthreads can use kthread_create_on_node() if pool_mode is
"percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can
also take into account NUMA node affinity for memory allocations.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Neil Brown <neilb@suse.de>
CC: David Miller <davem@davemloft.net>
---
 fs/lockd/svc.c             |    2 +-
 fs/nfs/callback.c          |    2 +-
 include/linux/sunrpc/svc.h |    2 +-
 net/sunrpc/svc.c           |   33 ++++++++++++++++++++++++---------
 4 files changed, 27 insertions(+), 12 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

J. Bruce Fields July 29, 2011, 4:42 p.m. UTC | #1
On Thu, Jul 28, 2011 at 08:04:09PM +0200, Eric Dumazet wrote:
> Use NUMA aware allocations to reduce latencies and increase throughput.
> 
> sunrpc kthreads can use kthread_create_on_node() if pool_mode is
> "percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can
> also take into account NUMA node affinity for memory allocations.
...
> @@ -662,14 +675,16 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>  		nrservs--;
>  		chosen_pool = choose_pool(serv, pool, &state);
>  
> -		rqstp = svc_prepare_thread(serv, chosen_pool);
> +		node = svc_pool_map_get_node(chosen_pool->sp_id);
> +		rqstp = svc_prepare_thread(serv, chosen_pool, node);

The only correct value for the third argument there is
svc_pool_map_get_node(chosen_pool->sp_id), so let's have
svc_prepare_thread() call that itself.

Seems OK otherwise.

Any suggestions on how we should test this?

--b.

>  		if (IS_ERR(rqstp)) {
>  			error = PTR_ERR(rqstp);
>  			break;
>  		}
>  
>  		__module_get(serv->sv_module);
> -		task = kthread_create(serv->sv_function, rqstp, serv->sv_name);
> +		task = kthread_create_on_node(serv->sv_function, rqstp,
> +					      node, serv->sv_name);
>  		if (IS_ERR(task)) {
>  			error = PTR_ERR(task);
>  			module_put(serv->sv_module);
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 29, 2011, 6:02 p.m. UTC | #2
Le vendredi 29 juillet 2011 à 12:42 -0400, J. Bruce Fields a écrit :
> On Thu, Jul 28, 2011 at 08:04:09PM +0200, Eric Dumazet wrote:
> > Use NUMA aware allocations to reduce latencies and increase throughput.
> > 
> > sunrpc kthreads can use kthread_create_on_node() if pool_mode is
> > "percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can
> > also take into account NUMA node affinity for memory allocations.
> ...
> > @@ -662,14 +675,16 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> >  		nrservs--;
> >  		chosen_pool = choose_pool(serv, pool, &state);
> >  
> > -		rqstp = svc_prepare_thread(serv, chosen_pool);
> > +		node = svc_pool_map_get_node(chosen_pool->sp_id);
> > +		rqstp = svc_prepare_thread(serv, chosen_pool, node);
> 
> The only correct value for the third argument there is
> svc_pool_map_get_node(chosen_pool->sp_id), so let's have
> svc_prepare_thread() call that itself.
> 

I have no idea of what you mean ;)

I need 'node' for the following kthread_create_on_node()


> Seems OK otherwise.
> 
> Any suggestions on how we should test this?

I did tests on my machine, seems good.

I checked that stacks were now correct using :
"echo t > /proc/sysrq-trigger"



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields July 29, 2011, 6:08 p.m. UTC | #3
On Fri, Jul 29, 2011 at 08:02:05PM +0200, Eric Dumazet wrote:
> Le vendredi 29 juillet 2011 à 12:42 -0400, J. Bruce Fields a écrit :
> > On Thu, Jul 28, 2011 at 08:04:09PM +0200, Eric Dumazet wrote:
> > > Use NUMA aware allocations to reduce latencies and increase throughput.
> > > 
> > > sunrpc kthreads can use kthread_create_on_node() if pool_mode is
> > > "percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can
> > > also take into account NUMA node affinity for memory allocations.
> > ...
> > > @@ -662,14 +675,16 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> > >  		nrservs--;
> > >  		chosen_pool = choose_pool(serv, pool, &state);
> > >  
> > > -		rqstp = svc_prepare_thread(serv, chosen_pool);
> > > +		node = svc_pool_map_get_node(chosen_pool->sp_id);
> > > +		rqstp = svc_prepare_thread(serv, chosen_pool, node);
> > 
> > The only correct value for the third argument there is
> > svc_pool_map_get_node(chosen_pool->sp_id), so let's have
> > svc_prepare_thread() call that itself.
> > 
> 
> I have no idea of what you mean ;)
> 
> I need 'node' for the following kthread_create_on_node()

Doh, of course--apologies.

> > Seems OK otherwise.
> > 
> > Any suggestions on how we should test this?
> 
> I did tests on my machine, seems good.
> 
> I checked that stacks were now correct using :
> "echo t > /proc/sysrq-trigger"

I was wondering more about good tests of nfsd's performance on numa;
that might be more of a question for Greg.

--b.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Banks July 29, 2011, 8:39 p.m. UTC | #4
Sent from my iPhone

On 30/07/2011, at 4:08, "J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Fri, Jul 29, 2011 at 08:02:05PM +0200, Eric Dumazet wrote:
>> Le vendredi 29 juillet 2011 à 12:42 -0400, J. Bruce Fields a écri 
>> t :
>>> On Thu, Jul 28, 2011 at 08:04:09PM +0200, Eric Dumazet wrote:
>>>> Use NUMA aware allocations to reduce latencies and increase  
>>>> throughput.
>>>>
>>>> sunrpc kthreads can use kthread_create_on_node() if pool_mode is
>>>> "percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer()  
>>>> can
>>>> also take into account NUMA node affinity for memory allocations.
>>> ...
>>>> @@ -662,14 +675,16 @@ svc_set_num_threads(struct svc_serv *serv,  
>>>> struct svc_pool *pool, int nrservs)
>>>>        nrservs--;
>>>>        chosen_pool = choose_pool(serv, pool, &state);
>>>>
>>>> -        rqstp = svc_prepare_thread(serv, chosen_pool);
>>>> +        node = svc_pool_map_get_node(chosen_pool->sp_id);
>>>> +        rqstp = svc_prepare_thread(serv, chosen_pool, node);
>>>
>>> The only correct value for the third argument there is
>>> svc_pool_map_get_node(chosen_pool->sp_id), so let's have
>>> svc_prepare_thread() call that itself.
>>>
>>
>> I have no idea of what you mean ;)
>>
>> I need 'node' for the following kthread_create_on_node()
>
> Doh, of course--apologies.
>
>>> Seems OK otherwise.
>>>
>>> Any suggestions on how we should test this?
>>
>> I did tests on my machine, seems good.
>>
>> I checked that stacks were now correct using :
>> "echo t > /proc/sysrq-trigger"
>
> I was wondering more about good tests of nfsd's performance on numa;
> that might be more of a question for Greg.
>

To really show a big difference you need a much bigger box, or slower  
NUMA interconnects than today's. You also want network cards locally  
attached to each node and a metadata heavy (i.e. high rpc call rate)  
load.

Greg.--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index abfff9d..c061b9a 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -282,7 +282,7 @@  int lockd_up(void)
 	/*
 	 * Create the kernel thread and wait for it to start.
 	 */
-	nlmsvc_rqst = svc_prepare_thread(serv, &serv->sv_pools[0]);
+	nlmsvc_rqst = svc_prepare_thread(serv, &serv->sv_pools[0], NUMA_NO_NODE);
 	if (IS_ERR(nlmsvc_rqst)) {
 		error = PTR_ERR(nlmsvc_rqst);
 		nlmsvc_rqst = NULL;
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index e3d2942..ce620b5 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -125,7 +125,7 @@  nfs4_callback_up(struct svc_serv *serv)
 	else
 		goto out_err;
 
-	return svc_prepare_thread(serv, &serv->sv_pools[0]);
+	return svc_prepare_thread(serv, &serv->sv_pools[0], NUMA_NO_NODE);
 
 out_err:
 	if (ret == 0)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 223588a..a78a51e 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -404,7 +404,7 @@  struct svc_procedure {
 struct svc_serv *svc_create(struct svc_program *, unsigned int,
 			    void (*shutdown)(struct svc_serv *));
 struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
-					struct svc_pool *pool);
+					struct svc_pool *pool, int node);
 void		   svc_exit_thread(struct svc_rqst *);
 struct svc_serv *  svc_create_pooled(struct svc_program *, unsigned int,
 			void (*shutdown)(struct svc_serv *),
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 6a69a11..30d70ab 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -295,6 +295,18 @@  svc_pool_map_put(void)
 }
 
 
+static int svc_pool_map_get_node(unsigned int pidx)
+{
+	const struct svc_pool_map *m = &svc_pool_map;
+
+	if (m->count) {
+		if (m->mode == SVC_POOL_PERCPU)
+			return cpu_to_node(m->pool_to[pidx]);
+		if (m->mode == SVC_POOL_PERNODE)
+			return m->pool_to[pidx];
+	}
+	return NUMA_NO_NODE;
+}
 /*
  * Set the given thread's cpus_allowed mask so that it
  * will only run on cpus in the given pool.
@@ -499,7 +511,7 @@  EXPORT_SYMBOL_GPL(svc_destroy);
  * We allocate pages and place them in rq_argpages.
  */
 static int
-svc_init_buffer(struct svc_rqst *rqstp, unsigned int size)
+svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
 {
 	unsigned int pages, arghi;
 
@@ -513,7 +525,7 @@  svc_init_buffer(struct svc_rqst *rqstp, unsigned int size)
 	arghi = 0;
 	BUG_ON(pages > RPCSVC_MAXPAGES);
 	while (pages) {
-		struct page *p = alloc_page(GFP_KERNEL);
+		struct page *p = alloc_pages_node(node, GFP_KERNEL, 0);
 		if (!p)
 			break;
 		rqstp->rq_pages[arghi++] = p;
@@ -536,11 +548,11 @@  svc_release_buffer(struct svc_rqst *rqstp)
 }
 
 struct svc_rqst *
-svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool)
+svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 {
 	struct svc_rqst	*rqstp;
 
-	rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
+	rqstp = kzalloc_node(sizeof(*rqstp), GFP_KERNEL, node);
 	if (!rqstp)
 		goto out_enomem;
 
@@ -554,15 +566,15 @@  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool)
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
 
-	rqstp->rq_argp = kmalloc(serv->sv_xdrsize, GFP_KERNEL);
+	rqstp->rq_argp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
 	if (!rqstp->rq_argp)
 		goto out_thread;
 
-	rqstp->rq_resp = kmalloc(serv->sv_xdrsize, GFP_KERNEL);
+	rqstp->rq_resp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
 	if (!rqstp->rq_resp)
 		goto out_thread;
 
-	if (!svc_init_buffer(rqstp, serv->sv_max_mesg))
+	if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node))
 		goto out_thread;
 
 	return rqstp;
@@ -647,6 +659,7 @@  svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 	struct svc_pool *chosen_pool;
 	int error = 0;
 	unsigned int state = serv->sv_nrthreads-1;
+	int node;
 
 	if (pool == NULL) {
 		/* The -1 assumes caller has done a svc_get() */
@@ -662,14 +675,16 @@  svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 		nrservs--;
 		chosen_pool = choose_pool(serv, pool, &state);
 
-		rqstp = svc_prepare_thread(serv, chosen_pool);
+		node = svc_pool_map_get_node(chosen_pool->sp_id);
+		rqstp = svc_prepare_thread(serv, chosen_pool, node);
 		if (IS_ERR(rqstp)) {
 			error = PTR_ERR(rqstp);
 			break;
 		}
 
 		__module_get(serv->sv_module);
-		task = kthread_create(serv->sv_function, rqstp, serv->sv_name);
+		task = kthread_create_on_node(serv->sv_function, rqstp,
+					      node, serv->sv_name);
 		if (IS_ERR(task)) {
 			error = PTR_ERR(task);
 			module_put(serv->sv_module);