diff mbox series

[v4] lib: optimize cpumask_local_spread()

Message ID 1576051437-23230-1-git-send-email-zhangshaokun@hisilicon.com
State Not Applicable
Delegated to: David Miller
Headers show
Series [v4] lib: optimize cpumask_local_spread() | expand

Commit Message

Shaokun Zhang Dec. 11, 2019, 8:03 a.m. UTC
From: yuqi jin <jinyuqi@huawei.com>

In multi-processor and NUMA system, I/O driver will find cpu cores that
which shall be bound IRQ. When cpu cores in the local numa have been
used, it is better to find the node closest to the local numa node for
performance, instead of choosing any online cpu immediately.

On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
system(0 - 1). The topology of this server is followed:
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 63379 MB
node 0 free: 61899 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64509 MB
node 1 free: 63942 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 64509 MB
node 2 free: 63056 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 63997 MB
node 3 free: 63420 MB
node distances:
node   0   1   2   3
  0:  10  16  32  33
  1:  16  10  25  32
  2:  32  25  10  16
  3:  33  32  16  10

We perform PS (parameter server) business test, the behavior of the 
service is that the client initiates a request through the network card,
the server responds to the request after calculation. When two PS
processes run on node2 and node3 separately and the network card is
located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
and node3 (22W QPS) is different.
It is better that the NIC queues are bound to the cpu1 cores in turn,
then XPS will also be properly initialized, while cpumask_local_spread
only considers the local node. When the number of NIC queues exceeds the
number of cores in the local node, it returns to the online core directly.
So when PS runs on node3 sending a calculated request, the performance is
not as good as the node2.
The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
patch, before the patch:
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
0
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
1
...
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
22
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
23
After the patch:
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
72
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
73
...
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
94
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
95
So the performance of the node3 is the same as node2 that is 26W QPS when
the network card is still in 'node2' with the patch.

It is considered that the NIC and other I/O devices shall initialize the
interrupt binding, if the cores of the local node are used up, it is
reasonable to return the node closest to it. Let's optimize it and find
the nearest node through NUMA distance for the non-local NUMA nodes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: yuqi jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
ChangeLog from v3:
    1. Make spread_lock local to cpumask_local_spread();
    2. Add more descriptions on the affinities change in log;

ChangeLog from v2:
    1. Change the variables as static and use spinlock to protect;
    2. Give more explantation on test and performance;

 lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 90 insertions(+), 12 deletions(-)

Comments

Michal Hocko Dec. 11, 2019, 9:08 a.m. UTC | #1
On Wed 11-12-19 16:03:57, Shaokun Zhang wrote:
> From: yuqi jin <jinyuqi@huawei.com>
> 
> In multi-processor and NUMA system, I/O driver will find cpu cores that
> which shall be bound IRQ. When cpu cores in the local numa have been
> used, it is better to find the node closest to the local numa node for
> performance, instead of choosing any online cpu immediately.
> 
> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
> system(0 - 1). The topology of this server is followed:
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 0 size: 63379 MB
> node 0 free: 61899 MB
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 1 size: 64509 MB
> node 1 free: 63942 MB
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 2 size: 64509 MB
> node 2 free: 63056 MB
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> node 3 size: 63997 MB
> node 3 free: 63420 MB
> node distances:
> node   0   1   2   3
>   0:  10  16  32  33
>   1:  16  10  25  32
>   2:  32  25  10  16
>   3:  33  32  16  10
> 
> We perform PS (parameter server) business test, the behavior of the 
> service is that the client initiates a request through the network card,
> the server responds to the request after calculation. When two PS
> processes run on node2 and node3 separately and the network card is
> located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
> and node3 (22W QPS) is different.
> It is better that the NIC queues are bound to the cpu1 cores in turn,
> then XPS will also be properly initialized, while cpumask_local_spread
> only considers the local node. When the number of NIC queues exceeds the
> number of cores in the local node, it returns to the online core directly.
> So when PS runs on node3 sending a calculated request, the performance is
> not as good as the node2.
> The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
> patch, before the patch:
> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
> 0
> Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
> 1
> ...
> Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
> 22
> Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
> 23
> After the patch:
> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
> 72
> Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
> 73
> ...
> Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
> 94
> Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
> 95
> So the performance of the node3 is the same as node2 that is 26W QPS when
> the network card is still in 'node2' with the patch.
> 
> It is considered that the NIC and other I/O devices shall initialize the
> interrupt binding, if the cores of the local node are used up, it is
> reasonable to return the node closest to it. Let's optimize it and find
> the nearest node through NUMA distance for the non-local NUMA nodes.

As I've said/asked earlier. I am missing some background how this is
affecting other existing users. Is this just that nobody has noticed the
suboptimal cpu usage or is your workload very special in that regards.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paul Burton <paul.burton@mips.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Signed-off-by: yuqi jin <jinyuqi@huawei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> ---
> ChangeLog from v3:
>     1. Make spread_lock local to cpumask_local_spread();
>     2. Add more descriptions on the affinities change in log;
> 
> ChangeLog from v2:
>     1. Change the variables as static and use spinlock to protect;
>     2. Give more explantation on test and performance;
> 
>  lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 90 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 0cb672eb107c..f7394ba36116 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
>  #include <linux/export.h>
>  #include <linux/memblock.h>
>  #include <linux/numa.h>
> +#include <linux/spinlock.h>
>  
>  /**
>   * cpumask_next - get the next cpu in a cpumask
> @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>  }
>  #endif
>  
> -/**
> - * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> - * @i: index number
> - * @node: local numa_node
> - *
> - * This function selects an online CPU according to a numa aware policy;
> - * local cpus are returned first, followed by non-local ones, then it
> - * wraps around.
> - *
> - * It's not very efficient, but useful for setup.
> - */
> -unsigned int cpumask_local_spread(unsigned int i, int node)
> +static void calc_node_distance(int *node_dist, int node)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_node_ids; i++)
> +		node_dist[i] = node_distance(node, i);
> +}
> +
> +static int find_nearest_node(int *node_dist, bool *used)
> +{
> +	int i, min_dist = node_dist[0], node_id = -1;
> +
> +	/* Choose the first unused node to compare */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		if (used[i] == 0) {
> +			min_dist = node_dist[i];
> +			node_id = i;
> +			break;
> +		}
> +	}
> +
> +	/* Compare and return the nearest node */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		if (node_dist[i] < min_dist && used[i] == 0) {
> +			min_dist = node_dist[i];
> +			node_id = i;
> +		}
> +	}
> +
> +	return node_id;
> +}
> +
> +static unsigned int __cpumask_local_spread(unsigned int i, int node)
>  {
>  	int cpu;
>  
> @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>  	}
>  	BUG();
>  }
> +
> +/**
> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> + * @i: index number
> + * @node: local numa_node
> + *
> + * This function selects an online CPU according to a numa aware policy;
> + * local cpus are returned first, followed by the nearest non-local ones,
> + * then it wraps around.
> + *
> + * It's not very efficient, but useful for setup.
> + */
> +unsigned int cpumask_local_spread(unsigned int i, int node)
> +{
> +	static DEFINE_SPINLOCK(spread_lock);
> +	static int node_dist[MAX_NUMNODES];
> +	static bool used[MAX_NUMNODES];
> +	unsigned long flags;
> +	int cpu, j, id;
> +
> +	/* Wrap: we always want a cpu. */
> +	i %= num_online_cpus();
> +
> +	if (node == NUMA_NO_NODE) {
> +		for_each_cpu(cpu, cpu_online_mask)
> +			if (i-- == 0)
> +				return cpu;
> +	} else {
> +		if (nr_node_ids > MAX_NUMNODES)
> +			return __cpumask_local_spread(i, node);
> +
> +		spin_lock_irqsave(&spread_lock, flags);
> +		memset(used, 0, nr_node_ids * sizeof(bool));
> +		calc_node_distance(node_dist, node);
> +		for (j = 0; j < nr_node_ids; j++) {
> +			id = find_nearest_node(node_dist, used);
> +			if (id < 0)
> +				break;
> +
> +			for_each_cpu_and(cpu, cpumask_of_node(id),
> +					 cpu_online_mask)
> +				if (i-- == 0) {
> +					spin_unlock_irqrestore(&spread_lock,
> +							       flags);
> +					return cpu;
> +				}
> +			used[id] = 1;
> +		}
> +		spin_unlock_irqrestore(&spread_lock, flags);
> +
> +		for_each_cpu(cpu, cpu_online_mask)
> +			if (i-- == 0)
> +				return cpu;
> +	}
> +	BUG();
> +}
>  EXPORT_SYMBOL(cpumask_local_spread);
> -- 
> 2.7.4
Shaokun Zhang Dec. 12, 2019, 1:41 a.m. UTC | #2
Hi Michal,

+Cc: Rusty Russell who optimised this function in commit
f36963c9d3f6 ("cpumask_set_cpu_local_first => cpumask_local_spread, lament")

On 2019/12/11 17:08, Michal Hocko wrote:
> On Wed 11-12-19 16:03:57, Shaokun Zhang wrote:
>> From: yuqi jin <jinyuqi@huawei.com>
>>
>> In multi-processor and NUMA system, I/O driver will find cpu cores that
>> which shall be bound IRQ. When cpu cores in the local numa have been
>> used, it is better to find the node closest to the local numa node for
>> performance, instead of choosing any online cpu immediately.
>>
>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
>> system(0 - 1). The topology of this server is followed:
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>> node 0 size: 63379 MB
>> node 0 free: 61899 MB
>> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 1 size: 64509 MB
>> node 1 free: 63942 MB
>> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>> node 2 size: 64509 MB
>> node 2 free: 63056 MB
>> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
>> node 3 size: 63997 MB
>> node 3 free: 63420 MB
>> node distances:
>> node   0   1   2   3
>>   0:  10  16  32  33
>>   1:  16  10  25  32
>>   2:  32  25  10  16
>>   3:  33  32  16  10
>>
>> We perform PS (parameter server) business test, the behavior of the 
>> service is that the client initiates a request through the network card,
>> the server responds to the request after calculation. When two PS
>> processes run on node2 and node3 separately and the network card is
>> located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
>> and node3 (22W QPS) is different.
>> It is better that the NIC queues are bound to the cpu1 cores in turn,
>> then XPS will also be properly initialized, while cpumask_local_spread
>> only considers the local node. When the number of NIC queues exceeds the
>> number of cores in the local node, it returns to the online core directly.
>> So when PS runs on node3 sending a calculated request, the performance is
>> not as good as the node2.
>> The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
>> patch, before the patch:
>> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
>> 0
>> Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
>> 1
>> ...
>> Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
>> 22
>> Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
>> 23
>> After the patch:
>> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
>> 72
>> Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
>> 73
>> ...
>> Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
>> 94
>> Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
>> 95
>> So the performance of the node3 is the same as node2 that is 26W QPS when
>> the network card is still in 'node2' with the patch.
>>
>> It is considered that the NIC and other I/O devices shall initialize the
>> interrupt binding, if the cores of the local node are used up, it is
>> reasonable to return the node closest to it. Let's optimize it and find
>> the nearest node through NUMA distance for the non-local NUMA nodes.
> 
> As I've said/asked earlier. I am missing some background how this is
> affecting other existing users. Is this just that nobody has noticed the

I appreciate yours question, but I can't answer 'YES' or 'NO' to it. Since
I encountered this issue, I'm glad to discuss this in the community. If we
can solve it, it can give gain for people who maybe spend time to spot it
again.

> suboptimal cpu usage or is your workload very special in that regards.

What I have done is just that choosing cpu core from the nearest non-local
NUMA node, not the random online cpu core when the local cpu core is used
up. It's no matter what the workload is, Why I mention the PS in the patch?
Because this issue is triggered by it.

Thanks,
Shaokun

> 
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Paul Burton <paul.burton@mips.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Signed-off-by: yuqi jin <jinyuqi@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>> ChangeLog from v3:
>>     1. Make spread_lock local to cpumask_local_spread();
>>     2. Add more descriptions on the affinities change in log;
>>
>> ChangeLog from v2:
>>     1. Change the variables as static and use spinlock to protect;
>>     2. Give more explantation on test and performance;
>>
>>  lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 90 insertions(+), 12 deletions(-)
>>
>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>> index 0cb672eb107c..f7394ba36116 100644
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>>  #include <linux/export.h>
>>  #include <linux/memblock.h>
>>  #include <linux/numa.h>
>> +#include <linux/spinlock.h>
>>  
>>  /**
>>   * cpumask_next - get the next cpu in a cpumask
>> @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>  }
>>  #endif
>>  
>> -/**
>> - * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>> - * @i: index number
>> - * @node: local numa_node
>> - *
>> - * This function selects an online CPU according to a numa aware policy;
>> - * local cpus are returned first, followed by non-local ones, then it
>> - * wraps around.
>> - *
>> - * It's not very efficient, but useful for setup.
>> - */
>> -unsigned int cpumask_local_spread(unsigned int i, int node)
>> +static void calc_node_distance(int *node_dist, int node)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr_node_ids; i++)
>> +		node_dist[i] = node_distance(node, i);
>> +}
>> +
>> +static int find_nearest_node(int *node_dist, bool *used)
>> +{
>> +	int i, min_dist = node_dist[0], node_id = -1;
>> +
>> +	/* Choose the first unused node to compare */
>> +	for (i = 0; i < nr_node_ids; i++) {
>> +		if (used[i] == 0) {
>> +			min_dist = node_dist[i];
>> +			node_id = i;
>> +			break;
>> +		}
>> +	}
>> +
>> +	/* Compare and return the nearest node */
>> +	for (i = 0; i < nr_node_ids; i++) {
>> +		if (node_dist[i] < min_dist && used[i] == 0) {
>> +			min_dist = node_dist[i];
>> +			node_id = i;
>> +		}
>> +	}
>> +
>> +	return node_id;
>> +}
>> +
>> +static unsigned int __cpumask_local_spread(unsigned int i, int node)
>>  {
>>  	int cpu;
>>  
>> @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>  	}
>>  	BUG();
>>  }
>> +
>> +/**
>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>> + * @i: index number
>> + * @node: local numa_node
>> + *
>> + * This function selects an online CPU according to a numa aware policy;
>> + * local cpus are returned first, followed by the nearest non-local ones,
>> + * then it wraps around.
>> + *
>> + * It's not very efficient, but useful for setup.
>> + */
>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>> +{
>> +	static DEFINE_SPINLOCK(spread_lock);
>> +	static int node_dist[MAX_NUMNODES];
>> +	static bool used[MAX_NUMNODES];
>> +	unsigned long flags;
>> +	int cpu, j, id;
>> +
>> +	/* Wrap: we always want a cpu. */
>> +	i %= num_online_cpus();
>> +
>> +	if (node == NUMA_NO_NODE) {
>> +		for_each_cpu(cpu, cpu_online_mask)
>> +			if (i-- == 0)
>> +				return cpu;
>> +	} else {
>> +		if (nr_node_ids > MAX_NUMNODES)
>> +			return __cpumask_local_spread(i, node);
>> +
>> +		spin_lock_irqsave(&spread_lock, flags);
>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>> +		calc_node_distance(node_dist, node);
>> +		for (j = 0; j < nr_node_ids; j++) {
>> +			id = find_nearest_node(node_dist, used);
>> +			if (id < 0)
>> +				break;
>> +
>> +			for_each_cpu_and(cpu, cpumask_of_node(id),
>> +					 cpu_online_mask)
>> +				if (i-- == 0) {
>> +					spin_unlock_irqrestore(&spread_lock,
>> +							       flags);
>> +					return cpu;
>> +				}
>> +			used[id] = 1;
>> +		}
>> +		spin_unlock_irqrestore(&spread_lock, flags);
>> +
>> +		for_each_cpu(cpu, cpu_online_mask)
>> +			if (i-- == 0)
>> +				return cpu;
>> +	}
>> +	BUG();
>> +}
>>  EXPORT_SYMBOL(cpumask_local_spread);
>> -- 
>> 2.7.4
>
Shaokun Zhang Dec. 16, 2019, noon UTC | #3
Hi Andrew/Michal,

Any other comments please?

Thanks,
Shaokun

On 2019/12/11 17:08, Michal Hocko wrote:
> On Wed 11-12-19 16:03:57, Shaokun Zhang wrote:
>> From: yuqi jin <jinyuqi@huawei.com>
>>
>> In multi-processor and NUMA system, I/O driver will find cpu cores that
>> which shall be bound IRQ. When cpu cores in the local numa have been
>> used, it is better to find the node closest to the local numa node for
>> performance, instead of choosing any online cpu immediately.
>>
>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
>> system(0 - 1). The topology of this server is followed:
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>> node 0 size: 63379 MB
>> node 0 free: 61899 MB
>> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 1 size: 64509 MB
>> node 1 free: 63942 MB
>> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>> node 2 size: 64509 MB
>> node 2 free: 63056 MB
>> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
>> node 3 size: 63997 MB
>> node 3 free: 63420 MB
>> node distances:
>> node   0   1   2   3
>>   0:  10  16  32  33
>>   1:  16  10  25  32
>>   2:  32  25  10  16
>>   3:  33  32  16  10
>>
>> We perform PS (parameter server) business test, the behavior of the 
>> service is that the client initiates a request through the network card,
>> the server responds to the request after calculation. When two PS
>> processes run on node2 and node3 separately and the network card is
>> located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
>> and node3 (22W QPS) is different.
>> It is better that the NIC queues are bound to the cpu1 cores in turn,
>> then XPS will also be properly initialized, while cpumask_local_spread
>> only considers the local node. When the number of NIC queues exceeds the
>> number of cores in the local node, it returns to the online core directly.
>> So when PS runs on node3 sending a calculated request, the performance is
>> not as good as the node2.
>> The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
>> patch, before the patch:
>> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
>> 0
>> Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
>> 1
>> ...
>> Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
>> 22
>> Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
>> 23
>> After the patch:
>> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
>> 72
>> Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
>> 73
>> ...
>> Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
>> 94
>> Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
>> 95
>> So the performance of the node3 is the same as node2 that is 26W QPS when
>> the network card is still in 'node2' with the patch.
>>
>> It is considered that the NIC and other I/O devices shall initialize the
>> interrupt binding, if the cores of the local node are used up, it is
>> reasonable to return the node closest to it. Let's optimize it and find
>> the nearest node through NUMA distance for the non-local NUMA nodes.
> 
> As I've said/asked earlier. I am missing some background how this is
> affecting other existing users. Is this just that nobody has noticed the
> suboptimal cpu usage or is your workload very special in that regards.
> 
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Paul Burton <paul.burton@mips.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Signed-off-by: yuqi jin <jinyuqi@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>> ChangeLog from v3:
>>     1. Make spread_lock local to cpumask_local_spread();
>>     2. Add more descriptions on the affinities change in log;
>>
>> ChangeLog from v2:
>>     1. Change the variables as static and use spinlock to protect;
>>     2. Give more explantation on test and performance;
>>
>>  lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 90 insertions(+), 12 deletions(-)
>>
>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>> index 0cb672eb107c..f7394ba36116 100644
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>>  #include <linux/export.h>
>>  #include <linux/memblock.h>
>>  #include <linux/numa.h>
>> +#include <linux/spinlock.h>
>>  
>>  /**
>>   * cpumask_next - get the next cpu in a cpumask
>> @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>  }
>>  #endif
>>  
>> -/**
>> - * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>> - * @i: index number
>> - * @node: local numa_node
>> - *
>> - * This function selects an online CPU according to a numa aware policy;
>> - * local cpus are returned first, followed by non-local ones, then it
>> - * wraps around.
>> - *
>> - * It's not very efficient, but useful for setup.
>> - */
>> -unsigned int cpumask_local_spread(unsigned int i, int node)
>> +static void calc_node_distance(int *node_dist, int node)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr_node_ids; i++)
>> +		node_dist[i] = node_distance(node, i);
>> +}
>> +
>> +static int find_nearest_node(int *node_dist, bool *used)
>> +{
>> +	int i, min_dist = node_dist[0], node_id = -1;
>> +
>> +	/* Choose the first unused node to compare */
>> +	for (i = 0; i < nr_node_ids; i++) {
>> +		if (used[i] == 0) {
>> +			min_dist = node_dist[i];
>> +			node_id = i;
>> +			break;
>> +		}
>> +	}
>> +
>> +	/* Compare and return the nearest node */
>> +	for (i = 0; i < nr_node_ids; i++) {
>> +		if (node_dist[i] < min_dist && used[i] == 0) {
>> +			min_dist = node_dist[i];
>> +			node_id = i;
>> +		}
>> +	}
>> +
>> +	return node_id;
>> +}
>> +
>> +static unsigned int __cpumask_local_spread(unsigned int i, int node)
>>  {
>>  	int cpu;
>>  
>> @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>  	}
>>  	BUG();
>>  }
>> +
>> +/**
>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>> + * @i: index number
>> + * @node: local numa_node
>> + *
>> + * This function selects an online CPU according to a numa aware policy;
>> + * local cpus are returned first, followed by the nearest non-local ones,
>> + * then it wraps around.
>> + *
>> + * It's not very efficient, but useful for setup.
>> + */
>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>> +{
>> +	static DEFINE_SPINLOCK(spread_lock);
>> +	static int node_dist[MAX_NUMNODES];
>> +	static bool used[MAX_NUMNODES];
>> +	unsigned long flags;
>> +	int cpu, j, id;
>> +
>> +	/* Wrap: we always want a cpu. */
>> +	i %= num_online_cpus();
>> +
>> +	if (node == NUMA_NO_NODE) {
>> +		for_each_cpu(cpu, cpu_online_mask)
>> +			if (i-- == 0)
>> +				return cpu;
>> +	} else {
>> +		if (nr_node_ids > MAX_NUMNODES)
>> +			return __cpumask_local_spread(i, node);
>> +
>> +		spin_lock_irqsave(&spread_lock, flags);
>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>> +		calc_node_distance(node_dist, node);
>> +		for (j = 0; j < nr_node_ids; j++) {
>> +			id = find_nearest_node(node_dist, used);
>> +			if (id < 0)
>> +				break;
>> +
>> +			for_each_cpu_and(cpu, cpumask_of_node(id),
>> +					 cpu_online_mask)
>> +				if (i-- == 0) {
>> +					spin_unlock_irqrestore(&spread_lock,
>> +							       flags);
>> +					return cpu;
>> +				}
>> +			used[id] = 1;
>> +		}
>> +		spin_unlock_irqrestore(&spread_lock, flags);
>> +
>> +		for_each_cpu(cpu, cpu_online_mask)
>> +			if (i-- == 0)
>> +				return cpu;
>> +	}
>> +	BUG();
>> +}
>>  EXPORT_SYMBOL(cpumask_local_spread);
>> -- 
>> 2.7.4
>
diff mbox series

Patch

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..f7394ba36116 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@ 
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/spinlock.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -192,18 +193,39 @@  void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 }
 #endif
 
-/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
- * @i: index number
- * @node: local numa_node
- *
- * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
- *
- * It's not very efficient, but useful for setup.
- */
-unsigned int cpumask_local_spread(unsigned int i, int node)
+static void calc_node_distance(int *node_dist, int node)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++)
+		node_dist[i] = node_distance(node, i);
+}
+
+static int find_nearest_node(int *node_dist, bool *used)
+{
+	int i, min_dist = node_dist[0], node_id = -1;
+
+	/* Choose the first unused node to compare */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (used[i] == 0) {
+			min_dist = node_dist[i];
+			node_id = i;
+			break;
+		}
+	}
+
+	/* Compare and return the nearest node */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (node_dist[i] < min_dist && used[i] == 0) {
+			min_dist = node_dist[i];
+			node_id = i;
+		}
+	}
+
+	return node_id;
+}
+
+static unsigned int __cpumask_local_spread(unsigned int i, int node)
 {
 	int cpu;
 
@@ -231,4 +253,60 @@  unsigned int cpumask_local_spread(unsigned int i, int node)
 	}
 	BUG();
 }
+
+/**
+ * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * @i: index number
+ * @node: local numa_node
+ *
+ * This function selects an online CPU according to a numa aware policy;
+ * local cpus are returned first, followed by the nearest non-local ones,
+ * then it wraps around.
+ *
+ * It's not very efficient, but useful for setup.
+ */
+unsigned int cpumask_local_spread(unsigned int i, int node)
+{
+	static DEFINE_SPINLOCK(spread_lock);
+	static int node_dist[MAX_NUMNODES];
+	static bool used[MAX_NUMNODES];
+	unsigned long flags;
+	int cpu, j, id;
+
+	/* Wrap: we always want a cpu. */
+	i %= num_online_cpus();
+
+	if (node == NUMA_NO_NODE) {
+		for_each_cpu(cpu, cpu_online_mask)
+			if (i-- == 0)
+				return cpu;
+	} else {
+		if (nr_node_ids > MAX_NUMNODES)
+			return __cpumask_local_spread(i, node);
+
+		spin_lock_irqsave(&spread_lock, flags);
+		memset(used, 0, nr_node_ids * sizeof(bool));
+		calc_node_distance(node_dist, node);
+		for (j = 0; j < nr_node_ids; j++) {
+			id = find_nearest_node(node_dist, used);
+			if (id < 0)
+				break;
+
+			for_each_cpu_and(cpu, cpumask_of_node(id),
+					 cpu_online_mask)
+				if (i-- == 0) {
+					spin_unlock_irqrestore(&spread_lock,
+							       flags);
+					return cpu;
+				}
+			used[id] = 1;
+		}
+		spin_unlock_irqrestore(&spread_lock, flags);
+
+		for_each_cpu(cpu, cpu_online_mask)
+			if (i-- == 0)
+				return cpu;
+	}
+	BUG();
+}
 EXPORT_SYMBOL(cpumask_local_spread);