[ovs-dev] ovs: do not allocate memory from offline numa node
diff mbox

Message ID 20151002101822.12499.27658.stgit@buzz
State Not Applicable
Headers show

Commit Message

Konstantin Khlebnikov Oct. 2, 2015, 10:18 a.m. UTC
When openvswitch tries allocate memory from offline numa node 0:
stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
[ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
This patch disables numa affinity in this case.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

---

<4>[   24.368805] ------------[ cut here ]------------
<2>[   24.368846] kernel BUG at include/linux/gfp.h:325!
<4>[   24.368868] invalid opcode: 0000 [#1] SMP
<4>[   24.368892] Modules linked in: openvswitch vxlan udp_tunnel ip6_udp_tunnel gre libcrc32c kvm_amd kvm crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw mgag200 ttm drm_kms_helper drm gf128mul glue_helper serio_raw aes_x86_64 sysimgblt sysfillrect syscopyarea sp5100_tco amd64_edac_mod edac_core edac_mce_amd i2c_piix4 k10temp fam15h_power microcode raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 igb multipath i2c_algo_bit i2c_core linear dca psmouse ptp ahci pata_atiixp pps_core libahci
<4>[   24.369225] CPU: 22 PID: 987 Comm: ovs-vswitchd Not tainted 3.18.19-24 #1
<4>[   24.369255] Hardware name: Supermicro H8DGU/H8DGU, BIOS 3.0b       05/07/2013
<4>[   24.369286] task: ffff8807f2433240 ti: ffff8807ec9a0000 task.ti: ffff8807ec9a0000
<4>[   24.369317] RIP: 0010:[<ffffffff8119da34>]  [<ffffffff8119da34>] new_slab+0x2d4/0x380
<4>[   24.369359] RSP: 0018:ffff8807ec9a35d8  EFLAGS: 00010246
<4>[   24.369383] RAX: 0000000000000000 RBX: ffff8807ff403c00 RCX: 0000000000000000
<4>[   24.369412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000002012d0
<4>[   24.369441] RBP: ffff8807ec9a3608 R08: ffff8807f193cfe0 R09: 000000010080000a
<4>[   24.369471] R10: 00000000f193cf01 R11: 0000000000015f38 R12: 0000000000000000
<4>[   24.369501] R13: 0000000000000080 R14: 0000000000000000 R15: 00000000000000d0
<4>[   24.369531] FS:  00007febb0cbe980(0000) GS:ffff8807ffd80000(0000) knlGS:0000000000000000
<4>[   24.369563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   24.369588] CR2: 00007efc53abc1b8 CR3: 00000007f213f000 CR4: 00000000000407e0
<4>[   24.369618] Stack:
<4>[   24.369630]  ffff8807ec9a3618 0000000000000000 0000000000000000 ffff8807ffd958c0
<4>[   24.369669]  ffff8807ff403c00 00000000000080d0 ffff8807ec9a36f8 ffffffff816cc548
<4>[   24.370755]  ffff8807ec9a3708 0000000000000296 0000000000000004 0000000000000000
<4>[   24.371777] Call Trace:
<4>[   24.372929]  [<ffffffff816cc548>] __slab_alloc+0x33b/0x459
<4>[   24.374179]  [<ffffffffa0192a09>] ? ovs_flow_alloc+0x59/0x110 [openvswitch]
<4>[   24.375390]  [<ffffffff8114da93>] ? get_page_from_freelist+0x483/0x9f0
<4>[   24.376623]  [<ffffffff8136b15e>] ? memzero_explicit+0xe/0x10
<4>[   24.377767]  [<ffffffffa0192a09>] ? ovs_flow_alloc+0x59/0x110 [openvswitch]
<4>[   24.378951]  [<ffffffff8119e12c>] kmem_cache_alloc_node+0x9c/0x1b0
<4>[   24.379916]  [<ffffffff8119f08b>] ? kmem_cache_alloc+0x18b/0x1a0
<4>[   24.390806]  [<ffffffffa01929cd>] ? ovs_flow_alloc+0x1d/0x110 [openvswitch]
<4>[   24.391779]  [<ffffffffa0192a09>] ovs_flow_alloc+0x59/0x110 [openvswitch]
<4>[   24.392875]  [<ffffffffa018b18b>] ovs_flow_cmd_new+0x5b/0x360 [openvswitch]
<4>[   24.394004]  [<ffffffff8114e16c>] ? __alloc_pages_nodemask+0x16c/0xaf0
<4>[   24.394973]  [<ffffffff815bba77>] ? __alloc_skb+0x87/0x2a0
<4>[   24.395926]  [<ffffffff8138b240>] ? nla_parse+0x90/0x110
<4>[   24.476276]  [<ffffffff815fe453>] genl_family_rcv_msg+0x373/0x3d0
<4>[   24.477704]  [<ffffffff811a09dc>] ? __kmalloc_node_track_caller+0x6c/0x220
<4>[   24.478859]  [<ffffffff815fe4f4>] genl_rcv_msg+0x44/0x80
<4>[   24.479987]  [<ffffffff815fe4b0>] ? genl_family_rcv_msg+0x3d0/0x3d0
<4>[   24.481325]  [<ffffffff815fda49>] netlink_rcv_skb+0xb9/0xe0
<4>[   24.482466]  [<ffffffff815fdd6c>] genl_rcv+0x2c/0x40
<4>[   24.483554]  [<ffffffff815fd04b>] netlink_unicast+0x12b/0x1c0
<4>[   24.484739]  [<ffffffff815fd472>] netlink_sendmsg+0x392/0x6d0
<4>[   24.485942]  [<ffffffff815b2f9f>] sock_sendmsg+0xaf/0xc0
<4>[   24.486953]  [<ffffffff815fd2e2>] ? netlink_sendmsg+0x202/0x6d0
<4>[   24.487969]  [<ffffffff815b3622>] ___sys_sendmsg.part.19+0x322/0x330
<4>[   24.489167]  [<ffffffff815b3839>] ? SYSC_sendto+0xf9/0x130
<4>[   24.490217]  [<ffffffff815b367a>] ___sys_sendmsg+0x4a/0x70
<4>[   24.491162]  [<ffffffff815b40c9>] __sys_sendmsg+0x49/0x90
<4>[   24.492082]  [<ffffffff815b4129>] SyS_sendmsg+0x19/0x20
<4>[   24.493181]  [<ffffffff816d6c09>] system_call_fastpath+0x12/0x17
<4>[   24.494124] Code: 40 e9 ea fe ff ff 90 e8 6b 69 ff ff 49 89 c4 e9 07 fe ff ff 4c 89 f7 ff d0 e9 26 ff ff ff 49 c7 04 06 00 00 00 00 e9 3c ff ff ff <0f> 0b ba 00 10 00 00 be 5a 00 00 00 4c 89 ef 48 d3 e2 e8 65 2a
<1>[   24.496071] RIP  [<ffffffff8119da34>] new_slab+0x2d4/0x380
<4>[   24.497152]  RSP <ffff8807ec9a35d8>
<4>[   24.498945] ---[ end trace 6f97360ff4a9ee45 ]---
---
 net/openvswitch/flow_table.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Pravin B Shelar Oct. 2, 2015, 10:38 p.m. UTC | #1
On Fri, Oct 2, 2015 at 3:18 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> When openvswitch tries allocate memory from offline numa node 0:
> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
> This patch disables numa affinity in this case.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>
> ---
>
> <4>[   24.368805] ------------[ cut here ]------------
> <2>[   24.368846] kernel BUG at include/linux/gfp.h:325!
> <4>[   24.368868] invalid opcode: 0000 [#1] SMP
> <4>[   24.368892] Modules linked in: openvswitch vxlan udp_tunnel ip6_udp_tunnel gre libcrc32c kvm_amd kvm crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw mgag200 ttm drm_kms_helper drm gf128mul glue_helper serio_raw aes_x86_64 sysimgblt sysfillrect syscopyarea sp5100_tco amd64_edac_mod edac_core edac_mce_amd i2c_piix4 k10temp fam15h_power microcode raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 igb multipath i2c_algo_bit i2c_core linear dca psmouse ptp ahci pata_atiixp pps_core libahci
> <4>[   24.369225] CPU: 22 PID: 987 Comm: ovs-vswitchd Not tainted 3.18.19-24 #1
> <4>[   24.369255] Hardware name: Supermicro H8DGU/H8DGU, BIOS 3.0b       05/07/2013
> <4>[   24.369286] task: ffff8807f2433240 ti: ffff8807ec9a0000 task.ti: ffff8807ec9a0000
> <4>[   24.369317] RIP: 0010:[<ffffffff8119da34>]  [<ffffffff8119da34>] new_slab+0x2d4/0x380
> <4>[   24.369359] RSP: 0018:ffff8807ec9a35d8  EFLAGS: 00010246
> <4>[   24.369383] RAX: 0000000000000000 RBX: ffff8807ff403c00 RCX: 0000000000000000
> <4>[   24.369412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000002012d0
> <4>[   24.369441] RBP: ffff8807ec9a3608 R08: ffff8807f193cfe0 R09: 000000010080000a
> <4>[   24.369471] R10: 00000000f193cf01 R11: 0000000000015f38 R12: 0000000000000000
> <4>[   24.369501] R13: 0000000000000080 R14: 0000000000000000 R15: 00000000000000d0
> <4>[   24.369531] FS:  00007febb0cbe980(0000) GS:ffff8807ffd80000(0000) knlGS:0000000000000000
> <4>[   24.369563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[   24.369588] CR2: 00007efc53abc1b8 CR3: 00000007f213f000 CR4: 00000000000407e0
> <4>[   24.369618] Stack:
> <4>[   24.369630]  ffff8807ec9a3618 0000000000000000 0000000000000000 ffff8807ffd958c0
> <4>[   24.369669]  ffff8807ff403c00 00000000000080d0 ffff8807ec9a36f8 ffffffff816cc548
> <4>[   24.370755]  ffff8807ec9a3708 0000000000000296 0000000000000004 0000000000000000
> <4>[   24.371777] Call Trace:
> <4>[   24.372929]  [<ffffffff816cc548>] __slab_alloc+0x33b/0x459
> <4>[   24.374179]  [<ffffffffa0192a09>] ? ovs_flow_alloc+0x59/0x110 [openvswitch]
> <4>[   24.375390]  [<ffffffff8114da93>] ? get_page_from_freelist+0x483/0x9f0
> <4>[   24.376623]  [<ffffffff8136b15e>] ? memzero_explicit+0xe/0x10
> <4>[   24.377767]  [<ffffffffa0192a09>] ? ovs_flow_alloc+0x59/0x110 [openvswitch]
> <4>[   24.378951]  [<ffffffff8119e12c>] kmem_cache_alloc_node+0x9c/0x1b0
> <4>[   24.379916]  [<ffffffff8119f08b>] ? kmem_cache_alloc+0x18b/0x1a0
> <4>[   24.390806]  [<ffffffffa01929cd>] ? ovs_flow_alloc+0x1d/0x110 [openvswitch]
> <4>[   24.391779]  [<ffffffffa0192a09>] ovs_flow_alloc+0x59/0x110 [openvswitch]
> <4>[   24.392875]  [<ffffffffa018b18b>] ovs_flow_cmd_new+0x5b/0x360 [openvswitch]
> <4>[   24.394004]  [<ffffffff8114e16c>] ? __alloc_pages_nodemask+0x16c/0xaf0
> <4>[   24.394973]  [<ffffffff815bba77>] ? __alloc_skb+0x87/0x2a0
> <4>[   24.395926]  [<ffffffff8138b240>] ? nla_parse+0x90/0x110
> <4>[   24.476276]  [<ffffffff815fe453>] genl_family_rcv_msg+0x373/0x3d0
> <4>[   24.477704]  [<ffffffff811a09dc>] ? __kmalloc_node_track_caller+0x6c/0x220
> <4>[   24.478859]  [<ffffffff815fe4f4>] genl_rcv_msg+0x44/0x80
> <4>[   24.479987]  [<ffffffff815fe4b0>] ? genl_family_rcv_msg+0x3d0/0x3d0
> <4>[   24.481325]  [<ffffffff815fda49>] netlink_rcv_skb+0xb9/0xe0
> <4>[   24.482466]  [<ffffffff815fdd6c>] genl_rcv+0x2c/0x40
> <4>[   24.483554]  [<ffffffff815fd04b>] netlink_unicast+0x12b/0x1c0
> <4>[   24.484739]  [<ffffffff815fd472>] netlink_sendmsg+0x392/0x6d0
> <4>[   24.485942]  [<ffffffff815b2f9f>] sock_sendmsg+0xaf/0xc0
> <4>[   24.486953]  [<ffffffff815fd2e2>] ? netlink_sendmsg+0x202/0x6d0
> <4>[   24.487969]  [<ffffffff815b3622>] ___sys_sendmsg.part.19+0x322/0x330
> <4>[   24.489167]  [<ffffffff815b3839>] ? SYSC_sendto+0xf9/0x130
> <4>[   24.490217]  [<ffffffff815b367a>] ___sys_sendmsg+0x4a/0x70
> <4>[   24.491162]  [<ffffffff815b40c9>] __sys_sendmsg+0x49/0x90
> <4>[   24.492082]  [<ffffffff815b4129>] SyS_sendmsg+0x19/0x20
> <4>[   24.493181]  [<ffffffff816d6c09>] system_call_fastpath+0x12/0x17
> <4>[   24.494124] Code: 40 e9 ea fe ff ff 90 e8 6b 69 ff ff 49 89 c4 e9 07 fe ff ff 4c 89 f7 ff d0 e9 26 ff ff ff 49 c7 04 06 00 00 00 00 e9 3c ff ff ff <0f> 0b ba 00 10 00 00 be 5a 00 00 00 4c 89 ef 48 d3 e2 e8 65 2a
> <1>[   24.496071] RIP  [<ffffffff8119da34>] new_slab+0x2d4/0x380
> <4>[   24.497152]  RSP <ffff8807ec9a35d8>
> <4>[   24.498945] ---[ end trace 6f97360ff4a9ee45 ]---
> ---
>  net/openvswitch/flow_table.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index f2ea83ba4763..c7f74aab34b9 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>
>         /* Initialize the default stat node. */
>         stats = kmem_cache_alloc_node(flow_stats_cache,
> -                                     GFP_KERNEL | __GFP_ZERO, 0);
> +                                     GFP_KERNEL | __GFP_ZERO,
> +                                     node_online(0) ? 0 : NUMA_NO_NODE);
>         if (!stats)
>                 goto err;
>

Acked-by: Pravin B Shelar <pshelar@nicira.com>

Thanks!
David Miller Oct. 5, 2015, 1:44 p.m. UTC | #2
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Date: Fri, 02 Oct 2015 13:18:22 +0300

> When openvswitch tries allocate memory from offline numa node 0:
> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
> This patch disables numa affinity in this case.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Applied, but this should probably use NUMA_NO_NODE unconditionally.
Vlastimil Babka Oct. 5, 2015, 1:59 p.m. UTC | #3
On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
> When openvswitch tries allocate memory from offline numa node 0:
> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
> This patch disables numa affinity in this case.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

...

> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index f2ea83ba4763..c7f74aab34b9 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>
>   	/* Initialize the default stat node. */
>   	stats = kmem_cache_alloc_node(flow_stats_cache,
> -				      GFP_KERNEL | __GFP_ZERO, 0);
> +				      GFP_KERNEL | __GFP_ZERO,
> +				      node_online(0) ? 0 : NUMA_NO_NODE);

Stupid question: can node 0 become offline between this check, and the 
VM_WARN_ON? :) BTW what kind of system has node 0 offline?

>   	if (!stats)
>   		goto err;
>
>
Alexander Duyck Oct. 5, 2015, 8:25 p.m. UTC | #4
On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>> When openvswitch tries allocate memory from offline numa node 0:
>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | 
>> __GFP_ZERO, 0)
>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || 
>> !node_online(nid))
>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>> This patch disables numa affinity in this case.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>
> ...
>
>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>> index f2ea83ba4763..c7f74aab34b9 100644
>> --- a/net/openvswitch/flow_table.c
>> +++ b/net/openvswitch/flow_table.c
>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>
>>       /* Initialize the default stat node. */
>>       stats = kmem_cache_alloc_node(flow_stats_cache,
>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>> +                      GFP_KERNEL | __GFP_ZERO,
>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>
> Stupid question: can node 0 become offline between this check, and the 
> VM_WARN_ON? :) BTW what kind of system has node 0 offline?

Another question to ask would be is it possible for node 0 to be online, 
but be a memoryless node?

I would say you are better off just making this call kmem_cache_alloc.  
I don't see anything that indicates the memory has to come from node 0, 
so adding the extra overhead doesn't provide any value.

- Alex
Jesse Gross Oct. 7, 2015, 1:01 a.m. UTC | #5
On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>
>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>>
>>> When openvswitch tries allocate memory from offline numa node 0:
>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>> 0)
>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>> This patch disables numa affinity in this case.
>>>
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>
>>
>> ...
>>
>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>> index f2ea83ba4763..c7f74aab34b9 100644
>>> --- a/net/openvswitch/flow_table.c
>>> +++ b/net/openvswitch/flow_table.c
>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>>
>>>       /* Initialize the default stat node. */
>>>       stats = kmem_cache_alloc_node(flow_stats_cache,
>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>> +                      GFP_KERNEL | __GFP_ZERO,
>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>
>>
>> Stupid question: can node 0 become offline between this check, and the
>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>
>
> Another question to ask would be is it possible for node 0 to be online, but
> be a memoryless node?
>
> I would say you are better off just making this call kmem_cache_alloc.  I
> don't see anything that indicates the memory has to come from node 0, so
> adding the extra overhead doesn't provide any value.

I agree that this at least makes me wonder, though I actually have
concerns in the opposite direction - I see assumptions about this
being on node 0 in net/openvswitch/flow.c.

Jarno, since you original wrote this code, can you take a look to see
if everything still makes sense?
Jarno Rajahalme Oct. 7, 2015, 5:47 p.m. UTC | #6
> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com> wrote:
> 
> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>> 
>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>>> 
>>>> When openvswitch tries allocate memory from offline numa node 0:
>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>>> 0)
>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>>> This patch disables numa affinity in this case.
>>>> 
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>> 
>>> 
>>> ...
>>> 
>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>>> index f2ea83ba4763..c7f74aab34b9 100644
>>>> --- a/net/openvswitch/flow_table.c
>>>> +++ b/net/openvswitch/flow_table.c
>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>>> 
>>>>      /* Initialize the default stat node. */
>>>>      stats = kmem_cache_alloc_node(flow_stats_cache,
>>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>>> +                      GFP_KERNEL | __GFP_ZERO,
>>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>> 
>>> 
>>> Stupid question: can node 0 become offline between this check, and the
>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>> 
>> 
>> Another question to ask would be is it possible for node 0 to be online, but
>> be a memoryless node?
>> 
>> I would say you are better off just making this call kmem_cache_alloc.  I
>> don't see anything that indicates the memory has to come from node 0, so
>> adding the extra overhead doesn't provide any value.
> 
> I agree that this at least makes me wonder, though I actually have
> concerns in the opposite direction - I see assumptions about this
> being on node 0 in net/openvswitch/flow.c.
> 
> Jarno, since you original wrote this code, can you take a look to see
> if everything still makes sense?

We keep the pre-allocated stats node at array index 0, which is initially used by all CPUs, but if CPUs from multiple numa nodes start updating the stats, we allocate additional stats nodes (up to one per numa node), and the CPUs on node 0 keep using the preallocated entry. If stats cannot be allocated from CPUs local node, then those CPUs keep using the entry at index 0. Currently the code in net/openvswitch/flow.c will try to allocate the local memory repeatedly, which may not be optimal when there is no memory at the local node.

Allocating the memory for the index 0 from other than node 0, as discussed here, just means that the CPUs on node 0 will keep on using non-local memory for stats. In a scenario where there are CPUs on two nodes (0, 1), but only the node 1 has memory, a shared flow entry will still end up having separate memory allocated for both nodes, but both of the nodes would be at node 1. However, there is still a high likelihood that the memory allocations would not share a cache line, which should prevent the nodes from invalidating each other’s caches. Based on this I do not see a problem relaxing the memory allocation for the default stats node. If node 0 has memory, however, it would be better to allocate the memory from node 0.

  Jarno
Jesse Gross Oct. 8, 2015, 11:03 p.m. UTC | #7
On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>
>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com> wrote:
>>
>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>>>
>>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>>>>
>>>>> When openvswitch tries allocate memory from offline numa node 0:
>>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>>>> 0)
>>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>>>> This patch disables numa affinity in this case.
>>>>>
>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>
>>>>
>>>> ...
>>>>
>>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>>>> index f2ea83ba4763..c7f74aab34b9 100644
>>>>> --- a/net/openvswitch/flow_table.c
>>>>> +++ b/net/openvswitch/flow_table.c
>>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>>>>
>>>>>      /* Initialize the default stat node. */
>>>>>      stats = kmem_cache_alloc_node(flow_stats_cache,
>>>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>>>> +                      GFP_KERNEL | __GFP_ZERO,
>>>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>>>
>>>>
>>>> Stupid question: can node 0 become offline between this check, and the
>>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>>
>>>
>>> Another question to ask would be is it possible for node 0 to be online, but
>>> be a memoryless node?
>>>
>>> I would say you are better off just making this call kmem_cache_alloc.  I
>>> don't see anything that indicates the memory has to come from node 0, so
>>> adding the extra overhead doesn't provide any value.
>>
>> I agree that this at least makes me wonder, though I actually have
>> concerns in the opposite direction - I see assumptions about this
>> being on node 0 in net/openvswitch/flow.c.
>>
>> Jarno, since you original wrote this code, can you take a look to see
>> if everything still makes sense?
>
> We keep the pre-allocated stats node at array index 0, which is initially used by all CPUs, but if CPUs from multiple numa nodes start updating the stats, we allocate additional stats nodes (up to one per numa node), and the CPUs on node 0 keep using the preallocated entry. If stats cannot be allocated from CPUs local node, then those CPUs keep using the entry at index 0. Currently the code in net/openvswitch/flow.c will try to allocate the local memory repeatedly, which may not be optimal when there is no memory at the local node.
>
> Allocating the memory for the index 0 from other than node 0, as discussed here, just means that the CPUs on node 0 will keep on using non-local memory for stats. In a scenario where there are CPUs on two nodes (0, 1), but only the node 1 has memory, a shared flow entry will still end up having separate memory allocated for both nodes, but both of the nodes would be at node 1. However, there is still a high likelihood that the memory allocations would not share a cache line, which should prevent the nodes from invalidating each other’s caches. Based on this I do not see a problem relaxing the memory allocation for the default stats node. If node 0 has memory, however, it would be better to allocate the memory from node 0.

Thanks for going through all of that.

It seems like the question that is being raised is whether it actually
makes sense to try to get the initial memory on node 0, especially
since it seems to introduce some corner cases? Is there any reason why
the flow is more likely to hit node 0 than a randomly chosen one?
(Assuming that this is a multinode system, otherwise it's kind of a
moot point.) We could have a separate pointer to the default allocated
memory, so it wouldn't conflict with memory that was intentionally
allocated for node 0.
Jarno Rajahalme Oct. 9, 2015, 3:54 p.m. UTC | #8
> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse@nicira.com> wrote:
> 
> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>> wrote:
>> 
>>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com> wrote:
>>> 
>>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>>>> 
>>>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>>>>> 
>>>>>> When openvswitch tries allocate memory from offline numa node 0:
>>>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>>>>> 0)
>>>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>>>>> This patch disables numa affinity in this case.
>>>>>> 
>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>> 
>>>>> 
>>>>> ...
>>>>> 
>>>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>>>>> index f2ea83ba4763..c7f74aab34b9 100644
>>>>>> --- a/net/openvswitch/flow_table.c
>>>>>> +++ b/net/openvswitch/flow_table.c
>>>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>>>>> 
>>>>>>     /* Initialize the default stat node. */
>>>>>>     stats = kmem_cache_alloc_node(flow_stats_cache,
>>>>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>>>>> +                      GFP_KERNEL | __GFP_ZERO,
>>>>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>>>> 
>>>>> 
>>>>> Stupid question: can node 0 become offline between this check, and the
>>>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>>> 
>>>> 
>>>> Another question to ask would be is it possible for node 0 to be online, but
>>>> be a memoryless node?
>>>> 
>>>> I would say you are better off just making this call kmem_cache_alloc.  I
>>>> don't see anything that indicates the memory has to come from node 0, so
>>>> adding the extra overhead doesn't provide any value.
>>> 
>>> I agree that this at least makes me wonder, though I actually have
>>> concerns in the opposite direction - I see assumptions about this
>>> being on node 0 in net/openvswitch/flow.c.
>>> 
>>> Jarno, since you original wrote this code, can you take a look to see
>>> if everything still makes sense?
>> 
>> We keep the pre-allocated stats node at array index 0, which is initially used by all CPUs, but if CPUs from multiple numa nodes start updating the stats, we allocate additional stats nodes (up to one per numa node), and the CPUs on node 0 keep using the preallocated entry. If stats cannot be allocated from CPUs local node, then those CPUs keep using the entry at index 0. Currently the code in net/openvswitch/flow.c will try to allocate the local memory repeatedly, which may not be optimal when there is no memory at the local node.
>> 
>> Allocating the memory for the index 0 from other than node 0, as discussed here, just means that the CPUs on node 0 will keep on using non-local memory for stats. In a scenario where there are CPUs on two nodes (0, 1), but only the node 1 has memory, a shared flow entry will still end up having separate memory allocated for both nodes, but both of the nodes would be at node 1. However, there is still a high likelihood that the memory allocations would not share a cache line, which should prevent the nodes from invalidating each other’s caches. Based on this I do not see a problem relaxing the memory allocation for the default stats node. If node 0 has memory, however, it would be better to allocate the memory from node 0.
> 
> Thanks for going through all of that.
> 
> It seems like the question that is being raised is whether it actually
> makes sense to try to get the initial memory on node 0, especially
> since it seems to introduce some corner cases? Is there any reason why
> the flow is more likely to hit node 0 than a randomly chosen one?
> (Assuming that this is a multinode system, otherwise it's kind of a
> moot point.) We could have a separate pointer to the default allocated
> memory, so it wouldn't conflict with memory that was intentionally
> allocated for node 0.

It would still be preferable to know from which node the default stats node was allocated, and store it in the appropriate pointer in the array. We could then add a new “default stats node index” that would be used to locate the node in the array of pointers we already have. That way we would avoid extra allocation and processing of the default stats node.

  Jarno
Jesse Gross Oct. 9, 2015, 10:11 p.m. UTC | #9
On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>
> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse@nicira.com> wrote:
>
> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme@nicira.com>
> wrote:
>
>
> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com> wrote:
>
> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>
>
> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>
>
> When openvswitch tries allocate memory from offline numa node 0:
> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
> 0)
> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
> This patch disables numa affinity in this case.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>
>
>
> ...
>
> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index f2ea83ba4763..c7f74aab34b9 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>
>     /* Initialize the default stat node. */
>     stats = kmem_cache_alloc_node(flow_stats_cache,
> -                      GFP_KERNEL | __GFP_ZERO, 0);
> +                      GFP_KERNEL | __GFP_ZERO,
> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>
>
>
> Stupid question: can node 0 become offline between this check, and the
> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>
>
>
> Another question to ask would be is it possible for node 0 to be online, but
> be a memoryless node?
>
> I would say you are better off just making this call kmem_cache_alloc.  I
> don't see anything that indicates the memory has to come from node 0, so
> adding the extra overhead doesn't provide any value.
>
>
> I agree that this at least makes me wonder, though I actually have
> concerns in the opposite direction - I see assumptions about this
> being on node 0 in net/openvswitch/flow.c.
>
> Jarno, since you original wrote this code, can you take a look to see
> if everything still makes sense?
>
>
> We keep the pre-allocated stats node at array index 0, which is initially
> used by all CPUs, but if CPUs from multiple numa nodes start updating the
> stats, we allocate additional stats nodes (up to one per numa node), and the
> CPUs on node 0 keep using the preallocated entry. If stats cannot be
> allocated from CPUs local node, then those CPUs keep using the entry at
> index 0. Currently the code in net/openvswitch/flow.c will try to allocate
> the local memory repeatedly, which may not be optimal when there is no
> memory at the local node.
>
> Allocating the memory for the index 0 from other than node 0, as discussed
> here, just means that the CPUs on node 0 will keep on using non-local memory
> for stats. In a scenario where there are CPUs on two nodes (0, 1), but only
> the node 1 has memory, a shared flow entry will still end up having separate
> memory allocated for both nodes, but both of the nodes would be at node 1.
> However, there is still a high likelihood that the memory allocations would
> not share a cache line, which should prevent the nodes from invalidating
> each other’s caches. Based on this I do not see a problem relaxing the
> memory allocation for the default stats node. If node 0 has memory, however,
> it would be better to allocate the memory from node 0.
>
>
> Thanks for going through all of that.
>
> It seems like the question that is being raised is whether it actually
> makes sense to try to get the initial memory on node 0, especially
> since it seems to introduce some corner cases? Is there any reason why
> the flow is more likely to hit node 0 than a randomly chosen one?
> (Assuming that this is a multinode system, otherwise it's kind of a
> moot point.) We could have a separate pointer to the default allocated
> memory, so it wouldn't conflict with memory that was intentionally
> allocated for node 0.
>
>
> It would still be preferable to know from which node the default stats node
> was allocated, and store it in the appropriate pointer in the array. We
> could then add a new “default stats node index” that would be used to locate
> the node in the array of pointers we already have. That way we would avoid
> extra allocation and processing of the default stats node.

I agree, that sounds reasonable to me. Will you make that change?

Besides eliminating corner cases, it might help performance in some
cases too by avoiding stressing memory bandwidth on node 0.
Jarno Rajahalme Oct. 10, 2015, 12:02 a.m. UTC | #10
> On Oct 9, 2015, at 3:11 PM, Jesse Gross <jesse@nicira.com> wrote:
> 
> On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>> wrote:
>> 
>> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse@nicira.com> wrote:
>> 
>> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme@nicira.com>
>> wrote:
>> 
>> 
>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com> wrote:
>> 
>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>> 
>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>> 
>> 
>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>> 
>> 
>> When openvswitch tries allocate memory from offline numa node 0:
>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>> 0)
>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>> This patch disables numa affinity in this case.
>> 
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> 
>> 
>> 
>> ...
>> 
>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>> index f2ea83ba4763..c7f74aab34b9 100644
>> --- a/net/openvswitch/flow_table.c
>> +++ b/net/openvswitch/flow_table.c
>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>> 
>>    /* Initialize the default stat node. */
>>    stats = kmem_cache_alloc_node(flow_stats_cache,
>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>> +                      GFP_KERNEL | __GFP_ZERO,
>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>> 
>> 
>> 
>> Stupid question: can node 0 become offline between this check, and the
>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>> 
>> 
>> 
>> Another question to ask would be is it possible for node 0 to be online, but
>> be a memoryless node?
>> 
>> I would say you are better off just making this call kmem_cache_alloc.  I
>> don't see anything that indicates the memory has to come from node 0, so
>> adding the extra overhead doesn't provide any value.
>> 
>> 
>> I agree that this at least makes me wonder, though I actually have
>> concerns in the opposite direction - I see assumptions about this
>> being on node 0 in net/openvswitch/flow.c.
>> 
>> Jarno, since you original wrote this code, can you take a look to see
>> if everything still makes sense?
>> 
>> 
>> We keep the pre-allocated stats node at array index 0, which is initially
>> used by all CPUs, but if CPUs from multiple numa nodes start updating the
>> stats, we allocate additional stats nodes (up to one per numa node), and the
>> CPUs on node 0 keep using the preallocated entry. If stats cannot be
>> allocated from CPUs local node, then those CPUs keep using the entry at
>> index 0. Currently the code in net/openvswitch/flow.c will try to allocate
>> the local memory repeatedly, which may not be optimal when there is no
>> memory at the local node.
>> 
>> Allocating the memory for the index 0 from other than node 0, as discussed
>> here, just means that the CPUs on node 0 will keep on using non-local memory
>> for stats. In a scenario where there are CPUs on two nodes (0, 1), but only
>> the node 1 has memory, a shared flow entry will still end up having separate
>> memory allocated for both nodes, but both of the nodes would be at node 1.
>> However, there is still a high likelihood that the memory allocations would
>> not share a cache line, which should prevent the nodes from invalidating
>> each other’s caches. Based on this I do not see a problem relaxing the
>> memory allocation for the default stats node. If node 0 has memory, however,
>> it would be better to allocate the memory from node 0.
>> 
>> 
>> Thanks for going through all of that.
>> 
>> It seems like the question that is being raised is whether it actually
>> makes sense to try to get the initial memory on node 0, especially
>> since it seems to introduce some corner cases? Is there any reason why
>> the flow is more likely to hit node 0 than a randomly chosen one?
>> (Assuming that this is a multinode system, otherwise it's kind of a
>> moot point.) We could have a separate pointer to the default allocated
>> memory, so it wouldn't conflict with memory that was intentionally
>> allocated for node 0.
>> 
>> 
>> It would still be preferable to know from which node the default stats node
>> was allocated, and store it in the appropriate pointer in the array. We
>> could then add a new “default stats node index” that would be used to locate
>> the node in the array of pointers we already have. That way we would avoid
>> extra allocation and processing of the default stats node.
> 
> I agree, that sounds reasonable to me. Will you make that change?
> 
> Besides eliminating corner cases, it might help performance in some
> cases too by avoiding stressing memory bandwidth on node 0.

I’ll do this,

  Jarno
Jarno Rajahalme Oct. 20, 2015, 5:58 p.m. UTC | #11
> On Oct 9, 2015, at 5:02 PM, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
> 
> 
>> On Oct 9, 2015, at 3:11 PM, Jesse Gross <jesse@nicira.com <mailto:jesse@nicira.com>> wrote:
>> 
>> On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>> wrote:
>>> 
>>> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse@nicira.com <mailto:jesse@nicira.com>> wrote:
>>> 
>>> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>>
>>> wrote:
>>> 
>>> 
>>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com <mailto:jesse@nicira.com>> wrote:
>>> 
>>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>>> <alexander.duyck@gmail.com <mailto:alexander.duyck@gmail.com>> wrote:
>>> 
>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>> 
>>> 
>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>> 
>>> 
>>> When openvswitch tries allocate memory from offline numa node 0:
>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>> 0)
>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>> This patch disables numa affinity in this case.
>>> 
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru <mailto:khlebnikov@yandex-team.ru>>
>>> 
>>> 
>>> 
>>> ...
>>> 
>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>> index f2ea83ba4763..c7f74aab34b9 100644
>>> --- a/net/openvswitch/flow_table.c
>>> +++ b/net/openvswitch/flow_table.c
>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>> 
>>>    /* Initialize the default stat node. */
>>>    stats = kmem_cache_alloc_node(flow_stats_cache,
>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>> +                      GFP_KERNEL | __GFP_ZERO,
>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>> 
>>> 
>>> 
>>> Stupid question: can node 0 become offline between this check, and the
>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>> 
>>> 
>>> 
>>> Another question to ask would be is it possible for node 0 to be online, but
>>> be a memoryless node?
>>> 
>>> I would say you are better off just making this call kmem_cache_alloc.  I
>>> don't see anything that indicates the memory has to come from node 0, so
>>> adding the extra overhead doesn't provide any value.
>>> 
>>> 
>>> I agree that this at least makes me wonder, though I actually have
>>> concerns in the opposite direction - I see assumptions about this
>>> being on node 0 in net/openvswitch/flow.c.
>>> 
>>> Jarno, since you original wrote this code, can you take a look to see
>>> if everything still makes sense?
>>> 
>>> 
>>> We keep the pre-allocated stats node at array index 0, which is initially
>>> used by all CPUs, but if CPUs from multiple numa nodes start updating the
>>> stats, we allocate additional stats nodes (up to one per numa node), and the
>>> CPUs on node 0 keep using the preallocated entry. If stats cannot be
>>> allocated from CPUs local node, then those CPUs keep using the entry at
>>> index 0. Currently the code in net/openvswitch/flow.c will try to allocate
>>> the local memory repeatedly, which may not be optimal when there is no
>>> memory at the local node.
>>> 
>>> Allocating the memory for the index 0 from other than node 0, as discussed
>>> here, just means that the CPUs on node 0 will keep on using non-local memory
>>> for stats. In a scenario where there are CPUs on two nodes (0, 1), but only
>>> the node 1 has memory, a shared flow entry will still end up having separate
>>> memory allocated for both nodes, but both of the nodes would be at node 1.
>>> However, there is still a high likelihood that the memory allocations would
>>> not share a cache line, which should prevent the nodes from invalidating
>>> each other’s caches. Based on this I do not see a problem relaxing the
>>> memory allocation for the default stats node. If node 0 has memory, however,
>>> it would be better to allocate the memory from node 0.
>>> 
>>> 
>>> Thanks for going through all of that.
>>> 
>>> It seems like the question that is being raised is whether it actually
>>> makes sense to try to get the initial memory on node 0, especially
>>> since it seems to introduce some corner cases? Is there any reason why
>>> the flow is more likely to hit node 0 than a randomly chosen one?
>>> (Assuming that this is a multinode system, otherwise it's kind of a
>>> moot point.) We could have a separate pointer to the default allocated
>>> memory, so it wouldn't conflict with memory that was intentionally
>>> allocated for node 0.
>>> 
>>> 
>>> It would still be preferable to know from which node the default stats node
>>> was allocated, and store it in the appropriate pointer in the array. We
>>> could then add a new “default stats node index” that would be used to locate
>>> the node in the array of pointers we already have. That way we would avoid
>>> extra allocation and processing of the default stats node.
>> 
>> I agree, that sounds reasonable to me. Will you make that change?
>> 
>> Besides eliminating corner cases, it might help performance in some
>> cases too by avoiding stressing memory bandwidth on node 0.
> 

According to the comment above kmem_cache_alloc_node(), kmem_cache_alloc_node() should not BUG_ON/WARN_ON in this case:
> /**
>  * kmem_cache_alloc_node - Allocate an object on the specified node
>  * @cachep: The cache to allocate from.
>  * @flags: See kmalloc().
>  * @nodeid: node number of the target node.
>  *
>  * Identical to kmem_cache_alloc but it will allocate memory on the given
>  * node, which can improve the performance for cpu bound structures.
>  *
>  * Fallback to other node is possible if __GFP_THISNODE is not set.
>  */
See also this from cpuset.c:

> /**
>  * cpuset_mem_spread_node() - On which node to begin search for a file page
>  * cpuset_slab_spread_node() - On which node to begin search for a slab page
>  *
>  * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
>  * tasks in a cpuset with is_spread_page or is_spread_slab set),
>  * and if the memory allocation used cpuset_mem_spread_node()
>  * to determine on which node to start looking, as it will for
>  * certain page cache or slab cache pages such as used for file
>  * system buffers and inode caches, then instead of starting on the
>  * local node to look for a free page, rather spread the starting
>  * node around the tasks mems_allowed nodes.
>  *
>  * We don't have to worry about the returned node being offline
>  * because "it can't happen", and even if it did, it would be ok.
>  *
>  * The routines calling guarantee_online_mems() are careful to
>  * only set nodes in task->mems_allowed that are online.  So it
>  * should not be possible for the following code to return an
>  * offline node.  But if it did, that would be ok, as this routine
>  * is not returning the node where the allocation must be, only
>  * the node where the search should start.  The zonelist passed to
>  * __alloc_pages() will include all nodes.  If the slab allocator
>  * is passed an offline node, it will fall back to the local node.
>  * See kmem_cache_alloc_node().
>  */


Based on this it seems this is a bug in the memory allocator, it probably should not be calling alloc_pages_exact_node() when __GFP_THISNODE is not set?

  Jarno
Vlastimil Babka Oct. 21, 2015, 8:55 a.m. UTC | #12
On 10/20/2015 07:58 PM, Jarno Rajahalme wrote:
>
>> On Oct 9, 2015, at 5:02 PM, Jarno Rajahalme <jrajahalme@nicira.com
>> <mailto:jrajahalme@nicira.com>> wrote:
>>
>>
>>> On Oct 9, 2015, at 3:11 PM, Jesse Gross <jesse@nicira.com
>>> <mailto:jesse@nicira.com>> wrote:
>>>
>>> On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme
>>> <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>> wrote:
>>>>
>>>> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse@nicira.com
>>>> <mailto:jesse@nicira.com>> wrote:
>>>>
>>>> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme
>>>> <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>>
>>>> wrote:
>>>>
>>>>
>>>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com
>>>> <mailto:jesse@nicira.com>> wrote:
>>>>
>>>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>>>> <alexander.duyck@gmail.com <mailto:alexander.duyck@gmail.com>> wrote:
>>>>
>>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>>>
>>>>
>>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>>>
>>>>
>>>> When openvswitch tries allocate memory from offline numa node 0:
>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>>> 0)
>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES ||
>>>> !node_online(nid))
>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>>> This patch disables numa affinity in this case.
>>>>
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru
>>>> <mailto:khlebnikov@yandex-team.ru>>
>>>>
>>>>
>>>>
>>>> ...
>>>>
>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>>> index f2ea83ba4763..c7f74aab34b9 100644
>>>> --- a/net/openvswitch/flow_table.c
>>>> +++ b/net/openvswitch/flow_table.c
>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>>>
>>>>    /* Initialize the default stat node. */
>>>>    stats = kmem_cache_alloc_node(flow_stats_cache,
>>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>>> +                      GFP_KERNEL | __GFP_ZERO,
>>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>>>
>>>>
>>>>
>>>> Stupid question: can node 0 become offline between this check, and the
>>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>>>
>>>>
>>>>
>>>> Another question to ask would be is it possible for node 0 to be
>>>> online, but
>>>> be a memoryless node?
>>>>
>>>> I would say you are better off just making this call
>>>> kmem_cache_alloc.  I
>>>> don't see anything that indicates the memory has to come from node 0, so
>>>> adding the extra overhead doesn't provide any value.
>>>>
>>>>
>>>> I agree that this at least makes me wonder, though I actually have
>>>> concerns in the opposite direction - I see assumptions about this
>>>> being on node 0 in net/openvswitch/flow.c.
>>>>
>>>> Jarno, since you original wrote this code, can you take a look to see
>>>> if everything still makes sense?
>>>>
>>>>
>>>> We keep the pre-allocated stats node at array index 0, which is
>>>> initially
>>>> used by all CPUs, but if CPUs from multiple numa nodes start
>>>> updating the
>>>> stats, we allocate additional stats nodes (up to one per numa node),
>>>> and the
>>>> CPUs on node 0 keep using the preallocated entry. If stats cannot be
>>>> allocated from CPUs local node, then those CPUs keep using the entry at
>>>> index 0. Currently the code in net/openvswitch/flow.c will try to
>>>> allocate
>>>> the local memory repeatedly, which may not be optimal when there is no
>>>> memory at the local node.
>>>>
>>>> Allocating the memory for the index 0 from other than node 0, as
>>>> discussed
>>>> here, just means that the CPUs on node 0 will keep on using
>>>> non-local memory
>>>> for stats. In a scenario where there are CPUs on two nodes (0, 1),
>>>> but only
>>>> the node 1 has memory, a shared flow entry will still end up having
>>>> separate
>>>> memory allocated for both nodes, but both of the nodes would be at
>>>> node 1.
>>>> However, there is still a high likelihood that the memory
>>>> allocations would
>>>> not share a cache line, which should prevent the nodes from invalidating
>>>> each other’s caches. Based on this I do not see a problem relaxing the
>>>> memory allocation for the default stats node. If node 0 has memory,
>>>> however,
>>>> it would be better to allocate the memory from node 0.
>>>>
>>>>
>>>> Thanks for going through all of that.
>>>>
>>>> It seems like the question that is being raised is whether it actually
>>>> makes sense to try to get the initial memory on node 0, especially
>>>> since it seems to introduce some corner cases? Is there any reason why
>>>> the flow is more likely to hit node 0 than a randomly chosen one?
>>>> (Assuming that this is a multinode system, otherwise it's kind of a
>>>> moot point.) We could have a separate pointer to the default allocated
>>>> memory, so it wouldn't conflict with memory that was intentionally
>>>> allocated for node 0.
>>>>
>>>>
>>>> It would still be preferable to know from which node the default
>>>> stats node
>>>> was allocated, and store it in the appropriate pointer in the array. We
>>>> could then add a new “default stats node index” that would be used
>>>> to locate
>>>> the node in the array of pointers we already have. That way we would
>>>> avoid
>>>> extra allocation and processing of the default stats node.
>>>
>>> I agree, that sounds reasonable to me. Will you make that change?
>>>
>>> Besides eliminating corner cases, it might help performance in some
>>> cases too by avoiding stressing memory bandwidth on node 0.
>>
>
> According to the comment above kmem_cache_alloc_node(),
> kmem_cache_alloc_node() should not BUG_ON/WARN_ON in this case:
>> *//**/*
>> */* kmem_cache_alloc_node - Allocate an object on the specified node/*
>> */* @cachep: The cache to allocate from./*
>> */* @flags: See kmalloc()./*
>> */* @nodeid: node number of the target node./*
>> */*/*
>> */* Identical to kmem_cache_alloc but it will allocate memory on the
>> given/*
>> */* node, which can improve the performance for cpu bound structures./*
>> */*/*
>> */* Fallback to other node is possible if __GFP_THISNODE is not set./*
>> */*//*
> See also this from cpuset.c:
>
>> /**
>>  * cpuset_mem_spread_node() - On which node to begin search for a file
>> page
>>  * cpuset_slab_spread_node() - On which node to begin search for a
>> slab page
>>  *
>>  * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
>>  * tasks in a cpuset with is_spread_page or is_spread_slab set),
>>  * and if the memory allocation used cpuset_mem_spread_node()
>>  * to determine on which node to start looking, as it will for
>>  * certain page cache or slab cache pages such as used for file
>>  * system buffers and inode caches, then instead of starting on the
>>  * local node to look for a free page, rather spread the starting
>>  * node around the tasks mems_allowed nodes.
>>  *
>>  * We don't have to worry about the returned node being offline
>>  * because "it can't happen", and even if it did, it would be ok.
>>  *
>>  * The routines calling guarantee_online_mems() are careful to
>>  * only set nodes in task->mems_allowed that are online.  So it
>>  * should not be possible for the following code to return an
>>  * offline node.  But if it did, that would be ok, as this routine
>>  * is not returning the node where the allocation must be, only
>>  * the node where the search should start.  The zonelist passed to
>>  * __alloc_pages() will include all nodes.  If the slab allocator
>>  * is passed an offline node, it will fall back to the local node.

OK, this is probably only true without __GFP_THISNODE.

>>  * See kmem_cache_alloc_node().
>>  */
>
> Based on this it seems this is a bug in the memory allocator, it
> probably should not be calling alloc_pages_exact_node()

alloc_pages_exact_node() doesn't exist anymore in 4.3-rcX

So what exact problem do you think there is? What I can see is that:
- cpuset_slab_spread_node() says it shouldn't return offline node, but 
asserts that if it happens anyway, slab will fall back
- slab.c calls the spread_node function from alternate_node_alloc() and 
then passes the nodeid to ____cache_alloc_node(), which calls 
cache_grow() with __GFP_THISNODE, which eventually calls 
__alloc_pages_node() and VM_WARN_ON() may happen for an offline node, 
and also with __GFP_THISNODE the allocation will fail... but then a 
fallback_alloc() occurs.

So the issue is a potential VM_WARN_ON when/if cpuset_slab_spread_node() 
fails to guarantee the node is online?

> when __GFP_THISNODE is not set?
>
>    Jarno
>

Patch
diff mbox

diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index f2ea83ba4763..c7f74aab34b9 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -93,7 +93,8 @@  struct sw_flow *ovs_flow_alloc(void)
 
 	/* Initialize the default stat node. */
 	stats = kmem_cache_alloc_node(flow_stats_cache,
-				      GFP_KERNEL | __GFP_ZERO, 0);
+				      GFP_KERNEL | __GFP_ZERO,
+				      node_online(0) ? 0 : NUMA_NO_NODE);
 	if (!stats)
 		goto err;