[RFC,2/3] topology: support node_numa_mem() for determining the fallback node

Message ID	alpine.DEB.2.10.1402071245040.20246@nuc (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Date: Fri, 7 Feb 2014 12:51:07 -0600 (CST) From: Christoph Lameter <cl@linux.com> To: Joonsoo Kim <iamjoonsoo.kim@lge.com> Subject: Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node In-Reply-To: <alpine.DEB.2.10.1402071150090.15168@nuc> Message-ID: <alpine.DEB.2.10.1402071245040.20246@nuc> References: <20140206020757.GC5433@linux.vnet.ibm.com> <1391674026-20092-1-git-send-email-iamjoonsoo.kim@lge.com> <1391674026-20092-2-git-send-email-iamjoonsoo.kim@lge.com> <alpine.DEB.2.02.1402060041040.21148@chino.kir.corp.google.com> <CAAmzW4PXkdpNi5pZ=4BzdXNvqTEAhcuw-x0pWidqrxzdePxXxA@mail.gmail.com> <alpine.DEB.2.02.1402061248450.9567@chino.kir.corp.google.com> <20140207054819.GC28952@lge.com> <alpine.DEB.2.10.1402071150090.15168@nuc> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>, Nishanth Aravamudan <nacc@linux.vnet.ibm.com>, Matt Mackall <mpm@selenic.com>, Pekka Enberg <penberg@kernel.org>, Linux Memory Management List <linux-mm@kvack.org>, Paul Mackerras <paulus@samba.org>, Anton Blanchard <anton@samba.org>, David Rientjes <rientjes@google.com>, linuxppc-dev@lists.ozlabs.org, Wanpeng Li <liwanp@linux.vnet.ibm.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Christoph Lameter (Ampere) Feb. 7, 2014, 6:51 p.m. UTC

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Nishanth Aravamudan Feb. 7, 2014, 9:38 p.m. UTC | #1

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.

Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Thanks,
Nish

Joonsoo Kim Feb. 10, 2014, 1:15 a.m. UTC | #2

On Fri, Feb 07, 2014 at 01:38:55PM -0800, Nishanth Aravamudan wrote:
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> 
> Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Hello,

I guess that your system has another problem that makes my patches inactive.
Maybe it will also affect to the Christoph's one. Could you confirm page_to_nid(),
numa_mem_id() and node_present_pages although I doubt mostly about page_to_nid()?

Thanks.

Joonsoo Kim Feb. 10, 2014, 1:29 a.m. UTC | #3

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.

Nishanth Aravamudan Feb. 10, 2014, 7:13 p.m. UTC | #4

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:        8264704 kB
MemFree:         5924608 kB
...
Slab:            1402496 kB
SReclaimable:     102848 kB
SUnreclaim:      1299648 kB

And Anton's slabusage reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       207 MB   98.60%  100.00%
task_struct                         134 MB   97.82%  100.00%
kmalloc-8192                        117 MB  100.00%  100.00%
pgtable-2^12                        111 MB  100.00%  100.00%
pgtable-2^10                        104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:        8264704 kB
MemFree:         4150464 kB
...
Slab:            1590336 kB
SReclaimable:     208768 kB
SUnreclaim:      1381568 kB

slabusage:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       227 MB   98.63%  100.00%
kmalloc-8192                        130 MB  100.00%  100.00%
task_struct                         129 MB   97.73%  100.00%
pgtable-2^12                        112 MB  100.00%  100.00%
pgtable-2^10                        106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

	bool is_node_empty(int nid)

	void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +	empty_node[alloc_node] = 0;
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node;
> +
> +	/* No data means no match */
> +	if (!page)
>  		return 0;
> +
> +	/* Node does not matter. Therefore anything is a match */
> +	if (node == NUMA_NO_NODE)
> +		return 1;
> +
> +	/* Did we hit the requested node ? */
> +	page_node = page_to_nid(page);
> +	if (page_node == node)
> +		return 1;
> +
> +	/* If the node has available data then we can use it. Mismatch */
> +	return !empty_node[page_node];
> +
> +	/* Target node empty so just take anything */
>  #endif
>  	return 1;
>  }
>

Joonsoo Kim Feb. 11, 2014, 7:42 a.m. UTC | #5

On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> Hi Christoph,
> 
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> > 
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
> > 
> > If there is no current slab then a regular allocation occurs with the
> > memoryless node. The page allocator will fallback to a possible node and
> > that will become the current slab. Next alloc from a memoryless node
> > will then use that slab.
> > 
> > For that we also add some tracking of allocations on nodes that were not
> > satisfied using the empty_node[] array. A successful alloc on a node
> > clears that flag.
> > 
> > I would rather avoid the empty_node[] array since its global and there may
> > be thread specific allocation restrictions but it would be expensive to do
> > an allocation attempt via the page allocator to make sure that there is
> > really no page available from the page allocator.
> 
> With this patch on our test system (I pulled out the numa_mem_id()
> change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> patch 1):
> 
> MemTotal:        8264704 kB
> MemFree:         5924608 kB
> ...
> Slab:            1402496 kB
> SReclaimable:     102848 kB
> SUnreclaim:      1299648 kB
> 
> And Anton's slabusage reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       207 MB   98.60%  100.00%
> task_struct                         134 MB   97.82%  100.00%
> kmalloc-8192                        117 MB  100.00%  100.00%
> pgtable-2^12                        111 MB  100.00%  100.00%
> pgtable-2^10                        104 MB  100.00%  100.00%
> 
> For comparison, Anton's patch applied at the same point in the series:
> 
> meminfo:
> 
> MemTotal:        8264704 kB
> MemFree:         4150464 kB
> ...
> Slab:            1590336 kB
> SReclaimable:     208768 kB
> SUnreclaim:      1381568 kB
> 
> slabusage:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       227 MB   98.63%  100.00%
> kmalloc-8192                        130 MB  100.00%  100.00%
> task_struct                         129 MB   97.73%  100.00%
> pgtable-2^12                        112 MB  100.00%  100.00%
> pgtable-2^10                        106 MB  100.00%  100.00%
> 
> 
> Consider this patch:
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:        8264640 kB
  MemFree:         7119680 kB
  Slab:             207232 kB
  SReclaimable:      32896 kB
  SUnreclaim:       174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.

Christoph Lameter (Ampere) Feb. 11, 2014, 6:45 p.m. UTC | #6

On Mon, 10 Feb 2014, Joonsoo Kim wrote:

> On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> >
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
>
> Why not inspecting whether we can get the page on the best node such as
> numa_mem_id() node?

Its expensive to do so.

> empty_node cannot be set on memoryless node, since page allocation would
> succeed on different node.

Ok then we need to add a check for being on the rignt node there too.

Nishanth Aravamudan Feb. 13, 2014, 6:51 a.m. UTC | #7

Hi Joonsoo,

On 11.02.2014 [16:42:00 +0900], Joonsoo Kim wrote:
> On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> > Hi Christoph,
> > 
> > On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > > Here is a draft of a patch to make this work with memoryless nodes.
> > > 
> > > The first thing is that we modify node_match to also match if we hit an
> > > empty node. In that case we simply take the current slab if its there.
> > > 
> > > If there is no current slab then a regular allocation occurs with the
> > > memoryless node. The page allocator will fallback to a possible node and
> > > that will become the current slab. Next alloc from a memoryless node
> > > will then use that slab.
> > > 
> > > For that we also add some tracking of allocations on nodes that were not
> > > satisfied using the empty_node[] array. A successful alloc on a node
> > > clears that flag.
> > > 
> > > I would rather avoid the empty_node[] array since its global and there may
> > > be thread specific allocation restrictions but it would be expensive to do
> > > an allocation attempt via the page allocator to make sure that there is
> > > really no page available from the page allocator.
> > 
> > With this patch on our test system (I pulled out the numa_mem_id()
> > change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> > kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> > patch 1):
> > 
> > MemTotal:        8264704 kB
> > MemFree:         5924608 kB
> > ...
> > Slab:            1402496 kB
> > SReclaimable:     102848 kB
> > SUnreclaim:      1299648 kB
> > 
> > And Anton's slabusage reports:
> > 
> > slab                                   mem     objs    slabs
> >                                       used   active   active
> > ------------------------------------------------------------
> > kmalloc-16384                       207 MB   98.60%  100.00%
> > task_struct                         134 MB   97.82%  100.00%
> > kmalloc-8192                        117 MB  100.00%  100.00%
> > pgtable-2^12                        111 MB  100.00%  100.00%
> > pgtable-2^10                        104 MB  100.00%  100.00%
> > 
> > For comparison, Anton's patch applied at the same point in the series:
> > 
> > meminfo:
> > 
> > MemTotal:        8264704 kB
> > MemFree:         4150464 kB
> > ...
> > Slab:            1590336 kB
> > SReclaimable:     208768 kB
> > SUnreclaim:      1381568 kB
> > 
> > slabusage:
> > 
> > slab                                   mem     objs    slabs
> >                                       used   active   active
> > ------------------------------------------------------------
> > kmalloc-16384                       227 MB   98.63%  100.00%
> > kmalloc-8192                        130 MB  100.00%  100.00%
> > task_struct                         129 MB   97.73%  100.00%
> > pgtable-2^12                        112 MB  100.00%  100.00%
> > pgtable-2^10                        106 MB  100.00%  100.00%
> > 
> > 
> > Consider this patch:
> > 
> > Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> > Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> Hello,
> 
> I still think that there is another problem.
> Your report about CONFIG_SLAB said that SLAB uses just 200MB.
> Below is your previous report.
> 
>   Ok, with your patches applied and CONFIG_SLAB enabled:
> 
>   MemTotal:        8264640 kB
>   MemFree:         7119680 kB
>   Slab:             207232 kB
>   SReclaimable:      32896 kB
>   SUnreclaim:       174336 kB
> 
> The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
> There is large difference on slab usage.

Agreed. But, at least for now, this gets us to not OOM all the time :) I
think that's significant progress. I will continue to look at this
issue for where the other gaps are, but would like to see Christoph's
latest patch get merged (pending my re-testing).

> And, I should note that number of active objects on slabinfo can be
> wrong on some situation, since it doesn't consider cpu slab (and cpu
> partial slab).

Well, I grabbed everything from /sys/kernel/slab for you in the
tarballs, I believe.

> I recommend to confirm page_to_nid() and other things as I mentioned
> earlier.

I believe these all work once CONFIG_HAVE_MEMORYLESS_NODES was set for
ppc64, but will test it again when I have access to the test system.

Also, given that only ia64 and (hopefuly soon) ppc64 can set
CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
memoryless nodes present? Even with fakenuma? Just curious.

-Nish

Joonsoo Kim Feb. 17, 2014, 7 a.m. UTC | #8

On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> Hi Joonsoo,
> Also, given that only ia64 and (hopefuly soon) ppc64 can set
> CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> memoryless nodes present? Even with fakenuma? Just curious.

I don't know, because I'm not expert on NUMA system :)
At first glance, fakenuma can't be used for testing
CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Thanks.

Christoph Lameter (Ampere) Feb. 18, 2014, 4:57 p.m. UTC | #9

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > Hi Joonsoo,
> > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > memoryless nodes present? Even with fakenuma? Just curious.

x86_64 currently does not support memoryless nodes otherwise it would
have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
node without memory but just processors. Not very efficient. Not sure why
people use these configurations.

> I don't know, because I'm not expert on NUMA system :)
> At first glance, fakenuma can't be used for testing
> CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Well yeah. You'd have to do some mods to enable that testing.

Nishanth Aravamudan Feb. 18, 2014, 5:28 p.m. UTC | #10

On 18.02.2014 [10:57:09 -0600], Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > > Hi Joonsoo,
> > > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > > memoryless nodes present? Even with fakenuma? Just curious.
> 
> x86_64 currently does not support memoryless nodes otherwise it would
> have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
> a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
> memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
> node without memory but just processors. Not very efficient. Not sure why
> people use these configurations.

Well, on powerpc, with the hypervisor providing the resources and the
topology, you can have cpuless and memoryless nodes. I'm not sure how
"fake" the NUMA is -- as I think since the resources are virtualized to
be one system, it's logically possible that the actual topology of the
resources can be CPUs from physical node 0 and memory from physical node
2. I would think with KVM on a sufficiently large (physically NUMA
x86_64) and loaded system, one could cause the same sort of
configuration to occur for a guest?

In any case, these configurations happen fairly often on long-running
(not rebooted) systems as LPARs are created/destroyed, resources are
DLPAR'd in and out of LPARs, etc.

> > I don't know, because I'm not expert on NUMA system :)
> > At first glance, fakenuma can't be used for testing
> > CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.
> 
> Well yeah. You'd have to do some mods to enable that testing.

I might look into it, as it might have sped up testing these changes.

Thanks,
Nish

Christoph Lameter (Ampere) Feb. 18, 2014, 7:58 p.m. UTC | #11

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

>
> Well, on powerpc, with the hypervisor providing the resources and the
> topology, you can have cpuless and memoryless nodes. I'm not sure how
> "fake" the NUMA is -- as I think since the resources are virtualized to
> be one system, it's logically possible that the actual topology of the
> resources can be CPUs from physical node 0 and memory from physical node
> 2. I would think with KVM on a sufficiently large (physically NUMA
> x86_64) and loaded system, one could cause the same sort of
> configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

> In any case, these configurations happen fairly often on long-running
> (not rebooted) systems as LPARs are created/destroyed, resources are
> DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

> I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

Nishanth Aravamudan Feb. 18, 2014, 9:09 p.m. UTC | #12

On 18.02.2014 [13:58:20 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> >
> > Well, on powerpc, with the hypervisor providing the resources and the
> > topology, you can have cpuless and memoryless nodes. I'm not sure how
> > "fake" the NUMA is -- as I think since the resources are virtualized to
> > be one system, it's logically possible that the actual topology of the
> > resources can be CPUs from physical node 0 and memory from physical node
> > 2. I would think with KVM on a sufficiently large (physically NUMA
> > x86_64) and loaded system, one could cause the same sort of
> > configuration to occur for a guest?
> 
> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.

We use the topology provided by the hypervisor, it does actually reflect
where CPUs and memory are, and their corresponding performance/NUMA
characteristics.

> Do not have a fake node that has no memory.
> 
> > In any case, these configurations happen fairly often on long-running
> > (not rebooted) systems as LPARs are created/destroyed, resources are
> > DLPAR'd in and out of LPARs, etc.
> 
> Ok then also move the memory of the local node somewhere?

This happens below the OS, we don't control the hypervisor's decisions.
I'm not sure if that's what you are suggesting.

Thanks,
Nish

Christoph Lameter (Ampere) Feb. 18, 2014, 9:49 p.m. UTC | #13

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> We use the topology provided by the hypervisor, it does actually reflect
> where CPUs and memory are, and their corresponding performance/NUMA
> characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

> > Ok then also move the memory of the local node somewhere?
>
> This happens below the OS, we don't control the hypervisor's decisions.
> I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

Nishanth Aravamudan Feb. 18, 2014, 10:22 p.m. UTC | #14

On 18.02.2014 [15:49:22 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> > We use the topology provided by the hypervisor, it does actually reflect
> > where CPUs and memory are, and their corresponding performance/NUMA
> > characteristics.
> 
> And so there are actually nodes without memory that have processors?

Virtually (topologically as indicated to Linux), yes. Physically, I
don't think they are, but they might be exhausted, which is we get sort
of odd-appearing NUMA configurations.

> Can the hypervisor or the linux arch code be convinced to ignore nodes
> without memory or assign a sane default node to processors?

I think this happens quite often, so I don't know that we want to ignore
the performance impact of the underlying NUMA configuration. I guess we
could special-case memoryless/cpuless configurations somewhat, but I
don't think there's any reason to do that if we can make memoryless-node
support work in-kernel?

> > > Ok then also move the memory of the local node somewhere?
> >
> > This happens below the OS, we don't control the hypervisor's decisions.
> > I'm not sure if that's what you are suggesting.
> 
> You could also do this from the powerpc arch code by sanitizing the
> processor / node information that is then used by Linux.

I see what you're saying now, thanks!

-Nish

Christoph Lameter (Ampere) Feb. 19, 2014, 4:11 p.m. UTC | #15

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> the performance impact of the underlying NUMA configuration. I guess we
> could special-case memoryless/cpuless configurations somewhat, but I
> don't think there's any reason to do that if we can make memoryless-node
> support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa "memory" nodes without memory).

David Rientjes Feb. 19, 2014, 10:03 p.m. UTC | #16

On Tue, 18 Feb 2014, Christoph Lameter wrote:

> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.
> 

By acpi, the abstraction of a NUMA node can include any combination of 
cpus, memory, I/O resources, networking, or storage devices.  This allows 
two memoryless nodes, for example, to have different proximity to memory.

[RFC,2/3] topology: support node_numa_mem() for determining the fallback node

Commit Message

Comments

Patch