Patchwork [v5,01/14] memory-hotplug: try to offline the memory twice to avoid dependence

login
register
mail settings
Submitter Tang Chen
Date Dec. 24, 2012, 12:09 p.m.
Message ID <1356350964-13437-2-git-send-email-tangchen@cn.fujitsu.com>
Download mbox | patch
Permalink /patch/208059/
State Not Applicable
Headers show

Comments

Tang Chen - Dec. 24, 2012, 12:09 p.m.
From: Wen Congyang <wency@cn.fujitsu.com>

memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

This idea is suggested by KOSAKI Motohiro.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)
Glauber Costa - Dec. 25, 2012, 8:35 a.m.
On 12/24/2012 04:09 PM, Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> memory can't be offlined when CONFIG_MEMCG is selected.
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages. When we online memory8, the memory stored page cgroup
> is not provided by this memory device. But when we online memory9, the memory
> stored page cgroup may be provided by memory8. So we can't offline memory8
> now. We should offline the memory in the reversed order.
> 
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail. In such case, iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
> 
> This idea is suggested by KOSAKI Motohiro.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

Maybe there is something here that I am missing - I admit that I came
late to this one, but this really sounds like a very ugly hack, that
really has no place in here.

Retrying, of course, may make sense, if we have reasonable belief that
we may now succeed. If this is the case, you need to document - in the
code - while is that.

The memcg argument, however, doesn't really cut it. Why can't we make
all page_cgroup allocations local to the node they are describing? If
memcg is the culprit here, we should fix it, and not retry. If there is
still any benefit in retrying, then we retry being very specific about why.
KAMEZAWA Hiroyuki - Dec. 26, 2012, 3:02 a.m.
(2012/12/24 21:09), Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> memory can't be offlined when CONFIG_MEMCG is selected.
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages. When we online memory8, the memory stored page cgroup
> is not provided by this memory device. But when we online memory9, the memory
> stored page cgroup may be provided by memory8. So we can't offline memory8
> now. We should offline the memory in the reversed order.
> 

If memory8 is onlined as NORMAL memory ...right ?

IIUC, vmalloc() uses __GFP_HIGHMEM but doesn't use __GFP_MOVABLE.

> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail. In such case, iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
> 
> This idea is suggested by KOSAKI Motohiro.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

I'm not sure but the whole DIMM should be onlined as MOVABLE mem ?

Anyway, I agree this kind of retry is required if memory is onlined as NORMAL mem.
But retry-once is ok ?

Thanks,
-Kame

> ---
>   mm/memory_hotplug.c |   16 ++++++++++++++--
>   1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d04ed87..62e04c9 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
>   	unsigned long start_pfn, end_pfn;
>   	unsigned long pfn, section_nr;
>   	int ret;
> +	int return_on_error = 0;
> +	int retry = 0;
>   
>   	start_pfn = PFN_DOWN(start);
>   	end_pfn = start_pfn + PFN_DOWN(size);
>   
> +repeat:
>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>   		section_nr = pfn_to_section_nr(pfn);
>   		if (!present_section_nr(section_nr))
> @@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)
>   
>   		ret = offline_memory_block(mem);
>   		if (ret) {
> -			kobject_put(&mem->dev.kobj);
> -			return ret;
> +			if (return_on_error) {
> +				kobject_put(&mem->dev.kobj);
> +				return ret;
> +			} else {
> +				retry = 1;
> +			}
>   		}
>   	}
>   
>   	if (mem)
>   		kobject_put(&mem->dev.kobj);
>   
> +	if (retry) {
> +		return_on_error = 1;
> +		goto repeat;
> +	}
> +
>   	return 0;
>   }
>   #else
>
Wen Congyang - Dec. 30, 2012, 5:49 a.m.
At 12/26/2012 11:02 AM, Kamezawa Hiroyuki Wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> memory can't be offlined when CONFIG_MEMCG is selected.
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages. When we online memory8, the memory stored page cgroup
>> is not provided by this memory device. But when we online memory9, the memory
>> stored page cgroup may be provided by memory8. So we can't offline memory8
>> now. We should offline the memory in the reversed order.
>>
> 
> If memory8 is onlined as NORMAL memory ...right ?

Yes, memory8 is onlined as NORMAL memory. And when we online memory9, we allocate
memory from memory8 to store page cgroup information.

> 
> IIUC, vmalloc() uses __GFP_HIGHMEM but doesn't use __GFP_MOVABLE.
> 
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail. In such case, iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> This idea is suggested by KOSAKI Motohiro.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> I'm not sure but the whole DIMM should be onlined as MOVABLE mem ?

If the whole DIMM is onlined as MOVABLE mem, we can offline it, and don't
retry again.

> 
> Anyway, I agree this kind of retry is required if memory is onlined as NORMAL mem.
> But retry-once is ok ?

I'am not sure, but I think in most cases the user may online the memory according first
which is hot-added first. So we may always fail in the first time, and retry-once can
success.

Thanks
Wen Congyang

> 
> Thanks,
> -Kame
> 
>> ---
>>   mm/memory_hotplug.c |   16 ++++++++++++++--
>>   1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index d04ed87..62e04c9 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
>>   	unsigned long start_pfn, end_pfn;
>>   	unsigned long pfn, section_nr;
>>   	int ret;
>> +	int return_on_error = 0;
>> +	int retry = 0;
>>   
>>   	start_pfn = PFN_DOWN(start);
>>   	end_pfn = start_pfn + PFN_DOWN(size);
>>   
>> +repeat:
>>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>   		section_nr = pfn_to_section_nr(pfn);
>>   		if (!present_section_nr(section_nr))
>> @@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)
>>   
>>   		ret = offline_memory_block(mem);
>>   		if (ret) {
>> -			kobject_put(&mem->dev.kobj);
>> -			return ret;
>> +			if (return_on_error) {
>> +				kobject_put(&mem->dev.kobj);
>> +				return ret;
>> +			} else {
>> +				retry = 1;
>> +			}
>>   		}
>>   	}
>>   
>>   	if (mem)
>>   		kobject_put(&mem->dev.kobj);
>>   
>> +	if (retry) {
>> +		return_on_error = 1;
>> +		goto repeat;
>> +	}
>> +
>>   	return 0;
>>   }
>>   #else
>>
> 
> 
>
Wen Congyang - Dec. 30, 2012, 5:58 a.m.
At 12/25/2012 04:35 PM, Glauber Costa Wrote:
> On 12/24/2012 04:09 PM, Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> memory can't be offlined when CONFIG_MEMCG is selected.
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages. When we online memory8, the memory stored page cgroup
>> is not provided by this memory device. But when we online memory9, the memory
>> stored page cgroup may be provided by memory8. So we can't offline memory8
>> now. We should offline the memory in the reversed order.
>>
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail. In such case, iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> This idea is suggested by KOSAKI Motohiro.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> Maybe there is something here that I am missing - I admit that I came
> late to this one, but this really sounds like a very ugly hack, that
> really has no place in here.
> 
> Retrying, of course, may make sense, if we have reasonable belief that
> we may now succeed. If this is the case, you need to document - in the
> code - while is that.
> 
> The memcg argument, however, doesn't really cut it. Why can't we make
> all page_cgroup allocations local to the node they are describing? If
> memcg is the culprit here, we should fix it, and not retry. If there is
> still any benefit in retrying, then we retry being very specific about why.

We try to make all page_cgroup allocations local to the node they are describing
now. If the memory is the first memory onlined in this node, we will allocate
it from the other node.

For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
1. memory block 8, page_cgroup allocations are in the other nodes
2. memory block 9, page_cgroup allocations are in memory block 8

So we should offline memory block 9 first. But we don't know in which order
the user online the memory block.

I think we can modify memcg like this:
allocate the memory from the memory block they are describing

I am not sure it is OK to do so.

Thanks
Wen Congyang

> 
> 
>
Glauber Costa - Jan. 9, 2013, 3:09 p.m.
On 12/30/2012 09:58 AM, Wen Congyang wrote:
> At 12/25/2012 04:35 PM, Glauber Costa Wrote:
>> On 12/24/2012 04:09 PM, Tang Chen wrote:
>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>
>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>> For example: there is a memory device on node 1. The address range
>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>> and memory11 under the directory /sys/devices/system/memory/.
>>>
>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>> when we online pages. When we online memory8, the memory stored page cgroup
>>> is not provided by this memory device. But when we online memory9, the memory
>>> stored page cgroup may be provided by memory8. So we can't offline memory8
>>> now. We should offline the memory in the reversed order.
>>>
>>> When the memory device is hotremoved, we will auto offline memory provided
>>> by this memory device. But we don't know which memory is onlined first, so
>>> offlining memory may fail. In such case, iterate twice to offline the memory.
>>> 1st iterate: offline every non primary memory block.
>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>
>>> This idea is suggested by KOSAKI Motohiro.
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Maybe there is something here that I am missing - I admit that I came
>> late to this one, but this really sounds like a very ugly hack, that
>> really has no place in here.
>>
>> Retrying, of course, may make sense, if we have reasonable belief that
>> we may now succeed. If this is the case, you need to document - in the
>> code - while is that.
>>
>> The memcg argument, however, doesn't really cut it. Why can't we make
>> all page_cgroup allocations local to the node they are describing? If
>> memcg is the culprit here, we should fix it, and not retry. If there is
>> still any benefit in retrying, then we retry being very specific about why.
> 
> We try to make all page_cgroup allocations local to the node they are describing
> now. If the memory is the first memory onlined in this node, we will allocate
> it from the other node.
> 
> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
> 1. memory block 8, page_cgroup allocations are in the other nodes
> 2. memory block 9, page_cgroup allocations are in memory block 8
> 
> So we should offline memory block 9 first. But we don't know in which order
> the user online the memory block.
> 
> I think we can modify memcg like this:
> allocate the memory from the memory block they are describing
> 
> I am not sure it is OK to do so.

I don't see a reason why not.

You would have to tweak a bit the lookup function for page_cgroup, but
assuming you will always have the pfns and limits, it should be easy to do.

I think the only tricky part is that today we have a single
node_page_cgroup, and we would of course have to have one per memory
block. My assumption is that the number of memory blocks is limited and
likely not very big. So even a static array would do.

Kamezawa, do you have any input in here?
Tang Chen - Jan. 10, 2013, 1:38 a.m.
Hi Glauber,

On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>
>> We try to make all page_cgroup allocations local to the node they are describing
>> now. If the memory is the first memory onlined in this node, we will allocate
>> it from the other node.
>>
>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
>> 1. memory block 8, page_cgroup allocations are in the other nodes
>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>
>> So we should offline memory block 9 first. But we don't know in which order
>> the user online the memory block.
>>
>> I think we can modify memcg like this:
>> allocate the memory from the memory block they are describing
>>
>> I am not sure it is OK to do so.
>
> I don't see a reason why not.

I'm not sure, but if we do this, we could bring in a fragment for each
memory block (a memory section, 128MB, right?). Is this a problem when
we use large page (such as 1GB page) ?

Even if not, will these fragments make any bad effects ?

Thank. :)

>
> You would have to tweak a bit the lookup function for page_cgroup, but
> assuming you will always have the pfns and limits, it should be easy to do.
>
> I think the only tricky part is that today we have a single
> node_page_cgroup, and we would of course have to have one per memory
> block. My assumption is that the number of memory blocks is limited and
> likely not very big. So even a static array would do.
>
> Kamezawa, do you have any input in here?
>
Tang Chen - Feb. 6, 2013, 3:07 a.m.
Hi Glauber, all,

An old thing I want to discuss with you. :)

On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>> For example: there is a memory device on node 1. The address range
>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>
>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>>> when we online pages. When we online memory8, the memory stored page cgroup
>>>> is not provided by this memory device. But when we online memory9, the memory
>>>> stored page cgroup may be provided by memory8. So we can't offline memory8
>>>> now. We should offline the memory in the reversed order.
>>>>
>>>> When the memory device is hotremoved, we will auto offline memory provided
>>>> by this memory device. But we don't know which memory is onlined first, so
>>>> offlining memory may fail. In such case, iterate twice to offline the memory.
>>>> 1st iterate: offline every non primary memory block.
>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>
>>>> This idea is suggested by KOSAKI Motohiro.
>>>>
>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>
>>> Maybe there is something here that I am missing - I admit that I came
>>> late to this one, but this really sounds like a very ugly hack, that
>>> really has no place in here.
>>>
>>> Retrying, of course, may make sense, if we have reasonable belief that
>>> we may now succeed. If this is the case, you need to document - in the
>>> code - while is that.
>>>
>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>> all page_cgroup allocations local to the node they are describing? If
>>> memcg is the culprit here, we should fix it, and not retry. If there is
>>> still any benefit in retrying, then we retry being very specific about why.
>>
>> We try to make all page_cgroup allocations local to the node they are describing
>> now. If the memory is the first memory onlined in this node, we will allocate
>> it from the other node.
>>
>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
>> 1. memory block 8, page_cgroup allocations are in the other nodes
>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>
>> So we should offline memory block 9 first. But we don't know in which order
>> the user online the memory block.
>>
>> I think we can modify memcg like this:
>> allocate the memory from the memory block they are describing
>>
>> I am not sure it is OK to do so.
>
> I don't see a reason why not.
>
> You would have to tweak a bit the lookup function for page_cgroup, but
> assuming you will always have the pfns and limits, it should be easy to do.
>
> I think the only tricky part is that today we have a single
> node_page_cgroup, and we would of course have to have one per memory
> block. My assumption is that the number of memory blocks is limited and
> likely not very big. So even a static array would do.
>

About the idea "allocate the memory from the memory block they are 
describing",

online_pages()
  |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this 
section is not in buddy yet.
       |-->page_cgroup_callback()
            |-->online_page_cgroup()
                 |-->init_section_page_cgroup()
                      |-->alloc_page_cgroup() --------- allocate 
page_cgroup from buddy system.

When onlining pages, we allocate page_cgroup from buddy. And the being 
onlined pages are not in
buddy yet. I think we can reserve some memory in the section for 
page_cgroup, and return all the
rest to the buddy.

But when the system is booting,

start_kernel()
  |-->setup_arch()
  |-->mm_init()
  |    |-->mem_init()
  |         |-->numa_free_all_bootmem() -------------- all the pages are 
in buddy system.
  |-->page_cgroup_init()
       |-->init_section_page_cgroup()
            |-->alloc_page_cgroup() ------------------ I don't know how 
to reserve memory in each section.

So any idea about how to deal with it when the system is booting please?


And one more question, a memory section is 128MB in Linux. If we reserve 
part of the them for page_cgroup,
then anyone who wants to allocate a contiguous memory larger than 128MB, 
it will fail, right ?
Is it OK ?

Thanks. :)
Tang Chen - Feb. 6, 2013, 9:17 a.m.
Hi all,

On 02/06/2013 11:07 AM, Tang Chen wrote:
> Hi Glauber, all,
>
> An old thing I want to discuss with you. :)
>
> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>> For example: there is a memory device on node 1. The address range
>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>> memory10,
>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>
>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>> cgroup
>>>>> when we online pages. When we online memory8, the memory stored
>>>>> page cgroup
>>>>> is not provided by this memory device. But when we online memory9,
>>>>> the memory
>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>> memory8
>>>>> now. We should offline the memory in the reversed order.
>>>>>
>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>> provided
>>>>> by this memory device. But we don't know which memory is onlined
>>>>> first, so
>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>> the memory.
>>>>> 1st iterate: offline every non primary memory block.
>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>
>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>
>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>
>>>> Maybe there is something here that I am missing - I admit that I came
>>>> late to this one, but this really sounds like a very ugly hack, that
>>>> really has no place in here.
>>>>
>>>> Retrying, of course, may make sense, if we have reasonable belief that
>>>> we may now succeed. If this is the case, you need to document - in the
>>>> code - while is that.
>>>>
>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>> all page_cgroup allocations local to the node they are describing? If
>>>> memcg is the culprit here, we should fix it, and not retry. If there is
>>>> still any benefit in retrying, then we retry being very specific
>>>> about why.
>>>
>>> We try to make all page_cgroup allocations local to the node they are
>>> describing
>>> now. If the memory is the first memory onlined in this node, we will
>>> allocate
>>> it from the other node.
>>>
>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>> to 11
>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>
>>> So we should offline memory block 9 first. But we don't know in which
>>> order
>>> the user online the memory block.
>>>
>>> I think we can modify memcg like this:
>>> allocate the memory from the memory block they are describing
>>>
>>> I am not sure it is OK to do so.
>>
>> I don't see a reason why not.
>>
>> You would have to tweak a bit the lookup function for page_cgroup, but
>> assuming you will always have the pfns and limits, it should be easy
>> to do.
>>
>> I think the only tricky part is that today we have a single
>> node_page_cgroup, and we would of course have to have one per memory
>> block. My assumption is that the number of memory blocks is limited and
>> likely not very big. So even a static array would do.
>>
>
> About the idea "allocate the memory from the memory block they are
> describing",
>
> online_pages()
> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
> section is not in buddy yet.
> |-->page_cgroup_callback()
> |-->online_page_cgroup()
> |-->init_section_page_cgroup()
> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy system.
>
> When onlining pages, we allocate page_cgroup from buddy. And the being
> onlined pages are not in
> buddy yet. I think we can reserve some memory in the section for
> page_cgroup, and return all the
> rest to the buddy.
>
> But when the system is booting,
>
> start_kernel()
> |-->setup_arch()
> |-->mm_init()
> | |-->mem_init()
> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
> system.
> |-->page_cgroup_init()
> |-->init_section_page_cgroup()
> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
> memory in each section.
>
> So any idea about how to deal with it when the system is booting please?
>

How about this way.

1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and MIX_SECTION_INFO.
2) In sparse_init(), reserve some beginning pages of each section as 
bootmem.
3) In register_page_bootmem_info_section(), set these pages as
      page->lru.next = PAGE_CGROUP_INFO;

Then these pages will not go to buddy system.

But I do worry about the fragment problem because part of each section will
be used in the very beginning.

Thanks. :)

>
> And one more question, a memory section is 128MB in Linux. If we reserve
> part of the them for page_cgroup,
> then anyone who wants to allocate a contiguous memory larger than 128MB,
> it will fail, right ?
> Is it OK ?
>
> Thanks. :)
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Tang Chen - Feb. 6, 2013, 10:10 a.m.
On 02/06/2013 05:17 PM, Tang Chen wrote:
> Hi all,
>
> On 02/06/2013 11:07 AM, Tang Chen wrote:
>> Hi Glauber, all,
>>
>> An old thing I want to discuss with you. :)
>>
>> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>>> For example: there is a memory device on node 1. The address range
>>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>>> memory10,
>>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>>
>>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>>> cgroup
>>>>>> when we online pages. When we online memory8, the memory stored
>>>>>> page cgroup
>>>>>> is not provided by this memory device. But when we online memory9,
>>>>>> the memory
>>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>>> memory8
>>>>>> now. We should offline the memory in the reversed order.
>>>>>>
>>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>>> provided
>>>>>> by this memory device. But we don't know which memory is onlined
>>>>>> first, so
>>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>>> the memory.
>>>>>> 1st iterate: offline every non primary memory block.
>>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>>
>>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>>
>>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>>
>>>>> Maybe there is something here that I am missing - I admit that I came
>>>>> late to this one, but this really sounds like a very ugly hack, that
>>>>> really has no place in here.
>>>>>
>>>>> Retrying, of course, may make sense, if we have reasonable belief that
>>>>> we may now succeed. If this is the case, you need to document - in the
>>>>> code - while is that.
>>>>>
>>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>>> all page_cgroup allocations local to the node they are describing? If
>>>>> memcg is the culprit here, we should fix it, and not retry. If
>>>>> there is
>>>>> still any benefit in retrying, then we retry being very specific
>>>>> about why.
>>>>
>>>> We try to make all page_cgroup allocations local to the node they are
>>>> describing
>>>> now. If the memory is the first memory onlined in this node, we will
>>>> allocate
>>>> it from the other node.
>>>>
>>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>>> to 11
>>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>>
>>>> So we should offline memory block 9 first. But we don't know in which
>>>> order
>>>> the user online the memory block.
>>>>
>>>> I think we can modify memcg like this:
>>>> allocate the memory from the memory block they are describing
>>>>
>>>> I am not sure it is OK to do so.
>>>
>>> I don't see a reason why not.
>>>
>>> You would have to tweak a bit the lookup function for page_cgroup, but
>>> assuming you will always have the pfns and limits, it should be easy
>>> to do.
>>>
>>> I think the only tricky part is that today we have a single
>>> node_page_cgroup, and we would of course have to have one per memory
>>> block. My assumption is that the number of memory blocks is limited and
>>> likely not very big. So even a static array would do.
>>>
>>
>> About the idea "allocate the memory from the memory block they are
>> describing",
>>
>> online_pages()
>> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
>> section is not in buddy yet.
>> |-->page_cgroup_callback()
>> |-->online_page_cgroup()
>> |-->init_section_page_cgroup()
>> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy system.
>>
>> When onlining pages, we allocate page_cgroup from buddy. And the being
>> onlined pages are not in
>> buddy yet. I think we can reserve some memory in the section for
>> page_cgroup, and return all the
>> rest to the buddy.
>>
>> But when the system is booting,
>>
>> start_kernel()
>> |-->setup_arch()
>> |-->mm_init()
>> | |-->mem_init()
>> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
>> system.
>> |-->page_cgroup_init()
>> |-->init_section_page_cgroup()
>> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
>> memory in each section.
>>
>> So any idea about how to deal with it when the system is booting please?
>>
>
> How about this way.
>
> 1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and MIX_SECTION_INFO.
> 2) In sparse_init(), reserve some beginning pages of each section as
> bootmem.

Hi all,

After digging into bootmem code, I met another problem.

memblock allocates memory from high address to low address, using 
memblock.current_limit
to remember where the upper limit is. What I am doing will produce a lot 
of fragments,
and the memory will be non-contiguous. So we need to modify memblock again.

I don't think it's a good idea. How do you think ?

Thanks. :)

> 3) In register_page_bootmem_info_section(), set these pages as
> page->lru.next = PAGE_CGROUP_INFO;
>
> Then these pages will not go to buddy system.
>
> But I do worry about the fragment problem because part of each section will
> be used in the very beginning.
>
> Thanks. :)
>
>>
>> And one more question, a memory section is 128MB in Linux. If we reserve
>> part of the them for page_cgroup,
>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>> it will fail, right ?
>> Is it OK ?
>>
>> Thanks. :)
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Glauber Costa - Feb. 6, 2013, 2:24 p.m.
On 02/06/2013 02:10 PM, Tang Chen wrote:
> On 02/06/2013 05:17 PM, Tang Chen wrote:
>> Hi all,
>>
>> On 02/06/2013 11:07 AM, Tang Chen wrote:
>>> Hi Glauber, all,
>>>
>>> An old thing I want to discuss with you. :)
>>>
>>> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>>>> For example: there is a memory device on node 1. The address range
>>>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>>>> memory10,
>>>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>>>
>>>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>>>> cgroup
>>>>>>> when we online pages. When we online memory8, the memory stored
>>>>>>> page cgroup
>>>>>>> is not provided by this memory device. But when we online memory9,
>>>>>>> the memory
>>>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>>>> memory8
>>>>>>> now. We should offline the memory in the reversed order.
>>>>>>>
>>>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>>>> provided
>>>>>>> by this memory device. But we don't know which memory is onlined
>>>>>>> first, so
>>>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>>>> the memory.
>>>>>>> 1st iterate: offline every non primary memory block.
>>>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>>>
>>>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>>>
>>>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>>>
>>>>>> Maybe there is something here that I am missing - I admit that I came
>>>>>> late to this one, but this really sounds like a very ugly hack, that
>>>>>> really has no place in here.
>>>>>>
>>>>>> Retrying, of course, may make sense, if we have reasonable belief
>>>>>> that
>>>>>> we may now succeed. If this is the case, you need to document - in
>>>>>> the
>>>>>> code - while is that.
>>>>>>
>>>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>>>> all page_cgroup allocations local to the node they are describing? If
>>>>>> memcg is the culprit here, we should fix it, and not retry. If
>>>>>> there is
>>>>>> still any benefit in retrying, then we retry being very specific
>>>>>> about why.
>>>>>
>>>>> We try to make all page_cgroup allocations local to the node they are
>>>>> describing
>>>>> now. If the memory is the first memory onlined in this node, we will
>>>>> allocate
>>>>> it from the other node.
>>>>>
>>>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>>>> to 11
>>>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>>>
>>>>> So we should offline memory block 9 first. But we don't know in which
>>>>> order
>>>>> the user online the memory block.
>>>>>
>>>>> I think we can modify memcg like this:
>>>>> allocate the memory from the memory block they are describing
>>>>>
>>>>> I am not sure it is OK to do so.
>>>>
>>>> I don't see a reason why not.
>>>>
>>>> You would have to tweak a bit the lookup function for page_cgroup, but
>>>> assuming you will always have the pfns and limits, it should be easy
>>>> to do.
>>>>
>>>> I think the only tricky part is that today we have a single
>>>> node_page_cgroup, and we would of course have to have one per memory
>>>> block. My assumption is that the number of memory blocks is limited and
>>>> likely not very big. So even a static array would do.
>>>>
>>>
>>> About the idea "allocate the memory from the memory block they are
>>> describing",
>>>
>>> online_pages()
>>> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
>>> section is not in buddy yet.
>>> |-->page_cgroup_callback()
>>> |-->online_page_cgroup()
>>> |-->init_section_page_cgroup()
>>> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy
>>> system.
>>>
>>> When onlining pages, we allocate page_cgroup from buddy. And the being
>>> onlined pages are not in
>>> buddy yet. I think we can reserve some memory in the section for
>>> page_cgroup, and return all the
>>> rest to the buddy.
>>>
>>> But when the system is booting,
>>>
>>> start_kernel()
>>> |-->setup_arch()
>>> |-->mm_init()
>>> | |-->mem_init()
>>> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
>>> system.
>>> |-->page_cgroup_init()
>>> |-->init_section_page_cgroup()
>>> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
>>> memory in each section.
>>>
>>> So any idea about how to deal with it when the system is booting please?
>>>
>>
>> How about this way.
>>
>> 1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and
>> MIX_SECTION_INFO.
>> 2) In sparse_init(), reserve some beginning pages of each section as
>> bootmem.
> 
> Hi all,
> 
> After digging into bootmem code, I met another problem.
> 
> memblock allocates memory from high address to low address, using
> memblock.current_limit
> to remember where the upper limit is. What I am doing will produce a lot
> of fragments,
> and the memory will be non-contiguous. So we need to modify memblock again.
> 
> I don't think it's a good idea. How do you think ?
> 
> Thanks. :)
> 
>> 3) In register_page_bootmem_info_section(), set these pages as
>> page->lru.next = PAGE_CGROUP_INFO;
>>
>> Then these pages will not go to buddy system.
>>
>> But I do worry about the fragment problem because part of each section
>> will
>> be used in the very beginning.
>>
>> Thanks. :)
>>
>>>
>>> And one more question, a memory section is 128MB in Linux. If we reserve
>>> part of the them for page_cgroup,
>>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>>> it will fail, right ?
>>> Is it OK ?
No, it is not.

Another take on this: Can't we free all the page_cgroup structure before
we actually start removing the sections ? If we do this, we would be
basically left with no problem at all, since when your code starts
running we would no longer have any page_cgroup allocated.

All you have to guarantee is that it happens after the memory block is
already isolated and allocations no longer can reach it.

What do you think ?
Tang Chen - Feb. 7, 2013, 7:56 a.m.
On 02/06/2013 10:24 PM, Glauber Costa wrote:
>>>> And one more question, a memory section is 128MB in Linux. If we reserve
>>>> part of the them for page_cgroup,
>>>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>>>> it will fail, right ?
>>>> Is it OK ?
> No, it is not.
>
> Another take on this: Can't we free all the page_cgroup structure before
> we actually start removing the sections ? If we do this, we would be
> basically left with no problem at all, since when your code starts
> running we would no longer have any page_cgroup allocated.
>
> All you have to guarantee is that it happens after the memory block is
> already isolated and allocations no longer can reach it.
>
> What do you think ?

Hi Glauber,

I don't think so. We can offline some of the sections and leave the 
reset online.

For example, we store page_cgroups of memory9~11 in memory8. So when we 
offline memory8,
we free memory8's page_cgroup storing on other section, but we cannot 
free the page_cgroups
being stored in memory8 if memory9~11 are left online.

So we still need to offline memory9~11, and then offline memory8, right ?
I think it makes no difference.

Thanks. :)

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d04ed87..62e04c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1388,10 +1388,13 @@  int remove_memory(u64 start, u64 size)
 	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
+	int return_on_error = 0;
+	int retry = 0;
 
 	start_pfn = PFN_DOWN(start);
 	end_pfn = start_pfn + PFN_DOWN(size);
 
+repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1410,14 +1413,23 @@  int remove_memory(u64 start, u64 size)
 
 		ret = offline_memory_block(mem);
 		if (ret) {
-			kobject_put(&mem->dev.kobj);
-			return ret;
+			if (return_on_error) {
+				kobject_put(&mem->dev.kobj);
+				return ret;
+			} else {
+				retry = 1;
+			}
 		}
 	}
 
 	if (mem)
 		kobject_put(&mem->dev.kobj);
 
+	if (retry) {
+		return_on_error = 1;
+		goto repeat;
+	}
+
 	return 0;
 }
 #else