diff mbox

[RFC] memory-hotplug: Use dev_online for memhp_auto_offline

Message ID 20170221172234.8047.33382.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com (mailing list archive)
State RFC
Headers show

Commit Message

Nathan Fontenot Feb. 21, 2017, 5:22 p.m. UTC
Commit 31bc3858e "add automatic onlining policy for the newly added memory"
provides the capability to have added memory automatically onlined
during add, but this appears to be slightly broken.

The current implementation uses walk_memory_range() to call
online_memory_block, which uses memory_block_change_state() to online
the memory. Instead I think we should be calling device_online()
for the memory block in online_memory_block. This would online
the memory (the memory bus online routine memory_subsys_online()
called from device_online calls memory_block_change_state()) and
properly update the device struct offline flag.

As a result of the current implementation, attempting to remove
a memory block after adding it using auto online fails. This is
because doing a remove, for instance
'echo offline > /sys/devices/system/memory/memoryXXX/state', uses
device_offline() which checks the dev->offline flag.

There is a workaround in that a user could online the memory or have
a udev rule to online the memory by using the sysfs interface. The
sysfs interface to online memory goes through device_online() which
should updated the dev->offline flag. I'm not sure that having kernel
memory hotplug rely on userspace actions is the correct way to go.

I have tried reading through the email threads when the origianl patch
was submitted and could not determine if this is the expected behavior.
The problem with the current behavior was found when trying to update
memory hotplug on powerpc to use auto online.

-Nathan Fontenot
---
 drivers/base/memory.c  |    2 +-
 include/linux/memory.h |    3 ---
 mm/memory_hotplug.c    |    2 +-
 3 files changed, 2 insertions(+), 5 deletions(-)

Comments

Vitaly Kuznetsov Feb. 22, 2017, 9:32 a.m. UTC | #1
Hi,

s,memhp_auto_offline,memhp_auto_online, in the subject please :-)

Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:

> Commit 31bc3858e "add automatic onlining policy for the newly added memory"
> provides the capability to have added memory automatically onlined
> during add, but this appears to be slightly broken.
>
> The current implementation uses walk_memory_range() to call
> online_memory_block, which uses memory_block_change_state() to online
> the memory. Instead I think we should be calling device_online()
> for the memory block in online_memory_block. This would online
> the memory (the memory bus online routine memory_subsys_online()
> called from device_online calls memory_block_change_state()) and
> properly update the device struct offline flag.
>
> As a result of the current implementation, attempting to remove
> a memory block after adding it using auto online fails.
> This is
> because doing a remove, for instance
> 'echo offline > /sys/devices/system/memory/memoryXXX/state', uses
> device_offline() which checks the dev->offline flag.

I see the issue.

>
> There is a workaround in that a user could online the memory or have
> a udev rule to online the memory by using the sysfs interface. The
> sysfs interface to online memory goes through device_online() which
> should updated the dev->offline flag. I'm not sure that having kernel
> memory hotplug rely on userspace actions is the correct way to go.

Using udev rule for memory onlining is possible when you disable
memhp_auto_online but in some cases it doesn't work well, e.g. when we
use memory hotplug to address memory pressure the loop through userspace
is really slow and memory consuming, we may hit OOM before we manage to
online newly added memory. In addition to that, systemd/udev folks
continuosly refused to add this udev rule to udev calling it stupid as
it actually is an unconditional and redundant ping-pong between kernel
and udev.

>
> I have tried reading through the email threads when the origianl patch
> was submitted and could not determine if this is the expected behavior.
> The problem with the current behavior was found when trying to update
> memory hotplug on powerpc to use auto online.
>
> -Nathan Fontenot
> ---
>  drivers/base/memory.c  |    2 +-
>  include/linux/memory.h |    3 ---
>  mm/memory_hotplug.c    |    2 +-
>  3 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 8ab8ea1..ede46f3 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -249,7 +249,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  	return ret;
>  }
>
> -int memory_block_change_state(struct memory_block *mem,
> +static int memory_block_change_state(struct memory_block *mem,
>  		unsigned long to_state, unsigned long from_state_req)
>  {
>  	int ret = 0;
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 093607f..b723a68 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -109,9 +109,6 @@ static inline int memory_isolate_notify(unsigned long val, void *v)
>  extern int register_memory_isolate_notifier(struct notifier_block *nb);
>  extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
>  extern int register_new_memory(int, struct mem_section *);
> -extern int memory_block_change_state(struct memory_block *mem,
> -				     unsigned long to_state,
> -				     unsigned long from_state_req);
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern int unregister_memory_section(struct mem_section *);
>  #endif
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e43142c1..6f7a289 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1329,7 +1329,7 @@ int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>
>  static int online_memory_block(struct memory_block *mem, void *arg)
>  {
> -	return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
> +	return device_online(&mem->dev);
>  }
>
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */

Your patch looks good to me, I tested it on x86 (Hyper-V) and it seems
to work.

Thanks!
Michal Hocko Feb. 23, 2017, 12:56 p.m. UTC | #2
On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
[...]
> > There is a workaround in that a user could online the memory or have
> > a udev rule to online the memory by using the sysfs interface. The
> > sysfs interface to online memory goes through device_online() which
> > should updated the dev->offline flag. I'm not sure that having kernel
> > memory hotplug rely on userspace actions is the correct way to go.
> 
> Using udev rule for memory onlining is possible when you disable
> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> use memory hotplug to address memory pressure the loop through userspace
> is really slow and memory consuming, we may hit OOM before we manage to
> online newly added memory.

How does the in-kernel implementation prevents from that?

> In addition to that, systemd/udev folks
> continuosly refused to add this udev rule to udev calling it stupid as
> it actually is an unconditional and redundant ping-pong between kernel
> and udev.

This is a policy and as such it doesn't belong to the kernel. The whole
auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
merged it.
Vitaly Kuznetsov Feb. 23, 2017, 1:31 p.m. UTC | #3
Michal Hocko <mhocko@kernel.org> writes:

> On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> [...]
>> > There is a workaround in that a user could online the memory or have
>> > a udev rule to online the memory by using the sysfs interface. The
>> > sysfs interface to online memory goes through device_online() which
>> > should updated the dev->offline flag. I'm not sure that having kernel
>> > memory hotplug rely on userspace actions is the correct way to go.
>> 
>> Using udev rule for memory onlining is possible when you disable
>> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> use memory hotplug to address memory pressure the loop through userspace
>> is really slow and memory consuming, we may hit OOM before we manage to
>> online newly added memory.
>
> How does the in-kernel implementation prevents from that?
>

Onlining memory on hot-plug is much more reliable, e.g. if we were able
to add it in add_memory_resource() we'll also manage to online it. With
udev rule we may end up adding many blocks and then (as udev is
asynchronous) failing to online any of them. In-kernel operation is
synchronous.

>> In addition to that, systemd/udev folks
>> continuosly refused to add this udev rule to udev calling it stupid as
>> it actually is an unconditional and redundant ping-pong between kernel
>> and udev.
>
> This is a policy and as such it doesn't belong to the kernel. The whole
> auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> merged it.

I disagree.

First of all it's not a policy, it is a default. We have many other
defaults in kernel. When I add a network card or a storage, for example,
I don't need to go anywhere and 'enable' it before I'm able to use
it from userspace. An for memory (and CPUs) we, for some unknown reason
opted for something completely different. If someone is plugging new
memory into a box he probably wants to use it, I don't see much value in
waiting for a special confirmation from him. 

Second, this feature is optional. If you want to keep old behavior just
don't enable it.

Third, this solves real world issues. With Hyper-V it is very easy to
show udev failing on stress. No other solution to the issue was ever
suggested.
Michal Hocko Feb. 23, 2017, 3:09 p.m. UTC | #4
On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> > [...]
> >> > There is a workaround in that a user could online the memory or have
> >> > a udev rule to online the memory by using the sysfs interface. The
> >> > sysfs interface to online memory goes through device_online() which
> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> 
> >> Using udev rule for memory onlining is possible when you disable
> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> use memory hotplug to address memory pressure the loop through userspace
> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> online newly added memory.
> >
> > How does the in-kernel implementation prevents from that?
> >
> 
> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> to add it in add_memory_resource() we'll also manage to online it.

How does that differ from initiating online from the users?

> With
> udev rule we may end up adding many blocks and then (as udev is
> asynchronous) failing to online any of them.

Why would it fail?

> In-kernel operation is synchronous.

which doesn't mean anything as the context is preemptible AFAICS.

> >> In addition to that, systemd/udev folks
> >> continuosly refused to add this udev rule to udev calling it stupid as
> >> it actually is an unconditional and redundant ping-pong between kernel
> >> and udev.
> >
> > This is a policy and as such it doesn't belong to the kernel. The whole
> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> > merged it.
> 
> I disagree.
> 
> First of all it's not a policy, it is a default. We have many other
> defaults in kernel. When I add a network card or a storage, for example,
> I don't need to go anywhere and 'enable' it before I'm able to use
> it from userspace. An for memory (and CPUs) we, for some unknown reason
> opted for something completely different. If someone is plugging new
> memory into a box he probably wants to use it, I don't see much value in
> waiting for a special confirmation from him. 

This was not my decision so I can only guess but to me it makes sense.
Both memory and cpus can be physically present and offline which is a
perfectly reasonable state. So having a two phase physicall hotadd is
just built on top of physical vs. logical distinction. I completely
understand that some usecases will really like to online the whole node
as soon as it appears present. But an automatic in-kernel implementation
has its down sites - e.g. if this operation fails in the middle you will
not know about that unless you check all the memblocks in sysfs. This is
really a poor interface.

> Second, this feature is optional. If you want to keep old behavior just
> don't enable it.

It just adds unnecessary configuration noise as well

> Third, this solves real world issues. With Hyper-V it is very easy to
> show udev failing on stress. 

What is the reason for this failures. Do you have any link handy?

> No other solution to the issue was ever suggested.

you mean like using ballooning for the memory overcommit like other more
reasonable virtualization solutions?
Vitaly Kuznetsov Feb. 23, 2017, 3:49 p.m. UTC | #5
Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> > [...]
>> >> > There is a workaround in that a user could online the memory or have
>> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> > sysfs interface to online memory goes through device_online() which
>> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> 
>> >> Using udev rule for memory onlining is possible when you disable
>> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> use memory hotplug to address memory pressure the loop through userspace
>> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> online newly added memory.
>> >
>> > How does the in-kernel implementation prevents from that?
>> >
>> 
>> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> to add it in add_memory_resource() we'll also manage to online it.
>
> How does that differ from initiating online from the users?
>
>> With
>> udev rule we may end up adding many blocks and then (as udev is
>> asynchronous) failing to online any of them.
>
> Why would it fail?
>
>> In-kernel operation is synchronous.
>
> which doesn't mean anything as the context is preemptible AFAICS.
>

It actually does,

imagine the following example: you run a small guest (256M of memory)
and now there is a request to add 1000 128mb blocks to it. In case you
do it the old way you're very likely to get OOM somewhere in the middle
as you keep adding blocks which requere kernel memory and nobody is
onlining it (or, at least you're racing with the onliner). With
in-kernel implementation we're going to online the first block when it's
added and only then go to the second.

>> >> In addition to that, systemd/udev folks
>> >> continuosly refused to add this udev rule to udev calling it stupid as
>> >> it actually is an unconditional and redundant ping-pong between kernel
>> >> and udev.
>> >
>> > This is a policy and as such it doesn't belong to the kernel. The whole
>> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
>> > merged it.
>> 
>> I disagree.
>> 
>> First of all it's not a policy, it is a default. We have many other
>> defaults in kernel. When I add a network card or a storage, for example,
>> I don't need to go anywhere and 'enable' it before I'm able to use
>> it from userspace. An for memory (and CPUs) we, for some unknown reason
>> opted for something completely different. If someone is plugging new
>> memory into a box he probably wants to use it, I don't see much value in
>> waiting for a special confirmation from him. 
>
> This was not my decision so I can only guess but to me it makes sense.
> Both memory and cpus can be physically present and offline which is a
> perfectly reasonable state. So having a two phase physicall hotadd is
> just built on top of physical vs. logical distinction. I completely
> understand that some usecases will really like to online the whole node
> as soon as it appears present. But an automatic in-kernel implementation
> has its down sites - e.g. if this operation fails in the middle you will
> not know about that unless you check all the memblocks in sysfs. This is
> really a poor interface.

And how do you know that some blocks failed to online with udev? Who
handles these failures and how? And, the last but not least, why do
these failures happen?

>
>> Second, this feature is optional. If you want to keep old behavior just
>> don't enable it.
>
> It just adds unnecessary configuration noise as well
>

For any particular user everything he doesn't need is 'noise'...

>> Third, this solves real world issues. With Hyper-V it is very easy to
>> show udev failing on stress. 
>
> What is the reason for this failures. Do you have any link handy?
>

The reason is going out of memory, swapping and being slow in
general. Again, think about the example I give above: there is a request
to add many memory blocks and if we try to handle them all before any of
them get online we will get OOM and may even kill the udev process.

>> No other solution to the issue was ever suggested.
>
> you mean like using ballooning for the memory overcommit like other more
> reasonable virtualization solutions?

Not sure how ballooning is related here. Hyper-V uses memory hotplug to
add memory to domains, I don't think we have any other solutions for
that. From hypervisor's point of view the memory was added when the
particular request succeeded, it is not aware of our 'logical/physical'
separation.
Michal Hocko Feb. 23, 2017, 4:12 p.m. UTC | #6
On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> >> > [...]
> >> >> > There is a workaround in that a user could online the memory or have
> >> >> > a udev rule to online the memory by using the sysfs interface. The
> >> >> > sysfs interface to online memory goes through device_online() which
> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> >> 
> >> >> Using udev rule for memory onlining is possible when you disable
> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> >> use memory hotplug to address memory pressure the loop through userspace
> >> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> >> online newly added memory.
> >> >
> >> > How does the in-kernel implementation prevents from that?
> >> >
> >> 
> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> >> to add it in add_memory_resource() we'll also manage to online it.
> >
> > How does that differ from initiating online from the users?
> >
> >> With
> >> udev rule we may end up adding many blocks and then (as udev is
> >> asynchronous) failing to online any of them.
> >
> > Why would it fail?
> >
> >> In-kernel operation is synchronous.
> >
> > which doesn't mean anything as the context is preemptible AFAICS.
> >
> 
> It actually does,
> 
> imagine the following example: you run a small guest (256M of memory)
> and now there is a request to add 1000 128mb blocks to it. 

Is a grow from 256M -> 128GB really something that happens in real life?
Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
which is an operation which has to allocate memory has to scale with the
currently available memory IMHO.

> In case you
> do it the old way you're very likely to get OOM somewhere in the middle
> as you keep adding blocks which requere kernel memory and nobody is
> onlining it (or, at least you're racing with the onliner). With
> in-kernel implementation we're going to online the first block when it's
> added and only then go to the second.

Yes, adding a memory will cost you some memory and that is why I am
really skeptical when memory hotplug is used under a strong memory
pressure. This can lead to OOMs even when you online one block at the
time.

[...]
> > This was not my decision so I can only guess but to me it makes sense.
> > Both memory and cpus can be physically present and offline which is a
> > perfectly reasonable state. So having a two phase physicall hotadd is
> > just built on top of physical vs. logical distinction. I completely
> > understand that some usecases will really like to online the whole node
> > as soon as it appears present. But an automatic in-kernel implementation
> > has its down sites - e.g. if this operation fails in the middle you will
> > not know about that unless you check all the memblocks in sysfs. This is
> > really a poor interface.
> 
> And how do you know that some blocks failed to online with udev?

Because the udev will run a code which can cope with that - retry if the
error is recoverable or simply report with all the details. Compare that
to crawling the system log to see that something has broken...

> Who
> handles these failures and how? And, the last but not least, why do
> these failures happen?

I haven't heard reports about the failures and from looking into the
code those are possible but very unlikely.
Vitaly Kuznetsov Feb. 23, 2017, 4:36 p.m. UTC | #7
Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> >> 
>> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> >> > [...]
>> >> >> > There is a workaround in that a user could online the memory or have
>> >> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> >> > sysfs interface to online memory goes through device_online() which
>> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> >> 
>> >> >> Using udev rule for memory onlining is possible when you disable
>> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> >> use memory hotplug to address memory pressure the loop through userspace
>> >> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> >> online newly added memory.
>> >> >
>> >> > How does the in-kernel implementation prevents from that?
>> >> >
>> >> 
>> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> >> to add it in add_memory_resource() we'll also manage to online it.
>> >
>> > How does that differ from initiating online from the users?
>> >
>> >> With
>> >> udev rule we may end up adding many blocks and then (as udev is
>> >> asynchronous) failing to online any of them.
>> >
>> > Why would it fail?
>> >
>> >> In-kernel operation is synchronous.
>> >
>> > which doesn't mean anything as the context is preemptible AFAICS.
>> >
>> 
>> It actually does,
>> 
>> imagine the following example: you run a small guest (256M of memory)
>> and now there is a request to add 1000 128mb blocks to it. 
>
> Is a grow from 256M -> 128GB really something that happens in real life?
> Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> which is an operation which has to allocate memory has to scale with the
> currently available memory IMHO.

With virtual machines this is very real and not exaggerated at
all. E.g. Hyper-V host can be tuned to automatically add new memory when
guest is running out of it. Even 100 blocks can represent an issue.

>
>> In case you
>> do it the old way you're very likely to get OOM somewhere in the middle
>> as you keep adding blocks which requere kernel memory and nobody is
>> onlining it (or, at least you're racing with the onliner). With
>> in-kernel implementation we're going to online the first block when it's
>> added and only then go to the second.
>
> Yes, adding a memory will cost you some memory and that is why I am
> really skeptical when memory hotplug is used under a strong memory
> pressure. This can lead to OOMs even when you online one block at the
> time.

If you can't allocate anything then yes, of course it will fail. But if
you try to add many blocks without onlining at the same time the
probability of failure is orders of magniture higher.

(a bit unrelated) I was actually thinking about the possible failure and
had the following idea in my head: we always keep everything allocated
for one additional memory block so when hotplug happens we use this
reserved space to add the block, online it and immediately reserve space
for the next one. I didn't do any coding yet.

>
> [...]
>> > This was not my decision so I can only guess but to me it makes sense.
>> > Both memory and cpus can be physically present and offline which is a
>> > perfectly reasonable state. So having a two phase physicall hotadd is
>> > just built on top of physical vs. logical distinction. I completely
>> > understand that some usecases will really like to online the whole node
>> > as soon as it appears present. But an automatic in-kernel implementation
>> > has its down sites - e.g. if this operation fails in the middle you will
>> > not know about that unless you check all the memblocks in sysfs. This is
>> > really a poor interface.
>> 
>> And how do you know that some blocks failed to online with udev?
>
> Because the udev will run a code which can cope with that - retry if the
> error is recoverable or simply report with all the details. Compare that
> to crawling the system log to see that something has broken...

I don't know much about udev, but the most common rule to online memory
I've met is:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"

doesn't do anything smart.

In current RHEL7 it is even worse:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

so to online new memory block we actually need to run a process.

>
>> Who
>> handles these failures and how? And, the last but not least, why do
>> these failures happen?
>
> I haven't heard reports about the failures and from looking into the
> code those are possible but very unlikely.

My point is - failures are possible, yes, but in the most common
use-case if we hot-plugged some memory we most probably want to use it
and the feature does that. I'd be glad to hear about possible
improvemets to it of course.
Michal Hocko Feb. 23, 2017, 5:41 p.m. UTC | #8
On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
[...]
> > Is a grow from 256M -> 128GB really something that happens in real life?
> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> > which is an operation which has to allocate memory has to scale with the
> > currently available memory IMHO.
> 
> With virtual machines this is very real and not exaggerated at
> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> guest is running out of it. Even 100 blocks can represent an issue.

Do you have any reference to a bug report. I am really curious because
something really smells wrong and it is not clear that the chosen
solution is really the best one.
[...]
> > Because the udev will run a code which can cope with that - retry if the
> > error is recoverable or simply report with all the details. Compare that
> > to crawling the system log to see that something has broken...
> 
> I don't know much about udev, but the most common rule to online memory
> I've met is:
> 
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"
> 
> doesn't do anything smart.

So what? Is there anything that prevents doing something smarter?
Vitaly Kuznetsov Feb. 23, 2017, 6:14 p.m. UTC | #9
Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
> [...]
>> > Is a grow from 256M -> 128GB really something that happens in real life?
>> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> > which is an operation which has to allocate memory has to scale with the
>> > currently available memory IMHO.
>> 
>> With virtual machines this is very real and not exaggerated at
>> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> guest is running out of it. Even 100 blocks can represent an issue.
>
> Do you have any reference to a bug report. I am really curious because
> something really smells wrong and it is not clear that the chosen
> solution is really the best one.

Unfortunately I'm not aware of any publicly posted bug reports (CC:
K. Y. - he may have a reference) but I think I still remember everything
correctly. Not sure how deep you want me to go into details though...

Virtual guests under stress were getting into OOM easily and the OOM
killer was even killing the udev process trying to online the
memory. There was a workaround for the issue added to the hyper-v driver
doing memory add:

hv_mem_hot_add(...) {
...
 add_memory(....);
 wait_for_completion_timeout(..., 5*HZ);
 ...
}

the completion was done by observing for the MEM_ONLINE event. This, of
course, was slowing things down significantly and waiting for a
userspace action in kernel is not a nice thing to have (not speaking
about all other memory adding methods which had the same issue). Just
removing this wait was leading us to the same OOM as the hypervisor was
adding more and more memory and eventually even add_memory() was
failing, udev and other processes were killed,...

With the feature in place we have new memory available right after we do
add_memory(), everything is serialized.

> [...]
>> > Because the udev will run a code which can cope with that - retry if the
>> > error is recoverable or simply report with all the details. Compare that
>> > to crawling the system log to see that something has broken...
>> 
>> I don't know much about udev, but the most common rule to online memory
>> I've met is:
>> 
>> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"
>> 
>> doesn't do anything smart.
>
> So what? Is there anything that prevents doing something smarter?

Yes, the asynchronous nature of all this stuff. There is no way you can
stop other blocks from being added to the system while you're processing
something in userspace.
Michal Hocko Feb. 24, 2017, 1:37 p.m. UTC | #10
On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> > [...]
> >> > Is a grow from 256M -> 128GB really something that happens in real life?
> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> >> > which is an operation which has to allocate memory has to scale with the
> >> > currently available memory IMHO.
> >> 
> >> With virtual machines this is very real and not exaggerated at
> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> >> guest is running out of it. Even 100 blocks can represent an issue.
> >
> > Do you have any reference to a bug report. I am really curious because
> > something really smells wrong and it is not clear that the chosen
> > solution is really the best one.
> 
> Unfortunately I'm not aware of any publicly posted bug reports (CC:
> K. Y. - he may have a reference) but I think I still remember everything
> correctly. Not sure how deep you want me to go into details though...

As much as possible to understand what was really going on...

> Virtual guests under stress were getting into OOM easily and the OOM
> killer was even killing the udev process trying to online the
> memory.

Do you happen to have any OOM report? I am really surprised that udev
would be an oom victim because that process is really small. Who is
consuming all the memory then?

Have you measured how much memory do we need to allocate to add one
memblock?

> There was a workaround for the issue added to the hyper-v driver
> doing memory add:
> 
> hv_mem_hot_add(...) {
> ...
>  add_memory(....);
>  wait_for_completion_timeout(..., 5*HZ);
>  ...
> }

I can still see 
		/*
		 * Wait for the memory block to be onlined when memory onlining
		 * is done outside of kernel (memhp_auto_online). Since the hot
		 * add has succeeded, it is ok to proceed even if the pages in
		 * the hot added region have not been "onlined" within the
		 * allowed time.
		 */
		if (dm_device.ha_waiting)
			wait_for_completion_timeout(&dm_device.ol_waitevent,
						    5*HZ);
 
> the completion was done by observing for the MEM_ONLINE event. This, of
> course, was slowing things down significantly and waiting for a
> userspace action in kernel is not a nice thing to have (not speaking
> about all other memory adding methods which had the same issue). Just
> removing this wait was leading us to the same OOM as the hypervisor was
> adding more and more memory and eventually even add_memory() was
> failing, udev and other processes were killed,...

Yes, I agree that waiting on a user action from the kernel is very far
from ideal.
 
> With the feature in place we have new memory available right after we do
> add_memory(), everything is serialized.

What prevented you from onlining the memory explicitly from
hv_mem_hot_add path? Why do you need a user visible policy for that at
all? You could also add a parameter to add_memory that would do the same
thing. Or am I missing something?
Vitaly Kuznetsov Feb. 24, 2017, 2:10 p.m. UTC | #11
Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> > [...]
>> >> > Is a grow from 256M -> 128GB really something that happens in real life?
>> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> >> > which is an operation which has to allocate memory has to scale with the
>> >> > currently available memory IMHO.
>> >> 
>> >> With virtual machines this is very real and not exaggerated at
>> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> >> guest is running out of it. Even 100 blocks can represent an issue.
>> >
>> > Do you have any reference to a bug report. I am really curious because
>> > something really smells wrong and it is not clear that the chosen
>> > solution is really the best one.
>> 
>> Unfortunately I'm not aware of any publicly posted bug reports (CC:
>> K. Y. - he may have a reference) but I think I still remember everything
>> correctly. Not sure how deep you want me to go into details though...
>
> As much as possible to understand what was really going on...
>
>> Virtual guests under stress were getting into OOM easily and the OOM
>> killer was even killing the udev process trying to online the
>> memory.
>
> Do you happen to have any OOM report? I am really surprised that udev
> would be an oom victim because that process is really small. Who is
> consuming all the memory then?

It's been a while since I worked on this and unfortunatelly I don't have
a log. From what I remember, the kernel itself was consuming all memory
so *all* processes were victims.

>
> Have you measured how much memory do we need to allocate to add one
> memblock?

No, it's actually a good idea if we decide to do some sort of pre-allocation.

Just did a quick (and probably dirty) test, increasing guest memory from
4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
per block. It's really easy to trigger OOM for small guests.

>
>> There was a workaround for the issue added to the hyper-v driver
>> doing memory add:
>> 
>> hv_mem_hot_add(...) {
>> ...
>>  add_memory(....);
>>  wait_for_completion_timeout(..., 5*HZ);
>>  ...
>> }
>
> I can still see 
> 		/*
> 		 * Wait for the memory block to be onlined when memory onlining
> 		 * is done outside of kernel (memhp_auto_online). Since the hot
> 		 * add has succeeded, it is ok to proceed even if the pages in
> 		 * the hot added region have not been "onlined" within the
> 		 * allowed time.
> 		 */
> 		if (dm_device.ha_waiting)
> 			wait_for_completion_timeout(&dm_device.ol_waitevent,
> 						    5*HZ);
>

See 

 dm_device.ha_waiting = !memhp_auto_online;

30 lines above. The workaround is still there for udev case and it is
still equaly bad.

>> the completion was done by observing for the MEM_ONLINE event. This, of
>> course, was slowing things down significantly and waiting for a
>> userspace action in kernel is not a nice thing to have (not speaking
>> about all other memory adding methods which had the same issue). Just
>> removing this wait was leading us to the same OOM as the hypervisor was
>> adding more and more memory and eventually even add_memory() was
>> failing, udev and other processes were killed,...
>
> Yes, I agree that waiting on a user action from the kernel is very far
> from ideal.
>
>> With the feature in place we have new memory available right after we do
>> add_memory(), everything is serialized.
>
> What prevented you from onlining the memory explicitly from
> hv_mem_hot_add path? Why do you need a user visible policy for that at
> all? You could also add a parameter to add_memory that would do the same
> thing. Or am I missing something?

We have different mechanisms for adding memory, I'm aware of at least 3:
ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
pretty sure I can reproduce the issue on Xen, for example - just boot a
small guest and try adding tons of memory. Why should we have different
defaults for different technologies? 

And, BTW, the link to the previous discussion:
https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ
Michal Hocko Feb. 24, 2017, 2:41 p.m. UTC | #12
On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
[...]
> >> Virtual guests under stress were getting into OOM easily and the OOM
> >> killer was even killing the udev process trying to online the
> >> memory.
> >
> > Do you happen to have any OOM report? I am really surprised that udev
> > would be an oom victim because that process is really small. Who is
> > consuming all the memory then?
> 
> It's been a while since I worked on this and unfortunatelly I don't have
> a log. From what I remember, the kernel itself was consuming all memory
> so *all* processes were victims.

This suggests that something is terminally broken!
 
> > Have you measured how much memory do we need to allocate to add one
> > memblock?
> 
> No, it's actually a good idea if we decide to do some sort of pre-allocation.
> 
> Just did a quick (and probably dirty) test, increasing guest memory from
> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> per block. It's really easy to trigger OOM for small guests.

So we need ~1.5% of the added memory. That doesn't sound like something
to trigger OOM killer too easily. Assuming that increase is not way too
large. Going from 256M (your earlier example) to 8G looks will eat half
the memory which is still quite far away from the OOM. I would call such
an increase a bad memory balancing, though, to be honest. A more
reasonable memory balancing would go and double the available memory
IMHO. Anway, I still think that hotplug is a terrible way to do memory
ballooning.

[...]
> >> the completion was done by observing for the MEM_ONLINE event. This, of
> >> course, was slowing things down significantly and waiting for a
> >> userspace action in kernel is not a nice thing to have (not speaking
> >> about all other memory adding methods which had the same issue). Just
> >> removing this wait was leading us to the same OOM as the hypervisor was
> >> adding more and more memory and eventually even add_memory() was
> >> failing, udev and other processes were killed,...
> >
> > Yes, I agree that waiting on a user action from the kernel is very far
> > from ideal.
> >
> >> With the feature in place we have new memory available right after we do
> >> add_memory(), everything is serialized.
> >
> > What prevented you from onlining the memory explicitly from
> > hv_mem_hot_add path? Why do you need a user visible policy for that at
> > all? You could also add a parameter to add_memory that would do the same
> > thing. Or am I missing something?
> 
> We have different mechanisms for adding memory, I'm aware of at least 3:
> ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
> pretty sure I can reproduce the issue on Xen, for example - just boot a
> small guest and try adding tons of memory. Why should we have different
> defaults for different technologies? 

Just make them all online the memory explicitly. I really do not see why
this should be decided by poor user. Put it differently, when should I
disable auto online when using hyperV or other of the mentioned
technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
I would even be for killing the whole memhp_auto_online thing along the
way. This simply doesn't make any sense to me.
 
> And, BTW, the link to the previous discussion:
> https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ

I remember this discussion and objected to the approach back then as
well.
Vitaly Kuznetsov Feb. 24, 2017, 3:05 p.m. UTC | #13
Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
> [...]
>> >> Virtual guests under stress were getting into OOM easily and the OOM
>> >> killer was even killing the udev process trying to online the
>> >> memory.
>> >
>> > Do you happen to have any OOM report? I am really surprised that udev
>> > would be an oom victim because that process is really small. Who is
>> > consuming all the memory then?
>> 
>> It's been a while since I worked on this and unfortunatelly I don't have
>> a log. From what I remember, the kernel itself was consuming all memory
>> so *all* processes were victims.
>
> This suggests that something is terminally broken!
>
>> > Have you measured how much memory do we need to allocate to add one
>> > memblock?
>> 
>> No, it's actually a good idea if we decide to do some sort of pre-allocation.
>> 
>> Just did a quick (and probably dirty) test, increasing guest memory from
>> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
>> per block. It's really easy to trigger OOM for small guests.
>
> So we need ~1.5% of the added memory. That doesn't sound like something
> to trigger OOM killer too easily. Assuming that increase is not way too
> large. Going from 256M (your earlier example) to 8G looks will eat half
> the memory which is still quite far away from the OOM.

And if the kernel itself takes 128Mb of ram (which is not something
extraordinary with many CPUs) we have zero left. Go to something bigger
than 8G and you die.

> I would call such
> an increase a bad memory balancing, though, to be honest. A more
> reasonable memory balancing would go and double the available memory
> IMHO. Anway, I still think that hotplug is a terrible way to do memory
> ballooning.

That's what we have in *all* modern hypervisors. And I don't see why
it's bad.

>
> [...]
>> >> the completion was done by observing for the MEM_ONLINE event. This, of
>> >> course, was slowing things down significantly and waiting for a
>> >> userspace action in kernel is not a nice thing to have (not speaking
>> >> about all other memory adding methods which had the same issue). Just
>> >> removing this wait was leading us to the same OOM as the hypervisor was
>> >> adding more and more memory and eventually even add_memory() was
>> >> failing, udev and other processes were killed,...
>> >
>> > Yes, I agree that waiting on a user action from the kernel is very far
>> > from ideal.
>> >
>> >> With the feature in place we have new memory available right after we do
>> >> add_memory(), everything is serialized.
>> >
>> > What prevented you from onlining the memory explicitly from
>> > hv_mem_hot_add path? Why do you need a user visible policy for that at
>> > all? You could also add a parameter to add_memory that would do the same
>> > thing. Or am I missing something?
>> 
>> We have different mechanisms for adding memory, I'm aware of at least 3:
>> ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
>> pretty sure I can reproduce the issue on Xen, for example - just boot a
>> small guest and try adding tons of memory. Why should we have different
>> defaults for different technologies? 
>
> Just make them all online the memory explicitly. I really do not see why
> this should be decided by poor user. Put it differently, when should I
> disable auto online when using hyperV or other of the mentioned
> technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
> I would even be for killing the whole memhp_auto_online thing along the
> way. This simply doesn't make any sense to me.

ACPI, for example, is shared between KVM/Qemu, Vmware and real
hardware. I can understand why bare metall guys might not want to have
auto-online by default (though, major linux distros ship the stupid
'offline' -> 'online' udev rule and nobody complains) -- they're doing
some physical action - going to a server room, openning the box,
plugging in memory, going back to their place but with VMs it's not like
that. What's gonna be the default for ACPI then?

I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
disturbing and why do we need to take this choice away from distros. I
don't understand what we're gaining by replacing it with
per-memory-add-technology defaults.

>> And, BTW, the link to the previous discussion:
>> https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ
>
> I remember this discussion and objected to the approach back then as
> well.
diff mbox

Patch

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8ab8ea1..ede46f3 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -249,7 +249,7 @@  static bool pages_correctly_reserved(unsigned long start_pfn)
 	return ret;
 }
 
-int memory_block_change_state(struct memory_block *mem,
+static int memory_block_change_state(struct memory_block *mem,
 		unsigned long to_state, unsigned long from_state_req)
 {
 	int ret = 0;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 093607f..b723a68 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -109,9 +109,6 @@  static inline int memory_isolate_notify(unsigned long val, void *v)
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 extern int register_new_memory(int, struct mem_section *);
-extern int memory_block_change_state(struct memory_block *mem,
-				     unsigned long to_state,
-				     unsigned long from_state_req);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e43142c1..6f7a289 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1329,7 +1329,7 @@  int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 
 static int online_memory_block(struct memory_block *mem, void *arg)
 {
-	return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+	return device_online(&mem->dev);
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */