Message ID | 20200117105759.27905-1-david@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [RFC,v1] mm: is_mem_section_removable() overhaul | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | warning | Failed to apply on branch powerpc/merge (20862247a368dbb75d6e97d82345999adaacf3cc) |
snowpatch_ozlabs/apply_patch | warning | Failed to apply on branch powerpc/next (3a9d970f17e05a7b26f782beb8f7f2118d1741ea) |
snowpatch_ozlabs/apply_patch | warning | Failed to apply on branch linus/master (575966e080270b7574175da35f7f7dd5ecd89ff4) |
snowpatch_ozlabs/apply_patch | warning | Failed to apply on branch powerpc/fixes (6da3eced8c5f3b03340b0c395bacd552c4d52411) |
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch linux-next (de970dffa7d19eae1d703c3534825308ef8d5dec) |
snowpatch_ozlabs/build-ppc64le | success | Build succeeded |
snowpatch_ozlabs/build-ppc64be | success | Build succeeded |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | warning | total: 0 errors, 0 warnings, 2 checks, 209 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > Let's refactor that code. We want to check if we can offline memory > blocks. Add a new function is_mem_section_offlineable() for that and > make it call is_mem_section_offlineable() for each contained section. > Within is_mem_section_offlineable(), add some more sanity checks and > directly bail out if the section contains holes or if it spans multiple > zones. I didn't read the patch (yet) but I am wondering. If we want to touch this code, can we simply always return true there? I mean whoever depends on this check is racy and the failure can happen even after the sysfs says good to go, right? The check is essentially as expensive as calling the offlining code itself. So the only usecase I can think of is a dumb driver to crawl over blocks and check which is removable and try to hotremove it. But just trying to offline one block after another is essentially going to achieve the same. Or does anybody see any reasonable usecase that would break if we did that unconditional behavior?
On 17.01.20 12:33, Michal Hocko wrote: > On Fri 17-01-20 11:57:59, David Hildenbrand wrote: >> Let's refactor that code. We want to check if we can offline memory >> blocks. Add a new function is_mem_section_offlineable() for that and >> make it call is_mem_section_offlineable() for each contained section. >> Within is_mem_section_offlineable(), add some more sanity checks and >> directly bail out if the section contains holes or if it spans multiple >> zones. > > I didn't read the patch (yet) but I am wondering. If we want to touch > this code, can we simply always return true there? I mean whoever > depends on this check is racy and the failure can happen even after > the sysfs says good to go, right? The check is essentially as expensive > as calling the offlining code itself. So the only usecase I can think of > is a dumb driver to crawl over blocks and check which is removable and > try to hotremove it. But just trying to offline one block after another > is essentially going to achieve the same. Some thoughts: 1. It allows you to check if memory is likely to be offlineable without doing expensive locking and trying to isolate pages (meaning: zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() when isolating) 2. There are use cases that want to identify a memory block/DIMM to unplug. One example is PPC DLPAR code (see this patch). Going over all memory block trying to offline them is an expensive operation. 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) makes use of /sys/.../removable to speed up the search AFAIK. 4. lsmem displays/groups by "removable". 5. If "removable=false" then it usually really is not offlineable. Of course, there could also be races (free the last unmovable page), but it means "don't even try". OTOH, "removable=true" is more racy, and gives less guarantees. ("looks okay, feel free to try") > > Or does anybody see any reasonable usecase that would break if we did > that unconditional behavior? If we would return always "true", then the whole reason the interface originally was introduced would be "broken" (meaning, less performant as you would try to offline any memory block). commit 5c755e9fd813810680abd56ec09a5f90143e815b Author: Badari Pulavarty <pbadari@us.ibm.com> Date: Wed Jul 23 21:28:19 2008 -0700 memory-hotplug: add sysfs removable attribute for hotplug memory remove Memory may be hot-removed on a per-memory-block basis, particularly on POWER where the SPARSEMEM section size often matches the memory-block size. A user-level agent must be able to identify which sections of memory are likely to be removable before attempting the potentially expensive operation. This patch adds a file called "removable" to the memory directory in sysfs to help such an agent. In this patch, a memory block is considered removable if; I'd love to see this go away (just like "valid_zones"), but I don't think it is possible. So this patch makes it work a little more correctly (multiplezones, holes), cleans up the code and avoids races with unplug code. It can, however, not give more guarantees if memory offlining will actually succeed.
On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > On 17.01.20 12:33, Michal Hocko wrote: > > On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > >> Let's refactor that code. We want to check if we can offline memory > >> blocks. Add a new function is_mem_section_offlineable() for that and > >> make it call is_mem_section_offlineable() for each contained section. > >> Within is_mem_section_offlineable(), add some more sanity checks and > >> directly bail out if the section contains holes or if it spans multiple > >> zones. > > > > I didn't read the patch (yet) but I am wondering. If we want to touch > > this code, can we simply always return true there? I mean whoever > > depends on this check is racy and the failure can happen even after > > the sysfs says good to go, right? The check is essentially as expensive > > as calling the offlining code itself. So the only usecase I can think of > > is a dumb driver to crawl over blocks and check which is removable and > > try to hotremove it. But just trying to offline one block after another > > is essentially going to achieve the same. > > Some thoughts: > > 1. It allows you to check if memory is likely to be offlineable without > doing expensive locking and trying to isolate pages (meaning: > zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > when isolating) > > 2. There are use cases that want to identify a memory block/DIMM to > unplug. One example is PPC DLPAR code (see this patch). Going over all > memory block trying to offline them is an expensive operation. > > 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > makes use of /sys/.../removable to speed up the search AFAIK. Well, while I do see those points I am not really sure they are worth having a broken (by-definition) interface. > 4. lsmem displays/groups by "removable". Is anybody really using that? > 5. If "removable=false" then it usually really is not offlineable. > Of course, there could also be races (free the last unmovable page), > but it means "don't even try". OTOH, "removable=true" is more racy, > and gives less guarantees. ("looks okay, feel free to try") Yeah, but you could be already pessimistic and try movable zones before other kernel zones. > > Or does anybody see any reasonable usecase that would break if we did > > that unconditional behavior? > > If we would return always "true", then the whole reason the > interface originally was introduced would be "broken" (meaning, less > performant as you would try to offline any memory block). I would argue that the whole interface is broken ;). Not the first time in the kernel development history and not the last time either. What I am trying to say here is that unless there are _real_ usecases depending on knowing that something surely is _not_ offlineable then I would just try to drop the functionality while preserving the interface and see what happens.
On 17.01.20 15:52, Michal Hocko wrote: > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: >> On 17.01.20 12:33, Michal Hocko wrote: >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: >>>> Let's refactor that code. We want to check if we can offline memory >>>> blocks. Add a new function is_mem_section_offlineable() for that and >>>> make it call is_mem_section_offlineable() for each contained section. >>>> Within is_mem_section_offlineable(), add some more sanity checks and >>>> directly bail out if the section contains holes or if it spans multiple >>>> zones. >>> >>> I didn't read the patch (yet) but I am wondering. If we want to touch >>> this code, can we simply always return true there? I mean whoever >>> depends on this check is racy and the failure can happen even after >>> the sysfs says good to go, right? The check is essentially as expensive >>> as calling the offlining code itself. So the only usecase I can think of >>> is a dumb driver to crawl over blocks and check which is removable and >>> try to hotremove it. But just trying to offline one block after another >>> is essentially going to achieve the same. >> >> Some thoughts: >> >> 1. It allows you to check if memory is likely to be offlineable without >> doing expensive locking and trying to isolate pages (meaning: >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() >> when isolating) >> >> 2. There are use cases that want to identify a memory block/DIMM to >> unplug. One example is PPC DLPAR code (see this patch). Going over all >> memory block trying to offline them is an expensive operation. >> >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) >> makes use of /sys/.../removable to speed up the search AFAIK. > > Well, while I do see those points I am not really sure they are worth > having a broken (by-definition) interface. It's a pure speedup. And for that, the interface has been working perfectly fine for years? > >> 4. lsmem displays/groups by "removable". > > Is anybody really using that? Well at least I am using that when testing to identify which (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate all the zone shrinking stuff I have been fixing) So there is at least one user ;) [...] > >>> Or does anybody see any reasonable usecase that would break if we did >>> that unconditional behavior? >> >> If we would return always "true", then the whole reason the >> interface originally was introduced would be "broken" (meaning, less >> performant as you would try to offline any memory block). > > I would argue that the whole interface is broken ;). Not the first time > in the kernel development history and not the last time either. What I > am trying to say here is that unless there are _real_ usecases depending > on knowing that something surely is _not_ offlineable then I would just > try to drop the functionality while preserving the interface and see > what happens. I can see that, but I can perfectly well understand why - especially powerpc - wants a fast way to sense which blocks actually sense to try to online. The original patch correctly states "which sections of memory are likely to be removable before attempting the potentially expensive operation." It works as designed I would say.
On Fri 17-01-20 15:58:26, David Hildenbrand wrote: > On 17.01.20 15:52, Michal Hocko wrote: > > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > >> On 17.01.20 12:33, Michal Hocko wrote: > >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > >>>> Let's refactor that code. We want to check if we can offline memory > >>>> blocks. Add a new function is_mem_section_offlineable() for that and > >>>> make it call is_mem_section_offlineable() for each contained section. > >>>> Within is_mem_section_offlineable(), add some more sanity checks and > >>>> directly bail out if the section contains holes or if it spans multiple > >>>> zones. > >>> > >>> I didn't read the patch (yet) but I am wondering. If we want to touch > >>> this code, can we simply always return true there? I mean whoever > >>> depends on this check is racy and the failure can happen even after > >>> the sysfs says good to go, right? The check is essentially as expensive > >>> as calling the offlining code itself. So the only usecase I can think of > >>> is a dumb driver to crawl over blocks and check which is removable and > >>> try to hotremove it. But just trying to offline one block after another > >>> is essentially going to achieve the same. > >> > >> Some thoughts: > >> > >> 1. It allows you to check if memory is likely to be offlineable without > >> doing expensive locking and trying to isolate pages (meaning: > >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > >> when isolating) > >> > >> 2. There are use cases that want to identify a memory block/DIMM to > >> unplug. One example is PPC DLPAR code (see this patch). Going over all > >> memory block trying to offline them is an expensive operation. > >> > >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > >> makes use of /sys/.../removable to speed up the search AFAIK. > > > > Well, while I do see those points I am not really sure they are worth > > having a broken (by-definition) interface. > > It's a pure speedup. And for that, the interface has been working > perfectly fine for years? > > > > >> 4. lsmem displays/groups by "removable". > > > > Is anybody really using that? > > Well at least I am using that when testing to identify which > (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate > all the zone shrinking stuff I have been fixing) > > So there is at least one user ;) Fair enough. But I would argue that there are better ways to do the same solely for testing purposes. Rather than having a subtly broken code to maintain. > > > >>> Or does anybody see any reasonable usecase that would break if we did > >>> that unconditional behavior? > >> > >> If we would return always "true", then the whole reason the > >> interface originally was introduced would be "broken" (meaning, less > >> performant as you would try to offline any memory block). > > > > I would argue that the whole interface is broken ;). Not the first time > > in the kernel development history and not the last time either. What I > > am trying to say here is that unless there are _real_ usecases depending > > on knowing that something surely is _not_ offlineable then I would just > > try to drop the functionality while preserving the interface and see > > what happens. > > I can see that, but I can perfectly well understand why - especially > powerpc - wants a fast way to sense which blocks actually sense to try > to online. > > The original patch correctly states > "which sections of > memory are likely to be removable before attempting the potentially > expensive operation." > > It works as designed I would say. Then I would just keep it crippled the same way it has been for years without anybody noticing.
On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 17-01-20 15:58:26, David Hildenbrand wrote: > > On 17.01.20 15:52, Michal Hocko wrote: > > > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > > >> On 17.01.20 12:33, Michal Hocko wrote: > > >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > > >>>> Let's refactor that code. We want to check if we can offline memory > > >>>> blocks. Add a new function is_mem_section_offlineable() for that and > > >>>> make it call is_mem_section_offlineable() for each contained section. > > >>>> Within is_mem_section_offlineable(), add some more sanity checks and > > >>>> directly bail out if the section contains holes or if it spans multiple > > >>>> zones. > > >>> > > >>> I didn't read the patch (yet) but I am wondering. If we want to touch > > >>> this code, can we simply always return true there? I mean whoever > > >>> depends on this check is racy and the failure can happen even after > > >>> the sysfs says good to go, right? The check is essentially as expensive > > >>> as calling the offlining code itself. So the only usecase I can think of > > >>> is a dumb driver to crawl over blocks and check which is removable and > > >>> try to hotremove it. But just trying to offline one block after another > > >>> is essentially going to achieve the same. > > >> > > >> Some thoughts: > > >> > > >> 1. It allows you to check if memory is likely to be offlineable without > > >> doing expensive locking and trying to isolate pages (meaning: > > >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > > >> when isolating) > > >> > > >> 2. There are use cases that want to identify a memory block/DIMM to > > >> unplug. One example is PPC DLPAR code (see this patch). Going over all > > >> memory block trying to offline them is an expensive operation. > > >> > > >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > > >> makes use of /sys/.../removable to speed up the search AFAIK. > > > > > > Well, while I do see those points I am not really sure they are worth > > > having a broken (by-definition) interface. > > > > It's a pure speedup. And for that, the interface has been working > > perfectly fine for years? > > > > > > > >> 4. lsmem displays/groups by "removable". > > > > > > Is anybody really using that? > > > > Well at least I am using that when testing to identify which > > (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate > > all the zone shrinking stuff I have been fixing) > > > > So there is at least one user ;) > > Fair enough. But I would argue that there are better ways to do the same > solely for testing purposes. Rather than having a subtly broken code to > maintain. > > > > > > >>> Or does anybody see any reasonable usecase that would break if we did > > >>> that unconditional behavior? > > >> > > >> If we would return always "true", then the whole reason the > > >> interface originally was introduced would be "broken" (meaning, less > > >> performant as you would try to offline any memory block). > > > > > > I would argue that the whole interface is broken ;). Not the first time > > > in the kernel development history and not the last time either. What I > > > am trying to say here is that unless there are _real_ usecases depending > > > on knowing that something surely is _not_ offlineable then I would just > > > try to drop the functionality while preserving the interface and see > > > what happens. > > > > I can see that, but I can perfectly well understand why - especially > > powerpc - wants a fast way to sense which blocks actually sense to try > > to online. > > > > The original patch correctly states > > "which sections of > > memory are likely to be removable before attempting the potentially > > expensive operation." > > > > It works as designed I would say. > > Then I would just keep it crippled the same way it has been for years > without anybody noticing. I tend to agree. At least the kmem driver that wants to unplug memory could not use an interface that does not give stable answers. It just relies on remove_memory() to return a definitive error.
On 17.01.20 16:54, Dan Williams wrote: > On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko <mhocko@kernel.org> wrote: >> >> On Fri 17-01-20 15:58:26, David Hildenbrand wrote: >>> On 17.01.20 15:52, Michal Hocko wrote: >>>> On Fri 17-01-20 14:08:06, David Hildenbrand wrote: >>>>> On 17.01.20 12:33, Michal Hocko wrote: >>>>>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: >>>>>>> Let's refactor that code. We want to check if we can offline memory >>>>>>> blocks. Add a new function is_mem_section_offlineable() for that and >>>>>>> make it call is_mem_section_offlineable() for each contained section. >>>>>>> Within is_mem_section_offlineable(), add some more sanity checks and >>>>>>> directly bail out if the section contains holes or if it spans multiple >>>>>>> zones. >>>>>> >>>>>> I didn't read the patch (yet) but I am wondering. If we want to touch >>>>>> this code, can we simply always return true there? I mean whoever >>>>>> depends on this check is racy and the failure can happen even after >>>>>> the sysfs says good to go, right? The check is essentially as expensive >>>>>> as calling the offlining code itself. So the only usecase I can think of >>>>>> is a dumb driver to crawl over blocks and check which is removable and >>>>>> try to hotremove it. But just trying to offline one block after another >>>>>> is essentially going to achieve the same. >>>>> >>>>> Some thoughts: >>>>> >>>>> 1. It allows you to check if memory is likely to be offlineable without >>>>> doing expensive locking and trying to isolate pages (meaning: >>>>> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() >>>>> when isolating) >>>>> >>>>> 2. There are use cases that want to identify a memory block/DIMM to >>>>> unplug. One example is PPC DLPAR code (see this patch). Going over all >>>>> memory block trying to offline them is an expensive operation. >>>>> >>>>> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) >>>>> makes use of /sys/.../removable to speed up the search AFAIK. >>>> >>>> Well, while I do see those points I am not really sure they are worth >>>> having a broken (by-definition) interface. >>> >>> It's a pure speedup. And for that, the interface has been working >>> perfectly fine for years? >>> >>>> >>>>> 4. lsmem displays/groups by "removable". >>>> >>>> Is anybody really using that? >>> >>> Well at least I am using that when testing to identify which >>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate >>> all the zone shrinking stuff I have been fixing) >>> >>> So there is at least one user ;) >> >> Fair enough. But I would argue that there are better ways to do the same >> solely for testing purposes. Rather than having a subtly broken code to >> maintain. >> >>>> >>>>>> Or does anybody see any reasonable usecase that would break if we did >>>>>> that unconditional behavior? >>>>> >>>>> If we would return always "true", then the whole reason the >>>>> interface originally was introduced would be "broken" (meaning, less >>>>> performant as you would try to offline any memory block). >>>> >>>> I would argue that the whole interface is broken ;). Not the first time >>>> in the kernel development history and not the last time either. What I >>>> am trying to say here is that unless there are _real_ usecases depending >>>> on knowing that something surely is _not_ offlineable then I would just >>>> try to drop the functionality while preserving the interface and see >>>> what happens. >>> >>> I can see that, but I can perfectly well understand why - especially >>> powerpc - wants a fast way to sense which blocks actually sense to try >>> to online. >>> >>> The original patch correctly states >>> "which sections of >>> memory are likely to be removable before attempting the potentially >>> expensive operation." >>> >>> It works as designed I would say. >> >> Then I would just keep it crippled the same way it has been for years >> without anybody noticing. > > I tend to agree. At least the kmem driver that wants to unplug memory > could not use an interface that does not give stable answers. It just > relies on remove_memory() to return a definitive error. > Just because kmem cannot reuse such an interface doesn't mean we should not touch it (or I am not getting your point). Especially, this interface is about "can it be likely be offlined and then eventually be removed (if there is a HW interface for that)" (as documented), not about "will remove_memory()" work. We do have users and if we agree to keep it (what I think we should as I expressed) then I think we should un-cripple and fix it. After all we have to maintain it. The current interface provides what was documented - "likely to be offlineable". (the chosen name was just horribly bad - as I expressed a while ago already :) )
On Fri, Jan 17, 2020 at 8:11 AM David Hildenbrand <david@redhat.com> wrote: > > On 17.01.20 16:54, Dan Williams wrote: > > On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko <mhocko@kernel.org> wrote: > >> > >> On Fri 17-01-20 15:58:26, David Hildenbrand wrote: > >>> On 17.01.20 15:52, Michal Hocko wrote: > >>>> On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > >>>>> On 17.01.20 12:33, Michal Hocko wrote: > >>>>>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > >>>>>>> Let's refactor that code. We want to check if we can offline memory > >>>>>>> blocks. Add a new function is_mem_section_offlineable() for that and > >>>>>>> make it call is_mem_section_offlineable() for each contained section. > >>>>>>> Within is_mem_section_offlineable(), add some more sanity checks and > >>>>>>> directly bail out if the section contains holes or if it spans multiple > >>>>>>> zones. > >>>>>> > >>>>>> I didn't read the patch (yet) but I am wondering. If we want to touch > >>>>>> this code, can we simply always return true there? I mean whoever > >>>>>> depends on this check is racy and the failure can happen even after > >>>>>> the sysfs says good to go, right? The check is essentially as expensive > >>>>>> as calling the offlining code itself. So the only usecase I can think of > >>>>>> is a dumb driver to crawl over blocks and check which is removable and > >>>>>> try to hotremove it. But just trying to offline one block after another > >>>>>> is essentially going to achieve the same. > >>>>> > >>>>> Some thoughts: > >>>>> > >>>>> 1. It allows you to check if memory is likely to be offlineable without > >>>>> doing expensive locking and trying to isolate pages (meaning: > >>>>> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > >>>>> when isolating) > >>>>> > >>>>> 2. There are use cases that want to identify a memory block/DIMM to > >>>>> unplug. One example is PPC DLPAR code (see this patch). Going over all > >>>>> memory block trying to offline them is an expensive operation. > >>>>> > >>>>> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > >>>>> makes use of /sys/.../removable to speed up the search AFAIK. > >>>> > >>>> Well, while I do see those points I am not really sure they are worth > >>>> having a broken (by-definition) interface. > >>> > >>> It's a pure speedup. And for that, the interface has been working > >>> perfectly fine for years? > >>> > >>>> > >>>>> 4. lsmem displays/groups by "removable". > >>>> > >>>> Is anybody really using that? > >>> > >>> Well at least I am using that when testing to identify which > >>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate > >>> all the zone shrinking stuff I have been fixing) > >>> > >>> So there is at least one user ;) > >> > >> Fair enough. But I would argue that there are better ways to do the same > >> solely for testing purposes. Rather than having a subtly broken code to > >> maintain. > >> > >>>> > >>>>>> Or does anybody see any reasonable usecase that would break if we did > >>>>>> that unconditional behavior? > >>>>> > >>>>> If we would return always "true", then the whole reason the > >>>>> interface originally was introduced would be "broken" (meaning, less > >>>>> performant as you would try to offline any memory block). > >>>> > >>>> I would argue that the whole interface is broken ;). Not the first time > >>>> in the kernel development history and not the last time either. What I > >>>> am trying to say here is that unless there are _real_ usecases depending > >>>> on knowing that something surely is _not_ offlineable then I would just > >>>> try to drop the functionality while preserving the interface and see > >>>> what happens. > >>> > >>> I can see that, but I can perfectly well understand why - especially > >>> powerpc - wants a fast way to sense which blocks actually sense to try > >>> to online. > >>> > >>> The original patch correctly states > >>> "which sections of > >>> memory are likely to be removable before attempting the potentially > >>> expensive operation." > >>> > >>> It works as designed I would say. > >> > >> Then I would just keep it crippled the same way it has been for years > >> without anybody noticing. > > > > I tend to agree. At least the kmem driver that wants to unplug memory > > could not use an interface that does not give stable answers. It just > > relies on remove_memory() to return a definitive error. > > > > Just because kmem cannot reuse such an interface doesn't mean we should > not touch it (or I am not getting your point). Especially, this > interface is about "can it be likely be offlined and then eventually be > removed (if there is a HW interface for that)" (as documented), not > about "will remove_memory()" work. Unless the user is willing to hold the device_hotplug_lock over the evaluation then the result is unreliable. For example the changes to removable_show() are no better than they were previously because the result is invalidated as soon as the lock is dropped. > We do have users and if we agree to keep it (what I think we should as I > expressed) then I think we should un-cripple and fix it. After all we > have to maintain it. The current interface provides what was documented > - "likely to be offlineable". (the chosen name was just horribly bad - > as I expressed a while ago already :) ) Are there users that misbehave because they try to offline a section with holes? I brought up kmem because it's an unplug user that does not care whether the failure was due to pinned pages or holes in sections. Do others care about that precision in a meaningful way?
On Fri 17-01-20 08:57:51, Dan Williams wrote: [...] > Unless the user is willing to hold the device_hotplug_lock over the > evaluation then the result is unreliable. Do we want to hold the device_hotplug_lock from this user readable file in the first place? My book says that this just waits to become a problem. Really, the interface is flawed and should have never been merged in the first place. We cannot simply remove it altogether I am afraid so let's at least remove the bogus code and pretend that the world is a better place where everything is removable except the reality sucks...
On 20.01.20 08:48, Michal Hocko wrote: > On Fri 17-01-20 08:57:51, Dan Williams wrote: > [...] >> Unless the user is willing to hold the device_hotplug_lock over the >> evaluation then the result is unreliable. > > Do we want to hold the device_hotplug_lock from this user readable file > in the first place? My book says that this just waits to become a > problem. It was the "big hammer" solution for this RFC. I think we could do with a try_lock() on the device_lock() paired with a device->removed flag. The latter is helpful for properly catching zombie devices on the onlining/offlining path either way (and on my todo list). > > Really, the interface is flawed and should have never been merged in the > first place. We cannot simply remove it altogether I am afraid so let's > at least remove the bogus code and pretend that the world is a better > place where everything is removable except the reality sucks... As I expressed already, the interface works as designed/documented and has been used like that for years. I tend to agree that it never should have been merged like that. We have (at least) two places that are racy (with concurrent memory hotplug): 1. /sys/.../memoryX/removable - a) make it always return yes and make the interface useless - b) add proper locking and keep it running as is (e.g., so David can identify offlineable memory blocks :) ). 2. /sys/.../memoryX/valid_zones - a) always return "none" if the memory is online - b) add proper locking and keep it running as is - c) cache the result ("zone") when a block is onlined (e.g., in mem->zone. If it is NULL, either mixed zones or unknown) At least 2. already scream for a proper device_lock() locking as the mem->state is not stable across the function call. 1a and 2a are the easiest solutions but remove all ways to identify if a memory block could theoretically be offlined - without trying (especially, also to identify the MOVABLE zone). I tend to prefer 1b) and 2c), paired with proper device_lock() locking. We don't affect existing use cases but are able to simplify the code + fix the races. What's your opinion? Any alternatives?
On 20.01.20 10:14, David Hildenbrand wrote: > On 20.01.20 08:48, Michal Hocko wrote: >> On Fri 17-01-20 08:57:51, Dan Williams wrote: >> [...] >>> Unless the user is willing to hold the device_hotplug_lock over the >>> evaluation then the result is unreliable. >> >> Do we want to hold the device_hotplug_lock from this user readable file >> in the first place? My book says that this just waits to become a >> problem. > > It was the "big hammer" solution for this RFC. > > I think we could do with a try_lock() on the device_lock() paired with a > device->removed flag. The latter is helpful for properly catching zombie > devices on the onlining/offlining path either way (and on my todo list). We do have dev->p->dead which could come in handy.
On Mon 20-01-20 10:14:44, David Hildenbrand wrote: > On 20.01.20 08:48, Michal Hocko wrote: > > On Fri 17-01-20 08:57:51, Dan Williams wrote: > > [...] > >> Unless the user is willing to hold the device_hotplug_lock over the > >> evaluation then the result is unreliable. > > > > Do we want to hold the device_hotplug_lock from this user readable file > > in the first place? My book says that this just waits to become a > > problem. > > It was the "big hammer" solution for this RFC. > > I think we could do with a try_lock() on the device_lock() paired with a > device->removed flag. The latter is helpful for properly catching zombie > devices on the onlining/offlining path either way (and on my todo list). try_lock would be more considerate. It would at least make any potential hammering a bit harder. > > Really, the interface is flawed and should have never been merged in the > > first place. We cannot simply remove it altogether I am afraid so let's > > at least remove the bogus code and pretend that the world is a better > > place where everything is removable except the reality sucks... > > As I expressed already, the interface works as designed/documented and > has been used like that for years. It seems we do differ in the usefulness though. Using a crappy interface for years doesn't make it less crappy. I do realize we cannot remove the interface but we can remove issues with the implementation and I dare to say that most existing users wouldn't really notice. > I tend to agree that it never should have been merged like that. > > We have (at least) two places that are racy (with concurrent memory > hotplug): > > 1. /sys/.../memoryX/removable > - a) make it always return yes and make the interface useless > - b) add proper locking and keep it running as is (e.g., so David can > identify offlineable memory blocks :) ). > > 2. /sys/.../memoryX/valid_zones > - a) always return "none" if the memory is online > - b) add proper locking and keep it running as is > - c) cache the result ("zone") when a block is onlined (e.g., in > mem->zone. If it is NULL, either mixed zones or unknown) > > At least 2. already scream for a proper device_lock() locking as the > mem->state is not stable across the function call. > > 1a and 2a are the easiest solutions but remove all ways to identify if a > memory block could theoretically be offlined - without trying > (especially, also to identify the MOVABLE zone). > > I tend to prefer 1b) and 2c), paired with proper device_lock() locking. > We don't affect existing use cases but are able to simplify the code + > fix the races. > > What's your opinion? Any alternatives? 1a) and 2c) if you ask me.
>>> Really, the interface is flawed and should have never been merged in the >>> first place. We cannot simply remove it altogether I am afraid so let's >>> at least remove the bogus code and pretend that the world is a better >>> place where everything is removable except the reality sucks... >> >> As I expressed already, the interface works as designed/documented and >> has been used like that for years. > > It seems we do differ in the usefulness though. Using a crappy interface > for years doesn't make it less crappy. I do realize we cannot remove the > interface but we can remove issues with the implementation and I dare to > say that most existing users wouldn't really notice. Well, at least powerpc-utils (why this interface was introduced) will notice a) performance wise and b) because more logging output will be generated (obviously non-offlineable blocks will be tried to offline). However, it should not break, because we could have had races before/false positives. > >> I tend to agree that it never should have been merged like that. >> >> We have (at least) two places that are racy (with concurrent memory >> hotplug): >> >> 1. /sys/.../memoryX/removable >> - a) make it always return yes and make the interface useless >> - b) add proper locking and keep it running as is (e.g., so David can >> identify offlineable memory blocks :) ). >> >> 2. /sys/.../memoryX/valid_zones >> - a) always return "none" if the memory is online >> - b) add proper locking and keep it running as is >> - c) cache the result ("zone") when a block is onlined (e.g., in >> mem->zone. If it is NULL, either mixed zones or unknown) >> >> At least 2. already scream for a proper device_lock() locking as the >> mem->state is not stable across the function call. >> >> 1a and 2a are the easiest solutions but remove all ways to identify if a >> memory block could theoretically be offlined - without trying >> (especially, also to identify the MOVABLE zone). >> >> I tend to prefer 1b) and 2c), paired with proper device_lock() locking. >> We don't affect existing use cases but are able to simplify the code + >> fix the races. >> >> What's your opinion? Any alternatives? > > 1a) and 2c) if you ask me. > I'll look into that all, just might take a little (busy with a lot of stuff). But after all, it does not seem to be urgent. 1a) will be easy, I'll post a patch soon that we can let rest in -next for a bit to see if people start to scream out loud.
On Wed 22-01-20 11:39:08, David Hildenbrand wrote: > >>> Really, the interface is flawed and should have never been merged in the > >>> first place. We cannot simply remove it altogether I am afraid so let's > >>> at least remove the bogus code and pretend that the world is a better > >>> place where everything is removable except the reality sucks... > >> > >> As I expressed already, the interface works as designed/documented and > >> has been used like that for years. > > > > It seems we do differ in the usefulness though. Using a crappy interface > > for years doesn't make it less crappy. I do realize we cannot remove the > > interface but we can remove issues with the implementation and I dare to > > say that most existing users wouldn't really notice. > > Well, at least powerpc-utils (why this interface was introduced) will > notice a) performance wise and b) because more logging output will be > generated (obviously non-offlineable blocks will be tried to offline). I would really appreciate some specific example for a real usecase. I am not familiar with powerpc-utils worklflows myself.
On 22.01.20 11:42, Michal Hocko wrote: > On Wed 22-01-20 11:39:08, David Hildenbrand wrote: >>>>> Really, the interface is flawed and should have never been merged in the >>>>> first place. We cannot simply remove it altogether I am afraid so let's >>>>> at least remove the bogus code and pretend that the world is a better >>>>> place where everything is removable except the reality sucks... >>>> >>>> As I expressed already, the interface works as designed/documented and >>>> has been used like that for years. >>> >>> It seems we do differ in the usefulness though. Using a crappy interface >>> for years doesn't make it less crappy. I do realize we cannot remove the >>> interface but we can remove issues with the implementation and I dare to >>> say that most existing users wouldn't really notice. >> >> Well, at least powerpc-utils (why this interface was introduced) will >> notice a) performance wise and b) because more logging output will be >> generated (obviously non-offlineable blocks will be tried to offline). > > I would really appreciate some specific example for a real usecase. I am > not familiar with powerpc-utils worklflows myself. > Not an expert myself: https://github.com/ibm-power-utilities/powerpc-utils -> src/drmgr/drslot_chrp_mem.c On request to remove some memory it will a) Read "->removable" of all memory blocks ("lmb") b) Check if the request can be fulfilled using the removable blocks c) Try to offline the memory blocks by trying to offline it. If that succeeded, trigger removeal of it using some hypervisor hooks. Interestingly, with "AMS ballooning", it will already consider the "removable" information useless (most probably, because of non-migratable balloon pages that can be offlined - I assume the powerpc code that I converted to proper balloon compaction just recently). a) and b) is skipped. Returning "yes" on all blocks will make them handle it just like if "AMS ballooning" is active. So any memory block will be tried. Should work but will be slower if no ballooning is active.
On 22.01.20 11:54, David Hildenbrand wrote: > On 22.01.20 11:42, Michal Hocko wrote: >> On Wed 22-01-20 11:39:08, David Hildenbrand wrote: >>>>>> Really, the interface is flawed and should have never been merged in the >>>>>> first place. We cannot simply remove it altogether I am afraid so let's >>>>>> at least remove the bogus code and pretend that the world is a better >>>>>> place where everything is removable except the reality sucks... >>>>> >>>>> As I expressed already, the interface works as designed/documented and >>>>> has been used like that for years. >>>> >>>> It seems we do differ in the usefulness though. Using a crappy interface >>>> for years doesn't make it less crappy. I do realize we cannot remove the >>>> interface but we can remove issues with the implementation and I dare to >>>> say that most existing users wouldn't really notice. >>> >>> Well, at least powerpc-utils (why this interface was introduced) will >>> notice a) performance wise and b) because more logging output will be >>> generated (obviously non-offlineable blocks will be tried to offline). >> >> I would really appreciate some specific example for a real usecase. I am >> not familiar with powerpc-utils worklflows myself. >> > > Not an expert myself: > > https://github.com/ibm-power-utilities/powerpc-utils > > -> src/drmgr/drslot_chrp_mem.c > > On request to remove some memory it will > > a) Read "->removable" of all memory blocks ("lmb") > b) Check if the request can be fulfilled using the removable blocks > c) Try to offline the memory blocks by trying to offline it. If that > succeeded, trigger removeal of it using some hypervisor hooks. > > Interestingly, with "AMS ballooning", it will already consider the > "removable" information useless (most probably, because of > non-migratable balloon pages that can be offlined - I assume the powerpc > code that I converted to proper balloon compaction just recently). a) > and b) is skipped. > > Returning "yes" on all blocks will make them handle it just like if "AMS > ballooning" is active. So any memory block will be tried. Should work > but will be slower if no ballooning is active. > On lsmem: https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html " Removable yes if the memory range can be set offline, no if it cannot be set offline. A dash (-) means that the range is already offline. The kernel method that identifies removable memory ranges is heuristic and not exact. Occasionally, memory ranges are falsely reported as removable or falsely reported as not removable. " Usage of lsmem paird with chmem: https://access.redhat.com/solutions/3937181 Especially interesting for IBM z Systems, whereby memory onlining/offlining will trigger the actual population of memory in the hypervisor. So if an admin wants to offline some memory (to give it back to the hypervisor), it would use lsmem to identify such blocks first, instead of trying random blocks until one offlining request succeeds. E.g., documented in https://books.google.de/books?id=1UEhDQAAQBAJ&pg=PA117&lpg=PA117&dq=lsmem+removable&source=bl&ots=OzMfU6Gbzu&sig=ACfU3U2IfH0eTVJs0qu50FdkysA3iC0elw&hl=de&sa=X&ved=2ahUKEwjQpdXQkpfnAhVOzqQKHTN4BsoQ6AEwBXoECAoQAQ#v=onepage&q=lsmem%20removable&f=false So I still think that the interface is useful and is getting used in real life. Users tolerate false positives/negatives.
On Wed 22-01-20 12:58:16, David Hildenbrand wrote: > On 22.01.20 11:54, David Hildenbrand wrote: > > On 22.01.20 11:42, Michal Hocko wrote: > >> On Wed 22-01-20 11:39:08, David Hildenbrand wrote: > >>>>>> Really, the interface is flawed and should have never been merged in the > >>>>>> first place. We cannot simply remove it altogether I am afraid so let's > >>>>>> at least remove the bogus code and pretend that the world is a better > >>>>>> place where everything is removable except the reality sucks... > >>>>> > >>>>> As I expressed already, the interface works as designed/documented and > >>>>> has been used like that for years. > >>>> > >>>> It seems we do differ in the usefulness though. Using a crappy interface > >>>> for years doesn't make it less crappy. I do realize we cannot remove the > >>>> interface but we can remove issues with the implementation and I dare to > >>>> say that most existing users wouldn't really notice. > >>> > >>> Well, at least powerpc-utils (why this interface was introduced) will > >>> notice a) performance wise and b) because more logging output will be > >>> generated (obviously non-offlineable blocks will be tried to offline). > >> > >> I would really appreciate some specific example for a real usecase. I am > >> not familiar with powerpc-utils worklflows myself. > >> > > > > Not an expert myself: > > > > https://github.com/ibm-power-utilities/powerpc-utils > > > > -> src/drmgr/drslot_chrp_mem.c > > > > On request to remove some memory it will > > > > a) Read "->removable" of all memory blocks ("lmb") > > b) Check if the request can be fulfilled using the removable blocks > > c) Try to offline the memory blocks by trying to offline it. If that > > succeeded, trigger removeal of it using some hypervisor hooks. > > > > Interestingly, with "AMS ballooning", it will already consider the > > "removable" information useless (most probably, because of > > non-migratable balloon pages that can be offlined - I assume the powerpc > > code that I converted to proper balloon compaction just recently). a) > > and b) is skipped. > > > > Returning "yes" on all blocks will make them handle it just like if "AMS > > ballooning" is active. So any memory block will be tried. Should work > > but will be slower if no ballooning is active. > > > > On lsmem: > > https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html > > " > Removable > yes if the memory range can be set offline, no if it cannot be set > offline. A dash (-) means that the range is already offline. The kernel > method that identifies removable memory ranges is heuristic and not > exact. Occasionally, memory ranges are falsely reported as removable or > falsely reported as not removable. > " > > Usage of lsmem paird with chmem: > > https://access.redhat.com/solutions/3937181 > > > Especially interesting for IBM z Systems, whereby memory > onlining/offlining will trigger the actual population of memory in the > hypervisor. So if an admin wants to offline some memory (to give it back > to the hypervisor), it would use lsmem to identify such blocks first, > instead of trying random blocks until one offlining request succeeds. I am sorry for being dense here but I still do not understand why s390 and the way how it does the hotremove matters here. Afterall there are no arch specific operations done until the memory is offlined. Also randomly checking memory blocks and then hoping that the offline will succeed is not way much different from just trying the offline the block. Both have to crawl through the pfn range and bail out on the unmovable memory.
On 22.01.20 17:46, Michal Hocko wrote: > On Wed 22-01-20 12:58:16, David Hildenbrand wrote: >> On 22.01.20 11:54, David Hildenbrand wrote: >>> On 22.01.20 11:42, Michal Hocko wrote: >>>> On Wed 22-01-20 11:39:08, David Hildenbrand wrote: >>>>>>>> Really, the interface is flawed and should have never been merged in the >>>>>>>> first place. We cannot simply remove it altogether I am afraid so let's >>>>>>>> at least remove the bogus code and pretend that the world is a better >>>>>>>> place where everything is removable except the reality sucks... >>>>>>> >>>>>>> As I expressed already, the interface works as designed/documented and >>>>>>> has been used like that for years. >>>>>> >>>>>> It seems we do differ in the usefulness though. Using a crappy interface >>>>>> for years doesn't make it less crappy. I do realize we cannot remove the >>>>>> interface but we can remove issues with the implementation and I dare to >>>>>> say that most existing users wouldn't really notice. >>>>> >>>>> Well, at least powerpc-utils (why this interface was introduced) will >>>>> notice a) performance wise and b) because more logging output will be >>>>> generated (obviously non-offlineable blocks will be tried to offline). >>>> >>>> I would really appreciate some specific example for a real usecase. I am >>>> not familiar with powerpc-utils worklflows myself. >>>> >>> >>> Not an expert myself: >>> >>> https://github.com/ibm-power-utilities/powerpc-utils >>> >>> -> src/drmgr/drslot_chrp_mem.c >>> >>> On request to remove some memory it will >>> >>> a) Read "->removable" of all memory blocks ("lmb") >>> b) Check if the request can be fulfilled using the removable blocks >>> c) Try to offline the memory blocks by trying to offline it. If that >>> succeeded, trigger removeal of it using some hypervisor hooks. >>> >>> Interestingly, with "AMS ballooning", it will already consider the >>> "removable" information useless (most probably, because of >>> non-migratable balloon pages that can be offlined - I assume the powerpc >>> code that I converted to proper balloon compaction just recently). a) >>> and b) is skipped. >>> >>> Returning "yes" on all blocks will make them handle it just like if "AMS >>> ballooning" is active. So any memory block will be tried. Should work >>> but will be slower if no ballooning is active. >>> >> >> On lsmem: >> >> https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html >> >> " >> Removable >> yes if the memory range can be set offline, no if it cannot be set >> offline. A dash (-) means that the range is already offline. The kernel >> method that identifies removable memory ranges is heuristic and not >> exact. Occasionally, memory ranges are falsely reported as removable or >> falsely reported as not removable. >> " >> >> Usage of lsmem paird with chmem: >> >> https://access.redhat.com/solutions/3937181 >> >> >> Especially interesting for IBM z Systems, whereby memory >> onlining/offlining will trigger the actual population of memory in the >> hypervisor. So if an admin wants to offline some memory (to give it back >> to the hypervisor), it would use lsmem to identify such blocks first, >> instead of trying random blocks until one offlining request succeeds. > > I am sorry for being dense here but I still do not understand why s390 It's good that we talk about it :) It's hard to reconstruct actual use cases from tools and some documentation only ... Side note (just FYI): One difference on s390x compared to other architectures (AFAIKS) is that once memory is offline, you might not be allowed (by the hypervisor) to online it again - because it was effectively unplugged. Such memory is not removed via remove_memory(), it's simply kept offline. > and the way how it does the hotremove matters here. Afterall there are > no arch specific operations done until the memory is offlined. Also > randomly checking memory blocks and then hoping that the offline will > succeed is not way much different from just trying the offline the > block. Both have to crawl through the pfn range and bail out on the > unmovable memory. I think in general we have to approaches to memory unplugging. 1. Know explicitly what you want to unplug (e.g., a DIMM spanning multiple memory blocks). 2. Find random memory blocks you can offline/unplug. For 1, I think we both agree that we don't need this. Just try to offline and you know if it worked. Now of course, for 2 you can try random blocks until you succeeded. From a sysadmin point of view that's very inefficient. From a powerpc-utils point of view, that's inefficient. I learned just now, "chmem"[1] has a mode where you can specify a "size" and not only a range. So a sysadmin can still control onlining/offlining for this use case with a few commands. In other tools (e.g., powerpc-utils), well, you have to try to offline random memory blocks (just like chmem does). AFAIK, once we turn /sys/.../removable useless, I can see the following changes: 1. Trying to offline a certain amount of memory blocks gets slower/takes longer/is less efficient. Might be tolerable. The tools seem to keep working. 2. You can no longer make a rough estimate how much memory you could offline - before you actually try to offline it. I can only imagine that something like this makes sense in a virtual environment (e.g., IBM z) to balance memory between virtual machines, but I am not aware of a real user of something like that. So what I can do is a) Come up with a patch that rips that stuff out (well I have that already lying around) b) Describe the existing users + changes we will see c) CC relevant people I identify (lsmem/chmem/powerpc-utils/etc.) on the patch to see if we are missing other use cases/users/implications. Sounds like a plan? [1] https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/tree/sys-utils/chmem.c
On Wed 22-01-20 19:15:47, David Hildenbrand wrote: > On 22.01.20 17:46, Michal Hocko wrote: > > On Wed 22-01-20 12:58:16, David Hildenbrand wrote: [...] > >> Especially interesting for IBM z Systems, whereby memory > >> onlining/offlining will trigger the actual population of memory in the > >> hypervisor. So if an admin wants to offline some memory (to give it back > >> to the hypervisor), it would use lsmem to identify such blocks first, > >> instead of trying random blocks until one offlining request succeeds. > > > > I am sorry for being dense here but I still do not understand why s390 > > It's good that we talk about it :) It's hard to reconstruct actual use > cases from tools and some documentation only ... > > Side note (just FYI): One difference on s390x compared to other > architectures (AFAIKS) is that once memory is offline, you might not be > allowed (by the hypervisor) to online it again - because it was > effectively unplugged. Such memory is not removed via remove_memory(), > it's simply kept offline. I have a very vague understanding of s390 specialities but this is not really relevant to the discussion AFAICS because this happens _after_ offlining. > > and the way how it does the hotremove matters here. Afterall there are > > no arch specific operations done until the memory is offlined. Also > > randomly checking memory blocks and then hoping that the offline will > > succeed is not way much different from just trying the offline the > > block. Both have to crawl through the pfn range and bail out on the > > unmovable memory. > > I think in general we have to approaches to memory unplugging. > > 1. Know explicitly what you want to unplug (e.g., a DIMM spanning > multiple memory blocks). > > 2. Find random memory blocks you can offline/unplug. > > > For 1, I think we both agree that we don't need this. Just try to > offline and you know if it worked. > > Now of course, for 2 you can try random blocks until you succeeded. From > a sysadmin point of view that's very inefficient. From a powerpc-utils > point of view, that's inefficient. How exactly is check + offline more optimal then offline which makes check as its first step? I will get to your later points after this is clarified.
On 22.01.20 19:38, Michal Hocko wrote: > On Wed 22-01-20 19:15:47, David Hildenbrand wrote: >> On 22.01.20 17:46, Michal Hocko wrote: >>> On Wed 22-01-20 12:58:16, David Hildenbrand wrote: > [...] >>>> Especially interesting for IBM z Systems, whereby memory >>>> onlining/offlining will trigger the actual population of memory in the >>>> hypervisor. So if an admin wants to offline some memory (to give it back >>>> to the hypervisor), it would use lsmem to identify such blocks first, >>>> instead of trying random blocks until one offlining request succeeds. >>> >>> I am sorry for being dense here but I still do not understand why s390 >> >> It's good that we talk about it :) It's hard to reconstruct actual use >> cases from tools and some documentation only ... >> >> Side note (just FYI): One difference on s390x compared to other >> architectures (AFAIKS) is that once memory is offline, you might not be >> allowed (by the hypervisor) to online it again - because it was >> effectively unplugged. Such memory is not removed via remove_memory(), >> it's simply kept offline. > > I have a very vague understanding of s390 specialities but this is not > really relevant to the discussion AFAICS because this happens _after_ > offlining. Jep, that's why I flagged it as a side note. > >>> and the way how it does the hotremove matters here. Afterall there are >>> no arch specific operations done until the memory is offlined. Also >>> randomly checking memory blocks and then hoping that the offline will >>> succeed is not way much different from just trying the offline the >>> block. Both have to crawl through the pfn range and bail out on the >>> unmovable memory. >> >> I think in general we have to approaches to memory unplugging. >> >> 1. Know explicitly what you want to unplug (e.g., a DIMM spanning >> multiple memory blocks). >> >> 2. Find random memory blocks you can offline/unplug. >> >> >> For 1, I think we both agree that we don't need this. Just try to >> offline and you know if it worked. >> >> Now of course, for 2 you can try random blocks until you succeeded. From >> a sysadmin point of view that's very inefficient. From a powerpc-utils >> point of view, that's inefficient. > > How exactly is check + offline more optimal then offline which makes > check as its first step? I will get to your later points after this is > clarified. Scanning (almost) lockless is more efficient than bouncing back and forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock and zone locks - as far as I understand. And as far as I understood, that was the whole reason of the original commit. Anyhow, you should have read until the end of my mail to find what you were looking for :)
On Wed 22-01-20 19:15:47, David Hildenbrand wrote: [...] > c) CC relevant people I identify (lsmem/chmem/powerpc-utils/etc.) on the > patch to see if we are missing other use cases/users/implications. > > Sounds like a plan? I would start with this step. Thanks!
On Wed 22-01-20 19:46:15, David Hildenbrand wrote: > On 22.01.20 19:38, Michal Hocko wrote: [...] > > How exactly is check + offline more optimal then offline which makes > > check as its first step? I will get to your later points after this is > > clarified. > > Scanning (almost) lockless is more efficient than bouncing back and > forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock > and zone locks - as far as I understand. All but the zone lock shouldn't be really contended and as such shouldn't cause any troubles. zone->lock really depends on the page allocator usage of course. But as soon as we have a contention then it is just more likely that the result is less reliable. I would be also really curious about how much actual time could be saved by this - some real numbers - because hotplug operations shouldn't happen so often that this would stand out. At least that is my understanding. > And as far as I understood, that was the whole reason of the original > commit. Well, I have my doubts but it might be just me and I might be wrong. My experience from a large part of the memory hotplug functionality is that it was driven by a good intention but without a due diligence to think behind the most obvious usecase. Having a removable flag on the memblock sounds like a neat idea of course. But an inherently racy flag is just borderline useful. Anyway, I will stop at this moment and wait for real usecases. Thanks!
On Wed, Jan 22, 2020 at 11:09 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 22-01-20 19:46:15, David Hildenbrand wrote: > > On 22.01.20 19:38, Michal Hocko wrote: > [...] > > > How exactly is check + offline more optimal then offline which makes > > > check as its first step? I will get to your later points after this is > > > clarified. > > > > Scanning (almost) lockless is more efficient than bouncing back and > > forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock > > and zone locks - as far as I understand. > > All but the zone lock shouldn't be really contended and as such > shouldn't cause any troubles. zone->lock really depends on the page > allocator usage of course. But as soon as we have a contention then it > is just more likely that the result is less reliable. > > I would be also really curious about how much actual time could be saved > by this - some real numbers - because hotplug operations shouldn't > happen so often that this would stand out. At least that is my > understanding. > > > And as far as I understood, that was the whole reason of the original > > commit. > > Well, I have my doubts but it might be just me and I might be wrong. My > experience from a large part of the memory hotplug functionality is that > it was driven by a good intention but without a due diligence to think > behind the most obvious usecase. Having a removable flag on the memblock > sounds like a neat idea of course. But an inherently racy flag is just > borderline useful. > > Anyway, I will stop at this moment and wait for real usecases. ...that and practical numbers showing that optimizing an interface that can at best give rough estimate answers is worth the code change.
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c126b94d1943..8d80159465e4 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -337,34 +337,24 @@ static int pseries_remove_mem_node(struct device_node *np) static bool lmb_is_removable(struct drmem_lmb *lmb) { - int i, scns_per_block; - bool rc = true; - unsigned long pfn, block_sz; - u64 phys_addr; + struct memory_block *mem; + bool rc = false; if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) return false; - block_sz = memory_block_size_bytes(); - scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE; - phys_addr = lmb->base_addr; - #ifdef CONFIG_FA_DUMP /* * Don't hot-remove memory that falls in fadump boot memory area * and memory that is reserved for capturing old kernel memory. */ - if (is_fadump_memory_area(phys_addr, block_sz)) + if (is_fadump_memory_area(lmb->base_addr, memory_block_size_bytes())) return false; #endif - - for (i = 0; i < scns_per_block; i++) { - pfn = PFN_DOWN(phys_addr); - if (!pfn_present(pfn)) - continue; - - rc = rc && is_mem_section_removable(pfn, PAGES_PER_SECTION); - phys_addr += MIN_MEMORY_BLOCK_SIZE; + mem = lmb_to_memblock(lmb); + if (mem) { + rc = is_memory_block_offlineable(mem); + put_device(&mem->dev); } return rc; diff --git a/drivers/base/memory.c b/drivers/base/memory.c index c6d288fad493..f744250c34d0 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -104,6 +104,25 @@ static ssize_t phys_index_show(struct device *dev, return sprintf(buf, "%08lx\n", phys_index); } +/* + * Test if a memory block is likely to be offlineable. Returns true if + * the block is already offline. + * + * Called under device_hotplug_lock. + */ +bool is_memory_block_offlineable(struct memory_block *mem) +{ + int i; + + if (mem->state != MEM_ONLINE) + return true; + + for (i = 0; i < sections_per_block; i++) + if (!is_mem_section_offlineable(mem->start_section_nr + i)) + return false; + return true; +} + /* * Show whether the memory block is likely to be offlineable (or is already * offline). Once offline, the memory block could be removed. The return @@ -114,20 +133,14 @@ static ssize_t removable_show(struct device *dev, struct device_attribute *attr, char *buf) { struct memory_block *mem = to_memory_block(dev); - unsigned long pfn; - int ret = 1, i; - - if (mem->state != MEM_ONLINE) - goto out; + int ret; - for (i = 0; i < sections_per_block; i++) { - if (!present_section_nr(mem->start_section_nr + i)) - continue; - pfn = section_nr_to_pfn(mem->start_section_nr + i); - ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION); - } + ret = lock_device_hotplug_sysfs(); + if (ret) + return ret; + ret = is_memory_block_offlineable(mem); + unlock_device_hotplug(); -out: return sprintf(buf, "%d\n", ret); } diff --git a/include/linux/memory.h b/include/linux/memory.h index 0b8d791b6669..faf03eb64ecc 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -91,6 +91,7 @@ typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *); extern int walk_memory_blocks(unsigned long start, unsigned long size, void *arg, walk_memory_blocks_func_t func); extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func); +extern bool is_memory_block_offlineable(struct memory_block *mem); #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT) #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index f4d59155f3d4..8d087772f748 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -306,15 +306,14 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {} #ifdef CONFIG_MEMORY_HOTREMOVE -extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages); +extern bool is_mem_section_offlineable(unsigned long nr); extern void try_offline_node(int nid); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern int remove_memory(int nid, u64 start, u64 size); extern void __remove_memory(int nid, u64 start, u64 size); #else -static inline bool is_mem_section_removable(unsigned long pfn, - unsigned long nr_pages) +static inline bool is_mem_section_offlineable(unsigned long nr) { return false; } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 7a6de9b0dcab..a6d14d2b7f0c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1128,46 +1128,51 @@ static unsigned long next_active_pageblock(unsigned long pfn) return pfn + pageblock_nr_pages; } -static bool is_pageblock_removable_nolock(unsigned long pfn) +static int count_system_ram_pages_cb(unsigned long start_pfn, + unsigned long nr_pages, void *data) { - struct page *page = pfn_to_page(pfn); + unsigned long *nr_system_ram_pages = data; + + *nr_system_ram_pages += nr_pages; + return 0; +} + +/* + * Check if a section is likely to be offlineable. + * + * Called with device_hotplug_lock. + */ +bool is_mem_section_offlineable(unsigned long nr) +{ + const unsigned long start_pfn = section_nr_to_pfn(nr); + const unsigned long end_pfn = start_pfn + PAGES_PER_SECTION; + unsigned long pfn, nr_pages = 0; struct zone *zone; - /* - * We have to be careful here because we are iterating over memory - * sections which are not zone aware so we might end up outside of - * the zone but still within the section. - * We have to take care about the node as well. If the node is offline - * its NODE_DATA will be NULL - see page_zone. - */ - if (!node_online(page_to_nid(page))) + if (!present_section_nr(nr)) return false; - - zone = page_zone(page); - pfn = page_to_pfn(page); - if (!zone_spans_pfn(zone, pfn)) + if (!online_section_nr(nr)) return false; - return !has_unmovable_pages(zone, page, MIGRATE_MOVABLE, - MEMORY_OFFLINE); -} - -/* Checks if this range of memory is likely to be hot-removable. */ -bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages) -{ - unsigned long end_pfn, pfn; + /* we don't allow to offline sections with holes */ + walk_system_ram_range(start_pfn, PAGES_PER_SECTION, &nr_pages, + count_system_ram_pages_cb); + if (nr_pages != PAGES_PER_SECTION) + return false; - end_pfn = min(start_pfn + nr_pages, - zone_end_pfn(page_zone(pfn_to_page(start_pfn)))); + /* we don't allow to offline sections with mixed zones/nodes */ + zone = test_pages_in_a_zone(start_pfn, end_pfn); + if (!zone) + return false; - /* Check the starting page of each pageblock within the range */ + /* check each pageblock if it contains unmovable pages */ for (pfn = start_pfn; pfn < end_pfn; pfn = next_active_pageblock(pfn)) { - if (!is_pageblock_removable_nolock(pfn)) + if (has_unmovable_pages(zone, pfn_to_page(pfn), MIGRATE_MOVABLE, + MEMORY_OFFLINE)) return false; cond_resched(); } - /* All pageblocks in the memory block are likely to be hot-removable */ return true; } @@ -1436,15 +1441,6 @@ static void node_states_clear_node(int node, struct memory_notify *arg) node_clear_state(node, N_MEMORY); } -static int count_system_ram_pages_cb(unsigned long start_pfn, - unsigned long nr_pages, void *data) -{ - unsigned long *nr_system_ram_pages = data; - - *nr_system_ram_pages += nr_pages; - return 0; -} - static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn) {
Let's refactor that code. We want to check if we can offline memory blocks. Add a new function is_mem_section_offlineable() for that and make it call is_mem_section_offlineable() for each contained section. Within is_mem_section_offlineable(), add some more sanity checks and directly bail out if the section contains holes or if it spans multiple zones. The old code was inherently racy with concurrent offlining/memory unplug. Let's avoid that and grab the device_hotplug_lock. Luckily we are already holding it when calling from powerpc code. Note1: If somebody wants to export this function for use in driver code, we need a variant that takes the device_hotplug_lock() Note2: If we could have a zombie device (not clear yet), the present section checks would properly bail out early. Note3: I'd prefer the mem_hotplug_lock in read, but as we are about to change the locking on the removal path (IOW, don't hold it when removing memory block devices), I do not want to go down that path. Note4: For now we would have returned "removable" although we would block offlining due to memory holes, multiple zones, or missing sections. Tested with DIMMs on x86-64. Compile-tested on Power. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Leonardo Bras <leonardo@linux.ibm.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Allison Randal <allison@lohutok.net> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: lantianyu1986@gmail.com Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: David Hildenbrand <david@redhat.com> --- .../platforms/pseries/hotplug-memory.c | 24 ++----- drivers/base/memory.c | 37 ++++++---- include/linux/memory.h | 1 + include/linux/memory_hotplug.h | 5 +- mm/memory_hotplug.c | 68 +++++++++---------- 5 files changed, 67 insertions(+), 68 deletions(-)