Message ID | 20190516023706.50118-1-tyreld@linux.ibm.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
Series | [1/3] powerpc/pseries: Simplify cpu readd to use drc_index | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch next (8150a153c013aa2dd1ffae43370b89ac1347a7fb) |
snowpatch_ozlabs/checkpatch | warning | total: 0 errors, 1 warnings, 0 checks, 39 lines checked |
Tyrel Datwyler <tyreld@linux.ibm.com> writes: > The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up > the cpus device_node so that we can get at the ibm,my-drc-index > property. The only user of cpu readd is an OF notifier call back. This > call back already has a reference to the device_node and therefore can > retrieve the drc_index from the device_node. dlpar_cpu_readd is a hack to try to change the CPU-node relationship at runtime without destabilizing the system. It doesn't accomplish that and it should just be removed (and I'm working on that).
On 05/16/2019 12:17 PM, Nathan Lynch wrote: > Tyrel Datwyler <tyreld@linux.ibm.com> writes: >> The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up >> the cpus device_node so that we can get at the ibm,my-drc-index >> property. The only user of cpu readd is an OF notifier call back. This >> call back already has a reference to the device_node and therefore can >> retrieve the drc_index from the device_node. > > dlpar_cpu_readd is a hack to try to change the CPU-node relationship at > runtime without destabilizing the system. It doesn't accomplish that and > it should just be removed (and I'm working on that). > I will politely disagree. We've done exactly this from userspace for years. My experience still suggests that memory affinity is the problem area, and that the work to push this all into the kernel originally was poorly tested. -Tyrel
Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes: > On 05/16/2019 12:17 PM, Nathan Lynch wrote: >> Tyrel Datwyler <tyreld@linux.ibm.com> writes: >>> The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up >>> the cpus device_node so that we can get at the ibm,my-drc-index >>> property. The only user of cpu readd is an OF notifier call back. This >>> call back already has a reference to the device_node and therefore can >>> retrieve the drc_index from the device_node. >> >> dlpar_cpu_readd is a hack to try to change the CPU-node relationship at >> runtime without destabilizing the system. It doesn't accomplish that and >> it should just be removed (and I'm working on that). >> > > I will politely disagree. We've done exactly this from userspace for > years. My experience still suggests that memory affinity is the > problem area, and that the work to push this all into the kernel > originally was poorly tested. Kernel implementation details aside, how do you change the cpu-node relationship at runtime without breaking NUMA-aware applications? Is this not a fundamental issue to address before adding code like this?
On 05/20/2019 08:01 AM, Nathan Lynch wrote: > Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes: > >> On 05/16/2019 12:17 PM, Nathan Lynch wrote: >>> Tyrel Datwyler <tyreld@linux.ibm.com> writes: >>>> The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up >>>> the cpus device_node so that we can get at the ibm,my-drc-index >>>> property. The only user of cpu readd is an OF notifier call back. This >>>> call back already has a reference to the device_node and therefore can >>>> retrieve the drc_index from the device_node. >>> >>> dlpar_cpu_readd is a hack to try to change the CPU-node relationship at >>> runtime without destabilizing the system. It doesn't accomplish that and >>> it should just be removed (and I'm working on that). >>> >> >> I will politely disagree. We've done exactly this from userspace for >> years. My experience still suggests that memory affinity is the >> problem area, and that the work to push this all into the kernel >> originally was poorly tested. > > Kernel implementation details aside, how do you change the cpu-node > relationship at runtime without breaking NUMA-aware applications? Is > this not a fundamental issue to address before adding code like this? > If that is the concern then hotplug in general already breaks them. Take for example the removal of a faulty processor and then adding a new processor back. It is quite possible that the new processor is in a different NUMA node. Keep in mind that in this scenario the new processor and threads gets the same logical cpu ids as the faulty processor we just removed. Now we have to ask the question who is right and who is wrong. In this case the kernel data structures reflect the correct NUMA topology. However, did the NUMA aware application or libnuma make an assumption that specific sets of logical cpu ids are always in the same NUMA node? -Tyrel
Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes: > On 05/20/2019 08:01 AM, Nathan Lynch wrote: >> Kernel implementation details aside, how do you change the cpu-node >> relationship at runtime without breaking NUMA-aware applications? Is >> this not a fundamental issue to address before adding code like this? >> > > If that is the concern then hotplug in general already breaks > them. Take for example the removal of a faulty processor and then > adding a new processor back. It is quite possible that the new > processor is in a different NUMA node. Keep in mind that in this > scenario the new processor and threads gets the same logical cpu ids > as the faulty processor we just removed. Yes, the problem is re-use of a logical CPU id with a node id that differs from the one it was initially assigned, and there are several ways to get into that situation on this platform. We probably need to be more careful in how we allocate a spot in the CPU maps for a newly-added processor. I believe the algorithm is simple first-fit right now, and it doesn't take into account prior NUMA relationships. > Now we have to ask the question who is right and who is wrong. In this > case the kernel data structures reflect the correct NUMA > topology. However, did the NUMA aware application or libnuma make an > assumption that specific sets of logical cpu ids are always in the > same NUMA node? Yes, and that assumption is widespread because people tend to develop on an architecture where this kind of stuff doesn't happen (at least not yet). And I don't really agree that the current behavior reflects what is actually going on. When Linux running in a PowerVM LPAR receives a notification to change the NUMA properties of a processor at runtime, it's because the platform has changed the physical characteristics of the partition. I.e. you're now using a different physical processor, with different relationships to the other resources in the system. Even if it didn't destabilize the kernel (by changing the result of cpu_to_node() when various subsystems assume it will be static), continuing to use the logical CPU ids on the new processor obscures what has actually happened. And we have developers that have told us that this behavior - changing the logical cpu<->node relationship at runtime - is something their existing NUMA-aware applications cannot handle.
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index f85e2b01c3df..c906d9ec9013 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -133,7 +133,7 @@ static inline void shared_proc_topology_init(void) {} #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_core_id(cpu) (cpu_to_core_id(cpu)) -int dlpar_cpu_readd(int cpu); +int dlpar_cpu_readd(u32 drc_index); #endif #endif diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 57e64273cb33..40c0b6da12c2 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1479,9 +1479,9 @@ static int dt_update_callback(struct notifier_block *nb, case OF_RECONFIG_UPDATE_PROPERTY: if (of_node_is_type(update->dn, "cpu") && !of_prop_cmp(update->prop->name, "ibm,associativity")) { - u32 core_id; - of_property_read_u32(update->dn, "reg", &core_id); - rc = dlpar_cpu_readd(core_id); + u32 drc_index; + of_property_read_u32(update->dn, "ibm,my-drc-index", &drc_index); + rc = dlpar_cpu_readd(drc_index); rc = NOTIFY_OK; } break; diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 97feb6e79f1a..2dfa9416ce54 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -802,18 +802,10 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add) return rc; } -int dlpar_cpu_readd(int cpu) +int dlpar_cpu_readd(u32 drc_index) { - struct device_node *dn; - struct device *dev; - u32 drc_index; int rc; - dev = get_cpu_device(cpu); - dn = dev->of_node; - - rc = of_property_read_u32(dn, "ibm,my-drc-index", &drc_index); - rc = dlpar_cpu_remove_by_index(drc_index); if (!rc) rc = dlpar_cpu_add(drc_index);
The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up the cpus device_node so that we can get at the ibm,my-drc-index property. The only user of cpu readd is an OF notifier call back. This call back already has a reference to the device_node and therefore can retrieve the drc_index from the device_node. This patch simplifies dlpar_cpu_readd() to take a drc_index directly and does away with an uneccsary device_node lookup. Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> --- arch/powerpc/include/asm/topology.h | 2 +- arch/powerpc/mm/numa.c | 6 +++--- arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +--------- 3 files changed, 5 insertions(+), 13 deletions(-)