Message ID | 20210303174857.1760393-2-clg@kaod.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | powerpc/xive: Map one IPI interrupt per node | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (626a6c3d2e20da80aaa710104f34ea6037b28b33) |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 17 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
On Wed, 3 Mar 2021 18:48:50 +0100 Cédric Le Goater <clg@kaod.org> wrote: > The 'chip_id' field of the XIVE CPU structure is used to choose a > target for a source located on the same chip when possible. This field > is assigned on the PowerNV platform using the "ibm,chip-id" property > on pSeries under KVM when NUMA nodes are defined but it is undefined This sentence seems to have a syntax problem... like it is missing an 'and' before 'on pSeries'. > under PowerVM. The XIVE source structure has a similar field > 'src_chip' which is only assigned on the PowerNV platform. > > cpu_to_node() returns a compatible value on all platforms, 0 being the > default node. It will also give us the opportunity to set the affinity > of a source on pSeries when we can localize them. > IIUC this relies on the fact that the NUMA node id is == to chip id on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable with this change. On the other hand, you have the pSeries case under PowerVM that doesn't xc->chip_id, which isn't passed to any hcall AFAICT. It looks like the chip id is only used for localization purpose in this case, right ? In this case, what about doing this change for pSeries only, somewhere in spapr.c ? > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- > arch/powerpc/sysdev/xive/common.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 595310e056f4..b8e456da28aa 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) > > xc = per_cpu(xive_cpu, cpu); > if (!xc) { > - struct device_node *np; > - > xc = kzalloc_node(sizeof(struct xive_cpu), > GFP_KERNEL, cpu_to_node(cpu)); > if (!xc) > return -ENOMEM; > - np = of_get_cpu_node(cpu, NULL); > - if (np) > - xc->chip_id = of_get_ibm_chip_id(np); > - of_node_put(np); > + xc->chip_id = cpu_to_node(cpu); > xc->hw_ipi = XIVE_BAD_IRQ; > > per_cpu(xive_cpu, cpu) = xc;
On 3/8/21 6:13 PM, Greg Kurz wrote: > On Wed, 3 Mar 2021 18:48:50 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> The 'chip_id' field of the XIVE CPU structure is used to choose a >> target for a source located on the same chip when possible. This field >> is assigned on the PowerNV platform using the "ibm,chip-id" property >> on pSeries under KVM when NUMA nodes are defined but it is undefined > > This sentence seems to have a syntax problem... like it is missing an > 'and' before 'on pSeries'. ah yes, or simply a comma. >> under PowerVM. The XIVE source structure has a similar field >> 'src_chip' which is only assigned on the PowerNV platform. >> >> cpu_to_node() returns a compatible value on all platforms, 0 being the >> default node. It will also give us the opportunity to set the affinity >> of a source on pSeries when we can localize them. >> > > IIUC this relies on the fact that the NUMA node id is == to chip id > on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > with this change. Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel in Cc:) On PowerNV, Linux uses "ibm,associativity" property of the CPU to find the node id. This value is built from the chip id in OPAL, so the value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" property are unlikely to be different. cpu_to_node(cpu) is used in many places to allocate the structures locally to the owning node. XIVE is not an exception (see below in the same patch), it is better to be consistent and get the same information (node id) using the same routine. In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot unifies the controllers of the system to only expose one the OS. This is problematic and should be changed but it's another topic. > On the other hand, you have the pSeries case under PowerVM that > doesn't xc->chip_id, which isn't passed to any hcall AFAICT. yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid chip id. QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not always correct btw) > It looks like the chip id is only used for localization purpose in > this case, right ? Yes and PAPR sources are not localized. So it's not used. MSI sources could be if we rewrote the MSI driver. > In this case, what about doing this change for pSeries only, > somewhere in spapr.c ? The IPI code is common to all platforms and all have the same issue. I rather not. Thanks, C. >> Signed-off-by: Cédric Le Goater <clg@kaod.org> >> --- >> arch/powerpc/sysdev/xive/common.c | 7 +------ >> 1 file changed, 1 insertion(+), 6 deletions(-) >> >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >> index 595310e056f4..b8e456da28aa 100644 >> --- a/arch/powerpc/sysdev/xive/common.c >> +++ b/arch/powerpc/sysdev/xive/common.c >> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >> >> xc = per_cpu(xive_cpu, cpu); >> if (!xc) { >> - struct device_node *np; >> - >> xc = kzalloc_node(sizeof(struct xive_cpu), >> GFP_KERNEL, cpu_to_node(cpu)); >> if (!xc) >> return -ENOMEM; >> - np = of_get_cpu_node(cpu, NULL); >> - if (np) >> - xc->chip_id = of_get_ibm_chip_id(np); >> - of_node_put(np); >> + xc->chip_id = cpu_to_node(cpu); >> xc->hw_ipi = XIVE_BAD_IRQ; >> >> per_cpu(xive_cpu, cpu) = xc; >
On 3/9/21 12:33 PM, Cédric Le Goater wrote: > On 3/8/21 6:13 PM, Greg Kurz wrote: >> On Wed, 3 Mar 2021 18:48:50 +0100 >> Cédric Le Goater <clg@kaod.org> wrote: >> >>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>> target for a source located on the same chip when possible. This field >>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>> on pSeries under KVM when NUMA nodes are defined but it is undefined >> >> This sentence seems to have a syntax problem... like it is missing an >> 'and' before 'on pSeries'. > > ah yes, or simply a comma. > >>> under PowerVM. The XIVE source structure has a similar field >>> 'src_chip' which is only assigned on the PowerNV platform. >>> >>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>> default node. It will also give us the opportunity to set the affinity >>> of a source on pSeries when we can localize them. >>> >> >> IIUC this relies on the fact that the NUMA node id is == to chip id >> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >> with this change. > > Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > in Cc:) That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but a list with the ibm,associativity domains of the CPU that "proc-no" (processor identifier) is mapped to inside QEMU. node_id in this case, considering that we're working with a reference-points of size 4, is the 4th element of the returned list. The last element is "procno" itself. > > On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > the node id. This value is built from the chip id in OPAL, so the > value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > property are unlikely to be different. > > cpu_to_node(cpu) is used in many places to allocate the structures > locally to the owning node. XIVE is not an exception (see below in the > same patch), it is better to be consistent and get the same information > (node id) using the same routine. > > > In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > unifies the controllers of the system to only expose one the OS. This > is problematic and should be changed but it's another topic. > > >> On the other hand, you have the pSeries case under PowerVM that >> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > > yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > chip id. > > QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > always correct btw) If you have a way to reliably reproduce this, let me know and I'll fix it up in QEMU. Thanks, DHB > >> It looks like the chip id is only used for localization purpose in >> this case, right ? > > Yes and PAPR sources are not localized. So it's not used. MSI sources > could be if we rewrote the MSI driver. > >> In this case, what about doing this change for pSeries only, >> somewhere in spapr.c ? > > The IPI code is common to all platforms and all have the same issue. > I rather not. > > Thanks, > > C. > >>> Signed-off-by: Cédric Le Goater <clg@kaod.org> >>> --- >>> arch/powerpc/sysdev/xive/common.c | 7 +------ >>> 1 file changed, 1 insertion(+), 6 deletions(-) >>> >>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >>> index 595310e056f4..b8e456da28aa 100644 >>> --- a/arch/powerpc/sysdev/xive/common.c >>> +++ b/arch/powerpc/sysdev/xive/common.c >>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >>> >>> xc = per_cpu(xive_cpu, cpu); >>> if (!xc) { >>> - struct device_node *np; >>> - >>> xc = kzalloc_node(sizeof(struct xive_cpu), >>> GFP_KERNEL, cpu_to_node(cpu)); >>> if (!xc) >>> return -ENOMEM; >>> - np = of_get_cpu_node(cpu, NULL); >>> - if (np) >>> - xc->chip_id = of_get_ibm_chip_id(np); >>> - of_node_put(np); >>> + xc->chip_id = cpu_to_node(cpu); >>> xc->hw_ipi = XIVE_BAD_IRQ; >>> >>> per_cpu(xive_cpu, cpu) = xc; >> >
On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > > > On 3/9/21 12:33 PM, Cédric Le Goater wrote: >> On 3/8/21 6:13 PM, Greg Kurz wrote: >>> On Wed, 3 Mar 2021 18:48:50 +0100 >>> Cédric Le Goater <clg@kaod.org> wrote: >>> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>> target for a source located on the same chip when possible. This field >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>> >>> This sentence seems to have a syntax problem... like it is missing an >>> 'and' before 'on pSeries'. >> >> ah yes, or simply a comma. >> >>>> under PowerVM. The XIVE source structure has a similar field >>>> 'src_chip' which is only assigned on the PowerNV platform. >>>> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>> default node. It will also give us the opportunity to set the affinity >>>> of a source on pSeries when we can localize them. >>>> >>> >>> IIUC this relies on the fact that the NUMA node id is == to chip id >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>> with this change. >> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >> in Cc:) > > That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but > a list with the ibm,associativity domains of the CPU that "proc-no" (processor > identifier) is mapped to inside QEMU. > > node_id in this case, considering that we're working with a reference-points > of size 4, is the 4th element of the returned list. The last element is > "procno" itself. > > >> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >> the node id. This value is built from the chip id in OPAL, so the >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >> property are unlikely to be different. >> >> cpu_to_node(cpu) is used in many places to allocate the structures >> locally to the owning node. XIVE is not an exception (see below in the >> same patch), it is better to be consistent and get the same information >> (node id) using the same routine. >> >> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >> unifies the controllers of the system to only expose one the OS. This >> is problematic and should be changed but it's another topic. >> >> >>> On the other hand, you have the pSeries case under PowerVM that >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >> chip id. >> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >> always correct btw) > > > If you have a way to reliably reproduce this, let me know and I'll fix it > up in QEMU. with : -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x01>; ibm,chip-id = <0x02>; ibm,chip-id = <0x00>; ibm,chip-id = <0x03>; with : -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; I think we should simply remove "ibm,chip-id" since it's not used and not in the PAPR spec. Thanks, C. > > Thanks, > > > DHB > > >> >>> It looks like the chip id is only used for localization purpose in >>> this case, right ? >> >> Yes and PAPR sources are not localized. So it's not used. MSI sources >> could be if we rewrote the MSI driver. >> >>> In this case, what about doing this change for pSeries only, >>> somewhere in spapr.c ? >> >> The IPI code is common to all platforms and all have the same issue. >> I rather not. >> >> Thanks, >> >> C. >> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org> >>>> --- >>>> arch/powerpc/sysdev/xive/common.c | 7 +------ >>>> 1 file changed, 1 insertion(+), 6 deletions(-) >>>> >>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >>>> index 595310e056f4..b8e456da28aa 100644 >>>> --- a/arch/powerpc/sysdev/xive/common.c >>>> +++ b/arch/powerpc/sysdev/xive/common.c >>>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >>>> xc = per_cpu(xive_cpu, cpu); >>>> if (!xc) { >>>> - struct device_node *np; >>>> - >>>> xc = kzalloc_node(sizeof(struct xive_cpu), >>>> GFP_KERNEL, cpu_to_node(cpu)); >>>> if (!xc) >>>> return -ENOMEM; >>>> - np = of_get_cpu_node(cpu, NULL); >>>> - if (np) >>>> - xc->chip_id = of_get_ibm_chip_id(np); >>>> - of_node_put(np); >>>> + xc->chip_id = cpu_to_node(cpu); >>>> xc->hw_ipi = XIVE_BAD_IRQ; >>>> per_cpu(xive_cpu, cpu) = xc; >>> >>
On Tue, 9 Mar 2021 18:26:35 +0100 Cédric Le Goater <clg@kaod.org> wrote: > On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > > > > > > On 3/9/21 12:33 PM, Cédric Le Goater wrote: > >> On 3/8/21 6:13 PM, Greg Kurz wrote: > >>> On Wed, 3 Mar 2021 18:48:50 +0100 > >>> Cédric Le Goater <clg@kaod.org> wrote: > >>> > >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a > >>>> target for a source located on the same chip when possible. This field > >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property > >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined > >>> > >>> This sentence seems to have a syntax problem... like it is missing an > >>> 'and' before 'on pSeries'. > >> > >> ah yes, or simply a comma. > >> > >>>> under PowerVM. The XIVE source structure has a similar field > >>>> 'src_chip' which is only assigned on the PowerNV platform. > >>>> > >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the > >>>> default node. It will also give us the opportunity to set the affinity > >>>> of a source on pSeries when we can localize them. > >>>> > >>> > >>> IIUC this relies on the fact that the NUMA node id is == to chip id > >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > >>> with this change. > >> > >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > >> in Cc:) > [...] > >> > >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > >> the node id. This value is built from the chip id in OPAL, so the > >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > >> property are unlikely to be different. > >> > >> cpu_to_node(cpu) is used in many places to allocate the structures > >> locally to the owning node. XIVE is not an exception (see below in the > >> same patch), it is better to be consistent and get the same information > >> (node id) using the same routine. > >> > >> > >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > >> unifies the controllers of the system to only expose one the OS. This > >> is problematic and should be changed but it's another topic. > >> > >> > >>> On the other hand, you have the pSeries case under PowerVM that > >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > >> > >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > >> chip id. > >> > >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > >> always correct btw) > > > > > > If you have a way to reliably reproduce this, let me know and I'll fix it > > up in QEMU. > > with : > > -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x01>; > ibm,chip-id = <0x02>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x03>; > > with : > > -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > > I think we should simply remove "ibm,chip-id" since it's not used and > not in the PAPR spec. As I mentioned to Daniel on our call this morning, oddly it *does* appear to be used in the RHEL kernel, even though that's 4.18 based. This patch seems to have caused a minor regression; not in the identification of NUMA nodes, but in the number of sockets shown be lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 for more information. Since the value was used by some PAPR kernels - even if they shouldn't have - I think we should only remove this for newer machine types. We also need to check what we're not supplying that the guest kernel is showing a different number of sockets than specified on the qemu command line. > > Thanks, > > C. > > > > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] >
On 3/12/21 2:55 AM, David Gibson wrote: > On Tue, 9 Mar 2021 18:26:35 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>> >>> >>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>> >>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>> target for a source located on the same chip when possible. This field >>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>> >>>>> This sentence seems to have a syntax problem... like it is missing an >>>>> 'and' before 'on pSeries'. >>>> >>>> ah yes, or simply a comma. >>>> >>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>> >>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>> default node. It will also give us the opportunity to set the affinity >>>>>> of a source on pSeries when we can localize them. >>>>>> >>>>> >>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>> with this change. >>>> >>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>> in Cc:) >> [...] >>>> >>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>> the node id. This value is built from the chip id in OPAL, so the >>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>> property are unlikely to be different. >>>> >>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>> locally to the owning node. XIVE is not an exception (see below in the >>>> same patch), it is better to be consistent and get the same information >>>> (node id) using the same routine. >>>> >>>> >>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>> unifies the controllers of the system to only expose one the OS. This >>>> is problematic and should be changed but it's another topic. >>>> >>>> >>>>> On the other hand, you have the pSeries case under PowerVM that >>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>> >>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>> chip id. >>>> >>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>> always correct btw) >>> >>> >>> If you have a way to reliably reproduce this, let me know and I'll fix it >>> up in QEMU. >> >> with : >> >> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x01>; >> ibm,chip-id = <0x02>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x03>; >> >> with : >> >> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> >> I think we should simply remove "ibm,chip-id" since it's not used and >> not in the PAPR spec. > > As I mentioned to Daniel on our call this morning, oddly it *does* > appear to be used in the RHEL kernel, even though that's 4.18 based. > This patch seems to have caused a minor regression; not in the > identification of NUMA nodes, but in the number of sockets shown be > lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 > for more information. Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : [root@localhost ~]# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 .... "ibm,chip-id" is still being used on some occasion on pSeries machines. This is wrong :/ The problem is : #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) We should be using cpu_to_node(). C. > > Since the value was used by some PAPR kernels - even if they shouldn't > have - I think we should only remove this for newer machine types. We > also need to check what we're not supplying that the guest kernel is > showing a different number of sockets than specified on the qemu > command line. > >> >> Thanks, >> >> C. >> >> >> >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> > >
On 3/12/21 6:53 AM, Cédric Le Goater wrote: > On 3/12/21 2:55 AM, David Gibson wrote: >> On Tue, 9 Mar 2021 18:26:35 +0100 >> Cédric Le Goater <clg@kaod.org> wrote: >> >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>>> >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>>> target for a source located on the same chip when possible. This field >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>>> >>>>>> This sentence seems to have a syntax problem... like it is missing an >>>>>> 'and' before 'on pSeries'. >>>>> >>>>> ah yes, or simply a comma. >>>>> >>>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>>> >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>>> default node. It will also give us the opportunity to set the affinity >>>>>>> of a source on pSeries when we can localize them. >>>>>>> >>>>>> >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>>> with this change. >>>>> >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>>> in Cc:) >>> [...] >>>>> >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>>> the node id. This value is built from the chip id in OPAL, so the >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>>> property are unlikely to be different. >>>>> >>>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>>> locally to the owning node. XIVE is not an exception (see below in the >>>>> same patch), it is better to be consistent and get the same information >>>>> (node id) using the same routine. >>>>> >>>>> >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>>> unifies the controllers of the system to only expose one the OS. This >>>>> is problematic and should be changed but it's another topic. >>>>> >>>>> >>>>>> On the other hand, you have the pSeries case under PowerVM that >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>>> >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>>> chip id. >>>>> >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>>> always correct btw) >>>> >>>> >>>> If you have a way to reliably reproduce this, let me know and I'll fix it >>>> up in QEMU. >>> >>> with : >>> >>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> # dmesg | grep numa >>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>> ibm,chip-id = <0x01>; >>> ibm,chip-id = <0x02>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x03>; >>> >>> with : >>> >>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> # dmesg | grep numa >>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> >>> I think we should simply remove "ibm,chip-id" since it's not used and >>> not in the PAPR spec. >> >> As I mentioned to Daniel on our call this morning, oddly it *does* >> appear to be used in the RHEL kernel, even though that's 4.18 based. >> This patch seems to have caused a minor regression; not in the >> identification of NUMA nodes, but in the number of sockets shown be >> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 >> for more information. > > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : > > [root@localhost ~]# lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 4 > Core(s) per socket: 16 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: KVM > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-63 > NUMA node1 CPU(s): 64-127 > > [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id > /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 > .... > > "ibm,chip-id" is still being used on some occasion on pSeries machines. > This is wrong :/ The problem is : > > #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) > > We should be using cpu_to_node(). IIUC the "real fix" then is this change you mentioned above, together with this xive patch as well, to stop using ibm,chip-id for good in the pserie kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries machine without impact. Is this correct? If that's the case, then I believe it's ok to go forward with the QEMU side change (just for 6.0.0 and newer machines). Or should I wait for the kernel changes to be merged upstream first? Thanks, DHB > > C. > >> >> Since the value was used by some PAPR kernels - even if they shouldn't >> have - I think we should only remove this for newer machine types. We >> also need to check what we're not supplying that the guest kernel is >> showing a different number of sockets than specified on the qemu >> command line. >> >>> >>> Thanks, >>> >>> C. >>> >>> >>> >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> >> >> >
On Fri, 12 Mar 2021 09:18:39 -0300 Daniel Henrique Barboza <danielhb@linux.ibm.com> wrote: > > > On 3/12/21 6:53 AM, Cédric Le Goater wrote: > > On 3/12/21 2:55 AM, David Gibson wrote: > >> On Tue, 9 Mar 2021 18:26:35 +0100 > >> Cédric Le Goater <clg@kaod.org> wrote: > >> > >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > >>>> > >>>> > >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: > >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: > >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 > >>>>>> Cédric Le Goater <clg@kaod.org> wrote: > >>>>>> > >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a > >>>>>>> target for a source located on the same chip when possible. This field > >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property > >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined > >>>>>> > >>>>>> This sentence seems to have a syntax problem... like it is missing an > >>>>>> 'and' before 'on pSeries'. > >>>>> > >>>>> ah yes, or simply a comma. > >>>>> > >>>>>>> under PowerVM. The XIVE source structure has a similar field > >>>>>>> 'src_chip' which is only assigned on the PowerNV platform. > >>>>>>> > >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the > >>>>>>> default node. It will also give us the opportunity to set the affinity > >>>>>>> of a source on pSeries when we can localize them. > >>>>>>> > >>>>>> > >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id > >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > >>>>>> with this change. > >>>>> > >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > >>>>> in Cc:) > >>> [...] > >>>>> > >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > >>>>> the node id. This value is built from the chip id in OPAL, so the > >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > >>>>> property are unlikely to be different. > >>>>> > >>>>> cpu_to_node(cpu) is used in many places to allocate the structures > >>>>> locally to the owning node. XIVE is not an exception (see below in the > >>>>> same patch), it is better to be consistent and get the same information > >>>>> (node id) using the same routine. > >>>>> > >>>>> > >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > >>>>> unifies the controllers of the system to only expose one the OS. This > >>>>> is problematic and should be changed but it's another topic. > >>>>> > >>>>> > >>>>>> On the other hand, you have the pSeries case under PowerVM that > >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > >>>>> > >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > >>>>> chip id. > >>>>> > >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > >>>>> always correct btw) > >>>> > >>>> > >>>> If you have a way to reliably reproduce this, let me know and I'll fix it > >>>> up in QEMU. > >>> > >>> with : > >>> > >>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >>> > >>> # dmesg | grep numa > >>> [ 0.013106] numa: Node 0 CPUs: 0-1 > >>> [ 0.013136] numa: Node 1 CPUs: 2-3 > >>> > >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > >>> ibm,chip-id = <0x01>; > >>> ibm,chip-id = <0x02>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x03>; > >>> > >>> with : > >>> > >>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >>> > >>> # dmesg | grep numa > >>> [ 0.013106] numa: Node 0 CPUs: 0-1 > >>> [ 0.013136] numa: Node 1 CPUs: 2-3 > >>> > >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> > >>> I think we should simply remove "ibm,chip-id" since it's not used and > >>> not in the PAPR spec. > >> > >> As I mentioned to Daniel on our call this morning, oddly it *does* > >> appear to be used in the RHEL kernel, even though that's 4.18 based. > >> This patch seems to have caused a minor regression; not in the > >> identification of NUMA nodes, but in the number of sockets shown be > >> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 > >> for more information. > > > > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we > > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : > > > > [root@localhost ~]# lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 128 > > On-line CPU(s) list: 0-127 > > Thread(s) per core: 4 > > Core(s) per socket: 16 > > Socket(s): 2 > > NUMA node(s): 2 > > Model: 2.2 (pvr 004e 1202) > > Model name: POWER9 (architected), altivec supported > > Hypervisor vendor: KVM > > Virtualization type: para > > L1d cache: 32K > > L1i cache: 32K > > NUMA node0 CPU(s): 0-63 > > NUMA node1 CPU(s): 64-127 > > > > [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id > > /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 > > .... > > > > "ibm,chip-id" is still being used on some occasion on pSeries machines. > > This is wrong :/ The problem is : > > > > #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) > > > > We should be using cpu_to_node(). > > > IIUC the "real fix" then is this change you mentioned above, together with > this xive patch as well, to stop using ibm,chip-id for good in the pserie > kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries > machine without impact. Is this correct? > > If that's the case, then I believe it's ok to go forward with the QEMU side > change (just for 6.0.0 and newer machines). Or should I wait for the kernel > changes to be merged upstream first? > I'd say the latter since this is a breaking change and people will want to identify the upstream commits they have to backport to their kernel in order to support the disappearance of "ibm,chip-id". Cheers, -- Greg > > Thanks, > > > DHB > > > > > > C. > > > >> > >> Since the value was used by some PAPR kernels - even if they shouldn't > >> have - I think we should only remove this for newer machine types. We > >> also need to check what we're not supplying that the guest kernel is > >> showing a different number of sockets than specified on the qemu > >> command line. > >> > >>> > >>> Thanks, > >>> > >>> C. > >>> > >>> > >>> > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> > >> > >> > >
On 3/12/21 1:18 PM, Daniel Henrique Barboza wrote: > > > On 3/12/21 6:53 AM, Cédric Le Goater wrote: >> On 3/12/21 2:55 AM, David Gibson wrote: >>> On Tue, 9 Mar 2021 18:26:35 +0100 >>> Cédric Le Goater <clg@kaod.org> wrote: >>> >>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>>>> >>>>> >>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>>>> >>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>>>> target for a source located on the same chip when possible. This field >>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>>>> >>>>>>> This sentence seems to have a syntax problem... like it is missing an >>>>>>> 'and' before 'on pSeries'. >>>>>> >>>>>> ah yes, or simply a comma. >>>>>> >>>>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>>>> >>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>>>> default node. It will also give us the opportunity to set the affinity >>>>>>>> of a source on pSeries when we can localize them. >>>>>>>> >>>>>>> >>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>>>> with this change. >>>>>> >>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>>>> in Cc:) >>>> [...] >>>>>> >>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>>>> the node id. This value is built from the chip id in OPAL, so the >>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>>>> property are unlikely to be different. >>>>>> >>>>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>>>> locally to the owning node. XIVE is not an exception (see below in the >>>>>> same patch), it is better to be consistent and get the same information >>>>>> (node id) using the same routine. >>>>>> >>>>>> >>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>>>> unifies the controllers of the system to only expose one the OS. This >>>>>> is problematic and should be changed but it's another topic. >>>>>> >>>>>> >>>>>>> On the other hand, you have the pSeries case under PowerVM that >>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>>>> >>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>>>> chip id. >>>>>> >>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>>>> always correct btw) >>>>> >>>>> >>>>> If you have a way to reliably reproduce this, let me know and I'll fix it >>>>> up in QEMU. >>>> >>>> with : >>>> >>>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> # dmesg | grep numa >>>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>>> >>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>>> ibm,chip-id = <0x01>; >>>> ibm,chip-id = <0x02>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x03>; >>>> >>>> with : >>>> >>>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> # dmesg | grep numa >>>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>>> >>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> >>>> I think we should simply remove "ibm,chip-id" since it's not used and >>>> not in the PAPR spec. >>> >>> As I mentioned to Daniel on our call this morning, oddly it *does* >>> appear to be used in the RHEL kernel, even though that's 4.18 based. >>> This patch seems to have caused a minor regression; not in the >>> identification of NUMA nodes, but in the number of sockets shown be >>> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 >>> for more information. >> >> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we >> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : >> >> [root@localhost ~]# lscpu >> Architecture: ppc64le >> Byte Order: Little Endian >> CPU(s): 128 >> On-line CPU(s) list: 0-127 >> Thread(s) per core: 4 >> Core(s) per socket: 16 >> Socket(s): 2 >> NUMA node(s): 2 >> Model: 2.2 (pvr 004e 1202) >> Model name: POWER9 (architected), altivec supported >> Hypervisor vendor: KVM >> Virtualization type: para >> L1d cache: 32K >> L1i cache: 32K >> NUMA node0 CPU(s): 0-63 >> NUMA node1 CPU(s): 64-127 >> >> [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id >> /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 >> .... >> >> "ibm,chip-id" is still being used on some occasion on pSeries machines. >> This is wrong :/ The problem is : >> >> #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) >> >> We should be using cpu_to_node(). > > > IIUC the "real fix" then is this change you mentioned above, together with > this xive patch as well, These are independent. The XIVE patch just raised the issue because it's another usage example of cpu_to_chip_id() or directly "ibm,chip-id" in the XIVE case, on a pseries machine. The use of cpu_to_node(cpu) for topology_physical_package_id(cpu) is a fix for the sysfs issue reported in the redhat BZ. > to stop using ibm,chip-id for good in the pserie > kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries > machine without impact. Is this correct? Linux is already "broken" on PowerVM today since we don't have the "ibm,chip-id" property. QEMU is just hiding the problem on KVM. But we have to be bug compatible :) if the QEMU fix is under the pseries-6.x machine we should be fine. > If that's the case, then I believe it's ok to go forward with the QEMU side > change (just for 6.0.0 and newer machines). Or should I wait for the kernel > changes to be merged upstream first? Once Linux is fixed, we shouldn't care if QEMU exports 'ibm,chip-id' or not. I don't think the order is very important. These are independent. C.
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 595310e056f4..b8e456da28aa 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) xc = per_cpu(xive_cpu, cpu); if (!xc) { - struct device_node *np; - xc = kzalloc_node(sizeof(struct xive_cpu), GFP_KERNEL, cpu_to_node(cpu)); if (!xc) return -ENOMEM; - np = of_get_cpu_node(cpu, NULL); - if (np) - xc->chip_id = of_get_ibm_chip_id(np); - of_node_put(np); + xc->chip_id = cpu_to_node(cpu); xc->hw_ipi = XIVE_BAD_IRQ; per_cpu(xive_cpu, cpu) = xc;
The 'chip_id' field of the XIVE CPU structure is used to choose a target for a source located on the same chip when possible. This field is assigned on the PowerNV platform using the "ibm,chip-id" property on pSeries under KVM when NUMA nodes are defined but it is undefined under PowerVM. The XIVE source structure has a similar field 'src_chip' which is only assigned on the PowerNV platform. cpu_to_node() returns a compatible value on all platforms, 0 being the default node. It will also give us the opportunity to set the affinity of a source on pSeries when we can localize them. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-)