Patchwork RFC: spapr: introduce smt_cpu_index

login
register
mail settings
Submitter Alexey Kardashevskiy
Date Oct. 31, 2013, 9:28 a.m.
Message ID <1383211692-25020-1-git-send-email-aik@ozlabs.ru>
Download mbox | patch
Permalink /patch/287459/
State New
Headers show

Comments

Alexey Kardashevskiy - Oct. 31, 2013, 9:28 a.m.
This is not really a patch to accept or review, this is a conversation
starter.

Cc: Badari Pulavarty  <pbadari@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andreas Färber <afaerber@suse.de>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---

Normall CPUState::cpu_index is used to pick the right CPU for various
operations. However default consecutive numbering does not always work
for PPC64.

For example, on POWER7 (which supports 4 threads per core),
"-smp 8,threads=4" should create CPUs with indexes 0,1,2,3,4,5,6,7 and
"-smp 8,threads=1" should create CPUs with indexes 0,4,8,12,16,20,24,28.

These indexes are reflected in /proc/device-tree/cpus/PowerPC,POWER7@XX
and used to call KVM VCPU's ioctl.
In order to achieve this, kvmppc_fixup_cpu() was introduced. Roughly,
it multiplies cpu_index by the number of threads per core.

The approach used to work till we hit NUMA bug. For example, if to run
QEMU with:

-smp 16,sockets=16,cores=1,threads=1
-numa node,nodeid=0,cpus=0-7,mem=500
-numa node,nodeid=1,cpus=8-15,mem=500

then CPUs 0 1 4 5 6 7 8 9 10 11 12 13 14 15 get assigned to node#0 and
only CPUs 2 and 3 are assigned to node#1. This happens because
kvmppc_fixup_cpu() does change cpu_index but does not change NUMA binding.

We could fix kvmppc_fixup_cpu() to update NUMA binding but the whole idea
of hacking cpu_index does not seem right as (for example) with
"-smp 8,threads=1" the user would expect that QEMU Monitor's "cpu"
command would accept indexes 0, 1, 2,.. but this is not true as
it accepts 0, 4, 8,...

And there is an existing mechanism to solve my issue which allows having
architecture specific CPU indexes via kvm_arch_vcpu_id(). However it is:

1) KVM only and does not have empty stub for TCG;

2) Reverse converter is needed to get CPUState from KVM's CPU index
(which does not exist) as sPAPR will use the KVM CPU index in various
hypercalls.

So I could:
1. fix kvmppc_fixup_cpu() to update NUMA binding;
2. Apply the patch below and make sure it does not break x86/ARM/s390
with and without KVM (does not it?);
3. Move kvm_arch_vcpu_id() + new kvm_arch_get_vcpu_by_id() out of KVM part
and make it platform specific with or without KVM.

What is the correct approach here? Thanks.



---
 hw/intc/xics_kvm.c          | 14 +++++++-------
 hw/ppc/spapr.c              |  9 +++++----
 hw/ppc/spapr_hcall.c        |  2 +-
 hw/ppc/spapr_rtas.c         |  4 ++--
 include/sysemu/kvm.h        |  1 +
 kvm-stub.c                  | 10 ++++++++++
 target-ppc/cpu.h            |  3 +++
 target-ppc/kvm.c            | 24 +++++++++++++++++++++---
 target-ppc/translate_init.c |  1 +
 9 files changed, 51 insertions(+), 17 deletions(-)
Paolo Bonzini - Oct. 31, 2013, 12:50 p.m.
Il 31/10/2013 10:28, Alexey Kardashevskiy ha scritto:
> This is not really a patch to accept or review, this is a conversation
> starter.
> 
> Cc: Badari Pulavarty  <pbadari@linux.vnet.ibm.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Andreas Färber <afaerber@suse.de>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> ---
> 
> Normall CPUState::cpu_index is used to pick the right CPU for various
> operations. However default consecutive numbering does not always work
> for PPC64.
> 
> For example, on POWER7 (which supports 4 threads per core),
> "-smp 8,threads=4" should create CPUs with indexes 0,1,2,3,4,5,6,7 and
> "-smp 8,threads=1" should create CPUs with indexes 0,4,8,12,16,20,24,28.
> 
> These indexes are reflected in /proc/device-tree/cpus/PowerPC,POWER7@XX
> and used to call KVM VCPU's ioctl.

I think these two uses should be separated, even if they use the same
formula.  Define a PPC-specific concept, used by TCG as well, and make
kvm_arch_vcpu_id use it.

In other words:

- the hw/ppc/spapr* changes should not use kvm_arch_vcpu_id and
kvm_arch_get_vcpu_by_id, but something like ppc_get_vcpu_dt_id and
ppc_get_vcpu_by_dt_id.

- the comment for the new field should be something like /* The CPU
index used in the device tree.  KVM uses this index too.  */

- kvm_arch_vcpu_id can be a simple wrapper for ppc_get_vcpu_dt_id; no
need for the new kvm_arch_get_vcpu_by_id, and no kvm-stub.c changes
should be necessary either.

- everything else seems fine.

> 2. Apply the patch below and make sure it does not break x86/ARM/s390
> with and without KVM (does not it?);

If you do it right (i.e. do not touch kvm-stub.c) it should "obviously"
not break anything outside PPC.

Paolo

Patch

diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index c203646..c96ca4d 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -65,7 +65,7 @@  static void icp_get_kvm_state(ICPState *ss)
     ret = kvm_vcpu_ioctl(ss->cs, KVM_GET_ONE_REG, &reg);
     if (ret != 0) {
         error_report("Unable to retrieve KVM interrupt controller state"
-                " for CPU %d: %s", ss->cs->cpu_index, strerror(errno));
+                " for CPU %ld: %s", kvm_arch_vcpu_id(ss->cs), strerror(errno));
         exit(1);
     }
 
@@ -97,7 +97,7 @@  static int icp_set_kvm_state(ICPState *ss, int version_id)
     ret = kvm_vcpu_ioctl(ss->cs, KVM_SET_ONE_REG, &reg);
     if (ret != 0) {
         error_report("Unable to restore KVM interrupt controller state (0x%"
-                PRIx64 ") for CPU %d: %s", state, ss->cs->cpu_index,
+                PRIx64 ") for CPU %ld: %s", state, kvm_arch_vcpu_id(ss->cs),
                 strerror(errno));
         return ret;
     }
@@ -313,9 +313,9 @@  static void xics_kvm_cpu_setup(XICSState *icp, PowerPCCPU *cpu)
     KVMXICSState *icpkvm = KVM_XICS(icp);
 
     cs = CPU(cpu);
-    ss = &icp->ss[cs->cpu_index];
+    ss = &icp->ss[kvm_arch_vcpu_id(cs)];
 
-    assert(cs->cpu_index < icp->nr_servers);
+    assert(kvm_arch_vcpu_id(cs) < icp->nr_servers);
     if (icpkvm->kernel_xics_fd == -1) {
         abort();
     }
@@ -325,15 +325,15 @@  static void xics_kvm_cpu_setup(XICSState *icp, PowerPCCPU *cpu)
         struct kvm_enable_cap xics_enable_cap = {
             .cap = KVM_CAP_IRQ_XICS,
             .flags = 0,
-            .args = {icpkvm->kernel_xics_fd, cs->cpu_index, 0, 0},
+            .args = {icpkvm->kernel_xics_fd, kvm_arch_vcpu_id(cs), 0, 0},
         };
 
         ss->cs = cs;
 
         ret = kvm_vcpu_ioctl(ss->cs, KVM_ENABLE_CAP, &xics_enable_cap);
         if (ret < 0) {
-            error_report("Unable to connect CPU%d to kernel XICS: %s",
-                    cs->cpu_index, strerror(errno));
+            error_report("Unable to connect CPU%ld to kernel XICS: %s",
+                    kvm_arch_vcpu_id(cs), strerror(errno));
             exit(1);
         }
     }
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index f76b355..1e00982 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -206,19 +206,20 @@  static int spapr_fixup_cpu_dt(void *fdt, sPAPREnvironment *spapr)
 
     CPU_FOREACH(cpu) {
         DeviceClass *dc = DEVICE_GET_CLASS(cpu);
+        int smt_cpu_index = kvm_arch_vcpu_id(cpu);
         uint32_t associativity[] = {cpu_to_be32(0x5),
                                     cpu_to_be32(0x0),
                                     cpu_to_be32(0x0),
                                     cpu_to_be32(0x0),
                                     cpu_to_be32(cpu->numa_node),
-                                    cpu_to_be32(cpu->cpu_index)};
+                                    cpu_to_be32(smt_cpu_index)};
 
-        if ((cpu->cpu_index % smt) != 0) {
+        if ((smt_cpu_index % smt) != 0) {
             continue;
         }
 
         snprintf(cpu_model, 32, "/cpus/%s@%x", dc->fw_name,
-                 cpu->cpu_index);
+                 smt_cpu_index);
 
         offset = fdt_path_offset(fdt, cpu_model);
         if (offset < 0) {
@@ -367,7 +368,7 @@  static void *spapr_create_fdt_skel(hwaddr initrd_base,
         CPUPPCState *env = &cpu->env;
         DeviceClass *dc = DEVICE_GET_CLASS(cs);
         PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cs);
-        int index = cs->cpu_index;
+        int index = kvm_arch_vcpu_id(cs);
         uint32_t servers_prop[smp_threads];
         uint32_t gservers_prop[smp_threads * 2];
         char *nodename;
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index f755a53..1742c17 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -466,7 +466,7 @@  static target_ulong h_register_vpa(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     CPUPPCState *tenv;
     CPUState *tcpu;
 
-    tcpu = qemu_get_cpu(procno);
+    tcpu = kvm_arch_get_vcpu_by_id(procno);
     if (!tcpu) {
         return H_PARAMETER;
     }
diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index eb542f2..90a16f1 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -139,7 +139,7 @@  static void rtas_query_cpu_stopped_state(PowerPCCPU *cpu_,
     }
 
     id = rtas_ld(args, 0);
-    cpu = qemu_get_cpu(id);
+    cpu = kvm_arch_get_vcpu_by_id(id);
     if (cpu != NULL) {
         if (cpu->halted) {
             rtas_st(rets, 1, 0);
@@ -172,7 +172,7 @@  static void rtas_start_cpu(PowerPCCPU *cpu_, sPAPREnvironment *spapr,
     start = rtas_ld(args, 1);
     r3 = rtas_ld(args, 2);
 
-    cs = qemu_get_cpu(id);
+    cs = kvm_arch_get_vcpu_by_id(id);
     if (cs != NULL) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
         CPUPPCState *env = &cpu->env;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 3b25f27..6358ada 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -222,6 +222,7 @@  int kvm_arch_init_vcpu(CPUState *cpu);
 
 /* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */
 unsigned long kvm_arch_vcpu_id(CPUState *cpu);
+CPUState *kvm_arch_get_vcpu_by_id(int index);
 
 void kvm_arch_reset_vcpu(CPUState *cpu);
 
diff --git a/kvm-stub.c b/kvm-stub.c
index e979f76..9d2acab 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -147,3 +147,13 @@  int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
     return -ENOSYS;
 }
 #endif
+
+unsigned long kvm_arch_vcpu_id(CPUState *cpu)
+{
+    return cpu->cpu_index;
+}
+
+CPUState *kvm_arch_get_vcpu_by_id(int index)
+{
+    return qemu_get_cpu(index);
+}
diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h
index 26acdba..1eb417e 100644
--- a/target-ppc/cpu.h
+++ b/target-ppc/cpu.h
@@ -1068,6 +1068,9 @@  struct CPUPPCState {
      */
     uint8_t fit_period[4];
     uint8_t wdt_period[4];
+
+    /* The CPU index used for KVM */
+    int smt_cpu_index;
 };
 
 #define SET_FIT_PERIOD(a_, b_, c_, d_)          \
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 10d0cd9..8a8948a 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -399,9 +399,27 @@  static inline void kvm_fixup_page_sizes(PowerPCCPU *cpu)
 
 #endif /* !defined (TARGET_PPC64) */
 
-unsigned long kvm_arch_vcpu_id(CPUState *cpu)
+unsigned long kvm_arch_vcpu_id(CPUState *cs)
 {
-    return cpu->cpu_index;
+    PowerPCCPU *cpu = POWERPC_CPU(cs);
+    CPUPPCState *env = &cpu->env;
+
+    return env->smt_cpu_index;
+}
+
+CPUState *kvm_arch_get_vcpu_by_id(int index)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        CPUPPCState *env = &cpu->env;
+        if (env->smt_cpu_index == index) {
+            return cs;
+        }
+    }
+
+    return NULL;
 }
 
 int kvm_arch_init_vcpu(CPUState *cs)
@@ -1773,7 +1791,7 @@  int kvmppc_fixup_cpu(PowerPCCPU *cpu)
 
     /* Adjust cpu index for SMT */
     smt = kvmppc_smt_threads();
-    cs->cpu_index = (cs->cpu_index / smp_threads) * smt
+    cpu->env.smt_cpu_index = (cs->cpu_index / smp_threads) * smt
         + (cs->cpu_index % smp_threads);
 
     return 0;
diff --git a/target-ppc/translate_init.c b/target-ppc/translate_init.c
index 3d3952c..6a7ebbd 100644
--- a/target-ppc/translate_init.c
+++ b/target-ppc/translate_init.c
@@ -8557,6 +8557,7 @@  static void ppc_cpu_initfn(Object *obj)
 
     cs->env_ptr = env;
     cpu_exec_init(env);
+    env->smt_cpu_index = cs->cpu_index;
 
     env->msr_mask = pcc->msr_mask;
     env->mmu_model = pcc->mmu_model;