[RFC,1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

Message ID 70974cfb62a7f09a53ec914d2909639884228244.1523516498.git.sam.bobroff@au1.ibm.com
State Superseded
Headers show
Series
  • [RFC,1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
Related show

Commit Message

Sam Bobroff April 12, 2018, 7:02 a.m.
It is not currently possible to create the full number of possible
VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
threads per core than it's core stride (or "VSMT mode"). This is
because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
even though the VCPU ID is less than KVM_MAX_VCPU_ID.

To address this, "pack" the VCORE ID and XIVE offsets by using
knowledge of the way the VCPU IDs will be used when there are less
guest threads per core than the core stride. The primary thread of
each core will always be used first. Then, if the guest uses more than
one thread per core, these secondary threads will sequentially follow
the primary in each core.

So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
VCPUs are being spaced apart, so at least half of each core is empty
and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
into the second half of each core (4..7, in an 8-thread core).

Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
each core is being left empty, and we can map down into the second and
third quarters of each core (2, 3 and 5, 6 in an 8-thread core).

Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
threads are being used and 7/8 of the core is empty, allowing use of
the 1, 3, 5 and 7 thread slots.

(Strides less than 8 are handled similarly.)

This allows the VCORE ID or offset to be calculated quickly from the
VCPU ID or XIVE server numbers, without access to the VCPU structure.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
---
Hello everyone,

I've tested this on P8 and P9, in lots of combinations of host and guest
threading modes and it has been fine but it does feel like a "tricky"
approach, so I still feel somewhat wary about it.

I've posted it as an RFC because I have not tested it with guest native-XIVE,
and I suspect that it will take some work to support it.

 arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
 arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
 arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
 3 files changed, 36 insertions(+), 6 deletions(-)

Comments

David Gibson April 16, 2018, 4:09 a.m. | #1
On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
> It is not currently possible to create the full number of possible
> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> threads per core than it's core stride (or "VSMT mode"). This is
> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> 
> To address this, "pack" the VCORE ID and XIVE offsets by using
> knowledge of the way the VCPU IDs will be used when there are less
> guest threads per core than the core stride. The primary thread of
> each core will always be used first. Then, if the guest uses more than
> one thread per core, these secondary threads will sequentially follow
> the primary in each core.
> 
> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> VCPUs are being spaced apart, so at least half of each core is empty
> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> into the second half of each core (4..7, in an 8-thread core).
> 
> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> each core is being left empty, and we can map down into the second and
> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> 
> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> threads are being used and 7/8 of the core is empty, allowing use of
> the 1, 3, 5 and 7 thread slots.
> 
> (Strides less than 8 are handled similarly.)
> 
> This allows the VCORE ID or offset to be calculated quickly from the
> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> 
> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
> ---
> Hello everyone,
> 
> I've tested this on P8 and P9, in lots of combinations of host and guest
> threading modes and it has been fine but it does feel like a "tricky"
> approach, so I still feel somewhat wary about it.
> 
> I've posted it as an RFC because I have not tested it with guest native-XIVE,
> and I suspect that it will take some work to support it.
> 
>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
>  3 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 376ae803b69c..1295056d564a 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>  #define SPLIT_HACK_MASK			0xff000000
>  #define SPLIT_HACK_OFFS			0xfb000000
>  
> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> + * (but not it's actual threading mode, which is not available) to avoid
> + * collisions.
> + */
> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> +{
> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};

I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
roughly the same thing, but I think makes the pattern more obvious.

> +	int stride = kvm->arch.emul_smt_mode > 1 ?
> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;

AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
always be 1 when this is called, so the conditional here doesn't seem
useful.

> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> +	u32 packed_id;
> +
> +	BUG_ON(block >= MAX_SMT_THREADS);
> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> +	return packed_id;
> +}

It took me a while to wrap my head around the packing function, but I
think I got there in the end.  It's pretty clever.

One thing bothers me, though.  This certainly packs things under
KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
etc. unused.

So again, the question is what exactly are these remapped IDs useful
for.  If we're indexing into a bare array of structures of size
KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
more entries than vcpus.  If we're indexing into something sparser,
then why is the remapping worthwhile?



> +
>  #endif /* __ASM_KVM_BOOK3S_H__ */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 9cb9448163c4..49165cc90051 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
>  	return threads_per_subcore;
>  }
>  
> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>  {
>  	struct kvmppc_vcore *vcore;
>  
> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>  	init_swait_queue_head(&vcore->wq);
>  	vcore->preempt_tb = TB_NIL;
>  	vcore->lpcr = kvm->arch.lpcr;
> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> +	vcore->first_vcpuid = id;
>  	vcore->kvm = kvm;
>  	INIT_LIST_HEAD(&vcore->preempt_list);
>  
> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>  	mutex_lock(&kvm->lock);
>  	vcore = NULL;
>  	err = -EINVAL;
> -	core = id / kvm->arch.smt_mode;
> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> +		BUG_ON(kvm->arch.smt_mode != 1);
> +		core = kvmppc_pack_vcpu_id(kvm, id);
> +	} else {
> +		core = id / kvm->arch.smt_mode;
> +	}
>  	if (core < KVM_MAX_VCORES) {
>  		vcore = kvm->arch.vcores[core];
> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>  		if (!vcore) {
>  			err = -ENOMEM;
> -			vcore = kvmppc_vcore_create(kvm, core);
> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>  			kvm->arch.vcores[core] = vcore;
>  			kvm->arch.online_vcores++;
>  		}
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index f9818d7d3381..681dfe12a5f3 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>  	return -EBUSY;
>  }
>  
> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> +{
> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> +}
> +

I'm finding the XIVE indexing really baffling.  There are a bunch of
other places where the code uses (xive->vp_base + NUMBER) directly.
If those are host side references, I guess they don't need updates for
this.

But if that's the case, then how does indexing into the same array
with both host and guest server numbers make sense?

>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>  			     struct kvmppc_xive_src_block *sb,
>  			     struct kvmppc_xive_irq_state *state)
> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		pr_devel("Duplicate !\n");
>  		return -EEXIST;
>  	}
> -	if (cpu >= KVM_MAX_VCPUS) {
> +	if (cpu >= KVM_MAX_VCPU_ID) {
>  		pr_devel("Out of bounds !\n");
>  		return -EINVAL;
>  	}
> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	xc->xive = xive;
>  	xc->vcpu = vcpu;
>  	xc->server_num = cpu;
> -	xc->vp_id = xive->vp_base + cpu;
> +	xc->vp_id = xive_vp(xive, cpu);
>  	xc->mfrr = 0xff;
>  	xc->valid = true;
>
Cédric Le Goater April 23, 2018, 9:06 a.m. | #2
On 04/16/2018 06:09 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
>> It is not currently possible to create the full number of possible
>> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
>> threads per core than it's core stride (or "VSMT mode"). This is
>> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
>> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>>
>> To address this, "pack" the VCORE ID and XIVE offsets by using
>> knowledge of the way the VCPU IDs will be used when there are less
>> guest threads per core than the core stride. The primary thread of
>> each core will always be used first. Then, if the guest uses more than
>> one thread per core, these secondary threads will sequentially follow
>> the primary in each core.
>>
>> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
>> VCPUs are being spaced apart, so at least half of each core is empty
>> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
>> into the second half of each core (4..7, in an 8-thread core).
>>
>> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
>> each core is being left empty, and we can map down into the second and
>> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>>
>> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
>> threads are being used and 7/8 of the core is empty, allowing use of
>> the 1, 3, 5 and 7 thread slots.
>>
>> (Strides less than 8 are handled similarly.)
>>
>> This allows the VCORE ID or offset to be calculated quickly from the
>> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>>
>> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
>> ---
>> Hello everyone,
>>
>> I've tested this on P8 and P9, in lots of combinations of host and guest
>> threading modes and it has been fine but it does feel like a "tricky"
>> approach, so I still feel somewhat wary about it.

Have you done any migration ? 

>> I've posted it as an RFC because I have not tested it with guest native-XIVE,
>> and I suspect that it will take some work to support it.

The KVM XIVE device will be different for XIVE exploitation mode, same structures 
though. I will send a patchset shortly. 

>>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
>>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
>>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
>>  3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
>> index 376ae803b69c..1295056d564a 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>>  #define SPLIT_HACK_MASK			0xff000000
>>  #define SPLIT_HACK_OFFS			0xfb000000
>>  
>> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
>> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
>> + * (but not it's actual threading mode, which is not available) to avoid
>> + * collisions.
>> + */
>> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
>> +{
>> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> 
> I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> roughly the same thing, but I think makes the pattern more obvious.
> 
>> +	int stride = kvm->arch.emul_smt_mode > 1 ?
>> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> 
> AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> always be 1 when this is called, so the conditional here doesn't seem
> useful.
> 
>> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
>> +	u32 packed_id;
>> +
>> +	BUG_ON(block >= MAX_SMT_THREADS);
>> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
>> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
>> +	return packed_id;
>> +}
> 
> It took me a while to wrap my head around the packing function, but I
> think I got there in the end.  It's pretty clever.
> 
> One thing bothers me, though.  This certainly packs things under
> KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> etc. unused.
> 
> So again, the question is what exactly are these remapped IDs useful
> for.  If we're indexing into a bare array of structures of size
> KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> more entries than vcpus.  If we're indexing into something sparser,
> then why is the remapping worthwhile?
> 
> 
> 
>> +
>>  #endif /* __ASM_KVM_BOOK3S_H__ */
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 9cb9448163c4..49165cc90051 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
>>  	return threads_per_subcore;
>>  }
>>  
>> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>>  {
>>  	struct kvmppc_vcore *vcore;
>>  
>> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>>  	init_swait_queue_head(&vcore->wq);
>>  	vcore->preempt_tb = TB_NIL;
>>  	vcore->lpcr = kvm->arch.lpcr;
>> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
>> +	vcore->first_vcpuid = id;
>>  	vcore->kvm = kvm;
>>  	INIT_LIST_HEAD(&vcore->preempt_list);
>>  
>> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>>  	mutex_lock(&kvm->lock);
>>  	vcore = NULL;
>>  	err = -EINVAL;
>> -	core = id / kvm->arch.smt_mode;
>> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>> +		BUG_ON(kvm->arch.smt_mode != 1);
>> +		core = kvmppc_pack_vcpu_id(kvm, id);
>> +	} else {
>> +		core = id / kvm->arch.smt_mode;
>> +	}
>>  	if (core < KVM_MAX_VCORES) {
>>  		vcore = kvm->arch.vcores[core];
>> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>>  		if (!vcore) {
>>  			err = -ENOMEM;
>> -			vcore = kvmppc_vcore_create(kvm, core);
>> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>>  			kvm->arch.vcores[core] = vcore;
>>  			kvm->arch.online_vcores++;
>>  		}
>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>> index f9818d7d3381..681dfe12a5f3 100644
>> --- a/arch/powerpc/kvm/book3s_xive.c
>> +++ b/arch/powerpc/kvm/book3s_xive.c
>> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>>  	return -EBUSY;
>>  }
>>  
>> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
>> +{
>> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>> +}
>> +
> 
> I'm finding the XIVE indexing really baffling.  There are a bunch of
> other places where the code uses (xive->vp_base + NUMBER) directly.

This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
in OPAL. So we need to check that all used NUMBERs are, first, consistent 
and then, in the correct range.

> If those are host side references, I guess they don't need updates for
> this.

> But if that's the case, then how does indexing into the same array
> with both host and guest server numbers make sense?

yes. VPs are allocated with KVM_MAX_VCPUS :

	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);

but

	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)

WE would need to change the allocation of the VPs I guess.

>>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>>  			     struct kvmppc_xive_src_block *sb,
>>  			     struct kvmppc_xive_irq_state *state)
>> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  		pr_devel("Duplicate !\n");
>>  		return -EEXIST;
>>  	}
>> -	if (cpu >= KVM_MAX_VCPUS) {
>> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
>>  		pr_devel("Out of bounds !\n");
>>  		return -EINVAL;
>>  	}
>> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  	xc->xive = xive;
>>  	xc->vcpu = vcpu;
>>  	xc->server_num = cpu;
>> -	xc->vp_id = xive->vp_base + cpu;
>> +	xc->vp_id = xive_vp(xive, cpu);
>>  	xc->mfrr = 0xff;
>>  	xc->valid = true;
>>  
>
Sam Bobroff April 24, 2018, 3:19 a.m. | #3
On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> On 04/16/2018 06:09 AM, David Gibson wrote:
> > On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
> >> It is not currently possible to create the full number of possible
> >> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> >> threads per core than it's core stride (or "VSMT mode"). This is
> >> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> >> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> >>
> >> To address this, "pack" the VCORE ID and XIVE offsets by using
> >> knowledge of the way the VCPU IDs will be used when there are less
> >> guest threads per core than the core stride. The primary thread of
> >> each core will always be used first. Then, if the guest uses more than
> >> one thread per core, these secondary threads will sequentially follow
> >> the primary in each core.
> >>
> >> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> >> VCPUs are being spaced apart, so at least half of each core is empty
> >> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> >> into the second half of each core (4..7, in an 8-thread core).
> >>
> >> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> >> each core is being left empty, and we can map down into the second and
> >> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> >>
> >> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> >> threads are being used and 7/8 of the core is empty, allowing use of
> >> the 1, 3, 5 and 7 thread slots.
> >>
> >> (Strides less than 8 are handled similarly.)
> >>
> >> This allows the VCORE ID or offset to be calculated quickly from the
> >> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> >>
> >> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
> >> ---
> >> Hello everyone,
> >>
> >> I've tested this on P8 and P9, in lots of combinations of host and guest
> >> threading modes and it has been fine but it does feel like a "tricky"
> >> approach, so I still feel somewhat wary about it.
> 
> Have you done any migration ? 

No, but I will :-)

> >> I've posted it as an RFC because I have not tested it with guest native-XIVE,
> >> and I suspect that it will take some work to support it.
> 
> The KVM XIVE device will be different for XIVE exploitation mode, same structures 
> though. I will send a patchset shortly. 

Great. This is probably where conflicts between the host and guest
numbers will show up. (See dwg's question below.)

> >>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
> >>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
> >>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
> >>  3 files changed, 36 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> >> index 376ae803b69c..1295056d564a 100644
> >> --- a/arch/powerpc/include/asm/kvm_book3s.h
> >> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> >> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
> >>  #define SPLIT_HACK_MASK			0xff000000
> >>  #define SPLIT_HACK_OFFS			0xfb000000
> >>  
> >> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> >> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> >> + * (but not it's actual threading mode, which is not available) to avoid
> >> + * collisions.
> >> + */
> >> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> >> +{
> >> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> > 
> > I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> > roughly the same thing, but I think makes the pattern more obvious.

OK.

> >> +	int stride = kvm->arch.emul_smt_mode > 1 ?
> >> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> > 
> > AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> > always be 1 when this is called, so the conditional here doesn't seem
> > useful.

Ah yes, right. (That was an older version when I was thinking of using
it for P8 as well but that didn't seem to be a good idea.)

> >> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> >> +	u32 packed_id;
> >> +
> >> +	BUG_ON(block >= MAX_SMT_THREADS);
> >> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> >> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> >> +	return packed_id;
> >> +}
> > 
> > It took me a while to wrap my head around the packing function, but I
> > think I got there in the end.  It's pretty clever.

Thanks, I'll try to add a better description as well :-)

> > One thing bothers me, though.  This certainly packs things under
> > KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> > e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> > qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> > and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> > etc. unused.

That's right. The property it provides is that all the numbers are under
KVM_MAX_VCPUS (which, see below, is the size of the fixed areas) not
that they are sequential.

> > So again, the question is what exactly are these remapped IDs useful
> > for.  If we're indexing into a bare array of structures of size
> > KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> > more entries than vcpus.  If we're indexing into something sparser,
> > then why is the remapping worthwhile?

Well, here's my thinking:

At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
(via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
enough space for the maximum number of VCPUs, and some space is wasted
when the guest uses less than this (but KVM doesn't know how many will
be created, so we can't do better easily). The problem is that the
indicies overflow before all of those VCPUs can be created, not that
more space is needed.

We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
but that will use 8x the space we use now, and we know that no more than
KVM_MAX_VCPUS will be used so all this new space is basically wasted.

So remapping seems better if it will work. (Ben H. was strongly against
wasting more XIVE space if possible.)

In short, remapping provides a way to allow the guest to create it's full set
of VCPUs without wasting any more space than we do currently, without
having to do something more complicated like tracking used IDs or adding
additional KVM CAPs.

> >> +
> >>  #endif /* __ASM_KVM_BOOK3S_H__ */
> >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> >> index 9cb9448163c4..49165cc90051 100644
> >> --- a/arch/powerpc/kvm/book3s_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_hv.c
> >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> >>  	return threads_per_subcore;
> >>  }
> >>  
> >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> >>  {
> >>  	struct kvmppc_vcore *vcore;
> >>  
> >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> >>  	init_swait_queue_head(&vcore->wq);
> >>  	vcore->preempt_tb = TB_NIL;
> >>  	vcore->lpcr = kvm->arch.lpcr;
> >> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> >> +	vcore->first_vcpuid = id;
> >>  	vcore->kvm = kvm;
> >>  	INIT_LIST_HEAD(&vcore->preempt_list);
> >>  
> >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> >>  	mutex_lock(&kvm->lock);
> >>  	vcore = NULL;
> >>  	err = -EINVAL;
> >> -	core = id / kvm->arch.smt_mode;
> >> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> >> +		BUG_ON(kvm->arch.smt_mode != 1);
> >> +		core = kvmppc_pack_vcpu_id(kvm, id);
> >> +	} else {
> >> +		core = id / kvm->arch.smt_mode;
> >> +	}
> >>  	if (core < KVM_MAX_VCORES) {
> >>  		vcore = kvm->arch.vcores[core];
> >> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> >>  		if (!vcore) {
> >>  			err = -ENOMEM;
> >> -			vcore = kvmppc_vcore_create(kvm, core);
> >> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> >>  			kvm->arch.vcores[core] = vcore;
> >>  			kvm->arch.online_vcores++;
> >>  		}
> >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> >> index f9818d7d3381..681dfe12a5f3 100644
> >> --- a/arch/powerpc/kvm/book3s_xive.c
> >> +++ b/arch/powerpc/kvm/book3s_xive.c
> >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> >>  	return -EBUSY;
> >>  }
> >>  
> >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> >> +{
> >> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> >> +}
> >> +
> > 
> > I'm finding the XIVE indexing really baffling.  There are a bunch of
> > other places where the code uses (xive->vp_base + NUMBER) directly.

Ugh, yes. It looks like I botched part of my final cleanup and all the
cases you saw in kvm/book3s_xive.c should have been replaced with a call to
xive_vp(). I'll fix it and sorry for the confusion.

> This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
> in OPAL. So we need to check that all used NUMBERs are, first, consistent 
> and then, in the correct range.

Right. My approach was to allow XIVE to keep using server numbers that
are equal to VCPU IDs, and just pack down the ID before indexing into
the vp_base area.

> > If those are host side references, I guess they don't need updates for
> > this.

These are all guest side references.

> > But if that's the case, then how does indexing into the same array
> > with both host and guest server numbers make sense?

Right, it doesn't make sense to mix host and guest server numbers when
we're remapping only the guest ones, but in this case (without native
guest XIVE support) it's just guest ones.

> yes. VPs are allocated with KVM_MAX_VCPUS :
> 
> 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> 
> but
> 
> 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
> 
> WE would need to change the allocation of the VPs I guess.

Yes, this is one of the structures that overflow if we don't pack the IDs.

> >>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> >>  			     struct kvmppc_xive_src_block *sb,
> >>  			     struct kvmppc_xive_irq_state *state)
> >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> >>  		pr_devel("Duplicate !\n");
> >>  		return -EEXIST;
> >>  	}
> >> -	if (cpu >= KVM_MAX_VCPUS) {
> >> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
> >>  		pr_devel("Out of bounds !\n");
> >>  		return -EINVAL;
> >>  	}
> >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> >>  	xc->xive = xive;
> >>  	xc->vcpu = vcpu;
> >>  	xc->server_num = cpu;
> >> -	xc->vp_id = xive->vp_base + cpu;
> >> +	xc->vp_id = xive_vp(xive, cpu);
> >>  	xc->mfrr = 0xff;
> >>  	xc->valid = true;
> >>  
> > 
>
David Gibson April 24, 2018, 3:48 a.m. | #4
On Tue, Apr 24, 2018 at 01:19:15PM +1000, Sam Bobroff wrote:
> On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> > On 04/16/2018 06:09 AM, David Gibson wrote:
> > > On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
> > >> It is not currently possible to create the full number of possible
> > >> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> > >> threads per core than it's core stride (or "VSMT mode"). This is
> > >> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> > >> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> > >>
> > >> To address this, "pack" the VCORE ID and XIVE offsets by using
> > >> knowledge of the way the VCPU IDs will be used when there are less
> > >> guest threads per core than the core stride. The primary thread of
> > >> each core will always be used first. Then, if the guest uses more than
> > >> one thread per core, these secondary threads will sequentially follow
> > >> the primary in each core.
> > >>
> > >> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> > >> VCPUs are being spaced apart, so at least half of each core is empty
> > >> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> > >> into the second half of each core (4..7, in an 8-thread core).
> > >>
> > >> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> > >> each core is being left empty, and we can map down into the second and
> > >> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> > >>
> > >> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> > >> threads are being used and 7/8 of the core is empty, allowing use of
> > >> the 1, 3, 5 and 7 thread slots.
> > >>
> > >> (Strides less than 8 are handled similarly.)
> > >>
> > >> This allows the VCORE ID or offset to be calculated quickly from the
> > >> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> > >>
> > >> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
> > >> ---
> > >> Hello everyone,
> > >>
> > >> I've tested this on P8 and P9, in lots of combinations of host and guest
> > >> threading modes and it has been fine but it does feel like a "tricky"
> > >> approach, so I still feel somewhat wary about it.
> > 
> > Have you done any migration ? 
> 
> No, but I will :-)
> 
> > >> I've posted it as an RFC because I have not tested it with guest native-XIVE,
> > >> and I suspect that it will take some work to support it.
> > 
> > The KVM XIVE device will be different for XIVE exploitation mode, same structures 
> > though. I will send a patchset shortly. 
> 
> Great. This is probably where conflicts between the host and guest
> numbers will show up. (See dwg's question below.)
> 
> > >>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
> > >>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
> > >>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
> > >>  3 files changed, 36 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> > >> index 376ae803b69c..1295056d564a 100644
> > >> --- a/arch/powerpc/include/asm/kvm_book3s.h
> > >> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> > >> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
> > >>  #define SPLIT_HACK_MASK			0xff000000
> > >>  #define SPLIT_HACK_OFFS			0xfb000000
> > >>  
> > >> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> > >> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> > >> + * (but not it's actual threading mode, which is not available) to avoid
> > >> + * collisions.
> > >> + */
> > >> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> > >> +{
> > >> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> > > 
> > > I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> > > roughly the same thing, but I think makes the pattern more obvious.
> 
> OK.
> 
> > >> +	int stride = kvm->arch.emul_smt_mode > 1 ?
> > >> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> > > 
> > > AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> > > always be 1 when this is called, so the conditional here doesn't seem
> > > useful.
> 
> Ah yes, right. (That was an older version when I was thinking of using
> it for P8 as well but that didn't seem to be a good idea.)
> 
> > >> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> > >> +	u32 packed_id;
> > >> +
> > >> +	BUG_ON(block >= MAX_SMT_THREADS);
> > >> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> > >> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> > >> +	return packed_id;
> > >> +}
> > > 
> > > It took me a while to wrap my head around the packing function, but I
> > > think I got there in the end.  It's pretty clever.
> 
> Thanks, I'll try to add a better description as well :-)
> 
> > > One thing bothers me, though.  This certainly packs things under
> > > KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> > > e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> > > qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> > > and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> > > etc. unused.
> 
> That's right. The property it provides is that all the numbers are under
> KVM_MAX_VCPUS (which, see below, is the size of the fixed areas) not
> that they are sequential.
> 
> > > So again, the question is what exactly are these remapped IDs useful
> > > for.  If we're indexing into a bare array of structures of size
> > > KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> > > more entries than vcpus.  If we're indexing into something sparser,
> > > then why is the remapping worthwhile?
> 
> Well, here's my thinking:
> 
> At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
> (via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
> enough space for the maximum number of VCPUs, and some space is wasted
> when the guest uses less than this (but KVM doesn't know how many will
> be created, so we can't do better easily). The problem is that the
> indicies overflow before all of those VCPUs can be created, not that
> more space is needed.
> 
> We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
> but that will use 8x the space we use now, and we know that no more than
> KVM_MAX_VCPUS will be used so all this new space is basically wasted.
> 
> So remapping seems better if it will work. (Ben H. was strongly against
> wasting more XIVE space if possible.)

Hm, ok.  Are the relevant arrays here per-VM, or global?  Or some of both?

> In short, remapping provides a way to allow the guest to create it's full set
> of VCPUs without wasting any more space than we do currently, without
> having to do something more complicated like tracking used IDs or adding
> additional KVM CAPs.
> 
> > >> +
> > >>  #endif /* __ASM_KVM_BOOK3S_H__ */
> > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > >> index 9cb9448163c4..49165cc90051 100644
> > >> --- a/arch/powerpc/kvm/book3s_hv.c
> > >> +++ b/arch/powerpc/kvm/book3s_hv.c
> > >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> > >>  	return threads_per_subcore;
> > >>  }
> > >>  
> > >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> > >>  {
> > >>  	struct kvmppc_vcore *vcore;
> > >>  
> > >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > >>  	init_swait_queue_head(&vcore->wq);
> > >>  	vcore->preempt_tb = TB_NIL;
> > >>  	vcore->lpcr = kvm->arch.lpcr;
> > >> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> > >> +	vcore->first_vcpuid = id;
> > >>  	vcore->kvm = kvm;
> > >>  	INIT_LIST_HEAD(&vcore->preempt_list);
> > >>  
> > >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> > >>  	mutex_lock(&kvm->lock);
> > >>  	vcore = NULL;
> > >>  	err = -EINVAL;
> > >> -	core = id / kvm->arch.smt_mode;
> > >> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> > >> +		BUG_ON(kvm->arch.smt_mode != 1);
> > >> +		core = kvmppc_pack_vcpu_id(kvm, id);
> > >> +	} else {
> > >> +		core = id / kvm->arch.smt_mode;
> > >> +	}
> > >>  	if (core < KVM_MAX_VCORES) {
> > >>  		vcore = kvm->arch.vcores[core];
> > >> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> > >>  		if (!vcore) {
> > >>  			err = -ENOMEM;
> > >> -			vcore = kvmppc_vcore_create(kvm, core);
> > >> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> > >>  			kvm->arch.vcores[core] = vcore;
> > >>  			kvm->arch.online_vcores++;
> > >>  		}
> > >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > >> index f9818d7d3381..681dfe12a5f3 100644
> > >> --- a/arch/powerpc/kvm/book3s_xive.c
> > >> +++ b/arch/powerpc/kvm/book3s_xive.c
> > >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> > >>  	return -EBUSY;
> > >>  }
> > >>  
> > >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> > >> +{
> > >> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> > >> +}
> > >> +
> > > 
> > > I'm finding the XIVE indexing really baffling.  There are a bunch of
> > > other places where the code uses (xive->vp_base + NUMBER) directly.
> 
> Ugh, yes. It looks like I botched part of my final cleanup and all the
> cases you saw in kvm/book3s_xive.c should have been replaced with a call to
> xive_vp(). I'll fix it and sorry for the confusion.

Ok.

> > This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
> > in OPAL. So we need to check that all used NUMBERs are, first, consistent 
> > and then, in the correct range.
> 
> Right. My approach was to allow XIVE to keep using server numbers that
> are equal to VCPU IDs, and just pack down the ID before indexing into
> the vp_base area.
> 
> > > If those are host side references, I guess they don't need updates for
> > > this.
> 
> These are all guest side references.
> 
> > > But if that's the case, then how does indexing into the same array
> > > with both host and guest server numbers make sense?
> 
> Right, it doesn't make sense to mix host and guest server numbers when
> we're remapping only the guest ones, but in this case (without native
> guest XIVE support) it's just guest ones.

Right.  Will this remapping be broken by guest-visible XIVE?  That is
for the guest visible XIVE are we going to need to expose un-remapped
XIVE server IDs to the guest?

> > yes. VPs are allocated with KVM_MAX_VCPUS :
> > 
> > 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > 
> > but
> > 
> > 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
> > 
> > WE would need to change the allocation of the VPs I guess.
> 
> Yes, this is one of the structures that overflow if we don't pack the IDs.
> 
> > >>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> > >>  			     struct kvmppc_xive_src_block *sb,
> > >>  			     struct kvmppc_xive_irq_state *state)
> > >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > >>  		pr_devel("Duplicate !\n");
> > >>  		return -EEXIST;
> > >>  	}
> > >> -	if (cpu >= KVM_MAX_VCPUS) {
> > >> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
> > >>  		pr_devel("Out of bounds !\n");
> > >>  		return -EINVAL;
> > >>  	}
> > >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > >>  	xc->xive = xive;
> > >>  	xc->vcpu = vcpu;
> > >>  	xc->server_num = cpu;
> > >> -	xc->vp_id = xive->vp_base + cpu;
> > >> +	xc->vp_id = xive_vp(xive, cpu);
> > >>  	xc->mfrr = 0xff;
> > >>  	xc->valid = true;
> > >>  
> > > 
> >
Cédric Le Goater April 24, 2018, 7:44 a.m. | #5
On 04/24/2018 05:19 AM, Sam Bobroff wrote:
> On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
>> On 04/16/2018 06:09 AM, David Gibson wrote:
>>> On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
>>>> It is not currently possible to create the full number of possible
>>>> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
>>>> threads per core than it's core stride (or "VSMT mode"). This is
>>>> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
>>>> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>>>>
>>>> To address this, "pack" the VCORE ID and XIVE offsets by using
>>>> knowledge of the way the VCPU IDs will be used when there are less
>>>> guest threads per core than the core stride. The primary thread of
>>>> each core will always be used first. Then, if the guest uses more than
>>>> one thread per core, these secondary threads will sequentially follow
>>>> the primary in each core.
>>>>
>>>> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
>>>> VCPUs are being spaced apart, so at least half of each core is empty
>>>> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
>>>> into the second half of each core (4..7, in an 8-thread core).
>>>>
>>>> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
>>>> each core is being left empty, and we can map down into the second and
>>>> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>>>>
>>>> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
>>>> threads are being used and 7/8 of the core is empty, allowing use of
>>>> the 1, 3, 5 and 7 thread slots.
>>>>
>>>> (Strides less than 8 are handled similarly.)
>>>>
>>>> This allows the VCORE ID or offset to be calculated quickly from the
>>>> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>>>>
>>>> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
>>>> ---
>>>> Hello everyone,
>>>>
>>>> I've tested this on P8 and P9, in lots of combinations of host and guest
>>>> threading modes and it has been fine but it does feel like a "tricky"
>>>> approach, so I still feel somewhat wary about it.
>>
>> Have you done any migration ? 
> 
> No, but I will :-)
> 
>>>> I've posted it as an RFC because I have not tested it with guest native-XIVE,
>>>> and I suspect that it will take some work to support it.
>>
>> The KVM XIVE device will be different for XIVE exploitation mode, same structures 
>> though. I will send a patchset shortly. 
> 
> Great. This is probably where conflicts between the host and guest
> numbers will show up. (See dwg's question below.)

The 'server' part looks better than the XICS-over-XIVE glue in fact, 
may be because it has not yet been tortured.   


Here is my take on the server topic :

All the OPAL calls should take a 'vp_id' of some sort, the one from the 
struct kvmppc_xive_vcpu, or the result of a routine translating a guest 
side CPU number to a VP id in the range defined for the guest. Moreover,
it would be better to make sure the guest side CPU number is valid in 
KVM and do a kvmppc_xive_vcpu lookup each time we use one before calling
OPAL, like that we would also get the associated struct kvmppc_xive_vcpu 
and its 'vp_id'.

The 'server_num' of kvmppc_xive_vcpu should probably still be a guest 
side CPU number, but we need to check its usage. The only problem is 
when it is compared to 'act_server' of 'kvmppc_xive_irq_state'. 

if 'act_server' was a VP id that would make our life easier. we could 
get rid of xive->vp_base + NUMBER usage in : 

	xive_native_configure_irq( ..., xive->vp_base + server, ...)

Would it be complex to have a routine converting back a VP id to a
guest side cpu number ? we would need it in get_xive() and get_source()

If we start shuffling the XIVE code in the direction above, I rather
do it to make sure the XIVE native exploitation mode patchset stays in 
sync. 

>>>>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
>>>>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
>>>>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
>>>>  3 files changed, 36 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
>>>> index 376ae803b69c..1295056d564a 100644
>>>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>>>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>>>> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>>>>  #define SPLIT_HACK_MASK			0xff000000
>>>>  #define SPLIT_HACK_OFFS			0xfb000000
>>>>  
>>>> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
>>>> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
>>>> + * (but not it's actual threading mode, which is not available) to avoid
>>>> + * collisions.
>>>> + */
>>>> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
>>>> +{
>>>> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
>>>
>>> I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
>>> roughly the same thing, but I think makes the pattern more obvious.
> 
> OK.
> 
>>>> +	int stride = kvm->arch.emul_smt_mode > 1 ?
>>>> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
>>>
>>> AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
>>> always be 1 when this is called, so the conditional here doesn't seem
>>> useful.
> 
> Ah yes, right. (That was an older version when I was thinking of using
> it for P8 as well but that didn't seem to be a good idea.)
> 
>>>> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
>>>> +	u32 packed_id;
>>>> +
>>>> +	BUG_ON(block >= MAX_SMT_THREADS);
>>>> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
>>>> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
>>>> +	return packed_id;
>>>> +}
>>>
>>> It took me a while to wrap my head around the packing function, but I
>>> think I got there in the end.  It's pretty clever.
> 
> Thanks, I'll try to add a better description as well :-)
> 
>>> One thing bothers me, though.  This certainly packs things under
>>> KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
>>> e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
>>> qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
>>> and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
>>> etc. unused.
> 
> That's right. The property it provides is that all the numbers are under
> KVM_MAX_VCPUS (which, see below, is the size of the fixed areas) not
> that they are sequential.
> 
>>> So again, the question is what exactly are these remapped IDs useful
>>> for.  If we're indexing into a bare array of structures of size
>>> KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
>>> more entries than vcpus.  If we're indexing into something sparser,
>>> then why is the remapping worthwhile?
> 
> Well, here's my thinking:
> 
> At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
> (via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
> enough space for the maximum number of VCPUs, and some space is wasted
> when the guest uses less than this (but KVM doesn't know how many will
> be created, so we can't do better easily). The problem is that the
> indicies overflow before all of those VCPUs can be created, not that
> more space is needed.
> 
> We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
> but that will use 8x the space we use now, and we know that no more than
> KVM_MAX_VCPUS will be used so all this new space is basically wasted.
> 
> So remapping seems better if it will work. (Ben H. was strongly against
> wasting more XIVE space if possible.)

remapping is 'nearly' done. kvmppc_xive_vcpu holds both values already. 
it's a question of good usage. the KVM XIVE layer should use internally
VP ids and do a translation at the frontier: hcalls and host kernel 
routines (get/set_xive)

Thanks,

C.

> In short, remapping provides a way to allow the guest to create it's full set
> of VCPUs without wasting any more space than we do currently, without
> having to do something more complicated like tracking used IDs or adding
> additional KVM CAPs.
> 
>>>> +
>>>>  #endif /* __ASM_KVM_BOOK3S_H__ */
>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>> index 9cb9448163c4..49165cc90051 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
>>>>  	return threads_per_subcore;
>>>>  }
>>>>  
>>>> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>>>> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>>>>  {
>>>>  	struct kvmppc_vcore *vcore;
>>>>  
>>>> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>>>>  	init_swait_queue_head(&vcore->wq);
>>>>  	vcore->preempt_tb = TB_NIL;
>>>>  	vcore->lpcr = kvm->arch.lpcr;
>>>> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
>>>> +	vcore->first_vcpuid = id;
>>>>  	vcore->kvm = kvm;
>>>>  	INIT_LIST_HEAD(&vcore->preempt_list);
>>>>  
>>>> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>>>>  	mutex_lock(&kvm->lock);
>>>>  	vcore = NULL;
>>>>  	err = -EINVAL;
>>>> -	core = id / kvm->arch.smt_mode;
>>>> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>>>> +		BUG_ON(kvm->arch.smt_mode != 1);
>>>> +		core = kvmppc_pack_vcpu_id(kvm, id);
>>>> +	} else {
>>>> +		core = id / kvm->arch.smt_mode;
>>>> +	}
>>>>  	if (core < KVM_MAX_VCORES) {
>>>>  		vcore = kvm->arch.vcores[core];
>>>> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>>>>  		if (!vcore) {
>>>>  			err = -ENOMEM;
>>>> -			vcore = kvmppc_vcore_create(kvm, core);
>>>> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>>>>  			kvm->arch.vcores[core] = vcore;
>>>>  			kvm->arch.online_vcores++;
>>>>  		}
>>>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>>>> index f9818d7d3381..681dfe12a5f3 100644
>>>> --- a/arch/powerpc/kvm/book3s_xive.c
>>>> +++ b/arch/powerpc/kvm/book3s_xive.c
>>>> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>>>>  	return -EBUSY;
>>>>  }
>>>>  
>>>> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
>>>> +{
>>>> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>>>> +}
>>>> +
>>>
>>> I'm finding the XIVE indexing really baffling.  There are a bunch of
>>> other places where the code uses (xive->vp_base + NUMBER) directly.
> 
> Ugh, yes. It looks like I botched part of my final cleanup and all the
> cases you saw in kvm/book3s_xive.c should have been replaced with a call to
> xive_vp(). I'll fix it and sorry for the confusion.
> 
>> This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
>> in OPAL. So we need to check that all used NUMBERs are, first, consistent 
>> and then, in the correct range.
> 
> Right. My approach was to allow XIVE to keep using server numbers that
> are equal to VCPU IDs, and just pack down the ID before indexing into
> the vp_base area.
> 
>>> If those are host side references, I guess they don't need updates for
>>> this.
> 
> These are all guest side references.
> 
>>> But if that's the case, then how does indexing into the same array
>>> with both host and guest server numbers make sense?
> 
> Right, it doesn't make sense to mix host and guest server numbers when
> we're remapping only the guest ones, but in this case (without native
> guest XIVE support) it's just guest ones.
> 
>> yes. VPs are allocated with KVM_MAX_VCPUS :
>>
>> 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
>>
>> but
>>
>> 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
>>
>> WE would need to change the allocation of the VPs I guess.
> 
> Yes, this is one of the structures that overflow if we don't pack the IDs.
> 
>>>>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>>>>  			     struct kvmppc_xive_src_block *sb,
>>>>  			     struct kvmppc_xive_irq_state *state)
>>>> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>>>  		pr_devel("Duplicate !\n");
>>>>  		return -EEXIST;
>>>>  	}
>>>> -	if (cpu >= KVM_MAX_VCPUS) {
>>>> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
>>>>  		pr_devel("Out of bounds !\n");
>>>>  		return -EINVAL;
>>>>  	}
>>>> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>>>  	xc->xive = xive;
>>>>  	xc->vcpu = vcpu;
>>>>  	xc->server_num = cpu;
>>>> -	xc->vp_id = xive->vp_base + cpu;
>>>> +	xc->vp_id = xive_vp(xive, cpu);
>>>>  	xc->mfrr = 0xff;
>>>>  	xc->valid = true;
>>>>  
>>>
>>
Sam Bobroff May 1, 2018, 4:52 a.m. | #6
On Tue, Apr 24, 2018 at 01:48:25PM +1000, David Gibson wrote:
> On Tue, Apr 24, 2018 at 01:19:15PM +1000, Sam Bobroff wrote:
> > On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> > > On 04/16/2018 06:09 AM, David Gibson wrote:
> > > > On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
> > > >> It is not currently possible to create the full number of possible
> > > >> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> > > >> threads per core than it's core stride (or "VSMT mode"). This is
> > > >> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> > > >> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> > > >>
> > > >> To address this, "pack" the VCORE ID and XIVE offsets by using
> > > >> knowledge of the way the VCPU IDs will be used when there are less
> > > >> guest threads per core than the core stride. The primary thread of
> > > >> each core will always be used first. Then, if the guest uses more than
> > > >> one thread per core, these secondary threads will sequentially follow
> > > >> the primary in each core.
> > > >>
> > > >> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> > > >> VCPUs are being spaced apart, so at least half of each core is empty
> > > >> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> > > >> into the second half of each core (4..7, in an 8-thread core).
> > > >>
> > > >> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> > > >> each core is being left empty, and we can map down into the second and
> > > >> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> > > >>
> > > >> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> > > >> threads are being used and 7/8 of the core is empty, allowing use of
> > > >> the 1, 3, 5 and 7 thread slots.
> > > >>
> > > >> (Strides less than 8 are handled similarly.)
> > > >>
> > > >> This allows the VCORE ID or offset to be calculated quickly from the
> > > >> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> > > >>
> > > >> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
> > > >> ---
> > > >> Hello everyone,
> > > >>
> > > >> I've tested this on P8 and P9, in lots of combinations of host and guest
> > > >> threading modes and it has been fine but it does feel like a "tricky"
> > > >> approach, so I still feel somewhat wary about it.
> > > 
> > > Have you done any migration ? 
> > 
> > No, but I will :-)
> > 
> > > >> I've posted it as an RFC because I have not tested it with guest native-XIVE,
> > > >> and I suspect that it will take some work to support it.
> > > 
> > > The KVM XIVE device will be different for XIVE exploitation mode, same structures 
> > > though. I will send a patchset shortly. 
> > 
> > Great. This is probably where conflicts between the host and guest
> > numbers will show up. (See dwg's question below.)
> > 
> > > >>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
> > > >>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
> > > >>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
> > > >>  3 files changed, 36 insertions(+), 6 deletions(-)
> > > >>
> > > >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> > > >> index 376ae803b69c..1295056d564a 100644
> > > >> --- a/arch/powerpc/include/asm/kvm_book3s.h
> > > >> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> > > >> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
> > > >>  #define SPLIT_HACK_MASK			0xff000000
> > > >>  #define SPLIT_HACK_OFFS			0xfb000000
> > > >>  
> > > >> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> > > >> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> > > >> + * (but not it's actual threading mode, which is not available) to avoid
> > > >> + * collisions.
> > > >> + */
> > > >> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> > > >> +{
> > > >> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> > > > 
> > > > I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> > > > roughly the same thing, but I think makes the pattern more obvious.
> > 
> > OK.
> > 
> > > >> +	int stride = kvm->arch.emul_smt_mode > 1 ?
> > > >> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> > > > 
> > > > AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> > > > always be 1 when this is called, so the conditional here doesn't seem
> > > > useful.
> > 
> > Ah yes, right. (That was an older version when I was thinking of using
> > it for P8 as well but that didn't seem to be a good idea.)
> > 
> > > >> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> > > >> +	u32 packed_id;
> > > >> +
> > > >> +	BUG_ON(block >= MAX_SMT_THREADS);
> > > >> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> > > >> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> > > >> +	return packed_id;
> > > >> +}
> > > > 
> > > > It took me a while to wrap my head around the packing function, but I
> > > > think I got there in the end.  It's pretty clever.
> > 
> > Thanks, I'll try to add a better description as well :-)
> > 
> > > > One thing bothers me, though.  This certainly packs things under
> > > > KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> > > > e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> > > > qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> > > > and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> > > > etc. unused.
> > 
> > That's right. The property it provides is that all the numbers are under
> > KVM_MAX_VCPUS (which, see below, is the size of the fixed areas) not
> > that they are sequential.
> > 
> > > > So again, the question is what exactly are these remapped IDs useful
> > > > for.  If we're indexing into a bare array of structures of size
> > > > KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> > > > more entries than vcpus.  If we're indexing into something sparser,
> > > > then why is the remapping worthwhile?
> > 
> > Well, here's my thinking:
> > 
> > At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
> > (via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
> > enough space for the maximum number of VCPUs, and some space is wasted
> > when the guest uses less than this (but KVM doesn't know how many will
> > be created, so we can't do better easily). The problem is that the
> > indicies overflow before all of those VCPUs can be created, not that
> > more space is needed.
> > 
> > We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
> > but that will use 8x the space we use now, and we know that no more than
> > KVM_MAX_VCPUS will be used so all this new space is basically wasted.
> > 
> > So remapping seems better if it will work. (Ben H. was strongly against
> > wasting more XIVE space if possible.)
> 
> Hm, ok.  Are the relevant arrays here per-VM, or global?  Or some of both?

Per-VM. They are the kvm->vcores[] array and the blocks of memory
pointed to by xive->vp_base.

> > In short, remapping provides a way to allow the guest to create it's full set
> > of VCPUs without wasting any more space than we do currently, without
> > having to do something more complicated like tracking used IDs or adding
> > additional KVM CAPs.
> > 
> > > >> +
> > > >>  #endif /* __ASM_KVM_BOOK3S_H__ */
> > > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > >> index 9cb9448163c4..49165cc90051 100644
> > > >> --- a/arch/powerpc/kvm/book3s_hv.c
> > > >> +++ b/arch/powerpc/kvm/book3s_hv.c
> > > >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> > > >>  	return threads_per_subcore;
> > > >>  }
> > > >>  
> > > >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> > > >>  {
> > > >>  	struct kvmppc_vcore *vcore;
> > > >>  
> > > >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > >>  	init_swait_queue_head(&vcore->wq);
> > > >>  	vcore->preempt_tb = TB_NIL;
> > > >>  	vcore->lpcr = kvm->arch.lpcr;
> > > >> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> > > >> +	vcore->first_vcpuid = id;
> > > >>  	vcore->kvm = kvm;
> > > >>  	INIT_LIST_HEAD(&vcore->preempt_list);
> > > >>  
> > > >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> > > >>  	mutex_lock(&kvm->lock);
> > > >>  	vcore = NULL;
> > > >>  	err = -EINVAL;
> > > >> -	core = id / kvm->arch.smt_mode;
> > > >> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> > > >> +		BUG_ON(kvm->arch.smt_mode != 1);
> > > >> +		core = kvmppc_pack_vcpu_id(kvm, id);
> > > >> +	} else {
> > > >> +		core = id / kvm->arch.smt_mode;
> > > >> +	}
> > > >>  	if (core < KVM_MAX_VCORES) {
> > > >>  		vcore = kvm->arch.vcores[core];
> > > >> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> > > >>  		if (!vcore) {
> > > >>  			err = -ENOMEM;
> > > >> -			vcore = kvmppc_vcore_create(kvm, core);
> > > >> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> > > >>  			kvm->arch.vcores[core] = vcore;
> > > >>  			kvm->arch.online_vcores++;
> > > >>  		}
> > > >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > > >> index f9818d7d3381..681dfe12a5f3 100644
> > > >> --- a/arch/powerpc/kvm/book3s_xive.c
> > > >> +++ b/arch/powerpc/kvm/book3s_xive.c
> > > >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> > > >>  	return -EBUSY;
> > > >>  }
> > > >>  
> > > >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> > > >> +{
> > > >> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> > > >> +}
> > > >> +
> > > > 
> > > > I'm finding the XIVE indexing really baffling.  There are a bunch of
> > > > other places where the code uses (xive->vp_base + NUMBER) directly.
> > 
> > Ugh, yes. It looks like I botched part of my final cleanup and all the
> > cases you saw in kvm/book3s_xive.c should have been replaced with a call to
> > xive_vp(). I'll fix it and sorry for the confusion.
> 
> Ok.
> 
> > > This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
> > > in OPAL. So we need to check that all used NUMBERs are, first, consistent 
> > > and then, in the correct range.
> > 
> > Right. My approach was to allow XIVE to keep using server numbers that
> > are equal to VCPU IDs, and just pack down the ID before indexing into
> > the vp_base area.
> > 
> > > > If those are host side references, I guess they don't need updates for
> > > > this.
> > 
> > These are all guest side references.
> > 
> > > > But if that's the case, then how does indexing into the same array
> > > > with both host and guest server numbers make sense?
> > 
> > Right, it doesn't make sense to mix host and guest server numbers when
> > we're remapping only the guest ones, but in this case (without native
> > guest XIVE support) it's just guest ones.
> 
> Right.  Will this remapping be broken by guest-visible XIVE?  That is
> for the guest visible XIVE are we going to need to expose un-remapped
> XIVE server IDs to the guest?

I'm not sure, I'll start looking at that next.

> > > yes. VPs are allocated with KVM_MAX_VCPUS :
> > > 
> > > 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > > 
> > > but
> > > 
> > > 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
> > > 
> > > WE would need to change the allocation of the VPs I guess.
> > 
> > Yes, this is one of the structures that overflow if we don't pack the IDs.
> > 
> > > >>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> > > >>  			     struct kvmppc_xive_src_block *sb,
> > > >>  			     struct kvmppc_xive_irq_state *state)
> > > >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > >>  		pr_devel("Duplicate !\n");
> > > >>  		return -EEXIST;
> > > >>  	}
> > > >> -	if (cpu >= KVM_MAX_VCPUS) {
> > > >> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
> > > >>  		pr_devel("Out of bounds !\n");
> > > >>  		return -EINVAL;
> > > >>  	}
> > > >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > >>  	xc->xive = xive;
> > > >>  	xc->vcpu = vcpu;
> > > >>  	xc->server_num = cpu;
> > > >> -	xc->vp_id = xive->vp_base + cpu;
> > > >> +	xc->vp_id = xive_vp(xive, cpu);
> > > >>  	xc->mfrr = 0xff;
> > > >>  	xc->valid = true;
> > > >>  
> > > > 
> > > 
> 
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson
David Gibson May 3, 2018, 3:11 a.m. | #7
On Tue, May 01, 2018 at 02:52:21PM +1000, Sam Bobroff wrote:
> On Tue, Apr 24, 2018 at 01:48:25PM +1000, David Gibson wrote:
> > On Tue, Apr 24, 2018 at 01:19:15PM +1000, Sam Bobroff wrote:
> > > On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> > > > On 04/16/2018 06:09 AM, David Gibson wrote:
[snip]
> > > At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
> > > (via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
> > > enough space for the maximum number of VCPUs, and some space is wasted
> > > when the guest uses less than this (but KVM doesn't know how many will
> > > be created, so we can't do better easily). The problem is that the
> > > indicies overflow before all of those VCPUs can be created, not that
> > > more space is needed.
> > > 
> > > We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
> > > but that will use 8x the space we use now, and we know that no more than
> > > KVM_MAX_VCPUS will be used so all this new space is basically wasted.
> > > 
> > > So remapping seems better if it will work. (Ben H. was strongly against
> > > wasting more XIVE space if possible.)
> > 
> > Hm, ok.  Are the relevant arrays here per-VM, or global?  Or some of both?
> 
> Per-VM. They are the kvm->vcores[] array and the blocks of memory
> pointed to by xive->vp_base.

Hm.  If it were global (where you can't know the size of a specific
VM) I'd certainly see the concern about not expanding the size of the
array.

As it is, I'm a little perplexed that we care so much about the
difference between KVM_MAX_VCPUS and KVM_MAX_VCPU_ID, a factor of 8,
when we apparently don't care about the difference between the vm's
actual number of cpus and KVM_MAX_VCPUS, a factor of maybe 2048 (for a
1vcpu guest and powernv_defconfig).

> 
> > > In short, remapping provides a way to allow the guest to create it's full set
> > > of VCPUs without wasting any more space than we do currently, without
> > > having to do something more complicated like tracking used IDs or adding
> > > additional KVM CAPs.
> > > 
> > > > >> +
> > > > >>  #endif /* __ASM_KVM_BOOK3S_H__ */
> > > > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > > >> index 9cb9448163c4..49165cc90051 100644
> > > > >> --- a/arch/powerpc/kvm/book3s_hv.c
> > > > >> +++ b/arch/powerpc/kvm/book3s_hv.c
> > > > >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> > > > >>  	return threads_per_subcore;
> > > > >>  }
> > > > >>  
> > > > >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > > >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> > > > >>  {
> > > > >>  	struct kvmppc_vcore *vcore;
> > > > >>  
> > > > >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > > >>  	init_swait_queue_head(&vcore->wq);
> > > > >>  	vcore->preempt_tb = TB_NIL;
> > > > >>  	vcore->lpcr = kvm->arch.lpcr;
> > > > >> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> > > > >> +	vcore->first_vcpuid = id;
> > > > >>  	vcore->kvm = kvm;
> > > > >>  	INIT_LIST_HEAD(&vcore->preempt_list);
> > > > >>  
> > > > >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> > > > >>  	mutex_lock(&kvm->lock);
> > > > >>  	vcore = NULL;
> > > > >>  	err = -EINVAL;
> > > > >> -	core = id / kvm->arch.smt_mode;
> > > > >> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> > > > >> +		BUG_ON(kvm->arch.smt_mode != 1);
> > > > >> +		core = kvmppc_pack_vcpu_id(kvm, id);
> > > > >> +	} else {
> > > > >> +		core = id / kvm->arch.smt_mode;
> > > > >> +	}
> > > > >>  	if (core < KVM_MAX_VCORES) {
> > > > >>  		vcore = kvm->arch.vcores[core];
> > > > >> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> > > > >>  		if (!vcore) {
> > > > >>  			err = -ENOMEM;
> > > > >> -			vcore = kvmppc_vcore_create(kvm, core);
> > > > >> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> > > > >>  			kvm->arch.vcores[core] = vcore;
> > > > >>  			kvm->arch.online_vcores++;
> > > > >>  		}
> > > > >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > > > >> index f9818d7d3381..681dfe12a5f3 100644
> > > > >> --- a/arch/powerpc/kvm/book3s_xive.c
> > > > >> +++ b/arch/powerpc/kvm/book3s_xive.c
> > > > >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> > > > >>  	return -EBUSY;
> > > > >>  }
> > > > >>  
> > > > >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> > > > >> +{
> > > > >> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> > > > >> +}
> > > > >> +
> > > > > 
> > > > > I'm finding the XIVE indexing really baffling.  There are a bunch of
> > > > > other places where the code uses (xive->vp_base + NUMBER) directly.
> > > 
> > > Ugh, yes. It looks like I botched part of my final cleanup and all the
> > > cases you saw in kvm/book3s_xive.c should have been replaced with a call to
> > > xive_vp(). I'll fix it and sorry for the confusion.
> > 
> > Ok.
> > 
> > > > This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
> > > > in OPAL. So we need to check that all used NUMBERs are, first, consistent 
> > > > and then, in the correct range.
> > > 
> > > Right. My approach was to allow XIVE to keep using server numbers that
> > > are equal to VCPU IDs, and just pack down the ID before indexing into
> > > the vp_base area.
> > > 
> > > > > If those are host side references, I guess they don't need updates for
> > > > > this.
> > > 
> > > These are all guest side references.
> > > 
> > > > > But if that's the case, then how does indexing into the same array
> > > > > with both host and guest server numbers make sense?
> > > 
> > > Right, it doesn't make sense to mix host and guest server numbers when
> > > we're remapping only the guest ones, but in this case (without native
> > > guest XIVE support) it's just guest ones.
> > 
> > Right.  Will this remapping be broken by guest-visible XIVE?  That is
> > for the guest visible XIVE are we going to need to expose un-remapped
> > XIVE server IDs to the guest?
> 
> I'm not sure, I'll start looking at that next.
> 
> > > > yes. VPs are allocated with KVM_MAX_VCPUS :
> > > > 
> > > > 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > > > 
> > > > but
> > > > 
> > > > 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
> > > > 
> > > > WE would need to change the allocation of the VPs I guess.
> > > 
> > > Yes, this is one of the structures that overflow if we don't pack the IDs.
> > > 
> > > > >>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> > > > >>  			     struct kvmppc_xive_src_block *sb,
> > > > >>  			     struct kvmppc_xive_irq_state *state)
> > > > >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > > >>  		pr_devel("Duplicate !\n");
> > > > >>  		return -EEXIST;
> > > > >>  	}
> > > > >> -	if (cpu >= KVM_MAX_VCPUS) {
> > > > >> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
> > > > >>  		pr_devel("Out of bounds !\n");
> > > > >>  		return -EINVAL;
> > > > >>  	}
> > > > >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > > >>  	xc->xive = xive;
> > > > >>  	xc->vcpu = vcpu;
> > > > >>  	xc->server_num = cpu;
> > > > >> -	xc->vp_id = xive->vp_base + cpu;
> > > > >> +	xc->vp_id = xive_vp(xive, cpu);
> > > > >>  	xc->mfrr = 0xff;
> > > > >>  	xc->valid = true;
> > > > >>  
> > > > > 
> > > > 
> > 
> > 
> > 
> 
>
Sam Bobroff May 3, 2018, 3:38 a.m. | #8
On Thu, May 03, 2018 at 01:11:13PM +1000, David Gibson wrote:
> On Tue, May 01, 2018 at 02:52:21PM +1000, Sam Bobroff wrote:
> > On Tue, Apr 24, 2018 at 01:48:25PM +1000, David Gibson wrote:
> > > On Tue, Apr 24, 2018 at 01:19:15PM +1000, Sam Bobroff wrote:
> > > > On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> > > > > On 04/16/2018 06:09 AM, David Gibson wrote:
> [snip]
> > > > At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
> > > > (via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
> > > > enough space for the maximum number of VCPUs, and some space is wasted
> > > > when the guest uses less than this (but KVM doesn't know how many will
> > > > be created, so we can't do better easily). The problem is that the
> > > > indicies overflow before all of those VCPUs can be created, not that
> > > > more space is needed.
> > > > 
> > > > We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
> > > > but that will use 8x the space we use now, and we know that no more than
> > > > KVM_MAX_VCPUS will be used so all this new space is basically wasted.
> > > > 
> > > > So remapping seems better if it will work. (Ben H. was strongly against
> > > > wasting more XIVE space if possible.)
> > > 
> > > Hm, ok.  Are the relevant arrays here per-VM, or global?  Or some of both?
> > 
> > Per-VM. They are the kvm->vcores[] array and the blocks of memory
> > pointed to by xive->vp_base.
> 
> Hm.  If it were global (where you can't know the size of a specific
> VM) I'd certainly see the concern about not expanding the size of the
> array.
> 
> As it is, I'm a little perplexed that we care so much about the
> difference between KVM_MAX_VCPUS and KVM_MAX_VCPU_ID, a factor of 8,
> when we apparently don't care about the difference between the vm's
> actual number of cpus and KVM_MAX_VCPUS, a factor of maybe 2048 (for a
> 1vcpu guest and powernv_defconfig).

I agree, and we should do better (more because of the XIVE area than the
vcores array), but that will require a coordinated change to QEMU and
KVM to export the information KVM needs and that's going to be more
complicated. So basically, yes, this is only partially fixing it but
it's easy to do.

As for how we could solve the bigger problem; I've discussed with Paul
the idea of adding the guest's threading mode as a second parameter when
QEMU sets KVM_CAP_SMT to set the VSMT mode but that only helps with
packing; KVM still needs to know the maximum number of (hot pluggable)
CPUs so we'll need some other extension as well.

> > > > In short, remapping provides a way to allow the guest to create it's full set
> > > > of VCPUs without wasting any more space than we do currently, without
> > > > having to do something more complicated like tracking used IDs or adding
> > > > additional KVM CAPs.
> > > > 
> > > > > >> +
> > > > > >>  #endif /* __ASM_KVM_BOOK3S_H__ */
> > > > > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > > > >> index 9cb9448163c4..49165cc90051 100644
> > > > > >> --- a/arch/powerpc/kvm/book3s_hv.c
> > > > > >> +++ b/arch/powerpc/kvm/book3s_hv.c
> > > > > >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> > > > > >>  	return threads_per_subcore;
> > > > > >>  }
> > > > > >>  
> > > > > >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > > > >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> > > > > >>  {
> > > > > >>  	struct kvmppc_vcore *vcore;
> > > > > >>  
> > > > > >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > > > >>  	init_swait_queue_head(&vcore->wq);
> > > > > >>  	vcore->preempt_tb = TB_NIL;
> > > > > >>  	vcore->lpcr = kvm->arch.lpcr;
> > > > > >> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> > > > > >> +	vcore->first_vcpuid = id;
> > > > > >>  	vcore->kvm = kvm;
> > > > > >>  	INIT_LIST_HEAD(&vcore->preempt_list);
> > > > > >>  
> > > > > >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> > > > > >>  	mutex_lock(&kvm->lock);
> > > > > >>  	vcore = NULL;
> > > > > >>  	err = -EINVAL;
> > > > > >> -	core = id / kvm->arch.smt_mode;
> > > > > >> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> > > > > >> +		BUG_ON(kvm->arch.smt_mode != 1);
> > > > > >> +		core = kvmppc_pack_vcpu_id(kvm, id);
> > > > > >> +	} else {
> > > > > >> +		core = id / kvm->arch.smt_mode;
> > > > > >> +	}
> > > > > >>  	if (core < KVM_MAX_VCORES) {
> > > > > >>  		vcore = kvm->arch.vcores[core];
> > > > > >> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> > > > > >>  		if (!vcore) {
> > > > > >>  			err = -ENOMEM;
> > > > > >> -			vcore = kvmppc_vcore_create(kvm, core);
> > > > > >> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> > > > > >>  			kvm->arch.vcores[core] = vcore;
> > > > > >>  			kvm->arch.online_vcores++;
> > > > > >>  		}
> > > > > >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > > > > >> index f9818d7d3381..681dfe12a5f3 100644
> > > > > >> --- a/arch/powerpc/kvm/book3s_xive.c
> > > > > >> +++ b/arch/powerpc/kvm/book3s_xive.c
> > > > > >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> > > > > >>  	return -EBUSY;
> > > > > >>  }
> > > > > >>  
> > > > > >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> > > > > >> +{
> > > > > >> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> > > > > >> +}
> > > > > >> +
> > > > > > 
> > > > > > I'm finding the XIVE indexing really baffling.  There are a bunch of
> > > > > > other places where the code uses (xive->vp_base + NUMBER) directly.
> > > > 
> > > > Ugh, yes. It looks like I botched part of my final cleanup and all the
> > > > cases you saw in kvm/book3s_xive.c should have been replaced with a call to
> > > > xive_vp(). I'll fix it and sorry for the confusion.
> > > 
> > > Ok.
> > > 
> > > > > This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
> > > > > in OPAL. So we need to check that all used NUMBERs are, first, consistent 
> > > > > and then, in the correct range.
> > > > 
> > > > Right. My approach was to allow XIVE to keep using server numbers that
> > > > are equal to VCPU IDs, and just pack down the ID before indexing into
> > > > the vp_base area.
> > > > 
> > > > > > If those are host side references, I guess they don't need updates for
> > > > > > this.
> > > > 
> > > > These are all guest side references.
> > > > 
> > > > > > But if that's the case, then how does indexing into the same array
> > > > > > with both host and guest server numbers make sense?
> > > > 
> > > > Right, it doesn't make sense to mix host and guest server numbers when
> > > > we're remapping only the guest ones, but in this case (without native
> > > > guest XIVE support) it's just guest ones.
> > > 
> > > Right.  Will this remapping be broken by guest-visible XIVE?  That is
> > > for the guest visible XIVE are we going to need to expose un-remapped
> > > XIVE server IDs to the guest?
> > 
> > I'm not sure, I'll start looking at that next.
> > 
> > > > > yes. VPs are allocated with KVM_MAX_VCPUS :
> > > > > 
> > > > > 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > > > > 
> > > > > but
> > > > > 
> > > > > 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
> > > > > 
> > > > > WE would need to change the allocation of the VPs I guess.
> > > > 
> > > > Yes, this is one of the structures that overflow if we don't pack the IDs.
> > > > 
> > > > > >>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> > > > > >>  			     struct kvmppc_xive_src_block *sb,
> > > > > >>  			     struct kvmppc_xive_irq_state *state)
> > > > > >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > > > >>  		pr_devel("Duplicate !\n");
> > > > > >>  		return -EEXIST;
> > > > > >>  	}
> > > > > >> -	if (cpu >= KVM_MAX_VCPUS) {
> > > > > >> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
> > > > > >>  		pr_devel("Out of bounds !\n");
> > > > > >>  		return -EINVAL;
> > > > > >>  	}
> > > > > >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > > > >>  	xc->xive = xive;
> > > > > >>  	xc->vcpu = vcpu;
> > > > > >>  	xc->server_num = cpu;
> > > > > >> -	xc->vp_id = xive->vp_base + cpu;
> > > > > >> +	xc->vp_id = xive_vp(xive, cpu);
> > > > > >>  	xc->mfrr = 0xff;
> > > > > >>  	xc->valid = true;
> > > > > >>  
> > > > > > 
> > > > > 
> > > 
> > > 
> > > 
> > 
> > 
> 
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

Patch

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 376ae803b69c..1295056d564a 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -368,4 +368,23 @@  extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
 #define SPLIT_HACK_MASK			0xff000000
 #define SPLIT_HACK_OFFS			0xfb000000
 
+/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
+ * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
+ * (but not it's actual threading mode, which is not available) to avoid
+ * collisions.
+ */
+static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
+{
+	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
+	int stride = kvm->arch.emul_smt_mode > 1 ?
+		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
+	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
+	u32 packed_id;
+
+	BUG_ON(block >= MAX_SMT_THREADS);
+	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
+	BUG_ON(packed_id >= KVM_MAX_VCPUS);
+	return packed_id;
+}
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9cb9448163c4..49165cc90051 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1762,7 +1762,7 @@  static int threads_per_vcore(struct kvm *kvm)
 	return threads_per_subcore;
 }
 
-static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
+static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
 {
 	struct kvmppc_vcore *vcore;
 
@@ -1776,7 +1776,7 @@  static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
 	init_swait_queue_head(&vcore->wq);
 	vcore->preempt_tb = TB_NIL;
 	vcore->lpcr = kvm->arch.lpcr;
-	vcore->first_vcpuid = core * kvm->arch.smt_mode;
+	vcore->first_vcpuid = id;
 	vcore->kvm = kvm;
 	INIT_LIST_HEAD(&vcore->preempt_list);
 
@@ -1992,12 +1992,18 @@  static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 	mutex_lock(&kvm->lock);
 	vcore = NULL;
 	err = -EINVAL;
-	core = id / kvm->arch.smt_mode;
+	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+		BUG_ON(kvm->arch.smt_mode != 1);
+		core = kvmppc_pack_vcpu_id(kvm, id);
+	} else {
+		core = id / kvm->arch.smt_mode;
+	}
 	if (core < KVM_MAX_VCORES) {
 		vcore = kvm->arch.vcores[core];
+		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
 		if (!vcore) {
 			err = -ENOMEM;
-			vcore = kvmppc_vcore_create(kvm, core);
+			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
 			kvm->arch.vcores[core] = vcore;
 			kvm->arch.online_vcores++;
 		}
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f9818d7d3381..681dfe12a5f3 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -317,6 +317,11 @@  static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 	return -EBUSY;
 }
 
+static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
+{
+	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
+}
+
 static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
 			     struct kvmppc_xive_src_block *sb,
 			     struct kvmppc_xive_irq_state *state)
@@ -1084,7 +1089,7 @@  int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		pr_devel("Duplicate !\n");
 		return -EEXIST;
 	}
-	if (cpu >= KVM_MAX_VCPUS) {
+	if (cpu >= KVM_MAX_VCPU_ID) {
 		pr_devel("Out of bounds !\n");
 		return -EINVAL;
 	}
@@ -1098,7 +1103,7 @@  int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	xc->xive = xive;
 	xc->vcpu = vcpu;
 	xc->server_num = cpu;
-	xc->vp_id = xive->vp_base + cpu;
+	xc->vp_id = xive_vp(xive, cpu);
 	xc->mfrr = 0xff;
 	xc->valid = true;