diff mbox series

[v3,15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

Message ID 1530270944-11351-16-git-send-email-suzuki.poulose@arm.com
State New
Headers show
Series arm64: Dynamic & 52bit IPA support | expand

Commit Message

Suzuki K Poulose June 29, 2018, 11:15 a.m. UTC
Allow specifying the physical address size for a new VM via
the kvm_type argument for KVM_CREATE_VM ioctl. This allows
us to finalise the stage2 page table format as early as possible
and hence perform the right checks on the memory slots without
complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
of the type field and can encode more information in the future if
required. The IPA size is still capped at 40bits.

Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <cdall@kernel.org>
Cc: Peter Maydel <peter.maydell@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 arch/arm/include/asm/kvm_mmu.h   |  2 ++
 arch/arm64/include/asm/kvm_arm.h | 10 +++-------
 arch/arm64/include/asm/kvm_mmu.h |  2 ++
 include/uapi/linux/kvm.h         | 10 ++++++++++
 virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
 5 files changed, 39 insertions(+), 9 deletions(-)

Comments

Marc Zyngier July 2, 2018, 1:13 p.m. UTC | #1
On 29/06/18 12:15, Suzuki K Poulose wrote:
> Allow specifying the physical address size for a new VM via
> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> us to finalise the stage2 page table format as early as possible
> and hence perform the right checks on the memory slots without
> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> of the type field and can encode more information in the future if
> required. The IPA size is still capped at 40bits.

Can't we relax this? There is no technical reason (AFAICS) not to allow
going down to 36bit IPA if the user has requested it.

If we run on a 36bit IPA system, the default would fail. But if the user
specified "please give me a 36bit IPA VM", we could satisfy that
requirement and allow them to run their stupidly small guest!

Thanks,

	M.
Suzuki K Poulose July 2, 2018, 1:31 p.m. UTC | #2
On 02/07/18 14:13, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> Allow specifying the physical address size for a new VM via
>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>> us to finalise the stage2 page table format as early as possible
>> and hence perform the right checks on the memory slots without
>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>> of the type field and can encode more information in the future if
>> required. The IPA size is still capped at 40bits.
> 
> Can't we relax this? There is no technical reason (AFAICS) not to allow
> going down to 36bit IPA if the user has requested it.

Sure, we can.

> 
> If we run on a 36bit IPA system, the default would fail. But if the user
> specified "please give me a 36bit IPA VM", we could satisfy that
> requirement and allow them to run their stupidly small guest!

Absolutely. I will fix this in the next version.

Cheers
Suzuki
Will Deacon July 4, 2018, 3:51 p.m. UTC | #3
Hi Suzuki,

On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
> Allow specifying the physical address size for a new VM via
> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> us to finalise the stage2 page table format as early as possible
> and hence perform the right checks on the memory slots without
> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> of the type field and can encode more information in the future if
> required. The IPA size is still capped at 40bits.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Cc: Peter Maydel <peter.maydell@linaro.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h   |  2 ++
>  arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>  arch/arm64/include/asm/kvm_mmu.h |  2 ++
>  include/uapi/linux/kvm.h         | 10 ++++++++++
>  virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>  5 files changed, 39 insertions(+), 9 deletions(-)

[...]

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4df9bb6..fa4cab0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_S390_SIE_PAGE_OFFSET 1
>  
>  /*
> + * On arm/arm64, machine type can be used to request the physical
> + * address size for the VM. Bits [7-0] have been reserved for the
> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> + * value 0 implies the default IPA size, which is 40bits.
> + */
> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK	0xff
> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)		\
> +	((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)

This seems like you're allocating quite a lot of bits in a non-extensible
interface to a fairly esoteric parameter. Would it be better to add another
ioctl, or condense the number of sizes you support instead?

Will
Suzuki K Poulose July 4, 2018, 10:03 p.m. UTC | #4
On 07/04/2018 04:51 PM, Will Deacon wrote:
> Hi Suzuki,
> 
> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>> Allow specifying the physical address size for a new VM via
>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>> us to finalise the stage2 page table format as early as possible
>> and hence perform the right checks on the memory slots without
>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>> of the type field and can encode more information in the future if
>> required. The IPA size is still capped at 40bits.
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Cc: Peter Maydel <peter.maydell@linaro.org>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>   5 files changed, 39 insertions(+), 9 deletions(-)
> 
> [...]
> 
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 4df9bb6..fa4cab0 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>   
>>   /*
>> + * On arm/arm64, machine type can be used to request the physical
>> + * address size for the VM. Bits [7-0] have been reserved for the
>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>> + * value 0 implies the default IPA size, which is 40bits.
>> + */
>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK	0xff
>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)		\
>> +	((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
> 
> This seems like you're allocating quite a lot of bits in a non-extensible
> interface to a fairly esoteric parameter. Would it be better to add another
> ioctl, or condense the number of sizes you support instead?

As I explained in the other thread, we need the size as soon as the VM
is created. The major challenge is keeping the backward compatibility by
mapping 0 to 40bits. I will give it a thought.

Suzuki
Suzuki K Poulose July 6, 2018, 1:49 p.m. UTC | #5
On 04/07/18 23:03, Suzuki K Poulose wrote:
> On 07/04/2018 04:51 PM, Will Deacon wrote:
>> Hi Suzuki,
>>
>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>> Allow specifying the physical address size for a new VM via
>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>> us to finalise the stage2 page table format as early as possible
>>> and hence perform the right checks on the memory slots without
>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>> of the type field and can encode more information in the future if
>>> required. The IPA size is still capped at 40bits.
>>>
>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>> Cc: Christoffer Dall <cdall@kernel.org>
>>> Cc: Peter Maydel <peter.maydell@linaro.org>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> ---
>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>
>> [...]
>>
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 4df9bb6..fa4cab0 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>   /*
>>> + * On arm/arm64, machine type can be used to request the physical
>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>> + * value 0 implies the default IPA size, which is 40bits.
>>> + */
>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>
>> This seems like you're allocating quite a lot of bits in a non-extensible
>> interface to a fairly esoteric parameter. Would it be better to add another
>> ioctl, or condense the number of sizes you support instead?
> 
> As I explained in the other thread, we need the size as soon as the VM
> is created. The major challenge is keeping the backward compatibility by
> mapping 0 to 40bits. I will give it a thought.

Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
also has the field definitions, except that the field is 4bits wide, but
only 3bits are used)

000 32 bits, 4GB.
001 36 bits, 64GB.
010 40 bits, 1TB.
011 42 bits, 4TB.
100 44 bits, 16TB.
101 48 bits, 256TB.
110 52 bits, 4PB

But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
we could use the additional one bit to indicate that IPA size is requested
in the 3 bits.

i.e,

machine_type:

Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.

Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
		0 => Not requested

The only minor down side is restricting to the predefined values above,
which is not a real issue for a VM.

Thoughts ?

Suzuki
Marc Zyngier July 6, 2018, 3:09 p.m. UTC | #6
On 06/07/18 14:49, Suzuki K Poulose wrote:
> On 04/07/18 23:03, Suzuki K Poulose wrote:
>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>> Hi Suzuki,
>>>
>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>> Allow specifying the physical address size for a new VM via
>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>> us to finalise the stage2 page table format as early as possible
>>>> and hence perform the right checks on the memory slots without
>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>> of the type field and can encode more information in the future if
>>>> required. The IPA size is still capped at 40bits.
>>>>
>>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>>> Cc: Christoffer Dall <cdall@kernel.org>
>>>> Cc: Peter Maydel <peter.maydell@linaro.org>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>> ---
>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>>
>>> [...]
>>>
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 4df9bb6..fa4cab0 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>   /*
>>>> + * On arm/arm64, machine type can be used to request the physical
>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>> + */
>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>
>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>> interface to a fairly esoteric parameter. Would it be better to add another
>>> ioctl, or condense the number of sizes you support instead?
>>
>> As I explained in the other thread, we need the size as soon as the VM
>> is created. The major challenge is keeping the backward compatibility by
>> mapping 0 to 40bits. I will give it a thought.
> 
> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
> also has the field definitions, except that the field is 4bits wide, but
> only 3bits are used)
> 
> 000 32 bits, 4GB.
> 001 36 bits, 64GB.
> 010 40 bits, 1TB.
> 011 42 bits, 4TB.
> 100 44 bits, 16TB.
> 101 48 bits, 256TB.
> 110 52 bits, 4PB
> 
> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
> we could use the additional one bit to indicate that IPA size is requested
> in the 3 bits.
> 
> i.e,
> 
> machine_type:
> 
> Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.
> 
> Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
> 		0 => Not requested
> 
> The only minor down side is restricting to the predefined values above,
> which is not a real issue for a VM.
> 
> Thoughts ?

I'd be very wary of using that 4th bit to do something that is not in
the architecture. We have only a single value left to be used (0b111),
and then your scheme clashes with the architecture definition.

I'd rather encode things in a way that is independent from the
architecture, and be done with it. You can map 0 to 40bits, and we have
the ability to express all values the architecture has (just in a
different order).

Thanks,

	M.
Suzuki K Poulose July 6, 2018, 4:39 p.m. UTC | #7
On 07/06/2018 04:09 PM, Marc Zyngier wrote:
> On 06/07/18 14:49, Suzuki K Poulose wrote:
>> On 04/07/18 23:03, Suzuki K Poulose wrote:
>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>>> Hi Suzuki,
>>>>
>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>>> Allow specifying the physical address size for a new VM via
>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>>> us to finalise the stage2 page table format as early as possible
>>>>> and hence perform the right checks on the memory slots without
>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>>> of the type field and can encode more information in the future if
>>>>> required. The IPA size is still capped at 40bits.
>>>>>
>>>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>>>> Cc: Christoffer Dall <cdall@kernel.org>
>>>>> Cc: Peter Maydel <peter.maydell@linaro.org>
>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>> ---
>>>>>    arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>>    arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>>    arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>>    include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>>    virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>>    5 files changed, 39 insertions(+), 9 deletions(-)
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>> index 4df9bb6..fa4cab0 100644
>>>>> --- a/include/uapi/linux/kvm.h
>>>>> +++ b/include/uapi/linux/kvm.h
>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>>    #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>>    /*
>>>>> + * On arm/arm64, machine type can be used to request the physical
>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>>> + */
>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>>
>>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>>> interface to a fairly esoteric parameter. Would it be better to add another
>>>> ioctl, or condense the number of sizes you support instead?
>>>
>>> As I explained in the other thread, we need the size as soon as the VM
>>> is created. The major challenge is keeping the backward compatibility by
>>> mapping 0 to 40bits. I will give it a thought.
>>
>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
>> also has the field definitions, except that the field is 4bits wide, but
>> only 3bits are used)
>>
>> 000 32 bits, 4GB.
>> 001 36 bits, 64GB.
>> 010 40 bits, 1TB.
>> 011 42 bits, 4TB.
>> 100 44 bits, 16TB.
>> 101 48 bits, 256TB.
>> 110 52 bits, 4PB
>>
>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
>> we could use the additional one bit to indicate that IPA size is requested
>> in the 3 bits.
>>
>> i.e,
>>
>> machine_type:
>>
>> Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.
>>
>> Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
>> 		0 => Not requested
>>
>> The only minor down side is restricting to the predefined values above,
>> which is not a real issue for a VM.
>>
>> Thoughts ?
> 
> I'd be very wary of using that 4th bit to do something that is not in
> the architecture. We have only a single value left to be used (0b111),
> and then your scheme clashes with the architecture definition.

I agree. However, if we ever go beyond the 3bits in PARange, we have an
issue with {V}TCR counter part. But lets not take that chance.

> 
> I'd rather encode things in a way that is independent from the
> architecture, and be done with it. You can map 0 to 40bits, and we have
> the ability to express all values the architecture has (just in a
> different order).

The other option I can think of is encoding a signed number which is the 
difference of the IPA from 40. But that would need 5 bits if we were to
encode it as it is. And if we want to squeeze it in 4bit, we could store 
half the difference (limiting the IPA limit to even numbers).

i.e IPA = 40 + 2 * sign_extend(bits[3:0);


Suzuki
Dave Martin July 9, 2018, 11:23 a.m. UTC | #8
On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
> >On 06/07/18 14:49, Suzuki K Poulose wrote:
> >>On 04/07/18 23:03, Suzuki K Poulose wrote:
> >>>On 07/04/2018 04:51 PM, Will Deacon wrote:
> >>>>Hi Suzuki,
> >>>>
> >>>>On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
> >>>>>Allow specifying the physical address size for a new VM via
> >>>>>the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> >>>>>us to finalise the stage2 page table format as early as possible
> >>>>>and hence perform the right checks on the memory slots without
> >>>>>complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> >>>>>of the type field and can encode more information in the future if
> >>>>>required. The IPA size is still capped at 40bits.
> >>>>>
> >>>>>Cc: Marc Zyngier <marc.zyngier@arm.com>
> >>>>>Cc: Christoffer Dall <cdall@kernel.org>
> >>>>>Cc: Peter Maydel <peter.maydell@linaro.org>
> >>>>>Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>Cc: Radim Krčmář <rkrcmar@redhat.com>
> >>>>>Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> >>>>>---
> >>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
> >>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
> >>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
> >>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
> >>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
> >>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
> >>>>
> >>>>[...]
> >>>>
> >>>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>index 4df9bb6..fa4cab0 100644
> >>>>>--- a/include/uapi/linux/kvm.h
> >>>>>+++ b/include/uapi/linux/kvm.h
> >>>>>@@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
> >>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
> >>>>>   /*
> >>>>>+ * On arm/arm64, machine type can be used to request the physical
> >>>>>+ * address size for the VM. Bits [7-0] have been reserved for the
> >>>>>+ * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> >>>>>+ * value 0 implies the default IPA size, which is 40bits.
> >>>>>+ */
> >>>>>+#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
> >>>>>+#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
> >>>>>+    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
> >>>>
> >>>>This seems like you're allocating quite a lot of bits in a non-extensible
> >>>>interface to a fairly esoteric parameter. Would it be better to add another
> >>>>ioctl, or condense the number of sizes you support instead?
> >>>
> >>>As I explained in the other thread, we need the size as soon as the VM
> >>>is created. The major challenge is keeping the backward compatibility by
> >>>mapping 0 to 40bits. I will give it a thought.
> >>
> >>Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
> >>occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
> >>also has the field definitions, except that the field is 4bits wide, but
> >>only 3bits are used)
> >>
> >>000 32 bits, 4GB.
> >>001 36 bits, 64GB.
> >>010 40 bits, 1TB.
> >>011 42 bits, 4TB.
> >>100 44 bits, 16TB.
> >>101 48 bits, 256TB.
> >>110 52 bits, 4PB
> >>
> >>But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
> >>we could use the additional one bit to indicate that IPA size is requested
> >>in the 3 bits.
> >>
> >>i.e,
> >>
> >>machine_type:
> >>
> >>Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.
> >>
> >>Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
> >>		0 => Not requested
> >>
> >>The only minor down side is restricting to the predefined values above,
> >>which is not a real issue for a VM.
> >>
> >>Thoughts ?
> >
> >I'd be very wary of using that 4th bit to do something that is not in
> >the architecture. We have only a single value left to be used (0b111),
> >and then your scheme clashes with the architecture definition.
> 
> I agree. However, if we ever go beyond the 3bits in PARange, we have an
> issue with {V}TCR counter part. But lets not take that chance.
> 
> >
> >I'd rather encode things in a way that is independent from the
> >architecture, and be done with it. You can map 0 to 40bits, and we have
> >the ability to express all values the architecture has (just in a
> >different order).
> 
> The other option I can think of is encoding a signed number which is the
> difference of the IPA from 40. But that would need 5 bits if we were to
> encode it as it is. And if we want to squeeze it in 4bit, we could store
> half the difference (limiting the IPA limit to even numbers).
> 
> i.e IPA = 40 + 2 * sign_extend(bits[3:0);

I came across similar issues when trying to work out how to enable
SVE for KVM.  In the end I reduced this to a per-vcpu feature, but
it means that there is no global opt-in for the SVE-specific KVM
API extensions:

That's a bit gross, because SVE may require a change to the way
vcpus are initialised.  The set of supported SVE vector lengths needs
to be set somehow before the vcpu is set running, but it's tricky do
do that without a new ioctl -- which would mean that if SVE is enabled
for a vcpu then the vcpu is not considered runnable until the new
magic ioctl is called.

Opting into that semantic change globally at VM creation time might
be preferable.  On the SVE side, this is still very much subject to
review/change.


Here:

The KVM_CREATE_VM init argument seems undefined by the KVM core code and
is available for arches to abuse in creative ways.  x86 and arm have
nothing here and reject non-zero values with -EINVAL; s390 treats it as
a bitmask, and defines a sincle feature-like bit here; powerpc treats it
as an enumeration of VM types.

If we want to be extensible, we could

 a) Pass a pointer in type, and come up with some extensible VM parameter
    struct for it to point to (which then wouldn't need a cryptic
    compressed encoding), or

 b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
    an argument, or mandates a parameter negotiation phase involving
    additional ioctls before marking the VM as ready for vcpu and
    device creation.

(a) feels like an easy backwards-compatible approach, but cannot be
readily adopted by other arches (maybe not an issue).

(b) might be considered overengineered, so it would need a bit of
thought.

Wedging arguments into a few bits in the type argument feels awkward,
and may be regretted later if we run out of bits, or something can't be
represented in the chosen encoding.

Cheers
---Dave
Marc Zyngier July 9, 2018, 12:29 p.m. UTC | #9
On 09/07/18 12:23, Dave Martin wrote:
> On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
>> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
>>> On 06/07/18 14:49, Suzuki K Poulose wrote:
>>>> On 04/07/18 23:03, Suzuki K Poulose wrote:
>>>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>>>>> Hi Suzuki,
>>>>>>
>>>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>>>>> Allow specifying the physical address size for a new VM via
>>>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>>>>> us to finalise the stage2 page table format as early as possible
>>>>>>> and hence perform the right checks on the memory slots without
>>>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>>>>> of the type field and can encode more information in the future if
>>>>>>> required. The IPA size is still capped at 40bits.
>>>>>>>
>>>>>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>>>>>> Cc: Christoffer Dall <cdall@kernel.org>
>>>>>>> Cc: Peter Maydel <peter.maydell@linaro.org>
>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>>>> ---
>>>>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>> index 4df9bb6..fa4cab0 100644
>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>>>>   /*
>>>>>>> + * On arm/arm64, machine type can be used to request the physical
>>>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>>>>> + */
>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>>>>
>>>>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>>>>> interface to a fairly esoteric parameter. Would it be better to add another
>>>>>> ioctl, or condense the number of sizes you support instead?
>>>>>
>>>>> As I explained in the other thread, we need the size as soon as the VM
>>>>> is created. The major challenge is keeping the backward compatibility by
>>>>> mapping 0 to 40bits. I will give it a thought.
>>>>
>>>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
>>>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
>>>> also has the field definitions, except that the field is 4bits wide, but
>>>> only 3bits are used)
>>>>
>>>> 000 32 bits, 4GB.
>>>> 001 36 bits, 64GB.
>>>> 010 40 bits, 1TB.
>>>> 011 42 bits, 4TB.
>>>> 100 44 bits, 16TB.
>>>> 101 48 bits, 256TB.
>>>> 110 52 bits, 4PB
>>>>
>>>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
>>>> we could use the additional one bit to indicate that IPA size is requested
>>>> in the 3 bits.
>>>>
>>>> i.e,
>>>>
>>>> machine_type:
>>>>
>>>> Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.
>>>>
>>>> Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
>>>> 		0 => Not requested
>>>>
>>>> The only minor down side is restricting to the predefined values above,
>>>> which is not a real issue for a VM.
>>>>
>>>> Thoughts ?
>>>
>>> I'd be very wary of using that 4th bit to do something that is not in
>>> the architecture. We have only a single value left to be used (0b111),
>>> and then your scheme clashes with the architecture definition.
>>
>> I agree. However, if we ever go beyond the 3bits in PARange, we have an
>> issue with {V}TCR counter part. But lets not take that chance.
>>
>>>
>>> I'd rather encode things in a way that is independent from the
>>> architecture, and be done with it. You can map 0 to 40bits, and we have
>>> the ability to express all values the architecture has (just in a
>>> different order).
>>
>> The other option I can think of is encoding a signed number which is the
>> difference of the IPA from 40. But that would need 5 bits if we were to
>> encode it as it is. And if we want to squeeze it in 4bit, we could store
>> half the difference (limiting the IPA limit to even numbers).
>>
>> i.e IPA = 40 + 2 * sign_extend(bits[3:0);
> 
> I came across similar issues when trying to work out how to enable
> SVE for KVM.  In the end I reduced this to a per-vcpu feature, but
> it means that there is no global opt-in for the SVE-specific KVM
> API extensions:
> 
> That's a bit gross, because SVE may require a change to the way
> vcpus are initialised.  The set of supported SVE vector lengths needs
> to be set somehow before the vcpu is set running, but it's tricky do
> do that without a new ioctl -- which would mean that if SVE is enabled
> for a vcpu then the vcpu is not considered runnable until the new
> magic ioctl is called.
> 
> Opting into that semantic change globally at VM creation time might
> be preferable.  On the SVE side, this is still very much subject to
> review/change.
> 
> 
> Here:
> 
> The KVM_CREATE_VM init argument seems undefined by the KVM core code and
> is available for arches to abuse in creative ways.  x86 and arm have
> nothing here and reject non-zero values with -EINVAL; s390 treats it as
> a bitmask, and defines a sincle feature-like bit here; powerpc treats it
> as an enumeration of VM types.
> 
> If we want to be extensible, we could
> 
>  a) Pass a pointer in type, and come up with some extensible VM parameter
>     struct for it to point to (which then wouldn't need a cryptic
>     compressed encoding), or
> 
>  b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
>     an argument, or mandates a parameter negotiation phase involving
>     additional ioctls before marking the VM as ready for vcpu and
>     device creation.
> 
> (a) feels like an easy backwards-compatible approach, but cannot be
> readily adopted by other arches (maybe not an issue).
> 
> (b) might be considered overengineered, so it would need a bit of
> thought.
> 
> Wedging arguments into a few bits in the type argument feels awkward,
> and may be regretted later if we run out of bits, or something can't be
> represented in the chosen encoding.

I think that's a pretty convincing argument for a "better" CREATE_VM,
one that would have a clearly defined, structured (and potentially
extensible) argument.

I've quickly hacked the following:

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6270a3b38e9..3e76214034c2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
 	__u32 pad;
 };

+struct kvm_create_vm2 {
+	__u64	version;	/* Or maybe not */
+	union {
+		struct {
+#define KVM_ARM_SVE_CAPABLE	(1 << 0)
+#define KVM_ARM_SELECT_IPA	{1 << 1)
+			__u64	capabilities;
+			__u16	sve_vlen;
+			__u8	ipa_size;
+		} arm64;
+		__u64	dummy[15];
+	};
+};
+
 #define KVMIO 0xAE

 /* machine type bits, to be used as argument to KVM_CREATE_VM */

Other architectures could fill in their own bits if they need to.

Thoughts?

	M.
Dave Martin July 9, 2018, 1:37 p.m. UTC | #10
On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
> On 09/07/18 12:23, Dave Martin wrote:
> > On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
> >> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
> >>> On 06/07/18 14:49, Suzuki K Poulose wrote:
> >>>> On 04/07/18 23:03, Suzuki K Poulose wrote:
> >>>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
> >>>>>> Hi Suzuki,
> >>>>>>
> >>>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
> >>>>>>> Allow specifying the physical address size for a new VM via
> >>>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> >>>>>>> us to finalise the stage2 page table format as early as possible
> >>>>>>> and hence perform the right checks on the memory slots without
> >>>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> >>>>>>> of the type field and can encode more information in the future if
> >>>>>>> required. The IPA size is still capped at 40bits.
> >>>>>>>
> >>>>>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
> >>>>>>> Cc: Christoffer Dall <cdall@kernel.org>
> >>>>>>> Cc: Peter Maydel <peter.maydell@linaro.org>
> >>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
> >>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> >>>>>>> ---
> >>>>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
> >>>>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
> >>>>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
> >>>>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
> >>>>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
> >>>>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>> index 4df9bb6..fa4cab0 100644
> >>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
> >>>>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
> >>>>>>>   /*
> >>>>>>> + * On arm/arm64, machine type can be used to request the physical
> >>>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
> >>>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> >>>>>>> + * value 0 implies the default IPA size, which is 40bits.
> >>>>>>> + */
> >>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
> >>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
> >>>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
> >>>>>>
> >>>>>> This seems like you're allocating quite a lot of bits in a non-extensible
> >>>>>> interface to a fairly esoteric parameter. Would it be better to add another
> >>>>>> ioctl, or condense the number of sizes you support instead?
> >>>>>
> >>>>> As I explained in the other thread, we need the size as soon as the VM
> >>>>> is created. The major challenge is keeping the backward compatibility by
> >>>>> mapping 0 to 40bits. I will give it a thought.
> >>>>
> >>>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
> >>>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
> >>>> also has the field definitions, except that the field is 4bits wide, but
> >>>> only 3bits are used)
> >>>>
> >>>> 000 32 bits, 4GB.
> >>>> 001 36 bits, 64GB.
> >>>> 010 40 bits, 1TB.
> >>>> 011 42 bits, 4TB.
> >>>> 100 44 bits, 16TB.
> >>>> 101 48 bits, 256TB.
> >>>> 110 52 bits, 4PB
> >>>>
> >>>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
> >>>> we could use the additional one bit to indicate that IPA size is requested
> >>>> in the 3 bits.
> >>>>
> >>>> i.e,
> >>>>
> >>>> machine_type:
> >>>>
> >>>> Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.
> >>>>
> >>>> Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
> >>>> 		0 => Not requested
> >>>>
> >>>> The only minor down side is restricting to the predefined values above,
> >>>> which is not a real issue for a VM.
> >>>>
> >>>> Thoughts ?
> >>>
> >>> I'd be very wary of using that 4th bit to do something that is not in
> >>> the architecture. We have only a single value left to be used (0b111),
> >>> and then your scheme clashes with the architecture definition.
> >>
> >> I agree. However, if we ever go beyond the 3bits in PARange, we have an
> >> issue with {V}TCR counter part. But lets not take that chance.
> >>
> >>>
> >>> I'd rather encode things in a way that is independent from the
> >>> architecture, and be done with it. You can map 0 to 40bits, and we have
> >>> the ability to express all values the architecture has (just in a
> >>> different order).
> >>
> >> The other option I can think of is encoding a signed number which is the
> >> difference of the IPA from 40. But that would need 5 bits if we were to
> >> encode it as it is. And if we want to squeeze it in 4bit, we could store
> >> half the difference (limiting the IPA limit to even numbers).
> >>
> >> i.e IPA = 40 + 2 * sign_extend(bits[3:0);
> > 
> > I came across similar issues when trying to work out how to enable
> > SVE for KVM.  In the end I reduced this to a per-vcpu feature, but
> > it means that there is no global opt-in for the SVE-specific KVM
> > API extensions:
> > 
> > That's a bit gross, because SVE may require a change to the way
> > vcpus are initialised.  The set of supported SVE vector lengths needs
> > to be set somehow before the vcpu is set running, but it's tricky do
> > do that without a new ioctl -- which would mean that if SVE is enabled
> > for a vcpu then the vcpu is not considered runnable until the new
> > magic ioctl is called.
> > 
> > Opting into that semantic change globally at VM creation time might
> > be preferable.  On the SVE side, this is still very much subject to
> > review/change.
> > 
> > 
> > Here:
> > 
> > The KVM_CREATE_VM init argument seems undefined by the KVM core code and
> > is available for arches to abuse in creative ways.  x86 and arm have
> > nothing here and reject non-zero values with -EINVAL; s390 treats it as
> > a bitmask, and defines a sincle feature-like bit here; powerpc treats it
> > as an enumeration of VM types.
> > 
> > If we want to be extensible, we could
> > 
> >  a) Pass a pointer in type, and come up with some extensible VM parameter
> >     struct for it to point to (which then wouldn't need a cryptic
> >     compressed encoding), or
> > 
> >  b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
> >     an argument, or mandates a parameter negotiation phase involving
> >     additional ioctls before marking the VM as ready for vcpu and
> >     device creation.
> > 
> > (a) feels like an easy backwards-compatible approach, but cannot be
> > readily adopted by other arches (maybe not an issue).
> > 
> > (b) might be considered overengineered, so it would need a bit of
> > thought.
> > 
> > Wedging arguments into a few bits in the type argument feels awkward,
> > and may be regretted later if we run out of bits, or something can't be
> > represented in the chosen encoding.
> 
> I think that's a pretty convincing argument for a "better" CREATE_VM,
> one that would have a clearly defined, structured (and potentially
> extensible) argument.
> 
> I've quickly hacked the following:
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3b38e9..3e76214034c2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
>  	__u32 pad;
>  };
> 
> +struct kvm_create_vm2 {
> +	__u64	version;	/* Or maybe not */
> +	union {
> +		struct {
> +#define KVM_ARM_SVE_CAPABLE	(1 << 0)
> +#define KVM_ARM_SELECT_IPA	{1 << 1)
> +			__u64	capabilities;
> +			__u16	sve_vlen;
> +			__u8	ipa_size;
> +		} arm64;
> +		__u64	dummy[15];
> +	};
> +};
> +
>  #define KVMIO 0xAE
> 
>  /* machine type bits, to be used as argument to KVM_CREATE_VM */
> 
> Other architectures could fill in their own bits if they need to.
> 
> Thoughts?

This kind of thing should work, but it may still get messy when we
add additional fields.

It we want this to work cross-arch, would it make sense to go
for a more generic approach, say

struct kvm_create_vm_attr_any {
        __u32   type;
};

#define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
struct kvm_create_vm_attr_arch_capabilities {
        __u32   type;
        __u16   size; /* support future expansion of capabilities[] */
        __u16   reserved;
        __u64   capabilities[1];
};

#define KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE 2
struct kvm_create_vm_attr_arm64_physaddr_size {
        __u32   type;
        __u32   physaddr_bits;
};

/* ... */

union kvm_create_vm_attr {
        struct kvm_create_vm_attr_any;
        struct kvm_create_vm_attr_arch_capabilities;
        struct kvm_create_vm_attr_arm64_physaddr_size;
        /* ... */
};

struct kvm_create_vm2 {
        __u32   version;        /* harmless, even if not useful */
        __u16   nr_attrs;       /* or could just terminate attrs with a
                                   NULL entry */
        union kvm_create_vm_attr __user *__user *attrs;
};


This is quite flexible, but obviously a bit heavy.

However, if we're adding a new interface due to lack of extensibility,
it may be worth going for something that's freely extensible.


Userspace might call this as

	struct kvm_create_vm_attr_arch_capabilities vm_arch_caps = {
		.type = KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES,
		.size = 64,
		.capabilities[0] = KVM_CREATE_VM_ARM64_VCPU_NEEDS_SET_SVE_VLS,
	};

	struct kvm_create_vm_attr_arch_arm64_physaddr_size = {
		.type = KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE,
		.physaddr_bits = 52,
	};

	union kvm_create_vm_attr **vmattrs[] = {
		&vm_arch_caps,
		&vm_arm64_physaddr_size,
		NULL, /* maybe */
	};

	struct kvm_create_vm2 vm;

	vm.version = 0;
	vm.nr_attrs = 2; /* maybe */
	vm.attrs = vmattrs;

	ioctl(..., KVM_CREATE_VM2, &vm);

Cheers
---Dave
Suzuki K Poulose July 10, 2018, 4:38 p.m. UTC | #11
On 09/07/18 14:37, Dave Martin wrote:
> On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
>> On 09/07/18 12:23, Dave Martin wrote:
>>> On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
>>>> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
>>>>> On 06/07/18 14:49, Suzuki K Poulose wrote:
>>>>>> On 04/07/18 23:03, Suzuki K Poulose wrote:
>>>>>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>>>>>>> Hi Suzuki,
>>>>>>>>
>>>>>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>>>>>>> Allow specifying the physical address size for a new VM via
>>>>>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>>>>>>> us to finalise the stage2 page table format as early as possible
>>>>>>>>> and hence perform the right checks on the memory slots without
>>>>>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>>>>>>> of the type field and can encode more information in the future if
>>>>>>>>> required. The IPA size is still capped at 40bits.
>>>>>>>>>
>>>>>>>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>>>>>>>> Cc: Christoffer Dall <cdall@kernel.org>
>>>>>>>>> Cc: Peter Maydel <peter.maydell@linaro.org>
>>>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>>>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>>>>>> ---
>>>>>>>>>    arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>>>>>>    arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>>>>>>    arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>>>>>>    include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>>>>>>    virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>>>>>>    5 files changed, 39 insertions(+), 9 deletions(-)
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>> index 4df9bb6..fa4cab0 100644
>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>>>>>>    #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>>>>>>    /*
>>>>>>>>> + * On arm/arm64, machine type can be used to request the physical
>>>>>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>>>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>>>>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>>>>>>> + */
>>>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>>>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>>>>>>
>>>>>>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>>>>>>> interface to a fairly esoteric parameter. Would it be better to add another
>>>>>>>> ioctl, or condense the number of sizes you support instead?
>>>>>>>
>>>>>>> As I explained in the other thread, we need the size as soon as the VM
>>>>>>> is created. The major challenge is keeping the backward compatibility by
>>>>>>> mapping 0 to 40bits. I will give it a thought.
>>>>>>
>>>>>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
>>>>>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
>>>>>> also has the field definitions, except that the field is 4bits wide, but
>>>>>> only 3bits are used)
>>>>>>
>>>>>> 000 32 bits, 4GB.
>>>>>> 001 36 bits, 64GB.
>>>>>> 010 40 bits, 1TB.
>>>>>> 011 42 bits, 4TB.
>>>>>> 100 44 bits, 16TB.
>>>>>> 101 48 bits, 256TB.
>>>>>> 110 52 bits, 4PB
>>>>>>
>>>>>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
>>>>>> we could use the additional one bit to indicate that IPA size is requested
>>>>>> in the 3 bits.
>>>>>>
>>>>>> i.e,
>>>>>>
>>>>>> machine_type:
>>>>>>
>>>>>> Bit [2:0]	- Requested IPA size. Values follow VTCR_EL2.PS format.
>>>>>>
>>>>>> Bit [3]		- 1 => IPA Size bits (Bits[2:0]) requested.
>>>>>> 		0 => Not requested
>>>>>>
>>>>>> The only minor down side is restricting to the predefined values above,
>>>>>> which is not a real issue for a VM.
>>>>>>
>>>>>> Thoughts ?
>>>>>
>>>>> I'd be very wary of using that 4th bit to do something that is not in
>>>>> the architecture. We have only a single value left to be used (0b111),
>>>>> and then your scheme clashes with the architecture definition.
>>>>
>>>> I agree. However, if we ever go beyond the 3bits in PARange, we have an
>>>> issue with {V}TCR counter part. But lets not take that chance.
>>>>
>>>>>
>>>>> I'd rather encode things in a way that is independent from the
>>>>> architecture, and be done with it. You can map 0 to 40bits, and we have
>>>>> the ability to express all values the architecture has (just in a
>>>>> different order).
>>>>
>>>> The other option I can think of is encoding a signed number which is the
>>>> difference of the IPA from 40. But that would need 5 bits if we were to
>>>> encode it as it is. And if we want to squeeze it in 4bit, we could store
>>>> half the difference (limiting the IPA limit to even numbers).
>>>>
>>>> i.e IPA = 40 + 2 * sign_extend(bits[3:0);
>>>
>>> I came across similar issues when trying to work out how to enable
>>> SVE for KVM.  In the end I reduced this to a per-vcpu feature, but
>>> it means that there is no global opt-in for the SVE-specific KVM
>>> API extensions:
>>>
>>> That's a bit gross, because SVE may require a change to the way
>>> vcpus are initialised.  The set of supported SVE vector lengths needs
>>> to be set somehow before the vcpu is set running, but it's tricky do
>>> do that without a new ioctl -- which would mean that if SVE is enabled
>>> for a vcpu then the vcpu is not considered runnable until the new
>>> magic ioctl is called.
>>>
>>> Opting into that semantic change globally at VM creation time might
>>> be preferable.  On the SVE side, this is still very much subject to
>>> review/change.
>>>
>>>
>>> Here:
>>>
>>> The KVM_CREATE_VM init argument seems undefined by the KVM core code and
>>> is available for arches to abuse in creative ways.  x86 and arm have
>>> nothing here and reject non-zero values with -EINVAL; s390 treats it as
>>> a bitmask, and defines a sincle feature-like bit here; powerpc treats it
>>> as an enumeration of VM types.
>>>
>>> If we want to be extensible, we could
>>>
>>>   a) Pass a pointer in type, and come up with some extensible VM parameter
>>>      struct for it to point to (which then wouldn't need a cryptic
>>>      compressed encoding), or
>>>
>>>   b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
>>>      an argument, or mandates a parameter negotiation phase involving
>>>      additional ioctls before marking the VM as ready for vcpu and
>>>      device creation.
>>>
>>> (a) feels like an easy backwards-compatible approach, but cannot be
>>> readily adopted by other arches (maybe not an issue).
>>>
>>> (b) might be considered overengineered, so it would need a bit of
>>> thought.
>>>
>>> Wedging arguments into a few bits in the type argument feels awkward,
>>> and may be regretted later if we run out of bits, or something can't be
>>> represented in the chosen encoding.
>>
>> I think that's a pretty convincing argument for a "better" CREATE_VM,
>> one that would have a clearly defined, structured (and potentially
>> extensible) argument.
>>
>> I've quickly hacked the following:
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index b6270a3b38e9..3e76214034c2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
>>   	__u32 pad;
>>   };
>>
>> +struct kvm_create_vm2 {
>> +	__u64	version;	/* Or maybe not */
>> +	union {
>> +		struct {
>> +#define KVM_ARM_SVE_CAPABLE	(1 << 0)
>> +#define KVM_ARM_SELECT_IPA	{1 << 1)
>> +			__u64	capabilities;
>> +			__u16	sve_vlen;
>> +			__u8	ipa_size;
>> +		} arm64;
>> +		__u64	dummy[15];
>> +	};
>> +};
>> +
>>   #define KVMIO 0xAE
>>
>>   /* machine type bits, to be used as argument to KVM_CREATE_VM */
>>
>> Other architectures could fill in their own bits if they need to.
>>
>> Thoughts?
> 
> This kind of thing should work, but it may still get messy when we
> add additional fields.


Marc, Dave,

I like Dave's approach. Some comments below.

> 
> It we want this to work cross-arch, would it make sense to go
> for a more generic approach, say
> 
> struct kvm_create_vm_attr_any {
>          __u32   type;
> };
> 
> #define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
> struct kvm_create_vm_attr_arch_capabilities {
>          __u32   type;
>          __u16   size; /* support future expansion of capabilities[] */
>          __u16   reserved;
>          __u64   capabilities[1];
> };

We also need to advertise which attributes are supported by the host,
so that the user can tune the available ones. That would make a bit mask
like the above trickier, unless we return the supported values back
in the argument ptr for the "probe" call. And this scheme in general
can be useful for passing back a non-boolean result specific to the
attribute, without having a per-attribute ioctl. (e.g, maximum limit
for IPA).

> 
> #define KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE 2
> struct kvm_create_vm_attr_arm64_physaddr_size {
>          __u32   type;
>          __u32   physaddr_bits;
> };
> 
> /* ... */
> 
> union kvm_create_vm_attr {
>          struct kvm_create_vm_attr_any;
>          struct kvm_create_vm_attr_arch_capabilities;
>          struct kvm_create_vm_attr_arm64_physaddr_size;
>          /* ... */
> };

nit: Could we simply do s/kvm_create_vm_attr/kvm_vm_attr/ everywhere ?
While I agree that the kvm_create_vm_attr makes it implicit that the attributes
are valid only "create" ioctl, the lack of an ioctl to set the VM attribute
should be sufficient to indicate the same.

> 
> struct kvm_create_vm2 {
>          __u32   version;        /* harmless, even if not useful */
>          __u16   nr_attrs;       /* or could just terminate attrs with a
>                                     NULL entry */
>          union kvm_create_vm_attr __user *__user *attrs;
> };
> 
> 
> This is quite flexible, but obviously a bit heavy.
> 
> However, if we're adding a new interface due to lack of extensibility,
> it may be worth going for something that's freely extensible.

True. I could hack something up along the lines above and send it here.

> 
> 
> Userspace might call this as
> 
> 	struct kvm_create_vm_attr_arch_capabilities vm_arch_caps = {
> 		.type = KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES,
> 		.size = 64,
> 		.capabilities[0] = KVM_CREATE_VM_ARM64_VCPU_NEEDS_SET_SVE_VLS,
> 	};
> 
> 	struct kvm_create_vm_attr_arch_arm64_physaddr_size = {
> 		.type = KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE,
> 		.physaddr_bits = 52,
> 	};
> 
> 	union kvm_create_vm_attr **vmattrs[] = {
> 		&vm_arch_caps,
> 		&vm_arm64_physaddr_size,
> 		NULL, /* maybe */
> 	};
> 
> 	struct kvm_create_vm2 vm;
> 
> 	vm.version = 0;
> 	vm.nr_attrs = 2; /* maybe */
> 	vm.attrs = vmattrs;
> 
> 	ioctl(..., KVM_CREATE_VM2, &vm);

Thanks
Suzuki
Dave Martin July 10, 2018, 5:03 p.m. UTC | #12
On Tue, Jul 10, 2018 at 05:38:39PM +0100, Suzuki K Poulose wrote:
> On 09/07/18 14:37, Dave Martin wrote:
> >On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
> >>On 09/07/18 12:23, Dave Martin wrote:

[...]

> >>>Wedging arguments into a few bits in the type argument feels awkward,
> >>>and may be regretted later if we run out of bits, or something can't be
> >>>represented in the chosen encoding.
> >>
> >>I think that's a pretty convincing argument for a "better" CREATE_VM,
> >>one that would have a clearly defined, structured (and potentially
> >>extensible) argument.
> >>
> >>I've quickly hacked the following:
> >>
> >>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>index b6270a3b38e9..3e76214034c2 100644
> >>--- a/include/uapi/linux/kvm.h
> >>+++ b/include/uapi/linux/kvm.h
> >>@@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
> >>  	__u32 pad;
> >>  };
> >>
> >>+struct kvm_create_vm2 {
> >>+	__u64	version;	/* Or maybe not */
> >>+	union {
> >>+		struct {
> >>+#define KVM_ARM_SVE_CAPABLE	(1 << 0)
> >>+#define KVM_ARM_SELECT_IPA	{1 << 1)
> >>+			__u64	capabilities;
> >>+			__u16	sve_vlen;
> >>+			__u8	ipa_size;
> >>+		} arm64;
> >>+		__u64	dummy[15];
> >>+	};
> >>+};
> >>+
> >>  #define KVMIO 0xAE
> >>
> >>  /* machine type bits, to be used as argument to KVM_CREATE_VM */
> >>
> >>Other architectures could fill in their own bits if they need to.
> >>
> >>Thoughts?
> >
> >This kind of thing should work, but it may still get messy when we
> >add additional fields.
> 
> 
> Marc, Dave,
> 
> I like Dave's approach. Some comments below.
> 
> >
> >It we want this to work cross-arch, would it make sense to go
> >for a more generic approach, say
> >
> >struct kvm_create_vm_attr_any {
> >         __u32   type;
> >};
> >
> >#define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
> >struct kvm_create_vm_attr_arch_capabilities {
> >         __u32   type;
> >         __u16   size; /* support future expansion of capabilities[] */
> >         __u16   reserved;
> >         __u64   capabilities[1];
> >};
> 
> We also need to advertise which attributes are supported by the host,
> so that the user can tune the available ones. That would make a bit mask
> like the above trickier, unless we return the supported values back
> in the argument ptr for the "probe" call. And this scheme in general
> can be useful for passing back a non-boolean result specific to the
> attribute, without having a per-attribute ioctl. (e.g, maximum limit
> for IPA).

Maybe, but this could quickly become bloated.  (My approach already
feels a bit bloated...)

I'm not sure that arbitrarily complex negotiation will really be
needed, but userspace might want to change its mind if setting a
particular propertiy fails.

An alternative might be to have a bunch of per-VM ioctls to configure
different things, like x86 has.  There's at least precedent for that.
For arm, we currently only have a few.  That allows for easy extension,
at the cost of adding ioctls.

There may be some ioctls we can reuse, like KVM_ENABLE_CAP for per-
vm capability flags.


[...]

> >union kvm_create_vm_attr {
> >         struct kvm_create_vm_attr_any;
> >         struct kvm_create_vm_attr_arch_capabilities;
> >         struct kvm_create_vm_attr_arm64_physaddr_size;
> >         /* ... */
> >};
> 
> nit: Could we simply do s/kvm_create_vm_attr/kvm_vm_attr/ everywhere ?
> While I agree that the kvm_create_vm_attr makes it implicit that the attributes
> are valid only "create" ioctl, the lack of an ioctl to set the VM attribute
> should be sufficient to indicate the same.

I just randomly came up with some names.  The precise naming scheme
isn't that important, so long as it unlikely to result in name
collisions and so long as it's reasonablu clear (or compiler-checkable,
or preferably both) which things can be used where.

I wouldn't have a problem with something a bit terser.

> 
> >
> >struct kvm_create_vm2 {
> >         __u32   version;        /* harmless, even if not useful */
> >         __u16   nr_attrs;       /* or could just terminate attrs with a
> >                                    NULL entry */
> >         union kvm_create_vm_attr __user *__user *attrs;
> >};
> >
> >
> >This is quite flexible, but obviously a bit heavy.
> >
> >However, if we're adding a new interface due to lack of extensibility,
> >it may be worth going for something that's freely extensible.
> 
> True. I could hack something up along the lines above and send it here.

Sure, but best to keep it fairly rough for now.

Cheers
---Dave
Suzuki K Poulose July 11, 2018, 9:05 a.m. UTC | #13
On 10/07/18 18:03, Dave Martin wrote:
> On Tue, Jul 10, 2018 at 05:38:39PM +0100, Suzuki K Poulose wrote:
>> On 09/07/18 14:37, Dave Martin wrote:
>>> On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
>>>> On 09/07/18 12:23, Dave Martin wrote:
> 
> [...]
> 
>>>>> Wedging arguments into a few bits in the type argument feels awkward,
>>>>> and may be regretted later if we run out of bits, or something can't be
>>>>> represented in the chosen encoding.
>>>>
>>>> I think that's a pretty convincing argument for a "better" CREATE_VM,
>>>> one that would have a clearly defined, structured (and potentially
>>>> extensible) argument.
>>>>
>>>> I've quickly hacked the following:
>>>>
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index b6270a3b38e9..3e76214034c2 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
>>>>   	__u32 pad;
>>>>   };
>>>>
>>>> +struct kvm_create_vm2 {
>>>> +	__u64	version;	/* Or maybe not */
>>>> +	union {
>>>> +		struct {
>>>> +#define KVM_ARM_SVE_CAPABLE	(1 << 0)
>>>> +#define KVM_ARM_SELECT_IPA	{1 << 1)
>>>> +			__u64	capabilities;
>>>> +			__u16	sve_vlen;
>>>> +			__u8	ipa_size;
>>>> +		} arm64;
>>>> +		__u64	dummy[15];
>>>> +	};
>>>> +};
>>>> +
>>>>   #define KVMIO 0xAE
>>>>
>>>>   /* machine type bits, to be used as argument to KVM_CREATE_VM */
>>>>
>>>> Other architectures could fill in their own bits if they need to.
>>>>
>>>> Thoughts?
>>>
>>> This kind of thing should work, but it may still get messy when we
>>> add additional fields.
>>
>>
>> Marc, Dave,
>>
>> I like Dave's approach. Some comments below.
>>
>>>
>>> It we want this to work cross-arch, would it make sense to go
>>> for a more generic approach, say
>>>
>>> struct kvm_create_vm_attr_any {
>>>          __u32   type;
>>> };
>>>
>>> #define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
>>> struct kvm_create_vm_attr_arch_capabilities {
>>>          __u32   type;
>>>          __u16   size; /* support future expansion of capabilities[] */
>>>          __u16   reserved;
>>>          __u64   capabilities[1];
>>> };
>>
>> We also need to advertise which attributes are supported by the host,
>> so that the user can tune the available ones. That would make a bit mask
>> like the above trickier, unless we return the supported values back
>> in the argument ptr for the "probe" call. And this scheme in general
>> can be useful for passing back a non-boolean result specific to the
>> attribute, without having a per-attribute ioctl. (e.g, maximum limit
>> for IPA).
> 
> Maybe, but this could quickly become bloated.  (My approach already
> feels a bit bloated...)
> 
> I'm not sure that arbitrarily complex negotiation will really be
> needed, but userspace might want to change its mind if setting a
> particular propertiy fails.
> 
> An alternative might be to have a bunch of per-VM ioctls to configure
> different things, like x86 has.  There's at least precedent for that.
> For arm, we currently only have a few.  That allows for easy extension,
> at the cost of adding ioctls.

As you know, one of the major problems with the per-VM ioctls is
the ordering of different operations and tracking to make sure that
the userspace follows the expected order. e.g, the first approach for
IPA series was based on this and it made things complex enough to drop
it.

> 
> There may be some ioctls we can reuse, like KVM_ENABLE_CAP for per-
> vm capability flags.

May be we could switch to KVM_VM_CAPS and pass a list of capabilities
to be enabled at creation time ? The kvm_enable_cap can pass in additional
arguments for each cap. That way we don't have to rely on a new set of
attributes and probing becomes straight forward.

Suzuki
Dave Martin July 11, 2018, 10:38 a.m. UTC | #14
On Wed, Jul 11, 2018 at 10:05:50AM +0100, Suzuki K Poulose wrote:
> On 10/07/18 18:03, Dave Martin wrote:
> >On Tue, Jul 10, 2018 at 05:38:39PM +0100, Suzuki K Poulose wrote:
> >>On 09/07/18 14:37, Dave Martin wrote:
> >>>On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
> >>>>On 09/07/18 12:23, Dave Martin wrote:
> >
> >[...]
> >
> >>>>>Wedging arguments into a few bits in the type argument feels awkward,
> >>>>>and may be regretted later if we run out of bits, or something can't be
> >>>>>represented in the chosen encoding.
> >>>>
> >>>>I think that's a pretty convincing argument for a "better" CREATE_VM,
> >>>>one that would have a clearly defined, structured (and potentially
> >>>>extensible) argument.
> >>>>
> >>>>I've quickly hacked the following:
> >>>>
> >>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>index b6270a3b38e9..3e76214034c2 100644
> >>>>--- a/include/uapi/linux/kvm.h
> >>>>+++ b/include/uapi/linux/kvm.h
> >>>>@@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
> >>>>  	__u32 pad;
> >>>>  };
> >>>>
> >>>>+struct kvm_create_vm2 {
> >>>>+	__u64	version;	/* Or maybe not */
> >>>>+	union {
> >>>>+		struct {
> >>>>+#define KVM_ARM_SVE_CAPABLE	(1 << 0)
> >>>>+#define KVM_ARM_SELECT_IPA	{1 << 1)
> >>>>+			__u64	capabilities;
> >>>>+			__u16	sve_vlen;
> >>>>+			__u8	ipa_size;
> >>>>+		} arm64;
> >>>>+		__u64	dummy[15];
> >>>>+	};
> >>>>+};
> >>>>+
> >>>>  #define KVMIO 0xAE
> >>>>
> >>>>  /* machine type bits, to be used as argument to KVM_CREATE_VM */
> >>>>
> >>>>Other architectures could fill in their own bits if they need to.
> >>>>
> >>>>Thoughts?
> >>>
> >>>This kind of thing should work, but it may still get messy when we
> >>>add additional fields.
> >>
> >>
> >>Marc, Dave,
> >>
> >>I like Dave's approach. Some comments below.
> >>
> >>>
> >>>It we want this to work cross-arch, would it make sense to go
> >>>for a more generic approach, say
> >>>
> >>>struct kvm_create_vm_attr_any {
> >>>         __u32   type;
> >>>};
> >>>
> >>>#define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
> >>>struct kvm_create_vm_attr_arch_capabilities {
> >>>         __u32   type;
> >>>         __u16   size; /* support future expansion of capabilities[] */
> >>>         __u16   reserved;
> >>>         __u64   capabilities[1];
> >>>};
> >>
> >>We also need to advertise which attributes are supported by the host,
> >>so that the user can tune the available ones. That would make a bit mask
> >>like the above trickier, unless we return the supported values back
> >>in the argument ptr for the "probe" call. And this scheme in general
> >>can be useful for passing back a non-boolean result specific to the
> >>attribute, without having a per-attribute ioctl. (e.g, maximum limit
> >>for IPA).
> >
> >Maybe, but this could quickly become bloated.  (My approach already
> >feels a bit bloated...)
> >
> >I'm not sure that arbitrarily complex negotiation will really be
> >needed, but userspace might want to change its mind if setting a
> >particular propertiy fails.
> >
> >An alternative might be to have a bunch of per-VM ioctls to configure
> >different things, like x86 has.  There's at least precedent for that.
> >For arm, we currently only have a few.  That allows for easy extension,
> >at the cost of adding ioctls.
> 
> As you know, one of the major problems with the per-VM ioctls is
> the ordering of different operations and tracking to make sure that
> the userspace follows the expected order. e.g, the first approach for
> IPA series was based on this and it made things complex enough to drop
> it.

I'm aware of that, but if we are adding a new KVM_CREATE_VM, we could
perhaps give it different semantics: i.e., we create a half-created VM
that only accepts configuration ioctls and a "finish creation" ioctl
that finalises everything before you're allowed to create devices,
vcpus etc.

This is the sort of thing I was moving torwards for SVE (but for
vcpus there).

I'm not saying we should drop the existing KVM_CREATE_VM2 ideas,
but that we should take a step back if it starts to accrue complexity.

> >
> >There may be some ioctls we can reuse, like KVM_ENABLE_CAP for per-
> >vm capability flags.
> 
> May be we could switch to KVM_VM_CAPS and pass a list of capabilities
> to be enabled at creation time ? The kvm_enable_cap can pass in additional
> arguments for each cap. That way we don't have to rely on a new set of
> attributes and probing becomes straight forward.

That's a possibility.  I guess we'd need to understand how exactly x86
uses this.

Cheers
---Dave
diff mbox series

Patch

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index d86f8dd..bcc3dd9 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -385,6 +385,8 @@  static inline u32 kvm_get_ipa_limit(void)
 	return KVM_PHYS_SHIFT;
 }
 
+static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
+
 #endif	/* !__ASSEMBLY__ */
 
 #endif /* __ARM_KVM_MMU_H__ */
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index b02c316..2e90942 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -128,19 +128,15 @@ 
 #define VTCR_EL2_T0SZ(x)	TCR_T0SZ(x)
 
 /*
- * We configure the Stage-2 page tables to always restrict the IPA space to be
- * 40 bits wide (T0SZ = 24).  Systems with a PARange smaller than 40 bits are
- * not known to exist and will break with this configuration.
+ * We configure the Stage-2 page tables based on the requested size of
+ * IPA for each VM. The default size is set to 40bits and is not allowed
+ * go below that limit (for backward compatibility).
  *
  * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
  * (see hyp-init.S).
  *
  * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
  * the VM.
- *
- * Note that when using 4K pages, we concatenate two first level page tables
- * together. With 16K pages, we concatenate 16 first level page tables.
- *
  */
 
 #define VTCR_EL2_COMMON_BITS	(VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b4564d8..f3fb05a3 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -537,5 +537,7 @@  static inline u32 kvm_get_ipa_limit(void)
 	return KVM_PHYS_SHIFT;
 }
 
+static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
+
 #endif /* __ASSEMBLY__ */
 #endif /* __ARM64_KVM_MMU_H__ */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4df9bb6..fa4cab0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -751,6 +751,16 @@  struct kvm_ppc_resize_hpt {
 #define KVM_S390_SIE_PAGE_OFFSET 1
 
 /*
+ * On arm/arm64, machine type can be used to request the physical
+ * address size for the VM. Bits [7-0] have been reserved for the
+ * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
+ * value 0 implies the default IPA size, which is 40bits.
+ */
+#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK	0xff
+#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)		\
+	((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
+
+/*
  * ioctls for /dev/kvm fds:
  */
 #define KVM_GET_API_VERSION       _IO(KVMIO,   0x00)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 0d99e67..1085761 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -112,6 +112,25 @@  void kvm_arch_check_processor_compat(void *rtn)
 }
 
 
+static int kvm_arch_config_vm(struct kvm *kvm, unsigned long type)
+{
+	u32 ipa_shift = KVM_VM_TYPE_ARM_PHYS_SHIFT(type);
+
+	/*
+	 * Make sure the size, if specified, is within the range of
+	 * default size and supported maximum limit.
+	 */
+	if (ipa_shift) {
+		if (ipa_shift < KVM_PHYS_SHIFT || ipa_shift > kvm_ipa_limit)
+			return -EINVAL;
+	} else {
+		ipa_shift = KVM_PHYS_SHIFT;
+	}
+
+	kvm_config_stage2(kvm, ipa_shift);
+	return 0;
+}
+
 /**
  * kvm_arch_init_vm - initializes a VM data structure
  * @kvm:	pointer to the KVM struct
@@ -120,8 +139,9 @@  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	int ret, cpu;
 
-	if (type)
-		return -EINVAL;
+	ret = kvm_arch_config_vm(kvm, type);
+	if (ret)
+		return ret;
 
 	kvm->arch.last_vcpu_ran = alloc_percpu(typeof(*kvm->arch.last_vcpu_ran));
 	if (!kvm->arch.last_vcpu_ran)