mbox series

[RFC,v2,00/11] KVM: arm64: Add support for hypercall services selection

Message ID 20211113012234.1443009-1-rananta@google.com
Headers show
Series KVM: arm64: Add support for hypercall services selection | expand

Message

Raghavendra Rao Ananta Nov. 13, 2021, 1:22 a.m. UTC
Hello,

Continuing the discussion from [1], the series tries to add support
for the user-space to elect the hypercall services that it wishes
to expose to the guest, rather than the guest discovering them
unconditionally. The idea employed by the series was taken from
[1] as suggested by Marc Z.

In a broad sense, the idea is similar to the current implementation
of PSCI interface- create a 'firmware psuedo-register' to handle the
firmware revisions. The series extends this idea to all the other
hypercalls such as TRNG (True Random Number Generator), PV_TIME
(Paravirtualized Time), and PTP (Precision Time protocol).

For better categorization and future scaling, these firmware registers
are categorized based on the service call owners, but unlike the
existing firmware psuedo-registers, they hold the features supported
in the form of a bitmap. During VM (vCPU) initialization, the registers
shows an upper-limit of the features supported by the corresponding
registers. The VMM can simply use GET_ONE_REG to discover the features.
If it's unhappy with any of the features, it can simply write-back the
desired feature bitmap using SET_ONE_REG.

KVM allows these modification only until a VM has started. KVM also
assumes that the VMM is unaware of a register if a register remains
unaccessed (read/write), and would simply clear all the bits of the
registers such that the guest accidently doesn't get exposed to the
features. Finally, the set of bitmaps from all the registers are the
services that are exposed to the guest.

In order to provide backward compatibility with already existing VMMs,
a new capability, KVM_CAP_ARM_HVC_FW_REG_BMAP, is introduced. To enable
the bitmap firmware registers extension, the capability must be
explicitly enabled. If not, the behavior is similar to the previous
setup.

The patches are based off of mainline kernel 5.15, with the selftest
patches from [2] applied.

Patch-1 factors out the non-PSCI related interface from psci.c to
hypercalls.c, as the series would extend the list in the upcoming
patches.

Patches-2,3 introduces core KVM functions, kvm_vcpu_has_run_once()
and kvm_vm_has_run_once() to be used in upcoming patches.

Patch-4 sets up the framework for the bitmap firmware psuedo-registers.
This includes introducing the capability, KVM_CAP_ARM_HVC_FW_REG_BMAP,
read/write helpers for the registers, helper to sanitize the regsiters
before VM start, and another helper to check if a particular hypercall
service is supported for the guest.
It also adds the register KVM_REG_ARM_STD_HYP_BMAP to support ARM's
standard secure services.

Patch-5 introduces the firmware register, KVM_REG_ARM_STD_HYP_BMAP,
which holds the standard hypervisor services (such as PV_TIME).

Patch-6 introduces the firmware register, KVM_REG_ARM_VENDOR_HYP_BMAP,
which holds the vendor specific hypercall services.

Patch-7,8 Add the necessary documentation for the newly added capability
and firmware registers.

Patch-9 imports the SMCCC definitions from linux/arm-smccc.h into tools/
for further use in selftests.

Patch-10 adds the selftest to test the guest (using 'hvc') and VMM
interfaces (SET/GET_ONE_REG).

Patch-11 adds these firmware registers into the get-reg-list selftest.

[1]: https://lore.kernel.org/kvmarm/874kbcpmlq.wl-maz@kernel.org/T/
[2]: https://lore.kernel.org/kvmarm/YUzgdbYk8BeCnHyW@google.com/

Regards,
Raghavendra

v1 -> v2

Addressed comments by Oliver (thanks!):

- Introduced kvm_vcpu_has_run_once() and kvm_vm_has_run_once() in the
  core kvm code, rather than relying on ARM specific vcpu->arch.has_run_once.
- Writing to KVM_REG_ARM_PSCI_VERSION is done in hypercalls.c itself,
  rather than separating out to psci.c.
- Introduced KVM_CAP_ARM_HVC_FW_REG_BMAP to enable the extension.
- Tracks the register accesses from VMM to decide whether to sanitize
  a register or not, as opposed to sanitizing upon the first 'write'
  in v1.
- kvm_hvc_call_supported() is implemented using a direct switch-case
  statement, instead of looping over all the registers to pick the
  register for the function-id.
- Replaced the register bit definitions with #defines, instead of enums.
- Removed the patch v1-06/08 that imports the firmware register
  definitions as it's not needed.
- Separated out the documentations in its own patch, and the renaming
  of hypercalls.rst to psci.rst into another patch.
- Add the new firmware registers to get-reg-list KVM selftest.

v1: https://lore.kernel.org/kvmarm/20211102002203.1046069-1-rananta@google.com/

Raghavendra Rao Ananta (11):
  KVM: arm64: Factor out firmware register handling from psci.c
  KVM: Introduce kvm_vcpu_has_run_once
  KVM: Introduce kvm_vm_has_run_once
  KVM: arm64: Setup a framework for hypercall bitmap firmware registers
  KVM: arm64: Add standard hypervisor firmware register
  KVM: arm64: Add vendor hypervisor firmware register
  Docs: KVM: Add doc for the bitmap firmware registers
  Docs: KVM: Rename psci.rst to hypercalls.rst
  tools: Import ARM SMCCC definitions
  selftests: KVM: aarch64: Introduce hypercall ABI test
  selftests: KVM: aarch64: Add the bitmap firmware registers to
    get-reg-list

 Documentation/virt/kvm/api.rst                |  23 +
 Documentation/virt/kvm/arm/hypercalls.rst     | 132 ++++++
 Documentation/virt/kvm/arm/psci.rst           |  77 ---
 arch/arm64/include/asm/kvm_host.h             |  21 +-
 arch/arm64/include/uapi/asm/kvm.h             |  12 +
 arch/arm64/kvm/arm.c                          |  31 +-
 arch/arm64/kvm/guest.c                        |   2 +-
 arch/arm64/kvm/hypercalls.c                   | 437 +++++++++++++++++-
 arch/arm64/kvm/psci.c                         | 166 -------
 arch/arm64/kvm/pvtime.c                       |   3 +
 arch/arm64/kvm/trng.c                         |   9 +-
 arch/arm64/kvm/vgic/vgic-init.c               |   2 +-
 arch/riscv/include/asm/kvm_host.h             |   3 -
 arch/riscv/kvm/vcpu.c                         |   7 +-
 include/kvm/arm_hypercalls.h                  |  20 +
 include/kvm/arm_psci.h                        |   7 -
 include/linux/kvm_host.h                      |   9 +
 include/uapi/linux/kvm.h                      |   1 +
 tools/include/linux/arm-smccc.h               | 188 ++++++++
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/aarch64/get-reg-list.c      |  35 ++
 .../selftests/kvm/aarch64/hypercalls.c        | 367 +++++++++++++++
 virt/kvm/kvm_main.c                           |  18 +
 24 files changed, 1291 insertions(+), 281 deletions(-)
 create mode 100644 Documentation/virt/kvm/arm/hypercalls.rst
 delete mode 100644 Documentation/virt/kvm/arm/psci.rst
 create mode 100644 tools/include/linux/arm-smccc.h
 create mode 100644 tools/testing/selftests/kvm/aarch64/hypercalls.c

Comments

Marc Zyngier Nov. 22, 2021, 4:27 p.m. UTC | #1
On Sat, 13 Nov 2021 01:22:25 +0000,
Raghavendra Rao Ananta <rananta@google.com> wrote:
> 
> Architectures such as arm64 and riscv uses vcpu variables
> such as has_run_once and ran_atleast_once, respectively,
> to mark if the vCPU has started running. Since these are
> architecture agnostic variables, introduce
> kvm_vcpu_has_run_once() as a core kvm functionality and
> use this instead of the architecture defined variables.
> 
> No functional change intended.
> 
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>

arm64 is moving away from this, see [1]. You also don't need any new
state, as vcpu->pid gives you exactly what you need.

Happy to queue additional patches on top if you want to deal with
riscv.

Thanks,

	M.

[1] https://lore.kernel.org/all/20211018211158.3050779-1-maz@kernel.org/
Marc Zyngier Nov. 22, 2021, 4:31 p.m. UTC | #2
On Sat, 13 Nov 2021 01:22:26 +0000,
Raghavendra Rao Ananta <rananta@google.com> wrote:
> 
> The upcoming patches need a way to detect if the VM, as
> a whole, has started. Hence, unionize kvm_vcpu_has_run_once()
> of all the vcpus of the VM and build kvm_vm_has_run_once()
> to achieve the functionality.
> 
> No functional change intended.
> 
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c      | 17 +++++++++++++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b373929c71eb..102e00c0e21c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1854,4 +1854,6 @@ static inline bool kvm_vcpu_has_run_once(struct kvm_vcpu *vcpu)
>  	return vcpu->has_run_once;
>  }
>  
> +bool kvm_vm_has_run_once(struct kvm *kvm);
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1ec8a8e959b2..3d8d96e8f61d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4339,6 +4339,23 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>  	return fd;
>  }
>  
> +bool kvm_vm_has_run_once(struct kvm *kvm)
> +{
> +	int i, ret = false;
> +	struct kvm_vcpu *vcpu;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		ret = kvm_vcpu_has_run_once(vcpu);
> +		if (ret)
> +			break;
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return ret;
> +}

This is horribly racy. Nothing prevents a vcpu from running behind
your back. If you want any sort of guarantee, look at what we do in
kvm_vgic_create(). Alexandru has patches that extract it to make it
generally available (at least for arm64).

	M.
Marc Zyngier Nov. 22, 2021, 5:23 p.m. UTC | #3
On Sat, 13 Nov 2021 01:22:27 +0000,
Raghavendra Rao Ananta <rananta@google.com> wrote:
> 
> KVM regularly introduces new hypercall services to the guests without
> any consent from the Virtual Machine Manager (VMM). This means, the
> guests can observe hypercall services in and out as they migrate
> across various host kernel versions. This could be a major problem
> if the guest discovered a hypercall, started using it, and after
> getting migrated to an older kernel realizes that it's no longer
> available. Depending on how the guest handles the change, there's
> a potential chance that the guest would just panic.
> 
> As a result, there's a need for the VMM to elect the services that
> it wishes the guest to discover. VMM can elect these services based
> on the kernels spread across its (migration) fleet. To remedy this,
> extend the existing firmware psuedo-registers, such as
> KVM_REG_ARM_PSCI_VERSION, for all the hypercall services available.
> 
> These firmware registers are categorized based on the service call
> owners, and unlike the existing firmware psuedo-registers, they hold
> the features supported in the form of a bitmap. During VM (vCPU)
> initialization, the registers shows an upper-limit of the features
> supported by the corresponding registers. The VMM can simply use
> GET_ONE_REG to discover the features. If it's unhappy with any of
> the features, it can simply write-back the desired feature bitmap
> using SET_ONE_REG.
> 
> KVM allows these modification only until a VM has started. KVM also
> assumes that the VMM is unaware of a register if a register remains
> unaccessed (read/write), and would simply clear all the bits of the
> registers such that the guest accidently doesn't get exposed to the
> features. Finally, the set of bitmaps from all the registers are the
> services that are exposed to the guest.
> 
> In order to provide backward compatibility with already existing VMMs,
> a new capability, KVM_CAP_ARM_HVC_FW_REG_BMAP, is introduced. To enable
> the bitmap firmware registers extension, the capability must be
> explicitly enabled. If not, the behavior is similar to the previous
> setup.
> 
> In this patch, the framework adds the register only for ARM's standard
> secure services (owner value 4). Currently, this includes support only
> for ARM True Random Number Generator (TRNG) service, with bit-0 of the
> register representing mandatory features of v1.0. Other services are
> momentarily added in the upcoming patches.
> 
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  16 +++
>  arch/arm64/include/uapi/asm/kvm.h |   4 +
>  arch/arm64/kvm/arm.c              |  23 +++-
>  arch/arm64/kvm/hypercalls.c       | 217 +++++++++++++++++++++++++++++-
>  arch/arm64/kvm/trng.c             |   9 +-
>  include/kvm/arm_hypercalls.h      |   7 +
>  include/uapi/linux/kvm.h          |   1 +
>  7 files changed, 262 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 02dffe50a20c..1546a2f973ef 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -102,6 +102,19 @@ struct kvm_s2_mmu {
>  struct kvm_arch_memory_slot {
>  };
>  
> +struct hvc_fw_reg_bmap {
> +	bool accessed;
> +	u64 reg_id;
> +	u64 bmap;
> +};
> +
> +struct hvc_reg_desc {
> +	spinlock_t lock;
> +	bool fw_reg_bmap_enabled;
> +
> +	struct hvc_fw_reg_bmap hvc_std_bmap;
> +};

Please document what these data structures track. Without any
documentation, it is pretty difficult to build a mental picture of how
this all fits together.

> +
>  struct kvm_arch {
>  	struct kvm_s2_mmu mmu;
>  
> @@ -137,6 +150,9 @@ struct kvm_arch {
>  
>  	/* Memory Tagging Extension enabled for the guest */
>  	bool mte_enabled;
> +
> +	/* Hypercall firmware registers' descriptor */
> +	struct hvc_reg_desc hvc_desc;
>  };
>  
>  struct kvm_vcpu_fault_info {
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index b3edde68bc3e..d6e099ed14ef 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -281,6 +281,10 @@ struct kvm_arm_copy_mte_tags {
>  #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED	3
>  #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED     	(1U << 4)
>  
> +#define KVM_REG_ARM_STD_BMAP			KVM_REG_ARM_FW_REG(3)
> +#define KVM_REG_ARM_STD_BIT_TRNG_V1_0		BIT(0)
> +#define KVM_REG_ARM_STD_BMAP_BIT_MAX		0	/* Last valid bit */
> +
>  /* SVE registers */
>  #define KVM_REG_ARM64_SVE		(0x15 << KVM_REG_ARM_COPROC_SHIFT)
>  
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 0cc148211b4e..f2099e4d1109 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -81,26 +81,32 @@ int kvm_arch_check_processor_compat(void *opaque)
>  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			    struct kvm_enable_cap *cap)
>  {
> -	int r;
> +	int r = 0;
> +	struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
>  
>  	if (cap->flags)
>  		return -EINVAL;
>  
>  	switch (cap->cap) {
>  	case KVM_CAP_ARM_NISV_TO_USER:
> -		r = 0;
>  		kvm->arch.return_nisv_io_abort_to_user = true;
>  		break;
>  	case KVM_CAP_ARM_MTE:
>  		mutex_lock(&kvm->lock);
> -		if (!system_supports_mte() || kvm->created_vcpus) {
> +		if (!system_supports_mte() || kvm->created_vcpus)
>  			r = -EINVAL;
> -		} else {
> -			r = 0;
> +		else
>  			kvm->arch.mte_enabled = true;
> -		}
>  		mutex_unlock(&kvm->lock);
>  		break;
> +	case KVM_CAP_ARM_HVC_FW_REG_BMAP:
> +		if (kvm_vm_has_run_once(kvm))
> +			return -EBUSY;
> +
> +		spin_lock(&hvc_desc->lock);

Does this really need to be a spin-lock? Are you ever taking it on a
context where you cannot sleep? And why does it need to be a new lock
when we already have a plethora of them?

> +		hvc_desc->fw_reg_bmap_enabled = true;
> +		spin_unlock(&hvc_desc->lock);
> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> @@ -157,6 +163,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	set_default_spectre(kvm);
>  
> +	kvm_arm_init_hypercalls(kvm);
> +
>  	return ret;
>  out_free_stage2_pgd:
>  	kvm_free_stage2_pgd(&kvm->arch.mmu);
> @@ -215,6 +223,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_SET_GUEST_DEBUG:
>  	case KVM_CAP_VCPU_ATTRIBUTES:
>  	case KVM_CAP_PTP_KVM:
> +	case KVM_CAP_ARM_HVC_FW_REG_BMAP:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> @@ -622,6 +631,8 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
>  	if (kvm_vm_is_protected(kvm))
>  		kvm_call_hyp_nvhe(__pkvm_vcpu_init_traps, vcpu);
>  
> +	kvm_arm_sanitize_fw_regs(kvm);

What is the rational for doing this on first run? Surely this could be
done at the point where userspace performs the write, couldn't it?

My mental model is that the VMM reads some data, clears some bits if
it really wants to, and writes it back. There shouldn't be anything to
sanitise after the facts. Or at least that's my gut feeling so far.

/me reads on.

> +
>  	return ret;
>  }
>  
> diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> index 9e136d91b470..f5df7bc61146 100644
> --- a/arch/arm64/kvm/hypercalls.c
> +++ b/arch/arm64/kvm/hypercalls.c
> @@ -58,6 +58,41 @@ static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
>  	val[3] = lower_32_bits(cycles);
>  }
>  
> +static bool
> +kvm_arm_fw_reg_feat_enabled(struct hvc_fw_reg_bmap *reg_bmap, u64 feat_bit)
> +{
> +	return reg_bmap->bmap & feat_bit;
> +}
> +
> +bool kvm_hvc_call_supported(struct kvm_vcpu *vcpu, u32 func_id)
> +{
> +	struct hvc_reg_desc *hvc_desc = &vcpu->kvm->arch.hvc_desc;
> +
> +	/*
> +	 * To ensure backward compatibility, support all the service calls,
> +	 * including new additions, if the firmware registers holding the
> +	 * feature bitmaps isn't explicitly enabled.
> +	 */
> +	if (!hvc_desc->fw_reg_bmap_enabled)
> +		return true;
> +
> +	switch (func_id) {
> +	case ARM_SMCCC_TRNG_VERSION:
> +	case ARM_SMCCC_TRNG_FEATURES:
> +	case ARM_SMCCC_TRNG_GET_UUID:
> +	case ARM_SMCCC_TRNG_RND32:
> +	case ARM_SMCCC_TRNG_RND64:
> +		return kvm_arm_fw_reg_feat_enabled(&hvc_desc->hvc_std_bmap,
> +					KVM_REG_ARM_STD_BIT_TRNG_V1_0);
> +	default:
> +		/* By default, allow the services that aren't listed here */
> +		return true;
> +	}
> +
> +	/* We shouldn't be reaching here */
> +	return true;

So why have anything at all?

> +}
> +
>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  {
>  	u32 func_id = smccc_get_function(vcpu);
> @@ -65,6 +100,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  	u32 feature;
>  	gpa_t gpa;
>  
> +	if (!kvm_hvc_call_supported(vcpu, func_id))
> +		goto out;
> +
>  	switch (func_id) {
>  	case ARM_SMCCC_VERSION_FUNC_ID:
>  		val[0] = ARM_SMCCC_VERSION_1_1;
> @@ -143,6 +181,7 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  		return kvm_psci_call(vcpu);
>  	}
>  
> +out:
>  	smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
>  	return 1;
>  }
> @@ -153,17 +192,178 @@ static const u64 fw_reg_ids[] = {
>  	KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2,
>  };
>  
> +static const u64 fw_reg_bmap_ids[] = {
> +	KVM_REG_ARM_STD_BMAP,
> +};
> +
> +static void kvm_arm_fw_reg_init_hvc(struct hvc_reg_desc *hvc_desc,
> +					struct hvc_fw_reg_bmap *fw_reg_bmap,
> +					u64 reg_id, u64 default_map)
> +{
> +	fw_reg_bmap->reg_id = reg_id;
> +	fw_reg_bmap->bmap = default_map;
> +}
> +
> +void kvm_arm_init_hypercalls(struct kvm *kvm)
> +{
> +	struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> +
> +	spin_lock_init(&hvc_desc->lock);
> +
> +	kvm_arm_fw_reg_init_hvc(hvc_desc, &hvc_desc->hvc_std_bmap,
> +				KVM_REG_ARM_STD_BMAP, ARM_SMCCC_STD_FEATURES);
> +}
> +
> +static void kvm_arm_fw_reg_sanitize(struct hvc_fw_reg_bmap *fw_reg_bmap)
> +{
> +	if (!fw_reg_bmap->accessed)
> +		fw_reg_bmap->bmap = 0;
> +}
> +
> +/*
> + * kvm_arm_sanitize_fw_regs: Sanitize the hypercall firmware registers
> + *
> + * Sanitization, in the case of hypercall firmware registers, is basically
> + * clearing out the feature bitmaps so that the guests are not exposed to
> + * the services corresponding to a particular register. The registers that
> + * needs sanitization is decided on two factors on the user-space part:
> + *	1. Enablement of KVM_CAP_ARM_HVC_FW_REG_BMAP:
> + *	   If the user-space hasn't enabled the capability, it either means
> + *	   that it's unaware of its existence, or it simply doesn't want to
> + *	   participate in the arrangement and is okay with the default settings.
> + *	   The former case is to ensure backward compatibility.
> + *
> + *	2. Has the user-space accessed (read/write) the register? :
> + *	   If yes, it means that the user-space is aware of the register's
> + *	   existence and can set the bits as it sees fit for the guest. A
> + *	   read-only access from user-space indicates that the user-space is
> + *	   happy with the default settings, and doesn't wish to change it.
> + *
> + * The logic for sanitizing a register will then be:
> + * ---------------------------------------------------------------------------
> + * | CAP enabled | Accessed reg | Clear reg | Comments                       |
> + * ---------------------------------------------------------------------------
> + * |      N      |       N      |     N     |                                |
> + * |      N      |       Y      |     N     | -ENOENT returned during access |
> + * |      Y      |       N      |     Y     |                                |
> + * |      Y      |       Y      |     N     |                                |
> + * ---------------------------------------------------------------------------
> + */
> +void kvm_arm_sanitize_fw_regs(struct kvm *kvm)
> +{
> +	struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> +
> +	spin_lock(&hvc_desc->lock);
> +
> +	if (!hvc_desc->fw_reg_bmap_enabled)
> +		goto out;
> +
> +	kvm_arm_fw_reg_sanitize(&hvc_desc->hvc_std_bmap);
> +
> +out:
> +	spin_unlock(&hvc_desc->lock);

I keep being baffled by this. Why should we track the VMM accesses or
the VMM writeback? This logic doesn't seem to bring anything useful as
far as I can tell. All we need to ensure is that what is written to
the pseudo-register is an acceptable subset of the previous value, and
I cannot see why this can't be done at write-time.

If you want to hide this behind a capability, fine (although my guts
feeling is that we don't need that either). But I really want to be
convinced about all this tracking.

> +}
> +
> +static int kvm_arm_fw_reg_get_bmap(struct kvm *kvm,
> +				struct hvc_fw_reg_bmap *fw_reg_bmap, u64 *val)
> +{
> +	int ret = 0;
> +	struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> +
> +	spin_lock(&hvc_desc->lock);
> +
> +	if (!hvc_desc->fw_reg_bmap_enabled) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	fw_reg_bmap->accessed = true;
> +	*val = fw_reg_bmap->bmap;
> +out:
> +	spin_unlock(&hvc_desc->lock);
> +	return ret;
> +}
> +
> +static int kvm_arm_fw_reg_set_bmap(struct kvm *kvm,
> +				struct hvc_fw_reg_bmap *fw_reg_bmap, u64 val)
> +{
> +	int ret = 0;
> +	u64 fw_reg_features;
> +	struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> +
> +	spin_lock(&hvc_desc->lock);
> +
> +	if (!hvc_desc->fw_reg_bmap_enabled) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	if (fw_reg_bmap->bmap == val)
> +		goto out;
> +
> +	if (kvm_vm_has_run_once(kvm)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	switch (fw_reg_bmap->reg_id) {
> +	case KVM_REG_ARM_STD_BMAP:
> +		fw_reg_features = ARM_SMCCC_STD_FEATURES;
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Check for unsupported feature bit */
> +	if (val & ~fw_reg_features) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	fw_reg_bmap->accessed = true;
> +	fw_reg_bmap->bmap = val;
> +out:
> +	spin_unlock(&hvc_desc->lock);
> +	return ret;
> +}
> +
>  int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu)
>  {
> -	return ARRAY_SIZE(fw_reg_ids);
> +	struct hvc_reg_desc *hvc_desc = &vcpu->kvm->arch.hvc_desc;
> +	int n_regs = ARRAY_SIZE(fw_reg_ids);
> +
> +	spin_lock(&hvc_desc->lock);
> +
> +	if (hvc_desc->fw_reg_bmap_enabled)
> +		n_regs += ARRAY_SIZE(fw_reg_bmap_ids);
> +
> +	spin_unlock(&hvc_desc->lock);
> +
> +	return n_regs;
>  }
>  
>  int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
>  {
> +	struct hvc_reg_desc *hvc_desc = &vcpu->kvm->arch.hvc_desc;
>  	int i;
>  
>  	for (i = 0; i < ARRAY_SIZE(fw_reg_ids); i++) {
> -		if (put_user(fw_reg_ids[i], uindices))
> +		if (put_user(fw_reg_ids[i], uindices++))
> +			return -EFAULT;
> +	}
> +
> +	spin_lock(&hvc_desc->lock);
> +
> +	if (!hvc_desc->fw_reg_bmap_enabled) {
> +		spin_unlock(&hvc_desc->lock);
> +		return 0;
> +	}
> +
> +	spin_unlock(&hvc_desc->lock);
> +
> +	for (i = 0; i < ARRAY_SIZE(fw_reg_bmap_ids); i++) {
> +		if (put_user(fw_reg_bmap_ids[i], uindices++))
>  			return -EFAULT;

I really don't get what you are trying to achieve with this locking.
You guard the 'enabled' bit, but you still allow the kernel view to be
copied to userspace while another thread is writing to it?

Thanks,

	M.
Raghavendra Rao Ananta Nov. 23, 2021, 6:31 p.m. UTC | #4
Hello Marc,

On Mon, Nov 22, 2021 at 8:27 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 13 Nov 2021 01:22:25 +0000,
> Raghavendra Rao Ananta <rananta@google.com> wrote:
> >
> > Architectures such as arm64 and riscv uses vcpu variables
> > such as has_run_once and ran_atleast_once, respectively,
> > to mark if the vCPU has started running. Since these are
> > architecture agnostic variables, introduce
> > kvm_vcpu_has_run_once() as a core kvm functionality and
> > use this instead of the architecture defined variables.
> >
> > No functional change intended.
> >
> > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
>
> arm64 is moving away from this, see [1]. You also don't need any new
> state, as vcpu->pid gives you exactly what you need.

Thanks for the pointer. I can directly use this!
>
> Happy to queue additional patches on top if you want to deal with
> riscv.
>
Just to clarify, you mean get the kvm support for riscv on kvmarm-next?
If yes, then sure, I can make changes in riscv to use
vcpu_has_run_once() from your series.

Regards,
Raghavendra


> Thanks,
>
>         M.
>
> [1] https://lore.kernel.org/all/20211018211158.3050779-1-maz@kernel.org/
>
> --
> Without deviation from the norm, progress is not possible.
Raghavendra Rao Ananta Nov. 23, 2021, 6:34 p.m. UTC | #5
On Mon, Nov 22, 2021 at 9:23 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 13 Nov 2021 01:22:27 +0000,
> Raghavendra Rao Ananta <rananta@google.com> wrote:
> >
> > KVM regularly introduces new hypercall services to the guests without
> > any consent from the Virtual Machine Manager (VMM). This means, the
> > guests can observe hypercall services in and out as they migrate
> > across various host kernel versions. This could be a major problem
> > if the guest discovered a hypercall, started using it, and after
> > getting migrated to an older kernel realizes that it's no longer
> > available. Depending on how the guest handles the change, there's
> > a potential chance that the guest would just panic.
> >
> > As a result, there's a need for the VMM to elect the services that
> > it wishes the guest to discover. VMM can elect these services based
> > on the kernels spread across its (migration) fleet. To remedy this,
> > extend the existing firmware psuedo-registers, such as
> > KVM_REG_ARM_PSCI_VERSION, for all the hypercall services available.
> >
> > These firmware registers are categorized based on the service call
> > owners, and unlike the existing firmware psuedo-registers, they hold
> > the features supported in the form of a bitmap. During VM (vCPU)
> > initialization, the registers shows an upper-limit of the features
> > supported by the corresponding registers. The VMM can simply use
> > GET_ONE_REG to discover the features. If it's unhappy with any of
> > the features, it can simply write-back the desired feature bitmap
> > using SET_ONE_REG.
> >
> > KVM allows these modification only until a VM has started. KVM also
> > assumes that the VMM is unaware of a register if a register remains
> > unaccessed (read/write), and would simply clear all the bits of the
> > registers such that the guest accidently doesn't get exposed to the
> > features. Finally, the set of bitmaps from all the registers are the
> > services that are exposed to the guest.
> >
> > In order to provide backward compatibility with already existing VMMs,
> > a new capability, KVM_CAP_ARM_HVC_FW_REG_BMAP, is introduced. To enable
> > the bitmap firmware registers extension, the capability must be
> > explicitly enabled. If not, the behavior is similar to the previous
> > setup.
> >
> > In this patch, the framework adds the register only for ARM's standard
> > secure services (owner value 4). Currently, this includes support only
> > for ARM True Random Number Generator (TRNG) service, with bit-0 of the
> > register representing mandatory features of v1.0. Other services are
> > momentarily added in the upcoming patches.
> >
> > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  16 +++
> >  arch/arm64/include/uapi/asm/kvm.h |   4 +
> >  arch/arm64/kvm/arm.c              |  23 +++-
> >  arch/arm64/kvm/hypercalls.c       | 217 +++++++++++++++++++++++++++++-
> >  arch/arm64/kvm/trng.c             |   9 +-
> >  include/kvm/arm_hypercalls.h      |   7 +
> >  include/uapi/linux/kvm.h          |   1 +
> >  7 files changed, 262 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 02dffe50a20c..1546a2f973ef 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -102,6 +102,19 @@ struct kvm_s2_mmu {
> >  struct kvm_arch_memory_slot {
> >  };
> >
> > +struct hvc_fw_reg_bmap {
> > +     bool accessed;
> > +     u64 reg_id;
> > +     u64 bmap;
> > +};
> > +
> > +struct hvc_reg_desc {
> > +     spinlock_t lock;
> > +     bool fw_reg_bmap_enabled;
> > +
> > +     struct hvc_fw_reg_bmap hvc_std_bmap;
> > +};
>
> Please document what these data structures track. Without any
> documentation, it is pretty difficult to build a mental picture of how
> this all fits together.
>
Sure, will do.
> > +
> >  struct kvm_arch {
> >       struct kvm_s2_mmu mmu;
> >
> > @@ -137,6 +150,9 @@ struct kvm_arch {
> >
> >       /* Memory Tagging Extension enabled for the guest */
> >       bool mte_enabled;
> > +
> > +     /* Hypercall firmware registers' descriptor */
> > +     struct hvc_reg_desc hvc_desc;
> >  };
> >
> >  struct kvm_vcpu_fault_info {
> > diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> > index b3edde68bc3e..d6e099ed14ef 100644
> > --- a/arch/arm64/include/uapi/asm/kvm.h
> > +++ b/arch/arm64/include/uapi/asm/kvm.h
> > @@ -281,6 +281,10 @@ struct kvm_arm_copy_mte_tags {
> >  #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED     3
> >  #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED          (1U << 4)
> >
> > +#define KVM_REG_ARM_STD_BMAP                 KVM_REG_ARM_FW_REG(3)
> > +#define KVM_REG_ARM_STD_BIT_TRNG_V1_0                BIT(0)
> > +#define KVM_REG_ARM_STD_BMAP_BIT_MAX         0       /* Last valid bit */
> > +
> >  /* SVE registers */
> >  #define KVM_REG_ARM64_SVE            (0x15 << KVM_REG_ARM_COPROC_SHIFT)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 0cc148211b4e..f2099e4d1109 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -81,26 +81,32 @@ int kvm_arch_check_processor_compat(void *opaque)
> >  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >                           struct kvm_enable_cap *cap)
> >  {
> > -     int r;
> > +     int r = 0;
> > +     struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> >
> >       if (cap->flags)
> >               return -EINVAL;
> >
> >       switch (cap->cap) {
> >       case KVM_CAP_ARM_NISV_TO_USER:
> > -             r = 0;
> >               kvm->arch.return_nisv_io_abort_to_user = true;
> >               break;
> >       case KVM_CAP_ARM_MTE:
> >               mutex_lock(&kvm->lock);
> > -             if (!system_supports_mte() || kvm->created_vcpus) {
> > +             if (!system_supports_mte() || kvm->created_vcpus)
> >                       r = -EINVAL;
> > -             } else {
> > -                     r = 0;
> > +             else
> >                       kvm->arch.mte_enabled = true;
> > -             }
> >               mutex_unlock(&kvm->lock);
> >               break;
> > +     case KVM_CAP_ARM_HVC_FW_REG_BMAP:
> > +             if (kvm_vm_has_run_once(kvm))
> > +                     return -EBUSY;
> > +
> > +             spin_lock(&hvc_desc->lock);
>
> Does this really need to be a spin-lock? Are you ever taking it on a
> context where you cannot sleep? And why does it need to be a new lock
> when we already have a plethora of them?
>
I suppose I was going with the fact that we have very small critical
sections and could just go with a spinlock without interfering with
any other paths. But I suppose that's not needed. I can go with the
kvm->lock mutex instead.
> > +             hvc_desc->fw_reg_bmap_enabled = true;
> > +             spin_unlock(&hvc_desc->lock);
> > +             break;
> >       default:
> >               r = -EINVAL;
> >               break;
> > @@ -157,6 +163,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> >
> >       set_default_spectre(kvm);
> >
> > +     kvm_arm_init_hypercalls(kvm);
> > +
> >       return ret;
> >  out_free_stage2_pgd:
> >       kvm_free_stage2_pgd(&kvm->arch.mmu);
> > @@ -215,6 +223,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >       case KVM_CAP_SET_GUEST_DEBUG:
> >       case KVM_CAP_VCPU_ATTRIBUTES:
> >       case KVM_CAP_PTP_KVM:
> > +     case KVM_CAP_ARM_HVC_FW_REG_BMAP:
> >               r = 1;
> >               break;
> >       case KVM_CAP_SET_GUEST_DEBUG2:
> > @@ -622,6 +631,8 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
> >       if (kvm_vm_is_protected(kvm))
> >               kvm_call_hyp_nvhe(__pkvm_vcpu_init_traps, vcpu);
> >
> > +     kvm_arm_sanitize_fw_regs(kvm);
>
> What is the rational for doing this on first run? Surely this could be
> done at the point where userspace performs the write, couldn't it?
>
> My mental model is that the VMM reads some data, clears some bits if
> it really wants to, and writes it back. There shouldn't be anything to
> sanitise after the facts. Or at least that's my gut feeling so far.
>
I tried to explain things below..

> /me reads on.
>
> > +
> >       return ret;
> >  }
> >
> > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> > index 9e136d91b470..f5df7bc61146 100644
> > --- a/arch/arm64/kvm/hypercalls.c
> > +++ b/arch/arm64/kvm/hypercalls.c
> > @@ -58,6 +58,41 @@ static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
> >       val[3] = lower_32_bits(cycles);
> >  }
> >
> > +static bool
> > +kvm_arm_fw_reg_feat_enabled(struct hvc_fw_reg_bmap *reg_bmap, u64 feat_bit)
> > +{
> > +     return reg_bmap->bmap & feat_bit;
> > +}
> > +
> > +bool kvm_hvc_call_supported(struct kvm_vcpu *vcpu, u32 func_id)
> > +{
> > +     struct hvc_reg_desc *hvc_desc = &vcpu->kvm->arch.hvc_desc;
> > +
> > +     /*
> > +      * To ensure backward compatibility, support all the service calls,
> > +      * including new additions, if the firmware registers holding the
> > +      * feature bitmaps isn't explicitly enabled.
> > +      */
> > +     if (!hvc_desc->fw_reg_bmap_enabled)
> > +             return true;
> > +
> > +     switch (func_id) {
> > +     case ARM_SMCCC_TRNG_VERSION:
> > +     case ARM_SMCCC_TRNG_FEATURES:
> > +     case ARM_SMCCC_TRNG_GET_UUID:
> > +     case ARM_SMCCC_TRNG_RND32:
> > +     case ARM_SMCCC_TRNG_RND64:
> > +             return kvm_arm_fw_reg_feat_enabled(&hvc_desc->hvc_std_bmap,
> > +                                     KVM_REG_ARM_STD_BIT_TRNG_V1_0);
> > +     default:
> > +             /* By default, allow the services that aren't listed here */
> > +             return true;
> > +     }
> > +
> > +     /* We shouldn't be reaching here */
> > +     return true;
>
> So why have anything at all?
>
I was expecting the compiler might complain, but guess not, I'll remove this.
> > +}
> > +
> >  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >  {
> >       u32 func_id = smccc_get_function(vcpu);
> > @@ -65,6 +100,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >       u32 feature;
> >       gpa_t gpa;
> >
> > +     if (!kvm_hvc_call_supported(vcpu, func_id))
> > +             goto out;
> > +
> >       switch (func_id) {
> >       case ARM_SMCCC_VERSION_FUNC_ID:
> >               val[0] = ARM_SMCCC_VERSION_1_1;
> > @@ -143,6 +181,7 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >               return kvm_psci_call(vcpu);
> >       }
> >
> > +out:
> >       smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
> >       return 1;
> >  }
> > @@ -153,17 +192,178 @@ static const u64 fw_reg_ids[] = {
> >       KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2,
> >  };
> >
> > +static const u64 fw_reg_bmap_ids[] = {
> > +     KVM_REG_ARM_STD_BMAP,
> > +};
> > +
> > +static void kvm_arm_fw_reg_init_hvc(struct hvc_reg_desc *hvc_desc,
> > +                                     struct hvc_fw_reg_bmap *fw_reg_bmap,
> > +                                     u64 reg_id, u64 default_map)
> > +{
> > +     fw_reg_bmap->reg_id = reg_id;
> > +     fw_reg_bmap->bmap = default_map;
> > +}
> > +
> > +void kvm_arm_init_hypercalls(struct kvm *kvm)
> > +{
> > +     struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> > +
> > +     spin_lock_init(&hvc_desc->lock);
> > +
> > +     kvm_arm_fw_reg_init_hvc(hvc_desc, &hvc_desc->hvc_std_bmap,
> > +                             KVM_REG_ARM_STD_BMAP, ARM_SMCCC_STD_FEATURES);
> > +}
> > +
> > +static void kvm_arm_fw_reg_sanitize(struct hvc_fw_reg_bmap *fw_reg_bmap)
> > +{
> > +     if (!fw_reg_bmap->accessed)
> > +             fw_reg_bmap->bmap = 0;
> > +}
> > +
> > +/*
> > + * kvm_arm_sanitize_fw_regs: Sanitize the hypercall firmware registers
> > + *
> > + * Sanitization, in the case of hypercall firmware registers, is basically
> > + * clearing out the feature bitmaps so that the guests are not exposed to
> > + * the services corresponding to a particular register. The registers that
> > + * needs sanitization is decided on two factors on the user-space part:
> > + *   1. Enablement of KVM_CAP_ARM_HVC_FW_REG_BMAP:
> > + *      If the user-space hasn't enabled the capability, it either means
> > + *      that it's unaware of its existence, or it simply doesn't want to
> > + *      participate in the arrangement and is okay with the default settings.
> > + *      The former case is to ensure backward compatibility.
> > + *
> > + *   2. Has the user-space accessed (read/write) the register? :
> > + *      If yes, it means that the user-space is aware of the register's
> > + *      existence and can set the bits as it sees fit for the guest. A
> > + *      read-only access from user-space indicates that the user-space is
> > + *      happy with the default settings, and doesn't wish to change it.
> > + *
> > + * The logic for sanitizing a register will then be:
> > + * ---------------------------------------------------------------------------
> > + * | CAP enabled | Accessed reg | Clear reg | Comments                       |
> > + * ---------------------------------------------------------------------------
> > + * |      N      |       N      |     N     |                                |
> > + * |      N      |       Y      |     N     | -ENOENT returned during access |
> > + * |      Y      |       N      |     Y     |                                |
> > + * |      Y      |       Y      |     N     |                                |
> > + * ---------------------------------------------------------------------------
> > + */
> > +void kvm_arm_sanitize_fw_regs(struct kvm *kvm)
> > +{
> > +     struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> > +
> > +     spin_lock(&hvc_desc->lock);
> > +
> > +     if (!hvc_desc->fw_reg_bmap_enabled)
> > +             goto out;
> > +
> > +     kvm_arm_fw_reg_sanitize(&hvc_desc->hvc_std_bmap);
> > +
> > +out:
> > +     spin_unlock(&hvc_desc->lock);
>
> I keep being baffled by this. Why should we track the VMM accesses or
> the VMM writeback? This logic doesn't seem to bring anything useful as
> far as I can tell. All we need to ensure is that what is written to
> the pseudo-register is an acceptable subset of the previous value, and
> I cannot see why this can't be done at write-time.
>
> If you want to hide this behind a capability, fine (although my guts
> feeling is that we don't need that either). But I really want to be
> convinced about all this tracking.
>
The tracking of each owner register is necessary here to safe-guard
the possibility that the user-space may not be aware of a newly
introduced register, and hence, hasn't accessed it. If it had at least
read the register, but not write-back, we assume that the user-space
is happy with the configuration. But the fact that the register has
not even been read would state that user-space is unaware of the
existence of this new register. In such a case, if we don't sanitize
(clear all the bits) this register, the features will be exposed
unconditionally to the guest.

The capability is introduced here to make sure that this new
infrastructure is backward compatible with old VMMs. If the VMMs don't
enable this capability, they are probably unaware of this, and this
will work as it always has- expose new services to the guest
unconditionally as and when they are introduced.
> > +}
> > +
> > +static int kvm_arm_fw_reg_get_bmap(struct kvm *kvm,
> > +                             struct hvc_fw_reg_bmap *fw_reg_bmap, u64 *val)
> > +{
> > +     int ret = 0;
> > +     struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> > +
> > +     spin_lock(&hvc_desc->lock);
> > +
> > +     if (!hvc_desc->fw_reg_bmap_enabled) {
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
> > +
> > +     fw_reg_bmap->accessed = true;
> > +     *val = fw_reg_bmap->bmap;
> > +out:
> > +     spin_unlock(&hvc_desc->lock);
> > +     return ret;
> > +}
> > +
> > +static int kvm_arm_fw_reg_set_bmap(struct kvm *kvm,
> > +                             struct hvc_fw_reg_bmap *fw_reg_bmap, u64 val)
> > +{
> > +     int ret = 0;
> > +     u64 fw_reg_features;
> > +     struct hvc_reg_desc *hvc_desc = &kvm->arch.hvc_desc;
> > +
> > +     spin_lock(&hvc_desc->lock);
> > +
> > +     if (!hvc_desc->fw_reg_bmap_enabled) {
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
> > +
> > +     if (fw_reg_bmap->bmap == val)
> > +             goto out;
> > +
> > +     if (kvm_vm_has_run_once(kvm)) {
> > +             ret = -EBUSY;
> > +             goto out;
> > +     }
> > +
> > +     switch (fw_reg_bmap->reg_id) {
> > +     case KVM_REG_ARM_STD_BMAP:
> > +             fw_reg_features = ARM_SMCCC_STD_FEATURES;
> > +             break;
> > +     default:
> > +             ret = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     /* Check for unsupported feature bit */
> > +     if (val & ~fw_reg_features) {
> > +             ret = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     fw_reg_bmap->accessed = true;
> > +     fw_reg_bmap->bmap = val;
> > +out:
> > +     spin_unlock(&hvc_desc->lock);
> > +     return ret;
> > +}
> > +
> >  int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu)
> >  {
> > -     return ARRAY_SIZE(fw_reg_ids);
> > +     struct hvc_reg_desc *hvc_desc = &vcpu->kvm->arch.hvc_desc;
> > +     int n_regs = ARRAY_SIZE(fw_reg_ids);
> > +
> > +     spin_lock(&hvc_desc->lock);
> > +
> > +     if (hvc_desc->fw_reg_bmap_enabled)
> > +             n_regs += ARRAY_SIZE(fw_reg_bmap_ids);
> > +
> > +     spin_unlock(&hvc_desc->lock);
> > +
> > +     return n_regs;
> >  }
> >
> >  int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
> >  {
> > +     struct hvc_reg_desc *hvc_desc = &vcpu->kvm->arch.hvc_desc;
> >       int i;
> >
> >       for (i = 0; i < ARRAY_SIZE(fw_reg_ids); i++) {
> > -             if (put_user(fw_reg_ids[i], uindices))
> > +             if (put_user(fw_reg_ids[i], uindices++))
> > +                     return -EFAULT;
> > +     }
> > +
> > +     spin_lock(&hvc_desc->lock);
> > +
> > +     if (!hvc_desc->fw_reg_bmap_enabled) {
> > +             spin_unlock(&hvc_desc->lock);
> > +             return 0;
> > +     }
> > +
> > +     spin_unlock(&hvc_desc->lock);
> > +
> > +     for (i = 0; i < ARRAY_SIZE(fw_reg_bmap_ids); i++) {
> > +             if (put_user(fw_reg_bmap_ids[i], uindices++))
> >                       return -EFAULT;
>
> I really don't get what you are trying to achieve with this locking.
> You guard the 'enabled' bit, but you still allow the kernel view to be
> copied to userspace while another thread is writing to it?
>
That's my mistake. Thanks for pointing it out. I'll get rid of the
spinlock and use the existing ones correctly.

Regards,
Raghavendra

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Raghavendra Rao Ananta Nov. 23, 2021, 6:48 p.m. UTC | #6
On Mon, Nov 22, 2021 at 8:31 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 13 Nov 2021 01:22:26 +0000,
> Raghavendra Rao Ananta <rananta@google.com> wrote:
> >
> > The upcoming patches need a way to detect if the VM, as
> > a whole, has started. Hence, unionize kvm_vcpu_has_run_once()
> > of all the vcpus of the VM and build kvm_vm_has_run_once()
> > to achieve the functionality.
> >
> > No functional change intended.
> >
> > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > ---
> >  include/linux/kvm_host.h |  2 ++
> >  virt/kvm/kvm_main.c      | 17 +++++++++++++++++
> >  2 files changed, 19 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index b373929c71eb..102e00c0e21c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1854,4 +1854,6 @@ static inline bool kvm_vcpu_has_run_once(struct kvm_vcpu *vcpu)
> >       return vcpu->has_run_once;
> >  }
> >
> > +bool kvm_vm_has_run_once(struct kvm *kvm);
> > +
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1ec8a8e959b2..3d8d96e8f61d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4339,6 +4339,23 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> >       return fd;
> >  }
> >
> > +bool kvm_vm_has_run_once(struct kvm *kvm)
> > +{
> > +     int i, ret = false;
> > +     struct kvm_vcpu *vcpu;
> > +
> > +     mutex_lock(&kvm->lock);
> > +
> > +     kvm_for_each_vcpu(i, vcpu, kvm) {
> > +             ret = kvm_vcpu_has_run_once(vcpu);
> > +             if (ret)
> > +                     break;
> > +     }
> > +
> > +     mutex_unlock(&kvm->lock);
> > +     return ret;
> > +}
>
> This is horribly racy. Nothing prevents a vcpu from running behind
> your back. If you want any sort of guarantee, look at what we do in
> kvm_vgic_create(). Alexandru has patches that extract it to make it
> generally available (at least for arm64).
>
Yes, I looked into kvm_lock_all_vcpus(), but the fact that the series
would call the function with the current vcpu lock held caused me to
back off..
Perhaps I can come up with a similar function, kvm_lock_all_vcpus_except(vcpu) ?

Regards,
Raghavendra

>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Andrew Jones Nov. 27, 2021, 1:16 p.m. UTC | #7
On Sat, Nov 13, 2021 at 01:22:24AM +0000, Raghavendra Rao Ananta wrote:
> Common hypercall firmware register handing is currently employed
> by psci.c. Since the upcoming patches add more of these registers,
> it's better to move the generic handling to hypercall.c for a
> cleaner presentation.
> 
> While we are at it, collect all the firmware registers under
> fw_reg_ids[] to help implement kvm_arm_get_fw_num_regs() and
> kvm_arm_copy_fw_reg_indices() in a generic way.
> 
> No functional change intended.
> 
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> ---
>  arch/arm64/kvm/guest.c       |   2 +-
>  arch/arm64/kvm/hypercalls.c  | 170 +++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/psci.c        | 166 ----------------------------------
>  include/kvm/arm_hypercalls.h |   7 ++
>  include/kvm/arm_psci.h       |   7 --
>  5 files changed, 178 insertions(+), 174 deletions(-)
> 
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 5ce26bedf23c..625f97f7b304 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -18,7 +18,7 @@
>  #include <linux/string.h>
>  #include <linux/vmalloc.h>
>  #include <linux/fs.h>
> -#include <kvm/arm_psci.h>
> +#include <kvm/arm_hypercalls.h>
>  #include <asm/cputype.h>
>  #include <linux/uaccess.h>
>  #include <asm/fpsimd.h>
> diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> index 30da78f72b3b..9e136d91b470 100644
> --- a/arch/arm64/kvm/hypercalls.c
> +++ b/arch/arm64/kvm/hypercalls.c
> @@ -146,3 +146,173 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  	smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
>  	return 1;
>  }
> +
> +static const u64 fw_reg_ids[] = {
> +	KVM_REG_ARM_PSCI_VERSION,
> +	KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1,
> +	KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2,
> +};
> +
> +int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu)
> +{
> +	return ARRAY_SIZE(fw_reg_ids);
> +}
> +
> +int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(fw_reg_ids); i++) {
> +		if (put_user(fw_reg_ids[i], uindices))

This is missing the ++ on uindices, so it just writes the same offset
three times.

> +			return -EFAULT;
> +	}
> +
> +	return 0;
> +}

I assume the rest of the patch is just a cut+paste move of code.

Thanks,
drew
Andrew Jones Nov. 27, 2021, 5:27 p.m. UTC | #8
On Tue, Nov 23, 2021 at 10:34:23AM -0800, Raghavendra Rao Ananta wrote:
> On Mon, Nov 22, 2021 at 9:23 AM Marc Zyngier <maz@kernel.org> wrote:
> > I keep being baffled by this. Why should we track the VMM accesses or
> > the VMM writeback? This logic doesn't seem to bring anything useful as
> > far as I can tell. All we need to ensure is that what is written to
> > the pseudo-register is an acceptable subset of the previous value, and
> > I cannot see why this can't be done at write-time.
> >
> > If you want to hide this behind a capability, fine (although my guts
> > feeling is that we don't need that either). But I really want to be
> > convinced about all this tracking.
> >
> The tracking of each owner register is necessary here to safe-guard
> the possibility that the user-space may not be aware of a newly
> introduced register, and hence, hasn't accessed it. If it had at least
> read the register, but not write-back, we assume that the user-space
> is happy with the configuration. But the fact that the register has
> not even been read would state that user-space is unaware of the
> existence of this new register. In such a case, if we don't sanitize
> (clear all the bits) this register, the features will be exposed
> unconditionally to the guest.
> 
> The capability is introduced here to make sure that this new
> infrastructure is backward compatible with old VMMs. If the VMMs don't
> enable this capability, they are probably unaware of this, and this
> will work as it always has- expose new services to the guest
> unconditionally as and when they are introduced.

Hi Raghavendra,

I don't think we need a CAP that has to be enabled or to make any
assumptions or policy decisions in the kernel. I think we just need to
provide a bit more information to the VMM when it checks if KVM has the
CAP. If KVM would tell the VMM how may pseudo registers there are, which
can be done with the return value of the CAP, then the VMM code could be
something like this

  r = check_cap(KVM_CAP_ARM_HVC_FW_REG_BMAP);
  if (r) {
    num_regs = r;

    for (idx = 0; idx < num_regs; ++idx) {
      reg = hvc_fw_reg(idx);

      if (idx > vmm_last_known_idx) {
        ...
      } else {
        ...
      }
    }
  }

With this, the VMM is free to decide if it wants to clear all registers
greater than the last index it was aware of or if it wants to let those
registers just get exposed to the guest without knowing what's getting
exposed. Along with documenting that by default everything gets exposed
by KVM, which is the backwards compatible thing to do, then the VMM has
been warned and given everything it needs to manage its guests.

Another thing that might be nice is giving userspace control of how many
pseudo registers show up in get-reg-list. In order to migrate from a host
with a more recent KVM to a host with an older KVM[*] we should only
expose the number of pseudo registers that the older host is aware of.
The VMM would zero these registers out anyway, in order to be compatible
for migration, but that's not enough when they also show up in the list
(at least not with QEMU that aborts migration when the destination
expects less registers than what get-reg-list provides)

[*] This isn't a great idea, but it'd be nice if we can make it work,
because users may want to rollback upgrades or, after migrating to a
host with a newer kernel, they may want to migrate back to where they
started.

Thanks,
drew
Raghavendra Rao Ananta Nov. 30, 2021, 12:56 a.m. UTC | #9
On Sat, Nov 27, 2021 at 9:27 AM Andrew Jones <drjones@redhat.com> wrote:
>
> On Tue, Nov 23, 2021 at 10:34:23AM -0800, Raghavendra Rao Ananta wrote:
> > On Mon, Nov 22, 2021 at 9:23 AM Marc Zyngier <maz@kernel.org> wrote:
> > > I keep being baffled by this. Why should we track the VMM accesses or
> > > the VMM writeback? This logic doesn't seem to bring anything useful as
> > > far as I can tell. All we need to ensure is that what is written to
> > > the pseudo-register is an acceptable subset of the previous value, and
> > > I cannot see why this can't be done at write-time.
> > >
> > > If you want to hide this behind a capability, fine (although my guts
> > > feeling is that we don't need that either). But I really want to be
> > > convinced about all this tracking.
> > >
> > The tracking of each owner register is necessary here to safe-guard
> > the possibility that the user-space may not be aware of a newly
> > introduced register, and hence, hasn't accessed it. If it had at least
> > read the register, but not write-back, we assume that the user-space
> > is happy with the configuration. But the fact that the register has
> > not even been read would state that user-space is unaware of the
> > existence of this new register. In such a case, if we don't sanitize
> > (clear all the bits) this register, the features will be exposed
> > unconditionally to the guest.
> >
> > The capability is introduced here to make sure that this new
> > infrastructure is backward compatible with old VMMs. If the VMMs don't
> > enable this capability, they are probably unaware of this, and this
> > will work as it always has- expose new services to the guest
> > unconditionally as and when they are introduced.
>
> Hi Raghavendra,
>
> I don't think we need a CAP that has to be enabled or to make any
> assumptions or policy decisions in the kernel. I think we just need to
> provide a bit more information to the VMM when it checks if KVM has the
> CAP. If KVM would tell the VMM how may pseudo registers there are, which
> can be done with the return value of the CAP, then the VMM code could be
> something like this
>
>   r = check_cap(KVM_CAP_ARM_HVC_FW_REG_BMAP);
>   if (r) {
>     num_regs = r;
>
>     for (idx = 0; idx < num_regs; ++idx) {
>       reg = hvc_fw_reg(idx);
>
>       if (idx > vmm_last_known_idx) {
>         ...
>       } else {
>         ...
>       }
>     }
>   }
>
> With this, the VMM is free to decide if it wants to clear all registers
> greater than the last index it was aware of or if it wants to let those
> registers just get exposed to the guest without knowing what's getting
> exposed. Along with documenting that by default everything gets exposed
> by KVM, which is the backwards compatible thing to do, then the VMM has
> been warned and given everything it needs to manage its guests.
>
Hi Andrew,

Thanks for your comments and suggestions!

I like the idea of sharing info via a read of the CAP, and not having
to explicitly sanitize/clear the registers before the guest begins to
run.
However the handshake is done over an API doc, which is a little
concerning. The user-space must remember and explicitly clear any new
register that it doesn't want to expose to the guest, while the
current approach does this automatically.
Any bug in VMM's implementation could be risky and unintentionally
expose features to the guest. What do you think?

> Another thing that might be nice is giving userspace control of how many
> pseudo registers show up in get-reg-list. In order to migrate from a host
> with a more recent KVM to a host with an older KVM[*] we should only
> expose the number of pseudo registers that the older host is aware of.
> The VMM would zero these registers out anyway, in order to be compatible
> for migration, but that's not enough when they also show up in the list
> (at least not with QEMU that aborts migration when the destination
> expects less registers than what get-reg-list provides)
>
> [*] This isn't a great idea, but it'd be nice if we can make it work,
> because users may want to rollback upgrades or, after migrating to a
> host with a newer kernel, they may want to migrate back to where they
> started.
>
Good point. But IIUC, if the user-space is able to communicate the
info that it's expecting a certain get-reg-list, do you think it can
handle it at its end too, rather than relying on the kernel to send a
list back?

My assumption was that VMM would statically maintain a known set of
registers that it wants to work with and are to be modified by hand,
rather than relying on get-reg-list. This could be the least common
set of registers that are present in all the host kernels (higher or
lower versions) of the migration fleet. This config doesn't change
even with get-reg-list declaring a new register as the features
exposed by it could still be untested. Although, migrating to a host
with a missing register shouldn't be possible in this setting, but if
it encounters the scenario, it should be able to avoid migration to
the host (similar to QEMU).

Please correct me if you think it's a false assumption to proceed with.

Regards,
Raghavendra

> Thanks,
> drew
>
Raghavendra Rao Ananta Nov. 30, 2021, 12:57 a.m. UTC | #10
On Sat, Nov 27, 2021 at 5:16 AM Andrew Jones <drjones@redhat.com> wrote:
>
> On Sat, Nov 13, 2021 at 01:22:24AM +0000, Raghavendra Rao Ananta wrote:
> > Common hypercall firmware register handing is currently employed
> > by psci.c. Since the upcoming patches add more of these registers,
> > it's better to move the generic handling to hypercall.c for a
> > cleaner presentation.
> >
> > While we are at it, collect all the firmware registers under
> > fw_reg_ids[] to help implement kvm_arm_get_fw_num_regs() and
> > kvm_arm_copy_fw_reg_indices() in a generic way.
> >
> > No functional change intended.
> >
> > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > ---
> >  arch/arm64/kvm/guest.c       |   2 +-
> >  arch/arm64/kvm/hypercalls.c  | 170 +++++++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/psci.c        | 166 ----------------------------------
> >  include/kvm/arm_hypercalls.h |   7 ++
> >  include/kvm/arm_psci.h       |   7 --
> >  5 files changed, 178 insertions(+), 174 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> > index 5ce26bedf23c..625f97f7b304 100644
> > --- a/arch/arm64/kvm/guest.c
> > +++ b/arch/arm64/kvm/guest.c
> > @@ -18,7 +18,7 @@
> >  #include <linux/string.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/fs.h>
> > -#include <kvm/arm_psci.h>
> > +#include <kvm/arm_hypercalls.h>
> >  #include <asm/cputype.h>
> >  #include <linux/uaccess.h>
> >  #include <asm/fpsimd.h>
> > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> > index 30da78f72b3b..9e136d91b470 100644
> > --- a/arch/arm64/kvm/hypercalls.c
> > +++ b/arch/arm64/kvm/hypercalls.c
> > @@ -146,3 +146,173 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >       smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
> >       return 1;
> >  }
> > +
> > +static const u64 fw_reg_ids[] = {
> > +     KVM_REG_ARM_PSCI_VERSION,
> > +     KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1,
> > +     KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2,
> > +};
> > +
> > +int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu)
> > +{
> > +     return ARRAY_SIZE(fw_reg_ids);
> > +}
> > +
> > +int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(fw_reg_ids); i++) {
> > +             if (put_user(fw_reg_ids[i], uindices))
>
> This is missing the ++ on uindices, so it just writes the same offset
> three times.
>
Thanks for catching this! I believe I realized this later and
corrected it in patch-04/11 of the series and missed it here.
I'll fix it here as well.

> > +                     return -EFAULT;
> > +     }
> > +
> > +     return 0;
> > +}
>
> I assume the rest of the patch is just a cut+paste move of code.
>
That's right.

Regards,
Raghavendra

> Thanks,
> drew
>
Andrew Jones Nov. 30, 2021, 10:19 a.m. UTC | #11
On Mon, Nov 29, 2021 at 04:56:19PM -0800, Raghavendra Rao Ananta wrote:
> On Sat, Nov 27, 2021 at 9:27 AM Andrew Jones <drjones@redhat.com> wrote:
> >
> > On Tue, Nov 23, 2021 at 10:34:23AM -0800, Raghavendra Rao Ananta wrote:
> > > On Mon, Nov 22, 2021 at 9:23 AM Marc Zyngier <maz@kernel.org> wrote:
> > > > I keep being baffled by this. Why should we track the VMM accesses or
> > > > the VMM writeback? This logic doesn't seem to bring anything useful as
> > > > far as I can tell. All we need to ensure is that what is written to
> > > > the pseudo-register is an acceptable subset of the previous value, and
> > > > I cannot see why this can't be done at write-time.
> > > >
> > > > If you want to hide this behind a capability, fine (although my guts
> > > > feeling is that we don't need that either). But I really want to be
> > > > convinced about all this tracking.
> > > >
> > > The tracking of each owner register is necessary here to safe-guard
> > > the possibility that the user-space may not be aware of a newly
> > > introduced register, and hence, hasn't accessed it. If it had at least
> > > read the register, but not write-back, we assume that the user-space
> > > is happy with the configuration. But the fact that the register has
> > > not even been read would state that user-space is unaware of the
> > > existence of this new register. In such a case, if we don't sanitize
> > > (clear all the bits) this register, the features will be exposed
> > > unconditionally to the guest.
> > >
> > > The capability is introduced here to make sure that this new
> > > infrastructure is backward compatible with old VMMs. If the VMMs don't
> > > enable this capability, they are probably unaware of this, and this
> > > will work as it always has- expose new services to the guest
> > > unconditionally as and when they are introduced.
> >
> > Hi Raghavendra,
> >
> > I don't think we need a CAP that has to be enabled or to make any
> > assumptions or policy decisions in the kernel. I think we just need to
> > provide a bit more information to the VMM when it checks if KVM has the
> > CAP. If KVM would tell the VMM how may pseudo registers there are, which
> > can be done with the return value of the CAP, then the VMM code could be
> > something like this
> >
> >   r = check_cap(KVM_CAP_ARM_HVC_FW_REG_BMAP);
> >   if (r) {
> >     num_regs = r;
> >
> >     for (idx = 0; idx < num_regs; ++idx) {
> >       reg = hvc_fw_reg(idx);
> >
> >       if (idx > vmm_last_known_idx) {
> >         ...
> >       } else {
> >         ...
> >       }
> >     }
> >   }
> >
> > With this, the VMM is free to decide if it wants to clear all registers
> > greater than the last index it was aware of or if it wants to let those
> > registers just get exposed to the guest without knowing what's getting
> > exposed. Along with documenting that by default everything gets exposed
> > by KVM, which is the backwards compatible thing to do, then the VMM has
> > been warned and given everything it needs to manage its guests.
> >
> Hi Andrew,
> 
> Thanks for your comments and suggestions!
> 
> I like the idea of sharing info via a read of the CAP, and not having
> to explicitly sanitize/clear the registers before the guest begins to
> run.
> However the handshake is done over an API doc, which is a little
> concerning. The user-space must remember and explicitly clear any new
> register that it doesn't want to expose to the guest, while the
> current approach does this automatically.
> Any bug in VMM's implementation could be risky and unintentionally
> expose features to the guest. What do you think?

The VMM can mess things up in many ways. While KVM should protect itself
from the VMM, it shouldn't try to protect the VMM from the VMM itself. In
this case, the risk here isn't that we allow the VMM to do something that
can harm KVM, or even the guest. The risk is only that the VMM fails to do
what it wanted to do (assuming it didn't want to expose unknown features
to the guest). I.e. the risk here is only that the VMM has a bug, and it's
an easily detectable bug. I say let the VMM developers manage it.

> 
> > Another thing that might be nice is giving userspace control of how many
> > pseudo registers show up in get-reg-list. In order to migrate from a host
> > with a more recent KVM to a host with an older KVM[*] we should only
> > expose the number of pseudo registers that the older host is aware of.
> > The VMM would zero these registers out anyway, in order to be compatible
> > for migration, but that's not enough when they also show up in the list
> > (at least not with QEMU that aborts migration when the destination
> > expects less registers than what get-reg-list provides)
> >
> > [*] This isn't a great idea, but it'd be nice if we can make it work,
> > because users may want to rollback upgrades or, after migrating to a
> > host with a newer kernel, they may want to migrate back to where they
> > started.
> >
> Good point. But IIUC, if the user-space is able to communicate the
> info that it's expecting a certain get-reg-list, do you think it can
> handle it at its end too, rather than relying on the kernel to send a
> list back?

Yes, I think we can probably manage this in the VMM, and maybe/probably
that's the better place to manage it.

> 
> My assumption was that VMM would statically maintain a known set of
> registers that it wants to work with and are to be modified by hand,
> rather than relying on get-reg-list. This could be the least common
> set of registers that are present in all the host kernels (higher or
> lower versions) of the migration fleet. This config doesn't change
> even with get-reg-list declaring a new register as the features
> exposed by it could still be untested. Although, migrating to a host
> with a missing register shouldn't be possible in this setting, but if
> it encounters the scenario, it should be able to avoid migration to
> the host (similar to QEMU).
> 
> Please correct me if you think it's a false assumption to proceed with.

Your assumptions align with mine. It seems as we move towards CPU models,
get-reg-list's role will likely only be to confirm a host supports the
minimum required. We should probably implement/change the VMM to allow
migrating from a host with more registers to one with less as long as
the one with less includes all the required registers. Of course we
also need to ensure that any registers we don't want to require are
not exposed to the guest, but I guess that's precisely what we're trying
to do with this series for at least some pseudo registers.

Thanks,
drew