mbox series

[RFC,00/27] KVM/arm64: A stage 2 for the host

Message ID 20201117181607.1761516-1-qperret@google.com
Headers show
Series KVM/arm64: A stage 2 for the host | expand

Message

Quentin Perret Nov. 17, 2020, 6:15 p.m. UTC
Hi all,

This RFC series provides the infrastructure enabling to wrap the host
kernel with a stage 2 when running KVM in nVHE. This can be useful for
several use-cases, but the primary motivation is to (eventually) be able
to protect guest memory from the host kernel. More details about the
overall idea, design, and motivations can be found in Will's talk at KVM
Forum 2020 [1], or the pKVM talk at the Android uconf during LPC 2020
[2].

This series essentially gets us to a point where the 'VM' bit is set
in the host's HCR_EL2 when running in nVHE and if 'kvm-arm.protected'
is set on the kernel command line. The EL2 object directly handles
memory aborts from the host and manages entirely its stage 2 page table.

However, this series does _not_ provide any real user for this (yet)
and simply idmaps everything into the host stage 2 as RWX cacheable.
This is all about the infrastructure for now, so clearly not ready for
inclusion upstream yet (hence the RFC tag), but the bases are there and
I thought it'd be useful to start a discussion with the community early
as this is a rather intrusive change. So, here goes.

One of the interesting requirements that comes with the series is that
managing page-tables requires some sort of memory allocator at EL2 to
allocate, refcount and free memory pages. Clearly, none of that is
currently possible in nVHE, so a significant chunk of the series is
dedicated to solving that problem. The proposed EL2 memory allocator
mimics Linux' buddy system in principles, and re-uses some of the arm64
mm design choices. Specifically, it uses a vmemmap at EL2 which contains
a set of struct hyp_page entries to hold pages metadata. To support
this, I extended the EL2 object to make it manage its own stage 1
page-table in addition to host stage 2. This simplifies the hyp_vmemmap
creation and was going to be required anyway for the protected VM
use-case -- the threat model implies the host cannot be trusted after
boot, and it will thus be crucial to ensure it cannot map arbitrary code
at EL2.

The pool of memory pages used by the EL2 allocator are reserved by the
host early during boot (while it is still trusted) using the memblock
API, and are donated to EL2 during KVM init. The current assumption is
that the host reserves enough pages to allow the EL2 object to map all
of memory at page granularity for both hyp stage 1 and host stage 2,
plus some extra pages for device mappings.

On top of that the series introduces a few smaller features that are
needed along the way, but hopefully all of those are detailed properly
in the relevant commit messages.

And as a last note, I'd like to point out that there are at this point
trivial ways for the host to circumvent its stage 2 protection. It still
owns the guests stage 2 for example, meaning that nothing would prevent
a malicious host from using a guest as a proxy to access protected
memory, _yet_. This series lays the ground for future work to address
these things, which will clearly require a stage 2 over the host at some
point, so I just wanted to set the expectations right.

With all that in mind, the series is organized as follows:

 - patches 01-03 provide EL2 with some utility libraries needed for
   memory management and synchronization;

 - patches 04-09 mostly refactor smalls portions of the code to ease the
   EL2 memory management;

 - patches 10-17 add the actual EL2 memory management code, as well as
   the setup/bootstrap code on the KVM init path;

 - patches 18-24 refactor the existing stage 2 management code to make
   it re-usable from the EL2 object;

 - and finally patches 25-27 introduce the host stage 2 and the trap
   handling logic at EL2.

This work is based on the latest kvmarm/queue (which includes Marc's
host EL2 entry rework [3], as well as Will's guest vector refactoring
[4]) + David's PSCI proxying series [5].

And if you'd like a branch that has all the bits and pieces:

    https://android-kvm.googlesource.com/linux qperret/host-stage2

Boot-tested (host and guest) using qemu in VHE and nVHE, and on real
hardware on a AML-S905X-CC (Le Potato).

Thanks,
Quentin

[1] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google
[2] https://youtu.be/54q6RzS9BpQ?t=10859
[3] https://lore.kernel.org/kvmarm/20201109175923.445945-1-maz@kernel.org/
[4] https://lore.kernel.org/kvmarm/20201113113847.21619-1-will@kernel.org/
[5] https://lore.kernel.org/kvmarm/20201116204318.63987-1-dbrazdil@google.com/


Quentin Perret (24):
  KVM: arm64: Initialize kvm_nvhe_init_params early
  KVM: arm64: Avoid free_page() in page-table allocator
  KVM: arm64: Factor memory allocation out of pgtable.c
  KVM: arm64: Introduce a BSS section for use at Hyp
  KVM: arm64: Make kvm_call_hyp() a function call at Hyp
  KVM: arm64: Allow using kvm_nvhe_sym() in hyp code
  KVM: arm64: Introduce an early Hyp page allocator
  KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp
  KVM: arm64: Introduce a Hyp buddy page allocator
  KVM: arm64: Enable access to sanitized CPU features at EL2
  KVM: arm64: Factor out vector address calculation
  of/fdt: Introduce early_init_dt_add_memory_hyp()
  KVM: arm64: Prepare Hyp memory protection
  KVM: arm64: Elevate Hyp mappings creation at EL2
  KVM: arm64: Use kvm_arch for stage 2 pgtable
  KVM: arm64: Use kvm_arch in kvm_s2_mmu
  KVM: arm64: Set host stage 2 using kvm_nvhe_init_params
  KVM: arm64: Refactor kvm_arm_setup_stage2()
  KVM: arm64: Refactor __load_guest_stage2()
  KVM: arm64: Refactor __populate_fault_info()
  KVM: arm64: Make memcache anonymous in pgtable allocator
  KVM: arm64: Reserve memory for host stage 2
  KVM: arm64: Sort the memblock regions list
  KVM: arm64: Wrap the host with a stage 2

Will Deacon (3):
  arm64: lib: Annotate {clear,copy}_page() as position-independent
  KVM: arm64: Link position-independent string routines into .hyp.text
  KVM: arm64: Add standalone ticket spinlock implementation for use at
    hyp

 arch/arm64/include/asm/cpufeature.h           |   1 +
 arch/arm64/include/asm/hyp_image.h            |   4 +
 arch/arm64/include/asm/kvm_asm.h              |  13 +-
 arch/arm64/include/asm/kvm_cpufeature.h       |  19 ++
 arch/arm64/include/asm/kvm_host.h             |  17 +-
 arch/arm64/include/asm/kvm_hyp.h              |   8 +
 arch/arm64/include/asm/kvm_mmu.h              |  69 +++++-
 arch/arm64/include/asm/kvm_pgtable.h          |  41 +++-
 arch/arm64/include/asm/sections.h             |   1 +
 arch/arm64/kernel/asm-offsets.c               |   3 +
 arch/arm64/kernel/cpufeature.c                |  14 +-
 arch/arm64/kernel/image-vars.h                |  35 +++
 arch/arm64/kernel/vmlinux.lds.S               |   7 +
 arch/arm64/kvm/arm.c                          | 136 +++++++++--
 arch/arm64/kvm/hyp/Makefile                   |   2 +-
 arch/arm64/kvm/hyp/include/hyp/switch.h       |  36 +--
 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h |  14 ++
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |  32 +++
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  33 +++
 arch/arm64/kvm/hyp/include/nvhe/memory.h      |  55 +++++
 arch/arm64/kvm/hyp/include/nvhe/mm.h          | 107 +++++++++
 arch/arm64/kvm/hyp/include/nvhe/spinlock.h    |  95 ++++++++
 arch/arm64/kvm/hyp/include/nvhe/util.h        |  25 ++
 arch/arm64/kvm/hyp/nvhe/Makefile              |   9 +-
 arch/arm64/kvm/hyp/nvhe/cache.S               |  13 ++
 arch/arm64/kvm/hyp/nvhe/cpufeature.c          |   8 +
 arch/arm64/kvm/hyp/nvhe/early_alloc.c         |  60 +++++
 arch/arm64/kvm/hyp/nvhe/hyp-init.S            |  39 ++++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  50 ++++
 arch/arm64/kvm/hyp/nvhe/hyp.lds.S             |   1 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 191 ++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/mm.c                  | 175 ++++++++++++++
 arch/arm64/kvm/hyp/nvhe/page_alloc.c          | 185 +++++++++++++++
 arch/arm64/kvm/hyp/nvhe/psci-relay.c          |   7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c               | 214 ++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/stub.c                |  22 ++
 arch/arm64/kvm/hyp/nvhe/switch.c              |  12 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
 arch/arm64/kvm/hyp/pgtable.c                  |  98 ++++----
 arch/arm64/kvm/hyp/reserved_mem.c             |  95 ++++++++
 arch/arm64/kvm/mmu.c                          | 114 +++++++++-
 arch/arm64/kvm/reset.c                        |  42 +---
 arch/arm64/lib/clear_page.S                   |   4 +-
 arch/arm64/lib/copy_page.S                    |   4 +-
 arch/arm64/mm/init.c                          |   3 +
 drivers/of/fdt.c                              |   5 +
 46 files changed, 1971 insertions(+), 151 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/util.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
 create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
 create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c

Comments

Fuad Tabba Nov. 23, 2020, 10:55 a.m. UTC | #1
Hi Quentin,

On Tue, Nov 17, 2020 at 6:16 PM 'Quentin Perret' via kernel-team
<kernel-team@android.com> wrote:
>
> Introduce the infrastructure in KVM enabling to copy CPU feature
> registers into EL2-owned data-structures, to allow reading sanitised
> values directly at EL2 in nVHE.
>
> Given that only a subset of these features are being read by the
> hypervisor, the ones that need to be copied are to be listed under
> <asm/kvm_cpufeature.h> together with the name of the nVHE variable that
> will hold the copy.
>
> While at it, introduce the first user of this infrastructure by
> implementing __flush_dcache_area at EL2, which needs
> arm64_ftr_reg_ctrel0.
>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/cpufeature.h     |  1 +
>  arch/arm64/include/asm/kvm_cpufeature.h | 17 ++++++++++++++
>  arch/arm64/kernel/cpufeature.c          | 12 ++++++++++
>  arch/arm64/kernel/image-vars.h          |  2 ++
>  arch/arm64/kvm/arm.c                    | 31 +++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/Makefile        |  3 ++-
>  arch/arm64/kvm/hyp/nvhe/cache.S         | 13 +++++++++++
>  arch/arm64/kvm/hyp/nvhe/cpufeature.c    |  8 +++++++
>  8 files changed, 86 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index da250e4741bd..3dfbd76fb647 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -600,6 +600,7 @@ void __init setup_cpu_features(void);
>  void check_local_cpu_capabilities(void);
>
>  u64 read_sanitised_ftr_reg(u32 id);
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst);
>
>  static inline bool cpu_supports_mixed_endian_el0(void)
>  {
> diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
> new file mode 100644
> index 000000000000..d34f85cba358
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */

Missing include guard.


> +
> +#include <asm/cpufeature.h>
> +
> +#ifndef KVM_HYP_CPU_FTR_REG
> +#if defined(__KVM_NVHE_HYPERVISOR__)
> +#define KVM_HYP_CPU_FTR_REG(id, name) extern struct arm64_ftr_reg name;
> +#else
> +#define KVM_HYP_CPU_FTR_REG(id, name) DECLARE_KVM_NVHE_SYM(name);
> +#endif
> +#endif
> +
> +KVM_HYP_CPU_FTR_REG(SYS_CTR_EL0, arm64_ftr_reg_ctrel0)
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index dd5bc0f0cf0d..3bc86d1423f8 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1116,6 +1116,18 @@ u64 read_sanitised_ftr_reg(u32 id)
>  }
>  EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);
>
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst)
> +{
> +       struct arm64_ftr_reg *regp = get_arm64_ftr_reg(id);
> +
> +       if (!regp)
> +               return -EINVAL;
> +
> +       memcpy(dst, regp, sizeof(*regp));
> +
> +       return 0;
> +}
> +
>  #define read_sysreg_case(r)    \
>         case r:         return read_sysreg_s(r)
>
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index dd8ccc9efb6a..c35d768672eb 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -116,6 +116,8 @@ __kvm_nvhe___memcpy                 = __kvm_nvhe___pi_memcpy;
>  __kvm_nvhe___memset                    = __kvm_nvhe___pi_memset;
>  #endif
>
> +_kvm_nvhe___flush_dcache_area          = __kvm_nvhe___pi___flush_dcache_area;
> +
>  #endif /* CONFIG_KVM */
>
>  #endif /* __ARM64_KERNEL_IMAGE_VARS_H */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 391cf6753a13..c7f8fca97202 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -34,6 +34,7 @@
>  #include <asm/virt.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_asm.h>
> +#include <asm/kvm_cpufeature.h>
>  #include <asm/kvm_mmu.h>
>  #include <asm/kvm_emulate.h>
>  #include <asm/sections.h>
> @@ -1636,6 +1637,29 @@ static void teardown_hyp_mode(void)
>         }
>  }
>
> +#undef KVM_HYP_CPU_FTR_REG
> +#define KVM_HYP_CPU_FTR_REG(id, name) \
> +       { .sys_id = id, .dst = (struct arm64_ftr_reg *)&kvm_nvhe_sym(name) },
> +static const struct __ftr_reg_copy_entry {
> +       u32                     sys_id;
> +       struct arm64_ftr_reg    *dst;
> +} hyp_ftr_regs[] = {
> +       #include <asm/kvm_cpufeature.h>
> +};
> +
> +static int copy_cpu_ftr_regs(void)
> +{
> +       int i, ret;
> +
> +       for (i = 0; i < ARRAY_SIZE(hyp_ftr_regs); i++) {
> +               ret = copy_ftr_reg(hyp_ftr_regs[i].sys_id, hyp_ftr_regs[i].dst);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +
>  /**
>   * Inits Hyp-mode on all online CPUs
>   */
> @@ -1644,6 +1668,13 @@ static int init_hyp_mode(void)
>         int cpu;
>         int err = 0;
>
> +       /*
> +        * Copy the required CPU feature register in their EL2 counterpart
> +        */
> +       err = copy_cpu_ftr_regs();
> +       if (err)
> +               return err;
> +
>         /*
>          * Allocate Hyp PGD and setup Hyp identity mapping
>          */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 9e5eacfec6ec..72cfe53f106f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -10,7 +10,8 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
>  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>
>  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> -        hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
> +        hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
> +        cache.o cpufeature.o
>  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
>          ../fpsimd.o ../hyp-entry.o ../exception.o
>  obj-y += $(lib-objs)
> diff --git a/arch/arm64/kvm/hyp/nvhe/cache.S b/arch/arm64/kvm/hyp/nvhe/cache.S
> new file mode 100644
> index 000000000000..36cef6915428
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/cache.S
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Code copied from arch/arm64/mm/cache.S.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +#include <asm/alternative.h>
> +
> +SYM_FUNC_START_PI(__flush_dcache_area)
> +       dcache_by_line_op civac, sy, x0, x1, x2, x3
> +       ret
> +SYM_FUNC_END_PI(__flush_dcache_area)
> diff --git a/arch/arm64/kvm/hyp/nvhe/cpufeature.c b/arch/arm64/kvm/hyp/nvhe/cpufeature.c
> new file mode 100644
> index 000000000000..a887508f996f
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/cpufeature.c
> @@ -0,0 +1,8 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#define KVM_HYP_CPU_FTR_REG(id, name) struct arm64_ftr_reg name;
> +#include <asm/kvm_cpufeature.h>
> --
> 2.29.2.299.gdc1121823c-goog
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

/fuad
David Brazdil Nov. 23, 2020, 12:34 p.m. UTC | #2
On Tue, Nov 17, 2020 at 06:15:42PM +0000, 'Quentin Perret' via kernel-team wrote:
> From: Will Deacon <will@kernel.org>
> 
> Pull clear_page(), copy_page(), memcpy() and memset() into the nVHE hyp
> code and ensure that we always execute the '__pi_' entry point on the
> offchance that it changes in future.
> 
> [ qperret: Commit title nits ]
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kernel/image-vars.h   | 11 +++++++++++
>  arch/arm64/kvm/hyp/nvhe/Makefile |  4 ++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index 8539f34d7538..dd8ccc9efb6a 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -105,6 +105,17 @@ KVM_NVHE_ALIAS(__stop___kvm_ex_table);
>  /* Array containing bases of nVHE per-CPU memory regions. */
>  KVM_NVHE_ALIAS(kvm_arm_hyp_percpu_base);
>  
> +/* Position-independent library routines */
> +__kvm_nvhe_clear_page			= __kvm_nvhe___pi_clear_page;
> +__kvm_nvhe_copy_page			= __kvm_nvhe___pi_copy_page;
> +__kvm_nvhe_memcpy			= __kvm_nvhe___pi_memcpy;
> +__kvm_nvhe_memset			= __kvm_nvhe___pi_memset;
> +
> +#ifdef CONFIG_KASAN
> +__kvm_nvhe___memcpy			= __kvm_nvhe___pi_memcpy;
> +__kvm_nvhe___memset			= __kvm_nvhe___pi_memset;
> +#endif
> +
>  #endif /* CONFIG_KVM */

Nit: Would be good to use the kvm_nvhe_sym() helper for the namespacing.
And feel free to define something like KVM_NVHE_ALIAS for PI in hyp-image.h.

>  
>  #endif /* __ARM64_KERNEL_IMAGE_VARS_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 1f1e351c5fe2..590fdefb42dd 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -6,10 +6,14 @@
>  asflags-y := -D__KVM_NVHE_HYPERVISOR__
>  ccflags-y := -D__KVM_NVHE_HYPERVISOR__
>  
> +lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> +lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> +
>  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
>  	 hyp-main.o hyp-smp.o psci-relay.o
>  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
>  	 ../fpsimd.o ../hyp-entry.o ../exception.o
> +obj-y += $(lib-objs)
>  
>  ##
>  ## Build rules for compiling nVHE hyp code
> -- 
> 2.29.2.299.gdc1121823c-goog
> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
David Brazdil Nov. 23, 2020, 12:51 p.m. UTC | #3
On Tue, Nov 17, 2020 at 06:15:48PM +0000, 'Quentin Perret' via kernel-team wrote:
> kvm_call_hyp() has some logic to issue a function call or a hypercall
> depending the EL at which the kernel is running. However, all the code
> compiled under __KVM_NVHE_HYPERVISOR__ is guaranteed to run only at EL2,
> and in this case a simple function call is needed.
> 
> Add ifdefery to kvm_host.h to symplify kvm_call_hyp() in .hyp.text.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index ac11adab6602..7a5d5f4b3351 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -557,6 +557,7 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>  void kvm_arm_halt_guest(struct kvm *kvm);
>  void kvm_arm_resume_guest(struct kvm *kvm);
>  
> +#ifndef __KVM_NVHE_HYPERVISOR__
>  #define kvm_call_hyp_nvhe(f, ...)						\
>  	({								\
>  		struct arm_smccc_res res;				\
> @@ -596,6 +597,11 @@ void kvm_arm_resume_guest(struct kvm *kvm);
>  									\
>  		ret;							\
>  	})
> +#else /* __KVM_NVHE_HYPERVISOR__ */
> +#define kvm_call_hyp(f, ...) f(__VA_ARGS__)
> +#define kvm_call_hyp_ret(f, ...) f(__VA_ARGS__)
> +#define kvm_call_hyp_nvhe(f, ...) f(__VA_ARGS__)
> +#endif /* __KVM_NVHE_HYPERVISOR__ */

I was hoping we could define this as the following instead. That would require
adding host-side declarations of all functions currently called with _nvhe.

#define kvm_call_hyp_nvhe(f, ...)					\
+	is_nvhe_hyp_code() ? f(__VA_ARGS__) :				\
	({								\
		struct arm_smccc_res res;				\
									\
		arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(f),		\
				##__VA_ARGS__, &res);			\
		WARN_ON(res.a0 != SMCCC_RET_SUCCESS);			\
									\
		res.a1;							\
	})

Up to you what you think is cleaner, just my 2 cents...
David Brazdil Nov. 23, 2020, 12:57 p.m. UTC | #4
On Tue, Nov 17, 2020 at 06:15:49PM +0000, 'Quentin Perret' via kernel-team wrote:
> In order to allow the usage of code shared by the host and the hyp in
> static inline library function, allow the usage of kvm_nvhe_sym() at el2
> by defaulting to the raw symbol name.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/hyp_image.h | 4 ++++
>  arch/arm64/include/asm/kvm_asm.h   | 4 ++--
>  arch/arm64/kvm/arm.c               | 2 +-
>  3 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/hyp_image.h b/arch/arm64/include/asm/hyp_image.h
> index daa1a1da539e..8b807b646b8f 100644
> --- a/arch/arm64/include/asm/hyp_image.h
> +++ b/arch/arm64/include/asm/hyp_image.h
> @@ -7,11 +7,15 @@
>  #ifndef __ARM64_HYP_IMAGE_H__
>  #define __ARM64_HYP_IMAGE_H__
>  
> +#ifndef __KVM_NVHE_HYPERVISOR__
>  /*
>   * KVM nVHE code has its own symbol namespace prefixed with __kvm_nvhe_,
>   * to separate it from the kernel proper.
>   */
>  #define kvm_nvhe_sym(sym)	__kvm_nvhe_##sym
> +#else
> +#define kvm_nvhe_sym(sym)	sym
> +#endif
>  
>  #ifdef LINKER_SCRIPT
>  
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 1a86581e581e..e4934f5e4234 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -173,11 +173,11 @@ struct kvm_s2_mmu;
>  DECLARE_KVM_NVHE_SYM(__kvm_hyp_init);
>  DECLARE_KVM_NVHE_SYM(__kvm_hyp_host_vector);
>  DECLARE_KVM_HYP_SYM(__kvm_hyp_vector);
> -DECLARE_KVM_NVHE_SYM(__kvm_hyp_psci_cpu_entry);
>  #define __kvm_hyp_init			CHOOSE_NVHE_SYM(__kvm_hyp_init)
>  #define __kvm_hyp_host_vector		CHOOSE_NVHE_SYM(__kvm_hyp_host_vector)
>  #define __kvm_hyp_vector		CHOOSE_HYP_SYM(__kvm_hyp_vector)
> -#define __kvm_hyp_psci_cpu_entry	CHOOSE_NVHE_SYM(__kvm_hyp_psci_cpu_entry)
> +
> +void kvm_nvhe_sym(__kvm_hyp_psci_cpu_entry)(void);
>  
>  extern unsigned long kvm_arm_hyp_percpu_base[NR_CPUS];
>  DECLARE_KVM_NVHE_SYM(__per_cpu_start);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 882eb383bd75..391cf6753a13 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1369,7 +1369,7 @@ static void cpu_prepare_hyp_mode(int cpu)
>  
>  	params->vector_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref(__kvm_hyp_host_vector));
>  	params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE);
> -	params->entry_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref(__kvm_hyp_psci_cpu_entry));
> +	params->entry_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref_nvhe(__kvm_hyp_psci_cpu_entry));

Why is this change needed?

>  	params->pgd_pa = kvm_mmu_get_httbr();
>  
>  	/*
> -- 
> 2.29.2.299.gdc1121823c-goog
> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
David Brazdil Nov. 23, 2020, 1:22 p.m. UTC | #5
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst)
> +{
> +	struct arm64_ftr_reg *regp = get_arm64_ftr_reg(id);
> +
> +	if (!regp)
> +		return -EINVAL;
> +
> +	memcpy(dst, regp, sizeof(*regp));
> +
> +	return 0;
> +}
> +
>  #define read_sysreg_case(r)	\
>  	case r:		return read_sysreg_s(r)
>  
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index dd8ccc9efb6a..c35d768672eb 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -116,6 +116,8 @@ __kvm_nvhe___memcpy			= __kvm_nvhe___pi_memcpy;
>  __kvm_nvhe___memset			= __kvm_nvhe___pi_memset;
>  #endif
>  
> +_kvm_nvhe___flush_dcache_area		= __kvm_nvhe___pi___flush_dcache_area;
> +

Could you help my understand why we need this?
* Why do we need PI routines in the first place? Would my series that fixes
  relocations in hyp code remove the need?
* You added these aliases for the string routines because you were worried
  somebody would change the implementation in arch/arm64/lib, right? But this
  cache flush function is defined in hyp/nvhe. So why do we need to point to
  the PI alias if we control the implementation?
Quentin Perret Nov. 23, 2020, 1:51 p.m. UTC | #6
On Monday 23 Nov 2020 at 10:55:20 (+0000), Fuad Tabba wrote:
> > diff --git a/arch/arm64/include/asm/kvm_cpufeature.h b/arch/arm64/include/asm/kvm_cpufeature.h
> > new file mode 100644
> > index 000000000000..d34f85cba358
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> > @@ -0,0 +1,17 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2020 - Google LLC
> > + * Author: Quentin Perret <qperret@google.com>
> > + */
> 
> Missing include guard.

Right, but on purpose :)

See how arm.c includes this header twice with different definitions of
KVM_HYP_CPU_FTR_REG for instance.

Thanks,
Quentin
Quentin Perret Nov. 23, 2020, 2:02 p.m. UTC | #7
On Monday 23 Nov 2020 at 12:57:23 (+0000), David Brazdil wrote:
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 882eb383bd75..391cf6753a13 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -1369,7 +1369,7 @@ static void cpu_prepare_hyp_mode(int cpu)
> >  
> >  	params->vector_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref(__kvm_hyp_host_vector));
> >  	params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE);
> > -	params->entry_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref(__kvm_hyp_psci_cpu_entry));
> > +	params->entry_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref_nvhe(__kvm_hyp_psci_cpu_entry));
> 
> Why is this change needed?

You mean this line specifically or the whole __kvm_hyp_psci_cpu_entry
thing?

For the latter, it is to avoid having the compiler complain about
__kvm_hyp_psci_cpu_entry being re-defined as a different symbol. If
there is a better way to solve this problem I'm happy to change it -- I
must admit I got a little confused with the namespacing along the way.

Thanks,
Quentin
Quentin Perret Nov. 23, 2020, 2:06 p.m. UTC | #8
On Monday 23 Nov 2020 at 12:34:25 (+0000), David Brazdil wrote:
> On Tue, Nov 17, 2020 at 06:15:42PM +0000, 'Quentin Perret' via kernel-team wrote:
> > diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> > index 8539f34d7538..dd8ccc9efb6a 100644
> > --- a/arch/arm64/kernel/image-vars.h
> > +++ b/arch/arm64/kernel/image-vars.h
> > @@ -105,6 +105,17 @@ KVM_NVHE_ALIAS(__stop___kvm_ex_table);
> >  /* Array containing bases of nVHE per-CPU memory regions. */
> >  KVM_NVHE_ALIAS(kvm_arm_hyp_percpu_base);
> >  
> > +/* Position-independent library routines */
> > +__kvm_nvhe_clear_page			= __kvm_nvhe___pi_clear_page;
> > +__kvm_nvhe_copy_page			= __kvm_nvhe___pi_copy_page;
> > +__kvm_nvhe_memcpy			= __kvm_nvhe___pi_memcpy;
> > +__kvm_nvhe_memset			= __kvm_nvhe___pi_memset;
> > +
> > +#ifdef CONFIG_KASAN
> > +__kvm_nvhe___memcpy			= __kvm_nvhe___pi_memcpy;
> > +__kvm_nvhe___memset			= __kvm_nvhe___pi_memset;
> > +#endif
> > +
> >  #endif /* CONFIG_KVM */
> 
> Nit: Would be good to use the kvm_nvhe_sym() helper for the namespacing.
> And feel free to define something like KVM_NVHE_ALIAS for PI in hyp-image.h.

Ack, that'd be much nicer, I'll fix it up for v2.

Thanks,
Quentin
Quentin Perret Nov. 23, 2020, 2:39 p.m. UTC | #9
On Monday 23 Nov 2020 at 13:22:23 (+0000), David Brazdil wrote:
> Could you help my understand why we need this?
> * Why do we need PI routines in the first place? Would my series that fixes
>   relocations in hyp code remove the need?
> * You added these aliases for the string routines because you were worried
>   somebody would change the implementation in arch/arm64/lib, right? But this
>   cache flush function is defined in hyp/nvhe. So why do we need to point to
>   the PI alias if we control the implementation?

Right, in the specific case of the __flush_dcache_area() function none
of the PI stuff is really needed I think. I did it this way to keep
things as consistent as possible with the host-side implementation, but
that is not required.

I understand this can cause confusion, so yes, I'll simplify this for
v2.

Cheers,
Quentin
David Brazdil Nov. 23, 2020, 2:54 p.m. UTC | #10
On Mon, Nov 23, 2020 at 02:02:50PM +0000, 'Quentin Perret' via kernel-team wrote:
> On Monday 23 Nov 2020 at 12:57:23 (+0000), David Brazdil wrote:
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 882eb383bd75..391cf6753a13 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -1369,7 +1369,7 @@ static void cpu_prepare_hyp_mode(int cpu)
> > >  
> > >  	params->vector_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref(__kvm_hyp_host_vector));
> > >  	params->stack_hyp_va = kern_hyp_va(per_cpu(kvm_arm_hyp_stack_page, cpu) + PAGE_SIZE);
> > > -	params->entry_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref(__kvm_hyp_psci_cpu_entry));
> > > +	params->entry_hyp_va = kern_hyp_va((unsigned long)kvm_ksym_ref_nvhe(__kvm_hyp_psci_cpu_entry));
> > 
> > Why is this change needed?
> 
> You mean this line specifically or the whole __kvm_hyp_psci_cpu_entry
> thing?
> 
> For the latter, it is to avoid having the compiler complain about
> __kvm_hyp_psci_cpu_entry being re-defined as a different symbol. If
> there is a better way to solve this problem I'm happy to change it -- I
> must admit I got a little confused with the namespacing along the way.

Yeah, we do need a more robust approach. It's getting out of control.

> 
> Thanks,
> Quentin
> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Fuad Tabba Dec. 3, 2020, 12:57 p.m. UTC | #11
Hi Quentin,

On Tue, Nov 17, 2020 at 6:17 PM 'Quentin Perret' via kernel-team
<kernel-team@android.com> wrote:
>
> When memory protection is enabled, the Hyp code needs the ability to
> create and manage its own page-table. To do so, introduce a new set of
> hypercalls to initialize Hyp memory protection.
>
> During the init hcall, the hypervisor runs with the host-provided
> page-table and uses the trivial early page allocator to create its own
> set of page-tables, using a memory pool that was donated by the host.
> Specifically, the hypervisor creates its own mappings for __hyp_text,
> the Hyp memory pool, the __hyp_bss, the portion of hyp_vmemmap
> corresponding to the Hyp pool, among other things. It then jumps back in
> the idmap page, switches to use the newly-created pgd (instead of the
> temporary one provided by the host) and then installs the full-fledged
> buddy allocator which will then be the only one in used from then on.
>
> Note that for the sake of symplifying the review, this only introduces
> the code doing this operation, without actually being called by anyhing
> yet. This will be done in a subsequent patch, which will introduce the
> necessary host kernel changes.
>
> Credits to Will for __kvm_init_switch_pgd.
>
> Co-authored-by: Will Deacon <will@kernel.org>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_asm.h         |   6 +-
>  arch/arm64/include/asm/kvm_host.h        |   8 +
>  arch/arm64/include/asm/kvm_hyp.h         |   8 +
>  arch/arm64/kernel/cpufeature.c           |   2 +-
>  arch/arm64/kernel/image-vars.h           |  19 +++
>  arch/arm64/kvm/hyp/Makefile              |   2 +-
>  arch/arm64/kvm/hyp/include/nvhe/memory.h |   6 +
>  arch/arm64/kvm/hyp/include/nvhe/mm.h     |  79 +++++++++
>  arch/arm64/kvm/hyp/nvhe/Makefile         |   4 +-
>  arch/arm64/kvm/hyp/nvhe/hyp-init.S       |  30 ++++
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c       |  44 +++++
>  arch/arm64/kvm/hyp/nvhe/mm.c             | 175 ++++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/psci-relay.c     |   2 -
>  arch/arm64/kvm/hyp/nvhe/setup.c          | 196 +++++++++++++++++++++++
>  arch/arm64/kvm/hyp/reserved_mem.c        |  75 +++++++++
>  arch/arm64/kvm/mmu.c                     |   2 +-
>  arch/arm64/mm/init.c                     |   3 +
>  17 files changed, 653 insertions(+), 8 deletions(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
>  create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index e4934f5e4234..9266b17f8ba9 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -57,6 +57,10 @@
>  #define __KVM_HOST_SMCCC_FUNC___kvm_get_mdcr_el2               12
>  #define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs              13
>  #define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs           14
> +#define __KVM_HOST_SMCCC_FUNC___kvm_hyp_protect                        15
> +#define __KVM_HOST_SMCCC_FUNC___hyp_create_mappings            16
> +#define __KVM_HOST_SMCCC_FUNC___hyp_create_private_mapping     17
> +#define __KVM_HOST_SMCCC_FUNC___hyp_cpu_set_vector             18
>
>  #ifndef __ASSEMBLY__
>
> @@ -171,7 +175,7 @@ struct kvm_vcpu;
>  struct kvm_s2_mmu;
>
>  DECLARE_KVM_NVHE_SYM(__kvm_hyp_init);
> -DECLARE_KVM_NVHE_SYM(__kvm_hyp_host_vector);
> +DECLARE_KVM_HYP_SYM(__kvm_hyp_host_vector);
>  DECLARE_KVM_HYP_SYM(__kvm_hyp_vector);
>  #define __kvm_hyp_init                 CHOOSE_NVHE_SYM(__kvm_hyp_init)
>  #define __kvm_hyp_host_vector          CHOOSE_NVHE_SYM(__kvm_hyp_host_vector)
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 7a5d5f4b3351..ee8bb8021637 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -742,4 +742,12 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>  #define kvm_vcpu_has_pmu(vcpu)                                 \
>         (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
>
> +#ifdef CONFIG_KVM
> +extern phys_addr_t hyp_mem_base;
> +extern phys_addr_t hyp_mem_size;
> +void __init reserve_kvm_hyp(void);
> +#else
> +static inline void reserve_kvm_hyp(void) { }
> +#endif
> +
>  #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 95a2bbbcc7e1..dbd2ef86afa9 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -105,5 +105,13 @@ void __noreturn hyp_panic(void);
>  void __noreturn __hyp_do_panic(bool restore_host, u64 spsr, u64 elr, u64 par);
>  #endif
>
> +#ifdef __KVM_NVHE_HYPERVISOR__
> +void __kvm_init_switch_pgd(phys_addr_t phys, unsigned long size,
> +                          phys_addr_t pgd, void *sp, void *cont_fn);
> +int __kvm_hyp_protect(phys_addr_t phys, unsigned long size,
> +                     unsigned long nr_cpus, unsigned long *per_cpu_base);
> +void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);
> +#endif
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 3bc86d1423f8..010458f6d799 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1722,7 +1722,7 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap)
>  #endif /* CONFIG_ARM64_MTE */
>
>  #ifdef CONFIG_KVM
> -static bool enable_protected_kvm;
> +bool enable_protected_kvm;
>
>  static bool has_protected_kvm(const struct arm64_cpu_capabilities *entry, int __unused)
>  {
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index c35d768672eb..f2d43e6cd86d 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -118,6 +118,25 @@ __kvm_nvhe___memset                        = __kvm_nvhe___pi_memset;
>
>  _kvm_nvhe___flush_dcache_area          = __kvm_nvhe___pi___flush_dcache_area;
>
> +/* Hypevisor VA size */
> +KVM_NVHE_ALIAS(hyp_va_bits);
> +
> +/* Kernel memory sections */
> +KVM_NVHE_ALIAS(__start_rodata);
> +KVM_NVHE_ALIAS(__end_rodata);
> +KVM_NVHE_ALIAS(__bss_start);
> +KVM_NVHE_ALIAS(__bss_stop);
> +
> +/* Hyp memory sections */
> +KVM_NVHE_ALIAS(__hyp_idmap_text_start);
> +KVM_NVHE_ALIAS(__hyp_idmap_text_end);
> +KVM_NVHE_ALIAS(__hyp_text_start);
> +KVM_NVHE_ALIAS(__hyp_text_end);
> +KVM_NVHE_ALIAS(__hyp_data_ro_after_init_start);
> +KVM_NVHE_ALIAS(__hyp_data_ro_after_init_end);
> +KVM_NVHE_ALIAS(__hyp_bss_start);
> +KVM_NVHE_ALIAS(__hyp_bss_end);
> +
>  #endif /* CONFIG_KVM */
>
>  #endif /* __ARM64_KERNEL_IMAGE_VARS_H */
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 687598e41b21..b726332eec49 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -10,4 +10,4 @@ subdir-ccflags-y := -I$(incdir)                               \
>                     -DDISABLE_BRANCH_PROFILING          \
>                     $(DISABLE_STACKLEAK_PLUGIN)
>
> -obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o
> +obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o reserved_mem.o
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> index ed47674bc988..c8af6fe87bfb 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> @@ -6,6 +6,12 @@
>
>  #include <linux/types.h>
>
> +#define HYP_MEMBLOCK_REGIONS 128
> +struct hyp_memblock_region {
> +       phys_addr_t start;
> +       phys_addr_t end;
> +};
> +
>  struct hyp_pool;
>  struct hyp_page {
>         unsigned int refcount;
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> new file mode 100644
> index 000000000000..5a3ad6f4e5bc
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_HYP_MM_H
> +#define __KVM_HYP_MM_H
> +
> +#include <asm/kvm_pgtable.h>
> +#include <asm/spectre.h>
> +#include <linux/types.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/spinlock.h>
> +
> +extern struct hyp_memblock_region kvm_nvhe_sym(hyp_memory)[];
> +extern int kvm_nvhe_sym(hyp_memblock_nr);
> +extern struct kvm_pgtable hyp_pgtable;

nit: I found the name of this struct to be confusing (hyp_pgtable),
since there's also
arch/arm64/kvm/mmu.c:25:static struct kvm_pgtable *hyp_pgtable;
which has the same name, but is a pointer to the hyp page table before
being swapped out in favor of this one.

> +extern hyp_spinlock_t __hyp_pgd_lock;
> +extern struct hyp_pool hpool;
> +extern u64 __io_map_base;
> +extern u32 hyp_va_bits;
> +
> +int hyp_create_idmap(void);
> +int hyp_map_vectors(void);
> +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> +int __hyp_create_mappings(unsigned long start, unsigned long size,
> +                         unsigned long phys, unsigned long prot);
> +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
> +                                          unsigned long prot);
> +

nit: I also thought that the hyp_create_mappings function names are a
bit confusing, since there's the create_hyp_mappings functions which
use the aforementioned *hyp_pgtable.

> +static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size,
> +                                    unsigned long *start, unsigned long *end)
> +{
> +       unsigned long nr_pages = size >> PAGE_SHIFT;
> +       struct hyp_page *p = hyp_phys_to_page(phys);
> +
> +       *start = (unsigned long)p;
> +       *end = *start + nr_pages * sizeof(struct hyp_page);
> +       *start = ALIGN_DOWN(*start, PAGE_SIZE);
> +       *end = ALIGN(*end, PAGE_SIZE);
> +}
> +
> +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> +{
> +       unsigned long total = 0, i;
> +
> +       /* Provision the worst case scenario with 4 levels of page-table */
> +       for (i = 0; i < 4; i++) {
> +               nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> +               total += nr_pages;
> +       }
> +
> +       return total;
> +}
> +
> +static inline unsigned long hyp_s1_pgtable_size(void)
> +{
> +       struct hyp_memblock_region *reg;
> +       unsigned long nr_pages, res = 0;
> +       int i;
> +
> +       if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> +               return 0;
> +
> +       for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
> +               reg = &kvm_nvhe_sym(hyp_memory)[i];
> +               nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
> +               nr_pages = __hyp_pgtable_max_pages(nr_pages);
> +               res += nr_pages << PAGE_SHIFT;
> +       }
> +
> +       /* Allow 1 GiB for private mappings */
> +       nr_pages = (1 << 30) >> PAGE_SHIFT;
> +       nr_pages = __hyp_pgtable_max_pages(nr_pages);
> +       res += nr_pages << PAGE_SHIFT;
> +
> +       return res;
> +}
> +
> +#endif /* __KVM_HYP_MM_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 72cfe53f106f..d7381a503182 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -11,9 +11,9 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>
>  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
>          hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
> -        cache.o cpufeature.o
> +        cache.o cpufeature.o setup.o mm.o
>  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> -        ../fpsimd.o ../hyp-entry.o ../exception.o
> +        ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
>  obj-y += $(lib-objs)
>
>  ##
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> index 8f3602f320ac..e2d62297edfe 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> @@ -247,4 +247,34 @@ alternative_else_nop_endif
>
>  SYM_CODE_END(__kvm_handle_stub_hvc)
>
> +SYM_FUNC_START(__kvm_init_switch_pgd)
> +       /* Turn the MMU off */
> +       pre_disable_mmu_workaround
> +       mrs     x2, sctlr_el2
> +       bic     x3, x2, #SCTLR_ELx_M
> +       msr     sctlr_el2, x3
> +       isb
> +
> +       tlbi    alle2
> +
> +       /* Install the new pgtables */
> +       ldr     x3, [x0, #NVHE_INIT_PGD_PA]
> +       phys_to_ttbr x4, x3
> +alternative_if ARM64_HAS_CNP
> +       orr     x4, x4, #TTBR_CNP_BIT
> +alternative_else_nop_endif
> +       msr     ttbr0_el2, x4
> +
> +       /* Set the new stack pointer */
> +       ldr     x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> +       mov     sp, x0
> +
> +       /* And turn the MMU back on! */
> +       dsb     nsh
> +       isb
> +       msr     sctlr_el2, x2
> +       isb
> +       ret     x1
> +SYM_FUNC_END(__kvm_init_switch_pgd)
> +

Should the instruction cache be flushed here (ic iallu), to discard
speculatively fetched instructions?

>         .popsection
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 933329699425..a0bfe0d26da6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -6,12 +6,15 @@
>
>  #include <hyp/switch.h>
>
> +#include <asm/pgtable-types.h>
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_emulate.h>
>  #include <asm/kvm_host.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/kvm_mmu.h>
>
> +#include <nvhe/mm.h>
> +
>  DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
>
>  #define cpu_reg(ctxt, r)       (ctxt)->regs.regs[r]
> @@ -106,6 +109,43 @@ static void handle___vgic_v3_restore_aprs(struct kvm_cpu_context *host_ctxt)
>         __vgic_v3_restore_aprs(kern_hyp_va(cpu_if));
>  }
>
> +static void handle___kvm_hyp_protect(struct kvm_cpu_context *host_ctxt)
> +{
> +       DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> +       DECLARE_REG(unsigned long, size, host_ctxt, 2);
> +       DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> +       DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> +
> +       cpu_reg(host_ctxt, 1) = __kvm_hyp_protect(phys, size, nr_cpus,
> +                                                 per_cpu_base);
> +}
> +
> +static void handle___hyp_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> +{
> +       DECLARE_REG(enum arm64_hyp_spectre_vector, slot, host_ctxt, 1);
> +
> +       cpu_reg(host_ctxt, 1) = hyp_cpu_set_vector(slot);
> +}
> +
> +static void handle___hyp_create_mappings(struct kvm_cpu_context *host_ctxt)
> +{
> +       DECLARE_REG(unsigned long, start, host_ctxt, 1);
> +       DECLARE_REG(unsigned long, size, host_ctxt, 2);
> +       DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> +       DECLARE_REG(unsigned long, prot, host_ctxt, 4);
> +
> +       cpu_reg(host_ctxt, 1) = __hyp_create_mappings(start, size, phys, prot);
> +}
> +
> +static void handle___hyp_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> +{
> +       DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> +       DECLARE_REG(size_t, size, host_ctxt, 2);
> +       DECLARE_REG(unsigned long, prot, host_ctxt, 3);
> +
> +       cpu_reg(host_ctxt, 1) = __hyp_create_private_mapping(phys, size, prot);
> +}
> +
>  typedef void (*hcall_t)(struct kvm_cpu_context *);
>
>  #define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = kimg_fn_ptr(handle_##x)
> @@ -125,6 +165,10 @@ static const hcall_t *host_hcall[] = {
>         HANDLE_FUNC(__kvm_get_mdcr_el2),
>         HANDLE_FUNC(__vgic_v3_save_aprs),
>         HANDLE_FUNC(__vgic_v3_restore_aprs),
> +       HANDLE_FUNC(__kvm_hyp_protect),
> +       HANDLE_FUNC(__hyp_cpu_set_vector),
> +       HANDLE_FUNC(__hyp_create_mappings),
> +       HANDLE_FUNC(__hyp_create_private_mapping),
>  };
>
>  static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
> diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> new file mode 100644
> index 000000000000..cad5dae197c6
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> @@ -0,0 +1,175 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_hyp.h>
> +#include <asm/kvm_mmu.h>
> +#include <asm/kvm_pgtable.h>
> +#include <asm/spectre.h>
> +
> +#include <nvhe/early_alloc.h>
> +#include <nvhe/gfp.h>
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +#include <nvhe/spinlock.h>
> +
> +struct kvm_pgtable hyp_pgtable;
> +
> +hyp_spinlock_t __hyp_pgd_lock;
> +u64 __io_map_base;
> +
> +struct hyp_memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
> +int hyp_memblock_nr;
> +
> +int __hyp_create_mappings(unsigned long start, unsigned long size,
> +                         unsigned long phys, unsigned long prot)
> +{
> +       int err;
> +
> +       hyp_spin_lock(&__hyp_pgd_lock);
> +       err = kvm_pgtable_hyp_map(&hyp_pgtable, start, size, phys, prot);
> +       hyp_spin_unlock(&__hyp_pgd_lock);
> +
> +       return err;
> +}
> +
> +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
> +                                          unsigned long prot)
> +{
> +       unsigned long addr;
> +       int ret;
> +
> +       hyp_spin_lock(&__hyp_pgd_lock);
> +
> +       size = PAGE_ALIGN(size + offset_in_page(phys));
> +       addr = __io_map_base;
> +       __io_map_base += size;
> +
> +       /* Are we overflowing on the vmemmap ? */
> +       if (__io_map_base > __hyp_vmemmap) {
> +               __io_map_base -= size;
> +               addr = 0;
> +               goto out;
> +       }
> +
> +       ret = kvm_pgtable_hyp_map(&hyp_pgtable, addr, size, phys, prot);
> +       if (ret) {
> +               addr = 0;
> +               goto out;
> +       }
> +
> +       addr = addr + offset_in_page(phys);
> +out:
> +       hyp_spin_unlock(&__hyp_pgd_lock);
> +
> +       return addr;
> +}
> +
> +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> +{
> +       unsigned long start = (unsigned long)from;
> +       unsigned long end = (unsigned long)to;
> +       unsigned long virt_addr;
> +       phys_addr_t phys;
> +
> +       start = start & PAGE_MASK;
> +       end = PAGE_ALIGN(end);
> +
> +       for (virt_addr = start; virt_addr < end; virt_addr += PAGE_SIZE) {
> +               int err;
> +
> +               phys = hyp_virt_to_phys((void *)virt_addr);
> +               err = __hyp_create_mappings(virt_addr, PAGE_SIZE, phys, prot);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)
> +{
> +       unsigned long start, end;
> +
> +       hyp_vmemmap_range(phys, size, &start, &end);
> +
> +       return __hyp_create_mappings(start, end - start, back, PAGE_HYP);
> +}
> +
> +static void *__hyp_bp_vect_base;
> +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot)
> +{
> +       void *vector;
> +
> +       switch (slot) {
> +       case HYP_VECTOR_DIRECT: {
> +               vector = hyp_symbol_addr(__kvm_hyp_vector);
> +               break;
> +       }
> +       case HYP_VECTOR_SPECTRE_DIRECT: {
> +               vector = hyp_symbol_addr(__bp_harden_hyp_vecs);
> +               break;
> +       }
> +       case HYP_VECTOR_INDIRECT:
> +       case HYP_VECTOR_SPECTRE_INDIRECT: {
> +               vector = (void *)__hyp_bp_vect_base;
> +               break;
> +       }
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       vector = __kvm_vector_slot2addr(vector, slot);
> +       *this_cpu_ptr(&kvm_hyp_vector) = (unsigned long)vector;
> +
> +       return 0;
> +}
> +
> +int hyp_map_vectors(void)
> +{
> +       unsigned long bp_base;
> +
> +       if (!cpus_have_const_cap(ARM64_SPECTRE_V3A))
> +               return 0;
> +
> +       bp_base = (unsigned long)hyp_symbol_addr(__bp_harden_hyp_vecs);
> +       bp_base = __hyp_pa(bp_base);
> +       bp_base = __hyp_create_private_mapping(bp_base, __BP_HARDEN_HYP_VECS_SZ,
> +                                              PAGE_HYP_EXEC);
> +       if (!bp_base)
> +               return -1;
> +
> +       __hyp_bp_vect_base = (void *)bp_base;
> +
> +       return 0;
> +}
> +
> +int hyp_create_idmap(void)
> +{
> +       unsigned long start, end;
> +
> +       start = (unsigned long)hyp_symbol_addr(__hyp_idmap_text_start);
> +       start = hyp_virt_to_phys((void *)start);
> +       start = ALIGN_DOWN(start, PAGE_SIZE);
> +
> +       end = (unsigned long)hyp_symbol_addr(__hyp_idmap_text_end);
> +       end = hyp_virt_to_phys((void *)end);
> +       end = ALIGN(end, PAGE_SIZE);
> +
> +       /*
> +        * One half of the VA space is reserved to linearly map portions of
> +        * memory -- see va_layout.c for more details. The other half of the VA
> +        * space contains the trampoline page, and needs some care. Split that
> +        * second half in two and find the quarter of VA space not conflicting
> +        * with the idmap to place the IOs and the vmemmap. IOs use the lower
> +        * half of the quarter and the vmemmap the upper half.
> +        */
> +       __io_map_base = start & BIT(hyp_va_bits - 2);
> +       __io_map_base ^= BIT(hyp_va_bits - 2);
> +       __hyp_vmemmap = __io_map_base | BIT(hyp_va_bits - 3);
> +
> +       return __hyp_create_mappings(start, end - start, start, PAGE_HYP_EXEC);
> +}
> diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
> index dbe57ae84a0c..cfc6dac0f0ac 100644
> --- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c
> +++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
> @@ -193,8 +193,6 @@ static int psci_cpu_on(u64 func_id, struct kvm_cpu_context *host_ctxt)
>         return ret;
>  }
>
> -void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);
> -
>  asmlinkage void __noreturn __kvm_hyp_psci_cpu_entry(void)
>  {
>         struct kvm_host_psci_state *cpu_state = this_cpu_ptr(&kvm_host_psci_state);
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> new file mode 100644
> index 000000000000..9679c97b875b
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -0,0 +1,196 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_hyp.h>
> +#include <asm/kvm_mmu.h>
> +#include <asm/kvm_pgtable.h>
> +
> +#include <nvhe/early_alloc.h>
> +#include <nvhe/gfp.h>
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +
> +struct hyp_pool hpool;
> +struct kvm_pgtable_mm_ops hyp_pgtable_mm_ops;
> +unsigned long hyp_nr_cpus;
> +
> +#define hyp_percpu_size ((unsigned long)__per_cpu_end - \
> +                        (unsigned long)__per_cpu_start)
> +
> +static void *stacks_base;
> +static void *vmemmap_base;
> +static void *hyp_pgt_base;
> +
> +static int divide_memory_pool(void *virt, unsigned long size)
> +{
> +       unsigned long vstart, vend, nr_pages;
> +
> +       hyp_early_alloc_init(virt, size);
> +
> +       stacks_base = hyp_early_alloc_contig(hyp_nr_cpus);
> +       if (!stacks_base)
> +               return -ENOMEM;
> +
> +       hyp_vmemmap_range(__hyp_pa(virt), size, &vstart, &vend);
> +       nr_pages = (vend - vstart) >> PAGE_SHIFT;
> +       vmemmap_base = hyp_early_alloc_contig(nr_pages);
> +       if (!vmemmap_base)
> +               return -ENOMEM;
> +
> +       nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
> +       hyp_pgt_base = hyp_early_alloc_contig(nr_pages);
> +       if (!hyp_pgt_base)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> +                                unsigned long *per_cpu_base)
> +{
> +       void *start, *end, *virt = hyp_phys_to_virt(phys);
> +       int ret, i;
> +
> +       /* Recreate the hyp page-table using the early page allocator */
> +       hyp_early_alloc_init(hyp_pgt_base, hyp_s1_pgtable_size());
> +       ret = kvm_pgtable_hyp_init(&hyp_pgtable, hyp_va_bits,
> +                                  &hyp_early_alloc_mm_ops);
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_idmap();
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_map_vectors();
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base));
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_mappings(hyp_symbol_addr(__hyp_text_start),
> +                                 hyp_symbol_addr(__hyp_text_end),
> +                                 PAGE_HYP_EXEC);
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_mappings(hyp_symbol_addr(__start_rodata),
> +                                 hyp_symbol_addr(__end_rodata), PAGE_HYP_RO);
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_mappings(hyp_symbol_addr(__hyp_data_ro_after_init_start),
> +                                 hyp_symbol_addr(__hyp_data_ro_after_init_end),
> +                                 PAGE_HYP_RO);
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_mappings(hyp_symbol_addr(__bss_start),
> +                                 hyp_symbol_addr(__hyp_bss_end), PAGE_HYP);
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_mappings(hyp_symbol_addr(__hyp_bss_end),
> +                                 hyp_symbol_addr(__bss_stop), PAGE_HYP_RO);
> +       if (ret)
> +               return ret;
> +
> +       ret = hyp_create_mappings(virt, virt + size - 1, PAGE_HYP);
> +       if (ret)
> +               return ret;
> +
> +       for (i = 0; i < hyp_nr_cpus; i++) {
> +               start = (void *)kern_hyp_va(per_cpu_base[i]);
> +               end = start + PAGE_ALIGN(hyp_percpu_size);
> +               ret = hyp_create_mappings(start, end, PAGE_HYP);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static void update_nvhe_init_params(void)
> +{
> +       struct kvm_nvhe_init_params *params;
> +       unsigned long i, stack;
> +
> +       for (i = 0; i < hyp_nr_cpus; i++) {
> +               stack = (unsigned long)stacks_base + (i << PAGE_SHIFT);
> +               params = per_cpu_ptr(&kvm_init_params, i);
> +               params->stack_hyp_va = stack + PAGE_SIZE;
> +               params->pgd_pa = __hyp_pa(hyp_pgtable.pgd);
> +               __flush_dcache_area(params, sizeof(*params));
> +       }
> +}
> +
> +static void *hyp_zalloc_hyp_page(void *arg)
> +{
> +       return hyp_alloc_pages(&hpool, HYP_GFP_ZERO, 0);
> +}
> +
> +void __noreturn __kvm_hyp_protect_finalise(void)
> +{
> +       struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> +       struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
> +       unsigned long nr_pages, used_pages;
> +       int ret;
> +
> +       /* Now that the vmemmap is backed, install the full-fledged allocator */
> +       nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
> +       used_pages = hyp_early_alloc_nr_pages();
> +       ret = hyp_pool_init(&hpool, __hyp_pa(hyp_pgt_base), nr_pages, used_pages);
> +       if (ret)
> +               goto out;
> +
> +       hyp_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
> +       hyp_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
> +       hyp_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
> +       hyp_pgtable_mm_ops.get_page = hyp_get_page;
> +       hyp_pgtable_mm_ops.put_page = hyp_put_page;
> +       hyp_pgtable.mm_ops = &hyp_pgtable_mm_ops;
> +
> +out:
> +       host_ctxt->regs.regs[0] = SMCCC_RET_SUCCESS;
> +       host_ctxt->regs.regs[1] = ret;
> +
> +       __host_enter(host_ctxt);
> +}
> +
> +int __kvm_hyp_protect(phys_addr_t phys, unsigned long size,
> +                     unsigned long nr_cpus, unsigned long *per_cpu_base)
> +{
> +       struct kvm_nvhe_init_params *params;
> +       void *virt = hyp_phys_to_virt(phys);
> +       void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
> +       int ret;
> +
> +       if (phys % PAGE_SIZE || size % PAGE_SIZE || (u64)virt % PAGE_SIZE)
> +               return -EINVAL;
> +
> +       hyp_spin_lock_init(&__hyp_pgd_lock);
> +       hyp_nr_cpus = nr_cpus;
> +
> +       ret = divide_memory_pool(virt, size);
> +       if (ret)
> +               return ret;
> +
> +       ret = recreate_hyp_mappings(phys, size, per_cpu_base);
> +       if (ret)
> +               return ret;
> +
> +       update_nvhe_init_params();
> +
> +       /* Jump in the idmap page to switch to the new page-tables */
> +       params = this_cpu_ptr(&kvm_init_params);
> +       fn = (typeof(fn))__hyp_pa(hyp_symbol_addr(__kvm_init_switch_pgd));
> +       fn(__hyp_pa(params), hyp_symbol_addr(__kvm_hyp_protect_finalise));
> +
> +       unreachable();
> +}
> diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
> new file mode 100644
> index 000000000000..02b0b18006f5
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/reserved_mem.c
> @@ -0,0 +1,75 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/memblock.h>
> +
> +#include <asm/kvm_host.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +
> +phys_addr_t hyp_mem_base;
> +phys_addr_t hyp_mem_size;
> +
> +void __init early_init_dt_add_memory_hyp(u64 base, u64 size)
> +{
> +       struct hyp_memblock_region *reg;
> +
> +       if (kvm_nvhe_sym(hyp_memblock_nr) >= HYP_MEMBLOCK_REGIONS)
> +               kvm_nvhe_sym(hyp_memblock_nr) = -1;
> +
> +       if (kvm_nvhe_sym(hyp_memblock_nr) < 0)
> +               return;
> +
> +       reg = kvm_nvhe_sym(hyp_memory);
> +       reg[kvm_nvhe_sym(hyp_memblock_nr)].start = base;
> +       reg[kvm_nvhe_sym(hyp_memblock_nr)].end = base + size;
> +       kvm_nvhe_sym(hyp_memblock_nr)++;
> +}
> +
> +extern bool enable_protected_kvm;
> +void __init reserve_kvm_hyp(void)
> +{
> +       u64 nr_pages, prev;
> +
> +       if (!enable_protected_kvm)
> +               return;
> +
> +       if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> +               return;
> +
> +       if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> +               return;
> +
> +       hyp_mem_size += num_possible_cpus() << PAGE_SHIFT;
> +       hyp_mem_size += hyp_s1_pgtable_size();
> +
> +       /*
> +        * The hyp_vmemmap needs to be backed by pages, but these pages
> +        * themselves need to be present in the vmemmap, so compute the number
> +        * of pages needed by looking for a fixed point.
> +        */
> +       nr_pages = 0;
> +       do {
> +               prev = nr_pages;
> +               nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
> +               nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
> +               nr_pages += __hyp_pgtable_max_pages(nr_pages);
> +       } while (nr_pages != prev);
> +       hyp_mem_size += nr_pages << PAGE_SHIFT;
> +
> +       hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
> +                                             hyp_mem_size, SZ_2M);
> +       if (!hyp_mem_base) {
> +               kvm_err("Failed to reserve hyp memory\n");
> +               return;
> +       }
> +       memblock_reserve(hyp_mem_base, hyp_mem_size);
> +
> +       kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20,
> +                hyp_mem_base);
> +}
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 278e163beda4..3cf9397dabdb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1264,10 +1264,10 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
>         .virt_to_phys           = kvm_host_pa,
>  };
>
> +u32 hyp_va_bits;
>  int kvm_mmu_init(void)
>  {
>         int err;
> -       u32 hyp_va_bits;
>
>         hyp_idmap_start = __pa_symbol(__hyp_idmap_text_start);
>         hyp_idmap_start = ALIGN_DOWN(hyp_idmap_start, PAGE_SIZE);
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 095540667f0f..f81da019b677 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -34,6 +34,7 @@
>  #include <asm/fixmap.h>
>  #include <asm/kasan.h>
>  #include <asm/kernel-pgtable.h>
> +#include <asm/kvm_host.h>
>  #include <asm/memory.h>
>  #include <asm/numa.h>
>  #include <asm/sections.h>
> @@ -390,6 +391,8 @@ void __init arm64_memblock_init(void)
>
>         reserve_elfcorehdr();
>
> +       reserve_kvm_hyp();
> +
>         high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
>
>         dma_contiguous_reserve(arm64_dma32_phys_limit);
> --
> 2.29.2.299.gdc1121823c-goog

Cheers,
/fuad
Quentin Perret Dec. 4, 2020, 6:01 p.m. UTC | #12
On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
<snip>
> > +int hyp_create_idmap(void);
> > +int hyp_map_vectors(void);
> > +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> > +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> > +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> > +int __hyp_create_mappings(unsigned long start, unsigned long size,
> > +                         unsigned long phys, unsigned long prot);
> > +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
> > +                                          unsigned long prot);
> > +
> 
> nit: I also thought that the hyp_create_mappings function names are a
> bit confusing, since there's the create_hyp_mappings functions which
> use the aforementioned *hyp_pgtable.

Sure, happy to re-name those (and hyp_pgtable above). Any suggestions?


<snip>
> > +SYM_FUNC_START(__kvm_init_switch_pgd)
> > +       /* Turn the MMU off */
> > +       pre_disable_mmu_workaround
> > +       mrs     x2, sctlr_el2
> > +       bic     x3, x2, #SCTLR_ELx_M
> > +       msr     sctlr_el2, x3
> > +       isb
> > +
> > +       tlbi    alle2
> > +
> > +       /* Install the new pgtables */
> > +       ldr     x3, [x0, #NVHE_INIT_PGD_PA]
> > +       phys_to_ttbr x4, x3
> > +alternative_if ARM64_HAS_CNP
> > +       orr     x4, x4, #TTBR_CNP_BIT
> > +alternative_else_nop_endif
> > +       msr     ttbr0_el2, x4
> > +
> > +       /* Set the new stack pointer */
> > +       ldr     x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > +       mov     sp, x0
> > +
> > +       /* And turn the MMU back on! */
> > +       dsb     nsh
> > +       isb
> > +       msr     sctlr_el2, x2
> > +       isb
> > +       ret     x1
> > +SYM_FUNC_END(__kvm_init_switch_pgd)
> > +
> 
> Should the instruction cache be flushed here (ic iallu), to discard
> speculatively fetched instructions?

Hmm, Will? Thoughts?

Thanks,
Quentin
Will Deacon Dec. 7, 2020, 10:20 a.m. UTC | #13
On Fri, Dec 04, 2020 at 06:01:52PM +0000, Quentin Perret wrote:
> On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
> <snip>
> > > +int hyp_create_idmap(void);
> > > +int hyp_map_vectors(void);
> > > +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> > > +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> > > +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> > > +int __hyp_create_mappings(unsigned long start, unsigned long size,
> > > +                         unsigned long phys, unsigned long prot);
> > > +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
> > > +                                          unsigned long prot);
> > > +
> > 
> > nit: I also thought that the hyp_create_mappings function names are a
> > bit confusing, since there's the create_hyp_mappings functions which
> > use the aforementioned *hyp_pgtable.
> 
> Sure, happy to re-name those (and hyp_pgtable above). Any suggestions?
> 
> 
> <snip>
> > > +SYM_FUNC_START(__kvm_init_switch_pgd)
> > > +       /* Turn the MMU off */
> > > +       pre_disable_mmu_workaround
> > > +       mrs     x2, sctlr_el2
> > > +       bic     x3, x2, #SCTLR_ELx_M
> > > +       msr     sctlr_el2, x3
> > > +       isb
> > > +
> > > +       tlbi    alle2
> > > +
> > > +       /* Install the new pgtables */
> > > +       ldr     x3, [x0, #NVHE_INIT_PGD_PA]
> > > +       phys_to_ttbr x4, x3
> > > +alternative_if ARM64_HAS_CNP
> > > +       orr     x4, x4, #TTBR_CNP_BIT
> > > +alternative_else_nop_endif
> > > +       msr     ttbr0_el2, x4
> > > +
> > > +       /* Set the new stack pointer */
> > > +       ldr     x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > > +       mov     sp, x0
> > > +
> > > +       /* And turn the MMU back on! */
> > > +       dsb     nsh
> > > +       isb
> > > +       msr     sctlr_el2, x2
> > > +       isb
> > > +       ret     x1
> > > +SYM_FUNC_END(__kvm_init_switch_pgd)
> > > +
> > 
> > Should the instruction cache be flushed here (ic iallu), to discard
> > speculatively fetched instructions?
> 
> Hmm, Will? Thoughts?

The I-cache is physically tagged, so not sure what invalidation would
achieve here. Fuad -- what do you think could go wrong specifically?

Will
Mark Rutland Dec. 7, 2020, 11:05 a.m. UTC | #14
On Mon, Dec 07, 2020 at 10:20:03AM +0000, Will Deacon wrote:
> On Fri, Dec 04, 2020 at 06:01:52PM +0000, Quentin Perret wrote:
> > On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
> > <snip>
> > > > +SYM_FUNC_START(__kvm_init_switch_pgd)
> > > > +       /* Turn the MMU off */
> > > > +       pre_disable_mmu_workaround
> > > > +       mrs     x2, sctlr_el2
> > > > +       bic     x3, x2, #SCTLR_ELx_M
> > > > +       msr     sctlr_el2, x3
> > > > +       isb
> > > > +
> > > > +       tlbi    alle2
> > > > +
> > > > +       /* Install the new pgtables */
> > > > +       ldr     x3, [x0, #NVHE_INIT_PGD_PA]
> > > > +       phys_to_ttbr x4, x3
> > > > +alternative_if ARM64_HAS_CNP
> > > > +       orr     x4, x4, #TTBR_CNP_BIT
> > > > +alternative_else_nop_endif
> > > > +       msr     ttbr0_el2, x4
> > > > +
> > > > +       /* Set the new stack pointer */
> > > > +       ldr     x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > > > +       mov     sp, x0
> > > > +
> > > > +       /* And turn the MMU back on! */
> > > > +       dsb     nsh
> > > > +       isb
> > > > +       msr     sctlr_el2, x2
> > > > +       isb
> > > > +       ret     x1
> > > > +SYM_FUNC_END(__kvm_init_switch_pgd)
> > > > +
> > > 
> > > Should the instruction cache be flushed here (ic iallu), to discard
> > > speculatively fetched instructions?
> > 
> > Hmm, Will? Thoughts?
> 
> The I-cache is physically tagged, so not sure what invalidation would
> achieve here. Fuad -- what do you think could go wrong specifically?

While the MMU is off, instruction fetches can be made from the PoC
rather than the PoU, so where instructions have been modified/copied and
not cleaned to the PoC, it's possible to fetch stale copies into the
I-caches. The physical tag doesn't prevent that.

In the regular CPU boot paths, __enabble_mmu() has an IC IALLU after
enabling the MMU to ensure that we get rid of anything stale (e.g. so
secondaries don't miss ftrace patching, which is only cleaned to the
PoU).

That might not be a problem here, if things are suitably padded and
never dynamically patched, but if so it's probably worth a comment.

Fuad, is that the sort of thing you were considering, or did you have
additional concerns?

Thanks,
Mark.
Will Deacon Dec. 7, 2020, 11:10 a.m. UTC | #15
On Mon, Dec 07, 2020 at 11:05:45AM +0000, Mark Rutland wrote:
> On Mon, Dec 07, 2020 at 10:20:03AM +0000, Will Deacon wrote:
> > On Fri, Dec 04, 2020 at 06:01:52PM +0000, Quentin Perret wrote:
> > > On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
> > > <snip>
> > > > > +SYM_FUNC_START(__kvm_init_switch_pgd)
> > > > > +       /* Turn the MMU off */
> > > > > +       pre_disable_mmu_workaround
> > > > > +       mrs     x2, sctlr_el2
> > > > > +       bic     x3, x2, #SCTLR_ELx_M
> > > > > +       msr     sctlr_el2, x3
> > > > > +       isb
> > > > > +
> > > > > +       tlbi    alle2
> > > > > +
> > > > > +       /* Install the new pgtables */
> > > > > +       ldr     x3, [x0, #NVHE_INIT_PGD_PA]
> > > > > +       phys_to_ttbr x4, x3
> > > > > +alternative_if ARM64_HAS_CNP
> > > > > +       orr     x4, x4, #TTBR_CNP_BIT
> > > > > +alternative_else_nop_endif
> > > > > +       msr     ttbr0_el2, x4
> > > > > +
> > > > > +       /* Set the new stack pointer */
> > > > > +       ldr     x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > > > > +       mov     sp, x0
> > > > > +
> > > > > +       /* And turn the MMU back on! */
> > > > > +       dsb     nsh
> > > > > +       isb
> > > > > +       msr     sctlr_el2, x2
> > > > > +       isb
> > > > > +       ret     x1
> > > > > +SYM_FUNC_END(__kvm_init_switch_pgd)
> > > > > +
> > > > 
> > > > Should the instruction cache be flushed here (ic iallu), to discard
> > > > speculatively fetched instructions?
> > > 
> > > Hmm, Will? Thoughts?
> > 
> > The I-cache is physically tagged, so not sure what invalidation would
> > achieve here. Fuad -- what do you think could go wrong specifically?
> 
> While the MMU is off, instruction fetches can be made from the PoC
> rather than the PoU, so where instructions have been modified/copied and
> not cleaned to the PoC, it's possible to fetch stale copies into the
> I-caches. The physical tag doesn't prevent that.

Oh yeah, we even have a comment about that in
idmap_kpti_install_ng_mappings(). Maybe we should wrap disable_mmu and
enable_mmu in some macros so we don't have to trip over this every time (and
this would mean we could get rid of pre_disable_mmu_workaround too).

> In the regular CPU boot paths, __enabble_mmu() has an IC IALLU after
> enabling the MMU to ensure that we get rid of anything stale (e.g. so
> secondaries don't miss ftrace patching, which is only cleaned to the
> PoU).
> 
> That might not be a problem here, if things are suitably padded and
> never dynamically patched, but if so it's probably worth a comment.

It's fragile enough that we should just do the invalidation.

Will
Fuad Tabba Dec. 7, 2020, 11:14 a.m. UTC | #16
On Mon, Dec 7, 2020 at 11:05 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Mon, Dec 07, 2020 at 10:20:03AM +0000, Will Deacon wrote:
> > On Fri, Dec 04, 2020 at 06:01:52PM +0000, Quentin Perret wrote:
> > > On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
> > > <snip>
> > > > > +SYM_FUNC_START(__kvm_init_switch_pgd)
> > > > > +       /* Turn the MMU off */
> > > > > +       pre_disable_mmu_workaround
> > > > > +       mrs     x2, sctlr_el2
> > > > > +       bic     x3, x2, #SCTLR_ELx_M
> > > > > +       msr     sctlr_el2, x3
> > > > > +       isb
> > > > > +
> > > > > +       tlbi    alle2
> > > > > +
> > > > > +       /* Install the new pgtables */
> > > > > +       ldr     x3, [x0, #NVHE_INIT_PGD_PA]
> > > > > +       phys_to_ttbr x4, x3
> > > > > +alternative_if ARM64_HAS_CNP
> > > > > +       orr     x4, x4, #TTBR_CNP_BIT
> > > > > +alternative_else_nop_endif
> > > > > +       msr     ttbr0_el2, x4
> > > > > +
> > > > > +       /* Set the new stack pointer */
> > > > > +       ldr     x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > > > > +       mov     sp, x0
> > > > > +
> > > > > +       /* And turn the MMU back on! */
> > > > > +       dsb     nsh
> > > > > +       isb
> > > > > +       msr     sctlr_el2, x2
> > > > > +       isb
> > > > > +       ret     x1
> > > > > +SYM_FUNC_END(__kvm_init_switch_pgd)
> > > > > +
> > > >
> > > > Should the instruction cache be flushed here (ic iallu), to discard
> > > > speculatively fetched instructions?
> > >
> > > Hmm, Will? Thoughts?
> >
> > The I-cache is physically tagged, so not sure what invalidation would
> > achieve here. Fuad -- what do you think could go wrong specifically?
>
> While the MMU is off, instruction fetches can be made from the PoC
> rather than the PoU, so where instructions have been modified/copied and
> not cleaned to the PoC, it's possible to fetch stale copies into the
> I-caches. The physical tag doesn't prevent that.
>
> In the regular CPU boot paths, __enabble_mmu() has an IC IALLU after
> enabling the MMU to ensure that we get rid of anything stale (e.g. so
> secondaries don't miss ftrace patching, which is only cleaned to the
> PoU).
>
> That might not be a problem here, if things are suitably padded and
> never dynamically patched, but if so it's probably worth a comment.
>
> Fuad, is that the sort of thing you were considering, or did you have
> additional concerns?

No other concerns. Thanks Mark.
/fuad
Fuad Tabba Dec. 7, 2020, 11:16 a.m. UTC | #17
On Fri, Dec 4, 2020 at 6:01 PM Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
> <snip>
> > > +int hyp_create_idmap(void);
> > > +int hyp_map_vectors(void);
> > > +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> > > +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> > > +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> > > +int __hyp_create_mappings(unsigned long start, unsigned long size,
> > > +                         unsigned long phys, unsigned long prot);
> > > +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
> > > +                                          unsigned long prot);
> > > +
> >
> > nit: I also thought that the hyp_create_mappings function names are a
> > bit confusing, since there's the create_hyp_mappings functions which
> > use the aforementioned *hyp_pgtable.
>
> Sure, happy to re-name those (and hyp_pgtable above). Any suggestions?

Perhaps something to indicate that these are temporary, tmp_ or
bootstrap_ maybe?

Cheers,
/fuad
Quentin Perret Dec. 7, 2020, 11:58 a.m. UTC | #18
On Monday 07 Dec 2020 at 11:16:05 (+0000), Fuad Tabba wrote:
> On Fri, Dec 4, 2020 at 6:01 PM Quentin Perret <qperret@google.com> wrote:
> >
> > On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
> > <snip>
> > > > +int hyp_create_idmap(void);
> > > > +int hyp_map_vectors(void);
> > > > +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> > > > +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> > > > +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> > > > +int __hyp_create_mappings(unsigned long start, unsigned long size,
> > > > +                         unsigned long phys, unsigned long prot);
> > > > +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
> > > > +                                          unsigned long prot);
> > > > +
> > >
> > > nit: I also thought that the hyp_create_mappings function names are a
> > > bit confusing, since there's the create_hyp_mappings functions which
> > > use the aforementioned *hyp_pgtable.
> >
> > Sure, happy to re-name those (and hyp_pgtable above). Any suggestions?
> 
> Perhaps something to indicate that these are temporary, tmp_ or
> bootstrap_ maybe?

Hmm, the thing is these are temporary only in protected mode, they're
permanent otherwise :/

Perhaps I could prefix the protected pgtable (and associated functions)
with 'pkvm_' or so? Marc, any preferences?

Thanks,
Quentin
Will Deacon Dec. 7, 2020, 1:40 p.m. UTC | #19
Hi Quentin,

On Tue, Nov 17, 2020 at 06:15:56PM +0000, Quentin Perret wrote:
> When memory protection is enabled, the Hyp code needs the ability to
> create and manage its own page-table. To do so, introduce a new set of
> hypercalls to initialize Hyp memory protection.
> 
> During the init hcall, the hypervisor runs with the host-provided
> page-table and uses the trivial early page allocator to create its own
> set of page-tables, using a memory pool that was donated by the host.
> Specifically, the hypervisor creates its own mappings for __hyp_text,
> the Hyp memory pool, the __hyp_bss, the portion of hyp_vmemmap
> corresponding to the Hyp pool, among other things. It then jumps back in
> the idmap page, switches to use the newly-created pgd (instead of the
> temporary one provided by the host) and then installs the full-fledged
> buddy allocator which will then be the only one in used from then on.
> 
> Note that for the sake of symplifying the review, this only introduces
> the code doing this operation, without actually being called by anyhing
> yet. This will be done in a subsequent patch, which will introduce the
> necessary host kernel changes.

[...]

> diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
> new file mode 100644
> index 000000000000..02b0b18006f5
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/reserved_mem.c

[...]

> +extern bool enable_protected_kvm;
> +void __init reserve_kvm_hyp(void)
> +{
> +	u64 nr_pages, prev;
> +
> +	if (!enable_protected_kvm)
> +		return;
> +
> +	if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> +		return;
> +
> +	if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> +		return;
> +
> +	hyp_mem_size += num_possible_cpus() << PAGE_SHIFT;
> +	hyp_mem_size += hyp_s1_pgtable_size();
> +
> +	/*
> +	 * The hyp_vmemmap needs to be backed by pages, but these pages
> +	 * themselves need to be present in the vmemmap, so compute the number
> +	 * of pages needed by looking for a fixed point.
> +	 */
> +	nr_pages = 0;
> +	do {
> +		prev = nr_pages;
> +		nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
> +		nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
> +		nr_pages += __hyp_pgtable_max_pages(nr_pages);
> +	} while (nr_pages != prev);
> +	hyp_mem_size += nr_pages << PAGE_SHIFT;
> +
> +	hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
> +					      hyp_mem_size, SZ_2M);
> +	if (!hyp_mem_base) {
> +		kvm_err("Failed to reserve hyp memory\n");
> +		return;
> +	}
> +	memblock_reserve(hyp_mem_base, hyp_mem_size);

Why not use the RESERVEDMEM_OF_DECLARE() interface for the hypervisor
memory? That way, the hypervisor memory can either be statically partitioned
as a carveout or allocated dynamically for us -- we wouldn't need to care.

Will
Marc Zyngier Dec. 7, 2020, 1:54 p.m. UTC | #20
On 2020-12-07 11:58, Quentin Perret wrote:
> On Monday 07 Dec 2020 at 11:16:05 (+0000), Fuad Tabba wrote:
>> On Fri, Dec 4, 2020 at 6:01 PM Quentin Perret <qperret@google.com> 
>> wrote:
>> >
>> > On Thursday 03 Dec 2020 at 12:57:33 (+0000), Fuad Tabba wrote:
>> > <snip>
>> > > > +int hyp_create_idmap(void);
>> > > > +int hyp_map_vectors(void);
>> > > > +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
>> > > > +int hyp_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
>> > > > +int hyp_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
>> > > > +int __hyp_create_mappings(unsigned long start, unsigned long size,
>> > > > +                         unsigned long phys, unsigned long prot);
>> > > > +unsigned long __hyp_create_private_mapping(phys_addr_t phys, size_t size,
>> > > > +                                          unsigned long prot);
>> > > > +
>> > >
>> > > nit: I also thought that the hyp_create_mappings function names are a
>> > > bit confusing, since there's the create_hyp_mappings functions which
>> > > use the aforementioned *hyp_pgtable.
>> >
>> > Sure, happy to re-name those (and hyp_pgtable above). Any suggestions?
>> 
>> Perhaps something to indicate that these are temporary, tmp_ or
>> bootstrap_ maybe?
> 
> Hmm, the thing is these are temporary only in protected mode, they're
> permanent otherwise :/
> 
> Perhaps I could prefix the protected pgtable (and associated functions)
> with 'pkvm_' or so? Marc, any preferences?

None. Whichever name you pick, someone will ask you to change it.
Just call it Bob.

What I really *don't* want is see a blanket rename of existing symbols
or concepts.

Thanks,

         M.
Quentin Perret Dec. 7, 2020, 2:11 p.m. UTC | #21
On Monday 07 Dec 2020 at 13:40:52 (+0000), Will Deacon wrote:
> Why not use the RESERVEDMEM_OF_DECLARE() interface for the hypervisor
> memory? That way, the hypervisor memory can either be statically partitioned
> as a carveout or allocated dynamically for us -- we wouldn't need to care.

Yup, I did consider that, but the actual amount of memory we need to
reserve for the hypervisor depends on things such as the size of struct
hyp_page, which depends on the kernel you're running (that is, it might
change over time). So, that really felt like something the kernel should
be doing, to keep the DT backward compatible, ... Or did you have
something more elaborate in mind?

Thanks,
Quentin
Quentin Perret Dec. 7, 2020, 2:17 p.m. UTC | #22
On Monday 07 Dec 2020 at 13:54:48 (+0000), Marc Zyngier wrote:
> None. Whichever name you pick, someone will ask you to change it.
> Just call it Bob.

:-)

> What I really *don't* want is see a blanket rename of existing symbols
> or concepts.

Understood. I'll go with pkvm_create_mappings() and friends for all the
new functions unless someone comes up with a better name in the meantime.

Thanks,
Quentin
Will Deacon Dec. 8, 2020, 9:40 a.m. UTC | #23
On Mon, Dec 07, 2020 at 02:11:20PM +0000, Quentin Perret wrote:
> On Monday 07 Dec 2020 at 13:40:52 (+0000), Will Deacon wrote:
> > Why not use the RESERVEDMEM_OF_DECLARE() interface for the hypervisor
> > memory? That way, the hypervisor memory can either be statically partitioned
> > as a carveout or allocated dynamically for us -- we wouldn't need to care.
> 
> Yup, I did consider that, but the actual amount of memory we need to
> reserve for the hypervisor depends on things such as the size of struct
> hyp_page, which depends on the kernel you're running (that is, it might
> change over time). So, that really felt like something the kernel should
> be doing, to keep the DT backward compatible, ... Or did you have
> something more elaborate in mind?

No, that's fair. Just wanted to make sure we had a good reason not to use
the existing memory reservation code.

Will