[0/8] arm64 kexec kernel patches V7

Message ID	cover.1421449714.git.geoff@infradead.org
State	New
Headers	show Return-Path: <linux-arm-kernel-bounces+incoming-imx=patchwork.ozlabs.org@lists.infradead.org> Message-Id: <cover.1421449714.git.geoff@infradead.org> In-Reply-To: <cover.1415926876.git.geoff@infradead.orgg> References: <cover.1415926876.git.geoff@infradead.orgg> From: Geoff Levand <geoff@infradead.org> Patch-Date: Fri, 16 Jan 2015 15:08:34 -0800 Subject: [PATCH 0/8] arm64 kexec kernel patches V7 To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com> Date: Sat, 17 Jan 2015 00:23:34 +0000 Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>, marc.zyngier@arm.com, kexec@lists.infradead.org, Deepak Saxena <dsaxena@linaro.org>, christoffer.dall@linaro.org, Grant Likely <grant.likely@linaro.org>, linux-arm-kernel@lists.infradead.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+incoming-imx=patchwork.ozlabs.org@lists.infradead.org

Geoff Levand Jan. 17, 2015, 12:23 a.m. UTC

Hi All,

This series adds the core support for kexec re-boots on arm64.  This v7 of the
series is mainly just a rebase to the latest arm64 for-next/core branch
(v3.19-rc4), and a few very minor changes requested for v6.

I have tested with the ARM VE fast model, the ARM Base model and the ARM
Foundation model with various kernel config options for both the first and
second stage kernels.

To load a second stage kernel and execute a kexec re-boot on arm64 my patches to
kexec-tools [2], which have not yet been merged upstream, are needed.

Patch 1 here moves proc-macros.S from arm64/mm to arm64/include/asm so that the
dcache_line_size macro it defines can be uesd by kexec's relocate kernel
routine.

Patches 2-4 rework the arm64 hcall mechanism to give the arm64 soft_restart()
routine the ability to switch exception levels from EL1 to EL2 for kernels that
were entered in EL2.

Patches 5-8 add the actual kexec support.

Please consider all patches for inclusion.

Note that the location of my development repositories has changed:

[1]  https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git
[2]  https://git.kernel.org/cgit/linux/kernel/git/geoff/kexec-tools.git

Several things are known to have problems on kexec re-boot:

spin-table
----------

PROBLEM: The spin-table enable method does not implement all the methods needed
for CPU hot-plug, so the first stage kernel cannot be shutdown properly.

WORK-AROUND: Upgrade to system firmware that provides PSCI enable method
support, OR build the first stage kernel with CONFIG_SMP=n, OR pass 'maxcpus=1'
on the first stage kernel command line.

FIX: Upgrade system firmware to provide PSCI enable method support or add
missing spin-table support to the kernel.

KVM
---

PROBLEM: KVM acquires hypervisor resources on startup, but does not free those
resources on shutdown, so the first stage kernel cannot be shutdown properly
when using kexec.

WORK-AROUND:  Build the first stage kernel with CONFIG_KVM=n.

FIX: Fix KVM to support soft_restart().  KVM needs to restore default exception
vectors, etc.

UEFI
----

PROBLEM: UEFI does not manage its runtime services virtual mappings in a way
that is compatible with a kexec re-boot, so the second stage kernel hangs on
boot-up.

WORK-AROUND:  Disable UEFI in firmware.

FIX: Ard Biesheuvel has done work to fix this.  Basic kexec re-boot has been
tested and works.  More comprehensive testing is needed.

/memreserve/
----------

PROBLEM: The use of device tree /memreserve/ entries is not compatible with
kexec re-boot.  The second stage kernel will use the reserved regions and the
system will become unstable.

WORK-AROUND: Pass a user specified DTB using the kexec --dtb option.

FIX: An interface to expose a binary device tree to user space has been
proposed.  User kexec utilities will need to be updated to add support for this
new interface.

ACPI
----

PROBLEM: The kernel for ACPI based systems does not export a device tree to the
standard user space location of 'proc/device-tree'.  Current applications
expect to access device tree information from this standard location.

WORK-AROUND: Disable ACPI in firmware, OR pass 'acpi=off' on the first stage
kernel command line, OR pass a user specified DTB using the kexec --dtb option.

FIX: FIX: An interface to expose a binary device tree to user space has been
proposed.  User kexec utilities will need to be updated to add support for this
new interface.

----------------------------------------------------------------
The following changes since commit 6083fe74b7bfffc2c7be8c711596608bda0cda6e:

  arm64: respect mem= for EFI (2015-01-16 16:21:58 +0000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v7

for you to fetch changes up to 7db998d9533d5efab816edfad3d7010fc2f7e62c:

  arm64/kexec: Enable kexec in the arm64 defconfig (2015-01-16 14:55:28 -0800)

----------------------------------------------------------------
Geoff Levand (8):
      arm64: Move proc-macros.S to include/asm
      arm64: Convert hcalls to use ISS field
      arm64: Add new hcall HVC_CALL_FUNC
      arm64: Add EL2 switch to soft_restart
      arm64/kexec: Add core kexec support
      arm64/kexec: Add pr_devel output
      arm64/kexec: Add checks for KVM
      arm64/kexec: Enable kexec in the arm64 defconfig

 arch/arm64/Kconfig                           |   9 ++
 arch/arm64/configs/defconfig                 |   1 +
 arch/arm64/include/asm/kexec.h               |  47 ++++++
 arch/arm64/include/asm/proc-fns.h            |   4 +-
 arch/arm64/{mm => include/asm}/proc-macros.S |   0
 arch/arm64/include/asm/virt.h                |  33 ++++
 arch/arm64/kernel/Makefile                   |   1 +
 arch/arm64/kernel/hyp-stub.S                 |  45 ++++--
 arch/arm64/kernel/machine_kexec.c            | 219 +++++++++++++++++++++++++++
 arch/arm64/kernel/process.c                  |  10 +-
 arch/arm64/kernel/relocate_kernel.S          | 160 +++++++++++++++++++
 arch/arm64/kvm/hyp.S                         |  18 ++-
 arch/arm64/mm/cache.S                        |   3 +-
 arch/arm64/mm/proc.S                         |  50 ++++--
 include/uapi/linux/kexec.h                   |   1 +
 15 files changed, 563 insertions(+), 38 deletions(-)
 create mode 100644 arch/arm64/include/asm/kexec.h
 rename arch/arm64/{mm => include/asm}/proc-macros.S (100%)
 create mode 100644 arch/arm64/kernel/machine_kexec.c
 create mode 100644 arch/arm64/kernel/relocate_kernel.S

Catalin Marinas Jan. 26, 2015, 5:44 p.m. UTC | #1

Hi Geoff,

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> This series adds the core support for kexec re-boots on arm64.  This v7 of the
> series is mainly just a rebase to the latest arm64 for-next/core branch
> (v3.19-rc4), and a few very minor changes requested for v6.

I haven't looked at the series in detail before, so some of my comments
may have already been discussed.

> Several things are known to have problems on kexec re-boot:
> 
> spin-table

I think that's not too bad, for complete kexec support (SMP->SMP) we can
require some CPU unplug mechanism and PSCI is one of them.

> FIX: Upgrade system firmware to provide PSCI enable method support or add
> missing spin-table support to the kernel.

What's the missing spin-table support?

> ACPI
> ----
> 
> PROBLEM: The kernel for ACPI based systems does not export a device tree to the
> standard user space location of 'proc/device-tree'.  Current applications
> expect to access device tree information from this standard location.
> 
> WORK-AROUND: Disable ACPI in firmware, OR pass 'acpi=off' on the first stage
> kernel command line, OR pass a user specified DTB using the kexec --dtb option.
> 
> FIX: FIX: An interface to expose a binary device tree to user space has been
> proposed.  User kexec utilities will need to be updated to add support for this
> new interface.

So the fix here is to boot the second stage kernel with dtb, which means
that we mandate the existence of a DT file for any ACPI system. Are
there plans to make the kexec'ed kernel reuse the ACPI tables?

Catalin Marinas Jan. 26, 2015, 5:45 p.m. UTC | #2

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> To allow the assembler macros defined in proc-macros.S to be used outside
> the mm code move the proc-macros.S file from arch/arm64/mm/ to
> arch/arm64/include/asm/ and fix up any preprocessor includes to use the new
> file location.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/include/asm/proc-macros.S | 54 ++++++++++++++++++++++++++++++++++++
>  arch/arm64/mm/cache.S                |  3 +-
>  arch/arm64/mm/proc-macros.S          | 54 ------------------------------------
>  arch/arm64/mm/proc.S                 |  3 +-

Actually, I would just merge proc-macros.S into assembler.h. Not wirth
keeping the former just for a few macros.

Catalin Marinas Jan. 26, 2015, 6:26 p.m. UTC | #3

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> To allow for additional hcalls to be defined and to make the arm64 hcall API
> more consistent across exception vector routines, change the hcall implementations
> to use the ISS field of the ESR_EL2 register to specify the hcall type.
> 
> The existing arm64 hcall implementations are limited in that they only allow
> for two distinct hcalls; with the x0 register either zero, or not zero.  Also,
> the API of the hyp-stub exception vector routines and the KVM exception vector
> routines differ; hyp-stub uses a non-zero value in x0 to implement
> __hyp_set_vectors, whereas KVM uses it to implement kvm_call_hyp.
> 
> Define three new preprocessor macros HVC_GET_VECTORS, HVC_SET_VECTORS and
> HVC_CALL_HYP and to be used as hcall type specifiers and convert the
> existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp() routines
> to use these new macros when executing an HVC call.  Also change the
> corresponding hyp-stub and KVM el1_sync exception vector routines to use these
> new macros.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>

Using the #imm value for HVC to separate what gets called looks fine to
me. However, I'd like to see a review from Marc/Christoffer on this
patch.

Some comments below:

> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 7a5df52..99c319c 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -21,6 +21,26 @@
>  #define BOOT_CPU_MODE_EL1	(0xe11)
>  #define BOOT_CPU_MODE_EL2	(0xe12)
>  
> +/*
> + * HVC_GET_VECTORS - Return the value of the vbar_el2 register.
> + */
> +
> +#define HVC_GET_VECTORS 1
> +
> +/*
> + * HVC_SET_VECTORS - Set the value of the vbar_el2 register.
> + *
> + * @x0: Physical address of the new vector table.
> + */
> +
> +#define HVC_SET_VECTORS 2
> +
> +/*
> + * HVC_CALL_HYP - Execute a hyp routine.
> + */
> +
> +#define HVC_CALL_HYP 3

I think you can ignore this case (make it the default), just define it
as 0 as that's the normal use-case after initialisation and avoid
checking it explicitly.

>  /*
> diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> index a272f33..e3db3fd 100644
> --- a/arch/arm64/kernel/hyp-stub.S
> +++ b/arch/arm64/kernel/hyp-stub.S
> @@ -22,6 +22,7 @@
>  #include <linux/irqchip/arm-gic-v3.h>
>  
>  #include <asm/assembler.h>
> +#include <asm/kvm_arm.h>
>  #include <asm/ptrace.h>
>  #include <asm/virt.h>
>  
> @@ -53,14 +54,22 @@ ENDPROC(__hyp_stub_vectors)
>  	.align 11
>  
>  el1_sync:
> -	mrs	x1, esr_el2
> -	lsr	x1, x1, #26
> -	cmp	x1, #0x16
> -	b.ne	2f				// Not an HVC trap
> -	cbz	x0, 1f
> -	msr	vbar_el2, x0			// Set vbar_el2
> +	mrs	x18, esr_el2
> +	lsr	x17, x18, #ESR_ELx_EC_SHIFT
> +	and	x18, x18, #ESR_ELx_ISS_MASK
> +
> +	cmp     x17, #ESR_ELx_EC_HVC64
> +	b.ne    2f				// Not an HVC trap
> +
> +	cmp	x18, #HVC_GET_VECTORS
> +	b.ne	1f
> +	mrs	x0, vbar_el2
>  	b	2f
> -1:	mrs	x0, vbar_el2			// Return vbar_el2
> +
> +1:	cmp	x18, #HVC_SET_VECTORS
> +	b.ne	2f
> +	msr	vbar_el2, x0
> +
>  2:	eret
>  ENDPROC(el1_sync)

You seem to be using x17 and x18 here freely. Do you have any guarantees
that the caller saved/restored those registers? I guess you assume they
are temporary registers and the caller first branches to a function
(like __kvm_hyp_call) and expects them to be corrupted. But I'm not sure
that's always the case. Take for example the __invoke_psci_fn_hvc where
the function is in C (we should change this for other reasons).

> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index c0d8202..1916c89 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -27,6 +27,7 @@
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_mmu.h>
>  #include <asm/memory.h>
> +#include <asm/virt.h>
>  
>  #define CPU_GP_REG_OFFSET(x)	(CPU_GP_REGS + x)
>  #define CPU_XREG_OFFSET(x)	CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
> @@ -1106,12 +1107,9 @@ __hyp_panic_str:
>   * in Hyp mode (see init_hyp_mode in arch/arm/kvm/arm.c).  Return values are
>   * passed in r0 and r1.
>   *
> - * A function pointer with a value of 0 has a special meaning, and is
> - * used to implement __hyp_get_vectors in the same way as in
> - * arch/arm64/kernel/hyp_stub.S.
>   */
>  ENTRY(kvm_call_hyp)
> -	hvc	#0
> +	hvc	#HVC_CALL_HYP
>  	ret
>  ENDPROC(kvm_call_hyp)
>  
> @@ -1142,6 +1140,7 @@ el1_sync:					// Guest trapped into EL2
>  
>  	mrs	x1, esr_el2
>  	lsr	x2, x1, #ESR_ELx_EC_SHIFT
> +	and	x0, x1, #ESR_ELx_ISS_MASK
>  
>  	cmp	x2, #ESR_ELx_EC_HVC64
>  	b.ne	el1_trap
> @@ -1150,15 +1149,19 @@ el1_sync:					// Guest trapped into EL2
>  	cbnz	x3, el1_trap			// called HVC
>  
>  	/* Here, we're pretty sure the host called HVC. */
> +	mov	x18, x0

Same comment here about corrupting x18. If it is safe, maybe add some
comments in the calling place.

>  	pop	x2, x3
>  	pop	x0, x1
>  
> -	/* Check for __hyp_get_vectors */
> -	cbnz	x0, 1f
> +	cmp	x18, #HVC_GET_VECTORS
> +	b.ne	1f
>  	mrs	x0, vbar_el2
>  	b	2f
>  
> -1:	push	lr, xzr
> +1:	cmp	x18, #HVC_CALL_HYP
> +	b.ne	2f
> +
> +	push	lr, xzr

At this point, we expect either HVC_GET_VECTORS or HVC_CALL_HYP. I think
you can simply assume HVC_CALL_HYP as default and ignore the additional
cmp.

Grant Likely Jan. 26, 2015, 6:37 p.m. UTC | #4

On Mon, Jan 26, 2015 at 5:44 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> Hi Geoff,
>
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>> This series adds the core support for kexec re-boots on arm64.  This v7 of the
>> series is mainly just a rebase to the latest arm64 for-next/core branch
>> (v3.19-rc4), and a few very minor changes requested for v6.
>
> I haven't looked at the series in detail before, so some of my comments
> may have already been discussed.
>
>> Several things are known to have problems on kexec re-boot:
>>
>> spin-table
>
> I think that's not too bad, for complete kexec support (SMP->SMP) we can
> require some CPU unplug mechanism and PSCI is one of them.
>
>> FIX: Upgrade system firmware to provide PSCI enable method support or add
>> missing spin-table support to the kernel.
>
> What's the missing spin-table support?
>
>> ACPI
>> ----
>>
>> PROBLEM: The kernel for ACPI based systems does not export a device tree to the
>> standard user space location of 'proc/device-tree'.  Current applications
>> expect to access device tree information from this standard location.
>>
>> WORK-AROUND: Disable ACPI in firmware, OR pass 'acpi=off' on the first stage
>> kernel command line, OR pass a user specified DTB using the kexec --dtb option.
>>
>> FIX: FIX: An interface to expose a binary device tree to user space has been
>> proposed.  User kexec utilities will need to be updated to add support for this
>> new interface.

The new interface is merged into mainline. /sys/firmware/fdt

> So the fix here is to boot the second stage kernel with dtb, which means
> that we mandate the existence of a DT file for any ACPI system. Are
> there plans to make the kexec'ed kernel reuse the ACPI tables?

Yes, the kexec'ed kernel will reuse the ACPI tables, and any other
data passed by UEFI. The DT we're talking about here is the DT
generated by the kernel's UEFI stub, and the kexec tools want access
to it so they can find the UEFI system table pointer.

g.

Mark Rutland Jan. 26, 2015, 6:55 p.m. UTC | #5

On Mon, Jan 26, 2015 at 05:44:14PM +0000, Catalin Marinas wrote:
> Hi Geoff,
> 
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > This series adds the core support for kexec re-boots on arm64.  This v7 of the
> > series is mainly just a rebase to the latest arm64 for-next/core branch
> > (v3.19-rc4), and a few very minor changes requested for v6.
> 
> I haven't looked at the series in detail before, so some of my comments
> may have already been discussed.
> 
> > Several things are known to have problems on kexec re-boot:
> > 
> > spin-table
> 
> I think that's not too bad, for complete kexec support (SMP->SMP) we can
> require some CPU unplug mechanism and PSCI is one of them.
> 
> > FIX: Upgrade system firmware to provide PSCI enable method support or add
> > missing spin-table support to the kernel.
> 
> What's the missing spin-table support?

As you mention above, a mechanism for returning the CPUs to FW. There is
no spec for doing this with spin-table, which would have to be written
and vetted.

Mark.

Mark Rutland Jan. 26, 2015, 7:02 p.m. UTC | #6

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> When a CPU is reset it needs to be put into the exception level it had when it
> entered the kernel.  Update cpu_reset() to accept an argument el2_switch which
> signals cpu_reset() to enter the soft reset address at EL2.  If el2_switch is
> not set the soft reset address will be entered at EL1.
> 
> Update cpu_soft_restart() and soft_restart() to pass the return of
> is_hyp_mode_available() as the el2_switch value to cpu_reset().  Also update the
> comments of cpu_reset(), cpu_soft_restart() and soft_restart() to reflect this
> change.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/include/asm/proc-fns.h |  4 ++--
>  arch/arm64/kernel/process.c       | 10 ++++++++-
>  arch/arm64/mm/proc.S              | 47 +++++++++++++++++++++++++++++----------
>  3 files changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/proc-fns.h b/arch/arm64/include/asm/proc-fns.h
> index 9a8fd84..339394d 100644
> --- a/arch/arm64/include/asm/proc-fns.h
> +++ b/arch/arm64/include/asm/proc-fns.h
> @@ -32,8 +32,8 @@ extern void cpu_cache_off(void);
>  extern void cpu_do_idle(void);
>  extern void cpu_do_switch_mm(unsigned long pgd_phys, struct mm_struct *mm);
>  extern void cpu_reset(unsigned long addr) __attribute__((noreturn));
> -void cpu_soft_restart(phys_addr_t cpu_reset,
> -		unsigned long addr) __attribute__((noreturn));
> +void cpu_soft_restart(phys_addr_t cpu_reset, unsigned long el2_switch,
> +		      unsigned long addr) __attribute__((noreturn));
>  extern void cpu_do_suspend(struct cpu_suspend_ctx *ptr);
>  extern u64 cpu_do_resume(phys_addr_t ptr, u64 idmap_ttbr);
>  
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index fde9923..371bbf1 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -50,6 +50,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/processor.h>
>  #include <asm/stacktrace.h>
> +#include <asm/virt.h>
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
>  #include <linux/stackprotector.h>
> @@ -60,7 +61,14 @@ EXPORT_SYMBOL(__stack_chk_guard);
>  void soft_restart(unsigned long addr)
>  {
>  	setup_mm_for_reboot();
> -	cpu_soft_restart(virt_to_phys(cpu_reset), addr);
> +
> +	/* TODO: Remove this conditional when KVM can support CPU restart. */
> +	if (IS_ENABLED(CONFIG_KVM))
> +		cpu_soft_restart(virt_to_phys(cpu_reset), 0, addr);

If we haven't torn down KVM, doesn't that mean that KVM is active at EL2
(with MMU and caches on) at this point?

If that's the case then we cannot possibly try to call kexec(), because
we cannot touch the memory used by the page tables for those EL2
mappings. Things will explode if we do.

Mark.

Mark Rutland Jan. 26, 2015, 7:16 p.m. UTC | #7

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S to the
> arm64 architecture that add support for the kexec re-boot mechanism
> (CONFIG_KEXEC) on arm64 platforms.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/Kconfig                  |   9 ++
>  arch/arm64/include/asm/kexec.h      |  47 +++++++++++
>  arch/arm64/kernel/Makefile          |   1 +
>  arch/arm64/kernel/machine_kexec.c   | 155 ++++++++++++++++++++++++++++++++++
>  arch/arm64/kernel/relocate_kernel.S | 160 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kexec.h          |   1 +
>  6 files changed, 373 insertions(+)
>  create mode 100644 arch/arm64/include/asm/kexec.h
>  create mode 100644 arch/arm64/kernel/machine_kexec.c
>  create mode 100644 arch/arm64/kernel/relocate_kernel.S
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index b1f9a20..d9eb9cd 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -474,6 +474,15 @@ config SECCOMP
>           and the task is only allowed to execute a few safe syscalls
>           defined by each seccomp mode.
> 
> +config KEXEC
> +       depends on (!SMP || PM_SLEEP_SMP)
> +       bool "kexec system call"
> +       ---help---
> +         kexec is a system call that implements the ability to shutdown your
> +         current kernel, and to start another kernel.  It is like a reboot
> +         but it is independent of the system firmware.   And like a reboot
> +         you can start any kernel with it, not just Linux.
> +

[...]

> +/**
> + * kexec_is_dtb - Helper routine to check the device tree header signature.
> + */
> +static bool kexec_is_dtb(const void *dtb)
> +{
> +       __be32 magic;
> +
> +       return get_user(magic, (__be32 *)dtb) ? false :
> +               (be32_to_cpu(magic) == OF_DT_HEADER);
> +}
> +
> +/**
> + * kexec_find_dtb_seg - Helper routine to find the dtb segment.
> + */
> +static const struct kexec_segment *kexec_find_dtb_seg(
> +       const struct kimage *image)
> +{
> +       int i;
> +
> +       for (i = 0; i < image->nr_segments; i++) {
> +               if (kexec_is_dtb(image->segment[i].buf))
> +                       return &image->segment[i];
> +       }
> +
> +       return NULL;
> +}

As mentioned before, _please_ move the dtb handling to the
userspace-provided purgatory. It would be far better to get userspace to
handle setting up the dtb pointer explicitly. That avoids fragility
w.r.t. policy here as userspace will get exactly what it asked for,
nothing more, nothing less.

The fact that this is done on 32-bit arm does not mean that we must do
it here.

[...]

> +       /* Start new image. */
> +       ldr     x4, arm64_kexec_kimage_start
> +       ldr     x0, arm64_kexec_dtb_addr
> +       mov     x1, xzr
> +       mov     x2, xzr
> +       mov     x3, xzr
> +       br      x4

Likewise, this should be part of the userspace-provided purgatory code.
If we're staying true to "like a reboot you can start any kernel with
it, not just Linux", we shouldn't embed the Linux boot protocol here.

Mark.

Mark Rutland Jan. 26, 2015, 7:19 p.m. UTC | #8

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> Add runtime checks that fail the arm64 kexec syscall for situations that would
> result in system instability do to problems in the KVM kernel support.
> These checks should be removed when the KVM problems are resolved fixed.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> index 3d84759..a36459d 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -16,6 +16,9 @@
>  #include <asm/cacheflush.h>
>  #include <asm/system_misc.h>
>  
> +/* TODO: Remove this include when KVM can support a kexec reboot. */
> +#include <asm/virt.h>
> +
>  /* Global variables for the relocate_kernel routine. */
>  extern const unsigned char relocate_new_kernel[];
>  extern const unsigned long relocate_new_kernel_size;
> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
>  
>  	kexec_image_info(image);
>  
> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
> +			__func__);
> +		return -ENOSYS;
> +	}

If you really don't want to implement KVM teardown, surely this should
be at the start of the series, so we don't have a point in the middle
where things may explode in this case?

Mark.

Christoffer Dall Jan. 26, 2015, 8:39 p.m. UTC | #9

On Mon, Jan 26, 2015 at 8:19 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>> Add runtime checks that fail the arm64 kexec syscall for situations that would
>> result in system instability do to problems in the KVM kernel support.
>> These checks should be removed when the KVM problems are resolved fixed.
>>
>> Signed-off-by: Geoff Levand <geoff@infradead.org>
>> ---
>>  arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>> index 3d84759..a36459d 100644
>> --- a/arch/arm64/kernel/machine_kexec.c
>> +++ b/arch/arm64/kernel/machine_kexec.c
>> @@ -16,6 +16,9 @@
>>  #include <asm/cacheflush.h>
>>  #include <asm/system_misc.h>
>>
>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
>> +#include <asm/virt.h>
>> +
>>  /* Global variables for the relocate_kernel routine. */
>>  extern const unsigned char relocate_new_kernel[];
>>  extern const unsigned long relocate_new_kernel_size;
>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
>>
>>       kexec_image_info(image);
>>
>> +     /* TODO: Remove this message when KVM can support a kexec reboot. */
>> +     if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
>> +             pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
>> +                     __func__);
>> +             return -ENOSYS;
>> +     }
>
> If you really don't want to implement KVM teardown, surely this should
> be at the start of the series, so we don't have a point in the middle
> where things may explode in this case?
>
So this caters to support systems that don't support KVM (don't boot
in EL2) but is configured with both KVM and KEXEC?

Why not just make the kexec config dependent on !KVM ?

-Christoffer

Geoff Levand Jan. 26, 2015, 8:57 p.m. UTC | #10

Hi Catalin,


On Mon, 2015-01-26 at 17:44 +0000, Catalin Marinas wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > This series adds the core support for kexec re-boots on arm64.  This v7 of the
> > series is mainly just a rebase to the latest arm64 for-next/core branch
> > (v3.19-rc4), and a few very minor changes requested for v6.
> 
> I haven't looked at the series in detail before, so some of my comments
> may have already been discussed.
> 
> > Several things are known to have problems on kexec re-boot:
> > 
> > spin-table
> 
> I think that's not too bad, for complete kexec support (SMP->SMP) we can
> require some CPU unplug mechanism and PSCI is one of them.
> 
> > FIX: Upgrade system firmware to provide PSCI enable method support or add
> > missing spin-table support to the kernel.
> 
> What's the missing spin-table support?

I had a working spin-table implementation, but I have not kept that
up to date.  It is somewhat complicated, and it requires a mechanism
to return the secondary CPUs to the spin table code.  As Mark
mentioned, that mechanism would need to be decided on.  I don't plan
to put any more work into spin-table support.

Just for anyone interested, my old spin-table support patches are in
this branch:

https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git/log/?h=spin-table

> 
> > ACPI
> > ----
> > 
> > PROBLEM: The kernel for ACPI based systems does not export a device tree to the
> > standard user space location of 'proc/device-tree'.  Current applications
> > expect to access device tree information from this standard location.
> > 
> > WORK-AROUND: Disable ACPI in firmware, OR pass 'acpi=off' on the first stage
> > kernel command line, OR pass a user specified DTB using the kexec --dtb option.
> > 
> > FIX: FIX: An interface to expose a binary device tree to user space has been
> > proposed.  User kexec utilities will need to be updated to add support for this
> > new interface.
> 
> So the fix here is to boot the second stage kernel with dtb, which means
> that we mandate the existence of a DT file for any ACPI system. Are
> there plans to make the kexec'ed kernel reuse the ACPI tables?

As Grant mentioned, the dtb the UEFI stub creates and passes to the first
stage kernel is what is passed to the second stage kernel.  The second
stage kernel uses that to setup its ACPI support.

-Geoff

Geoff Levand Jan. 26, 2015, 8:58 p.m. UTC | #11

On Mon, 2015-01-26 at 21:39 +0100, Christoffer Dall wrote:
> On Mon, Jan 26, 2015 at 8:19 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> >> Add runtime checks that fail the arm64 kexec syscall for situations that would
> >> result in system instability do to problems in the KVM kernel support.
> >> These checks should be removed when the KVM problems are resolved fixed.
> >>
> >> Signed-off-by: Geoff Levand <geoff@infradead.org>
> >> ---
> >>  arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> >> index 3d84759..a36459d 100644
> >> --- a/arch/arm64/kernel/machine_kexec.c
> >> +++ b/arch/arm64/kernel/machine_kexec.c
> >> @@ -16,6 +16,9 @@
> >>  #include <asm/cacheflush.h>
> >>  #include <asm/system_misc.h>
> >>
> >> +/* TODO: Remove this include when KVM can support a kexec reboot. */
> >> +#include <asm/virt.h>
> >> +
> >>  /* Global variables for the relocate_kernel routine. */
> >>  extern const unsigned char relocate_new_kernel[];
> >>  extern const unsigned long relocate_new_kernel_size;
> >> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
> >>
> >>       kexec_image_info(image);
> >>
> >> +     /* TODO: Remove this message when KVM can support a kexec reboot. */
> >> +     if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
> >> +             pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
> >> +                     __func__);
> >> +             return -ENOSYS;
> >> +     }
> >
> > If you really don't want to implement KVM teardown, surely this should
> > be at the start of the series, so we don't have a point in the middle
> > where things may explode in this case?
> >
> So this caters to support systems that don't support KVM (don't boot
> in EL2) but is configured with both KVM and KEXEC?
> 
> Why not just make the kexec config dependent on !KVM ?

Sure, that would work.  I put it this way so we can get build testing of
kexec since the arm64 defconfig has KVM set.

-Geoff

Geoff Levand Jan. 26, 2015, 9 p.m. UTC | #12

Hi Mark,

On Mon, 2015-01-26 at 19:19 +0000, Mark Rutland wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > Add runtime checks that fail the arm64 kexec syscall for situations that would
> > result in system instability do to problems in the KVM kernel support.
> > These checks should be removed when the KVM problems are resolved fixed.
> > 
> > Signed-off-by: Geoff Levand <geoff@infradead.org>
> > ---
> >  arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> > index 3d84759..a36459d 100644
> > --- a/arch/arm64/kernel/machine_kexec.c
> > +++ b/arch/arm64/kernel/machine_kexec.c
> > @@ -16,6 +16,9 @@
> >  #include <asm/cacheflush.h>
> >  #include <asm/system_misc.h>
> >  
> > +/* TODO: Remove this include when KVM can support a kexec reboot. */
> > +#include <asm/virt.h>
> > +
> >  /* Global variables for the relocate_kernel routine. */
> >  extern const unsigned char relocate_new_kernel[];
> >  extern const unsigned long relocate_new_kernel_size;
> > @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
> >  
> >  	kexec_image_info(image);
> >  
> > +	/* TODO: Remove this message when KVM can support a kexec reboot. */
> > +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
> > +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
> > +			__func__);
> > +		return -ENOSYS;
> > +	}
> 
> If you really don't want to implement KVM teardown, surely this should
> be at the start of the series, so we don't have a point in the middle
> where things may explode in this case?

Yes, you're right.  I'm hoping we can get the KVM fix done soon so this
is not needed.

-Geoff

Geoff Levand Jan. 26, 2015, 9:48 p.m. UTC | #13

Hi Mark,

On Mon, 2015-01-26 at 19:02 +0000, Mark Rutland wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > When a CPU is reset it needs to be put into the exception level it had when it
> > entered the kernel.  Update cpu_reset() to accept an argument el2_switch which
> > signals cpu_reset() to enter the soft reset address at EL2.  If el2_switch is
> > not set the soft reset address will be entered at EL1.
> > 
> > Update cpu_soft_restart() and soft_restart() to pass the return of
> > is_hyp_mode_available() as the el2_switch value to cpu_reset().  Also update the
> > comments of cpu_reset(), cpu_soft_restart() and soft_restart() to reflect this
> > change.
> > 
> > Signed-off-by: Geoff Levand <geoff@infradead.org>
> > ---
> >  arch/arm64/include/asm/proc-fns.h |  4 ++--
> >  arch/arm64/kernel/process.c       | 10 ++++++++-
> >  arch/arm64/mm/proc.S              | 47 +++++++++++++++++++++++++++++----------
> >  3 files changed, 46 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/proc-fns.h b/arch/arm64/include/asm/proc-fns.h
> > index 9a8fd84..339394d 100644
> > --- a/arch/arm64/include/asm/proc-fns.h
> > +++ b/arch/arm64/include/asm/proc-fns.h
> > @@ -32,8 +32,8 @@ extern void cpu_cache_off(void);
> >  extern void cpu_do_idle(void);
> >  extern void cpu_do_switch_mm(unsigned long pgd_phys, struct mm_struct *mm);
> >  extern void cpu_reset(unsigned long addr) __attribute__((noreturn));
> > -void cpu_soft_restart(phys_addr_t cpu_reset,
> > -		unsigned long addr) __attribute__((noreturn));
> > +void cpu_soft_restart(phys_addr_t cpu_reset, unsigned long el2_switch,
> > +		      unsigned long addr) __attribute__((noreturn));
> >  extern void cpu_do_suspend(struct cpu_suspend_ctx *ptr);
> >  extern u64 cpu_do_resume(phys_addr_t ptr, u64 idmap_ttbr);
> >  
> > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> > index fde9923..371bbf1 100644
> > --- a/arch/arm64/kernel/process.c
> > +++ b/arch/arm64/kernel/process.c
> > @@ -50,6 +50,7 @@
> >  #include <asm/mmu_context.h>
> >  #include <asm/processor.h>
> >  #include <asm/stacktrace.h>
> > +#include <asm/virt.h>
> >  
> >  #ifdef CONFIG_CC_STACKPROTECTOR
> >  #include <linux/stackprotector.h>
> > @@ -60,7 +61,14 @@ EXPORT_SYMBOL(__stack_chk_guard);
> >  void soft_restart(unsigned long addr)
> >  {
> >  	setup_mm_for_reboot();
> > -	cpu_soft_restart(virt_to_phys(cpu_reset), addr);
> > +
> > +	/* TODO: Remove this conditional when KVM can support CPU restart. */
> > +	if (IS_ENABLED(CONFIG_KVM))
> > +		cpu_soft_restart(virt_to_phys(cpu_reset), 0, addr);
> 
> If we haven't torn down KVM, doesn't that mean that KVM is active at EL2
> (with MMU and caches on) at this point?
> 
> If that's the case then we cannot possibly try to call kexec(), because
> we cannot touch the memory used by the page tables for those EL2
> mappings. Things will explode if we do.

This conditional is just if KVM, do things the old way (don't try to
switch exception levels).  It is to handle the system shutdown case.

Another patch in this series '[PATCH 7/8] arm64/kexec: Add checks for
KVM' assures kexec cannot happen when KVM is configured.

-Geoff

Mark Rutland Jan. 27, 2015, 4:46 p.m. UTC | #14

On Mon, Jan 26, 2015 at 09:48:48PM +0000, Geoff Levand wrote:
> Hi Mark,
> 
> On Mon, 2015-01-26 at 19:02 +0000, Mark Rutland wrote:
> > On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > > When a CPU is reset it needs to be put into the exception level it had when it
> > > entered the kernel.  Update cpu_reset() to accept an argument el2_switch which
> > > signals cpu_reset() to enter the soft reset address at EL2.  If el2_switch is
> > > not set the soft reset address will be entered at EL1.
> > > 
> > > Update cpu_soft_restart() and soft_restart() to pass the return of
> > > is_hyp_mode_available() as the el2_switch value to cpu_reset().  Also update the
> > > comments of cpu_reset(), cpu_soft_restart() and soft_restart() to reflect this
> > > change.
> > > 
> > > Signed-off-by: Geoff Levand <geoff@infradead.org>
> > > ---
> > >  arch/arm64/include/asm/proc-fns.h |  4 ++--
> > >  arch/arm64/kernel/process.c       | 10 ++++++++-
> > >  arch/arm64/mm/proc.S              | 47 +++++++++++++++++++++++++++++----------
> > >  3 files changed, 46 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/proc-fns.h b/arch/arm64/include/asm/proc-fns.h
> > > index 9a8fd84..339394d 100644
> > > --- a/arch/arm64/include/asm/proc-fns.h
> > > +++ b/arch/arm64/include/asm/proc-fns.h
> > > @@ -32,8 +32,8 @@ extern void cpu_cache_off(void);
> > >  extern void cpu_do_idle(void);
> > >  extern void cpu_do_switch_mm(unsigned long pgd_phys, struct mm_struct *mm);
> > >  extern void cpu_reset(unsigned long addr) __attribute__((noreturn));
> > > -void cpu_soft_restart(phys_addr_t cpu_reset,
> > > -		unsigned long addr) __attribute__((noreturn));
> > > +void cpu_soft_restart(phys_addr_t cpu_reset, unsigned long el2_switch,
> > > +		      unsigned long addr) __attribute__((noreturn));
> > >  extern void cpu_do_suspend(struct cpu_suspend_ctx *ptr);
> > >  extern u64 cpu_do_resume(phys_addr_t ptr, u64 idmap_ttbr);
> > >  
> > > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> > > index fde9923..371bbf1 100644
> > > --- a/arch/arm64/kernel/process.c
> > > +++ b/arch/arm64/kernel/process.c
> > > @@ -50,6 +50,7 @@
> > >  #include <asm/mmu_context.h>
> > >  #include <asm/processor.h>
> > >  #include <asm/stacktrace.h>
> > > +#include <asm/virt.h>
> > >  
> > >  #ifdef CONFIG_CC_STACKPROTECTOR
> > >  #include <linux/stackprotector.h>
> > > @@ -60,7 +61,14 @@ EXPORT_SYMBOL(__stack_chk_guard);
> > >  void soft_restart(unsigned long addr)
> > >  {
> > >  	setup_mm_for_reboot();
> > > -	cpu_soft_restart(virt_to_phys(cpu_reset), addr);
> > > +
> > > +	/* TODO: Remove this conditional when KVM can support CPU restart. */
> > > +	if (IS_ENABLED(CONFIG_KVM))
> > > +		cpu_soft_restart(virt_to_phys(cpu_reset), 0, addr);
> > 
> > If we haven't torn down KVM, doesn't that mean that KVM is active at EL2
> > (with MMU and caches on) at this point?
> > 
> > If that's the case then we cannot possibly try to call kexec(), because
> > we cannot touch the memory used by the page tables for those EL2
> > mappings. Things will explode if we do.
> 
> This conditional is just if KVM, do things the old way (don't try to
> switch exception levels).  It is to handle the system shutdown case.

Having grepped treewide for soft_restart, other than kexec there are no
users for arm64. So surely kexec is the only case to cater for at the
moment?

> Another patch in this series '[PATCH 7/8] arm64/kexec: Add checks for
> KVM' assures kexec cannot happen when KVM is configured.

It would be better to just move this earlier (or event better, implement
kvm teardown).

Mark.

Catalin Marinas Jan. 27, 2015, 5:39 p.m. UTC | #15

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 99c319c..4f23a48 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -41,6 +41,19 @@
>  
>  #define HVC_CALL_HYP 3
>  
> +/*
> + * HVC_CALL_FUNC - Execute a function at EL2.
> + *
> + * @x0: Physical address of the function to be executed.
> + * @x1: Passed as the first argument to the function.
> + * @x2: Passed as the second argument to the function.
> + * @x3: Passed as the third argument to the function.
> + *
> + * The called function must preserve the contents of register x18.

Can you pick a register that's normally callee saved?

> + */
> +
> +#define HVC_CALL_FUNC 4
> +
>  #ifndef __ASSEMBLY__
>  
>  /*
> diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> index e3db3fd..b5d36e7 100644
> --- a/arch/arm64/kernel/hyp-stub.S
> +++ b/arch/arm64/kernel/hyp-stub.S
> @@ -66,9 +66,20 @@ el1_sync:
>  	mrs	x0, vbar_el2
>  	b	2f
>  
> -1:	cmp	x18, #HVC_SET_VECTORS
> -	b.ne	2f
> -	msr	vbar_el2, x0
> +1:	cmp     x18, #HVC_SET_VECTORS

This line doesn't seem to have any change, apart from some whitespace.
Or did you want to drop the label?

> +	b.ne    1f
> +	msr     vbar_el2, x0
> +	b       2f
> +
> +1:	cmp     x18, #HVC_CALL_FUNC
> +	b.ne    2f
> +	mov     x18, lr
> +	mov     lr, x0
> +	mov     x0, x1
> +	mov     x1, x2
> +	mov     x2, x3
> +	blr     lr
> +	mov     lr, x18
>  
>  2:	eret
>  ENDPROC(el1_sync)

What is the calling convention for this HVC? You mentioned x18 above but
what about other registers that the called function may corrupt (x18 is
a temporary register, so it's not expected to be callee saved).

Catalin Marinas Jan. 27, 2015, 5:57 p.m. UTC | #16

On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>  ENTRY(cpu_reset)
> -	mrs	x1, sctlr_el1
> -	bic	x1, x1, #1
> -	msr	sctlr_el1, x1			// disable the MMU
> +	mrs	x2, sctlr_el1
> +	bic	x2, x2, #1
> +	msr	sctlr_el1, x2			// disable the MMU
>  	isb
> -	ret	x0
> +
> +	cbz	x0, 1f				// el2_switch?
> +	mov	x0, x1
> +	mov	x1, xzr
> +	mov	x2, xzr
> +	mov	x3, xzr
> +	hvc	#HVC_CALL_FUNC			// no return

If that's the only user of HVC_CALL_FUNC, why do we bother with
arguments, calling convention?

Mark Rutland Jan. 27, 2015, 6 p.m. UTC | #17

On Tue, Jan 27, 2015 at 05:39:47PM +0000, Catalin Marinas wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> > index 99c319c..4f23a48 100644
> > --- a/arch/arm64/include/asm/virt.h
> > +++ b/arch/arm64/include/asm/virt.h
> > @@ -41,6 +41,19 @@
> >  
> >  #define HVC_CALL_HYP 3
> >  
> > +/*
> > + * HVC_CALL_FUNC - Execute a function at EL2.
> > + *
> > + * @x0: Physical address of the function to be executed.
> > + * @x1: Passed as the first argument to the function.
> > + * @x2: Passed as the second argument to the function.
> > + * @x3: Passed as the third argument to the function.
> > + *
> > + * The called function must preserve the contents of register x18.
> 
> Can you pick a register that's normally callee saved?

We're in the hyp-stub, so we don't have a stack in EL2. Therefore we
can't stack any of the existing callee-saved register values in order to
be able to use them.

One way to avoid that would be to have asm block which issues the HVC at
EL1 stack/unstack the LR around the HVC. Then we're free to corrupt the
LR at EL2 in order to call the provided function.

[...]

> > +1:	cmp     x18, #HVC_CALL_FUNC
> > +	b.ne    2f
> > +	mov     x18, lr
> > +	mov     lr, x0
> > +	mov     x0, x1
> > +	mov     x1, x2
> > +	mov     x2, x3
> > +	blr     lr
> > +	mov     lr, x18
> >  
> >  2:	eret
> >  ENDPROC(el1_sync)
> 
> What is the calling convention for this HVC? You mentioned x18 above but
> what about other registers that the called function may corrupt (x18 is
> a temporary register, so it's not expected to be callee saved).

Other than x18, the usual PCS rules apply here. We don't have a stack,
so the function we call can't make a nested call to anything else.

Mark.

Geoff Levand Jan. 27, 2015, 6:34 p.m. UTC | #18

Hi Mark,

On Tue, 2015-01-27 at 16:46 +0000, Mark Rutland wrote:
> On Mon, Jan 26, 2015 at 09:48:48PM +0000, Geoff Levand wrote:
> > This conditional is just if KVM, do things the old way (don't try to
> > switch exception levels).  It is to handle the system shutdown case.
> 
> Having grepped treewide for soft_restart, other than kexec there are no
> users for arm64. So surely kexec is the only case to cater for at the
> moment?

Yes, I think you're right, and so it seems we can drop this patch and
just have the 'Add checks for KVM' patch.

> > Another patch in this series '[PATCH 7/8] arm64/kexec: Add checks for
> > KVM' assures kexec cannot happen when KVM is configured.
> 
> It would be better to just move this earlier (or event better, implement
> kvm teardown).

Yes, I hope we don't really need to have any KVM work-arounds.

-Geoff

AKASHI Takahiro Jan. 29, 2015, 9:36 a.m. UTC | #19

Hello,

On 01/27/2015 04:19 AM, Mark Rutland wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>> Add runtime checks that fail the arm64 kexec syscall for situations that would
>> result in system instability do to problems in the KVM kernel support.
>> These checks should be removed when the KVM problems are resolved fixed.
>>
>> Signed-off-by: Geoff Levand <geoff@infradead.org>
>> ---
>>   arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>> index 3d84759..a36459d 100644
>> --- a/arch/arm64/kernel/machine_kexec.c
>> +++ b/arch/arm64/kernel/machine_kexec.c
>> @@ -16,6 +16,9 @@
>>   #include <asm/cacheflush.h>
>>   #include <asm/system_misc.h>
>>
>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
>> +#include <asm/virt.h>
>> +
>>   /* Global variables for the relocate_kernel routine. */
>>   extern const unsigned char relocate_new_kernel[];
>>   extern const unsigned long relocate_new_kernel_size;
>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
>>
>>   	kexec_image_info(image);
>>
>> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
>> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
>> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
>> +			__func__);
>> +		return -ENOSYS;
>> +	}
>
> If you really don't want to implement KVM teardown, surely this should
> be at the start of the series, so we don't have a point in the middle
> where things may explode in this case?

I'm going to fix this KVM issue (teardown) in cooperation with Geoff.

Looking into kvm init code, kvm_arch_init() in particular,
I guess that teardown function (kvm_arch_exit()) should do
   (reverting kvm_timer_hyp_init() per cpu)
- stop arch timer

   (reverting cpu_init_hyp_mode() per cpu)
- flush TLB
- jump into identical mapping (using boot_hyp_pgd?)
- disable MMU?
- restore vbar_el2 to __hyp_stub_vectors (or NULL?)

   (reverting kvm_mmu_init())
- Do we need to free page tables and a bounce(trampoline) page?

Is this good enough for safely shutting down kvm?
Do I miss anything essential, or can I skip anyhing?

I really appreciate your comments.
-Takahiro AKASHI

> Mark.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

AKASHI Takahiro Jan. 29, 2015, 9:57 a.m. UTC | #20

Hello,

On 01/27/2015 04:19 AM, Mark Rutland wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>> Add runtime checks that fail the arm64 kexec syscall for situations that would
>> result in system instability do to problems in the KVM kernel support.
>> These checks should be removed when the KVM problems are resolved fixed.
>>
>> Signed-off-by: Geoff Levand <geoff@infradead.org>
>> ---
>>   arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>> index 3d84759..a36459d 100644
>> --- a/arch/arm64/kernel/machine_kexec.c
>> +++ b/arch/arm64/kernel/machine_kexec.c
>> @@ -16,6 +16,9 @@
>>   #include <asm/cacheflush.h>
>>   #include <asm/system_misc.h>
>>
>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
>> +#include <asm/virt.h>
>> +
>>   /* Global variables for the relocate_kernel routine. */
>>   extern const unsigned char relocate_new_kernel[];
>>   extern const unsigned long relocate_new_kernel_size;
>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
>>
>>   	kexec_image_info(image);
>>
>> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
>> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
>> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
>> +			__func__);
>> +		return -ENOSYS;
>> +	}
>
> If you really don't want to implement KVM teardown, surely this should
> be at the start of the series, so we don't have a point in the middle
> where things may explode in this case?

I'm going to fix this KVM issue (teardown) in cooperation with Geoff.

Looking into kvm init code, kvm_arch_init() in particular,
I guess that teardown function (kvm_arch_exit()) should do
   (reverting kvm_timer_hyp_init() per cpu)
- stop arch timer

   (reverting cpu_init_hyp_mode() per cpu)
- flush TLB
- jump into identical mapping (using boot_hyp_pgd?)
- disable MMU?
- restore vbar_el2 to __hyp_stub_vectors (or NULL?)

   (reverting kvm_mmu_init())
- Do we need to free page tables and a bounce(trampoline) page?

Is this good enough for safely shutting down kvm?
Do I miss anything essential, or can I skip anyhing?

I really appreciate your comments.
-Takahiro AKASHI

> Mark.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Marc Zyngier Jan. 29, 2015, 10:59 a.m. UTC | #21

On 29/01/15 09:57, AKASHI Takahiro wrote:
> Hello,
> 
> On 01/27/2015 04:19 AM, Mark Rutland wrote:
>> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>>> Add runtime checks that fail the arm64 kexec syscall for situations that would
>>> result in system instability do to problems in the KVM kernel support.
>>> These checks should be removed when the KVM problems are resolved fixed.
>>>
>>> Signed-off-by: Geoff Levand <geoff@infradead.org>
>>> ---
>>>   arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>>> index 3d84759..a36459d 100644
>>> --- a/arch/arm64/kernel/machine_kexec.c
>>> +++ b/arch/arm64/kernel/machine_kexec.c
>>> @@ -16,6 +16,9 @@
>>>   #include <asm/cacheflush.h>
>>>   #include <asm/system_misc.h>
>>>
>>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
>>> +#include <asm/virt.h>
>>> +
>>>   /* Global variables for the relocate_kernel routine. */
>>>   extern const unsigned char relocate_new_kernel[];
>>>   extern const unsigned long relocate_new_kernel_size;
>>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
>>>
>>>   	kexec_image_info(image);
>>>
>>> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
>>> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
>>> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
>>> +			__func__);
>>> +		return -ENOSYS;
>>> +	}
>>
>> If you really don't want to implement KVM teardown, surely this should
>> be at the start of the series, so we don't have a point in the middle
>> where things may explode in this case?
> 
> I'm going to fix this KVM issue (teardown) in cooperation with Geoff.
> 
> Looking into kvm init code, kvm_arch_init() in particular,
> I guess that teardown function (kvm_arch_exit()) should do
>    (reverting kvm_timer_hyp_init() per cpu)
> - stop arch timer

No need, there shouldn't be any guest running at that point.

>    (reverting cpu_init_hyp_mode() per cpu)
> - flush TLB
> - jump into identical mapping (using boot_hyp_pgd?)
> - disable MMU?

Yes, and for that, you need to go back to an idmap

> - restore vbar_el2 to __hyp_stub_vectors (or NULL?)

It doesn't matter, you want to stay in HYP for the next kernel.

>    (reverting kvm_mmu_init())
> - Do we need to free page tables and a bounce(trampoline) page?

I don't think that's useful, you've killed the kernel already.

> Is this good enough for safely shutting down kvm?
> Do I miss anything essential, or can I skip anyhing?

I've outlined the steps a couple of days ago there:
http://www.spinics.net/lists/arm-kernel/msg395177.html

and the outline above should give you a nice list of things to look at.
All in all, it is pretty straightforward, provided that you're careful
enough about (commit 5a677ce044f has most of the information, just read
it backward... ;-).

Thanks,

	M.

Mark Rutland Jan. 29, 2015, 6:47 p.m. UTC | #22

On Thu, Jan 29, 2015 at 10:59:45AM +0000, Marc Zyngier wrote:
> 
> On 29/01/15 09:57, AKASHI Takahiro wrote:
> > Hello,
> > 
> > On 01/27/2015 04:19 AM, Mark Rutland wrote:
> >> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> >>> Add runtime checks that fail the arm64 kexec syscall for situations that would
> >>> result in system instability do to problems in the KVM kernel support.
> >>> These checks should be removed when the KVM problems are resolved fixed.
> >>>
> >>> Signed-off-by: Geoff Levand <geoff@infradead.org>
> >>> ---
> >>>   arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
> >>>   1 file changed, 10 insertions(+)
> >>>
> >>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> >>> index 3d84759..a36459d 100644
> >>> --- a/arch/arm64/kernel/machine_kexec.c
> >>> +++ b/arch/arm64/kernel/machine_kexec.c
> >>> @@ -16,6 +16,9 @@
> >>>   #include <asm/cacheflush.h>
> >>>   #include <asm/system_misc.h>
> >>>
> >>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
> >>> +#include <asm/virt.h>
> >>> +
> >>>   /* Global variables for the relocate_kernel routine. */
> >>>   extern const unsigned char relocate_new_kernel[];
> >>>   extern const unsigned long relocate_new_kernel_size;
> >>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
> >>>
> >>>   	kexec_image_info(image);
> >>>
> >>> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
> >>> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
> >>> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
> >>> +			__func__);
> >>> +		return -ENOSYS;
> >>> +	}
> >>
> >> If you really don't want to implement KVM teardown, surely this should
> >> be at the start of the series, so we don't have a point in the middle
> >> where things may explode in this case?
> > 
> > I'm going to fix this KVM issue (teardown) in cooperation with Geoff.
> > 
> > Looking into kvm init code, kvm_arch_init() in particular,
> > I guess that teardown function (kvm_arch_exit()) should do
> >    (reverting kvm_timer_hyp_init() per cpu)
> > - stop arch timer
> 
> No need, there shouldn't be any guest running at that point.
> 
> >    (reverting cpu_init_hyp_mode() per cpu)
> > - flush TLB
> > - jump into identical mapping (using boot_hyp_pgd?)
> > - disable MMU?
> 
> Yes, and for that, you need to go back to an idmap
> 
> > - restore vbar_el2 to __hyp_stub_vectors (or NULL?)
> 
> It doesn't matter, you want to stay in HYP for the next kernel.

Well, it depends on how the teardown is implemented. I'd imagined we'd
have KVM tear itself down (restoring the hyp stub) as part of shut down.
Later kexec/soft_restart would call the stub to get back to EL2.

That way kexec doesn't need to know anything about KVM, and it has one
path that works regardless of whether KVM is compiled into the kernel.

Mark.

AKASHI Takahiro Jan. 30, 2015, 6:10 a.m. UTC | #23

Hello Marc, Mark

Thank you for your useful comments.
I need to look at Marc's commit(5a677ce044f) more carefully
about trampoline code.

On 01/30/2015 03:47 AM, Mark Rutland wrote:
> On Thu, Jan 29, 2015 at 10:59:45AM +0000, Marc Zyngier wrote:
>>
>> On 29/01/15 09:57, AKASHI Takahiro wrote:
>>> Hello,
>>>
>>> On 01/27/2015 04:19 AM, Mark Rutland wrote:
>>>> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
>>>>> Add runtime checks that fail the arm64 kexec syscall for situations that would
>>>>> result in system instability do to problems in the KVM kernel support.
>>>>> These checks should be removed when the KVM problems are resolved fixed.
>>>>>
>>>>> Signed-off-by: Geoff Levand <geoff@infradead.org>
>>>>> ---
>>>>>    arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
>>>>>    1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>>>>> index 3d84759..a36459d 100644
>>>>> --- a/arch/arm64/kernel/machine_kexec.c
>>>>> +++ b/arch/arm64/kernel/machine_kexec.c
>>>>> @@ -16,6 +16,9 @@
>>>>>    #include <asm/cacheflush.h>
>>>>>    #include <asm/system_misc.h>
>>>>>
>>>>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
>>>>> +#include <asm/virt.h>
>>>>> +
>>>>>    /* Global variables for the relocate_kernel routine. */
>>>>>    extern const unsigned char relocate_new_kernel[];
>>>>>    extern const unsigned long relocate_new_kernel_size;
>>>>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
>>>>>
>>>>>    	kexec_image_info(image);
>>>>>
>>>>> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
>>>>> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
>>>>> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
>>>>> +			__func__);
>>>>> +		return -ENOSYS;
>>>>> +	}
>>>>
>>>> If you really don't want to implement KVM teardown, surely this should
>>>> be at the start of the series, so we don't have a point in the middle
>>>> where things may explode in this case?
>>>
>>> I'm going to fix this KVM issue (teardown) in cooperation with Geoff.
>>>
>>> Looking into kvm init code, kvm_arch_init() in particular,
>>> I guess that teardown function (kvm_arch_exit()) should do
>>>     (reverting kvm_timer_hyp_init() per cpu)
>>> - stop arch timer
>>
>> No need, there shouldn't be any guest running at that point.
>>
>>>     (reverting cpu_init_hyp_mode() per cpu)
>>> - flush TLB
>>> - jump into identical mapping (using boot_hyp_pgd?)
>>> - disable MMU?
>>
>> Yes, and for that, you need to go back to an idmap
>>
>>> - restore vbar_el2 to __hyp_stub_vectors (or NULL?)
>>
>> It doesn't matter, you want to stay in HYP for the next kernel.
>
> Well, it depends on how the teardown is implemented. I'd imagined we'd
> have KVM tear itself down (restoring the hyp stub) as part of shut down.
> Later kexec/soft_restart would call the stub to get back to EL2.
>
> That way kexec doesn't need to know anything about KVM, and it has one
> path that works regardless of whether KVM is compiled into the kernel.

Initially, I thought that we would define kvm_arch_exit() and call it
somewhere in the middle of kexec path (no idea yet).
But Geoff suggested me to implement a new hvc call, HVC_CPU_SHUTDOWN(??),
and make it called via cpu_notifier(CPU_DYING_FROZEN) initiated by
machine_shutdown() from kernel_kexec().
(As you know, a hook is already there for cpu online in kvm/arm.c.)

Is this the right place to put teardown function?
(I'm not sure if this hook will be called also on a boot cpu.)

Thanks,
-Takahiro AKASHI



> Mark.
>

Mark Rutland Jan. 30, 2015, 12:14 p.m. UTC | #24

On Fri, Jan 30, 2015 at 06:10:53AM +0000, AKASHI Takahiro wrote:
> Hello Marc, Mark
> 
> Thank you for your useful comments.
> I need to look at Marc's commit(5a677ce044f) more carefully
> about trampoline code.
> 
> On 01/30/2015 03:47 AM, Mark Rutland wrote:
> > On Thu, Jan 29, 2015 at 10:59:45AM +0000, Marc Zyngier wrote:
> >>
> >> On 29/01/15 09:57, AKASHI Takahiro wrote:
> >>> Hello,
> >>>
> >>> On 01/27/2015 04:19 AM, Mark Rutland wrote:
> >>>> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> >>>>> Add runtime checks that fail the arm64 kexec syscall for situations that would
> >>>>> result in system instability do to problems in the KVM kernel support.
> >>>>> These checks should be removed when the KVM problems are resolved fixed.
> >>>>>
> >>>>> Signed-off-by: Geoff Levand <geoff@infradead.org>
> >>>>> ---
> >>>>>    arch/arm64/kernel/machine_kexec.c | 10 ++++++++++
> >>>>>    1 file changed, 10 insertions(+)
> >>>>>
> >>>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> >>>>> index 3d84759..a36459d 100644
> >>>>> --- a/arch/arm64/kernel/machine_kexec.c
> >>>>> +++ b/arch/arm64/kernel/machine_kexec.c
> >>>>> @@ -16,6 +16,9 @@
> >>>>>    #include <asm/cacheflush.h>
> >>>>>    #include <asm/system_misc.h>
> >>>>>
> >>>>> +/* TODO: Remove this include when KVM can support a kexec reboot. */
> >>>>> +#include <asm/virt.h>
> >>>>> +
> >>>>>    /* Global variables for the relocate_kernel routine. */
> >>>>>    extern const unsigned char relocate_new_kernel[];
> >>>>>    extern const unsigned long relocate_new_kernel_size;
> >>>>> @@ -100,6 +103,13 @@ int machine_kexec_prepare(struct kimage *image)
> >>>>>
> >>>>>    	kexec_image_info(image);
> >>>>>
> >>>>> +	/* TODO: Remove this message when KVM can support a kexec reboot. */
> >>>>> +	if (IS_ENABLED(CONFIG_KVM) && is_hyp_mode_available()) {
> >>>>> +		pr_err("%s: Your kernel is configured with KVM support (CONFIG_KVM=y) which currently does not allow for kexec re-boot.\n",
> >>>>> +			__func__);
> >>>>> +		return -ENOSYS;
> >>>>> +	}
> >>>>
> >>>> If you really don't want to implement KVM teardown, surely this should
> >>>> be at the start of the series, so we don't have a point in the middle
> >>>> where things may explode in this case?
> >>>
> >>> I'm going to fix this KVM issue (teardown) in cooperation with Geoff.
> >>>
> >>> Looking into kvm init code, kvm_arch_init() in particular,
> >>> I guess that teardown function (kvm_arch_exit()) should do
> >>>     (reverting kvm_timer_hyp_init() per cpu)
> >>> - stop arch timer
> >>
> >> No need, there shouldn't be any guest running at that point.
> >>
> >>>     (reverting cpu_init_hyp_mode() per cpu)
> >>> - flush TLB
> >>> - jump into identical mapping (using boot_hyp_pgd?)
> >>> - disable MMU?
> >>
> >> Yes, and for that, you need to go back to an idmap
> >>
> >>> - restore vbar_el2 to __hyp_stub_vectors (or NULL?)
> >>
> >> It doesn't matter, you want to stay in HYP for the next kernel.
> >
> > Well, it depends on how the teardown is implemented. I'd imagined we'd
> > have KVM tear itself down (restoring the hyp stub) as part of shut down.
> > Later kexec/soft_restart would call the stub to get back to EL2.
> >
> > That way kexec doesn't need to know anything about KVM, and it has one
> > path that works regardless of whether KVM is compiled into the kernel.
> 
> Initially, I thought that we would define kvm_arch_exit() and call it
> somewhere in the middle of kexec path (no idea yet).
> But Geoff suggested me to implement a new hvc call, HVC_CPU_SHUTDOWN(??),
> and make it called via cpu_notifier(CPU_DYING_FROZEN) initiated by
> machine_shutdown() from kernel_kexec().
> (As you know, a hook is already there for cpu online in kvm/arm.c.)

I think it would make far more sense to have the KVM teardown entirely
contained within KVM (e.g. in kvm_arch_exit). That way there's no need
for an interdependency between KVM and kexec, and we'd be a step closer
to KVM as a module.

I'm not keen on having KVM teardown in the middle of the kexec path.

Mark.

Geoff Levand Jan. 30, 2015, 7:48 p.m. UTC | #25

Hi Takahiro.

On Fri, 2015-01-30 at 15:10 +0900, AKASHI Takahiro wrote:
> Initially, I thought that we would define kvm_arch_exit() and call it
> somewhere in the middle of kexec path (no idea yet).
> But Geoff suggested me to implement a new hvc call, HVC_CPU_SHUTDOWN(??),
> and make it called via cpu_notifier(CPU_DYING_FROZEN) initiated by
> machine_shutdown() from kernel_kexec().

As an initial implementation we can hook into the CPU_DYING_FROZEN
notifier sent to hyp_init_cpu_notify().  The longer term solution
should use kvm_arch_hardware_enable() and kvm_arch_hardware_disable().

The calls to cpu_notifier(CPU_DYING_FROZEN) are part of cpu hot
plug, and independent of kexec.  If someone were to add spin-table
cpu un-plug, then it would be used for that also.  It seems we should
be able to test without kexec by using cpu hot plug.

To tear down KVM you need to get back to hyp mode, and hence
the need for HVC_CPU_SHUTDOWN.  The sequence I envisioned would
be like this:

cpu_notifier(CPU_DYING_FROZEN)
 -> kvm_cpu_shutdown() 
    prepare for hvc
    -> HVC_CPU_SHUTDOWN
       now in hyp mode, do KVM tear down, restore default exception vectors

Once the default exception vectors are restored soft_restart()
can then execute the cpu_reset routine in EL2.

Some notes are here for those with access:  https://cards.linaro.org/browse/KWG-611

-Geoff

Geoff Levand Jan. 30, 2015, 9:47 p.m. UTC | #26

Hi Catalin,

On Tue, 2015-01-27 at 17:57 +0000, Catalin Marinas wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> >  ENTRY(cpu_reset)
> > -	mrs	x1, sctlr_el1
> > -	bic	x1, x1, #1
> > -	msr	sctlr_el1, x1			// disable the MMU
> > +	mrs	x2, sctlr_el1
> > +	bic	x2, x2, #1
> > +	msr	sctlr_el1, x2			// disable the MMU
> >  	isb
> > -	ret	x0
> > +
> > +	cbz	x0, 1f				// el2_switch?
> > +	mov	x0, x1
> > +	mov	x1, xzr
> > +	mov	x2, xzr
> > +	mov	x3, xzr
> > +	hvc	#HVC_CALL_FUNC			// no return
> 
> If that's the only user of HVC_CALL_FUNC, why do we bother with
> arguments, calling convention?

It was intended that HVC_CALL_FUNC be a mechanism to call a generic
function in hyp mode.  cpu_reset() is the only user of it now.

As Mark explained in another post, the use of x18 by the hyp stub
is to avoid the complication of setting up a stack for EL2.  We
thought this solution was acceptable since there are relatively
few HVC_CALL_FUNC calls and we could assure they would do the
right thing.

-Geoff

Geoff Levand Jan. 30, 2015, 9:52 p.m. UTC | #27

Hi Catalin,


On Tue, 2015-01-27 at 17:39 +0000, Catalin Marinas wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> > index 99c319c..4f23a48 100644
> > --- a/arch/arm64/include/asm/virt.h
> > +++ b/arch/arm64/include/asm/virt.h
> > @@ -41,6 +41,19 @@
> >  
> >  #define HVC_CALL_HYP 3
> >  
> > +/*
> > + * HVC_CALL_FUNC - Execute a function at EL2.
> > + *
> > + * @x0: Physical address of the function to be executed.
> > + * @x1: Passed as the first argument to the function.
> > + * @x2: Passed as the second argument to the function.
> > + * @x3: Passed as the third argument to the function.
> > + *
> > + * The called function must preserve the contents of register x18.
> 
> Can you pick a register that's normally callee saved?

Mark covered this in his reply.

> > + */
> > +
> > +#define HVC_CALL_FUNC 4
> > +
> >  #ifndef __ASSEMBLY__
> >  
> >  /*
> > diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> > index e3db3fd..b5d36e7 100644
> > --- a/arch/arm64/kernel/hyp-stub.S
> > +++ b/arch/arm64/kernel/hyp-stub.S
> > @@ -66,9 +66,20 @@ el1_sync:
> >  	mrs	x0, vbar_el2
> >  	b	2f
> >  
> > -1:	cmp	x18, #HVC_SET_VECTORS
> > -	b.ne	2f
> > -	msr	vbar_el2, x0
> > +1:	cmp     x18, #HVC_SET_VECTORS
> 
> This line doesn't seem to have any change, apart from some whitespace.
> Or did you want to drop the label?

Some whitespace problems got in there from so many rebases.  I went
through the series and cleaned them up.


> > +	b.ne    1f
> > +	msr     vbar_el2, x0
> > +	b       2f
> > +
> > +1:	cmp     x18, #HVC_CALL_FUNC
> > +	b.ne    2f
> > +	mov     x18, lr
> > +	mov     lr, x0
> > +	mov     x0, x1
> > +	mov     x1, x2
> > +	mov     x2, x3
> > +	blr     lr
> > +	mov     lr, x18
> >  
> >  2:	eret
> >  ENDPROC(el1_sync)
> 
> What is the calling convention for this HVC? You mentioned x18 above but
> what about other registers that the called function may corrupt (x18 is
> a temporary register, so it's not expected to be callee saved).

Again, Mark covered this in his reply.

Geoff Levand Jan. 30, 2015, 11:31 p.m. UTC | #28

On Mon, 2015-01-26 at 18:26 +0000, Catalin Marinas wrote:
> On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > To allow for additional hcalls to be defined and to make the arm64 hcall API
> > more consistent across exception vector routines, change the hcall implementations
> > to use the ISS field of the ESR_EL2 register to specify the hcall type.
> > 
> > The existing arm64 hcall implementations are limited in that they only allow
> > for two distinct hcalls; with the x0 register either zero, or not zero.  Also,
> > the API of the hyp-stub exception vector routines and the KVM exception vector
> > routines differ; hyp-stub uses a non-zero value in x0 to implement
> > __hyp_set_vectors, whereas KVM uses it to implement kvm_call_hyp.
> > 
> > Define three new preprocessor macros HVC_GET_VECTORS, HVC_SET_VECTORS and
> > HVC_CALL_HYP and to be used as hcall type specifiers and convert the
> > existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp() routines
> > to use these new macros when executing an HVC call.  Also change the
> > corresponding hyp-stub and KVM el1_sync exception vector routines to use these
> > new macros.
> > 
> > Signed-off-by: Geoff Levand <geoff@infradead.org>
> 
> Using the #imm value for HVC to separate what gets called looks fine to
> me. However, I'd like to see a review from Marc/Christoffer on this
> patch.

Marc, Christopher, comments please?

> Some comments below:
> 
> > diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> > index 7a5df52..99c319c 100644
> > --- a/arch/arm64/include/asm/virt.h
> > +++ b/arch/arm64/include/asm/virt.h
> > @@ -21,6 +21,26 @@
> >  #define BOOT_CPU_MODE_EL1	(0xe11)
> >  #define BOOT_CPU_MODE_EL2	(0xe12)
> >  
> > +/*
> > + * HVC_GET_VECTORS - Return the value of the vbar_el2 register.
> > + */
> > +
> > +#define HVC_GET_VECTORS 1
> > +
> > +/*
> > + * HVC_SET_VECTORS - Set the value of the vbar_el2 register.
> > + *
> > + * @x0: Physical address of the new vector table.
> > + */
> > +
> > +#define HVC_SET_VECTORS 2
> > +
> > +/*
> > + * HVC_CALL_HYP - Execute a hyp routine.
> > + */
> > +
> > +#define HVC_CALL_HYP 3
> 
> I think you can ignore this case (make it the default), just define it
> as 0 as that's the normal use-case after initialisation and avoid
> checking it explicitly.

OK, I changed this so that HVC_CALL_HYP is the default at 0.

> >  /*
> > diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> > index a272f33..e3db3fd 100644
> > --- a/arch/arm64/kernel/hyp-stub.S
> > +++ b/arch/arm64/kernel/hyp-stub.S
> > @@ -22,6 +22,7 @@
> >  #include <linux/irqchip/arm-gic-v3.h>
> >  
> >  #include <asm/assembler.h>
> > +#include <asm/kvm_arm.h>
> >  #include <asm/ptrace.h>
> >  #include <asm/virt.h>
> >  
> > @@ -53,14 +54,22 @@ ENDPROC(__hyp_stub_vectors)
> >  	.align 11
> >  
> >  el1_sync:
> > -	mrs	x1, esr_el2
> > -	lsr	x1, x1, #26
> > -	cmp	x1, #0x16
> > -	b.ne	2f				// Not an HVC trap
> > -	cbz	x0, 1f
> > -	msr	vbar_el2, x0			// Set vbar_el2
> > +	mrs	x18, esr_el2
> > +	lsr	x17, x18, #ESR_ELx_EC_SHIFT
> > +	and	x18, x18, #ESR_ELx_ISS_MASK
> > +
> > +	cmp     x17, #ESR_ELx_EC_HVC64
> > +	b.ne    2f				// Not an HVC trap
> > +
> > +	cmp	x18, #HVC_GET_VECTORS
> > +	b.ne	1f
> > +	mrs	x0, vbar_el2
> >  	b	2f
> > -1:	mrs	x0, vbar_el2			// Return vbar_el2
> > +
> > +1:	cmp	x18, #HVC_SET_VECTORS
> > +	b.ne	2f
> > +	msr	vbar_el2, x0
> > +
> >  2:	eret
> >  ENDPROC(el1_sync)
> 
> You seem to be using x17 and x18 here freely. Do you have any guarantees
> that the caller saved/restored those registers? I guess you assume they
> are temporary registers and the caller first branches to a function
> (like __kvm_hyp_call) and expects them to be corrupted. But I'm not sure
> that's always the case. Take for example the __invoke_psci_fn_hvc where
> the function is in C (we should change this for other reasons).

Yes, I assume the compiler will not expect them to be preserved.  I
missed __invoke_psci_fn_hvc.  Can we just add x17 and x18 to the
clobbered list?

        asm volatile(
                        __asmeq("%0", "x0")
                        __asmeq("%1", "x1")
                        __asmeq("%2", "x2")
                        __asmeq("%3", "x3")
                        "hvc    #0\n"
                : "+r" (function_id)
-               : "r" (arg0), "r" (arg1), "r" (arg2));
+               : "r" (arg0), "r" (arg1), "r" (arg2)
+               : "x17", "x18");


> > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > index c0d8202..1916c89 100644
> > --- a/arch/arm64/kvm/hyp.S
> > +++ b/arch/arm64/kvm/hyp.S
> > @@ -27,6 +27,7 @@
> >  #include <asm/kvm_asm.h>
> >  #include <asm/kvm_mmu.h>
> >  #include <asm/memory.h>
> > +#include <asm/virt.h>
> >  
> >  #define CPU_GP_REG_OFFSET(x)	(CPU_GP_REGS + x)
> >  #define CPU_XREG_OFFSET(x)	CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
> > @@ -1106,12 +1107,9 @@ __hyp_panic_str:
> >   * in Hyp mode (see init_hyp_mode in arch/arm/kvm/arm.c).  Return values are
> >   * passed in r0 and r1.
> >   *
> > - * A function pointer with a value of 0 has a special meaning, and is
> > - * used to implement __hyp_get_vectors in the same way as in
> > - * arch/arm64/kernel/hyp_stub.S.
> >   */
> >  ENTRY(kvm_call_hyp)
> > -	hvc	#0
> > +	hvc	#HVC_CALL_HYP
> >  	ret
> >  ENDPROC(kvm_call_hyp)
> >  
> > @@ -1142,6 +1140,7 @@ el1_sync:					// Guest trapped into EL2
> >  
> >  	mrs	x1, esr_el2
> >  	lsr	x2, x1, #ESR_ELx_EC_SHIFT
> > +	and	x0, x1, #ESR_ELx_ISS_MASK
> >  
> >  	cmp	x2, #ESR_ELx_EC_HVC64
> >  	b.ne	el1_trap
> > @@ -1150,15 +1149,19 @@ el1_sync:					// Guest trapped into EL2
> >  	cbnz	x3, el1_trap			// called HVC
> >  
> >  	/* Here, we're pretty sure the host called HVC. */
> > +	mov	x18, x0
> 
> Same comment here about corrupting x18. If it is safe, maybe add some
> comments in the calling place.

I added a comment regarding this to virt.h where the HVC_XXX macros
are defined.  I'll post that fixed up patch for review.

> 
> >  	pop	x2, x3
> >  	pop	x0, x1
> >  
> > -	/* Check for __hyp_get_vectors */
> > -	cbnz	x0, 1f
> > +	cmp	x18, #HVC_GET_VECTORS
> > +	b.ne	1f
> >  	mrs	x0, vbar_el2
> >  	b	2f
> >  
> > -1:	push	lr, xzr
> > +1:	cmp	x18, #HVC_CALL_HYP
> > +	b.ne	2f
> > +
> > +	push	lr, xzr
> 
> At this point, we expect either HVC_GET_VECTORS or HVC_CALL_HYP. I think
> you can simply assume HVC_CALL_HYP as default and ignore the additional
> cmp.

OK, did that.

-Geoff

AKASHI Takahiro Feb. 2, 2015, 8:18 a.m. UTC | #29

Geoff,

On 01/31/2015 04:48 AM, Geoff Levand wrote:
> Hi Takahiro.
>
> On Fri, 2015-01-30 at 15:10 +0900, AKASHI Takahiro wrote:
>> Initially, I thought that we would define kvm_arch_exit() and call it
>> somewhere in the middle of kexec path (no idea yet).
>> But Geoff suggested me to implement a new hvc call, HVC_CPU_SHUTDOWN(??),
>> and make it called via cpu_notifier(CPU_DYING_FROZEN) initiated by
>> machine_shutdown() from kernel_kexec().
>
> As an initial implementation we can hook into the CPU_DYING_FROZEN
> notifier sent to hyp_init_cpu_notify().  The longer term solution
> should use kvm_arch_hardware_enable() and kvm_arch_hardware_disable().

Are these two different approaches? I mean,
kexec will initiate cpu hotplug:
kernel_exec() -> machine_shutdown() -> disable_nonboot_cpu()
    -> _cpu_down() -> cpu_notify_nofail(CPU_DEAD|...)

On the other hand, kvm already has a hook into kvm_arch_hardware_disable():
   (ignoring kvm_usage_count here)
kvm_cpu_hotplug(CPU_DYING) -> hardware_disable()
    -> hardware_disable_nolock() -> kvm_arch_hardware_disable()

So it seems that we don't have to add a new hook at hyp_init_cpu_notify()
if kvm_arch_hardware_disable() is properly implemented.
disable_nonboot_cpu() will not inovke cpu hotplug on *boot* cpu, and
we should handle it in a separate way though.

Do I misunderstand anything here?


-Takahiro AKASHI

> The calls to cpu_notifier(CPU_DYING_FROZEN) are part of cpu hot
> plug, and independent of kexec.  If someone were to add spin-table
> cpu un-plug, then it would be used for that also.  It seems we should
> be able to test without kexec by using cpu hot plug.
>
> To tear down KVM you need to get back to hyp mode, and hence
> the need for HVC_CPU_SHUTDOWN.  The sequence I envisioned would
> be like this:
>
> cpu_notifier(CPU_DYING_FROZEN)
>   -> kvm_cpu_shutdown()
>      prepare for hvc
>      -> HVC_CPU_SHUTDOWN
>         now in hyp mode, do KVM tear down, restore default exception vectors
>
> Once the default exception vectors are restored soft_restart()
> can then execute the cpu_reset routine in EL2.
>
> Some notes are here for those with access:  https://cards.linaro.org/browse/KWG-611
>
> -Geoff
>

Catalin Marinas Feb. 2, 2015, 4:04 p.m. UTC | #30

On Fri, Jan 30, 2015 at 11:31:21PM +0000, Geoff Levand wrote:
> On Mon, 2015-01-26 at 18:26 +0000, Catalin Marinas wrote:
> > On Sat, Jan 17, 2015 at 12:23:34AM +0000, Geoff Levand wrote:
> > >  /*
> > > diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> > > index a272f33..e3db3fd 100644
> > > --- a/arch/arm64/kernel/hyp-stub.S
> > > +++ b/arch/arm64/kernel/hyp-stub.S
> > > @@ -22,6 +22,7 @@
> > >  #include <linux/irqchip/arm-gic-v3.h>
> > >  
> > >  #include <asm/assembler.h>
> > > +#include <asm/kvm_arm.h>
> > >  #include <asm/ptrace.h>
> > >  #include <asm/virt.h>
> > >  
> > > @@ -53,14 +54,22 @@ ENDPROC(__hyp_stub_vectors)
> > >  	.align 11
> > >  
> > >  el1_sync:
> > > -	mrs	x1, esr_el2
> > > -	lsr	x1, x1, #26
> > > -	cmp	x1, #0x16
> > > -	b.ne	2f				// Not an HVC trap
> > > -	cbz	x0, 1f
> > > -	msr	vbar_el2, x0			// Set vbar_el2
> > > +	mrs	x18, esr_el2
> > > +	lsr	x17, x18, #ESR_ELx_EC_SHIFT
> > > +	and	x18, x18, #ESR_ELx_ISS_MASK
> > > +
> > > +	cmp     x17, #ESR_ELx_EC_HVC64
> > > +	b.ne    2f				// Not an HVC trap
> > > +
> > > +	cmp	x18, #HVC_GET_VECTORS
> > > +	b.ne	1f
> > > +	mrs	x0, vbar_el2
> > >  	b	2f
> > > -1:	mrs	x0, vbar_el2			// Return vbar_el2
> > > +
> > > +1:	cmp	x18, #HVC_SET_VECTORS
> > > +	b.ne	2f
> > > +	msr	vbar_el2, x0
> > > +
> > >  2:	eret
> > >  ENDPROC(el1_sync)
> > 
> > You seem to be using x17 and x18 here freely. Do you have any guarantees
> > that the caller saved/restored those registers? I guess you assume they
> > are temporary registers and the caller first branches to a function
> > (like __kvm_hyp_call) and expects them to be corrupted. But I'm not sure
> > that's always the case. Take for example the __invoke_psci_fn_hvc where
> > the function is in C (we should change this for other reasons).
> 
> Yes, I assume the compiler will not expect them to be preserved.  I
> missed __invoke_psci_fn_hvc.  Can we just add x17 and x18 to the
> clobbered list?
> 
>         asm volatile(
>                         __asmeq("%0", "x0")
>                         __asmeq("%1", "x1")
>                         __asmeq("%2", "x2")
>                         __asmeq("%3", "x3")
>                         "hvc    #0\n"
>                 : "+r" (function_id)
> -               : "r" (arg0), "r" (arg1), "r" (arg2));
> +               : "r" (arg0), "r" (arg1), "r" (arg2)
> +               : "x17", "x18");

I think we can ignore these because they would be called from a guest
context and IIUC we would only clobber x18 on the host HVC side.

Geoff Levand Feb. 6, 2015, 12:11 a.m. UTC | #31

Hi Takahiro,

On Mon, 2015-02-02 at 17:18 +0900, AKASHI Takahiro wrote:
> On 01/31/2015 04:48 AM, Geoff Levand wrote:
> > As an initial implementation we can hook into the CPU_DYING_FROZEN
> > notifier sent to hyp_init_cpu_notify().  The longer term solution
> > should use kvm_arch_hardware_enable() and kvm_arch_hardware_disable().
> 
> Are these two different approaches? 

Yes, these are two different solutions,  One initial work-around, and a
more involved proper solution.  Hooking into the CPU_DYING_FROZEN
notifier would be a initial fix.  The proper solution would be to move
the KVM setup to kvm_arch_hardware_enable(), and the shutdown to
kvm_arch_hardware_disable().

> kernel_exec() -> machine_shutdown() -> disable_nonboot_cpu()
>     -> _cpu_down() -> cpu_notify_nofail(CPU_DEAD|...)
> 
> On the other hand, kvm already has a hook into kvm_arch_hardware_disable():
>    (ignoring kvm_usage_count here)
> kvm_cpu_hotplug(CPU_DYING) -> hardware_disable()
>     -> hardware_disable_nolock() -> kvm_arch_hardware_disable()
> 
> So it seems that we don't have to add a new hook at hyp_init_cpu_notify()
> if kvm_arch_hardware_disable() is properly implemented.

Yes, that is correct.  But, as above, you would also need to update the
KVM startup to use kvm_arch_hardware_enable().

> disable_nonboot_cpu() will not inovke cpu hotplug on *boot* cpu, and
> we should handle it in a separate way though.

IIRC, the secondary cpus go through PSCI on shutdown, and that path
is working OK.  Maybe I am mistaken though.

The primary cpu shutdown (hyp stubs restored) is what is missing.  The
primary cpu goes through cpu_soft_restart(), and that is what is
currently failing.

-Geoff

AKASHI Takahiro Feb. 6, 2015, 4:18 a.m. UTC | #32

On 02/06/2015 09:11 AM, Geoff Levand wrote:
> Hi Takahiro,
>
> On Mon, 2015-02-02 at 17:18 +0900, AKASHI Takahiro wrote:
>> On 01/31/2015 04:48 AM, Geoff Levand wrote:
>>> As an initial implementation we can hook into the CPU_DYING_FROZEN
>>> notifier sent to hyp_init_cpu_notify().  The longer term solution
>>> should use kvm_arch_hardware_enable() and kvm_arch_hardware_disable().
>>
>> Are these two different approaches?
>
> Yes, these are two different solutions,  One initial work-around, and a
> more involved proper solution.  Hooking into the CPU_DYING_FROZEN
> notifier would be a initial fix.  The proper solution would be to move
> the KVM setup to kvm_arch_hardware_enable(), and the shutdown to
> kvm_arch_hardware_disable().
>
>
>> kernel_exec() -> machine_shutdown() -> disable_nonboot_cpu()
>>      -> _cpu_down() -> cpu_notify_nofail(CPU_DEAD|...)
>>
>> On the other hand, kvm already has a hook into kvm_arch_hardware_disable():
>>     (ignoring kvm_usage_count here)
>> kvm_cpu_hotplug(CPU_DYING) -> hardware_disable()
>>      -> hardware_disable_nolock() -> kvm_arch_hardware_disable()
>>
>> So it seems that we don't have to add a new hook at hyp_init_cpu_notify()
>> if kvm_arch_hardware_disable() is properly implemented.
>
> Yes, that is correct.  But, as above, you would also need to update the
> KVM startup to use kvm_arch_hardware_enable().
>
>> disable_nonboot_cpu() will not inovke cpu hotplug on *boot* cpu, and
>> we should handle it in a separate way though.
>
> IIRC, the secondary cpus go through PSCI on shutdown, and that path
> is working OK.  Maybe I am mistaken though.

If so, why should we add a hook at hyp_init_cpu_notify() as initial work-around?

> The primary cpu shutdown (hyp stubs restored) is what is missing.  The
> primary cpu goes through cpu_soft_restart(), and that is what is
> currently failing.

Yeah, we will call teardown function manually in soft_restart();

-Takahiro AKASHI
>
> -Geoff
>

Geoff Levand Feb. 6, 2015, 7:06 a.m. UTC | #33

On Fri, 2015-02-06 at 13:18 +0900, AKASHI Takahiro wrote:
> On 02/06/2015 09:11 AM, Geoff Levand wrote:
> > Hi Takahiro,
> >
> > On Mon, 2015-02-02 at 17:18 +0900, AKASHI Takahiro wrote:
> >> On 01/31/2015 04:48 AM, Geoff Levand wrote:
> >>> As an initial implementation we can hook into the CPU_DYING_FROZEN
> >>> notifier sent to hyp_init_cpu_notify().  The longer term solution
> >>> should use kvm_arch_hardware_enable() and kvm_arch_hardware_disable().
> >>
> >> Are these two different approaches?
> >
> > Yes, these are two different solutions,  One initial work-around, and a
> > more involved proper solution.  Hooking into the CPU_DYING_FROZEN
> > notifier would be a initial fix.  The proper solution would be to move
> > the KVM setup to kvm_arch_hardware_enable(), and the shutdown to
> > kvm_arch_hardware_disable().
> >
> >
> >> kernel_exec() -> machine_shutdown() -> disable_nonboot_cpu()
> >>      -> _cpu_down() -> cpu_notify_nofail(CPU_DEAD|...)
> >>
> >> On the other hand, kvm already has a hook into kvm_arch_hardware_disable():
> >>     (ignoring kvm_usage_count here)
> >> kvm_cpu_hotplug(CPU_DYING) -> hardware_disable()
> >>      -> hardware_disable_nolock() -> kvm_arch_hardware_disable()
> >>
> >> So it seems that we don't have to add a new hook at hyp_init_cpu_notify()
> >> if kvm_arch_hardware_disable() is properly implemented.
> >
> > Yes, that is correct.  But, as above, you would also need to update the
> > KVM startup to use kvm_arch_hardware_enable().
> >
> >> disable_nonboot_cpu() will not inovke cpu hotplug on *boot* cpu, and
> >> we should handle it in a separate way though.
> >
> > IIRC, the secondary cpus go through PSCI on shutdown, and that path
> > is working OK.  Maybe I am mistaken though.
> 
> If so, why should we add a hook at hyp_init_cpu_notify() as initial work-around?

To tear down KVM on the primary cpu.

> > The primary cpu shutdown (hyp stubs restored) is what is missing.  The
> > primary cpu goes through cpu_soft_restart(), and that is what is
> > currently failing.
> 
> Yeah, we will call teardown function manually in soft_restart();

I think the KVM tear down should happen independent of soft_restart().
When soft_restart() is called, KVM should have already been torn down.

-Geoff

Christoffer Dall Feb. 19, 2015, 8:57 p.m. UTC | #34

On Fri, Jan 30, 2015 at 03:33:48PM -0800, Geoff Levand wrote:
> To allow for additional hcalls to be defined and to make the arm64 hcall API
> more consistent across exception vector routines, change the hcall implementations
> to use the ISS field of the ESR_EL2 register to specify the hcall type.

how does this make things more consistent?  Do we have other examples of
things using the immediate field which I'm missing?

> 
> The existing arm64 hcall implementations are limited in that they only allow
> for two distinct hcalls; with the x0 register either zero, or not zero.  Also,
> the API of the hyp-stub exception vector routines and the KVM exception vector
> routines differ; hyp-stub uses a non-zero value in x0 to implement
> __hyp_set_vectors, whereas KVM uses it to implement kvm_call_hyp.

this seems orthogonal to the use of the immediate field vs. x0 though,
so why it the immediate field preferred again?

> 
> Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
> HVC_SET_VECTORS to be used as hcall type specifiers and convert the
> existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp() routines
> to use these new macros when executing an HVC call.  Also change the
> corresponding hyp-stub and KVM el1_sync exception vector routines to use these
> new macros.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/include/asm/virt.h | 27 +++++++++++++++++++++++++++
>  arch/arm64/kernel/hyp-stub.S  | 32 +++++++++++++++++++++-----------
>  arch/arm64/kernel/psci.c      |  3 ++-
>  arch/arm64/kvm/hyp.S          | 16 +++++++++-------
>  4 files changed, 59 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 7a5df52..eb10368 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -18,6 +18,33 @@
>  #ifndef __ASM__VIRT_H
>  #define __ASM__VIRT_H
>  
> +/*
> + * The arm64 hcall implementation uses the ISS field of the ESR_EL2 register to
> + * specify the hcall type.  The exception handlers are allowed to use registers
> + * x17 and x18 in their implementation.  Any routine issuing an hcall must not
> + * expect these registers to be preserved.
> + */

I thought the existing use of registers were based on the arm procedure
call standard so we didn't have to worry about adding more caller-save
registers.

Don't we now have to start adding code around callers to make sure
callers know that x17 and x18 may be clobbered?

Thanks,
-Christoffer

Geoff Levand Feb. 25, 2015, 10:09 p.m. UTC | #35

Hi Christoffer,

On Thu, 2015-02-19 at 21:57 +0100, Christoffer Dall wrote:
> On Fri, Jan 30, 2015 at 03:33:48PM -0800, Geoff Levand wrote:
> > To allow for additional hcalls to be defined and to make the arm64 hcall API
> > more consistent across exception vector routines, change the hcall implementations
> > to use the ISS field of the ESR_EL2 register to specify the hcall type.
> 
> how does this make things more consistent?  Do we have other examples of
> things using the immediate field which I'm missing?

As I detail in the next paragraph, by consistent I mean in the API exposed,
not in the implementation.  This point is really secondary to the need for
more hyper calls as I discuss below.

> > The existing arm64 hcall implementations are limited in that they only allow
> > for two distinct hcalls; with the x0 register either zero, or not zero.  Also,
> > the API of the hyp-stub exception vector routines and the KVM exception vector
> > routines differ; hyp-stub uses a non-zero value in x0 to implement
> > __hyp_set_vectors, whereas KVM uses it to implement kvm_call_hyp.
> 
> this seems orthogonal to the use of the immediate field vs. x0 though,
> so why it the immediate field preferred again?

When a CPU is reset via cpu_soft_restart() we need to execute the caller
supplied reset routine at the exception level the kernel was entered at.
So, for a kernel entered at EL2, we need a way to execute that routine at
EL2.

The current hyp-stub vector implementation, which uses x0, is limited
to two hyper calls; __hyp_get_vectors and __hyp_set_vectors.  To
support cpu_soft_restart() we need a third hyper call, one which
allows for code to be executed at EL2.  My proposed use of the
immediate value of the hvc instruction will allow for 2^16 distinct
hyper calls.
 
> > Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
> > HVC_SET_VECTORS to be used as hcall type specifiers and convert the
> > existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp() routines
> > to use these new macros when executing an HVC call.  Also change the
> > corresponding hyp-stub and KVM el1_sync exception vector routines to use these
> > new macros.
> > 
> > Signed-off-by: Geoff Levand <geoff@infradead.org>
> > ---
> >  arch/arm64/include/asm/virt.h | 27 +++++++++++++++++++++++++++
> >  arch/arm64/kernel/hyp-stub.S  | 32 +++++++++++++++++++++-----------
> >  arch/arm64/kernel/psci.c      |  3 ++-
> >  arch/arm64/kvm/hyp.S          | 16 +++++++++-------
> >  4 files changed, 59 insertions(+), 19 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> > index 7a5df52..eb10368 100644
> > --- a/arch/arm64/include/asm/virt.h
> > +++ b/arch/arm64/include/asm/virt.h
> > @@ -18,6 +18,33 @@
> >  #ifndef __ASM__VIRT_H
> >  #define __ASM__VIRT_H
> >  
> > +/*
> > + * The arm64 hcall implementation uses the ISS field of the ESR_EL2 register to
> > + * specify the hcall type.  The exception handlers are allowed to use registers
> > + * x17 and x18 in their implementation.  Any routine issuing an hcall must not
> > + * expect these registers to be preserved.
> > + */
> 
> I thought the existing use of registers were based on the arm procedure
> call standard so we didn't have to worry about adding more caller-save
> registers.
> 
> Don't we now have to start adding code around callers to make sure
> callers know that x17 and x18 may be clobbered?

We use x17 and x18 to allow hyper calls to work without a stack, which
is needed for cpu_soft_restart(). The procedure call standard says that
these are temporary registers, so a C compiler should not expect these
to be preserved.

-Geoff

Christoffer Dall March 2, 2015, 10:13 p.m. UTC | #36

On Wed, Feb 25, 2015 at 02:09:30PM -0800, Geoff Levand wrote:
> Hi Christoffer,
> 
> On Thu, 2015-02-19 at 21:57 +0100, Christoffer Dall wrote:
> > On Fri, Jan 30, 2015 at 03:33:48PM -0800, Geoff Levand wrote:
> > > To allow for additional hcalls to be defined and to make the arm64 hcall API
> > > more consistent across exception vector routines, change the hcall implementations
> > > to use the ISS field of the ESR_EL2 register to specify the hcall type.
> > 
> > how does this make things more consistent?  Do we have other examples of
> > things using the immediate field which I'm missing?
> 
> As I detail in the next paragraph, by consistent I mean in the API exposed,
> not in the implementation.  This point is really secondary to the need for
> more hyper calls as I discuss below.
> 
> > > The existing arm64 hcall implementations are limited in that they only allow
> > > for two distinct hcalls; with the x0 register either zero, or not zero.  Also,
> > > the API of the hyp-stub exception vector routines and the KVM exception vector
> > > routines differ; hyp-stub uses a non-zero value in x0 to implement
> > > __hyp_set_vectors, whereas KVM uses it to implement kvm_call_hyp.
> > 
> > this seems orthogonal to the use of the immediate field vs. x0 though,
> > so why it the immediate field preferred again?
> 
> When a CPU is reset via cpu_soft_restart() we need to execute the caller
> supplied reset routine at the exception level the kernel was entered at.
> So, for a kernel entered at EL2, we need a way to execute that routine at
> EL2.
> 
> The current hyp-stub vector implementation, which uses x0, is limited
> to two hyper calls; __hyp_get_vectors and __hyp_set_vectors.  To
> support cpu_soft_restart() we need a third hyper call, one which
> allows for code to be executed at EL2.  My proposed use of the
> immediate value of the hvc instruction will allow for 2^16 distinct
> hyper calls.
>  

right, but using x0 allows for 2^64 distinct hypercalls.  Just to be
clear, I'm fine with using immediate field if there are no good reasons
not to, I was just curious as to what direct benefit it has.  After
thinking about it a bit, from my point of view, the benefit would be the
clarity that x0 is first argument like a normal procedure call, so no
need to shift things around.  Is this part of the equation or am I
missing the overall purpose here?

> > > Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
> > > HVC_SET_VECTORS to be used as hcall type specifiers and convert the
> > > existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp() routines
> > > to use these new macros when executing an HVC call.  Also change the
> > > corresponding hyp-stub and KVM el1_sync exception vector routines to use these
> > > new macros.
> > > 
> > > Signed-off-by: Geoff Levand <geoff@infradead.org>
> > > ---
> > >  arch/arm64/include/asm/virt.h | 27 +++++++++++++++++++++++++++
> > >  arch/arm64/kernel/hyp-stub.S  | 32 +++++++++++++++++++++-----------
> > >  arch/arm64/kernel/psci.c      |  3 ++-
> > >  arch/arm64/kvm/hyp.S          | 16 +++++++++-------
> > >  4 files changed, 59 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> > > index 7a5df52..eb10368 100644
> > > --- a/arch/arm64/include/asm/virt.h
> > > +++ b/arch/arm64/include/asm/virt.h
> > > @@ -18,6 +18,33 @@
> > >  #ifndef __ASM__VIRT_H
> > >  #define __ASM__VIRT_H
> > >  
> > > +/*
> > > + * The arm64 hcall implementation uses the ISS field of the ESR_EL2 register to
> > > + * specify the hcall type.  The exception handlers are allowed to use registers
> > > + * x17 and x18 in their implementation.  Any routine issuing an hcall must not
> > > + * expect these registers to be preserved.
> > > + */
> > 
> > I thought the existing use of registers were based on the arm procedure
> > call standard so we didn't have to worry about adding more caller-save
> > registers.
> > 
> > Don't we now have to start adding code around callers to make sure
> > callers know that x17 and x18 may be clobbered?
> 
> We use x17 and x18 to allow hyper calls to work without a stack, which
> is needed for cpu_soft_restart(). The procedure call standard says that
> these are temporary registers, so a C compiler should not expect these
> to be preserved.
> 
Then why not use r9-15 or r0-r7 as the AACPS clearly specifies as
caller preserve instead of the registers which may have special meaning
etc.?

-Christoffer

Geoff Levand March 2, 2015, 11:22 p.m. UTC | #37

Hi Christoffer,

On Mon, 2015-03-02 at 14:13 -0800, Christoffer Dall wrote:
> On Wed, Feb 25, 2015 at 02:09:30PM -0800, Geoff Levand wrote:
> > The current hyp-stub vector implementation, which uses x0, is limited
> > to two hyper calls; __hyp_get_vectors and __hyp_set_vectors.  To
> > support cpu_soft_restart() we need a third hyper call, one which
> > allows for code to be executed at EL2.  My proposed use of the
> > immediate value of the hvc instruction will allow for 2^16 distinct
> > hyper calls.  
> 
> right, but using x0 allows for 2^64 distinct hypercalls.  Just to be
> clear, I'm fine with using immediate field if there are no good reasons
> not to, I was just curious as to what direct benefit it has.  After
> thinking about it a bit, from my point of view, the benefit would be the
> clarity that x0 is first argument like a normal procedure call, so no
> need to shift things around.  Is this part of the equation or am I
> missing the overall purpose here?

Yes, in general it will make marshaling of args, etc. easier.  Also,
to me, if we are going to change the implementation it seems to be
the most natural way.
 
> > > > + * The arm64 hcall implementation uses the ISS field of the ESR_EL2 register to
> > > > + * specify the hcall type.  The exception handlers are allowed to use registers
> > > > + * x17 and x18 in their implementation.  Any routine issuing an hcall must not
> > > > + * expect these registers to be preserved.
> > > > + */
> > > 
> > > I thought the existing use of registers were based on the arm procedure
> > > call standard so we didn't have to worry about adding more caller-save
> > > registers.
> > > 
> > > Don't we now have to start adding code around callers to make sure
> > > callers know that x17 and x18 may be clobbered?
> > 
> > We use x17 and x18 to allow hyper calls to work without a stack, which
> > is needed for cpu_soft_restart(). The procedure call standard says that
> > these are temporary registers, so a C compiler should not expect these
> > to be preserved.
> > 
> Then why not use r9-15 or r0-r7 as the AACPS clearly specifies as
> caller preserve instead of the registers which may have special meaning
> etc.?

OK, I will change these to x9, x10.  I'll post a v8 patch set soon.

-Geoff

Christopher Covington March 3, 2015, 9:47 p.m. UTC | #38

Hi Geoff,

On 03/02/2015 06:22 PM, Geoff Levand wrote:
> Hi Christoffer,
> 
> On Mon, 2015-03-02 at 14:13 -0800, Christoffer Dall wrote:
>> On Wed, Feb 25, 2015 at 02:09:30PM -0800, Geoff Levand wrote:
>>> The current hyp-stub vector implementation, which uses x0, is limited
>>> to two hyper calls; __hyp_get_vectors and __hyp_set_vectors.  To
>>> support cpu_soft_restart() we need a third hyper call, one which
>>> allows for code to be executed at EL2.  My proposed use of the
>>> immediate value of the hvc instruction will allow for 2^16 distinct
>>> hyper calls.  
>>
>> right, but using x0 allows for 2^64 distinct hypercalls.  Just to be
>> clear, I'm fine with using immediate field if there are no good reasons
>> not to, I was just curious as to what direct benefit it has.  After
>> thinking about it a bit, from my point of view, the benefit would be the
>> clarity that x0 is first argument like a normal procedure call, so no
>> need to shift things around.  Is this part of the equation or am I
>> missing the overall purpose here?
> 
> Yes, in general it will make marshaling of args, etc. easier.  Also,
> to me, if we are going to change the implementation it seems to be
> the most natural way.

>From reading the architecture documentation, I too expected the hypervisor
call instruction's immediate and the instruction specific syndrome to be used.
However I vaguely recall someone pointing out that reading the exception
syndrome register and extracting the instruction specific syndrome is bound to
take longer than simply using a general purpose register.

One might also consider alignment with the SMC Calling Convention document
[1], which while originally written for SMC, is also used for HVC by PSCI [2].

1. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0028a/index.html
2. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0022c/index.html

Chris

Geoff Levand March 3, 2015, 10:35 p.m. UTC | #39

Hi Christopher,

On Tue, 2015-03-03 at 16:47 -0500, Christopher Covington wrote:
> On 03/02/2015 06:22 PM, Geoff Levand wrote:
> > Yes, in general it will make marshaling of args, etc. easier.  Also,
> > to me, if we are going to change the implementation it seems to be
> > the most natural way.
> 
> From reading the architecture documentation, I too expected the hypervisor
> call instruction's immediate and the instruction specific syndrome to be used.
> However I vaguely recall someone pointing out that reading the exception
> syndrome register and extracting the instruction specific syndrome is bound to
> take longer than simply using a general purpose register.
> 
> One might also consider alignment with the SMC Calling Convention document
> [1], which while originally written for SMC, is also used for HVC by PSCI [2].
> 
> 1. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0028a/index.html
> 2. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0022c/index.html

On looking at the SMC document, I found this:

  The SMC instruction encodes an immediate value as defined by the ARM
  architecture [1][2]. The size of this and mechanism to access the
  immediate value differ between the ARM instruction sets. Additionally,
  it is time consuming for 32-bit Secure Monitor code to access this
  immediate value. Consequently:

   o An SMC immediate value of Zero must be used.
   o All other SMC immediate values are reserved. 

The first problem of differing access methods does not exist for our
case, the kernel will always use the same method.

As for the second problem, the current implementation already reads
esr_el2.  The new code just adds an AND instruction to mask the ISS
field.  I don't think this would be more overhead than shifting
registers.

One alternative would be to use a high register, say x7, and limit the
hcalls to args x0-x6, but I don't think this gains much over using the
immediate.

-Geoff

[0/8] arm64 kexec kernel patches V7

Pull-request

Message

Comments