mbox

[v12,00/16] arm64 kexec kernel patches v12

Message ID cover.1448403503.git.geoff@infradead.org
State New
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v12

Message

Geoff Levand Nov. 24, 2015, 10:25 p.m. UTC
Hi All,

This series adds the core support for kexec re-boot and kdump on ARM64.  This
version of the series combines Takahiro's kdump patches with my kexec patches.

Changes since v11:

o No changes, rebase to v4.4-rc2.

Changes since v10:

o Move the flush of the new image from arm64_relocate_new_kernel to machine_kexec.
o Pass values to arm64_relocate_new_kernel in registers, not in global variables.
o Fixups to setting the sctlr_el1 and sctlr_el2 flags.

To load a second stage kernel and execute a kexec re-boot or to work with kdump
on ARM64 systems a series of patches to kexec-tools [2], which have not yet been
merged upstream, are needed.

I have tested kexec with the ARM Foundation model, and Takahiro has reported
that kdump is working on the 96boards HiKey developer board.  Kexec on EFI
systems works correctly.  More ACPI + kexec testing is needed.

Patch 1 here moves the macros from proc-macros.S to asm/assembler.h so that the
dcache_line_size macro it defines can be uesd by kexec's relocate kernel
routine.

Patches 2 & 3 rework the ARM64 hcall mechanism to give the CPU reset routines
the ability to switch exception levels from EL1 to EL2 for kernels that were
entered in EL2.

Patch 4 allows KVM to handle a CPU reset.

Patches 5 - 7 add back the ARM64 CPU reset support that was recently removed
from the kernel.

Patches 8 - 10 add kexec support.

Patches 11-16 add kdump support.

Please consider all patches for inclusion.

[1]  https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git
[2]  https://git.kernel.org/cgit/linux/kernel/git/geoff/kexec-tools.git

-Geoff

The following changes since commit 1ec218373b8ebda821aec00bb156a9c94fad9cd4:

  Linux 4.4-rc2 (2015-11-22 16:45:59 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v12

for you to fetch changes up to 83db66df70fc1f7cd4da8b580d264b9320991cf0:

  arm64: kdump: relax BUG_ON() if more than one cpus are still active (2015-11-24 13:59:27 -0800)

----------------------------------------------------------------
AKASHI Takahiro (7):
      arm64: kvm: allows kvm cpu hotplug
      arm64: kdump: reserve memory for crash dump kernel
      arm64: kdump: implement machine_crash_shutdown()
      arm64: kdump: add kdump support
      arm64: kdump: update a kernel doc
      arm64: kdump: enable kdump in the arm64 defconfig
      arm64: kdump: relax BUG_ON() if more than one cpus are still active

Geoff Levand (9):
      arm64: Fold proc-macros.S into assembler.h
      arm64: Convert hcalls to use HVC immediate value
      arm64: Add new hcall HVC_CALL_FUNC
      arm64: Add back cpu_reset routines
      Revert "arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function"
      Revert "arm64: remove dead code"
      arm64/kexec: Add core kexec support
      arm64/kexec: Add pr_devel output
      arm64/kexec: Enable kexec in the arm64 defconfig

 Documentation/kdump/kdump.txt        |  32 ++++-
 arch/arm/include/asm/kvm_host.h      |  10 +-
 arch/arm/include/asm/kvm_mmu.h       |   1 +
 arch/arm/kvm/arm.c                   |  79 ++++++-----
 arch/arm/kvm/mmu.c                   |   5 +
 arch/arm64/Kconfig                   |  22 ++++
 arch/arm64/configs/defconfig         |   2 +
 arch/arm64/include/asm/assembler.h   |  48 ++++++-
 arch/arm64/include/asm/kexec.h       |  80 +++++++++++
 arch/arm64/include/asm/kvm_host.h    |  16 ++-
 arch/arm64/include/asm/kvm_mmu.h     |   1 +
 arch/arm64/include/asm/mmu.h         |   1 +
 arch/arm64/include/asm/mmu_context.h |  35 +++--
 arch/arm64/include/asm/virt.h        |  49 +++++++
 arch/arm64/kernel/Makefile           |   3 +
 arch/arm64/kernel/cpu-reset.S        |  84 ++++++++++++
 arch/arm64/kernel/cpu-reset.h        |  22 ++++
 arch/arm64/kernel/crash_dump.c       |  71 ++++++++++
 arch/arm64/kernel/hyp-stub.S         |  43 ++++--
 arch/arm64/kernel/machine_kexec.c    | 248 +++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/relocate_kernel.S  | 131 ++++++++++++++++++
 arch/arm64/kernel/setup.c            |   7 +-
 arch/arm64/kernel/smp.c              |  16 ++-
 arch/arm64/kvm/hyp-init.S            |  34 ++++-
 arch/arm64/kvm/hyp.S                 |  44 +++++--
 arch/arm64/mm/cache.S                |   2 -
 arch/arm64/mm/init.c                 |  83 ++++++++++++
 arch/arm64/mm/mmu.c                  |  11 ++
 arch/arm64/mm/proc-macros.S          |  64 ---------
 arch/arm64/mm/proc.S                 |   3 -
 include/uapi/linux/kexec.h           |   1 +
 31 files changed, 1097 insertions(+), 151 deletions(-)
 create mode 100644 arch/arm64/include/asm/kexec.h
 create mode 100644 arch/arm64/kernel/cpu-reset.S
 create mode 100644 arch/arm64/kernel/cpu-reset.h
 create mode 100644 arch/arm64/kernel/crash_dump.c
 create mode 100644 arch/arm64/kernel/machine_kexec.c
 create mode 100644 arch/arm64/kernel/relocate_kernel.S
 delete mode 100644 arch/arm64/mm/proc-macros.S

Comments

Pratyush Anand Nov. 27, 2015, 1:13 p.m. UTC | #1
Hi Geoff,

On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
> +	/* Test the entry flags. */
> +.Ltest_source:
> +	tbz	x18, IND_SOURCE_BIT, .Ltest_indirection
> +
> +	mov x20, x13				/*  x20 = copy dest */
> +	mov x21, x12				/*  x21 = copy src */

Till v10 we had here invalidation for relocated destination page to PoC. I could
not understand, why it was removed. Removing that piece of code breaks kexec
booting with mustang. I need [1] to kexec boot into second kernel with mustang
platform.

[1]
https://github.com/pratyushanand/linux/commit/431e3247391981a1e8b2864a83a5743e8a274cb9

~Pratyush
Marc Zyngier Nov. 27, 2015, 1:54 p.m. UTC | #2
On 24/11/15 22:25, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> The current kvm implementation on arm64 does cpu-specific initialization
> at system boot, and has no way to gracefully shutdown a core in terms of
> kvm. This prevents, especially, kexec from rebooting the system on a boot
> core in EL2.
> 
> This patch adds a cpu tear-down function and also puts an existing cpu-init
> code into a separate function, kvm_arch_hardware_disable() and
> kvm_arch_hardware_enable() respectively.
> We don't need arm64-specific cpu hotplug hook any more.
> 
> Since this patch modifies common part of code between arm and arm64, one
> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
> compiling errors.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  arch/arm/include/asm/kvm_host.h   | 10 ++++-
>  arch/arm/include/asm/kvm_mmu.h    |  1 +
>  arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>  arch/arm/kvm/mmu.c                |  5 +++
>  arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>  arch/arm64/include/asm/kvm_mmu.h  |  1 +
>  arch/arm64/include/asm/virt.h     |  9 +++++
>  arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>  arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>  9 files changed, 138 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 6692982..9242765 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -214,6 +214,15 @@ static inline void __cpu_init_hyp_mode(phys_addr_t boot_pgd_ptr,
>  	kvm_call_hyp((void*)hyp_stack_ptr, vector_ptr, pgd_ptr);
>  }
>  
> +static inline void __cpu_reset_hyp_mode(phys_addr_t boot_pgd_ptr,
> +					phys_addr_t phys_idmap_start)
> +{
> +	/*
> +	 * TODO
> +	 * kvm_call_reset(boot_pgd_ptr, phys_idmap_start);
> +	 */
> +}
> +
>  static inline int kvm_arch_dev_ioctl_check_extension(long ext)
>  {
>  	return 0;
> @@ -226,7 +235,6 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  
>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>  
> -static inline void kvm_arch_hardware_disable(void) {}
>  static inline void kvm_arch_hardware_unsetup(void) {}
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 405aa18..dc6fadf 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -66,6 +66,7 @@ void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>  phys_addr_t kvm_mmu_get_httbr(void);
>  phys_addr_t kvm_mmu_get_boot_httbr(void);
>  phys_addr_t kvm_get_idmap_vector(void);
> +phys_addr_t kvm_get_idmap_start(void);
>  int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
>  
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index eab83b2..a5d9d74 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -16,7 +16,6 @@
>   * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
>   */
>  
> -#include <linux/cpu.h>
>  #include <linux/cpu_pm.h>
>  #include <linux/errno.h>
>  #include <linux/err.h>
> @@ -61,6 +60,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
>  static u8 kvm_next_vmid;
>  static DEFINE_SPINLOCK(kvm_vmid_lock);
>  
> +static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
> +
>  static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
>  {
>  	BUG_ON(preemptible());
> @@ -85,11 +86,6 @@ struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
>  	return &kvm_arm_running_vcpu;
>  }
>  
> -int kvm_arch_hardware_enable(void)
> -{
> -	return 0;
> -}
> -
>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>  {
>  	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> @@ -959,7 +955,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  	}
>  }
>  
> -static void cpu_init_hyp_mode(void *dummy)
> +int kvm_arch_hardware_enable(void)
>  {
>  	phys_addr_t boot_pgd_ptr;
>  	phys_addr_t pgd_ptr;
> @@ -967,6 +963,9 @@ static void cpu_init_hyp_mode(void *dummy)
>  	unsigned long stack_page;
>  	unsigned long vector_ptr;
>  
> +	if (__hyp_get_vectors() != hyp_default_vectors)
> +		return 0;
> +
>  	/* Switch from the HYP stub to our own HYP init vector */
>  	__hyp_set_vectors(kvm_get_idmap_vector());
>  
> @@ -979,38 +978,50 @@ static void cpu_init_hyp_mode(void *dummy)
>  	__cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
>  
>  	kvm_arm_init_debug();
> +
> +	return 0;
>  }
>  
> -static int hyp_init_cpu_notify(struct notifier_block *self,
> -			       unsigned long action, void *cpu)
> +void kvm_arch_hardware_disable(void)
>  {
> -	switch (action) {
> -	case CPU_STARTING:
> -	case CPU_STARTING_FROZEN:
> -		if (__hyp_get_vectors() == hyp_default_vectors)
> -			cpu_init_hyp_mode(NULL);
> -		break;
> -	}
> +	phys_addr_t boot_pgd_ptr;
> +	phys_addr_t phys_idmap_start;
>  
> -	return NOTIFY_OK;
> -}
> +	if (__hyp_get_vectors() == hyp_default_vectors)
> +		return;
>  
> -static struct notifier_block hyp_init_cpu_nb = {
> -	.notifier_call = hyp_init_cpu_notify,
> -};
> +	boot_pgd_ptr = kvm_mmu_get_boot_httbr();
> +	phys_idmap_start = kvm_get_idmap_start();
> +
> +	__cpu_reset_hyp_mode(boot_pgd_ptr, phys_idmap_start);
> +}
>  
>  #ifdef CONFIG_CPU_PM
>  static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
>  				    unsigned long cmd,
>  				    void *v)
>  {
> -	if (cmd == CPU_PM_EXIT &&
> -	    __hyp_get_vectors() == hyp_default_vectors) {
> -		cpu_init_hyp_mode(NULL);
> +	switch (cmd) {
> +	case CPU_PM_ENTER:
> +		if (__hyp_get_vectors() != hyp_default_vectors)
> +			__this_cpu_write(kvm_arm_hardware_enabled, 1);
> +		else
> +			__this_cpu_write(kvm_arm_hardware_enabled, 0);
> +		/*
> +		 * don't call kvm_arch_hardware_disable() in case of
> +		 * CPU_PM_ENTER because it does't actually save any state.
> +		 */
> +
> +		return NOTIFY_OK;
> +	case CPU_PM_EXIT:
> +		if (__this_cpu_read(kvm_arm_hardware_enabled))
> +			kvm_arch_hardware_enable();
> +
>  		return NOTIFY_OK;
> -	}
>  
> -	return NOTIFY_DONE;
> +	default:
> +		return NOTIFY_DONE;
> +	}
>  }
>  
>  static struct notifier_block hyp_init_cpu_pm_nb = {
> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>  	}
>  
>  	/*
> -	 * Execute the init code on each CPU.
> -	 */
> -	on_each_cpu(cpu_init_hyp_mode, NULL, 1);
> -
> -	/*
>  	 * Init HYP view of VGIC
>  	 */
>  	err = kvm_vgic_hyp_init();
> @@ -1186,26 +1192,15 @@ int kvm_arch_init(void *opaque)
>  		}
>  	}
>  
> -	cpu_notifier_register_begin();
> -
>  	err = init_hyp_mode();
>  	if (err)
>  		goto out_err;
>  
> -	err = __register_cpu_notifier(&hyp_init_cpu_nb);
> -	if (err) {
> -		kvm_err("Cannot register HYP init CPU notifier (%d)\n", err);
> -		goto out_err;
> -	}
> -
> -	cpu_notifier_register_done();
> -
>  	hyp_cpu_pm_init();
>  
>  	kvm_coproc_table_init();
>  	return 0;
>  out_err:
> -	cpu_notifier_register_done();
>  	return err;
>  }
>  
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 6984342..69b4a33 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1644,6 +1644,11 @@ phys_addr_t kvm_get_idmap_vector(void)
>  	return hyp_idmap_vector;
>  }
>  
> +phys_addr_t kvm_get_idmap_start(void)
> +{
> +	return hyp_idmap_start;
> +}
> +
>  int kvm_mmu_init(void)
>  {
>  	int err;
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index a35ce72..0b540f8 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -223,6 +223,7 @@ struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
>  struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
>  
>  u64 kvm_call_hyp(void *hypfn, ...);
> +void kvm_call_reset(phys_addr_t boot_pgd_ptr, phys_addr_t phys_idmap_start);
>  void force_vm_exit(const cpumask_t *mask);
>  void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  
> @@ -247,7 +248,20 @@ static inline void __cpu_init_hyp_mode(phys_addr_t boot_pgd_ptr,
>  		     hyp_stack_ptr, vector_ptr);
>  }
>  
> -static inline void kvm_arch_hardware_disable(void) {}
> +static inline void __cpu_reset_hyp_mode(phys_addr_t boot_pgd_ptr,
> +					phys_addr_t phys_idmap_start)
> +{
> +	/*
> +	 * Call reset code, and switch back to stub hyp vectors.
> +	 */
> +	kvm_call_reset(boot_pgd_ptr, phys_idmap_start);
> +}
> +
> +struct vgic_sr_vectors {
> +	void	*save_vgic;
> +	void	*restore_vgic;
> +};
> +

Why is this structure being re-introduced? It has been removed in 48f8bd5.

>  static inline void kvm_arch_hardware_unsetup(void) {}
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 6150567..ff5a087 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -98,6 +98,7 @@ void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>  phys_addr_t kvm_mmu_get_httbr(void);
>  phys_addr_t kvm_mmu_get_boot_httbr(void);
>  phys_addr_t kvm_get_idmap_vector(void);
> +phys_addr_t kvm_get_idmap_start(void);
>  int kvm_mmu_init(void);
>  void kvm_clear_hyp_idmap(void);
>  
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 3070096..bca79f9 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -58,9 +58,18 @@
>  
>  #define HVC_CALL_FUNC 3
>  
> +/*
> + * HVC_RESET_CPU - Reset cpu in EL2 to initial state.
> + *
> + * @x0: entry address in trampoline code in va
> + * @x1: identical mapping page table in pa
> + */
> +
>  #define BOOT_CPU_MODE_EL1	(0xe11)
>  #define BOOT_CPU_MODE_EL2	(0xe12)
>  
> +#define HVC_RESET_CPU 4
> +

Why isn't this next to the comment that document it?

>  #ifndef __ASSEMBLY__
>  
>  /*
> diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S
> index 2e67a48..1925163 100644
> --- a/arch/arm64/kvm/hyp-init.S
> +++ b/arch/arm64/kvm/hyp-init.S
> @@ -139,6 +139,39 @@ merged:
>  	eret
>  ENDPROC(__kvm_hyp_init)
>  
> +	/*
> +	 * x0: HYP boot pgd
> +	 * x1: HYP phys_idmap_start
> +	 */
> +ENTRY(__kvm_hyp_reset)
> +	/* We're in trampoline code in VA, switch back to boot page tables */
> +	msr	ttbr0_el2, x0
> +	isb
> +
> +	/* Invalidate the old TLBs */
> +	tlbi	alle2
> +	dsb	sy

I find the location of that TLBI a bit weird. If you want to do
something more useful, move it to after the MMU has been disabled.

> +
> +	/* Branch into PA space */
> +	adr	x0, 1f
> +	bfi	x1, x0, #0, #PAGE_SHIFT
> +	br	x1
> +
> +	/* We're now in idmap, disable MMU */
> +1:	mrs	x0, sctlr_el2
> +	ldr	x1, =SCTLR_EL2_FLAGS
> +	bic	x0, x0, x1		// Clear SCTL_M and etc
> +	msr	sctlr_el2, x0
> +	isb
> +
> +	/* Install stub vectors */
> +	adrp	x0, __hyp_stub_vectors
> +	add	x0, x0, #:lo12:__hyp_stub_vectors
> +	msr	vbar_el2, x0
> +
> +	eret
> +ENDPROC(__kvm_hyp_reset)
> +
>  	.ltorg
>  
>  	.popsection
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index 1bef8db..aca11d6 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -939,6 +939,11 @@ ENTRY(kvm_call_hyp)
>  	ret
>  ENDPROC(kvm_call_hyp)
>  
> +ENTRY(kvm_call_reset)
> +	hvc	#HVC_RESET_CPU
> +	ret
> +ENDPROC(kvm_call_reset)
> +
>  .macro invalid_vector	label, target
>  	.align	2
>  \label:
> @@ -982,10 +987,27 @@ el1_sync:					// Guest trapped into EL2
>  	cmp	x18, #HVC_GET_VECTORS
>  	b.ne	1f
>  	mrs	x0, vbar_el2
> -	b	2f
> -
> -1:	/* Default to HVC_CALL_HYP. */
> +	b	do_eret
>  
> +	/* jump into trampoline code */
> +1:	cmp	x18, #HVC_RESET_CPU
> +	b.ne	2f
> +	/*
> +	 * Entry point is:
> +	 *	TRAMPOLINE_VA
> +	 *	+ (__kvm_hyp_reset - (__hyp_idmap_text_start & PAGE_MASK))
> +	 */
> +	adrp	x2, __kvm_hyp_reset
> +	add	x2, x2, #:lo12:__kvm_hyp_reset
> +	adrp	x3, __hyp_idmap_text_start
> +	add	x3, x3, #:lo12:__hyp_idmap_text_start
> +	and	x3, x3, PAGE_MASK
> +	sub	x2, x2, x3
> +	ldr	x3, =TRAMPOLINE_VA

Nit: You should be able to do a "mov x3, #TRAMPOLINE_VA and avoid the
load from memory.

> +	add	x2, x2, x3
> +	br	x2				// no return
> +
> +2:	/* Default to HVC_CALL_HYP. */
>  	push	lr, xzr
>  
>  	/*
> @@ -999,7 +1021,9 @@ el1_sync:					// Guest trapped into EL2
>  	blr	lr
>  
>  	pop	lr, xzr
> -2:	eret
> +
> +do_eret:
> +	eret
>  
>  el1_trap:
>  	/*
> 

Thanks,

	M.
Marc Zyngier Nov. 27, 2015, 2:19 p.m. UTC | #3
On 24/11/15 22:25, Geoff Levand wrote:
> Commit 68234df4ea7939f98431aa81113fbdce10c4a84b (arm64: kill flush_cache_all())
> removed the global arm64 routines cpu_reset() and cpu_soft_restart() needed by
> the arm64 kexec and kdump support.  Add simplified versions of those two
> routines back with some changes needed for kexec in the new files cpu_reset.S,
> and cpu_reset.h.
> 
> When a CPU is reset it needs to be put into the exception level it had
> when it entered the kernel. Update cpu_reset() to accept an argument
> which signals if the reset address needs to be entered at EL1 or EL2.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/kernel/cpu-reset.S | 84 +++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kernel/cpu-reset.h | 22 ++++++++++++
>  2 files changed, 106 insertions(+)
>  create mode 100644 arch/arm64/kernel/cpu-reset.S
>  create mode 100644 arch/arm64/kernel/cpu-reset.h
> 
> diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
> new file mode 100644
> index 0000000..0423f27
> --- /dev/null
> +++ b/arch/arm64/kernel/cpu-reset.S
> @@ -0,0 +1,84 @@
> +/*
> + * cpu reset routines
> + *
> + * Copyright (C) 2001 Deep Blue Solutions Ltd.
> + * Copyright (C) 2012 ARM Ltd.
> + * Copyright (C) 2015 Huawei Futurewei Technologies.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +#include <asm/virt.h>
> +
> +#define SCTLR_ELx_I	(1 << 12)
> +#define SCTLR_ELx_SA	(1 << 3)
> +#define SCTLR_ELx_C	(1 << 2)
> +#define SCTLR_ELx_A	(1 << 1)
> +#define SCTLR_ELx_M	1

kvm_arm.h already has some of these defined, and sysreg.h has another
bunch. Can you please consolidate this into a single place (probably
sysreg.h)?

> +#define SCTLR_ELx_FLAGS	(SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C |	\
> +			 SCTLR_ELx_SA | SCTLR_ELx_I)
> +
> +.text
> +.pushsection    .idmap.text, "ax"
> +
> +.align 5

Why is this .align directive required?

> +
> +/*
> + * cpu_soft_restart(cpu_reset, el2_switch, entry, arg0, arg1) - Perform a cpu
> + * soft reset.
> + *
> + * @cpu_reset: Physical address of the cpu_reset routine.
> + * @el2_switch: Flag to indicate a swich to EL2 is needed, passed to cpu_reset.
> + * @entry: Location to jump to for soft reset, passed to cpu_reset.
> + * arg0: First argument passed to @entry.
> + * arg1: Second argument passed to @entry.
> + */
> +
> +ENTRY(cpu_soft_restart)
> +	mov	x18, x0				// cpu_reset
> +	mov	x0, x1				// el2_switch
> +	mov	x1, x2				// entry
> +	mov	x2, x3				// arg0
> +	mov	x3, x4				// arg1
> +	ret	x18
> +ENDPROC(cpu_soft_restart)

Grepping through the tree, I can only find a single use of
cpu_soft_restart, with cpu_reset as its first parameter.

Why do we need this indirection? Having

void cpu_soft_restart(el2_switch, entry, arg0, arg1);

should be enough...

> +
> +/*
> + * cpu_reset(el2_switch, entry, arg0, arg1) - Helper for cpu_soft_restart.
> + *
> + * @el2_switch: Flag to indicate a swich to EL2 is needed.
> + * @entry: Location to jump to for soft reset.
> + * arg0: First argument passed to @entry.
> + * arg1: Second argument passed to @entry.
> + *
> + * Put the CPU into the same state as it would be if it had been reset, and
> + * branch to what would be the reset vector. It must be executed with the
> + * flat identity mapping.
> + */
> +
> +ENTRY(cpu_reset)
> +	/* Clear sctlr_el1 flags. */
> +	mrs	x12, sctlr_el1
> +	ldr	x13, =SCTLR_ELx_FLAGS
> +	bic	x12, x12, x13
> +	msr	sctlr_el1, x12
> +	isb
> +
> +	cbz	x0, 1f				// el2_switch?
> +
> +	mov	x0, x1				// entry
> +	mov	x1, x2				// arg0
> +	mov	x2, x3				// arg1
> +	hvc	#HVC_CALL_FUNC			// no return
> +
> +1:	mov	x18, x1				// entry
> +	mov	x0, x2				// arg0
> +	mov	x1, x3				// arg1
> +	ret	x18
> +ENDPROC(cpu_reset)

And this can be an actual part of cpu_soft_reset.

> +
> +.popsection
> diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
> new file mode 100644
> index 0000000..79816b6
> --- /dev/null
> +++ b/arch/arm64/kernel/cpu-reset.h
> @@ -0,0 +1,22 @@
> +/*
> + * cpu reset routines
> + *
> + * Copyright (C) 2015 Huawei Futurewei Technologies.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#if !defined(_ARM64_CPU_RESET_H)

The usual idiom is "#ifndef __BLAH_H".

> +#define _ARM64_CPU_RESET_H
> +
> +#include <asm/virt.h>
> +
> +void __attribute__((noreturn)) cpu_reset(unsigned long el2_switch,
> +	unsigned long entry, unsigned long arg0, unsigned long arg1);
> +void __attribute__((noreturn)) cpu_soft_restart(phys_addr_t cpu_reset,
> +	unsigned long el2_switch, unsigned long entry, unsigned long arg0,
> +	unsigned long arg1);
> +
> +#endif
> 

There is a __noreturn attribute already defined.

Thanks,

	M.
Marc Zyngier Nov. 27, 2015, 2:39 p.m. UTC | #4
On 24/11/15 22:25, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> kdump calls machine_crash_shutdown() to shut down non-boot cpus and
> save registers' status in per-cpu ELF notes before starting the crash
> dump kernel. See kernel_kexec().
> 
> ipi_cpu_stop() is a bit modified and used to support this behavior.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  arch/arm64/include/asm/kexec.h    | 34 +++++++++++++++++++++++++++++++++-
>  arch/arm64/kernel/machine_kexec.c | 31 +++++++++++++++++++++++++++++--
>  arch/arm64/kernel/smp.c           | 16 ++++++++++++++--
>  3 files changed, 76 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index 46d63cd..555a955 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -30,6 +30,8 @@
>  
>  #if !defined(__ASSEMBLY__)
>  
> +extern bool in_crash_kexec;
> +
>  /**
>   * crash_setup_regs() - save registers for the panic kernel
>   *
> @@ -40,7 +42,37 @@
>  static inline void crash_setup_regs(struct pt_regs *newregs,
>  				    struct pt_regs *oldregs)
>  {
> -	/* Empty routine needed to avoid build errors. */
> +	if (oldregs) {
> +		memcpy(newregs, oldregs, sizeof(*newregs));
> +	} else {
> +		__asm__ __volatile__ (
> +			"stp	 x0,   x1, [%3, #16 *  0]\n"
> +			"stp	 x2,   x3, [%3, #16 *  1]\n"
> +			"stp	 x4,   x5, [%3, #16 *  2]\n"
> +			"stp	 x6,   x7, [%3, #16 *  3]\n"
> +			"stp	 x8,   x9, [%3, #16 *  4]\n"
> +			"stp	x10,  x11, [%3, #16 *  5]\n"
> +			"stp	x12,  x13, [%3, #16 *  6]\n"
> +			"stp	x14,  x15, [%3, #16 *  7]\n"
> +			"stp	x16,  x17, [%3, #16 *  8]\n"
> +			"stp	x18,  x19, [%3, #16 *  9]\n"
> +			"stp	x20,  x21, [%3, #16 * 10]\n"
> +			"stp	x22,  x23, [%3, #16 * 11]\n"
> +			"stp	x24,  x25, [%3, #16 * 12]\n"
> +			"stp	x26,  x27, [%3, #16 * 13]\n"
> +			"stp	x28,  x29, [%3, #16 * 14]\n"
> +			"str	x30,	   [%3, #16 * 15]\n"
> +			"mov	%0, sp\n"
> +			"adr	%1, 1f\n"
> +			"mrs	%2, spsr_el1\n"
> +		"1:"
> +			: "=r" (newregs->sp),
> +			  "=r" (newregs->pc),
> +			  "=r" (newregs->pstate)
> +			: "r"  (&newregs->regs)
> +			: "memory"
> +		);

I wonder how useful this thing is, given that it starts by corrupting
whatever register is holding newregs->regs. Maybe this is not supposed
to be accurate anyway...


> +	}
>  }
>  
>  #endif /* !defined(__ASSEMBLY__) */
> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> index da28a26..d2d7e90 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -9,6 +9,7 @@
>   * published by the Free Software Foundation.
>   */
>  
> +#include <linux/kernel.h>
>  #include <linux/kexec.h>
>  #include <linux/of_fdt.h>
>  #include <linux/slab.h>
> @@ -23,6 +24,7 @@
>  extern const unsigned char arm64_relocate_new_kernel[];
>  extern const unsigned long arm64_relocate_new_kernel_size;
>  
> +bool in_crash_kexec;
>  static unsigned long kimage_start;
>  
>  /**
> @@ -203,13 +205,38 @@ void machine_kexec(struct kimage *kimage)
>  	 */
>  
>  	cpu_soft_restart(virt_to_phys(cpu_reset),
> -		is_hyp_mode_available(),
> +		in_crash_kexec ? 0 : is_hyp_mode_available(),
>  		reboot_code_buffer_phys, kimage->head, kimage_start);
>  
>  	BUG(); /* Should never get here. */
>  }
>  
> +/**
> + * machine_crash_shutdown - shutdown non-boot cpus and save registers
> + */
>  void machine_crash_shutdown(struct pt_regs *regs)
>  {
> -	/* Empty routine needed to avoid build errors. */
> +	struct pt_regs dummy_regs;
> +	int cpu;
> +
> +	local_irq_disable();
> +
> +	in_crash_kexec = true;
> +
> +	/*
> +	 * clear and initialize the per-cpu info. This is necessary
> +	 * because, otherwise, slots for offline cpus would never be
> +	 * filled up. See smp_send_stop().
> +	 */
> +	memset(&dummy_regs, 0, sizeof(dummy_regs));
> +	for_each_possible_cpu(cpu)
> +		crash_save_cpu(&dummy_regs, cpu);
> +
> +	/* shutdown non-boot cpus */
> +	smp_send_stop();
> +
> +	/* for boot cpu */
> +	crash_save_cpu(regs, smp_processor_id());
> +
> +	pr_info("Starting crashdump kernel...\n");
>  }
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index b1adc51..15aabef 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -37,6 +37,7 @@
>  #include <linux/completion.h>
>  #include <linux/of.h>
>  #include <linux/irq_work.h>
> +#include <linux/kexec.h>
>  
>  #include <asm/alternative.h>
>  #include <asm/atomic.h>
> @@ -54,6 +55,8 @@
>  #include <asm/ptrace.h>
>  #include <asm/virt.h>
>  
> +#include "cpu-reset.h"
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/ipi.h>
>  
> @@ -683,8 +686,12 @@ static DEFINE_RAW_SPINLOCK(stop_lock);
>  /*
>   * ipi_cpu_stop - handle IPI from smp_send_stop()
>   */
> -static void ipi_cpu_stop(unsigned int cpu)
> +static void ipi_cpu_stop(unsigned int cpu, struct pt_regs *regs)
>  {
> +#ifdef CONFIG_KEXEC
> +	/* printing messages may slow down the shutdown. */
> +	if (!in_crash_kexec)
> +#endif
>  	if (system_state == SYSTEM_BOOTING ||
>  	    system_state == SYSTEM_RUNNING) {
>  		raw_spin_lock(&stop_lock);

Irrespective of how useful this change is, how about having a predicate
instead? Something like:

static inline bool is_in_crash_kexec(void)
{
#ifdef CONFIG_KEXEC
	return in_crash_kexec;
#else
	return false;
#endif
}

located in machine_kexec.c (making the in_crash_kernel static), and then

	if (!is_in_crash_kexec() && (systen_state == ... || ...) {

It would certainly look better.

> @@ -697,6 +704,11 @@ static void ipi_cpu_stop(unsigned int cpu)
>  
>  	local_irq_disable();
>  
> +#ifdef CONFIG_KEXEC
> +	if (in_crash_kexec)
> +		crash_save_cpu(regs, cpu);
> +#endif /* CONFIG_KEXEC */
> +

Same here.

>  	while (1)
>  		cpu_relax();
>  }
> @@ -727,7 +739,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  
>  	case IPI_CPU_STOP:
>  		irq_enter();
> -		ipi_cpu_stop(cpu);
> +		ipi_cpu_stop(cpu, regs);
>  		irq_exit();
>  		break;
>  
> 

Thanks,

	M.
Pratyush Anand Nov. 30, 2015, 5:28 a.m. UTC | #5
On 27/11/2015:02:19:37 PM, Marc Zyngier wrote:
> On 24/11/15 22:25, Geoff Levand wrote:
> > +ENTRY(cpu_soft_restart)
> > +	mov	x18, x0				// cpu_reset
> > +	mov	x0, x1				// el2_switch
> > +	mov	x1, x2				// entry
> > +	mov	x2, x3				// arg0
> > +	mov	x3, x4				// arg1
> > +	ret	x18
> > +ENDPROC(cpu_soft_restart)
> 
> Grepping through the tree, I can only find a single use of
> cpu_soft_restart, with cpu_reset as its first parameter.
> 
> Why do we need this indirection? Having

It is needed because we need to execute cpu_reset() in physical address space.

> 
> void cpu_soft_restart(el2_switch, entry, arg0, arg1);
> 
> should be enough...

We can do with only cpu_soft_restart(), but then a function pointer to __pa() of
it need to be called. May  be current approach is more cleaner.

~Pratyush
Geoff Levand Nov. 30, 2015, 6:51 p.m. UTC | #6
Hi,

On Fri, 2015-11-27 at 18:43 +0530, Pratyush Anand wrote:

> On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
> > +> > 	> > /* Test the entry flags. */
> > +.Ltest_source:
> > +> > 	> > tbz> > 	> > x18, IND_SOURCE_BIT, .Ltest_indirection
> > +
> > +> > 	> > mov x20, x13> > 	> > 	> > 	> > 	> > /*  x20 = copy dest */
> > +> > 	> > mov x21, x12> > 	> > 	> > 	> > 	> > /*  x21 = copy src */
> 
> Till v10 we had here invalidation for relocated destination page to PoC. I could
> not understand, why it was removed. Removing that piece of code breaks kexec
> booting with mustang. I need [1] to kexec boot into second kernel with mustang
> platform.

We need to flush the new kernel to PoC.  The code that was here that
was doing that would only be executed when the new kernel needed
relocation (the standard kexec case).  We also need to flush kernels
that do not need relocation (the standard kdump case).

I moved the new kernel flush to kexec_segment_flush(), called
unconditionally in machine_kexec() so we can handle both cases
with one piece of code.

Have you experienced a problem on mustang with the current version?

-Geoff
Geoff Levand Nov. 30, 2015, 8:03 p.m. UTC | #7
Hi Marc,

On Fri, 2015-11-27 at 14:19 +0000, Marc Zyngier wrote:
> On 24/11/15 22:25, Geoff Levand wrote:
> > +#define SCTLR_ELx_I> > 	> > (1 << 12)
> > +#define SCTLR_ELx_SA> > 	> > (1 << 3)
> > +#define SCTLR_ELx_C> > 	> > (1 << 2)
> > +#define SCTLR_ELx_A> > 	> > (1 << 1)
> > +#define SCTLR_ELx_M> > 	> > 1
> 
> kvm_arm.h already has some of these defined, and sysreg.h has another
> bunch. Can you please consolidate this into a single place (probably
> sysreg.h)?

OK, I'll look into it.

> > +#define SCTLR_ELx_FLAGS> > 	> > (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C |> > 	> > \
> > +> > 	> > 	> > 	> >  SCTLR_ELx_SA | SCTLR_ELx_I)
> > +
> > +.text
> > +.pushsection    .idmap.text, "ax"
> > +
> > +.align 5
> 
> Why is this .align directive required?

In the original proc.S file, this was to align the cpu_reset
routine.  It has been there since the initial commit of the
proc.S.  I don't think this alignment is needed, but maybe
there is some reason it is there that I am not aware of?

> > +
> > +/*
> > + * cpu_soft_restart(cpu_reset, el2_switch, entry, arg0, arg1) - Perform a cpu
> > + * soft reset.
> > + *
> > + * @cpu_reset: Physical address of the cpu_reset routine.
> > + * @el2_switch: Flag to indicate a swich to EL2 is needed, passed to cpu_reset.
> > + * @entry: Location to jump to for soft reset, passed to cpu_reset.
> > + * arg0: First argument passed to @entry.
> > + * arg1: Second argument passed to @entry.
> > + */
> > +
> > +ENTRY(cpu_soft_restart)
> > +> > 	> > mov> > 	> > x18, x0> > 	> > 	> > 	> > 	> > // cpu_reset
> > +> > 	> > mov> > 	> > x0, x1> > 	> > 	> > 	> > 	> > // el2_switch
> > +> > 	> > mov> > 	> > x1, x2> > 	> > 	> > 	> > 	> > // entry
> > +> > 	> > mov> > 	> > x2, x3> > 	> > 	> > 	> > 	> > // arg0
> > +> > 	> > mov> > 	> > x3, x4> > 	> > 	> > 	> > 	> > // arg1
> > +> > 	> > ret> > 	> > x18
> > +ENDPROC(cpu_soft_restart)
> 
> Grepping through the tree, I can only find a single use of
> cpu_soft_restart, with cpu_reset as its first parameter.
> 
> Why do we need this indirection? Having
> 
> void cpu_soft_restart(el2_switch, entry, arg0, arg1);
> 
> should be enough...

> > +/*
> > + * cpu_reset(el2_switch, entry, arg0, arg1) - Helper for cpu_soft_restart.
> > + *
> > + * @el2_switch: Flag to indicate a swich to EL2 is needed.
> > + * @entry: Location to jump to for soft reset.
> > + * arg0: First argument passed to @entry.
> > + * arg1: Second argument passed to @entry.
> > + *
> > + * Put the CPU into the same state as it would be if it had been reset, and
> > + * branch to what would be the reset vector. It must be executed with the
> > + * flat identity mapping.
> > + */
> > +
> > +ENTRY(cpu_reset)
> > +> > 	> > /* Clear sctlr_el1 flags. */
> > +> > 	> > mrs> > 	> > x12, sctlr_el1
> > +> > 	> > ldr> > 	> > x13, =SCTLR_ELx_FLAGS
> > +> > 	> > bic> > 	> > x12, x12, x13
> > +> > 	> > msr> > 	> > sctlr_el1, x12
> > +> > 	> > isb
> > +
> > +> > 	> > cbz> > 	> > x0, 1f> > 	> > 	> > 	> > 	> > // el2_switch?
> > +
> > +> > 	> > mov> > 	> > x0, x1> > 	> > 	> > 	> > 	> > // entry
> > +> > 	> > mov> > 	> > x1, x2> > 	> > 	> > 	> > 	> > // arg0
> > +> > 	> > mov> > 	> > x2, x3> > 	> > 	> > 	> > 	> > // arg1
> > +> > 	> > hvc> > 	> > #HVC_CALL_FUNC> > 	> > 	> > 	> > // no return
> > +
> > +1:> > 	> > mov> > 	> > x18, x1> > 	> > 	> > 	> > 	> > // entry
> > +> > 	> > mov> > 	> > x0, x2> > 	> > 	> > 	> > 	> > // arg0
> > +> > 	> > mov> > 	> > x1, x3> > 	> > 	> > 	> > 	> > // arg1
> > +> > 	> > ret> > 	> > x18
> > +ENDPROC(cpu_reset)
> 
> And this can be an actual part of cpu_soft_reset.

Sure, I'll look at your proposal from your reply to Pratyush.

> > +
> > +.popsection
> > diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
> > new file mode 100644
> > index 0000000..79816b6
> > --- /dev/null
> > +++ b/arch/arm64/kernel/cpu-reset.h
> > @@ -0,0 +1,22 @@
> > +/*
> > + * cpu reset routines
> > + *
> > + * Copyright (C) 2015 Huawei Futurewei Technologies.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#if !defined(_ARM64_CPU_RESET_H)
> 
> The usual idiom is "#ifndef __BLAH_H".

Sure, I can change that.  Grepping the arm64 sources shows one other
use of this style, in a kvm header...

> > +#define _ARM64_CPU_RESET_H
> > +
> > +#include 
> > +
> > +void __attribute__((noreturn)) cpu_reset(unsigned long el2_switch,
> > +> > 	> > unsigned long entry, unsigned long arg0, unsigned long arg1);
> > +void __attribute__((noreturn)) cpu_soft_restart(phys_addr_t cpu_reset,
> > +> > 	> > unsigned long el2_switch, unsigned long entry, unsigned long arg0,
> > +> > 	> > unsigned long arg1);
> > +
> > +#endif
> > 
> 
> There is a __noreturn attribute already defined.

I'll switch these over.

Thanks for the review.

-Geoff
Pratyush Anand Dec. 1, 2015, 2:16 a.m. UTC | #8
Hi Geoff,

On 30/11/2015:10:51:15 AM, Geoff Levand wrote:
> Hi,
> 
> On Fri, 2015-11-27 at 18:43 +0530, Pratyush Anand wrote:
> 
> > On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
> > > +> > 	> > /* Test the entry flags. */
> > > +.Ltest_source:
> > > +> > 	> > tbz> > 	> > x18, IND_SOURCE_BIT, .Ltest_indirection
> > > +
> > > +> > 	> > mov x20, x13> > 	> > 	> > 	> > 	> > /*  x20 = copy dest */
> > > +> > 	> > mov x21, x12> > 	> > 	> > 	> > 	> > /*  x21 = copy src */
> > 
> > Till v10 we had here invalidation for relocated destination page to PoC. I could
> > not understand, why it was removed. Removing that piece of code breaks kexec
> > booting with mustang. I need [1] to kexec boot into second kernel with mustang
> > platform.
> 
> We need to flush the new kernel to PoC.  The code that was here that
> was doing that would only be executed when the new kernel needed
> relocation (the standard kexec case).  We also need to flush kernels
> that do not need relocation (the standard kdump case).
> 
> I moved the new kernel flush to kexec_segment_flush(), called
> unconditionally in machine_kexec() so we can handle both cases
> with one piece of code.

Yes, I had noticed that. Actually flushing before cache is disabled can always
cause heisenbug.

> 
> Have you experienced a problem on mustang with the current version?

Yes, v10 works fine, but I need invalidation fix for both v11 and v12 to work on
mustang.
I have not tested vanilla v12 on seattle, but v12 ported on rhelsa needs
invalidation fix to work on seattle as well.

~Pratyush
Marc Zyngier Dec. 1, 2015, 9:38 a.m. UTC | #9
On 30/11/15 20:03, Geoff Levand wrote:

Hi Geoff,

>>> +.align 5
>>
>> Why is this .align directive required?
> 
> In the original proc.S file, this was to align the cpu_reset
> routine.  It has been there since the initial commit of the
> proc.S.  I don't think this alignment is needed, but maybe
> there is some reason it is there that I am not aware of?

None that I can think of, so just drop it - we'll find out pretty
quickly if there was an actual requirement.

Thanks,

	M.
Azriel Samson Dec. 1, 2015, 6:32 p.m. UTC | #10
Hi,

I tested with v11 on our platform and I too needed the invalidation fix.

Thanks,
Azriel Samson
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project

-----Original Message-----
From: kexec [mailto:kexec-bounces@lists.infradead.org] On Behalf Of Pratyush
Anand
Sent: Monday, November 30, 2015 7:16 PM
To: Geoff Levand
Cc: Mark Rutland; marc.zyngier@arm.com; Catalin Marinas; Will Deacon; AKASHI
Takahiro; linux-arm-kernel@lists.infradead.org; Dave Young;
kexec@lists.infradead.org; christoffer.dall@linaro.org
Subject: Re: [PATCH v12 08/16] arm64/kexec: Add core kexec support

Hi Geoff,

On 30/11/2015:10:51:15 AM, Geoff Levand wrote:
> Hi,
> 
> On Fri, 2015-11-27 at 18:43 +0530, Pratyush Anand wrote:
> 
> > On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
> > > +> > 	> > /* Test the entry flags. */
> > > +.Ltest_source:
> > > +> > 	> > tbz> > 	> > x18, IND_SOURCE_BIT, .Ltest_indirection
> > > +
> > > +> > 	> > mov x20, x13> > 	> > 	> > 	> > 	> > /*  x20
= copy dest */
> > > +> > 	> > mov x21, x12> > 	> > 	> > 	> > 	> > /*  x21
= copy src */
> > 
> > Till v10 we had here invalidation for relocated destination page to 
> > PoC. I could not understand, why it was removed. Removing that piece 
> > of code breaks kexec booting with mustang. I need [1] to kexec boot 
> > into second kernel with mustang platform.
> 
> We need to flush the new kernel to PoC.  The code that was here that 
> was doing that would only be executed when the new kernel needed 
> relocation (the standard kexec case).  We also need to flush kernels 
> that do not need relocation (the standard kdump case).
> 
> I moved the new kernel flush to kexec_segment_flush(), called 
> unconditionally in machine_kexec() so we can handle both cases with 
> one piece of code.

Yes, I had noticed that. Actually flushing before cache is disabled can
always cause heisenbug.

> 
> Have you experienced a problem on mustang with the current version?

Yes, v10 works fine, but I need invalidation fix for both v11 and v12 to
work on mustang.
I have not tested vanilla v12 on seattle, but v12 ported on rhelsa needs
invalidation fix to work on seattle as well.

~Pratyush
Mark Rutland Dec. 1, 2015, 7:03 p.m. UTC | #11
On Tue, Dec 01, 2015 at 07:46:15AM +0530, Pratyush Anand wrote:
> Hi Geoff,
> 
> On 30/11/2015:10:51:15 AM, Geoff Levand wrote:
> > Hi,
> > 
> > On Fri, 2015-11-27 at 18:43 +0530, Pratyush Anand wrote:
> > 
> > > On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
> > > > +> > 	> > /* Test the entry flags. */
> > > > +.Ltest_source:
> > > > +> > 	> > tbz> > 	> > x18, IND_SOURCE_BIT, .Ltest_indirection
> > > > +
> > > > +> > 	> > mov x20, x13> > 	> > 	> > 	> > 	> > /*  x20 = copy dest */
> > > > +> > 	> > mov x21, x12> > 	> > 	> > 	> > 	> > /*  x21 = copy src */
> > > 
> > > Till v10 we had here invalidation for relocated destination page to PoC. I could
> > > not understand, why it was removed. Removing that piece of code breaks kexec
> > > booting with mustang. I need [1] to kexec boot into second kernel with mustang
> > > platform.
> > 
> > We need to flush the new kernel to PoC.  The code that was here that
> > was doing that would only be executed when the new kernel needed
> > relocation (the standard kexec case).  We also need to flush kernels
> > that do not need relocation (the standard kdump case).
> > 
> > I moved the new kernel flush to kexec_segment_flush(), called
> > unconditionally in machine_kexec() so we can handle both cases
> > with one piece of code.
> 
> Yes, I had noticed that. Actually flushing before cache is disabled can always
> cause heisenbug.

Please use "clean", "invalidate", or "clean+invalidate" rather than
"flush". The latter is ambiguous and misleading.

You can validly perform maintenance while the cache may allocate for a
region of memory, it's just that afterwards the cache may hold a clean
entries for that region.

You can clean/clean+invalidate to push data to the PoC, or you can
invalidate/clean+invalidate to ensure that no asynchronous writebacks
occur later (so long as you do not make a cacheable write to said
cacheline).

The only thing that you cannot guarantee is that there is not some clean
cacheline allocated for a region of memory to which cacheable accesses
may be performed.

Note that the kernel only requires its Image to be clean to the PoC. So
long as this is true, we know that there will not be asynchrnoous
writebacks, and can invalidate as necessary as part of the boot process.

Thanks,
Mark.
Geoff Levand Dec. 2, 2015, 9:08 p.m. UTC | #12
Hi Mark,

On Tue, 2015-12-01 at 19:03 +0000, Mark Rutland wrote:
> You can validly perform maintenance while the cache may allocate for a
> region of memory, it's just that afterwards the cache may hold a clean
> entries for that region.
> 
> You can clean/clean+invalidate to push data to the PoC, or you can
> invalidate/clean+invalidate to ensure that no asynchronous writebacks
> occur later (so long as you do not make a cacheable write to said
> cacheline).
> 
> The only thing that you cannot guarantee is that there is not some clean
> cacheline allocated for a region of memory to which cacheable accesses
> may be performed.
> 
> Note that the kernel only requires its Image to be clean to the PoC. So
> long as this is true, we know that there will not be asynchrnoous
> writebacks, and can invalidate as necessary as part of the boot process.

In v10, which worked for Mustang and Qualcomm, we had:

  clean+invalidate to PoC all source pages
  disable d-cache
  loop {
    invalidate to PoC destination page
    copy page source->destination
  }
  enter new image

In v11 I changed this, and it did not work for those platforms:
 
  clean+invalidate to PoC all source pages
  clean+invalidate to PoC all destination pages
  disable d-cache
  loop {
    copy page source->destination
  }
  enter new image

Based on your comments above I would think both should work OK.

-Geoff
Ashwin Chaugule Dec. 2, 2015, 10:40 p.m. UTC | #13
Hello,

On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>
> The current kvm implementation on arm64 does cpu-specific initialization
> at system boot, and has no way to gracefully shutdown a core in terms of
> kvm. This prevents, especially, kexec from rebooting the system on a boot
> core in EL2.
>
> This patch adds a cpu tear-down function and also puts an existing cpu-init
> code into a separate function, kvm_arch_hardware_disable() and
> kvm_arch_hardware_enable() respectively.
> We don't need arm64-specific cpu hotplug hook any more.
>
> Since this patch modifies common part of code between arm and arm64, one
> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
> compiling errors.
>
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  arch/arm/include/asm/kvm_host.h   | 10 ++++-
>  arch/arm/include/asm/kvm_mmu.h    |  1 +
>  arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>  arch/arm/kvm/mmu.c                |  5 +++
>  arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>  arch/arm64/include/asm/kvm_mmu.h  |  1 +
>  arch/arm64/include/asm/virt.h     |  9 +++++
>  arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>  arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>  9 files changed, 138 insertions(+), 48 deletions(-)

[..]

>
>
>  static struct notifier_block hyp_init_cpu_pm_nb = {
> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>         }
>
>         /*
> -        * Execute the init code on each CPU.
> -        */
> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
> -
> -       /*
>          * Init HYP view of VGIC
>          */
>         err = kvm_vgic_hyp_init();

With this flow, the cpu_init_hyp_mode() is called only at VM guest
creation, but vgic_hyp_init() is called at bootup. On a system with
GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
(to get the number of LRs), because we're not reading it from EL2
anymore.

Whats the best way to fix this?
- Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
- Fold the VGIC init stuff back into hardware_enable()?
- Read the VGIC number of LRs from the hyp stub?
- ..

Regards,
Ashwin.
Geoff Levand Dec. 2, 2015, 10:49 p.m. UTC | #14
Azriel and Pratyush,

On Tue, 2015-12-01 at 11:32 -0700, Azriel Samson wrote:
> I tested with v11 on our platform and I too needed the invalidation
> fix.

I pushed out a kexec-v12.1 branch to my linux-kexec repo [1] with
fixes.  Could you test it with both kexec reboot and kdump?  For
kdump, use something like:

  CMDLINE += 'crashkernel=64M@2240M'

then

  # kexec -d --load-panic /boot/vmlinux.strip --append="1 mem=64M maxcpus=1 reset_devices console=... root=..."
  # echo c > /proc/sysrq-trigger

We need to test both reboot and kdump since the the way the kexec is
done is different for the two.

[1]  https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git
-Geoff
Geoff Levand Dec. 2, 2015, 10:57 p.m. UTC | #15
On Mon, 2015-11-30 at 10:40 +0000, Marc Zyngier wrote:

> All that can be solved in C, and mostly at compile time. Using an
> assembler trampoline is complicating things for no good reason:

I added this into my kexec-v12.1 branch.  After some more testing I'll
post a v13 to the list.

-Geoff
Pratyush Anand Dec. 3, 2015, 4:15 a.m. UTC | #16
Hi Akashi,

Some of the points which came while discussing with Mark Salter are worth
including in v13.

On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>

[...]

> +/**
> + * machine_crash_shutdown - shutdown non-boot cpus and save registers

"non-panic" would be correct in stead of "non-boot". 

> +	/* shutdown non-boot cpus */

Ditto

> +	smp_send_stop();
> +
> +	/* for boot cpu */

"for panic cpu"


[...]

> @@ -697,6 +704,11 @@ static void ipi_cpu_stop(unsigned int cpu)
>  
>  	local_irq_disable();

We have "set_cpu_online(cpu, false);" just before it.
Panic core is waiting for non-panic to go offline, i.e. for the above event.

>  
> +#ifdef CONFIG_KEXEC
> +	if (in_crash_kexec)
> +		crash_save_cpu(regs, cpu);
> +#endif /* CONFIG_KEXEC */

However, we are still saving crash info for non-panic core. So, it would be good
to move crash_save_cpu() before set_cpu_online() to avoid any race condition.

~Pratyush
Azriel Samson Dec. 3, 2015, 4:37 a.m. UTC | #17
Hi Geoff,

> I pushed out a kexec-v12.1 branch to my linux-kexec repo [1] with fixes.
Could you test it with both kexec reboot and kdump?  For kdump, use
something like:
 > CMDLINE += 'crashkernel=64M@2240M'

For kdump, I also require a version of the patch "arm64/kexec: Add support
for kexec-lite". Without this, the dump capture kernel cannot find the
device tree even on v11 that I tested.

Thanks,
Azriel Samson
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project
Pratyush Anand Dec. 3, 2015, 6:09 a.m. UTC | #18
Hi Geoff,

On 02/12/2015:02:49:22 PM, Geoff Levand wrote:
> Azriel and Pratyush,
> 
> On Tue, 2015-12-01 at 11:32 -0700, Azriel Samson wrote:
> > I tested with v11 on our platform and I too needed the invalidation
> > fix.
> 
> I pushed out a kexec-v12.1 branch to my linux-kexec repo [1] with
> fixes.  Could you test it with both kexec reboot and kdump?  For
> kdump, use something like:

I tested kexec-v12.1 with mustang and it works for both kexec and kdump.

There had been few updates with Akashi, which are not there in kexec-v12.1.
Please take his updates and merge them in v13.

~Pratyush
Will Deacon Dec. 3, 2015, 9:32 a.m. UTC | #19
On Wed, Dec 02, 2015 at 02:57:30PM -0800, Geoff Levand wrote:
> On Mon, 2015-11-30 at 10:40 +0000, Marc Zyngier wrote:
> 
> > All that can be solved in C, and mostly at compile time. Using an
> > assembler trampoline is complicating things for no good reason:
> 
> I added this into my kexec-v12.1 branch.  After some more testing I'll
> post a v13 to the list.

You may well need some notrace annotations if you want to run C code
here. Can you give it a spin with things like ftrace and kprobes?

Will
Ashwin Chaugule Dec. 3, 2015, 1:55 p.m. UTC | #20
On 2 December 2015 at 17:40, Ashwin Chaugule <ashwin.chaugule@linaro.org> wrote:
> Hello,
>
> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> The current kvm implementation on arm64 does cpu-specific initialization
>> at system boot, and has no way to gracefully shutdown a core in terms of
>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>> core in EL2.
>>
>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>> code into a separate function, kvm_arch_hardware_disable() and
>> kvm_arch_hardware_enable() respectively.
>> We don't need arm64-specific cpu hotplug hook any more.
>>
>> Since this patch modifies common part of code between arm and arm64, one
>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>> compiling errors.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>  arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>  arch/arm/include/asm/kvm_mmu.h    |  1 +
>>  arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>  arch/arm/kvm/mmu.c                |  5 +++
>>  arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>  arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>  arch/arm64/include/asm/virt.h     |  9 +++++
>>  arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>  arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>  9 files changed, 138 insertions(+), 48 deletions(-)
>
> [..]
>
>>
>>
>>  static struct notifier_block hyp_init_cpu_pm_nb = {
>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>         }
>>
>>         /*
>> -        * Execute the init code on each CPU.
>> -        */
>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>> -
>> -       /*
>>          * Init HYP view of VGIC
>>          */
>>         err = kvm_vgic_hyp_init();
>
> With this flow, the cpu_init_hyp_mode() is called only at VM guest
> creation, but vgic_hyp_init() is called at bootup. On a system with
> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
> (to get the number of LRs), because we're not reading it from EL2
> anymore.

..or more likely, that the function to read the ICH_VTR_EL2 is not
mapped in EL2, because we deferred the cpu_init_hyp_mode() calls. In
any case, this ends up breaking KVM and manifests as a guest console
which is stuck.

>
> Whats the best way to fix this?
> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
> - Fold the VGIC init stuff back into hardware_enable()?
> - Read the VGIC number of LRs from the hyp stub?
> - ..
>
> Regards,
> Ashwin.
Marc Zyngier Dec. 3, 2015, 1:58 p.m. UTC | #21
On 02/12/15 22:40, Ashwin Chaugule wrote:
> Hello,
> 
> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> The current kvm implementation on arm64 does cpu-specific initialization
>> at system boot, and has no way to gracefully shutdown a core in terms of
>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>> core in EL2.
>>
>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>> code into a separate function, kvm_arch_hardware_disable() and
>> kvm_arch_hardware_enable() respectively.
>> We don't need arm64-specific cpu hotplug hook any more.
>>
>> Since this patch modifies common part of code between arm and arm64, one
>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>> compiling errors.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>  arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>  arch/arm/include/asm/kvm_mmu.h    |  1 +
>>  arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>  arch/arm/kvm/mmu.c                |  5 +++
>>  arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>  arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>  arch/arm64/include/asm/virt.h     |  9 +++++
>>  arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>  arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>  9 files changed, 138 insertions(+), 48 deletions(-)
> 
> [..]
> 
>>
>>
>>  static struct notifier_block hyp_init_cpu_pm_nb = {
>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>         }
>>
>>         /*
>> -        * Execute the init code on each CPU.
>> -        */
>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>> -
>> -       /*
>>          * Init HYP view of VGIC
>>          */
>>         err = kvm_vgic_hyp_init();
> 
> With this flow, the cpu_init_hyp_mode() is called only at VM guest
> creation, but vgic_hyp_init() is called at bootup. On a system with
> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
> (to get the number of LRs), because we're not reading it from EL2
> anymore.

Indeed, this is completely broken (I just reproduced the issue on a
model). I wish this kind of details had been checked earlier, but thanks
for pointing it out.

> Whats the best way to fix this?
> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
> - Fold the VGIC init stuff back into hardware_enable()?

None of that works - kvm_arch_hardware_enable() is called once per CPU,
while vgic_hyp_init() can only be called once. Also,
kvm_arch_hardware_enable() is called from interrupt context, and I
wouldn't feel comfortable starting probing DT and allocating stuff from
there.

> - Read the VGIC number of LRs from the hyp stub?

That's may UNDEF if called in the wrong context. Also, this defeats the
point of stubs, which is just to install the vectors for the hypervisor.

> - ..

Yeah, quite.

Geoff, Takahiro?

	M.
Mark Rutland Dec. 3, 2015, 4:06 p.m. UTC | #22
On Wed, Dec 02, 2015 at 01:08:33PM -0800, Geoff Levand wrote:
> Hi Mark,
> 
> On Tue, 2015-12-01 at 19:03 +0000, Mark Rutland wrote:
> > You can validly perform maintenance while the cache may allocate for a
> > region of memory, it's just that afterwards the cache may hold a clean
> > entries for that region.
> > 
> > You can clean/clean+invalidate to push data to the PoC, or you can
> > invalidate/clean+invalidate to ensure that no asynchronous writebacks
> > occur later (so long as you do not make a cacheable write to said
> > cacheline).
> > 
> > The only thing that you cannot guarantee is that there is not some clean
> > cacheline allocated for a region of memory to which cacheable accesses
> > may be performed.
> > 
> > Note that the kernel only requires its Image to be clean to the PoC. So
> > long as this is true, we know that there will not be asynchrnoous
> > writebacks, and can invalidate as necessary as part of the boot process.

I've realised that my wording here is somewhat confusing.

When I say "clean to the PoC", I mean that for a given PA the cache can
either:
* Not hold an entry (i.e. be invalid).
* Hold a clean entry, where the data is identical to that at the PoC.

> In v10, which worked for Mustang and Qualcomm, we had:
> 
>   clean+invalidate to PoC all source pages
>   disable d-cache
>   loop {
>     invalidate to PoC destination page
>     copy page source->destination
>   }
>   enter new image
> 
> In v11 I changed this, and it did not work for those platforms:
>  
>   clean+invalidate to PoC all source pages
>   clean+invalidate to PoC all destination pages
>   disable d-cache
>   loop {
>     copy page source->destination
>   }
>   enter new image
> 
> Based on your comments above I would think both should work OK.

No.

In the latter case, clean lines can be allocated before cacheable data
accesses are inhibited. So stale clean lines can shadow the data copied
via non-cacheable accesses.

While the cache is clean, it isn't clean to the PoC.

Thanks,
Mark.
Geoff Levand Dec. 3, 2015, 7:56 p.m. UTC | #23
Hi,

On Wed, 2015-12-02 at 21:37 -0700, Azriel Samson wrote:
> I pushed out a kexec-v12.1 branch to my linux-kexec repo [1] with fixes.
> Could you test it with both kexec reboot and kdump?  For kdump, use
> something like:
>  > CMDLINE += 'crashkernel=64M@2240M'
> 
> For kdump, I also require a version of the patch "arm64/kexec: Add support
> for kexec-lite". Without this, the dump capture kernel cannot find the
> device tree even on v11 that I tested.

Without that patch, does the 2nd stage kernel get the correct
dtb address in register x0?

Is the problem that the dtb data is corrupt when in the 2nd
stage kernel?

Is the dtb data OK when in the 1st stage kernel? 

-Geoff
Azriel Samson Dec. 4, 2015, 12:39 a.m. UTC | #24
Hi,

On 12/3/2015 12:56 PM, Geoff Levand wrote:
> Hi,
>
> On Wed, 2015-12-02 at 21:37 -0700, Azriel Samson wrote:
>> For kdump, I also require a version of the patch "arm64/kexec: Add support
>> for kexec-lite". Without this, the dump capture kernel cannot find the
>> device tree even on v11 that I tested.

This was a setup issue on my side. kdump with your v11 patches worked 
fine without the above patch when I used your version of kexec-tools.

In kexec-tools, I had to skip check_cpu_nodes() since we are using ACPI 
and the device tree does not contain cpu_info.

>
> Without that patch, does the 2nd stage kernel get the correct
> dtb address in register x0?
>
> Is the problem that the dtb data is corrupt when in the 2nd
> stage kernel?
>
> Is the dtb data OK when in the 1st stage kernel?
>
> -Geoff
>

Thanks,
Azriel Samson
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
Pratyush Anand Dec. 4, 2015, 3:54 a.m. UTC | #25
Hi Geoff,

On 03/12/2015:05:39:12 PM, Azriel Samson wrote:
> This was a setup issue on my side. kdump with your v11 patches worked fine
> without the above patch when I used your version of kexec-tools.
> 
> In kexec-tools, I had to skip check_cpu_nodes() since we are using ACPI and
> the device tree does not contain cpu_info.

I think, something like following should be squashed into patch "arm64: Add arm64
kexec support".
https://github.com/pratyushanand/kexec-tools/commit/85f403cabb9ae0d7a37d1656a00fb609a9525086

~Pratyush
Geoff Levand Dec. 7, 2015, 6:47 p.m. UTC | #26
On Fri, 2015-12-04 at 09:24 +0530, Pratyush Anand wrote:
> Hi Geoff,
> 
> On 03/12/2015:05:39:12 PM, Azriel Samson wrote:
> > This was a setup issue on my side. kdump with your v11 patches worked fine
> > without the above patch when I used your version of kexec-tools.
> > 
> > In kexec-tools, I had to skip check_cpu_nodes() since we are using ACPI and
> > the device tree does not contain cpu_info.
> 
> I think, something like following should be squashed into patch "arm64: Add arm64
> kexec support".
> https://github.com/pratyushanand/kexec-tools/commit/85f403cabb9ae0d7a37d1656a00fb609a9525086

I think it better to test for ACPI and then run
any meaningful checks than to just ignore the
check_cpu_nodes failure.

-Geoff
Geoff Levand Dec. 10, 2015, 12:49 a.m. UTC | #27
Hi Will,

On Thu, 2015-12-03 at 09:32 +0000, Will Deacon wrote:
> On Wed, Dec 02, 2015 at 02:57:30PM -0800, Geoff Levand wrote:
> > On Mon, 2015-11-30 at 10:40 +0000, Marc Zyngier wrote:
> > 
> > > All that can be solved in C, and mostly at compile time. Using an
> > > assembler trampoline is complicating things for no good reason:
> > 
> > I added this into my kexec-v12.1 branch.  After some more testing I'll
> > post a v13 to the list.
> 
> You may well need some notrace annotations if you want to run C code
> here. Can you give it a spin with things like ftrace and kprobes?

I tested ftrace with and without the notrace annotation on
cpu_soft_restart() and could do kexec re-boots either way.

When cpu_soft_restart() is called we've only done setup_mm_for_reboot(),
so I would think C code should still be OK.

-Geoff
Will Deacon Dec. 10, 2015, 10:17 a.m. UTC | #28
On Wed, Dec 09, 2015 at 04:49:22PM -0800, Geoff Levand wrote:
> On Thu, 2015-12-03 at 09:32 +0000, Will Deacon wrote:
> > On Wed, Dec 02, 2015 at 02:57:30PM -0800, Geoff Levand wrote:
> > > On Mon, 2015-11-30 at 10:40 +0000, Marc Zyngier wrote:
> > > 
> > > > All that can be solved in C, and mostly at compile time. Using an
> > > > assembler trampoline is complicating things for no good reason:
> > > 
> > > I added this into my kexec-v12.1 branch.  After some more testing I'll
> > > post a v13 to the list.
> > 
> > You may well need some notrace annotations if you want to run C code
> > here. Can you give it a spin with things like ftrace and kprobes?
> 
> I tested ftrace with and without the notrace annotation on
> cpu_soft_restart() and could do kexec re-boots either way.
> 
> When cpu_soft_restart() is called we've only done setup_mm_for_reboot(),
> so I would think C code should still be OK.

Yeah, you're right. I mistakenly thought that cpu_soft_restart was running
at the idmap address as opposed to the kernel virtual address.

Will
AKASHI Takahiro Dec. 10, 2015, 11:34 a.m. UTC | #29
Marc,

I was back from my vacation.

On 11/27/2015 11:39 PM, Marc Zyngier wrote:
> On 24/11/15 22:25, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> kdump calls machine_crash_shutdown() to shut down non-boot cpus and
>> save registers' status in per-cpu ELF notes before starting the crash
>> dump kernel. See kernel_kexec().
>>
>> ipi_cpu_stop() is a bit modified and used to support this behavior.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>   arch/arm64/include/asm/kexec.h    | 34 +++++++++++++++++++++++++++++++++-
>>   arch/arm64/kernel/machine_kexec.c | 31 +++++++++++++++++++++++++++++--
>>   arch/arm64/kernel/smp.c           | 16 ++++++++++++++--
>>   3 files changed, 76 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
>> index 46d63cd..555a955 100644
>> --- a/arch/arm64/include/asm/kexec.h
>> +++ b/arch/arm64/include/asm/kexec.h
>> @@ -30,6 +30,8 @@
>>
>>   #if !defined(__ASSEMBLY__)
>>
>> +extern bool in_crash_kexec;
>> +
>>   /**
>>    * crash_setup_regs() - save registers for the panic kernel
>>    *
>> @@ -40,7 +42,37 @@
>>   static inline void crash_setup_regs(struct pt_regs *newregs,
>>   				    struct pt_regs *oldregs)
>>   {
>> -	/* Empty routine needed to avoid build errors. */
>> +	if (oldregs) {
>> +		memcpy(newregs, oldregs, sizeof(*newregs));
>> +	} else {
>> +		__asm__ __volatile__ (
>> +			"stp	 x0,   x1, [%3, #16 *  0]\n"
>> +			"stp	 x2,   x3, [%3, #16 *  1]\n"
>> +			"stp	 x4,   x5, [%3, #16 *  2]\n"
>> +			"stp	 x6,   x7, [%3, #16 *  3]\n"
>> +			"stp	 x8,   x9, [%3, #16 *  4]\n"
>> +			"stp	x10,  x11, [%3, #16 *  5]\n"
>> +			"stp	x12,  x13, [%3, #16 *  6]\n"
>> +			"stp	x14,  x15, [%3, #16 *  7]\n"
>> +			"stp	x16,  x17, [%3, #16 *  8]\n"
>> +			"stp	x18,  x19, [%3, #16 *  9]\n"
>> +			"stp	x20,  x21, [%3, #16 * 10]\n"
>> +			"stp	x22,  x23, [%3, #16 * 11]\n"
>> +			"stp	x24,  x25, [%3, #16 * 12]\n"
>> +			"stp	x26,  x27, [%3, #16 * 13]\n"
>> +			"stp	x28,  x29, [%3, #16 * 14]\n"
>> +			"str	x30,	   [%3, #16 * 15]\n"
>> +			"mov	%0, sp\n"
>> +			"adr	%1, 1f\n"
>> +			"mrs	%2, spsr_el1\n"
>> +		"1:"
>> +			: "=r" (newregs->sp),
>> +			  "=r" (newregs->pc),
>> +			  "=r" (newregs->pstate)
>> +			: "r"  (&newregs->regs)
>> +			: "memory"
>> +		);
>
> I wonder how useful this thing is, given that it starts by corrupting
> whatever register is holding newregs->regs. Maybe this is not supposed
> to be accurate anyway...

I'm not quite sure about what part of my code you're mentioning here, but
crash_setup_regs() is solely called by crash_kexec(), and panic() is
the only caller of crash_kexec() with NULL argument which, in turn, is
used as 'oldregs' in crash_setup_regs().

Given this fact, I think that the values saved in newregs as indicated above
will be the best estimate of current cpu contexts.

The other caller of crash_kexec() is die() in traps.c, but here we call
it with explicit cpu contexts at exception.

>
>> +	}
>>   }
>>
>>   #endif /* !defined(__ASSEMBLY__) */
>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>> index da28a26..d2d7e90 100644
>> --- a/arch/arm64/kernel/machine_kexec.c
>> +++ b/arch/arm64/kernel/machine_kexec.c
>> @@ -9,6 +9,7 @@
>>    * published by the Free Software Foundation.
>>    */
>>
>> +#include <linux/kernel.h>
>>   #include <linux/kexec.h>
>>   #include <linux/of_fdt.h>
>>   #include <linux/slab.h>
>> @@ -23,6 +24,7 @@
>>   extern const unsigned char arm64_relocate_new_kernel[];
>>   extern const unsigned long arm64_relocate_new_kernel_size;
>>
>> +bool in_crash_kexec;
>>   static unsigned long kimage_start;
>>
>>   /**
>> @@ -203,13 +205,38 @@ void machine_kexec(struct kimage *kimage)
>>   	 */
>>
>>   	cpu_soft_restart(virt_to_phys(cpu_reset),
>> -		is_hyp_mode_available(),
>> +		in_crash_kexec ? 0 : is_hyp_mode_available(),
>>   		reboot_code_buffer_phys, kimage->head, kimage_start);
>>
>>   	BUG(); /* Should never get here. */
>>   }
>>
>> +/**
>> + * machine_crash_shutdown - shutdown non-boot cpus and save registers
>> + */
>>   void machine_crash_shutdown(struct pt_regs *regs)
>>   {
>> -	/* Empty routine needed to avoid build errors. */
>> +	struct pt_regs dummy_regs;
>> +	int cpu;
>> +
>> +	local_irq_disable();
>> +
>> +	in_crash_kexec = true;
>> +
>> +	/*
>> +	 * clear and initialize the per-cpu info. This is necessary
>> +	 * because, otherwise, slots for offline cpus would never be
>> +	 * filled up. See smp_send_stop().
>> +	 */
>> +	memset(&dummy_regs, 0, sizeof(dummy_regs));
>> +	for_each_possible_cpu(cpu)
>> +		crash_save_cpu(&dummy_regs, cpu);
>> +
>> +	/* shutdown non-boot cpus */
>> +	smp_send_stop();
>> +
>> +	/* for boot cpu */
>> +	crash_save_cpu(regs, smp_processor_id());
>> +
>> +	pr_info("Starting crashdump kernel...\n");
>>   }
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index b1adc51..15aabef 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -37,6 +37,7 @@
>>   #include <linux/completion.h>
>>   #include <linux/of.h>
>>   #include <linux/irq_work.h>
>> +#include <linux/kexec.h>
>>
>>   #include <asm/alternative.h>
>>   #include <asm/atomic.h>
>> @@ -54,6 +55,8 @@
>>   #include <asm/ptrace.h>
>>   #include <asm/virt.h>
>>
>> +#include "cpu-reset.h"
>> +
>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/ipi.h>
>>
>> @@ -683,8 +686,12 @@ static DEFINE_RAW_SPINLOCK(stop_lock);
>>   /*
>>    * ipi_cpu_stop - handle IPI from smp_send_stop()
>>    */
>> -static void ipi_cpu_stop(unsigned int cpu)
>> +static void ipi_cpu_stop(unsigned int cpu, struct pt_regs *regs)
>>   {
>> +#ifdef CONFIG_KEXEC
>> +	/* printing messages may slow down the shutdown. */
>> +	if (!in_crash_kexec)
>> +#endif
>>   	if (system_state == SYSTEM_BOOTING ||
>>   	    system_state == SYSTEM_RUNNING) {
>>   		raw_spin_lock(&stop_lock);
>
> Irrespective of how useful this change is, how about having a predicate
> instead? Something like:
>
> static inline bool is_in_crash_kexec(void)
> {
> #ifdef CONFIG_KEXEC
> 	return in_crash_kexec;
> #else
> 	return false;
> #endif
> }

OK, I will take your idea.

> located in machine_kexec.c (making the in_crash_kernel static), and then

but cannot make in_crash_kernel static because it is also used in both smp.c
and machine_kexec.c.

> 	if (!is_in_crash_kexec() && (systen_state == ... || ...) {
>
> It would certainly look better.

Thanks,
-Takahiro AKASHI

>> @@ -697,6 +704,11 @@ static void ipi_cpu_stop(unsigned int cpu)
>>
>>   	local_irq_disable();
>>
>> +#ifdef CONFIG_KEXEC
>> +	if (in_crash_kexec)
>> +		crash_save_cpu(regs, cpu);
>> +#endif /* CONFIG_KEXEC */
>> +
>
> Same here.
>
>>   	while (1)
>>   		cpu_relax();
>>   }
>> @@ -727,7 +739,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>>
>>   	case IPI_CPU_STOP:
>>   		irq_enter();
>> -		ipi_cpu_stop(cpu);
>> +		ipi_cpu_stop(cpu, regs);
>>   		irq_exit();
>>   		break;
>>
>>
>
> Thanks,
>
> 	M.
>
AKASHI Takahiro Dec. 10, 2015, 11:42 a.m. UTC | #30
Pratyush,

On 12/03/2015 01:15 PM, Pratyush Anand wrote:
> Hi Akashi,
>
> Some of the points which came while discussing with Mark Salter are worth
> including in v13.
>
> On 24/11/2015:10:25:34 PM, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>
> [...]
>
>> +/**
>> + * machine_crash_shutdown - shutdown non-boot cpus and save registers
>
> "non-panic" would be correct in stead of "non-boot".

OK, but for consistency with other places (say, arm/kernel/machine_kexec.c)
I prefer "non-crashing cpus."

>> +	/* shutdown non-boot cpus */
>
> Ditto
>
>> +	smp_send_stop();
>> +
>> +	/* for boot cpu */
>
> "for panic cpu"

Ditto.
"for crashing cpu"

>
>
> [...]
>
>> @@ -697,6 +704,11 @@ static void ipi_cpu_stop(unsigned int cpu)
>>
>>   	local_irq_disable();
>
> We have "set_cpu_online(cpu, false);" just before it.
> Panic core is waiting for non-panic to go offline, i.e. for the above event.
>
>>
>> +#ifdef CONFIG_KEXEC
>> +	if (in_crash_kexec)
>> +		crash_save_cpu(regs, cpu);
>> +#endif /* CONFIG_KEXEC */
>
> However, we are still saving crash info for non-panic core. So, it would be good
> to move crash_save_cpu() before set_cpu_online() to avoid any race condition.

Good point, race will be very unlikely though. I will fix it.

Thanks,
-Takahiro AKASHI

> ~Pratyush
>
Marc Zyngier Dec. 10, 2015, 11:44 a.m. UTC | #31
On 10/12/15 11:34, AKASHI Takahiro wrote:
> Marc,
> 
> I was back from my vacation.
> 
> On 11/27/2015 11:39 PM, Marc Zyngier wrote:
>> On 24/11/15 22:25, Geoff Levand wrote:
>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>
>>> kdump calls machine_crash_shutdown() to shut down non-boot cpus and
>>> save registers' status in per-cpu ELF notes before starting the crash
>>> dump kernel. See kernel_kexec().
>>>
>>> ipi_cpu_stop() is a bit modified and used to support this behavior.
>>>
>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> ---
>>>   arch/arm64/include/asm/kexec.h    | 34 +++++++++++++++++++++++++++++++++-
>>>   arch/arm64/kernel/machine_kexec.c | 31 +++++++++++++++++++++++++++++--
>>>   arch/arm64/kernel/smp.c           | 16 ++++++++++++++--
>>>   3 files changed, 76 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
>>> index 46d63cd..555a955 100644
>>> --- a/arch/arm64/include/asm/kexec.h
>>> +++ b/arch/arm64/include/asm/kexec.h
>>> @@ -30,6 +30,8 @@
>>>
>>>   #if !defined(__ASSEMBLY__)
>>>
>>> +extern bool in_crash_kexec;
>>> +
>>>   /**
>>>    * crash_setup_regs() - save registers for the panic kernel
>>>    *
>>> @@ -40,7 +42,37 @@
>>>   static inline void crash_setup_regs(struct pt_regs *newregs,
>>>   				    struct pt_regs *oldregs)
>>>   {
>>> -	/* Empty routine needed to avoid build errors. */
>>> +	if (oldregs) {
>>> +		memcpy(newregs, oldregs, sizeof(*newregs));
>>> +	} else {
>>> +		__asm__ __volatile__ (
>>> +			"stp	 x0,   x1, [%3, #16 *  0]\n"
>>> +			"stp	 x2,   x3, [%3, #16 *  1]\n"
>>> +			"stp	 x4,   x5, [%3, #16 *  2]\n"
>>> +			"stp	 x6,   x7, [%3, #16 *  3]\n"
>>> +			"stp	 x8,   x9, [%3, #16 *  4]\n"
>>> +			"stp	x10,  x11, [%3, #16 *  5]\n"
>>> +			"stp	x12,  x13, [%3, #16 *  6]\n"
>>> +			"stp	x14,  x15, [%3, #16 *  7]\n"
>>> +			"stp	x16,  x17, [%3, #16 *  8]\n"
>>> +			"stp	x18,  x19, [%3, #16 *  9]\n"
>>> +			"stp	x20,  x21, [%3, #16 * 10]\n"
>>> +			"stp	x22,  x23, [%3, #16 * 11]\n"
>>> +			"stp	x24,  x25, [%3, #16 * 12]\n"
>>> +			"stp	x26,  x27, [%3, #16 * 13]\n"
>>> +			"stp	x28,  x29, [%3, #16 * 14]\n"
>>> +			"str	x30,	   [%3, #16 * 15]\n"
>>> +			"mov	%0, sp\n"
>>> +			"adr	%1, 1f\n"
>>> +			"mrs	%2, spsr_el1\n"
>>> +		"1:"
>>> +			: "=r" (newregs->sp),
>>> +			  "=r" (newregs->pc),
>>> +			  "=r" (newregs->pstate)
>>> +			: "r"  (&newregs->regs)
>>> +			: "memory"
>>> +		);
>>
>> I wonder how useful this thing is, given that it starts by corrupting
>> whatever register is holding newregs->regs. Maybe this is not supposed
>> to be accurate anyway...
> 
> I'm not quite sure about what part of my code you're mentioning here, but
> crash_setup_regs() is solely called by crash_kexec(), and panic() is
> the only caller of crash_kexec() with NULL argument which, in turn, is
> used as 'oldregs' in crash_setup_regs().

You have this assembly sequence:

stp	 x0,   x1, [%3, #16 *  0]
[...]

where %3 itself is one of the x[0..30] registers. So you are saving
things that have already been corrupted by the saving procedure. Not
sure how useful that is, but as I said, maybe it is not supposed to be
completely accurate.

> Given this fact, I think that the values saved in newregs as indicated above
> will be the best estimate of current cpu contexts.
> 
> The other caller of crash_kexec() is die() in traps.c, but here we call
> it with explicit cpu contexts at exception.
> 
>>
>>> +	}
>>>   }
>>>
>>>   #endif /* !defined(__ASSEMBLY__) */
>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>>> index da28a26..d2d7e90 100644
>>> --- a/arch/arm64/kernel/machine_kexec.c
>>> +++ b/arch/arm64/kernel/machine_kexec.c
>>> @@ -9,6 +9,7 @@
>>>    * published by the Free Software Foundation.
>>>    */
>>>
>>> +#include <linux/kernel.h>
>>>   #include <linux/kexec.h>
>>>   #include <linux/of_fdt.h>
>>>   #include <linux/slab.h>
>>> @@ -23,6 +24,7 @@
>>>   extern const unsigned char arm64_relocate_new_kernel[];
>>>   extern const unsigned long arm64_relocate_new_kernel_size;
>>>
>>> +bool in_crash_kexec;
>>>   static unsigned long kimage_start;
>>>
>>>   /**
>>> @@ -203,13 +205,38 @@ void machine_kexec(struct kimage *kimage)
>>>   	 */
>>>
>>>   	cpu_soft_restart(virt_to_phys(cpu_reset),
>>> -		is_hyp_mode_available(),
>>> +		in_crash_kexec ? 0 : is_hyp_mode_available(),
>>>   		reboot_code_buffer_phys, kimage->head, kimage_start);
>>>
>>>   	BUG(); /* Should never get here. */
>>>   }
>>>
>>> +/**
>>> + * machine_crash_shutdown - shutdown non-boot cpus and save registers
>>> + */
>>>   void machine_crash_shutdown(struct pt_regs *regs)
>>>   {
>>> -	/* Empty routine needed to avoid build errors. */
>>> +	struct pt_regs dummy_regs;
>>> +	int cpu;
>>> +
>>> +	local_irq_disable();
>>> +
>>> +	in_crash_kexec = true;
>>> +
>>> +	/*
>>> +	 * clear and initialize the per-cpu info. This is necessary
>>> +	 * because, otherwise, slots for offline cpus would never be
>>> +	 * filled up. See smp_send_stop().
>>> +	 */
>>> +	memset(&dummy_regs, 0, sizeof(dummy_regs));
>>> +	for_each_possible_cpu(cpu)
>>> +		crash_save_cpu(&dummy_regs, cpu);
>>> +
>>> +	/* shutdown non-boot cpus */
>>> +	smp_send_stop();
>>> +
>>> +	/* for boot cpu */
>>> +	crash_save_cpu(regs, smp_processor_id());
>>> +
>>> +	pr_info("Starting crashdump kernel...\n");
>>>   }
>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index b1adc51..15aabef 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -37,6 +37,7 @@
>>>   #include <linux/completion.h>
>>>   #include <linux/of.h>
>>>   #include <linux/irq_work.h>
>>> +#include <linux/kexec.h>
>>>
>>>   #include <asm/alternative.h>
>>>   #include <asm/atomic.h>
>>> @@ -54,6 +55,8 @@
>>>   #include <asm/ptrace.h>
>>>   #include <asm/virt.h>
>>>
>>> +#include "cpu-reset.h"
>>> +
>>>   #define CREATE_TRACE_POINTS
>>>   #include <trace/events/ipi.h>
>>>
>>> @@ -683,8 +686,12 @@ static DEFINE_RAW_SPINLOCK(stop_lock);
>>>   /*
>>>    * ipi_cpu_stop - handle IPI from smp_send_stop()
>>>    */
>>> -static void ipi_cpu_stop(unsigned int cpu)
>>> +static void ipi_cpu_stop(unsigned int cpu, struct pt_regs *regs)
>>>   {
>>> +#ifdef CONFIG_KEXEC
>>> +	/* printing messages may slow down the shutdown. */
>>> +	if (!in_crash_kexec)
>>> +#endif
>>>   	if (system_state == SYSTEM_BOOTING ||
>>>   	    system_state == SYSTEM_RUNNING) {
>>>   		raw_spin_lock(&stop_lock);
>>
>> Irrespective of how useful this change is, how about having a predicate
>> instead? Something like:
>>
>> static inline bool is_in_crash_kexec(void)
>> {
>> #ifdef CONFIG_KEXEC
>> 	return in_crash_kexec;
>> #else
>> 	return false;
>> #endif
>> }
> 
> OK, I will take your idea.
> 
>> located in machine_kexec.c (making the in_crash_kernel static), and then
> 
> but cannot make in_crash_kernel static because it is also used in both smp.c
> and machine_kexec.c.

smp.c only reads from in_crash_kernel (at least from what I can see in
this patch), so it should be able to use the accessor.

Thanks,

	M.
Pratyush Anand Dec. 10, 2015, 11:50 a.m. UTC | #32
On 10/12/2015:08:42:12 PM, AKASHI Takahiro wrote:
> >>+ * machine_crash_shutdown - shutdown non-boot cpus and save registers
> >
> >"non-panic" would be correct in stead of "non-boot".
> 
> OK, but for consistency with other places (say, arm/kernel/machine_kexec.c)
> I prefer "non-crashing cpus."

OK.

> 
> >>+	/* shutdown non-boot cpus */
> >
> >Ditto
> >
> >>+	smp_send_stop();
> >>+
> >>+	/* for boot cpu */
> >
> >"for panic cpu"
> 
> Ditto.
> "for crashing cpu"

OK.

~Pratyush
AKASHI Takahiro Dec. 10, 2015, 12:55 p.m. UTC | #33
On 12/10/2015 08:44 PM, Marc Zyngier wrote:
> On 10/12/15 11:34, AKASHI Takahiro wrote:
>> Marc,
>>
>> I was back from my vacation.
>>
>> On 11/27/2015 11:39 PM, Marc Zyngier wrote:
>>> On 24/11/15 22:25, Geoff Levand wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> kdump calls machine_crash_shutdown() to shut down non-boot cpus and
>>>> save registers' status in per-cpu ELF notes before starting the crash
>>>> dump kernel. See kernel_kexec().
>>>>
>>>> ipi_cpu_stop() is a bit modified and used to support this behavior.
>>>>
>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>> ---
>>>>    arch/arm64/include/asm/kexec.h    | 34 +++++++++++++++++++++++++++++++++-
>>>>    arch/arm64/kernel/machine_kexec.c | 31 +++++++++++++++++++++++++++++--
>>>>    arch/arm64/kernel/smp.c           | 16 ++++++++++++++--
>>>>    3 files changed, 76 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
>>>> index 46d63cd..555a955 100644
>>>> --- a/arch/arm64/include/asm/kexec.h
>>>> +++ b/arch/arm64/include/asm/kexec.h
>>>> @@ -30,6 +30,8 @@
>>>>
>>>>    #if !defined(__ASSEMBLY__)
>>>>
>>>> +extern bool in_crash_kexec;
>>>> +
>>>>    /**
>>>>     * crash_setup_regs() - save registers for the panic kernel
>>>>     *
>>>> @@ -40,7 +42,37 @@
>>>>    static inline void crash_setup_regs(struct pt_regs *newregs,
>>>>    				    struct pt_regs *oldregs)
>>>>    {
>>>> -	/* Empty routine needed to avoid build errors. */
>>>> +	if (oldregs) {
>>>> +		memcpy(newregs, oldregs, sizeof(*newregs));
>>>> +	} else {
>>>> +		__asm__ __volatile__ (
>>>> +			"stp	 x0,   x1, [%3, #16 *  0]\n"
>>>> +			"stp	 x2,   x3, [%3, #16 *  1]\n"
>>>> +			"stp	 x4,   x5, [%3, #16 *  2]\n"
>>>> +			"stp	 x6,   x7, [%3, #16 *  3]\n"
>>>> +			"stp	 x8,   x9, [%3, #16 *  4]\n"
>>>> +			"stp	x10,  x11, [%3, #16 *  5]\n"
>>>> +			"stp	x12,  x13, [%3, #16 *  6]\n"
>>>> +			"stp	x14,  x15, [%3, #16 *  7]\n"
>>>> +			"stp	x16,  x17, [%3, #16 *  8]\n"
>>>> +			"stp	x18,  x19, [%3, #16 *  9]\n"
>>>> +			"stp	x20,  x21, [%3, #16 * 10]\n"
>>>> +			"stp	x22,  x23, [%3, #16 * 11]\n"
>>>> +			"stp	x24,  x25, [%3, #16 * 12]\n"
>>>> +			"stp	x26,  x27, [%3, #16 * 13]\n"
>>>> +			"stp	x28,  x29, [%3, #16 * 14]\n"
>>>> +			"str	x30,	   [%3, #16 * 15]\n"
>>>> +			"mov	%0, sp\n"
>>>> +			"adr	%1, 1f\n"
>>>> +			"mrs	%2, spsr_el1\n"
>>>> +		"1:"
>>>> +			: "=r" (newregs->sp),
>>>> +			  "=r" (newregs->pc),
>>>> +			  "=r" (newregs->pstate)
>>>> +			: "r"  (&newregs->regs)
>>>> +			: "memory"
>>>> +		);
>>>
>>> I wonder how useful this thing is, given that it starts by corrupting
>>> whatever register is holding newregs->regs. Maybe this is not supposed
>>> to be accurate anyway...
>>
>> I'm not quite sure about what part of my code you're mentioning here, but
>> crash_setup_regs() is solely called by crash_kexec(), and panic() is
>> the only caller of crash_kexec() with NULL argument which, in turn, is
>> used as 'oldregs' in crash_setup_regs().
>
> You have this assembly sequence:
>
> stp	 x0,   x1, [%3, #16 *  0]
> [...]
>
> where %3 itself is one of the x[0..30] registers.

Not only %3, but also

> So you are saving
> things that have already been corrupted by the saving procedure. Not
> sure how useful that is, but as I said, maybe it is not supposed to be
> completely accurate.

x0, x1 ... are the current values in panic(), and not the exact cpu contexts
at the place we are really interested in.
We have no way here in panic() to know them, but sp and pc would still be useful
for back-tracing in later investigation of dump file.

Please note that the same problem exists on arm (and x86) implementation.

>> Given this fact, I think that the values saved in newregs as indicated above
>> will be the best estimate of current cpu contexts.
>>
>> The other caller of crash_kexec() is die() in traps.c, but here we call
>> it with explicit cpu contexts at exception.
>>
>>>
>>>> +	}
>>>>    }
>>>>
>>>>    #endif /* !defined(__ASSEMBLY__) */
>>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>>>> index da28a26..d2d7e90 100644
>>>> --- a/arch/arm64/kernel/machine_kexec.c
>>>> +++ b/arch/arm64/kernel/machine_kexec.c
>>>> @@ -9,6 +9,7 @@
>>>>     * published by the Free Software Foundation.
>>>>     */
>>>>
>>>> +#include <linux/kernel.h>
>>>>    #include <linux/kexec.h>
>>>>    #include <linux/of_fdt.h>
>>>>    #include <linux/slab.h>
>>>> @@ -23,6 +24,7 @@
>>>>    extern const unsigned char arm64_relocate_new_kernel[];
>>>>    extern const unsigned long arm64_relocate_new_kernel_size;
>>>>
>>>> +bool in_crash_kexec;
>>>>    static unsigned long kimage_start;
>>>>
>>>>    /**
>>>> @@ -203,13 +205,38 @@ void machine_kexec(struct kimage *kimage)
>>>>    	 */
>>>>
>>>>    	cpu_soft_restart(virt_to_phys(cpu_reset),
>>>> -		is_hyp_mode_available(),
>>>> +		in_crash_kexec ? 0 : is_hyp_mode_available(),
>>>>    		reboot_code_buffer_phys, kimage->head, kimage_start);
>>>>
>>>>    	BUG(); /* Should never get here. */
>>>>    }
>>>>
>>>> +/**
>>>> + * machine_crash_shutdown - shutdown non-boot cpus and save registers
>>>> + */
>>>>    void machine_crash_shutdown(struct pt_regs *regs)
>>>>    {
>>>> -	/* Empty routine needed to avoid build errors. */
>>>> +	struct pt_regs dummy_regs;
>>>> +	int cpu;
>>>> +
>>>> +	local_irq_disable();
>>>> +
>>>> +	in_crash_kexec = true;
>>>> +
>>>> +	/*
>>>> +	 * clear and initialize the per-cpu info. This is necessary
>>>> +	 * because, otherwise, slots for offline cpus would never be
>>>> +	 * filled up. See smp_send_stop().
>>>> +	 */
>>>> +	memset(&dummy_regs, 0, sizeof(dummy_regs));
>>>> +	for_each_possible_cpu(cpu)
>>>> +		crash_save_cpu(&dummy_regs, cpu);
>>>> +
>>>> +	/* shutdown non-boot cpus */
>>>> +	smp_send_stop();
>>>> +
>>>> +	/* for boot cpu */
>>>> +	crash_save_cpu(regs, smp_processor_id());
>>>> +
>>>> +	pr_info("Starting crashdump kernel...\n");
>>>>    }
>>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>>> index b1adc51..15aabef 100644
>>>> --- a/arch/arm64/kernel/smp.c
>>>> +++ b/arch/arm64/kernel/smp.c
>>>> @@ -37,6 +37,7 @@
>>>>    #include <linux/completion.h>
>>>>    #include <linux/of.h>
>>>>    #include <linux/irq_work.h>
>>>> +#include <linux/kexec.h>
>>>>
>>>>    #include <asm/alternative.h>
>>>>    #include <asm/atomic.h>
>>>> @@ -54,6 +55,8 @@
>>>>    #include <asm/ptrace.h>
>>>>    #include <asm/virt.h>
>>>>
>>>> +#include "cpu-reset.h"
>>>> +
>>>>    #define CREATE_TRACE_POINTS
>>>>    #include <trace/events/ipi.h>
>>>>
>>>> @@ -683,8 +686,12 @@ static DEFINE_RAW_SPINLOCK(stop_lock);
>>>>    /*
>>>>     * ipi_cpu_stop - handle IPI from smp_send_stop()
>>>>     */
>>>> -static void ipi_cpu_stop(unsigned int cpu)
>>>> +static void ipi_cpu_stop(unsigned int cpu, struct pt_regs *regs)
>>>>    {
>>>> +#ifdef CONFIG_KEXEC
>>>> +	/* printing messages may slow down the shutdown. */
>>>> +	if (!in_crash_kexec)
>>>> +#endif
>>>>    	if (system_state == SYSTEM_BOOTING ||
>>>>    	    system_state == SYSTEM_RUNNING) {
>>>>    		raw_spin_lock(&stop_lock);
>>>
>>> Irrespective of how useful this change is, how about having a predicate
>>> instead? Something like:
>>>
>>> static inline bool is_in_crash_kexec(void)
>>> {
>>> #ifdef CONFIG_KEXEC
>>> 	return in_crash_kexec;
>>> #else
>>> 	return false;
>>> #endif
>>> }
>>
>> OK, I will take your idea.
>>
>>> located in machine_kexec.c (making the in_crash_kernel static), and then
>>
>> but cannot make in_crash_kernel static because it is also used in both smp.c
>> and machine_kexec.c.
>
> smp.c only reads from in_crash_kernel (at least from what I can see in
> this patch), so it should be able to use the accessor.

Only if we define the accessor as a real function, not an inline function in a header :)

-Takahiro AKASHI

> Thanks,
>
> 	M.
>
Marc Zyngier Dec. 10, 2015, 1:43 p.m. UTC | #34
On 10/12/15 12:55, AKASHI Takahiro wrote:
> On 12/10/2015 08:44 PM, Marc Zyngier wrote:
>> On 10/12/15 11:34, AKASHI Takahiro wrote:
>>> Marc,
>>>
>>> I was back from my vacation.
>>>
>>> On 11/27/2015 11:39 PM, Marc Zyngier wrote:
>>>> On 24/11/15 22:25, Geoff Levand wrote:
>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>
>>>>> kdump calls machine_crash_shutdown() to shut down non-boot cpus and
>>>>> save registers' status in per-cpu ELF notes before starting the crash
>>>>> dump kernel. See kernel_kexec().
>>>>>
>>>>> ipi_cpu_stop() is a bit modified and used to support this behavior.
>>>>>
>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>> ---
>>>>>    arch/arm64/include/asm/kexec.h    | 34 +++++++++++++++++++++++++++++++++-
>>>>>    arch/arm64/kernel/machine_kexec.c | 31 +++++++++++++++++++++++++++++--
>>>>>    arch/arm64/kernel/smp.c           | 16 ++++++++++++++--
>>>>>    3 files changed, 76 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
>>>>> index 46d63cd..555a955 100644
>>>>> --- a/arch/arm64/include/asm/kexec.h
>>>>> +++ b/arch/arm64/include/asm/kexec.h
>>>>> @@ -30,6 +30,8 @@
>>>>>
>>>>>    #if !defined(__ASSEMBLY__)
>>>>>
>>>>> +extern bool in_crash_kexec;
>>>>> +
>>>>>    /**
>>>>>     * crash_setup_regs() - save registers for the panic kernel
>>>>>     *
>>>>> @@ -40,7 +42,37 @@
>>>>>    static inline void crash_setup_regs(struct pt_regs *newregs,
>>>>>    				    struct pt_regs *oldregs)
>>>>>    {
>>>>> -	/* Empty routine needed to avoid build errors. */
>>>>> +	if (oldregs) {
>>>>> +		memcpy(newregs, oldregs, sizeof(*newregs));
>>>>> +	} else {
>>>>> +		__asm__ __volatile__ (
>>>>> +			"stp	 x0,   x1, [%3, #16 *  0]\n"
>>>>> +			"stp	 x2,   x3, [%3, #16 *  1]\n"
>>>>> +			"stp	 x4,   x5, [%3, #16 *  2]\n"
>>>>> +			"stp	 x6,   x7, [%3, #16 *  3]\n"
>>>>> +			"stp	 x8,   x9, [%3, #16 *  4]\n"
>>>>> +			"stp	x10,  x11, [%3, #16 *  5]\n"
>>>>> +			"stp	x12,  x13, [%3, #16 *  6]\n"
>>>>> +			"stp	x14,  x15, [%3, #16 *  7]\n"
>>>>> +			"stp	x16,  x17, [%3, #16 *  8]\n"
>>>>> +			"stp	x18,  x19, [%3, #16 *  9]\n"
>>>>> +			"stp	x20,  x21, [%3, #16 * 10]\n"
>>>>> +			"stp	x22,  x23, [%3, #16 * 11]\n"
>>>>> +			"stp	x24,  x25, [%3, #16 * 12]\n"
>>>>> +			"stp	x26,  x27, [%3, #16 * 13]\n"
>>>>> +			"stp	x28,  x29, [%3, #16 * 14]\n"
>>>>> +			"str	x30,	   [%3, #16 * 15]\n"
>>>>> +			"mov	%0, sp\n"
>>>>> +			"adr	%1, 1f\n"
>>>>> +			"mrs	%2, spsr_el1\n"
>>>>> +		"1:"
>>>>> +			: "=r" (newregs->sp),
>>>>> +			  "=r" (newregs->pc),
>>>>> +			  "=r" (newregs->pstate)
>>>>> +			: "r"  (&newregs->regs)
>>>>> +			: "memory"
>>>>> +		);
>>>>
>>>> I wonder how useful this thing is, given that it starts by corrupting
>>>> whatever register is holding newregs->regs. Maybe this is not supposed
>>>> to be accurate anyway...
>>>
>>> I'm not quite sure about what part of my code you're mentioning here, but
>>> crash_setup_regs() is solely called by crash_kexec(), and panic() is
>>> the only caller of crash_kexec() with NULL argument which, in turn, is
>>> used as 'oldregs' in crash_setup_regs().
>>
>> You have this assembly sequence:
>>
>> stp	 x0,   x1, [%3, #16 *  0]
>> [...]
>>
>> where %3 itself is one of the x[0..30] registers.
> 
> Not only %3, but also
> 
>> So you are saving
>> things that have already been corrupted by the saving procedure. Not
>> sure how useful that is, but as I said, maybe it is not supposed to be
>> completely accurate.
> 
> x0, x1 ... are the current values in panic(), and not the exact cpu contexts
> at the place we are really interested in.
> We have no way here in panic() to know them, but sp and pc would still be useful
> for back-tracing in later investigation of dump file.
> 
> Please note that the same problem exists on arm (and x86) implementation.

As I said: if people don't expect to have a precise dump of the register
file, then fine.

>>> Given this fact, I think that the values saved in newregs as indicated above
>>> will be the best estimate of current cpu contexts.
>>>
>>> The other caller of crash_kexec() is die() in traps.c, but here we call
>>> it with explicit cpu contexts at exception.
>>>
>>>>
>>>>> +	}
>>>>>    }
>>>>>
>>>>>    #endif /* !defined(__ASSEMBLY__) */
>>>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>>>>> index da28a26..d2d7e90 100644
>>>>> --- a/arch/arm64/kernel/machine_kexec.c
>>>>> +++ b/arch/arm64/kernel/machine_kexec.c
>>>>> @@ -9,6 +9,7 @@
>>>>>     * published by the Free Software Foundation.
>>>>>     */
>>>>>
>>>>> +#include <linux/kernel.h>
>>>>>    #include <linux/kexec.h>
>>>>>    #include <linux/of_fdt.h>
>>>>>    #include <linux/slab.h>
>>>>> @@ -23,6 +24,7 @@
>>>>>    extern const unsigned char arm64_relocate_new_kernel[];
>>>>>    extern const unsigned long arm64_relocate_new_kernel_size;
>>>>>
>>>>> +bool in_crash_kexec;
>>>>>    static unsigned long kimage_start;
>>>>>
>>>>>    /**
>>>>> @@ -203,13 +205,38 @@ void machine_kexec(struct kimage *kimage)
>>>>>    	 */
>>>>>
>>>>>    	cpu_soft_restart(virt_to_phys(cpu_reset),
>>>>> -		is_hyp_mode_available(),
>>>>> +		in_crash_kexec ? 0 : is_hyp_mode_available(),
>>>>>    		reboot_code_buffer_phys, kimage->head, kimage_start);
>>>>>
>>>>>    	BUG(); /* Should never get here. */
>>>>>    }
>>>>>
>>>>> +/**
>>>>> + * machine_crash_shutdown - shutdown non-boot cpus and save registers
>>>>> + */
>>>>>    void machine_crash_shutdown(struct pt_regs *regs)
>>>>>    {
>>>>> -	/* Empty routine needed to avoid build errors. */
>>>>> +	struct pt_regs dummy_regs;
>>>>> +	int cpu;
>>>>> +
>>>>> +	local_irq_disable();
>>>>> +
>>>>> +	in_crash_kexec = true;
>>>>> +
>>>>> +	/*
>>>>> +	 * clear and initialize the per-cpu info. This is necessary
>>>>> +	 * because, otherwise, slots for offline cpus would never be
>>>>> +	 * filled up. See smp_send_stop().
>>>>> +	 */
>>>>> +	memset(&dummy_regs, 0, sizeof(dummy_regs));
>>>>> +	for_each_possible_cpu(cpu)
>>>>> +		crash_save_cpu(&dummy_regs, cpu);
>>>>> +
>>>>> +	/* shutdown non-boot cpus */
>>>>> +	smp_send_stop();
>>>>> +
>>>>> +	/* for boot cpu */
>>>>> +	crash_save_cpu(regs, smp_processor_id());
>>>>> +
>>>>> +	pr_info("Starting crashdump kernel...\n");
>>>>>    }
>>>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>>>> index b1adc51..15aabef 100644
>>>>> --- a/arch/arm64/kernel/smp.c
>>>>> +++ b/arch/arm64/kernel/smp.c
>>>>> @@ -37,6 +37,7 @@
>>>>>    #include <linux/completion.h>
>>>>>    #include <linux/of.h>
>>>>>    #include <linux/irq_work.h>
>>>>> +#include <linux/kexec.h>
>>>>>
>>>>>    #include <asm/alternative.h>
>>>>>    #include <asm/atomic.h>
>>>>> @@ -54,6 +55,8 @@
>>>>>    #include <asm/ptrace.h>
>>>>>    #include <asm/virt.h>
>>>>>
>>>>> +#include "cpu-reset.h"
>>>>> +
>>>>>    #define CREATE_TRACE_POINTS
>>>>>    #include <trace/events/ipi.h>
>>>>>
>>>>> @@ -683,8 +686,12 @@ static DEFINE_RAW_SPINLOCK(stop_lock);
>>>>>    /*
>>>>>     * ipi_cpu_stop - handle IPI from smp_send_stop()
>>>>>     */
>>>>> -static void ipi_cpu_stop(unsigned int cpu)
>>>>> +static void ipi_cpu_stop(unsigned int cpu, struct pt_regs *regs)
>>>>>    {
>>>>> +#ifdef CONFIG_KEXEC
>>>>> +	/* printing messages may slow down the shutdown. */
>>>>> +	if (!in_crash_kexec)
>>>>> +#endif
>>>>>    	if (system_state == SYSTEM_BOOTING ||
>>>>>    	    system_state == SYSTEM_RUNNING) {
>>>>>    		raw_spin_lock(&stop_lock);
>>>>
>>>> Irrespective of how useful this change is, how about having a predicate
>>>> instead? Something like:
>>>>
>>>> static inline bool is_in_crash_kexec(void)
>>>> {
>>>> #ifdef CONFIG_KEXEC
>>>> 	return in_crash_kexec;
>>>> #else
>>>> 	return false;
>>>> #endif
>>>> }
>>>
>>> OK, I will take your idea.
>>>
>>>> located in machine_kexec.c (making the in_crash_kernel static), and then
>>>
>>> but cannot make in_crash_kernel static because it is also used in both smp.c
>>> and machine_kexec.c.
>>
>> smp.c only reads from in_crash_kernel (at least from what I can see in
>> this patch), so it should be able to use the accessor.
> 
> Only if we define the accessor as a real function, not an inline function in a header :)

I'll leave the implementation details in your capable hands. :-)

Thanks,

	M.
Geoff Levand Dec. 10, 2015, 6:31 p.m. UTC | #35
Hi All,

On Thu, 2015-12-03 at 13:58 +0000, Marc Zyngier wrote:
> Indeed, this is completely broken (I just reproduced the issue on a
> model).

There are users out there that want kexec/kdump support.  I think we
should remove this patch from the series, replace it with a temporary
Kconfig conditional that allows only one of KVM or kexec to be enabled,
and merge kexec and kdump support in for 4.5.  This will satisfy users
who need kexec but not KVM, like kexec based bootloaders.  Once this
KVM hot plug patch is fixed we then merge it, either for 4.5 or 4.6,
depending on the timing.

I'll prepare and post a v13 series that does the above.

-Geoff
Yang Shi Dec. 10, 2015, 6:44 p.m. UTC | #36
On 12/3/2015 5:58 AM, Marc Zyngier wrote:
> On 02/12/15 22:40, Ashwin Chaugule wrote:
>> Hello,
>>
>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>
>>> The current kvm implementation on arm64 does cpu-specific initialization
>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>> core in EL2.
>>>
>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>> code into a separate function, kvm_arch_hardware_disable() and
>>> kvm_arch_hardware_enable() respectively.
>>> We don't need arm64-specific cpu hotplug hook any more.
>>>
>>> Since this patch modifies common part of code between arm and arm64, one
>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>> compiling errors.
>>>
>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> ---
>>>   arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>   arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>   arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>   arch/arm/kvm/mmu.c                |  5 +++
>>>   arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>   arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>   arch/arm64/include/asm/virt.h     |  9 +++++
>>>   arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>   arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>   9 files changed, 138 insertions(+), 48 deletions(-)
>>
>> [..]
>>
>>>
>>>
>>>   static struct notifier_block hyp_init_cpu_pm_nb = {
>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>          }
>>>
>>>          /*
>>> -        * Execute the init code on each CPU.
>>> -        */
>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>> -
>>> -       /*
>>>           * Init HYP view of VGIC
>>>           */
>>>          err = kvm_vgic_hyp_init();
>>
>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>> creation, but vgic_hyp_init() is called at bootup. On a system with
>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>> (to get the number of LRs), because we're not reading it from EL2
>> anymore.
>
> Indeed, this is completely broken (I just reproduced the issue on a
> model). I wish this kind of details had been checked earlier, but thanks
> for pointing it out.

Sorry for chiming in late. I did run into something similar when I 
tested some earlier version patches with Akashi San.

The guest bootup just hangs with 4.1 kernel on my LS2085 board which has 
GICv3. Not sure if this is the same issue. But, my bisect did point to 
this patch.

I used to think it is my kernel's problem (4.1 sounds a little bit old) 
since Akashi San can't reproduce this on his Hikey board.

Thanks,
Yang

>
>> Whats the best way to fix this?
>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>> - Fold the VGIC init stuff back into hardware_enable()?
>
> None of that works - kvm_arch_hardware_enable() is called once per CPU,
> while vgic_hyp_init() can only be called once. Also,
> kvm_arch_hardware_enable() is called from interrupt context, and I
> wouldn't feel comfortable starting probing DT and allocating stuff from
> there.
>
>> - Read the VGIC number of LRs from the hyp stub?
>
> That's may UNDEF if called in the wrong context. Also, this defeats the
> point of stubs, which is just to install the vectors for the hypervisor.
>
>> - ..
>
> Yeah, quite.
>
> Geoff, Takahiro?
>
> 	M.
>
AKASHI Takahiro Dec. 11, 2015, 8:09 a.m. UTC | #37
Yang,

On 12/11/2015 03:44 AM, Shi, Yang wrote:
> On 12/3/2015 5:58 AM, Marc Zyngier wrote:
>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>> Hello,
>>>
>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>> core in EL2.
>>>>
>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>> kvm_arch_hardware_enable() respectively.
>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>
>>>> Since this patch modifies common part of code between arm and arm64, one
>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>> compiling errors.
>>>>
>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>> ---
>>>>   arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>   arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>   arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>   arch/arm/kvm/mmu.c                |  5 +++
>>>>   arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>   arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>   arch/arm64/include/asm/virt.h     |  9 +++++
>>>>   arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>   arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>   9 files changed, 138 insertions(+), 48 deletions(-)
>>>
>>> [..]
>>>
>>>>
>>>>
>>>>   static struct notifier_block hyp_init_cpu_pm_nb = {
>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>          }
>>>>
>>>>          /*
>>>> -        * Execute the init code on each CPU.
>>>> -        */
>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>> -
>>>> -       /*
>>>>           * Init HYP view of VGIC
>>>>           */
>>>>          err = kvm_vgic_hyp_init();
>>>
>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>> (to get the number of LRs), because we're not reading it from EL2
>>> anymore.
>>
>> Indeed, this is completely broken (I just reproduced the issue on a
>> model). I wish this kind of details had been checked earlier, but thanks
>> for pointing it out.
>
> Sorry for chiming in late. I did run into something similar when I tested some earlier version patches with Akashi San.
>
> The guest bootup just hangs with 4.1 kernel on my LS2085 board which has GICv3. Not sure if this is the same issue. But,
> my bisect did point to this patch.

Can you please try my fixup! patch to determine whether it is the same issue or not?

Thanks
-Takahiro AKASHI


> I used to think it is my kernel's problem (4.1 sounds a little bit old) since Akashi San can't reproduce this on his
> Hikey board.
>
> Thanks,
> Yang
>
>>
>>> Whats the best way to fix this?
>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>> - Fold the VGIC init stuff back into hardware_enable()?
>>
>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>> while vgic_hyp_init() can only be called once. Also,
>> kvm_arch_hardware_enable() is called from interrupt context, and I
>> wouldn't feel comfortable starting probing DT and allocating stuff from
>> there.
>>
>>> - Read the VGIC number of LRs from the hyp stub?
>>
>> That's may UNDEF if called in the wrong context. Also, this defeats the
>> point of stubs, which is just to install the vectors for the hypervisor.
>>
>>> - ..
>>
>> Yeah, quite.
>>
>> Geoff, Takahiro?
>>
>>     M.
>>
>
Shanker Donthineni Dec. 11, 2015, 1 p.m. UTC | #38
On 12/11/2015 02:06 AM, AKASHI Takahiro wrote:
> Ashwin, Marc,
>
> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>> Hello,
>>>
>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> The current kvm implementation on arm64 does cpu-specific
> initialization
>>>> at system boot, and has no way to gracefully shutdown a core in terms
> of
>>>> kvm. This prevents, especially, kexec from rebooting the system on a
> boot
>>>> core in EL2.
>>>>
>>>> This patch adds a cpu tear-down function and also puts an existing
> cpu-init
>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>> kvm_arch_hardware_enable() respectively.
>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>
>>>> Since this patch modifies common part of code between arm and arm64,
> one
>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>> compiling errors.
>>>>
>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>> ---
>>>>   arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>   arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>   arch/arm/kvm/arm.c                | 79
> ++++++++++++++++++---------------------
>>>>   arch/arm/kvm/mmu.c                | 5 +++
>>>>   arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>   arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>   arch/arm64/include/asm/virt.h     |  9 +++++
>>>>   arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>   arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>   9 files changed, 138 insertions(+), 48 deletions(-)
>>>
>>> [..]
>>>
>>>>
>>>>
>>>>   static struct notifier_block hyp_init_cpu_pm_nb = {
>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>          }
>>>>
>>>>          /*
>>>> -        * Execute the init code on each CPU.
>>>> -        */
>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>> -
>>>> -       /*
>>>>           * Init HYP view of VGIC
>>>>           */
>>>>          err = kvm_vgic_hyp_init();
>>>
>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>> (to get the number of LRs), because we're not reading it from EL2
>>> anymore.
>
> Thank you for pointing this out.
> Recently I tested my kdump code on hikey, and as hikey(hi6220) has
> gic-400,
> I didn't notice this problem.
>
>> Indeed, this is completely broken (I just reproduced the issue on a
>> model). I wish this kind of details had been checked earlier, but thanks
>> for pointing it out.
>>
>>> Whats the best way to fix this?
>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable
> later?
>>> - Fold the VGIC init stuff back into hardware_enable()?
>>
>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>> while vgic_hyp_init() can only be called once. Also,
>> kvm_arch_hardware_enable() is called from interrupt context, and I
>> wouldn't feel comfortable starting probing DT and allocating stuff from
>> there.
>
> Do you think so?
> How about the fixup! patch attached below?
> The point is that, like Ashwin's first idea, we initialize cpus
> temporarily
> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
> kvm cpu hotplug will still continue to work as before.
> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
> original code, the change will not be a big jump.
I have tested this patch and verified the fix on our hardware with GICv3 
without any issue.
>
> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
> I hope this should work. Actually I confirmed that, with this fixup!
> patch,
> we could run a kvm guest and also successfully executed kexec on model
> w/gic-v3.
>
> My only concern is the following kernel message I saw when kexec shut 
> down
> the kernel:
> (Please note that I was running one kvm quest (pid=961) here.)
>
> ===
> sh-4.3# ./kexec -d -e
> kexec version: 15.11.16.11.06-g41e52e2
> arch_process_options:112: command_line: (null)
> arch_process_options:114: initrd: (null)
> arch_process_options:115: dtb: (null)
> arch_process_options:117: port: 0x0
> kvm: exiting hardware virtualization
> kvm [961]: Unsupported exception type: 6248304    <== this message
> kexec_core: Starting new kernel
> Disabling non-boot CPUs ...
> CPU1: shutdown
> CPU2: shutdown
> CPU3: shutdown
> CPU4: shutdown
> CPU5: shutdown
> CPU6: shutdown
> CPU7: shutdown
> Bye!
> Booting Linux on physical CPU 0x0
> ...
> ===
>
> I don't know whether we can ignore this kind of message or not.
> Any thoughts?
>
> Thanks,
> -Takahiro AKASHI
>
>
>
>>> - Read the VGIC number of LRs from the hyp stub?
>>
>> That's may UNDEF if called in the wrong context. Also, this defeats the
>> point of stubs, which is just to install the vectors for the hypervisor.
>>
>>> - ..
>>
>> Yeah, quite.
>>
>> Geoff, Takahiro?
>>
>>     M.
>>
> ----8<----
> From 66ca3baedf45c78c81a76ea77ddd6ace49550ab6 Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Date: Fri, 11 Dec 2015 13:43:35 +0900
> Subject: [PATCH 7/7] fixup! arm64: kvm: allows kvm cpu hotplug
>
> ---
>  arch/arm/kvm/arm.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 518c3c7..8fe59ba 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -950,7 +950,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>      }
>  }
>
> -int kvm_arch_hardware_enable(void)
> +static void cpu_init_hyp_mode(void *dummy)
>  {
>      phys_addr_t boot_pgd_ptr;
>      phys_addr_t pgd_ptr;
> @@ -958,9 +958,6 @@ int kvm_arch_hardware_enable(void)
>      unsigned long stack_page;
>      unsigned long vector_ptr;
>
> -    if (__hyp_get_vectors() != hyp_default_vectors)
> -        return 0;
> -
>      /* Switch from the HYP stub to our own HYP init vector */
>      __hyp_set_vectors(kvm_get_idmap_vector());
>
> @@ -973,24 +970,35 @@ int kvm_arch_hardware_enable(void)
>      __cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr,
> vector_ptr);
>
>      kvm_arm_init_debug();
> -
> -    return 0;
>  }
>
> -void kvm_arch_hardware_disable(void)
> +static void cpu_reset_hyp_mode(void *dummy)
>  {
>      phys_addr_t boot_pgd_ptr;
>      phys_addr_t phys_idmap_start;
>
> -    if (__hyp_get_vectors() == hyp_default_vectors)
> -        return;
> -
>      boot_pgd_ptr = kvm_mmu_get_boot_httbr();
>      phys_idmap_start = kvm_get_idmap_start();
>
>      __cpu_reset_hyp_mode(boot_pgd_ptr, phys_idmap_start);
>  }
>
> +int kvm_arch_hardware_enable(void)
> +{
> +    if (__hyp_get_vectors() == hyp_default_vectors)
> +        cpu_init_hyp_mode(NULL);
> +
> +    return 0;
> +}
> +
> +void kvm_arch_hardware_disable(void)
> +{
> +    if (__hyp_get_vectors() == hyp_default_vectors)
> +        return;
> +
> +    cpu_reset_hyp_mode(NULL);
> +}
> +
>  #ifdef CONFIG_CPU_PM
>  static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
>                      unsigned long cmd,
> @@ -1114,6 +1122,12 @@ static int init_hyp_mode(void)
>      }
>
>      /*
> +     * Execute the init code on each CPU.
> +     * Only needed to execute kvm_hyp_call() during *_hyp_init().
> +     */
> +    on_each_cpu(cpu_init_hyp_mode, NULL, 1);
> +
> +    /*
>       * Init HYP view of VGIC
>       */
>      err = kvm_vgic_hyp_init();
> @@ -1127,6 +1141,8 @@ static int init_hyp_mode(void)
>      if (err)
>          goto out_free_context;
>
> +    on_each_cpu(cpu_reset_hyp_mode, NULL, 1);
> +
>  #ifndef CONFIG_HOTPLUG_CPU
>      free_boot_hyp_pgd();
>  #endif
> @@ -1137,6 +1153,7 @@ static int init_hyp_mode(void)
>
>      return 0;
>  out_free_context:
> +    on_each_cpu(cpu_reset_hyp_mode, NULL, 1);
>      free_percpu(kvm_host_cpu_state);
>  out_free_mappings:
>      free_hyp_pgds();
Marc Zyngier Dec. 11, 2015, 4:28 p.m. UTC | #39
On 11/12/15 08:06, AKASHI Takahiro wrote:
> Ashwin, Marc,
> 
> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>> Hello,
>>>
>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>> core in EL2.
>>>>
>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>> kvm_arch_hardware_enable() respectively.
>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>
>>>> Since this patch modifies common part of code between arm and arm64, one
>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>> compiling errors.
>>>>
>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>> ---
>>>>   arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>   arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>   arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>   arch/arm/kvm/mmu.c                |  5 +++
>>>>   arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>   arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>   arch/arm64/include/asm/virt.h     |  9 +++++
>>>>   arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>   arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>   9 files changed, 138 insertions(+), 48 deletions(-)
>>>
>>> [..]
>>>
>>>>
>>>>
>>>>   static struct notifier_block hyp_init_cpu_pm_nb = {
>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>          }
>>>>
>>>>          /*
>>>> -        * Execute the init code on each CPU.
>>>> -        */
>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>> -
>>>> -       /*
>>>>           * Init HYP view of VGIC
>>>>           */
>>>>          err = kvm_vgic_hyp_init();
>>>
>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>> (to get the number of LRs), because we're not reading it from EL2
>>> anymore.
> 
> Thank you for pointing this out.
> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
> I didn't notice this problem.

Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
GICv3 uses some system registers that are only available at EL2, and KVM
needs some information contained in these registers before being able to
get initialized.

>> Indeed, this is completely broken (I just reproduced the issue on a
>> model). I wish this kind of details had been checked earlier, but thanks
>> for pointing it out.
>>
>>> Whats the best way to fix this?
>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>> - Fold the VGIC init stuff back into hardware_enable()?
>>
>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>> while vgic_hyp_init() can only be called once. Also,
>> kvm_arch_hardware_enable() is called from interrupt context, and I
>> wouldn't feel comfortable starting probing DT and allocating stuff from
>> there.
> 
> Do you think so?
> How about the fixup! patch attached below?
> The point is that, like Ashwin's first idea, we initialize cpus temporarily
> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
> kvm cpu hotplug will still continue to work as before.
> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
> original code, the change will not be a big jump.

This seems quite complicated:
- init EL2 on  all CPUs
- do some initialization
- tear all CPUs EL2 down
- let KVM drive the vectors being set or not

My questions are: why do we need to do this on *all* cpus? Can't that
work on a single one?

Also, the simple fact that we were able to get some junk value is a sign
that something is amiss. I'd expect a splat of some sort, because we now
have a possibility of doing things in the wrong context.

> 
> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
> I hope this should work. Actually I confirmed that, with this fixup! patch,
> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
> 
> My only concern is the following kernel message I saw when kexec shut down
> the kernel:
> (Please note that I was running one kvm quest (pid=961) here.)
> 
> ===
> sh-4.3# ./kexec -d -e
> kexec version: 15.11.16.11.06-g41e52e2
> arch_process_options:112: command_line: (null)
> arch_process_options:114: initrd: (null)
> arch_process_options:115: dtb: (null)
> arch_process_options:117: port: 0x0
> kvm: exiting hardware virtualization
> kvm [961]: Unsupported exception type: 6248304    <== this message

That makes me feel very uncomfortable. It looks like we've exited a
guest with some horrible value in X0. How is that even possible?

This deserves to be investigated.

Thanks,

	M.
Will Deacon Dec. 11, 2015, 4:31 p.m. UTC | #40
On Thu, Dec 10, 2015 at 10:31:48AM -0800, Geoff Levand wrote:
> On Thu, 2015-12-03 at 13:58 +0000, Marc Zyngier wrote:
> > Indeed, this is completely broken (I just reproduced the issue on a
> > model).
> 
> There are users out there that want kexec/kdump support.  I think we
> should remove this patch from the series, replace it with a temporary
> Kconfig conditional that allows only one of KVM or kexec to be enabled,
> and merge kexec and kdump support in for 4.5.  This will satisfy users
> who need kexec but not KVM, like kexec based bootloaders.  Once this
> KVM hot plug patch is fixed we then merge it, either for 4.5 or 4.6,
> depending on the timing.
> 
> I'll prepare and post a v13 series that does the above.

I'm not really keen on merging this based on a "temporary" Kconfig change
with a promise to have it resolved in the future. Given how long this
series has been floating around already, I'm not filled with confidence
at the prospect of waiting for a fixup patch after it's been merged.

Please address the outstanding review comments before reposting.

Will
Shanker Donthineni Dec. 11, 2015, 6 p.m. UTC | #41
On 12/11/2015 10:28 AM, Marc Zyngier wrote:
> On 11/12/15 08:06, AKASHI Takahiro wrote:
>> Ashwin, Marc,
>>
>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>> Hello,
>>>>
>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>
>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>> core in EL2.
>>>>>
>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>> kvm_arch_hardware_enable() respectively.
>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>
>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>> compiling errors.
>>>>>
>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>> ---
>>>>>    arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>    arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>    arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>    arch/arm/kvm/mmu.c                |  5 +++
>>>>>    arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>    arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>    arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>    arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>    arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>    9 files changed, 138 insertions(+), 48 deletions(-)
>>>> [..]
>>>>
>>>>>
>>>>>    static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>           }
>>>>>
>>>>>           /*
>>>>> -        * Execute the init code on each CPU.
>>>>> -        */
>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>> -
>>>>> -       /*
>>>>>            * Init HYP view of VGIC
>>>>>            */
>>>>>           err = kvm_vgic_hyp_init();
>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>> (to get the number of LRs), because we're not reading it from EL2
>>>> anymore.
>> Thank you for pointing this out.
>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>> I didn't notice this problem.
> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
> GICv3 uses some system registers that are only available at EL2, and KVM
> needs some information contained in these registers before being able to
> get initialized.
>
>>> Indeed, this is completely broken (I just reproduced the issue on a
>>> model). I wish this kind of details had been checked earlier, but thanks
>>> for pointing it out.
>>>
>>>> Whats the best way to fix this?
>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>> while vgic_hyp_init() can only be called once. Also,
>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>> there.
>> Do you think so?
>> How about the fixup! patch attached below?
>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>> kvm cpu hotplug will still continue to work as before.
>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>> original code, the change will not be a big jump.
> This seems quite complicated:
> - init EL2 on  all CPUs
> - do some initialization
> - tear all CPUs EL2 down
> - let KVM drive the vectors being set or not
>
> My questions are: why do we need to do this on *all* cpus? Can't that
> work on a single one?
>   

Single CPU EL2 initialization should be fine as long as no kernel 
preemption happens
in between init EL2  and  kvm_vgic_hyp_init() execution. The function 
init_hyp_mode()
is called by do_basic_setup() with preemption enabled.

I don't have deeper knowledge of how scheduler is handled during the 
kernel boot
time, but initializing all CPUs definitely helps if preemption happens 
before reading
ICH_VTR_EL2 register and after kvm_vgic_hyp_init().

> Also, the simple fact that we were able to get some junk value is a sign
> that something is amiss. I'd expect a splat of some sort, because we now
> have a possibility of doing things in the wrong context.
>
>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>
>> My only concern is the following kernel message I saw when kexec shut down
>> the kernel:
>> (Please note that I was running one kvm quest (pid=961) here.)
>>
>> ===
>> sh-4.3# ./kexec -d -e
>> kexec version: 15.11.16.11.06-g41e52e2
>> arch_process_options:112: command_line: (null)
>> arch_process_options:114: initrd: (null)
>> arch_process_options:115: dtb: (null)
>> arch_process_options:117: port: 0x0
>> kvm: exiting hardware virtualization
>> kvm [961]: Unsupported exception type: 6248304    <== this message
> That makes me feel very uncomfortable. It looks like we've exited a
> guest with some horrible value in X0. How is that even possible?
>
> This deserves to be investigated.
>
> Thanks,
>
> 	M.
Marc Zyngier Dec. 11, 2015, 6:11 p.m. UTC | #42
On 11/12/15 18:00, Shanker Donthineni wrote:
> 
> 
> On 12/11/2015 10:28 AM, Marc Zyngier wrote:
>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>> Ashwin, Marc,
>>>
>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>> Hello,
>>>>>
>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>
>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>> core in EL2.
>>>>>>
>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>
>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>> compiling errors.
>>>>>>
>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>> ---
>>>>>>    arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>    arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>    arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>    arch/arm/kvm/mmu.c                |  5 +++
>>>>>>    arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>    arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>    arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>    arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>    arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>    9 files changed, 138 insertions(+), 48 deletions(-)
>>>>> [..]
>>>>>
>>>>>>
>>>>>>    static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>           }
>>>>>>
>>>>>>           /*
>>>>>> -        * Execute the init code on each CPU.
>>>>>> -        */
>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>> -
>>>>>> -       /*
>>>>>>            * Init HYP view of VGIC
>>>>>>            */
>>>>>>           err = kvm_vgic_hyp_init();
>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>> anymore.
>>> Thank you for pointing this out.
>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>> I didn't notice this problem.
>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>> GICv3 uses some system registers that are only available at EL2, and KVM
>> needs some information contained in these registers before being able to
>> get initialized.
>>
>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>> for pointing it out.
>>>>
>>>>> Whats the best way to fix this?
>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>> while vgic_hyp_init() can only be called once. Also,
>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>> there.
>>> Do you think so?
>>> How about the fixup! patch attached below?
>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>> kvm cpu hotplug will still continue to work as before.
>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>> original code, the change will not be a big jump.
>> This seems quite complicated:
>> - init EL2 on  all CPUs
>> - do some initialization
>> - tear all CPUs EL2 down
>> - let KVM drive the vectors being set or not
>>
>> My questions are: why do we need to do this on *all* cpus? Can't that
>> work on a single one?
>>   
> 
> Single CPU EL2 initialization should be fine as long as no kernel 
> preemption happens in between init EL2  and  kvm_vgic_hyp_init()
> execution. The function init_hyp_mode() is called by do_basic_setup()
> with preemption enabled.

Indeed. So far, we never needed this since we were executing this code
with interrupts disabled.

> I don't have deeper knowledge of how scheduler is handled during the
>  kernel boot time, but initializing all CPUs definitely helps if
> preemption happens before reading ICH_VTR_EL2 register and after
> kvm_vgic_hyp_init().

What is wrong with wrapping the critical path with
preempt_{enabled,disabled}?

Thanks,

	M.
Shanker Donthineni Dec. 11, 2015, 7:11 p.m. UTC | #43
On 12/11/2015 12:11 PM, Marc Zyngier wrote:
> On 11/12/15 18:00, Shanker Donthineni wrote:
>>
>> On 12/11/2015 10:28 AM, Marc Zyngier wrote:
>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>> Ashwin, Marc,
>>>>
>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>
>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>> core in EL2.
>>>>>>>
>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>
>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>> compiling errors.
>>>>>>>
>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>> ---
>>>>>>>     arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>     arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>     arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>     arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>     arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>     arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>     arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>     arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>     9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>> [..]
>>>>>>
>>>>>>>     static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>            }
>>>>>>>
>>>>>>>            /*
>>>>>>> -        * Execute the init code on each CPU.
>>>>>>> -        */
>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>> -
>>>>>>> -       /*
>>>>>>>             * Init HYP view of VGIC
>>>>>>>             */
>>>>>>>            err = kvm_vgic_hyp_init();
>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>> anymore.
>>>> Thank you for pointing this out.
>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>> I didn't notice this problem.
>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>> needs some information contained in these registers before being able to
>>> get initialized.
>>>
>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>> for pointing it out.
>>>>>
>>>>>> Whats the best way to fix this?
>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>> there.
>>>> Do you think so?
>>>> How about the fixup! patch attached below?
>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>> kvm cpu hotplug will still continue to work as before.
>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>> original code, the change will not be a big jump.
>>> This seems quite complicated:
>>> - init EL2 on  all CPUs
>>> - do some initialization
>>> - tear all CPUs EL2 down
>>> - let KVM drive the vectors being set or not
>>>
>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>> work on a single one?
>>>    
>> Single CPU EL2 initialization should be fine as long as no kernel
>> preemption happens in between init EL2  and  kvm_vgic_hyp_init()
>> execution. The function init_hyp_mode() is called by do_basic_setup()
>> with preemption enabled.
> Indeed. So far, we never needed this since we were executing this code
> with interrupts disabled.
>
>> I don't have deeper knowledge of how scheduler is handled during the
>>   kernel boot time, but initializing all CPUs definitely helps if
>> preemption happens before reading ICH_VTR_EL2 register and after
>> kvm_vgic_hyp_init().
> What is wrong with wrapping the critical path with
> preempt_{enabled,disabled}?

I agree with you nothing wrong implementing your suggestion. It will fix 
the problem.

> Thanks,
>
> 	M.
Ashwin Chaugule Dec. 11, 2015, 8:13 p.m. UTC | #44
On 11 December 2015 at 11:28, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 11/12/15 08:06, AKASHI Takahiro wrote:
>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>

[..]

>> How about the fixup! patch attached below?
>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>> kvm cpu hotplug will still continue to work as before.
>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>> original code, the change will not be a big jump.
>
> This seems quite complicated:
> - init EL2 on  all CPUs
> - do some initialization
> - tear all CPUs EL2 down
> - let KVM drive the vectors being set or not
>
> My questions are: why do we need to do this on *all* cpus? Can't that
> work on a single one?
>
> Also, the simple fact that we were able to get some junk value is a sign
> that something is amiss. I'd expect a splat of some sort, because we now
> have a possibility of doing things in the wrong context.

Yea. I had the same expectation too. Probably some leftover state from
the hyp stub at bootup kept us lucky.

>
>>
>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>
>> My only concern is the following kernel message I saw when kexec shut down
>> the kernel:
>> (Please note that I was running one kvm quest (pid=961) here.)
>>
>> ===
>> sh-4.3# ./kexec -d -e
>> kexec version: 15.11.16.11.06-g41e52e2
>> arch_process_options:112: command_line: (null)
>> arch_process_options:114: initrd: (null)
>> arch_process_options:115: dtb: (null)
>> arch_process_options:117: port: 0x0
>> kvm: exiting hardware virtualization
>> kvm [961]: Unsupported exception type: 6248304    <== this message
>
> That makes me feel very uncomfortable. It looks like we've exited a
> guest with some horrible value in X0. How is that even possible?
>
> This deserves to be investigated.

Did this exception trigger as a result of the fixup? For giggles, what
happens if you dont tear down the EL2 setup after reading the number
of LRs? Are we missing another site where we need to setup and
teardown?

Ashwin
Marc Zyngier Dec. 14, 2015, 5:33 p.m. UTC | #45
On 14/12/15 07:33, AKASHI Takahiro wrote:
> Marc,
> 
> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>> Ashwin, Marc,
>>>
>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>> Hello,
>>>>>
>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>
>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>> core in EL2.
>>>>>>
>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>
>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>> compiling errors.
>>>>>>
>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>> ---
>>>>>>    arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>    arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>    arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>    arch/arm/kvm/mmu.c                |  5 +++
>>>>>>    arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>    arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>    arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>    arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>    arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>    9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>
>>>>> [..]
>>>>>
>>>>>>
>>>>>>
>>>>>>    static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>           }
>>>>>>
>>>>>>           /*
>>>>>> -        * Execute the init code on each CPU.
>>>>>> -        */
>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>> -
>>>>>> -       /*
>>>>>>            * Init HYP view of VGIC
>>>>>>            */
>>>>>>           err = kvm_vgic_hyp_init();
>>>>>
>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>> anymore.
>>>
>>> Thank you for pointing this out.
>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>> I didn't notice this problem.
>>
>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>> GICv3 uses some system registers that are only available at EL2, and KVM
>> needs some information contained in these registers before being able to
>> get initialized.
> 
> I see.
> 
>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>> for pointing it out.
>>>>
>>>>> Whats the best way to fix this?
>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>
>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>> while vgic_hyp_init() can only be called once. Also,
>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>> there.
>>>
>>> Do you think so?
>>> How about the fixup! patch attached below?
>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>> kvm cpu hotplug will still continue to work as before.
>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>> original code, the change will not be a big jump.
>>
>> This seems quite complicated:
>> - init EL2 on  all CPUs
>> - do some initialization
>> - tear all CPUs EL2 down
>> - let KVM drive the vectors being set or not
>>
>> My questions are: why do we need to do this on *all* cpus? Can't that
>> work on a single one?
> 
> I did initialize all the cpus partly because using preempt_enable/disable
> looked a bit ugly and partly because we may, in the future, do additional
> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
> But if you're comfortable with preempt_*() stuff, I don' care.
> 
> 
>> Also, the simple fact that we were able to get some junk value is a sign
>> that something is amiss. I'd expect a splat of some sort, because we now
>> have a possibility of doing things in the wrong context.
>>
>>>
>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>
>>> My only concern is the following kernel message I saw when kexec shut down
>>> the kernel:
>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>
>>> ===
>>> sh-4.3# ./kexec -d -e
>>> kexec version: 15.11.16.11.06-g41e52e2
>>> arch_process_options:112: command_line: (null)
>>> arch_process_options:114: initrd: (null)
>>> arch_process_options:115: dtb: (null)
>>> arch_process_options:117: port: 0x0
>>> kvm: exiting hardware virtualization
>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>
>> That makes me feel very uncomfortable. It looks like we've exited a
>> guest with some horrible value in X0. How is that even possible?
>>
>> This deserves to be investigated.
> 
> I guess the problem is that cpu tear-down function is called even if a kvm guest
> is still running in kvm_arch_vcpu_ioctl_run().
> So adding a check whether cpu has been initialized or not in every iteration of
> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
> a guest mode. Since this check is done while interrupt is disabled, it won't
> interfere with kvm_arch_hardware_disable() called via IPI.
> See the attached fixup patch.
> 
> Again, I verified the code on model.
> 
> Thanks,
> -Takahiro AKASHI
> 
>> Thanks,
>>
>> 	M.
>>
> 
> ----8<----
>  From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Date: Fri, 11 Dec 2015 13:43:35 +0900
> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
> 
> ---
>   arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 518c3c7..d7e86fb 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>   		/*
>   		 * Re-check atomic conditions
>   		 */
> -		if (signal_pending(current)) {
> +		if (__hyp_get_vectors() == hyp_default_vectors) {
> +			/* cpu has been torn down */
> +			ret = -ENOEXEC;
> +			run->exit_reason = KVM_EXIT_SHUTDOWN;


That feels completely overkill (and very slow). Why don't you maintain a
per-cpu variable containing the CPU states, which will avoid calling
__hyp_get_vectors() all the time? You should be able to reuse that
construct everywhere.

Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
(called on triple fault). KVM_EXIT_FAIL_ENTRY looks more appropriate,
and the hardware_entry_failure_reason field should be populated (and
documented).

Thanks,

	M.
Yang Shi Dec. 14, 2015, 6 p.m. UTC | #46
On 12/11/2015 12:09 AM, AKASHI Takahiro wrote:
> Yang,
>
> On 12/11/2015 03:44 AM, Shi, Yang wrote:
>> On 12/3/2015 5:58 AM, Marc Zyngier wrote:
>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>> Hello,
>>>>
>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>
>>>>> The current kvm implementation on arm64 does cpu-specific
>>>>> initialization
>>>>> at system boot, and has no way to gracefully shutdown a core in
>>>>> terms of
>>>>> kvm. This prevents, especially, kexec from rebooting the system on
>>>>> a boot
>>>>> core in EL2.
>>>>>
>>>>> This patch adds a cpu tear-down function and also puts an existing
>>>>> cpu-init
>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>> kvm_arch_hardware_enable() respectively.
>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>
>>>>> Since this patch modifies common part of code between arm and
>>>>> arm64, one
>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>> compiling errors.
>>>>>
>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>> ---
>>>>>   arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>   arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>   arch/arm/kvm/arm.c                | 79
>>>>> ++++++++++++++++++---------------------
>>>>>   arch/arm/kvm/mmu.c                |  5 +++
>>>>>   arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>   arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>   arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>   arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>   arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>   9 files changed, 138 insertions(+), 48 deletions(-)
>>>>
>>>> [..]
>>>>
>>>>>
>>>>>
>>>>>   static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>          }
>>>>>
>>>>>          /*
>>>>> -        * Execute the init code on each CPU.
>>>>> -        */
>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>> -
>>>>> -       /*
>>>>>           * Init HYP view of VGIC
>>>>>           */
>>>>>          err = kvm_vgic_hyp_init();
>>>>
>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>> (to get the number of LRs), because we're not reading it from EL2
>>>> anymore.
>>>
>>> Indeed, this is completely broken (I just reproduced the issue on a
>>> model). I wish this kind of details had been checked earlier, but thanks
>>> for pointing it out.
>>
>> Sorry for chiming in late. I did run into something similar when I
>> tested some earlier version patches with Akashi San.
>>
>> The guest bootup just hangs with 4.1 kernel on my LS2085 board which
>> has GICv3. Not sure if this is the same issue. But,
>> my bisect did point to this patch.
>
> Can you please try my fixup! patch to determine whether it is the same
> issue or not?

I wish I could help this time, however, unfortunately, my LS2085 board 
had some hardware failure so I lost the access to it. Once I get a 
replacement or have it fixed, I will have a try.

Regards,
Yang

>
> Thanks
> -Takahiro AKASHI
>
>
>> I used to think it is my kernel's problem (4.1 sounds a little bit
>> old) since Akashi San can't reproduce this on his
>> Hikey board.
>>
>> Thanks,
>> Yang
>>
>>>
>>>> Whats the best way to fix this?
>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable
>>>> later?
>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>
>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>> while vgic_hyp_init() can only be called once. Also,
>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>> there.
>>>
>>>> - Read the VGIC number of LRs from the hyp stub?
>>>
>>> That's may UNDEF if called in the wrong context. Also, this defeats the
>>> point of stubs, which is just to install the vectors for the hypervisor.
>>>
>>>> - ..
>>>
>>> Yeah, quite.
>>>
>>> Geoff, Takahiro?
>>>
>>>     M.
>>>
>>
AKASHI Takahiro Dec. 15, 2015, 7:51 a.m. UTC | #47
On 12/15/2015 02:33 AM, Marc Zyngier wrote:
> On 14/12/15 07:33, AKASHI Takahiro wrote:
>> Marc,
>>
>> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>> Ashwin, Marc,
>>>>
>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>
>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>> core in EL2.
>>>>>>>
>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>
>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>> compiling errors.
>>>>>>>
>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>> ---
>>>>>>>     arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>     arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>     arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>     arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>     arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>     arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>     arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>     arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>     9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>            }
>>>>>>>
>>>>>>>            /*
>>>>>>> -        * Execute the init code on each CPU.
>>>>>>> -        */
>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>> -
>>>>>>> -       /*
>>>>>>>             * Init HYP view of VGIC
>>>>>>>             */
>>>>>>>            err = kvm_vgic_hyp_init();
>>>>>>
>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>> anymore.
>>>>
>>>> Thank you for pointing this out.
>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>> I didn't notice this problem.
>>>
>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>> needs some information contained in these registers before being able to
>>> get initialized.
>>
>> I see.
>>
>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>> for pointing it out.
>>>>>
>>>>>> Whats the best way to fix this?
>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>>
>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>> there.
>>>>
>>>> Do you think so?
>>>> How about the fixup! patch attached below?
>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>> kvm cpu hotplug will still continue to work as before.
>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>> original code, the change will not be a big jump.
>>>
>>> This seems quite complicated:
>>> - init EL2 on  all CPUs
>>> - do some initialization
>>> - tear all CPUs EL2 down
>>> - let KVM drive the vectors being set or not
>>>
>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>> work on a single one?
>>
>> I did initialize all the cpus partly because using preempt_enable/disable
>> looked a bit ugly and partly because we may, in the future, do additional
>> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
>> But if you're comfortable with preempt_*() stuff, I don' care.
>>
>>
>>> Also, the simple fact that we were able to get some junk value is a sign
>>> that something is amiss. I'd expect a splat of some sort, because we now
>>> have a possibility of doing things in the wrong context.
>>>
>>>>
>>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>>
>>>> My only concern is the following kernel message I saw when kexec shut down
>>>> the kernel:
>>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>>
>>>> ===
>>>> sh-4.3# ./kexec -d -e
>>>> kexec version: 15.11.16.11.06-g41e52e2
>>>> arch_process_options:112: command_line: (null)
>>>> arch_process_options:114: initrd: (null)
>>>> arch_process_options:115: dtb: (null)
>>>> arch_process_options:117: port: 0x0
>>>> kvm: exiting hardware virtualization
>>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>>
>>> That makes me feel very uncomfortable. It looks like we've exited a
>>> guest with some horrible value in X0. How is that even possible?
>>>
>>> This deserves to be investigated.
>>
>> I guess the problem is that cpu tear-down function is called even if a kvm guest
>> is still running in kvm_arch_vcpu_ioctl_run().
>> So adding a check whether cpu has been initialized or not in every iteration of
>> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
>> a guest mode. Since this check is done while interrupt is disabled, it won't
>> interfere with kvm_arch_hardware_disable() called via IPI.
>> See the attached fixup patch.
>>
>> Again, I verified the code on model.
>>
>> Thanks,
>> -Takahiro AKASHI
>>
>>> Thanks,
>>>
>>> 	M.
>>>
>>
>> ----8<----
>>   From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> Date: Fri, 11 Dec 2015 13:43:35 +0900
>> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
>>
>> ---
>>    arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>>    1 file changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 518c3c7..d7e86fb 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>    		/*
>>    		 * Re-check atomic conditions
>>    		 */
>> -		if (signal_pending(current)) {
>> +		if (__hyp_get_vectors() == hyp_default_vectors) {
>> +			/* cpu has been torn down */
>> +			ret = -ENOEXEC;
>> +			run->exit_reason = KVM_EXIT_SHUTDOWN;
>
>
> That feels completely overkill (and very slow). Why don't you maintain a
> per-cpu variable containing the CPU states, which will avoid calling
> __hyp_get_vectors() all the time? You should be able to reuse that
> construct everywhere.

OK. Since I have introduced per-cpu variable, kvm_arm_hardware_enabled, against
cpuidle issue, we will be able to re-use it.

> Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
> (called on triple fault).

No, I don't think so.
Looking at kvm_cpu_exec() in kvm-all.c of qemu, KVM_EXIT_SHUTDOWN
is handled in a generic way and results in a reset request.
On the other hand, KVM_EXIT_FAIL_ENTRY seems more arch-specific.
In addition, if kvm_vcpu_ioctl() returns a negative value, run->exit_reason
will never be examined.
So I think
    ret -> 0
    run->exit_reason -> KVM_EXIT_SHUTDOWN
or just
    ret -> -ENOEXEC
is the best.

In either way, a guest will have no good chance to gracefully shutdown itself
because we're kexec'ing (without waiting for threads' termination).

-Takahiro AKASHI

> KVM_EXIT_FAIL_ENTRY looks more appropriate,
> and the hardware_entry_failure_reason field should be populated (and
> documented).
>
> Thanks,
>
> 	M.
>
Marc Zyngier Dec. 15, 2015, 8:45 a.m. UTC | #48
On 15/12/15 07:51, AKASHI Takahiro wrote:
> On 12/15/2015 02:33 AM, Marc Zyngier wrote:
>> On 14/12/15 07:33, AKASHI Takahiro wrote:
>>> Marc,
>>>
>>> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>>> Ashwin, Marc,
>>>>>
>>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>>
>>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>>> core in EL2.
>>>>>>>>
>>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>>
>>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>>> compiling errors.
>>>>>>>>
>>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>> ---
>>>>>>>>     arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>>     arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>>     arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>>     arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>>     arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>>     arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>>     arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>>     arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>>     arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>>     9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>>>
>>>>>>> [..]
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>>            }
>>>>>>>>
>>>>>>>>            /*
>>>>>>>> -        * Execute the init code on each CPU.
>>>>>>>> -        */
>>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>>> -
>>>>>>>> -       /*
>>>>>>>>             * Init HYP view of VGIC
>>>>>>>>             */
>>>>>>>>            err = kvm_vgic_hyp_init();
>>>>>>>
>>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>>> anymore.
>>>>>
>>>>> Thank you for pointing this out.
>>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>>> I didn't notice this problem.
>>>>
>>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>>> needs some information contained in these registers before being able to
>>>> get initialized.
>>>
>>> I see.
>>>
>>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>>> for pointing it out.
>>>>>>
>>>>>>> Whats the best way to fix this?
>>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>>>
>>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>>> there.
>>>>>
>>>>> Do you think so?
>>>>> How about the fixup! patch attached below?
>>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>>> kvm cpu hotplug will still continue to work as before.
>>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>>> original code, the change will not be a big jump.
>>>>
>>>> This seems quite complicated:
>>>> - init EL2 on  all CPUs
>>>> - do some initialization
>>>> - tear all CPUs EL2 down
>>>> - let KVM drive the vectors being set or not
>>>>
>>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>>> work on a single one?
>>>
>>> I did initialize all the cpus partly because using preempt_enable/disable
>>> looked a bit ugly and partly because we may, in the future, do additional
>>> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
>>> But if you're comfortable with preempt_*() stuff, I don' care.
>>>
>>>
>>>> Also, the simple fact that we were able to get some junk value is a sign
>>>> that something is amiss. I'd expect a splat of some sort, because we now
>>>> have a possibility of doing things in the wrong context.
>>>>
>>>>>
>>>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>>>
>>>>> My only concern is the following kernel message I saw when kexec shut down
>>>>> the kernel:
>>>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>>>
>>>>> ===
>>>>> sh-4.3# ./kexec -d -e
>>>>> kexec version: 15.11.16.11.06-g41e52e2
>>>>> arch_process_options:112: command_line: (null)
>>>>> arch_process_options:114: initrd: (null)
>>>>> arch_process_options:115: dtb: (null)
>>>>> arch_process_options:117: port: 0x0
>>>>> kvm: exiting hardware virtualization
>>>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>>>
>>>> That makes me feel very uncomfortable. It looks like we've exited a
>>>> guest with some horrible value in X0. How is that even possible?
>>>>
>>>> This deserves to be investigated.
>>>
>>> I guess the problem is that cpu tear-down function is called even if a kvm guest
>>> is still running in kvm_arch_vcpu_ioctl_run().
>>> So adding a check whether cpu has been initialized or not in every iteration of
>>> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
>>> a guest mode. Since this check is done while interrupt is disabled, it won't
>>> interfere with kvm_arch_hardware_disable() called via IPI.
>>> See the attached fixup patch.
>>>
>>> Again, I verified the code on model.
>>>
>>> Thanks,
>>> -Takahiro AKASHI
>>>
>>>> Thanks,
>>>>
>>>> 	M.
>>>>
>>>
>>> ----8<----
>>>   From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>> Date: Fri, 11 Dec 2015 13:43:35 +0900
>>> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
>>>
>>> ---
>>>    arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>>>    1 file changed, 34 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>>> index 518c3c7..d7e86fb 100644
>>> --- a/arch/arm/kvm/arm.c
>>> +++ b/arch/arm/kvm/arm.c
>>> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>>    		/*
>>>    		 * Re-check atomic conditions
>>>    		 */
>>> -		if (signal_pending(current)) {
>>> +		if (__hyp_get_vectors() == hyp_default_vectors) {
>>> +			/* cpu has been torn down */
>>> +			ret = -ENOEXEC;
>>> +			run->exit_reason = KVM_EXIT_SHUTDOWN;
>>
>>
>> That feels completely overkill (and very slow). Why don't you maintain a
>> per-cpu variable containing the CPU states, which will avoid calling
>> __hyp_get_vectors() all the time? You should be able to reuse that
>> construct everywhere.
> 
> OK. Since I have introduced per-cpu variable, kvm_arm_hardware_enabled, against
> cpuidle issue, we will be able to re-use it.
> 
>> Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
>> (called on triple fault).
> 
> No, I don't think so.

maz@approximate:~/Work/arm-platforms$ git grep KVM_EXIT_SHUTDOWN
arch/x86/kvm/svm.c:     kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
arch/x86/kvm/vmx.c:     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
arch/x86/kvm/x86.c:                     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
include/uapi/linux/kvm.h:#define KVM_EXIT_SHUTDOWN         8

And that's it. No other architecture ever generates this, and this is
an undocumented API. So I'm not going to let that in until someone actually
defines what this thing means.

> Looking at kvm_cpu_exec() in kvm-all.c of qemu, KVM_EXIT_SHUTDOWN
> is handled in a generic way and results in a reset request.

Which is not what we want. We want to indicate that the guest couldn't
be entered. This is not due to a guest doing a triple fault (which is
the way an x86 system gets rebooted).

> On the other hand, KVM_EXIT_FAIL_ENTRY seems more arch-specific.

Certainly arch specific, but actually extremely accurate. You couldn't
enter the guest, and you describe the reason in an architecture-specific
fashion. This is also the only exit code that describe this exact case
we're talking about here.

> In addition, if kvm_vcpu_ioctl() returns a negative value, run->exit_reason
> will never be examined.
> So I think
>     ret -> 0
>     run->exit_reason -> KVM_EXIT_SHUTDOWN

ret = 0
run->exit_reason = KVM_EXIT_FAIL_ENTRY;
run->hardware_entry_failure_reason = (u64)-ENOEXEC;

> or just
>     ret -> -ENOEXEC
> is the best.
> 
> In either way, a guest will have no good chance to gracefully shutdown itself
> because we're kexec'ing (without waiting for threads' termination).

Well, at least userspace gets a chance - and should kexec fail, we have
a chance to recover.

Thanks,

	M.
AKASHI Takahiro Dec. 15, 2015, 8:48 a.m. UTC | #49
Will,

On 12/12/2015 01:31 AM, Will Deacon wrote:
> On Thu, Dec 10, 2015 at 10:31:48AM -0800, Geoff Levand wrote:
>> On Thu, 2015-12-03 at 13:58 +0000, Marc Zyngier wrote:
>>> Indeed, this is completely broken (I just reproduced the issue on a
>>> model).
>>
>> There are users out there that want kexec/kdump support.  I think we
>> should remove this patch from the series, replace it with a temporary
>> Kconfig conditional that allows only one of KVM or kexec to be enabled,
>> and merge kexec and kdump support in for 4.5.  This will satisfy users
>> who need kexec but not KVM, like kexec based bootloaders.  Once this
>> KVM hot plug patch is fixed we then merge it, either for 4.5 or 4.6,
>> depending on the timing.
>>
>> I'll prepare and post a v13 series that does the above.
>
> I'm not really keen on merging this based on a "temporary" Kconfig change
> with a promise to have it resolved in the future. Given how long this
> series has been floating around already, I'm not filled with confidence
> at the prospect of waiting for a fixup patch after it's been merged.
>
> Please address the outstanding review comments before reposting.

I'm working hard on addressing Marc's comment on kvm cpu hotplug, and so
can you please review other *main* part of kexec/kdump patches?
Or can I think that 'no comments' is a good sign of your acceptance?

Thanks,
-Takahiro AKASHI


> Will
>
Marc Zyngier Dec. 15, 2015, 10:13 a.m. UTC | #50
On 15/12/15 09:51, AKASHI Takahiro wrote:
> On 12/15/2015 05:45 PM, Marc Zyngier wrote:
>> On 15/12/15 07:51, AKASHI Takahiro wrote:
>>> On 12/15/2015 02:33 AM, Marc Zyngier wrote:
>>>> On 14/12/15 07:33, AKASHI Takahiro wrote:
>>>>> Marc,
>>>>>
>>>>> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>>>>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>>>>> Ashwin, Marc,
>>>>>>>
>>>>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff@infradead.org> wrote:
>>>>>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>>>>
>>>>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>>>>> core in EL2.
>>>>>>>>>>
>>>>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>>>>
>>>>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>>>>> compiling errors.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>>>> ---
>>>>>>>>>>      arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>>>>      arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>>>>      arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>>>>      arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>>>>      arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>>>>      arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>>>>      arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>>>>      arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>>>>      arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>>>>      9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>>>>>
>>>>>>>>> [..]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             /*
>>>>>>>>>> -        * Execute the init code on each CPU.
>>>>>>>>>> -        */
>>>>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>>>>> -
>>>>>>>>>> -       /*
>>>>>>>>>>              * Init HYP view of VGIC
>>>>>>>>>>              */
>>>>>>>>>>             err = kvm_vgic_hyp_init();
>>>>>>>>>
>>>>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>>>>> anymore.
>>>>>>>
>>>>>>> Thank you for pointing this out.
>>>>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>>>>> I didn't notice this problem.
>>>>>>
>>>>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>>>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>>>>> needs some information contained in these registers before being able to
>>>>>> get initialized.
>>>>>
>>>>> I see.
>>>>>
>>>>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>>>>> for pointing it out.
>>>>>>>>
>>>>>>>>> Whats the best way to fix this?
>>>>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>>>>>
>>>>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>>>>> there.
>>>>>>>
>>>>>>> Do you think so?
>>>>>>> How about the fixup! patch attached below?
>>>>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>>>>> kvm cpu hotplug will still continue to work as before.
>>>>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>>>>> original code, the change will not be a big jump.
>>>>>>
>>>>>> This seems quite complicated:
>>>>>> - init EL2 on  all CPUs
>>>>>> - do some initialization
>>>>>> - tear all CPUs EL2 down
>>>>>> - let KVM drive the vectors being set or not
>>>>>>
>>>>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>>>>> work on a single one?
>>>>>
>>>>> I did initialize all the cpus partly because using preempt_enable/disable
>>>>> looked a bit ugly and partly because we may, in the future, do additional
>>>>> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
>>>>> But if you're comfortable with preempt_*() stuff, I don' care.
>>>>>
>>>>>
>>>>>> Also, the simple fact that we were able to get some junk value is a sign
>>>>>> that something is amiss. I'd expect a splat of some sort, because we now
>>>>>> have a possibility of doing things in the wrong context.
>>>>>>
>>>>>>>
>>>>>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>>>>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>>>>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>>>>>
>>>>>>> My only concern is the following kernel message I saw when kexec shut down
>>>>>>> the kernel:
>>>>>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>>>>>
>>>>>>> ===
>>>>>>> sh-4.3# ./kexec -d -e
>>>>>>> kexec version: 15.11.16.11.06-g41e52e2
>>>>>>> arch_process_options:112: command_line: (null)
>>>>>>> arch_process_options:114: initrd: (null)
>>>>>>> arch_process_options:115: dtb: (null)
>>>>>>> arch_process_options:117: port: 0x0
>>>>>>> kvm: exiting hardware virtualization
>>>>>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>>>>>
>>>>>> That makes me feel very uncomfortable. It looks like we've exited a
>>>>>> guest with some horrible value in X0. How is that even possible?
>>>>>>
>>>>>> This deserves to be investigated.
>>>>>
>>>>> I guess the problem is that cpu tear-down function is called even if a kvm guest
>>>>> is still running in kvm_arch_vcpu_ioctl_run().
>>>>> So adding a check whether cpu has been initialized or not in every iteration of
>>>>> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
>>>>> a guest mode. Since this check is done while interrupt is disabled, it won't
>>>>> interfere with kvm_arch_hardware_disable() called via IPI.
>>>>> See the attached fixup patch.
>>>>>
>>>>> Again, I verified the code on model.
>>>>>
>>>>> Thanks,
>>>>> -Takahiro AKASHI
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> 	M.
>>>>>>
>>>>>
>>>>> ----8<----
>>>>>    From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>> Date: Fri, 11 Dec 2015 13:43:35 +0900
>>>>> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
>>>>>
>>>>> ---
>>>>>     arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>>>>>     1 file changed, 34 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>>>>> index 518c3c7..d7e86fb 100644
>>>>> --- a/arch/arm/kvm/arm.c
>>>>> +++ b/arch/arm/kvm/arm.c
>>>>> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>>>>     		/*
>>>>>     		 * Re-check atomic conditions
>>>>>     		 */
>>>>> -		if (signal_pending(current)) {
>>>>> +		if (__hyp_get_vectors() == hyp_default_vectors) {
>>>>> +			/* cpu has been torn down */
>>>>> +			ret = -ENOEXEC;
>>>>> +			run->exit_reason = KVM_EXIT_SHUTDOWN;
>>>>
>>>>
>>>> That feels completely overkill (and very slow). Why don't you maintain a
>>>> per-cpu variable containing the CPU states, which will avoid calling
>>>> __hyp_get_vectors() all the time? You should be able to reuse that
>>>> construct everywhere.
>>>
>>> OK. Since I have introduced per-cpu variable, kvm_arm_hardware_enabled, against
>>> cpuidle issue, we will be able to re-use it.
>>>
>>>> Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
>>>> (called on triple fault).
>>>
>>> No, I don't think so.
>>
>> maz@approximate:~/Work/arm-platforms$ git grep KVM_EXIT_SHUTDOWN
>> arch/x86/kvm/svm.c:     kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
>> arch/x86/kvm/vmx.c:     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>> arch/x86/kvm/x86.c:                     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>> include/uapi/linux/kvm.h:#define KVM_EXIT_SHUTDOWN         8
>>
>> And that's it. No other architecture ever generates this, and this is
>> an undocumented API. So I'm not going to let that in until someone actually
>> defines what this thing means.
>>
>>> Looking at kvm_cpu_exec() in kvm-all.c of qemu, KVM_EXIT_SHUTDOWN
>>> is handled in a generic way and results in a reset request.
>>
>> Which is not what we want. We want to indicate that the guest couldn't
>> be entered. This is not due to a guest doing a triple fault (which is
>> the way an x86 system gets rebooted).
>>
>>> On the other hand, KVM_EXIT_FAIL_ENTRY seems more arch-specific.
>>
>> Certainly arch specific, but actually extremely accurate. You couldn't
>> enter the guest, and you describe the reason in an architecture-specific
>> fashion. This is also the only exit code that describe this exact case
>> we're talking about here.
>>
>>> In addition, if kvm_vcpu_ioctl() returns a negative value, run->exit_reason
>>> will never be examined.
>>> So I think
>>>      ret -> 0
>>>      run->exit_reason -> KVM_EXIT_SHUTDOWN
>>
>> ret = 0
>> run->exit_reason = KVM_EXIT_FAIL_ENTRY;
>> run->hardware_entry_failure_reason = (u64)-ENOEXEC;
> 
> OK.
> 
>>> or just
>>>      ret -> -ENOEXEC
>>> is the best.
>>>
>>> In either way, a guest will have no good chance to gracefully shutdown itself
>>> because we're kexec'ing (without waiting for threads' termination).
>>
>> Well, at least userspace gets a chance - and should kexec fail, we have
>> a chance to recover.
> 
> Well, the current kexec implementation (on arm64) never fails
> except very early stage :)
> 
> So please review the attached fixup patch, again.
> 
> Thanks,
> -Takahiro AKASHI
> 
> 
>> Thanks,
>>
>> 	M.
>>
> 
> ----8<----
>  From ec6c07fe80d6ba96855468f61daffa9b91cf5622 Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Date: Fri, 11 Dec 2015 13:43:35 +0900
> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
> 
> ---
>   arch/arm/kvm/arm.c |   62 +++++++++++++++++++++++++++++++++++-----------------
>   1 file changed, 42 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 518c3c7..05eaa35 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -573,7 +573,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>   		/*
>   		 * Re-check atomic conditions
>   		 */
> -		if (signal_pending(current)) {
> +		if (!__this_cpu_read(kvm_arm_hardware_enabled)) {

if (unlikely(...))

> +			/* cpu has been torn down */
> +			ret = 0;
> +			run->exit_reason = KVM_EXIT_FAIL_ENTRY;
> +			run->fail_entry.hardware_entry_failure_reason
> +					= (u64)-ENOEXEC;
> +		} else if (signal_pending(current)) {
>   			ret = -EINTR;
>   			run->exit_reason = KVM_EXIT_INTR;
>   		}
> @@ -950,7 +956,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   	}
>   }
> 
> -int kvm_arch_hardware_enable(void)
> +static void cpu_init_hyp_mode(void)
>   {
>   	phys_addr_t boot_pgd_ptr;
>   	phys_addr_t pgd_ptr;
> @@ -958,9 +964,6 @@ int kvm_arch_hardware_enable(void)
>   	unsigned long stack_page;
>   	unsigned long vector_ptr;
> 
> -	if (__hyp_get_vectors() != hyp_default_vectors)
> -		return 0;
> -
>   	/* Switch from the HYP stub to our own HYP init vector */
>   	__hyp_set_vectors(kvm_get_idmap_vector());
> 
> @@ -973,24 +976,38 @@ int kvm_arch_hardware_enable(void)
>   	__cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
> 
>   	kvm_arm_init_debug();
> -
> -	return 0;
>   }
> 
> -void kvm_arch_hardware_disable(void)
> +static void cpu_reset_hyp_mode(void)
>   {
>   	phys_addr_t boot_pgd_ptr;
>   	phys_addr_t phys_idmap_start;
> 
> -	if (__hyp_get_vectors() == hyp_default_vectors)
> -		return;
> -
>   	boot_pgd_ptr = kvm_mmu_get_boot_httbr();
>   	phys_idmap_start = kvm_get_idmap_start();
> 
>   	__cpu_reset_hyp_mode(boot_pgd_ptr, phys_idmap_start);
>   }
> 
> +int kvm_arch_hardware_enable(void)
> +{
> +	if (!__this_cpu_read(kvm_arm_hardware_enabled)) {
> +		cpu_init_hyp_mode();
> +		__this_cpu_write(kvm_arm_hardware_enabled, 1);
> +	}
> +
> +	return 0;
> +}
> +
> +void kvm_arch_hardware_disable(void)
> +{
> +	if (!__this_cpu_read(kvm_arm_hardware_enabled))
> +		return;
> +
> +	cpu_reset_hyp_mode();
> +	__this_cpu_write(kvm_arm_hardware_enabled, 0);
> +}

Are you guaranteed that these two functions are called from a context
where both preemption and interrupts are disabled? If not, you cannot
use the __this_cpu* operations, and you must stick with this_cpu*.

> +
>   #ifdef CONFIG_CPU_PM
>   static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
>   				    unsigned long cmd,
> @@ -998,19 +1015,13 @@ static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
>   {
>   	switch (cmd) {
>   	case CPU_PM_ENTER:
> -		if (__hyp_get_vectors() != hyp_default_vectors)
> -			__this_cpu_write(kvm_arm_hardware_enabled, 1);
> -		else
> -			__this_cpu_write(kvm_arm_hardware_enabled, 0);
> -		/*
> -		 * don't call kvm_arch_hardware_disable() in case of
> -		 * CPU_PM_ENTER because it does't actually save any state.
> -		 */
> +		if (__this_cpu_read(kvm_arm_hardware_enabled))
> +			cpu_reset_hyp_mode();
> 
>   		return NOTIFY_OK;
>   	case CPU_PM_EXIT:
>   		if (__this_cpu_read(kvm_arm_hardware_enabled))
> -			kvm_arch_hardware_enable();
> +			cpu_init_hyp_mode();

It would be good to have a comment as to why we're not directly calling
hardware_enable/disable.

> 
>   		return NOTIFY_OK;
> 
> @@ -1114,9 +1125,20 @@ static int init_hyp_mode(void)
>   	}
> 
>   	/*
> +	 * Init this CPU temporarily to execute kvm_hyp_call()
> +	 * during kvm_vgic_hyp_init().
> +	 */
> +	preempt_disable();
> +	cpu_init_hyp_mode();
> +
> +	/*
>   	 * Init HYP view of VGIC
>   	 */
>   	err = kvm_vgic_hyp_init();
> +
> +	cpu_reset_hyp_mode();
> +	preempt_enable();
> +
>   	if (err)
>   		goto out_free_context;
> 

Thanks,

	M.
Will Deacon Dec. 15, 2015, 5:05 p.m. UTC | #51
On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> We should try best in case of kdump.
> So even if not all secondary cpus have shut down, we do kdump anyway.
> ---
>  arch/arm64/kernel/machine_kexec.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> index d2d7e90..482aae7 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -148,7 +148,13 @@ void machine_kexec(struct kimage *kimage)
>  	phys_addr_t reboot_code_buffer_phys;
>  	void *reboot_code_buffer;
>  
> -	BUG_ON(num_online_cpus() > 1);
> +	if (num_online_cpus() > 1) {
> +		if (in_crash_kexec)
> +			pr_warn("kdump might fail because %d cpus are still online\n",
> +					num_online_cpus());
> +		else
> +			BUG();
> +	}

Can you just rewrite the existing BUG_ON as:

  BUG_ON(num_online_cpus() && !WARN_ON(in_crash_kexec));

?

Will
Will Deacon Dec. 15, 2015, 5:15 p.m. UTC | #52
On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> To aid in debugging kexec problems or when adding new functionality to kexec add
> a new routine kexec_image_info() and several inline pr_devel statements.

We don't currently have any pr_devel invocations under arm64. Can you
either remove these or change them to pr_debug?

> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/kernel/machine_kexec.c | 63 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> index 8b990b8..da28a26 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -25,6 +25,48 @@ extern const unsigned long arm64_relocate_new_kernel_size;
>  
>  static unsigned long kimage_start;
>  
> +/**
> + * kexec_is_dtb - Helper routine to check the device tree header signature.
> + */
> +static bool kexec_is_dtb(const void *dtb)
> +{
> +	__be32 magic;
> +
> +	return get_user(magic, (__be32 *)dtb) ? false :
> +		(be32_to_cpu(magic) == OF_DT_HEADER);

Isn't there an fdt32_to_cpu helper for this already? I also can't help
but thing this would be more readable if you didn't bother with the
ternary if form.

Will
Will Deacon Dec. 15, 2015, 5:17 p.m. UTC | #53
On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> This patch adds arch specific descriptions about kdump usage on arm64
> to kdump.txt.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  Documentation/kdump/kdump.txt | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index bc4bd5a..7d3fad0 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>  a remote system.
>  
>  Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
> -s390x and arm architectures.
> +s390x, arm and arm64 architectures.
>  
>  When the system kernel boots, it reserves a small section of memory for
>  the dump-capture kernel. This ensures that ongoing Direct Memory Access
> @@ -249,6 +249,29 @@ Dump-capture kernel config options (Arch Dependent, arm)
>  
>      AUTO_ZRELADDR=y
>  
> +Dump-capture kernel config options (Arch Dependent, arm64)
> +----------------------------------------------------------
> +
> +1) Disable symmetric multi-processing support under "Processor type and
> +   features":
> +
> +   CONFIG_SMP=n
> +
> +   (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
> +   when loading the dump-capture kernel, see section "Load the Dump-capture
> +   Kernel".)

SMP is now forced on, so please update this text.

> +2) The maximum memory size on the dump-capture kernel must be limited by
> +   specifying:
> +
> +   mem=X[MG]
> +
> +   where X should be less than or equal to the size in "crashkernel="
> +   boot parameter. Kexec-tools will automatically add this.
> +
> +3) Currently, kvm will not be enabled on the dump-capture kernel even
> +   if it is configured.
> +
>  Extended crashkernel syntax
>  ===========================
>  
> @@ -312,6 +335,8 @@ Boot into System Kernel
>     any space below the alignment point may be overwritten by the dump-capture kernel,
>     which means it is possible that the vmcore is not that precise as expected.
>  
> +   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
> +   the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
>  
>  Load the Dump-capture Kernel
>  ============================
> @@ -334,6 +359,8 @@ For s390x:
>  	- Use image or bzImage
>  For arm:
>  	- Use zImage
> +For arm64:
> +	- Use vmlinux

What? Why doesn't Image work?

Will
Will Deacon Dec. 15, 2015, 5:29 p.m. UTC | #54
On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> On primary kernel, the memory region used by crash dump kernel must be
> specified by "crashkernel=" boot parameter. reserve_crashkernel()
> will allocate and reserve the region for later use.
> 
> User space tools will be able to find the region marked as "Crash kernel"
> in /proc/iomem.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  arch/arm64/kernel/setup.c |  7 +++++-
>  arch/arm64/mm/init.c      | 54 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 8119479..f9fffc9 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -31,7 +31,6 @@
>  #include <linux/screen_info.h>
>  #include <linux/init.h>
>  #include <linux/kexec.h>
> -#include <linux/crash_dump.h>
>  #include <linux/root_dev.h>
>  #include <linux/cpu.h>
>  #include <linux/interrupt.h>
> @@ -221,6 +220,12 @@ static void __init request_standard_resources(void)
>  		    kernel_data.end <= res->end)
>  			request_resource(res, &kernel_data);
>  	}
> +
> +#ifdef CONFIG_KEXEC
> +	/* User space tools will find "Crash kernel" region in /proc/iomem. */
> +	if (crashk_res.end)
> +		insert_resource(&iomem_resource, &crashk_res);
> +#endif

Why can't this bit be moved into reserved_crashkernel, at the end?

>  }
>  
>  #ifdef CONFIG_BLK_DEV_INITRD
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 17bf39a..24f0a1c 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -34,6 +34,7 @@
>  #include <linux/dma-contiguous.h>
>  #include <linux/efi.h>
>  #include <linux/swiotlb.h>
> +#include <linux/kexec.h>
>  
>  #include <asm/fixmap.h>
>  #include <asm/memory.h>
> @@ -66,6 +67,55 @@ static int __init early_initrd(char *p)
>  early_param("initrd", early_initrd);
>  #endif
>  
> +#ifdef CONFIG_KEXEC
> +/*
> + * reserve_crashkernel() - reserves memory for crash kernel
> + *
> + * This function reserves memory area given in "crashkernel=" kernel command
> + * line parameter. The memory reserved is used by dump capture kernel when
> + * primary kernel is crashing.
> + */
> +static void __init reserve_crashkernel(void)
> +{
> +	unsigned long long crash_size = 0, crash_base = 0;
> +	int ret;
> +
> +	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> +				&crash_size, &crash_base);
> +	if (ret)
> +		return;

What's the point in ret?

> +
> +	if (crash_base == 0) {
> +		crash_base = memblock_alloc(crash_size, 1 << 21);

What is this magic number? How about SZ_2M and a comment mentioning the
arm64 boot protocol?

> +		if (crash_base == 0) {
> +			pr_warn("Unable to allocate crashkernel (size:%llx)\n",
> +				crash_size);
> +			return;
> +		}
> +	} else {
> +		/* User specifies base address explicitly. */
> +		if (!memblock_is_region_memory(crash_base, crash_size) ||
> +			memblock_is_region_reserved(crash_base, crash_size)) {
> +			pr_warn("crashkernel has wrong address or size\n");
> +			return;
> +		}
> +
> +		if (crash_base & ((1 << 21) - 1)) {

IS_ALIGNED

> +			pr_warn("crashkernel base address is not 2MB aligned\n");
> +			return;
> +		}
> +
> +		memblock_reserve(crash_base, crash_size);
> +	}
> +
> +	pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
> +		crash_size >> 20, crash_base >> 20);
> +
> +	crashk_res.start = crash_base;
> +	crashk_res.end = crash_base + crash_size - 1;
> +}

#else

static void __init reserve_crashkernel(void) {};

Will
Will Deacon Dec. 15, 2015, 5:45 p.m. UTC | #55
On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> On crash dump kernel, all the information about primary kernel's core
> image is available in elf core header specified by "elfcorehdr=" boot
> parameter. reserve_elfcorehdr() will set aside the region to avoid any
> corruption by crash dump kernel.
> 
> Crash dump kernel will access the system memory of primary kernel via
> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
> not reside in linear mapping on crash dump kernel.
> Please note that we should add "mem=X[MG]" boot parameter to limit the
> memory size and avoid the following assertion at ioremap():
> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> 		return NULL;
> when accessing any pages beyond the usable memories of crash dump kernel.
> 
> We also need our own elfcorehdr_read() here since the weak definition of
> elfcorehdr_read() utilizes copy_oldmem_page() and will hit the assertion
> above on arm64.
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> ---
>  arch/arm64/Kconfig             | 12 +++++++
>  arch/arm64/kernel/Makefile     |  1 +
>  arch/arm64/kernel/crash_dump.c | 71 ++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/mm/init.c           | 29 +++++++++++++++++
>  4 files changed, 113 insertions(+)
>  create mode 100644 arch/arm64/kernel/crash_dump.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c23fd77..4bac7dc 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -545,6 +545,18 @@ config KEXEC
>  	  but it is independent of the system firmware.   And like a reboot
>  	  you can start any kernel with it, not just Linux.
>  
> +config CRASH_DUMP
> +	bool "Build kdump crash kernel"
> +	help
> +	  Generate crash dump after being started by kexec. This should
> +	  be normally only set in special crash dump kernels which are
> +	  loaded in the main kernel with kexec-tools into a specially
> +	  reserved region and then later executed after a crash by
> +	  kdump/kexec. The crash dump kernel must be compiled to a
> +	  memory address not used by the main kernel.

What does this even mean? How do I "compile to a memory address not used
by the main kernel"?

> diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
> new file mode 100644
> index 0000000..3d86c0a
> --- /dev/null
> +++ b/arch/arm64/kernel/crash_dump.c
> @@ -0,0 +1,71 @@
> +/*
> + * Routines for doing kexec-based kdump
> + *
> + * Copyright (C) 2014 Linaro Limited
> + * Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/crash_dump.h>
> +#include <linux/errno.h>
> +#include <linux/io.h>
> +#include <linux/memblock.h>
> +#include <linux/uaccess.h>
> +#include <asm/memory.h>
> +
> +/**
> + * copy_oldmem_page() - copy one page from old kernel memory
> + * @pfn: page frame number to be copied
> + * @buf: buffer where the copied page is placed
> + * @csize: number of bytes to copy
> + * @offset: offset in bytes into the page
> + * @userbuf: if set, @buf is in a user address space
> + *
> + * This function copies one page from old kernel memory into buffer pointed by
> + * @buf. If @buf is in userspace, set @userbuf to %1. Returns number of bytes
> + * copied or negative error in case of failure.
> + */
> +ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
> +			 size_t csize, unsigned long offset,
> +			 int userbuf)
> +{
> +	void *vaddr;
> +
> +	if (!csize)
> +		return 0;
> +
> +	vaddr = ioremap_cache(pfn << PAGE_SHIFT, PAGE_SIZE);

pfn_to_page

> +	if (!vaddr)
> +		return -ENOMEM;
> +
> +	if (userbuf) {
> +		if (copy_to_user(buf, vaddr + offset, csize)) {
> +			iounmap(vaddr);
> +			return -EFAULT;
> +		}
> +	} else {
> +		memcpy(buf, vaddr + offset, csize);
> +	}
> +
> +	iounmap(vaddr);
> +
> +	return csize;
> +}
> +
> +/**
> + * elfcorehdr_read - read from ELF core header
> + * @buf: buffer where the data is placed
> + * @csize: number of bytes to read
> + * @ppos: address in the memory
> + *
> + * This function reads @count bytes from elf core header which exists
> + * on crash dump kernel's memory.
> + */
> +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> +{
> +	memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
> +	return count;
> +}

I know you say that we have to override this function so that we don't
hit the pfn_valid warning in ioremap, but what guarantees that the ELF
header of the crashed kernel is actually mapped in our linear mapping?

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 24f0a1c..52a1469 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -35,6 +35,7 @@
>  #include <linux/efi.h>
>  #include <linux/swiotlb.h>
>  #include <linux/kexec.h>
> +#include <linux/crash_dump.h>
>  
>  #include <asm/fixmap.h>
>  #include <asm/memory.h>
> @@ -116,6 +117,31 @@ static void __init reserve_crashkernel(void)
>  }
>  #endif /* CONFIG_KEXEC */
>  
> +#ifdef CONFIG_CRASH_DUMP
> +/*
> + * reserve_elfcorehdr() - reserves memory for elf core header
> + *
> + * This function reserves elf core header given in "elfcorehdr=" kernel
> + * command line parameter. This region contains all the information about
> + * primary kernel's core image and is used by a dump capture kernel to
> + * access the system memory on primary kernel.
> + */
> +static void __init reserve_elfcorehdr(void)
> +{
> +	if (!elfcorehdr_size)
> +		return;
> +
> +	if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
> +		pr_warn("elfcorehdr is overlapped\n");
> +		return;
> +	}
> +
> +	memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
> +
> +	pr_info("Reserving %lldKB of memory at %lldMB for elfcorehdr\n",
> +		elfcorehdr_size >> 10, elfcorehdr_addr >> 20);

I'd have thought it would be more useful to print the address as an
address rather than a size.

> +}

Similar #else trick here.

Will
Will Deacon Dec. 15, 2015, 6:29 p.m. UTC | #56
On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S to the
> arm64 architecture that add support for the kexec re-boot mechanism
> (CONFIG_KEXEC) on arm64 platforms.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/Kconfig                  |  10 +++
>  arch/arm64/include/asm/kexec.h      |  48 ++++++++++++
>  arch/arm64/kernel/Makefile          |   2 +
>  arch/arm64/kernel/machine_kexec.c   | 152 ++++++++++++++++++++++++++++++++++++
>  arch/arm64/kernel/relocate_kernel.S | 131 +++++++++++++++++++++++++++++++
>  include/uapi/linux/kexec.h          |   1 +
>  6 files changed, 344 insertions(+)
>  create mode 100644 arch/arm64/include/asm/kexec.h
>  create mode 100644 arch/arm64/kernel/machine_kexec.c
>  create mode 100644 arch/arm64/kernel/relocate_kernel.S
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 9ac16a4..c23fd77 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -535,6 +535,16 @@ config SECCOMP
>  	  and the task is only allowed to execute a few safe syscalls
>  	  defined by each seccomp mode.
>  
> +config KEXEC
> +	depends on PM_SLEEP_SMP
> +	select KEXEC_CORE
> +	bool "kexec system call"
> +	---help---
> +	  kexec is a system call that implements the ability to shutdown your
> +	  current kernel, and to start another kernel.  It is like a reboot
> +	  but it is independent of the system firmware.   And like a reboot
> +	  you can start any kernel with it, not just Linux.
> +
>  config XEN_DOM0
>  	def_bool y
>  	depends on XEN
> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> new file mode 100644
> index 0000000..46d63cd
> --- /dev/null
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -0,0 +1,48 @@
> +/*
> + * kexec for arm64
> + *
> + * Copyright (C) Linaro.
> + * Copyright (C) Huawei Futurewei Technologies.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#if !defined(_ARM64_KEXEC_H)
> +#define _ARM64_KEXEC_H

Please keep to the style used elsewhere in the arch headers.

> +
> +/* Maximum physical address we can use pages from */
> +
> +#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
> +
> +/* Maximum address we can reach in physical address mode */
> +
> +#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
> +
> +/* Maximum address we can use for the control code buffer */
> +
> +#define KEXEC_CONTROL_MEMORY_LIMIT (-1UL)
> +
> +#define KEXEC_CONTROL_PAGE_SIZE	4096

Does this work on kernels configured with 64k pages? It looks like the
kexec core code will end up using order-0 pages, so I worry that we'll
actually put down 64k and potentially confuse a 4k crash kernel, for
example.

> +#define KEXEC_ARCH KEXEC_ARCH_ARM64
> +
> +#if !defined(__ASSEMBLY__)

#ifndef

> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> new file mode 100644
> index 0000000..8b990b8
> --- /dev/null
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -0,0 +1,152 @@
> +/*
> + * kexec for arm64
> + *
> + * Copyright (C) Linaro.
> + * Copyright (C) Huawei Futurewei Technologies.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kexec.h>
> +#include <linux/of_fdt.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <asm/cacheflush.h>
> +#include <asm/system_misc.h>
> +
> +#include "cpu-reset.h"
> +
> +/* Global variables for the arm64_relocate_new_kernel routine. */
> +extern const unsigned char arm64_relocate_new_kernel[];
> +extern const unsigned long arm64_relocate_new_kernel_size;
> +
> +static unsigned long kimage_start;
> +
> +void machine_kexec_cleanup(struct kimage *kimage)
> +{
> +	/* Empty routine needed to avoid build errors. */
> +}
> +
> +/**
> + * machine_kexec_prepare - Prepare for a kexec reboot.
> + *
> + * Called from the core kexec code when a kernel image is loaded.
> + */
> +int machine_kexec_prepare(struct kimage *kimage)
> +{
> +	kimage_start = kimage->start;
> +	return 0;
> +}
> +
> +/**
> + * kexec_list_flush - Helper to flush the kimage list to PoC.
> + */
> +static void kexec_list_flush(unsigned long kimage_head)
> +{
> +	unsigned long *entry;
> +
> +	for (entry = &kimage_head; ; entry++) {
> +		unsigned int flag = *entry & IND_FLAGS;
> +		void *addr = phys_to_virt(*entry & PAGE_MASK);
> +
> +		switch (flag) {
> +		case IND_INDIRECTION:
> +			entry = (unsigned long *)addr - 1;
> +			__flush_dcache_area(addr, PAGE_SIZE);
> +			break;
> +		case IND_DESTINATION:
> +			break;
> +		case IND_SOURCE:
> +			__flush_dcache_area(addr, PAGE_SIZE);
> +			break;
> +		case IND_DONE:
> +			return;
> +		default:
> +			BUG();
> +		}
> +	}
> +}
> +
> +/**
> + * kexec_segment_flush - Helper to flush the kimage segments to PoC.
> + */
> +static void kexec_segment_flush(const struct kimage *kimage)
> +{
> +	unsigned long i;
> +
> +	pr_devel("%s:\n", __func__);
> +
> +	for (i = 0; i < kimage->nr_segments; i++) {
> +		pr_devel("  segment[%lu]: %016lx - %016lx, %lx bytes, %lu pages\n",
> +			i,
> +			kimage->segment[i].mem,
> +			kimage->segment[i].mem + kimage->segment[i].memsz,
> +			kimage->segment[i].memsz,
> +			kimage->segment[i].memsz /  PAGE_SIZE);
> +
> +		__flush_dcache_area(phys_to_virt(kimage->segment[i].mem),
> +			kimage->segment[i].memsz);
> +	}
> +}
> +
> +/**
> + * machine_kexec - Do the kexec reboot.
> + *
> + * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
> + */
> +void machine_kexec(struct kimage *kimage)
> +{
> +	phys_addr_t reboot_code_buffer_phys;
> +	void *reboot_code_buffer;
> +
> +	BUG_ON(num_online_cpus() > 1);
> +
> +	reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
> +	reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
> +
> +	/*
> +	 * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
> +	 * after the kernel is shut down.
> +	 */
> +	memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
> +		arm64_relocate_new_kernel_size);

At which point does the I-cache get invalidated for this?

> +
> +	/* Flush the reboot_code_buffer in preparation for its execution. */
> +	__flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
> +
> +	/* Flush the new image. */
> +	kexec_segment_flush(kimage);
> +
> +	/* Flush the kimage list. */
> +	kexec_list_flush(kimage->head);
> +
> +	pr_info("Bye!\n");
> +
> +	/* Disable all DAIF exceptions. */
> +	asm volatile ("msr daifset, #0xf" : : : "memory");

Can we not use our helpers for this?

> +
> +	setup_mm_for_reboot();
> +
> +	/*
> +	 * cpu_soft_restart will shutdown the MMU, disable data caches, then
> +	 * transfer control to the reboot_code_buffer which contains a copy of
> +	 * the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
> +	 * uses physical addressing to relocate the new image to its final
> +	 * position and transfers control to the image entry point when the
> +	 * relocation is complete.
> +	 */
> +
> +	cpu_soft_restart(virt_to_phys(cpu_reset),
> +		is_hyp_mode_available(),
> +		reboot_code_buffer_phys, kimage->head, kimage_start);
> +
> +	BUG(); /* Should never get here. */
> +}
> +
> +void machine_crash_shutdown(struct pt_regs *regs)
> +{
> +	/* Empty routine needed to avoid build errors. */
> +}
> diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
> new file mode 100644
> index 0000000..71cab0e
> --- /dev/null
> +++ b/arch/arm64/kernel/relocate_kernel.S
> @@ -0,0 +1,131 @@
> +/*
> + * kexec for arm64
> + *
> + * Copyright (C) Linaro.
> + * Copyright (C) Huawei Futurewei Technologies.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kexec.h>
> +
> +#include <asm/assembler.h>
> +#include <asm/kexec.h>
> +#include <asm/kvm_arm.h>
> +#include <asm/page.h>
> +
> +/*
> + * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
> + *
> + * The memory that the old kernel occupies may be overwritten when coping the
> + * new image to its final location.  To assure that the
> + * arm64_relocate_new_kernel routine which does that copy is not overwritten,
> + * all code and data needed by arm64_relocate_new_kernel must be between the
> + * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end.  The
> + * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
> + * control_code_page, a special page which has been set up to be preserved
> + * during the copy operation.
> + */
> +.globl arm64_relocate_new_kernel
> +arm64_relocate_new_kernel:
> +
> +	/* Setup the list loop variables. */
> +	mov	x18, x0				/* x18 = kimage_head */
> +	mov	x17, x1				/* x17 = kimage_start */
> +	dcache_line_size x16, x0		/* x16 = dcache line size */

Why is this needed?

> +	mov	x15, xzr			/* x15 = segment start */
> +	mov	x14, xzr			/* x14 = entry ptr */
> +	mov	x13, xzr			/* x13 = copy dest */
> +
> +	/* Clear the sctlr_el2 flags. */
> +	mrs	x0, CurrentEL
> +	cmp	x0, #CurrentEL_EL2
> +	b.ne	1f
> +	mrs	x0, sctlr_el2
> +	ldr	x1, =SCTLR_EL2_FLAGS

If we're using literal pools, we probably want a .ltorg directive somewhere.

> +	bic	x0, x0, x1
> +	msr	sctlr_el2, x0
> +	isb
> +1:
> +
> +	/* Check if the new image needs relocation. */
> +	cbz	x18, .Ldone
> +	tbnz	x18, IND_DONE_BIT, .Ldone
> +
> +.Lloop:
> +	and	x12, x18, PAGE_MASK		/* x12 = addr */
> +
> +	/* Test the entry flags. */
> +.Ltest_source:
> +	tbz	x18, IND_SOURCE_BIT, .Ltest_indirection
> +
> +	mov x20, x13				/*  x20 = copy dest */
> +	mov x21, x12				/*  x21 = copy src */

Weird indentation.

> +	/* Copy page. */
> +1:	ldp	x22, x23, [x21]
> +	ldp	x24, x25, [x21, #16]
> +	ldp	x26, x27, [x21, #32]
> +	ldp	x28, x29, [x21, #48]
> +	add	x21, x21, #64
> +	stnp	x22, x23, [x20]
> +	stnp	x24, x25, [x20, #16]
> +	stnp	x26, x27, [x20, #32]
> +	stnp	x28, x29, [x20, #48]
> +	add	x20, x20, #64
> +	tst	x21, #(PAGE_SIZE - 1)
> +	b.ne	1b

We should macroise this, to save on duplication of a common routine.
You also need to address the caching issues that Mark raised separately.

> +	/* dest += PAGE_SIZE */
> +	add	x13, x13, PAGE_SIZE
> +	b	.Lnext
> +
> +.Ltest_indirection:
> +	tbz	x18, IND_INDIRECTION_BIT, .Ltest_destination
> +
> +	/* ptr = addr */
> +	mov	x14, x12
> +	b	.Lnext
> +
> +.Ltest_destination:
> +	tbz	x18, IND_DESTINATION_BIT, .Lnext
> +
> +	mov	x15, x12
> +
> +	/* dest = addr */
> +	mov	x13, x12
> +
> +.Lnext:
> +	/* entry = *ptr++ */
> +	ldr	x18, [x14], #8
> +
> +	/* while (!(entry & DONE)) */
> +	tbz	x18, IND_DONE_BIT, .Lloop
> +
> +.Ldone:
> +	dsb	sy
> +	ic	ialluis

I don't think this needs to be inner-shareable, and these dsbs can probably
be non-shareable too.

> +	dsb	sy
> +	isb
> +
> +	/* Start new image. */
> +	mov	x0, xzr
> +	mov	x1, xzr
> +	mov	x2, xzr
> +	mov	x3, xzr
> +	br	x17
> +
> +.align 3	/* To keep the 64-bit values below naturally aligned. */
> +
> +.Lcopy_end:
> +.org	KEXEC_CONTROL_PAGE_SIZE
> +
> +/*
> + * arm64_relocate_new_kernel_size - Number of bytes to copy to the
> + * control_code_page.
> + */
> +.globl arm64_relocate_new_kernel_size
> +arm64_relocate_new_kernel_size:
> +	.quad	.Lcopy_end - arm64_relocate_new_kernel
> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index 99048e5..ccec467 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
> @@ -39,6 +39,7 @@
>  #define KEXEC_ARCH_SH      (42 << 16)
>  #define KEXEC_ARCH_MIPS_LE (10 << 16)
>  #define KEXEC_ARCH_MIPS    ( 8 << 16)
> +#define KEXEC_ARCH_ARM64   (183 << 16)

This should probably be called KEXEC_ARCH_AARCH64 for consistency with
the ELF machine name.

Will
Geoff Levand Dec. 16, 2015, 12:14 a.m. UTC | #57
Hi Will,

I'll post a v12.4 of this patch that addresses your comments.

On Tue, 2015-12-15 at 18:29 +0000, Will Deacon wrote:
> +#if !defined(_ARM64_KEXEC_H)
> > +#define _ARM64_KEXEC_H
> 
> Please keep to the style used elsewhere in the arch headers.

OK.

> > +
> > +#define KEXEC_CONTROL_PAGE_SIZE> > 	> > 4096
> 
> Does this work on kernels configured with 64k pages? It looks like the
> kexec core code will end up using order-0 pages, so I worry that we'll
> actually put down 64k and potentially confuse a 4k crash kernel, for
> example.

KEXEC_CONTROL_PAGE_SIZE just tells the core kexec code how big
we need control_code_buffer to be.  That buffer is only used by
the arch code of the first stage kernel.  With 64k pages the buffer
will be a full page, but we'll only use the first 4k of it. 

> > +#define KEXEC_ARCH KEXEC_ARCH_ARM64
> > +
> > +#if !defined(__ASSEMBLY__)
> 
> #ifndef

OK.

> > + * machine_kexec - Do the kexec reboot.
> > + *
> > + * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
> > + */
> > +void machine_kexec(struct kimage *kimage)
> > +{
> > +> > 	> > phys_addr_t reboot_code_buffer_phys;
> > +> > 	> > void *reboot_code_buffer;
> > +
> > +> > 	> > BUG_ON(num_online_cpus() > 1);
> > +
> > +> > 	> > reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
> > +> > 	> > reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
> > +
> > +> > 	> > /*
> > +> > 	> >  * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
> > +> > 	> >  * after the kernel is shut down.
> > +> > 	> >  */
> > +> > 	> > memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
> > +> > 	> > 	> > arm64_relocate_new_kernel_size);
> 
> At which point does the I-cache get invalidated for this?

I'll add a call to flush_icache_range() for reboot_code_buffer.  I
think that should do it.

> > +
> > +> > 	> > /* Flush the reboot_code_buffer in preparation for its execution. */
> > +> > 	> > __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
> > +
> > +> > 	> > /* Flush the new image. */
> > +> > 	> > kexec_segment_flush(kimage);
> > +
> > +> > 	> > /* Flush the kimage list. */
> > +> > 	> > kexec_list_flush(kimage->head);
> > +
> > +> > 	> > pr_info("Bye!\n");
> > +
> > +> > 	> > /* Disable all DAIF exceptions. */
> > +> > 	> > asm volatile ("msr > > daifset> > , #0xf" : : :
> > "memory");
> 
> Can we not use our helpers for this?

Mark Rutland had commented that calling daifset four times
through the different macros took considerable time, and
recommended a single call here.

Would you prefer a new macro for irqflags.h, maybe

  #define local_daif_disable() asm("msr daifset, #0xf" : : : "memory")?

> > +/*
> > + * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
> > + *
> > + * The memory that the old kernel occupies may be overwritten when coping the
> > + * new image to its final location.  To assure that the
> > + * arm64_relocate_new_kernel routine which does that copy is not overwritten,
> > + * all code and data needed by arm64_relocate_new_kernel must be between the
> > + * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end.  The
> > + * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
> > + * control_code_page, a special page which has been set up to be preserved
> > + * during the copy operation.
> > + */
> > +.globl arm64_relocate_new_kernel
> > +arm64_relocate_new_kernel:
> > +
> > +> > 	> > /* Setup the list loop variables. */
> > +> > 	> > mov> > 	> > x18, x0> > 	> > 	> > 	> > 	> > /* x18 = kimage_head */
> > +> > 	> > mov> > 	> > x17, x1> > 	> > 	> > 	> > 	> > /* x17 = kimage_start */
> > +> > 	> > dcache_line_size x16, x0> > 	> > 	> > /* x16 = dcache line size */
> 
> Why is this needed?

This was left over from the previous version where we
invalidated pages while copying them.  I've since added
that invalidate back, so this is again needed.

> 
> > +> > 	> > mov> > 	> > x15, xzr> > 	> > 	> > 	> > /* x15 = segment start */
> > +> > 	> > mov> > 	> > x14, xzr> > 	> > 	> > 	> > /* x14 = entry ptr */
> > +> > 	> > mov> > 	> > x13, xzr> > 	> > 	> > 	> > /* x13 = copy dest */
> > +
> > +> > 	> > /* Clear the sctlr_el2 flags. */
> > +> > 	> > mrs> > 	> > x0, CurrentEL
> > +> > 	> > cmp> > 	> > x0, #CurrentEL_EL2
> > +> > 	> > b.ne> > 	> > 1f
> > +> > 	> > mrs> > 	> > x0, sctlr_el2
> > +> > 	> > ldr> > 	> > x1, =SCTLR_EL2_FLAGS
> 
> If we're using literal pools, we probably want a .ltorg directive somewhere.

I've added one in at the end of the arm64_relocate_new_kernel
code.

> > +> > 	> > bic> > 	> > x0, x0, x1
> > +> > 	> > msr> > 	> > sctlr_el2, x0
> > +> > 	> > isb
> > +1:
> > +
> > +> > 	> > /* Check if the new image needs relocation. */
> > +> > 	> > cbz> > 	> > x18, .Ldone
> > +> > 	> > tbnz> > 	> > x18, IND_DONE_BIT, .Ldone
> > +
> > +.Lloop:
> > +> > 	> > and> > 	> > x12, x18, PAGE_MASK> > 	> > 	> > /* x12 = addr */
> > +
> > +> > 	> > /* Test the entry flags. */
> > +.Ltest_source:
> > +> > 	> > tbz> > 	> > x18, IND_SOURCE_BIT, .Ltest_indirection
> > +
> > +> > 	> > mov x20, x13> > 	> > 	> > 	> > 	> > /*  x20 = copy dest */
> > +> > 	> > mov x21, x12> > 	> > 	> > 	> > 	> > /*  x21 = copy src */
> 
> Weird indentation.

Fixed.

> > +> > 	> > /* Copy page. */
> > +1:> > 	> > ldp> > 	> > x22, x23, [x21]
> > +> > 	> > ldp> > 	> > x24, x25, [x21, #16]
> > +> > 	> > ldp> > 	> > x26, x27, [x21, #32]
> > +> > 	> > ldp> > 	> > x28, x29, [x21, #48]
> > +> > 	> > add> > 	> > x21, x21, #64
> > +> > 	> > stnp> > 	> > x22, x23, [x20]
> > +> > 	> > stnp> > 	> > x24, x25, [x20, #16]
> > +> > 	> > stnp> > 	> > x26, x27, [x20, #32]
> > +> > 	> > stnp> > 	> > x28, x29, [x20, #48]
> > +> > 	> > add> > 	> > x20, x20, #64
> > +> > 	> > tst> > 	> > x21, #(PAGE_SIZE - 1)
> > +> > 	> > b.ne> > 	> > 1b
> 
> We should macroise this, to save on duplication of a common routine.

So something like this in assembler.h?

+/*
+ * copy_page - copy src to dest using temp registers t1-t8
+ */
+	.macro copy_page dest:req src:req t1:req t2:req t3:req t4:req t5:req t6:req t7:req t8:req
+1:	ldp	/t1, /t2, [/src]
+	ldp	/t3, /t4, [/src, #16]
+	ldp	/t5, /t6, [/src, #32]
+	ldp	/t7, /t8, [/src, #48]
+	add	/src, /src, #64
+	stnp	/t1, /t2, [/dest]
+	stnp	/t3, /t4, [/dest, #16]
+	stnp	/t5, /t6, [/dest, #32]
+	stnp	/t7, /t8, [/dest, #48]
+	add	/dest, /dest, #64
+	tst	/src, #(PAGE_SIZE - 1)
+	b.ne	1b
+	.endm

> You also need to address the caching issues that Mark raised separately.

Cache maintenance has been fixed (reintroduced) in the current code.

> 
> > +> > 	> > /* dest += PAGE_SIZE */
> > +> > 	> > add> > 	> > x13, x13, PAGE_SIZE
> > +> > 	> > b> > 	> > .Lnext
> > +
> > +.Ltest_indirection:
> > +> > 	> > tbz> > 	> > x18, IND_INDIRECTION_BIT, .Ltest_destination
> > +
> > +> > 	> > /* ptr = addr */
> > +> > 	> > mov> > 	> > x14, x12
> > +> > 	> > b> > 	> > .Lnext
> > +
> > +.Ltest_destination:
> > +> > 	> > tbz> > 	> > x18, IND_DESTINATION_BIT, .Lnext
> > +
> > +> > 	> > mov> > 	> > x15, x12
> > +
> > +> > 	> > /* dest = addr */
> > +> > 	> > mov> > 	> > x13, x12
> > +
> > +.Lnext:
> > +> > 	> > /* entry = *ptr++ */
> > +> > 	> > ldr> > 	> > x18, [x14], #8
> > +
> > +> > 	> > /* while (!(entry & DONE)) */
> > +> > 	> > tbz> > 	> > x18, IND_DONE_BIT, .Lloop
> > +
> > +.Ldone:
> > +> > 	> > dsb> > 	> > sy
> > +> > 	> > ic> > 	> > ialluis
> 
> I don't think this needs to be inner-shareable, and these dsbs can probably
> be non-shareable too.

OK.

> > +> > 	> > dsb> > 	> > sy
> > +> > 	> > isb
> > +
> > +> > 	> > /* Start new image. */
> > +> > 	> > mov> > 	> > x0, xzr
> > +> > 	> > mov> > 	> > x1, xzr
> > +> > 	> > mov> > 	> > x2, xzr
> > +> > 	> > mov> > 	> > x3, xzr
> > +> > 	> > br> > 	> > x17
> > +
> > +.align 3> > 	> > /* To keep the 64-bit values below naturally aligned. */
> > +
> > +.Lcopy_end:
> > +.org> > 	> > KEXEC_CONTROL_PAGE_SIZE
> > +
> > +/*
> > + * arm64_relocate_new_kernel_size - Number of bytes to copy to the
> > + * control_code_page.
> > + */
> > +.globl arm64_relocate_new_kernel_size
> > +arm64_relocate_new_kernel_size:
> > +> > 	> > .quad> > 	> > .Lcopy_end - arm64_relocate_new_kernel
> > diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> > index 99048e5..ccec467 100644
> > --- a/include/uapi/linux/kexec.h
> > +++ b/include/uapi/linux/kexec.h
> > @@ -39,6 +39,7 @@
> >  #define KEXEC_ARCH_SH      (42 << 16)
> >  #define KEXEC_ARCH_MIPS_LE (10 << 16)
> >  #define KEXEC_ARCH_MIPS    ( 8 << 16)
> > +#define KEXEC_ARCH_ARM64   (183 << 16)
> 
> This should probably be called KEXEC_ARCH_AARCH64 for consistency with
> the ELF machine name.

OK.

-Geoff
Geoff Levand Dec. 16, 2015, 12:45 a.m. UTC | #58
On Tue, 2015-12-15 at 17:15 +0000, Will Deacon wrote:
> On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
> > To aid in debugging kexec problems or when adding new functionality to kexec add
> > a new routine kexec_image_info() and several inline pr_devel statements.
> 
> We don't currently have any pr_devel invocations under arm64. Can you
> either remove these or change them to pr_debug?

OK.

> > Signed-off-by: Geoff Levand <geoff@infradead.org>
> > ---
> >  arch/arm64/kernel/machine_kexec.c | 63 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 63 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> > index 8b990b8..da28a26 100644
> > --- a/arch/arm64/kernel/machine_kexec.c
> > +++ b/arch/arm64/kernel/machine_kexec.c
> > @@ -25,6 +25,48 @@ extern const unsigned long arm64_relocate_new_kernel_size;
> >  
> >  static unsigned long kimage_start;
> >  
> > +/**
> > + * kexec_is_dtb - Helper routine to check the device tree header signature.
> > + */
> > +static bool kexec_is_dtb(const void *dtb)
> > +{
> > +> > 	> > __be32 magic;
> > +
> > +> > 	> > return get_user(magic, (__be32 *)dtb) ? false :
> > +> > 	> > 	> > (be32_to_cpu(magic) == OF_DT_HEADER);
> 
> Isn't there an fdt32_to_cpu helper for this already? I also can't help
> but thing this would be more readable if you didn't bother with the
> ternary if form.

Sure, I'll clean it up.

-Geoff
AKASHI Takahiro Dec. 16, 2015, 5:19 a.m. UTC | #59
On 12/16/2015 02:29 AM, Will Deacon wrote:
> On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> On primary kernel, the memory region used by crash dump kernel must be
>> specified by "crashkernel=" boot parameter. reserve_crashkernel()
>> will allocate and reserve the region for later use.
>>
>> User space tools will be able to find the region marked as "Crash kernel"
>> in /proc/iomem.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>   arch/arm64/kernel/setup.c |  7 +++++-
>>   arch/arm64/mm/init.c      | 54 +++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 60 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>> index 8119479..f9fffc9 100644
>> --- a/arch/arm64/kernel/setup.c
>> +++ b/arch/arm64/kernel/setup.c
>> @@ -31,7 +31,6 @@
>>   #include <linux/screen_info.h>
>>   #include <linux/init.h>
>>   #include <linux/kexec.h>
>> -#include <linux/crash_dump.h>
>>   #include <linux/root_dev.h>
>>   #include <linux/cpu.h>
>>   #include <linux/interrupt.h>
>> @@ -221,6 +220,12 @@ static void __init request_standard_resources(void)
>>   		    kernel_data.end <= res->end)
>>   			request_resource(res, &kernel_data);
>>   	}
>> +
>> +#ifdef CONFIG_KEXEC
>> +	/* User space tools will find "Crash kernel" region in /proc/iomem. */
>> +	if (crashk_res.end)
>> +		insert_resource(&iomem_resource, &crashk_res);
>> +#endif
>
> Why can't this bit be moved into reserved_crashkernel, at the end?

Because
- I want to call reserve_crashkernel() as early as possbile, (so before
   request_standard_resources()), and
- if inserting "Crash kernel" region is done before request_standard_resources(),
   request_resource() for "System RAM" region which contains "Crash kernel" will
   fail and the entry will be lost from /proc/iomem.


>>   }
>>
>>   #ifdef CONFIG_BLK_DEV_INITRD
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index 17bf39a..24f0a1c 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -34,6 +34,7 @@
>>   #include <linux/dma-contiguous.h>
>>   #include <linux/efi.h>
>>   #include <linux/swiotlb.h>
>> +#include <linux/kexec.h>
>>
>>   #include <asm/fixmap.h>
>>   #include <asm/memory.h>
>> @@ -66,6 +67,55 @@ static int __init early_initrd(char *p)
>>   early_param("initrd", early_initrd);
>>   #endif
>>
>> +#ifdef CONFIG_KEXEC
>> +/*
>> + * reserve_crashkernel() - reserves memory for crash kernel
>> + *
>> + * This function reserves memory area given in "crashkernel=" kernel command
>> + * line parameter. The memory reserved is used by dump capture kernel when
>> + * primary kernel is crashing.
>> + */
>> +static void __init reserve_crashkernel(void)
>> +{
>> +	unsigned long long crash_size = 0, crash_base = 0;
>> +	int ret;
>> +
>> +	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>> +				&crash_size, &crash_base);
>> +	if (ret)
>> +		return;
>
> What's the point in ret?

parse_crashkernel() will returns -EINVAL if
- the boot command line doesn't contain "crashkernel=" parameter, or
- "crashkernel=" parameter has an invalid synstax of value
Otherwise returns 0.

So just returning should be OK, but just in case that the syntax is OK, but
that a resulting crash_size gets 0, I will add a check like:
     if (!crash_size)
         return;

>> +
>> +	if (crash_base == 0) {
>> +		crash_base = memblock_alloc(crash_size, 1 << 21);
>
> What is this magic number? How about SZ_2M and a comment mentioning the
> arm64 boot protocol?

Right. I will do so.

>> +		if (crash_base == 0) {
>> +			pr_warn("Unable to allocate crashkernel (size:%llx)\n",
>> +				crash_size);
>> +			return;
>> +		}
>> +	} else {
>> +		/* User specifies base address explicitly. */
>> +		if (!memblock_is_region_memory(crash_base, crash_size) ||
>> +			memblock_is_region_reserved(crash_base, crash_size)) {
>> +			pr_warn("crashkernel has wrong address or size\n");
>> +			return;
>> +		}
>> +
>> +		if (crash_base & ((1 << 21) - 1)) {
>
> IS_ALIGNED

OK.

>> +			pr_warn("crashkernel base address is not 2MB aligned\n");
>> +			return;
>> +		}
>> +
>> +		memblock_reserve(crash_base, crash_size);
>> +	}
>> +
>> +	pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
>> +		crash_size >> 20, crash_base >> 20);
>> +
>> +	crashk_res.start = crash_base;
>> +	crashk_res.end = crash_base + crash_size - 1;
>> +}
>
> #else
>
> static void __init reserve_crashkernel(void) {};

OK.

Thanks,
-Takahiro AKASHI

> Will
>
AKASHI Takahiro Dec. 16, 2015, 5:41 a.m. UTC | #60
On 12/16/2015 02:45 AM, Will Deacon wrote:
> On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> On crash dump kernel, all the information about primary kernel's core
>> image is available in elf core header specified by "elfcorehdr=" boot
>> parameter. reserve_elfcorehdr() will set aside the region to avoid any
>> corruption by crash dump kernel.
>>
>> Crash dump kernel will access the system memory of primary kernel via
>> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
>> not reside in linear mapping on crash dump kernel.
>> Please note that we should add "mem=X[MG]" boot parameter to limit the
>> memory size and avoid the following assertion at ioremap():
>> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
>> 		return NULL;
>> when accessing any pages beyond the usable memories of crash dump kernel.
>>
>> We also need our own elfcorehdr_read() here since the weak definition of
>> elfcorehdr_read() utilizes copy_oldmem_page() and will hit the assertion
>> above on arm64.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>   arch/arm64/Kconfig             | 12 +++++++
>>   arch/arm64/kernel/Makefile     |  1 +
>>   arch/arm64/kernel/crash_dump.c | 71 ++++++++++++++++++++++++++++++++++++++++++
>>   arch/arm64/mm/init.c           | 29 +++++++++++++++++
>>   4 files changed, 113 insertions(+)
>>   create mode 100644 arch/arm64/kernel/crash_dump.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index c23fd77..4bac7dc 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -545,6 +545,18 @@ config KEXEC
>>   	  but it is independent of the system firmware.   And like a reboot
>>   	  you can start any kernel with it, not just Linux.
>>
>> +config CRASH_DUMP
>> +	bool "Build kdump crash kernel"
>> +	help
>> +	  Generate crash dump after being started by kexec. This should
>> +	  be normally only set in special crash dump kernels which are
>> +	  loaded in the main kernel with kexec-tools into a specially
>> +	  reserved region and then later executed after a crash by
>> +	  kdump/kexec. The crash dump kernel must be compiled to a
>> +	  memory address not used by the main kernel.
>
> What does this even mean? How do I "compile to a memory address not used
> by the main kernel"?

Well, it's just a copy from arm, but right, it's ambiguous.
I will remove that text.

>> diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
>> new file mode 100644
>> index 0000000..3d86c0a
>> --- /dev/null
>> +++ b/arch/arm64/kernel/crash_dump.c
>> @@ -0,0 +1,71 @@
>> +/*
>> + * Routines for doing kexec-based kdump
>> + *
>> + * Copyright (C) 2014 Linaro Limited
>> + * Author: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/crash_dump.h>
>> +#include <linux/errno.h>
>> +#include <linux/io.h>
>> +#include <linux/memblock.h>
>> +#include <linux/uaccess.h>
>> +#include <asm/memory.h>
>> +
>> +/**
>> + * copy_oldmem_page() - copy one page from old kernel memory
>> + * @pfn: page frame number to be copied
>> + * @buf: buffer where the copied page is placed
>> + * @csize: number of bytes to copy
>> + * @offset: offset in bytes into the page
>> + * @userbuf: if set, @buf is in a user address space
>> + *
>> + * This function copies one page from old kernel memory into buffer pointed by
>> + * @buf. If @buf is in userspace, set @userbuf to %1. Returns number of bytes
>> + * copied or negative error in case of failure.
>> + */
>> +ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
>> +			 size_t csize, unsigned long offset,
>> +			 int userbuf)
>> +{
>> +	void *vaddr;
>> +
>> +	if (!csize)
>> +		return 0;
>> +
>> +	vaddr = ioremap_cache(pfn << PAGE_SHIFT, PAGE_SIZE);
>
> pfn_to_page

Maybe __pfn_to_phsy()?

>> +	if (!vaddr)
>> +		return -ENOMEM;
>> +
>> +	if (userbuf) {
>> +		if (copy_to_user(buf, vaddr + offset, csize)) {
>> +			iounmap(vaddr);
>> +			return -EFAULT;
>> +		}
>> +	} else {
>> +		memcpy(buf, vaddr + offset, csize);
>> +	}
>> +
>> +	iounmap(vaddr);
>> +
>> +	return csize;
>> +}
>> +
>> +/**
>> + * elfcorehdr_read - read from ELF core header
>> + * @buf: buffer where the data is placed
>> + * @csize: number of bytes to read
>> + * @ppos: address in the memory
>> + *
>> + * This function reads @count bytes from elf core header which exists
>> + * on crash dump kernel's memory.
>> + */
>> +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
>> +{
>> +	memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
>> +	return count;
>> +}
>
> I know you say that we have to override this function so that we don't
> hit the pfn_valid warning in ioremap, but what guarantees that the ELF
> header of the crashed kernel is actually mapped in our linear mapping?

Well, in fact, it depends on kexec-tools.
In the current implementation for arm64, the elf core header is allocated
within the usable memory of crash dump kernel.

Should we add some check here?

>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index 24f0a1c..52a1469 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -35,6 +35,7 @@
>>   #include <linux/efi.h>
>>   #include <linux/swiotlb.h>
>>   #include <linux/kexec.h>
>> +#include <linux/crash_dump.h>
>>
>>   #include <asm/fixmap.h>
>>   #include <asm/memory.h>
>> @@ -116,6 +117,31 @@ static void __init reserve_crashkernel(void)
>>   }
>>   #endif /* CONFIG_KEXEC */
>>
>> +#ifdef CONFIG_CRASH_DUMP
>> +/*
>> + * reserve_elfcorehdr() - reserves memory for elf core header
>> + *
>> + * This function reserves elf core header given in "elfcorehdr=" kernel
>> + * command line parameter. This region contains all the information about
>> + * primary kernel's core image and is used by a dump capture kernel to
>> + * access the system memory on primary kernel.
>> + */
>> +static void __init reserve_elfcorehdr(void)
>> +{
>> +	if (!elfcorehdr_size)
>> +		return;
>> +
>> +	if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
>> +		pr_warn("elfcorehdr is overlapped\n");
>> +		return;
>> +	}
>> +
>> +	memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
>> +
>> +	pr_info("Reserving %lldKB of memory at %lldMB for elfcorehdr\n",
>> +		elfcorehdr_size >> 10, elfcorehdr_addr >> 20);
>
> I'd have thought it would be more useful to print the address as an
> address rather than a size.

Yeah, I totally agree, but all the other archs, including x86 and arm,
print the address in "%lldMB" format.
If you like, I can fix it.

>> +}
>
> Similar #else trick here.

Sure.

Thanks,
-Takahiro AKASHI

> Will
>
AKASHI Takahiro Dec. 16, 2015, 5:48 a.m. UTC | #61
On 12/16/2015 02:17 AM, Will Deacon wrote:
> On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> This patch adds arch specific descriptions about kdump usage on arm64
>> to kdump.txt.
>>
>> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> ---
>>   Documentation/kdump/kdump.txt | 32 +++++++++++++++++++++++++++++++-
>>   1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>> index bc4bd5a..7d3fad0 100644
>> --- a/Documentation/kdump/kdump.txt
>> +++ b/Documentation/kdump/kdump.txt
>> @@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
>>   a remote system.
>>
>>   Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
>> -s390x and arm architectures.
>> +s390x, arm and arm64 architectures.
>>
>>   When the system kernel boots, it reserves a small section of memory for
>>   the dump-capture kernel. This ensures that ongoing Direct Memory Access
>> @@ -249,6 +249,29 @@ Dump-capture kernel config options (Arch Dependent, arm)
>>
>>       AUTO_ZRELADDR=y
>>
>> +Dump-capture kernel config options (Arch Dependent, arm64)
>> +----------------------------------------------------------
>> +
>> +1) Disable symmetric multi-processing support under "Processor type and
>> +   features":
>> +
>> +   CONFIG_SMP=n
>> +
>> +   (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
>> +   when loading the dump-capture kernel, see section "Load the Dump-capture
>> +   Kernel".)
>
> SMP is now forced on, so please update this text.

OK, I will fix it.

>> +2) The maximum memory size on the dump-capture kernel must be limited by
>> +   specifying:
>> +
>> +   mem=X[MG]
>> +
>> +   where X should be less than or equal to the size in "crashkernel="
>> +   boot parameter. Kexec-tools will automatically add this.
>> +
>> +3) Currently, kvm will not be enabled on the dump-capture kernel even
>> +   if it is configured.
>> +
>>   Extended crashkernel syntax
>>   ===========================
>>
>> @@ -312,6 +335,8 @@ Boot into System Kernel
>>      any space below the alignment point may be overwritten by the dump-capture kernel,
>>      which means it is possible that the vmcore is not that precise as expected.
>>
>> +   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
>> +   the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
>>
>>   Load the Dump-capture Kernel
>>   ============================
>> @@ -334,6 +359,8 @@ For s390x:
>>   	- Use image or bzImage
>>   For arm:
>>   	- Use zImage
>> +For arm64:
>> +	- Use vmlinux
>
> What? Why doesn't Image work?

Actually it depends on kexec-tools, not the kernel.
Initially Geoff implemented vmlinux support only, then
recently Pratyush (RedHat) gave us a patch for "Image" support.

So now Image should work.

-Takahiro AKASHI

> Will
>
AKASHI Takahiro Dec. 16, 2015, 5:51 a.m. UTC | #62
On 12/16/2015 02:05 AM, Will Deacon wrote:
> On Tue, Nov 24, 2015 at 10:25:34PM +0000, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> We should try best in case of kdump.
>> So even if not all secondary cpus have shut down, we do kdump anyway.
>> ---
>>   arch/arm64/kernel/machine_kexec.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
>> index d2d7e90..482aae7 100644
>> --- a/arch/arm64/kernel/machine_kexec.c
>> +++ b/arch/arm64/kernel/machine_kexec.c
>> @@ -148,7 +148,13 @@ void machine_kexec(struct kimage *kimage)
>>   	phys_addr_t reboot_code_buffer_phys;
>>   	void *reboot_code_buffer;
>>
>> -	BUG_ON(num_online_cpus() > 1);
>> +	if (num_online_cpus() > 1) {
>> +		if (in_crash_kexec)
>> +			pr_warn("kdump might fail because %d cpus are still online\n",
>> +					num_online_cpus());
>> +		else
>> +			BUG();
>> +	}
>
> Can you just rewrite the existing BUG_ON as:
>
>    BUG_ON(num_online_cpus() && !WARN_ON(in_crash_kexec));
>
> ?

Yes, but
      BUG_ON((num_online_cpus() > 1) && !WARN_ON(in_crash_kexec));

Thanks,
-Takahiro AKASHI

> Will
>
Pratyush Anand Dec. 16, 2015, 7:18 a.m. UTC | #63
On 15/12/2015:04:14:30 PM, Geoff Levand wrote:
> On Tue, 2015-12-15 at 18:29 +0000, Will Deacon wrote:
> > > +> > 	> >  * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
> > > +> > 	> >  * after the kernel is shut down.
> > > +> > 	> >  */
> > > +> > 	> > memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
> > > +> > 	> > 	> > arm64_relocate_new_kernel_size);
> > 
> > At which point does the I-cache get invalidated for this?
> 
> I'll add a call to flush_icache_range() for reboot_code_buffer.  I
> think that should do it.

We execute arm64_relocate_new_kernel() code with I-cache disabled. So, do we
really need to invalidate I-cache?

~Pratyush
Pratyush Anand Dec. 16, 2015, 7:36 a.m. UTC | #64
On 16/12/2015:02:19:26 PM, AKASHI Takahiro wrote:
> On 12/16/2015 02:29 AM, Will Deacon wrote:
> >#else
> >
> >static void __init reserve_crashkernel(void) {};
> 
> OK.

Then may be you do not need to protect calling of reserve_crashkernel() with
CONFIG_KEXEC_CORE as well.

~Pratyush
James Morse Dec. 16, 2015, 9:30 a.m. UTC | #65
On 16/12/15 07:18, Pratyush Anand wrote:
> On 15/12/2015:04:14:30 PM, Geoff Levand wrote:
>> On Tue, 2015-12-15 at 18:29 +0000, Will Deacon wrote:
>>>> +> > 	> >  * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
>>>> +> > 	> >  * after the kernel is shut down.
>>>> +> > 	> >  */
>>>> +> > 	> > memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
>>>> +> > 	> > 	> > arm64_relocate_new_kernel_size);
>>>
>>> At which point does the I-cache get invalidated for this?
>>
>> I'll add a call to flush_icache_range() for reboot_code_buffer.  I
>> think that should do it.
> 
> We execute arm64_relocate_new_kernel() code with I-cache disabled. So, do we
> really need to invalidate I-cache?

I got bitten by this, see Mark's earlier reply[0]:

Mark Rutland wrote:
> The SCTLR_ELx.I only affects the attributes that the I-cache uses to
> fetch with, not whether it is enabled (it cannot be disabled
> architecturally).



James


[0]
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/382101.html
Pratyush Anand Dec. 16, 2015, 10:32 a.m. UTC | #66
On 16/12/2015:09:30:34 AM, James Morse wrote:
> On 16/12/15 07:18, Pratyush Anand wrote:
> > We execute arm64_relocate_new_kernel() code with I-cache disabled. So, do we
> > really need to invalidate I-cache?
> 
> I got bitten by this, see Mark's earlier reply[0]:
> 
> Mark Rutland wrote:
> > The SCTLR_ELx.I only affects the attributes that the I-cache uses to
> > fetch with, not whether it is enabled (it cannot be disabled
> > architecturally).

Thanks James for pointing to it. I had missed that.

~Pratyush