mbox

[v15,00/20] arm64 kexec kernel patches v15

Message ID cover.1457977401.git.geoff@infradead.org
State New
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v15

Message

Geoff Levand March 14, 2016, 5:47 p.m. UTC
This series adds the core support for kexec re-boot and kdump on ARM64.  This
version of the series combines Takahiro's kdump patches with my kexec patches.
Please consider all patches for inclusion.

To load a second stage kernel and execute a kexec re-boot or to work with kdump
on ARM64 systems a series of patches to kexec-tools [2], which have not yet been
merged upstream, are needed.  Please update to the latest if you have been using
an older version.

To examine vmcore (/proc/vmcore), you should use
  - gdb v7.7 or later
  - crash v7.1.1 or later

[1]  https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git
[2]  https://git.kernel.org/cgit/linux/kernel/git/geoff/kexec-tools.git

Changes for v15 (Mar 14, 2016, 22m):

  o Rebase to Linux-4.5.
  o Remove DEBUG conditional in 'Add pr_debug output'.

Changes for v14 (Mar 4, 2016, 22m):

  o Rebase to Linux-4.5-rc6.
  o Rename setup_mm_for_reboot to cpu_install_idmap.
  o kdump: leave non-boot cpus online at crash dump
    As we don't have to make non-boot (crashed) cpus offline at crash dump,
    this patch adds a variant of smp_send_stop().
  o kdump: use a new device-tree property, "linux,elfcorehdr", instead of
    traditional "elfcorehdr=" kernel parameter
  o limit memory regions based on DT property, "usable-memory", instead of
    "mem=" kernel parameter
  o kdump: fix a build error when !CONFIG_KEXEC_CORE
  o kvm: use generic kvm_call_hyp() interface instead of kvm_cpu_reset()
  o kvm: initialize only a primary cpu at init_hyp_mode()

Changes for v13 (Jan 15, 2016, 20m):

  o Rebase to Linux-4.4.
  o Remove align directive from cpu_reset.c.
  o Use inline C wrapper for cpu_soft_restart.
  o Revert the new image d-cache flush changes of v10.
  o Add SCTLR cleanup patch.
  o Change pr_devel to pr_debug.
  o Call flush_icache_range() for reboot_code_buffer.
  o Add .ltorg directive to arm64_relocate_new_kernel.
  o Make new asm macro copy_page.
  o Change cache maintenence from inner-shareable to non-shareable.
  o Rename KEXEC_ARCH_ARM64 to KEXEC_ARCH_AARCH64.

  o arm64: kvm: allows kvm cpu hotplug
    - remove some garbage code from kvm_host.h
  o arm64: kdump: reserve memory for crash dump kernel
    - change CONFIG_KEXEC to CONFIG_KEXEC_CORE
    - don't panic on crash kernel alloc failure
      (thanks to Mark Salter, RH)
  o arm64: kdump: implement machine_crash_shutdown()
    - change "boot/non-boot cpu" to "crashing/non-crashing cpu"
    - introduce is_in_crash_kexec() for readability
    - re-introduce machine_kexec_mask_interrupts(), as arch/arm has,
      to discard unexpected interrupts
    - call crash_save_cpu() before making cpus offline to avoid a possible race
      (thanks to Pratyush Anand/Mark Salter, RH)
  o arm64: kdump: update a kernel doc
    - clarify that we support "Image" format as well as vmlinux in kdump.txt
  o arm64: kdump: relax BUG_ON() if more than one cpus are still active
    - change a warning message at the failure of shooting down non-crahsing cpus

Changes for v12 (Nov 24, 2015, 18m):

  o No changes, rebase to Linux-4.4-rc2.

Changes for v11 (Nov 6, 2015, 18m):

  o Rebase to Linux-4.3.
  o Move the new image d-cache flush from arm64_relocate_new_kernel to machine_kexec.
  o Pass values to arm64_relocate_new_kernel in registers, not in global variables.
  o Fixups to setting the sctlr_el1 and sctlr_el2 flags.

Changes for v10 (Oct 18, 2015, 17m):

  o Rebase to Linux-4.3-rc6.
  o Move tcr_set_idmap_t0sz to assembler.h.
  o Add back simplified cpu_reset routines.
  o Combine kexec + kdump patches.

Changes for v9 (Apr 7, 2015, 11m):

  o Use new upstream flag IND_FLAGS.

Changes for v8 (Mar 19, 2015, 10m):

  o Rebase to Linux-4.0-rc4.
  o Re-boot using purgatory only.

Changes for v7 (Jan 16, 2015, 8m):

  o Rebase to Linux-3.19-rc4.
  o Change from ESR_EL2_ to ESR_ELx_.
  o Remove work-arounds for EFI systems.
  
Changes for v6 (Dec 2, 2014, 7m):

  o Rebase to Linux-3.18-rc2

Changes for v5 (Nov 16, 2014, 6m):

Changes for v4 (Oct 3, 2014, 5m):

Changes for v3 (Sept 23, 2014, 4m):

Changes for v2 (Sep 9, 2014, 4m):

  o Rebase to Linux-3.17-rc4.
  o Move macros from proc-macros.S to assembler.h.
  o Convert hcalls to use ISS field.
  o Add new hcall HVC_CALL_FUNC.
  o Add EL2 switch to soft_restart.

First submission v1 (May 13, 2014):

  o Based on Linux-3.15-rc4.

-Geoff

The following changes since commit b562e44f507e863c6792946e4e1b1449fbbac85d:

  Linux 4.5 (2016-03-13 21:28:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v15

for you to fetch changes up to 906d185a6451d3349669bcf4b080129dfaa7359f:

  arm64: kdump: update a kernel doc (2016-03-14 10:29:13 -0700)

----------------------------------------------------------------
AKASHI Takahiro (7):
      arm64: kvm: allows kvm cpu hotplug
      arm64: kdump: reserve memory for crash dump kernel
      arm64: limit memory regions based on DT property, usable-memory
      arm64: kdump: implement machine_crash_shutdown()
      arm64: kdump: add kdump support
      arm64: kdump: enable kdump in the arm64 defconfig
      arm64: kdump: update a kernel doc

Geoff Levand (11):
      arm64: Fold proc-macros.S into assembler.h
      arm64: Cleanup SCTLR flags
      arm64: Convert hcalls to use HVC immediate value
      arm64: Add new hcall HVC_CALL_FUNC
      arm64: Add new asm macro copy_page
      arm64: Add back cpu_reset routines
      Revert "arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function"
      Revert "arm64: remove dead code"
      arm64/kexec: Add core kexec support
      arm64/kexec: Enable kexec in the arm64 defconfig
      arm64/kexec: Add pr_debug output

James Morse (2):
      arm64: kernel: Include _AC definition in page.h
      arm64: Promote KERNEL_START/KERNEL_END definitions to a header file

 Documentation/kdump/kdump.txt        |  15 +-
 arch/arm/include/asm/kvm_host.h      |  10 +-
 arch/arm/include/asm/kvm_mmu.h       |   1 +
 arch/arm/kvm/arm.c                   | 104 +++++++++-----
 arch/arm/kvm/mmu.c                   |   5 +
 arch/arm64/Kconfig                   |  21 +++
 arch/arm64/configs/defconfig         |   2 +
 arch/arm64/include/asm/assembler.h   | 101 ++++++++++++-
 arch/arm64/include/asm/hardirq.h     |   2 +-
 arch/arm64/include/asm/kexec.h       |  82 +++++++++++
 arch/arm64/include/asm/kvm_arm.h     |  11 --
 arch/arm64/include/asm/kvm_asm.h     |   1 +
 arch/arm64/include/asm/kvm_host.h    |  12 +-
 arch/arm64/include/asm/kvm_mmu.h     |   1 +
 arch/arm64/include/asm/memory.h      |   3 +
 arch/arm64/include/asm/mmu.h         |   1 +
 arch/arm64/include/asm/mmu_context.h |  35 +++--
 arch/arm64/include/asm/page.h        |   2 +
 arch/arm64/include/asm/smp.h         |   5 +
 arch/arm64/include/asm/sysreg.h      |  19 ++-
 arch/arm64/include/asm/virt.h        |  40 ++++++
 arch/arm64/kernel/Makefile           |   3 +
 arch/arm64/kernel/cpu-reset.S        |  57 ++++++++
 arch/arm64/kernel/cpu-reset.h        |  29 ++++
 arch/arm64/kernel/crash_dump.c       |  71 +++++++++
 arch/arm64/kernel/head.S             |   3 -
 arch/arm64/kernel/hyp-stub.S         |  49 +++++--
 arch/arm64/kernel/machine_kexec.c    | 270 +++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/relocate_kernel.S  | 131 +++++++++++++++++
 arch/arm64/kernel/setup.c            |   7 +-
 arch/arm64/kernel/smp.c              |  56 ++++++++
 arch/arm64/kvm/hyp-init.S            |  47 +++++-
 arch/arm64/kvm/hyp.S                 |   3 +-
 arch/arm64/kvm/hyp/hyp-entry.S       |   9 +-
 arch/arm64/kvm/reset.c               |  14 ++
 arch/arm64/mm/cache.S                |   2 -
 arch/arm64/mm/init.c                 | 151 ++++++++++++++++++++
 arch/arm64/mm/mmu.c                  |  11 ++
 arch/arm64/mm/proc-macros.S          |  98 -------------
 arch/arm64/mm/proc.S                 |   3 -
 include/uapi/linux/kexec.h           |   1 +
 41 files changed, 1291 insertions(+), 197 deletions(-)
 create mode 100644 arch/arm64/include/asm/kexec.h
 create mode 100644 arch/arm64/kernel/cpu-reset.S
 create mode 100644 arch/arm64/kernel/cpu-reset.h
 create mode 100644 arch/arm64/kernel/crash_dump.c
 create mode 100644 arch/arm64/kernel/machine_kexec.c
 create mode 100644 arch/arm64/kernel/relocate_kernel.S
 delete mode 100644 arch/arm64/mm/proc-macros.S

Comments

Dave Martin March 15, 2016, 1:50 p.m. UTC | #1
Hi

On Mon, Mar 14, 2016 at 05:48:00PM +0000, Geoff Levand wrote:
> The existing arm64 hcall implementations are limited in that they only
> allow for two distinct hcalls; with the x0 register either zero or not
> zero.  Also, the API of the hyp-stub exception vector routines and the
> KVM exception vector routines differ; hyp-stub uses a non-zero value in
> x0 to implement __hyp_set_vectors, whereas KVM uses it to implement
> kvm_call_hyp.
> 
> To allow for additional hcalls to be defined and to make the arm64 hcall
> API more consistent across exception vector routines, change the hcall
> implementations to use the 16 bit immediate value of the HVC instruction
> to specify the hcall type.

I'm a bit concerned about namespace pollution on the HVC immediate here.
Existing users tend allocate a single "random" number to identify the
API -- Xen and Jailhouse do this for example.

If we start using the HVC immediate to select functions, not just APIs,
the space is going to fill up a lot faster, if we have a multiplex
multiple APIs through it.

(We don't currently seem to multiplex APIs much here, except that we
do use HVC for PSCI calls from the guest, and it could be used for
additional paravirtualised services in the future).

> Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
> HVC_SET_VECTORS to be used as hcall type specifiers and convert the
> existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp()
> routines to use these new macros when executing an HVC call.  Also,
> change the corresponding hyp-stub and KVM el1_sync exception vector
> routines to use these new macros.

It would also be preferable to keep the 32-bit and 64-bit APIs the same;
we should avoid having them different unless there's a clinching
technical reason...


There may be some historical context for this that I'm missing...

Cheers
---Dave

> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/arm64/include/asm/virt.h  | 27 +++++++++++++++++++++++++++
>  arch/arm64/kernel/hyp-stub.S   | 32 +++++++++++++++++++++-----------
>  arch/arm64/kvm/hyp.S           |  3 ++-
>  arch/arm64/kvm/hyp/hyp-entry.S |  9 ++++++---
>  4 files changed, 56 insertions(+), 15 deletions(-)
> 

[...]
Geoff Levand March 15, 2016, 6:15 p.m. UTC | #2
Hi,

On Tue, 2016-03-15 at 13:50 +0000, Dave Martin wrote:
> On Mon, Mar 14, 2016 at 05:48:00PM +0000, Geoff Levand wrote:
> > The existing arm64 hcall implementations are limited in that they only
> > allow for two distinct hcalls; with the x0 register either zero or not
> > zero.  Also, the API of the hyp-stub exception vector routines and the
> > KVM exception vector routines differ; hyp-stub uses a non-zero value in
> > x0 to implement __hyp_set_vectors, whereas KVM uses it to implement
> > kvm_call_hyp.
> > 
> > To allow for additional hcalls to be defined and to make the arm64 hcall
> > API more consistent across exception vector routines, change the hcall
> > implementations to use the 16 bit immediate value of the HVC instruction
> > to specify the hcall type.
> 
> I'm a bit concerned about namespace pollution on the HVC immediate here.
> Existing users tend allocate a single "random" number to identify the
> API -- Xen and Jailhouse do this for example.
> 
> If we start using the HVC immediate to select functions, not just APIs,
> the space is going to fill up a lot faster, if we have a multiplex
> multiple APIs through it.

This was discussed and concluded that we have 16 bits to fill up,
and that is enough.  Functions can still be multiplexed through a
single HVC immediate if the user chooses to do so.

> 
> (We don't currently seem to multiplex APIs much here, except that we
> do use HVC for PSCI calls from the guest, and it could be used for
> additional paravirtualised services in the future).
> 
> > Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
> > HVC_SET_VECTORS to be used as hcall type specifiers and convert the
> > existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp()
> > routines to use these new macros when executing an HVC call.  Also,
> > change the corresponding hyp-stub and KVM el1_sync exception vector
> > routines to use these new macros.
> 
> It would also be preferable to keep the 32-bit and 64-bit APIs the same;
> we should avoid having them different unless there's a clinching
> technical reason...

Please expand on why you see it as preferable.  What problems do
you see?

-Geoff
Dave Martin March 16, 2016, 1:50 p.m. UTC | #3
On Tue, Mar 15, 2016 at 11:15:10AM -0700, Geoff Levand wrote:
> Hi,
> 
> On Tue, 2016-03-15 at 13:50 +0000, Dave Martin wrote:
> > On Mon, Mar 14, 2016 at 05:48:00PM +0000, Geoff Levand wrote:
> > > The existing arm64 hcall implementations are limited in that they only
> > > allow for two distinct hcalls; with the x0 register either zero or not
> > > zero.  Also, the API of the hyp-stub exception vector routines and the
> > > KVM exception vector routines differ; hyp-stub uses a non-zero value in
> > > x0 to implement __hyp_set_vectors, whereas KVM uses it to implement
> > > kvm_call_hyp.
> > > 
> > > To allow for additional hcalls to be defined and to make the arm64 hcall
> > > API more consistent across exception vector routines, change the hcall
> > > implementations to use the 16 bit immediate value of the HVC instruction
> > > to specify the hcall type.
> > 
> > I'm a bit concerned about namespace pollution on the HVC immediate here.
> > Existing users tend allocate a single "random" number to identify the
> > API -- Xen and Jailhouse do this for example.
> > 
> > If we start using the HVC immediate to select functions, not just APIs,
> > the space is going to fill up a lot faster, if we have a multiplex
> > multiple APIs through it.
> 
> This was discussed and concluded that we have 16 bits to fill up,
> and that is enough.  Functions can still be multiplexed through a

Enough for what?

> single HVC immediate if the user chooses to do so.

But KVM can't?

The HVC #imm space doesn't seem to be managed, which implies that
discovery and/or renumbering mechanisms would be needed if we end up
wanting to mux multiple ABIs through there.  The tighter limitation
on immediate size, and the need for code patching if translation of
HVC numbers is needed, mean that this can be harder when using the HVC
immediate for demux rather than an ordinary register.

Currently, the only other ABI muxed through HVC is PSCI, but it
already looks like there is a potential collision -- HVC #0 from EL1 is
already KVM_CALL_HYP or a PSCI call, and we rely on knowing whether
the call came from the host or guest to demux it properly.

This kind of problem is likely to proliferate over time.

> > (We don't currently seem to multiplex APIs much here, except that we
> > do use HVC for PSCI calls from the guest, and it could be used for
> > additional paravirtualised services in the future).
> > 
> > > Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
> > > HVC_SET_VECTORS to be used as hcall type specifiers and convert the
> > > existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp()
> > > routines to use these new macros when executing an HVC call.  Also,
> > > change the corresponding hyp-stub and KVM el1_sync exception vector
> > > routines to use these new macros.
> > 
> > It would also be preferable to keep the 32-bit and 64-bit APIs the same;
> > we should avoid having them different unless there's a clinching
> > technical reason...
> 
> Please expand on why you see it as preferable.  What problems do
> you see?

Fragmentation avoidance is the main argument I see.  The architectural
constraints and the problem to be solved are basically the same between
32- and 64-bit, AFAICT.

Cheers
---Dave
Marc Zyngier March 16, 2016, 2:09 p.m. UTC | #4
On 16/03/16 13:50, Dave Martin wrote:
> On Tue, Mar 15, 2016 at 11:15:10AM -0700, Geoff Levand wrote:
>> Hi,
>>
>> On Tue, 2016-03-15 at 13:50 +0000, Dave Martin wrote:
>>> On Mon, Mar 14, 2016 at 05:48:00PM +0000, Geoff Levand wrote:
>>>> The existing arm64 hcall implementations are limited in that they only
>>>> allow for two distinct hcalls; with the x0 register either zero or not
>>>> zero.  Also, the API of the hyp-stub exception vector routines and the
>>>> KVM exception vector routines differ; hyp-stub uses a non-zero value in
>>>> x0 to implement __hyp_set_vectors, whereas KVM uses it to implement
>>>> kvm_call_hyp.
>>>>
>>>> To allow for additional hcalls to be defined and to make the arm64 hcall
>>>> API more consistent across exception vector routines, change the hcall
>>>> implementations to use the 16 bit immediate value of the HVC instruction
>>>> to specify the hcall type.
>>>
>>> I'm a bit concerned about namespace pollution on the HVC immediate here.
>>> Existing users tend allocate a single "random" number to identify the
>>> API -- Xen and Jailhouse do this for example.
>>>
>>> If we start using the HVC immediate to select functions, not just APIs,
>>> the space is going to fill up a lot faster, if we have a multiplex
>>> multiple APIs through it.
>>
>> This was discussed and concluded that we have 16 bits to fill up,
>> and that is enough.  Functions can still be multiplexed through a
> 
> Enough for what?
> 
>> single HVC immediate if the user chooses to do so.
> 
> But KVM can't?
> 
> The HVC #imm space doesn't seem to be managed, which implies that
> discovery and/or renumbering mechanisms would be needed if we end up
> wanting to mux multiple ABIs through there.  The tighter limitation
> on immediate size, and the need for code patching if translation of
> HVC numbers is needed, mean that this can be harder when using the HVC
> immediate for demux rather than an ordinary register.
> 
> Currently, the only other ABI muxed through HVC is PSCI, but it
> already looks like there is a potential collision -- HVC #0 from EL1 is
> already KVM_CALL_HYP or a PSCI call, and we rely on knowing whether
> the call came from the host or guest to demux it properly.
> 
> This kind of problem is likely to proliferate over time.
> 
>>> (We don't currently seem to multiplex APIs much here, except that we
>>> do use HVC for PSCI calls from the guest, and it could be used for
>>> additional paravirtualised services in the future).
>>>
>>>> Define three new preprocessor macros HVC_CALL_HYP, HVC_GET_VECTORS, and
>>>> HVC_SET_VECTORS to be used as hcall type specifiers and convert the
>>>> existing __hyp_get_vectors(), __hyp_set_vectors() and kvm_call_hyp()
>>>> routines to use these new macros when executing an HVC call.  Also,
>>>> change the corresponding hyp-stub and KVM el1_sync exception vector
>>>> routines to use these new macros.
>>>
>>> It would also be preferable to keep the 32-bit and 64-bit APIs the same;
>>> we should avoid having them different unless there's a clinching
>>> technical reason...
>>
>> Please expand on why you see it as preferable.  What problems do
>> you see?
> 
> Fragmentation avoidance is the main argument I see.  The architectural
> constraints and the problem to be solved are basically the same between
> 32- and 64-bit, AFAICT.

+1. I never quite understood why we went from a single HVC immediate + a
register indicating the operation to a proliferation of immediate values
(and still the need for a register to indicate the operation in most cases).

This seems to go in a direction that is diametrically opposite the the
"normal" ARM way. That doesn't make it an invalid approach, but
uniformity with other APIs (PSCI for example) and the 32bit KVM code
seems a highly desirable feature (given that I'll end up maintaining
that code).

Thanks,

	M.
Geoff Levand March 17, 2016, 4:47 p.m. UTC | #5
Hi Marc,

On Wed, 2016-03-16 at 14:09 +0000, Marc Zyngier wrote:
> This seems to go in a direction that is diametrically opposite the
> the
> "normal" ARM way. That doesn't make it an invalid approach, but
> uniformity with other APIs (PSCI for example) and the 32bit KVM code
> seems a highly desirable feature (given that I'll end up maintaining
> that code).

We need a way to get the CPU back to the exception level it had on
entry to the kernel, and this hcall change is part of my proposed
solution.  If you could outline something you think would be a better
fit, I'll take that and work on an implementation of it.

-Geoff
James Morse March 18, 2016, 6:08 p.m. UTC | #6
Hi!

On 14/03/16 17:48, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
> and save registers' status in per-cpu ELF notes before starting crash
> dump kernel. See kernel_kexec().
> Even if not all secondary cpus have shut down, we do kdump anyway.
> 
> As we don't have to make non-boot(crashed) cpus offline (to preserve
> correct status of cpus at crash dump) before shutting down, this patch
> also adds a variant of smp_send_stop().
> 
> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>

> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index b1adc51..76402c6cd 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -701,6 +705,28 @@ static void ipi_cpu_stop(unsigned int cpu)
>  		cpu_relax();
>  }
>  
> +static atomic_t waiting_for_crash_ipi;
> +
> +static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> +{
> +	crash_save_cpu(regs, cpu);
> +
> +	raw_spin_lock(&stop_lock);
> +	pr_debug("CPU%u: stopping\n", cpu);
> +	raw_spin_unlock(&stop_lock);
> +
> +	atomic_dec(&waiting_for_crash_ipi);
> +
> +	local_irq_disable();

Aren't irqs already disabled here? - or is this a 'just make sure'....


> +
> +	if (cpu_ops[cpu]->cpu_die)
> +		cpu_ops[cpu]->cpu_die(cpu);
> +
> +	/* just in case */
> +	while (1)
> +		wfi();
> +}
> +
>  /*
>   * Main handler for inter-processor interrupts
>   */
> @@ -731,6 +757,12 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  		irq_exit();
>  		break;
>  
> +	case IPI_CPU_CRASH_STOP:
> +		irq_enter();
> +		ipi_cpu_crash_stop(cpu, regs);
> +		irq_exit();

This made me jump: irq_exit() may end up in the __do_softirq() (with irqs turned
back on!) ... but these lines are impossible to reach. Maybe get the compiler to
enforce this with an unreachable() instead?


> +		break;
> +
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>  	case IPI_TIMER:
>  		irq_enter();
> @@ -791,6 +823,30 @@ void smp_send_stop(void)
>  		pr_warning("SMP: failed to stop secondary CPUs\n");
>  }
>  
> +void smp_send_crash_stop(void)
> +{
> +	cpumask_t mask;
> +	unsigned long timeout;
> +
> +	if (num_online_cpus() == 1)
> +		return;
> +
> +	cpumask_copy(&mask, cpu_online_mask);
> +	cpumask_clear_cpu(smp_processor_id(), &mask);
> +
> +	atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> +
> +	smp_cross_call(&mask, IPI_CPU_CRASH_STOP);
> +
> +	/* Wait up to one second for other CPUs to stop */
> +	timeout = USEC_PER_SEC;
> +	while ((atomic_read(&waiting_for_crash_ipi) > 0) && timeout--)
> +		udelay(1);
> +
> +	if (atomic_read(&waiting_for_crash_ipi) > 0)
> +		pr_warn("SMP: failed to stop secondary CPUs\n");
> +}
> +
>  /*
>   * not supported here
>   */
> 


Thanks,

James
James Morse March 18, 2016, 6:08 p.m. UTC | #7
Hi!

On 14/03/16 17:48, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> On the startup of primary kernel, the memory region used by crash dump
> kernel must be specified by "crashkernel=" kernel parameter.
> reserve_crashkernel() will allocate and reserve the region for later use.
> 
> User space tools, like kexec-tools, will be able to find that region marked
> as "Crash kernel" in /proc/iomem.

[NB: Re-ordered diff hunks ]

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 7802f21..4edf181 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -171,6 +229,8 @@ void __init arm64_memblock_init(void)
>  		memblock_reserve(__virt_to_phys(initrd_start), initrd_end - initrd_start);
>  #endif
>
> +	reserve_crashkernel();
> +
>  	early_init_fdt_scan_reserved_mem();
>
>  	/* 4GB maximum for 32-bit only capable devices */
>


This is 'nit' territory, but if you were to make this:
>	if (IS_ENABLED(CONFIG_KEXEC_CORE))
>		reserve_crashkernel();

then the #ifdefs around reserve_crashkernel() can go, and the compiler will work
out that this static function can be optimised out. It also means the
compiler performs its checks, even if CONFIG_KEXEC_CORE isn't selected. The same
trick can be applied in patch 18 (around reserve_elfcorehdr()).


> @@ -66,6 +67,63 @@ static int __init early_initrd(char *p)
>  early_param("initrd", early_initrd);
>  #endif
>  
> +#ifdef CONFIG_KEXEC_CORE
> +/*
> + * reserve_crashkernel() - reserves memory for crash kernel
> + *
> + * This function reserves memory area given in "crashkernel=" kernel command
> + * line parameter. The memory reserved is used by dump capture kernel when
> + * primary kernel is crashing.
> + */
> +static void __init reserve_crashkernel(void)
> +{
> +	unsigned long long crash_size = 0, crash_base = 0;
> +	int ret;
> +
> +	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> +				&crash_size, &crash_base);
> +	if (ret)
> +		return;
> +
> +	if (crash_base == 0) {
> +		crash_base = memblock_find_in_range(0,
> +				MEMBLOCK_ALLOC_ACCESSIBLE, crash_size, 1 << 21);
> +		if (crash_base == 0) {
> +			pr_warn("Unable to allocate crashkernel (size:%llx)\n",
> +				crash_size);
> +			return;
> +		}
> +		memblock_reserve(crash_base, crash_size);
> +
> +	} else {
> +		/* User specifies base address explicitly. */
> +		if (!memblock_is_region_memory(crash_base, crash_size) ||
> +			memblock_is_region_reserved(crash_base, crash_size)) {
> +			pr_warn("crashkernel has wrong address or size\n");
> +			return;
> +		}
> +
> +		if (crash_base & ((1 << 21) - 1)) {
> +			pr_warn("crashkernel base address is not 2MB aligned\n");
> +			return;
> +		}
> +
> +		memblock_reserve(crash_base, crash_size);
> +	}
> +
> +	pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
> +		crash_size >> 20, crash_base >> 20);
> +
> +	crashk_res.start = crash_base;
> +	crashk_res.end = crash_base + crash_size - 1;
> +}
> +#else
> +static void __init reserve_crashkernel(void)
> +{
> +	;
> +}
> +#endif /* CONFIG_KEXEC_CORE */
> +
>  /*
>   * Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
>   * currently assumes that for memory starting above 4G, 32-bit devices will


Thanks,

James
James Morse March 21, 2016, 1:29 p.m. UTC | #8
Hi!

On 18/03/16 18:08, James Morse wrote:
> On 14/03/16 17:48, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
>> and save registers' status in per-cpu ELF notes before starting crash
>> dump kernel. See kernel_kexec().
>> Even if not all secondary cpus have shut down, we do kdump anyway.
>>
>> As we don't have to make non-boot(crashed) cpus offline (to preserve
>> correct status of cpus at crash dump) before shutting down, this patch
>> also adds a variant of smp_send_stop().

>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index b1adc51..76402c6cd 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -701,6 +705,28 @@ static void ipi_cpu_stop(unsigned int cpu)
>>  		cpu_relax();
>>  }
>>  
>> +static atomic_t waiting_for_crash_ipi;
>> +
>> +static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
>> +{
>> +	crash_save_cpu(regs, cpu);
>> +
>> +	raw_spin_lock(&stop_lock);
>> +	pr_debug("CPU%u: stopping\n", cpu);
>> +	raw_spin_unlock(&stop_lock);
>> +
>> +	atomic_dec(&waiting_for_crash_ipi);
>> +
>> +	local_irq_disable();
>> +
>> +	if (cpu_ops[cpu]->cpu_die)
>> +		cpu_ops[cpu]->cpu_die(cpu);
>> +
>> +	/* just in case */
>> +	while (1)
>> +		wfi();

Having thought about this some more: I don't think spinning like this is safe.
We need to spin with the MMU turned off, otherwise this core will pollute the
kdump kernel with TLB entries from the old page tables. Suzuki added code to
catch this happening with cpu hotplug (grep CPU_STUCK_IN_KERNEL in
arm64/for-next/core), but that won't help here. If 'CPU_STUCK_IN_KERNEL' was set
by a core, I don't think we can kexec/kdump for this reason.

Something like cpu_die() for spin-table is needed, naively I think it should
turn the MMU off, and jump back into the secondary_holding_pen, but the core
would still be stuck in the kernel, and the memory addresses associated with
secondary_holding_pen can't be re-used. (which is fine for kdump, but not kexec)


Thanks,

James
AKASHI Takahiro March 31, 2016, 7:19 a.m. UTC | #9
On Fri, Mar 18, 2016 at 06:08:30PM +0000, James Morse wrote:
> Hi!
> 
> On 14/03/16 17:48, Geoff Levand wrote:
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > 
> > On the startup of primary kernel, the memory region used by crash dump
> > kernel must be specified by "crashkernel=" kernel parameter.
> > reserve_crashkernel() will allocate and reserve the region for later use.
> > 
> > User space tools, like kexec-tools, will be able to find that region marked
> > as "Crash kernel" in /proc/iomem.
> 
> [NB: Re-ordered diff hunks ]
> 
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 7802f21..4edf181 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -171,6 +229,8 @@ void __init arm64_memblock_init(void)
> >  		memblock_reserve(__virt_to_phys(initrd_start), initrd_end - initrd_start);
> >  #endif
> >
> > +	reserve_crashkernel();
> > +
> >  	early_init_fdt_scan_reserved_mem();
> >
> >  	/* 4GB maximum for 32-bit only capable devices */
> >
> 
> 
> This is 'nit' territory, but if you were to make this:
> >	if (IS_ENABLED(CONFIG_KEXEC_CORE))
> >		reserve_crashkernel();
> 
> then the #ifdefs around reserve_crashkernel() can go, and the compiler will work
> out that this static function can be optimised out. It also means the
> compiler performs its checks, even if CONFIG_KEXEC_CORE isn't selected. The same
> trick can be applied in patch 18 (around reserve_elfcorehdr()).

It will also make the code more understandable.

Thanks,
-Takahiro AKASHI

> > @@ -66,6 +67,63 @@ static int __init early_initrd(char *p)
> >  early_param("initrd", early_initrd);
> >  #endif
> >  
> > +#ifdef CONFIG_KEXEC_CORE
> > +/*
> > + * reserve_crashkernel() - reserves memory for crash kernel
> > + *
> > + * This function reserves memory area given in "crashkernel=" kernel command
> > + * line parameter. The memory reserved is used by dump capture kernel when
> > + * primary kernel is crashing.
> > + */
> > +static void __init reserve_crashkernel(void)
> > +{
> > +	unsigned long long crash_size = 0, crash_base = 0;
> > +	int ret;
> > +
> > +	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> > +				&crash_size, &crash_base);
> > +	if (ret)
> > +		return;
> > +
> > +	if (crash_base == 0) {
> > +		crash_base = memblock_find_in_range(0,
> > +				MEMBLOCK_ALLOC_ACCESSIBLE, crash_size, 1 << 21);
> > +		if (crash_base == 0) {
> > +			pr_warn("Unable to allocate crashkernel (size:%llx)\n",
> > +				crash_size);
> > +			return;
> > +		}
> > +		memblock_reserve(crash_base, crash_size);
> > +
> > +	} else {
> > +		/* User specifies base address explicitly. */
> > +		if (!memblock_is_region_memory(crash_base, crash_size) ||
> > +			memblock_is_region_reserved(crash_base, crash_size)) {
> > +			pr_warn("crashkernel has wrong address or size\n");
> > +			return;
> > +		}
> > +
> > +		if (crash_base & ((1 << 21) - 1)) {
> > +			pr_warn("crashkernel base address is not 2MB aligned\n");
> > +			return;
> > +		}
> > +
> > +		memblock_reserve(crash_base, crash_size);
> > +	}
> > +
> > +	pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
> > +		crash_size >> 20, crash_base >> 20);
> > +
> > +	crashk_res.start = crash_base;
> > +	crashk_res.end = crash_base + crash_size - 1;
> > +}
> > +#else
> > +static void __init reserve_crashkernel(void)
> > +{
> > +	;
> > +}
> > +#endif /* CONFIG_KEXEC_CORE */
> > +
> >  /*
> >   * Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
> >   * currently assumes that for memory starting above 4G, 32-bit devices will
> 
> 
> Thanks,
> 
> James
>
AKASHI Takahiro March 31, 2016, 7:46 a.m. UTC | #10
On Fri, Mar 18, 2016 at 06:08:15PM +0000, James Morse wrote:
> Hi!
> 
> On 14/03/16 17:48, Geoff Levand wrote:
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > 
> > Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
> > and save registers' status in per-cpu ELF notes before starting crash
> > dump kernel. See kernel_kexec().
> > Even if not all secondary cpus have shut down, we do kdump anyway.
> > 
> > As we don't have to make non-boot(crashed) cpus offline (to preserve
> > correct status of cpus at crash dump) before shutting down, this patch
> > also adds a variant of smp_send_stop().
> > 
> > Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index b1adc51..76402c6cd 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -701,6 +705,28 @@ static void ipi_cpu_stop(unsigned int cpu)
> >  		cpu_relax();
> >  }
> >  
> > +static atomic_t waiting_for_crash_ipi;
> > +
> > +static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> > +{
> > +	crash_save_cpu(regs, cpu);
> > +
> > +	raw_spin_lock(&stop_lock);
> > +	pr_debug("CPU%u: stopping\n", cpu);
> > +	raw_spin_unlock(&stop_lock);
> > +
> > +	atomic_dec(&waiting_for_crash_ipi);
> > +
> > +	local_irq_disable();
> 
> Aren't irqs already disabled here? - or is this a 'just make sure'....

Well, it also exists in ipi_cpu_stop() ...

> > +
> > +	if (cpu_ops[cpu]->cpu_die)
> > +		cpu_ops[cpu]->cpu_die(cpu);
> > +
> > +	/* just in case */
> > +	while (1)
> > +		wfi();
> > +}
> > +
> >  /*
> >   * Main handler for inter-processor interrupts
> >   */
> > @@ -731,6 +757,12 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
> >  		irq_exit();
> >  		break;
> >  
> > +	case IPI_CPU_CRASH_STOP:
> > +		irq_enter();
> > +		ipi_cpu_crash_stop(cpu, regs);
> > +		irq_exit();
> 
> This made me jump: irq_exit() may end up in the __do_softirq() (with irqs turned
> back on!) ... but these lines are impossible to reach. Maybe get the compiler to
> enforce this with an unreachable() instead?

I'm not sure how effective unreachable() is here, but OK I will add it.

Thanks,
-Takahiro AKASHI

> > +		break;
> > +
> >  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
> >  	case IPI_TIMER:
> >  		irq_enter();
> > @@ -791,6 +823,30 @@ void smp_send_stop(void)
> >  		pr_warning("SMP: failed to stop secondary CPUs\n");
> >  }
> >  
> > +void smp_send_crash_stop(void)
> > +{
> > +	cpumask_t mask;
> > +	unsigned long timeout;
> > +
> > +	if (num_online_cpus() == 1)
> > +		return;
> > +
> > +	cpumask_copy(&mask, cpu_online_mask);
> > +	cpumask_clear_cpu(smp_processor_id(), &mask);
> > +
> > +	atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> > +
> > +	smp_cross_call(&mask, IPI_CPU_CRASH_STOP);
> > +
> > +	/* Wait up to one second for other CPUs to stop */
> > +	timeout = USEC_PER_SEC;
> > +	while ((atomic_read(&waiting_for_crash_ipi) > 0) && timeout--)
> > +		udelay(1);
> > +
> > +	if (atomic_read(&waiting_for_crash_ipi) > 0)
> > +		pr_warn("SMP: failed to stop secondary CPUs\n");
> > +}
> > +
> >  /*
> >   * not supported here
> >   */
> > 
> 
> 
> Thanks,
> 
> James
> 
>
AKASHI Takahiro March 31, 2016, 7:57 a.m. UTC | #11
On Mon, Mar 21, 2016 at 01:29:28PM +0000, James Morse wrote:
> Hi!
> 
> On 18/03/16 18:08, James Morse wrote:
> > On 14/03/16 17:48, Geoff Levand wrote:
> >> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>
> >> Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
> >> and save registers' status in per-cpu ELF notes before starting crash
> >> dump kernel. See kernel_kexec().
> >> Even if not all secondary cpus have shut down, we do kdump anyway.
> >>
> >> As we don't have to make non-boot(crashed) cpus offline (to preserve
> >> correct status of cpus at crash dump) before shutting down, this patch
> >> also adds a variant of smp_send_stop().
> 
> >> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >> index b1adc51..76402c6cd 100644
> >> --- a/arch/arm64/kernel/smp.c
> >> +++ b/arch/arm64/kernel/smp.c
> >> @@ -701,6 +705,28 @@ static void ipi_cpu_stop(unsigned int cpu)
> >>  		cpu_relax();
> >>  }
> >>  
> >> +static atomic_t waiting_for_crash_ipi;
> >> +
> >> +static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> >> +{
> >> +	crash_save_cpu(regs, cpu);
> >> +
> >> +	raw_spin_lock(&stop_lock);
> >> +	pr_debug("CPU%u: stopping\n", cpu);
> >> +	raw_spin_unlock(&stop_lock);
> >> +
> >> +	atomic_dec(&waiting_for_crash_ipi);
> >> +
> >> +	local_irq_disable();
> >> +
> >> +	if (cpu_ops[cpu]->cpu_die)
> >> +		cpu_ops[cpu]->cpu_die(cpu);
> >> +
> >> +	/* just in case */
> >> +	while (1)
> >> +		wfi();
> 
> Having thought about this some more: I don't think spinning like this is safe.
> We need to spin with the MMU turned off, otherwise this core will pollute the
> kdump kernel with TLB entries from the old page tables.

I think that wfi() will never wake up since local interrupts are disabled
here. So how can it pollute the kdump kernel?

> Suzuki added code to
> catch this happening with cpu hotplug (grep CPU_STUCK_IN_KERNEL in
> arm64/for-next/core), but that won't help here. If 'CPU_STUCK_IN_KERNEL' was set
> by a core, I don't think we can kexec/kdump for this reason.

I will need to look into Suzuki's code.

> Something like cpu_die() for spin-table is needed, naively I think it should
> turn the MMU off, and jump back into the secondary_holding_pen, but the core
> would still be stuck in the kernel, and the memory addresses associated with
> secondary_holding_pen can't be re-used. (which is fine for kdump, but not kexec)

Please note that the code is exercised only in kdump case
through machine_crash_shutdown().

Thanks,
-Takahiro AKASHI

> 
> Thanks,
> 
> James
>
Marc Zyngier March 31, 2016, 8:12 a.m. UTC | #12
On 31/03/16 08:57, AKASHI Takahiro wrote:
> On Mon, Mar 21, 2016 at 01:29:28PM +0000, James Morse wrote:
>> Hi!
>>
>> On 18/03/16 18:08, James Morse wrote:
>>> On 14/03/16 17:48, Geoff Levand wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
>>>> and save registers' status in per-cpu ELF notes before starting crash
>>>> dump kernel. See kernel_kexec().
>>>> Even if not all secondary cpus have shut down, we do kdump anyway.
>>>>
>>>> As we don't have to make non-boot(crashed) cpus offline (to preserve
>>>> correct status of cpus at crash dump) before shutting down, this patch
>>>> also adds a variant of smp_send_stop().
>>
>>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>>> index b1adc51..76402c6cd 100644
>>>> --- a/arch/arm64/kernel/smp.c
>>>> +++ b/arch/arm64/kernel/smp.c
>>>> @@ -701,6 +705,28 @@ static void ipi_cpu_stop(unsigned int cpu)
>>>>  		cpu_relax();
>>>>  }
>>>>  
>>>> +static atomic_t waiting_for_crash_ipi;
>>>> +
>>>> +static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
>>>> +{
>>>> +	crash_save_cpu(regs, cpu);
>>>> +
>>>> +	raw_spin_lock(&stop_lock);
>>>> +	pr_debug("CPU%u: stopping\n", cpu);
>>>> +	raw_spin_unlock(&stop_lock);
>>>> +
>>>> +	atomic_dec(&waiting_for_crash_ipi);
>>>> +
>>>> +	local_irq_disable();
>>>> +
>>>> +	if (cpu_ops[cpu]->cpu_die)
>>>> +		cpu_ops[cpu]->cpu_die(cpu);
>>>> +
>>>> +	/* just in case */
>>>> +	while (1)
>>>> +		wfi();
>>
>> Having thought about this some more: I don't think spinning like this is safe.
>> We need to spin with the MMU turned off, otherwise this core will pollute the
>> kdump kernel with TLB entries from the old page tables.
> 
> I think that wfi() will never wake up since local interrupts are disabled
> here. So how can it pollute the kdump kernel?

Having interrupts disabled doesn't prevent an exit from WFI. Quite the
opposite, actually. It is designed to wake-up the core when something
happens on the external interface.

Thanks,

	M.
Mark Rutland March 31, 2016, 10:10 a.m. UTC | #13
On Thu, Mar 31, 2016 at 09:12:32AM +0100, Marc Zyngier wrote:
> On 31/03/16 08:57, AKASHI Takahiro wrote:
> > On Mon, Mar 21, 2016 at 01:29:28PM +0000, James Morse wrote:
> >> On 18/03/16 18:08, James Morse wrote:
> >>> On 14/03/16 17:48, Geoff Levand wrote:
> >>>> +	/* just in case */
> >>>> +	while (1)
> >>>> +		wfi();
> >>
> >> Having thought about this some more: I don't think spinning like this is safe.
> >> We need to spin with the MMU turned off, otherwise this core will pollute the
> >> kdump kernel with TLB entries from the old page tables.
> > 
> > I think that wfi() will never wake up since local interrupts are disabled
> > here. So how can it pollute the kdump kernel?
> 
> Having interrupts disabled doesn't prevent an exit from WFI. Quite the
> opposite, actually. It is designed to wake-up the core when something
> happens on the external interface.

Further, WFI is a hint, and may simply act as a NOP. 

The ARM ARM calls this out (see "D1.17.2" Wait For Interrupt in ARM DDI
0487A.i):

	Because the architecture permits a PE to leave the low-power
	state for any reason, it is permissible for a PE to treat WFI as
	a NOP , but this is not recommended for lowest power operation.

Thanks,
Mark.
James Morse March 31, 2016, 10:22 a.m. UTC | #14
On 31/03/16 08:46, AKASHI Takahiro wrote:
> James Morse wrote:
> > This made me jump: irq_exit() may end up in the __do_softirq() (with irqs turned
> > back on!) ... but these lines are impossible to reach. Maybe get the compiler to
> > enforce this with an unreachable() instead?
>
> I'm not sure how effective unreachable() is here, but OK I will add it.

I thought '__builtin_unreachable()' would generate a warning if it was
reachable, but from [0] it looks like it just suppresses warnings.

You're right, it won't help.

Thanks,


James

[0] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Dave Young April 1, 2016, 1:59 a.m. UTC | #15
On 03/14/16 at 05:47pm, Geoff Levand wrote:
> This series adds the core support for kexec re-boot and kdump on ARM64.  This
> version of the series combines Takahiro's kdump patches with my kexec patches.
> Please consider all patches for inclusion.
> 
> To load a second stage kernel and execute a kexec re-boot or to work with kdump
> on ARM64 systems a series of patches to kexec-tools [2], which have not yet been
> merged upstream, are needed.  Please update to the latest if you have been using
> an older version.
> 
> To examine vmcore (/proc/vmcore), you should use
>   - gdb v7.7 or later
>   - crash v7.1.1 or later
> 
> [1]  https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git
> [2]  https://git.kernel.org/cgit/linux/kernel/git/geoff/kexec-tools.git
> 

Geoff, for easier to review maybe you can send kexec patches first then AKASHI Takahiro
can send the kdump patches as a standalone patchset?

Thanks
Dave
AKASHI Takahiro April 1, 2016, 6:16 a.m. UTC | #16
On 03/31/2016 02:19 PM, AKASHI Takahiro wrote:
> On Fri, Mar 18, 2016 at 06:08:30PM +0000, James Morse wrote:
>> Hi!
>>
>> On 14/03/16 17:48, Geoff Levand wrote:
>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>
>>> On the startup of primary kernel, the memory region used by crash dump
>>> kernel must be specified by "crashkernel=" kernel parameter.
>>> reserve_crashkernel() will allocate and reserve the region for later use.
>>>
>>> User space tools, like kexec-tools, will be able to find that region marked
>>> as "Crash kernel" in /proc/iomem.
>>
>> [NB: Re-ordered diff hunks ]
>>
>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>>> index 7802f21..4edf181 100644
>>> --- a/arch/arm64/mm/init.c
>>> +++ b/arch/arm64/mm/init.c
>>> @@ -171,6 +229,8 @@ void __init arm64_memblock_init(void)
>>>   		memblock_reserve(__virt_to_phys(initrd_start), initrd_end - initrd_start);
>>>   #endif
>>>
>>> +	reserve_crashkernel();
>>> +
>>>   	early_init_fdt_scan_reserved_mem();
>>>
>>>   	/* 4GB maximum for 32-bit only capable devices */
>>>
>>
>>
>> This is 'nit' territory, but if you were to make this:
>>> 	if (IS_ENABLED(CONFIG_KEXEC_CORE))
>>> 		reserve_crashkernel();
>>
>> then the #ifdefs around reserve_crashkernel() can go, and the compiler will work
>> out that this static function can be optimised out. It also means the
>> compiler performs its checks, even if CONFIG_KEXEC_CORE isn't selected. The same
>> trick can be applied in patch 18 (around reserve_elfcorehdr()).
> 
> It will also make the code more understandable.

I tried this change, but
unfortunately, preserve_crashkernel() calls an external function, parse_crashkernel(),
which is defined only if CONFIG_KEXEC_CORE. So we will see a compiler error in !CONFIG_KEXEC_CORE
if we take #ifdef out around perserve_crashkernel().

-Takahiro AKASHI


> Thanks,
> -Takahiro AKASHI
> 
>>> @@ -66,6 +67,63 @@ static int __init early_initrd(char *p)
>>>   early_param("initrd", early_initrd);
>>>   #endif
>>>   
>>> +#ifdef CONFIG_KEXEC_CORE
>>> +/*
>>> + * reserve_crashkernel() - reserves memory for crash kernel
>>> + *
>>> + * This function reserves memory area given in "crashkernel=" kernel command
>>> + * line parameter. The memory reserved is used by dump capture kernel when
>>> + * primary kernel is crashing.
>>> + */
>>> +static void __init reserve_crashkernel(void)
>>> +{
>>> +	unsigned long long crash_size = 0, crash_base = 0;
>>> +	int ret;
>>> +
>>> +	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>>> +				&crash_size, &crash_base);
>>> +	if (ret)
>>> +		return;
>>> +
>>> +	if (crash_base == 0) {
>>> +		crash_base = memblock_find_in_range(0,
>>> +				MEMBLOCK_ALLOC_ACCESSIBLE, crash_size, 1 << 21);
>>> +		if (crash_base == 0) {
>>> +			pr_warn("Unable to allocate crashkernel (size:%llx)\n",
>>> +				crash_size);
>>> +			return;
>>> +		}
>>> +		memblock_reserve(crash_base, crash_size);
>>> +
>>> +	} else {
>>> +		/* User specifies base address explicitly. */
>>> +		if (!memblock_is_region_memory(crash_base, crash_size) ||
>>> +			memblock_is_region_reserved(crash_base, crash_size)) {
>>> +			pr_warn("crashkernel has wrong address or size\n");
>>> +			return;
>>> +		}
>>> +
>>> +		if (crash_base & ((1 << 21) - 1)) {
>>> +			pr_warn("crashkernel base address is not 2MB aligned\n");
>>> +			return;
>>> +		}
>>> +
>>> +		memblock_reserve(crash_base, crash_size);
>>> +	}
>>> +
>>> +	pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
>>> +		crash_size >> 20, crash_base >> 20);
>>> +
>>> +	crashk_res.start = crash_base;
>>> +	crashk_res.end = crash_base + crash_size - 1;
>>> +}
>>> +#else
>>> +static void __init reserve_crashkernel(void)
>>> +{
>>> +	;
>>> +}
>>> +#endif /* CONFIG_KEXEC_CORE */
>>> +
>>>   /*
>>>    * Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
>>>    * currently assumes that for memory starting above 4G, 32-bit devices will
>>
>>
>> Thanks,
>>
>> James
>>
AKASHI Takahiro April 1, 2016, 8:45 a.m. UTC | #17
On Thu, Mar 31, 2016 at 11:10:38AM +0100, Mark Rutland wrote:
> On Thu, Mar 31, 2016 at 09:12:32AM +0100, Marc Zyngier wrote:
> > On 31/03/16 08:57, AKASHI Takahiro wrote:
> > > On Mon, Mar 21, 2016 at 01:29:28PM +0000, James Morse wrote:
> > >> On 18/03/16 18:08, James Morse wrote:
> > >>> On 14/03/16 17:48, Geoff Levand wrote:
> > >>>> +	/* just in case */
> > >>>> +	while (1)
> > >>>> +		wfi();
> > >>
> > >> Having thought about this some more: I don't think spinning like this is safe.
> > >> We need to spin with the MMU turned off, otherwise this core will pollute the
> > >> kdump kernel with TLB entries from the old page tables.
> > > 
> > > I think that wfi() will never wake up since local interrupts are disabled
> > > here. So how can it pollute the kdump kernel?
> > 
> > Having interrupts disabled doesn't prevent an exit from WFI. Quite the
> > opposite, actually. It is designed to wake-up the core when something
> > happens on the external interface.
> 
> Further, WFI is a hint, and may simply act as a NOP. 

Ah, OK. But even so, none of interrupt handlers (nor other code) will be
executed after cpu wakes up, and the memory won't be polluted.
Or do I still miss something?

-Takahiro AKASHI
 
> The ARM ARM calls this out (see "D1.17.2" Wait For Interrupt in ARM DDI
> 0487A.i):
> 
> 	Because the architecture permits a PE to leave the low-power
> 	state for any reason, it is permissible for a PE to treat WFI as
> 	a NOP , but this is not recommended for lowest power operation.
> 
> Thanks,
> Mark.
Mark Rutland April 1, 2016, 9:36 a.m. UTC | #18
On Fri, Apr 01, 2016 at 05:45:09PM +0900, AKASHI Takahiro wrote:
> On Thu, Mar 31, 2016 at 11:10:38AM +0100, Mark Rutland wrote:
> > On Thu, Mar 31, 2016 at 09:12:32AM +0100, Marc Zyngier wrote:
> > > On 31/03/16 08:57, AKASHI Takahiro wrote:
> > > > On Mon, Mar 21, 2016 at 01:29:28PM +0000, James Morse wrote:
> > > >> On 18/03/16 18:08, James Morse wrote:
> > > >>> On 14/03/16 17:48, Geoff Levand wrote:
> > > >>>> +	/* just in case */
> > > >>>> +	while (1)
> > > >>>> +		wfi();
> > > >>
> > > >> Having thought about this some more: I don't think spinning like this is safe.
> > > >> We need to spin with the MMU turned off, otherwise this core will pollute the
> > > >> kdump kernel with TLB entries from the old page tables.
> > > > 
> > > > I think that wfi() will never wake up since local interrupts are disabled
> > > > here. So how can it pollute the kdump kernel?
> > > 
> > > Having interrupts disabled doesn't prevent an exit from WFI. Quite the
> > > opposite, actually. It is designed to wake-up the core when something
> > > happens on the external interface.
> > 
> > Further, WFI is a hint, and may simply act as a NOP. 
> 
> Ah, OK. But even so, none of interrupt handlers (nor other code) will
> be executed after cpu wakes up, and the memory won't be polluted.

The code comprising the while(1) loop will be executed, and TLB walks,
speculative fetches, etc may occur regardless.

We don't share TLB entries between cores (we don't have support for
ARMv8.2s CnP), and the kdump kernel should be running in a carveout from
main memory (which IIUC is not mapped by the original kernel). So
normally, this would not be a problem.

However, if there is a problem with the page tables (e.g. entries
erroneously point into a PA range the kdump kernel is using), then
unavoidable background memory traffic from the CPU may cause problems
for the kdump kernel.

Thanks,
Mark.
Geoff Levand April 1, 2016, 6:39 p.m. UTC | #19
Hi,

On Fri, 2016-04-01 at 09:59 +0800, Dave Young wrote:
> Geoff, for easier to review maybe you can send kexec patches first
> then AKASHI Takahiro
> can send the kdump patches as a standalone patchset?

Marc Zyngier specifically asked for an integrated set
of patches for easier review.  I will keep it that way
for now.

-Geoff
AKASHI Takahiro April 4, 2016, 9:27 a.m. UTC | #20
Mark,

On Fri, Apr 01, 2016 at 10:36:35AM +0100, Mark Rutland wrote:
> On Fri, Apr 01, 2016 at 05:45:09PM +0900, AKASHI Takahiro wrote:
> > On Thu, Mar 31, 2016 at 11:10:38AM +0100, Mark Rutland wrote:
> > > On Thu, Mar 31, 2016 at 09:12:32AM +0100, Marc Zyngier wrote:
> > > > On 31/03/16 08:57, AKASHI Takahiro wrote:
> > > > > On Mon, Mar 21, 2016 at 01:29:28PM +0000, James Morse wrote:
> > > > >> On 18/03/16 18:08, James Morse wrote:
> > > > >>> On 14/03/16 17:48, Geoff Levand wrote:
> > > > >>>> +	/* just in case */
> > > > >>>> +	while (1)
> > > > >>>> +		wfi();
> > > > >>
> > > > >> Having thought about this some more: I don't think spinning like this is safe.
> > > > >> We need to spin with the MMU turned off, otherwise this core will pollute the
> > > > >> kdump kernel with TLB entries from the old page tables.
> > > > > 
> > > > > I think that wfi() will never wake up since local interrupts are disabled
> > > > > here. So how can it pollute the kdump kernel?
> > > > 
> > > > Having interrupts disabled doesn't prevent an exit from WFI. Quite the
> > > > opposite, actually. It is designed to wake-up the core when something
> > > > happens on the external interface.
> > > 
> > > Further, WFI is a hint, and may simply act as a NOP. 
> > 
> > Ah, OK. But even so, none of interrupt handlers (nor other code) will
> > be executed after cpu wakes up, and the memory won't be polluted.
> 
> The code comprising the while(1) loop will be executed, and TLB walks,
> speculative fetches, etc may occur regardless.
> 
> We don't share TLB entries between cores (we don't have support for
> ARMv8.2s CnP), and the kdump kernel should be running in a carveout from
> main memory (which IIUC is not mapped by the original kernel). So
> normally, this would not be a problem.

In fact, the memory region for crash kernel will be just memblock_reserve()'d.
We can't do memblock_remove() here because that region is expected to
exist as part of system memory.
(For details, see kimage_load_crash_segment() in kexec_core.c.)
Should we remove it from kernel mapping?
 
> However, if there is a problem with the page tables (e.g. entries
> erroneously point into a PA range the kdump kernel is using), then
> unavoidable background memory traffic from the CPU may cause problems
> for the kdump kernel.

But the traffic generated by the cpu will concentrate on the area around
the while loop, ipi_cpu_crash_stop(), in the *crashed* kernel's memory.
(I'm not sure about speculative behaviors.)

So I don't think it will hurt kdump kernel.

Thanks,
-Takahiro AKASHI

 
> Thanks,
> Mark.
Dave Young May 17, 2016, 5:42 a.m. UTC | #21
Marc, it has been discussed for long time, I think kdump code can be
logically splitted from this series so that we can get the kexec
done first? 

On 04/01/16 at 11:39am, Geoff Levand wrote:
> Hi,
> 
> On Fri, 2016-04-01 at 09:59 +0800, Dave Young wrote:
> > Geoff, for easier to review maybe you can send kexec patches first
> > then AKASHI Takahiro
> > can send the kdump patches as a standalone patchset?
> 
> Marc Zyngier specifically asked for an integrated set
> of patches for easier review.  I will keep it that way
> for now.
> 
> -Geoff
Marc Zyngier May 17, 2016, 8:06 a.m. UTC | #22
Hi Dave,

On 17/05/16 06:42, Dave Young wrote:
> Marc, it has been discussed for long time, I think kdump code can be
> logically splitted from this series so that we can get the kexec
> done first? 

The basis for kexec are already on their way to mainline, as part of the
hibernate series that James has put together.

I'd expect someone to:

1) rebase the remaining kexec/kdump patches based on the current
arm64/for-next/core,
2) post these patches when -rc1 gets cut in two weeks from now,
3) address the review comments in a timely manner,
4) update the series once a week so that we see some actual progress

If this happens, I can't see any reason why this code wouldn't get
merged. But keeping kexec and kdump together is not what has prevented
the code from getting merged until now.

Thanks,

	M.
> 
> On 04/01/16 at 11:39am, Geoff Levand wrote:
>> Hi,
>>
>> On Fri, 2016-04-01 at 09:59 +0800, Dave Young wrote:
>>> Geoff, for easier to review maybe you can send kexec patches first
>>> then AKASHI Takahiro
>>> can send the kdump patches as a standalone patchset?
>>
>> Marc Zyngier specifically asked for an integrated set
>> of patches for easier review.  I will keep it that way
>> for now.
>>
>> -Geoff
>
AKASHI Takahiro May 17, 2016, 9:07 a.m. UTC | #23
Marc, Dave

On Tue, May 17, 2016 at 09:06:54AM +0100, Marc Zyngier wrote:
> Hi Dave,
> 
> On 17/05/16 06:42, Dave Young wrote:
> > Marc, it has been discussed for long time, I think kdump code can be
> > logically splitted from this series so that we can get the kexec
> > done first? 
> 
> The basis for kexec are already on their way to mainline, as part of the
> hibernate series that James has put together.
> 
> I'd expect someone to:
> 
> 1) rebase the remaining kexec/kdump patches based on the current
> arm64/for-next/core,

Done. At the time of last week.
The only issue that I have now is a support for KASLR, but this is not
a kernel issue, but tool's. I'm now talking with Dave Anderson (RedHat)
about how we should fix it.

> 2) post these patches when -rc1 gets cut in two weeks from now,

Yes, we are planning to do so.

> 3) address the review comments in a timely manner,

I'd like to expect the code to be reviewed in a timely manner, too.
In v4.6, we have not got any comments (nor ack's) on kexec/kdump-specific
part of patches. If this happens again, it will make the turnaround of our
re-spinning even longer.

> 4) update the series once a week so that we see some actual progress

Yes if we have updates.

> If this happens, I can't see any reason why this code wouldn't get
> merged. But keeping kexec and kdump together is not what has prevented
> the code from getting merged until now.

Yeah, agree.

Thanks,
-Takahiro AKASHI

> 
> Thanks,
> 
> 	M.
> > 
> > On 04/01/16 at 11:39am, Geoff Levand wrote:
> >> Hi,
> >>
> >> On Fri, 2016-04-01 at 09:59 +0800, Dave Young wrote:
> >>> Geoff, for easier to review maybe you can send kexec patches first
> >>> then AKASHI Takahiro
> >>> can send the kdump patches as a standalone patchset?
> >>
> >> Marc Zyngier specifically asked for an integrated set
> >> of patches for easier review.  I will keep it that way
> >> for now.
> >>
> >> -Geoff
> > 
> 
> 
> -- 
> Jazz is not dead. It just smells funny...
Dave Young May 18, 2016, 2:09 a.m. UTC | #24
Hi Marc

On 05/17/16 at 09:06am, Marc Zyngier wrote:
> Hi Dave,
> 
> On 17/05/16 06:42, Dave Young wrote:
> > Marc, it has been discussed for long time, I think kdump code can be
> > logically splitted from this series so that we can get the kexec
> > done first? 
> 
> The basis for kexec are already on their way to mainline, as part of the
> hibernate series that James has put together.
> 
> I'd expect someone to:
> 
> 1) rebase the remaining kexec/kdump patches based on the current
> arm64/for-next/core,
> 2) post these patches when -rc1 gets cut in two weeks from now,
> 3) address the review comments in a timely manner,
> 4) update the series once a week so that we see some actual progress
> 
> If this happens, I can't see any reason why this code wouldn't get
> merged. But keeping kexec and kdump together is not what has prevented
> the code from getting merged until now.

Cool, thanks a lot for the clarification about the situation.

Dave