mbox

[00/16] arm64 kexec kernel patches v10

Message ID cover.1445297709.git.geoff@infradead.org
State New
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v10

Message

Geoff Levand Oct. 19, 2015, 11:38 p.m. UTC
Hi All,

This series adds the core support for kexec re-boot and kdump on ARM64.  This
v10 of the series combines Takahiro's kdump patches with my kexec patches.

To load a second stage kernel and execute a kexec re-boot or to work with kdump
on ARM64 systems a series of patches to kexec-tools [2], which have not yet been
merged upstream, are needed.

I have tested kexec with the ARM Foundation model, and Takahiro has reported
that kdump is working on the 96boards HiKey developer board.  Kexec on EFI
systems works correctly.  More ACPI + kexec testing is needed.

Patch 1 here moves the macros from proc-macros.S to asm/assembler.h so that the
dcache_line_size macro it defines can be uesd by kexec's relocate kernel
routine.

Patches 2 & 3 rework the ARM64 hcall mechanism to give the CPU reset routines
the ability to switch exception levels from EL1 to EL2 for kernels that were
entered in EL2.

Patch 4 allows KVM to handle a CPU reset.

Patches 5-7 add back the ARM64 CPU reset support that was recently removed from
the kernel.

Patches 8-10 add the actual kexec support.

Patches 11-16 add kdump support.

Please consider all patches for inclusion.

[1]  https://git.kernel.org/cgit/linux/kernel/git/geoff/linux-kexec.git
[2]  https://git.kernel.org/cgit/linux/kernel/git/geoff/kexec-tools.git

-Geoff

The following changes since commit 7379047d5585187d1288486d4627873170d0005a:

  Linux 4.3-rc6 (2015-10-18 16:08:42 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/linux-kexec.git kexec-v10

for you to fetch changes up to 4cf0c03d6cd1cb4826bb5df679fbcdaf80be0b1c:

  arm64: kdump: relax BUG_ON() if more than one cpus are still active (2015-10-19 15:51:52 -0700)

----------------------------------------------------------------
AKASHI Takahiro (7):
      arm64: kvm: allows kvm cpu hotplug
      arm64: kdump: reserve memory for crash dump kernel
      arm64: kdump: implement machine_crash_shutdown()
      arm64: kdump: add kdump support
      arm64: kdump: update a kernel doc
      arm64: kdump: enable kdump in the arm64 defconfig
      arm64: kdump: relax BUG_ON() if more than one cpus are still active

Geoff Levand (9):
      arm64: Fold proc-macros.S into assembler.h
      arm64: Convert hcalls to use HVC immediate value
      arm64: Add new hcall HVC_CALL_FUNC
      arm64: Add back cpu_reset routines
      arm64: Add EL2 switch to cpu_reset
      Revert "arm64: remove dead code"
      arm64/kexec: Add core kexec support
      arm64/kexec: Add pr_devel output
      arm64/kexec: Enable kexec in the arm64 defconfig

 Documentation/kdump/kdump.txt       |  29 ++++-
 arch/arm/include/asm/kvm_host.h     |  10 +-
 arch/arm/include/asm/kvm_mmu.h      |   1 +
 arch/arm/kvm/arm.c                  |  58 +++------
 arch/arm/kvm/mmu.c                  |   5 +
 arch/arm64/Kconfig                  |  22 ++++
 arch/arm64/configs/defconfig        |   2 +
 arch/arm64/include/asm/assembler.h  |  48 +++++++-
 arch/arm64/include/asm/kexec.h      |  80 ++++++++++++
 arch/arm64/include/asm/kvm_host.h   |  16 ++-
 arch/arm64/include/asm/kvm_mmu.h    |   1 +
 arch/arm64/include/asm/mmu.h        |   1 +
 arch/arm64/include/asm/virt.h       |  49 ++++++++
 arch/arm64/kernel/Makefile          |   3 +
 arch/arm64/kernel/cpu-reset.S       |  76 ++++++++++++
 arch/arm64/kernel/cpu-reset.h       |  20 +++
 arch/arm64/kernel/crash_dump.c      |  71 +++++++++++
 arch/arm64/kernel/head.S            |   1 -
 arch/arm64/kernel/hyp-stub.S        |  43 +++++--
 arch/arm64/kernel/machine_kexec.c   | 237 ++++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/relocate_kernel.S | 163 +++++++++++++++++++++++++
 arch/arm64/kernel/setup.c           |   7 +-
 arch/arm64/kernel/smp.c             |  16 ++-
 arch/arm64/kvm/hyp-init.S           |  34 +++++-
 arch/arm64/kvm/hyp.S                |  44 +++++--
 arch/arm64/mm/cache.S               |   2 -
 arch/arm64/mm/init.c                |  83 +++++++++++++
 arch/arm64/mm/mmu.c                 |  11 ++
 arch/arm64/mm/proc-macros.S         |  64 ----------
 arch/arm64/mm/proc.S                |   3 -
 include/uapi/linux/kexec.h          |   1 +
 31 files changed, 1063 insertions(+), 138 deletions(-)
 create mode 100644 arch/arm64/include/asm/kexec.h
 create mode 100644 arch/arm64/kernel/cpu-reset.S
 create mode 100644 arch/arm64/kernel/cpu-reset.h
 create mode 100644 arch/arm64/kernel/crash_dump.c
 create mode 100644 arch/arm64/kernel/machine_kexec.c
 create mode 100644 arch/arm64/kernel/relocate_kernel.S
 delete mode 100644 arch/arm64/mm/proc-macros.S

Comments

Pratyush Anand Oct. 20, 2015, 8:56 a.m. UTC | #1
Hi Geoff,

Thanks for the patches.

On 19/10/2015:11:38:53 PM, Geoff Levand wrote:
> +static void soft_restart(unsigned long addr)
> +{
> +	setup_mm_for_reboot();
> +	cpu_soft_restart(virt_to_phys(cpu_reset), addr,
> +		is_hyp_mode_available());

So now we do not flush cache for any memory region. Shouldn't we still flush
at least kernel and purgatory segments. 

kexec-tools loads a new kernel and purgatory executable. Some of those bits
might still be only in D-cache and we disable D-cache before control is passed
to the purgatory binary. Purgatory and some initial part of kernel code is
executed with D-cache disabled. So, We might land into a situation where correct
code is not executed while D-cache is disabled, no?

~Pratyush
Geoff Levand Oct. 20, 2015, 5:19 p.m. UTC | #2
Hi,

On Tue, 2015-10-20 at 14:26 +0530, Pratyush Anand wrote:
> On 19/10/2015:11:38:53 PM, Geoff Levand wrote:
> > +static void soft_restart(unsigned long addr)
> > +{
> > +> > 	> > setup_mm_for_reboot();
> > +> > 	> > cpu_soft_restart(virt_to_phys(cpu_reset), addr,
> > +> > 	> > 	> > is_hyp_mode_available());
> 
> So now we do not flush cache for any memory region. Shouldn't we still flush
> at least kernel and purgatory segments. 

Relevant pages of the kexec list are flushed in the code following the comment
'Invalidate dest page to PoC' of the arm64_relocate_new_kernel routine:

 The dcache is turned off
 The page is invalidated to PoC
 The new page is written

-Geoff
Dave Young Oct. 22, 2015, 3:25 a.m. UTC | #3
Hi, AKASHI,

On 10/19/15 at 11:38pm, Geoff Levand wrote:
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> 
> On crash dump kernel, all the information about primary kernel's core
> image is available in elf core header specified by "elfcorehdr=" boot
> parameter. reserve_elfcorehdr() will set aside the region to avoid any
> corruption by crash dump kernel.
> 
> Crash dump kernel will access the system memory of primary kernel via
> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
> not reside in linear mapping on crash dump kernel.
> Please note that we should add "mem=X[MG]" boot parameter to limit the
> memory size and avoid the following assertion at ioremap():
> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> 		return NULL;
> when accessing any pages beyond the usable memories of crash dump kernel.

How does kexec-tools pass usable memory ranges to kernel? using dtb?
Passing an extra mem=X sounds odd in the design. Kdump kernel should get
usable ranges and hanle the limit better than depending on an extern kernel
param. 

Thanks
Dave
AKASHI Takahiro Oct. 22, 2015, 4:29 a.m. UTC | #4
Hi Dave,

Thank you for your comment.

On 10/22/2015 12:25 PM, Dave Young wrote:
> Hi, AKASHI,
>
> On 10/19/15 at 11:38pm, Geoff Levand wrote:
>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>
>> On crash dump kernel, all the information about primary kernel's core
>> image is available in elf core header specified by "elfcorehdr=" boot
>> parameter. reserve_elfcorehdr() will set aside the region to avoid any
>> corruption by crash dump kernel.
>>
>> Crash dump kernel will access the system memory of primary kernel via
>> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
>> not reside in linear mapping on crash dump kernel.
>> Please note that we should add "mem=X[MG]" boot parameter to limit the
>> memory size and avoid the following assertion at ioremap():
>> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
>> 		return NULL;
>> when accessing any pages beyond the usable memories of crash dump kernel.
>
> How does kexec-tools pass usable memory ranges to kernel? using dtb?
> Passing an extra mem=X sounds odd in the design. Kdump kernel should get
> usable ranges and hanle the limit better than depending on an extern kernel
> param.

Well, regarding "depending on an external kernel param,"
- this limitation ("mem=") is compatible with arm(32) implementation although
   it is not clearly described in kernel's Documentation/kdump/kdump.txt.
- "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
   The parameter is explicitly generated and added by kexec-tools.

Do I miss your point?

Thanks,
-Takahiro AKASHI

> Thanks
> Dave
>
Dave Young Oct. 22, 2015, 5:15 a.m. UTC | #5
On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
> Hi Dave,
> 
> Thank you for your comment.
> 
> On 10/22/2015 12:25 PM, Dave Young wrote:
> >Hi, AKASHI,
> >
> >On 10/19/15 at 11:38pm, Geoff Levand wrote:
> >>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>
> >>On crash dump kernel, all the information about primary kernel's core
> >>image is available in elf core header specified by "elfcorehdr=" boot
> >>parameter. reserve_elfcorehdr() will set aside the region to avoid any
> >>corruption by crash dump kernel.
> >>
> >>Crash dump kernel will access the system memory of primary kernel via
> >>copy_oldmem_page(), which reads one page by ioremap'ing it since it does
> >>not reside in linear mapping on crash dump kernel.
> >>Please note that we should add "mem=X[MG]" boot parameter to limit the
> >>memory size and avoid the following assertion at ioremap():
> >>	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> >>		return NULL;
> >>when accessing any pages beyond the usable memories of crash dump kernel.
> >
> >How does kexec-tools pass usable memory ranges to kernel? using dtb?
> >Passing an extra mem=X sounds odd in the design. Kdump kernel should get
> >usable ranges and hanle the limit better than depending on an extern kernel
> >param.
> 
> Well, regarding "depending on an external kernel param,"
> - this limitation ("mem=") is compatible with arm(32) implementation although
>   it is not clearly described in kernel's Documentation/kdump/kdump.txt.
> - "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
>   The parameter is explicitly generated and added by kexec-tools.
> 
> Do I miss your point?

Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
says: The boot loader must pass at a minimum the size and location of the
system memory

In arm64 booting.txt, it does mentions about dtb but without above sentence.

So if you are using dtb to pass memory I think the extra mem= should be not
necessary unless there's other limitations dtb can not been used.

One thing I'm confused is mem= only pass the memory size, where does you pass
the start addresses? What if there's multiple sections such as some reserved
ranges 2nd kernel also need?

Thanks
Dave
AKASHI Takahiro Oct. 22, 2015, 9:57 a.m. UTC | #6
(added Ard to Cc.)

On 10/22/2015 02:15 PM, Dave Young wrote:
> On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
>> Hi Dave,
>>
>> Thank you for your comment.
>>
>> On 10/22/2015 12:25 PM, Dave Young wrote:
>>> Hi, AKASHI,
>>>
>>> On 10/19/15 at 11:38pm, Geoff Levand wrote:
>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>
>>>> On crash dump kernel, all the information about primary kernel's core
>>>> image is available in elf core header specified by "elfcorehdr=" boot
>>>> parameter. reserve_elfcorehdr() will set aside the region to avoid any
>>>> corruption by crash dump kernel.
>>>>
>>>> Crash dump kernel will access the system memory of primary kernel via
>>>> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
>>>> not reside in linear mapping on crash dump kernel.
>>>> Please note that we should add "mem=X[MG]" boot parameter to limit the
>>>> memory size and avoid the following assertion at ioremap():
>>>> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
>>>> 		return NULL;
>>>> when accessing any pages beyond the usable memories of crash dump kernel.
>>>
>>> How does kexec-tools pass usable memory ranges to kernel? using dtb?
>>> Passing an extra mem=X sounds odd in the design. Kdump kernel should get
>>> usable ranges and hanle the limit better than depending on an extern kernel
>>> param.
>>
>> Well, regarding "depending on an external kernel param,"
>> - this limitation ("mem=") is compatible with arm(32) implementation although
>>    it is not clearly described in kernel's Documentation/kdump/kdump.txt.
>> - "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
>>    The parameter is explicitly generated and added by kexec-tools.
>>
>> Do I miss your point?
>
> Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
> says: The boot loader must pass at a minimum the size and location of the
> system memory
>
> In arm64 booting.txt, it does mentions about dtb but without above sentence.
>
> So if you are using dtb to pass memory I think the extra mem= should be not
> necessary unless there's other limitations dtb can not been used.

I would expect comments from arm64 maintainers here.

In my old implementation, I added "usablemem" attributes, along with "reg," to
"memory" nodes in dtb to specify the usable memory region on crash dump kernel.

But I removed this feature partly because, on uefi system, uefi might pass
no memory information in dtb.

> One thing I'm confused is mem= only pass the memory size, where does you pass
> the start addresses?

In the current arm64 implementation, any regions below the start address will
be ignored as system ram.

> What if there's multiple sections such as some reserved
> ranges 2nd kernel also need?

My patch utilizes only a single contiguous region of memory as system ram.
One exception that I notice is uefi's runtime data. They will be ioremap'ed separately.

Please let me know if there is any other case that should be supported.

Thanks,
-Takahiro AKASHI

> Thanks
> Dave
>
Pratyush Anand Oct. 23, 2015, 7:29 a.m. UTC | #7
On 20/10/2015:10:19:25 AM, Geoff Levand wrote:
> Hi,
> 
> On Tue, 2015-10-20 at 14:26 +0530, Pratyush Anand wrote:
> > On 19/10/2015:11:38:53 PM, Geoff Levand wrote:
> > > +static void soft_restart(unsigned long addr)
> > > +{
> > > +> > 	> > setup_mm_for_reboot();
> > > +> > 	> > cpu_soft_restart(virt_to_phys(cpu_reset), addr,
> > > +> > 	> > 	> > is_hyp_mode_available());
> > 
> > So now we do not flush cache for any memory region. Shouldn't we still flush
> > at least kernel and purgatory segments. 
> 
> Relevant pages of the kexec list are flushed in the code following the comment
> 'Invalidate dest page to PoC' of the arm64_relocate_new_kernel routine:
> 
>  The dcache is turned off
>  The page is invalidated to PoC
>  The new page is written

Thanks for clarifying it.

I tested your kexec-v10.2 with mustang.

Tested-by: Pratyush Anand <panand@redhat.com>
Dave Young Oct. 23, 2015, 9:50 a.m. UTC | #8
On 10/22/15 at 06:57pm, AKASHI Takahiro wrote:
> (added Ard to Cc.)
> 
> On 10/22/2015 02:15 PM, Dave Young wrote:
> >On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
> >>Hi Dave,
> >>
> >>Thank you for your comment.
> >>
> >>On 10/22/2015 12:25 PM, Dave Young wrote:
> >>>Hi, AKASHI,
> >>>
> >>>On 10/19/15 at 11:38pm, Geoff Levand wrote:
> >>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>
> >>>>On crash dump kernel, all the information about primary kernel's core
> >>>>image is available in elf core header specified by "elfcorehdr=" boot
> >>>>parameter. reserve_elfcorehdr() will set aside the region to avoid any
> >>>>corruption by crash dump kernel.
> >>>>
> >>>>Crash dump kernel will access the system memory of primary kernel via
> >>>>copy_oldmem_page(), which reads one page by ioremap'ing it since it does
> >>>>not reside in linear mapping on crash dump kernel.
> >>>>Please note that we should add "mem=X[MG]" boot parameter to limit the
> >>>>memory size and avoid the following assertion at ioremap():
> >>>>	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> >>>>		return NULL;
> >>>>when accessing any pages beyond the usable memories of crash dump kernel.
> >>>
> >>>How does kexec-tools pass usable memory ranges to kernel? using dtb?
> >>>Passing an extra mem=X sounds odd in the design. Kdump kernel should get
> >>>usable ranges and hanle the limit better than depending on an extern kernel
> >>>param.
> >>
> >>Well, regarding "depending on an external kernel param,"
> >>- this limitation ("mem=") is compatible with arm(32) implementation although
> >>   it is not clearly described in kernel's Documentation/kdump/kdump.txt.
> >>- "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
> >>   The parameter is explicitly generated and added by kexec-tools.
> >>
> >>Do I miss your point?
> >
> >Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
> >says: The boot loader must pass at a minimum the size and location of the
> >system memory
> >
> >In arm64 booting.txt, it does mentions about dtb but without above sentence.
> >
> >So if you are using dtb to pass memory I think the extra mem= should be not
> >necessary unless there's other limitations dtb can not been used.
> 
> I would expect comments from arm64 maintainers here.
> 
> In my old implementation, I added "usablemem" attributes, along with "reg," to
> "memory" nodes in dtb to specify the usable memory region on crash dump kernel.
> 
> But I removed this feature partly because, on uefi system, uefi might pass
> no memory information in dtb.

If this is the case there must be somewhere else one can pass memory infomation
to kernel, the booting.txt should be updated?

kexec as a boot loader need use same method as the 1st kernel boot loader.

> 
> >One thing I'm confused is mem= only pass the memory size, where does you pass
> >the start addresses?
> 
> In the current arm64 implementation, any regions below the start address will
> be ignored as system ram.
> 
> >What if there's multiple sections such as some reserved
> >ranges 2nd kernel also need?
> 
> My patch utilizes only a single contiguous region of memory as system ram.
> One exception that I notice is uefi's runtime data. They will be ioremap'ed separately.
> 
> Please let me know if there is any other case that should be supported.

For example the elf headers range, you reserved them in kdump kernel code,
but kexec-tools can do that early if it can provides all memory info to 2nd
kernel. Ditto for mark all the memory ranges 1st kernel used as reserved.

Thanks
Dave
AKASHI Takahiro Oct. 29, 2015, 5:55 a.m. UTC | #9
Dave,

On 10/23/2015 06:50 PM, Dave Young wrote:
> On 10/22/15 at 06:57pm, AKASHI Takahiro wrote:
>> (added Ard to Cc.)
>>
>> On 10/22/2015 02:15 PM, Dave Young wrote:
>>> On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
>>>> Hi Dave,
>>>>
>>>> Thank you for your comment.
>>>>
>>>> On 10/22/2015 12:25 PM, Dave Young wrote:
>>>>> Hi, AKASHI,
>>>>>
>>>>> On 10/19/15 at 11:38pm, Geoff Levand wrote:
>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>
>>>>>> On crash dump kernel, all the information about primary kernel's core
>>>>>> image is available in elf core header specified by "elfcorehdr=" boot
>>>>>> parameter. reserve_elfcorehdr() will set aside the region to avoid any
>>>>>> corruption by crash dump kernel.
>>>>>>
>>>>>> Crash dump kernel will access the system memory of primary kernel via
>>>>>> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
>>>>>> not reside in linear mapping on crash dump kernel.
>>>>>> Please note that we should add "mem=X[MG]" boot parameter to limit the
>>>>>> memory size and avoid the following assertion at ioremap():
>>>>>> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
>>>>>> 		return NULL;
>>>>>> when accessing any pages beyond the usable memories of crash dump kernel.
>>>>>
>>>>> How does kexec-tools pass usable memory ranges to kernel? using dtb?
>>>>> Passing an extra mem=X sounds odd in the design. Kdump kernel should get
>>>>> usable ranges and hanle the limit better than depending on an extern kernel
>>>>> param.
>>>>
>>>> Well, regarding "depending on an external kernel param,"
>>>> - this limitation ("mem=") is compatible with arm(32) implementation although
>>>>    it is not clearly described in kernel's Documentation/kdump/kdump.txt.
>>>> - "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
>>>>    The parameter is explicitly generated and added by kexec-tools.
>>>>
>>>> Do I miss your point?
>>>
>>> Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
>>> says: The boot loader must pass at a minimum the size and location of the
>>> system memory
>>>
>>> In arm64 booting.txt, it does mentions about dtb but without above sentence.
>>>
>>> So if you are using dtb to pass memory I think the extra mem= should be not
>>> necessary unless there's other limitations dtb can not been used.
>>
>> I would expect comments from arm64 maintainers here.
>>
>> In my old implementation, I added "usablemem" attributes, along with "reg," to
>> "memory" nodes in dtb to specify the usable memory region on crash dump kernel.
>>
>> But I removed this feature partly because, on uefi system, uefi might pass
>> no memory information in dtb.
>
> If this is the case there must be somewhere else one can pass memory infomation
> to kernel, the booting.txt should be updated?
>
> kexec as a boot loader need use same method as the 1st kernel boot loader.
>
>>
>>> One thing I'm confused is mem= only pass the memory size, where does you pass
>>> the start addresses?
>>
>> In the current arm64 implementation, any regions below the start address will
>> be ignored as system ram.
>>
>>> What if there's multiple sections such as some reserved
>>> ranges 2nd kernel also need?
>>
>> My patch utilizes only a single contiguous region of memory as system ram.
>> One exception that I notice is uefi's runtime data. They will be ioremap'ed separately.
>>
>> Please let me know if there is any other case that should be supported.
>
> For example the elf headers range, you reserved them in kdump kernel code,
> but kexec-tools can do that early if it can provides all memory info to 2nd
> kernel. Ditto for mark all the memory ranges 1st kernel used as reserved.

It seems to me that the issue you mentioned here is totally independent
from "mem=" issue, isn't it?
(and "elfcorehdr=" is a common way for crash dump kernel to know the region.)

-Takahiro AKASHI

> Thanks
> Dave
>
Dave Young Oct. 29, 2015, 6:40 a.m. UTC | #10
Hi, AKASHI

On 10/29/15 at 02:55pm, AKASHI Takahiro wrote:
> Dave,
> 
> On 10/23/2015 06:50 PM, Dave Young wrote:
> >On 10/22/15 at 06:57pm, AKASHI Takahiro wrote:
> >>(added Ard to Cc.)
> >>
> >>On 10/22/2015 02:15 PM, Dave Young wrote:
> >>>On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
> >>>>Hi Dave,
> >>>>
> >>>>Thank you for your comment.
> >>>>
> >>>>On 10/22/2015 12:25 PM, Dave Young wrote:
> >>>>>Hi, AKASHI,
> >>>>>
> >>>>>On 10/19/15 at 11:38pm, Geoff Levand wrote:
> >>>>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>>>
> >>>>>>On crash dump kernel, all the information about primary kernel's core
> >>>>>>image is available in elf core header specified by "elfcorehdr=" boot
> >>>>>>parameter. reserve_elfcorehdr() will set aside the region to avoid any
> >>>>>>corruption by crash dump kernel.
> >>>>>>
> >>>>>>Crash dump kernel will access the system memory of primary kernel via
> >>>>>>copy_oldmem_page(), which reads one page by ioremap'ing it since it does
> >>>>>>not reside in linear mapping on crash dump kernel.
> >>>>>>Please note that we should add "mem=X[MG]" boot parameter to limit the
> >>>>>>memory size and avoid the following assertion at ioremap():
> >>>>>>	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> >>>>>>		return NULL;
> >>>>>>when accessing any pages beyond the usable memories of crash dump kernel.
> >>>>>
> >>>>>How does kexec-tools pass usable memory ranges to kernel? using dtb?
> >>>>>Passing an extra mem=X sounds odd in the design. Kdump kernel should get
> >>>>>usable ranges and hanle the limit better than depending on an extern kernel
> >>>>>param.
> >>>>
> >>>>Well, regarding "depending on an external kernel param,"
> >>>>- this limitation ("mem=") is compatible with arm(32) implementation although
> >>>>   it is not clearly described in kernel's Documentation/kdump/kdump.txt.
> >>>>- "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
> >>>>   The parameter is explicitly generated and added by kexec-tools.
> >>>>
> >>>>Do I miss your point?
> >>>
> >>>Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
> >>>says: The boot loader must pass at a minimum the size and location of the
> >>>system memory
> >>>
> >>>In arm64 booting.txt, it does mentions about dtb but without above sentence.
> >>>
> >>>So if you are using dtb to pass memory I think the extra mem= should be not
> >>>necessary unless there's other limitations dtb can not been used.
> >>
> >>I would expect comments from arm64 maintainers here.
> >>
> >>In my old implementation, I added "usablemem" attributes, along with "reg," to
> >>"memory" nodes in dtb to specify the usable memory region on crash dump kernel.
> >>
> >>But I removed this feature partly because, on uefi system, uefi might pass
> >>no memory information in dtb.
> >
> >If this is the case there must be somewhere else one can pass memory infomation
> >to kernel, the booting.txt should be updated?
> >
> >kexec as a boot loader need use same method as the 1st kernel boot loader.
> >
> >>
> >>>One thing I'm confused is mem= only pass the memory size, where does you pass
> >>>the start addresses?
> >>
> >>In the current arm64 implementation, any regions below the start address will
> >>be ignored as system ram.
> >>
> >>>What if there's multiple sections such as some reserved
> >>>ranges 2nd kernel also need?
> >>
> >>My patch utilizes only a single contiguous region of memory as system ram.
> >>One exception that I notice is uefi's runtime data. They will be ioremap'ed separately.
> >>
> >>Please let me know if there is any other case that should be supported.
> >
> >For example the elf headers range, you reserved them in kdump kernel code,
> >but kexec-tools can do that early if it can provides all memory info to 2nd
> >kernel. Ditto for mark all the memory ranges 1st kernel used as reserved.
> 
> It seems to me that the issue you mentioned here is totally independent
> from "mem=" issue, isn't it?
> (and "elfcorehdr=" is a common way for crash dump kernel to know the region.)

Hmm, I did not talked about the eflcorehdr=, I means the code to reserve the
memory ranges elfcorehdr is using.

Thanks
Dave

> 
> -Takahiro AKASHI
> 
> >Thanks
> >Dave
> >
AKASHI Takahiro Oct. 29, 2015, 6:53 a.m. UTC | #11
On 10/29/2015 03:40 PM, Dave Young wrote:
> Hi, AKASHI
>
> On 10/29/15 at 02:55pm, AKASHI Takahiro wrote:
>> Dave,
>>
>> On 10/23/2015 06:50 PM, Dave Young wrote:
>>> On 10/22/15 at 06:57pm, AKASHI Takahiro wrote:
>>>> (added Ard to Cc.)
>>>>
>>>> On 10/22/2015 02:15 PM, Dave Young wrote:
>>>>> On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
>>>>>> Hi Dave,
>>>>>>
>>>>>> Thank you for your comment.
>>>>>>
>>>>>> On 10/22/2015 12:25 PM, Dave Young wrote:
>>>>>>> Hi, AKASHI,
>>>>>>>
>>>>>>> On 10/19/15 at 11:38pm, Geoff Levand wrote:
>>>>>>>> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>>>>>>>>
>>>>>>>> On crash dump kernel, all the information about primary kernel's core
>>>>>>>> image is available in elf core header specified by "elfcorehdr=" boot
>>>>>>>> parameter. reserve_elfcorehdr() will set aside the region to avoid any
>>>>>>>> corruption by crash dump kernel.
>>>>>>>>
>>>>>>>> Crash dump kernel will access the system memory of primary kernel via
>>>>>>>> copy_oldmem_page(), which reads one page by ioremap'ing it since it does
>>>>>>>> not reside in linear mapping on crash dump kernel.
>>>>>>>> Please note that we should add "mem=X[MG]" boot parameter to limit the
>>>>>>>> memory size and avoid the following assertion at ioremap():
>>>>>>>> 	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
>>>>>>>> 		return NULL;
>>>>>>>> when accessing any pages beyond the usable memories of crash dump kernel.
>>>>>>>
>>>>>>> How does kexec-tools pass usable memory ranges to kernel? using dtb?
>>>>>>> Passing an extra mem=X sounds odd in the design. Kdump kernel should get
>>>>>>> usable ranges and hanle the limit better than depending on an extern kernel
>>>>>>> param.
>>>>>>
>>>>>> Well, regarding "depending on an external kernel param,"
>>>>>> - this limitation ("mem=") is compatible with arm(32) implementation although
>>>>>>    it is not clearly described in kernel's Documentation/kdump/kdump.txt.
>>>>>> - "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
>>>>>>    The parameter is explicitly generated and added by kexec-tools.
>>>>>>
>>>>>> Do I miss your point?
>>>>>
>>>>> Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
>>>>> says: The boot loader must pass at a minimum the size and location of the
>>>>> system memory
>>>>>
>>>>> In arm64 booting.txt, it does mentions about dtb but without above sentence.
>>>>>
>>>>> So if you are using dtb to pass memory I think the extra mem= should be not
>>>>> necessary unless there's other limitations dtb can not been used.
>>>>
>>>> I would expect comments from arm64 maintainers here.
>>>>
>>>> In my old implementation, I added "usablemem" attributes, along with "reg," to
>>>> "memory" nodes in dtb to specify the usable memory region on crash dump kernel.
>>>>
>>>> But I removed this feature partly because, on uefi system, uefi might pass
>>>> no memory information in dtb.
>>>
>>> If this is the case there must be somewhere else one can pass memory infomation
>>> to kernel, the booting.txt should be updated?
>>>
>>> kexec as a boot loader need use same method as the 1st kernel boot loader.
>>>
>>>>
>>>>> One thing I'm confused is mem= only pass the memory size, where does you pass
>>>>> the start addresses?
>>>>
>>>> In the current arm64 implementation, any regions below the start address will
>>>> be ignored as system ram.
>>>>
>>>>> What if there's multiple sections such as some reserved
>>>>> ranges 2nd kernel also need?
>>>>
>>>> My patch utilizes only a single contiguous region of memory as system ram.
>>>> One exception that I notice is uefi's runtime data. They will be ioremap'ed separately.
>>>>
>>>> Please let me know if there is any other case that should be supported.
>>>
>>> For example the elf headers range, you reserved them in kdump kernel code,
>>> but kexec-tools can do that early if it can provides all memory info to 2nd
>>> kernel. Ditto for mark all the memory ranges 1st kernel used as reserved.
>>
>> It seems to me that the issue you mentioned here is totally independent
>> from "mem=" issue, isn't it?
>> (and "elfcorehdr=" is a common way for crash dump kernel to know the region.)
>
> Hmm, I did not talked about the eflcorehdr=, I means the code to reserve the
> memory ranges elfcorehdr is using.

So how does it relate to "mem=" issue?

-Takahiro AKASHI

> Thanks
> Dave
>
>>
>> -Takahiro AKASHI
>>
>>> Thanks
>>> Dave
>>>
Dave Young Oct. 29, 2015, 7:01 a.m. UTC | #12
On 10/29/15 at 03:53pm, AKASHI Takahiro wrote:
> On 10/29/2015 03:40 PM, Dave Young wrote:
> >Hi, AKASHI
> >
> >On 10/29/15 at 02:55pm, AKASHI Takahiro wrote:
> >>Dave,
> >>
> >>On 10/23/2015 06:50 PM, Dave Young wrote:
> >>>On 10/22/15 at 06:57pm, AKASHI Takahiro wrote:
> >>>>(added Ard to Cc.)
> >>>>
> >>>>On 10/22/2015 02:15 PM, Dave Young wrote:
> >>>>>On 10/22/15 at 01:29pm, AKASHI Takahiro wrote:
> >>>>>>Hi Dave,
> >>>>>>
> >>>>>>Thank you for your comment.
> >>>>>>
> >>>>>>On 10/22/2015 12:25 PM, Dave Young wrote:
> >>>>>>>Hi, AKASHI,
> >>>>>>>
> >>>>>>>On 10/19/15 at 11:38pm, Geoff Levand wrote:
> >>>>>>>>From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >>>>>>>>
> >>>>>>>>On crash dump kernel, all the information about primary kernel's core
> >>>>>>>>image is available in elf core header specified by "elfcorehdr=" boot
> >>>>>>>>parameter. reserve_elfcorehdr() will set aside the region to avoid any
> >>>>>>>>corruption by crash dump kernel.
> >>>>>>>>
> >>>>>>>>Crash dump kernel will access the system memory of primary kernel via
> >>>>>>>>copy_oldmem_page(), which reads one page by ioremap'ing it since it does
> >>>>>>>>not reside in linear mapping on crash dump kernel.
> >>>>>>>>Please note that we should add "mem=X[MG]" boot parameter to limit the
> >>>>>>>>memory size and avoid the following assertion at ioremap():
> >>>>>>>>	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> >>>>>>>>		return NULL;
> >>>>>>>>when accessing any pages beyond the usable memories of crash dump kernel.
> >>>>>>>
> >>>>>>>How does kexec-tools pass usable memory ranges to kernel? using dtb?
> >>>>>>>Passing an extra mem=X sounds odd in the design. Kdump kernel should get
> >>>>>>>usable ranges and hanle the limit better than depending on an extern kernel
> >>>>>>>param.
> >>>>>>
> >>>>>>Well, regarding "depending on an external kernel param,"
> >>>>>>- this limitation ("mem=") is compatible with arm(32) implementation although
> >>>>>>   it is not clearly described in kernel's Documentation/kdump/kdump.txt.
> >>>>>>- "elfcorehdr" kernel parameter is mandatory on x86 as well as on arm/arm64.
> >>>>>>   The parameter is explicitly generated and added by kexec-tools.
> >>>>>>
> >>>>>>Do I miss your point?
> >>>>>
> >>>>>Arm previously use atag_mem tag for memory kernel uses, with dtb, Booting.txt
> >>>>>says: The boot loader must pass at a minimum the size and location of the
> >>>>>system memory
> >>>>>
> >>>>>In arm64 booting.txt, it does mentions about dtb but without above sentence.
> >>>>>
> >>>>>So if you are using dtb to pass memory I think the extra mem= should be not
> >>>>>necessary unless there's other limitations dtb can not been used.
> >>>>
> >>>>I would expect comments from arm64 maintainers here.
> >>>>
> >>>>In my old implementation, I added "usablemem" attributes, along with "reg," to
> >>>>"memory" nodes in dtb to specify the usable memory region on crash dump kernel.
> >>>>
> >>>>But I removed this feature partly because, on uefi system, uefi might pass
> >>>>no memory information in dtb.
> >>>
> >>>If this is the case there must be somewhere else one can pass memory infomation
> >>>to kernel, the booting.txt should be updated?
> >>>
> >>>kexec as a boot loader need use same method as the 1st kernel boot loader.
> >>>
> >>>>
> >>>>>One thing I'm confused is mem= only pass the memory size, where does you pass
> >>>>>the start addresses?
> >>>>
> >>>>In the current arm64 implementation, any regions below the start address will
> >>>>be ignored as system ram.
> >>>>
> >>>>>What if there's multiple sections such as some reserved
> >>>>>ranges 2nd kernel also need?
> >>>>
> >>>>My patch utilizes only a single contiguous region of memory as system ram.
> >>>>One exception that I notice is uefi's runtime data. They will be ioremap'ed separately.
> >>>>
> >>>>Please let me know if there is any other case that should be supported.
> >>>
> >>>For example the elf headers range, you reserved them in kdump kernel code,
> >>>but kexec-tools can do that early if it can provides all memory info to 2nd
> >>>kernel. Ditto for mark all the memory ranges 1st kernel used as reserved.
> >>
> >>It seems to me that the issue you mentioned here is totally independent
> >>from "mem=" issue, isn't it?
> >>(and "elfcorehdr=" is a common way for crash dump kernel to know the region.)
> >
> >Hmm, I did not talked about the eflcorehdr=, I means the code to reserve the
> >memory ranges elfcorehdr is using.
> 
> So how does it relate to "mem=" issue?

It is just an example that kexec can pass it along with the usable mem range to
kernel via some interface like dtb blob or some other interfaces. 

> 
> -Takahiro AKASHI
> 
> >Thanks
> >Dave
> >
> >>
> >>-Takahiro AKASHI
> >>
> >>>Thanks
> >>>Dave
> >>>
James Morse Oct. 30, 2015, 4:29 p.m. UTC | #13
Hi Geoff,

On 20/10/15 00:38, Geoff Levand wrote:
> Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S to the
> arm64 architecture that add support for the kexec re-boot mechanism
> (CONFIG_KEXEC) on arm64 platforms.
> 
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
>  arch/arm64/Kconfig                  |  10 +++
>  arch/arm64/include/asm/kexec.h      |  48 +++++++++++
>  arch/arm64/kernel/Makefile          |   2 +
>  arch/arm64/kernel/cpu-reset.S       |   2 +-
>  arch/arm64/kernel/machine_kexec.c   | 141 +++++++++++++++++++++++++++++++
>  arch/arm64/kernel/relocate_kernel.S | 163 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kexec.h          |   1 +
>  7 files changed, 366 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/kexec.h
>  create mode 100644 arch/arm64/kernel/machine_kexec.c
>  create mode 100644 arch/arm64/kernel/relocate_kernel.S
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 07d1811..73e8e31 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -491,6 +491,16 @@ config SECCOMP
>  	  and the task is only allowed to execute a few safe syscalls
>  	  defined by each seccomp mode.
>  
> +config KEXEC
> +	depends on (!SMP || PM_SLEEP_SMP)

Commit 4b3dc9679cf7 got rid of '!SMP'.


> +	select KEXEC_CORE
> +	bool "kexec system call"
> +	---help---
> +	  kexec is a system call that implements the ability to shutdown your
> +	  current kernel, and to start another kernel.  It is like a reboot
> +	  but it is independent of the system firmware.   And like a reboot
> +	  you can start any kernel with it, not just Linux.
> +
>  config XEN_DOM0
>  	def_bool y
>  	depends on XEN
> diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
> index ffc9e385e..7cc7f56 100644
> --- a/arch/arm64/kernel/cpu-reset.S
> +++ b/arch/arm64/kernel/cpu-reset.S
> @@ -3,7 +3,7 @@
>   *
>   * Copyright (C) 2001 Deep Blue Solutions Ltd.
>   * Copyright (C) 2012 ARM Ltd.
> - * Copyright (C) 2015 Huawei Futurewei Technologies.
> + * Copyright (C) Huawei Futurewei Technologies.

Move this hunk into the patch that adds the file?


>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
> new file mode 100644
> index 0000000..7b07a16
> --- /dev/null
> +++ b/arch/arm64/kernel/relocate_kernel.S
> @@ -0,0 +1,163 @@
> +/*
> + * kexec for arm64
> + *
> + * Copyright (C) Linaro.
> + * Copyright (C) Huawei Futurewei Technologies.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kexec.h>
> +
> +#include <asm/assembler.h>
> +#include <asm/kexec.h>
> +#include <asm/memory.h>
> +#include <asm/page.h>
> +
> +
> +/*
> + * arm64_relocate_new_kernel - Put a 2nd stage kernel image in place and boot it.
> + *
> + * The memory that the old kernel occupies may be overwritten when coping the
> + * new image to its final location.  To assure that the
> + * arm64_relocate_new_kernel routine which does that copy is not overwritten,
> + * all code and data needed by arm64_relocate_new_kernel must be between the
> + * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end.  The
> + * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
> + * control_code_page, a special page which has been set up to be preserved
> + * during the copy operation.
> + */
> +.globl arm64_relocate_new_kernel
> +arm64_relocate_new_kernel:
> +
> +	/* Setup the list loop variables. */
> +	ldr	x18, .Lkimage_head		/* x18 = list entry */
> +	dcache_line_size x17, x0		/* x17 = dcache line size */
> +	mov	x16, xzr			/* x16 = segment start */
> +	mov	x15, xzr			/* x15 = entry ptr */
> +	mov	x14, xzr			/* x14 = copy dest */
> +
> +	/* Check if the new image needs relocation. */
> +	cbz	x18, .Ldone
> +	tbnz	x18, IND_DONE_BIT, .Ldone
> +
> +.Lloop:
> +	and	x13, x18, PAGE_MASK		/* x13 = addr */
> +
> +	/* Test the entry flags. */
> +.Ltest_source:
> +	tbz	x18, IND_SOURCE_BIT, .Ltest_indirection
> +
> +	mov x20, x14				/*  x20 = copy dest */
> +	mov x21, x13				/*  x21 = copy src */
> +
> +	/* Invalidate dest page to PoC. */
> +	mov	x0, x20
> +	add	x19, x0, #PAGE_SIZE
> +	sub	x1, x17, #1
> +	bic	x0, x0, x1
> +1:	dc	ivac, x0
> +	add	x0, x0, x17
> +	cmp	x0, x19
> +	b.lo	1b
> +	dsb	sy

If I've followed all this through properly:

With KVM - mmu+caches are configured, but then disabled by 'kvm: allows kvm
cpu hotplug'. This 'arm64_relocate_new_kernel' function then runs at EL2
with M=0, C=0, I=0.

Without KVM - when there is no user of EL2, the mmu+caches are left in
whatever state the bootloader (or efi stub) left them in. From
Documentation/arm64/booting.txt:
> Instruction cache may be on or off.
and
> System caches which respect the architected cache maintenance by VA
> operations must be configured and may be enabled.

So 'arm64_relocate_new_kernel' function could run at EL2 with M=0, C=?, I=?.

I think this means you can't guarantee anything you are copying below
actually makes it through the caches - booting secondary processors may get
stale values.

The EFI stub disables the M and C bits when booted at EL2 with uefi - but
it leaves the instruction cache enabled. You only clean the
reboot_code_buffer from the data cache, so there may be stale values in the
instruction cache.

I think you need to disable the i-cache at EL1. If you jump to EL2, I think
you need to disable the I/C bits there too - as you can't rely on the code
in 'kvm: allows kvm cpu hotplug' to do this in a non-kvm case.


> +
> +	/* Copy page. */
> +1:	ldp	x22, x23, [x21]
> +	ldp	x24, x25, [x21, #16]
> +	ldp	x26, x27, [x21, #32]
> +	ldp	x28, x29, [x21, #48]
> +	add	x21, x21, #64
> +	stnp	x22, x23, [x20]
> +	stnp	x24, x25, [x20, #16]
> +	stnp	x26, x27, [x20, #32]
> +	stnp	x28, x29, [x20, #48]
> +	add	x20, x20, #64
> +	tst	x21, #(PAGE_SIZE - 1)
> +	b.ne	1b
> +
> +	/* dest += PAGE_SIZE */
> +	add	x14, x14, PAGE_SIZE
> +	b	.Lnext
> +
> +.Ltest_indirection:
> +	tbz	x18, IND_INDIRECTION_BIT, .Ltest_destination
> +
> +	/* ptr = addr */
> +	mov	x15, x13
> +	b	.Lnext
> +
> +.Ltest_destination:
> +	tbz	x18, IND_DESTINATION_BIT, .Lnext
> +
> +	mov	x16, x13
> +
> +	/* dest = addr */
> +	mov	x14, x13
> +
> +.Lnext:
> +	/* entry = *ptr++ */
> +	ldr	x18, [x15], #8
> +
> +	/* while (!(entry & DONE)) */
> +	tbz	x18, IND_DONE_BIT, .Lloop
> +
> +.Ldone:
> +	dsb	sy
> +	isb
> +	ic	ialluis
> +	dsb	sy

Why the second dsb?


> +	isb
> +
> +	/* Start new image. */
> +	ldr	x4, .Lkimage_start
> +	mov	x0, xzr
> +	mov	x1, xzr
> +	mov	x2, xzr
> +	mov	x3, xzr

Once the kexec'd kernel is booting, I get:
> WARNING: x1-x3 nonzero in violation of boot protocol:
>         x1: 0000000080008000
>         x2: 0000000000000020
>         x3: 0000000000000020
> This indicates a broken bootloader or old kernel

Presumably this 'kimage_start' isn't pointing to the new kernel, but the
purgatory code, (which comes from user-space?). (If so what are these xzr-s
for?)


> +	br	x4
> +
> +.align 3	/* To keep the 64-bit values below naturally aligned. */
> +
> +/* The machine_kexec routine sets these variables via offsets from
> + * arm64_relocate_new_kernel.
> + */
> +
> +/*
> + * .Lkimage_start - Copy of image->start, the entry point of the new
> + * image.
> + */
> +.Lkimage_start:
> +	.quad	0x0
> +
> +/*
> + * .Lkimage_head - Copy of image->head, the list of kimage entries.
> + */
> +.Lkimage_head:
> +	.quad	0x0
> +

I assume these .quad-s are used because you can't pass the values in via
registers - due to the complicated soft_restart(). Given you are the only
user, couldn't you simplify it to do all the disabling in
arm64_relocate_new_kernel?


> +.Lcopy_end:
> +.org	KEXEC_CONTROL_PAGE_SIZE
> +
> +/*
> + * arm64_relocate_new_kernel_size - Number of bytes to copy to the control_code_page.
> + */
> +.globl arm64_relocate_new_kernel_size
> +arm64_relocate_new_kernel_size:
> +	.quad	.Lcopy_end - arm64_relocate_new_kernel
> +
> +/*
> + * arm64_kexec_kimage_start_offset - Offset for writing .Lkimage_start.
> + */
> +.globl arm64_kexec_kimage_start_offset
> +arm64_kexec_kimage_start_offset:
> +	.quad	.Lkimage_start - arm64_relocate_new_kernel
> +
> +/*
> + * arm64_kexec_kimage_head_offset - Offset for writing .Lkimage_head.
> + */
> +.globl arm64_kexec_kimage_head_offset
> +arm64_kexec_kimage_head_offset:
> +	.quad	.Lkimage_head - arm64_relocate_new_kernel


>From 'kexec -e' to the first messages from the new kernel takes ~1 minute
on Juno, Did you see a similar delay? Or should I go looking for what I've
configured wrong!?

(Copying code with the mmu+caches on, then cleaning to PoC was noticeably
faster for hibernate)


I've used this series for kexec-ing between 4K and 64K page_size kernels on
Juno.

Tested-By: James Morse <james.morse@arm.com>



Thanks!

James
Mark Rutland Oct. 30, 2015, 4:54 p.m. UTC | #14
Hi,

> If I've followed all this through properly:
> 
> With KVM - mmu+caches are configured, but then disabled by 'kvm: allows kvm
> cpu hotplug'. This 'arm64_relocate_new_kernel' function then runs at EL2
> with M=0, C=0, I=0.
> 
> Without KVM - when there is no user of EL2, the mmu+caches are left in
> whatever state the bootloader (or efi stub) left them in. From
> Documentation/arm64/booting.txt:
> > Instruction cache may be on or off.
> and
> > System caches which respect the architected cache maintenance by VA
> > operations must be configured and may be enabled.
> 
> So 'arm64_relocate_new_kernel' function could run at EL2 with M=0, C=?, I=?.
> 
> I think this means you can't guarantee anything you are copying below
> actually makes it through the caches - booting secondary processors may get
> stale values.
> 
> The EFI stub disables the M and C bits when booted at EL2 with uefi - but
> it leaves the instruction cache enabled. You only clean the
> reboot_code_buffer from the data cache, so there may be stale values in the
> instruction cache.
> 
> I think you need to disable the i-cache at EL1. If you jump to EL2, I think
> you need to disable the I/C bits there too - as you can't rely on the code
> in 'kvm: allows kvm cpu hotplug' to do this in a non-kvm case.

The SCTLR_ELx.I only affects the attributes that the I-cache uses to
fetch with, not whether it is enabled (it cannot be disabled
architecturally).

It's not necessary to clear the I bit so long as the appropriate
maintenance has occurred, though I believe that when the I bit is set
instruction fetches may allocte in unified levels of cache, so
additional consideration is required for that case.

> > +	/* Copy page. */
> > +1:	ldp	x22, x23, [x21]
> > +	ldp	x24, x25, [x21, #16]
> > +	ldp	x26, x27, [x21, #32]
> > +	ldp	x28, x29, [x21, #48]
> > +	add	x21, x21, #64
> > +	stnp	x22, x23, [x20]
> > +	stnp	x24, x25, [x20, #16]
> > +	stnp	x26, x27, [x20, #32]
> > +	stnp	x28, x29, [x20, #48]
> > +	add	x20, x20, #64
> > +	tst	x21, #(PAGE_SIZE - 1)
> > +	b.ne	1b
> > +
> > +	/* dest += PAGE_SIZE */
> > +	add	x14, x14, PAGE_SIZE
> > +	b	.Lnext
> > +
> > +.Ltest_indirection:
> > +	tbz	x18, IND_INDIRECTION_BIT, .Ltest_destination
> > +
> > +	/* ptr = addr */
> > +	mov	x15, x13
> > +	b	.Lnext
> > +
> > +.Ltest_destination:
> > +	tbz	x18, IND_DESTINATION_BIT, .Lnext
> > +
> > +	mov	x16, x13
> > +
> > +	/* dest = addr */
> > +	mov	x14, x13
> > +
> > +.Lnext:
> > +	/* entry = *ptr++ */
> > +	ldr	x18, [x15], #8
> > +
> > +	/* while (!(entry & DONE)) */
> > +	tbz	x18, IND_DONE_BIT, .Lloop
> > +
> > +.Ldone:
> > +	dsb	sy
> > +	isb
> > +	ic	ialluis
> > +	dsb	sy
> 
> Why the second dsb?
> 
> 
> > +	isb

The first DSB ensures that the copied data is observable by the
I-caches.

The first ISB is unnecessary.

The second DSB ensures that the I-cache maintenance is completed.

The second ISB ensures that the I-cache maintenance is complete w.r.t.
the current instruction stream. There could be instructions in the
pipline fetched from the I-cache prior to invalidation which need to be
cleared.

Thanks,
Mark.
Pratyush Anand Nov. 2, 2015, 9:26 a.m. UTC | #15
Hi James,

On 30/10/2015:04:29:01 PM, James Morse wrote:
> 
> >From 'kexec -e' to the first messages from the new kernel takes ~1 minute
> on Juno, Did you see a similar delay? Or should I go looking for what I've
> configured wrong!?

I did had similar issues with mustang, where it was taking more than 2 min.

Can you please try with my kexec-tools repo [1] where I have patches to enable
D-cache for sha verification. Your feedback might help to upstream these patches.

Thanks
~Pratyush

[1] https://github.com/pratyushanand/kexec-tools.git : master
Geoff Levand Nov. 3, 2015, 12:30 a.m. UTC | #16
Hi James,

On Fri, 2015-10-30 at 16:29 +0000, James Morse wrote:
> On 20/10/15 00:38, Geoff Levand wrote:
> > +config KEXEC
> > +> > 	> > depends on (!SMP || PM_SLEEP_SMP)
> 
> Commit 4b3dc9679cf7 got rid of '!SMP'.

Fixed for v11.

> > - * Copyright (C) 2015 Huawei Futurewei Technologies.
> > + * Copyright (C) Huawei Futurewei Technologies.
> 
> Move this hunk into the patch that adds the file?

Was fixed in v10.2.
 

> > +++ b/arch/arm64/kernel/relocate_kernel.S

> If I've followed all this through properly:
> 
> With KVM - mmu+caches are configured, but then disabled by 'kvm: allows kvm
> cpu hotplug'. This 'arm64_relocate_new_kernel' function then runs at EL2
> with M=0, C=0, I=0.
> 
> Without KVM - when there is no user of EL2, the mmu+caches are left in
> whatever state the bootloader (or efi stub) left them in. From
> Documentation/arm64/booting.txt:
> > Instruction cache may be on or off.
> and
> > System caches which respect the architected cache maintenance by VA
> > operations must be configured and may be enabled.
> 
> So 'arm64_relocate_new_kernel' function could run at EL2 with M=0, C=?, I=?.
> 
> I think this means you can't guarantee anything you are copying below
> actually makes it through the caches - booting secondary processors may get
> stale values.
> 
> The EFI stub disables the M and C bits when booted at EL2 with uefi - but
> it leaves the instruction cache enabled. You only clean the
> reboot_code_buffer from the data cache, so there may be stale values in the
> instruction cache.
> 
> I think you need to disable the i-cache at EL1. If you jump to EL2, I think
> you need to disable the I/C bits there too - as you can't rely on the code
> in 'kvm: allows kvm cpu hotplug' to do this in a non-kvm case.

For consistency across all code paths, we could put in something like this:

+       /* Clear SCTLR_ELx_FLAGS. */
+       mrs     x0, CurrentEL
+       cmp     x0, #CurrentEL_EL2
+       b.ne    1f
+       mrs     x0, sctlr_el2
+       ldr     x1, =SCTLR_EL2_FLAGS
+       bic     x0, x0, x1
+       msr     sctlr_el2, x0
+       isb
+       b       2f
+1:     mrs     x0, sctlr_el1
+       ldr     x1, =SCTLR_EL2_FLAGS
+       bic     x0, x0, x1
+       msr     sctlr_el1, x0
+       isb



> > +.Ldone:
> > +> > 	> > dsb> > 	> > sy
> > +> > 	> > isb
> > +> > 	> > ic> > 	> > ialluis
> > +> > 	> > dsb> > 	> > sy
> 
> Why the second dsb?

I removed the first isb as Mark suggested.


> 
> > +> > 	> > isb
> > +
> > +> > 	> > /* Start new image. */
> > +> > 	> > ldr> > 	> > x4, .Lkimage_start
> > +> > 	> > mov> > 	> > x0, xzr
> > +> > 	> > mov> > 	> > x1, xzr
> > +> > 	> > mov> > 	> > x2, xzr
> > +> > 	> > mov> > 	> > x3, xzr
> 
> Once the kexec'd kernel is booting, I get:
> > WARNING: x1-x3 nonzero in violation of boot protocol:
> >         x1: 0000000080008000
> >         x2: 0000000000000020
> >         x3: 0000000000000020
> > This indicates a broken bootloader or old kernel
> 
> Presumably this 'kimage_start' isn't pointing to the new kernel, but the
> purgatory code, (which comes from user-space?). (If so what are these xzr-s
> for?)

The warning was from the arm64 purgatory in kexec-tools, now fixed.

We don't need to zero the registers anymore.   At one time I had
an option where the kernel found the dtb section and jumped
directly to the new image as the 32 bit arm kernel does.

> +/* The machine_kexec routine sets these variables via offsets from
> > + * arm64_relocate_new_kernel.
> > + */
> > +
> > +/*
> > + * .Lkimage_start - Copy of image->start, the entry point of the new
> > + * image.
> > + */
> > +.Lkimage_start:
> > +> > 	> > .quad> > 	> > 0x0
> > +
> > +/*
> > + * .Lkimage_head - Copy of image->head, the list of kimage entries.
> > + */
> > +.Lkimage_head:
> > +> > 	> > .quad> > 	> > 0x0
> > +
> 
> I assume these .quad-s are used because you can't pass the values in via
> registers - due to the complicated soft_restart(). Given you are the only
> user, couldn't you simplify it to do all the disabling in
> arm64_relocate_new_kernel?

I moved some things from cpu_reset to arm64_relocate_new_kernel, but
from what Takahiro has said, to support a modular kvm some of the CPU
shutdown code will be shared.  Maybe we can look into simplifying things
once work on modular kvm is started.


> 
> From 'kexec -e' to the first messages from the new kernel takes ~1 minute
> on Juno, Did you see a similar delay? Or should I go looking for what I've
> configured wrong!?

As Pratyush has mentioned this is most likely due to the dcaches
being disabled.

> (Copying code with the mmu+caches on, then cleaning to PoC was noticeably
> faster for hibernate)
> 
> 
> I've used this series for kexec-ing between 4K and 64K page_size kernels on
> Juno.

Thanks for testing.

-Geoff