diff mbox

roms: Flush icache when writing roms to guest memory

Message ID 1386768216-33686-1-git-send-email-agraf@suse.de
State New
Headers show

Commit Message

Alexander Graf Dec. 11, 2013, 1:23 p.m. UTC
We use the rom infrastructure to write firmware and/or initial kernel
blobs into guest address space. So we're essentially the layer before
the first code that gets executed inside the guest.

The guest expects that its data and instruction cache view of the world
is 100% consistent when it initially boots. This works just fine on
initial rom population for the first boot.

However, when we reboot and then repopulate the rom region there could
be old code still stuck in the instruction cache, giving the guest an
inconsistent view of the world when we're using kvm.

So we need to invalidate the icache every time we write a rom into guest
address space. We do not need to do this for every DMA since the guest
expects it has to flush the icache manually in that case.

This fixes random reboot issues on e5500 (booke ppc) for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 exec.c           |  8 ++++++++
 hw/core/loader.c | 10 ++++++++++
 2 files changed, 18 insertions(+)

Comments

Paolo Bonzini Dec. 11, 2013, 1:27 p.m. UTC | #1
Il 11/12/2013 14:23, Alexander Graf ha scritto:
> +            if (kvm_enabled()) {
> +                /*
> +                 * The guest may want to directly execute from the rom region,
> +                 * so we better invalidate its icache
> +                 */
> +                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
> +            }

Shouldn't KVM itself do that when a memslot is registered?  There should
be no reason for non-TCG QEMU to flush the icache.

Paolo
Alexander Graf Dec. 11, 2013, 1:35 p.m. UTC | #2
On 11.12.2013, at 14:27, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 11/12/2013 14:23, Alexander Graf ha scritto:
>> +            if (kvm_enabled()) {
>> +                /*
>> +                 * The guest may want to directly execute from the rom region,
>> +                 * so we better invalidate its icache
>> +                 */
>> +                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
>> +            }
> 
> Shouldn't KVM itself do that when a memslot is registered?  There should
> be no reason for non-TCG QEMU to flush the icache.

How would KVM know when things changed inside of a memory region? It's up to user space to manage the contents of a memory region, no?

Alex
Peter Maydell Dec. 11, 2013, 1:56 p.m. UTC | #3
On 11 December 2013 13:23, Alexander Graf <agraf@suse.de> wrote:
> The guest expects that its data and instruction cache view of the world
> is 100% consistent when it initially boots. This works just fine on
> initial rom population for the first boot.
>
> However, when we reboot and then repopulate the rom region there could
> be old code still stuck in the instruction cache, giving the guest an
> inconsistent view of the world when we're using kvm.
>
> So we need to invalidate the icache every time we write a rom into guest
> address space. We do not need to do this for every DMA since the guest
> expects it has to flush the icache manually in that case.

> @@ -2033,6 +2034,13 @@ void cpu_physical_memory_write_rom(hwaddr addr,
>              ptr = qemu_get_ram_ptr(addr1);
>              memcpy(ptr, buf, l);
>              invalidate_and_set_dirty(addr1, l);
> +            if (kvm_enabled()) {
> +                /*
> +                 * The guest may want to directly execute from the rom region,
> +                 * so we better invalidate its icache
> +                 */
> +                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
> +            }

I bet these aren't the only places where code gets written
to guest memory. Also are you sure flush_icache_range()
works correctly when multiple threads (multiple vCPUs,
potentially executing on different host CPUs) are involved? The
TCG case only needs to care about "this thread writes code
to memory that it will itself later execute", not any kind of
cross-host-CPU flushing.

There was a huge thread on kvmarm earlier this year
https://lists.cs.columbia.edu/pipermail/kvmarm/2013-August/006716.html
about a similar sort of issue, and I think the conclusion was that
the kernel basically had to deal with the problem itself [though
the thread is rather confusing...]. I've cc'd Marc Z in the hope
he remembers the ARM specific detail...

thanks
-- PMM
Paolo Bonzini Dec. 11, 2013, 2:03 p.m. UTC | #4
Il 11/12/2013 14:35, Alexander Graf ha scritto:
>>> >> +            if (kvm_enabled()) {
>>> >> +                /*
>>> >> +                 * The guest may want to directly execute from the rom region,
>>> >> +                 * so we better invalidate its icache
>>> >> +                 */
>>> >> +                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
>>> >> +            }
>> > 
>> > Shouldn't KVM itself do that when a memslot is registered?  There should
>> > be no reason for non-TCG QEMU to flush the icache.
> How would KVM know when things changed inside of a memory region? It's up to user space to manage the contents of a memory region, no?

Yeah, that is true.  BTW, shouldn't the same happen when you do migration?

I'd prefer the above snippet to be replaced by a function in
kvm-stub.c/kvm-all.c (kvm_flush_icache_range).

I wonder if there would be a reason to add a KVM_FLUSH_ICACHE ioctl
though.  Could a virtually-indexed/virtually-tagged icache require
flushing by guest address instead of host address?

Paolo
Peter Maydell Dec. 11, 2013, 2:07 p.m. UTC | #5
On 11 December 2013 13:35, Alexander Graf <agraf@suse.de> wrote:
> How would KVM know when things changed inside of a memory region?
> It's up to user space to manage the contents of a memory region, no?

If the architecture spec says that a freshly reset physical CPU has
coherent icache and dcache, then resetting the vCPU should also
ensure the icache and dcache are coherent, so one way to solve
this would be just to make sure that vcpu reset did the right thing.

-- PMM
Alexander Graf Dec. 11, 2013, 2:17 p.m. UTC | #6
On 11.12.2013, at 15:07, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 11 December 2013 13:35, Alexander Graf <agraf@suse.de> wrote:
>> How would KVM know when things changed inside of a memory region?
>> It's up to user space to manage the contents of a memory region, no?
> 
> If the architecture spec says that a freshly reset physical CPU has
> coherent icache and dcache, then resetting the vCPU should also
> ensure the icache and dcache are coherent, so one way to solve
> this would be just to make sure that vcpu reset did the right thing.

Well, this really is a simplified view of the world.

On real hardware the system boots up with caches disabled. Firmware is then responsible for enabling caches and flushing things as it goes. Firmware loads the kernel into ram, flushing the icache on those regions it wrote to along the way. The kernel boots and every time it faults in a page, it flushes caches for that page.

So really the problem is that we're skipping the "cache disabled firmware" step. With this patch, we're simulating a bootloader's behavior when writing a blob into guest memory. Since that's really what we are trying to behave like - a bootloader.


Alex
Mihai Caraman Dec. 11, 2013, 2:18 p.m. UTC | #7
> -----Original Message-----

> From: Peter Maydell [mailto:peter.maydell@linaro.org]

> Sent: Wednesday, December 11, 2013 4:07 PM

> To: Alexander Graf

> Cc: Paolo Bonzini; Vlad Bogdan-BOGVLAD1; QEMU Developers; qemu-

> ppc@nongnu.org; Sethi Varun-B16395; Wood Scott-B07421; Caraman Mihai

> Claudiu-B02008

> Subject: Re: [Qemu-devel] [PATCH] roms: Flush icache when writing roms to

> guest memory

> 

> On 11 December 2013 13:35, Alexander Graf <agraf@suse.de> wrote:

> > How would KVM know when things changed inside of a memory region?

> > It's up to user space to manage the contents of a memory region, no?

> 

> If the architecture spec says that a freshly reset physical CPU has

> coherent icache and dcache, then resetting the vCPU should also

> ensure the icache and dcache are coherent, so one way to solve

> this would be just to make sure that vcpu reset did the right thing.


This is not related to reset operation. Freescale e500 core family
does not assure the coherency between data and instruction cache.
This is an extract from reference manual:

'When a processor modifies any memory location that can contain an
instruction, software must ensure that the instruction cache is made
consistent with data memory and that the modifications are made visible
to the instruction fetching mechanism. This must be done even if the
cache is disabled or if the page is marked caching-inhibited.'

So it's the loader duty to synchronize the instruction cache.

-Mike
Alexander Graf Dec. 11, 2013, 2:20 p.m. UTC | #8
On 11.12.2013, at 15:03, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 11/12/2013 14:35, Alexander Graf ha scritto:
>>>>>> +            if (kvm_enabled()) {
>>>>>> +                /*
>>>>>> +                 * The guest may want to directly execute from the rom region,
>>>>>> +                 * so we better invalidate its icache
>>>>>> +                 */
>>>>>> +                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
>>>>>> +            }
>>>> 
>>>> Shouldn't KVM itself do that when a memslot is registered?  There should
>>>> be no reason for non-TCG QEMU to flush the icache.
>> How would KVM know when things changed inside of a memory region? It's up to user space to manage the contents of a memory region, no?
> 
> Yeah, that is true.  BTW, shouldn't the same happen when you do migration?

Fortunately no, because migration always happens on a clean plate, so the icache is not populated yet for the regions that the guest's memory get written to :).

> I'd prefer the above snippet to be replaced by a function in
> kvm-stub.c/kvm-all.c (kvm_flush_icache_range).

That makes sense.

> I wonder if there would be a reason to add a KVM_FLUSH_ICACHE ioctl
> though.  Could a virtually-indexed/virtually-tagged icache require
> flushing by guest address instead of host address?

No PPC platform I care about has vi/vt icache. I don't know if ARM has any - but I'd prefer to keep this as simple as possible for as long as we can. Newer POWER chips even just do cache snooping and don't need all this manual cache synchronization nonsense anymore.


Alex
Peter Maydell Dec. 11, 2013, 2:25 p.m. UTC | #9
On 11 December 2013 14:18, mihai.caraman@freescale.com
<mihai.caraman@freescale.com> wrote:
>> From: Peter Maydell [mailto:peter.maydell@linaro.org]
>> If the architecture spec says that a freshly reset physical CPU has
>> coherent icache and dcache, then resetting the vCPU should also
>> ensure the icache and dcache are coherent, so one way to solve
>> this would be just to make sure that vcpu reset did the right thing.
>
> This is not related to reset operation. Freescale e500 core family
> does not assure the coherency between data and instruction cache.
> This is an extract from reference manual:
>
> 'When a processor modifies any memory location that can contain an
> instruction, software must ensure that the instruction cache is made
> consistent with data memory and that the modifications are made visible
> to the instruction fetching mechanism. This must be done even if the
> cache is disabled or if the page is marked caching-inhibited.'
>
> So it's the loader duty to synchronize the instruction cache.

But these are (emulated) ROMs, not an emulated bootloader.
They ought to work like actual ROMs: QEMU as the emulator
of the system/devices provides the contents of physical address
space; KVM as the emulator of the CPU provides a CPU which
doesn't start up executing from rubbish in its icache. (This matches
how a real physical CPU executes its first instruction by really
going out to the ROM, not by looking at its cache.)

thanks
-- PMM
Mihai Caraman Dec. 11, 2013, 2:27 p.m. UTC | #10
> On 11.12.2013, at 16:15, Alexander Graf < agraf@suse.de > wrote:
>
> Well, this really is a simplified view of the world.
> 
> On real hardware the system boots up with caches disabled. Firmware is
> then responsible for enabling caches and flushing things as it goes.
> Firmware loads the kernel into ram, flushing the icache on those regions
> it wrote to along the way. The kernel boots and every time it faults in a
> page, it flushes caches for that page.
> 
> So really the problem is that we're skipping the "cache disabled
> firmware" step. With this patch, we're simulating a bootloader's behavior
> when writing a blob into guest memory. Since that's really what we are
> trying to behave like - a bootloader.

The cache synchronization is required by self-modifying code not just bootloaders.

-Mike
Alexander Graf Dec. 11, 2013, 2:31 p.m. UTC | #11
On 11.12.2013, at 15:25, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 11 December 2013 14:18, mihai.caraman@freescale.com
> <mihai.caraman@freescale.com> wrote:
>>> From: Peter Maydell [mailto:peter.maydell@linaro.org]
>>> If the architecture spec says that a freshly reset physical CPU has
>>> coherent icache and dcache, then resetting the vCPU should also
>>> ensure the icache and dcache are coherent, so one way to solve
>>> this would be just to make sure that vcpu reset did the right thing.
>> 
>> This is not related to reset operation. Freescale e500 core family
>> does not assure the coherency between data and instruction cache.
>> This is an extract from reference manual:
>> 
>> 'When a processor modifies any memory location that can contain an
>> instruction, software must ensure that the instruction cache is made
>> consistent with data memory and that the modifications are made visible
>> to the instruction fetching mechanism. This must be done even if the
>> cache is disabled or if the page is marked caching-inhibited.'
>> 
>> So it's the loader duty to synchronize the instruction cache.
> 
> But these are (emulated) ROMs, not an emulated bootloader.
> They ought to work like actual ROMs: QEMU as the emulator

No, they don't. Real ROMs lie in cache inhibited memory and are only copied / shadowed into RAM by firmware. We don't do that with QEMU.

> of the system/devices provides the contents of physical address
> space; KVM as the emulator of the CPU provides a CPU which
> doesn't start up executing from rubbish in its icache. (This matches
> how a real physical CPU executes its first instruction by really
> going out to the ROM, not by looking at its cache.)

KVM can't even execute from real ROM (MMIO) regions.


Alex
Mihai Caraman Dec. 11, 2013, 2:58 p.m. UTC | #12
On 11.12.2013, at 15:07, Peter Maydell <peter.maydell@linaro.org> wrote:
> But these are (emulated) ROMs, not an emulated bootloader.

> They ought to work like actual ROMs: QEMU as the emulator

> of the system/devices provides the contents of physical address

> space; KVM as the emulator of the CPU provides a CPU which

> doesn't start up executing from rubbish in its icache. (This matches

> how a real physical CPU executes its first instruction by really

> going out to the ROM, not by looking at its cache.)


For ppce500 machine, Qemu calls load_uimage2()/load_elf() effectively
loading the image at address 0 instead of handling it as a raw blob.
We do not run yet a bootloader inside the VM.

-Mike
Scott Wood Dec. 13, 2013, 7:18 p.m. UTC | #13
On Wed, 2013-12-11 at 13:56 +0000, Peter Maydell wrote:
> On 11 December 2013 13:23, Alexander Graf <agraf@suse.de> wrote:
> > The guest expects that its data and instruction cache view of the world
> > is 100% consistent when it initially boots. This works just fine on
> > initial rom population for the first boot.
> >
> > However, when we reboot and then repopulate the rom region there could
> > be old code still stuck in the instruction cache, giving the guest an
> > inconsistent view of the world when we're using kvm.
> >
> > So we need to invalidate the icache every time we write a rom into guest
> > address space. We do not need to do this for every DMA since the guest
> > expects it has to flush the icache manually in that case.
> 
> > @@ -2033,6 +2034,13 @@ void cpu_physical_memory_write_rom(hwaddr addr,
> >              ptr = qemu_get_ram_ptr(addr1);
> >              memcpy(ptr, buf, l);
> >              invalidate_and_set_dirty(addr1, l);
> > +            if (kvm_enabled()) {
> > +                /*
> > +                 * The guest may want to directly execute from the rom region,
> > +                 * so we better invalidate its icache
> > +                 */
> > +                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
> > +            }
> 
> I bet these aren't the only places where code gets written
> to guest memory. Also are you sure flush_icache_range()
> works correctly when multiple threads (multiple vCPUs,
> potentially executing on different host CPUs) are involved?

On PPC these cache operations broadcast, and are the architecturally
defined way of doing self-modifying code.

>  The TCG case only needs to care about "this thread writes code
> to memory that it will itself later execute", not any kind of
> cross-host-CPU flushing.

Can't the TCG thread get migrated between CPUs?

> There was a huge thread on kvmarm earlier this year
> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-August/006716.html
> about a similar sort of issue, and I think the conclusion was that
> the kernel basically had to deal with the problem itself [though
> the thread is rather confusing...]. I've cc'd Marc Z in the hope
> he remembers the ARM specific detail...

Hmm, a good point is raised in that thread regarding what happens if a
page is swapped out and then back in:
https://lists.cs.columbia.edu/pipermail/kvmarm/2013-August/006738.html

I think the usual mechanism PPC booke uses to handle this is currently
not effective with KVM because we get the pages via __get_user_pages
fast() rather than by enabling execute permission in an ISI (see the
second instance of set_pte_filter() in arch/powerpc/mm/pgtable.c).  Even
if we fix that to invoke the cache cleaning code when KVM acquires a
page, though, QEMU would still need to flush if it modifies/loads code
on a page that may already be marked in the kernel as having been
cleaned.

-Scott
Paolo Bonzini Dec. 14, 2013, 10:58 a.m. UTC | #14
Il 13/12/2013 20:18, Scott Wood ha scritto:
>> Also are you sure flush_icache_range()
>> works correctly when multiple threads (multiple vCPUs,
>> potentially executing on different host CPUs) are involved?
> 
> On PPC these cache operations broadcast, and are the architecturally
> defined way of doing self-modifying code.

I expect that to be the same on any cache-coherent system.

On a VIVT cache with shadow paging, some kernel collaboration may be
necessary because you have to flush using guest addresses rather than
host addresses (or alternatively you have to flush a whole context id).
 But we can fix the problem when it happens.

Paolo
Peter Maydell Dec. 14, 2013, 11:08 a.m. UTC | #15
On 14 December 2013 10:58, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 13/12/2013 20:18, Scott Wood ha scritto:
>>> Also are you sure flush_icache_range()
>>> works correctly when multiple threads (multiple vCPUs,
>>> potentially executing on different host CPUs) are involved?
>>
>> On PPC these cache operations broadcast, and are the architecturally
>> defined way of doing self-modifying code.
>
> I expect that to be the same on any cache-coherent system.

On ARM you have the choice of "clean/invalidate sufficient to
run code on this CPU" vs "sufficient to run code on any CPU"
(the latter obviously is a more expensive operation). As it happens
I've checked through and the syscall/gcc primitive we use is
doing the latter not the former. But I did have to check.

I think SPARC is also OK (the manual defines the 'flush'
insn as working for all cpus). Haven't checked others.

thanks
-- PMM
diff mbox

Patch

diff --git a/exec.c b/exec.c
index f4b9ef2..cc63eb6 100644
--- a/exec.c
+++ b/exec.c
@@ -50,6 +50,7 @@ 
 #include "translate-all.h"
 
 #include "exec/memory-internal.h"
+#include "qemu/cache-utils.h"
 
 //#define DEBUG_SUBPAGE
 
@@ -2033,6 +2034,13 @@  void cpu_physical_memory_write_rom(hwaddr addr,
             ptr = qemu_get_ram_ptr(addr1);
             memcpy(ptr, buf, l);
             invalidate_and_set_dirty(addr1, l);
+            if (kvm_enabled()) {
+                /*
+                 * The guest may want to directly execute from the rom region,
+                 * so we better invalidate its icache
+                 */
+                flush_icache_range((uintptr_t)ptr, (uintptr_t)ptr + l);
+            }
         }
         len -= l;
         buf += l;
diff --git a/hw/core/loader.c b/hw/core/loader.c
index 60d2ebd..4f809f3 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -51,6 +51,7 @@ 
 #include "hw/nvram/fw_cfg.h"
 #include "exec/memory.h"
 #include "exec/address-spaces.h"
+#include "qemu/cache-utils.h"
 
 #include <zlib.h>
 
@@ -619,6 +620,7 @@  static void *rom_set_mr(Rom *rom, Object *owner, const char *name)
 
     data = memory_region_get_ram_ptr(rom->mr);
     memcpy(data, rom->data, rom->datasize);
+    flush_icache_range((uintptr_t)data, (uintptr_t)data + rom->datasize);
 
     return data;
 }
@@ -777,6 +779,14 @@  static void rom_reset(void *unused)
         if (rom->mr) {
             void *host = memory_region_get_ram_ptr(rom->mr);
             memcpy(host, rom->data, rom->datasize);
+            if (kvm_enabled()) {
+                /*
+                 * The guest may want to directly execute from the rom region,
+                 * so we better invalidate its icache
+                 */
+                flush_icache_range((uintptr_t)host,
+                                   (uintptr_t)host + rom->datasize);
+            }
         } else {
             cpu_physical_memory_write_rom(rom->addr, rom->data, rom->datasize);
         }