Message ID | 20161024225959.7168.20626.stgit@gimli.home |
---|---|
State | New |
Headers | show |
On 25/10/2016 01:00, Alex Williamson wrote: > With a vfio assigned device we lay down a base MemoryRegion registered > as an IO region, giving us read & write accessors. If the region > supports mmap, we lay down a higher priority sub-region MemoryRegion > on top of the base layer initialized as a RAM device pointer to the > mmap. Finally, if we have any quirks for the device (ie. address > ranges that need additional virtualization support), we put another IO > sub-region on top of the mmap MemoryRegion. When this is flattened, > we now potentially have sub-page mmap MemoryRegions exposed which > cannot be directly mapped through KVM. > > This is as expected, but a subtle detail of this is that we end up > with two different access mechanisms through QEMU. If we disable the > mmap MemoryRegion, we make use of the IO MemoryRegion and service > accesses using pread and pwrite to the vfio device file descriptor. > If the mmap MemoryRegion is enabled and results in one of these > sub-page gaps, QEMU handles the access as RAM, using memcpy to the > mmap. Using either pread/pwrite or the mmap directly should be > correct, but using memcpy causes us problems. I expect that not only > does memcpy not necessarily honor the original width and alignment in > performing a copy, but it potentially also uses processor instructions > not intended for MMIO spaces. It turns out that this has been a > problem for Realtek NIC assignment, which has such a quirk that > creates a sub-page mmap MemoryRegion access. > > To resolve this, we disable memory_access_is_direct() for ram_device > regions since QEMU assumes that it can use memcpy for those regions. > Instead we access through MemoryRegionOps, which replaces the memcpy > with simple de-references of standard sizes to the host memory. This > also allows us to eliminate the ram_device bool from the MemoryRegion > structure since we can simply test the ops pointer. > > With this patch we attempt to provide unrestricted access to the RAM > device, allowing byte through qword access as well as unaligned > access. The assumption here is that accesses initiated by the VM are > driven by a device specific driver, which knows the device > capabilities. If unaligned accesses are not supported by the device, > we don't want them to work in a VM by performing multiple aligned > accesses to compose the unaligned access. A down-side of this > philosophy is that the xp command from the monitor attempts to use > the largest available access weidth, unaware of the underlying > device. Using memcpy had this same restriction, but at least now an > operator can dump individual registers, even if blocks of device > memory may result in access widths beyond the capabilities of a > given device (RTL NICs only support up to dword). > > Reported-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > include/exec/memory.h | 7 +++-- > memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++- > trace-events | 2 + > 3 files changed, 74 insertions(+), 5 deletions(-) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index a75b8c3..2d4a287 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -209,7 +209,6 @@ struct MemoryRegion { > void (*destructor)(MemoryRegion *mr); > uint64_t align; > bool terminates; > - bool ram_device; > bool enabled; > bool warning_printed; /* For reservations */ > uint8_t vga_logging_count; > @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr); > static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) > { > if (is_write) { > - return memory_region_is_ram(mr) && !mr->readonly; > + return memory_region_is_ram(mr) && > + !mr->readonly && !memory_region_is_ram_device(mr); > } else { > - return memory_region_is_ram(mr) || memory_region_is_romd(mr); > + return (memory_region_is_ram(mr) && !memory_region_is_ram_device(mr)) || > + memory_region_is_romd(mr); Oops, I hadn't thought of this. This is a bit slower, especially so if you have to compare the ops pointer. It's probably best to keep the mr->ram_device flag, so that later we can place all five relevant flags (ram, readonly, ram_device, rom_device, romd_mode) into the same word and play games with bits, like the following: // write case mr->flags == MR_RAM // read cases (mr->flags & ~MR_READONLY) == MR_RAM || (mr->flags & ~MR_READONLY) == (MR_ROM_DEVICE | MR_ROMD_MODE) (the latter in turn can be optimized to a range check if the flags are arranged cleverly). The patch is okay with mr->ram_device left in place, feel free to merge it through your own VFIO tree. Paolo > } > } > > diff --git a/memory.c b/memory.c > index 7ffcff1..d07f785 100644 > --- a/memory.c > +++ b/memory.c > @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = { > .endianness = DEVICE_NATIVE_ENDIAN, > }; > > +static uint64_t memory_region_ram_device_read(void *opaque, > + hwaddr addr, unsigned size) > +{ > + MemoryRegion *mr = opaque; > + uint64_t data = (uint64_t)~0; > + > + switch (size) { > + case 1: > + data = *(uint8_t *)(mr->ram_block->host + addr); > + break; > + case 2: > + data = *(uint16_t *)(mr->ram_block->host + addr); > + break; > + case 4: > + data = *(uint32_t *)(mr->ram_block->host + addr); > + break; > + case 8: > + data = *(uint64_t *)(mr->ram_block->host + addr); > + break; > + } > + > + trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size); > + > + return data; > +} > + > +static void memory_region_ram_device_write(void *opaque, hwaddr addr, > + uint64_t data, unsigned size) > +{ > + MemoryRegion *mr = opaque; > + > + trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size); > + > + switch (size) { > + case 1: > + *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data; > + break; > + case 2: > + *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data; > + break; > + case 4: > + *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data; > + break; > + case 8: > + *(uint64_t *)(mr->ram_block->host + addr) = data; > + break; > + } > +} > + > +static const MemoryRegionOps ram_device_mem_ops = { > + .read = memory_region_ram_device_read, > + .write = memory_region_ram_device_write, > + .endianness = DEVICE_NATIVE_ENDIAN, > + .valid = { > + .min_access_size = 1, > + .max_access_size = 8, > + .unaligned = true, > + }, > + .impl = { > + .min_access_size = 1, > + .max_access_size = 8, > + .unaligned = true, > + }, > +}; > + > bool memory_region_access_valid(MemoryRegion *mr, > hwaddr addr, > unsigned size, > @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, > void *ptr) > { > memory_region_init_ram_ptr(mr, owner, name, size, ptr); > - mr->ram_device = true; > + mr->ops = &ram_device_mem_ops; > + mr->opaque = mr; > } > > void memory_region_init_alias(MemoryRegion *mr, > @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion *mr) > > bool memory_region_is_ram_device(MemoryRegion *mr) > { > - return mr->ram_device; > + return mr->ops == &ram_device_mem_ops; > } > > uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) > diff --git a/trace-events b/trace-events > index 8ecded5..f74e1d3 100644 > --- a/trace-events > +++ b/trace-events > @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t va > memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value %#"PRIx64" size %u" > memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" > memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" > +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u" > +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u" > > ### Guest events, keep at bottom > >
diff --git a/include/exec/memory.h b/include/exec/memory.h index a75b8c3..2d4a287 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -209,7 +209,6 @@ struct MemoryRegion { void (*destructor)(MemoryRegion *mr); uint64_t align; bool terminates; - bool ram_device; bool enabled; bool warning_printed; /* For reservations */ uint8_t vga_logging_count; @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr); static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) { if (is_write) { - return memory_region_is_ram(mr) && !mr->readonly; + return memory_region_is_ram(mr) && + !mr->readonly && !memory_region_is_ram_device(mr); } else { - return memory_region_is_ram(mr) || memory_region_is_romd(mr); + return (memory_region_is_ram(mr) && !memory_region_is_ram_device(mr)) || + memory_region_is_romd(mr); } } diff --git a/memory.c b/memory.c index 7ffcff1..d07f785 100644 --- a/memory.c +++ b/memory.c @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; +static uint64_t memory_region_ram_device_read(void *opaque, + hwaddr addr, unsigned size) +{ + MemoryRegion *mr = opaque; + uint64_t data = (uint64_t)~0; + + switch (size) { + case 1: + data = *(uint8_t *)(mr->ram_block->host + addr); + break; + case 2: + data = *(uint16_t *)(mr->ram_block->host + addr); + break; + case 4: + data = *(uint32_t *)(mr->ram_block->host + addr); + break; + case 8: + data = *(uint64_t *)(mr->ram_block->host + addr); + break; + } + + trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size); + + return data; +} + +static void memory_region_ram_device_write(void *opaque, hwaddr addr, + uint64_t data, unsigned size) +{ + MemoryRegion *mr = opaque; + + trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size); + + switch (size) { + case 1: + *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data; + break; + case 2: + *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data; + break; + case 4: + *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data; + break; + case 8: + *(uint64_t *)(mr->ram_block->host + addr) = data; + break; + } +} + +static const MemoryRegionOps ram_device_mem_ops = { + .read = memory_region_ram_device_read, + .write = memory_region_ram_device_write, + .endianness = DEVICE_NATIVE_ENDIAN, + .valid = { + .min_access_size = 1, + .max_access_size = 8, + .unaligned = true, + }, + .impl = { + .min_access_size = 1, + .max_access_size = 8, + .unaligned = true, + }, +}; + bool memory_region_access_valid(MemoryRegion *mr, hwaddr addr, unsigned size, @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, void *ptr) { memory_region_init_ram_ptr(mr, owner, name, size, ptr); - mr->ram_device = true; + mr->ops = &ram_device_mem_ops; + mr->opaque = mr; } void memory_region_init_alias(MemoryRegion *mr, @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion *mr) bool memory_region_is_ram_device(MemoryRegion *mr) { - return mr->ram_device; + return mr->ops == &ram_device_mem_ops; } uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) diff --git a/trace-events b/trace-events index 8ecded5..f74e1d3 100644 --- a/trace-events +++ b/trace-events @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t va memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value %#"PRIx64" size %u" memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u" +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u" ### Guest events, keep at bottom
With a vfio assigned device we lay down a base MemoryRegion registered as an IO region, giving us read & write accessors. If the region supports mmap, we lay down a higher priority sub-region MemoryRegion on top of the base layer initialized as a RAM device pointer to the mmap. Finally, if we have any quirks for the device (ie. address ranges that need additional virtualization support), we put another IO sub-region on top of the mmap MemoryRegion. When this is flattened, we now potentially have sub-page mmap MemoryRegions exposed which cannot be directly mapped through KVM. This is as expected, but a subtle detail of this is that we end up with two different access mechanisms through QEMU. If we disable the mmap MemoryRegion, we make use of the IO MemoryRegion and service accesses using pread and pwrite to the vfio device file descriptor. If the mmap MemoryRegion is enabled and results in one of these sub-page gaps, QEMU handles the access as RAM, using memcpy to the mmap. Using either pread/pwrite or the mmap directly should be correct, but using memcpy causes us problems. I expect that not only does memcpy not necessarily honor the original width and alignment in performing a copy, but it potentially also uses processor instructions not intended for MMIO spaces. It turns out that this has been a problem for Realtek NIC assignment, which has such a quirk that creates a sub-page mmap MemoryRegion access. To resolve this, we disable memory_access_is_direct() for ram_device regions since QEMU assumes that it can use memcpy for those regions. Instead we access through MemoryRegionOps, which replaces the memcpy with simple de-references of standard sizes to the host memory. This also allows us to eliminate the ram_device bool from the MemoryRegion structure since we can simply test the ops pointer. With this patch we attempt to provide unrestricted access to the RAM device, allowing byte through qword access as well as unaligned access. The assumption here is that accesses initiated by the VM are driven by a device specific driver, which knows the device capabilities. If unaligned accesses are not supported by the device, we don't want them to work in a VM by performing multiple aligned accesses to compose the unaligned access. A down-side of this philosophy is that the xp command from the monitor attempts to use the largest available access weidth, unaware of the underlying device. Using memcpy had this same restriction, but at least now an operator can dump individual registers, even if blocks of device memory may result in access widths beyond the capabilities of a given device (RTL NICs only support up to dword). Reported-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- include/exec/memory.h | 7 +++-- memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++- trace-events | 2 + 3 files changed, 74 insertions(+), 5 deletions(-)