diff mbox series

hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

Message ID 20210926021614.76933-1-david.dai@montage-tech.com
State New
Headers show
Series hw/misc: Add a virtual pci device to dynamically attach memory to QEMU | expand

Commit Message

david.dai Sept. 26, 2021, 2:16 a.m. UTC
Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
to VM, so driver in guest can apply host memory in fly without virtualization
management software's help, such as libvirt/manager. The attached memory is
isolated from System RAM, it can be used in heterogeneous memory management for
virtualization. Multiple VMs dynamically share same computing device memory
without memory overcommit.

Signed-off-by: David Dai <david.dai@montage-tech.com>
---
 docs/devel/dynamic_mdev.rst | 122 ++++++++++
 hw/misc/Kconfig             |   5 +
 hw/misc/dynamic_mdev.c      | 456 ++++++++++++++++++++++++++++++++++++
 hw/misc/meson.build         |   1 +
 4 files changed, 584 insertions(+)
 create mode 100644 docs/devel/dynamic_mdev.rst
 create mode 100644 hw/misc/dynamic_mdev.c

--
2.27.0

Comments

Stefan Hajnoczi Sept. 27, 2021, 8:27 a.m. UTC | #1
On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
> to VM, so driver in guest can apply host memory in fly without virtualization
> management software's help, such as libvirt/manager. The attached memory is
> isolated from System RAM, it can be used in heterogeneous memory management for
> virtualization. Multiple VMs dynamically share same computing device memory
> without memory overcommit.
> 
> Signed-off-by: David Dai <david.dai@montage-tech.com>

CCing David Hildenbrand (virtio-balloon and virtio-mem) and Igor
Mammedov (host memory backend).

> ---
>  docs/devel/dynamic_mdev.rst | 122 ++++++++++
>  hw/misc/Kconfig             |   5 +
>  hw/misc/dynamic_mdev.c      | 456 ++++++++++++++++++++++++++++++++++++
>  hw/misc/meson.build         |   1 +
>  4 files changed, 584 insertions(+)
>  create mode 100644 docs/devel/dynamic_mdev.rst
>  create mode 100644 hw/misc/dynamic_mdev.c
> 
> diff --git a/docs/devel/dynamic_mdev.rst b/docs/devel/dynamic_mdev.rst
> new file mode 100644
> index 0000000000..8e2edb6600
> --- /dev/null
> +++ b/docs/devel/dynamic_mdev.rst
> @@ -0,0 +1,122 @@
> +Motivation:
> +In heterogeneous computing system, accelorator generally exposes its device

s/accelorator/accelerator/

(There are missing articles and small grammar tweaks that could be made,
but I'm skipping the English language stuff for now.)

> +memory to host via PCIe and CXL.mem(Compute Express Link) to share memory
> +between host and device, and these memory generally are uniformly managed by
> +host, they are called HDM (host managed device memory), further SVA (share
> +virtual address) can be achieved on this base. One computing device may be used

Is this Shared Virtual Addressing (SVA) (also known as Shared Virtual
Memory)? If yes, please use the exact name ("Shared Virtual Addressing",
not "share virtual address") so that's clear and the reader can easily
find out more information through a web search.

> +by multiple virtual machines if it supports SRIOV, to efficiently use device
> +memory in virtualization, each VM allocates device memory on-demand without
> +overcommit, but how to dynamically attach host memory resource to VM. A virtual

I cannot parse this sentence. Can you rephrase it and/or split it into
multiple sentences?

> +PCI device, dynamic_mdev, is introduced to achieve this target. dynamic_mdev

I suggest calling it "memdev" instead of "mdev" to prevent confusion
with VFIO mdev.

> +has a big bar space which size can be assigned by user when creating VM, the
> +bar doesn't have backend memory at initialization stage, later driver in guest
> +triggers QEMU to map host memory to the bar space. how much memory, when and
> +where memory will be mapped to are determined by guest driver, after device
> +memory has been attached to the virtual PCI bar, application in guest can
> +access device memory by the virtual PCI bar. Memory allocation and negotiation
> +are left to guest driver and memory backend implementation. dynamic_mdev is a
> +mechanism which provides significant benefits to heterogeneous memory
> +virtualization.

David and Igor: please review this design. I'm not familiar enough with
the various memory hotplug and ballooning mechanisms to give feedback on
this.

> +Implementation:
> +dynamic_mdev device has two bars, bar0 and bar2. bar0 is a 32-bit register bar
> +which used to host control register for control and message communication, Bar2
> +is a 64-bit mmio bar, which is used to attach host memory to, the bar size can
> +be assigned via parameter when creating VM. Host memory is attached to this bar
> +via mmap API.
> +
> +
> +          VM1                           VM2
> + -----------------------        ----------------------
> +|      application      |      |     application      |
> +|                       |      |                      |
> +|-----------------------|      |----------------------|
> +|     guest driver      |      |     guest driver     |
> +|   |--------------|    |      |   | -------------|   |
> +|   | pci mem bar  |    |      |   | pci mem bar  |   |
> + ---|--------------|-----       ---|--------------|---
> +     ----   ---                     --   ------
> +    |    | |   |                   |  | |      |
> +     ----   ---                     --   ------
> +            \                      /
> +             \                    /
> +              \                  /
> +               \                /
> +                |              |
> +                V              V
> + --------------- /dev/mdev.mmap ------------------------
> +|     --   --   --   --   --   --                       |
> +|    |  | |  | |  | |  | |  | |  |  <-----free_mem_list |
> +|     --   --   --   --   --   --                       |
> +|                                                       |
> +|                       HDM(host managed device memory )|
> + -------------------------------------------------------
> +
> +1. Create device:
> +-device dyanmic-mdevice,size=0x200000000,align=0x40000000,mem-path=/dev/mdev
> +
> +size: bar space size
> +aglin: alignment of dynamical attached memory
> +mem-path: host backend memory device
> +
> +
> +2. Registers to control dynamical memory attach
> +All register is placed in bar0
> +
> +        INT_MASK     =     0, /* RW */
> +        INT_STATUS   =     4, /* RW: write 1 clear */
> +        DOOR_BELL    =     8, /*
> +                               * RW: trigger device to act
> +                               *  31        15        0
> +                               *  --------------------
> +                               * |en|xxxxxxxx|  cmd   |
> +                               *  --------------------
> +                               */
> +
> +        /* RO: 4k, 2M, 1G aglign for memory size */
> +        MEM_ALIGN   =      12,
> +
> +        /* RO: offset in memory bar shows bar space has had ram map */
> +        HW_OFFSET    =     16,
> +
> +        /* RW: size of dynamical attached memory */
> +        MEM_SIZE     =     24,
> +
> +        /* RW: offset in host mdev, which dynamical attached memory from  */
> +        MEM_OFFSET   =     32,
> +
> +3. To trigger QEMU to attach a memory, guest driver makes following operation:
> +
> +        /* memory size */
> +        writeq(size, reg_base + 0x18);
> +
> +        /* backend file offset */
> +        writeq(offset, reg_base + 0x20);
> +
> +        /* trigger device to map memory from host */
> +        writel(0x80000001, reg_base + 0x8);
> +
> +        /* wait for reply from backend */
> +        wait_for_completion(&attach_cmp);
> +
> +4. QEMU implementation
> +dynamic_mdev utilizes QEMU's memory model to dynamically add memory region to
> +memory container, the model is described at qemu/docs/devel/memory.rst
> +The below steps will describe the whole flow:
> +   1> create a virtual PCI device
> +   2> pci register bar with memory region container, which only define bar size
> +   3> guest driver requests memory via register interaction, and it tells QEMU
> +      about memory size, backend memory offset, and so on
> +   4> QEMU receives request from guest driver, then apply host memory from
> +      backend file via mmap(), QEMU use the allocated RAM to create a memory
> +      region through memory_region_init_ram(), and attach this memory region to
> +      bar container through calling memory_region_add_subregion_overlap(). After
> +      that KVM build gap->hpa mapping
> +   5> QEMU sends MSI to guest driver that dynamical memory attach completed
> +You could refer to source code for more detail.
> +
> +
> +Backend memory device
> +Backend device can be a stardard share memory file with standard mmap() support
> +It also may be a specific char device with mmap() implementation.
> +In a word, how to implement this device is user responsibility.
> diff --git a/hw/misc/Kconfig b/hw/misc/Kconfig
> index 507058d8bf..f03263cc1e 100644
> --- a/hw/misc/Kconfig
> +++ b/hw/misc/Kconfig
> @@ -67,6 +67,11 @@ config IVSHMEM_DEVICE
>      default y if PCI_DEVICES
>      depends on PCI && LINUX && IVSHMEM && MSI_NONBROKEN
> 
> +config DYNAMIC_MDEV
> +    bool
> +    default y if PCI_DEVICES
> +    depends on PCI && LINUX && MSI_NONBROKEN
> +
>  config ECCMEMCTL
>      bool
>      select ECC
> diff --git a/hw/misc/dynamic_mdev.c b/hw/misc/dynamic_mdev.c
> new file mode 100644
> index 0000000000..8a56a6157b
> --- /dev/null
> +++ b/hw/misc/dynamic_mdev.c
> @@ -0,0 +1,456 @@
> +/*
> + * Dynamical memory attached PCI device
> + *
> + * Copyright Montage, Corp. 2014
> + *
> + * Authors:
> + *  David Dai <david.dai@montage-tech.com>
> + *  Changguo Du <changguo.du@montage-tech.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/units.h"
> +#include "hw/pci/pci.h"
> +#include "hw/hw.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/qdev-properties-system.h"
> +#include "hw/pci/msi.h"
> +#include "qemu/module.h"
> +#include "qom/object_interfaces.h"
> +#include "qapi/visitor.h"
> +#include "qom/object.h"
> +#include "qemu/error-report.h"
> +
> +#define PCI_VENDOR_ID_DMDEV   0x1b00
> +#define PCI_DEVICE_ID_DMDEV   0x1110
> +#define DYNAMIC_MDEV_BAR_SIZE 0x1000
> +
> +#define INTERRUPT_MEMORY_ATTACH_SUCCESS           (1 << 0)
> +#define INTERRUPT_MEMORY_DEATTACH_SUCCESS         (1 << 1)
> +#define INTERRUPT_MEMORY_ATTACH_NOMEM             (1 << 2)
> +#define INTERRUPT_MEMORY_ATTACH_ALIGN_ERR         (1 << 3)
> +#define INTERRUPT_ACCESS_NOT_MAPPED_ADDR          (1 << 4)
> +
> +#define DYNAMIC_CMD_ENABLE               (0x80000000)
> +#define DYNAMIC_CMD_MASK                 (0xffff)
> +#define DYNAMIC_CMD_MEM_ATTACH           (0x1)
> +#define DYNAMIC_CMD_MEM_DEATTACH         (0x2)
> +
> +#define DYNAMIC_MDEV_DEBUG               1
> +
> +#define DYNAMIC_MDEV_DPRINTF(fmt, ...)                          \
> +    do {                                                        \
> +        if (DYNAMIC_MDEV_DEBUG) {                               \
> +            printf("QEMU: " fmt, ## __VA_ARGS__);               \
> +        }                                                       \
> +    } while (0)
> +
> +#define TYPE_DYNAMIC_MDEV "dyanmic-mdevice"
> +
> +typedef struct DmdevState DmdevState;
> +DECLARE_INSTANCE_CHECKER(DmdevState, DYNAMIC_MDEV,
> +                         TYPE_DYNAMIC_MDEV)
> +
> +struct DmdevState {
> +    /*< private >*/
> +    PCIDevice parent_obj;
> +    /*< public >*/
> +
> +    /* registers */
> +    uint32_t mask;
> +    uint32_t status;
> +    uint32_t align;
> +    uint64_t size;
> +    uint64_t hw_offset;
> +    uint64_t mem_offset;
> +
> +    /* mdev name */
> +    char *devname;
> +    int fd;
> +
> +    /* memory bar size */
> +    uint64_t bsize;
> +
> +    /* BAR 0 (registers) */
> +    MemoryRegion dmdev_mmio;
> +
> +    /* BAR 2 (memory bar for daynamical memory attach) */
> +    MemoryRegion dmdev_mem;
> +};
> +
> +/* registers for the dynamical memory device */
> +enum dmdev_registers {
> +    INT_MASK     =     0, /* RW */
> +    INT_STATUS   =     4, /* RW: write 1 clear */
> +    DOOR_BELL    =     8, /*
> +                           * RW: trigger device to act
> +                           *  31        15        0
> +                           *  --------------------
> +                           * |en|xxxxxxxx|  cmd   |
> +                           *  --------------------
> +                           */
> +
> +    /* RO: 4k, 2M, 1G aglign for memory size */
> +    MEM_ALIGN   =     12,
> +
> +    /* RO: offset in memory bar shows bar space has had ram map */
> +    HW_OFFSET    =    16,
> +
> +    /* RW: size of dynamical attached memory */
> +    MEM_SIZE     =    24,
> +
> +    /* RW: offset in host mdev, where dynamical attached memory from  */
> +    MEM_OFFSET   =    32,
> +
> +};
> +
> +static void dmdev_mem_attach(DmdevState *s)
> +{
> +    void *ptr;
> +    struct MemoryRegion *mr;
> +    uint64_t size = s->size;
> +    uint64_t align = s->align;
> +    uint64_t hwaddr = s->hw_offset;
> +    uint64_t offset = s->mem_offset;
> +    PCIDevice *pdev = PCI_DEVICE(s);
> +
> +    DYNAMIC_MDEV_DPRINTF("%s:size =0x%lx,align=0x%lx,hwaddr=0x%lx,\
> +        offset=0x%lx\n", __func__, size, align, hwaddr, offset);
> +
> +    if (size % align || hwaddr % align) {
> +        error_report("%s size doesn't align, size =0x%lx, \
> +                align=0x%lx, hwaddr=0x%lx\n", __func__, size, align, hwaddr);
> +        s->status |= INTERRUPT_MEMORY_ATTACH_ALIGN_ERR;
> +        msi_notify(pdev, 0);
> +        return;
> +    }
> +
> +    ptr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, s->fd, offset);
> +    if (ptr == MAP_FAILED) {
> +        error_report("Can't map memory err(%d)", errno);
> +        s->status |= INTERRUPT_MEMORY_ATTACH_ALIGN_ERR;
> +        msi_notify(pdev, 0);
> +        return;
> +    }
> +
> +    mr = g_new0(MemoryRegion, 1);
> +    memory_region_init_ram_ptr(mr, OBJECT(PCI_DEVICE(s)),
> +                            "dynamic_mdev", size, ptr);
> +    memory_region_add_subregion_overlap(&s->dmdev_mem, hwaddr, mr, 1);
> +
> +    s->hw_offset += size;
> +
> +    s->status |= INTERRUPT_MEMORY_ATTACH_SUCCESS;
> +    msi_notify(pdev, 0);
> +
> +    DYNAMIC_MDEV_DPRINTF("%s msi_notify success ptr=%p\n", __func__, ptr);
> +    return;
> +}
> +
> +static void dmdev_mem_deattach(DmdevState *s)
> +{
> +    struct MemoryRegion *mr = &s->dmdev_mem;
> +    struct MemoryRegion *subregion;
> +    void *host;
> +    PCIDevice *pdev = PCI_DEVICE(s);
> +
> +    memory_region_transaction_begin();
> +    while (!QTAILQ_EMPTY(&mr->subregions)) {
> +        subregion = QTAILQ_FIRST(&mr->subregions);
> +        memory_region_del_subregion(mr, subregion);
> +        host = memory_region_get_ram_ptr(subregion);
> +        munmap(host, memory_region_size(subregion));
> +        DYNAMIC_MDEV_DPRINTF("%s:host=%p,size=0x%lx\n",
> +                    __func__, host,  memory_region_size(subregion));
> +    }
> +
> +    memory_region_transaction_commit();
> +
> +    s->hw_offset = 0;
> +
> +    s->status |= INTERRUPT_MEMORY_DEATTACH_SUCCESS;
> +    msi_notify(pdev, 0);
> +
> +    return;
> +}
> +
> +static void dmdev_doorbell_handle(DmdevState *s,  uint64_t val)
> +{
> +    if (!(val & DYNAMIC_CMD_ENABLE)) {
> +        return;
> +    }
> +
> +    switch (val & DYNAMIC_CMD_MASK) {
> +
> +    case DYNAMIC_CMD_MEM_ATTACH:
> +        dmdev_mem_attach(s);
> +        break;
> +
> +    case DYNAMIC_CMD_MEM_DEATTACH:
> +        dmdev_mem_deattach(s);
> +        break;
> +
> +    default:
> +        break;
> +    }
> +
> +    return;
> +}
> +
> +static void dmdev_mmio_write(void *opaque, hwaddr addr,
> +                        uint64_t val, unsigned size)
> +{
> +    DmdevState *s = opaque;
> +
> +    DYNAMIC_MDEV_DPRINTF("%s write addr=0x%lx, val=0x%lx, size=0x%x\n",
> +                __func__, addr, val, size);
> +
> +    switch (addr) {
> +    case INT_MASK:
> +        s->mask = val;
> +        return;
> +
> +    case INT_STATUS:
> +        return;
> +
> +    case DOOR_BELL:
> +        dmdev_doorbell_handle(s, val);
> +        return;
> +
> +    case MEM_ALIGN:
> +        return;
> +
> +    case HW_OFFSET:
> +        /* read only */
> +        return;
> +
> +    case HW_OFFSET + 4:
> +        /* read only */
> +        return;
> +
> +    case MEM_SIZE:
> +        if (size == 4) {
> +            s->size &= ~(0xffffffff);
> +            val &= 0xffffffff;
> +            s->size |= val;
> +        } else { /* 64-bit */
> +            s->size = val;
> +        }
> +        return;
> +
> +    case MEM_SIZE + 4:
> +        s->size &= 0xffffffff;
> +
> +        s->size |= val << 32;
> +        return;
> +
> +    case MEM_OFFSET:
> +        if (size == 4) {
> +            s->mem_offset &= ~(0xffffffff);
> +            val &= 0xffffffff;
> +            s->mem_offset |= val;
> +        } else { /* 64-bit */
> +            s->mem_offset = val;
> +        }
> +        return;
> +
> +    case MEM_OFFSET + 4:
> +        s->mem_offset &= 0xffffffff;
> +
> +        s->mem_offset |= val << 32;
> +        return;
> +
> +    default:
> +        DYNAMIC_MDEV_DPRINTF("default 0x%lx\n", val);
> +    }
> +
> +    return;
> +}
> +
> +static uint64_t dmdev_mmio_read(void *opaque, hwaddr addr,
> +                        unsigned size)
> +{
> +    DmdevState *s = opaque;
> +    unsigned int value;
> +
> +    DYNAMIC_MDEV_DPRINTF("%s read addr=0x%lx, size=0x%x\n",
> +                         __func__, addr, size);
> +    switch (addr) {
> +    case INT_MASK:
> +        /* mask: read-write */
> +        return s->mask;
> +
> +    case INT_STATUS:
> +        /* status: read-clear */
> +        value = s->status;
> +        s->status = 0;
> +        return value;
> +
> +    case DOOR_BELL:
> +        /* doorbell: write-only */
> +        return 0;
> +
> +    case MEM_ALIGN:
> +        /* align: read-only */
> +        return s->align;
> +
> +    case HW_OFFSET:
> +        if (size == 4) {
> +            return s->hw_offset & 0xffffffff;
> +        } else { /* 64-bit */
> +            return s->hw_offset;
> +        }
> +
> +    case HW_OFFSET + 4:
> +        return s->hw_offset >> 32;
> +
> +    case MEM_SIZE:
> +        if (size == 4) {
> +            return s->size & 0xffffffff;
> +        } else { /* 64-bit */
> +            return s->size;
> +        }
> +
> +    case MEM_SIZE + 4:
> +        return s->size >> 32;
> +
> +    case MEM_OFFSET:
> +        if (size == 4) {
> +            return s->mem_offset & 0xffffffff;
> +        } else { /* 64-bit */
> +            return s->mem_offset;
> +        }
> +
> +    case MEM_OFFSET + 4:
> +        return s->mem_offset >> 32;
> +
> +    default:
> +        DYNAMIC_MDEV_DPRINTF("default read err address 0x%lx\n", addr);
> +
> +    }
> +
> +    return 0;
> +}
> +
> +static const MemoryRegionOps dmdev_mmio_ops = {
> +    .read = dmdev_mmio_read,
> +    .write = dmdev_mmio_write,
> +    .endianness = DEVICE_NATIVE_ENDIAN,
> +    .impl = {
> +        .min_access_size = 4,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static void dmdev_reset(DeviceState *d)
> +{
> +    DmdevState *s = DYNAMIC_MDEV(d);
> +
> +    s->status = 0;
> +    s->mask = 0;
> +    s->hw_offset = 0;
> +    dmdev_mem_deattach(s);
> +}
> +
> +static void dmdev_realize(PCIDevice *dev, Error **errp)
> +{
> +    DmdevState *s = DYNAMIC_MDEV(dev);
> +    int status;
> +
> +    Error *err = NULL;
> +    uint8_t *pci_conf;
> +
> +    pci_conf = dev->config;
> +    pci_conf[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
> +
> +    /* init msi */
> +    status = msi_init(dev, 0, 1, true, false, &err);
> +    if (status) {
> +        error_report("msi_init %d failed", status);
> +        return;
> +    }
> +
> +    memory_region_init_io(&s->dmdev_mmio, OBJECT(s), &dmdev_mmio_ops, s,
> +                          "dmdev-mmio", DYNAMIC_MDEV_BAR_SIZE);
> +
> +    /* region for registers*/
> +    pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                     &s->dmdev_mmio);
> +
> +    /* initialize a memory region container */
> +    memory_region_init(&s->dmdev_mem, OBJECT(s),
> +                       "dmdev-mem", s->bsize);
> +
> +    pci_register_bar(PCI_DEVICE(s), 2,
> +                    PCI_BASE_ADDRESS_SPACE_MEMORY |
> +                    PCI_BASE_ADDRESS_MEM_PREFETCH |
> +                    PCI_BASE_ADDRESS_MEM_TYPE_64,
> +                    &s->dmdev_mem);
> +
> +    if (s->devname) {
> +        s->fd = open(s->devname, O_RDWR, 0x0777);
> +    } else {
> +        s->fd = -1;
> +    }
> +
> +    s->hw_offset = 0;
> +
> +    DYNAMIC_MDEV_DPRINTF("open file %s %s\n",
> +            s->devname, s->fd < 0 ? "failed" : "success");
> +}
> +
> +static void dmdev_exit(PCIDevice *dev)
> +{
> +    DmdevState *s = DYNAMIC_MDEV(dev);
> +
> +    msi_uninit(dev);
> +    dmdev_mem_deattach(s);
> +    DYNAMIC_MDEV_DPRINTF("%s\n", __func__);
> +
> +}
> +
> +static Property dmdev_properties[] = {
> +    DEFINE_PROP_UINT64("size", DmdevState, bsize, 0x40000000),
> +    DEFINE_PROP_UINT32("align", DmdevState, align, 0x40000000),
> +    DEFINE_PROP_STRING("mem-path", DmdevState, devname),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void dmdev_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> +
> +    k->realize = dmdev_realize;
> +    k->exit = dmdev_exit;
> +    k->vendor_id = PCI_VENDOR_ID_DMDEV;
> +    k->device_id = PCI_DEVICE_ID_DMDEV;
> +    k->class_id = PCI_CLASS_MEMORY_RAM;
> +    k->revision = 1;
> +    dc->reset = dmdev_reset;
> +    device_class_set_props(dc, dmdev_properties);
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    dc->desc = "pci device to dynamically attach memory";
> +}
> +
> +static const TypeInfo dmdev_info = {
> +    .name          = TYPE_DYNAMIC_MDEV,
> +    .parent        = TYPE_PCI_DEVICE,
> +    .instance_size = sizeof(DmdevState),
> +    .class_init    = dmdev_class_init,
> +    .interfaces    = (InterfaceInfo[]) {
> +        { INTERFACE_PCIE_DEVICE },
> +        { },
> +    },
> +};
> +
> +static void dmdev_register_types(void)
> +{
> +    type_register_static(&dmdev_info);
> +}
> +
> +type_init(dmdev_register_types)
> diff --git a/hw/misc/meson.build b/hw/misc/meson.build
> index a53b849a5a..38f6701a4b 100644
> --- a/hw/misc/meson.build
> +++ b/hw/misc/meson.build
> @@ -124,3 +124,4 @@ specific_ss.add(when: 'CONFIG_MIPS_CPS', if_true: files('mips_cmgcr.c', 'mips_cp
>  specific_ss.add(when: 'CONFIG_MIPS_ITU', if_true: files('mips_itu.c'))
> 
>  specific_ss.add(when: 'CONFIG_SBSA_REF', if_true: files('sbsa_ec.c'))
> +specific_ss.add(when: 'CONFIG_DYNAMIC_MDEV', if_true: files('dynamic_mdev.c'))
> --
> 2.27.0
> 
>
David Hildenbrand Sept. 27, 2021, 9:07 a.m. UTC | #2
On 27.09.21 10:27, Stefan Hajnoczi wrote:
> On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
>> Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
>> to VM, so driver in guest can apply host memory in fly without virtualization
>> management software's help, such as libvirt/manager. The attached memory is

We do have virtio-mem to dynamically attach memory to a VM. It could be 
extended by a mechanism for the VM to request more/less memory, that's 
already a planned feature. But yeah, virito-mem memory is exposed as 
ordinary system RAM, not only via a BAR to mostly be managed by user 
space completely.

>> isolated from System RAM, it can be used in heterogeneous memory management for
>> virtualization. Multiple VMs dynamically share same computing device memory
>> without memory overcommit.

This sounds a lot like MemExpand/MemLego ... am I right that this is the 
original design? I recall that VMs share a memory region and dynamically 
agree upon which part of the memory region a VM uses. I further recall 
that there were malloc() hooks that would dynamically allocate such 
memory in user space from the shared memory region.

I can see some use cases for it, although the shared memory design isn't 
what you typically want in most VM environments.
david.dai Sept. 27, 2021, 12:17 p.m. UTC | #3
On Mon, Sep 27, 2021 at 10:27:06AM +0200, Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
> > to VM, so driver in guest can apply host memory in fly without virtualization
> > management software's help, such as libvirt/manager. The attached memory is
> > isolated from System RAM, it can be used in heterogeneous memory management for
> > virtualization. Multiple VMs dynamically share same computing device memory
> > without memory overcommit.
> > 
> > Signed-off-by: David Dai <david.dai@montage-tech.com>
> 
> CCing David Hildenbrand (virtio-balloon and virtio-mem) and Igor
> Mammedov (host memory backend).
> 
> > ---
> >  docs/devel/dynamic_mdev.rst | 122 ++++++++++
> >  hw/misc/Kconfig             |   5 +
> >  hw/misc/dynamic_mdev.c      | 456 ++++++++++++++++++++++++++++++++++++
> >  hw/misc/meson.build         |   1 +
> >  4 files changed, 584 insertions(+)
> >  create mode 100644 docs/devel/dynamic_mdev.rst
> >  create mode 100644 hw/misc/dynamic_mdev.c
> > 
> > diff --git a/docs/devel/dynamic_mdev.rst b/docs/devel/dynamic_mdev.rst
> > new file mode 100644
> > index 0000000000..8e2edb6600
> > --- /dev/null
> > +++ b/docs/devel/dynamic_mdev.rst
> > @@ -0,0 +1,122 @@
> > +Motivation:
> > +In heterogeneous computing system, accelorator generally exposes its device
> 
> s/accelorator/accelerator/
> 
> (There are missing articles and small grammar tweaks that could be made,
> but I'm skipping the English language stuff for now.)
> 

Thank you for your review.

> > +memory to host via PCIe and CXL.mem(Compute Express Link) to share memory
> > +between host and device, and these memory generally are uniformly managed by
> > +host, they are called HDM (host managed device memory), further SVA (share
> > +virtual address) can be achieved on this base. One computing device may be used
> 
> Is this Shared Virtual Addressing (SVA) (also known as Shared Virtual
> Memory)? If yes, please use the exact name ("Shared Virtual Addressing",
> not "share virtual address") so that's clear and the reader can easily
> find out more information through a web search.
>
 
Yes, you are right.

> > +by multiple virtual machines if it supports SRIOV, to efficiently use device
> > +memory in virtualization, each VM allocates device memory on-demand without
> > +overcommit, but how to dynamically attach host memory resource to VM. A virtual
> 
> I cannot parse this sentence. Can you rephrase it and/or split it into
> multiple sentences?
> 
> > +PCI device, dynamic_mdev, is introduced to achieve this target. dynamic_mdev
> 
> I suggest calling it "memdev" instead of "mdev" to prevent confusion
> with VFIO mdev.
>

I agree your suggestion.
I will make changes according to your comments at new patch.

> > +has a big bar space which size can be assigned by user when creating VM, the
> > +bar doesn't have backend memory at initialization stage, later driver in guest
> > +triggers QEMU to map host memory to the bar space. how much memory, when and
> > +where memory will be mapped to are determined by guest driver, after device
> > +memory has been attached to the virtual PCI bar, application in guest can
> > +access device memory by the virtual PCI bar. Memory allocation and negotiation
> > +are left to guest driver and memory backend implementation. dynamic_mdev is a
> > +mechanism which provides significant benefits to heterogeneous memory
> > +virtualization.
> 
> David and Igor: please review this design. I'm not familiar enough with
> the various memory hotplug and ballooning mechanisms to give feedback on
> this.
> 
> > +Implementation:
> > +dynamic_mdev device has two bars, bar0 and bar2. bar0 is a 32-bit register bar
> > +which used to host control register for control and message communication, Bar2
> > +is a 64-bit mmio bar, which is used to attach host memory to, the bar size can
> > +be assigned via parameter when creating VM. Host memory is attached to this bar
> > +via mmap API.
> > +
> > +
> > +          VM1                           VM2
> > + -----------------------        ----------------------
> > +|      application      |      |     application      |
> > +|                       |      |                      |
> > +|-----------------------|      |----------------------|
> > +|     guest driver      |      |     guest driver     |
> > +|   |--------------|    |      |   | -------------|   |
> > +|   | pci mem bar  |    |      |   | pci mem bar  |   |
> > + ---|--------------|-----       ---|--------------|---
> > +     ----   ---                     --   ------
> > +    |    | |   |                   |  | |      |
> > +     ----   ---                     --   ------
> > +            \                      /
> > +             \                    /
> > +              \                  /
> > +               \                /
> > +                |              |
> > +                V              V
> > + --------------- /dev/mdev.mmap ------------------------
> > +|     --   --   --   --   --   --                       |
> > +|    |  | |  | |  | |  | |  | |  |  <-----free_mem_list |
> > +|     --   --   --   --   --   --                       |
> > +|                                                       |
> > +|                       HDM(host managed device memory )|
> > + -------------------------------------------------------
> > +
> > +1. Create device:
> > +-device dyanmic-mdevice,size=0x200000000,align=0x40000000,mem-path=/dev/mdev
> > +
> > +size: bar space size
> > +aglin: alignment of dynamical attached memory
> > +mem-path: host backend memory device
> > +
> > +
> > +2. Registers to control dynamical memory attach
> > +All register is placed in bar0
> > +
> > +        INT_MASK     =     0, /* RW */
> > +        INT_STATUS   =     4, /* RW: write 1 clear */
> > +        DOOR_BELL    =     8, /*
> > +                               * RW: trigger device to act
> > +                               *  31        15        0
> > +                               *  --------------------
> > +                               * |en|xxxxxxxx|  cmd   |
> > +                               *  --------------------
> > +                               */
> > +
> > +        /* RO: 4k, 2M, 1G aglign for memory size */
> > +        MEM_ALIGN   =      12,
> > +
> > +        /* RO: offset in memory bar shows bar space has had ram map */
> > +        HW_OFFSET    =     16,
> > +
> > +        /* RW: size of dynamical attached memory */
> > +        MEM_SIZE     =     24,
> > +
> > +        /* RW: offset in host mdev, which dynamical attached memory from  */
> > +        MEM_OFFSET   =     32,
> > +
> > +3. To trigger QEMU to attach a memory, guest driver makes following operation:
> > +
> > +        /* memory size */
> > +        writeq(size, reg_base + 0x18);
> > +
> > +        /* backend file offset */
> > +        writeq(offset, reg_base + 0x20);
> > +
> > +        /* trigger device to map memory from host */
> > +        writel(0x80000001, reg_base + 0x8);
> > +
> > +        /* wait for reply from backend */
> > +        wait_for_completion(&attach_cmp);
> > +
> > +4. QEMU implementation
> > +dynamic_mdev utilizes QEMU's memory model to dynamically add memory region to
> > +memory container, the model is described at qemu/docs/devel/memory.rst
> > +The below steps will describe the whole flow:
> > +   1> create a virtual PCI device
> > +   2> pci register bar with memory region container, which only define bar size
> > +   3> guest driver requests memory via register interaction, and it tells QEMU
> > +      about memory size, backend memory offset, and so on
> > +   4> QEMU receives request from guest driver, then apply host memory from
> > +      backend file via mmap(), QEMU use the allocated RAM to create a memory
> > +      region through memory_region_init_ram(), and attach this memory region to
> > +      bar container through calling memory_region_add_subregion_overlap(). After
> > +      that KVM build gap->hpa mapping
> > +   5> QEMU sends MSI to guest driver that dynamical memory attach completed
> > +You could refer to source code for more detail.
> > +
> > +
> > +Backend memory device
> > +Backend device can be a stardard share memory file with standard mmap() support
> > +It also may be a specific char device with mmap() implementation.
> > +In a word, how to implement this device is user responsibility.
> > diff --git a/hw/misc/Kconfig b/hw/misc/Kconfig
> > index 507058d8bf..f03263cc1e 100644
> > --- a/hw/misc/Kconfig
> > +++ b/hw/misc/Kconfig
> > @@ -67,6 +67,11 @@ config IVSHMEM_DEVICE
> >      default y if PCI_DEVICES
> >      depends on PCI && LINUX && IVSHMEM && MSI_NONBROKEN
> > 
> > +config DYNAMIC_MDEV
> > +    bool
> > +    default y if PCI_DEVICES
> > +    depends on PCI && LINUX && MSI_NONBROKEN
> > +
> >  config ECCMEMCTL
> >      bool
> >      select ECC
> > diff --git a/hw/misc/dynamic_mdev.c b/hw/misc/dynamic_mdev.c
> > new file mode 100644
> > index 0000000000..8a56a6157b
> > --- /dev/null
> > +++ b/hw/misc/dynamic_mdev.c
> > @@ -0,0 +1,456 @@
> > +/*
> > + * Dynamical memory attached PCI device
> > + *
> > + * Copyright Montage, Corp. 2014
> > + *
> > + * Authors:
> > + *  David Dai <david.dai@montage-tech.com>
> > + *  Changguo Du <changguo.du@montage-tech.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu/units.h"
> > +#include "hw/pci/pci.h"
> > +#include "hw/hw.h"
> > +#include "hw/qdev-properties.h"
> > +#include "hw/qdev-properties-system.h"
> > +#include "hw/pci/msi.h"
> > +#include "qemu/module.h"
> > +#include "qom/object_interfaces.h"
> > +#include "qapi/visitor.h"
> > +#include "qom/object.h"
> > +#include "qemu/error-report.h"
> > +
> > +#define PCI_VENDOR_ID_DMDEV   0x1b00
> > +#define PCI_DEVICE_ID_DMDEV   0x1110
> > +#define DYNAMIC_MDEV_BAR_SIZE 0x1000
> > +
> > +#define INTERRUPT_MEMORY_ATTACH_SUCCESS           (1 << 0)
> > +#define INTERRUPT_MEMORY_DEATTACH_SUCCESS         (1 << 1)
> > +#define INTERRUPT_MEMORY_ATTACH_NOMEM             (1 << 2)
> > +#define INTERRUPT_MEMORY_ATTACH_ALIGN_ERR         (1 << 3)
> > +#define INTERRUPT_ACCESS_NOT_MAPPED_ADDR          (1 << 4)
> > +
> > +#define DYNAMIC_CMD_ENABLE               (0x80000000)
> > +#define DYNAMIC_CMD_MASK                 (0xffff)
> > +#define DYNAMIC_CMD_MEM_ATTACH           (0x1)
> > +#define DYNAMIC_CMD_MEM_DEATTACH         (0x2)
> > +
> > +#define DYNAMIC_MDEV_DEBUG               1
> > +
> > +#define DYNAMIC_MDEV_DPRINTF(fmt, ...)                          \
> > +    do {                                                        \
> > +        if (DYNAMIC_MDEV_DEBUG) {                               \
> > +            printf("QEMU: " fmt, ## __VA_ARGS__);               \
> > +        }                                                       \
> > +    } while (0)
> > +
> > +#define TYPE_DYNAMIC_MDEV "dyanmic-mdevice"
> > +
> > +typedef struct DmdevState DmdevState;
> > +DECLARE_INSTANCE_CHECKER(DmdevState, DYNAMIC_MDEV,
> > +                         TYPE_DYNAMIC_MDEV)
> > +
> > +struct DmdevState {
> > +    /*< private >*/
> > +    PCIDevice parent_obj;
> > +    /*< public >*/
> > +
> > +    /* registers */
> > +    uint32_t mask;
> > +    uint32_t status;
> > +    uint32_t align;
> > +    uint64_t size;
> > +    uint64_t hw_offset;
> > +    uint64_t mem_offset;
> > +
> > +    /* mdev name */
> > +    char *devname;
> > +    int fd;
> > +
> > +    /* memory bar size */
> > +    uint64_t bsize;
> > +
> > +    /* BAR 0 (registers) */
> > +    MemoryRegion dmdev_mmio;
> > +
> > +    /* BAR 2 (memory bar for daynamical memory attach) */
> > +    MemoryRegion dmdev_mem;
> > +};
> > +
> > +/* registers for the dynamical memory device */
> > +enum dmdev_registers {
> > +    INT_MASK     =     0, /* RW */
> > +    INT_STATUS   =     4, /* RW: write 1 clear */
> > +    DOOR_BELL    =     8, /*
> > +                           * RW: trigger device to act
> > +                           *  31        15        0
> > +                           *  --------------------
> > +                           * |en|xxxxxxxx|  cmd   |
> > +                           *  --------------------
> > +                           */
> > +
> > +    /* RO: 4k, 2M, 1G aglign for memory size */
> > +    MEM_ALIGN   =     12,
> > +
> > +    /* RO: offset in memory bar shows bar space has had ram map */
> > +    HW_OFFSET    =    16,
> > +
> > +    /* RW: size of dynamical attached memory */
> > +    MEM_SIZE     =    24,
> > +
> > +    /* RW: offset in host mdev, where dynamical attached memory from  */
> > +    MEM_OFFSET   =    32,
> > +
> > +};
> > +
> > +static void dmdev_mem_attach(DmdevState *s)
> > +{
> > +    void *ptr;
> > +    struct MemoryRegion *mr;
> > +    uint64_t size = s->size;
> > +    uint64_t align = s->align;
> > +    uint64_t hwaddr = s->hw_offset;
> > +    uint64_t offset = s->mem_offset;
> > +    PCIDevice *pdev = PCI_DEVICE(s);
> > +
> > +    DYNAMIC_MDEV_DPRINTF("%s:size =0x%lx,align=0x%lx,hwaddr=0x%lx,\
> > +        offset=0x%lx\n", __func__, size, align, hwaddr, offset);
> > +
> > +    if (size % align || hwaddr % align) {
> > +        error_report("%s size doesn't align, size =0x%lx, \
> > +                align=0x%lx, hwaddr=0x%lx\n", __func__, size, align, hwaddr);
> > +        s->status |= INTERRUPT_MEMORY_ATTACH_ALIGN_ERR;
> > +        msi_notify(pdev, 0);
> > +        return;
> > +    }
> > +
> > +    ptr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, s->fd, offset);
> > +    if (ptr == MAP_FAILED) {
> > +        error_report("Can't map memory err(%d)", errno);
> > +        s->status |= INTERRUPT_MEMORY_ATTACH_ALIGN_ERR;
> > +        msi_notify(pdev, 0);
> > +        return;
> > +    }
> > +
> > +    mr = g_new0(MemoryRegion, 1);
> > +    memory_region_init_ram_ptr(mr, OBJECT(PCI_DEVICE(s)),
> > +                            "dynamic_mdev", size, ptr);
> > +    memory_region_add_subregion_overlap(&s->dmdev_mem, hwaddr, mr, 1);
> > +
> > +    s->hw_offset += size;
> > +
> > +    s->status |= INTERRUPT_MEMORY_ATTACH_SUCCESS;
> > +    msi_notify(pdev, 0);
> > +
> > +    DYNAMIC_MDEV_DPRINTF("%s msi_notify success ptr=%p\n", __func__, ptr);
> > +    return;
> > +}
> > +
> > +static void dmdev_mem_deattach(DmdevState *s)
> > +{
> > +    struct MemoryRegion *mr = &s->dmdev_mem;
> > +    struct MemoryRegion *subregion;
> > +    void *host;
> > +    PCIDevice *pdev = PCI_DEVICE(s);
> > +
> > +    memory_region_transaction_begin();
> > +    while (!QTAILQ_EMPTY(&mr->subregions)) {
> > +        subregion = QTAILQ_FIRST(&mr->subregions);
> > +        memory_region_del_subregion(mr, subregion);
> > +        host = memory_region_get_ram_ptr(subregion);
> > +        munmap(host, memory_region_size(subregion));
> > +        DYNAMIC_MDEV_DPRINTF("%s:host=%p,size=0x%lx\n",
> > +                    __func__, host,  memory_region_size(subregion));
> > +    }
> > +
> > +    memory_region_transaction_commit();
> > +
> > +    s->hw_offset = 0;
> > +
> > +    s->status |= INTERRUPT_MEMORY_DEATTACH_SUCCESS;
> > +    msi_notify(pdev, 0);
> > +
> > +    return;
> > +}
> > +
> > +static void dmdev_doorbell_handle(DmdevState *s,  uint64_t val)
> > +{
> > +    if (!(val & DYNAMIC_CMD_ENABLE)) {
> > +        return;
> > +    }
> > +
> > +    switch (val & DYNAMIC_CMD_MASK) {
> > +
> > +    case DYNAMIC_CMD_MEM_ATTACH:
> > +        dmdev_mem_attach(s);
> > +        break;
> > +
> > +    case DYNAMIC_CMD_MEM_DEATTACH:
> > +        dmdev_mem_deattach(s);
> > +        break;
> > +
> > +    default:
> > +        break;
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static void dmdev_mmio_write(void *opaque, hwaddr addr,
> > +                        uint64_t val, unsigned size)
> > +{
> > +    DmdevState *s = opaque;
> > +
> > +    DYNAMIC_MDEV_DPRINTF("%s write addr=0x%lx, val=0x%lx, size=0x%x\n",
> > +                __func__, addr, val, size);
> > +
> > +    switch (addr) {
> > +    case INT_MASK:
> > +        s->mask = val;
> > +        return;
> > +
> > +    case INT_STATUS:
> > +        return;
> > +
> > +    case DOOR_BELL:
> > +        dmdev_doorbell_handle(s, val);
> > +        return;
> > +
> > +    case MEM_ALIGN:
> > +        return;
> > +
> > +    case HW_OFFSET:
> > +        /* read only */
> > +        return;
> > +
> > +    case HW_OFFSET + 4:
> > +        /* read only */
> > +        return;
> > +
> > +    case MEM_SIZE:
> > +        if (size == 4) {
> > +            s->size &= ~(0xffffffff);
> > +            val &= 0xffffffff;
> > +            s->size |= val;
> > +        } else { /* 64-bit */
> > +            s->size = val;
> > +        }
> > +        return;
> > +
> > +    case MEM_SIZE + 4:
> > +        s->size &= 0xffffffff;
> > +
> > +        s->size |= val << 32;
> > +        return;
> > +
> > +    case MEM_OFFSET:
> > +        if (size == 4) {
> > +            s->mem_offset &= ~(0xffffffff);
> > +            val &= 0xffffffff;
> > +            s->mem_offset |= val;
> > +        } else { /* 64-bit */
> > +            s->mem_offset = val;
> > +        }
> > +        return;
> > +
> > +    case MEM_OFFSET + 4:
> > +        s->mem_offset &= 0xffffffff;
> > +
> > +        s->mem_offset |= val << 32;
> > +        return;
> > +
> > +    default:
> > +        DYNAMIC_MDEV_DPRINTF("default 0x%lx\n", val);
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static uint64_t dmdev_mmio_read(void *opaque, hwaddr addr,
> > +                        unsigned size)
> > +{
> > +    DmdevState *s = opaque;
> > +    unsigned int value;
> > +
> > +    DYNAMIC_MDEV_DPRINTF("%s read addr=0x%lx, size=0x%x\n",
> > +                         __func__, addr, size);
> > +    switch (addr) {
> > +    case INT_MASK:
> > +        /* mask: read-write */
> > +        return s->mask;
> > +
> > +    case INT_STATUS:
> > +        /* status: read-clear */
> > +        value = s->status;
> > +        s->status = 0;
> > +        return value;
> > +
> > +    case DOOR_BELL:
> > +        /* doorbell: write-only */
> > +        return 0;
> > +
> > +    case MEM_ALIGN:
> > +        /* align: read-only */
> > +        return s->align;
> > +
> > +    case HW_OFFSET:
> > +        if (size == 4) {
> > +            return s->hw_offset & 0xffffffff;
> > +        } else { /* 64-bit */
> > +            return s->hw_offset;
> > +        }
> > +
> > +    case HW_OFFSET + 4:
> > +        return s->hw_offset >> 32;
> > +
> > +    case MEM_SIZE:
> > +        if (size == 4) {
> > +            return s->size & 0xffffffff;
> > +        } else { /* 64-bit */
> > +            return s->size;
> > +        }
> > +
> > +    case MEM_SIZE + 4:
> > +        return s->size >> 32;
> > +
> > +    case MEM_OFFSET:
> > +        if (size == 4) {
> > +            return s->mem_offset & 0xffffffff;
> > +        } else { /* 64-bit */
> > +            return s->mem_offset;
> > +        }
> > +
> > +    case MEM_OFFSET + 4:
> > +        return s->mem_offset >> 32;
> > +
> > +    default:
> > +        DYNAMIC_MDEV_DPRINTF("default read err address 0x%lx\n", addr);
> > +
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static const MemoryRegionOps dmdev_mmio_ops = {
> > +    .read = dmdev_mmio_read,
> > +    .write = dmdev_mmio_write,
> > +    .endianness = DEVICE_NATIVE_ENDIAN,
> > +    .impl = {
> > +        .min_access_size = 4,
> > +        .max_access_size = 8,
> > +    },
> > +};
> > +
> > +static void dmdev_reset(DeviceState *d)
> > +{
> > +    DmdevState *s = DYNAMIC_MDEV(d);
> > +
> > +    s->status = 0;
> > +    s->mask = 0;
> > +    s->hw_offset = 0;
> > +    dmdev_mem_deattach(s);
> > +}
> > +
> > +static void dmdev_realize(PCIDevice *dev, Error **errp)
> > +{
> > +    DmdevState *s = DYNAMIC_MDEV(dev);
> > +    int status;
> > +
> > +    Error *err = NULL;
> > +    uint8_t *pci_conf;
> > +
> > +    pci_conf = dev->config;
> > +    pci_conf[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
> > +
> > +    /* init msi */
> > +    status = msi_init(dev, 0, 1, true, false, &err);
> > +    if (status) {
> > +        error_report("msi_init %d failed", status);
> > +        return;
> > +    }
> > +
> > +    memory_region_init_io(&s->dmdev_mmio, OBJECT(s), &dmdev_mmio_ops, s,
> > +                          "dmdev-mmio", DYNAMIC_MDEV_BAR_SIZE);
> > +
> > +    /* region for registers*/
> > +    pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
> > +                     &s->dmdev_mmio);
> > +
> > +    /* initialize a memory region container */
> > +    memory_region_init(&s->dmdev_mem, OBJECT(s),
> > +                       "dmdev-mem", s->bsize);
> > +
> > +    pci_register_bar(PCI_DEVICE(s), 2,
> > +                    PCI_BASE_ADDRESS_SPACE_MEMORY |
> > +                    PCI_BASE_ADDRESS_MEM_PREFETCH |
> > +                    PCI_BASE_ADDRESS_MEM_TYPE_64,
> > +                    &s->dmdev_mem);
> > +
> > +    if (s->devname) {
> > +        s->fd = open(s->devname, O_RDWR, 0x0777);
> > +    } else {
> > +        s->fd = -1;
> > +    }
> > +
> > +    s->hw_offset = 0;
> > +
> > +    DYNAMIC_MDEV_DPRINTF("open file %s %s\n",
> > +            s->devname, s->fd < 0 ? "failed" : "success");
> > +}
> > +
> > +static void dmdev_exit(PCIDevice *dev)
> > +{
> > +    DmdevState *s = DYNAMIC_MDEV(dev);
> > +
> > +    msi_uninit(dev);
> > +    dmdev_mem_deattach(s);
> > +    DYNAMIC_MDEV_DPRINTF("%s\n", __func__);
> > +
> > +}
> > +
> > +static Property dmdev_properties[] = {
> > +    DEFINE_PROP_UINT64("size", DmdevState, bsize, 0x40000000),
> > +    DEFINE_PROP_UINT32("align", DmdevState, align, 0x40000000),
> > +    DEFINE_PROP_STRING("mem-path", DmdevState, devname),
> > +    DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +static void dmdev_class_init(ObjectClass *klass, void *data)
> > +{
> > +    DeviceClass *dc = DEVICE_CLASS(klass);
> > +    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> > +
> > +    k->realize = dmdev_realize;
> > +    k->exit = dmdev_exit;
> > +    k->vendor_id = PCI_VENDOR_ID_DMDEV;
> > +    k->device_id = PCI_DEVICE_ID_DMDEV;
> > +    k->class_id = PCI_CLASS_MEMORY_RAM;
> > +    k->revision = 1;
> > +    dc->reset = dmdev_reset;
> > +    device_class_set_props(dc, dmdev_properties);
> > +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> > +    dc->desc = "pci device to dynamically attach memory";
> > +}
> > +
> > +static const TypeInfo dmdev_info = {
> > +    .name          = TYPE_DYNAMIC_MDEV,
> > +    .parent        = TYPE_PCI_DEVICE,
> > +    .instance_size = sizeof(DmdevState),
> > +    .class_init    = dmdev_class_init,
> > +    .interfaces    = (InterfaceInfo[]) {
> > +        { INTERFACE_PCIE_DEVICE },
> > +        { },
> > +    },
> > +};
> > +
> > +static void dmdev_register_types(void)
> > +{
> > +    type_register_static(&dmdev_info);
> > +}
> > +
> > +type_init(dmdev_register_types)
> > diff --git a/hw/misc/meson.build b/hw/misc/meson.build
> > index a53b849a5a..38f6701a4b 100644
> > --- a/hw/misc/meson.build
> > +++ b/hw/misc/meson.build
> > @@ -124,3 +124,4 @@ specific_ss.add(when: 'CONFIG_MIPS_CPS', if_true: files('mips_cmgcr.c', 'mips_cp
> >  specific_ss.add(when: 'CONFIG_MIPS_ITU', if_true: files('mips_itu.c'))
> > 
> >  specific_ss.add(when: 'CONFIG_SBSA_REF', if_true: files('sbsa_ec.c'))
> > +specific_ss.add(when: 'CONFIG_DYNAMIC_MDEV', if_true: files('dynamic_mdev.c'))
> > --
> > 2.27.0
> > 
> >
david.dai Sept. 27, 2021, 12:28 p.m. UTC | #4
On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand (david@redhat.com) wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> On 27.09.21 10:27, Stefan Hajnoczi wrote:
> > On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > > Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
> > > to VM, so driver in guest can apply host memory in fly without virtualization
> > > management software's help, such as libvirt/manager. The attached memory is
> 
> We do have virtio-mem to dynamically attach memory to a VM. It could be
> extended by a mechanism for the VM to request more/less memory, that's
> already a planned feature. But yeah, virito-mem memory is exposed as
> ordinary system RAM, not only via a BAR to mostly be managed by user space
> completely.
>

I wish virtio-mem can solve our problem, but it is a dynamic allocation mechanism
for system RAM in virtualization. In heterogeneous computing environments, the
attached memory usually comes from computing device, it should be managed separately.
we doesn't hope Linux MM controls it.
 
> > > isolated from System RAM, it can be used in heterogeneous memory management for
> > > virtualization. Multiple VMs dynamically share same computing device memory
> > > without memory overcommit.
> 
> This sounds a lot like MemExpand/MemLego ... am I right that this is the
> original design? I recall that VMs share a memory region and dynamically
> agree upon which part of the memory region a VM uses. I further recall that
> there were malloc() hooks that would dynamically allocate such memory in
> user space from the shared memory region.
>

Thank you for telling me about Memexpand/MemLego, I have carefully read the paper.
some ideas from it are same as this patch, such as software model and stack, but
it may have a security risk that whole shared memory is visible to all VMs.
-----------------------
     application
-----------------------
memory management driver
-----------------------
     pci driver
-----------------------
   virtual pci device
-----------------------

> I can see some use cases for it, although the shared memory design isn't
> what you typically want in most VM environments.
>

The original design for this patch is to share a computing device among multipile
VMs. Each VM runs a computing application(for example, OpenCL application)
Our computing device can support a few applications in parallel. In addition, it
supports SVM(shared virtual memory) via IOMMU/ATS/PASID/PRI. Device exposes its
memory to host vis PCIe bar or CXL.mem, host constructs memory pool to uniformly
manage device memory, then attach device memory to VM via a virtual PCI device.
but we don't know how much memory should be assigned when creating VM, so we hope
memory is attached to VM on-demand. driver in guest triggers memory attaching, but
not outside virtualization management software. so the original requirements are:
1> The managed memory comes from device, it should be isolated from system RAM
2> The memory can be dynamically attached to VM in fly
3> The attached memory supports SVM and DMA operation with IOMMU

Thank you very much. 


Best Regards,
David Dai

> -- 
> Thanks,
> 
> David / dhildenb
> 
>
David Hildenbrand Sept. 29, 2021, 9:30 a.m. UTC | #5
On 27.09.21 14:28, david.dai wrote:
> On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you recognize the sender and know the
>> content is safe.
>>
>>
>> On 27.09.21 10:27, Stefan Hajnoczi wrote:
>>> On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
>>>> Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
>>>> to VM, so driver in guest can apply host memory in fly without virtualization
>>>> management software's help, such as libvirt/manager. The attached memory is
>>
>> We do have virtio-mem to dynamically attach memory to a VM. It could be
>> extended by a mechanism for the VM to request more/less memory, that's
>> already a planned feature. But yeah, virito-mem memory is exposed as
>> ordinary system RAM, not only via a BAR to mostly be managed by user space
>> completely.

There is a virtio-pmem spec proposal to expose the memory region via a 
PCI BAR. We could do something similar for virtio-mem, however, we would 
have to wire that new model up differently in QEMU (it's no longer a 
"memory device" like a DIMM then).

>>
> 
> I wish virtio-mem can solve our problem, but it is a dynamic allocation mechanism
> for system RAM in virtualization. In heterogeneous computing environments, the
> attached memory usually comes from computing device, it should be managed separately.
> we doesn't hope Linux MM controls it.

If that heterogeneous memory would have a dedicated node (which usually 
is the case IIRC) , and you let it manage by the Linux kernel 
(dax/kmem), you can bind the memory backend of virtio-mem to that 
special NUMA node. So all memory managed by that virtio-mem device would 
come from that heterogeneous memory.

You could then further use a separate NUMA node for that virtio-mem 
device inside the VM. But to the VM it would look like System memory 
with different performance characteristics. That would work fore some 
use cases I guess, but not sure for which not (I assume you can tell :) ).

We could even write an alternative virtio-mem mode, where device manage 
isn't exposed to the buddy but using some different way to user space.

>   
>>>> isolated from System RAM, it can be used in heterogeneous memory management for
>>>> virtualization. Multiple VMs dynamically share same computing device memory
>>>> without memory overcommit.
>>
>> This sounds a lot like MemExpand/MemLego ... am I right that this is the
>> original design? I recall that VMs share a memory region and dynamically
>> agree upon which part of the memory region a VM uses. I further recall that
>> there were malloc() hooks that would dynamically allocate such memory in
>> user space from the shared memory region.
>>
> 
> Thank you for telling me about Memexpand/MemLego, I have carefully read the paper.
> some ideas from it are same as this patch, such as software model and stack, but
> it may have a security risk that whole shared memory is visible to all VMs.

How will you make sure that not all shared memory can be accessed by the 
other VMs? IOW, emulate !shared memory on shared memory?

> -----------------------
>       application
> -----------------------
> memory management driver
> -----------------------
>       pci driver
> -----------------------
>     virtual pci device
> -----------------------
> 
>> I can see some use cases for it, although the shared memory design isn't
>> what you typically want in most VM environments.
>>
> 
> The original design for this patch is to share a computing device among multipile
> VMs. Each VM runs a computing application(for example, OpenCL application)
> Our computing device can support a few applications in parallel. In addition, it
> supports SVM(shared virtual memory) via IOMMU/ATS/PASID/PRI. Device exposes its
> memory to host vis PCIe bar or CXL.mem, host constructs memory pool to uniformly
> manage device memory, then attach device memory to VM via a virtual PCI device.

How exactly is that memory pool created/managed? Simply dax/kmem and 
handling it via the buddy in a special NUMA node.

> but we don't know how much memory should be assigned when creating VM, so we hope
> memory is attached to VM on-demand. driver in guest triggers memory attaching, but
> not outside virtualization management software. so the original requirements are:
> 1> The managed memory comes from device, it should be isolated from system RAM
> 2> The memory can be dynamically attached to VM in fly
> 3> The attached memory supports SVM and DMA operation with IOMMU
> 
> Thank you very much.

Thanks for the info. If virtio-mem is not applicable and cannot be 
modified for this use case, would it make sense to create a new virtio 
device type?
david.dai Sept. 30, 2021, 9:40 a.m. UTC | #6
On Wed, Sep 29, 2021 at 11:30:53AM +0200, David Hildenbrand (david@redhat.com) wrote: 
> 
> On 27.09.21 14:28, david.dai wrote:
> > On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand (david@redhat.com) wrote:
> > > 
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you recognize the sender and know the
> > > content is safe.
> > > 
> > > 
> > > On 27.09.21 10:27, Stefan Hajnoczi wrote:
> > > > On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > > > > Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
> > > > > to VM, so driver in guest can apply host memory in fly without virtualization
> > > > > management software's help, such as libvirt/manager. The attached memory is
> > > 
> > > We do have virtio-mem to dynamically attach memory to a VM. It could be
> > > extended by a mechanism for the VM to request more/less memory, that's
> > > already a planned feature. But yeah, virito-mem memory is exposed as
> > > ordinary system RAM, not only via a BAR to mostly be managed by user space
> > > completely.
> 
> There is a virtio-pmem spec proposal to expose the memory region via a PCI
> BAR. We could do something similar for virtio-mem, however, we would have to
> wire that new model up differently in QEMU (it's no longer a "memory device"
> like a DIMM then).
> 
> > > 
> > 
> > I wish virtio-mem can solve our problem, but it is a dynamic allocation mechanism
> > for system RAM in virtualization. In heterogeneous computing environments, the
> > attached memory usually comes from computing device, it should be managed separately.
> > we doesn't hope Linux MM controls it.
> 
> If that heterogeneous memory would have a dedicated node (which usually is
> the case IIRC) , and you let it manage by the Linux kernel (dax/kmem), you
> can bind the memory backend of virtio-mem to that special NUMA node. So all
> memory managed by that virtio-mem device would come from that heterogeneous
> memory.
> 

Yes, CXL type 2, 3 devices expose memory to host as a dedicated node, the node
is marked as soft_reserved_memory, dax/kmem can take over the node to create a
dax devcie. This dax device can be regarded as the memory backend of virtio-mem

I don't sure whether a dax device can be open by multiple VMs or host applications. 

> You could then further use a separate NUMA node for that virtio-mem device
> inside the VM. But to the VM it would look like System memory with different
> performance characteristics. That would work fore some use cases I guess,
> but not sure for which not (I assume you can tell :) ).
> 

If the NUMA node in guest can be dynamically expanded by virtio-mem, maybe it is
a good thing. Because we will develop our own memory management driver to manage
device memory.
   
> We could even write an alternative virtio-mem mode, where device manage
> isn't exposed to the buddy but using some different way to user space.
> 
> > > > > isolated from System RAM, it can be used in heterogeneous memory management for
> > > > > virtualization. Multiple VMs dynamically share same computing device memory
> > > > > without memory overcommit.
> > > 
> > > This sounds a lot like MemExpand/MemLego ... am I right that this is the
> > > original design? I recall that VMs share a memory region and dynamically
> > > agree upon which part of the memory region a VM uses. I further recall that
> > > there were malloc() hooks that would dynamically allocate such memory in
> > > user space from the shared memory region.
> > > 
> > 
> > Thank you for telling me about Memexpand/MemLego, I have carefully read the paper.
> > some ideas from it are same as this patch, such as software model and stack, but
> > it may have a security risk that whole shared memory is visible to all VMs.
> 
> How will you make sure that not all shared memory can be accessed by the
> other VMs? IOW, emulate !shared memory on shared memory?
> 
> > -----------------------
> >       application
> > -----------------------
> > memory management driver
> > -----------------------
> >       pci driver
> > -----------------------
> >     virtual pci device
> > -----------------------
> > 
> > > I can see some use cases for it, although the shared memory design isn't
> > > what you typically want in most VM environments.
> > > 
> > 
> > The original design for this patch is to share a computing device among multipile
> > VMs. Each VM runs a computing application(for example, OpenCL application)
> > Our computing device can support a few applications in parallel. In addition, it
> > supports SVM(shared virtual memory) via IOMMU/ATS/PASID/PRI. Device exposes its
> > memory to host vis PCIe bar or CXL.mem, host constructs memory pool to uniformly
> > manage device memory, then attach device memory to VM via a virtual PCI device.
> 
> How exactly is that memory pool created/managed? Simply dax/kmem and
> handling it via the buddy in a special NUMA node.
>

We develop MM driver in host and guest to manage reserved memory(NUMA node as you mentioned).
MM Driver is similar to buddy system, it also uses specific page structure to manage physical
memory, then offers mmap() to host application or VM. Device driver adds memory region to MM
driver.

we don't use dax/kmem, because we need to control key software module to reduce risk. we may
add new features into driver overtime.

> > but we don't know how much memory should be assigned when creating VM, so we hope
> > memory is attached to VM on-demand. driver in guest triggers memory attaching, but
> > not outside virtualization management software. so the original requirements are:
> > 1> The managed memory comes from device, it should be isolated from system RAM
> > 2> The memory can be dynamically attached to VM in fly
> > 3> The attached memory supports SVM and DMA operation with IOMMU
> > 
> > Thank you very much.
> 
> Thanks for the info. If virtio-mem is not applicable and cannot be modified
> for this use case, would it make sense to create a new virtio device type?
> 

we had MM driver in host and guest, now we need a way to dynamically attach memory
to guest, join both ends together, this patch is a self-contain device, it doesn't
impact QEMU's stability.

A new virtio device type is a good idea for me, I need some time to understand
virtio spec and virtio-mem, then I may give another proposal, such as
[RFC]hw/virtio: Add virtio-memdev to dynamically attach memory to QEMU


Thanks,
David Dai

> 
> -- 
> Thanks,
> 
> David / dhildenb
> 
>
David Hildenbrand Sept. 30, 2021, 10:33 a.m. UTC | #7
On 30.09.21 11:40, david.dai wrote:
> On Wed, Sep 29, 2021 at 11:30:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>
>> On 27.09.21 14:28, david.dai wrote:
>>> On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>>>
>>>> CAUTION: This email originated from outside of the organization. Do not
>>>> click links or open attachments unless you recognize the sender and know the
>>>> content is safe.
>>>>
>>>>
>>>> On 27.09.21 10:27, Stefan Hajnoczi wrote:
>>>>> On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
>>>>>> Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
>>>>>> to VM, so driver in guest can apply host memory in fly without virtualization
>>>>>> management software's help, such as libvirt/manager. The attached memory is
>>>>
>>>> We do have virtio-mem to dynamically attach memory to a VM. It could be
>>>> extended by a mechanism for the VM to request more/less memory, that's
>>>> already a planned feature. But yeah, virito-mem memory is exposed as
>>>> ordinary system RAM, not only via a BAR to mostly be managed by user space
>>>> completely.
>>
>> There is a virtio-pmem spec proposal to expose the memory region via a PCI
>> BAR. We could do something similar for virtio-mem, however, we would have to
>> wire that new model up differently in QEMU (it's no longer a "memory device"
>> like a DIMM then).
>>
>>>>
>>>
>>> I wish virtio-mem can solve our problem, but it is a dynamic allocation mechanism
>>> for system RAM in virtualization. In heterogeneous computing environments, the
>>> attached memory usually comes from computing device, it should be managed separately.
>>> we doesn't hope Linux MM controls it.
>>
>> If that heterogeneous memory would have a dedicated node (which usually is
>> the case IIRC) , and you let it manage by the Linux kernel (dax/kmem), you
>> can bind the memory backend of virtio-mem to that special NUMA node. So all
>> memory managed by that virtio-mem device would come from that heterogeneous
>> memory.
>>
> 
> Yes, CXL type 2, 3 devices expose memory to host as a dedicated node, the node
> is marked as soft_reserved_memory, dax/kmem can take over the node to create a
> dax devcie. This dax device can be regarded as the memory backend of virtio-mem
> 
> I don't sure whether a dax device can be open by multiple VMs or host applications.

virito-mem currently relies on having a single sparse memory region 
(anon mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although 
we can share memory with other processes, sharing with other VMs is not 
intended. Instead of actually mmaping parts dynamically (which can be 
quite expensive), virtio-mem relies on punching holes into the backend 
and dynamically allocating memory/file blocks/... on access.

So the easy way to make it work is:

a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device 
memory getting managed by the buddy on a separate NUMA node.
b) (optional) allocate huge pages on that separate NUMA node.
c) Use ordinary memory-device-ram or memory-device-memfd (for huge 
pages), *bidning* the memory backend to that special NUMA node.

This will dynamically allocate memory from that special NUMA node, 
resulting in the virtio-mem device completely being backed by that 
device memory, being able to dynamically resize the memory allocation.


Exposing an actual devdax to the virtio-mem device, shared by multiple 
VMs isn't really what we want and won't work without major design 
changes. Also, I'm not so sure it's a very clean design: exposing memory 
belonging to other VMs to unrelated QEMU processes. This sounds like a 
serious security hole: if you managed to escalate to the QEMU process 
from inside the VM, you can access unrelated VM memory quite happily. 
You want an abstraction in-between, that makes sure each VM/QEMU process 
only sees private memory: for example, the buddy via dax/kmem.
david.dai Oct. 9, 2021, 9:42 a.m. UTC | #8
On Thu, Sep 30, 2021 at 12:33:30PM +0200, David Hildenbrand (david@redhat.com) wrote:
> 
> 
> On 30.09.21 11:40, david.dai wrote:
> > On Wed, Sep 29, 2021 at 11:30:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
> > > 
> > > On 27.09.21 14:28, david.dai wrote:
> > > > On Mon, Sep 27, 2021 at 11:07:43AM +0200, David Hildenbrand (david@redhat.com) wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the organization. Do not
> > > > > click links or open attachments unless you recognize the sender and know the
> > > > > content is safe.
> > > > > 
> > > > > 
> > > > > On 27.09.21 10:27, Stefan Hajnoczi wrote:
> > > > > > On Sun, Sep 26, 2021 at 10:16:14AM +0800, David Dai wrote:
> > > > > > > Add a virtual pci to QEMU, the pci device is used to dynamically attach memory
> > > > > > > to VM, so driver in guest can apply host memory in fly without virtualization
> > > > > > > management software's help, such as libvirt/manager. The attached memory is
> > > > > 
> > > > > We do have virtio-mem to dynamically attach memory to a VM. It could be
> > > > > extended by a mechanism for the VM to request more/less memory, that's
> > > > > already a planned feature. But yeah, virito-mem memory is exposed as
> > > > > ordinary system RAM, not only via a BAR to mostly be managed by user space
> > > > > completely.
> > > 
> > > There is a virtio-pmem spec proposal to expose the memory region via a PCI
> > > BAR. We could do something similar for virtio-mem, however, we would have to
> > > wire that new model up differently in QEMU (it's no longer a "memory device"
> > > like a DIMM then).
> > > 
> > > > > 
> > > > 
> > > > I wish virtio-mem can solve our problem, but it is a dynamic allocation mechanism
> > > > for system RAM in virtualization. In heterogeneous computing environments, the
> > > > attached memory usually comes from computing device, it should be managed separately.
> > > > we doesn't hope Linux MM controls it.
> > > 
> > > If that heterogeneous memory would have a dedicated node (which usually is
> > > the case IIRC) , and you let it manage by the Linux kernel (dax/kmem), you
> > > can bind the memory backend of virtio-mem to that special NUMA node. So all
> > > memory managed by that virtio-mem device would come from that heterogeneous
> > > memory.
> > > 
> > 
> > Yes, CXL type 2, 3 devices expose memory to host as a dedicated node, the node
> > is marked as soft_reserved_memory, dax/kmem can take over the node to create a
> > dax devcie. This dax device can be regarded as the memory backend of virtio-mem
> > 
> > I don't sure whether a dax device can be open by multiple VMs or host applications.
> 
> virito-mem currently relies on having a single sparse memory region (anon
> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> share memory with other processes, sharing with other VMs is not intended.
> Instead of actually mmaping parts dynamically (which can be quite
> expensive), virtio-mem relies on punching holes into the backend and
> dynamically allocating memory/file blocks/... on access.
> 
> So the easy way to make it work is:
> 
> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> memory getting managed by the buddy on a separate NUMA node.
>

Linux kernel buddy system? how to guarantee other applications don't apply memory
from it

>
> b) (optional) allocate huge pages on that separate NUMA node.
> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> *bidning* the memory backend to that special NUMA node.
>
 
"-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
How to bind backend memory to NUMA node

>
> This will dynamically allocate memory from that special NUMA node, resulting
> in the virtio-mem device completely being backed by that device memory,
> being able to dynamically resize the memory allocation.
> 
> 
> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> isn't really what we want and won't work without major design changes. Also,
> I'm not so sure it's a very clean design: exposing memory belonging to other
> VMs to unrelated QEMU processes. This sounds like a serious security hole:
> if you managed to escalate to the QEMU process from inside the VM, you can
> access unrelated VM memory quite happily. You want an abstraction
> in-between, that makes sure each VM/QEMU process only sees private memory:
> for example, the buddy via dax/kmem.
> 
Hi David
Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
How does current virtio-mem dynamically attach memory to guest, via page fault?

Thanks,
David 


> -- 
> Thanks,
> 
> David / dhildenb
> 
>
David Hildenbrand Oct. 11, 2021, 7:43 a.m. UTC | #9
>> virito-mem currently relies on having a single sparse memory region (anon
>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
>> share memory with other processes, sharing with other VMs is not intended.
>> Instead of actually mmaping parts dynamically (which can be quite
>> expensive), virtio-mem relies on punching holes into the backend and
>> dynamically allocating memory/file blocks/... on access.
>>
>> So the easy way to make it work is:
>>
>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
>> memory getting managed by the buddy on a separate NUMA node.
>>
> 
> Linux kernel buddy system? how to guarantee other applications don't apply memory
> from it

Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
such that even if some other allocation ended up there, that it could
get migrated somewhere else.

For example, "daxctl reconfigure-device" tries doing that as default:

https://pmem.io/ndctl/daxctl-reconfigure-device.html

However, I agree that we might actually want to tell the system to not
use this CPU-less node as fallback for other allocations, and that we
might not want to swap out such memory etc.


But, in the end all that virtio-mem needs to work in the hypervisor is

a) A sparse memmap (anonymous RAM, memfd, file)
b) A way to populate memory within that sparse memmap (e.g., on fault,
    using madvise(MADV_POPULATE_WRITE), fallocate())
c) A way to discard memory (madvise(MADV_DONTNEED),
    fallocate(FALLOC_FL_PUNCH_HOLE))

So instead of using anonymous memory+mbind, you can also mmap a sparse file
and rely on populate-on-demand. One alternative for your use case would be
to create a DAX  filesystem on that CXL memory (IIRC that should work) and
simply providing virtio-mem with a sparse file located on that filesystem.

Of course, you can also use some other mechanism as you might have in
your approach, as long as it supports a,b,c.

> 
>>
>> b) (optional) allocate huge pages on that separate NUMA node.
>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
>> *bidning* the memory backend to that special NUMA node.
>>
>   
> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
> How to bind backend memory to NUMA node
> 

I think the syntax is "policy=bind,host-nodes=X"

whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
"host-nodes=0x20" etc.

>>
>> This will dynamically allocate memory from that special NUMA node, resulting
>> in the virtio-mem device completely being backed by that device memory,
>> being able to dynamically resize the memory allocation.
>>
>>
>> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
>> isn't really what we want and won't work without major design changes. Also,
>> I'm not so sure it's a very clean design: exposing memory belonging to other
>> VMs to unrelated QEMU processes. This sounds like a serious security hole:
>> if you managed to escalate to the QEMU process from inside the VM, you can
>> access unrelated VM memory quite happily. You want an abstraction
>> in-between, that makes sure each VM/QEMU process only sees private memory:
>> for example, the buddy via dax/kmem.
>>
> Hi David
> Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
> How does current virtio-mem dynamically attach memory to guest, via page fault?

Essentially you have a large sparse mmap. Withing that mmap, memory is
populated on demand. Instead if mmap/munmap you perform a single large
mmap and then dynamically populate memory/discard memory.

Right now, memory is populated via page faults on access. This is
sub-optimal when dealing with limited resources (i.e., hugetlbfs,
file blocks) and you might run out of backend memory.

I'm working on a "prealloc" mode, which will preallocate/populate memory
necessary for exposing the next block of memory to the VM, and which
fails gracefully if preallocation/population fails in the case of such
limited resources.

The patch resides on:
	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next

commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 2 19:51:36 2021 +0200

     virtio-mem: support "prealloc=on" option
     
     Especially for hugetlb, but also for file-based memory backends, we'd
     like to be able to prealloc memory, especially to make user errors less
     severe: crashing the VM when there are not sufficient huge pages around.
     
     A common option for hugetlb will be using "reserve=off,prealloc=off" for
     the memory backend and "prealloc=on" for the virtio-mem device. This
     way, no huge pages will be reserved for the process, but we can recover
     if there are no actual huge pages when plugging memory.
     
     Signed-off-by: David Hildenbrand <david@redhat.com>
david.dai Oct. 13, 2021, 8:13 a.m. UTC | #10
On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
> 
> 
> 
> > > virito-mem currently relies on having a single sparse memory region (anon
> > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> > > share memory with other processes, sharing with other VMs is not intended.
> > > Instead of actually mmaping parts dynamically (which can be quite
> > > expensive), virtio-mem relies on punching holes into the backend and
> > > dynamically allocating memory/file blocks/... on access.
> > > 
> > > So the easy way to make it work is:
> > > 
> > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> > > memory getting managed by the buddy on a separate NUMA node.
> > > 
> > 
> > Linux kernel buddy system? how to guarantee other applications don't apply memory
> > from it
> 
> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> such that even if some other allocation ended up there, that it could
> get migrated somewhere else.
> 
> For example, "daxctl reconfigure-device" tries doing that as default:
> 
> https://pmem.io/ndctl/daxctl-reconfigure-device.html
> 
> However, I agree that we might actually want to tell the system to not
> use this CPU-less node as fallback for other allocations, and that we
> might not want to swap out such memory etc.
> 
> 
> But, in the end all that virtio-mem needs to work in the hypervisor is
> 
> a) A sparse memmap (anonymous RAM, memfd, file)
> b) A way to populate memory within that sparse memmap (e.g., on fault,
>    using madvise(MADV_POPULATE_WRITE), fallocate())
> c) A way to discard memory (madvise(MADV_DONTNEED),
>    fallocate(FALLOC_FL_PUNCH_HOLE))
> 
> So instead of using anonymous memory+mbind, you can also mmap a sparse file
> and rely on populate-on-demand. One alternative for your use case would be
> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> simply providing virtio-mem with a sparse file located on that filesystem.
> 
> Of course, you can also use some other mechanism as you might have in
> your approach, as long as it supports a,b,c.
> 
> > 
> > > 
> > > b) (optional) allocate huge pages on that separate NUMA node.
> > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> > > *bidning* the memory backend to that special NUMA node.
> > > 
> > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
> > How to bind backend memory to NUMA node
> > 
> 
> I think the syntax is "policy=bind,host-nodes=X"
> 
> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
> "host-nodes=0x20" etc.
> 
> > > 
> > > This will dynamically allocate memory from that special NUMA node, resulting
> > > in the virtio-mem device completely being backed by that device memory,
> > > being able to dynamically resize the memory allocation.
> > > 
> > > 
> > > Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> > > isn't really what we want and won't work without major design changes. Also,
> > > I'm not so sure it's a very clean design: exposing memory belonging to other
> > > VMs to unrelated QEMU processes. This sounds like a serious security hole:
> > > if you managed to escalate to the QEMU process from inside the VM, you can
> > > access unrelated VM memory quite happily. You want an abstraction
> > > in-between, that makes sure each VM/QEMU process only sees private memory:
> > > for example, the buddy via dax/kmem.
> > > 
> > Hi David
> > Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
> > How does current virtio-mem dynamically attach memory to guest, via page fault?
> 
> Essentially you have a large sparse mmap. Withing that mmap, memory is
> populated on demand. Instead if mmap/munmap you perform a single large
> mmap and then dynamically populate memory/discard memory.
> 
> Right now, memory is populated via page faults on access. This is
> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
> file blocks) and you might run out of backend memory.
> 
> I'm working on a "prealloc" mode, which will preallocate/populate memory
> necessary for exposing the next block of memory to the VM, and which
> fails gracefully if preallocation/population fails in the case of such
> limited resources.
> 
> The patch resides on:
> 	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
> 
> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
> Author: David Hildenbrand <david@redhat.com>
> Date:   Mon Aug 2 19:51:36 2021 +0200
> 
>     virtio-mem: support "prealloc=on" option
>     Especially for hugetlb, but also for file-based memory backends, we'd
>     like to be able to prealloc memory, especially to make user errors less
>     severe: crashing the VM when there are not sufficient huge pages around.
>     A common option for hugetlb will be using "reserve=off,prealloc=off" for
>     the memory backend and "prealloc=on" for the virtio-mem device. This
>     way, no huge pages will be reserved for the process, but we can recover
>     if there are no actual huge pages when plugging memory.
>     Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

Hi David,

After read virtio-mem code, I understand what you have expressed, please allow me to describe
my understanding to virtio-mem, so that we have a aligned view.

Virtio-mem:
 Virtio-mem device initializes and reserved a memory area(GPA), later memory dynamically
 growing/shrinking will not exceed this scope, memory-backend-ram has mapped anon. memory
 to the whole area, but no ram is attached because Linux have a policy to delay allocation.
 When virtio-mem driver apply to dynamically add memory to guest, it first request a region
 from the reserved memory area, then notify virtio-mem device to record the information
 (virtio-mem device doesn't make real memory allocation). After received response from
 virtio-mem deivce, virtio-mem driver will online the requested region and add it to Linux
 page allocator. Real ram allocation will happen via page fault when guest cpu access it.
 Memory shrink will be achieved by madvise()

Questions:
1. heterogeneous computing, memory may be accessed by CPUs on host side and device side.
   Memory delayed allocation is not suitable. Host software(for instance, OpenCL) may
   allocate a buffer to computing device to place the computing result in.
2. we hope build ourselves page allocator in host kernel, so it can offer customized mmap()
   method to build va->pa mapping in MMU and IOMMU.
3. some potential requirements also require our driver to manage memory, so that page size
   granularity can be controlled to fit small device iotlb cache.
   CXL has bias mode for HDM(host managed device memory), it needs physical address to make
   bias mode switch between host access and device access. These tell us driver manage memory
   is mandatory.

My opinion:
 I hope this patch can enter QEMU main tree, it is a self-contain virtual device which doesn't impact QEMU stability.
 It is a mechanism to dynamically attach memory to guest, virtio-mem via pagefault, this patch create new memory region.
 In addition, user has big room to customize frontend and backend implement.
 It can be regarded as a sample code and give other people more idea and help.

Thanks,
David
David Hildenbrand Oct. 13, 2021, 8:33 a.m. UTC | #11
On 13.10.21 10:13, david.dai wrote:
> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>
>>
>>
>>>> virito-mem currently relies on having a single sparse memory region (anon
>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
>>>> share memory with other processes, sharing with other VMs is not intended.
>>>> Instead of actually mmaping parts dynamically (which can be quite
>>>> expensive), virtio-mem relies on punching holes into the backend and
>>>> dynamically allocating memory/file blocks/... on access.
>>>>
>>>> So the easy way to make it work is:
>>>>
>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
>>>> memory getting managed by the buddy on a separate NUMA node.
>>>>
>>>
>>> Linux kernel buddy system? how to guarantee other applications don't apply memory
>>> from it
>>
>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
>> such that even if some other allocation ended up there, that it could
>> get migrated somewhere else.
>>
>> For example, "daxctl reconfigure-device" tries doing that as default:
>>
>> https://pmem.io/ndctl/daxctl-reconfigure-device.html
>>
>> However, I agree that we might actually want to tell the system to not
>> use this CPU-less node as fallback for other allocations, and that we
>> might not want to swap out such memory etc.
>>
>>
>> But, in the end all that virtio-mem needs to work in the hypervisor is
>>
>> a) A sparse memmap (anonymous RAM, memfd, file)
>> b) A way to populate memory within that sparse memmap (e.g., on fault,
>>     using madvise(MADV_POPULATE_WRITE), fallocate())
>> c) A way to discard memory (madvise(MADV_DONTNEED),
>>     fallocate(FALLOC_FL_PUNCH_HOLE))
>>
>> So instead of using anonymous memory+mbind, you can also mmap a sparse file
>> and rely on populate-on-demand. One alternative for your use case would be
>> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
>> simply providing virtio-mem with a sparse file located on that filesystem.
>>
>> Of course, you can also use some other mechanism as you might have in
>> your approach, as long as it supports a,b,c.
>>
>>>
>>>>
>>>> b) (optional) allocate huge pages on that separate NUMA node.
>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
>>>> *bidning* the memory backend to that special NUMA node.
>>>>
>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
>>> How to bind backend memory to NUMA node
>>>
>>
>> I think the syntax is "policy=bind,host-nodes=X"
>>
>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
>> "host-nodes=0x20" etc.
>>
>>>>
>>>> This will dynamically allocate memory from that special NUMA node, resulting
>>>> in the virtio-mem device completely being backed by that device memory,
>>>> being able to dynamically resize the memory allocation.
>>>>
>>>>
>>>> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
>>>> isn't really what we want and won't work without major design changes. Also,
>>>> I'm not so sure it's a very clean design: exposing memory belonging to other
>>>> VMs to unrelated QEMU processes. This sounds like a serious security hole:
>>>> if you managed to escalate to the QEMU process from inside the VM, you can
>>>> access unrelated VM memory quite happily. You want an abstraction
>>>> in-between, that makes sure each VM/QEMU process only sees private memory:
>>>> for example, the buddy via dax/kmem.
>>>>
>>> Hi David
>>> Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
>>> How does current virtio-mem dynamically attach memory to guest, via page fault?
>>
>> Essentially you have a large sparse mmap. Withing that mmap, memory is
>> populated on demand. Instead if mmap/munmap you perform a single large
>> mmap and then dynamically populate memory/discard memory.
>>
>> Right now, memory is populated via page faults on access. This is
>> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
>> file blocks) and you might run out of backend memory.
>>
>> I'm working on a "prealloc" mode, which will preallocate/populate memory
>> necessary for exposing the next block of memory to the VM, and which
>> fails gracefully if preallocation/population fails in the case of such
>> limited resources.
>>
>> The patch resides on:
>> 	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
>>
>> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Mon Aug 2 19:51:36 2021 +0200
>>
>>      virtio-mem: support "prealloc=on" option
>>      Especially for hugetlb, but also for file-based memory backends, we'd
>>      like to be able to prealloc memory, especially to make user errors less
>>      severe: crashing the VM when there are not sufficient huge pages around.
>>      A common option for hugetlb will be using "reserve=off,prealloc=off" for
>>      the memory backend and "prealloc=on" for the virtio-mem device. This
>>      way, no huge pages will be reserved for the process, but we can recover
>>      if there are no actual huge pages when plugging memory.
>>      Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>>
>> -- 
>> Thanks,
>>
>> David / dhildenb
>>
> 
> Hi David,
> 
> After read virtio-mem code, I understand what you have expressed, please allow me to describe
> my understanding to virtio-mem, so that we have a aligned view.
> 
> Virtio-mem:
>   Virtio-mem device initializes and reserved a memory area(GPA), later memory dynamically
>   growing/shrinking will not exceed this scope, memory-backend-ram has mapped anon. memory
>   to the whole area, but no ram is attached because Linux have a policy to delay allocation.

Right, but it can also be any sparse file (memory-backend-memfd, 
memory-backend-file).

>   When virtio-mem driver apply to dynamically add memory to guest, it first request a region
>   from the reserved memory area, then notify virtio-mem device to record the information
>   (virtio-mem device doesn't make real memory allocation). After received response from

In the upcoming prealloc=on mode I referenced, the allocation will 
happen before the guest is notified about success and starts using the 
memory.

With vfio/mdev support, the allocation will happen nowadays already, 
when vfio/mdev is notified about the populated memory ranges (see 
RamDiscardManager). That's essentially what makes virtio-mem device 
passthrough work.

>   virtio-mem deivce, virtio-mem driver will online the requested region and add it to Linux
>   page allocator. Real ram allocation will happen via page fault when guest cpu access it.
>   Memory shrink will be achieved by madvise()

Right, but you could write a custom virtio-mem driver that pools this 
memory differently.

Memory shrinking in the hypervisor is either done using 
madvise(DONMTNEED) or fallocate(FALLOC_FL_PUNCH_HOLE)

> 
> Questions:
> 1. heterogeneous computing, memory may be accessed by CPUs on host side and device side.
>     Memory delayed allocation is not suitable. Host software(for instance, OpenCL) may
>     allocate a buffer to computing device to place the computing result in.

That works already with virtio-mem with vfio/mdev via the 
RamDiscardManager infrastructure introduced recently. With 
"prealloc=on", the delayed memory allocation can also be avoided without 
vfio/mdev.

> 2. we hope build ourselves page allocator in host kernel, so it can offer customized mmap()
>     method to build va->pa mapping in MMU and IOMMU.

Theoretically, you can wire up pretty much any driver in QEMU like 
vfio/mdev via the RamDiscardManager. From there, you can issue whatever 
syscall you need to popualte memory when plugging new memory blocks. All 
you need to support is a sparse mmap and a way to populate/discard 
memory. Populate/discard could be wired up in QEMU virtio-mem code as 
you need it.

> 3. some potential requirements also require our driver to manage memory, so that page size
>     granularity can be controlled to fit small device iotlb cache.
>     CXL has bias mode for HDM(host managed device memory), it needs physical address to make
>     bias mode switch between host access and device access. These tell us driver manage memory
>     is mandatory.

I think if you write your driver in a certain way and wire it up in QEMU 
virtio-mem accordingly (e.g., using a new memory-backend-whatever), that 
shouldn't be an issue.
david.dai Oct. 15, 2021, 9:10 a.m. UTC | #12
On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand (david@redhat.com) wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> On 13.10.21 10:13, david.dai wrote:
> > On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
> > > 
> > > 
> > > 
> > > > > virito-mem currently relies on having a single sparse memory region (anon
> > > > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> > > > > share memory with other processes, sharing with other VMs is not intended.
> > > > > Instead of actually mmaping parts dynamically (which can be quite
> > > > > expensive), virtio-mem relies on punching holes into the backend and
> > > > > dynamically allocating memory/file blocks/... on access.
> > > > > 
> > > > > So the easy way to make it work is:
> > > > > 
> > > > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> > > > > memory getting managed by the buddy on a separate NUMA node.
> > > > > 
> > > > 
> > > > Linux kernel buddy system? how to guarantee other applications don't apply memory
> > > > from it
> > > 
> > > Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> > > such that even if some other allocation ended up there, that it could
> > > get migrated somewhere else.
> > > 
> > > For example, "daxctl reconfigure-device" tries doing that as default:
> > > 
> > > https://pmem.io/ndctl/daxctl-reconfigure-device.html
> > > 
> > > However, I agree that we might actually want to tell the system to not
> > > use this CPU-less node as fallback for other allocations, and that we
> > > might not want to swap out such memory etc.
> > > 
> > > 
> > > But, in the end all that virtio-mem needs to work in the hypervisor is
> > > 
> > > a) A sparse memmap (anonymous RAM, memfd, file)
> > > b) A way to populate memory within that sparse memmap (e.g., on fault,
> > >     using madvise(MADV_POPULATE_WRITE), fallocate())
> > > c) A way to discard memory (madvise(MADV_DONTNEED),
> > >     fallocate(FALLOC_FL_PUNCH_HOLE))
> > > 
> > > So instead of using anonymous memory+mbind, you can also mmap a sparse file
> > > and rely on populate-on-demand. One alternative for your use case would be
> > > to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> > > simply providing virtio-mem with a sparse file located on that filesystem.
> > > 
> > > Of course, you can also use some other mechanism as you might have in
> > > your approach, as long as it supports a,b,c.
> > > 
> > > > 
> > > > > 
> > > > > b) (optional) allocate huge pages on that separate NUMA node.
> > > > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> > > > > *bidning* the memory backend to that special NUMA node.
> > > > > 
> > > > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
> > > > How to bind backend memory to NUMA node
> > > > 
> > > 
> > > I think the syntax is "policy=bind,host-nodes=X"
> > > 
> > > whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
> > > "host-nodes=0x20" etc.
> > > 
> > > > > 
> > > > > This will dynamically allocate memory from that special NUMA node, resulting
> > > > > in the virtio-mem device completely being backed by that device memory,
> > > > > being able to dynamically resize the memory allocation.
> > > > > 
> > > > > 
> > > > > Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> > > > > isn't really what we want and won't work without major design changes. Also,
> > > > > I'm not so sure it's a very clean design: exposing memory belonging to other
> > > > > VMs to unrelated QEMU processes. This sounds like a serious security hole:
> > > > > if you managed to escalate to the QEMU process from inside the VM, you can
> > > > > access unrelated VM memory quite happily. You want an abstraction
> > > > > in-between, that makes sure each VM/QEMU process only sees private memory:
> > > > > for example, the buddy via dax/kmem.
> > > > > 
> > > > Hi David
> > > > Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
> > > > How does current virtio-mem dynamically attach memory to guest, via page fault?
> > > 
> > > Essentially you have a large sparse mmap. Withing that mmap, memory is
> > > populated on demand. Instead if mmap/munmap you perform a single large
> > > mmap and then dynamically populate memory/discard memory.
> > > 
> > > Right now, memory is populated via page faults on access. This is
> > > sub-optimal when dealing with limited resources (i.e., hugetlbfs,
> > > file blocks) and you might run out of backend memory.
> > > 
> > > I'm working on a "prealloc" mode, which will preallocate/populate memory
> > > necessary for exposing the next block of memory to the VM, and which
> > > fails gracefully if preallocation/population fails in the case of such
> > > limited resources.
> > > 
> > > The patch resides on:
> > > 	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
> > > 
> > > commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
> > > Author: David Hildenbrand <david@redhat.com>
> > > Date:   Mon Aug 2 19:51:36 2021 +0200
> > > 
> > >      virtio-mem: support "prealloc=on" option
> > >      Especially for hugetlb, but also for file-based memory backends, we'd
> > >      like to be able to prealloc memory, especially to make user errors less
> > >      severe: crashing the VM when there are not sufficient huge pages around.
> > >      A common option for hugetlb will be using "reserve=off,prealloc=off" for
> > >      the memory backend and "prealloc=on" for the virtio-mem device. This
> > >      way, no huge pages will be reserved for the process, but we can recover
> > >      if there are no actual huge pages when plugging memory.
> > >      Signed-off-by: David Hildenbrand <david@redhat.com>
> > > 
> > > 
> > > -- 
> > > Thanks,
> > > 
> > > David / dhildenb
> > > 
> > 
> > Hi David,
> > 
> > After read virtio-mem code, I understand what you have expressed, please allow me to describe
> > my understanding to virtio-mem, so that we have a aligned view.
> > 
> > Virtio-mem:
> >   Virtio-mem device initializes and reserved a memory area(GPA), later memory dynamically
> >   growing/shrinking will not exceed this scope, memory-backend-ram has mapped anon. memory
> >   to the whole area, but no ram is attached because Linux have a policy to delay allocation.
> 
> Right, but it can also be any sparse file (memory-backend-memfd,
> memory-backend-file).
> 
> >   When virtio-mem driver apply to dynamically add memory to guest, it first request a region
> >   from the reserved memory area, then notify virtio-mem device to record the information
> >   (virtio-mem device doesn't make real memory allocation). After received response from
> 
> In the upcoming prealloc=on mode I referenced, the allocation will happen
> before the guest is notified about success and starts using the memory.
> 
> With vfio/mdev support, the allocation will happen nowadays already, when
> vfio/mdev is notified about the populated memory ranges (see
> RamDiscardManager). That's essentially what makes virtio-mem device
> passthrough work.
> 
> >   virtio-mem deivce, virtio-mem driver will online the requested region and add it to Linux
> >   page allocator. Real ram allocation will happen via page fault when guest cpu access it.
> >   Memory shrink will be achieved by madvise()
> 
> Right, but you could write a custom virtio-mem driver that pools this memory
> differently.
> 
> Memory shrinking in the hypervisor is either done using madvise(DONMTNEED)
> or fallocate(FALLOC_FL_PUNCH_HOLE)
> 
> > 
> > Questions:
> > 1. heterogeneous computing, memory may be accessed by CPUs on host side and device side.
> >     Memory delayed allocation is not suitable. Host software(for instance, OpenCL) may
> >     allocate a buffer to computing device to place the computing result in.
> 
> That works already with virtio-mem with vfio/mdev via the RamDiscardManager
> infrastructure introduced recently. With "prealloc=on", the delayed memory
> allocation can also be avoided without vfio/mdev.
> 
> > 2. we hope build ourselves page allocator in host kernel, so it can offer customized mmap()
> >     method to build va->pa mapping in MMU and IOMMU.
> 
> Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev
> via the RamDiscardManager. From there, you can issue whatever syscall you
> need to popualte memory when plugging new memory blocks. All you need to
> support is a sparse mmap and a way to populate/discard memory.
> Populate/discard could be wired up in QEMU virtio-mem code as you need it.
> 
> > 3. some potential requirements also require our driver to manage memory, so that page size
> >     granularity can be controlled to fit small device iotlb cache.
> >     CXL has bias mode for HDM(host managed device memory), it needs physical address to make
> >     bias mode switch between host access and device access. These tell us driver manage memory
> >     is mandatory.
> 
> I think if you write your driver in a certain way and wire it up in QEMU
> virtio-mem accordingly (e.g., using a new memory-backend-whatever), that
> shouldn't be an issue.
>

Thanks a lot, so let me have a try.
 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 
>
David Hildenbrand Oct. 15, 2021, 9:27 a.m. UTC | #13
On 15.10.21 11:10, david.dai wrote:
> On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you recognize the sender and know the
>> content is safe.
>>
>>
>> On 13.10.21 10:13, david.dai wrote:
>>> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
>>>>
>>>>
>>>>
>>>>>> virito-mem currently relies on having a single sparse memory region (anon
>>>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
>>>>>> share memory with other processes, sharing with other VMs is not intended.
>>>>>> Instead of actually mmaping parts dynamically (which can be quite
>>>>>> expensive), virtio-mem relies on punching holes into the backend and
>>>>>> dynamically allocating memory/file blocks/... on access.
>>>>>>
>>>>>> So the easy way to make it work is:
>>>>>>
>>>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
>>>>>> memory getting managed by the buddy on a separate NUMA node.
>>>>>>
>>>>>
>>>>> Linux kernel buddy system? how to guarantee other applications don't apply memory
>>>>> from it
>>>>
>>>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
>>>> such that even if some other allocation ended up there, that it could
>>>> get migrated somewhere else.
>>>>
>>>> For example, "daxctl reconfigure-device" tries doing that as default:
>>>>
>>>> https://pmem.io/ndctl/daxctl-reconfigure-device.html
>>>>
>>>> However, I agree that we might actually want to tell the system to not
>>>> use this CPU-less node as fallback for other allocations, and that we
>>>> might not want to swap out such memory etc.
>>>>
>>>>
>>>> But, in the end all that virtio-mem needs to work in the hypervisor is
>>>>
>>>> a) A sparse memmap (anonymous RAM, memfd, file)
>>>> b) A way to populate memory within that sparse memmap (e.g., on fault,
>>>>     using madvise(MADV_POPULATE_WRITE), fallocate())
>>>> c) A way to discard memory (madvise(MADV_DONTNEED),
>>>>     fallocate(FALLOC_FL_PUNCH_HOLE))
>>>>
>>>> So instead of using anonymous memory+mbind, you can also mmap a sparse file
>>>> and rely on populate-on-demand. One alternative for your use case would be
>>>> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
>>>> simply providing virtio-mem with a sparse file located on that filesystem.
>>>>
>>>> Of course, you can also use some other mechanism as you might have in
>>>> your approach, as long as it supports a,b,c.
>>>>
>>>>>
>>>>>>
>>>>>> b) (optional) allocate huge pages on that separate NUMA node.
>>>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
>>>>>> *bidning* the memory backend to that special NUMA node.
>>>>>>
>>>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
>>>>> How to bind backend memory to NUMA node
>>>>>
>>>>
>>>> I think the syntax is "policy=bind,host-nodes=X"
>>>>
>>>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
>>>> "host-nodes=0x20" etc.
>>>>
>>>>>>
>>>>>> This will dynamically allocate memory from that special NUMA node, resulting
>>>>>> in the virtio-mem device completely being backed by that device memory,
>>>>>> being able to dynamically resize the memory allocation.
>>>>>>
>>>>>>
>>>>>> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
>>>>>> isn't really what we want and won't work without major design changes. Also,
>>>>>> I'm not so sure it's a very clean design: exposing memory belonging to other
>>>>>> VMs to unrelated QEMU processes. This sounds like a serious security hole:
>>>>>> if you managed to escalate to the QEMU process from inside the VM, you can
>>>>>> access unrelated VM memory quite happily. You want an abstraction
>>>>>> in-between, that makes sure each VM/QEMU process only sees private memory:
>>>>>> for example, the buddy via dax/kmem.
>>>>>>
>>>>> Hi David
>>>>> Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
>>>>> How does current virtio-mem dynamically attach memory to guest, via page fault?
>>>>
>>>> Essentially you have a large sparse mmap. Withing that mmap, memory is
>>>> populated on demand. Instead if mmap/munmap you perform a single large
>>>> mmap and then dynamically populate memory/discard memory.
>>>>
>>>> Right now, memory is populated via page faults on access. This is
>>>> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
>>>> file blocks) and you might run out of backend memory.
>>>>
>>>> I'm working on a "prealloc" mode, which will preallocate/populate memory
>>>> necessary for exposing the next block of memory to the VM, and which
>>>> fails gracefully if preallocation/population fails in the case of such
>>>> limited resources.
>>>>
>>>> The patch resides on:
>>>> 	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
>>>>
>>>> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
>>>> Author: David Hildenbrand <david@redhat.com>
>>>> Date:   Mon Aug 2 19:51:36 2021 +0200
>>>>
>>>>      virtio-mem: support "prealloc=on" option
>>>>      Especially for hugetlb, but also for file-based memory backends, we'd
>>>>      like to be able to prealloc memory, especially to make user errors less
>>>>      severe: crashing the VM when there are not sufficient huge pages around.
>>>>      A common option for hugetlb will be using "reserve=off,prealloc=off" for
>>>>      the memory backend and "prealloc=on" for the virtio-mem device. This
>>>>      way, no huge pages will be reserved for the process, but we can recover
>>>>      if there are no actual huge pages when plugging memory.
>>>>      Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>
>>>>
>>>> -- 
>>>> Thanks,
>>>>
>>>> David / dhildenb
>>>>
>>>
>>> Hi David,
>>>
>>> After read virtio-mem code, I understand what you have expressed, please allow me to describe
>>> my understanding to virtio-mem, so that we have a aligned view.
>>>
>>> Virtio-mem:
>>>   Virtio-mem device initializes and reserved a memory area(GPA), later memory dynamically
>>>   growing/shrinking will not exceed this scope, memory-backend-ram has mapped anon. memory
>>>   to the whole area, but no ram is attached because Linux have a policy to delay allocation.
>>
>> Right, but it can also be any sparse file (memory-backend-memfd,
>> memory-backend-file).
>>
>>>   When virtio-mem driver apply to dynamically add memory to guest, it first request a region
>>>   from the reserved memory area, then notify virtio-mem device to record the information
>>>   (virtio-mem device doesn't make real memory allocation). After received response from
>>
>> In the upcoming prealloc=on mode I referenced, the allocation will happen
>> before the guest is notified about success and starts using the memory.
>>
>> With vfio/mdev support, the allocation will happen nowadays already, when
>> vfio/mdev is notified about the populated memory ranges (see
>> RamDiscardManager). That's essentially what makes virtio-mem device
>> passthrough work.
>>
>>>   virtio-mem deivce, virtio-mem driver will online the requested region and add it to Linux
>>>   page allocator. Real ram allocation will happen via page fault when guest cpu access it.
>>>   Memory shrink will be achieved by madvise()
>>
>> Right, but you could write a custom virtio-mem driver that pools this memory
>> differently.
>>
>> Memory shrinking in the hypervisor is either done using madvise(DONMTNEED)
>> or fallocate(FALLOC_FL_PUNCH_HOLE)
>>
>>>
>>> Questions:
>>> 1. heterogeneous computing, memory may be accessed by CPUs on host side and device side.
>>>     Memory delayed allocation is not suitable. Host software(for instance, OpenCL) may
>>>     allocate a buffer to computing device to place the computing result in.
>>
>> That works already with virtio-mem with vfio/mdev via the RamDiscardManager
>> infrastructure introduced recently. With "prealloc=on", the delayed memory
>> allocation can also be avoided without vfio/mdev.
>>
>>> 2. we hope build ourselves page allocator in host kernel, so it can offer customized mmap()
>>>     method to build va->pa mapping in MMU and IOMMU.
>>
>> Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev
>> via the RamDiscardManager. From there, you can issue whatever syscall you
>> need to popualte memory when plugging new memory blocks. All you need to
>> support is a sparse mmap and a way to populate/discard memory.
>> Populate/discard could be wired up in QEMU virtio-mem code as you need it.
>>
>>> 3. some potential requirements also require our driver to manage memory, so that page size
>>>     granularity can be controlled to fit small device iotlb cache.
>>>     CXL has bias mode for HDM(host managed device memory), it needs physical address to make
>>>     bias mode switch between host access and device access. These tell us driver manage memory
>>>     is mandatory.
>>
>> I think if you write your driver in a certain way and wire it up in QEMU
>> virtio-mem accordingly (e.g., using a new memory-backend-whatever), that
>> shouldn't be an issue.
>>
> 
> Thanks a lot, so let me have a try.

Let me know if you need some help or run into issues! Further, if we
need spec extensions to handle some additional requirements, that's also
not really an issue.

I certainly don't want you to use virtio-mem by any means. However
"virtual pci device to dynamically attach memory to QEMU" is essentially
what virtio-mem was does :) .  As it's already compatible with vfio/mdev
and soon has full support for dealing with limited resources
(preallocation support, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE), it feels
like a good fit for your use case as well, although some details are
left to be figured out.

(also, virtio-mem solved a lot of the issues related to guest memory
dumping, VM snapshotting/migration, and how to make it consumable by
upper layers like libvirt -- so you would get that for almost free as well)
david.dai Oct. 15, 2021, 9:57 a.m. UTC | #14
On Fri, Oct 15, 2021 at 11:27:02AM +0200, David Hildenbrand (david@redhat.com) wrote:
> 
> 
> On 15.10.21 11:10, david.dai wrote:
> > On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand (david@redhat.com) wrote:
> >>
> >> CAUTION: This email originated from outside of the organization. Do not
> >> click links or open attachments unless you recognize the sender and know the
> >> content is safe.
> >>
> >>
> >> On 13.10.21 10:13, david.dai wrote:
> >>> On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) wrote:
> >>>>
> >>>>
> >>>>
> >>>>>> virito-mem currently relies on having a single sparse memory region (anon
> >>>>>> mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> >>>>>> share memory with other processes, sharing with other VMs is not intended.
> >>>>>> Instead of actually mmaping parts dynamically (which can be quite
> >>>>>> expensive), virtio-mem relies on punching holes into the backend and
> >>>>>> dynamically allocating memory/file blocks/... on access.
> >>>>>>
> >>>>>> So the easy way to make it work is:
> >>>>>>
> >>>>>> a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> >>>>>> memory getting managed by the buddy on a separate NUMA node.
> >>>>>>
> >>>>>
> >>>>> Linux kernel buddy system? how to guarantee other applications don't apply memory
> >>>>> from it
> >>>>
> >>>> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> >>>> such that even if some other allocation ended up there, that it could
> >>>> get migrated somewhere else.
> >>>>
> >>>> For example, "daxctl reconfigure-device" tries doing that as default:
> >>>>
> >>>> https://pmem.io/ndctl/daxctl-reconfigure-device.html
> >>>>
> >>>> However, I agree that we might actually want to tell the system to not
> >>>> use this CPU-less node as fallback for other allocations, and that we
> >>>> might not want to swap out such memory etc.
> >>>>
> >>>>
> >>>> But, in the end all that virtio-mem needs to work in the hypervisor is
> >>>>
> >>>> a) A sparse memmap (anonymous RAM, memfd, file)
> >>>> b) A way to populate memory within that sparse memmap (e.g., on fault,
> >>>>     using madvise(MADV_POPULATE_WRITE), fallocate())
> >>>> c) A way to discard memory (madvise(MADV_DONTNEED),
> >>>>     fallocate(FALLOC_FL_PUNCH_HOLE))
> >>>>
> >>>> So instead of using anonymous memory+mbind, you can also mmap a sparse file
> >>>> and rely on populate-on-demand. One alternative for your use case would be
> >>>> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> >>>> simply providing virtio-mem with a sparse file located on that filesystem.
> >>>>
> >>>> Of course, you can also use some other mechanism as you might have in
> >>>> your approach, as long as it supports a,b,c.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> b) (optional) allocate huge pages on that separate NUMA node.
> >>>>>> c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> >>>>>> *bidning* the memory backend to that special NUMA node.
> >>>>>>
> >>>>> "-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
> >>>>> How to bind backend memory to NUMA node
> >>>>>
> >>>>
> >>>> I think the syntax is "policy=bind,host-nodes=X"
> >>>>
> >>>> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
> >>>> "host-nodes=0x20" etc.
> >>>>
> >>>>>>
> >>>>>> This will dynamically allocate memory from that special NUMA node, resulting
> >>>>>> in the virtio-mem device completely being backed by that device memory,
> >>>>>> being able to dynamically resize the memory allocation.
> >>>>>>
> >>>>>>
> >>>>>> Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> >>>>>> isn't really what we want and won't work without major design changes. Also,
> >>>>>> I'm not so sure it's a very clean design: exposing memory belonging to other
> >>>>>> VMs to unrelated QEMU processes. This sounds like a serious security hole:
> >>>>>> if you managed to escalate to the QEMU process from inside the VM, you can
> >>>>>> access unrelated VM memory quite happily. You want an abstraction
> >>>>>> in-between, that makes sure each VM/QEMU process only sees private memory:
> >>>>>> for example, the buddy via dax/kmem.
> >>>>>>
> >>>>> Hi David
> >>>>> Thanks for your suggestion, also sorry for my delayed reply due to my long vacation.
> >>>>> How does current virtio-mem dynamically attach memory to guest, via page fault?
> >>>>
> >>>> Essentially you have a large sparse mmap. Withing that mmap, memory is
> >>>> populated on demand. Instead if mmap/munmap you perform a single large
> >>>> mmap and then dynamically populate memory/discard memory.
> >>>>
> >>>> Right now, memory is populated via page faults on access. This is
> >>>> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
> >>>> file blocks) and you might run out of backend memory.
> >>>>
> >>>> I'm working on a "prealloc" mode, which will preallocate/populate memory
> >>>> necessary for exposing the next block of memory to the VM, and which
> >>>> fails gracefully if preallocation/population fails in the case of such
> >>>> limited resources.
> >>>>
> >>>> The patch resides on:
> >>>> 	https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
> >>>>
> >>>> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
> >>>> Author: David Hildenbrand <david@redhat.com>
> >>>> Date:   Mon Aug 2 19:51:36 2021 +0200
> >>>>
> >>>>      virtio-mem: support "prealloc=on" option
> >>>>      Especially for hugetlb, but also for file-based memory backends, we'd
> >>>>      like to be able to prealloc memory, especially to make user errors less
> >>>>      severe: crashing the VM when there are not sufficient huge pages around.
> >>>>      A common option for hugetlb will be using "reserve=off,prealloc=off" for
> >>>>      the memory backend and "prealloc=on" for the virtio-mem device. This
> >>>>      way, no huge pages will be reserved for the process, but we can recover
> >>>>      if there are no actual huge pages when plugging memory.
> >>>>      Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>>
> >>>>
> >>>> -- 
> >>>> Thanks,
> >>>>
> >>>> David / dhildenb
> >>>>
> >>>
> >>> Hi David,
> >>>
> >>> After read virtio-mem code, I understand what you have expressed, please allow me to describe
> >>> my understanding to virtio-mem, so that we have a aligned view.
> >>>
> >>> Virtio-mem:
> >>>   Virtio-mem device initializes and reserved a memory area(GPA), later memory dynamically
> >>>   growing/shrinking will not exceed this scope, memory-backend-ram has mapped anon. memory
> >>>   to the whole area, but no ram is attached because Linux have a policy to delay allocation.
> >>
> >> Right, but it can also be any sparse file (memory-backend-memfd,
> >> memory-backend-file).
> >>
> >>>   When virtio-mem driver apply to dynamically add memory to guest, it first request a region
> >>>   from the reserved memory area, then notify virtio-mem device to record the information
> >>>   (virtio-mem device doesn't make real memory allocation). After received response from
> >>
> >> In the upcoming prealloc=on mode I referenced, the allocation will happen
> >> before the guest is notified about success and starts using the memory.
> >>
> >> With vfio/mdev support, the allocation will happen nowadays already, when
> >> vfio/mdev is notified about the populated memory ranges (see
> >> RamDiscardManager). That's essentially what makes virtio-mem device
> >> passthrough work.
> >>
> >>>   virtio-mem deivce, virtio-mem driver will online the requested region and add it to Linux
> >>>   page allocator. Real ram allocation will happen via page fault when guest cpu access it.
> >>>   Memory shrink will be achieved by madvise()
> >>
> >> Right, but you could write a custom virtio-mem driver that pools this memory
> >> differently.
> >>
> >> Memory shrinking in the hypervisor is either done using madvise(DONMTNEED)
> >> or fallocate(FALLOC_FL_PUNCH_HOLE)
> >>
> >>>
> >>> Questions:
> >>> 1. heterogeneous computing, memory may be accessed by CPUs on host side and device side.
> >>>     Memory delayed allocation is not suitable. Host software(for instance, OpenCL) may
> >>>     allocate a buffer to computing device to place the computing result in.
> >>
> >> That works already with virtio-mem with vfio/mdev via the RamDiscardManager
> >> infrastructure introduced recently. With "prealloc=on", the delayed memory
> >> allocation can also be avoided without vfio/mdev.
> >>
> >>> 2. we hope build ourselves page allocator in host kernel, so it can offer customized mmap()
> >>>     method to build va->pa mapping in MMU and IOMMU.
> >>
> >> Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev
> >> via the RamDiscardManager. From there, you can issue whatever syscall you
> >> need to popualte memory when plugging new memory blocks. All you need to
> >> support is a sparse mmap and a way to populate/discard memory.
> >> Populate/discard could be wired up in QEMU virtio-mem code as you need it.
> >>
> >>> 3. some potential requirements also require our driver to manage memory, so that page size
> >>>     granularity can be controlled to fit small device iotlb cache.
> >>>     CXL has bias mode for HDM(host managed device memory), it needs physical address to make
> >>>     bias mode switch between host access and device access. These tell us driver manage memory
> >>>     is mandatory.
> >>
> >> I think if you write your driver in a certain way and wire it up in QEMU
> >> virtio-mem accordingly (e.g., using a new memory-backend-whatever), that
> >> shouldn't be an issue.
> >>
> > 
> > Thanks a lot, so let me have a try.
> 
> Let me know if you need some help or run into issues! Further, if we
> need spec extensions to handle some additional requirements, that's also
> not really an issue.
> 
> I certainly don't want you to use virtio-mem by any means. However
> "virtual pci device to dynamically attach memory to QEMU" is essentially
> what virtio-mem was does :) .  As it's already compatible with vfio/mdev
> and soon has full support for dealing with limited resources
> (preallocation support, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE), it feels
> like a good fit for your use case as well, although some details are
> left to be figured out.
> 
> (also, virtio-mem solved a lot of the issues related to guest memory
> dumping, VM snapshotting/migration, and how to make it consumable by
> upper layers like libvirt -- so you would get that for almost free as well)
> 
>

Yes, if virtio-mem satisfy our requirements, of course we will employ it.
If any question, I will contact you for help.

Thanks,
David
diff mbox series

Patch

diff --git a/docs/devel/dynamic_mdev.rst b/docs/devel/dynamic_mdev.rst
new file mode 100644
index 0000000000..8e2edb6600
--- /dev/null
+++ b/docs/devel/dynamic_mdev.rst
@@ -0,0 +1,122 @@ 
+Motivation:
+In heterogeneous computing system, accelorator generally exposes its device
+memory to host via PCIe and CXL.mem(Compute Express Link) to share memory
+between host and device, and these memory generally are uniformly managed by
+host, they are called HDM (host managed device memory), further SVA (share
+virtual address) can be achieved on this base. One computing device may be used
+by multiple virtual machines if it supports SRIOV, to efficiently use device
+memory in virtualization, each VM allocates device memory on-demand without
+overcommit, but how to dynamically attach host memory resource to VM. A virtual
+PCI device, dynamic_mdev, is introduced to achieve this target. dynamic_mdev
+has a big bar space which size can be assigned by user when creating VM, the
+bar doesn't have backend memory at initialization stage, later driver in guest
+triggers QEMU to map host memory to the bar space. how much memory, when and
+where memory will be mapped to are determined by guest driver, after device
+memory has been attached to the virtual PCI bar, application in guest can
+access device memory by the virtual PCI bar. Memory allocation and negotiation
+are left to guest driver and memory backend implementation. dynamic_mdev is a
+mechanism which provides significant benefits to heterogeneous memory
+virtualization.
+
+Implementation:
+dynamic_mdev device has two bars, bar0 and bar2. bar0 is a 32-bit register bar
+which used to host control register for control and message communication, Bar2
+is a 64-bit mmio bar, which is used to attach host memory to, the bar size can
+be assigned via parameter when creating VM. Host memory is attached to this bar
+via mmap API.
+
+
+          VM1                           VM2
+ -----------------------        ----------------------
+|      application      |      |     application      |
+|                       |      |                      |
+|-----------------------|      |----------------------|
+|     guest driver      |      |     guest driver     |
+|   |--------------|    |      |   | -------------|   |
+|   | pci mem bar  |    |      |   | pci mem bar  |   |
+ ---|--------------|-----       ---|--------------|---
+     ----   ---                     --   ------
+    |    | |   |                   |  | |      |
+     ----   ---                     --   ------
+            \                      /
+             \                    /
+              \                  /
+               \                /
+                |              |
+                V              V
+ --------------- /dev/mdev.mmap ------------------------
+|     --   --   --   --   --   --                       |
+|    |  | |  | |  | |  | |  | |  |  <-----free_mem_list |
+|     --   --   --   --   --   --                       |
+|                                                       |
+|                       HDM(host managed device memory )|
+ -------------------------------------------------------
+
+1. Create device:
+-device dyanmic-mdevice,size=0x200000000,align=0x40000000,mem-path=/dev/mdev
+
+size: bar space size
+aglin: alignment of dynamical attached memory
+mem-path: host backend memory device
+
+
+2. Registers to control dynamical memory attach
+All register is placed in bar0
+
+        INT_MASK     =     0, /* RW */
+        INT_STATUS   =     4, /* RW: write 1 clear */
+        DOOR_BELL    =     8, /*
+                               * RW: trigger device to act
+                               *  31        15        0
+                               *  --------------------
+                               * |en|xxxxxxxx|  cmd   |
+                               *  --------------------
+                               */
+
+        /* RO: 4k, 2M, 1G aglign for memory size */
+        MEM_ALIGN   =      12,
+
+        /* RO: offset in memory bar shows bar space has had ram map */
+        HW_OFFSET    =     16,
+
+        /* RW: size of dynamical attached memory */
+        MEM_SIZE     =     24,
+
+        /* RW: offset in host mdev, which dynamical attached memory from  */
+        MEM_OFFSET   =     32,
+
+3. To trigger QEMU to attach a memory, guest driver makes following operation:
+
+        /* memory size */
+        writeq(size, reg_base + 0x18);
+
+        /* backend file offset */
+        writeq(offset, reg_base + 0x20);
+
+        /* trigger device to map memory from host */
+        writel(0x80000001, reg_base + 0x8);
+
+        /* wait for reply from backend */
+        wait_for_completion(&attach_cmp);
+
+4. QEMU implementation
+dynamic_mdev utilizes QEMU's memory model to dynamically add memory region to
+memory container, the model is described at qemu/docs/devel/memory.rst
+The below steps will describe the whole flow:
+   1> create a virtual PCI device
+   2> pci register bar with memory region container, which only define bar size
+   3> guest driver requests memory via register interaction, and it tells QEMU
+      about memory size, backend memory offset, and so on
+   4> QEMU receives request from guest driver, then apply host memory from
+      backend file via mmap(), QEMU use the allocated RAM to create a memory
+      region through memory_region_init_ram(), and attach this memory region to
+      bar container through calling memory_region_add_subregion_overlap(). After
+      that KVM build gap->hpa mapping
+   5> QEMU sends MSI to guest driver that dynamical memory attach completed
+You could refer to source code for more detail.
+
+
+Backend memory device
+Backend device can be a stardard share memory file with standard mmap() support
+It also may be a specific char device with mmap() implementation.
+In a word, how to implement this device is user responsibility.
diff --git a/hw/misc/Kconfig b/hw/misc/Kconfig
index 507058d8bf..f03263cc1e 100644
--- a/hw/misc/Kconfig
+++ b/hw/misc/Kconfig
@@ -67,6 +67,11 @@  config IVSHMEM_DEVICE
     default y if PCI_DEVICES
     depends on PCI && LINUX && IVSHMEM && MSI_NONBROKEN

+config DYNAMIC_MDEV
+    bool
+    default y if PCI_DEVICES
+    depends on PCI && LINUX && MSI_NONBROKEN
+
 config ECCMEMCTL
     bool
     select ECC
diff --git a/hw/misc/dynamic_mdev.c b/hw/misc/dynamic_mdev.c
new file mode 100644
index 0000000000..8a56a6157b
--- /dev/null
+++ b/hw/misc/dynamic_mdev.c
@@ -0,0 +1,456 @@ 
+/*
+ * Dynamical memory attached PCI device
+ *
+ * Copyright Montage, Corp. 2014
+ *
+ * Authors:
+ *  David Dai <david.dai@montage-tech.com>
+ *  Changguo Du <changguo.du@montage-tech.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "hw/pci/pci.h"
+#include "hw/hw.h"
+#include "hw/qdev-properties.h"
+#include "hw/qdev-properties-system.h"
+#include "hw/pci/msi.h"
+#include "qemu/module.h"
+#include "qom/object_interfaces.h"
+#include "qapi/visitor.h"
+#include "qom/object.h"
+#include "qemu/error-report.h"
+
+#define PCI_VENDOR_ID_DMDEV   0x1b00
+#define PCI_DEVICE_ID_DMDEV   0x1110
+#define DYNAMIC_MDEV_BAR_SIZE 0x1000
+
+#define INTERRUPT_MEMORY_ATTACH_SUCCESS           (1 << 0)
+#define INTERRUPT_MEMORY_DEATTACH_SUCCESS         (1 << 1)
+#define INTERRUPT_MEMORY_ATTACH_NOMEM             (1 << 2)
+#define INTERRUPT_MEMORY_ATTACH_ALIGN_ERR         (1 << 3)
+#define INTERRUPT_ACCESS_NOT_MAPPED_ADDR          (1 << 4)
+
+#define DYNAMIC_CMD_ENABLE               (0x80000000)
+#define DYNAMIC_CMD_MASK                 (0xffff)
+#define DYNAMIC_CMD_MEM_ATTACH           (0x1)
+#define DYNAMIC_CMD_MEM_DEATTACH         (0x2)
+
+#define DYNAMIC_MDEV_DEBUG               1
+
+#define DYNAMIC_MDEV_DPRINTF(fmt, ...)                          \
+    do {                                                        \
+        if (DYNAMIC_MDEV_DEBUG) {                               \
+            printf("QEMU: " fmt, ## __VA_ARGS__);               \
+        }                                                       \
+    } while (0)
+
+#define TYPE_DYNAMIC_MDEV "dyanmic-mdevice"
+
+typedef struct DmdevState DmdevState;
+DECLARE_INSTANCE_CHECKER(DmdevState, DYNAMIC_MDEV,
+                         TYPE_DYNAMIC_MDEV)
+
+struct DmdevState {
+    /*< private >*/
+    PCIDevice parent_obj;
+    /*< public >*/
+
+    /* registers */
+    uint32_t mask;
+    uint32_t status;
+    uint32_t align;
+    uint64_t size;
+    uint64_t hw_offset;
+    uint64_t mem_offset;
+
+    /* mdev name */
+    char *devname;
+    int fd;
+
+    /* memory bar size */
+    uint64_t bsize;
+
+    /* BAR 0 (registers) */
+    MemoryRegion dmdev_mmio;
+
+    /* BAR 2 (memory bar for daynamical memory attach) */
+    MemoryRegion dmdev_mem;
+};
+
+/* registers for the dynamical memory device */
+enum dmdev_registers {
+    INT_MASK     =     0, /* RW */
+    INT_STATUS   =     4, /* RW: write 1 clear */
+    DOOR_BELL    =     8, /*
+                           * RW: trigger device to act
+                           *  31        15        0
+                           *  --------------------
+                           * |en|xxxxxxxx|  cmd   |
+                           *  --------------------
+                           */
+
+    /* RO: 4k, 2M, 1G aglign for memory size */
+    MEM_ALIGN   =     12,
+
+    /* RO: offset in memory bar shows bar space has had ram map */
+    HW_OFFSET    =    16,
+
+    /* RW: size of dynamical attached memory */
+    MEM_SIZE     =    24,
+
+    /* RW: offset in host mdev, where dynamical attached memory from  */
+    MEM_OFFSET   =    32,
+
+};
+
+static void dmdev_mem_attach(DmdevState *s)
+{
+    void *ptr;
+    struct MemoryRegion *mr;
+    uint64_t size = s->size;
+    uint64_t align = s->align;
+    uint64_t hwaddr = s->hw_offset;
+    uint64_t offset = s->mem_offset;
+    PCIDevice *pdev = PCI_DEVICE(s);
+
+    DYNAMIC_MDEV_DPRINTF("%s:size =0x%lx,align=0x%lx,hwaddr=0x%lx,\
+        offset=0x%lx\n", __func__, size, align, hwaddr, offset);
+
+    if (size % align || hwaddr % align) {
+        error_report("%s size doesn't align, size =0x%lx, \
+                align=0x%lx, hwaddr=0x%lx\n", __func__, size, align, hwaddr);
+        s->status |= INTERRUPT_MEMORY_ATTACH_ALIGN_ERR;
+        msi_notify(pdev, 0);
+        return;
+    }
+
+    ptr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, s->fd, offset);
+    if (ptr == MAP_FAILED) {
+        error_report("Can't map memory err(%d)", errno);
+        s->status |= INTERRUPT_MEMORY_ATTACH_ALIGN_ERR;
+        msi_notify(pdev, 0);
+        return;
+    }
+
+    mr = g_new0(MemoryRegion, 1);
+    memory_region_init_ram_ptr(mr, OBJECT(PCI_DEVICE(s)),
+                            "dynamic_mdev", size, ptr);
+    memory_region_add_subregion_overlap(&s->dmdev_mem, hwaddr, mr, 1);
+
+    s->hw_offset += size;
+
+    s->status |= INTERRUPT_MEMORY_ATTACH_SUCCESS;
+    msi_notify(pdev, 0);
+
+    DYNAMIC_MDEV_DPRINTF("%s msi_notify success ptr=%p\n", __func__, ptr);
+    return;
+}
+
+static void dmdev_mem_deattach(DmdevState *s)
+{
+    struct MemoryRegion *mr = &s->dmdev_mem;
+    struct MemoryRegion *subregion;
+    void *host;
+    PCIDevice *pdev = PCI_DEVICE(s);
+
+    memory_region_transaction_begin();
+    while (!QTAILQ_EMPTY(&mr->subregions)) {
+        subregion = QTAILQ_FIRST(&mr->subregions);
+        memory_region_del_subregion(mr, subregion);
+        host = memory_region_get_ram_ptr(subregion);
+        munmap(host, memory_region_size(subregion));
+        DYNAMIC_MDEV_DPRINTF("%s:host=%p,size=0x%lx\n",
+                    __func__, host,  memory_region_size(subregion));
+    }
+
+    memory_region_transaction_commit();
+
+    s->hw_offset = 0;
+
+    s->status |= INTERRUPT_MEMORY_DEATTACH_SUCCESS;
+    msi_notify(pdev, 0);
+
+    return;
+}
+
+static void dmdev_doorbell_handle(DmdevState *s,  uint64_t val)
+{
+    if (!(val & DYNAMIC_CMD_ENABLE)) {
+        return;
+    }
+
+    switch (val & DYNAMIC_CMD_MASK) {
+
+    case DYNAMIC_CMD_MEM_ATTACH:
+        dmdev_mem_attach(s);
+        break;
+
+    case DYNAMIC_CMD_MEM_DEATTACH:
+        dmdev_mem_deattach(s);
+        break;
+
+    default:
+        break;
+    }
+
+    return;
+}
+
+static void dmdev_mmio_write(void *opaque, hwaddr addr,
+                        uint64_t val, unsigned size)
+{
+    DmdevState *s = opaque;
+
+    DYNAMIC_MDEV_DPRINTF("%s write addr=0x%lx, val=0x%lx, size=0x%x\n",
+                __func__, addr, val, size);
+
+    switch (addr) {
+    case INT_MASK:
+        s->mask = val;
+        return;
+
+    case INT_STATUS:
+        return;
+
+    case DOOR_BELL:
+        dmdev_doorbell_handle(s, val);
+        return;
+
+    case MEM_ALIGN:
+        return;
+
+    case HW_OFFSET:
+        /* read only */
+        return;
+
+    case HW_OFFSET + 4:
+        /* read only */
+        return;
+
+    case MEM_SIZE:
+        if (size == 4) {
+            s->size &= ~(0xffffffff);
+            val &= 0xffffffff;
+            s->size |= val;
+        } else { /* 64-bit */
+            s->size = val;
+        }
+        return;
+
+    case MEM_SIZE + 4:
+        s->size &= 0xffffffff;
+
+        s->size |= val << 32;
+        return;
+
+    case MEM_OFFSET:
+        if (size == 4) {
+            s->mem_offset &= ~(0xffffffff);
+            val &= 0xffffffff;
+            s->mem_offset |= val;
+        } else { /* 64-bit */
+            s->mem_offset = val;
+        }
+        return;
+
+    case MEM_OFFSET + 4:
+        s->mem_offset &= 0xffffffff;
+
+        s->mem_offset |= val << 32;
+        return;
+
+    default:
+        DYNAMIC_MDEV_DPRINTF("default 0x%lx\n", val);
+    }
+
+    return;
+}
+
+static uint64_t dmdev_mmio_read(void *opaque, hwaddr addr,
+                        unsigned size)
+{
+    DmdevState *s = opaque;
+    unsigned int value;
+
+    DYNAMIC_MDEV_DPRINTF("%s read addr=0x%lx, size=0x%x\n",
+                         __func__, addr, size);
+    switch (addr) {
+    case INT_MASK:
+        /* mask: read-write */
+        return s->mask;
+
+    case INT_STATUS:
+        /* status: read-clear */
+        value = s->status;
+        s->status = 0;
+        return value;
+
+    case DOOR_BELL:
+        /* doorbell: write-only */
+        return 0;
+
+    case MEM_ALIGN:
+        /* align: read-only */
+        return s->align;
+
+    case HW_OFFSET:
+        if (size == 4) {
+            return s->hw_offset & 0xffffffff;
+        } else { /* 64-bit */
+            return s->hw_offset;
+        }
+
+    case HW_OFFSET + 4:
+        return s->hw_offset >> 32;
+
+    case MEM_SIZE:
+        if (size == 4) {
+            return s->size & 0xffffffff;
+        } else { /* 64-bit */
+            return s->size;
+        }
+
+    case MEM_SIZE + 4:
+        return s->size >> 32;
+
+    case MEM_OFFSET:
+        if (size == 4) {
+            return s->mem_offset & 0xffffffff;
+        } else { /* 64-bit */
+            return s->mem_offset;
+        }
+
+    case MEM_OFFSET + 4:
+        return s->mem_offset >> 32;
+
+    default:
+        DYNAMIC_MDEV_DPRINTF("default read err address 0x%lx\n", addr);
+
+    }
+
+    return 0;
+}
+
+static const MemoryRegionOps dmdev_mmio_ops = {
+    .read = dmdev_mmio_read,
+    .write = dmdev_mmio_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    },
+};
+
+static void dmdev_reset(DeviceState *d)
+{
+    DmdevState *s = DYNAMIC_MDEV(d);
+
+    s->status = 0;
+    s->mask = 0;
+    s->hw_offset = 0;
+    dmdev_mem_deattach(s);
+}
+
+static void dmdev_realize(PCIDevice *dev, Error **errp)
+{
+    DmdevState *s = DYNAMIC_MDEV(dev);
+    int status;
+
+    Error *err = NULL;
+    uint8_t *pci_conf;
+
+    pci_conf = dev->config;
+    pci_conf[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+
+    /* init msi */
+    status = msi_init(dev, 0, 1, true, false, &err);
+    if (status) {
+        error_report("msi_init %d failed", status);
+        return;
+    }
+
+    memory_region_init_io(&s->dmdev_mmio, OBJECT(s), &dmdev_mmio_ops, s,
+                          "dmdev-mmio", DYNAMIC_MDEV_BAR_SIZE);
+
+    /* region for registers*/
+    pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &s->dmdev_mmio);
+
+    /* initialize a memory region container */
+    memory_region_init(&s->dmdev_mem, OBJECT(s),
+                       "dmdev-mem", s->bsize);
+
+    pci_register_bar(PCI_DEVICE(s), 2,
+                    PCI_BASE_ADDRESS_SPACE_MEMORY |
+                    PCI_BASE_ADDRESS_MEM_PREFETCH |
+                    PCI_BASE_ADDRESS_MEM_TYPE_64,
+                    &s->dmdev_mem);
+
+    if (s->devname) {
+        s->fd = open(s->devname, O_RDWR, 0x0777);
+    } else {
+        s->fd = -1;
+    }
+
+    s->hw_offset = 0;
+
+    DYNAMIC_MDEV_DPRINTF("open file %s %s\n",
+            s->devname, s->fd < 0 ? "failed" : "success");
+}
+
+static void dmdev_exit(PCIDevice *dev)
+{
+    DmdevState *s = DYNAMIC_MDEV(dev);
+
+    msi_uninit(dev);
+    dmdev_mem_deattach(s);
+    DYNAMIC_MDEV_DPRINTF("%s\n", __func__);
+
+}
+
+static Property dmdev_properties[] = {
+    DEFINE_PROP_UINT64("size", DmdevState, bsize, 0x40000000),
+    DEFINE_PROP_UINT32("align", DmdevState, align, 0x40000000),
+    DEFINE_PROP_STRING("mem-path", DmdevState, devname),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void dmdev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+    k->realize = dmdev_realize;
+    k->exit = dmdev_exit;
+    k->vendor_id = PCI_VENDOR_ID_DMDEV;
+    k->device_id = PCI_DEVICE_ID_DMDEV;
+    k->class_id = PCI_CLASS_MEMORY_RAM;
+    k->revision = 1;
+    dc->reset = dmdev_reset;
+    device_class_set_props(dc, dmdev_properties);
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    dc->desc = "pci device to dynamically attach memory";
+}
+
+static const TypeInfo dmdev_info = {
+    .name          = TYPE_DYNAMIC_MDEV,
+    .parent        = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(DmdevState),
+    .class_init    = dmdev_class_init,
+    .interfaces    = (InterfaceInfo[]) {
+        { INTERFACE_PCIE_DEVICE },
+        { },
+    },
+};
+
+static void dmdev_register_types(void)
+{
+    type_register_static(&dmdev_info);
+}
+
+type_init(dmdev_register_types)
diff --git a/hw/misc/meson.build b/hw/misc/meson.build
index a53b849a5a..38f6701a4b 100644
--- a/hw/misc/meson.build
+++ b/hw/misc/meson.build
@@ -124,3 +124,4 @@  specific_ss.add(when: 'CONFIG_MIPS_CPS', if_true: files('mips_cmgcr.c', 'mips_cp
 specific_ss.add(when: 'CONFIG_MIPS_ITU', if_true: files('mips_itu.c'))

 specific_ss.add(when: 'CONFIG_SBSA_REF', if_true: files('sbsa_ec.c'))
+specific_ss.add(when: 'CONFIG_DYNAMIC_MDEV', if_true: files('dynamic_mdev.c'))