diff mbox series

[RFC,1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'

Message ID 20240101075315.43167-2-horenchuang@bytedance.com
State New
Headers show
Series Introduce HostMemType for 'memory-backend-*' | expand

Commit Message

Ho-Ren (Jack) Chuang Jan. 1, 2024, 7:53 a.m. UTC
Introduce a new configuration option 'host-mem-type=' in the
'-object memory-backend-ram', allowing users to specify
from which type of memory to allocate.

Users can specify 'cxlram' as an argument, and QEMU will then
automatically locate CXL RAM NUMA nodes and use them as the backend memory.
For example:
	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \

In v1, we plan to move most of the implementations to util and break down
this patch into different smaller patches.

Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |   1 +
 qapi/common.json         |  19 ++++
 qapi/qom.json            |   1 +
 qemu-options.hx          |   2 +-
 5 files changed, 206 insertions(+), 1 deletion(-)

Comments

Philippe Mathieu-Daudé Jan. 2, 2024, 10:29 a.m. UTC | #1
Hi Jack,

On 1/1/24 08:53, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> 	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
> 
> In v1, we plan to move most of the implementations to util and break down
> this patch into different smaller patches.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>   backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
>   include/sysemu/hostmem.h |   1 +
>   qapi/common.json         |  19 ++++
>   qapi/qom.json            |   1 +
>   qemu-options.hx          |   2 +-
>   5 files changed, 206 insertions(+), 1 deletion(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c


> +#define CXL_DEVICE_PATH "/sys/bus/cxl/devices/"
> +#define REGION_PATH_LEN 307
> +#define DAX_REGION_PATH_LEN 563
> +#define DAX_PATH_LEN 819
> +#define TARGET_FILE_PATH_LEN 831

How do you get these numbers?

> diff --git a/qapi/common.json b/qapi/common.json
> index 6fed9cde1a..591fd73291 100644
> --- a/qapi/common.json
> +++ b/qapi/common.json
> @@ -167,6 +167,25 @@
>   { 'enum': 'HostMemPolicy',
>     'data': [ 'default', 'preferred', 'bind', 'interleave' ] }
>   
> +##
> +# @HostMemType:
> +#
> +# Automatically find a backend memory type on host.

This description is not clear (to me).

> +# Can be further extened to support other types such as cxlpmem, hbm.

Typo "extended" although I'm not sure it is helpful to mention it.

> +#
> +# @none: do nothing (default).

"do nothing" is confusing here, I'd drop it.

> +#
> +# @cxlram: a CXL RAM backend on host.
> +#
> +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same
> +# time. HostMemType is used to automatically bind with one kind of
> +# host memory types.
> +#
> +# Since: 8.3

9.0

> +##
> +{ 'enum': 'HostMemType',
> +  'data': [ 'none', 'cxlram' ] }
> +
>   ##
>   # @NetFilterDirection:
>   #
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 95516ba325..fa3bc29708 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -626,6 +626,7 @@
>               '*host-nodes': ['uint16'],
>               '*merge': 'bool',
>               '*policy': 'HostMemPolicy',
> +            '*host-mem-type': 'HostMemType',

Missing documentation in MemoryBackendProperties.

>               '*prealloc': 'bool',
>               '*prealloc-threads': 'uint32',
>               '*prealloc-context': 'str',
David Hildenbrand Jan. 2, 2024, 1:03 p.m. UTC | #2
On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> 	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
> 

You can achieve the exact same thing already simply by using memory 
policies and detecting the node(s) before calling QEMU, no?

There has to be a good reason to add such a shortcut into QEMU, and it 
should be spelled out here.
Gregory Price Jan. 3, 2024, 9:56 p.m. UTC | #3
On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \

Stupid questions:

Why not just use `host-nodes` and pass in the numa node you want to
allocate from? Why should QEMU be made "CXL-aware" in the sense that
QEMU is responsible for figuring out what host node has CXL memory?

This feels like an "upper level software" operation (orchestration), rather
than something qemu should internally understand.

> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \

For a variety of performance reasons, this will not work the way you
want it to.  You are essentially telling QEMU to map the vmem0 into a
virtual cxl device, and now any memory accesses to that memory region
will end up going through the cxl-type3 device logic - which is an IO
path from the perspective of QEMU.

You may want to double check that your tests aren't using swap space in
the guest, because I found in my testing that linux would prefer to use
swap rather than attempt to use virtual cxl-type3 device memory because
of how god-awful slow it is (even if it is backed by DRAM).


Additionally, this configuration will not (should not) presently work
with VMX enabled.  Unless I missed some other update, QEMU w/ CXL memory
presently crashes when VMX is enabled for not-yet-debugged reasons.

Another possiblity: You mapped this memory-backend into another numa
node explicitly and never onlined the memory via cxlcli.  I've done
this, and it works, but it's a "hidden feature" that probably should
not exist / be supported.



If I understand the goal here, it's to pass CXL-hosted DRAM through to
the guest in a way that the system can manage it according to its
performance attributes.

You're better off just using the `host-nodes` field of host-memory
and passing bandwidth/latency attributes though via `-numa hmat-lb`

In that scenario, the guest software doesn't even need to know CXL
exists at all, it can just read the attributes of the numa node
that QEMU created for it.

In the future to deal with proper dynamic capacity, we may need to
consider a different backend object altogether that allows sparse
allocations, and a virtual cxl device which pre-allocates the CFMW
can at least be registered to manage it.  I'm not quite sure how that
looks just yet.

For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB
on a cpuless node.

qemu-system-x86_64 \
-nographic \
-accel kvm \
-machine type=q35,hmat=on \
-drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 8G,slots=4,maxmem=16G \
-smp cpus=4 \
-object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here
-object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here
-numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \          <-- cpu node
-numa node,initiator=0,nodeid=1,memdev=ram-node1 \       <-- cpuless node
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880

[root@fedora ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 3965 MB
node 0 free: 3611 MB
node 1 cpus:
node 1 size: 3986 MB
node 1 free: 3960 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

[root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators
node0  read_bandwidth  read_latency  write_bandwidth  write_latency
[root@fedora initiators]# cat read_bandwidth
5
[root@fedora initiators]# cat read_latency
20


~Gregory
Hao Xiang Jan. 6, 2024, 12:45 a.m. UTC | #4
On Tue, Jan 2, 2024 at 5:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote:
> > Introduce a new configuration option 'host-mem-type=' in the
> > '-object memory-backend-ram', allowing users to specify
> > from which type of memory to allocate.
> >
> > Users can specify 'cxlram' as an argument, and QEMU will then
> > automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> > For example:
> >       -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> >       -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >       -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >       -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> >       -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
> >
>
> You can achieve the exact same thing already simply by using memory
> policies and detecting the node(s) before calling QEMU, no?

Yes, I agree this can be done with memory policy and bind to the CXL
memory numa nodes on host.

>
> There has to be a good reason to add such a shortcut into QEMU, and it
> should be spelled out here.

So our end goal here is to enable CXL memory in the guest VM and have
the guest kernel to recognize the CXL memory to the correct memory
tier (slow tier) in the Linux kernel tiered memory system.

Here is what we observed:
* The kernel tiered memory system relies on calculating the memory
attributes (read/write latency, bandwidth from ACPI) for fast vs slow
tier.
* The kernel tiered memory system has two path to recognize a memory
tier 1) in mm subsystem init, memory_tier_init 2) in kmem driver
device probe dev_dax_kmem_probe. Since the ACPI subsystem is
initialized after mm, reading the memory attributes from ACPI can only
be done in 2). CXL memory has to be presented as a devdax device,
which can then be probed by the kmem driver in the guest and
recognized as the slow tier.

We do see that QEMU has this option "-numa hmat-lb" to set the memory
attributes per VM's numa node. The problem is that setting the memory
attributes per numa node means that the numa node is created during
guest kernel initialization. A CXL devdax device can only be created
post kernel initialization and new numa nodes are created for the CXL
devdax devices. The guest kernel is not reading the memory attributes
from "-numa hmat-lb" for the CXL devdax devices.

So we thought if we create an explicit CXL memory backend, and
associate that with the virtual CXL type-3 frontend, we can pass the
CXL memory attributes from the host into the guest VM and avoid using
memory policy and "-numa hmat-lb", thus simplifying the configuration.
We are still figuring out exactly how to pass the memory attributes
from the CXL backend into the VM. There is probably a solution to
"-numa hmat-lb" for the CXL devdax devices as well and we are also
looking into it.

>
> --
> Cheers,
>
> David / dhildenb
>
Hao Xiang Jan. 6, 2024, 5:59 a.m. UTC | #5
On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote:
> > Introduce a new configuration option 'host-mem-type=' in the
> > '-object memory-backend-ram', allowing users to specify
> > from which type of memory to allocate.
> >
> > Users can specify 'cxlram' as an argument, and QEMU will then
> > automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> > For example:
> >       -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
>
> Stupid questions:
>
> Why not just use `host-nodes` and pass in the numa node you want to
> allocate from? Why should QEMU be made "CXL-aware" in the sense that
> QEMU is responsible for figuring out what host node has CXL memory?
>
> This feels like an "upper level software" operation (orchestration), rather
> than something qemu should internally understand.

I don't have a "big picture" and I am learning. Maybe we proposed
something not useful :-) I replied to the same question on a fork of
this thread.

>
> >       -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >       -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >       -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
>
> For a variety of performance reasons, this will not work the way you
> want it to.  You are essentially telling QEMU to map the vmem0 into a
> virtual cxl device, and now any memory accesses to that memory region
> will end up going through the cxl-type3 device logic - which is an IO
> path from the perspective of QEMU.

I didn't understand exactly how the virtual cxl-type3 device works. I
thought it would go with the same "guest virtual address ->  guest
physical address -> host physical address" translation totally done by
CPU. But if it is going through an emulation path handled by virtual
cxl-type3, I agree the performance would be bad. Do you know why
accessing memory on a virtual cxl-type3 device can't go with the
nested page table translation?

>
> You may want to double check that your tests aren't using swap space in
> the guest, because I found in my testing that linux would prefer to use
> swap rather than attempt to use virtual cxl-type3 device memory because
> of how god-awful slow it is (even if it is backed by DRAM).

We didn't enable swap in our current test setup. I think there was a
kernel change making the mm page reclamation path to use cxl memory
instead of swap if you enable memory tiering. Did you try that? Swap
is persistent storage. I would be very surprised if virtual cxl is
actually slower than swap.

>
>
> Additionally, this configuration will not (should not) presently work
> with VMX enabled.  Unless I missed some other update, QEMU w/ CXL memory
> presently crashes when VMX is enabled for not-yet-debugged reasons.

When we had a discussion with Intel, they told us to not use the KVM
option in QEMU while using virtual cxl type3 device. That's probably
related to the issue you described here? We enabled KVM though but
haven't seen the crash yet.

>
> Another possiblity: You mapped this memory-backend into another numa
> node explicitly and never onlined the memory via cxlcli.  I've done
> this, and it works, but it's a "hidden feature" that probably should
> not exist / be supported.

I thought general purpose memory nodes are onlined by default?

>
>
>
> If I understand the goal here, it's to pass CXL-hosted DRAM through to
> the guest in a way that the system can manage it according to its
> performance attributes.

Yes.

>
> You're better off just using the `host-nodes` field of host-memory
> and passing bandwidth/latency attributes though via `-numa hmat-lb`

We tried this but it doesn't work from end to end right now. I
described the issue in another fork of this thread.

>
> In that scenario, the guest software doesn't even need to know CXL
> exists at all, it can just read the attributes of the numa node
> that QEMU created for it.

We thought about this before. But the current kernel implementation
requires a devdax device to be probed and recognized as a slow tier
(by reading the memory attributes). I don't think this can be done via
the path you described. Have you tried this before?

>
> In the future to deal with proper dynamic capacity, we may need to
> consider a different backend object altogether that allows sparse
> allocations, and a virtual cxl device which pre-allocates the CFMW
> can at least be registered to manage it.  I'm not quite sure how that
> looks just yet.

Are we talking about CXL memory pooling?

>
> For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB
> on a cpuless node.
>
> qemu-system-x86_64 \
> -nographic \
> -accel kvm \
> -machine type=q35,hmat=on \
> -drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 8G,slots=4,maxmem=16G \
> -smp cpus=4 \
> -object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here
> -object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here
> -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \          <-- cpu node
> -numa node,initiator=0,nodeid=1,memdev=ram-node1 \       <-- cpuless node
> -netdev bridge,id=hn0,br=virbr0 \
> -device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>
> [root@fedora ~]# numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3
> node 0 size: 3965 MB
> node 0 free: 3611 MB
> node 1 cpus:
> node 1 size: 3986 MB
> node 1 free: 3960 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> [root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators
> node0  read_bandwidth  read_latency  write_bandwidth  write_latency
> [root@fedora initiators]# cat read_bandwidth
> 5
> [root@fedora initiators]# cat read_latency
> 20
>
>
> ~Gregory
Gregory Price Jan. 8, 2024, 5:15 p.m. UTC | #6
On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > For a variety of performance reasons, this will not work the way you
> > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > virtual cxl device, and now any memory accesses to that memory region
> > will end up going through the cxl-type3 device logic - which is an IO
> > path from the perspective of QEMU.
> 
> I didn't understand exactly how the virtual cxl-type3 device works. I
> thought it would go with the same "guest virtual address ->  guest
> physical address -> host physical address" translation totally done by
> CPU. But if it is going through an emulation path handled by virtual
> cxl-type3, I agree the performance would be bad. Do you know why
> accessing memory on a virtual cxl-type3 device can't go with the
> nested page table translation?
>

Because a byte-access on CXL memory can have checks on it that must be
emulated by the virtual device, and because there are caching
implications that have to be emulated as well.

The cxl device you are using is an emulated CXL device - not a
virtualization interface.  Nuanced difference:  the emulated device has
to emulate *everything* that CXL device does.

What you want is passthrough / managed access to a real device -
virtualization.  This is not the way to accomplish that.  A better way
to accomplish that is to simply pass the memory through as a static numa
node as I described.

> 
> When we had a discussion with Intel, they told us to not use the KVM
> option in QEMU while using virtual cxl type3 device. That's probably
> related to the issue you described here? We enabled KVM though but
> haven't seen the crash yet.
>

The crash really only happens, IIRC, if code ends up hosted in that
memory.  I forget the exact scenario, but the working theory is it has
to do with the way instruction caches are managed with KVM and this
device.

> >
> > You're better off just using the `host-nodes` field of host-memory
> > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> 
> We tried this but it doesn't work from end to end right now. I
> described the issue in another fork of this thread.
>
> >
> > In that scenario, the guest software doesn't even need to know CXL
> > exists at all, it can just read the attributes of the numa node
> > that QEMU created for it.
> 
> We thought about this before. But the current kernel implementation
> requires a devdax device to be probed and recognized as a slow tier
> (by reading the memory attributes). I don't think this can be done via
> the path you described. Have you tried this before?
>

Right, because the memory tiering component lumps the nodes together.

Better idea:  Fix the memory tiering component

I cc'd you on another patch line that is discussing something relevant
to this.

https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8

The point is: There's no need for this to be a dax device at all, there
is no need for the guest to even know what is providing the memory, or
for the guest to have any management access to the memory.  It just
wants the memory and the ability to tier it.

So we should fix the memory tiering component to work with this
workflow.

~Gregory
Hao Xiang Jan. 8, 2024, 10:47 p.m. UTC | #7
On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > >
> > > For a variety of performance reasons, this will not work the way you
> > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > virtual cxl device, and now any memory accesses to that memory region
> > > will end up going through the cxl-type3 device logic - which is an IO
> > > path from the perspective of QEMU.
> >
> > I didn't understand exactly how the virtual cxl-type3 device works. I
> > thought it would go with the same "guest virtual address ->  guest
> > physical address -> host physical address" translation totally done by
> > CPU. But if it is going through an emulation path handled by virtual
> > cxl-type3, I agree the performance would be bad. Do you know why
> > accessing memory on a virtual cxl-type3 device can't go with the
> > nested page table translation?
> >
>
> Because a byte-access on CXL memory can have checks on it that must be
> emulated by the virtual device, and because there are caching
> implications that have to be emulated as well.

Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
CXL memory data path goes through them, the performance would be
pretty problematic. We have actually run Intel's Memory Latency
Checker benchmark from inside a guest VM with both system-DRAM and
virtual CXL-type3 configured. The idle latency on the virtual CXL
memory is 2X of system DRAM, which is on-par with the benchmark
running from a physical host. I need to debug this more to understand
why the latency is actually much better than I would expect now.

>
> The cxl device you are using is an emulated CXL device - not a
> virtualization interface.  Nuanced difference:  the emulated device has
> to emulate *everything* that CXL device does.
>
> What you want is passthrough / managed access to a real device -
> virtualization.  This is not the way to accomplish that.  A better way
> to accomplish that is to simply pass the memory through as a static numa
> node as I described.

That would work, too. But I think a kernel change is required to
establish the correct memory tiering if we go this routine.

>
> >
> > When we had a discussion with Intel, they told us to not use the KVM
> > option in QEMU while using virtual cxl type3 device. That's probably
> > related to the issue you described here? We enabled KVM though but
> > haven't seen the crash yet.
> >
>
> The crash really only happens, IIRC, if code ends up hosted in that
> memory.  I forget the exact scenario, but the working theory is it has
> to do with the way instruction caches are managed with KVM and this
> device.
>
> > >
> > > You're better off just using the `host-nodes` field of host-memory
> > > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> >
> > We tried this but it doesn't work from end to end right now. I
> > described the issue in another fork of this thread.
> >
> > >
> > > In that scenario, the guest software doesn't even need to know CXL
> > > exists at all, it can just read the attributes of the numa node
> > > that QEMU created for it.
> >
> > We thought about this before. But the current kernel implementation
> > requires a devdax device to be probed and recognized as a slow tier
> > (by reading the memory attributes). I don't think this can be done via
> > the path you described. Have you tried this before?
> >
>
> Right, because the memory tiering component lumps the nodes together.
>
> Better idea:  Fix the memory tiering component
>
> I cc'd you on another patch line that is discussing something relevant
> to this.
>
> https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8
>
> The point is: There's no need for this to be a dax device at all, there
> is no need for the guest to even know what is providing the memory, or
> for the guest to have any management access to the memory.  It just
> wants the memory and the ability to tier it.
>
> So we should fix the memory tiering component to work with this
> workflow.

Agreed. We really don't need the devdax device at all. I thought that
choice was made due to the memory tiering concept being started with
pmem ... Let's continue this part of the discussion on the above
thread.

>
> ~Gregory
Hao Xiang Jan. 9, 2024, 1:05 a.m. UTC | #8
On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
>
> On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > > >
> > > > For a variety of performance reasons, this will not work the way you
> > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > virtual cxl device, and now any memory accesses to that memory region
> > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > path from the perspective of QEMU.
> > >
> > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > thought it would go with the same "guest virtual address ->  guest
> > > physical address -> host physical address" translation totally done by
> > > CPU. But if it is going through an emulation path handled by virtual
> > > cxl-type3, I agree the performance would be bad. Do you know why
> > > accessing memory on a virtual cxl-type3 device can't go with the
> > > nested page table translation?
> > >
> >
> > Because a byte-access on CXL memory can have checks on it that must be
> > emulated by the virtual device, and because there are caching
> > implications that have to be emulated as well.
>
> Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> CXL memory data path goes through them, the performance would be
> pretty problematic. We have actually run Intel's Memory Latency
> Checker benchmark from inside a guest VM with both system-DRAM and
> virtual CXL-type3 configured. The idle latency on the virtual CXL
> memory is 2X of system DRAM, which is on-par with the benchmark
> running from a physical host. I need to debug this more to understand
> why the latency is actually much better than I would expect now.

So we double checked on benchmark testing. What we see is that running
Intel Memory Latency Checker from a guest VM with virtual CXL memory
VS from a physical host with CXL1.1 memory expander has the same
latency.

From guest VM: local socket system-DRAM latency is 117.0ns, local
socket CXL-DRAM latency is 269.4ns
From physical host: local socket system-DRAM latency is 113.6ns ,
local socket CXL-DRAM latency is 267.5ns

I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
while running the benchmark testing but those two functions are not
ever hit. We used the virtual CXL configuration while launching QEMU
but the CXL memory is present as a separate NUMA node and we are not
creating devdax devices. Does that make any difference?

>
> >
> > The cxl device you are using is an emulated CXL device - not a
> > virtualization interface.  Nuanced difference:  the emulated device has
> > to emulate *everything* that CXL device does.
> >
> > What you want is passthrough / managed access to a real device -
> > virtualization.  This is not the way to accomplish that.  A better way
> > to accomplish that is to simply pass the memory through as a static numa
> > node as I described.
>
> That would work, too. But I think a kernel change is required to
> establish the correct memory tiering if we go this routine.
>
> >
> > >
> > > When we had a discussion with Intel, they told us to not use the KVM
> > > option in QEMU while using virtual cxl type3 device. That's probably
> > > related to the issue you described here? We enabled KVM though but
> > > haven't seen the crash yet.
> > >
> >
> > The crash really only happens, IIRC, if code ends up hosted in that
> > memory.  I forget the exact scenario, but the working theory is it has
> > to do with the way instruction caches are managed with KVM and this
> > device.
> >
> > > >
> > > > You're better off just using the `host-nodes` field of host-memory
> > > > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> > >
> > > We tried this but it doesn't work from end to end right now. I
> > > described the issue in another fork of this thread.
> > >
> > > >
> > > > In that scenario, the guest software doesn't even need to know CXL
> > > > exists at all, it can just read the attributes of the numa node
> > > > that QEMU created for it.
> > >
> > > We thought about this before. But the current kernel implementation
> > > requires a devdax device to be probed and recognized as a slow tier
> > > (by reading the memory attributes). I don't think this can be done via
> > > the path you described. Have you tried this before?
> > >
> >
> > Right, because the memory tiering component lumps the nodes together.
> >
> > Better idea:  Fix the memory tiering component
> >
> > I cc'd you on another patch line that is discussing something relevant
> > to this.
> >
> > https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8
> >
> > The point is: There's no need for this to be a dax device at all, there
> > is no need for the guest to even know what is providing the memory, or
> > for the guest to have any management access to the memory.  It just
> > wants the memory and the ability to tier it.
> >
> > So we should fix the memory tiering component to work with this
> > workflow.
>
> Agreed. We really don't need the devdax device at all. I thought that
> choice was made due to the memory tiering concept being started with
> pmem ... Let's continue this part of the discussion on the above
> thread.
>
> >
> > ~Gregory
Gregory Price Jan. 9, 2024, 1:13 a.m. UTC | #9
On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote:
> On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
> >
> > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
> > >
> > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > > > >
> > > > > For a variety of performance reasons, this will not work the way you
> > > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > > virtual cxl device, and now any memory accesses to that memory region
> > > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > > path from the perspective of QEMU.
> > > >
> > > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > > thought it would go with the same "guest virtual address ->  guest
> > > > physical address -> host physical address" translation totally done by
> > > > CPU. But if it is going through an emulation path handled by virtual
> > > > cxl-type3, I agree the performance would be bad. Do you know why
> > > > accessing memory on a virtual cxl-type3 device can't go with the
> > > > nested page table translation?
> > > >
> > >
> > > Because a byte-access on CXL memory can have checks on it that must be
> > > emulated by the virtual device, and because there are caching
> > > implications that have to be emulated as well.
> >
> > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> > CXL memory data path goes through them, the performance would be
> > pretty problematic. We have actually run Intel's Memory Latency
> > Checker benchmark from inside a guest VM with both system-DRAM and
> > virtual CXL-type3 configured. The idle latency on the virtual CXL
> > memory is 2X of system DRAM, which is on-par with the benchmark
> > running from a physical host. I need to debug this more to understand
> > why the latency is actually much better than I would expect now.
> 
> So we double checked on benchmark testing. What we see is that running
> Intel Memory Latency Checker from a guest VM with virtual CXL memory
> VS from a physical host with CXL1.1 memory expander has the same
> latency.
> 
> From guest VM: local socket system-DRAM latency is 117.0ns, local
> socket CXL-DRAM latency is 269.4ns
> From physical host: local socket system-DRAM latency is 113.6ns ,
> local socket CXL-DRAM latency is 267.5ns
> 
> I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
> while running the benchmark testing but those two functions are not
> ever hit. We used the virtual CXL configuration while launching QEMU
> but the CXL memory is present as a separate NUMA node and we are not
> creating devdax devices. Does that make any difference?
> 

Could you possibly share your full QEMU configuration and what OS/kernel
you are running inside the guest?

The only thing I'm surprised by is that the numa node appears without
requiring the driver to generate the NUMA node.  It's possible I missed
a QEMU update that allows this.

~Gregory
Hao Xiang Jan. 9, 2024, 7:33 p.m. UTC | #10
On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote:
> > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
> > >
> > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
> > > >
> > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > > > > >
> > > > > > For a variety of performance reasons, this will not work the way you
> > > > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > > > virtual cxl device, and now any memory accesses to that memory region
> > > > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > > > path from the perspective of QEMU.
> > > > >
> > > > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > > > thought it would go with the same "guest virtual address ->  guest
> > > > > physical address -> host physical address" translation totally done by
> > > > > CPU. But if it is going through an emulation path handled by virtual
> > > > > cxl-type3, I agree the performance would be bad. Do you know why
> > > > > accessing memory on a virtual cxl-type3 device can't go with the
> > > > > nested page table translation?
> > > > >
> > > >
> > > > Because a byte-access on CXL memory can have checks on it that must be
> > > > emulated by the virtual device, and because there are caching
> > > > implications that have to be emulated as well.
> > >
> > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> > > CXL memory data path goes through them, the performance would be
> > > pretty problematic. We have actually run Intel's Memory Latency
> > > Checker benchmark from inside a guest VM with both system-DRAM and
> > > virtual CXL-type3 configured. The idle latency on the virtual CXL
> > > memory is 2X of system DRAM, which is on-par with the benchmark
> > > running from a physical host. I need to debug this more to understand
> > > why the latency is actually much better than I would expect now.
> >
> > So we double checked on benchmark testing. What we see is that running
> > Intel Memory Latency Checker from a guest VM with virtual CXL memory
> > VS from a physical host with CXL1.1 memory expander has the same
> > latency.
> >
> > From guest VM: local socket system-DRAM latency is 117.0ns, local
> > socket CXL-DRAM latency is 269.4ns
> > From physical host: local socket system-DRAM latency is 113.6ns ,
> > local socket CXL-DRAM latency is 267.5ns
> >
> > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
> > while running the benchmark testing but those two functions are not
> > ever hit. We used the virtual CXL configuration while launching QEMU
> > but the CXL memory is present as a separate NUMA node and we are not
> > creating devdax devices. Does that make any difference?
> >
>
> Could you possibly share your full QEMU configuration and what OS/kernel
> you are running inside the guest?

Sounds like the technical details are explained on the other thread.
From what I understand now, if we don't go through a complex CXL
setup, it wouldn't go through the emulation path.

Here is our exact setup. Guest runs Linux kernel 6.6rc2

taskset --cpu-list 0-47,96-143 \
numactl -N 0 -m 0 ${QEMU} \
-M q35,cxl=on,hmat=on \
-m 64G \
-smp 8,sockets=1,cores=8,threads=1 \
-object memory-backend-ram,id=ram0,size=45G \
-numa node,memdev=ram0,cpus=0-7,nodeid=0 \
-msg timestamp=on -L /usr/share/seabios \
-enable-kvm \
-object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
\
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
-numa node,memdev=vmem0,nodeid=1 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k
\
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=14 \
-numa dist,src=1,dst=0,val=14 \
-numa dist,src=1,dst=1,val=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-latency,latency=91
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-latency,latency=100
\
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-latency,latency=91
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-latency,latency=100
\
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-bandwidth,bandwidth=262100M
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-bandwidth,bandwidth=30000M
\
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-bandwidth,bandwidth=176100M
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-bandwidth,bandwidth=30000M
\
-drive file="${DISK_IMG}",format=qcow2 \
-device pci-bridge,chassis_nr=3,id=pci.3,bus=pcie.0,addr=0xd \
-netdev tap,id=vm-sk-tap22,ifname=tap22,script=/usr/local/etc/qemu-ifup,downscript=no
\
-device virtio-net-pci,netdev=vm-sk-tap22,id=net0,mac=02:11:17:01:7e:33,bus=pci.3,addr=0x1,bootindex=3
\
-serial mon:stdio

>
> The only thing I'm surprised by is that the numa node appears without
> requiring the driver to generate the NUMA node.  It's possible I missed
> a QEMU update that allows this.
>
> ~Gregory
Gregory Price Jan. 9, 2024, 7:57 p.m. UTC | #11
On Tue, Jan 09, 2024 at 11:33:04AM -0800, Hao Xiang wrote:
> On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote:
> 
> Sounds like the technical details are explained on the other thread.
> From what I understand now, if we don't go through a complex CXL
> setup, it wouldn't go through the emulation path.
> 
> Here is our exact setup. Guest runs Linux kernel 6.6rc2
> 
> taskset --cpu-list 0-47,96-143 \
> numactl -N 0 -m 0 ${QEMU} \
> -M q35,cxl=on,hmat=on \
> -m 64G \
> -smp 8,sockets=1,cores=8,threads=1 \
> -object memory-backend-ram,id=ram0,size=45G \
> -numa node,memdev=ram0,cpus=0-7,nodeid=0 \
> -msg timestamp=on -L /usr/share/seabios \
> -enable-kvm \
> -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
> \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> -numa node,memdev=vmem0,nodeid=1 \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k

:] you did what i thought you did

-numa node,memdev=vmem0,nodeid=1

"""
Another possiblity: You mapped this memory-backend into another numa
node explicitly and never onlined the memory via cxlcli.  I've done
this, and it works, but it's a "hidden feature" that probably should
not exist / be supported.
"""

You're mapping vmem0 into an explicit numa node *and* into the type3
device.  You don't need to do both - and technically this shouldn't be
allowed.

With this configuration, you can go thorugh the cxl-cli setup process
for the CXL device, you'll find that you can create *another* node
(node 2 in this case) that maps to the same memory you mapped to node1..


You can drop the cxl devices objects in here and the memory will still
come up the way you want it to.

If you drop this line:

-numa node,memdev=vmem0,nodeid=1

You have to use the CXL driver to instantiate the dax device and the
numa node, and at *that* point you will see the read/write functions
being called.

~Gregory
Hao Xiang Jan. 9, 2024, 9:27 p.m. UTC | #12
On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
<gregory.price@memverge.com> wrote:
>
> On Tue, Jan 09, 2024 at 11:33:04AM -0800, Hao Xiang wrote:
> > On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > Sounds like the technical details are explained on the other thread.
> > From what I understand now, if we don't go through a complex CXL
> > setup, it wouldn't go through the emulation path.
> >
> > Here is our exact setup. Guest runs Linux kernel 6.6rc2
> >
> > taskset --cpu-list 0-47,96-143 \
> > numactl -N 0 -m 0 ${QEMU} \
> > -M q35,cxl=on,hmat=on \
> > -m 64G \
> > -smp 8,sockets=1,cores=8,threads=1 \
> > -object memory-backend-ram,id=ram0,size=45G \
> > -numa node,memdev=ram0,cpus=0-7,nodeid=0 \
> > -msg timestamp=on -L /usr/share/seabios \
> > -enable-kvm \
> > -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
> > \
> > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> > -numa node,memdev=vmem0,nodeid=1 \
> > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k
>
> :] you did what i thought you did
>
> -numa node,memdev=vmem0,nodeid=1
>
> """
> Another possiblity: You mapped this memory-backend into another numa
> node explicitly and never onlined the memory via cxlcli.  I've done
> this, and it works, but it's a "hidden feature" that probably should
> not exist / be supported.
> """
>
> You're mapping vmem0 into an explicit numa node *and* into the type3
> device.  You don't need to do both - and technically this shouldn't be
> allowed.
>
> With this configuration, you can go thorugh the cxl-cli setup process
> for the CXL device, you'll find that you can create *another* node
> (node 2 in this case) that maps to the same memory you mapped to node1..
>
>
> You can drop the cxl devices objects in here and the memory will still
> come up the way you want it to.
>
> If you drop this line:
>
> -numa node,memdev=vmem0,nodeid=1

We tried this as well and it works after going through the cxlcli
process and created the devdax device. The problem is that without the
"nodeid=1" configuration, we cannot connect with the explicit per numa
node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
lb information to VM's hmat.

From what I understand so far, the guest kernel will dynamically
create a numa node after a cxl devdax device is created. That means we
don't know the numa node until after VM boot. 2. QEMU can only
statically parse the lb information to the VM at boot time. How do we
connect these two things?

Assuming that the same issue applies to a physical server with CXL.
Were you able to see a host kernel getting the correct lb information
for a CXL devdax device?

>
> You have to use the CXL driver to instantiate the dax device and the
> numa node, and at *that* point you will see the read/write functions
> being called.
>
> ~Gregory
Gregory Price Jan. 9, 2024, 10:13 p.m. UTC | #13
On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:
> On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> <gregory.price@memverge.com> wrote:
> >
> > If you drop this line:
> >
> > -numa node,memdev=vmem0,nodeid=1
> 
> We tried this as well and it works after going through the cxlcli
> process and created the devdax device. The problem is that without the
> "nodeid=1" configuration, we cannot connect with the explicit per numa
> node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> lb information to VM's hmat.
>

Yeah, this is what Jonathan was saying - right now there isn't a good
way (in QEMU) to pass the hmat/cdat stuff down through the device.
Needs to be plumbed out.

In the meantime: You should just straight up drop the cxl device from
your QEMU config.  It doesn't actually get you anything.

> From what I understand so far, the guest kernel will dynamically
> create a numa node after a cxl devdax device is created. That means we
> don't know the numa node until after VM boot. 2. QEMU can only
> statically parse the lb information to the VM at boot time. How do we
> connect these two things?

during boot, the kernel discovers all the memory regions exposed to
bios. In this qemu configuration you have defined:

region 0: CPU + DRAM node
region 1: DRAM only node
region 2: CXL Fixed Memory Window (the last line of the cxl stuff)

The kernel reads this information on boot and reserves 1 numa node for
each of these regions.

The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
respectively.

Node2 sits dormant until you go through the cxl-cli startup sequence.


What you're asking for is for the QEMU team to plumb hmat/cdat
information down through the type3 device.  I *think* that is presently
possible with a custom CDAT file - but Jonathan probably has more
details on that.  You'll have to go digging for answers on that one.


Now - even if you did that - the current state of the cxl-type3 device
is *not what you want* because your memory accesses will be routed
through the read/write functions in the emulated device.

What Jonathan and I discussed on the other thread is how you might go
about slimming this down to allow pass-through of the memory without the
need for all the fluff.  This is a non-trivial refactor of the existing
device, so i would not expect that any time soon.

At the end of the day, quickest way to get-there-from-here is to just
drop the cxl related lines from your QEMU config, and keep everything
else.

> 
> Assuming that the same issue applies to a physical server with CXL.
> Were you able to see a host kernel getting the correct lb information
> for a CXL devdax device?
> 

Yes, if you bring up a CXL device via cxl-cli on real hardware, the
subsequent numa node ends up in the "lower tier" of the memory-tiering
infrastructure.

~Gregory
Hao Xiang Jan. 9, 2024, 11:55 p.m. UTC | #14
On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:
> > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> > <gregory.price@memverge.com> wrote:
> > >
> > > If you drop this line:
> > >
> > > -numa node,memdev=vmem0,nodeid=1
> >
> > We tried this as well and it works after going through the cxlcli
> > process and created the devdax device. The problem is that without the
> > "nodeid=1" configuration, we cannot connect with the explicit per numa
> > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> > lb information to VM's hmat.
> >
>
> Yeah, this is what Jonathan was saying - right now there isn't a good
> way (in QEMU) to pass the hmat/cdat stuff down through the device.
> Needs to be plumbed out.
>
> In the meantime: You should just straight up drop the cxl device from
> your QEMU config.  It doesn't actually get you anything.
>
> > From what I understand so far, the guest kernel will dynamically
> > create a numa node after a cxl devdax device is created. That means we
> > don't know the numa node until after VM boot. 2. QEMU can only
> > statically parse the lb information to the VM at boot time. How do we
> > connect these two things?
>
> during boot, the kernel discovers all the memory regions exposed to
> bios. In this qemu configuration you have defined:
>
> region 0: CPU + DRAM node
> region 1: DRAM only node
> region 2: CXL Fixed Memory Window (the last line of the cxl stuff)
>
> The kernel reads this information on boot and reserves 1 numa node for
> each of these regions.
>
> The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
> respectively.
>
> Node2 sits dormant until you go through the cxl-cli startup sequence.
>
>
> What you're asking for is for the QEMU team to plumb hmat/cdat
> information down through the type3 device.  I *think* that is presently
> possible with a custom CDAT file - but Jonathan probably has more
> details on that.  You'll have to go digging for answers on that one.

I think this is exactly what I was looking for. When we started with
the idea of having an explicit CXL memory backend, we wanted to
1) Bind a virtual CXL device to an actual CXL memory node on host.
2) Pass the latency/bandwidth information from the CXL backend into
the virtual CXL device.
I didn't have a concrete idea of how to do 2)
With the discussion here, I learned that the information is passed
from CDAT. Just looked into the virtual CXL code and found that
ct3_build_cdat_entries_for_mr() is the function that builds this
information. But the latency and bandwidth there are currently
hard-coded. I think it makes sense to have an explicit CXL memory
backend where QEMU can query the CXL memory attributes from the host
and pass that information from the CXL backend into the virtual CXL
type-3 device.

>
>
> Now - even if you did that - the current state of the cxl-type3 device
> is *not what you want* because your memory accesses will be routed
> through the read/write functions in the emulated device.
>
> What Jonathan and I discussed on the other thread is how you might go
> about slimming this down to allow pass-through of the memory without the
> need for all the fluff.  This is a non-trivial refactor of the existing
> device, so i would not expect that any time soon.
>
> At the end of the day, quickest way to get-there-from-here is to just
> drop the cxl related lines from your QEMU config, and keep everything
> else.

Agreed. We need the kernel to be capable of reading the memory
attributes from HMAT and establish the correct memory tier for
system-DRAM (on a CPUless numa node). Currently system-DRAM is assumed
to always be fast tier.

>
> >
> > Assuming that the same issue applies to a physical server with CXL.
> > Were you able to see a host kernel getting the correct lb information
> > for a CXL devdax device?
> >
>
> Yes, if you bring up a CXL device via cxl-cli on real hardware, the
> subsequent numa node ends up in the "lower tier" of the memory-tiering
> infrastructure.
>
> ~Gregory
Markus Armbruster Jan. 12, 2024, 3:32 p.m. UTC | #15
QAPI schema sanity review only.

"Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com> writes:

> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
>
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> 	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
>
> In v1, we plan to move most of the implementations to util and break down
> this patch into different smaller patches.
>
> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>

[...]

> diff --git a/qapi/common.json b/qapi/common.json
> index 6fed9cde1a..591fd73291 100644
> --- a/qapi/common.json
> +++ b/qapi/common.json
> @@ -167,6 +167,25 @@
>  { 'enum': 'HostMemPolicy',
>    'data': [ 'default', 'preferred', 'bind', 'interleave' ] }
>  
> +##
> +# @HostMemType:
> +#
> +# Automatically find a backend memory type on host.

Types don't do, they are.  Suggest something like

   # Host memory types

> +# Can be further extened to support other types such as cxlpmem, hbm.

Doc comments are reference documentation for users of QMP.  This
sentence doesn't fit there.

> +#
> +# @none: do nothing (default).

Again, types don't do, they are.

(default) makes no sense here.  Types don't have defaults, optional
arguments do.  In this case, @host-mem-type below.

What kind of memory does this setting request?

> +#
> +# @cxlram: a CXL RAM backend on host.

Suggest to spell it cxl-ram.

> +#
> +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same
> +# time. HostMemType is used to automatically bind with one kind of
> +# host memory types.

I feel this note doesn't belong here.  Add it to the users of
HostMemType and HostMemPolicy instead.

> +#
> +# Since: 8.3
> +##
> +{ 'enum': 'HostMemType',
> +  'data': [ 'none', 'cxlram' ] }
> +

I guess you're adding this to common.json and not qom.json so it's next
to HostMemPolicy.  Okay.

>  ##
>  # @NetFilterDirection:
>  #
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 95516ba325..fa3bc29708 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -626,6 +626,7 @@
>              '*host-nodes': ['uint16'],
>              '*merge': 'bool',
>              '*policy': 'HostMemPolicy',
> +            '*host-mem-type': 'HostMemType',
>              '*prealloc': 'bool',
>              '*prealloc-threads': 'uint32',
>              '*prealloc-context': 'str',

Missing: doc comment update.

[...]
diff mbox series

Patch

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 747e7838c0..3bede13879 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -44,6 +44,133 @@  host_memory_backend_get_name(HostMemoryBackend *backend)
     return object_get_canonical_path(OBJECT(backend));
 }
 
+#define FILE_LINE_LEN 256
+static int
+is_valid_node(const char *path) {
+    FILE *file = fopen(path, "r");
+    if (file == NULL) {
+        return -1;
+    }
+
+    char line[FILE_LINE_LEN];
+    if (fgets(line, sizeof(line), file) != NULL) {
+        int target_node = atoi(line);
+
+        if (target_node >= 0) {
+            fclose(file);
+            return target_node;
+        }
+    }
+
+    fclose(file);
+    return -1;
+}
+
+static int
+is_directory(const char *path) {
+    struct stat path_stat;
+    stat(path, &path_stat);
+    return S_ISDIR(path_stat.st_mode);
+}
+
+static int
+is_symlink(const char *path) {
+    struct stat path_stat;
+    if (lstat(path, &path_stat) == -1) {
+        return 0;
+    }
+    return S_ISLNK(path_stat.st_mode);
+}
+
+#define CXL_DEVICE_PATH "/sys/bus/cxl/devices/"
+#define REGION_PATH_LEN 307
+#define DAX_REGION_PATH_LEN 563
+#define DAX_PATH_LEN 819
+#define TARGET_FILE_PATH_LEN 831
+/*
+ * return: the number of valid numa node id found
+ */
+static int
+host_memory_backend_get_cxlram_nodes(int *valid_cxlram_nodes) {
+    DIR *base_dir = NULL, *region_dir = NULL, *dax_region_dir = NULL;
+    const char *base_dir_path = CXL_DEVICE_PATH;
+    struct dirent *entry;
+    int valid_node = 0, ret = 0;
+
+    base_dir = opendir(base_dir_path);
+    if (base_dir == NULL) {
+        return valid_node;
+    }
+
+    while ((entry = readdir(base_dir)) != NULL) {
+        char region_path[REGION_PATH_LEN];
+
+        ret = snprintf(region_path, sizeof(region_path), "%s%s",
+                                            base_dir_path, entry->d_name);
+        if (ret < 0 ||
+            !is_symlink(region_path) ||
+            strncmp(entry->d_name, "region", ARRAY_SIZE("region") - 1)) {
+            continue;
+        }
+
+        region_dir = opendir(region_path);
+        if (region_dir == NULL) {
+            goto region_exit;
+        }
+
+        while ((entry = readdir(region_dir)) != NULL) {
+            char dax_region_path[DAX_REGION_PATH_LEN];
+
+            ret = snprintf(dax_region_path, sizeof(dax_region_path), "%s/%s",
+                                                    region_path, entry->d_name);
+            if (ret < 0 ||
+                !is_directory(dax_region_path) ||
+                strncmp(entry->d_name, "dax_region",
+                            ARRAY_SIZE("dax_region") - 1)) {
+
+                continue;
+            }
+
+            dax_region_dir = opendir(dax_region_path);
+            if (dax_region_dir == NULL) {
+                goto dax_region_exit;
+            }
+
+            while ((entry = readdir(dax_region_dir)) != NULL) {
+                int target_node;
+                char dax_path[DAX_PATH_LEN];
+                char target_file_path[TARGET_FILE_PATH_LEN];
+                ret = snprintf(dax_path, sizeof(dax_path), "%s/%s",
+                                            dax_region_path, entry->d_name);
+                if (ret < 0 ||
+                    !is_directory(dax_path) ||
+                    strncmp(entry->d_name, "dax", ARRAY_SIZE("dax") - 1)) {
+                    continue;
+                }
+
+                ret = snprintf(target_file_path, sizeof(target_file_path),
+                                                    "%s/target_node", dax_path);
+                if (ret < 0) {
+                    continue;
+                }
+
+                target_node = is_valid_node(target_file_path);
+                if (target_node >= 0) {
+                    valid_cxlram_nodes[valid_node] = target_node;
+                    valid_node++;
+                }
+            }
+        }
+    }
+
+    closedir(dax_region_dir);
+dax_region_exit:
+    closedir(region_dir);
+region_exit:
+    closedir(base_dir);
+    return valid_node;
+}
+
 static void
 host_memory_backend_get_size(Object *obj, Visitor *v, const char *name,
                              void *opaque, Error **errp)
@@ -117,6 +244,12 @@  host_memory_backend_set_host_nodes(Object *obj, Visitor *v, const char *name,
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     uint16List *l, *host_nodes = NULL;
 
+    if (backend->host_mem_type == HOST_MEM_TYPE_CXLRAM) {
+        error_setg(errp,
+            "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible");
+        return;
+    }
+
     visit_type_uint16List(v, name, &host_nodes, errp);
 
     for (l = host_nodes; l; l = l->next) {
@@ -150,6 +283,11 @@  host_memory_backend_set_policy(Object *obj, int policy, Error **errp)
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     backend->policy = policy;
 
+    if (backend->host_mem_type == HOST_MEM_TYPE_CXLRAM) {
+        error_setg(errp,
+            "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible");
+    }
+
 #ifndef CONFIG_NUMA
     if (policy != HOST_MEM_POLICY_DEFAULT) {
         error_setg(errp, "NUMA policies are not supported by this QEMU");
@@ -157,6 +295,46 @@  host_memory_backend_set_policy(Object *obj, int policy, Error **errp)
 #endif
 }
 
+static int
+host_memory_backend_get_host_mem_type(Object *obj, Error **errp G_GNUC_UNUSED)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+    return backend->host_mem_type;
+}
+
+static void
+host_memory_backend_set_host_mem_type(Object *obj, int host_mem_type, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+    backend->host_mem_type = host_mem_type;
+
+#ifndef CONFIG_NUMA
+    error_setg(errp, "NUMA node host memory types are not supported by this QEMU");
+#else
+    int i, valid_cxlram_nodes[MAX_NODES];
+
+    if (backend->policy > 0 ||
+        !bitmap_empty(backend->host_nodes, MAX_NODES)) {
+        error_setg(errp,
+            "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible");
+        return;
+    }
+
+    if (host_memory_backend_get_cxlram_nodes(valid_cxlram_nodes) > 0) {
+        for (i = 0; i < MAX_NODES; i++) {
+            if (valid_cxlram_nodes[i] < 0) {
+                break;
+            }
+            bitmap_set(backend->host_nodes, valid_cxlram_nodes[i], 1);
+        }
+    } else {
+        error_setg(errp, "Cannot find CXL RAM on host");
+        return;
+    }
+    backend->policy = HOST_MEM_POLICY_BIND;
+#endif
+}
+
 static bool host_memory_backend_get_merge(Object *obj, Error **errp)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -536,6 +714,12 @@  host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_share, host_memory_backend_set_share);
     object_class_property_set_description(oc, "share",
         "Mark the memory as private to QEMU or shared");
+    object_class_property_add_enum(oc, "host-mem-type", "HostMemType",
+        &HostMemType_lookup,
+        host_memory_backend_get_host_mem_type,
+        host_memory_backend_set_host_mem_type);
+    object_class_property_set_description(oc, "host-mem-type",
+        "Set the backend host memory type");
 #ifdef CONFIG_LINUX
     object_class_property_add_bool(oc, "reserve",
         host_memory_backend_get_reserve, host_memory_backend_set_reserve);
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 39326f1d4f..afeb9b71d1 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -70,6 +70,7 @@  struct HostMemoryBackend {
     ThreadContext *prealloc_context;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
+    HostMemType host_mem_type;
 
     MemoryRegion mr;
 };
diff --git a/qapi/common.json b/qapi/common.json
index 6fed9cde1a..591fd73291 100644
--- a/qapi/common.json
+++ b/qapi/common.json
@@ -167,6 +167,25 @@ 
 { 'enum': 'HostMemPolicy',
   'data': [ 'default', 'preferred', 'bind', 'interleave' ] }
 
+##
+# @HostMemType:
+#
+# Automatically find a backend memory type on host.
+# Can be further extened to support other types such as cxlpmem, hbm.
+#
+# @none: do nothing (default).
+#
+# @cxlram: a CXL RAM backend on host.
+#
+# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same
+# time. HostMemType is used to automatically bind with one kind of
+# host memory types.
+#
+# Since: 8.3
+##
+{ 'enum': 'HostMemType',
+  'data': [ 'none', 'cxlram' ] }
+
 ##
 # @NetFilterDirection:
 #
diff --git a/qapi/qom.json b/qapi/qom.json
index 95516ba325..fa3bc29708 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -626,6 +626,7 @@ 
             '*host-nodes': ['uint16'],
             '*merge': 'bool',
             '*policy': 'HostMemPolicy',
+            '*host-mem-type': 'HostMemType',
             '*prealloc': 'bool',
             '*prealloc-threads': 'uint32',
             '*prealloc-context': 'str',
diff --git a/qemu-options.hx b/qemu-options.hx
index b66570ae00..39074c1aa0 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5211,7 +5211,7 @@  SRST
         (``share=off``). For this use case, we need writable RAM instead
         of ROM, and want to also set ``rom=off``.
 
-    ``-object memory-backend-ram,id=id,merge=on|off,dump=on|off,share=on|off,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
+    ``-object memory-backend-ram,id=id,merge=on|off,dump=on|off,share=on|off,prealloc=on|off,size=size,host-mem-type=cxlram,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
         Creates a memory backend object, which can be used to back the
         guest RAM. Memory backend objects offer more control than the
         ``-m`` option that is traditionally used to define guest RAM.