diff mbox series

[RFC] cxl: Multi-headed device design

Message ID ZBpe6btfLuuAS35g@memverge.com
State New
Headers show
Series [RFC] cxl: Multi-headed device design | expand

Commit Message

Gregory Price March 22, 2023, 1:50 a.m. UTC
Originally I was planning to kick this off with a patch set, but i've
decided my current prototype does not fit the extensibility requirements
to go from SLD to MH-SLD to MH-MLD.


So instead I'd like to kick off by just discussing the data structures
and laugh/cry a bit about some of the frustrating ambiguities for MH-SLDs
when it comes to the specification.

I apologize for the sheer length of this email, but it really is just
that complex.


=============================================================
 What does the specification say about Multi-headed Devices? 
=============================================================

Defining each relevant component according to the specification:

>
> VCS - Virtual CXL Switch
> * Includes entities within the physical switch belonging to a
>   single VH. It is identified using the VCS ID.
> 
> 
> VH - Virtual Hierarchy.
> * Everything from the CXL RP down.
> 
> 
> LD - Logical Device
> * Entity that represents a CXL Endpoint that is bound to a VCS.
>   An SLD device contains one LD.  An MLD contains multiple LDs.
> 
> 
> SLD - Single Logical Device
> * That's it, that's the definition.
> 
> 
> MLD - Multi Logical Device
> * Multi-Logical Device. CXL component that contains multiple LDs,
>   out of which one LD is reserved for configuration via the FM API,
>   and each remaining LD is suitable for assignment to a different
>   host. Currently MLDs are architected only for Type 3 LDs.
> 
> 
> MH-SLD - Mutli-Headed SLD
> * CXL component that contains multiple CXL ports, each presenting an
>   SLD. The ports must correctly operate when connected to any
>   combination of common or different hosts.
> 
> 
> MH-MLD - Multi-Headed MLD
> * CXL component that contains multiple CXL ports, each presenting an MLD
>   or SLD. The ports must correctly operate when connected to any
>   combination of common or different hosts. The FM-API is used to
>   configure each LD as well as the overall MH-MLD.
> 
>   MH-MLDs are considered a specialized type of MLD and, as such, are
>   subject to all functional and behavioral requirements of MLDs.
> 

Ambiguity #1:

* An SLD contains 1 Logical Device.
* An MH-SLD presents multiple SLDs, one per head.

Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
definition of LD, but not according to the definition of MLD, or MH-MLD.

Now is the winter of my discontent.

The Specification says this about MH-SLD's in other sections

> 2.4.3 Pooled and Shared FAM
> 
> LD-FAM includes several device variants.
> 
> A multi-headed Single Logical Device (MH-SLD) exposes multiple LDs, each with
> a dedicated link.
> 
>
> 2.5 Multi-Headed Device
> 
> There are two types of Multi-Headed Devices that are distinguied by how
> they present themselves on each head:
> *  MH-SLD, which present SLDs on all head
> *  MH-MLD, which may present MLDs on any of their heads
>
>
> Management of heads in Multi-Headed Devices follows the model defined for
> the device presented by that head:
> *  Heads that present SLDs may support the port management and control
>     features that are available for SLDs
> *  Heads that present MLDs may support the port management and control
>    features that are available for MLDs
>

I want to make very close note of this.  SLD's are managed like SLDs
SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
managed like SLDs from the perspective of each host.

That's pretty straight forward.

>
> Management of memory resources in Multi-Headed Devices follows the model
> defined for MLD components because both MH-SLDs and MH-MLDs must support
> the isolation of memory resources, state, context, and management on a
> per-LD basis.  LDs within the device are mapped to a single head.
> 
> *  In MH-SLDs, there is a 1:1 mapping between heads and LDs.
> *  In MH-MLDs, multiple LDs are mapped to at most one head.
> 
> 
> Multi-Headed Devices expose a dedicated Component Command Interface (CCI),
> the LD Pool CCI, for management of all LDs within the device. The LD Pool
> CCI may be exposed as an MCTP-based CCI or can be accessed via the Tunnel
> Management Command command through a head’s Mailbox CCI, as detailed in
> Section 7.6.7.3.1.

2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
that MH-SLDs (may) exist.  That's frustrating to say the least, but I
suppose we can gather from context that MH-SLD's *MAY NOT* have LD
management controls.

Lets see if that assumption holds.

> 7.6.7.3 MLD Port Command Set
>
> 7.6.7.3.1 Tunnel Management Command (Opcode 5300h)

The referenced section at the end of 2.5 seems to also suggest that
MH-SLDs do not (or don't have to?) implement the tunnel management
command set.  It sends us to the MLD command set, and SLDs don't get
managed like MLDs - ergo it's not relevant?

The final mention of MH-SLDs is mentioned in section 9.13.3

> 9.13.3 Dynamic Capacity Device
> ...
>  MH-SLD or MH-MLD based DCD shall forcefully release shared Dynamic
>  Capacity associated with all associated hosts upon a Conventional Reset
>  of a head.
>

From this we can gather that the specification foresaw someone making a
memory pool from an MH-SLD... but without LD management. We can fill in
some blanks and assume that if someone wanted to, they could make a
shared memory device and implement pooling via software controls.

That'd be a neat bodge/hack.  But that's not important right now.


Finally, we look at what the mailbox command-set requirements are for
multi-headed devices:

> 7.6.7.5 Multi-Headed Device Command Set
> The Multi-Headed device command set includes commands for querying the
> Head-to-LD mapping in a Multi-Headed device. Support for this command
> set is required on the LD Pool CCI of a Multi-Headed device.
>

Ambiguity #2: Ok, now we're not sure whether an MH-SLD is supposed to
expose an LD Pool CCI or not.  Also, is a MH-SLD supposed to show up
as something other than just an SLD?  This is really confusing.

Going back to the MLD Port Command set, we see

> Valid targets for the tunneled commands include switch MLD Ports,
> valid LDs within an MLD, and the LD Pool CCI in a Multi-Headed device.

Whatever the case, there's only a single command in the MHD command set:

> 7.6.7.5.1 Get Multi-Headed Info (Opcode 5500h)

This command is pretty straight forward, it just tells you what the head
to LD mapping is for each of the LDs in the device.  Presumably this is
what gets modified by the FM-APIs when LDs are attached to VCS ports.

For the simplest MH-SLD device, these fields would be immutable, and
there would be a single LD for each head, where head_id == ld_id.



So summarizing, what I took away from this was the following:

In the simplest form of MH-SLD, there's is neither a switch, nor is
thereo LD management.  So, presumably, we don't HAVE to implement the
MHD commands to say we "have MH-SLD support".


========
 Design
========

Ok... that's a lot to break down.  Here's what I think the roadmap
toward multi-headed multi-logical device support should look like:

1. SLD - we have this.  This is struct CXLType3Dev

2. MH-SLD No Switch, No Pool CCI.

3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)

4. MH-SLD w/ Switch (Implementing remap-ability of LD to Head)

5. MH-MLD - the whole kit and kaboodle.


Lets talk about what the first MH-SLD might look like.


=================================
2. MH-SLD No Switch, No Pool CCI.
=================================

1. The device has a "memory pool" that "backs" each Logical Device, and
   the specification does not limit whether this memory is discrete
   or may be shared between heads.

   In QEMU, we can represent this with a shared or file memory backend:

-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true


2. Each QEMU instance has a discrete SLD that amounts to its own private
   CXLType3Dev.  However, each "Head" maps back to the same common
   memory backend:

-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0


And that's it.  In fact, you can do this now, no changes needed!


But it's also not very useful.  You can only use the memory in devdax
mode, since it's a shared memory region. You could already do this via
the /dev/shm interface, so it's not even new functionality.

In theory you could build a pooling service in software-only on top of
memory blocks. That's an exercise left to the reader.


================================================================
3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
================================================================

This is a little more complicated, we have our first bit of shared
state.  Originally I had considered a shared memory region in
CXLType3Dev, but this is a backwards abstraction (A MH-SLD contains
mutliple SLDs, an SLD does not contain an MHD State).

./cxl_mhd_init 4 $shmid1
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1

./cxl_mhd_init would simply setup the nr_heads/lds field and such
and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
are static (head_id==ld_id).



But like I said, this is a backwards abstraction, so realistically we
should flip this around such that we have the following:

struct CXLMHD_SharedState {
	uint8_t nr_heads;
	uint8_t nr_lds;
	uint8_t ldmap[];
};

struct CXLMH_SLD {
	uint32_t headid;
	uint32_t shmid;
	struct CXLMHD_SharedState *state;
	struct CXLType3Dev sld;
};

The shared state would be instantiated the same way as above.

With this we'd basically just create a new memory device:

hw/mem/cxl_mh_sld.c


This is pretty straightforward - we just expose some of cxl_type3.c
functions in order to instantiate the device accordingly, the rest of it
just becomes passthrough because... it's just a cxl_type3.c device.


This ultimately manifests as:

shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`

./cxl_mhd_init 4 $shmid1

-device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid


Note: This is the patch set i'm working towards, but I presume there
might be some (strong) opinions, so i didn't want to get too far into
development before posting this.


==============================================================
4. MH-SLD w/ Switch (Implementing LD management in an SLD)
==============================================================

Is it even rational to try to build such a device?

MH-SLDs have a 1-to-1 mapping of Head:Logical Device.

Presumably each SLD maps the entirety of the "pooled" memory,
but the specification does not state that is true.  You could, for
example, setup each Logical Device to map to a particular portion of the
shared/pooled memory area:

-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true

QEMU #1
-device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=0,dpa_limit=1G

QEMU #2
-device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=1G,dpa_limit=1G

... and so on.

At least in theory, this would involve implementing something that
changes which SLD is mapped to a QEMU instance - but functionally this
is just changing the base and limit of each SLD.

It's interesting from a functional testing perspective, there's a bunch
of CCI/Tunnel commands that could be implemented, and presumably this
would require a separate process to manage/serialize appropriately.

=======================================
5. MH-MLD - the whole kit and kaboodle.
=======================================

If we implemented MH-SLD w/ Switching, then presumably it's just on step
further to create an MLD:

struct CXLMH_MLD {
        uint32_t headid;
        uint32_t shmid;
        struct CXLMHD_SharedState *state;
        struct CXLType3Dev ldmap[];
};

But i'm greatly oversimplifying here.  It's much more expressive to
describe an MLD in terms of a multi-tired switch in the QEMU topology,
similar to what can be done right now:

-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 \
-device cxl-rp,id=rp0,port=0,bus=cxl.0,chassis=0,slot=0 \
-device cxl-rp,id=rp1,port=1,bus=cxl.0,chassis=0,slot=1 \
-device cxl-upstream,bus=rp0,id=us0 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
-device cxl-type3,bus=swport0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k


But in order to make this multi-headed, some amount of this state would need
to be encapsulated in a shared memory region (or would it? I don't know, i
haven't finished this thought experiment yet).


=====
 FIN 
=====

I realize this was a long.  If you made it to the end of this email,
thank you reading my TED talk.  I greatly appreciate any comments,
even if it's just "You've gone too deep, Gregory." ;]

Regards,
~Gregory

Comments

Gregory Price May 16, 2023, 6:20 a.m. UTC | #1
On Mon, May 15, 2023 at 05:18:07PM +0100, Jonathan Cameron wrote:
> On Tue, 21 Mar 2023 21:50:33 -0400
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > 
> > Ambiguity #1:
> > 
> > * An SLD contains 1 Logical Device.
> > * An MH-SLD presents multiple SLDs, one per head.
> > 
> > Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
> > definition of LD, but not according to the definition of MLD, or MH-MLD.
> 
> I'd go with 'sort of'.  SLD is a presentation of a device to a host.
> It can be a normal single headed MLD that has been plugged directly into a host.
> 
> So for extra fun points you can have one MH-MLD that has some ports connected
> to switches and other directly to hosts. Thus it can present as SLD on some
> upstream ports and as MLD on others.
>

I suppose this section of the email was really to just point out that
what constitutions a "multi-headed", "logical", and "multi-logical"
device is rather confusing from just reading the spec.  Since writing
this, i've kind of settled on:

MH-* - anything with multiple heads, regardless of how it works
SLD - one LD per head, but LD does not imply any particular command set
MLD - multiple LD's per head, but those LD's may only attach to one head
DCD - anything can technically be a DCD if it implements the commands

Trying to figure out, from the spec, "what commands an MH-SLD" should
implement to be "Spec Compliance" was my frustration.  It's somewhat
clear now that the answer is "Technically nothing... unless its an MLD".

> > I want to make very close note of this.  SLD's are managed like SLDs
> > SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
> > managed like SLDs from the perspective of each host.
> 
> True, but an MH-MLD device connected directly to a host will also 
> be managed (at some level anyway) as an SLD on that particular port.
>

The ambiguous part is... what commands relate specifically to an SLD?
The spec isn't really written that way, and the answer is that an SLD is
more of a lack of other functionality (specifically MLD functionality),
rather than its own set of functionality.

i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
MLD, and DCD all do (at least in theory).

> > 
> > 2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
> > that MH-SLDs (may) exist.  That's frustrating to say the least, but I
> > suppose we can gather from context that MH-SLD's *MAY NOT* have LD
> > management controls.
> 
> Hmm. In theory you could have an MH-SLD that used a config from flash or similar
> but that would be odd.  We need some level of dynamic control to make these
> devices useful.  Doesn't mean the spec should exclude dumb devices, but
> we shouldn't concentrate on them for emulation.
> 
> One possible usecase would be a device that always shares all it's memory on
> all ports. Yuk.
> 

I can say that the earliest forms of MH-SLD, and certainly pre-DCD, is
likely to present all memory on all ports, and potentially provide some
custom commands to help hosts enforce exclusivity.

It's beyond the spec, but this can actually be emulated today with the
MH-SLD setup I describe below.  Certainly I expected a yuk factor to
proposing it, but I think the reality is on the path to 3.0 and DCD
devices we should at least entertain that someone will probably do this
with real hardware.

> > For the simplest MH-SLD device, these fields would be immutable, and
> > there would be a single LD for each head, where head_id == ld_id.
> 
> Agreed.
> 
> > 
> > So summarizing, what I took away from this was the following:
> > 
> > In the simplest form of MH-SLD, there's is neither a switch, nor is
> > there LD management.  So, presumably, we don't HAVE to implement the
> > MHD commands to say we "have MH-SLD support".
> 
> Whilst theoretically possible - I don' think such a device is interesting.
> Minimum I'd want to see is something with multiple upstream SLD ports
> and a management LD with appropriate interface to poke it.
> 
>
> The MLD side of things is interesting only once we support MLDs in general
> in QEMU CXL emulation and even then they are near invisible to a host
> and are more interesting for emulating fabric management.
> 
> What you may want to do is take Fan's work on DCD and look at doing
> a simple MH-SLD device that uses same cheat of just using QMP commands
> to do the configuration.  That's an intermediate step to us getting
> the FM-API and similar commands implemented.
> 

I actually think it's a good step to go from MH-SLD to MH-SLD+DCD while
not having to worry about the complexity of MLD and switches.

(I have not gotten the chance to review the DCD patch set yet, it's on
my list for after ISC'23, I presume this is what has been done).

My thoughts would be that you would have something like the following:

-device ct3d,... etc etc
-device cxl-dcd,type3-backend=mem0,manager=true

the manager would be the owner of the FM-Owned LD, and therefore the
system responsible for managing requests for memory.

How we pass those messages between instances is then an exercise for the
reader.


What I have been doing is just creating a shared memory region with
mkipc and using a separate program to initiate that shared state before
launching the guests.  I'll talk about this a little further down.


> > 
> > ... snip ...
> > 
> > 3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
> 
> I'd do this + DCD.
> 

I concur, and it's what i was looking into next.

I think your other notes on MH-* with switches is really where I was
left scratching my head.

When I look at Switch/MLD functionality vs DCD, I have a gut feeling the
vast majority of early device vendors are going to skip right over
switches and MLD setups and go directly to MH-SLD+DCD.

> > =================================
> > 2. MH-SLD No Switch, No Pool CCI.
> > =================================
> > 
> > But it's also not very useful.  You can only use the memory in devdax
> > mode, since it's a shared memory region. You could already do this via
> > the /dev/shm interface, so it's not even new functionality.
> > 
> > In theory you could build a pooling service in software-only on top of
> > memory blocks. That's an exercise left to the reader.
> 
> Yeah. Let's not do this step.
> 

To late :].  It was useful as a learning exercise, but it's definitely
not upstream quality.  I may post it for the sake of the playground, but
I too would recommend against this method of pooling in the long term.

I made a proto-DCD command set that was reachable from each memdev
character device, and exposed it to every qemu instance as part of ct3d
(I'm still learning the QEMU ecosystem, so was easier to bodge it in
than make a new device and link it up).

Then I created a shared memory region with mkipc, and implemented a
simple mutex in the space, as well as all the record keeping needed to
manage sections/extents.

> > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > ./cxl_mhd_init 4 $shmid1
> > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1
> > 
> > ./cxl_mhd_init would simply setup the nr_heads/lds field and such
> > and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
> > are static (head_id==ld_id).
> > ... snip ...
> >
> > shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
> > ./cxl_mhd_init 4 $shmid1
> > -device cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid

The last step was a few extra lines in the read/write functions to
ensure accesses to "Valid addresses" that "Aren't allocated" produce
errors.

At this point, each guest is capable basically using the device to do
the coordination for you by simply calling the allocate/deallocate
functions.

And that's it, you've got pooling.  Each guest sees the full extent of
the entire device, but must ask the device for access to a given
section, and the section can be translated into a memory block number
under the given numa node.


Ok, now lets talk about why this is a bad and why you shouldn't do it
this way:

* Technically a number of bios/hardware interleave functionality can
  bite you pretty hard when making the assumption that memory blocks are
  physically contiguous hardware addresses. However, that assumption
  holds if you simply don't turn those options on, so it might be useful
  as an early-adopter platform.


* The security posutre of a device like this is bad.  It requires each
  attached host to clear the memory before releasing it.  There isn't
  really a good way to do this in numa-mode, so you would have to
  implement custom firmware commands to ensure it happens, and that
  means custom drivers blah blah blah - not great.

  Basically you're trusting each host to play nice.  Not great.
  But potentially useful for early adopters regardless.


* General compaitibility and being in-spec - this design requires a
  number of non-spec extensions, so just generally not recommended,
  certainly not here in QEMU.

> 
> A few different moving parts are needed and I think we'd end up with something that
> looks like
> 
> -device cxl-mhd,volatile-memdev=mem0,id=backend
> -device cxl-mhd-sld,mhd=backend,bus=rp0,mhd-head=0,id=dev1,tunnel=true
> -device cxl-mhd-sld,mhd=backend,bus=rp1,mhd-head=1,id=dev2
> 
> dev1 provides the tunneling interface, but the actual implementation of
> the pool CCI and actual memory mappings is in the backend. Note that backend
> might be proxy to an external process, or a client/server approach between multiple
> QEMU instances.

I've hummed and hawwed over external process vs another QEMU instance and I
still haven't come to a satisfying answer here.  It feels extremely
heavy-handed to use an entirely separate QEMU instance just for this,
but there's nothing to say you can't just host it in one of the
head-attached instances.

I basically skipped this and allowed each instance to send the command
themselves, but serialized it with a mutex.  That way each instance can
operate cleanly without directly coordinating with each other.  I could
see a vendor implementing it this way on early devices.

I don't have a good answer for this yet, but maybe once I review the DCD
patch set I'll have more opinions.

> 
> or squish some parts and make a more extensible type3 device and have.
> 
> -device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=dev1,mhd-main=true
> -device cxl-type3,mhd=dev1,bus=rp1,mhd-head=1,id=dev2
> 

I originally went this route, but the downside of this is "What happens
when the main dies and has to restart".  There's all of kinds of
badness associated with that.  It's why i moved the shared state into a
separately created mkipc region.

> 
> To my mind there are a series of steps and questions here.
> 
> Which 'hotplug model'.
> 1) LD model for moving capacity
>   - If doing LD model, do MLDs and configurable switches first. Needed as a step along the
>     path anyway.  Deal with all the mess that brings and come back to MHD - as you note it
>     only makes sense with a switch in the path, so MLDs are a subset of the functionality anyway.
> 
> 2) DCD model for moving cacapcity
>   - MH-SLD with a pool CCI used to do DCD operations on the LDs.  Extension of
>     what Fan Ni is looking at.  He's making an SLD pretend to be a device
>     where DCD makes sense - whilst still using the CXL type 3 device. We probably shouldn't
>     do that without figuring out how to do an MHD-SLD - or at least a head that we intend
>     to hang this new stuff off - potentially just using the existing type 3 device with
>     more parameters as one of the MH-SLD heads that doesn't have the control interface and
>     different parameters if it does have the tunnel to the Pool CCI.
> 

Personally I think we should focus on the DCD model.  In fact, I think
we're already very close to this, as my personal prototype showed this
can work fairly cleanly, and I imagine I'll have a quick MHD patch set
once I get the change to review the DCD patch set.

If I'm being the honest, the more I look at the LD model, the less I
like it, but I understand that's how scale is going to be achieved.  I
don't know if focusing on that design right now is going to produce
adoption in the short term, since we're not likely to see those devices
for a few years.

MH-SLD+DCD is likely to show up much sooner, so I will target that.

~Gregory
Gregory Price May 29, 2023, 6:13 p.m. UTC | #2
On Wed, May 17, 2023 at 03:18:59PM +0100, Jonathan Cameron wrote:
> > 
> > i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
> > MLD, and DCD all do (at least in theory).
> 
> DCD 'might' though I don't think anything in the spec rules that you 'must'
> control the SLD/MLD via the FM-API, it's just a spec provided option.
> From our point of view we don't want to get more creative so lets assume
> it does.
> 
> I can't immediately think of reason for a single head SLD to have an FM owned
> LD, though it may well have an MCTP CCI for querying stuff about it from an FM.
> 

Before I go running off into the woods, it seems like it would be simple
enough to simply make an FM-LD "device" which simply links a mhXXX device
and implements its own Mailbox CCI.

Maybe not "realistic", but to my mind this appears as a separate
character device in /dev/cxl/*. Maybe the realism here doesn't matter,
since we're just implementing for the sake of testing.  This is just a
straightforward way to pipe a DCD request into the device and trigger
DCD event log entries.

As commented early, this is done as a QEMU fed event.  If that's
sufficient, a hack like this feels like it would be at least mildly
cleaner and easier to test against.


Example: consider a user wanting to issue a DCD command to add capacity.

Real world: this would be some out of band communication, and eventually
this results in a DCD command to the device that results in a
capacity-event showing up in the log. Maybe it happens over TCP and
drills down to a Redfish event that talks to the BMC that issues a
command over etc etc MTCP emulations, etc.

With a simplistic /dev/cxl/memX-fmld device a user can simply issue these
commands without all that, and the effect is the same.

On the QEMU side you get something like:

-device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=mem0,mhd-main=true
-device cxl-mhsld,type3=mem0,bus=rp0,headid=0,id=mhsld1,shmid=XXXXX
-device cxl-fmld,mhsld=mdsld1,bus=rp1,id=mem0-fmld,shmid=YYYYY

on the Linux side you get:
/dev/cxl/mem0
/dev/cxl/mem0-fmld

in this example, the shmid for mhsld is a shared memory region created
with mkipc that implements the shared state (basically section bitmap
tracking and the actual plumbing for DCD, etc). This limits the emulation
of the mhsld to a single host for now, but that seems sufficient.

The shmid for cxl-fmld implements any shared state for the fmld,
including a mutex, that allows all hosts attached to the mhsld to have
access to the fmld.  This may or may not be realistic, but it would
allow all head-attached hosts to send DCD commands over its own local
fabric, ratehr than going out of band.

This gets us to the point where, at a minimum, each host can issue its
own DCD commands to add capacity to itself.  That's step 1.

Step 2 is allow Host A to issue a DCD command to add capacity to Host B.

I suppose this could be done via a backgruond thread that waits on a
message to show up in the shared memory region?

Being somewhat unfamiliar with QEMU, is it kosher to start background
threads that just wait on events like this, or is that generally frowed
upon?  If done this way, it would stimplify the creation and startup
sequence at least.

~Gregory
diff mbox series

Patch

diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 7b72345079..1a9f2708e1 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -356,16 +356,6 @@  typedef struct CXLPoison {
 typedef QLIST_HEAD(, CXLPoison) CXLPoisonList;
 #define CXL_POISON_LIST_LIMIT 256

+struct CXLMHDState {
+    uint8_t nr_heads;
+    uint8_t nr_lds;
+    uint8_t ldmap[];
+};
+
 struct CXLType3Dev {
     /* Private */
     PCIDevice parent_obj;
@@ -377,15 +367,6 @@  struct CXLType3Dev {
     HostMemoryBackend *lsa;
     uint64_t sn;

+
+    /* Multi-headed device settings */
+    struct {
+        bool active;
+        uint32_t headid;
+        uint32_t shmid;
+        struct CXLMHDState *state;
+    } mhd;
+


The way you would instantiate this would be a via a separate process
that initializes the shared memory region:

shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`