diff mbox series

hw/nvme: reattach subsystem namespaces on hotplug

Message ID 20210909094308.122038-1-hare@suse.de
State New
Headers show
Series hw/nvme: reattach subsystem namespaces on hotplug | expand

Commit Message

Hannes Reinecke Sept. 9, 2021, 9:43 a.m. UTC
With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
namespaces get moved from the controller to the subsystem if one
is specified.
That keeps the namespaces alive after a controller hot-unplug, but
after a controller hotplug we have to reconnect the namespaces
from the subsystem to the controller.

Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
Cc: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 hw/nvme/subsys.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Klaus Jensen Sept. 9, 2021, 10:47 a.m. UTC | #1
On Sep  9 11:43, Hannes Reinecke wrote:
> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> namespaces get moved from the controller to the subsystem if one
> is specified.
> That keeps the namespaces alive after a controller hot-unplug, but
> after a controller hotplug we have to reconnect the namespaces
> from the subsystem to the controller.
> 
> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> Cc: Klaus Jensen <k.jensen@samsung.com>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  hw/nvme/subsys.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> index 93c35950d6..a9404f2b5e 100644
> --- a/hw/nvme/subsys.c
> +++ b/hw/nvme/subsys.c
> @@ -14,7 +14,7 @@
>  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>  {
>      NvmeSubsystem *subsys = n->subsys;
> -    int cntlid;
> +    int cntlid, nsid;
>  
>      for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
>          if (!subsys->ctrls[cntlid]) {
> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>  
>      subsys->ctrls[cntlid] = n;
>  
> +    for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
> +        if (subsys->namespaces[nsid]) {
> +            nvme_attach_ns(n, subsys->namespaces[nsid]);
> +        }

Thanks Hannes! I like it, keeping things simple.

But we should only attach namespaces that have the shared property or
have ns->attached == 0. Non-shared namespaces may already be attached to
another controller in the subsystem.

However...

The spec says that "The attach and detach operations are persistent
across all reset events.". This means that we should track those events
in the subsystem and only reattach namespaces that were attached at the
time of the "reset" event. Currently, we don't have anything mapping
that state. But the device already has to take some liberties with
regard to stuff that is considered persistent by the spec (SMART log
etc.) since we do not have any way to store stuff persistently across
qemu invocations, so I think the above is an acceptable compromise.

A potential (as good as it gets) fix would be to keep a map/list of
"persistently" attached controllers on the namespaces and re-attach
according to that when we see that controller joining the subsystem
again. CNTLID would be the obvious choice for the key here, but problem
is that we cant really use it since we assign it sequentially from the
subsystem, which now looks like a pretty bad choice. CNTLID should have
been a required property of the nvme device when subsystems are
involved. Maybe we can fix up the CNTLID assignment to take the serial
into account (we know that is defined and *should* be unique) and not
reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
controllers, but I think that is fair enough.

Sigh. Need to think this through.

Bottomline I think I'm partial to just accepting your patch (with the
addition of taking the shared property into account) and documenting the
limitation wrt. persistency of attach/detach events. No matter how
spec-compliant we do it on a live system, we still break compliance
across QEMU invocations.
Hannes Reinecke Sept. 9, 2021, 11:37 a.m. UTC | #2
On 9/9/21 12:47 PM, Klaus Jensen wrote:
> On Sep  9 11:43, Hannes Reinecke wrote:
>> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>> namespaces get moved from the controller to the subsystem if one
>> is specified.
>> That keeps the namespaces alive after a controller hot-unplug, but
>> after a controller hotplug we have to reconnect the namespaces
>> from the subsystem to the controller.
>>
>> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>> Cc: Klaus Jensen <k.jensen@samsung.com>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>  hw/nvme/subsys.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
>> index 93c35950d6..a9404f2b5e 100644
>> --- a/hw/nvme/subsys.c
>> +++ b/hw/nvme/subsys.c
>> @@ -14,7 +14,7 @@
>>  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>  {
>>      NvmeSubsystem *subsys = n->subsys;
>> -    int cntlid;
>> +    int cntlid, nsid;
>>  
>>      for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
>>          if (!subsys->ctrls[cntlid]) {
>> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>  
>>      subsys->ctrls[cntlid] = n;
>>  
>> +    for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
>> +        if (subsys->namespaces[nsid]) {
>> +            nvme_attach_ns(n, subsys->namespaces[nsid]);
>> +        }
> 
> Thanks Hannes! I like it, keeping things simple.
> 
> But we should only attach namespaces that have the shared property or
> have ns->attached == 0. Non-shared namespaces may already be attached to
> another controller in the subsystem.
> 

Well ... I tried to avoid that subject, but as you brought it up:
There is a slightly tricky issue in fabrics, in that the 'controller' is
independent from the 'connection'.
The 'shared' bit in the CMIC setting indicates that the subsystem may
have more than one _controller_. It doesn't talk about how many
_connections_ a controller may support; that then is the realm of
dynamic or static controllers, which we don't talk about :-).
Sufficient to say, linux only implements the dynamic controller model,
so every connection will be to a different controller.
But you have been warned :-)

However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
'shared' bit in the namespace).
Which leads to the interesting question on how exactly should one handle
non-shared namespaces in subsystems for which there are multiple
controllers. Especially as the NSID space is per subsystem, so each
controller will be able to figure out if there are blanked-out namespaces.
So to make this a sane setup I would propose to set the 'shared' option
automatically whenever the controller has the 'subsys' option set.
And actually, I would ditch the 'shared' option completely, and make it
dependent on the 'subsys' setting for the controller.
Much like we treat the 'CMIC' setting nowadays.
That avoids lots of confusions, and also make the implementation _way_
easier.

> However...
> 
> The spec says that "The attach and detach operations are persistent
> across all reset events.". This means that we should track those events
> in the subsystem and only reattach namespaces that were attached at the
> time of the "reset" event. Currently, we don't have anything mapping
> that state. But the device already has to take some liberties with
> regard to stuff that is considered persistent by the spec (SMART log
> etc.) since we do not have any way to store stuff persistently across
> qemu invocations, so I think the above is an acceptable compromise.
> 
Careful. 'attach' and 'detach' are MI (management interface) operations.
If and how many namespaces are visible to any given controllers is
actually independent on that; and, in fact, controllers might not even
implement 'attach' or 'detach'.
But I do agree that we don't handle the 'reset' state properly.

> A potential (as good as it gets) fix would be to keep a map/list of
> "persistently" attached controllers on the namespaces and re-attach
> according to that when we see that controller joining the subsystem
> again. CNTLID would be the obvious choice for the key here, but problem
> is that we cant really use it since we assign it sequentially from the
> subsystem, which now looks like a pretty bad choice. CNTLID should have
> been a required property of the nvme device when subsystems are
> involved. Maybe we can fix up the CNTLID assignment to take the serial
> into account (we know that is defined and *should* be unique) and not
> reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
> controllers, but I think that is fair enough.
> 
> Sigh. Need to think this through.
> 
Well, actually there is an easy way out. I do agree that we need to make
the 'cntlid' a property of the controller. And once it's set we can
track it properly, eg by having per-cntlid nsid lists in the subsystem.
But if it's not set we can claim that we'll be allocating a new
controller across reboots (which is actually what we're doing), making
us spec compliant again :-)

Cheers,

Hannes
Klaus Jensen Sept. 23, 2021, 8:09 p.m. UTC | #3
On Sep  9 13:37, Hannes Reinecke wrote:
> On 9/9/21 12:47 PM, Klaus Jensen wrote:
> > On Sep  9 11:43, Hannes Reinecke wrote:
> >> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> >> namespaces get moved from the controller to the subsystem if one
> >> is specified.
> >> That keeps the namespaces alive after a controller hot-unplug, but
> >> after a controller hotplug we have to reconnect the namespaces
> >> from the subsystem to the controller.
> >>
> >> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> >> Cc: Klaus Jensen <k.jensen@samsung.com>
> >> Signed-off-by: Hannes Reinecke <hare@suse.de>
> >> ---
> >>  hw/nvme/subsys.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> >> index 93c35950d6..a9404f2b5e 100644
> >> --- a/hw/nvme/subsys.c
> >> +++ b/hw/nvme/subsys.c
> >> @@ -14,7 +14,7 @@
> >>  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
> >>  {
> >>      NvmeSubsystem *subsys = n->subsys;
> >> -    int cntlid;
> >> +    int cntlid, nsid;
> >>  
> >>      for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
> >>          if (!subsys->ctrls[cntlid]) {
> >> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
> >>  
> >>      subsys->ctrls[cntlid] = n;
> >>  
> >> +    for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
> >> +        if (subsys->namespaces[nsid]) {
> >> +            nvme_attach_ns(n, subsys->namespaces[nsid]);
> >> +        }
> > 
> > Thanks Hannes! I like it, keeping things simple.
> > 
> > But we should only attach namespaces that have the shared property or
> > have ns->attached == 0. Non-shared namespaces may already be attached to
> > another controller in the subsystem.
> > 
> 
> Well ... I tried to avoid that subject, but as you brought it up:
> There is a slightly tricky issue in fabrics, in that the 'controller' is
> independent from the 'connection'.
> The 'shared' bit in the CMIC setting indicates that the subsystem may
> have more than one _controller_. It doesn't talk about how many
> _connections_ a controller may support; that then is the realm of
> dynamic or static controllers, which we don't talk about :-).
> Sufficient to say, linux only implements the dynamic controller model,
> so every connection will be to a different controller.
> But you have been warned :-)
> 
> However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
> 'shared' bit in the namespace).
> Which leads to the interesting question on how exactly should one handle
> non-shared namespaces in subsystems for which there are multiple
> controllers. Especially as the NSID space is per subsystem, so each
> controller will be able to figure out if there are blanked-out namespaces.
> So to make this a sane setup I would propose to set the 'shared' option
> automatically whenever the controller has the 'subsys' option set.
> And actually, I would ditch the 'shared' option completely, and make it
> dependent on the 'subsys' setting for the controller.
> Much like we treat the 'CMIC' setting nowadays.
> That avoids lots of confusions, and also make the implementation _way_
> easier.
> 

I see your point. Unfortunately, there is no easy way to ditch shared=
now. But I think it'd be good enough to attach to shared automatically,
but not to "shared=off".

I've already ditched the shared parameter on my RFC refactor series and
always having the namespaces shared.

> > However...
> > 
> > The spec says that "The attach and detach operations are persistent
> > across all reset events.". This means that we should track those events
> > in the subsystem and only reattach namespaces that were attached at the
> > time of the "reset" event. Currently, we don't have anything mapping
> > that state. But the device already has to take some liberties with
> > regard to stuff that is considered persistent by the spec (SMART log
> > etc.) since we do not have any way to store stuff persistently across
> > qemu invocations, so I think the above is an acceptable compromise.
> > 
> Careful. 'attach' and 'detach' are MI (management interface) operations.
> If and how many namespaces are visible to any given controllers is
> actually independent on that; and, in fact, controllers might not even
> implement 'attach' or 'detach'.
> But I do agree that we don't handle the 'reset' state properly.
> 

The Namespace Attachment command has nothing to do with MI? And the QEMU
controller does support attaching and detaching namespaces.

> > A potential (as good as it gets) fix would be to keep a map/list of
> > "persistently" attached controllers on the namespaces and re-attach
> > according to that when we see that controller joining the subsystem
> > again. CNTLID would be the obvious choice for the key here, but problem
> > is that we cant really use it since we assign it sequentially from the
> > subsystem, which now looks like a pretty bad choice. CNTLID should have
> > been a required property of the nvme device when subsystems are
> > involved. Maybe we can fix up the CNTLID assignment to take the serial
> > into account (we know that is defined and *should* be unique) and not
> > reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
> > controllers, but I think that is fair enough.
> > 
> > Sigh. Need to think this through.
> > 
> Well, actually there is an easy way out. I do agree that we need to make
> the 'cntlid' a property of the controller. And once it's set we can
> track it properly, eg by having per-cntlid nsid lists in the subsystem.
> But if it's not set we can claim that we'll be allocating a new
> controller across reboots (which is actually what we're doing), making
> us spec compliant again :-)
> 

That is a decent solution, but we still can't track it across reboots. I
think we should just with your patch (but for now, only automatically
attach shared namespaces).

Would that be acceptable to you? It would require your to add shared=on
on your single-namespace test set up since unfortunately, shared=on is
not the default. However, we can consider changing that default in 6.2.
Hannes Reinecke Sept. 24, 2021, 6:05 a.m. UTC | #4
On 9/23/21 10:09 PM, Klaus Jensen wrote:
> On Sep  9 13:37, Hannes Reinecke wrote:
>> On 9/9/21 12:47 PM, Klaus Jensen wrote:
>>> On Sep  9 11:43, Hannes Reinecke wrote:
>>>> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>>>> namespaces get moved from the controller to the subsystem if one
>>>> is specified.
>>>> That keeps the namespaces alive after a controller hot-unplug, but
>>>> after a controller hotplug we have to reconnect the namespaces
>>>> from the subsystem to the controller.
>>>>
>>>> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>>>> Cc: Klaus Jensen <k.jensen@samsung.com>
>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>> ---
>>>>   hw/nvme/subsys.c | 8 +++++++-
>>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
>>>> index 93c35950d6..a9404f2b5e 100644
>>>> --- a/hw/nvme/subsys.c
>>>> +++ b/hw/nvme/subsys.c
>>>> @@ -14,7 +14,7 @@
>>>>   int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>>>   {
>>>>       NvmeSubsystem *subsys = n->subsys;
>>>> -    int cntlid;
>>>> +    int cntlid, nsid;
>>>>   
>>>>       for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
>>>>           if (!subsys->ctrls[cntlid]) {
>>>> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>>>   
>>>>       subsys->ctrls[cntlid] = n;
>>>>   
>>>> +    for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
>>>> +        if (subsys->namespaces[nsid]) {
>>>> +            nvme_attach_ns(n, subsys->namespaces[nsid]);
>>>> +        }
>>>
>>> Thanks Hannes! I like it, keeping things simple.
>>>
>>> But we should only attach namespaces that have the shared property or
>>> have ns->attached == 0. Non-shared namespaces may already be attached to
>>> another controller in the subsystem.
>>>
>>
>> Well ... I tried to avoid that subject, but as you brought it up:
>> There is a slightly tricky issue in fabrics, in that the 'controller' is
>> independent from the 'connection'.
>> The 'shared' bit in the CMIC setting indicates that the subsystem may
>> have more than one _controller_. It doesn't talk about how many
>> _connections_ a controller may support; that then is the realm of
>> dynamic or static controllers, which we don't talk about :-).
>> Sufficient to say, linux only implements the dynamic controller model,
>> so every connection will be to a different controller.
>> But you have been warned :-)
>>
>> However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
>> 'shared' bit in the namespace).
>> Which leads to the interesting question on how exactly should one handle
>> non-shared namespaces in subsystems for which there are multiple
>> controllers. Especially as the NSID space is per subsystem, so each
>> controller will be able to figure out if there are blanked-out namespaces.
>> So to make this a sane setup I would propose to set the 'shared' option
>> automatically whenever the controller has the 'subsys' option set.
>> And actually, I would ditch the 'shared' option completely, and make it
>> dependent on the 'subsys' setting for the controller.
>> Much like we treat the 'CMIC' setting nowadays.
>> That avoids lots of confusions, and also make the implementation _way_
>> easier.
>>
> 
> I see your point. Unfortunately, there is no easy way to ditch shared=
> now. But I think it'd be good enough to attach to shared automatically,
> but not to "shared=off".
> 
> I've already ditched the shared parameter on my RFC refactor series and
> always having the namespaces shared.
> 

Okay.

>>> However...
>>>
>>> The spec says that "The attach and detach operations are persistent
>>> across all reset events.". This means that we should track those events
>>> in the subsystem and only reattach namespaces that were attached at the
>>> time of the "reset" event. Currently, we don't have anything mapping
>>> that state. But the device already has to take some liberties with
>>> regard to stuff that is considered persistent by the spec (SMART log
>>> etc.) since we do not have any way to store stuff persistently across
>>> qemu invocations, so I think the above is an acceptable compromise.
>>>
>> Careful. 'attach' and 'detach' are MI (management interface) operations.
>> If and how many namespaces are visible to any given controllers is
>> actually independent on that; and, in fact, controllers might not even
>> implement 'attach' or 'detach'.
>> But I do agree that we don't handle the 'reset' state properly.
>>
> 
> The Namespace Attachment command has nothing to do with MI? And the QEMU
> controller does support attaching and detaching namespaces.
> 

No, you got me wrong. Whether a not a namespace is attached is 
independent of any 'Namespace attachment' command.
Case in point: the subsystem will be starting up with namespace already 
attached, even though no-one issued any namespace attachment command.

>>> A potential (as good as it gets) fix would be to keep a map/list of
>>> "persistently" attached controllers on the namespaces and re-attach
>>> according to that when we see that controller joining the subsystem
>>> again. CNTLID would be the obvious choice for the key here, but problem
>>> is that we cant really use it since we assign it sequentially from the
>>> subsystem, which now looks like a pretty bad choice. CNTLID should have
>>> been a required property of the nvme device when subsystems are
>>> involved. Maybe we can fix up the CNTLID assignment to take the serial
>>> into account (we know that is defined and *should* be unique) and not
>>> reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
>>> controllers, but I think that is fair enough.
>>>
>>> Sigh. Need to think this through.
>>>
>> Well, actually there is an easy way out. I do agree that we need to make
>> the 'cntlid' a property of the controller. And once it's set we can
>> track it properly, eg by having per-cntlid nsid lists in the subsystem.
>> But if it's not set we can claim that we'll be allocating a new
>> controller across reboots (which is actually what we're doing), making
>> us spec compliant again :-)
>>
> 
> That is a decent solution, but we still can't track it across reboots. I
> think we should just with your patch (but for now, only automatically
> attach shared namespaces).
> 
But that's the point.
We don't _need_ to track it.
We only need to track it when it's specified via a (yet-to-be-added) 
cntlid parameter, but then it's trivial.
If it's not specified we will allocate a new one and don't need to track 
stuff. That's even permitted by the NVMe spec; v2.0 section 3.1.1 has:

While allocation of static controllers to hosts are expected to be 
durable (so that hosts can expect to form associations to the same 
controllers repeatedly (e.g., after each host reboot)), the NVM 
subsystem may remove the host allocation of a controller that is not in 
use at any time for implementation specific reasons
(e.g., controller resource reclamation, subsystem reconfiguration).

> Would that be acceptable to you? It would require your to add shared=on
> on your single-namespace test set up since unfortunately, shared=on is
> not the default. However, we can consider changing that default in 6.2.
> 
Yes, for an interim solution it's okay.
The important bit is to get hotplug up and running again for NVMe; we do 
have some testcases for patches in upstream linux which I would like to 
forward to our QA Team.

Cheers

Hannes
Klaus Jensen Sept. 24, 2021, 7:29 a.m. UTC | #5
On Sep 24 08:05, Hannes Reinecke wrote:
> On 9/23/21 10:09 PM, Klaus Jensen wrote:
> > On Sep  9 13:37, Hannes Reinecke wrote:
> > > On 9/9/21 12:47 PM, Klaus Jensen wrote:
> > > > On Sep  9 11:43, Hannes Reinecke wrote:
> > > > > With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> > > > > namespaces get moved from the controller to the subsystem if one
> > > > > is specified.
> > > > > That keeps the namespaces alive after a controller hot-unplug, but
> > > > > after a controller hotplug we have to reconnect the namespaces
> > > > > from the subsystem to the controller.
> > > > > 
> > > > > Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
> > > > > Cc: Klaus Jensen <k.jensen@samsung.com>
> > > > > Signed-off-by: Hannes Reinecke <hare@suse.de>
> > > > > ---
> > > > >   hw/nvme/subsys.c | 8 +++++++-
> > > > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> > > > > index 93c35950d6..a9404f2b5e 100644
> > > > > --- a/hw/nvme/subsys.c
> > > > > +++ b/hw/nvme/subsys.c
> > > > > @@ -14,7 +14,7 @@
> > > > >   int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
> > > > >   {
> > > > >       NvmeSubsystem *subsys = n->subsys;
> > > > > -    int cntlid;
> > > > > +    int cntlid, nsid;
> > > > >       for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
> > > > >           if (!subsys->ctrls[cntlid]) {
> > > > > @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
> > > > >       subsys->ctrls[cntlid] = n;
> > > > > +    for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
> > > > > +        if (subsys->namespaces[nsid]) {
> > > > > +            nvme_attach_ns(n, subsys->namespaces[nsid]);
> > > > > +        }
> > > > 
> > > > Thanks Hannes! I like it, keeping things simple.
> > > > 
> > > > But we should only attach namespaces that have the shared property or
> > > > have ns->attached == 0. Non-shared namespaces may already be attached to
> > > > another controller in the subsystem.
> > > > 
> > > 
> > > Well ... I tried to avoid that subject, but as you brought it up:
> > > There is a slightly tricky issue in fabrics, in that the 'controller' is
> > > independent from the 'connection'.
> > > The 'shared' bit in the CMIC setting indicates that the subsystem may
> > > have more than one _controller_. It doesn't talk about how many
> > > _connections_ a controller may support; that then is the realm of
> > > dynamic or static controllers, which we don't talk about :-).
> > > Sufficient to say, linux only implements the dynamic controller model,
> > > so every connection will be to a different controller.
> > > But you have been warned :-)
> > > 
> > > However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
> > > 'shared' bit in the namespace).
> > > Which leads to the interesting question on how exactly should one handle
> > > non-shared namespaces in subsystems for which there are multiple
> > > controllers. Especially as the NSID space is per subsystem, so each
> > > controller will be able to figure out if there are blanked-out namespaces.
> > > So to make this a sane setup I would propose to set the 'shared' option
> > > automatically whenever the controller has the 'subsys' option set.
> > > And actually, I would ditch the 'shared' option completely, and make it
> > > dependent on the 'subsys' setting for the controller.
> > > Much like we treat the 'CMIC' setting nowadays.
> > > That avoids lots of confusions, and also make the implementation _way_
> > > easier.
> > > 
> > 
> > I see your point. Unfortunately, there is no easy way to ditch shared=
> > now. But I think it'd be good enough to attach to shared automatically,
> > but not to "shared=off".
> > 
> > I've already ditched the shared parameter on my RFC refactor series and
> > always having the namespaces shared.
> > 
> 
> Okay.
> 
> > > > However...
> > > > 
> > > > The spec says that "The attach and detach operations are persistent
> > > > across all reset events.". This means that we should track those events
> > > > in the subsystem and only reattach namespaces that were attached at the
> > > > time of the "reset" event. Currently, we don't have anything mapping
> > > > that state. But the device already has to take some liberties with
> > > > regard to stuff that is considered persistent by the spec (SMART log
> > > > etc.) since we do not have any way to store stuff persistently across
> > > > qemu invocations, so I think the above is an acceptable compromise.
> > > > 
> > > Careful. 'attach' and 'detach' are MI (management interface) operations.
> > > If and how many namespaces are visible to any given controllers is
> > > actually independent on that; and, in fact, controllers might not even
> > > implement 'attach' or 'detach'.
> > > But I do agree that we don't handle the 'reset' state properly.
> > > 
> > 
> > The Namespace Attachment command has nothing to do with MI? And the QEMU
> > controller does support attaching and detaching namespaces.
> > 
> 
> No, you got me wrong. Whether a not a namespace is attached is independent
> of any 'Namespace attachment' command.
> Case in point: the subsystem will be starting up with namespace already
> attached, even though no-one issued any namespace attachment command.
> 

Right, understood.

> > > > A potential (as good as it gets) fix would be to keep a map/list of
> > > > "persistently" attached controllers on the namespaces and re-attach
> > > > according to that when we see that controller joining the subsystem
> > > > again. CNTLID would be the obvious choice for the key here, but problem
> > > > is that we cant really use it since we assign it sequentially from the
> > > > subsystem, which now looks like a pretty bad choice. CNTLID should have
> > > > been a required property of the nvme device when subsystems are
> > > > involved. Maybe we can fix up the CNTLID assignment to take the serial
> > > > into account (we know that is defined and *should* be unique) and not
> > > > reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique
> > > > controllers, but I think that is fair enough.
> > > > 
> > > > Sigh. Need to think this through.
> > > > 
> > > Well, actually there is an easy way out. I do agree that we need to make
> > > the 'cntlid' a property of the controller. And once it's set we can
> > > track it properly, eg by having per-cntlid nsid lists in the subsystem.
> > > But if it's not set we can claim that we'll be allocating a new
> > > controller across reboots (which is actually what we're doing), making
> > > us spec compliant again :-)
> > > 
> > 
> > That is a decent solution, but we still can't track it across reboots. I
> > think we should just with your patch (but for now, only automatically
> > attach shared namespaces).
> > 
> But that's the point.
> We don't _need_ to track it.
> We only need to track it when it's specified via a (yet-to-be-added) cntlid
> parameter, but then it's trivial.
> If it's not specified we will allocate a new one and don't need to track
> stuff. That's even permitted by the NVMe spec; v2.0 section 3.1.1 has:
> 
> While allocation of static controllers to hosts are expected to be durable
> (so that hosts can expect to form associations to the same controllers
> repeatedly (e.g., after each host reboot)), the NVM subsystem may remove the
> host allocation of a controller that is not in use at any time for
> implementation specific reasons
> (e.g., controller resource reclamation, subsystem reconfiguration).
> 

Awesome. Thanks for pointing me towards that wording.

> > Would that be acceptable to you? It would require your to add shared=on
> > on your single-namespace test set up since unfortunately, shared=on is
> > not the default. However, we can consider changing that default in 6.2.
> > 
> Yes, for an interim solution it's okay.
> The important bit is to get hotplug up and running again for NVMe; we do
> have some testcases for patches in upstream linux which I would like to
> forward to our QA Team.
> 

Cool. I've sent out a series with your patch and an additional patch
that changes the default for 'shared'.
diff mbox series

Patch

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 93c35950d6..a9404f2b5e 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -14,7 +14,7 @@ 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 {
     NvmeSubsystem *subsys = n->subsys;
-    int cntlid;
+    int cntlid, nsid;
 
     for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
         if (!subsys->ctrls[cntlid]) {
@@ -29,12 +29,18 @@  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
     subsys->ctrls[cntlid] = n;
 
+    for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
+        if (subsys->namespaces[nsid]) {
+            nvme_attach_ns(n, subsys->namespaces[nsid]);
+        }
+    }
     return cntlid;
 }
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
     subsys->ctrls[n->cntlid] = NULL;
+    n->cntlid = -1;
 }
 
 static void nvme_subsys_setup(NvmeSubsystem *subsys)