diff mbox series

[v2,1/1] spapr.c: always pulse guest IRQ in spapr_core_unplug_request()

Message ID 20210118193035.2089474-2-danielhb413@gmail.com
State New
Headers show
Series pseries: allow CPU unplug after failed attempt | expand

Commit Message

Daniel Henrique Barboza Jan. 18, 2021, 7:30 p.m. UTC
Commit 47c8c915b162 fixed a problem where multiple spapr_drc_detach()
requests were breaking QEMU. The solution was to just spapr_drc_detach()
once, and use spapr_drc_unplug_requested() to filter whether we
already detached it or not. The commit also tied the hotplug request
to the guest in the same condition.

Turns out that there is a reliable way for a CPU hotunplug to fail. If
a guest with one CPU hotplugs a CPU1, then offline CPU0s via
'echo 0 > /sys/devices/system/cpu/cpu0/online', then attempts to
hotunplug CPU1, the kernel will refuse it because it's the last online
CPU of the system. Given that we're pulsing the IRQ only in the first try,
in a failed attempt, all other CPU1 hotunplug attempts will fail, regardless
of the online state of CPU1 in the kernel, because we're simply not letting
the guest know that we want to hotunplug the device.

Let's move spapr_hotplug_req_remove_by_index() back out of the
"if (!spapr_drc_unplug_requested(drc))" conditional, allowing for multiple
'device_del' requests to the same CPU core to reach the guest, in case
the CPU core didn't fully hotunplugged previously. Granted, this is not
optimal because this can allow for multiple hotplug events queueing in the
guest, like it was already possible before 47c8c915b162.

Other possible alternatives would be:

- check if the given CPU is the last online CPU in the guest before attempting
to hotunplug. This can be done by checking 'cs->halted' and msr states of
the core. Problem is, this approach will fail if the guest offlines/onlines
a CPU while we're in the middle of the unplug request, given that we can't
control whether the CPU core states will change in the kernel. This option
sure makes it harder to allow a hotunplug failure to happen, but will never
be enough to fully avoid it;

- let the user handled it. In this case, we would advise the user to reboot the
guest and the CPU will be removed during machine reset.

None of the alternatives are clear winnners, so this patch goes for the approach
makes the IRQ queue of the guest prone to multiple hotunplug requests for the
same CPU, but at least the user can successfully hotunplug the CPU after a failed
attempt, without the need of guest reboot.

Reported-by: Xujun Ma <xuma@redhat.com>
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Greg Kurz Jan. 19, 2021, 9:50 a.m. UTC | #1
On Mon, 18 Jan 2021 16:30:35 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> Commit 47c8c915b162 fixed a problem where multiple spapr_drc_detach()
> requests were breaking QEMU. The solution was to just spapr_drc_detach()
> once, and use spapr_drc_unplug_requested() to filter whether we
> already detached it or not. The commit also tied the hotplug request
> to the guest in the same condition.
> 
> Turns out that there is a reliable way for a CPU hotunplug to fail. If
> a guest with one CPU hotplugs a CPU1, then offline CPU0s via
> 'echo 0 > /sys/devices/system/cpu/cpu0/online', then attempts to
> hotunplug CPU1, the kernel will refuse it because it's the last online
> CPU of the system. Given that we're pulsing the IRQ only in the first try,
> in a failed attempt, all other CPU1 hotunplug attempts will fail, regardless
> of the online state of CPU1 in the kernel, because we're simply not letting
> the guest know that we want to hotunplug the device.
> 
> Let's move spapr_hotplug_req_remove_by_index() back out of the
> "if (!spapr_drc_unplug_requested(drc))" conditional, allowing for multiple
> 'device_del' requests to the same CPU core to reach the guest, in case
> the CPU core didn't fully hotunplugged previously. Granted, this is not
> optimal because this can allow for multiple hotplug events queueing in the
> guest, like it was already possible before 47c8c915b162.
> 
> Other possible alternatives would be:
> 
> - check if the given CPU is the last online CPU in the guest before attempting
> to hotunplug. This can be done by checking 'cs->halted' and msr states of
> the core. Problem is, this approach will fail if the guest offlines/onlines
> a CPU while we're in the middle of the unplug request, given that we can't
> control whether the CPU core states will change in the kernel. This option
> sure makes it harder to allow a hotunplug failure to happen, but will never
> be enough to fully avoid it;
> 
> - let the user handled it. In this case, we would advise the user to reboot the
> guest and the CPU will be removed during machine reset.
> 

This is actually the only viable option since there's no way for the guest to
report an unplug request failure to QEMU. And this isn't specific to CPUs, eg.
Linux can also block unplug requests for DIMM devices if some LMB doesn't belong
to ZONE_MOVABLE. The solution for this is to tell linux to always put hot-plugged
memory in ZONE_MOVABLE.

Could something similar be done for CPUs ? For example, forbidding the off-lining
of CPU0 at the linux level : this would ensure all cores, except the one that has
CPU0, are always hot-unpluggable.

> None of the alternatives are clear winnners, so this patch goes for the approach
> makes the IRQ queue of the guest prone to multiple hotunplug requests for the
> same CPU, but at least the user can successfully hotunplug the CPU after a failed
> attempt, without the need of guest reboot.
> 
> Reported-by: Xujun Ma <xuma@redhat.com>
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index e9e3816cd3..e2f12ce413 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3737,8 +3737,17 @@ void spapr_core_unplug_request(HotplugHandler *hotplug_dev, DeviceState *dev,
>  
>      if (!spapr_drc_unplug_requested(drc)) {
>          spapr_drc_detach(drc);
> -        spapr_hotplug_req_remove_by_index(drc);
>      }
> +
> +    /*
> +     * spapr_hotplug_req_remove_by_index is left unguarded, out of the
> +     * "!spapr_drc_unplug_requested" check, to allow for multiple IRQ
> +     * pulses removing the same CPU core. Otherwise, in an failed hotunplug
> +     * attempt (e.g. the kernel will refuse to remove the last online CPU
> +     * core), we will never attempt it again because unplug_requested will
> +     * still be 'true' in that case.
> +     */
> +    spapr_hotplug_req_remove_by_index(drc);

This not only fire the IRQ again, it also enqueues a new event... have
you tried hammering QEMU with CPU hot-plug/unplug requests. I seem to
remember that the troubles fixed by 47c8c915b162 had more to do with
the DRC state machine than the hot-plug event itself, but posting the
same event several times during a regular hot-unplug sequence could
maybe cause subtle bugs as well.

Honestly, this is still a band-aid : it doesn't fix anything, it just
gives an alternative solution to reboot when someone has done something
silly. I'd rather not loosen our sanity checks for such a corner case.

On the other side, the at-least-one-cpu thing is a linux limitation.
It seems fair that linux should provide a way to mitigate the effects.
Like suggested above, this could be the ability to elect an individual
vCPU to be always on-line. Since QEMU refuses the hotplug of the boot
core, QEMU could maybe tell the guest to elect CPU0 through some
flag in the DT, like we've done recently for LMBs.

>  }
>  
>  int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
Daniel Henrique Barboza Jan. 19, 2021, 1:11 p.m. UTC | #2
On 1/19/21 6:50 AM, Greg Kurz wrote:
> On Mon, 18 Jan 2021 16:30:35 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
>> Commit 47c8c915b162 fixed a problem where multiple spapr_drc_detach()
>> requests were breaking QEMU. The solution was to just spapr_drc_detach()
>> once, and use spapr_drc_unplug_requested() to filter whether we
>> already detached it or not. The commit also tied the hotplug request
>> to the guest in the same condition.
>>
>> Turns out that there is a reliable way for a CPU hotunplug to fail. If
>> a guest with one CPU hotplugs a CPU1, then offline CPU0s via
>> 'echo 0 > /sys/devices/system/cpu/cpu0/online', then attempts to
>> hotunplug CPU1, the kernel will refuse it because it's the last online
>> CPU of the system. Given that we're pulsing the IRQ only in the first try,
>> in a failed attempt, all other CPU1 hotunplug attempts will fail, regardless
>> of the online state of CPU1 in the kernel, because we're simply not letting
>> the guest know that we want to hotunplug the device.
>>
>> Let's move spapr_hotplug_req_remove_by_index() back out of the
>> "if (!spapr_drc_unplug_requested(drc))" conditional, allowing for multiple
>> 'device_del' requests to the same CPU core to reach the guest, in case
>> the CPU core didn't fully hotunplugged previously. Granted, this is not
>> optimal because this can allow for multiple hotplug events queueing in the
>> guest, like it was already possible before 47c8c915b162.
>>
>> Other possible alternatives would be:
>>
>> - check if the given CPU is the last online CPU in the guest before attempting
>> to hotunplug. This can be done by checking 'cs->halted' and msr states of
>> the core. Problem is, this approach will fail if the guest offlines/onlines
>> a CPU while we're in the middle of the unplug request, given that we can't
>> control whether the CPU core states will change in the kernel. This option
>> sure makes it harder to allow a hotunplug failure to happen, but will never
>> be enough to fully avoid it;
>>
>> - let the user handled it. In this case, we would advise the user to reboot the
>> guest and the CPU will be removed during machine reset.
>>
> 
> This is actually the only viable option since there's no way for the guest to
> report an unplug request failure to QEMU. And this isn't specific to CPUs, eg.
> Linux can also block unplug requests for DIMM devices if some LMB doesn't belong
> to ZONE_MOVABLE. The solution for this is to tell linux to always put hot-plugged
> memory in ZONE_MOVABLE.

Indeed, the lack of a 'not going to comply with this unplug request' logic in
PAPR is the source of a lot of bugs and code in QEMU. As I said to you offline,
I had an old idea of proposing a PAPR change to add this mechanic, and then
QEMU can be aware of unplug requests that are denied by the kernel right before
all the DRC state transition occurs. This will demand some changes, but the net
result is more predictability in all unplug operations.

Note that the new LMB DT attribte doesn't fix the problems for DRCs. It just makes
it harder to occur. There's nothing holding the kernel back from refusing LMB unplug
requests in stress conditions (e.g. a stress-ng workload overloading all the RAM).

> 
> Could something similar be done for CPUs ? For example, forbidding the off-lining
> of CPU0 at the linux level : this would ensure all cores, except the one that has
> CPU0, are always hot-unpluggable.

I don't think so. LMBs has a strong case for that extra DT flag because the OS can
expand the kernel space to the hotplugged DIMMs and so on. As you said in the v1
review, PAPR already mentions that offlining all VCPUs should just terminate the
partition, so why do we need a flag to forbid CPU offlining?

We're dealing with a kernel restriction (always need an online CPU) that PAPR doesn't
predicts in the spec. Not only that, but it's also a dynamic restriction. The OS can
deny hotunplug of any of the available CPUs, as long as it is the last online. This
puts QEMU into this weird spot we are now: we can't guarantee that the vcpu state
will remain in the guest during the unplug request.

x86 for instance forbids offlining CPU0. I'm not sure if it's an ACPI restriction or
kernel design choice. But this would fix the issue we're having here. Of all the
available choices for a fix, patching the pseries kernel to forbid offlining CPU0
looks a very sane one in my estimation.


> 
>> None of the alternatives are clear winnners, so this patch goes for the approach
>> makes the IRQ queue of the guest prone to multiple hotunplug requests for the
>> same CPU, but at least the user can successfully hotunplug the CPU after a failed
>> attempt, without the need of guest reboot.
>>
>> Reported-by: Xujun Ma <xuma@redhat.com>
>> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index e9e3816cd3..e2f12ce413 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -3737,8 +3737,17 @@ void spapr_core_unplug_request(HotplugHandler *hotplug_dev, DeviceState *dev,
>>   
>>       if (!spapr_drc_unplug_requested(drc)) {
>>           spapr_drc_detach(drc);
>> -        spapr_hotplug_req_remove_by_index(drc);
>>       }
>> +
>> +    /*
>> +     * spapr_hotplug_req_remove_by_index is left unguarded, out of the
>> +     * "!spapr_drc_unplug_requested" check, to allow for multiple IRQ
>> +     * pulses removing the same CPU core. Otherwise, in an failed hotunplug
>> +     * attempt (e.g. the kernel will refuse to remove the last online CPU
>> +     * core), we will never attempt it again because unplug_requested will
>> +     * still be 'true' in that case.
>> +     */
>> +    spapr_hotplug_req_remove_by_index(drc);
> 
> This not only fire the IRQ again, it also enqueues a new event... have
> you tried hammering QEMU with CPU hot-plug/unplug requests. I seem to
> remember that the troubles fixed by 47c8c915b162 had more to do with
> the DRC state machine than the hot-plug event itself, but posting the
> same event several times during a regular hot-unplug sequence could
> maybe cause subtle bugs as well.

I tried hammering the guest with lots of IRQ events. Aside from the dmesg
flood in the guest kernel everything went fine. No guest breakage of any
sort.

It is also worth noticing that the unplug_request code is protected by
BQL. It is not possible to spam the guest queue out of control.

> 
> Honestly, this is still a band-aid : it doesn't fix anything, it just
> gives an alternative solution to reboot when someone has done something
> silly. I'd rather not loosen our sanity checks for such a corner case.

It doesn't fix indeed. But documenting that the user shouldn't offline CPU0
in the kernel because QEMU can't handle unplug properly isn't fixing anything
either.

Thing is, our spapr code is too optimistic, and not justifiable in my opinion,
in all unplug code for all devices. It assumes that everything will go as
intended, but in reality we're hoping it will work. Otherwise, we're going to
ask the user to reboot the guest so the unplug can happen in CAS.

This change I'm making makes the CPU unplug code way more pessimistic, and it
can open a way for device_del abuse via an IRQ spam. But in the end it will
alleviate he situation not only of this particular bug, but all "insert a random
reason why the kernel denied a CPU hotunplug, and now I can't unplug the
CPU anymore" bugs in the future as well. It gives the user the choice to try
again if something didn't go as planned. As far as I'm concerned, we should
fire up spapr_hotplug_req_remove_by_index() from all unplug_request() calls
of all devices regardless of unplug_request being true or not.

And if we're concerned about abuse of device_del spam, we can document that
device_del can't be spammed in a 'unplug until it works' fashion. I'd rather
advice the user to use device_del properly than to tell the user "if you offline
CPUs and enter this situation, you'll need to reboot the guest".

> 
> On the other side, the at-least-one-cpu thing is a linux limitation.
> It seems fair that linux should provide a way to mitigate the effects.
> Like suggested above, this could be the ability to elect an individual
> vCPU to be always on-line. Since QEMU refuses the hotplug of the boot
> core, QEMU could maybe tell the guest to elect CPU0 through some
> flag in the DT, like we've done recently for LMBs.


As I said above, I think that patching the kernel to avoid CPU0 offlining looks
like a viable idea.


I still believe that this patch should go through regardless though. I'm convinced
that we can't be optimistic about the CPU unplug process as a whole, and advising
the user to not spam the guest IRQ is a better option than documenting that we
need to reset the machine to handle the unplug, because the user did something
that the kernel fully allows, but we can't deal with and won't even allow further
attempts at it. I mean, what are we? Windows?



Thanks,


DHB




> 
>>   }
>>   
>>   int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>
Greg Kurz Jan. 19, 2021, 4:36 p.m. UTC | #3
On Tue, 19 Jan 2021 10:11:19 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> 
> 
> On 1/19/21 6:50 AM, Greg Kurz wrote:
> > On Mon, 18 Jan 2021 16:30:35 -0300
> > Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> > 
> >> Commit 47c8c915b162 fixed a problem where multiple spapr_drc_detach()
> >> requests were breaking QEMU. The solution was to just spapr_drc_detach()
> >> once, and use spapr_drc_unplug_requested() to filter whether we
> >> already detached it or not. The commit also tied the hotplug request
> >> to the guest in the same condition.
> >>
> >> Turns out that there is a reliable way for a CPU hotunplug to fail. If
> >> a guest with one CPU hotplugs a CPU1, then offline CPU0s via
> >> 'echo 0 > /sys/devices/system/cpu/cpu0/online', then attempts to
> >> hotunplug CPU1, the kernel will refuse it because it's the last online
> >> CPU of the system. Given that we're pulsing the IRQ only in the first try,
> >> in a failed attempt, all other CPU1 hotunplug attempts will fail, regardless
> >> of the online state of CPU1 in the kernel, because we're simply not letting
> >> the guest know that we want to hotunplug the device.
> >>
> >> Let's move spapr_hotplug_req_remove_by_index() back out of the
> >> "if (!spapr_drc_unplug_requested(drc))" conditional, allowing for multiple
> >> 'device_del' requests to the same CPU core to reach the guest, in case
> >> the CPU core didn't fully hotunplugged previously. Granted, this is not
> >> optimal because this can allow for multiple hotplug events queueing in the
> >> guest, like it was already possible before 47c8c915b162.
> >>
> >> Other possible alternatives would be:
> >>
> >> - check if the given CPU is the last online CPU in the guest before attempting
> >> to hotunplug. This can be done by checking 'cs->halted' and msr states of
> >> the core. Problem is, this approach will fail if the guest offlines/onlines
> >> a CPU while we're in the middle of the unplug request, given that we can't
> >> control whether the CPU core states will change in the kernel. This option
> >> sure makes it harder to allow a hotunplug failure to happen, but will never
> >> be enough to fully avoid it;
> >>
> >> - let the user handled it. In this case, we would advise the user to reboot the
> >> guest and the CPU will be removed during machine reset.
> >>
> > 
> > This is actually the only viable option since there's no way for the guest to
> > report an unplug request failure to QEMU. And this isn't specific to CPUs, eg.
> > Linux can also block unplug requests for DIMM devices if some LMB doesn't belong
> > to ZONE_MOVABLE. The solution for this is to tell linux to always put hot-plugged
> > memory in ZONE_MOVABLE.
> 
> Indeed, the lack of a 'not going to comply with this unplug request' logic in
> PAPR is the source of a lot of bugs and code in QEMU. As I said to you offline,
> I had an old idea of proposing a PAPR change to add this mechanic, and then
> QEMU can be aware of unplug requests that are denied by the kernel right before
> all the DRC state transition occurs. This will demand some changes, but the net
> result is more predictability in all unplug operations.
> 

Yes, this could probably help if PowerVM also consider this change as
valuable. Not sure on the impact this would cause on the linux and QEMU
code though.

> Note that the new LMB DT attribte doesn't fix the problems for DRCs. It just makes
> it harder to occur. There's nothing holding the kernel back from refusing LMB unplug
> requests in stress conditions (e.g. a stress-ng workload overloading all the RAM).
> 

Indeed. The use of 'movable_node' itself can be an issue as it can cause OOM
more easily for kernel allocations that cannot go to ZONE_MOVABLE.

> > 
> > Could something similar be done for CPUs ? For example, forbidding the off-lining
> > of CPU0 at the linux level : this would ensure all cores, except the one that has
> > CPU0, are always hot-unpluggable.
> 
> I don't think so. LMBs has a strong case for that extra DT flag because the OS can
> expand the kernel space to the hotplugged DIMMs and so on. As you said in the v1
> review, PAPR already mentions that offlining all VCPUs should just terminate the
> partition, so why do we need a flag to forbid CPU offlining?
> 

Problem is that CPU teardown doesn't even allow to offline the last
CPU. I cannot think of a way to bypass that cleanly and get the last
CPU of linux guest to call RTAS "stop-self" anyway... I don't think
we can implement this _feature_ of PAPR. So we need something else.

> We're dealing with a kernel restriction (always need an online CPU) that PAPR doesn't
> predicts in the spec. Not only that, but it's also a dynamic restriction. The OS can
> deny hotunplug of any of the available CPUs, as long as it is the last online. This
> puts QEMU into this weird spot we are now: we can't guarantee that the vcpu state
> will remain in the guest during the unplug request.
> 
> x86 for instance forbids offlining CPU0. I'm not sure if it's an ACPI restriction or
> kernel design choice. But this would fix the issue we're having here. Of all the
> available choices for a fix, patching the pseries kernel to forbid offlining CPU0
> looks a very sane one in my estimation.
> 

AFAICT the limitation comes from ACPI. POWER doesn't have such a restriction,
on powernv and pseries, so we would introduce it artificially as a workaround
to avoid the issue this patch is trying to address. I understand from the above
that adding a new flag in PAPR for QEMU to control this behavior has poor chances
of success. So this could be hardcoded to CPU0.

> 
> > 
> >> None of the alternatives are clear winnners, so this patch goes for the approach
> >> makes the IRQ queue of the guest prone to multiple hotunplug requests for the
> >> same CPU, but at least the user can successfully hotunplug the CPU after a failed
> >> attempt, without the need of guest reboot.
> >>
> >> Reported-by: Xujun Ma <xuma@redhat.com>
> >> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414
> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> ---
> >>   hw/ppc/spapr.c | 11 ++++++++++-
> >>   1 file changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index e9e3816cd3..e2f12ce413 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -3737,8 +3737,17 @@ void spapr_core_unplug_request(HotplugHandler *hotplug_dev, DeviceState *dev,
> >>   
> >>       if (!spapr_drc_unplug_requested(drc)) {
> >>           spapr_drc_detach(drc);
> >> -        spapr_hotplug_req_remove_by_index(drc);
> >>       }
> >> +
> >> +    /*
> >> +     * spapr_hotplug_req_remove_by_index is left unguarded, out of the
> >> +     * "!spapr_drc_unplug_requested" check, to allow for multiple IRQ
> >> +     * pulses removing the same CPU core. Otherwise, in an failed hotunplug
> >> +     * attempt (e.g. the kernel will refuse to remove the last online CPU
> >> +     * core), we will never attempt it again because unplug_requested will
> >> +     * still be 'true' in that case.
> >> +     */
> >> +    spapr_hotplug_req_remove_by_index(drc);
> > 
> > This not only fire the IRQ again, it also enqueues a new event... have
> > you tried hammering QEMU with CPU hot-plug/unplug requests. I seem to
> > remember that the troubles fixed by 47c8c915b162 had more to do with
> > the DRC state machine than the hot-plug event itself, but posting the
> > same event several times during a regular hot-unplug sequence could
> > maybe cause subtle bugs as well.
> 
> I tried hammering the guest with lots of IRQ events. Aside from the dmesg
> flood in the guest kernel everything went fine. No guest breakage of any
> sort.
> 
> It is also worth noticing that the unplug_request code is protected by
> BQL. It is not possible to spam the guest queue out of control.
> 

True.

> > 
> > Honestly, this is still a band-aid : it doesn't fix anything, it just
> > gives an alternative solution to reboot when someone has done something
> > silly. I'd rather not loosen our sanity checks for such a corner case.
> 
> It doesn't fix indeed. But documenting that the user shouldn't offline CPU0
> in the kernel because QEMU can't handle unplug properly isn't fixing anything
> either.
> 

Indeed but users know that they're walking on quicksand at least.

> Thing is, our spapr code is too optimistic, and not justifiable in my opinion,

PAPR is "too much something" in a lot of aspects and this certainly
had an impact on the code. ;-)

> in all unplug code for all devices. It assumes that everything will go as
> intended, but in reality we're hoping it will work. Otherwise, we're going to
> ask the user to reboot the guest so the unplug can happen in CAS.
> 
> This change I'm making makes the CPU unplug code way more pessimistic, and it
> can open a way for device_del abuse via an IRQ spam. But in the end it will
> alleviate he situation not only of this particular bug, but all "insert a random
> reason why the kernel denied a CPU hotunplug, and now I can't unplug the
> CPU anymore" bugs in the future as well. It gives the user the choice to try
> again if something didn't go as planned. As far as I'm concerned, we should
> fire up spapr_hotplug_req_remove_by_index() from all unplug_request() calls
> of all devices regardless of unplug_request being true or not.
> 

I agree that if we go for this approach, we should fix all our unplug_request()
handlers to be consistent. BTW, what do other archs do in similar situation ?


> And if we're concerned about abuse of device_del spam, we can document that
> device_del can't be spammed in a 'unplug until it works' fashion. I'd rather
> advice the user to use device_del properly than to tell the user "if you offline
> CPUs and enter this situation, you'll need to reboot the guest".
> 

It all boils down to what "use device_del properly" does mean exactly.

> > 
> > On the other side, the at-least-one-cpu thing is a linux limitation.
> > It seems fair that linux should provide a way to mitigate the effects.
> > Like suggested above, this could be the ability to elect an individual
> > vCPU to be always on-line. Since QEMU refuses the hotplug of the boot
> > core, QEMU could maybe tell the guest to elect CPU0 through some
> > flag in the DT, like we've done recently for LMBs.
> 
> 
> As I said above, I think that patching the kernel to avoid CPU0 offlining looks
> like a viable idea.
> 

My only concern is that I don't have a clear idea on CPU offlining use
cases that could break with such a change.

> 
> I still believe that this patch should go through regardless though. I'm convinced
> that we can't be optimistic about the CPU unplug process as a whole, and advising
> the user to not spam the guest IRQ is a better option than documenting that we
> need to reset the machine to handle the unplug, because the user did something
> that the kernel fully allows, but we can't deal with and won't even allow further
> attempts at it. I mean, what are we? Windows?
> 

David, any insight here ?

> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> 
> 
> > 
> >>   }
> >>   
> >>   int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> >
diff mbox series

Patch

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e9e3816cd3..e2f12ce413 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3737,8 +3737,17 @@  void spapr_core_unplug_request(HotplugHandler *hotplug_dev, DeviceState *dev,
 
     if (!spapr_drc_unplug_requested(drc)) {
         spapr_drc_detach(drc);
-        spapr_hotplug_req_remove_by_index(drc);
     }
+
+    /*
+     * spapr_hotplug_req_remove_by_index is left unguarded, out of the
+     * "!spapr_drc_unplug_requested" check, to allow for multiple IRQ
+     * pulses removing the same CPU core. Otherwise, in an failed hotunplug
+     * attempt (e.g. the kernel will refuse to remove the last online CPU
+     * core), we will never attempt it again because unplug_requested will
+     * still be 'true' in that case.
+     */
+    spapr_hotplug_req_remove_by_index(drc);
 }
 
 int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,