Message ID | 20200804032937.7235-1-mdroth@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | powerpc/pseries/hotplug-cpu: increase wait time for vCPU death | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (14fd53d1e5ee7350564cac75e336f8c0dea13bc9) |
snowpatch_ozlabs/build-ppc64le | warning | Upstream build failed, couldn't test patch |
snowpatch_ozlabs/build-ppc64be | warning | Upstream build failed, couldn't test patch |
snowpatch_ozlabs/build-ppc64e | warning | Upstream build failed, couldn't test patch |
snowpatch_ozlabs/build-pmac32 | warning | Upstream build failed, couldn't test patch |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 15 lines checked |
snowpatch_ozlabs/needsstable | warning | Please consider tagging this patch for stable! |
Hi Mike, There is a bit of history to this code, but not in a good way :) Michael Roth <mdroth@linux.vnet.ibm.com> writes: > For a power9 KVM guest with XIVE enabled, running a test loop > where we hotplug 384 vcpus and then unplug them, the following traces > can be seen (generally within a few loops) either from the unplugged > vcpu: > > [ 1767.353447] cpu 65 (hwid 65) Ready to die... > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 > [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 ... > > At that point the worker thread assumes the unplugged CPU is in some > unknown/dead state and procedes with the cleanup, causing the race with > the XIVE cleanup code executed by the unplugged CPU. > > Fix this by inserting an msleep() after each RTAS call to avoid We previously had an msleep(), but it was removed: b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") > pseries_cpu_die() returning prematurely, and double the number of > attempts so we wait at least a total of 5 seconds. While this isn't an > ideal solution, it is similar to how we dealt with a similar issue for > cede_offline mode in the past (940ce422a3). Thiago tried to fix this previously but there was a bit of discussion that didn't quite resolve: https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/ Spinning forever seems like a bad idea, but as has been demonstrated at least twice now, continuing when we don't know the state of the other CPU can lead to straight up crashes. So I think I'm persuaded that it's preferable to have the kernel stuck spinning rather than oopsing. I'm 50/50 on whether we should have a cond_resched() in the loop. My first instinct is no, if we're stuck here for 20s a stack trace would be good. But then we will probably hit that on some big and/or heavily loaded machine. So possibly we should call cond_resched() but have some custom logic in the loop to print a warning if we are stuck for more than some sufficiently long amount of time. > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 This is not public. I tend to trim Bugzilla links from the change log, because I'm not convinced they will last forever, but it is good to have them in the mail archive. cheers > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Cedric Le Goater <clg@kaod.org> > Cc: Greg Kurz <groug@kaod.org> > Cc: Nathan Lynch <nathanl@linux.ibm.com> > Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> > --- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index c6e0d8abf75e..3cb172758052 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu) > int cpu_status = 1; > unsigned int pcpu = get_hard_smp_processor_id(cpu); > > - for (tries = 0; tries < 25; tries++) { > + for (tries = 0; tries < 50; tries++) { > cpu_status = smp_query_cpu_stopped(pcpu); > if (cpu_status == QCSS_STOPPED || > cpu_status == QCSS_HARDWARE_ERROR) > break; > - cpu_relax(); > - > + msleep(100); > } > > if (cpu_status != 0) { > -- > 2.17.1
On Tue, 04 Aug 2020 23:35:10 +1000 Michael Ellerman <mpe@ellerman.id.au> wrote: > Hi Mike, > > There is a bit of history to this code, but not in a good way :) > > Michael Roth <mdroth@linux.vnet.ibm.com> writes: > > For a power9 KVM guest with XIVE enabled, running a test loop > > where we hotplug 384 vcpus and then unplug them, the following traces > > can be seen (generally within a few loops) either from the unplugged > > vcpu: > > > > [ 1767.353447] cpu 65 (hwid 65) Ready to die... > > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 > > [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 > ... > > > > At that point the worker thread assumes the unplugged CPU is in some > > unknown/dead state and procedes with the cleanup, causing the race with > > the XIVE cleanup code executed by the unplugged CPU. > > > > Fix this by inserting an msleep() after each RTAS call to avoid > > We previously had an msleep(), but it was removed: > > b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") > Ah, I hadn't seen that one... > > pseries_cpu_die() returning prematurely, and double the number of > > attempts so we wait at least a total of 5 seconds. While this isn't an > > ideal solution, it is similar to how we dealt with a similar issue for > > cede_offline mode in the past (940ce422a3). > > Thiago tried to fix this previously but there was a bit of discussion > that didn't quite resolve: > > https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/ > Yeah it appears that the motivation at the time was to make the "Querying DEAD?" messages to disappear and to avoid potentially concurrent calls to rtas-stop-self which is prohibited by PAPR... not fixing actual crashes. > > Spinning forever seems like a bad idea, but as has been demonstrated at > least twice now, continuing when we don't know the state of the other > CPU can lead to straight up crashes. > > So I think I'm persuaded that it's preferable to have the kernel stuck > spinning rather than oopsing. > +1 > I'm 50/50 on whether we should have a cond_resched() in the loop. My > first instinct is no, if we're stuck here for 20s a stack trace would be > good. But then we will probably hit that on some big and/or heavily > loaded machine. > > So possibly we should call cond_resched() but have some custom logic in > the loop to print a warning if we are stuck for more than some > sufficiently long amount of time. > How long should that be ? > > > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > > This is not public. > I'll have a look at changing that. > I tend to trim Bugzilla links from the change log, because I'm not > convinced they will last forever, but it is good to have them in the > mail archive. > > cheers > Cheers, -- Greg > > Cc: Michael Ellerman <mpe@ellerman.id.au> > > Cc: Cedric Le Goater <clg@kaod.org> > > Cc: Greg Kurz <groug@kaod.org> > > Cc: Nathan Lynch <nathanl@linux.ibm.com> > > Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> > > --- > > arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > > index c6e0d8abf75e..3cb172758052 100644 > > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > > @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu) > > int cpu_status = 1; > > unsigned int pcpu = get_hard_smp_processor_id(cpu); > > > > - for (tries = 0; tries < 25; tries++) { > > + for (tries = 0; tries < 50; tries++) { > > cpu_status = smp_query_cpu_stopped(pcpu); > > if (cpu_status == QCSS_STOPPED || > > cpu_status == QCSS_HARDWARE_ERROR) > > break; > > - cpu_relax(); > > - > > + msleep(100); > > } > > > > if (cpu_status != 0) { > > -- > > 2.17.1
Greg Kurz <groug@kaod.org> writes: > On Tue, 04 Aug 2020 23:35:10 +1000 > Michael Ellerman <mpe@ellerman.id.au> wrote: >> There is a bit of history to this code, but not in a good way :) >> >> Michael Roth <mdroth@linux.vnet.ibm.com> writes: >> > For a power9 KVM guest with XIVE enabled, running a test loop >> > where we hotplug 384 vcpus and then unplug them, the following traces >> > can be seen (generally within a few loops) either from the unplugged >> > vcpu: >> > >> > [ 1767.353447] cpu 65 (hwid 65) Ready to die... >> > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 >> > [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 >> ... >> > >> > At that point the worker thread assumes the unplugged CPU is in some >> > unknown/dead state and procedes with the cleanup, causing the race with >> > the XIVE cleanup code executed by the unplugged CPU. >> > >> > Fix this by inserting an msleep() after each RTAS call to avoid >> >> We previously had an msleep(), but it was removed: >> >> b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") > > Ah, I hadn't seen that one... > >> > pseries_cpu_die() returning prematurely, and double the number of >> > attempts so we wait at least a total of 5 seconds. While this isn't an >> > ideal solution, it is similar to how we dealt with a similar issue for >> > cede_offline mode in the past (940ce422a3). >> >> Thiago tried to fix this previously but there was a bit of discussion >> that didn't quite resolve: >> >> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/ > > Yeah it appears that the motivation at the time was to make the "Querying DEAD?" > messages to disappear and to avoid potentially concurrent calls to rtas-stop-self > which is prohibited by PAPR... not fixing actual crashes. I'm pretty sure at one point we were triggering crashes *in* RTAS via this path, I think that got resolved. >> Spinning forever seems like a bad idea, but as has been demonstrated at >> least twice now, continuing when we don't know the state of the other >> CPU can lead to straight up crashes. >> >> So I think I'm persuaded that it's preferable to have the kernel stuck >> spinning rather than oopsing. >> > > +1 > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My >> first instinct is no, if we're stuck here for 20s a stack trace would be >> good. But then we will probably hit that on some big and/or heavily >> loaded machine. >> >> So possibly we should call cond_resched() but have some custom logic in >> the loop to print a warning if we are stuck for more than some >> sufficiently long amount of time. > > How long should that be ? Yeah good question. I guess step one would be seeing how long it can take on the 384 vcpu machine. And we can probably test on some other big machines. Hopefully Nathan can give us some idea of how long he's seen it take on large systems? I know he was concerned about the 20s timeout of the softlockup detector. Maybe a minute or two? >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 >> >> This is not public. > > I'll have a look at changing that. Thanks. cheers
Michael Ellerman <mpe@ellerman.id.au> writes: > Greg Kurz <groug@kaod.org> writes: >> On Tue, 04 Aug 2020 23:35:10 +1000 >> Michael Ellerman <mpe@ellerman.id.au> wrote: >>> There is a bit of history to this code, but not in a good way :) >>> >>> Michael Roth <mdroth@linux.vnet.ibm.com> writes: >>> > For a power9 KVM guest with XIVE enabled, running a test loop >>> > where we hotplug 384 vcpus and then unplug them, the following traces >>> > can be seen (generally within a few loops) either from the unplugged >>> > vcpu: >>> > >>> > [ 1767.353447] cpu 65 (hwid 65) Ready to die... >>> > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 >>> > [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 >>> ... >>> > >>> > At that point the worker thread assumes the unplugged CPU is in some >>> > unknown/dead state and procedes with the cleanup, causing the race with >>> > the XIVE cleanup code executed by the unplugged CPU. >>> > >>> > Fix this by inserting an msleep() after each RTAS call to avoid >>> >>> We previously had an msleep(), but it was removed: >>> >>> b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") >> >> Ah, I hadn't seen that one... >> >>> > pseries_cpu_die() returning prematurely, and double the number of >>> > attempts so we wait at least a total of 5 seconds. While this isn't an >>> > ideal solution, it is similar to how we dealt with a similar issue for >>> > cede_offline mode in the past (940ce422a3). >>> >>> Thiago tried to fix this previously but there was a bit of discussion >>> that didn't quite resolve: >>> >>> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/ >> >> Yeah it appears that the motivation at the time was to make the "Querying DEAD?" >> messages to disappear and to avoid potentially concurrent calls to rtas-stop-self >> which is prohibited by PAPR... not fixing actual crashes. > > I'm pretty sure at one point we were triggering crashes *in* RTAS via > this path, I think that got resolved. Yes, pHyp's RTAS now tolerates concurrent calls to stop-self. The original bug that was reported when I worked on this ended in an RTAS crash because of this timeout. The crash was fixed then. -- Thiago Jung Bauermann IBM Linux Technology Center
Quoting Michael Ellerman (2020-08-04 22:07:08) > Greg Kurz <groug@kaod.org> writes: > > On Tue, 04 Aug 2020 23:35:10 +1000 > > Michael Ellerman <mpe@ellerman.id.au> wrote: > >> There is a bit of history to this code, but not in a good way :) > >> > >> Michael Roth <mdroth@linux.vnet.ibm.com> writes: > >> > For a power9 KVM guest with XIVE enabled, running a test loop > >> > where we hotplug 384 vcpus and then unplug them, the following traces > >> > can be seen (generally within a few loops) either from the unplugged > >> > vcpu: > >> > > >> > [ 1767.353447] cpu 65 (hwid 65) Ready to die... > >> > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 > >> > [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 > >> ... > >> > > >> > At that point the worker thread assumes the unplugged CPU is in some > >> > unknown/dead state and procedes with the cleanup, causing the race with > >> > the XIVE cleanup code executed by the unplugged CPU. > >> > > >> > Fix this by inserting an msleep() after each RTAS call to avoid > >> > >> We previously had an msleep(), but it was removed: > >> > >> b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") > > > > Ah, I hadn't seen that one... > > > >> > pseries_cpu_die() returning prematurely, and double the number of > >> > attempts so we wait at least a total of 5 seconds. While this isn't an > >> > ideal solution, it is similar to how we dealt with a similar issue for > >> > cede_offline mode in the past (940ce422a3). > >> > >> Thiago tried to fix this previously but there was a bit of discussion > >> that didn't quite resolve: > >> > >> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/ > > > > Yeah it appears that the motivation at the time was to make the "Querying DEAD?" > > messages to disappear and to avoid potentially concurrent calls to rtas-stop-self > > which is prohibited by PAPR... not fixing actual crashes. > > I'm pretty sure at one point we were triggering crashes *in* RTAS via > this path, I think that got resolved. > > >> Spinning forever seems like a bad idea, but as has been demonstrated at > >> least twice now, continuing when we don't know the state of the other > >> CPU can lead to straight up crashes. > >> > >> So I think I'm persuaded that it's preferable to have the kernel stuck > >> spinning rather than oopsing. > >> > > > > +1 > > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My > >> first instinct is no, if we're stuck here for 20s a stack trace would be > >> good. But then we will probably hit that on some big and/or heavily > >> loaded machine. > >> > >> So possibly we should call cond_resched() but have some custom logic in > >> the loop to print a warning if we are stuck for more than some > >> sufficiently long amount of time. > > > > How long should that be ? > > Yeah good question. > > I guess step one would be seeing how long it can take on the 384 vcpu > machine. And we can probably test on some other big machines. > > Hopefully Nathan can give us some idea of how long he's seen it take on > large systems? I know he was concerned about the 20s timeout of the > softlockup detector. > > Maybe a minute or two? Hmm, so I took a stab at this where I called cond_resched() after every 5 seconds of polling and printed a warning at the same time (FWIW that doesn't seem to trigger any warnings on a loaded 96-core mihawk system using KVM running the 384vcpu unplug loop) But it sounds like that's not quite what you had in mind. How frequently do you think we should call cond_resched()? Maybe after 25 iterations of polling smp_query_cpu_stopped() to keep original behavior somewhat similar? I'll let the current patch run on the mihawk system overnight in the meantime so we at least have that data point, but would be good to know what things look like a large pHyp machine. Thanks! > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > >> > >> This is not public. > > > > I'll have a look at changing that. > > Thanks. > > cheers
Quoting Michael Roth (2020-08-04 23:37:32) > Quoting Michael Ellerman (2020-08-04 22:07:08) > > Greg Kurz <groug@kaod.org> writes: > > > On Tue, 04 Aug 2020 23:35:10 +1000 > > > Michael Ellerman <mpe@ellerman.id.au> wrote: > > >> Spinning forever seems like a bad idea, but as has been demonstrated at > > >> least twice now, continuing when we don't know the state of the other > > >> CPU can lead to straight up crashes. > > >> > > >> So I think I'm persuaded that it's preferable to have the kernel stuck > > >> spinning rather than oopsing. > > >> > > > > > > +1 > > > > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My > > >> first instinct is no, if we're stuck here for 20s a stack trace would be > > >> good. But then we will probably hit that on some big and/or heavily > > >> loaded machine. > > >> > > >> So possibly we should call cond_resched() but have some custom logic in > > >> the loop to print a warning if we are stuck for more than some > > >> sufficiently long amount of time. > > > > > > How long should that be ? > > > > Yeah good question. > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > machine. And we can probably test on some other big machines. > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > large systems? I know he was concerned about the 20s timeout of the > > softlockup detector. > > > > Maybe a minute or two? > > Hmm, so I took a stab at this where I called cond_resched() after > every 5 seconds of polling and printed a warning at the same time (FWIW > that doesn't seem to trigger any warnings on a loaded 96-core mihawk > system using KVM running the 384vcpu unplug loop) > > But it sounds like that's not quite what you had in mind. How frequently > do you think we should call cond_resched()? Maybe after 25 iterations > of polling smp_query_cpu_stopped() to keep original behavior somewhat > similar? > > I'll let the current patch run on the mihawk system overnight in the > meantime so we at least have that data point, but would be good to > know what things look like a large pHyp machine. At one point I did manage to get the system in a state where unplug operations were taking 1-2s, but still not enough to trigger any 5s warning, and I wasn't able to reproduce that in subsequent runs. I also tried reworking the patch so that we print a warning and cond_resched() after 200 ms to make sure that path gets executed, but only managed to trigger the warning twice after a few hours. So, if we print a warning after a couple minutes, that seems pretty conservative as far as avoiding spurious warnings. And if we cond_resched() after 25 loops of polling (~0.1 ms in the cases that caused the original crash), that would avoid most of the default RCU/lockup warnings. But having a second timeout to trigger the cond_resched() after some set interval like 2s seems more deterministic since we're less susceptible to longer delays due to things like the RTAS calls contending for QEMU's global mutex in the the KVM case. > > Thanks! > > > > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > > >> > > >> This is not public. > > > > > > I'll have a look at changing that. > > > > Thanks. > > > > cheers
Quoting Michael Roth (2020-08-05 17:29:28) > Quoting Michael Roth (2020-08-04 23:37:32) > > Quoting Michael Ellerman (2020-08-04 22:07:08) > > > Greg Kurz <groug@kaod.org> writes: > > > > On Tue, 04 Aug 2020 23:35:10 +1000 > > > > Michael Ellerman <mpe@ellerman.id.au> wrote: > > > >> Spinning forever seems like a bad idea, but as has been demonstrated at > > > >> least twice now, continuing when we don't know the state of the other > > > >> CPU can lead to straight up crashes. > > > >> > > > >> So I think I'm persuaded that it's preferable to have the kernel stuck > > > >> spinning rather than oopsing. > > > >> > > > > > > > > +1 > > > > > > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My > > > >> first instinct is no, if we're stuck here for 20s a stack trace would be > > > >> good. But then we will probably hit that on some big and/or heavily > > > >> loaded machine. > > > >> > > > >> So possibly we should call cond_resched() but have some custom logic in > > > >> the loop to print a warning if we are stuck for more than some > > > >> sufficiently long amount of time. > > > > > > > > How long should that be ? > > > > > > Yeah good question. > > > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > > machine. And we can probably test on some other big machines. > > > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > > large systems? I know he was concerned about the 20s timeout of the > > > softlockup detector. > > > > > > Maybe a minute or two? > > > > Hmm, so I took a stab at this where I called cond_resched() after > > every 5 seconds of polling and printed a warning at the same time (FWIW > > that doesn't seem to trigger any warnings on a loaded 96-core mihawk > > system using KVM running the 384vcpu unplug loop) > > > > But it sounds like that's not quite what you had in mind. How frequently > > do you think we should call cond_resched()? Maybe after 25 iterations > > of polling smp_query_cpu_stopped() to keep original behavior somewhat > > similar? > > > > I'll let the current patch run on the mihawk system overnight in the > > meantime so we at least have that data point, but would be good to > > know what things look like a large pHyp machine. > > At one point I did manage to get the system in a state where unplug > operations were taking 1-2s, but still not enough to trigger any > 5s warning, and I wasn't able to reproduce that in subsequent runs. > > I also tried reworking the patch so that we print a warning and > cond_resched() after 200 ms to make sure that path gets executed, but > only managed to trigger the warning twice after a few hours. > > So, if we print a warning after a couple minutes, that seems pretty > conservative as far as avoiding spurious warnings. And if we > cond_resched() after 25 loops of polling (~0.1 ms in the cases ~0.1 seconds I mean > that caused the original crash), that would avoid most of the > default RCU/lockup warnings. > > But having a second timeout to trigger the cond_resched() after some > set interval like 2s seems more deterministic since we're less > susceptible to longer delays due to things like the RTAS calls > contending for QEMU's global mutex in the the KVM case. > > > > > > Thanks! > > > > > > > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") > > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > > > >> > > > >> This is not public. > > > > > > > > I'll have a look at changing that. > > > > > > Thanks. > > > > > > cheers
Michael Roth <mdroth@linux.vnet.ibm.com> writes: > Quoting Michael Roth (2020-08-04 23:37:32) >> Quoting Michael Ellerman (2020-08-04 22:07:08) >> > Greg Kurz <groug@kaod.org> writes: >> > > On Tue, 04 Aug 2020 23:35:10 +1000 >> > > Michael Ellerman <mpe@ellerman.id.au> wrote: >> > >> Spinning forever seems like a bad idea, but as has been demonstrated at >> > >> least twice now, continuing when we don't know the state of the other >> > >> CPU can lead to straight up crashes. >> > >> >> > >> So I think I'm persuaded that it's preferable to have the kernel stuck >> > >> spinning rather than oopsing. >> > >> >> > > >> > > +1 >> > > >> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My >> > >> first instinct is no, if we're stuck here for 20s a stack trace would be >> > >> good. But then we will probably hit that on some big and/or heavily >> > >> loaded machine. >> > >> >> > >> So possibly we should call cond_resched() but have some custom logic in >> > >> the loop to print a warning if we are stuck for more than some >> > >> sufficiently long amount of time. >> > > >> > > How long should that be ? >> > >> > Yeah good question. >> > >> > I guess step one would be seeing how long it can take on the 384 vcpu >> > machine. And we can probably test on some other big machines. >> > >> > Hopefully Nathan can give us some idea of how long he's seen it take on >> > large systems? I know he was concerned about the 20s timeout of the >> > softlockup detector. >> > >> > Maybe a minute or two? >> >> Hmm, so I took a stab at this where I called cond_resched() after >> every 5 seconds of polling and printed a warning at the same time (FWIW >> that doesn't seem to trigger any warnings on a loaded 96-core mihawk >> system using KVM running the 384vcpu unplug loop) >> >> But it sounds like that's not quite what you had in mind. How frequently >> do you think we should call cond_resched()? Maybe after 25 iterations >> of polling smp_query_cpu_stopped() to keep original behavior somewhat >> similar? I think we can just call it on every iteration, it should be cheap compared to an RTAS call. The concern was just by doing that you effectively prevent the softlockup detector from reporting you as stuck in that path. Hence the desire to manually print a warning after ~60s or something. cheers
Hi everyone, Michael Ellerman <mpe@ellerman.id.au> writes: > Greg Kurz <groug@kaod.org> writes: >> On Tue, 04 Aug 2020 23:35:10 +1000 >> Michael Ellerman <mpe@ellerman.id.au> wrote: >>> Spinning forever seems like a bad idea, but as has been demonstrated at >>> least twice now, continuing when we don't know the state of the other >>> CPU can lead to straight up crashes. >>> >>> So I think I'm persuaded that it's preferable to have the kernel stuck >>> spinning rather than oopsing. >>> >> >> +1 >> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My >>> first instinct is no, if we're stuck here for 20s a stack trace would be >>> good. But then we will probably hit that on some big and/or heavily >>> loaded machine. >>> >>> So possibly we should call cond_resched() but have some custom logic in >>> the loop to print a warning if we are stuck for more than some >>> sufficiently long amount of time. >> >> How long should that be ? > > Yeah good question. > > I guess step one would be seeing how long it can take on the 384 vcpu > machine. And we can probably test on some other big machines. > > Hopefully Nathan can give us some idea of how long he's seen it take on > large systems? I know he was concerned about the 20s timeout of the > softlockup detector. Maybe I'm not quite clear what this is referring to, but I don't think stop-self/query-stopped-state latency increases with processor count, at least not on PowerVM. And IIRC I was concerned with the earlier patch's potential for causing the softlockup watchdog to rightly complain by polling the stopped state without ever scheduling away. The fact that smp_query_cpu_stopped() kind of collapses the two distinct results from the query-cpu-stopped-state RTAS call into one return value may make it harder than necessary to reason about the questions around cond_resched() and whether to warn. Sorry to pull this stunt but I have had some code sitting in a neglected branch that I think gets the logic around this right. What we should have is a simple C wrapper for the RTAS call that reflects the architected inputs and outputs: ================================================================ (-- rtas.c --) /** * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state. * @hwcpu: Identifies the processor thread to be queried. * @status: Pointer to status, valid only on success. * * Determine whether the given processor thread is in the stopped * state. If successful and @status is non-NULL, the thread's status * is stored to @status. * * Return: * * 0 - Success * * -1 - Hardware error * * -2 - Busy, try again later */ int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status) { unsigned int cpu_status; int token; int fwrc; token = rtas_token("query-cpu-stopped-state"); fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu); if (fwrc != 0) goto out; if (status != NULL) *status = cpu_status; out: return fwrc; } ================================================================ And then a utility function that waits for the remote thread to enter stopped state, with higher-level logic for rescheduling and warning. The fact that smp_query_cpu_stopped() currently does not handle a -2/busy status is a bug, fixed below by using rtas_busy_delay(). Note the justification for the explicit cond_resched() in the outer loop: ================================================================ (-- rtas.h --) /* query-cpu-stopped-state CPU_status */ #define RTAS_QCSS_STATUS_STOPPED 0 #define RTAS_QCSS_STATUS_IN_PROGRESS 1 #define RTAS_QCSS_STATUS_NOT_STOPPED 2 (-- pseries/hotplug-cpu.c --) /** * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state. */ static void wait_for_cpu_stopped(unsigned int cpu) { unsigned int status; unsigned int hwcpu; hwcpu = get_hard_smp_processor_id(cpu); do { int fwrc; /* * rtas_busy_delay() will yield only if RTAS returns a * busy status. Since query-cpu-stopped-state can * yield RTAS_QCSS_STATUS_IN_PROGRESS or * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded * period before the target thread stops, we must take * care to explicitly reschedule while polling. */ cond_resched(); do { fwrc = rtas_query_cpu_stopped_state(hwcpu, &status); } while (rtas_busy_delay(fwrc)); if (fwrc == 0) continue; pr_err_ratelimited("query-cpu-stopped-state for " "thread 0x%x returned %d\n", hwcpu, fwrc); goto out; } while (status == RTAS_QCSS_STATUS_NOT_STOPPED || status == RTAS_QCSS_STATUS_IN_PROGRESS); if (status != RTAS_QCSS_STATUS_STOPPED) { pr_err_ratelimited("query-cpu-stopped-state yielded unknown " "status %d for thread 0x%x\n", status, hwcpu); } out: return; } [...] static void pseries_cpu_die(unsigned int cpu) { wait_for_cpu_stopped(cpu); paca_ptrs[cpu]->cpu_start = 0; } ================================================================ wait_for_cpu_stopped() should be able to accommodate a time-based warning if necessary, but speaking as a likely recipient of any bug reports that would arise here, I'm not convinced of the need and I don't know what a good value would be. It's relatively easy to sample the stack of a task that's apparently failing to make progress, plus I probably would use 'perf probe' or similar to report the inputs and outputs for the RTAS call. I'm happy to make this a proper submission after I can clean it up and retest it, or Michael R. is welcome to appropriate it, assuming it's acceptable.
Quoting Nathan Lynch (2020-08-07 02:05:09) > Hi everyone, > > Michael Ellerman <mpe@ellerman.id.au> writes: > > Greg Kurz <groug@kaod.org> writes: > >> On Tue, 04 Aug 2020 23:35:10 +1000 > >> Michael Ellerman <mpe@ellerman.id.au> wrote: > >>> Spinning forever seems like a bad idea, but as has been demonstrated at > >>> least twice now, continuing when we don't know the state of the other > >>> CPU can lead to straight up crashes. > >>> > >>> So I think I'm persuaded that it's preferable to have the kernel stuck > >>> spinning rather than oopsing. > >>> > >> > >> +1 > >> > >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My > >>> first instinct is no, if we're stuck here for 20s a stack trace would be > >>> good. But then we will probably hit that on some big and/or heavily > >>> loaded machine. > >>> > >>> So possibly we should call cond_resched() but have some custom logic in > >>> the loop to print a warning if we are stuck for more than some > >>> sufficiently long amount of time. > >> > >> How long should that be ? > > > > Yeah good question. > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > machine. And we can probably test on some other big machines. > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > large systems? I know he was concerned about the 20s timeout of the > > softlockup detector. > > Maybe I'm not quite clear what this is referring to, but I don't think > stop-self/query-stopped-state latency increases with processor count, at > least not on PowerVM. And IIRC I was concerned with the earlier patch's > potential for causing the softlockup watchdog to rightly complain by > polling the stopped state without ever scheduling away. > > The fact that smp_query_cpu_stopped() kind of collapses the two distinct > results from the query-cpu-stopped-state RTAS call into one return value > may make it harder than necessary to reason about the questions around > cond_resched() and whether to warn. > > Sorry to pull this stunt but I have had some code sitting in a neglected > branch that I think gets the logic around this right. > > What we should have is a simple C wrapper for the RTAS call that reflects the > architected inputs and outputs: > > ================================================================ > (-- rtas.c --) > > /** > * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state. > * @hwcpu: Identifies the processor thread to be queried. > * @status: Pointer to status, valid only on success. > * > * Determine whether the given processor thread is in the stopped > * state. If successful and @status is non-NULL, the thread's status > * is stored to @status. > * > * Return: > * * 0 - Success > * * -1 - Hardware error > * * -2 - Busy, try again later > */ > int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status) > { > unsigned int cpu_status; > int token; > int fwrc; > > token = rtas_token("query-cpu-stopped-state"); > > fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu); > if (fwrc != 0) > goto out; > > if (status != NULL) > *status = cpu_status; > out: > return fwrc; > } > ================================================================ > > > And then a utility function that waits for the remote thread to enter > stopped state, with higher-level logic for rescheduling and warning. The > fact that smp_query_cpu_stopped() currently does not handle a -2/busy > status is a bug, fixed below by using rtas_busy_delay(). Note the > justification for the explicit cond_resched() in the outer loop: > > ================================================================ > (-- rtas.h --) > > /* query-cpu-stopped-state CPU_status */ > #define RTAS_QCSS_STATUS_STOPPED 0 > #define RTAS_QCSS_STATUS_IN_PROGRESS 1 > #define RTAS_QCSS_STATUS_NOT_STOPPED 2 > > (-- pseries/hotplug-cpu.c --) > > /** > * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state. > */ > static void wait_for_cpu_stopped(unsigned int cpu) > { > unsigned int status; > unsigned int hwcpu; > > hwcpu = get_hard_smp_processor_id(cpu); > > do { > int fwrc; > > /* > * rtas_busy_delay() will yield only if RTAS returns a > * busy status. Since query-cpu-stopped-state can > * yield RTAS_QCSS_STATUS_IN_PROGRESS or > * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded > * period before the target thread stops, we must take > * care to explicitly reschedule while polling. > */ > cond_resched(); > > do { > fwrc = rtas_query_cpu_stopped_state(hwcpu, &status); > } while (rtas_busy_delay(fwrc)); > > if (fwrc == 0) > continue; > > pr_err_ratelimited("query-cpu-stopped-state for " > "thread 0x%x returned %d\n", > hwcpu, fwrc); > goto out; > > } while (status == RTAS_QCSS_STATUS_NOT_STOPPED || > status == RTAS_QCSS_STATUS_IN_PROGRESS); > > if (status != RTAS_QCSS_STATUS_STOPPED) { > pr_err_ratelimited("query-cpu-stopped-state yielded unknown " > "status %d for thread 0x%x\n", > status, hwcpu); > } > out: > return; > } > > [...] > > static void pseries_cpu_die(unsigned int cpu) > { > wait_for_cpu_stopped(cpu); > paca_ptrs[cpu]->cpu_start = 0; > } > ================================================================ > > wait_for_cpu_stopped() should be able to accommodate a time-based > warning if necessary, but speaking as a likely recipient of any bug > reports that would arise here, I'm not convinced of the need and I > don't know what a good value would be. It's relatively easy to sample > the stack of a task that's apparently failing to make progress, plus I > probably would use 'perf probe' or similar to report the inputs and > outputs for the RTAS call. I think if we make the timeout sufficiently high like 2 minutes or so it wouldn't hurt and if we did seem them it would probably point to an actual bug. But I don't have a strong feeling either way. > > I'm happy to make this a proper submission after I can clean it up and > retest it, or Michael R. is welcome to appropriate it, assuming it's > acceptable. > I've given it a shot with this patch and it seems to be holding up in testing. If we don't think the ~2 minutes warning message is needed I can clean it up to post: https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e I'd likely break the refactoring patches out to a separate patch under Nathan's name since it fixes a separate bug potentially.
Michael Roth <mdroth@linux.vnet.ibm.com> writes: > Quoting Nathan Lynch (2020-08-07 02:05:09) ... >> wait_for_cpu_stopped() should be able to accommodate a time-based >> warning if necessary, but speaking as a likely recipient of any bug >> reports that would arise here, I'm not convinced of the need and I >> don't know what a good value would be. It's relatively easy to sample >> the stack of a task that's apparently failing to make progress, plus I >> probably would use 'perf probe' or similar to report the inputs and >> outputs for the RTAS call. > > I think if we make the timeout sufficiently high like 2 minutes or so > it wouldn't hurt and if we did seem them it would probably point to an > actual bug. But I don't have a strong feeling either way. I think we should print a warning after 2 minutes. It's true that there are fairly easy mechanisms to work out where the thread is stuck, but customers are unlikely to use them. They're just going to report that it's stuck with no further info, and probably reboot the machine before we get a chance to get any further info. Whereas if the kernel prints a warning with a stack trace we at least have that to go on in an initial bug report. >> I'm happy to make this a proper submission after I can clean it up and >> retest it, or Michael R. is welcome to appropriate it, assuming it's >> acceptable. >> > > I've given it a shot with this patch and it seems to be holding up in > testing. If we don't think the ~2 minutes warning message is needed I > can clean it up to post: > > https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e > > I'd likely break the refactoring patches out to a separate patch under > Nathan's name since it fixes a separate bug potentially. While I like Nathan's refactoring, we probably want to do the minimal fix first to ease backporting. Then do the refactoring on top of that. cheers
Michael Ellerman <mpe@ellerman.id.au> writes: > Michael Roth <mdroth@linux.vnet.ibm.com> writes: >> Quoting Nathan Lynch (2020-08-07 02:05:09) > ... >>> wait_for_cpu_stopped() should be able to accommodate a time-based >>> warning if necessary, but speaking as a likely recipient of any bug >>> reports that would arise here, I'm not convinced of the need and I >>> don't know what a good value would be. It's relatively easy to sample >>> the stack of a task that's apparently failing to make progress, plus I >>> probably would use 'perf probe' or similar to report the inputs and >>> outputs for the RTAS call. >> >> I think if we make the timeout sufficiently high like 2 minutes or so >> it wouldn't hurt and if we did seem them it would probably point to an >> actual bug. But I don't have a strong feeling either way. > > I think we should print a warning after 2 minutes. > > It's true that there are fairly easy mechanisms to work out where the > thread is stuck, but customers are unlikely to use them. They're just > going to report that it's stuck with no further info, and probably > reboot the machine before we get a chance to get any further info. > > Whereas if the kernel prints a warning with a stack trace we at least > have that to go on in an initial bug report. > >>> I'm happy to make this a proper submission after I can clean it up and >>> retest it, or Michael R. is welcome to appropriate it, assuming it's >>> acceptable. >>> >> >> I've given it a shot with this patch and it seems to be holding up in >> testing. If we don't think the ~2 minutes warning message is needed I >> can clean it up to post: >> >> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e >> >> I'd likely break the refactoring patches out to a separate patch under >> Nathan's name since it fixes a separate bug potentially. > > While I like Nathan's refactoring, we probably want to do the minimal > fix first to ease backporting. > > Then do the refactoring on top of that. Fair enough, thanks.
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index c6e0d8abf75e..3cb172758052 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu) int cpu_status = 1; unsigned int pcpu = get_hard_smp_processor_id(cpu); - for (tries = 0; tries < 25; tries++) { + for (tries = 0; tries < 50; tries++) { cpu_status = smp_query_cpu_stopped(pcpu); if (cpu_status == QCSS_STOPPED || cpu_status == QCSS_HARDWARE_ERROR) break; - cpu_relax(); - + msleep(100); } if (cpu_status != 0) {
For a power9 KVM guest with XIVE enabled, running a test loop where we hotplug 384 vcpus and then unplug them, the following traces can be seen (generally within a few loops) either from the unplugged vcpu: [ 1767.353447] cpu 65 (hwid 65) Ready to die... [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 [ 1767.952322] ------------[ cut here ]------------ [ 1767.952323] kernel BUG at lib/list_debug.c:56! [ 1767.952325] Oops: Exception in kernel mode, sig: 5 [#1] [ 1767.952326] LE SMP NR_CPUS=2048 NUMA pSeries [ 1767.952328] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c sd_mod sg xts vmx_crypto virtio_net net_failover failover virtio_scsi dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi [ 1767.952352] CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 [ 1767.952354] NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac [ 1767.952355] REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) [ 1767.952355] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000 [ 1767.952360] CFAR: c0000000001ffe64 IRQMASK: 1 [ 1767.952360] GPR00: c0000000007ab508 c0000009e5a17ac0 c000000001ac0700 0000000000000054 [ 1767.952360] GPR04: c0000009f056cf90 c0000009f05f5628 c0000009ed40d654 c000000001c90700 [ 1767.952360] GPR08: 0000000000000007 c0000009f0573980 00000009ef2b0000 7562202c38303230 [ 1767.952360] GPR12: 0000000000000000 c0000007fff6ce80 c00a000002470208 0000000000000000 [ 1767.952360] GPR16: 0000000000000001 c0000009f05fbb00 0000000000000800 c0000009ff3d4980 [ 1767.952360] GPR20: c0000009f05fbb10 5deadbeef0000100 5deadbeef0000200 0000000000187961 [ 1767.952360] GPR24: c0000009e5a17b78 0000000000000000 0000000000000001 ffffffffffffffff [ 1767.952360] GPR28: c00a000002470200 c0000009f05fbb10 c0000009f05fbb10 0000000000000000 [ 1767.952375] NIP [c0000000007ab50c] __list_del_entry_valid+0xac/0x100 [ 1767.952376] LR [c0000000007ab508] __list_del_entry_valid+0xa8/0x100 [ 1767.952377] Call Trace: [ 1767.952378] [c0000009e5a17ac0] [c0000000007ab508] __list_del_entry_valid+0xa8/0x100 (unreliable) [ 1767.952381] [c0000009e5a17b20] [c000000000476e18] free_pcppages_bulk+0x1f8/0x940 [ 1767.952383] [c0000009e5a17c20] [c00000000047a9a0] free_unref_page+0xd0/0x100 [ 1767.952386] [c0000009e5a17c50] [c0000000000bc2a8] xive_spapr_cleanup_queue+0x148/0x1b0 [ 1767.952388] [c0000009e5a17cf0] [c0000000000b96dc] xive_teardown_cpu+0x1bc/0x240 [ 1767.952391] [c0000009e5a17d30] [c00000000010bf28] pseries_mach_cpu_die+0x78/0x2f0 [ 1767.952393] [c0000009e5a17de0] [c00000000005c8d8] cpu_die+0x48/0x70 [ 1767.952394] [c0000009e5a17e00] [c000000000021cf0] arch_cpu_idle_dead+0x20/0x40 [ 1767.952397] [c0000009e5a17e20] [c0000000001b4294] do_idle+0x2f4/0x4c0 [ 1767.952399] [c0000009e5a17ea0] [c0000000001b46a8] cpu_startup_entry+0x38/0x40 [ 1767.952400] [c0000009e5a17ed0] [c00000000005c43c] start_secondary+0x7bc/0x8f0 [ 1767.952403] [c0000009e5a17f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14 or on the worker thread handling the unplug: [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2 [ 1538.360736] BUG: Bad page state in process kworker/u768:3 pfn:95de1 [ 1538.360746] cpu 314 (hwid 314) Ready to die... [ 1538.360784] page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 [ 1538.361881] flags: 0x5ffffc00000000() [ 1538.361908] raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 [ 1538.361955] raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 [ 1538.362002] page dumped because: nonzero mapcount [ 1538.362033] Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink sunrpc xts vmx_crypto ip_tables xfs libcrc32c sd_mod sg virtio_net net_failover virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod [ 1538.362613] CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn [ 1538.362725] Call Trace: [ 1538.362743] [c0000009d4adf590] [c000000000e0e0fc] dump_stack+0xb0/0xf4 (unreliable) [ 1538.362789] [c0000009d4adf5d0] [c000000000475dfc] bad_page+0x12c/0x1b0 [ 1538.362827] [c0000009d4adf660] [c0000000004784bc] free_pcppages_bulk+0x5bc/0x940 [ 1538.362871] [c0000009d4adf760] [c000000000478c38] page_alloc_cpu_dead+0x118/0x120 [ 1538.362918] [c0000009d4adf7b0] [c00000000015b898] cpuhp_invoke_callback.constprop.5+0xb8/0x760 [ 1538.362969] [c0000009d4adf820] [c00000000015eee8] _cpu_down+0x188/0x340 [ 1538.363007] [c0000009d4adf890] [c00000000015d75c] cpu_down+0x5c/0xa0 [ 1538.363045] [c0000009d4adf8c0] [c00000000092c544] cpu_subsys_offline+0x24/0x40 [ 1538.363091] [c0000009d4adf8e0] [c0000000009212f0] device_offline+0xf0/0x130 [ 1538.363129] [c0000009d4adf920] [c00000000010aee4] dlpar_offline_cpu+0x1c4/0x2a0 [ 1538.363174] [c0000009d4adf9e0] [c00000000010b2f8] dlpar_cpu_remove+0xb8/0x190 [ 1538.363219] [c0000009d4adfa60] [c00000000010b4fc] dlpar_cpu_remove_by_index+0x12c/0x150 [ 1538.363264] [c0000009d4adfaf0] [c00000000010ca24] dlpar_cpu+0x94/0x800 [ 1538.363302] [c0000009d4adfc00] [c000000000102cc8] pseries_hp_work_fn+0x128/0x1e0 [ 1538.363347] [c0000009d4adfc70] [c00000000018aa84] process_one_work+0x304/0x5d0 [ 1538.363394] [c0000009d4adfd10] [c00000000018b5cc] worker_thread+0xcc/0x7a0 [ 1538.363433] [c0000009d4adfdc0] [c00000000019567c] kthread+0x1ac/0x1c0 [ 1538.363469] [c0000009d4adfe30] [c00000000000b7dc] ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by inserting an msleep() after each RTAS call to avoid pseries_cpu_die() returning prematurely, and double the number of attempts so we wait at least a total of 5 seconds. While this isn't an ideal solution, it is similar to how we dealt with a similar issue for cede_offline mode in the past (940ce422a3). Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Cedric Le Goater <clg@kaod.org> Cc: Greg Kurz <groug@kaod.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)