Message ID | 20181019010625.25294-1-cota@braap.org |
---|---|
Headers | show |
Series | per-CPU locks | expand |
On 19/10/2018 03:05, Emilio G. Cota wrote: > I'm calling this series a v3 because it supersedes the two series > I previously sent about using atomics for interrupt_request: > https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html > The approach in that series cannot work reliably; using (locked) atomics > to set interrupt_request but not using (locked) atomics to read it > can lead to missed updates. The idea here was that changes to protected fields are all followed by kick. That may not have been the case, granted, but I wonder if the plan is unworkable. Paolo
On Fri, Oct 19, 2018 at 08:59:24 +0200, Paolo Bonzini wrote: > On 19/10/2018 03:05, Emilio G. Cota wrote: > > I'm calling this series a v3 because it supersedes the two series > > I previously sent about using atomics for interrupt_request: > > https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html > > The approach in that series cannot work reliably; using (locked) atomics > > to set interrupt_request but not using (locked) atomics to read it > > can lead to missed updates. > > The idea here was that changes to protected fields are all followed by > kick. That may not have been the case, granted, but I wonder if the > plan is unworkable. I suspect that the cpu->interrupt_request+kick mechanism is not the issue, otherwise master should not work--we do atomic_read(cpu->interrupt_request) and only if that read != 0 we take the BQL. My guess is that the problem is with other reads of cpu->interrupt_request, e.g. those in cpu_has_work. Currently those reads happen with the BQL held, and updates to cpu->interrupt_request take the BQL. If we drop the BQL from the setters to instead use locked atomics (like in the aforementioned series), those BQL-protected readers might miss updates. Given that we need a per-CPU lock anyway to remove the BQL from the CPU loop, extending this lock to protect cpu->interrupt_request is a simple solution that keeps the current logic and allows for greater scalability. Thanks, Emilio
On 19/10/2018 16:50, Emilio G. Cota wrote: > On Fri, Oct 19, 2018 at 08:59:24 +0200, Paolo Bonzini wrote: >> On 19/10/2018 03:05, Emilio G. Cota wrote: >>> I'm calling this series a v3 because it supersedes the two series >>> I previously sent about using atomics for interrupt_request: >>> https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html >>> The approach in that series cannot work reliably; using (locked) atomics >>> to set interrupt_request but not using (locked) atomics to read it >>> can lead to missed updates. >> >> The idea here was that changes to protected fields are all followed by >> kick. That may not have been the case, granted, but I wonder if the >> plan is unworkable. > > I suspect that the cpu->interrupt_request+kick mechanism is not the issue, > otherwise master should not work--we do atomic_read(cpu->interrupt_request) > and only if that read != 0 we take the BQL. > > My guess is that the problem is with other reads of cpu->interrupt_request, > e.g. those in cpu_has_work. Currently those reads happen with the > BQL held, and updates to cpu->interrupt_request take the BQL. If we drop > the BQL from the setters to instead use locked atomics (like in the > aforementioned series), those BQL-protected readers might miss updates. cpu_has_work is only needed to handle the processor's halted state (or is it?). If it is, OR+kick should work. > Given that we need a per-CPU lock anyway to remove the BQL from the > CPU loop, extending this lock to protect cpu->interrupt_request is > a simple solution that keeps the current logic and allows for > greater scalability. Sure, I was just curious what the problem was. KVM uses OR+kick with no problems. Paolo
On Fri, Oct 19, 2018 at 18:01:18 +0200, Paolo Bonzini wrote: > On 19/10/2018 16:50, Emilio G. Cota wrote: > > On Fri, Oct 19, 2018 at 08:59:24 +0200, Paolo Bonzini wrote: > >> On 19/10/2018 03:05, Emilio G. Cota wrote: > >>> I'm calling this series a v3 because it supersedes the two series > >>> I previously sent about using atomics for interrupt_request: > >>> https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html > >>> The approach in that series cannot work reliably; using (locked) atomics > >>> to set interrupt_request but not using (locked) atomics to read it > >>> can lead to missed updates. > >> > >> The idea here was that changes to protected fields are all followed by > >> kick. That may not have been the case, granted, but I wonder if the > >> plan is unworkable. > > > > I suspect that the cpu->interrupt_request+kick mechanism is not the issue, > > otherwise master should not work--we do atomic_read(cpu->interrupt_request) > > and only if that read != 0 we take the BQL. > > > > My guess is that the problem is with other reads of cpu->interrupt_request, > > e.g. those in cpu_has_work. Currently those reads happen with the > > BQL held, and updates to cpu->interrupt_request take the BQL. If we drop > > the BQL from the setters to instead use locked atomics (like in the > > aforementioned series), those BQL-protected readers might miss updates. > > cpu_has_work is only needed to handle the processor's halted state (or > is it?). If it is, OR+kick should work. > > > Given that we need a per-CPU lock anyway to remove the BQL from the > > CPU loop, extending this lock to protect cpu->interrupt_request is > > a simple solution that keeps the current logic and allows for > > greater scalability. > > Sure, I was just curious what the problem was. KVM uses OR+kick with no > problems. I never found exactly where things break. The hangs happen pretty early when booting a large (-smp > 16) x86_64 Ubuntu guest. Booting never completes (ssh unresponsive) if I don't have the console output (I suspect the console output slows things down enough to hide some races). I only see a few threads busy: a couple of vCPU threads, and the I/O thread. I didn't have time to debug any further, so I moved on to an alternative approach. So it is possible that it was my implementation, and not the approach, what was at fault :-) Thanks, E.
On Fri, Oct 19, 2018 at 15:29:32 -0400, Emilio G. Cota wrote: > On Fri, Oct 19, 2018 at 18:01:18 +0200, Paolo Bonzini wrote: > > > Given that we need a per-CPU lock anyway to remove the BQL from the > > > CPU loop, extending this lock to protect cpu->interrupt_request is > > > a simple solution that keeps the current logic and allows for > > > greater scalability. > > > > Sure, I was just curious what the problem was. KVM uses OR+kick with no > > problems. > > I never found exactly where things break. The hangs happen > pretty early when booting a large (-smp > 16) x86_64 Ubuntu guest. > Booting never completes (ssh unresponsive) if I don't have the > console output (I suspect the console output slows things down > enough to hide some races). I only see a few threads busy: > a couple of vCPU threads, and the I/O thread. > > I didn't have time to debug any further, so I moved on > to an alternative approach. > > So it is possible that it was my implementation, and not the approach, > what was at fault :-) I've just observed a similar hang after adding the "BQL pushdown" patches on top of this series. So it's likely that the hangs come from those patches, and not from the work on cpu->interrupt_request. I just confirmed with the prior series, and removing the pushdown patches fixes the hangs there as well. Thanks, Emilio
On 20/10/2018 01:46, Emilio G. Cota wrote: >> So it is possible that it was my implementation, and not the approach, >> what was at fault :-) > I've just observed a similar hang after adding the "BQL > pushdown" patches on top of this series. So it's likely that the > hangs come from those patches, and not from the work on > cpu->interrupt_request. I just confirmed with the prior > series, and removing the pushdown patches fixes the hangs there > as well. Oh well, not a big deal. You already wrote these patches and I don't have much time for MTTCG anyway, so I am okay with sticking with them. Thanks! Paolo