[RFC,v3,0/56] per-CPU locks

Message ID	20181019010625.25294-1-cota@braap.org
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: "Emilio G. Cota" <cota@braap.org> To: qemu-devel@nongnu.org Date: Thu, 18 Oct 2018 21:05:29 -0400 Message-Id: <20181019010625.25294-1-cota@braap.org> Subject: [Qemu-devel] [RFC v3 0/56] per-CPU locks Precedence: list Cc: Peter Maydell <peter.maydell@linaro.org>, Chris Wulff <crwulff@gmail.com>, Sagar Karandikar <sagark@eecs.berkeley.edu>, David Hildenbrand <david@redhat.com>, James Hogan <jhogan@kernel.org>, Anthony Green <green@moxielogic.com>, Palmer Dabbelt <palmer@sifive.com>, Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>, Max Filippov <jcmvbkbc@gmail.com>, Michael Clark <mjc@sifive.com>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>, Guan Xuetao <gxt@mprc.pku.edu.cn>, Marek Vasut <marex@denx.de>, Alexander Graf <agraf@suse.de>, Christian Borntraeger <borntraeger@de.ibm.com>, Pavel Dovgalyuk <dovgaluk@ispras.ru>, Richard Henderson <rth@twiddle.net>, Artyom Tarasenko <atar4qemu@gmail.com>, Eduardo Habkost <ehabkost@redhat.com>, Fabien Chouteau <chouteau@adacore.com>, qemu-s390x@nongnu.org, qemu-arm@nongnu.org, Alistair Francis <alistair23@gmail.com>, Stafford Horne <shorne@gmail.com>, David Gibson <david@gibson.dropbear.id.au>, Peter Crosthwaite <crosthwaite.peter@gmail.com>, Bastian Koppelmann <kbastian@mail.uni-paderborn.de>, Cornelia Huck <cohuck@redhat.com>, Laurent Vivier <laurent@vivier.eu>, Michael Walle <michael@walle.cc>, qemu-ppc@nongnu.org, Aleksandar Markovic <amarkovic@wavecomp.com>, Paolo Bonzini <pbonzini@redhat.com>, Aurelien Jarno <aurelien@aurel32.net> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	per-CPU locks \| expand [RFC,v3,0/56] per-CPU locks [RFC,v3,01/56] cpu: convert queued work to a QSIMPLEQ [RFC,v3,02/56] cpu: rename cpu->work_mutex to cpu->lock [RFC,v3,03/56] cpu: introduce cpu_mutex_lock/unlock [RFC,v3,04/56] cpu: make qemu_work_cond per-cpu [RFC,v3,05/56] cpu: move run_on_cpu to cpus-common [RFC,v3,07/56] target/m68k: rename cpu_halted to cpu_halt [RFC,v3,08/56] cpu: define cpu_halted helpers [RFC,v3,09/56] arm: convert to cpu_halted [RFC,v3,10/56] ppc: convert to cpu_halted [RFC,v3,11/56] sh4: convert to cpu_halted [RFC,v3,12/56] i386: convert to cpu_halted [RFC,v3,13/56] lm32: convert to cpu_halted [RFC,v3,14/56] m68k: convert to cpu_halted [RFC,v3,15/56] mips: convert to cpu_halted [RFC,v3,16/56] riscv: convert to cpu_halted [RFC,v3,17/56] s390x: convert to cpu_halted [RFC,v3,18/56] sparc: convert to cpu_halted [RFC,v3,19/56] xtensa: convert to cpu_halted [RFC,v3,20/56] gdbstub: convert to cpu_halted [RFC,v3,21/56] openrisc: convert to cpu_halted [RFC,v3,22/56] cpu-exec: convert to cpu_halted [RFC,v3,23/56] cpu: define cpu_interrupt_request helpers [RFC,v3,24/56] ppc: use cpu_reset_interrupt [RFC,v3,25/56] exec: use cpu_reset_interrupt [RFC,v3,26/56] i386: use cpu_reset_interrupt [RFC,v3,27/56] s390x: use cpu_reset_interrupt [RFC,v3,28/56] openrisc: use cpu_reset_interrupt [RFC,v3,29/56] arm: convert to cpu_interrupt_request [RFC,v3,30/56] i386: convert to cpu_interrupt_request [RFC,v3,31/56] ppc: convert to cpu_interrupt_request [RFC,v3,32/56] sh4: convert to cpu_interrupt_request [RFC,v3,33/56] cris: convert to cpu_interrupt_request [RFC,v3,34/56] hppa: convert to cpu_interrupt_request [RFC,v3,35/56] lm32: convert to cpu_interrupt_request [RFC,v3,36/56] m68k: convert to cpu_interrupt_request [RFC,v3,37/56] mips: convert to cpu_interrupt_request [RFC,v3,38/56] nios: convert to cpu_interrupt_request [RFC,v3,39/56] s390x: convert to cpu_interrupt_request [RFC,v3,40/56] alpha: convert to cpu_interrupt_request [RFC,v3,41/56] moxie: convert to cpu_interrupt_request [RFC,v3,42/56] sparc: convert to cpu_interrupt_request [RFC,v3,43/56] openrisc: convert to cpu_interrupt_request [RFC,v3,44/56] unicore32: convert to cpu_interrupt_request [RFC,v3,45/56] microblaze: convert to cpu_interrupt_request [RFC,v3,46/56] accel/tcg: convert to cpu_interrupt_request [RFC,v3,47/56] cpu: call .cpu_has_work with the CPU lock held [RFC,v3,48/56] ppc: acquire the BQL in cpu_has_work [RFC,v3,49/56] mips: acquire the BQL in cpu_has_work [RFC,v3,50/56] s390: acquire the BQL in cpu_has_work [RFC,v3,51/56] riscv: acquire the BQL in cpu_has_work [RFC,v3,52/56] sparc: acquire the BQL in cpu_has_work [RFC,v3,53/56] xtensa: acquire the BQL in cpu_has_work [RFC,v3,54/56] cpu: protect most CPU state with cpu->lock [RFC,v3,55/56] cpu: add async_run_on_cpu_no_bql [RFC,v3,56/56] cputlb: queue async flush jobs without the BQL

Emilio Cota Oct. 19, 2018, 1:05 a.m. UTC

Cc: Aleksandar Markovic <amarkovic@wavecomp.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Alistair Francis <alistair23@gmail.com>
Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Anthony Green <green@moxielogic.com>
Cc: Artyom Tarasenko <atar4qemu@gmail.com>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Bastian Koppelmann <kbastian@mail.uni-paderborn.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Wulff <crwulff@gmail.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Edgar E. Iglesias" <edgar.iglesias@gmail.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Fabien Chouteau <chouteau@adacore.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: James Hogan <jhogan@kernel.org>
Cc: Laurent Vivier <laurent@vivier.eu>
Cc: Marek Vasut <marex@denx.de>
Cc: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Clark <mjc@sifive.com>
Cc: Michael Walle <michael@walle.cc>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Pavel Dovgalyuk <dovgaluk@ispras.ru>
Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-arm@nongnu.org
Cc: qemu-ppc@nongnu.org
Cc: qemu-s390x@nongnu.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Sagar Karandikar <sagark@eecs.berkeley.edu>
Cc: Stafford Horne <shorne@gmail.com>

I'm calling this series a v3 because it supersedes the two series
I previously sent about using atomics for interrupt_request:
  https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html
The approach in that series cannot work reliably; using (locked) atomics
to set interrupt_request but not using (locked) atomics to read it
can lead to missed updates.

This series takes a different approach: it serializes access to many
CPUState fields, including .interrupt_request, with a per-CPU lock.

Protecting more fields of CPUState with the lock then allows us to
substitute the BQL for the per-CPU lock in many places, notably
the execution loop in cpus.c. This leads to better scalability
for MTTCG, since CPUs don't have to acquire a contended lock
(the BQL) every time they stop executing code.

Some hurdles that remain:

1. I am not happy with the shutdown path via pause_all_vcpus.
   What happens if
   (a) A CPU is added while we're calling pause_all_vcpus?
   (b) Some CPUs are trying to run exclusive work while we
       call pause_all_vcpus?
   Am I being overly paranoid here?

2. I have done very light testing with x86_64 KVM, and no
   testing with other accels (hvf, hax, whpx). check-qtest
   works, except for an s390x test that to me is broken
   in master -- I reported the problem here:
     https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg03728.html

3. This might break record-replay. Furthermore, a quick test with icount
   on aarch64 seems to work, but I haven't tested icount extensively.

4. Some architectures still need the BQL in cpu_has_work.
   This leads to some contortions to avoid deadlock, since
   in this series cpu_has_work is called with the CPU lock held.

5. The interrupt handling path remains with the BQL held, mostly
   because the ISAs I routinely work with need the BQL anyway
   when handling the interrupt. We can complete the pushdown
   of the BQL to .do_interrupt/.exec_interrupt later on; this
   series is already way too long.

Points (1)-(3) makes this series an RFC and not a proper patch series.
I'd appreciate feedback on this approach and/or testing.

Note that this series fixes a bug by which cpu_has_work is
called without the BQL (from cpu_handle_halt). After
this series, cpu_has_work is called with the CPU lock,
and only the targets that need the BQL in cpu_has_work
acquire it.

For some performance numbers, see the last patch.

The series is checkpatch-clean; only one warning due to the
use of __COVERITY__ in cpus.c.

You can fetch this series from:

  https://github.com/cota/qemu/tree/cpu-lock-v3

Note that it applies on top of tcg-next + my dynamic TLB series,
which I'm using in the faint hope that the ubuntu experiments might
run a bit faster.

Thanks!

		Emilio

Paolo Bonzini Oct. 19, 2018, 6:59 a.m. UTC | #1

On 19/10/2018 03:05, Emilio G. Cota wrote:
> I'm calling this series a v3 because it supersedes the two series
> I previously sent about using atomics for interrupt_request:
>   https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html
> The approach in that series cannot work reliably; using (locked) atomics
> to set interrupt_request but not using (locked) atomics to read it
> can lead to missed updates.

The idea here was that changes to protected fields are all followed by
kick.  That may not have been the case, granted, but I wonder if the
plan is unworkable.

Paolo

Emilio Cota Oct. 19, 2018, 2:50 p.m. UTC | #2

On Fri, Oct 19, 2018 at 08:59:24 +0200, Paolo Bonzini wrote:
> On 19/10/2018 03:05, Emilio G. Cota wrote:
> > I'm calling this series a v3 because it supersedes the two series
> > I previously sent about using atomics for interrupt_request:
> >   https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html
> > The approach in that series cannot work reliably; using (locked) atomics
> > to set interrupt_request but not using (locked) atomics to read it
> > can lead to missed updates.
> 
> The idea here was that changes to protected fields are all followed by
> kick.  That may not have been the case, granted, but I wonder if the
> plan is unworkable.

I suspect that the cpu->interrupt_request+kick mechanism is not the issue,
otherwise master should not work--we do atomic_read(cpu->interrupt_request)
and only if that read != 0 we take the BQL.

My guess is that the problem is with other reads of cpu->interrupt_request,
e.g. those in cpu_has_work. Currently those reads happen with the
BQL held, and updates to cpu->interrupt_request take the BQL. If we drop
the BQL from the setters to instead use locked atomics (like in the
aforementioned series), those BQL-protected readers might miss updates.

Given that we need a per-CPU lock anyway to remove the BQL from the
CPU loop, extending this lock to protect cpu->interrupt_request is
a simple solution that keeps the current logic and allows for
greater scalability.

Thanks,

		Emilio

Paolo Bonzini Oct. 19, 2018, 4:01 p.m. UTC | #3

On 19/10/2018 16:50, Emilio G. Cota wrote:
> On Fri, Oct 19, 2018 at 08:59:24 +0200, Paolo Bonzini wrote:
>> On 19/10/2018 03:05, Emilio G. Cota wrote:
>>> I'm calling this series a v3 because it supersedes the two series
>>> I previously sent about using atomics for interrupt_request:
>>>   https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html
>>> The approach in that series cannot work reliably; using (locked) atomics
>>> to set interrupt_request but not using (locked) atomics to read it
>>> can lead to missed updates.
>>
>> The idea here was that changes to protected fields are all followed by
>> kick.  That may not have been the case, granted, but I wonder if the
>> plan is unworkable.
> 
> I suspect that the cpu->interrupt_request+kick mechanism is not the issue,
> otherwise master should not work--we do atomic_read(cpu->interrupt_request)
> and only if that read != 0 we take the BQL.
> 
> My guess is that the problem is with other reads of cpu->interrupt_request,
> e.g. those in cpu_has_work. Currently those reads happen with the
> BQL held, and updates to cpu->interrupt_request take the BQL. If we drop
> the BQL from the setters to instead use locked atomics (like in the
> aforementioned series), those BQL-protected readers might miss updates.

cpu_has_work is only needed to handle the processor's halted state (or
is it?).  If it is, OR+kick should work.

> Given that we need a per-CPU lock anyway to remove the BQL from the
> CPU loop, extending this lock to protect cpu->interrupt_request is
> a simple solution that keeps the current logic and allows for
> greater scalability.

Sure, I was just curious what the problem was.  KVM uses OR+kick with no
problems.

Paolo

Emilio Cota Oct. 19, 2018, 7:29 p.m. UTC | #4

On Fri, Oct 19, 2018 at 18:01:18 +0200, Paolo Bonzini wrote:
> On 19/10/2018 16:50, Emilio G. Cota wrote:
> > On Fri, Oct 19, 2018 at 08:59:24 +0200, Paolo Bonzini wrote:
> >> On 19/10/2018 03:05, Emilio G. Cota wrote:
> >>> I'm calling this series a v3 because it supersedes the two series
> >>> I previously sent about using atomics for interrupt_request:
> >>>   https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg02013.html
> >>> The approach in that series cannot work reliably; using (locked) atomics
> >>> to set interrupt_request but not using (locked) atomics to read it
> >>> can lead to missed updates.
> >>
> >> The idea here was that changes to protected fields are all followed by
> >> kick.  That may not have been the case, granted, but I wonder if the
> >> plan is unworkable.
> > 
> > I suspect that the cpu->interrupt_request+kick mechanism is not the issue,
> > otherwise master should not work--we do atomic_read(cpu->interrupt_request)
> > and only if that read != 0 we take the BQL.
> > 
> > My guess is that the problem is with other reads of cpu->interrupt_request,
> > e.g. those in cpu_has_work. Currently those reads happen with the
> > BQL held, and updates to cpu->interrupt_request take the BQL. If we drop
> > the BQL from the setters to instead use locked atomics (like in the
> > aforementioned series), those BQL-protected readers might miss updates.
> 
> cpu_has_work is only needed to handle the processor's halted state (or
> is it?).  If it is, OR+kick should work.
> 
> > Given that we need a per-CPU lock anyway to remove the BQL from the
> > CPU loop, extending this lock to protect cpu->interrupt_request is
> > a simple solution that keeps the current logic and allows for
> > greater scalability.
> 
> Sure, I was just curious what the problem was.  KVM uses OR+kick with no
> problems.

I never found exactly where things break. The hangs happen
pretty early when booting a large (-smp > 16) x86_64 Ubuntu guest.
Booting never completes (ssh unresponsive) if I don't have the
console output (I suspect the console output slows things down
enough to hide some races). I only see a few threads busy:
a couple of vCPU threads, and the I/O thread.

I didn't have time to debug any further, so I moved on
to an alternative approach.

So it is possible that it was my implementation, and not the approach,
what was at fault :-)

Thanks,

		E.

Emilio Cota Oct. 19, 2018, 11:46 p.m. UTC | #5

On Fri, Oct 19, 2018 at 15:29:32 -0400, Emilio G. Cota wrote:
> On Fri, Oct 19, 2018 at 18:01:18 +0200, Paolo Bonzini wrote:
> > > Given that we need a per-CPU lock anyway to remove the BQL from the
> > > CPU loop, extending this lock to protect cpu->interrupt_request is
> > > a simple solution that keeps the current logic and allows for
> > > greater scalability.
> > 
> > Sure, I was just curious what the problem was.  KVM uses OR+kick with no
> > problems.
> 
> I never found exactly where things break. The hangs happen
> pretty early when booting a large (-smp > 16) x86_64 Ubuntu guest.
> Booting never completes (ssh unresponsive) if I don't have the
> console output (I suspect the console output slows things down
> enough to hide some races). I only see a few threads busy:
> a couple of vCPU threads, and the I/O thread.
> 
> I didn't have time to debug any further, so I moved on
> to an alternative approach.
> 
> So it is possible that it was my implementation, and not the approach,
> what was at fault :-)

I've just observed a similar hang after adding the "BQL
pushdown" patches on top of this series. So it's likely that the
hangs come from those patches, and not from the work on
cpu->interrupt_request. I just confirmed with the prior
series, and removing the pushdown patches fixes the hangs there
as well.

Thanks,

		Emilio

Paolo Bonzini Oct. 22, 2018, 3:30 p.m. UTC | #6

On 20/10/2018 01:46, Emilio G. Cota wrote:
>> So it is possible that it was my implementation, and not the approach,
>> what was at fault :-)
> I've just observed a similar hang after adding the "BQL
> pushdown" patches on top of this series. So it's likely that the
> hangs come from those patches, and not from the work on
> cpu->interrupt_request. I just confirmed with the prior
> series, and removing the pushdown patches fixes the hangs there
> as well.

Oh well, not a big deal.  You already wrote these patches and I don't
have much time for MTTCG anyway, so I am okay with sticking with them.
Thanks!

Paolo

[RFC,v3,0/56] per-CPU locks

Message

Comments