Message ID | 557EE4D2.5010202@hs-rm.de |
---|---|
State | New |
Headers | show |
On 15 June 2015 at 15:44, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: > Am 12.06.2015 um 20:03 schrieb Peter Maydell: >> Probably the best approach would be to have something in >> arm_cpu_set_irq() which says "if we are CPU X and we've >> just caused an interrupt to be set for CPU Y, then we >> should ourselves yield back to the main loop". >> >> Something like this, maybe, though I have done no more testing >> than checking it doesn't actively break kernel booting :-) > > > Thanks! One more check for "level" is needed to get it work: What happens without that? It's reasonable to have it, but extra cpu_exit()s shouldn't cause a problem beyond being a bit inefficient... It would be interesting to know if this helps Linux as well as your custom OS. (I don't know whether a "CPU #0 polls" approach is bad on hardware too; the other option would be to have CPU #1 IPI back in the other direction if 0 needed to wait for a response.) -- PMM
Am 15.06.2015 um 16:51 schrieb Peter Maydell: > On 15 June 2015 at 15:44, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: >> Am 12.06.2015 um 20:03 schrieb Peter Maydell: >>> Probably the best approach would be to have something in >>> arm_cpu_set_irq() which says "if we are CPU X and we've >>> just caused an interrupt to be set for CPU Y, then we >>> should ourselves yield back to the main loop". >>> >>> Something like this, maybe, though I have done no more testing >>> than checking it doesn't actively break kernel booting :-) >> >> >> Thanks! One more check for "level" is needed to get it work: > > What happens without that? It's reasonable to have it, > but extra cpu_exit()s shouldn't cause a problem beyond > being a bit inefficient... The emulation get's stuck, for whatever reason I don't understand. I checked if something similar is done on other architectures and found that the level check is missing, see for example cpu_request_exit() in hw/ppc/prep.c: static void cpu_request_exit(void *opaque, int irq, int level) { CPUState *cpu = current_cpu; if (cpu && level) { cpu_exit(cpu); } } But probably this is used for something completely unrelated. > It would be interesting to know if this helps Linux as well > as your custom OS. (I don't know whether a "CPU #0 polls" > approach is bad on hardware too; the other option would be > to have CPU #1 IPI back in the other direction if 0 needed > to wait for a response.) > > -- PMM IIRC, Linux TLB shootdown on x86 once used such a scheme, but I don't know if they changed it. I'd say that an IPI+poll pattern is used quite often in the tricky parts of a kernel, like kernel debugging. Here's a simple IPI tester sending IPIs from CPU #0 to CPU #1 in an endless loop. The IPIs are delayed until the timer interrupt triggers the main loop. http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf 3174 bytes, md5sum 8d73890a60cd9b24a4f9139509b580e2 Run testcase: $ qemu-system-arm -M vexpress-a15 -smp 2 -kernel ipi.elf -nographic The testcase prints the following on the serial console without the patch: +------- CPU 0 came up |+------ CPU 0 initialization completed || +---- CPU 0 timer interrupt, 1 HZ || | vv v 0!1T.T.T.T.T.T.T. ^ ^ | | | +-- CPU 1 received an IPI +---- CPU 1 came up Expected testcase output with patch: 0!1T..............<hundreds of dots>.................T............... So: more dots == more IPIs handled between two timer interrupts "T" ... Best regards Alex
On 15 June 2015 at 16:05, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: > Here's a simple IPI tester sending IPIs from CPU #0 to CPU #1 in an endless loop. > The IPIs are delayed until the timer interrupt triggers the main loop. > > http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf > 3174 bytes, md5sum 8d73890a60cd9b24a4f9139509b580e2 > > Run testcase: > $ qemu-system-arm -M vexpress-a15 -smp 2 -kernel ipi.elf -nographic > > The testcase prints the following on the serial console without the patch: > > +------- CPU 0 came up > |+------ CPU 0 initialization completed > || +---- CPU 0 timer interrupt, 1 HZ > || | > vv v > 0!1T.T.T.T.T.T.T. > ^ ^ > | | > | +-- CPU 1 received an IPI > +---- CPU 1 came up > > > Expected testcase output with patch: > > 0!1T..............<hundreds of dots>.................T............... > > So: more dots == more IPIs handled between two timer interrupts "T" ... For me this test case (without any IPI related patches) just prints "0!TT" (or sometimes "0!T") and then hangs. The yield test binary does the same thing. -- PMM
On 15 June 2015 at 16:05, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: > Am 15.06.2015 um 16:51 schrieb Peter Maydell: >> On 15 June 2015 at 15:44, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: >>> Am 12.06.2015 um 20:03 schrieb Peter Maydell: >>>> Probably the best approach would be to have something in >>>> arm_cpu_set_irq() which says "if we are CPU X and we've >>>> just caused an interrupt to be set for CPU Y, then we >>>> should ourselves yield back to the main loop". >>>> >>>> Something like this, maybe, though I have done no more testing >>>> than checking it doesn't actively break kernel booting :-) >>> >>> >>> Thanks! One more check for "level" is needed to get it work: >> >> What happens without that? It's reasonable to have it, >> but extra cpu_exit()s shouldn't cause a problem beyond >> being a bit inefficient... > > The emulation get's stuck, for whatever reason I don't understand. I'm beginning to suspect that your guest code has a race condition in it, such that if the other CPU runs at a point you weren't expecting it to then you end up deadlocking or otherwise running into a bug in your guest. In particular, I see the emulation getting stuck even without this patch to arm_cpu_set_irq(). -- PMM
Am 15.06.2015 um 20:58 schrieb Peter Maydell: > On 15 June 2015 at 16:05, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: >> Am 15.06.2015 um 16:51 schrieb Peter Maydell: >>> On 15 June 2015 at 15:44, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: >>>> Am 12.06.2015 um 20:03 schrieb Peter Maydell: >>>>> Probably the best approach would be to have something in >>>>> arm_cpu_set_irq() which says "if we are CPU X and we've >>>>> just caused an interrupt to be set for CPU Y, then we >>>>> should ourselves yield back to the main loop". >>>>> >>>>> Something like this, maybe, though I have done no more testing >>>>> than checking it doesn't actively break kernel booting :-) >>>> >>>> >>>> Thanks! One more check for "level" is needed to get it work: >>> >>> What happens without that? It's reasonable to have it, >>> but extra cpu_exit()s shouldn't cause a problem beyond >>> being a bit inefficient... >> >> The emulation get's stuck, for whatever reason I don't understand. > > I'm beginning to suspect that your guest code has a race > condition in it, such that if the other CPU runs at a > point you weren't expecting it to then you end up > deadlocking or otherwise running into a bug in your guest. > > In particular, I see the emulation getting stuck even without > this patch to arm_cpu_set_irq(). > > -- PMM Yes, it's a bug, sorry for that. I removed too much code to get a simple testcase. It's stuck in the first spinlock where CPU#1 is waiting for CPU#0 to initialize the rest of the system, and I need to WFE or YIELD here as well. But this is showing the original problem again: the emulation get's stuck spinning on CPU #1 forever, because the main loop doesn't switch to CPU #0 voluntarily. Just press a key on the console/emulated serial line to trigger an event to QEMU's main loop, and the testcase should continue. Best regards Alex
On 15 June 2015 at 21:03, Alex Zuepke <alexander.zuepke@hs-rm.de> wrote: > Am 15.06.2015 um 20:58 schrieb Peter Maydell: >> I'm beginning to suspect that your guest code has a race >> condition in it, such that if the other CPU runs at a >> point you weren't expecting it to then you end up >> deadlocking or otherwise running into a bug in your guest. >> >> In particular, I see the emulation getting stuck even without >> this patch to arm_cpu_set_irq(). > Yes, it's a bug, sorry for that. I removed too much code to get a simple > testcase. It's stuck in the first spinlock where CPU#1 is waiting for CPU#0 > to initialize the rest of the system, and I need to WFE or YIELD here as > well. > > But this is showing the original problem again: the emulation get's stuck > spinning on CPU #1 forever, because the main loop doesn't switch to CPU #0 > voluntarily. Just press a key on the console/emulated serial line to trigger > an event to QEMU's main loop, and the testcase should continue. Pressing a key does not unwedge the test case for me. -- PMM
On 16 June 2015 at 11:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> Pressing a key does not unwedge the test case for me.
Looking at the logs, this seems to be expected given what
the guest code does with CPU #1: (the below is edited logs,
created with a hacky patch I have that annotates the debug
logs with CPU numbers):
CPU #1: Trace 0x7f2d67afa000 [80000100] _start
# we start
CPU #1: Trace 0x7f2d67afc060 [8000041c] main_cpu1
# we correctly figured out we're CPU 1
CPU #1: Trace 0x7f2d67afc220 [80000448] main_cpu1
# we took the branch to 80000448
CPU #1: Trace 0x7f2d67afc220 [80000448] main_cpu1
# 8000448 is a branch-to-self, so here we stay
CPU #1 never bothered to enable its GICC cpu interface,
so it will never receive interrupts and will never get
out of this tight loop.
We get here because CPU #1 has got through main_cpu1
to the point of testing your 'release' variable before
CPU #0 has got through main_cpu0 far enough to set it
to 1, so it still has the zero in it that it has on
system startup. If scheduling happened to mean that
CPU #0 ran further through main_cpu0 before CPU #1
ran, we wouldn't end up in this situation -- you have a
race condition, as I suggested.
The log shows we're sat with CPU#0 fruitlessly looping
on a variable in memory, and CPU#1 in this endless loop.
PS: QEMU doesn't care, but your binary seems to be entirely
devoid of barrier instructions, which is likely to cause
you problems on real hardware.
thanks
-- PMM
Hi Peter, Am 16.06.2015 um 12:59 schrieb Peter Maydell: > On 16 June 2015 at 11:33, Peter Maydell <peter.maydell@linaro.org> wrote: >> Pressing a key does not unwedge the test case for me. > > Looking at the logs, this seems to be expected given what > the guest code does with CPU #1: (the below is edited logs, > created with a hacky patch I have that annotates the debug > logs with CPU numbers): > > CPU #1: Trace 0x7f2d67afa000 [80000100] _start > # we start > CPU #1: Trace 0x7f2d67afc060 [8000041c] main_cpu1 > # we correctly figured out we're CPU 1 > CPU #1: Trace 0x7f2d67afc220 [80000448] main_cpu1 > # we took the branch to 80000448 > CPU #1: Trace 0x7f2d67afc220 [80000448] main_cpu1 > # 8000448 is a branch-to-self, so here we stay > > CPU #1 never bothered to enable its GICC cpu interface, > so it will never receive interrupts and will never get > out of this tight loop. Yes. CPU#1 is stuck in the initial spinlock which lacks WFE. > We get here because CPU #1 has got through main_cpu1 > to the point of testing your 'release' variable before > CPU #0 has got through main_cpu0 far enough to set it > to 1, so it still has the zero in it that it has on > system startup. If scheduling happened to mean that > CPU #0 ran further through main_cpu0 before CPU #1 > ran, we wouldn't end up in this situation -- you have a > race condition, as I suggested. > > The log shows we're sat with CPU#0 fruitlessly looping > on a variable in memory, and CPU#1 in this endless loop. I know that the startup has a racy because I removed too much code from the original project. But the startup is not my problem, it's the later parts. I added the WFE to the initial lock. Here are two new tests, both are now 3178 bytes in size: http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf Both start on my machine. The IPI ping-pong starts after the first timer interrupt after 1s. The problem is that IPIs are delivered only once a second after the timer interrupts QEMU's main loop. > PS: QEMU doesn't care, but your binary seems to be entirely > devoid of barrier instructions, which is likely to cause > you problems on real hardware. > > thanks > -- PMM Yes, I trimmed down my code to the bare minimum to handle IPIs on QEMU only. It lacks barriers, cache handling and has bogus baudrate settings. Something else: Existing ARM CPU so far do not use hyper-threading, but have real phyical cores. In contrast, QEMU is an extreme coarse-grained hyper-threading architectures, so existing legacy code that was written with physical cores in mind will trigger timing bugs in synchronization primitives then, especially code originally written for ARM11 MPCore like mine, which lacks WFE/SEV. If we consider QEMU as a platform to run legacy code, doesn't it make sense to address these issues? Best regards Alex
On 16 June 2015 at 12:11, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: > But the startup is not my problem, it's the later parts. But it was my problem because it meant your test case wasn't functional :-) > I added the WFE to the initial lock. Here are two new tests, both are now 3178 bytes in size: > http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf > http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf > > Both start on my machine. The IPI ping-pong starts after the > first timer interrupt after 1s. The problem is that IPIs are > delivered only once a second after the timer interrupts QEMU's > main loop. Thanks. These test cases work for me, and I can repro the same behaviour you see. I intend to investigate why we're not at least timeslicing between the two CPUs at a faster rate than "when there's another timer interrupt". > Something else: Existing ARM CPU so far do not use hyper-threading, > but have real phyical cores. In contrast, QEMU is an extreme > coarse-grained hyper-threading architectures, so existing legacy > code that was written with physical cores in mind will trigger > timing bugs in synchronization primitives then, especially code > originally written for ARM11 MPCore like mine, which lacks WFE/SEV. > If we consider QEMU as a platform to run legacy code, doesn't it > make sense to address these issues? In general QEMU's approach is more "run correct code reasonably fast" rather than "run buggy code the same way the hardware would" or "identify bugs in buggy code". There's certainly scope for heuristics for making our timeslicing approach less obtrusive, but we need to understand the underlying behaviour first (and check it doesn't accidentally slow down other common workloads in the process). In particular I think the 'do cpu_exit if one CPU triggers an interrupt on another' approach is probably good, but I need to investigate why it isn't working on your test programs without that extra 'level &&' condition first... thanks -- PMM
Am 16.06.2015 um 13:53 schrieb Peter Maydell: > On 16 June 2015 at 12:11, Alex Züpke <alexander.zuepke@hs-rm.de> wrote: >> But the startup is not my problem, it's the later parts. > > But it was my problem because it meant your test case wasn't > functional :-) > >> I added the WFE to the initial lock. Here are two new tests, both are now 3178 bytes in size: >> http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf >> http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf >> >> Both start on my machine. The IPI ping-pong starts after the >> first timer interrupt after 1s. The problem is that IPIs are >> delivered only once a second after the timer interrupts QEMU's >> main loop. > > Thanks. These test cases work for me, and I can repro the > same behaviour you see. OK, I'm glad that you can trigger my bug. > I intend to investigate why we're not at least timeslicing > between the two CPUs at a faster rate than "when there's > another timer interrupt". Probably there is no other way of time slicing in QEMU ... every OS uses some kind of timer interrupt, so it's not necessary. And even Linux' tickless kernel doesn't run into this issue because it uses SEV/WFE properly. >> Something else: Existing ARM CPU so far do not use hyper-threading, >> but have real phyical cores. In contrast, QEMU is an extreme >> coarse-grained hyper-threading architectures, so existing legacy >> code that was written with physical cores in mind will trigger >> timing bugs in synchronization primitives then, especially code >> originally written for ARM11 MPCore like mine, which lacks WFE/SEV. >> If we consider QEMU as a platform to run legacy code, doesn't it >> make sense to address these issues? > > In general QEMU's approach is more "run correct code reasonably > fast" rather than "run buggy code the same way the hardware > would" or "identify bugs in buggy code". There's certainly > scope for heuristics for making our timeslicing approach less > obtrusive, but we need to understand the underlying behaviour > first (and check it doesn't accidentally slow down other > common workloads in the process). In particular I think the > 'do cpu_exit if one CPU triggers an interrupt on another' > approach is probably good, but I need to investigate why > it isn't working on your test programs without that extra > 'level &&' condition first... > > thanks > -- PMM OK, thank you Peter! Best regards Alex
On 16 June 2015 at 12:53, Peter Maydell <peter.maydell@linaro.org> wrote: > In particular I think the > 'do cpu_exit if one CPU triggers an interrupt on another' > approach is probably good, but I need to investigate why > it isn't working on your test programs without that extra > 'level &&' condition first... I've figured out what's happening here, and it's an accidental artefact of our GIC implementation. What happens is: * cpu 0 does an IPI, which turns into "raise IRQ line on cpu 1" * arm_cpu_set_irq logic causes us to cpu_exit() cpu 0 * cpu 1 does then run; however pretty early on it does a read on the GIC to acknowledge the interrupt * this causes the function gic_update() to run, which recalculates the current state and sets CPU interrupt lines accordingly; among other things this results in an unnecessary but harmless call to arm_cpu_set_irq(CPU #0, irq, 0) * without the "level && " clause in the conditional, that causes us to cpu_exit() cpu 1 * we then start running cpu 0 again, which is pointless, and since there's no further irq traffic we don't yield til 0 reaches the end of its timeslice So basically without the level check we do make 0 yield to 1 as it should, but we then spuriously yield back to 0 again pretty much immediately. Next up: see if it gives us a perf improvement on Linux guests... -- PMM
On 19/06/2015 17:53, Peter Maydell wrote: > On 16 June 2015 at 12:53, Peter Maydell <peter.maydell@linaro.org> wrote: >> In particular I think the >> 'do cpu_exit if one CPU triggers an interrupt on another' >> approach is probably good, but I need to investigate why >> it isn't working on your test programs without that extra >> 'level &&' condition first... > I've figured out what's happening here, and it's an accidental > artefact of our GIC implementation. What happens is: > > * cpu 0 does an IPI, which turns into "raise IRQ line on cpu 1" > * arm_cpu_set_irq logic causes us to cpu_exit() cpu 0 > * cpu 1 does then run; however pretty early on it does a read > on the GIC to acknowledge the interrupt > * this causes the function gic_update() to run, which recalculates > the current state and sets CPU interrupt lines accordingly; > among other things this results in an unnecessary but harmless > call to arm_cpu_set_irq(CPU #0, irq, 0) > * without the "level && " clause in the conditional, that causes > us to cpu_exit() cpu 1 > * we then start running cpu 0 again, which is pointless, and > since there's no further irq traffic we don't yield til 0 > reaches the end of its timeslice > > So basically without the level check we do make 0 yield to 1 > as it should, but we then spuriously yield back to 0 again > pretty much immediately. > > Next up: see if it gives us a perf improvement on Linux guests... > > -- PMM > Hi, Can you send me a complete diff? I might have reproduced the same bug during MTTCG speed comparison. http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05704.html The normal boot with "-smp 4" and a smp 4 guest is slow and become a lot faster when I enable the window (which have timer callbacks and refresh the screen regularly) Thanks, Fred
On 23 June 2015 at 08:31, Frederic Konrad <fred.konrad@greensocs.com> wrote: > > Can you send me a complete diff? The link I pointed you at in the other thread has the complete diff: http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg03824.html -- PMM
On 23/06/2015 10:09, Peter Maydell wrote: > On 23 June 2015 at 08:31, Frederic Konrad <fred.konrad@greensocs.com> wrote: >> Can you send me a complete diff? > The link I pointed you at in the other thread has the complete diff: > http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg03824.html > > -- PMM Oops sorry I saw other piece of patch in the thread that's why I was not sure. Seems not fixing the problem anyway. Thanks, Fred
On 23 June 2015 at 08:31, Frederic Konrad <fred.konrad@greensocs.com> wrote: > The normal boot with "-smp 4" and a smp 4 guest is slow and become a lot > faster > when I enable the window (which have timer callbacks and refresh the screen > regularly) Is it just overall slow, or does it appear to hang? I have an interesting effect where *with* this patch an -smp 3 or 4 guest boot seems to hang between "SCSI subsystem initialized" and "Switched to clocksource arch_sys_counter"... Weirdly, you can make it recover from that hang if there's a GTK window and you wave the mouse over it, even if that window is only showing the QEMU monitor, not a guest graphics window. thanks -- PMM
On 23 June 2015 at 19:15, Peter Maydell <peter.maydell@linaro.org> wrote: > On 23 June 2015 at 08:31, Frederic Konrad <fred.konrad@greensocs.com> wrote: >> The normal boot with "-smp 4" and a smp 4 guest is slow and become a lot >> faster >> when I enable the window (which have timer callbacks and refresh the screen >> regularly) > > Is it just overall slow, or does it appear to hang? I have an > interesting effect where *with* this patch an -smp 3 or 4 guest > boot seems to hang between "SCSI subsystem initialized" and > "Switched to clocksource arch_sys_counter"... At least part of what is happening here seems to be that we're falling into a similar flavour of stall to the original test case: in the SMP 3 setup, CPU #2 is in the busy-loop of Linux's multi_cpu_stop() function, and it can sit in that loop for an entire second. We should never let a single CPU grab execution for that long when doing TCG round-robin... -- PMM
--- a/target-arm/cpu.c +++ b/target-arm/cpu.c @@ -325,6 +325,18 @@ static void arm_cpu_set_irq(void *opaque, int irq, int level) default: hw_error("arm_cpu_set_irq: Bad interrupt line %d\n", irq); } + + /* If we are currently executing code for CPU X, and this + * CPU we've just triggered an interrupt on is CPU Y, then + * make CPU X yield control back to the main loop at the + * end of the TB it's currently executing. + * This avoids problems where the interrupt was an IPI + * and CPU X would otherwise sit busy looping for the rest + * of its timeslice because Y hasn't had a chance to run. + */ + if (level && current_cpu && current_cpu != cs) { + cpu_exit(current_cpu); + } } static void arm_cpu_kvm_set_irq(void *opaque, int irq, int level)