Message ID | 20181017090750.4378-1-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Series | Revert "icount: remove obsolete warp call" | expand |
> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > QEMU running with options "-icount,sleep=off -rtc clock=vm" doesn't > execute emulation at maximum possible speed. > Target virtual clock may run faster or slower than realtime clock by N > times, where N value depends on various unrelated conditions (i.e. random > from the user point of view), or possibly the target hangs completely. > Bisection shows commit 281b2201e4 ("icount: remove obsolete warp call", > 2016-03-15) to be the culprit, revert it. > > Suggested-by: Artem Pisarenko <artem.k.pisarenko@gmail.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > cpus.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/cpus.c b/cpus.c > index 361678e459..2872e7e37c 100644 > --- a/cpus.c > +++ b/cpus.c > @@ -1224,6 +1224,10 @@ static void qemu_tcg_rr_wait_io_event(CPUState *cpu) > { > while (all_cpu_threads_idle()) { > stop_tcg_kick_timer(); > + > + /* Start accounting real time to the virtual clock if the CPUs > + are idle. */ > + qemu_clock_warp(QEMU_CLOCK_VIRTUAL); Can't apply, because this function does not exist anymore. > qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex); > } > > -- > 2.17.1 Pavel Dovgalyuk
See my last comment in bug report. This kind of modification, even adapted to changed function name, doesn't solve issue. I thought long time that it does, but once I catched qemu with a hang. And of course, I wasn't able to reproduce it. So it just better hides issue. Take a look at alternative solution from QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb I didn't catched fail with it (yet).
On 17/10/2018 11:53, Artem Pisarenko wrote: > See my last comment in bug report. This kind of modification, even > adapted to changed function name, doesn't solve issue. > I thought long time that it does, but once I catched qemu with a hang. > And of course, I wasn't able to reproduce it. So it just better hides issue. > Take a look at alternative solution from > QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb > I didn't catched fail with it (yet). Makes sense (and the patch I submitted was stupid). Clement, can you submit the patch from qbox with Signed-off-by? Thanks, Paolo
> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > On 17/10/2018 11:53, Artem Pisarenko wrote: > > See my last comment in bug report. This kind of modification, even > > adapted to changed function name, doesn't solve issue. > > I thought long time that it does, but once I catched qemu with a hang. > > And of course, I wasn't able to reproduce it. So it just better hides issue. > > Take a look at alternative solution from > > QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb > > I didn't catched fail with it (yet). Tried to test it, but rr seems to be broken again. I'll try to bisect now. Pavel Dovgalyuk
On 17/10/2018 13:38, Pavel Dovgalyuk wrote: >> From: Paolo Bonzini [mailto:pbonzini@redhat.com] >> On 17/10/2018 11:53, Artem Pisarenko wrote: >>> See my last comment in bug report. This kind of modification, even >>> adapted to changed function name, doesn't solve issue. >>> I thought long time that it does, but once I catched qemu with a hang. >>> And of course, I wasn't able to reproduce it. So it just better hides issue. >>> Take a look at alternative solution from >>> QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb >>> I didn't catched fail with it (yet). > > Tried to test it, but rr seems to be broken again. > I'll try to bisect now. Can we add a test that runs with "make check" and covers the basics of record/replay's cpus.c bits? rr is very cool, and we fixed/understood a lot of stuff when getting it ready for inclusion. But now it's constantly broken and every time we change rr we also risk breaking icount. Paolo
> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > On 17/10/2018 13:38, Pavel Dovgalyuk wrote: > >> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > >> On 17/10/2018 11:53, Artem Pisarenko wrote: > >>> See my last comment in bug report. This kind of modification, even > >>> adapted to changed function name, doesn't solve issue. > >>> I thought long time that it does, but once I catched qemu with a hang. > >>> And of course, I wasn't able to reproduce it. So it just better hides issue. > >>> Take a look at alternative solution from > >>> QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb > >>> I didn't catched fail with it (yet). > > > > Tried to test it, but rr seems to be broken again. > > I'll try to bisect now. > > Can we add a test that runs with "make check" and covers the basics of > record/replay's cpus.c bits? I thought Alex is trying to create some tests. Alex, am I right? Pavel Dovgalyuk
> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > On 17/10/2018 13:38, Pavel Dovgalyuk wrote: > >> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > >> On 17/10/2018 11:53, Artem Pisarenko wrote: > >>> See my last comment in bug report. This kind of modification, even > >>> adapted to changed function name, doesn't solve issue. > >>> I thought long time that it does, but once I catched qemu with a hang. > >>> And of course, I wasn't able to reproduce it. So it just better hides issue. > >>> Take a look at alternative solution from > >>> QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb > >>> I didn't catched fail with it (yet). > > > > Tried to test it, but rr seems to be broken again. > > I'll try to bisect now. > > Can we add a test that runs with "make check" and covers the basics of > record/replay's cpus.c bits? > > rr is very cool, and we fixed/understood a lot of stuff when getting it > ready for inclusion. But now it's constantly broken and every time we > change rr we also risk breaking icount. In addition to some rr bug I encountered the following (non-stable) error message: ERROR:/qemu/work/qemu/accel/tcg/translate-all.c:1346:tb_page_remove: code should not be reached Pavel Dovgalyuk
Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> From: Paolo Bonzini [mailto:pbonzini@redhat.com] >> On 17/10/2018 13:38, Pavel Dovgalyuk wrote: >> >> From: Paolo Bonzini [mailto:pbonzini@redhat.com] >> >> On 17/10/2018 11:53, Artem Pisarenko wrote: >> >>> See my last comment in bug report. This kind of modification, even >> >>> adapted to changed function name, doesn't solve issue. >> >>> I thought long time that it does, but once I catched qemu with a hang. >> >>> And of course, I wasn't able to reproduce it. So it just better hides issue. >> >>> Take a look at alternative solution from >> >>> QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb >> >>> I didn't catched fail with it (yet). >> > >> > Tried to test it, but rr seems to be broken again. >> > I'll try to bisect now. >> >> Can we add a test that runs with "make check" and covers the basics of >> record/replay's cpus.c bits? > > I thought Alex is trying to create some tests. > Alex, am I right? Not actively at the moment but we need to get there. As rr only works with system emulation we need to update the check-tcg machinery to be able to build some system binaries. A number of the tests/tcg targets already have some system binaries waiting to be ported to the new make system. The next question is what we need to exercise rr? The migration tests have added some very simple ping-pong type "boot sectors" so perhaps putting the source for those into check-tcg we can build them and use it for rr testing? -- Alex Bennée
> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > On 17/10/2018 13:38, Pavel Dovgalyuk wrote: > >> From: Paolo Bonzini [mailto:pbonzini@redhat.com] > >> On 17/10/2018 11:53, Artem Pisarenko wrote: > >>> See my last comment in bug report. This kind of modification, even > >>> adapted to changed function name, doesn't solve issue. > >>> I thought long time that it does, but once I catched qemu with a hang. > >>> And of course, I wasn't able to reproduce it. So it just better hides issue. > >>> Take a look at alternative solution from > >>> QBox: https://git.greensocs.com/qemu/qbox/commit/a8ed106032e375e715a531d6e93e4d9ec295dbdb > >>> I didn't catched fail with it (yet). > > > > Tried to test it, but rr seems to be broken again. > > I'll try to bisect now. > > Can we add a test that runs with "make check" and covers the basics of > record/replay's cpus.c bits? > > rr is very cool, and we fixed/understood a lot of stuff when getting it > ready for inclusion. But now it's constantly broken and every time we > change rr we also risk breaking icount. I found the source of the bug. As QEMU becomes more multi-threaded and non-synchronized, checkpoints move from thread to thread. And the event queue that processed at checkpoints should belong to the same thread in both record and replay executions. Current problem was with the checkpoint for virtual timers. They are processed from different threads: from vCPU and from aio_dispatch function. Therefore the following patch fixes the problem, but I think that this part has to be refactored. There should be nailed-to-thread events that process the event queue. Then checkpoints can become just synchronization events and therefore omitted for empty timer lists, for example. diff --git a/replay/replay-events.c b/replay/replay-events.c index 83a7d81..60e8c21 100644 --- a/replay/replay-events.c +++ b/replay/replay-events.c @@ -205,6 +205,7 @@ void replay_save_events(int checkpoint) { g_assert(replay_mutex_locked()); g_assert(checkpoint != CHECKPOINT_CLOCK_WARP_START); + g_assert(checkpoint != CHECKPOINT_CLOCK_VIRTUAL); while (!QTAILQ_EMPTY(&events_list)) { Event *event = QTAILQ_FIRST(&events_list); replay_save_event(event, checkpoint); diff --git a/replay/replay.c b/replay/replay.c index 93d2573..79b8485 100644 --- a/replay/replay.c +++ b/replay/replay.c @@ -231,7 +231,8 @@ bool replay_checkpoint(ReplayCheckpoint checkpoint) /* This checkpoint belongs to several threads. Processing events from different threads is non-deterministic */ - if (checkpoint != CHECKPOINT_CLOCK_WARP_START) { + if (checkpoint != CHECKPOINT_CLOCK_WARP_START + && checkpoint != CHECKPOINT_CLOCK_VIRTUAL) { replay_save_events(checkpoint); } res = true; Pavel Dovgalyuk
> I found the source of the bug. As QEMU becomes more multi-threaded and non-> synchronized, > checkpoints move from thread to thread. > And the event queue that processed at checkpoints should belong to the same thread > in both record and replay executions. > > Current problem was with the checkpoint for virtual timers. They are processed from different threads: > from vCPU and from aio_dispatch function. What bug you're talking about ? This patch thread started with intention to fix generic bug with icount, which is being encountered in non-rr mode. As far as I understood your words (and patch content) it will not fix that generic bug. Anyway, it's a good news!
On 17/10/2018 15:20, Pavel Dovgalyuk wrote: > I found the source of the bug. As QEMU becomes more multi-threaded > and non-synchronized, checkpoints move from thread to thread. And the > event queue that processed at checkpoints should belong to the same > thread in both record and replay executions. > > Current problem was with the checkpoint for virtual timers. They are > processed from different threads: from vCPU and from aio_dispatch > function. > > Therefore the following patch fixes the problem, but I think that > this part has to be refactored. There should be nailed-to-thread > events that process the event queue. Then checkpoints can become just > synchronization events and therefore omitted for empty timer lists, > for example. Can you add a FIXME comment and submit this as a full patch? Paolo
Hi all, I'm traveling right now, but as soon as I get home, I'll send my patch with the signed-off. Best, Clément On 10/17/18 7:43 AM, Paolo Bonzini wrote: > On 17/10/2018 15:20, Pavel Dovgalyuk wrote: >> I found the source of the bug. As QEMU becomes more multi-threaded >> and non-synchronized, checkpoints move from thread to thread. And the >> event queue that processed at checkpoints should belong to the same >> thread in both record and replay executions. >> >> Current problem was with the checkpoint for virtual timers. They are >> processed from different threads: from vCPU and from aio_dispatch >> function. >> >> Therefore the following patch fixes the problem, but I think that >> this part has to be refactored. There should be nailed-to-thread >> events that process the event queue. Then checkpoints can become just >> synchronization events and therefore omitted for empty timer lists, >> for example. > Can you add a FIXME comment and submit this as a full patch? > > Paolo
I made a fix to ensure that QBox patch doesn’t break the replay. Now I tried it on couple of tests and didn’t see any regression. Pavel Dovgalyuk From: Artem Pisarenko [mailto:artem.k.pisarenko@gmail.com] Sent: Wednesday, October 17, 2018 4:32 PM To: Pavel Dovgalyuk Cc: Paolo Bonzini; Clement Deschamps; qemu-devel@nongnu.org; Pavel Dovgalyuk Subject: Re: [PATCH] Revert "icount: remove obsolete warp call" > I found the source of the bug. As QEMU becomes more multi-threaded and non-> synchronized, > checkpoints move from thread to thread. > And the event queue that processed at checkpoints should belong to the same thread > in both record and replay executions. > > Current problem was with the checkpoint for virtual timers. They are processed from different threads: > from vCPU and from aio_dispatch function. What bug you're talking about ? This patch thread started with intention to fix generic bug with icount, which is being encountered in non-rr mode. As far as I understood your words (and patch content) it will not fix that generic bug. Anyway, it's a good news!
diff --git a/cpus.c b/cpus.c index 361678e459..2872e7e37c 100644 --- a/cpus.c +++ b/cpus.c @@ -1224,6 +1224,10 @@ static void qemu_tcg_rr_wait_io_event(CPUState *cpu) { while (all_cpu_threads_idle()) { stop_tcg_kick_timer(); + + /* Start accounting real time to the virtual clock if the CPUs + are idle. */ + qemu_clock_warp(QEMU_CLOCK_VIRTUAL); qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex); }
QEMU running with options "-icount,sleep=off -rtc clock=vm" doesn't execute emulation at maximum possible speed. Target virtual clock may run faster or slower than realtime clock by N times, where N value depends on various unrelated conditions (i.e. random from the user point of view), or possibly the target hangs completely. Bisection shows commit 281b2201e4 ("icount: remove obsolete warp call", 2016-03-15) to be the culprit, revert it. Suggested-by: Artem Pisarenko <artem.k.pisarenko@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- cpus.c | 4 ++++ 1 file changed, 4 insertions(+)