diff mbox series

[1/5] migration: Fix possible deadloop of ram save process

Message ID 20220920223800.47467-2-peterx@redhat.com
State New
Headers show
Series migration: Bug fixes (prepare for preempt-full) | expand

Commit Message

Peter Xu Sept. 20, 2022, 10:37 p.m. UTC
When starting ram saving procedure (especially at the completion phase),
always set last_seen_block to non-NULL to make sure we can always correctly
detect the case where "we've migrated all the dirty pages".

Then we'll guarantee both last_seen_block and pss.block will be valid
always before the loop starts.

See the comment in the code for some details.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Dr. David Alan Gilbert Sept. 22, 2022, 2:49 p.m. UTC | #1
* Peter Xu (peterx@redhat.com) wrote:
> When starting ram saving procedure (especially at the completion phase),
> always set last_seen_block to non-NULL to make sure we can always correctly
> detect the case where "we've migrated all the dirty pages".
> 
> Then we'll guarantee both last_seen_block and pss.block will be valid
> always before the loop starts.
> 
> See the comment in the code for some details.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Yeh I guess it can currently only happen during restart?

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dc1de9ddbc..1d42414ecc 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2546,14 +2546,22 @@ static int ram_find_and_save_block(RAMState *rs)
>          return pages;
>      }
>  
> +    /*
> +     * Always keep last_seen_block/last_page valid during this procedure,
> +     * because find_dirty_block() relies on these values (e.g., we compare
> +     * last_seen_block with pss.block to see whether we searched all the
> +     * ramblocks) to detect the completion of migration.  Having NULL value
> +     * of last_seen_block can conditionally cause below loop to run forever.
> +     */
> +    if (!rs->last_seen_block) {
> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> +        rs->last_page = 0;
> +    }
> +
>      pss.block = rs->last_seen_block;
>      pss.page = rs->last_page;
>      pss.complete_round = false;
>  
> -    if (!pss.block) {
> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> -    }
> -
>      do {
>          again = true;
>          found = get_queued_page(rs, &pss);
> -- 
> 2.32.0
>
Peter Xu Sept. 22, 2022, 3:25 p.m. UTC | #2
On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > When starting ram saving procedure (especially at the completion phase),
> > always set last_seen_block to non-NULL to make sure we can always correctly
> > detect the case where "we've migrated all the dirty pages".
> > 
> > Then we'll guarantee both last_seen_block and pss.block will be valid
> > always before the loop starts.
> > 
> > See the comment in the code for some details.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Yeh I guess it can currently only happen during restart?

There're only two places to clear last_seen_block:

ram_state_reset[2683]          rs->last_seen_block = NULL;
ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;

Where for the reset case:

ram_state_init[2994]           ram_state_reset(*rsp);
ram_state_resume_prepare[3110] ram_state_reset(rs);
ram_save_iterate[3271]         ram_state_reset(rs);

So I think it can at least happen in two places, either (1) postcopy just
started (assume when postcopy starts accidentally when all dirty pages were
migrated?), or (2) postcopy recover from failure.

In my case I triggered this deadloop when I was debugging the other bug
fixed by the next patch where it was postcopy recovery (on tls), but only
once..  So currently I'm still not 100% sure whether this is the same
problem, but logically it could trigger.

I also remember I used to hit very rare deadloops before too, maybe they're
the same thing because I did test recovery a lot.

> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks!
Dr. David Alan Gilbert Sept. 22, 2022, 4:41 p.m. UTC | #3
* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > When starting ram saving procedure (especially at the completion phase),
> > > always set last_seen_block to non-NULL to make sure we can always correctly
> > > detect the case where "we've migrated all the dirty pages".
> > > 
> > > Then we'll guarantee both last_seen_block and pss.block will be valid
> > > always before the loop starts.
> > > 
> > > See the comment in the code for some details.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Yeh I guess it can currently only happen during restart?
> 
> There're only two places to clear last_seen_block:
> 
> ram_state_reset[2683]          rs->last_seen_block = NULL;
> ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;
> 
> Where for the reset case:
> 
> ram_state_init[2994]           ram_state_reset(*rsp);
> ram_state_resume_prepare[3110] ram_state_reset(rs);
> ram_save_iterate[3271]         ram_state_reset(rs);
> 
> So I think it can at least happen in two places, either (1) postcopy just
> started (assume when postcopy starts accidentally when all dirty pages were
> migrated?), or (2) postcopy recover from failure.

Oh, (1) is a more general problem then; yeh.

> In my case I triggered this deadloop when I was debugging the other bug
> fixed by the next patch where it was postcopy recovery (on tls), but only
> once..  So currently I'm still not 100% sure whether this is the same
> problem, but logically it could trigger.
> 
> I also remember I used to hit very rare deadloops before too, maybe they're
> the same thing because I did test recovery a lot.

Note; 'deadlock' not 'deadloop'.

Dave

> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Thanks!
> 
> -- 
> Peter Xu
>
Peter Xu Oct. 4, 2022, 2:25 p.m. UTC | #4
On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > When starting ram saving procedure (especially at the completion phase),
> > > > always set last_seen_block to non-NULL to make sure we can always correctly
> > > > detect the case where "we've migrated all the dirty pages".
> > > > 
> > > > Then we'll guarantee both last_seen_block and pss.block will be valid
> > > > always before the loop starts.
> > > > 
> > > > See the comment in the code for some details.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Yeh I guess it can currently only happen during restart?
> > 
> > There're only two places to clear last_seen_block:
> > 
> > ram_state_reset[2683]          rs->last_seen_block = NULL;
> > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;
> > 
> > Where for the reset case:
> > 
> > ram_state_init[2994]           ram_state_reset(*rsp);
> > ram_state_resume_prepare[3110] ram_state_reset(rs);
> > ram_save_iterate[3271]         ram_state_reset(rs);
> > 
> > So I think it can at least happen in two places, either (1) postcopy just
> > started (assume when postcopy starts accidentally when all dirty pages were
> > migrated?), or (2) postcopy recover from failure.
> 
> Oh, (1) is a more general problem then; yeh.
> 
> > In my case I triggered this deadloop when I was debugging the other bug
> > fixed by the next patch where it was postcopy recovery (on tls), but only
> > once..  So currently I'm still not 100% sure whether this is the same
> > problem, but logically it could trigger.
> > 
> > I also remember I used to hit very rare deadloops before too, maybe they're
> > the same thing because I did test recovery a lot.
> 
> Note; 'deadlock' not 'deadloop'.

(Oops I somehow forgot there's still this series pending..)

Here it's not about a lock, or maybe I should add a space ("dead loop")?
Dr. David Alan Gilbert Oct. 4, 2022, 3:02 p.m. UTC | #5
* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Sep 22, 2022 at 05:41:30PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > When starting ram saving procedure (especially at the completion phase),
> > > > > always set last_seen_block to non-NULL to make sure we can always correctly
> > > > > detect the case where "we've migrated all the dirty pages".
> > > > > 
> > > > > Then we'll guarantee both last_seen_block and pss.block will be valid
> > > > > always before the loop starts.
> > > > > 
> > > > > See the comment in the code for some details.
> > > > > 
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > Yeh I guess it can currently only happen during restart?
> > > 
> > > There're only two places to clear last_seen_block:
> > > 
> > > ram_state_reset[2683]          rs->last_seen_block = NULL;
> > > ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;
> > > 
> > > Where for the reset case:
> > > 
> > > ram_state_init[2994]           ram_state_reset(*rsp);
> > > ram_state_resume_prepare[3110] ram_state_reset(rs);
> > > ram_save_iterate[3271]         ram_state_reset(rs);
> > > 
> > > So I think it can at least happen in two places, either (1) postcopy just
> > > started (assume when postcopy starts accidentally when all dirty pages were
> > > migrated?), or (2) postcopy recover from failure.
> > 
> > Oh, (1) is a more general problem then; yeh.
> > 
> > > In my case I triggered this deadloop when I was debugging the other bug
> > > fixed by the next patch where it was postcopy recovery (on tls), but only
> > > once..  So currently I'm still not 100% sure whether this is the same
> > > problem, but logically it could trigger.
> > > 
> > > I also remember I used to hit very rare deadloops before too, maybe they're
> > > the same thing because I did test recovery a lot.
> > 
> > Note; 'deadlock' not 'deadloop'.
> 
> (Oops I somehow forgot there's still this series pending..)
> 
> Here it's not about a lock, or maybe I should add a space ("dead loop")?

So the normal phrases I'm used to are:
  'deadlock' - two threads waiting for each other
  'livelock' - two threads spinning for each other

Dave

> -- 
> Peter Xu
>
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..1d42414ecc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2546,14 +2546,22 @@  static int ram_find_and_save_block(RAMState *rs)
         return pages;
     }
 
+    /*
+     * Always keep last_seen_block/last_page valid during this procedure,
+     * because find_dirty_block() relies on these values (e.g., we compare
+     * last_seen_block with pss.block to see whether we searched all the
+     * ramblocks) to detect the completion of migration.  Having NULL value
+     * of last_seen_block can conditionally cause below loop to run forever.
+     */
+    if (!rs->last_seen_block) {
+        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
+        rs->last_page = 0;
+    }
+
     pss.block = rs->last_seen_block;
     pss.page = rs->last_page;
     pss.complete_round = false;
 
-    if (!pss.block) {
-        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
-    }
-
     do {
         again = true;
         found = get_queued_page(rs, &pss);