Patchwork Block Migration Assertion in qemu-kvm 1.2.0

login
register
mail settings
Submitter Kevin Wolf
Date Sept. 18, 2012, 10:31 a.m.
Message ID <50584D84.2080802@redhat.com>
Download mbox | patch
Permalink /patch/184672/
State New
Headers show

Comments

Kevin Wolf - Sept. 18, 2012, 10:31 a.m.
Am 18.09.2012 12:28, schrieb Peter Lieven:
> On 09/17/12 22:12, Peter Lieven wrote:
>> On 09/17/12 10:41, Kevin Wolf wrote:
>>> Am 16.09.2012 12:13, schrieb Peter Lieven:
>>>> Hi,
>>>>
>>>> when trying to block migrate a VM from one node to another, the source
>>>> VM crashed with the following assertion:
>>>> block.c:3829: bdrv_set_in_use: Assertion `bs->in_use != in_use' failed.
>>>>
>>>> Is this sth already addresses/known?
>>> Not that I'm aware of, at least.
>>>
>>> Block migration doesn't seem to check whether the device is already in
>>> use, maybe this is the problem. Not sure why it would be in use, though,
>>> and in my quick test it didn't crash.
>>>
>>> So we need some more information: What's you command line, did you do
>>> anything specific in the monitor with block devices, what does the
>>> stacktrace look like, etc.?
>> kevin, it seems that i can very easily force a crash if I cancel a 
>> running block migration.
> if I understand correctly what happens there are aio callbacks coming in 
> after
> blk_mig_cleanup() has been called.
> 
> what is the proper way to detect this in blk_mig_read_cb()?

You could try this, it doesn't detect the situation in
blk_mig_read_cb(), but ensures that all callbacks happen before we do
the actual cleanup (completely untested):
Peter Lieven - Sept. 19, 2012, 5:49 a.m.
On 09/18/12 12:31, Kevin Wolf wrote:
> Am 18.09.2012 12:28, schrieb Peter Lieven:
>> On 09/17/12 22:12, Peter Lieven wrote:
>>> On 09/17/12 10:41, Kevin Wolf wrote:
>>>> Am 16.09.2012 12:13, schrieb Peter Lieven:
>>>>> Hi,
>>>>>
>>>>> when trying to block migrate a VM from one node to another, the source
>>>>> VM crashed with the following assertion:
>>>>> block.c:3829: bdrv_set_in_use: Assertion `bs->in_use != in_use' failed.
>>>>>
>>>>> Is this sth already addresses/known?
>>>> Not that I'm aware of, at least.
>>>>
>>>> Block migration doesn't seem to check whether the device is already in
>>>> use, maybe this is the problem. Not sure why it would be in use, though,
>>>> and in my quick test it didn't crash.
>>>>
>>>> So we need some more information: What's you command line, did you do
>>>> anything specific in the monitor with block devices, what does the
>>>> stacktrace look like, etc.?
>>> kevin, it seems that i can very easily force a crash if I cancel a
>>> running block migration.
>> if I understand correctly what happens there are aio callbacks coming in
>> after
>> blk_mig_cleanup() has been called.
>>
>> what is the proper way to detect this in blk_mig_read_cb()?
> You could try this, it doesn't detect the situation in
> blk_mig_read_cb(), but ensures that all callbacks happen before we do
> the actual cleanup (completely untested):
after testing it for half an hour i can say, it seems to fix the problem.
no segfaults and also no other assertions.

while searching I have seen that the queses blk_list and bmds_list are 
initialized at
qemu startup. wouldn't it be better to initialize them at init_blk_migration
or at least check that they are really empty? i have also seen that 
prev_time_offset
is not initialized.

thank you,
peter

sth like this:

--- qemu-kvm-1.2.0/block-migration.c.orig    2012-09-17 
21:14:44.458429855 +0200
+++ qemu-kvm-1.2.0/block-migration.c    2012-09-17 21:15:40.599736962 +0200
@@ -311,8 +311,12 @@ static void init_blk_migration(QEMUFile
      block_mig_state.prev_progress = -1;
      block_mig_state.bulk_completed = 0;
      block_mig_state.total_time = 0;
+    block_mig_state.prev_time_offset = 0;
      block_mig_state.reads = 0;

+    QSIMPLEQ_INIT(&block_mig_state.bmds_list);
+    QSIMPLEQ_INIT(&block_mig_state.blk_list);
+
      bdrv_iterate(init_blk_migration_it, NULL);
  }

@@ -760,9 +764,6 @@ SaveVMHandlers savevm_block_handlers = {

  void blk_mig_init(void)
  {
-    QSIMPLEQ_INIT(&block_mig_state.bmds_list);
-    QSIMPLEQ_INIT(&block_mig_state.blk_list);
-
      register_savevm_live(NULL, "block", 0, 1, &savevm_block_handlers,
&block_mig_state);
  }

> diff --git a/block-migration.c b/block-migration.c
> index 7def8ab..ed93301 100644
> --- a/block-migration.c
> +++ b/block-migration.c
> @@ -519,6 +519,8 @@ static void blk_mig_cleanup(void)
>       BlkMigDevState *bmds;
>       BlkMigBlock *blk;
>
> +    bdrv_drain_all();
> +
>       set_dirty_tracking(0);
>
>       while ((bmds = QSIMPLEQ_FIRST(&block_mig_state.bmds_list)) != NULL) {
Kevin Wolf - Sept. 25, 2012, 2 p.m.
Am 19.09.2012 07:49, schrieb Peter Lieven:
> On 09/18/12 12:31, Kevin Wolf wrote:
>> Am 18.09.2012 12:28, schrieb Peter Lieven:
>>> On 09/17/12 22:12, Peter Lieven wrote:
>>>> On 09/17/12 10:41, Kevin Wolf wrote:
>>>>> Am 16.09.2012 12:13, schrieb Peter Lieven:
>>>>>> Hi,
>>>>>>
>>>>>> when trying to block migrate a VM from one node to another, the source
>>>>>> VM crashed with the following assertion:
>>>>>> block.c:3829: bdrv_set_in_use: Assertion `bs->in_use != in_use' failed.
>>>>>>
>>>>>> Is this sth already addresses/known?
>>>>> Not that I'm aware of, at least.
>>>>>
>>>>> Block migration doesn't seem to check whether the device is already in
>>>>> use, maybe this is the problem. Not sure why it would be in use, though,
>>>>> and in my quick test it didn't crash.
>>>>>
>>>>> So we need some more information: What's you command line, did you do
>>>>> anything specific in the monitor with block devices, what does the
>>>>> stacktrace look like, etc.?
>>>> kevin, it seems that i can very easily force a crash if I cancel a
>>>> running block migration.
>>> if I understand correctly what happens there are aio callbacks coming in
>>> after
>>> blk_mig_cleanup() has been called.
>>>
>>> what is the proper way to detect this in blk_mig_read_cb()?
>> You could try this, it doesn't detect the situation in
>> blk_mig_read_cb(), but ensures that all callbacks happen before we do
>> the actual cleanup (completely untested):
> after testing it for half an hour i can say, it seems to fix the problem.
> no segfaults and also no other assertions.
> 
> while searching I have seen that the queses blk_list and bmds_list are 
> initialized at
> qemu startup. wouldn't it be better to initialize them at init_blk_migration
> or at least check that they are really empty? i have also seen that 
> prev_time_offset
> is not initialized.

Probably. If you sent this as a proper patch with a SoB, I wouldn't
reject it, but considering that block migration is deprecated anyway, I
won't bother myself as long as there's no real bug.

Kevin

Patch

diff --git a/block-migration.c b/block-migration.c
index 7def8ab..ed93301 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -519,6 +519,8 @@  static void blk_mig_cleanup(void)
     BlkMigDevState *bmds;
     BlkMigBlock *blk;

+    bdrv_drain_all();
+
     set_dirty_tracking(0);

     while ((bmds = QSIMPLEQ_FIRST(&block_mig_state.bmds_list)) != NULL) {