Patchwork block: init bs->io_base correctly to avoid locking

login
register
mail settings
Submitter Peter Lieven
Date Jan. 10, 2013, 9:21 a.m.
Message ID <50EE8810.7080507@dlhnet.de>
Download mbox | patch
Permalink /patch/210955/
State New
Headers show

Comments

Peter Lieven - Jan. 10, 2013, 9:21 a.m.
If io_limits are specified during runtime that exceed the number of operations in flight
bs->io_base is not initialized in the else statement in bdrv_exceed_io_limits().
The wait time calculated in bdrv_exceed_{bps,iops}_limits is thus totally wrong
and the machine locks.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
  block.c |    4 ++++
  1 file changed, 4 insertions(+)
Paolo Bonzini - Jan. 10, 2013, 10:19 a.m.
Il 10/01/2013 10:21, Peter Lieven ha scritto:
> If io_limits are specified during runtime that exceed the number of
> operations in flight
> bs->io_base is not initialized in the else statement in
> bdrv_exceed_io_limits().
> The wait time calculated in bdrv_exceed_{bps,iops}_limits is thus
> totally wrong
> and the machine locks.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 4e28c55..309aa85 100644
> --- a/block.c
> +++ b/block.c
> @@ -159,6 +159,10 @@ void bdrv_io_limits_enable(BlockDriverState *bs)
>      bs->slice_start = qemu_get_clock_ns(vm_clock);
>      bs->slice_end   = bs->slice_start + bs->slice_time;
>      memset(&bs->io_base, 0, sizeof(bs->io_base));

Please remove this memset.

> +    bs->io_base.bytes[0] = bs->nr_bytes[0];
> +    bs->io_base.bytes[1] = bs->nr_bytes[1];
> +    bs->io_base.ios[0] = bs->nr_ops[0];
> +    bs->io_base.ios[1] = bs->nr_ops[1];
>      bs->io_limits_enabled = true;
>  }
> 

Also, perhaps you can just call

    bdrv_exceed_io_limits(bs, 0, 0, NULL);

(which also subsumes the setting of slice_time, slice_start, slice_end).

Paolo
Markus Armbruster - Jan. 10, 2013, 10:45 a.m.
Subject line made me go "huh?" until I realized that you mean "Guest
locks up", not "QEMU uses locks to synchronize".  Suggest to add the
preposition :)

Peter Lieven <pl@dlhnet.de> writes:

> If io_limits are specified during runtime that exceed the number of operations in flight
> bs->io_base is not initialized in the else statement in bdrv_exceed_io_limits().

I'm confused.

    if ((bs->slice_start < now)
        && (bs->slice_end > now)) {
        bs->slice_end = now + bs->slice_time;
    } else {
        bs->slice_time  =  5 * BLOCK_IO_SLICE_TIME;
        bs->slice_start = now;
        bs->slice_end   = now + bs->slice_time;

        bs->io_base.bytes[is_write]  = bs->nr_bytes[is_write];
        bs->io_base.bytes[!is_write] = bs->nr_bytes[!is_write];

        bs->io_base.ios[is_write]    = bs->nr_ops[is_write];
        bs->io_base.ios[!is_write]   = bs->nr_ops[!is_write];
    }

It's set in the else.  Could you explain what goes wrong in a bit more
detail?

> The wait time calculated in bdrv_exceed_{bps,iops}_limits is thus totally wrong
> and the machine locks.

Please wrap lines around column 70.

>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block.c |    4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/block.c b/block.c
> index 4e28c55..309aa85 100644
> --- a/block.c
> +++ b/block.c
> @@ -159,6 +159,10 @@ void bdrv_io_limits_enable(BlockDriverState *bs)
>      bs->slice_start = qemu_get_clock_ns(vm_clock);
>      bs->slice_end   = bs->slice_start + bs->slice_time;
>      memset(&bs->io_base, 0, sizeof(bs->io_base));
> +    bs->io_base.bytes[0] = bs->nr_bytes[0];
> +    bs->io_base.bytes[1] = bs->nr_bytes[1];
> +    bs->io_base.ios[0] = bs->nr_ops[0];
> +    bs->io_base.ios[1] = bs->nr_ops[1];
>      bs->io_limits_enabled = true;
>  }
Peter Lieven - Jan. 10, 2013, 10:52 a.m.
Am 10.01.2013 um 11:19 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 10/01/2013 10:21, Peter Lieven ha scritto:
>> If io_limits are specified during runtime that exceed the number of
>> operations in flight
>> bs->io_base is not initialized in the else statement in
>> bdrv_exceed_io_limits().
>> The wait time calculated in bdrv_exceed_{bps,iops}_limits is thus
>> totally wrong
>> and the machine locks.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> block.c |    4 ++++
>> 1 file changed, 4 insertions(+)
>> 
>> diff --git a/block.c b/block.c
>> index 4e28c55..309aa85 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -159,6 +159,10 @@ void bdrv_io_limits_enable(BlockDriverState *bs)
>>     bs->slice_start = qemu_get_clock_ns(vm_clock);
>>     bs->slice_end   = bs->slice_start + bs->slice_time;
>>     memset(&bs->io_base, 0, sizeof(bs->io_base));
> 
> Please remove this memset.
> 
>> +    bs->io_base.bytes[0] = bs->nr_bytes[0];
>> +    bs->io_base.bytes[1] = bs->nr_bytes[1];
>> +    bs->io_base.ios[0] = bs->nr_ops[0];
>> +    bs->io_base.ios[1] = bs->nr_ops[1];
>>     bs->io_limits_enabled = true;
>> }
>> 
> 
> Also, perhaps you can just call
> 
>    bdrv_exceed_io_limits(bs, 0, 0, NULL);

won`t this segfault if only one of bps or ions limit is set?
in this case it reads wait before returning false in
bdrv_exceed_{bps,iops}_limits().


> 
> (which also subsumes the setting of slice_time, slice_start, slice_end).
> 
> Paolo
Paolo Bonzini - Jan. 10, 2013, 10:55 a.m.
Il 10/01/2013 11:52, Peter Lieven ha scritto:
>> > Also, perhaps you can just call
>> > 
>> >    bdrv_exceed_io_limits(bs, 0, 0, NULL);
> won`t this segfault if only one of bps or ions limit is set?
> in this case it reads wait before returning false in
> bdrv_exceed_{bps,iops}_limits().

I didn't see any accesses that aren't checked with "if (wait)".  Am I
blind? :)

Paolo
Paolo Bonzini - Jan. 10, 2013, 10:55 a.m.
Il 10/01/2013 11:45, Markus Armbruster ha scritto:
> 
>> > If io_limits are specified during runtime that exceed the number of operations in flight
>> > bs->io_base is not initialized in the else statement in bdrv_exceed_io_limits().
> I'm confused.
> 
>     if ((bs->slice_start < now)
>         && (bs->slice_end > now)) {
>         bs->slice_end = now + bs->slice_time;
>     } else {
>         bs->slice_time  =  5 * BLOCK_IO_SLICE_TIME;
>         bs->slice_start = now;
>         bs->slice_end   = now + bs->slice_time;
> 
>         bs->io_base.bytes[is_write]  = bs->nr_bytes[is_write];
>         bs->io_base.bytes[!is_write] = bs->nr_bytes[!is_write];
> 
>         bs->io_base.ios[is_write]    = bs->nr_ops[is_write];
>         bs->io_base.ios[!is_write]   = bs->nr_ops[!is_write];
>     }

bdrv_io_limits_enable correctly starts a new slice (the first three
lines of the else), but does not set io_base correctly for that slice.
Here is how io_base is used:

    bytes_base  = bs->nr_bytes[is_write] - bs->io_base.bytes[is_write];
    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;

    if (bytes_base + bytes_res <= bytes_limit) {
        /* no wait */
    } else {
        /* operation needs to be throttled */
    }

As a result, any I/O operations that are triggered between now and
bs->slice_end are incorrectly limited.  If 10 MB of data has been
written since the VM was started, QEMU thinks that 10 MB of data has
been written in this slice.

Paolo
Peter Lieven - Jan. 10, 2013, 10:59 a.m.
Am 10.01.2013 um 11:45 schrieb Markus Armbruster <armbru@redhat.com>:

> Subject line made me go "huh?" until I realized that you mean "Guest
> locks up", not "QEMU uses locks to synchronize".  Suggest to add the
> preposition :)
> 
> Peter Lieven <pl@dlhnet.de> writes:
> 
>> If io_limits are specified during runtime that exceed the number of operations in flight
>> bs->io_base is not initialized in the else statement in bdrv_exceed_io_limits().
> 
> I'm confused.
> 
>    if ((bs->slice_start < now)
>        && (bs->slice_end > now)) {
>        bs->slice_end = now + bs->slice_time;
>    } else {
>        bs->slice_time  =  5 * BLOCK_IO_SLICE_TIME;
>        bs->slice_start = now;
>        bs->slice_end   = now + bs->slice_time;
> 
>        bs->io_base.bytes[is_write]  = bs->nr_bytes[is_write];
>        bs->io_base.bytes[!is_write] = bs->nr_bytes[!is_write];
> 
>        bs->io_base.ios[is_write]    = bs->nr_ops[is_write];
>        bs->io_base.ios[!is_write]   = bs->nr_ops[!is_write];
>    }
> 
> It's set in the else.  Could you explain what goes wrong in a bit more
> detail?

If io_limits are enabled while there are I/Os in flight, we do not enter
the else tree as bs->slice_start and bs->slice_end are initialized
in bdrv_io_limits_enable().

In bdrv_exceed_iops_limits() ios_base is then initialized as

ios_base   = bs->nr_ops[is_write] - bs->io_base.ios[is_write];

ios_base is then a very high value if the machine runs for some time.

wait_time is then also a very high value as it is calculated as:

wait_time = (ios_base + 1) / iops_limit;

I hope my fix is correct. Maybe there is sth else wrong, but at least this
fixes the locking of the guest for me.

My intention is not to limit I/O in general, just watch it and shape guests
that behave inappropriate.

Peter
Peter Lieven - Jan. 10, 2013, 11:02 a.m.
Am 10.01.2013 um 11:55 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 10/01/2013 11:52, Peter Lieven ha scritto:
>>>> Also, perhaps you can just call
>>>> 
>>>>   bdrv_exceed_io_limits(bs, 0, 0, NULL);
>> won`t this segfault if only one of bps or ions limit is set?
>> in this case it reads wait before returning false in
>> bdrv_exceed_{bps,iops}_limits().
> 
> I didn't see any accesses that aren't checked with "if (wait)".  Am I
> blind? :)

No, I am ;-)

Peter

> 
> Paolo
Paolo Bonzini - Jan. 10, 2013, 11:17 a.m.
Il 10/01/2013 11:59, Peter Lieven ha scritto:
> I hope my fix is correct. Maybe there is sth else wrong, but at least this
> fixes the locking of the guest for me.

Yes, I also believe youre fix is correct (only aesthetic remarks).

Paolo


> My intention is not to limit I/O in general, just watch it and shape guests
> that behave inappropriate.
Markus Armbruster - Jan. 10, 2013, 11:52 a.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 10/01/2013 11:45, Markus Armbruster ha scritto:
>> 
>>> > If io_limits are specified during runtime that exceed the number
>>> > of operations in flight
>>> > bs->io_base is not initialized in the else statement in
>>> > bdrv_exceed_io_limits().
>> I'm confused.
>> 
>>     if ((bs->slice_start < now)
>>         && (bs->slice_end > now)) {
>>         bs->slice_end = now + bs->slice_time;
>>     } else {
>>         bs->slice_time  =  5 * BLOCK_IO_SLICE_TIME;
>>         bs->slice_start = now;
>>         bs->slice_end   = now + bs->slice_time;
>> 
>>         bs->io_base.bytes[is_write]  = bs->nr_bytes[is_write];
>>         bs->io_base.bytes[!is_write] = bs->nr_bytes[!is_write];
>> 
>>         bs->io_base.ios[is_write]    = bs->nr_ops[is_write];
>>         bs->io_base.ios[!is_write]   = bs->nr_ops[!is_write];
>>     }
>
> bdrv_io_limits_enable correctly starts a new slice (the first three
> lines of the else), but does not set io_base correctly for that slice.
> Here is how io_base is used:
>
>     bytes_base  = bs->nr_bytes[is_write] - bs->io_base.bytes[is_write];
>     bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>
>     if (bytes_base + bytes_res <= bytes_limit) {
>         /* no wait */
>     } else {
>         /* operation needs to be throttled */
>     }
>
> As a result, any I/O operations that are triggered between now and
> bs->slice_end are incorrectly limited.  If 10 MB of data has been
> written since the VM was started, QEMU thinks that 10 MB of data has
> been written in this slice.

Work that into the commit message, and I'm happy :)
Peter Lieven - Jan. 10, 2013, 11:57 a.m.
Am 10.01.2013 um 12:52 schrieb Markus Armbruster <armbru@redhat.com>:

> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> Il 10/01/2013 11:45, Markus Armbruster ha scritto:
>>> 
>>>>> If io_limits are specified during runtime that exceed the number
>>>>> of operations in flight
>>>>> bs->io_base is not initialized in the else statement in
>>>>> bdrv_exceed_io_limits().
>>> I'm confused.
>>> 
>>>    if ((bs->slice_start < now)
>>>        && (bs->slice_end > now)) {
>>>        bs->slice_end = now + bs->slice_time;
>>>    } else {
>>>        bs->slice_time  =  5 * BLOCK_IO_SLICE_TIME;
>>>        bs->slice_start = now;
>>>        bs->slice_end   = now + bs->slice_time;
>>> 
>>>        bs->io_base.bytes[is_write]  = bs->nr_bytes[is_write];
>>>        bs->io_base.bytes[!is_write] = bs->nr_bytes[!is_write];
>>> 
>>>        bs->io_base.ios[is_write]    = bs->nr_ops[is_write];
>>>        bs->io_base.ios[!is_write]   = bs->nr_ops[!is_write];
>>>    }
>> 
>> bdrv_io_limits_enable correctly starts a new slice (the first three
>> lines of the else), but does not set io_base correctly for that slice.
>> Here is how io_base is used:
>> 
>>    bytes_base  = bs->nr_bytes[is_write] - bs->io_base.bytes[is_write];
>>    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>> 
>>    if (bytes_base + bytes_res <= bytes_limit) {
>>        /* no wait */
>>    } else {
>>        /* operation needs to be throttled */
>>    }
>> 
>> As a result, any I/O operations that are triggered between now and
>> bs->slice_end are incorrectly limited.  If 10 MB of data has been
>> written since the VM was started, QEMU thinks that 10 MB of data has
>> been written in this slice.
> 
> Work that into the commit message, and I'm happy :)

Paolo, if you agree I would resubmit the patch (using your description).

I would not directly collapse the code to as its not obvious what 
bdrv_exceed_io_limits(bs, 0, 0, NULL); 
is doing. Maybe this could be done in a later patch.

Peter
Paolo Bonzini - Jan. 10, 2013, 12:09 p.m.
Il 10/01/2013 12:57, Peter Lieven ha scritto:
> Paolo, if you agree I would resubmit the patch (using your description).

Using the description is fine.  Also at least remove the memset.

> I would not directly collapse the code to as its not obvious what 
> bdrv_exceed_io_limits(bs, 0, 0, NULL); 
> is doing. Maybe this could be done in a later patch.

You're right that it's not obvious.

But perhaps we do not need to start a slice at all when iolimits are
set.  That is, do

  bs->slice_start = bs->slice_end = bs->slice_time = 0;

or perhaps even nothing at all since bdrv_io_limits_disable should have
written those exact values.

Paolo
Peter Lieven - Jan. 10, 2013, 12:12 p.m.
Am 10.01.2013 um 13:09 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 10/01/2013 12:57, Peter Lieven ha scritto:
>> Paolo, if you agree I would resubmit the patch (using your description).
> 
> Using the description is fine.  Also at least remove the memset.
> 
>> I would not directly collapse the code to as its not obvious what 
>> bdrv_exceed_io_limits(bs, 0, 0, NULL); 
>> is doing. Maybe this could be done in a later patch.
> 
> You're right that it's not obvious.
> 
> But perhaps we do not need to start a slice at all when iolimits are
> set.  That is, do
> 
>  bs->slice_start = bs->slice_end = bs->slice_time = 0;
> 
> or perhaps even nothing at all since bdrv_io_limits_disable should have
> written those exact values.

Or it was set when the BlockDriverState was initialized.

I am not familiar enough with the io limits code to decide if not starting a slice
is also correct.

Peter


> 
> Paolo
Paolo Bonzini - Jan. 10, 2013, 12:15 p.m.
Il 10/01/2013 13:12, Peter Lieven ha scritto:
>> > 
>> > But perhaps we do not need to start a slice at all when iolimits are
>> > set.  That is, do
>> > 
>> >  bs->slice_start = bs->slice_end = bs->slice_time = 0;
>> > 
>> > or perhaps even nothing at all since bdrv_io_limits_disable should have
>> > written those exact values.
> Or it was set when the BlockDriverState was initialized.
> 
> I am not familiar enough with the io limits code to decide if not starting a slice
> is also correct.

I haven't tested it, but if it works, I think it is better.

Think of it this way: it doesn't matter whether the first I/O operation
comes immediately after limits are set, or 10 seconds later.  In the
latter case, bdrv_exceed_io_limits will _already_ start a new slice.  It
is better to be consistent and always delay the start of the slice.

Paolo
Peter Lieven - Jan. 10, 2013, 12:26 p.m.
Am 10.01.2013 um 13:15 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 10/01/2013 13:12, Peter Lieven ha scritto:
>>>> 
>>>> But perhaps we do not need to start a slice at all when iolimits are
>>>> set.  That is, do
>>>> 
>>>> bs->slice_start = bs->slice_end = bs->slice_time = 0;
>>>> 
>>>> or perhaps even nothing at all since bdrv_io_limits_disable should have
>>>> written those exact values.
>> Or it was set when the BlockDriverState was initialized.
>> 
>> I am not familiar enough with the io limits code to decide if not starting a slice
>> is also correct.
> 
> I haven't tested it, but if it works, I think it is better.

i will test and report.

Peter

> 
> Think of it this way: it doesn't matter whether the first I/O operation
> comes immediately after limits are set, or 10 seconds later.  In the
> latter case, bdrv_exceed_io_limits will _already_ start a new slice.  It
> is better to be consistent and always delay the start of the slice.
> 
> Paolo
Peter Lieven - Jan. 10, 2013, 12:58 p.m.
Am 10.01.2013 um 13:15 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 10/01/2013 13:12, Peter Lieven ha scritto:
>>>> 
>>>> But perhaps we do not need to start a slice at all when iolimits are
>>>> set.  That is, do
>>>> 
>>>> bs->slice_start = bs->slice_end = bs->slice_time = 0;
>>>> 
>>>> or perhaps even nothing at all since bdrv_io_limits_disable should have
>>>> written those exact values.
>> Or it was set when the BlockDriverState was initialized.
>> 
>> I am not familiar enough with the io limits code to decide if not starting a slice
>> is also correct.
> 
> I haven't tested it, but if it works, I think it is better.
> 
> Think of it this way: it doesn't matter whether the first I/O operation
> comes immediately after limits are set, or 10 seconds later.  In the
> latter case, bdrv_exceed_io_limits will _already_ start a new slice.  It
> is better to be consistent and always delay the start of the slice.
> 

seems to be working as well.

are you happy with:

block: fix initialization in bdrv_io_limits_enable()

Peter
Paolo Bonzini - Jan. 10, 2013, 1:29 p.m.
Il 10/01/2013 13:58, Peter Lieven ha scritto:
> 
> Am 10.01.2013 um 13:15 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 10/01/2013 13:12, Peter Lieven ha scritto:
>>>>>
>>>>> But perhaps we do not need to start a slice at all when iolimits are
>>>>> set.  That is, do
>>>>>
>>>>> bs->slice_start = bs->slice_end = bs->slice_time = 0;
>>>>>
>>>>> or perhaps even nothing at all since bdrv_io_limits_disable should have
>>>>> written those exact values.
>>> Or it was set when the BlockDriverState was initialized.
>>>
>>> I am not familiar enough with the io limits code to decide if not starting a slice
>>> is also correct.
>>
>> I haven't tested it, but if it works, I think it is better.
>>
>> Think of it this way: it doesn't matter whether the first I/O operation
>> comes immediately after limits are set, or 10 seconds later.  In the
>> latter case, bdrv_exceed_io_limits will _already_ start a new slice.  It
>> is better to be consistent and always delay the start of the slice.
>>
> 
> seems to be working as well.
> 
> are you happy with:
> 
> block: fix initialization in bdrv_io_limits_enable()

Sure.

Paolo

Patch

diff --git a/block.c b/block.c
index 4e28c55..309aa85 100644
--- a/block.c
+++ b/block.c
@@ -159,6 +159,10 @@  void bdrv_io_limits_enable(BlockDriverState *bs)
      bs->slice_start = qemu_get_clock_ns(vm_clock);
      bs->slice_end   = bs->slice_start + bs->slice_time;
      memset(&bs->io_base, 0, sizeof(bs->io_base));
+    bs->io_base.bytes[0] = bs->nr_bytes[0];
+    bs->io_base.bytes[1] = bs->nr_bytes[1];
+    bs->io_base.ios[0] = bs->nr_ops[0];
+    bs->io_base.ios[1] = bs->nr_ops[1];
      bs->io_limits_enabled = true;
  }