[v5,28/31] block.c: assert BQL lock held in bdrv_co_invalidate_cache

Message ID	20211124064418.3120601-29-eesposit@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Emanuele Giuseppe Esposito <eesposit@redhat.com> To: qemu-block@nongnu.org Subject: [PATCH v5 28/31] block.c: assert BQL lock held in bdrv_co_invalidate_cache Date: Wed, 24 Nov 2021 01:44:15 -0500 Message-Id: <20211124064418.3120601-29-eesposit@redhat.com> In-Reply-To: <20211124064418.3120601-1-eesposit@redhat.com> References: <20211124064418.3120601-1-eesposit@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" Received-SPF: pass client-ip=170.10.129.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -34 X-Spam_score: -3.5 X-Spam_bar: --- X-Spam_report: (-3.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.7, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <fam@euphon.net>, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>, =?utf-8?q?Daniel_P?= =?utf-8?q?=2E_Berrang=C3=A9?= <berrange@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>, Juan Quintela <quintela@redhat.com>, qemu-devel@nongnu.org, John Snow <jsnow@redhat.com>, Emanuele Giuseppe Esposito <eesposit@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Markus Armbruster <armbru@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Eric Blake <eblake@redhat.com> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	block layer: split block APIs in global state and I/O \| expand [v5,00/31] block layer: split block APIs in global state and I/O [v5,01/31] main-loop.h: introduce qemu_in_main_thread() [v5,02/31] include/block/block: split header into I/O and global state API [v5,03/31] assertions for block global state API [v5,04/31] include/sysemu/block-backend: split header into I/O and global state (GS) API [v5,05/31] block-backend: special comments for blk_set/get_perm due to fuse [v5,06/31] block/block-backend.c: assertions for block-backend [v5,07/31] include/block/block_int: split header into I/O and global state API [v5,08/31] assertions for block_int global state API [v5,09/31] block: introduce assert_bdrv_graph_writable [v5,10/31] block.c: modify .attach and .detach callbacks of child_of_bds [v5,11/31] include/block/blockjob_int.h: split header into I/O and GS API [v5,12/31] assertions for blockjob_int.h [v5,13/31] block.c: add assertions to static functions [v5,14/31] include/block/blockjob.h: global state API [v5,15/31] assertions for blockjob.h global state API [v5,16/31] include/sysemu/blockdev.h: global state API [v5,17/31] assertions for blockdev.h global state API [v5,18/31] include/block/snapshot: global state API + assertions [v5,19/31] block/copy-before-write.h: global state API + assertions [v5,20/31] block/coroutines: I/O API [v5,21/31] block_int-common.h: split function pointers in BlockDriver [v5,22/31] block_int-common.h: assertion in the callers of BlockDriver function pointers [v5,23/31] block_int-common.h: split function pointers in BdrvChildClass [v5,24/31] block_int-common.h: assertions in the callers of BdrvChildClass function pointers [v5,25/31] block-backend-common.h: split function pointers in BlockDevOps [v5,26/31] job.h: split function pointers in JobDriver [v5,27/31] job.h: assertions in the callers of JobDriver funcion pointers [v5,28/31] block.c: assert BQL lock held in bdrv_co_invalidate_cache [v5,29/31] jobs: introduce pre_run function in JobDriver [v5,30/31] crypto: delegate permission functions to JobDriver .pre_run [v5,31/31] block.c: assertions to the block layer permissions API

Emanuele Giuseppe Esposito Nov. 24, 2021, 6:44 a.m. UTC

bdrv_co_invalidate_cache is special: it is an I/O function,
but uses the block layer permission API, which is GS.

Because of this, we can assert that either the function is
being called with BQL held, and thus can use the permission API,
or make sure that the permission API is not used, by ensuring that
bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Hanna Czenczek Dec. 17, 2021, 11:04 a.m. UTC | #1

On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
> bdrv_co_invalidate_cache is special: it is an I/O function,

I still don’t believe it is, but well.

(Yes, it is called by a test in an iothread, but I believe we’ve seen 
that the tests simply sometimes test things that shouldn’t be allowed.)

> but uses the block layer permission API, which is GS.
>
> Because of this, we can assert that either the function is
> being called with BQL held, and thus can use the permission API,
> or make sure that the permission API is not used, by ensuring that
> bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
>
> diff --git a/block.c b/block.c
> index a0309f827d..805974676b 100644
> --- a/block.c
> +++ b/block.c
> @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
>       bdrv_init();
>   }
>   
> +static bool bdrv_is_active(BlockDriverState *bs)
> +{
> +    BdrvChild *parent;
> +
> +    if (bs->open_flags & BDRV_O_INACTIVE) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
> +        if (parent->klass->parent_is_bds) {
> +            BlockDriverState *parent_bs = parent->opaque;

This looks like a really bad hack to me.  We purposefully have made the 
parent link opaque so that a BDS cannot easily reach its parents.  All 
accesses should go through BdrvChildClass methods.

I also don’t understand why we need to query parents at all.  The only 
fact that determines whether the current BDS will have its permissions 
changed is whether the BDS itself is active or inactive.  Sure, we’ll 
invoke bdrv_co_invalidate_cache() on the parents, too, but then we could 
simply let the assertion fail there.

> +            if (!bdrv_is_active(parent_bs)) {
> +                return false;
> +            }
> +        }
> +    }
> +
> +   return true;
> +}
> +
>   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>   {
>       BdrvChild *child, *parent;
> @@ -6585,6 +6605,12 @@ int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>           return -ENOMEDIUM;
>       }
>   
> +    /*
> +     * No need to muck with permissions if bs is active.
> +     * TODO: should activation be a separate function?
> +     */
> +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
> +

I don’t understand this, really.  It looks to me like “if you don’t call 
this in the main thread, this better be a no-op”, i.e., you must never 
call this function in an I/O thread if you really want to use it.  I.e. 
what I’d classify as a GS function.

It sounds like this is just a special case for said test, and 
special-casing code for tests sounds like a bad idea.

Hanna

>       QLIST_FOREACH(child, &bs->children, next) {
>           bdrv_co_invalidate_cache(child->bs, &local_err);
>           if (local_err) {

Emanuele Giuseppe Esposito Dec. 17, 2021, 4:38 p.m. UTC | #2

On 17/12/2021 12:04, Hanna Reitz wrote:
> On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
>> bdrv_co_invalidate_cache is special: it is an I/O function,
> 
> I still don’t believe it is, but well.
> 
> (Yes, it is called by a test in an iothread, but I believe we’ve seen 
> that the tests simply sometimes test things that shouldn’t be allowed.)
> 
>> but uses the block layer permission API, which is GS.
>>
>> Because of this, we can assert that either the function is
>> being called with BQL held, and thus can use the permission API,
>> or make sure that the permission API is not used, by ensuring that
>> bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   block.c | 26 ++++++++++++++++++++++++++
>>   1 file changed, 26 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index a0309f827d..805974676b 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
>>       bdrv_init();
>>   }
>> +static bool bdrv_is_active(BlockDriverState *bs)
>> +{
>> +    BdrvChild *parent;
>> +
>> +    if (bs->open_flags & BDRV_O_INACTIVE) {
>> +        return false;
>> +    }
>> +
>> +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
>> +        if (parent->klass->parent_is_bds) {
>> +            BlockDriverState *parent_bs = parent->opaque;
> 
> This looks like a really bad hack to me.  We purposefully have made the 
> parent link opaque so that a BDS cannot easily reach its parents.  All 
> accesses should go through BdrvChildClass methods.
> 
> I also don’t understand why we need to query parents at all.  The only 
> fact that determines whether the current BDS will have its permissions 
> changed is whether the BDS itself is active or inactive.  Sure, we’ll 
> invoke bdrv_co_invalidate_cache() on the parents, too, but then we could 
> simply let the assertion fail there.
> 
>> +            if (!bdrv_is_active(parent_bs)) {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +
>> +   return true;
>> +}
>> +
>>   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, 
>> Error **errp)
>>   {
>>       BdrvChild *child, *parent;
>> @@ -6585,6 +6605,12 @@ int coroutine_fn 
>> bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>>           return -ENOMEDIUM;
>>       }
>> +    /*
>> +     * No need to muck with permissions if bs is active.
>> +     * TODO: should activation be a separate function?
>> +     */
>> +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
>> +
> 
> I don’t understand this, really.  It looks to me like “if you don’t call 
> this in the main thread, this better be a no-op”, i.e., you must never 
> call this function in an I/O thread if you really want to use it.  I.e. 
> what I’d classify as a GS function.
> 
> It sounds like this is just a special case for said test, and 
> special-casing code for tests sounds like a bad idea.

Ok, but trying to leave just the qemu_in_main_thread() assertion makes 
test 307 (./check 307) fail.
I am actually not sure on why it fails, but I am sure it is because of 
the assertion, since without it it passes.

I tried with gdb (./check -gdb 307 on one terminal and
gdb -iex "target remote localhost:12345"
in another) but it points me to this below, which I think is the ndb 
server getting the socket closed (because on the other side it crashed), 
and not the actual error.


Thread 1 "qemu-system-x86" received signal SIGPIPE, Broken pipe.
0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
(gdb) bt
#0  0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
#1  0x0000555555c13cc9 in qio_channel_socket_writev (ioc=<optimized 
out>, iov=0x5555569a4870, niov=1, fds=0x0, nfds=<optimized out>, errp=0x0)
     at ../io/channel-socket.c:561
#2  0x0000555555c19b18 in qio_channel_writev_full_all 
(ioc=0x55555763b800, iov=iov@entry=0x7fffe8dffd80, niov=niov@entry=1, 
fds=fds@entry=0x0,
     nfds=nfds@entry=0, errp=errp@entry=0x0) at ../io/channel.c:240
#3  0x0000555555c19bd2 in qio_channel_writev_all (errp=0x0, niov=1, 
iov=0x7fffe8dffd80, ioc=<optimized out>) at ../io/channel.c:220
#4  qio_channel_write_all (ioc=<optimized out>, 
buf=buf@entry=0x7fffe8dffdd0 "", buflen=buflen@entry=20, 
errp=errp@entry=0x0) at ../io/channel.c:330
#5  0x0000555555c27e75 in nbd_write (errp=0x0, size=20, 
buffer=0x7fffe8dffdd0, ioc=<optimized out>) at ../nbd/nbd-internal.h:71
#6  nbd_negotiate_send_rep_len (client=client@entry=0x555556f60930, 
type=type@entry=1, len=len@entry=0, errp=errp@entry=0x0) at 
../nbd/server.c:203
#7  0x0000555555c29db1 in nbd_negotiate_send_rep (errp=0x0, type=1, 
client=0x555556f60930) at ../nbd/server.c:211
--Type <RET> for more, q to quit, c to continue without paging--
#8  nbd_negotiate_options (errp=0x7fffe8dffe88, client=<optimized out>) 
at ../nbd/server.c:1224
#9  nbd_negotiate (errp=0x7fffe8dffe88, client=<optimized out>) at 
../nbd/server.c:1340
#10 nbd_co_client_start (opaque=<optimized out>) at ../nbd/server.c:2715
#11 0x0000555555d70423 in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#12 0x00007ffff67f3820 in ?? () from target:/lib64/libc.so.6
#13 0x00007fffffffca80 in ?? ()

Emanuele
> 
> Hanna
> 
>>       QLIST_FOREACH(child, &bs->children, next) {
>>           bdrv_co_invalidate_cache(child->bs, &local_err);
>>           if (local_err) {
>

Emanuele Giuseppe Esposito Dec. 20, 2021, 12:20 p.m. UTC | #3

On 17/12/2021 17:38, Emanuele Giuseppe Esposito wrote:
> 
> 
> On 17/12/2021 12:04, Hanna Reitz wrote:
>> On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
>>> bdrv_co_invalidate_cache is special: it is an I/O function,
>>
>> I still don’t believe it is, but well.
>>
>> (Yes, it is called by a test in an iothread, but I believe we’ve seen 
>> that the tests simply sometimes test things that shouldn’t be allowed.)
>>
>>> but uses the block layer permission API, which is GS.
>>>
>>> Because of this, we can assert that either the function is
>>> being called with BQL held, and thus can use the permission API,
>>> or make sure that the permission API is not used, by ensuring that
>>> bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> ---
>>>   block.c | 26 ++++++++++++++++++++++++++
>>>   1 file changed, 26 insertions(+)
>>>
>>> diff --git a/block.c b/block.c
>>> index a0309f827d..805974676b 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
>>>       bdrv_init();
>>>   }
>>> +static bool bdrv_is_active(BlockDriverState *bs)
>>> +{
>>> +    BdrvChild *parent;
>>> +
>>> +    if (bs->open_flags & BDRV_O_INACTIVE) {
>>> +        return false;
>>> +    }
>>> +
>>> +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
>>> +        if (parent->klass->parent_is_bds) {
>>> +            BlockDriverState *parent_bs = parent->opaque;
>>
>> This looks like a really bad hack to me.  We purposefully have made 
>> the parent link opaque so that a BDS cannot easily reach its parents.  
>> All accesses should go through BdrvChildClass methods.
>>
>> I also don’t understand why we need to query parents at all.  The only 
>> fact that determines whether the current BDS will have its permissions 
>> changed is whether the BDS itself is active or inactive.  Sure, we’ll 
>> invoke bdrv_co_invalidate_cache() on the parents, too, but then we 
>> could simply let the assertion fail there.
>>
>>> +            if (!bdrv_is_active(parent_bs)) {
>>> +                return false;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +   return true;
>>> +}
>>> +
>>>   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, 
>>> Error **errp)
>>>   {
>>>       BdrvChild *child, *parent;
>>> @@ -6585,6 +6605,12 @@ int coroutine_fn 
>>> bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>>>           return -ENOMEDIUM;
>>>       }
>>> +    /*
>>> +     * No need to muck with permissions if bs is active.
>>> +     * TODO: should activation be a separate function?
>>> +     */
>>> +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
>>> +
>>
>> I don’t understand this, really.  It looks to me like “if you don’t 
>> call this in the main thread, this better be a no-op”, i.e., you must 
>> never call this function in an I/O thread if you really want to use 
>> it.  I.e. what I’d classify as a GS function.
>>
>> It sounds like this is just a special case for said test, and 
>> special-casing code for tests sounds like a bad idea.
> 
> Ok, but trying to leave just the qemu_in_main_thread() assertion makes 
> test 307 (./check 307) fail.
> I am actually not sure on why it fails, but I am sure it is because of 
> the assertion, since without it it passes.
> 
> I tried with gdb (./check -gdb 307 on one terminal and
> gdb -iex "target remote localhost:12345"
> in another) but it points me to this below, which I think is the ndb 
> server getting the socket closed (because on the other side it crashed), 
> and not the actual error.
> 
> 
> Thread 1 "qemu-system-x86" received signal SIGPIPE, Broken pipe.
> 0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
> (gdb) bt
> #0  0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
> #1  0x0000555555c13cc9 in qio_channel_socket_writev (ioc=<optimized 
> out>, iov=0x5555569a4870, niov=1, fds=0x0, nfds=<optimized out>, errp=0x0)
>      at ../io/channel-socket.c:561
> #2  0x0000555555c19b18 in qio_channel_writev_full_all 
> (ioc=0x55555763b800, iov=iov@entry=0x7fffe8dffd80, niov=niov@entry=1, 
> fds=fds@entry=0x0,
>      nfds=nfds@entry=0, errp=errp@entry=0x0) at ../io/channel.c:240
> #3  0x0000555555c19bd2 in qio_channel_writev_all (errp=0x0, niov=1, 
> iov=0x7fffe8dffd80, ioc=<optimized out>) at ../io/channel.c:220
> #4  qio_channel_write_all (ioc=<optimized out>, 
> buf=buf@entry=0x7fffe8dffdd0 "", buflen=buflen@entry=20, 
> errp=errp@entry=0x0) at ../io/channel.c:330
> #5  0x0000555555c27e75 in nbd_write (errp=0x0, size=20, 
> buffer=0x7fffe8dffdd0, ioc=<optimized out>) at ../nbd/nbd-internal.h:71
> #6  nbd_negotiate_send_rep_len (client=client@entry=0x555556f60930, 
> type=type@entry=1, len=len@entry=0, errp=errp@entry=0x0) at 
> ../nbd/server.c:203
> #7  0x0000555555c29db1 in nbd_negotiate_send_rep (errp=0x0, type=1, 
> client=0x555556f60930) at ../nbd/server.c:211
> --Type <RET> for more, q to quit, c to continue without paging--
> #8  nbd_negotiate_options (errp=0x7fffe8dffe88, client=<optimized out>) 
> at ../nbd/server.c:1224
> #9  nbd_negotiate (errp=0x7fffe8dffe88, client=<optimized out>) at 
> ../nbd/server.c:1340
> #10 nbd_co_client_start (opaque=<optimized out>) at ../nbd/server.c:2715
> #11 0x0000555555d70423 in coroutine_trampoline (i0=<optimized out>, 
> i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> #12 0x00007ffff67f3820 in ?? () from target:/lib64/libc.so.6
> #13 0x00007fffffffca80 in ?? ()
> 

Ok the reason for this is line 89 of tests/qemu-iotests/307:

# Should ignore the iothread conflict, but then fail because of the
# permission conflict (and not crash)
vm.qmp_log('block-export-add', id='export1', type='nbd', node_name='null',
                iothread='iothread1', fixed_iothread=False, writable=True)

This calls qmp_block_export_add() and then blk_exp_add(), that calls 
bdrv_invalidate_cache().
Both functions are running in the main thread, but then 
bdrv_invalidate_cache invokes bdrv_co_invalidate_cache() as a coroutine 
in the AioContext of the given bs, so not the main loop.

So Hanna, what should we do here? This seems very similar to the 
discussion in patch 22, ie run blockdev-create (in this case 
block-export-add, which seems similar?) involving nodes in I/O threads.

Thank you,
Emanuele

Hanna Czenczek Dec. 23, 2021, 5:11 p.m. UTC | #4

On 20.12.21 13:20, Emanuele Giuseppe Esposito wrote:
>
>
> On 17/12/2021 17:38, Emanuele Giuseppe Esposito wrote:
>>
>>
>> On 17/12/2021 12:04, Hanna Reitz wrote:
>>> On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
>>>> bdrv_co_invalidate_cache is special: it is an I/O function,
>>>
>>> I still don’t believe it is, but well.
>>>
>>> (Yes, it is called by a test in an iothread, but I believe we’ve 
>>> seen that the tests simply sometimes test things that shouldn’t be 
>>> allowed.)
>>>
>>>> but uses the block layer permission API, which is GS.
>>>>
>>>> Because of this, we can assert that either the function is
>>>> being called with BQL held, and thus can use the permission API,
>>>> or make sure that the permission API is not used, by ensuring that
>>>> bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
>>>>
>>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>>> ---
>>>>   block.c | 26 ++++++++++++++++++++++++++
>>>>   1 file changed, 26 insertions(+)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index a0309f827d..805974676b 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
>>>>       bdrv_init();
>>>>   }
>>>> +static bool bdrv_is_active(BlockDriverState *bs)
>>>> +{
>>>> +    BdrvChild *parent;
>>>> +
>>>> +    if (bs->open_flags & BDRV_O_INACTIVE) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
>>>> +        if (parent->klass->parent_is_bds) {
>>>> +            BlockDriverState *parent_bs = parent->opaque;
>>>
>>> This looks like a really bad hack to me.  We purposefully have made 
>>> the parent link opaque so that a BDS cannot easily reach its 
>>> parents.  All accesses should go through BdrvChildClass methods.
>>>
>>> I also don’t understand why we need to query parents at all. The 
>>> only fact that determines whether the current BDS will have its 
>>> permissions changed is whether the BDS itself is active or 
>>> inactive.  Sure, we’ll invoke bdrv_co_invalidate_cache() on the 
>>> parents, too, but then we could simply let the assertion fail there.
>>>
>>>> +            if (!bdrv_is_active(parent_bs)) {
>>>> +                return false;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +   return true;
>>>> +}
>>>> +
>>>>   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, 
>>>> Error **errp)
>>>>   {
>>>>       BdrvChild *child, *parent;
>>>> @@ -6585,6 +6605,12 @@ int coroutine_fn 
>>>> bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>>>>           return -ENOMEDIUM;
>>>>       }
>>>> +    /*
>>>> +     * No need to muck with permissions if bs is active.
>>>> +     * TODO: should activation be a separate function?
>>>> +     */
>>>> +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
>>>> +
>>>
>>> I don’t understand this, really.  It looks to me like “if you don’t 
>>> call this in the main thread, this better be a no-op”, i.e., you 
>>> must never call this function in an I/O thread if you really want to 
>>> use it.  I.e. what I’d classify as a GS function.
>>>
>>> It sounds like this is just a special case for said test, and 
>>> special-casing code for tests sounds like a bad idea.
>>
>> Ok, but trying to leave just the qemu_in_main_thread() assertion 
>> makes test 307 (./check 307) fail.
>> I am actually not sure on why it fails, but I am sure it is because 
>> of the assertion, since without it it passes.
>>
>> I tried with gdb (./check -gdb 307 on one terminal and
>> gdb -iex "target remote localhost:12345"
>> in another) but it points me to this below, which I think is the ndb 
>> server getting the socket closed (because on the other side it 
>> crashed), and not the actual error.
>>
>>
>> Thread 1 "qemu-system-x86" received signal SIGPIPE, Broken pipe.
>> 0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
>> (gdb) bt
>> #0  0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
>> #1  0x0000555555c13cc9 in qio_channel_socket_writev (ioc=<optimized 
>> out>, iov=0x5555569a4870, niov=1, fds=0x0, nfds=<optimized out>, 
>> errp=0x0)
>>      at ../io/channel-socket.c:561
>> #2  0x0000555555c19b18 in qio_channel_writev_full_all 
>> (ioc=0x55555763b800, iov=iov@entry=0x7fffe8dffd80, niov=niov@entry=1, 
>> fds=fds@entry=0x0,
>>      nfds=nfds@entry=0, errp=errp@entry=0x0) at ../io/channel.c:240
>> #3  0x0000555555c19bd2 in qio_channel_writev_all (errp=0x0, niov=1, 
>> iov=0x7fffe8dffd80, ioc=<optimized out>) at ../io/channel.c:220
>> #4  qio_channel_write_all (ioc=<optimized out>, 
>> buf=buf@entry=0x7fffe8dffdd0 "", buflen=buflen@entry=20, 
>> errp=errp@entry=0x0) at ../io/channel.c:330
>> #5  0x0000555555c27e75 in nbd_write (errp=0x0, size=20, 
>> buffer=0x7fffe8dffdd0, ioc=<optimized out>) at ../nbd/nbd-internal.h:71
>> #6  nbd_negotiate_send_rep_len (client=client@entry=0x555556f60930, 
>> type=type@entry=1, len=len@entry=0, errp=errp@entry=0x0) at 
>> ../nbd/server.c:203
>> #7  0x0000555555c29db1 in nbd_negotiate_send_rep (errp=0x0, type=1, 
>> client=0x555556f60930) at ../nbd/server.c:211
>> --Type <RET> for more, q to quit, c to continue without paging--
>> #8  nbd_negotiate_options (errp=0x7fffe8dffe88, client=<optimized 
>> out>) at ../nbd/server.c:1224
>> #9  nbd_negotiate (errp=0x7fffe8dffe88, client=<optimized out>) at 
>> ../nbd/server.c:1340
>> #10 nbd_co_client_start (opaque=<optimized out>) at ../nbd/server.c:2715
>> #11 0x0000555555d70423 in coroutine_trampoline (i0=<optimized out>, 
>> i1=<optimized out>) at ../util/coroutine-ucontext.c:173
>> #12 0x00007ffff67f3820 in ?? () from target:/lib64/libc.so.6
>> #13 0x00007fffffffca80 in ?? ()
>>
>
> Ok the reason for this is line 89 of tests/qemu-iotests/307:
>
> # Should ignore the iothread conflict, but then fail because of the
> # permission conflict (and not crash)
> vm.qmp_log('block-export-add', id='export1', type='nbd', 
> node_name='null',
>                iothread='iothread1', fixed_iothread=False, writable=True)
>
> This calls qmp_block_export_add() and then blk_exp_add(), that calls 
> bdrv_invalidate_cache().
> Both functions are running in the main thread, but then 
> bdrv_invalidate_cache invokes bdrv_co_invalidate_cache() as a 
> coroutine in the AioContext of the given bs, so not the main loop.
>
> So Hanna, what should we do here? This seems very similar to the 
> discussion in patch 22, ie run blockdev-create (in this case 
> block-export-add, which seems similar?) involving nodes in I/O threads.

Honestly, I’m still thinking about this and haven’t come to a 
conclusion.  It doesn’t seem invalid to run bdrv_co_invalidate_cache() 
in the context of the BDS here, but then the assertion that the BDS is 
already active seems kind of just lucky to work.

I plan to look into whether I can reproduce some case where the 
assertion doesn’t hold (thinking of migration with a block device in an 
iothread), and then see what I learn from this. :/

Hanna

Hanna Czenczek Jan. 19, 2022, 3:57 p.m. UTC | #5

On 23.12.21 18:11, Hanna Reitz wrote:
> On 20.12.21 13:20, Emanuele Giuseppe Esposito wrote:
>>
>>
>> On 17/12/2021 17:38, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> On 17/12/2021 12:04, Hanna Reitz wrote:
>>>> On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
>>>>> bdrv_co_invalidate_cache is special: it is an I/O function,
>>>>
>>>> I still don’t believe it is, but well.
>>>>
>>>> (Yes, it is called by a test in an iothread, but I believe we’ve 
>>>> seen that the tests simply sometimes test things that shouldn’t be 
>>>> allowed.)
>>>>
>>>>> but uses the block layer permission API, which is GS.
>>>>>
>>>>> Because of this, we can assert that either the function is
>>>>> being called with BQL held, and thus can use the permission API,
>>>>> or make sure that the permission API is not used, by ensuring that
>>>>> bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
>>>>>
>>>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>>>> ---
>>>>>   block.c | 26 ++++++++++++++++++++++++++
>>>>>   1 file changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/block.c b/block.c
>>>>> index a0309f827d..805974676b 100644
>>>>> --- a/block.c
>>>>> +++ b/block.c
>>>>> @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
>>>>>       bdrv_init();
>>>>>   }
>>>>> +static bool bdrv_is_active(BlockDriverState *bs)
>>>>> +{
>>>>> +    BdrvChild *parent;
>>>>> +
>>>>> +    if (bs->open_flags & BDRV_O_INACTIVE) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
>>>>> +        if (parent->klass->parent_is_bds) {
>>>>> +            BlockDriverState *parent_bs = parent->opaque;
>>>>
>>>> This looks like a really bad hack to me.  We purposefully have made 
>>>> the parent link opaque so that a BDS cannot easily reach its 
>>>> parents.  All accesses should go through BdrvChildClass methods.
>>>>
>>>> I also don’t understand why we need to query parents at all. The 
>>>> only fact that determines whether the current BDS will have its 
>>>> permissions changed is whether the BDS itself is active or 
>>>> inactive.  Sure, we’ll invoke bdrv_co_invalidate_cache() on the 
>>>> parents, too, but then we could simply let the assertion fail there.
>>>>
>>>>> +            if (!bdrv_is_active(parent_bs)) {
>>>>> +                return false;
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +   return true;
>>>>> +}
>>>>> +
>>>>>   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, 
>>>>> Error **errp)
>>>>>   {
>>>>>       BdrvChild *child, *parent;
>>>>> @@ -6585,6 +6605,12 @@ int coroutine_fn 
>>>>> bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>>>>>           return -ENOMEDIUM;
>>>>>       }
>>>>> +    /*
>>>>> +     * No need to muck with permissions if bs is active.
>>>>> +     * TODO: should activation be a separate function?
>>>>> +     */
>>>>> +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
>>>>> +
>>>>
>>>> I don’t understand this, really.  It looks to me like “if you don’t 
>>>> call this in the main thread, this better be a no-op”, i.e., you 
>>>> must never call this function in an I/O thread if you really want 
>>>> to use it.  I.e. what I’d classify as a GS function.
>>>>
>>>> It sounds like this is just a special case for said test, and 
>>>> special-casing code for tests sounds like a bad idea.
>>>
>>> Ok, but trying to leave just the qemu_in_main_thread() assertion 
>>> makes test 307 (./check 307) fail.
>>> I am actually not sure on why it fails, but I am sure it is because 
>>> of the assertion, since without it it passes.
>>>
>>> I tried with gdb (./check -gdb 307 on one terminal and
>>> gdb -iex "target remote localhost:12345"
>>> in another) but it points me to this below, which I think is the ndb 
>>> server getting the socket closed (because on the other side it 
>>> crashed), and not the actual error.
>>>
>>>
>>> Thread 1 "qemu-system-x86" received signal SIGPIPE, Broken pipe.
>>> 0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
>>> #1  0x0000555555c13cc9 in qio_channel_socket_writev (ioc=<optimized 
>>> out>, iov=0x5555569a4870, niov=1, fds=0x0, nfds=<optimized out>, 
>>> errp=0x0)
>>>      at ../io/channel-socket.c:561
>>> #2  0x0000555555c19b18 in qio_channel_writev_full_all 
>>> (ioc=0x55555763b800, iov=iov@entry=0x7fffe8dffd80, 
>>> niov=niov@entry=1, fds=fds@entry=0x0,
>>>      nfds=nfds@entry=0, errp=errp@entry=0x0) at ../io/channel.c:240
>>> #3  0x0000555555c19bd2 in qio_channel_writev_all (errp=0x0, niov=1, 
>>> iov=0x7fffe8dffd80, ioc=<optimized out>) at ../io/channel.c:220
>>> #4  qio_channel_write_all (ioc=<optimized out>, 
>>> buf=buf@entry=0x7fffe8dffdd0 "", buflen=buflen@entry=20, 
>>> errp=errp@entry=0x0) at ../io/channel.c:330
>>> #5  0x0000555555c27e75 in nbd_write (errp=0x0, size=20, 
>>> buffer=0x7fffe8dffdd0, ioc=<optimized out>) at ../nbd/nbd-internal.h:71
>>> #6  nbd_negotiate_send_rep_len (client=client@entry=0x555556f60930, 
>>> type=type@entry=1, len=len@entry=0, errp=errp@entry=0x0) at 
>>> ../nbd/server.c:203
>>> #7  0x0000555555c29db1 in nbd_negotiate_send_rep (errp=0x0, type=1, 
>>> client=0x555556f60930) at ../nbd/server.c:211
>>> --Type <RET> for more, q to quit, c to continue without paging--
>>> #8  nbd_negotiate_options (errp=0x7fffe8dffe88, client=<optimized 
>>> out>) at ../nbd/server.c:1224
>>> #9  nbd_negotiate (errp=0x7fffe8dffe88, client=<optimized out>) at 
>>> ../nbd/server.c:1340
>>> #10 nbd_co_client_start (opaque=<optimized out>) at 
>>> ../nbd/server.c:2715
>>> #11 0x0000555555d70423 in coroutine_trampoline (i0=<optimized out>, 
>>> i1=<optimized out>) at ../util/coroutine-ucontext.c:173
>>> #12 0x00007ffff67f3820 in ?? () from target:/lib64/libc.so.6
>>> #13 0x00007fffffffca80 in ?? ()
>>>
>>
>> Ok the reason for this is line 89 of tests/qemu-iotests/307:
>>
>> # Should ignore the iothread conflict, but then fail because of the
>> # permission conflict (and not crash)
>> vm.qmp_log('block-export-add', id='export1', type='nbd', 
>> node_name='null',
>>                iothread='iothread1', fixed_iothread=False, 
>> writable=True)
>>
>> This calls qmp_block_export_add() and then blk_exp_add(), that calls 
>> bdrv_invalidate_cache().
>> Both functions are running in the main thread, but then 
>> bdrv_invalidate_cache invokes bdrv_co_invalidate_cache() as a 
>> coroutine in the AioContext of the given bs, so not the main loop.
>>
>> So Hanna, what should we do here? This seems very similar to the 
>> discussion in patch 22, ie run blockdev-create (in this case 
>> block-export-add, which seems similar?) involving nodes in I/O threads.
>
> Honestly, I’m still thinking about this and haven’t come to a 
> conclusion.  It doesn’t seem invalid to run bdrv_co_invalidate_cache() 
> in the context of the BDS here, but then the assertion that the BDS is 
> already active seems kind of just lucky to work.
>
> I plan to look into whether I can reproduce some case where the 
> assertion doesn’t hold (thinking of migration with a block device in 
> an iothread), and then see what I learn from this. :/

OK, so.  I nearly couldn’t reproduce a case where the assertion doesn’t 
hold, and I was also very close to believing that your condition is 
entirely correct; on migration, we generally inactivate block nodes 
(BlockDriverStates) only after they have been detached from their 
respective block devices, and so they’re in the main context.  On the 
destination, it’s the other way around: We invalidate their caches 
before attaching them to the respective block devices, so they are again 
in the main context.

There are exceptions where we call bdrv_invalidate_cache() on error 
paths and so on, basically just to be sure, while the guest devices are 
connected to block nodes.  But in that case, we have never inactivated 
them (or have already re-activated them), and that’s where the 
`bdrv_is_active()` condition comes in.

But then I wanted to find out what happens when you don’t have a guest 
device, but an NBD export on top of the block node, and this happens:

$ ./qemu-system-x86_64 \
     -incoming defer \
     -qmp stdio \
     -object iothread,id=iothr0 \
     -blockdev null-co,node-name=node0 \
     <<EOF
{"execute": "qmp_capabilities"}
{"execute":"nbd-server-start","arguments":{"addr":{"type":"unix","data":{"path":"/tmp/nbd.sock"}}}}
{"execute":"block-export-add","arguments":{"type":"nbd","id":"exp0","iothread":"iothr0","node-name":"node0","name":"exp0"}}
EOF
{"QMP": {"version": {"qemu": {"micro": 91, "minor": 2, "major": 6}, 
"package": "v6.2.0-rc1-47-g043ab68869"}, "capabilities": ["oob"]}}
{"return": {}}
{"return": {}}
qemu-system-x86_64: ../block.c:6612: bdrv_co_invalidate_cache: Assertion 
`qemu_in_main_thread() || bdrv_is_active(bs)' failed.
[1]    155729 abort (core dumped)  ./qemu-system-x86_64 -incoming defer 
-qmp stdio -object iothread,id=iothr0

The problem here is that blk_exp_add() invalidates the cache after 
moving the node to the target context.  I think we can solve this 
particular problem by simply moving its bdrv_invalidate_cache() call 
before the `if (export->has_iothread)` block above it.

But it does mean the search isn’t over and I’ll need to look a bit 
further still...  At least your assertion condition now makes more sense 
to me.

Hanna

Hanna Czenczek Jan. 19, 2022, 5:44 p.m. UTC | #6

On 19.01.22 16:57, Hanna Reitz wrote:
> On 23.12.21 18:11, Hanna Reitz wrote:
>> On 20.12.21 13:20, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> On 17/12/2021 17:38, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> On 17/12/2021 12:04, Hanna Reitz wrote:
>>>>> On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
>>>>>> bdrv_co_invalidate_cache is special: it is an I/O function,
>>>>>
>>>>> I still don’t believe it is, but well.
>>>>>
>>>>> (Yes, it is called by a test in an iothread, but I believe we’ve 
>>>>> seen that the tests simply sometimes test things that shouldn’t be 
>>>>> allowed.)
>>>>>
>>>>>> but uses the block layer permission API, which is GS.
>>>>>>
>>>>>> Because of this, we can assert that either the function is
>>>>>> being called with BQL held, and thus can use the permission API,
>>>>>> or make sure that the permission API is not used, by ensuring that
>>>>>> bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
>>>>>>
>>>>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>>>>> ---
>>>>>>   block.c | 26 ++++++++++++++++++++++++++
>>>>>>   1 file changed, 26 insertions(+)
>>>>>>
>>>>>> diff --git a/block.c b/block.c
>>>>>> index a0309f827d..805974676b 100644
>>>>>> --- a/block.c
>>>>>> +++ b/block.c
>>>>>> @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
>>>>>>       bdrv_init();
>>>>>>   }
>>>>>> +static bool bdrv_is_active(BlockDriverState *bs)
>>>>>> +{
>>>>>> +    BdrvChild *parent;
>>>>>> +
>>>>>> +    if (bs->open_flags & BDRV_O_INACTIVE) {
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
>>>>>> +        if (parent->klass->parent_is_bds) {
>>>>>> +            BlockDriverState *parent_bs = parent->opaque;
>>>>>
>>>>> This looks like a really bad hack to me.  We purposefully have 
>>>>> made the parent link opaque so that a BDS cannot easily reach its 
>>>>> parents.  All accesses should go through BdrvChildClass methods.
>>>>>
>>>>> I also don’t understand why we need to query parents at all. The 
>>>>> only fact that determines whether the current BDS will have its 
>>>>> permissions changed is whether the BDS itself is active or 
>>>>> inactive.  Sure, we’ll invoke bdrv_co_invalidate_cache() on the 
>>>>> parents, too, but then we could simply let the assertion fail there.
>>>>>
>>>>>> +            if (!bdrv_is_active(parent_bs)) {
>>>>>> +                return false;
>>>>>> +            }
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +   return true;
>>>>>> +}
>>>>>> +
>>>>>>   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, 
>>>>>> Error **errp)
>>>>>>   {
>>>>>>       BdrvChild *child, *parent;
>>>>>> @@ -6585,6 +6605,12 @@ int coroutine_fn 
>>>>>> bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
>>>>>>           return -ENOMEDIUM;
>>>>>>       }
>>>>>> +    /*
>>>>>> +     * No need to muck with permissions if bs is active.
>>>>>> +     * TODO: should activation be a separate function?
>>>>>> +     */
>>>>>> +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
>>>>>> +
>>>>>
>>>>> I don’t understand this, really.  It looks to me like “if you 
>>>>> don’t call this in the main thread, this better be a no-op”, i.e., 
>>>>> you must never call this function in an I/O thread if you really 
>>>>> want to use it.  I.e. what I’d classify as a GS function.
>>>>>
>>>>> It sounds like this is just a special case for said test, and 
>>>>> special-casing code for tests sounds like a bad idea.
>>>>
>>>> Ok, but trying to leave just the qemu_in_main_thread() assertion 
>>>> makes test 307 (./check 307) fail.
>>>> I am actually not sure on why it fails, but I am sure it is because 
>>>> of the assertion, since without it it passes.
>>>>
>>>> I tried with gdb (./check -gdb 307 on one terminal and
>>>> gdb -iex "target remote localhost:12345"
>>>> in another) but it points me to this below, which I think is the 
>>>> ndb server getting the socket closed (because on the other side it 
>>>> crashed), and not the actual error.
>>>>
>>>>
>>>> Thread 1 "qemu-system-x86" received signal SIGPIPE, Broken pipe.
>>>> 0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
>>>> #1  0x0000555555c13cc9 in qio_channel_socket_writev (ioc=<optimized 
>>>> out>, iov=0x5555569a4870, niov=1, fds=0x0, nfds=<optimized out>, 
>>>> errp=0x0)
>>>>      at ../io/channel-socket.c:561
>>>> #2  0x0000555555c19b18 in qio_channel_writev_full_all 
>>>> (ioc=0x55555763b800, iov=iov@entry=0x7fffe8dffd80, 
>>>> niov=niov@entry=1, fds=fds@entry=0x0,
>>>>      nfds=nfds@entry=0, errp=errp@entry=0x0) at ../io/channel.c:240
>>>> #3  0x0000555555c19bd2 in qio_channel_writev_all (errp=0x0, niov=1, 
>>>> iov=0x7fffe8dffd80, ioc=<optimized out>) at ../io/channel.c:220
>>>> #4  qio_channel_write_all (ioc=<optimized out>, 
>>>> buf=buf@entry=0x7fffe8dffdd0 "", buflen=buflen@entry=20, 
>>>> errp=errp@entry=0x0) at ../io/channel.c:330
>>>> #5  0x0000555555c27e75 in nbd_write (errp=0x0, size=20, 
>>>> buffer=0x7fffe8dffdd0, ioc=<optimized out>) at 
>>>> ../nbd/nbd-internal.h:71
>>>> #6  nbd_negotiate_send_rep_len (client=client@entry=0x555556f60930, 
>>>> type=type@entry=1, len=len@entry=0, errp=errp@entry=0x0) at 
>>>> ../nbd/server.c:203
>>>> #7  0x0000555555c29db1 in nbd_negotiate_send_rep (errp=0x0, type=1, 
>>>> client=0x555556f60930) at ../nbd/server.c:211
>>>> --Type <RET> for more, q to quit, c to continue without paging--
>>>> #8  nbd_negotiate_options (errp=0x7fffe8dffe88, client=<optimized 
>>>> out>) at ../nbd/server.c:1224
>>>> #9  nbd_negotiate (errp=0x7fffe8dffe88, client=<optimized out>) at 
>>>> ../nbd/server.c:1340
>>>> #10 nbd_co_client_start (opaque=<optimized out>) at 
>>>> ../nbd/server.c:2715
>>>> #11 0x0000555555d70423 in coroutine_trampoline (i0=<optimized out>, 
>>>> i1=<optimized out>) at ../util/coroutine-ucontext.c:173
>>>> #12 0x00007ffff67f3820 in ?? () from target:/lib64/libc.so.6
>>>> #13 0x00007fffffffca80 in ?? ()
>>>>
>>>
>>> Ok the reason for this is line 89 of tests/qemu-iotests/307:
>>>
>>> # Should ignore the iothread conflict, but then fail because of the
>>> # permission conflict (and not crash)
>>> vm.qmp_log('block-export-add', id='export1', type='nbd', 
>>> node_name='null',
>>>                iothread='iothread1', fixed_iothread=False, 
>>> writable=True)
>>>
>>> This calls qmp_block_export_add() and then blk_exp_add(), that calls 
>>> bdrv_invalidate_cache().
>>> Both functions are running in the main thread, but then 
>>> bdrv_invalidate_cache invokes bdrv_co_invalidate_cache() as a 
>>> coroutine in the AioContext of the given bs, so not the main loop.
>>>
>>> So Hanna, what should we do here? This seems very similar to the 
>>> discussion in patch 22, ie run blockdev-create (in this case 
>>> block-export-add, which seems similar?) involving nodes in I/O threads.
>>
>> Honestly, I’m still thinking about this and haven’t come to a 
>> conclusion.  It doesn’t seem invalid to run 
>> bdrv_co_invalidate_cache() in the context of the BDS here, but then 
>> the assertion that the BDS is already active seems kind of just lucky 
>> to work.
>>
>> I plan to look into whether I can reproduce some case where the 
>> assertion doesn’t hold (thinking of migration with a block device in 
>> an iothread), and then see what I learn from this. :/
>
> OK, so.  I nearly couldn’t reproduce a case where the assertion 
> doesn’t hold, and I was also very close to believing that your 
> condition is entirely correct; on migration, we generally inactivate 
> block nodes (BlockDriverStates) only after they have been detached 
> from their respective block devices, and so they’re in the main 
> context.  On the destination, it’s the other way around: We invalidate 
> their caches before attaching them to the respective block devices, so 
> they are again in the main context.

Correcting myself: It doesn’t have to do anything with 
attaching/detaching, it’s that the guest puts the BDS into the non-main 
context only if the vmstate is “running”.

(virtio_pci_vmstate_change() is invoked on vmstate changes, and 
virtio_blk_class_init() registers virtio_blk_data_plane_{start,stop}() 
as the {start,stop}_ioeventfd callbacks.)

So it’s even better, the devices actually take care to put the BDS nodes 
into the main context while migration is ongoing.

> There are exceptions where we call bdrv_invalidate_cache() on error 
> paths and so on, basically just to be sure, while the guest devices 
> are connected to block nodes.  But in that case, we have never 
> inactivated them (or have already re-activated them), and that’s where 
> the `bdrv_is_active()` condition comes in.
>
> But then I wanted to find out what happens when you don’t have a guest 
> device, but an NBD export on top of the block node, and this happens:
>
> $ ./qemu-system-x86_64 \
>     -incoming defer \
>     -qmp stdio \
>     -object iothread,id=iothr0 \
>     -blockdev null-co,node-name=node0 \
>     <<EOF
> {"execute": "qmp_capabilities"}
> {"execute":"nbd-server-start","arguments":{"addr":{"type":"unix","data":{"path":"/tmp/nbd.sock"}}}} 
>
> {"execute":"block-export-add","arguments":{"type":"nbd","id":"exp0","iothread":"iothr0","node-name":"node0","name":"exp0"}} 
>
> EOF
> {"QMP": {"version": {"qemu": {"micro": 91, "minor": 2, "major": 6}, 
> "package": "v6.2.0-rc1-47-g043ab68869"}, "capabilities": ["oob"]}}
> {"return": {}}
> {"return": {}}
> qemu-system-x86_64: ../block.c:6612: bdrv_co_invalidate_cache: 
> Assertion `qemu_in_main_thread() || bdrv_is_active(bs)' failed.
> [1]    155729 abort (core dumped)  ./qemu-system-x86_64 -incoming 
> defer -qmp stdio -object iothread,id=iothr0
>
> The problem here is that blk_exp_add() invalidates the cache after 
> moving the node to the target context.  I think we can solve this 
> particular problem by simply moving its bdrv_invalidate_cache() call 
> before the `if (export->has_iothread)` block above it.

And that means that this makes even more sense, because if the virtio 
devices take care to keep their block device in the main context while 
migration is ongoing, blk_exp_add() should do the same when it calls a 
“migration function” (bdrv_invalidate_cache()).

> But it does mean the search isn’t over and I’ll need to look a bit 
> further still...

Did look further, couldn’t find anything else interesting.

So I think(TM) that this blk_exp_add() is the only thing that needs fixing.

Hanna

Kevin Wolf Jan. 19, 2022, 6:34 p.m. UTC | #7

Am 20.12.2021 um 13:20 hat Emanuele Giuseppe Esposito geschrieben:
> 
> 
> On 17/12/2021 17:38, Emanuele Giuseppe Esposito wrote:
> > 
> > 
> > On 17/12/2021 12:04, Hanna Reitz wrote:
> > > On 24.11.21 07:44, Emanuele Giuseppe Esposito wrote:
> > > > bdrv_co_invalidate_cache is special: it is an I/O function,
> > > 
> > > I still don’t believe it is, but well.
> > > 
> > > (Yes, it is called by a test in an iothread, but I believe we’ve
> > > seen that the tests simply sometimes test things that shouldn’t be
> > > allowed.)
> > > 
> > > > but uses the block layer permission API, which is GS.
> > > > 
> > > > Because of this, we can assert that either the function is
> > > > being called with BQL held, and thus can use the permission API,
> > > > or make sure that the permission API is not used, by ensuring that
> > > > bs (and parents) .open_flags does not contain BDRV_O_INACTIVE.
> > > > 
> > > > Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> > > > ---
> > > >   block.c | 26 ++++++++++++++++++++++++++
> > > >   1 file changed, 26 insertions(+)
> > > > 
> > > > diff --git a/block.c b/block.c
> > > > index a0309f827d..805974676b 100644
> > > > --- a/block.c
> > > > +++ b/block.c
> > > > @@ -6574,6 +6574,26 @@ void bdrv_init_with_whitelist(void)
> > > >       bdrv_init();
> > > >   }
> > > > +static bool bdrv_is_active(BlockDriverState *bs)
> > > > +{
> > > > +    BdrvChild *parent;
> > > > +
> > > > +    if (bs->open_flags & BDRV_O_INACTIVE) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    QLIST_FOREACH(parent, &bs->parents, next_parent) {
> > > > +        if (parent->klass->parent_is_bds) {
> > > > +            BlockDriverState *parent_bs = parent->opaque;
> > > 
> > > This looks like a really bad hack to me.  We purposefully have made
> > > the parent link opaque so that a BDS cannot easily reach its
> > > parents.  All accesses should go through BdrvChildClass methods.
> > > 
> > > I also don’t understand why we need to query parents at all.  The
> > > only fact that determines whether the current BDS will have its
> > > permissions changed is whether the BDS itself is active or
> > > inactive.  Sure, we’ll invoke bdrv_co_invalidate_cache() on the
> > > parents, too, but then we could simply let the assertion fail there.
> > > 
> > > > +            if (!bdrv_is_active(parent_bs)) {
> > > > +                return false;
> > > > +            }
> > > > +        }
> > > > +    }
> > > > +
> > > > +   return true;
> > > > +}
> > > > +
> > > >   int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState
> > > > *bs, Error **errp)
> > > >   {
> > > >       BdrvChild *child, *parent;
> > > > @@ -6585,6 +6605,12 @@ int coroutine_fn
> > > > bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
> > > >           return -ENOMEDIUM;
> > > >       }
> > > > +    /*
> > > > +     * No need to muck with permissions if bs is active.
> > > > +     * TODO: should activation be a separate function?
> > > > +     */
> > > > +    assert(qemu_in_main_thread() || bdrv_is_active(bs));
> > > > +
> > > 
> > > I don’t understand this, really.  It looks to me like “if you don’t
> > > call this in the main thread, this better be a no-op”, i.e., you
> > > must never call this function in an I/O thread if you really want to
> > > use it.  I.e. what I’d classify as a GS function.
> > > 
> > > It sounds like this is just a special case for said test, and
> > > special-casing code for tests sounds like a bad idea.
> > 
> > Ok, but trying to leave just the qemu_in_main_thread() assertion makes
> > test 307 (./check 307) fail.
> > I am actually not sure on why it fails, but I am sure it is because of
> > the assertion, since without it it passes.
> > 
> > I tried with gdb (./check -gdb 307 on one terminal and
> > gdb -iex "target remote localhost:12345"
> > in another) but it points me to this below, which I think is the ndb
> > server getting the socket closed (because on the other side it crashed),
> > and not the actual error.
> > 
> > 
> > Thread 1 "qemu-system-x86" received signal SIGPIPE, Broken pipe.
> > 0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
> > (gdb) bt
> > #0  0x00007ffff68af54d in sendmsg () from target:/lib64/libc.so.6
> > #1  0x0000555555c13cc9 in qio_channel_socket_writev (ioc=<optimized
> > out>, iov=0x5555569a4870, niov=1, fds=0x0, nfds=<optimized out>,
> > errp=0x0)
> >      at ../io/channel-socket.c:561
> > #2  0x0000555555c19b18 in qio_channel_writev_full_all
> > (ioc=0x55555763b800, iov=iov@entry=0x7fffe8dffd80, niov=niov@entry=1,
> > fds=fds@entry=0x0,
> >      nfds=nfds@entry=0, errp=errp@entry=0x0) at ../io/channel.c:240
> > #3  0x0000555555c19bd2 in qio_channel_writev_all (errp=0x0, niov=1,
> > iov=0x7fffe8dffd80, ioc=<optimized out>) at ../io/channel.c:220
> > #4  qio_channel_write_all (ioc=<optimized out>,
> > buf=buf@entry=0x7fffe8dffdd0 "", buflen=buflen@entry=20,
> > errp=errp@entry=0x0) at ../io/channel.c:330
> > #5  0x0000555555c27e75 in nbd_write (errp=0x0, size=20,
> > buffer=0x7fffe8dffdd0, ioc=<optimized out>) at ../nbd/nbd-internal.h:71
> > #6  nbd_negotiate_send_rep_len (client=client@entry=0x555556f60930,
> > type=type@entry=1, len=len@entry=0, errp=errp@entry=0x0) at
> > ../nbd/server.c:203
> > #7  0x0000555555c29db1 in nbd_negotiate_send_rep (errp=0x0, type=1,
> > client=0x555556f60930) at ../nbd/server.c:211
> > --Type <RET> for more, q to quit, c to continue without paging--
> > #8  nbd_negotiate_options (errp=0x7fffe8dffe88, client=<optimized out>)
> > at ../nbd/server.c:1224
> > #9  nbd_negotiate (errp=0x7fffe8dffe88, client=<optimized out>) at
> > ../nbd/server.c:1340
> > #10 nbd_co_client_start (opaque=<optimized out>) at ../nbd/server.c:2715
> > #11 0x0000555555d70423 in coroutine_trampoline (i0=<optimized out>,
> > i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> > #12 0x00007ffff67f3820 in ?? () from target:/lib64/libc.so.6
> > #13 0x00007fffffffca80 in ?? ()
> > 
> 
> Ok the reason for this is line 89 of tests/qemu-iotests/307:
> 
> # Should ignore the iothread conflict, but then fail because of the
> # permission conflict (and not crash)
> vm.qmp_log('block-export-add', id='export1', type='nbd', node_name='null',
>                iothread='iothread1', fixed_iothread=False, writable=True)
> 
> This calls qmp_block_export_add() and then blk_exp_add(), that calls
> bdrv_invalidate_cache().
> Both functions are running in the main thread, but then
> bdrv_invalidate_cache invokes bdrv_co_invalidate_cache() as a coroutine in
> the AioContext of the given bs, so not the main loop.
> 
> So Hanna, what should we do here? This seems very similar to the discussion
> in patch 22, ie run blockdev-create (in this case block-export-add, which
> seems similar?) involving nodes in I/O threads.

My gut feeling is that the bug is that we're entering coroutine context
in the node's iothread too early. I think the only place where we really
need it is the bs->drv->bdrv_co_invalidate_cache() call.

In fact, not only permission code, but also parent->klass->activate()
must be run in the main thread. The only implementation of the callback
is blk_root_activate(), and it calls some permissions functions, too,
but also qemu_add_vm_change_state_handler(), which doesn't look thread
safe. Unlikely to ever cause problems and if it does, it won't be
reproducible, but still a bug.

So if we go back to a bdrv_invalidate_cache() that does all the graph
manipulations (and asserts that we're in the main loop) and then have a
much smaller bdrv_co_invalidate_cache() that basically just calls into
the driver, would that solve the problem?

Kevin

Paolo Bonzini Jan. 20, 2022, 1:22 p.m. UTC | #8

On 1/19/22 19:34, Kevin Wolf wrote:
> So if we go back to a bdrv_invalidate_cache() that does all the graph
> manipulations (and asserts that we're in the main loop) and then have a
> much smaller bdrv_co_invalidate_cache() that basically just calls into
> the driver, would that solve the problem?

I was going to suggest something similar, and even name the former 
bdrv_activate().  Then bdrv_activate() is a graph manipulation function, 
while bdrv_co_invalidate_cache() is an I/O function.

Paolo

Kevin Wolf Jan. 20, 2022, 1:48 p.m. UTC | #9

Am 20.01.2022 um 14:22 hat Paolo Bonzini geschrieben:
> On 1/19/22 19:34, Kevin Wolf wrote:
> > So if we go back to a bdrv_invalidate_cache() that does all the graph
> > manipulations (and asserts that we're in the main loop) and then have a
> > much smaller bdrv_co_invalidate_cache() that basically just calls into
> > the driver, would that solve the problem?
> 
> I was going to suggest something similar, and even name the former
> bdrv_activate().  Then bdrv_activate() is a graph manipulation function,
> while bdrv_co_invalidate_cache() is an I/O function.

I like this. The naming inconsistency between inactivate and
invalidate_cache has always bothered me.

Kevin

Emanuele Giuseppe Esposito Jan. 21, 2022, 10:22 a.m. UTC | #10

On 20/01/2022 14:48, Kevin Wolf wrote:
> Am 20.01.2022 um 14:22 hat Paolo Bonzini geschrieben:
>> On 1/19/22 19:34, Kevin Wolf wrote:
>>> So if we go back to a bdrv_invalidate_cache() that does all the graph
>>> manipulations (and asserts that we're in the main loop) and then have a
>>> much smaller bdrv_co_invalidate_cache() that basically just calls into
>>> the driver, would that solve the problem?
>>
>> I was going to suggest something similar, and even name the former
>> bdrv_activate().  Then bdrv_activate() is a graph manipulation function,
>> while bdrv_co_invalidate_cache() is an I/O function.
> 
> I like this. The naming inconsistency between inactivate and
> invalidate_cache has always bothered me.
> 

Ok, I am going to apply this. Thank you for the suggestion!

> Did look further, couldn’t find anything else interesting.
> So I think(TM) that this blk_exp_add() is the only thing that needs fixing.

When splitting the logic between bdrv_activate and 
bdrv_co_invalidate_cache, there is no need anymore to fix blk_exp_add :)

I am going to send v6 by the end of today.

Thank you,
Emanuele

[v5,28/31] block.c: assert BQL lock held in bdrv_co_invalidate_cache

Commit Message

Comments

Patch