Patchwork [RFC] block: Removed coroutine ownership assumption

login
register
mail settings
Submitter Peter A. G. Crosthwaite
Date June 22, 2012, 6:44 a.m.
Message ID <1340347459-29861-1-git-send-email-peter.crosthwaite@petalogix.com>
Download mbox | patch
Permalink /patch/166523/
State New
Headers show

Comments

Peter A. G. Crosthwaite - June 22, 2012, 6:44 a.m.
The block layer assumes that it is the only user of coroutines -
The qemu_in_coroutine() is used to determine if a function is in one of the
block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
a machine model) of the block layer uses couroutine itself, the block layer
will identify the callers coroutines as its own, and may falsely yield the
calling coroutine (instead of creating its own to yield).

AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
issue, as anyone who comes along and used coroutines and the block layer
together is going to run into some very obscure and hard to debug race
conditions.

Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
---
 block.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)
Kevin Wolf - June 22, 2012, 7:49 a.m.
Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite:
> The block layer assumes that it is the only user of coroutines -
> The qemu_in_coroutine() is used to determine if a function is in one of the
> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
> a machine model) of the block layer uses couroutine itself, the block layer
> will identify the callers coroutines as its own, and may falsely yield the
> calling coroutine (instead of creating its own to yield).
> 
> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
> issue, as anyone who comes along and used coroutines and the block layer
> together is going to run into some very obscure and hard to debug race
> conditions.
> 
> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>

What does your coroutine caller look like that this is a problem? Does
it make assumptions about the number of yields or anything like that?

The assumption here is not that the block layer owns the coroutine, but
that any code running in a coroutine can yield at any time as long at it
makes sure that eventually the coroutine is reentered. Just like if you
were running in a thread, you certainly wouldn't assume that the block
layer has to create its own thread if it could get preempted, would you?

Can you post some example code that explains the race conditions you're
talking about?

Kevin
Jan Kiszka - June 22, 2012, 7:50 a.m.
On 2012-06-22 08:44, Peter A. G. Crosthwaite wrote:
> The block layer assumes that it is the only user of coroutines -
> The qemu_in_coroutine() is used to determine if a function is in one of the
> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
> a machine model) of the block layer uses couroutine itself, the block layer
> will identify the callers coroutines as its own, and may falsely yield the
> calling coroutine (instead of creating its own to yield).
> 
> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
> issue, as anyone who comes along and used coroutines and the block layer
> together is going to run into some very obscure and hard to debug race
> conditions.

Not sure if I understood the intention yet: Is this supposed to fix an
issue with the current usage of coroutines or to extend their usage
beyond that? In the latter case, please don't do this. We'd rather like
to get rid of them long term.

Jan

> 
> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
> ---
>  block.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 0acdcac..b50af15 100644
> --- a/block.c
> +++ b/block.c
> @@ -380,7 +380,7 @@ int bdrv_create(BlockDriver *drv, const char* filename,
>          return -ENOTSUP;
>      }
>  
> -    if (qemu_in_coroutine()) {
> +    if (0) {
>          /* Fast-path if already in coroutine context */
>          bdrv_create_co_entry(&cco);
>      } else {
> @@ -1590,7 +1590,7 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
>          bdrv_io_limits_disable(bs);
>      }
>  
> -    if (qemu_in_coroutine()) {
> +    if (0) {
>          /* Fast-path if already in coroutine context */
>          bdrv_rw_co_entry(&rwco);
>      } else {
> @@ -3813,7 +3813,7 @@ int bdrv_flush(BlockDriverState *bs)
>          .ret = NOT_DONE,
>      };
>  
> -    if (qemu_in_coroutine()) {
> +    if (0) {
>          /* Fast-path if already in coroutine context */
>          bdrv_flush_co_entry(&rwco);
>      } else {
> @@ -3874,7 +3874,7 @@ int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
>          .ret = NOT_DONE,
>      };
>  
> -    if (qemu_in_coroutine()) {
> +    if (0) {
>          /* Fast-path if already in coroutine context */
>          bdrv_discard_co_entry(&rwco);
>      } else {
>
Peter A. G. Crosthwaite - June 22, 2012, 8 a.m.
On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-06-22 08:44, Peter A. G. Crosthwaite wrote:
>> The block layer assumes that it is the only user of coroutines -
>> The qemu_in_coroutine() is used to determine if a function is in one of the
>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
>> a machine model) of the block layer uses couroutine itself, the block layer
>> will identify the callers coroutines as its own, and may falsely yield the
>> calling coroutine (instead of creating its own to yield).
>>
>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
>> issue, as anyone who comes along and used coroutines and the block layer
>> together is going to run into some very obscure and hard to debug race
>> conditions.
>
> Not sure if I understood the intention yet: Is this supposed to fix an
> issue with the current usage of coroutines or to extend their usage
> beyond that? In the latter case, please don't do this. We'd rather like
> to get rid of them long term.
>

My extended usage, which is under development and not ready for the
list. But are you saying qemu-coroutine is deprecated? If so Ill just
re-impelement my work with threads, mutexes and condition vars, but
coroutines are the most natural way of doing it.

> Jan
>
>>
>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
>> ---
>>  block.c |    8 ++++----
>>  1 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 0acdcac..b50af15 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -380,7 +380,7 @@ int bdrv_create(BlockDriver *drv, const char* filename,
>>          return -ENOTSUP;
>>      }
>>
>> -    if (qemu_in_coroutine()) {
>> +    if (0) {
>>          /* Fast-path if already in coroutine context */
>>          bdrv_create_co_entry(&cco);
>>      } else {
>> @@ -1590,7 +1590,7 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
>>          bdrv_io_limits_disable(bs);
>>      }
>>
>> -    if (qemu_in_coroutine()) {
>> +    if (0) {
>>          /* Fast-path if already in coroutine context */
>>          bdrv_rw_co_entry(&rwco);
>>      } else {
>> @@ -3813,7 +3813,7 @@ int bdrv_flush(BlockDriverState *bs)
>>          .ret = NOT_DONE,
>>      };
>>
>> -    if (qemu_in_coroutine()) {
>> +    if (0) {
>>          /* Fast-path if already in coroutine context */
>>          bdrv_flush_co_entry(&rwco);
>>      } else {
>> @@ -3874,7 +3874,7 @@ int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
>>          .ret = NOT_DONE,
>>      };
>>
>> -    if (qemu_in_coroutine()) {
>> +    if (0) {
>>          /* Fast-path if already in coroutine context */
>>          bdrv_discard_co_entry(&rwco);
>>      } else {
>>
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
>
Kevin Wolf - June 22, 2012, 8:06 a.m.
Am 22.06.2012 10:00, schrieb Peter Crosthwaite:
> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2012-06-22 08:44, Peter A. G. Crosthwaite wrote:
>>> The block layer assumes that it is the only user of coroutines -
>>> The qemu_in_coroutine() is used to determine if a function is in one of the
>>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
>>> a machine model) of the block layer uses couroutine itself, the block layer
>>> will identify the callers coroutines as its own, and may falsely yield the
>>> calling coroutine (instead of creating its own to yield).
>>>
>>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
>>> issue, as anyone who comes along and used coroutines and the block layer
>>> together is going to run into some very obscure and hard to debug race
>>> conditions.
>>
>> Not sure if I understood the intention yet: Is this supposed to fix an
>> issue with the current usage of coroutines or to extend their usage
>> beyond that? In the latter case, please don't do this. We'd rather like
>> to get rid of them long term.
>>
> 
> My extended usage, which is under development and not ready for the
> list. But are you saying qemu-coroutine is deprecated? If so Ill just
> re-impelement my work with threads, mutexes and condition vars, but
> coroutines are the most natural way of doing it.

I don't think we're going to drop them as long as they are useful for
someone, and definitely not anytime soon. Possibly at some point, when
the block layer is converted to threads, we could make gthread the only
(or the default) backend, which will impact performance a bit, but works
on more platforms.

Kevin
Peter Maydell - June 22, 2012, 8:16 a.m.
On 22 June 2012 09:00, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> Not sure if I understood the intention yet: Is this supposed to fix an
>> issue with the current usage of coroutines or to extend their usage
>> beyond that? In the latter case, please don't do this. We'd rather like
>> to get rid of them long term.

> My extended usage, which is under development and not ready for the
> list. But are you saying qemu-coroutine is deprecated? If so Ill just
> re-impelement my work with threads, mutexes and condition vars, but
> coroutines are the most natural way of doing it.

Basically coroutines are nastily unportable and we've had a set
of problems with them (as witness the fact that we have three
separate backend implementations!). There is supposedly some sort
of migration plan for getting them out of the block layer eventually;
they're a kind of halfway house for avoiding synchronous I/O there
AIUI. I would much prefer not to see any further use of them in new
code. Fundamentally C doesn't support coroutines and it's much better
to just admit that IMHO and use more idiomatic design approaches
instead.

-- PMM
Peter A. G. Crosthwaite - June 22, 2012, 8:20 a.m.
On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite:
>> The block layer assumes that it is the only user of coroutines -
>> The qemu_in_coroutine() is used to determine if a function is in one of the
>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
>> a machine model) of the block layer uses couroutine itself, the block layer
>> will identify the callers coroutines as its own, and may falsely yield the
>> calling coroutine (instead of creating its own to yield).
>>
>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
>> issue, as anyone who comes along and used coroutines and the block layer
>> together is going to run into some very obscure and hard to debug race
>> conditions.
>>
>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
>
> What does your coroutine caller look like that this is a problem?

Its a machine model that instantiated some block devices concurrently.
Theres some chicken-and-egg issues with the instantiation such that
they have the happen concurrently. One device instantiates a block
device (pflash_cfi_01) from coroutine context. block then check
qemu_in_coroutine() which returns true so it uses my coroutine for its
inner workings, whereas if it were a normal machine model it would
kick of its own coroutine to do its block stuff.

 Does
> it make assumptions about the number of yields or anything like that?

Yes it does, but I thought that was the point of coroutines vs
threads? coroutines you control yield behaviour explicitly whearas
thread you cant?

>
> The assumption here is not that the block layer owns the coroutine, but
> that any code running in a coroutine can yield

Yield to who tho? A coroutine yield transfers control back to the
context that spawned it. For correct functionality, the block layer
should yielf back to whatever context called it no? Which would mean
that the the block layer when yielding should transfer back to
pflash_cfi_01_register. But because it was called from a coroutine in
the first place, it yields control all the way back my machine model
(which then breaks).

at any time as long at it
> makes sure that eventually the coroutine is reentered. Just like if you
> were running in a thread, you certainly wouldn't assume that the block
> layer has to create its own thread if it could get preempted, would you?
>
> Can you post some example code that explains the race conditions you're
> talking about?
>
> Kevin
Peter A. G. Crosthwaite - June 22, 2012, 8:23 a.m.
On Fri, Jun 22, 2012 at 6:16 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 June 2012 09:00, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> Not sure if I understood the intention yet: Is this supposed to fix an
>>> issue with the current usage of coroutines or to extend their usage
>>> beyond that? In the latter case, please don't do this. We'd rather like
>>> to get rid of them long term.
>
>> My extended usage, which is under development and not ready for the
>> list. But are you saying qemu-coroutine is deprecated? If so Ill just
>> re-impelement my work with threads, mutexes and condition vars, but
>> coroutines are the most natural way of doing it.
>
> Basically coroutines are nastily unportable and we've had a set
> of problems with them (as witness the fact that we have three
> separate backend implementations!). There is supposedly some sort
> of migration plan for getting them out of the block layer eventually;
> they're a kind of halfway house for avoiding synchronous I/O there
> AIUI. I would much prefer not to see any further use of them in new
> code. Fundamentally C doesn't support coroutines and it's much better
> to just admit that IMHO and use more idiomatic design approaches
> instead.
>

Sounds depracted to me :) Can we add some comments to coroutine?

> -- PMM
Peter Maydell - June 22, 2012, 8:31 a.m.
On 22 June 2012 09:20, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
> Its a machine model that instantiated some block devices concurrently.
> Theres some chicken-and-egg issues with the instantiation such that
> they have the happen concurrently. One device instantiates a block
> device (pflash_cfi_01) from coroutine context. block then check
> qemu_in_coroutine() which returns true so it uses my coroutine for its
> inner workings, whereas if it were a normal machine model it would
> kick of its own coroutine to do its block stuff.

I think the problem here is trying to instantiate bits of the
machine model inside coroutines -- that just sounds wrong to me.
Model and device instantiate should be a straightforward single
threaded bit of code; if it isn't then something needs to be
straightened out so you don't have your chicken-and-egg problem.

-- PMM
Stefan Hajnoczi - June 22, 2012, 8:33 a.m.
On Fri, Jun 22, 2012 at 9:16 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 June 2012 09:00, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> Not sure if I understood the intention yet: Is this supposed to fix an
>>> issue with the current usage of coroutines or to extend their usage
>>> beyond that? In the latter case, please don't do this. We'd rather like
>>> to get rid of them long term.
>
>> My extended usage, which is under development and not ready for the
>> list. But are you saying qemu-coroutine is deprecated? If so Ill just
>> re-impelement my work with threads, mutexes and condition vars, but
>> coroutines are the most natural way of doing it.
>
> Basically coroutines are nastily unportable and we've had a set
> of problems with them (as witness the fact that we have three
> separate backend implementations!). There is supposedly some sort
> of migration plan for getting them out of the block layer eventually;
> they're a kind of halfway house for avoiding synchronous I/O there
> AIUI. I would much prefer not to see any further use of them in new
> code. Fundamentally C doesn't support coroutines and it's much better
> to just admit that IMHO and use more idiomatic design approaches
> instead.

If you avoid coroutines and write asynchronous code it will be
*harder* to convert to threads!  IMO it's better to make use of
coroutines for now and convert to threads as soon as the global mutex
has been pushed out into device emulation.

Yes, they require retargeting to different host platforms, but we have that now.

Stefan
Stefan Hajnoczi - June 22, 2012, 8:34 a.m.
On Fri, Jun 22, 2012 at 9:31 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 June 2012 09:20, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>> Its a machine model that instantiated some block devices concurrently.
>> Theres some chicken-and-egg issues with the instantiation such that
>> they have the happen concurrently. One device instantiates a block
>> device (pflash_cfi_01) from coroutine context. block then check
>> qemu_in_coroutine() which returns true so it uses my coroutine for its
>> inner workings, whereas if it were a normal machine model it would
>> kick of its own coroutine to do its block stuff.
>
> I think the problem here is trying to instantiate bits of the
> machine model inside coroutines -- that just sounds wrong to me.
> Model and device instantiate should be a straightforward single
> threaded bit of code; if it isn't then something needs to be
> straightened out so you don't have your chicken-and-egg problem.

Yes, I agree with Peter here.

Stefan
Kevin Wolf - June 22, 2012, 8:45 a.m.
Am 22.06.2012 10:16, schrieb Peter Maydell:
> On 22 June 2012 09:00, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> Not sure if I understood the intention yet: Is this supposed to fix an
>>> issue with the current usage of coroutines or to extend their usage
>>> beyond that? In the latter case, please don't do this. We'd rather like
>>> to get rid of them long term.
> 
>> My extended usage, which is under development and not ready for the
>> list. But are you saying qemu-coroutine is deprecated? If so Ill just
>> re-impelement my work with threads, mutexes and condition vars, but
>> coroutines are the most natural way of doing it.
> 
> Basically coroutines are nastily unportable and we've had a set
> of problems with them (as witness the fact that we have three
> separate backend implementations!). There is supposedly some sort
> of migration plan for getting them out of the block layer eventually;
> they're a kind of halfway house for avoiding synchronous I/O there
> AIUI. I would much prefer not to see any further use of them in new
> code. Fundamentally C doesn't support coroutines and it's much better
> to just admit that IMHO and use more idiomatic design approaches
> instead.

Our default coroutine implementation isn't really as portable as we
wish, but while we use this in order to achieve the best performance,
long term we could make gthread the default backend and keep coroutines
where they are useful.

The gthread backend is slower than the ucontext/sigaltstack ones, but it
shouldn't make a difference in performance compared to using threads and
putting exactly the same locking in place open-coded like Peter
mentioned as an alternative approach.

Kevin
Markus Armbruster - June 22, 2012, 8:48 a.m.
Peter Maydell <peter.maydell@linaro.org> writes:

> On 22 June 2012 09:00, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> Not sure if I understood the intention yet: Is this supposed to fix an
>>> issue with the current usage of coroutines or to extend their usage
>>> beyond that? In the latter case, please don't do this. We'd rather like
>>> to get rid of them long term.
>
>> My extended usage, which is under development and not ready for the
>> list. But are you saying qemu-coroutine is deprecated? If so Ill just
>> re-impelement my work with threads, mutexes and condition vars, but
>> coroutines are the most natural way of doing it.
>
> Basically coroutines are nastily unportable and we've had a set
> of problems with them (as witness the fact that we have three
> separate backend implementations!). There is supposedly some sort
> of migration plan for getting them out of the block layer eventually;
> they're a kind of halfway house for avoiding synchronous I/O there
> AIUI. I would much prefer not to see any further use of them in new
> code. Fundamentally C doesn't support coroutines and it's much better
> to just admit that IMHO and use more idiomatic design approaches
> instead.

I think you're overstating your case.  People have been doing coroutines
in C for decades, on a huge range of machines.  They wouldn't have done
so if it wasn't worth their while.  Not what I'd call a "fundamental"
problem.

In my opinion, coroutines have been useful for us so far.  Whether they
remain useful, or serve us just as a stepping stone towards general
threads remains to be seen.

As Kevin said, there are no concrete plans to convert the block layer
away from coroutines.
Kevin Wolf - June 22, 2012, 8:53 a.m.
Am 22.06.2012 10:20, schrieb Peter Crosthwaite:
> On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite:
>>> The block layer assumes that it is the only user of coroutines -
>>> The qemu_in_coroutine() is used to determine if a function is in one of the
>>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
>>> a machine model) of the block layer uses couroutine itself, the block layer
>>> will identify the callers coroutines as its own, and may falsely yield the
>>> calling coroutine (instead of creating its own to yield).
>>>
>>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
>>> issue, as anyone who comes along and used coroutines and the block layer
>>> together is going to run into some very obscure and hard to debug race
>>> conditions.
>>>
>>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
>>
>> What does your coroutine caller look like that this is a problem?
> 
> Its a machine model that instantiated some block devices concurrently.
> Theres some chicken-and-egg issues with the instantiation such that
> they have the happen concurrently. One device instantiates a block
> device (pflash_cfi_01) from coroutine context. block then check
> qemu_in_coroutine() which returns true so it uses my coroutine for its
> inner workings, whereas if it were a normal machine model it would
> kick of its own coroutine to do its block stuff.
> 
>  Does
>> it make assumptions about the number of yields or anything like that?
> 
> Yes it does, but I thought that was the point of coroutines vs
> threads? coroutines you control yield behaviour explicitly whearas
> thread you cant?

Well, I can see your point, although today no code in qemu makes use of
the property that the caller could in theory know where the coroutine
yields. I think it's also dangerous because there is a non-obvious
dependency of the caller on the internals of the coroutine. A simple
innocent looking refactoring of the coroutine might break assumptions
that are made here.

Other code in qemu that uses coroutines only makes use of the fact that
coroutines can only be interrupted at known points, so synchronisation
between multiple coroutines becomes easier.

Kevin
Peter Maydell - June 22, 2012, 9:06 a.m.
On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote:
> In my opinion, coroutines have been useful for us so far.  Whether they
> remain useful, or serve us just as a stepping stone towards general
> threads remains to be seen.

From my point of view I've seen a whole pile of problems and not
really any advantages... I particularly think it's a really bad
idea to have a complex and potentially race-condition-prone bit
of infrastructure implemented three different ways rather than
having one implementation used everywhere -- it's just asking
for obscure bugs on the non-x86 hosts.

Really it just breaks the general rule I prefer to follow that
you should write your code in the 'mainstream' of an API/platform;
if you head too close to the shallows you're liable to hit a rock.

-- PMM
Peter A. G. Crosthwaite - June 22, 2012, 10:59 a.m.
On Fri, Jun 22, 2012 at 6:53 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 22.06.2012 10:20, schrieb Peter Crosthwaite:
>> On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite:
>>>> The block layer assumes that it is the only user of coroutines -
>>>> The qemu_in_coroutine() is used to determine if a function is in one of the
>>>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
>>>> a machine model) of the block layer uses couroutine itself, the block layer
>>>> will identify the callers coroutines as its own, and may falsely yield the
>>>> calling coroutine (instead of creating its own to yield).
>>>>
>>>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
>>>> issue, as anyone who comes along and used coroutines and the block layer
>>>> together is going to run into some very obscure and hard to debug race
>>>> conditions.
>>>>
>>>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
>>>
>>> What does your coroutine caller look like that this is a problem?
>>
>> Its a machine model that instantiated some block devices concurrently.
>> Theres some chicken-and-egg issues with the instantiation such that
>> they have the happen concurrently. One device instantiates a block
>> device (pflash_cfi_01) from coroutine context. block then check
>> qemu_in_coroutine() which returns true so it uses my coroutine for its
>> inner workings, whereas if it were a normal machine model it would
>> kick of its own coroutine to do its block stuff.
>>
>>  Does
>>> it make assumptions about the number of yields or anything like that?
>>
>> Yes it does, but I thought that was the point of coroutines vs
>> threads? coroutines you control yield behaviour explicitly whearas
>> thread you cant?
>
> Well, I can see your point, although today no code in qemu makes use of
> the property that the caller could in theory know where the coroutine
> yields. I think it's also dangerous because there is a non-obvious
> dependency of the caller on the internals of the coroutine. A simple
> innocent looking refactoring of the coroutine might break assumptions
> that are made here.
>
> Other code in qemu that uses coroutines only makes use of the fact that
> coroutines can only be interrupted at known points,

So within the block layer, will the block coroutines yield anywhere or
only at a qemu_coroutine_yield() site? Would the block layer break if
a couroutine could be pre-empted by the OS anywhere?

So lets continue this to an example of multiple clients of
qemu-coroutine. The block layer is one client. It defines three
possible yields points (qemu_co_yield() sites) in block.c. Lets call
them A,B and C. The co-routine policy is that your thread of control
will not be pre-empted until locations  A, B or C are reached (I.E.
coroutines can only be interrupted at known points). Alls fair and it
works.

Now here is where it breaks. I have a device or machine model or
whatever (lets call it foo) that uses co-routines. It decides that it
wants its coroutines to stop at only at points D,E and F for ease of
synchronization. If those coroutines however make calls into to the
block layer, the block layer will think its in its own co-routine and
potentially yield at any of points A,B and C. Thing is, my foo system
doesn't care about the block layer implementation, so it shouldnt have
awareness of the fact it used co-routines too. But because it talks to
block, it can get yielded as any call into the block API which is
awkward considering the point of coroutines is they can only be
interrupted at known points. You have essentially defeated the purpose
or coroutines in the first place. Foo's coroutines are behaving like
preemptible threads (while blocks are behaving like coroutines).

The problem is solved with a simple piece of policy - dont yield
someone elses co-routine. Thats this patch. Note that this is an RFC
intended to start discussion - it needs a real implementation.
Something along the lines of keeping track of the block layer
coroutines and checking if in one of those specifically, rather then a
blanket call to qemu_in_coroutine(). But I have this patch in my
workarounds branch until we figure out a real solution.

Regards,
Peter

 so synchronisation
> between multiple coroutines becomes easier.
>
> Kevin
Markus Armbruster - June 22, 2012, 12:04 p.m.
Peter Maydell <peter.maydell@linaro.org> writes:

> On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote:
>> In my opinion, coroutines have been useful for us so far.  Whether they
>> remain useful, or serve us just as a stepping stone towards general
>> threads remains to be seen.
>
>>From my point of view I've seen a whole pile of problems and not
> really any advantages...

Advantages over what?

>                          I particularly think it's a really bad
> idea to have a complex and potentially race-condition-prone bit
> of infrastructure implemented three different ways rather than
> having one implementation used everywhere -- it's just asking
> for obscure bugs on the non-x86 hosts.

Fair point, but it's an implementation problem, not a fundamental
problem with coroutines.  You *can* implement coroutines portably,
e.g. on top of gthread.

But there's a portability / speed tradeoff.  Kevin already explained we
chose speed over portability initially, and that choice is open to
revision.

> Really it just breaks the general rule I prefer to follow that
> you should write your code in the 'mainstream' of an API/platform;
> if you head too close to the shallows you're liable to hit a rock.

It's a good rule.  Like for most rules, there are exceptions.
Peter Maydell - June 22, 2012, 12:30 p.m.
On 22 June 2012 13:04, Markus Armbruster <armbru@redhat.com> wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
>
>> On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote:
>>> In my opinion, coroutines have been useful for us so far.  Whether they
>>> remain useful, or serve us just as a stepping stone towards general
>>> threads remains to be seen.
>>
>>>From my point of view I've seen a whole pile of problems and not
>> really any advantages...
>
> Advantages over what?

Over what we had before we had coroutines. I know there are
advantages, I'm just saying that personally I've been largely
on the downside rather than the upside.

>>                          I particularly think it's a really bad
>> idea to have a complex and potentially race-condition-prone bit
>> of infrastructure implemented three different ways rather than
>> having one implementation used everywhere -- it's just asking
>> for obscure bugs on the non-x86 hosts.
>
> Fair point, but it's an implementation problem, not a fundamental
> problem with coroutines.  You *can* implement coroutines portably,
> e.g. on top of gthread.

If you're implementing them on top of separate threads then
you just have an obfuscated API to threads.

-- PMM
Markus Armbruster - June 22, 2012, 1:36 p.m.
Peter Maydell <peter.maydell@linaro.org> writes:

> On 22 June 2012 13:04, Markus Armbruster <armbru@redhat.com> wrote:
>> Peter Maydell <peter.maydell@linaro.org> writes:
>>
>>> On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote:
>>>> In my opinion, coroutines have been useful for us so far.  Whether they
>>>> remain useful, or serve us just as a stepping stone towards general
>>>> threads remains to be seen.
>>>
>>>>From my point of view I've seen a whole pile of problems and not
>>> really any advantages...
>>
>> Advantages over what?
>
> Over what we had before we had coroutines. I know there are
> advantages, I'm just saying that personally I've been largely
> on the downside rather than the upside.

Granted.  Just saying that there's an upside, too :)

>>>                          I particularly think it's a really bad
>>> idea to have a complex and potentially race-condition-prone bit
>>> of infrastructure implemented three different ways rather than
>>> having one implementation used everywhere -- it's just asking
>>> for obscure bugs on the non-x86 hosts.
>>
>> Fair point, but it's an implementation problem, not a fundamental
>> problem with coroutines.  You *can* implement coroutines portably,
>> e.g. on top of gthread.
>
> If you're implementing them on top of separate threads then
> you just have an obfuscated API to threads.

No, what you really have is yet another means to express control flow.
The fact that it can be implemented on top of threads doesn't
necessarily make it an obfuscated API to threads.

Look, a while loop isn't an obfuscated API to continuations, either.
It's less general, but when it's all you need, it's easier to use.

When all you need is coroutines, not having to worry about preemption
makes them easier to use in my experience.
Peter A. G. Crosthwaite - June 27, 2012, 3:05 a.m.
Ping!

Any Further thoughts Kevin?

This thread flew off on a tangent over whether or not coroutines
should be depracated. Seems to be the answer there was no on that
front - coroutines are here to stay.

I still think this thread points out a major flaw in block+coroutines,
regardless of the fact im using it from a machine model. This bug is
going to happen in any coroutine code that touches the block layer
(E.G. what happens if the next developer wants to implement a device
using coroutines?). Yes, without my full series there is no bug today,
but im just trying to save the next developer who decides to use
corourites (whether that be in tree or out of tree) the potentially
several hours of debugging around "why did my coroutine get yielded
randomly". That and of course minimisation of my own mainline diff.

Regards,
Peter

On Fri, Jun 22, 2012 at 8:59 PM, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
> On Fri, Jun 22, 2012 at 6:53 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 22.06.2012 10:20, schrieb Peter Crosthwaite:
>>> On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite:
>>>>> The block layer assumes that it is the only user of coroutines -
>>>>> The qemu_in_coroutine() is used to determine if a function is in one of the
>>>>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or
>>>>> a machine model) of the block layer uses couroutine itself, the block layer
>>>>> will identify the callers coroutines as its own, and may falsely yield the
>>>>> calling coroutine (instead of creating its own to yield).
>>>>>
>>>>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an
>>>>> issue, as anyone who comes along and used coroutines and the block layer
>>>>> together is going to run into some very obscure and hard to debug race
>>>>> conditions.
>>>>>
>>>>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com>
>>>>
>>>> What does your coroutine caller look like that this is a problem?
>>>
>>> Its a machine model that instantiated some block devices concurrently.
>>> Theres some chicken-and-egg issues with the instantiation such that
>>> they have the happen concurrently. One device instantiates a block
>>> device (pflash_cfi_01) from coroutine context. block then check
>>> qemu_in_coroutine() which returns true so it uses my coroutine for its
>>> inner workings, whereas if it were a normal machine model it would
>>> kick of its own coroutine to do its block stuff.
>>>
>>>  Does
>>>> it make assumptions about the number of yields or anything like that?
>>>
>>> Yes it does, but I thought that was the point of coroutines vs
>>> threads? coroutines you control yield behaviour explicitly whearas
>>> thread you cant?
>>
>> Well, I can see your point, although today no code in qemu makes use of
>> the property that the caller could in theory know where the coroutine
>> yields. I think it's also dangerous because there is a non-obvious
>> dependency of the caller on the internals of the coroutine. A simple
>> innocent looking refactoring of the coroutine might break assumptions
>> that are made here.
>>
>> Other code in qemu that uses coroutines only makes use of the fact that
>> coroutines can only be interrupted at known points,
>
> So within the block layer, will the block coroutines yield anywhere or
> only at a qemu_coroutine_yield() site? Would the block layer break if
> a couroutine could be pre-empted by the OS anywhere?
>
> So lets continue this to an example of multiple clients of
> qemu-coroutine. The block layer is one client. It defines three
> possible yields points (qemu_co_yield() sites) in block.c. Lets call
> them A,B and C. The co-routine policy is that your thread of control
> will not be pre-empted until locations  A, B or C are reached (I.E.
> coroutines can only be interrupted at known points). Alls fair and it
> works.
>
> Now here is where it breaks. I have a device or machine model or
> whatever (lets call it foo) that uses co-routines. It decides that it
> wants its coroutines to stop at only at points D,E and F for ease of
> synchronization. If those coroutines however make calls into to the
> block layer, the block layer will think its in its own co-routine and
> potentially yield at any of points A,B and C. Thing is, my foo system
> doesn't care about the block layer implementation, so it shouldnt have
> awareness of the fact it used co-routines too. But because it talks to
> block, it can get yielded as any call into the block API which is
> awkward considering the point of coroutines is they can only be
> interrupted at known points. You have essentially defeated the purpose
> or coroutines in the first place. Foo's coroutines are behaving like
> preemptible threads (while blocks are behaving like coroutines).
>
> The problem is solved with a simple piece of policy - dont yield
> someone elses co-routine. Thats this patch. Note that this is an RFC
> intended to start discussion - it needs a real implementation.
> Something along the lines of keeping track of the block layer
> coroutines and checking if in one of those specifically, rather then a
> blanket call to qemu_in_coroutine(). But I have this patch in my
> workarounds branch until we figure out a real solution.
>
> Regards,
> Peter
>
>  so synchronisation
>> between multiple coroutines becomes easier.
>>
>> Kevin
Stefan Hajnoczi - June 27, 2012, 7:48 a.m.
On Wed, Jun 27, 2012 at 4:05 AM, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
> I still think this thread points out a major flaw in block+coroutines,
> regardless of the fact im using it from a machine model. This bug is
> going to happen in any coroutine code that touches the block layer
> (E.G. what happens if the next developer wants to implement a device
> using coroutines?). Yes, without my full series there is no bug today,
> but im just trying to save the next developer who decides to use
> corourites (whether that be in tree or out of tree) the potentially
> several hours of debugging around "why did my coroutine get yielded
> randomly". That and of course minimisation of my own mainline diff.

The if (qemu_is_coroutine()) "fastpath" taken by the block layer today
hopefully won't be around forever.  It's really a shortcut that allows
code originally written with synchronous I/O in mind to work
unmodified in a coroutine.

Really we should get rid of bdrv_read() and friends so that all
callers use either bdrv_aio_*() or bdrv_co_*().  Then all functions
that yield will be marked coroutine_fn.  Then you know for sure that
the function may yield and you cannot rely on it not yielding.

I'd like to see your code though because I still don't understand why
it relies on the exact yield behavior.  Have you pushed it to a public
git repo?

Stefan
Peter Maydell - June 27, 2012, 7:59 a.m.
On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> I'd like to see your code though because I still don't understand why
> it relies on the exact yield behavior.  Have you pushed it to a public
> git repo?

I haven't seen Peter's code either, but his complaint makes sense
to me -- the whole point of coroutines is that you can rely on
the exact yield behaviour, surely.

-- PMM
Stefan Hajnoczi - June 27, 2012, 8:11 a.m.
On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> I'd like to see your code though because I still don't understand why
>> it relies on the exact yield behavior.  Have you pushed it to a public
>> git repo?
>
> I haven't seen Peter's code either, but his complaint makes sense
> to me -- the whole point of coroutines is that you can rely on
> the exact yield behaviour, surely.

Not if you call coroutine_fn functions - these are explicitly marked
as functions that yield.  For example block or net I/O.

The issue here is that there are some block.h functions that are not
marked coroutine_fn but actually yield if running inside coroutine
context.  I think we could get rid of them with a little bit of work.

Stefan
Markus Armbruster - June 27, 2012, 8:33 a.m.
Stefan Hajnoczi <stefanha@gmail.com> writes:

> On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> I'd like to see your code though because I still don't understand why
>>> it relies on the exact yield behavior.  Have you pushed it to a public
>>> git repo?
>>
>> I haven't seen Peter's code either, but his complaint makes sense
>> to me -- the whole point of coroutines is that you can rely on
>> the exact yield behaviour, surely.
>
> Not if you call coroutine_fn functions - these are explicitly marked
> as functions that yield.  For example block or net I/O.

I think you two are in violent agreement :)

With coroutines, you can rely on the exact yield behavior.  Of course,
that doesn't do you any good unless you know which functions can yield.
We make that easy by giving such functions distinctive names.

Except when we don't:

> The issue here is that there are some block.h functions that are not
> marked coroutine_fn but actually yield if running inside coroutine
> context.  I think we could get rid of them with a little bit of work.

Sounds like a good idea.
Andreas Färber - June 27, 2012, 9:10 a.m.
Am 22.06.2012 12:59, schrieb Peter Crosthwaite:
> On Fri, Jun 22, 2012 at 6:53 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 22.06.2012 10:20, schrieb Peter Crosthwaite:
>>> [...] I thought that was the point of coroutines vs
>>> threads? coroutines you control yield behaviour explicitly whearas
>>> thread you cant?
>>
>> Well, I can see your point, although today no code in qemu makes use of
>> the property that the caller could in theory know where the coroutine
>> yields. I think it's also dangerous because there is a non-obvious
>> dependency of the caller on the internals of the coroutine. A simple
>> innocent looking refactoring of the coroutine might break assumptions
>> that are made here.
>>
>> Other code in qemu that uses coroutines only makes use of the fact that
>> coroutines can only be interrupted at known points,
> 
> So within the block layer, will the block coroutines yield anywhere or
> only at a qemu_coroutine_yield() site? Would the block layer break if
> a couroutine could be pre-empted by the OS anywhere?
> 
> So lets continue this to an example of multiple clients of
> qemu-coroutine. The block layer is one client. It defines three
> possible yields points (qemu_co_yield() sites) in block.c. Lets call
> them A,B and C. The co-routine policy is that your thread of control
> will not be pre-empted until locations  A, B or C are reached (I.E.
> coroutines can only be interrupted at known points). Alls fair and it
> works.
> 
> Now here is where it breaks. I have a device or machine model or
> whatever (lets call it foo) that uses co-routines. It decides that it
> wants its coroutines to stop at only at points D,E and F for ease of
> synchronization. If those coroutines however make calls into to the
> block layer, the block layer will think its in its own co-routine and
> potentially yield at any of points A,B and C. Thing is, my foo system
> doesn't care about the block layer implementation, so it shouldnt have
> awareness of the fact it used co-routines too. But because it talks to
> block, it can get yielded as any call into the block API which is
> awkward considering the point of coroutines is they can only be
> interrupted at known points. You have essentially defeated the purpose
> or coroutines in the first place. Foo's coroutines are behaving like
> preemptible threads (while blocks are behaving like coroutines).
[snip]

Maybe I'm misunderstanding the discussion, but there's three coroutine
implementations, including one based on GThread. So in that case any
coroutine can be preempted anywhere and there are no such guarantees as
discussed here, are there?

Andreas
Peter Maydell - June 27, 2012, 9:14 a.m.
On 27 June 2012 10:10, Andreas Färber <afaerber@suse.de> wrote:
> Maybe I'm misunderstanding the discussion, but there's three coroutine
> implementations, including one based on GThread. So in that case any
> coroutine can be preempted anywhere and there are no such guarantees as
> discussed here, are there?

No, because the gthread-based implementation carefully ensures
that it's only actually allowing one thread to execute at once:
qemu_coroutine_switch() will make the calling thread block until
some other coroutine switches back to it. Any failure to enforce
the coroutine yield semantics would be a bug (and an indication
of why having three separate backends is a bad idea :-))

-- PMM
Peter A. G. Crosthwaite - June 27, 2012, 9:25 a.m.
On Wed, Jun 27, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Stefan Hajnoczi <stefanha@gmail.com> writes:
>
>> On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>> I'd like to see your code though because I still don't understand why
>>>> it relies on the exact yield behavior.  Have you pushed it to a public
>>>> git repo?
>>>
>>> I haven't seen Peter's code either, but his complaint makes sense
>>> to me -- the whole point of coroutines is that you can rely on
>>> the exact yield behaviour, surely.
>>
>> Not if you call coroutine_fn functions - these are explicitly marked
>> as functions that yield.  For example block or net I/O.

Thats what I mean by "assuming ownership of coroutines". If block
marks half its API as coroutine_fn, then essentially you are saying
block is mutually exclusive with other users of coroutine. Is it
really that hard for the block layer to keep track of its own little
collection of coroutines and just check that it owns the current
context before taking the fast path?

>
> I think you two are in violent agreement :)
>
> With coroutines, you can rely on the exact yield behavior.  Of course,
> that doesn't do you any good unless you know which functions can yield.
> We make that easy by giving such functions distinctive names.

That brings a new problem, best illustrated by example. Suppose I
ensure myself against yielding something like this:

foo(void *opaque) {
   bdrv_foo(...); //this may yield!
   *((int*)opaque)++; //state++
   .... /* some coroutine behaviour */
}

<<from my device>>
int state = 0;
foo_co = qemu_couroutine_new(foo_fn);
qemu_coroutine_enter(foo_co, &state);
while (state < 1) {
    qemu_coroutined_enter(foo_co);
}
... /* the other half of some coroutine behvaiour */

I have insured myself against bdrv_yielding by renentering it if im
not at a valid yield point. But whats really wrong here is the block
layer will be making assumption on re-entry of the coroutine, so I
cant re-enter it witout wildly changing the behaviour of the block
layer. If you adopt this "mark it as coroutine" poilcy, you end up
with a system where half the functions in QEMU:

A: may yield your coroutine
B: if it does yield it, your not allowed to restart it

Making them completely uncallable from coroutine context.

>
> Except when we don't:
>
>> The issue here is that there are some block.h functions that are not
>> marked coroutine_fn but actually yield if running inside coroutine
>> context.  I think we could get rid of them with a little bit of work.
>
> Sounds like a good idea.
Stefan Hajnoczi - June 27, 2012, 10:04 a.m.
On Wed, Jun 27, 2012 at 10:25 AM, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
> On Wed, Jun 27, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> Stefan Hajnoczi <stefanha@gmail.com> writes:
>>
>>> On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> I'd like to see your code though because I still don't understand why
>>>>> it relies on the exact yield behavior.  Have you pushed it to a public
>>>>> git repo?
>>>>
>>>> I haven't seen Peter's code either, but his complaint makes sense
>>>> to me -- the whole point of coroutines is that you can rely on
>>>> the exact yield behaviour, surely.
>>>
>>> Not if you call coroutine_fn functions - these are explicitly marked
>>> as functions that yield.  For example block or net I/O.
>
> Thats what I mean by "assuming ownership of coroutines". If block
> marks half its API as coroutine_fn, then essentially you are saying
> block is mutually exclusive with other users of coroutine. Is it
> really that hard for the block layer to keep track of its own little
> collection of coroutines and just check that it owns the current
> context before taking the fast path?
>
>>
>> I think you two are in violent agreement :)
>>
>> With coroutines, you can rely on the exact yield behavior.  Of course,
>> that doesn't do you any good unless you know which functions can yield.
>> We make that easy by giving such functions distinctive names.
>
> That brings a new problem, best illustrated by example. Suppose I
> ensure myself against yielding something like this:
>
> foo(void *opaque) {
>   bdrv_foo(...); //this may yield!
>   *((int*)opaque)++; //state++
>   .... /* some coroutine behaviour */
> }
>
> <<from my device>>
> int state = 0;
> foo_co = qemu_couroutine_new(foo_fn);
> qemu_coroutine_enter(foo_co, &state);
> while (state < 1) {
>    qemu_coroutined_enter(foo_co);
> }
> ... /* the other half of some coroutine behvaiour */
>
> I have insured myself against bdrv_yielding by renentering it if im
> not at a valid yield point. But whats really wrong here is the block
> layer will be making assumption on re-entry of the coroutine, so I
> cant re-enter it witout wildly changing the behaviour of the block
> layer. If you adopt this "mark it as coroutine" poilcy, you end up
> with a system where half the functions in QEMU:
>
> A: may yield your coroutine
> B: if it does yield it, your not allowed to restart it
>
> Making them completely uncallable from coroutine context.

Please link to real code, I'm having trouble understanding the example
code and your argument.

Stefan
Peter A. G. Crosthwaite - June 28, 2012, 3:17 a.m.
>> not at a valid yield point. But whats really wrong here is the block
>> layer will be making assumption on re-entry of the coroutine, so I
>> cant re-enter it witout wildly changing the behaviour of the block
>> layer. If you adopt this "mark it as coroutine" poilcy, you end up
>> with a system where half the functions in QEMU:
>>
>> A: may yield your coroutine
>> B: if it does yield it, your not allowed to restart it
>>
>> Making them completely uncallable from coroutine context.
>
> Please link to real code, I'm having trouble understanding the example
> code and your argument.
>

Hi Stefan,

Please see:

git://developer.petalogix.com/public/qemu.git for-stefan/fdt-generic.next

Its a generic machine model that creates an arbitrary machine from a
dtb. Yes, the work has many issues, some already discussed on the
mailing list, but it illustrates the issue with block. Top commit I
add some comments re this discussion. git diff HEAD^ will show you
where to look in the code.

Regards,
Peter

> Stefan
>
Stefan Hajnoczi - June 28, 2012, 1:03 p.m.
On Thu, Jun 28, 2012 at 4:17 AM, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
>>> not at a valid yield point. But whats really wrong here is the block
>>> layer will be making assumption on re-entry of the coroutine, so I
>>> cant re-enter it witout wildly changing the behaviour of the block
>>> layer. If you adopt this "mark it as coroutine" poilcy, you end up
>>> with a system where half the functions in QEMU:
>>>
>>> A: may yield your coroutine
>>> B: if it does yield it, your not allowed to restart it
>>>
>>> Making them completely uncallable from coroutine context.
>>
>> Please link to real code, I'm having trouble understanding the example
>> code and your argument.
>>
>
> Hi Stefan,
>
> Please see:
>
> git://developer.petalogix.com/public/qemu.git for-stefan/fdt-generic.next
>
> Its a generic machine model that creates an arbitrary machine from a
> dtb. Yes, the work has many issues, some already discussed on the
> mailing list, but it illustrates the issue with block. Top commit I
> add some comments re this discussion. git diff HEAD^ will show you
> where to look in the code.

Thanks, that is very useful.

Why are you using coroutines to build the machine?  I think the answer
is because you are trying to do parallel init with a wait when a
dependency is hit.

The problem I see is:

FDTMachineInfo *fdt_generic_create_machine()
{
    ...
    while (qemu_co_queue_enter_next(fdti->cq));
}

The problem you have is not that the block layer is yielding.  The
problem is that you need to run aio processing (i.e. the main loop) in
order for that bdrv_read() to make progress.  Please try:

while (qemu_co_queue_enter_next(fdti->cq)) {
    qemu_aio_wait();
}

Stefan
Peter A. G. Crosthwaite - June 29, 2012, 12:50 a.m.
Hi Stefan.

On Thu, Jun 28, 2012 at 11:03 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Jun 28, 2012 at 4:17 AM, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>>>> not at a valid yield point. But whats really wrong here is the block
>>>> layer will be making assumption on re-entry of the coroutine, so I
>>>> cant re-enter it witout wildly changing the behaviour of the block
>>>> layer. If you adopt this "mark it as coroutine" poilcy, you end up
>>>> with a system where half the functions in QEMU:
>>>>
>>>> A: may yield your coroutine
>>>> B: if it does yield it, your not allowed to restart it
>>>>
>>>> Making them completely uncallable from coroutine context.
>>>
>>> Please link to real code, I'm having trouble understanding the example
>>> code and your argument.
>>>
>>
>> Hi Stefan,
>>
>> Please see:
>>
>> git://developer.petalogix.com/public/qemu.git for-stefan/fdt-generic.next
>>
>> Its a generic machine model that creates an arbitrary machine from a
>> dtb. Yes, the work has many issues, some already discussed on the
>> mailing list, but it illustrates the issue with block. Top commit I
>> add some comments re this discussion. git diff HEAD^ will show you
>> where to look in the code.
>
> Thanks, that is very useful.
>
> Why are you using coroutines to build the machine?  I think the answer
> is because you are trying to do parallel init with a wait when a
> dependency is hit.

You got it. I could do multi passes of the tree to serialise but it is
messy. When I implement support for phandles as QOM links the number
of passes is no longer finite, so Id have to:

while (!everything_inited) {
    parse_entire_tree_again();
}

Which I dont want to do. Threads are a natural solution this this.
Coroutines are even better as I dont have to worry about synchronising
edits to the device tree (or other things for that matter) - which way
back was my original motivation for converting my thread based
solution to coroutines.

>
> The problem I see is:
>
> FDTMachineInfo *fdt_generic_create_machine()
> {
>    ...
>    while (qemu_co_queue_enter_next(fdti->cq));
> }
>
> The problem you have is not that the block layer is yielding.  The
> problem is that you need to run aio processing

And this is what Im trying to say is wrong. My usage of coroutine has
nothing to do with block, AIO or any other existing client of
coroutine. I shouldnt have to make API calls into things I dont care
about just to make my coroutines work. The QOM people should be able
to come along and rewrite the AIO subsystem tommorow and it shouldnt
affect my coroutines.

Maybe I need to generalise away from block. This problem goes a level
higher to AIO. AIO is assuming it owns all coroutines - I.E. if you
are in coroutines context (qemu_in_coroutine()) then that coroutine
can be pre-emented and queued up in AIO's work queue. Im saying this
is flawed - because it means coroutines are not generic, they are for
AIO use only.

So change subject to "Remove assumption that AIO owns coroutines".

(i.e. the main loop) in
> order for that bdrv_read() to make progress.  Please try:
>
> while (qemu_co_queue_enter_next(fdti->cq)) {
>    qemu_aio_wait();
> }

It still doesnt solve the yield point problem though. My coroutine
will yield at qdev_init() time which is not a valid yield point for my
coroutines.

Thanks for taking the time to look into this,
Regards,
Peter

>
> Stefan
Kevin Wolf - June 29, 2012, 8:24 a.m.
Am 29.06.2012 02:50, schrieb Peter Crosthwaite:
>>
>> The problem I see is:
>>
>> FDTMachineInfo *fdt_generic_create_machine()
>> {
>>    ...
>>    while (qemu_co_queue_enter_next(fdti->cq));
>> }
>>
>> The problem you have is not that the block layer is yielding.  The
>> problem is that you need to run aio processing
> 
> And this is what Im trying to say is wrong. My usage of coroutine has
> nothing to do with block, AIO or any other existing client of
> coroutine. I shouldnt have to make API calls into things I dont care
> about just to make my coroutines work. The QOM people should be able
> to come along and rewrite the AIO subsystem tommorow and it shouldnt
> affect my coroutines.

This isn't really block or AIO specific. What you need to run is
basically a nested main loop. As long as you process the
(aio_)fd_handlers and bottom halves, the block layer will work.
qemu_aio_wait() implements such a nested main loop. I think the only
major thing that is not executed in qemu_aio_wait() is timers (and maybe
that's something we should change).

> Maybe I need to generalise away from block. This problem goes a level
> higher to AIO. AIO is assuming it owns all coroutines - I.E. if you
> are in coroutines context (qemu_in_coroutine()) then that coroutine
> can be pre-emented and queued up in AIO's work queue. Im saying this
> is flawed - because it means coroutines are not generic, they are for
> AIO use only.
> 
> So change subject to "Remove assumption that AIO owns coroutines".

No, the assumption is a completely different one and it has nothing to
do with who "owns" a coroutine. It is about whether you may assume that
a call into a different subsystem doesn't yield. The answer today is no,
the other subsystem may yield. You want to change it to yes. But there's
nothing AIO or block specific about it.

Kevin
Peter A. G. Crosthwaite - June 29, 2012, 8:39 a.m.
On Jun 29, 2012 6:24 PM, "Kevin Wolf" <kwolf@redhat.com> wrote:
>
> Am 29.06.2012 02:50, schrieb Peter Crosthwaite:
> >>
> >> The problem I see is:
> >>
> >> FDTMachineInfo *fdt_generic_create_machine()
> >> {
> >>    ...
> >>    while (qemu_co_queue_enter_next(fdti->cq));
> >> }
> >>
> >> The problem you have is not that the block layer is yielding.  The
> >> problem is that you need to run aio processing
> >
> > And this is what Im trying to say is wrong. My usage of coroutine has
> > nothing to do with block, AIO or any other existing client of
> > coroutine. I shouldnt have to make API calls into things I dont care
> > about just to make my coroutines work. The QOM people should be able
> > to come along and rewrite the AIO subsystem tommorow and it shouldnt
> > affect my coroutines.
>
> This isn't really block or AIO specific. What you need to run is
> basically a nested main loop. As long as you process the
> (aio_)fd_handlers and bottom halves, the block layer will work.
> qemu_aio_wait() implements such a nested main loop. I think the only
> major thing that is not executed in qemu_aio_wait() is timers (and maybe
> that's something we should change).
>
> > Maybe I need to generalise away from block. This problem goes a level
> > higher to AIO. AIO is assuming it owns all coroutines - I.E. if you
> > are in coroutines context (qemu_in_coroutine()) then that coroutine
> > can be pre-emented and queued up in AIO's work queue. Im saying this
> > is flawed - because it means coroutines are not generic, they are for
> > AIO use only.
> >
> > So change subject to "Remove assumption that AIO owns coroutines".
>
> No, the assumption is a completely different one and it has nothing to
> do with who "owns" a coroutine. It is about whether you may assume that
> a call into a different subsystem doesn't yield. The answer today is no,
> the other subsystem may yield. You want to change it to yes.

Then can we change it to yes? If subsystems yield coroutines without checks
then PMM is right, coroutines are just obfuiscated threads...

But there's
> nothing AIO or block specific about it.
>
> Kevin
Markus Armbruster - June 29, 2012, 11:29 a.m.
Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes:

> On Jun 29, 2012 6:24 PM, "Kevin Wolf" <kwolf@redhat.com> wrote:
>>
>> Am 29.06.2012 02:50, schrieb Peter Crosthwaite:
>> >>
>> >> The problem I see is:
>> >>
>> >> FDTMachineInfo *fdt_generic_create_machine()
>> >> {
>> >>    ...
>> >>    while (qemu_co_queue_enter_next(fdti->cq));
>> >> }
>> >>
>> >> The problem you have is not that the block layer is yielding.  The
>> >> problem is that you need to run aio processing
>> >
>> > And this is what Im trying to say is wrong. My usage of coroutine has
>> > nothing to do with block, AIO or any other existing client of
>> > coroutine. I shouldnt have to make API calls into things I dont care
>> > about just to make my coroutines work. The QOM people should be able
>> > to come along and rewrite the AIO subsystem tommorow and it shouldnt
>> > affect my coroutines.
>>
>> This isn't really block or AIO specific. What you need to run is
>> basically a nested main loop. As long as you process the
>> (aio_)fd_handlers and bottom halves, the block layer will work.
>> qemu_aio_wait() implements such a nested main loop. I think the only
>> major thing that is not executed in qemu_aio_wait() is timers (and maybe
>> that's something we should change).
>>
>> > Maybe I need to generalise away from block. This problem goes a level
>> > higher to AIO. AIO is assuming it owns all coroutines - I.E. if you
>> > are in coroutines context (qemu_in_coroutine()) then that coroutine
>> > can be pre-emented and queued up in AIO's work queue. Im saying this
>> > is flawed - because it means coroutines are not generic, they are for
>> > AIO use only.
>> >
>> > So change subject to "Remove assumption that AIO owns coroutines".
>>
>> No, the assumption is a completely different one and it has nothing to
>> do with who "owns" a coroutine. It is about whether you may assume that
>> a call into a different subsystem doesn't yield.

Exactly.

>>                                                  The answer today is no,
>> the other subsystem may yield. You want to change it to yes.
>
> Then can we change it to yes? If subsystems yield coroutines without checks
> then PMM is right, coroutines are just obfuiscated threads...

Unlike threads, coroutines can yield only at designated points.  That's
a fundamental difference.  Even if you had to assume every single
function in every single API may yield.

Whether the set of yield points we have now is convenient for your
application is a fair question.  But it's a "have we gotten the set of
points right" question, no less, no more.  Unless no usable set can
exist, it's not a "does the coroutine concept make sense" question.

I appreciate contributions to APIs with more useful yield semantics, and
reports on problems with current APIs do count there.  I don't mind the
occasional philosophical musings on the merits of control structures
(which is what a coroutine is).  Just keep in mind that musings are
awfully prone to distract from real work, such as helping you with your
actual problem :)
Peter A. G. Crosthwaite - June 29, 2012, 11:51 a.m.
Can anoyne explain to me the consequences of my patch btw? I am
struggling to see how co-routines actually call into the block layer
API (without already being there) such that these fast paths are even
activated?

>>>
>>> No, the assumption is a completely different one and it has nothing to
>>> do with who "owns" a coroutine. It is about whether you may assume that
>>> a call into a different subsystem doesn't yield.
>
> Exactly.
>
>>>                                                  The answer today is no,
>>> the other subsystem may yield. You want to change it to yes.
>>
>> Then can we change it to yes? If subsystems yield coroutines without checks
>> then PMM is right, coroutines are just obfuiscated threads...
>
> Unlike threads, coroutines can yield only at designated points.  That's
> a fundamental difference.  Even if you had to assume every single
> function in every single API may yield.
>
> Whether the set of yield points we have now is convenient for your
> application is a fair question.  But it's a "have we gotten the set of
> points right" question,

So they way you are talking, it sounds like the set of "yield points"
is global. I think this is where it all goes pear shaped. What I am
proposing, is that for some coroutines it will be valid to yield at
A,B, C. Others, valid to yield and D, E and F, yield points should not
be considered a global thing or you end up with this viral taintage
where every function call leaving a subsystem you must assume it could
get coro yielded.

BTW Yielding is one thing, but the elephant in the room here is
resumption of the coroutine. When AIO yields my coroutine I i need to
talk to AIO to get it unyielded (Stefans propsoed edit to my code).
What happens when tommorow something in QOM, or a device model or  is
implemented with coros too? how do I know who yielded my routines and
what API to call re-enter it?

no less, no more.  Unless no usable set can
> exist, it's not a "does the coroutine concept make sense" question.

Coroutines do make sense :) I wouldnt have used them otherwise. My
dispute is with the global yield point concept.

>
> I appreciate contributions to APIs with more useful yield semantics, and
> reports on problems with current APIs do count there.  I don't mind the
> occasional philosophical musings on the merits of control structures
> (which is what a coroutine is).  Just keep in mind that musings are
> awfully prone to distract from real work, such as helping you with your
> actual problem :)

I have plenty of workarounds to the problem and my code is not bugged
anymore, but I don't want a fragile solution to the problem. This
patch is in my tree and will suffice for me and will stay there till
this get resolved.

Regards,
Peter
Stefan Hajnoczi - July 2, 2012, 8:50 a.m.
On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite
<peter.crosthwaite@petalogix.com> wrote:
> BTW Yielding is one thing, but the elephant in the room here is
> resumption of the coroutine. When AIO yields my coroutine I i need to
> talk to AIO to get it unyielded (Stefans propsoed edit to my code).
> What happens when tommorow something in QOM, or a device model or  is
> implemented with coros too? how do I know who yielded my routines and
> what API to call re-enter it?

Going back to what Kevin said, the qemu_aio_wait() isn't block layer
specific and there will never be a requirement to call any other magic
functions.

QEMU is event-driven and you must pump events.  If you call into
another subsystem, be prepared to pump events so that I/O can make
progress.  It's an assumption that is so central to QEMU architecture
that I don't see it as a problem.

Once the main loop is running then the event loop is taken care of for
you.  But during startup you're in a different environment and need to
pump events yourself.

If we drop bdrv_read()/bdrv_write() this won't change.  You'll have to
call bdrv_co_readv()/bdrv_co_writev() and pump events.

Stefan
Peter A. G. Crosthwaite - July 2, 2012, 8:57 a.m.
On Mon, Jul 2, 2012 at 6:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite
> <peter.crosthwaite@petalogix.com> wrote:
>> BTW Yielding is one thing, but the elephant in the room here is
>> resumption of the coroutine. When AIO yields my coroutine I i need to
>> talk to AIO to get it unyielded (Stefans propsoed edit to my code).
>> What happens when tommorow something in QOM, or a device model or  is
>> implemented with coros too? how do I know who yielded my routines and
>> what API to call re-enter it?
>
> Going back to what Kevin said, the qemu_aio_wait() isn't block layer
> specific and there will never be a requirement to call any other magic
> functions.
>
> QEMU is event-driven and you must pump events.  If you call into
> another subsystem, be prepared to pump events so that I/O can make
> progress.  It's an assumption that is so central to QEMU architecture
> that I don't see it as a problem.
>
> Once the main loop is running then the event loop is taken care of for
> you.  But during startup you're in a different environment and need to
> pump events yourself.
>
> If we drop bdrv_read()/bdrv_write() this won't change.  You'll have to
> call bdrv_co_readv()/bdrv_co_writev() and pump events.
>

Just tracing bdrv_aio_read(), It bypasses the fastpath logic:

. So a conversion of pflash to bdrv_aio_readv is a possible solution here.

bdrv_aio_read -> bdrv_co_aio_rw_vector :

static BlockDriverAIOCB *bdrv_co_aio_rw_vector (..) {
    Coroutine *co;
    BlockDriverAIOCBCoroutine *acb;

    ...

    co = qemu_coroutine_create(bdrv_co_do_rw);
    qemu_coroutine_enter(co, acb);

    return &acb->common;
}

No conditional on the qemu_coroutine_create. So it will always create
a new coroutine for its work which will solve my problem. All I need
to do is pump events once at the end of machine model creation. Any my
coroutines will never yield or get queued by block/AIO. Sound like a
solution?

Regards,
Peter

> Stefan
Kevin Wolf - July 2, 2012, 9:04 a.m.
Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
> On Mon, Jul 2, 2012 at 6:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite
>> <peter.crosthwaite@petalogix.com> wrote:
>>> BTW Yielding is one thing, but the elephant in the room here is
>>> resumption of the coroutine. When AIO yields my coroutine I i need to
>>> talk to AIO to get it unyielded (Stefans propsoed edit to my code).
>>> What happens when tommorow something in QOM, or a device model or  is
>>> implemented with coros too? how do I know who yielded my routines and
>>> what API to call re-enter it?
>>
>> Going back to what Kevin said, the qemu_aio_wait() isn't block layer
>> specific and there will never be a requirement to call any other magic
>> functions.
>>
>> QEMU is event-driven and you must pump events.  If you call into
>> another subsystem, be prepared to pump events so that I/O can make
>> progress.  It's an assumption that is so central to QEMU architecture
>> that I don't see it as a problem.
>>
>> Once the main loop is running then the event loop is taken care of for
>> you.  But during startup you're in a different environment and need to
>> pump events yourself.
>>
>> If we drop bdrv_read()/bdrv_write() this won't change.  You'll have to
>> call bdrv_co_readv()/bdrv_co_writev() and pump events.
>>
> 
> Just tracing bdrv_aio_read(), It bypasses the fastpath logic:
> 
> . So a conversion of pflash to bdrv_aio_readv is a possible solution here.
> 
> bdrv_aio_read -> bdrv_co_aio_rw_vector :
> 
> static BlockDriverAIOCB *bdrv_co_aio_rw_vector (..) {
>     Coroutine *co;
>     BlockDriverAIOCBCoroutine *acb;
> 
>     ...
> 
>     co = qemu_coroutine_create(bdrv_co_do_rw);
>     qemu_coroutine_enter(co, acb);
> 
>     return &acb->common;
> }
> 
> No conditional on the qemu_coroutine_create. So it will always create
> a new coroutine for its work which will solve my problem. All I need
> to do is pump events once at the end of machine model creation. Any my
> coroutines will never yield or get queued by block/AIO. Sound like a
> solution?

If you don't need the read data in your initialisation code, then yes,
that would work. bdrv_aio_* will always create a new coroutine. I just
assumed that you wanted to use the data right away, and then using the
AIO functions wouldn't have made much sense.

Kevin
Peter A. G. Crosthwaite - July 2, 2012, 9:42 a.m.
On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>> On Mon, Jul 2, 2012 at 6:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite
>>> <peter.crosthwaite@petalogix.com> wrote:
>>>> BTW Yielding is one thing, but the elephant in the room here is
>>>> resumption of the coroutine. When AIO yields my coroutine I i need to
>>>> talk to AIO to get it unyielded (Stefans propsoed edit to my code).
>>>> What happens when tommorow something in QOM, or a device model or  is
>>>> implemented with coros too? how do I know who yielded my routines and
>>>> what API to call re-enter it?
>>>
>>> Going back to what Kevin said, the qemu_aio_wait() isn't block layer
>>> specific and there will never be a requirement to call any other magic
>>> functions.
>>>
>>> QEMU is event-driven and you must pump events.  If you call into
>>> another subsystem, be prepared to pump events so that I/O can make
>>> progress.  It's an assumption that is so central to QEMU architecture
>>> that I don't see it as a problem.
>>>
>>> Once the main loop is running then the event loop is taken care of for
>>> you.  But during startup you're in a different environment and need to
>>> pump events yourself.
>>>
>>> If we drop bdrv_read()/bdrv_write() this won't change.  You'll have to
>>> call bdrv_co_readv()/bdrv_co_writev() and pump events.
>>>
>>
>> Just tracing bdrv_aio_read(), It bypasses the fastpath logic:
>>
>> . So a conversion of pflash to bdrv_aio_readv is a possible solution here.
>>
>> bdrv_aio_read -> bdrv_co_aio_rw_vector :
>>
>> static BlockDriverAIOCB *bdrv_co_aio_rw_vector (..) {
>>     Coroutine *co;
>>     BlockDriverAIOCBCoroutine *acb;
>>
>>     ...
>>
>>     co = qemu_coroutine_create(bdrv_co_do_rw);
>>     qemu_coroutine_enter(co, acb);
>>
>>     return &acb->common;
>> }
>>
>> No conditional on the qemu_coroutine_create. So it will always create
>> a new coroutine for its work which will solve my problem. All I need
>> to do is pump events once at the end of machine model creation. Any my
>> coroutines will never yield or get queued by block/AIO. Sound like a
>> solution?
>
> If you don't need the read data in your initialisation code,

definately not :) Just as long as the read data is there by the time
the machine goes live.  Whats the current policy with bdrv_read()ing
from init functions anyway? Several devices in qemu have init
functions that read the entire storage into a buffer (then the guest
just talks to the buffer rather than the backing store).

Pflash (pflash_cfi_01.c) is the device that is causing me interference
here and it works exactly like this. If we make the bdrv_read() aio
though, how do you ensure that it has completed before the guest talks
to the device? Will this just happen at the end of machine_init
anyways? Can we put a one liner in the machine init framework that
pumps all AIO events then just mass covert all these bdrv_reads (in
init functions) to bdrv_aio_read with a nop completion callback?

then yes,
> that would work. bdrv_aio_* will always create a new coroutine. I just
> assumed that you wanted to use the data right away, and then using the
> AIO functions wouldn't have made much sense.

You'd get a small performance increase no? Your machine init continues
on while your I/O happens rather than being synchronous so there is
motivation beyond my situation.

Regards,
Peter

>
> Kevin
Kevin Wolf - July 2, 2012, 10:01 a.m.
Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>> No conditional on the qemu_coroutine_create. So it will always create
>>> a new coroutine for its work which will solve my problem. All I need
>>> to do is pump events once at the end of machine model creation. Any my
>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>> solution?
>>
>> If you don't need the read data in your initialisation code,
> 
> definately not :) Just as long as the read data is there by the time
> the machine goes live.  Whats the current policy with bdrv_read()ing
> from init functions anyway? Several devices in qemu have init
> functions that read the entire storage into a buffer (then the guest
> just talks to the buffer rather than the backing store).

Reading from block devices during device initialisation breaks
migration, so I'd like to see it go away wherever possible. Reading in
the whole image file doesn't sound like something for which a good
excuse exists, you can do that as well during the first access.

> Pflash (pflash_cfi_01.c) is the device that is causing me interference
> here and it works exactly like this. If we make the bdrv_read() aio
> though, how do you ensure that it has completed before the guest talks
> to the device? Will this just happen at the end of machine_init
> anyways? Can we put a one liner in the machine init framework that
> pumps all AIO events then just mass covert all these bdrv_reads (in
> init functions) to bdrv_aio_read with a nop completion callback?

The initialisation function of the device can wait at its end for all
AIOs to return. I wouldn't want to encourage more block layer use during
the initialisation phase by supporting it in the infrastructure.

> then yes,
>> that would work. bdrv_aio_* will always create a new coroutine. I just
>> assumed that you wanted to use the data right away, and then using the
>> AIO functions wouldn't have made much sense.
> 
> You'd get a small performance increase no? Your machine init continues
> on while your I/O happens rather than being synchronous so there is
> motivation beyond my situation.

Yeah, as long as the next statement isn't "while (!returned)
qemu_aio_wait();", which it is in the common case.

Kevin
Peter Maydell - July 2, 2012, 10:18 a.m.
On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote:
> Reading from block devices during device initialisation breaks
> migration, so I'd like to see it go away wherever possible. Reading in
> the whole image file doesn't sound like something for which a good
> excuse exists, you can do that as well during the first access.

It's much nicer to be able to produce an error message ("file
doesn't exist", "file is too short for this flash device") at
device startup rather than miles later on at first access,
and pulling in a 64K file at startup is a simple implementation.
Why complicate things by adding code for "if this is the first
access then read in the file"?

I would have thought migration would be simpler with a "read
whole file at startup" implementation, because in that case
the required migration state is always "this block of memory",
rather than "sometimes this block of memory and sometimes a
flag which says we haven't initialised the memory and will
need to do a file read".

-- PMM
Peter A. G. Crosthwaite - July 2, 2012, 10:18 a.m.
On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>>> No conditional on the qemu_coroutine_create. So it will always create
>>>> a new coroutine for its work which will solve my problem. All I need
>>>> to do is pump events once at the end of machine model creation. Any my
>>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>>> solution?
>>>
>>> If you don't need the read data in your initialisation code,
>>
>> definately not :) Just as long as the read data is there by the time
>> the machine goes live.  Whats the current policy with bdrv_read()ing
>> from init functions anyway? Several devices in qemu have init
>> functions that read the entire storage into a buffer (then the guest
>> just talks to the buffer rather than the backing store).
>
> Reading from block devices during device initialisation breaks
> migration, so I'd like to see it go away wherever possible. Reading in
> the whole image file doesn't sound like something for which a good
> excuse exists,

Makes sense for small devices on embedded platforms. You end up with a
very simple and clean device model. The approach is totally broken for
HDDs but it does make some sense for the little embedded flashes where
you can get away with caching the entire device storage in RAM for the
lifetime of the system.

> you can do that as well during the first access.
>

Kind of painful though to change this for a lot of existing devices.
Its a reasonable policy for new devices, but can we just fix the
init->bdrv_read() calls in place for the moment?

>> Pflash (pflash_cfi_01.c) is the device that is causing me interference
>> here and it works exactly like this. If we make the bdrv_read() aio
>> though, how do you ensure that it has completed before the guest talks
>> to the device? Will this just happen at the end of machine_init
>> anyways? Can we put a one liner in the machine init framework that
>> pumps all AIO events then just mass covert all these bdrv_reads (in
>> init functions) to bdrv_aio_read with a nop completion callback?
>
> The initialisation function of the device can wait at its end for all
> AIOs to return.

You lose the performance increase discussed below however, as instead
of the init function returning to continue on with the machine init,
you block on disk IO.

I wouldn't want to encourage more block layer use during
> the initialisation phase by supporting it in the infrastructure.
>
>> then yes,
>>> that would work. bdrv_aio_* will always create a new coroutine. I just
>>> assumed that you wanted to use the data right away, and then using the
>>> AIO functions wouldn't have made much sense.
>>
>> You'd get a small performance increase no? Your machine init continues
>> on while your I/O happens rather than being synchronous so there is
>> motivation beyond my situation.
>
> Yeah, as long as the next statement isn't "while (!returned)
> qemu_aio_wait();", which it is in the common case.
>
> Kevin

Regards,
Peter
Kevin Wolf - July 2, 2012, 10:44 a.m.
Am 02.07.2012 12:18, schrieb Peter Maydell:
> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote:
>> Reading from block devices during device initialisation breaks
>> migration, so I'd like to see it go away wherever possible. Reading in
>> the whole image file doesn't sound like something for which a good
>> excuse exists, you can do that as well during the first access.
> 
> It's much nicer to be able to produce an error message ("file
> doesn't exist", "file is too short for this flash device") at
> device startup rather than miles later on at first access,

"file doesn't exist" is an error that occurs for the backend (-drive
if=none), not for the -device, so you shouldn't have to deal with that
at all.

> and pulling in a 64K file at startup is a simple implementation.
> Why complicate things by adding code for "if this is the first
> access then read in the file"?

Because then it works. :-)

Migration works more or less like this:

1. Destination creates device model based on command line
2. RAM is copied live, source keeps running
3. Source stops, device state is transferred
4. Destination starts running the VM

Reading from a block device is meaningful the earliest in step 3,
because at earlier points the guest still runs on the source and can
overwrite the data on the block device. If you're reading in the whole
image, you're doing it in step 1, so your data will be outdated by the
time the migration completes.

And this description doesn't even take things like cache coherency for
image files into consideration.

> I would have thought migration would be simpler with a "read
> whole file at startup" implementation, because in that case
> the required migration state is always "this block of memory",
> rather than "sometimes this block of memory and sometimes a
> flag which says we haven't initialised the memory and will
> need to do a file read".

Oh, so you're actually migrating the whole content of the flash device
instead of using the same file on the migration destination? What's the
advantage over rereading the file on the destination? You still need
access to the file there, don't you?

The approach with migrating the block device content probably works for
your 64k flash, but what about my hard disk?

Kevin
Kevin Wolf - July 2, 2012, 10:52 a.m.
Am 02.07.2012 12:18, schrieb Peter Crosthwaite:
> On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
>>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>>>> No conditional on the qemu_coroutine_create. So it will always create
>>>>> a new coroutine for its work which will solve my problem. All I need
>>>>> to do is pump events once at the end of machine model creation. Any my
>>>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>>>> solution?
>>>>
>>>> If you don't need the read data in your initialisation code,
>>>
>>> definately not :) Just as long as the read data is there by the time
>>> the machine goes live.  Whats the current policy with bdrv_read()ing
>>> from init functions anyway? Several devices in qemu have init
>>> functions that read the entire storage into a buffer (then the guest
>>> just talks to the buffer rather than the backing store).
>>
>> Reading from block devices during device initialisation breaks
>> migration, so I'd like to see it go away wherever possible. Reading in
>> the whole image file doesn't sound like something for which a good
>> excuse exists,
> 
> Makes sense for small devices on embedded platforms. You end up with a
> very simple and clean device model. The approach is totally broken for
> HDDs but it does make some sense for the little embedded flashes where
> you can get away with caching the entire device storage in RAM for the
> lifetime of the system.

It kind of works for read-only devices, it's just ugly there. With
writeable devices it makes the VM unmigratable.

>> you can do that as well during the first access.
>>
> 
> Kind of painful though to change this for a lot of existing devices.
> Its a reasonable policy for new devices, but can we just fix the
> init->bdrv_read() calls in place for the moment?

Sure, but you asked for the policy and the policy is "only with good
reasons". ;-)

>>> Pflash (pflash_cfi_01.c) is the device that is causing me interference
>>> here and it works exactly like this. If we make the bdrv_read() aio
>>> though, how do you ensure that it has completed before the guest talks
>>> to the device? Will this just happen at the end of machine_init
>>> anyways? Can we put a one liner in the machine init framework that
>>> pumps all AIO events then just mass covert all these bdrv_reads (in
>>> init functions) to bdrv_aio_read with a nop completion callback?
>>
>> The initialisation function of the device can wait at its end for all
>> AIOs to return.
> 
> You lose the performance increase discussed below however, as instead
> of the init function returning to continue on with the machine init,
> you block on disk IO.

Do you think it really matters? I guess it might if you have two such
devices that take each a second or so to load, but otherwise there isn't
much to parallelise.

Kevin
Peter A. G. Crosthwaite - July 2, 2012, 10:57 a.m.
On Mon, Jul 2, 2012 at 8:52 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 02.07.2012 12:18, schrieb Peter Crosthwaite:
>> On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
>>>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>>>>> No conditional on the qemu_coroutine_create. So it will always create
>>>>>> a new coroutine for its work which will solve my problem. All I need
>>>>>> to do is pump events once at the end of machine model creation. Any my
>>>>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>>>>> solution?
>>>>>
>>>>> If you don't need the read data in your initialisation code,
>>>>
>>>> definately not :) Just as long as the read data is there by the time
>>>> the machine goes live.  Whats the current policy with bdrv_read()ing
>>>> from init functions anyway? Several devices in qemu have init
>>>> functions that read the entire storage into a buffer (then the guest
>>>> just talks to the buffer rather than the backing store).
>>>
>>> Reading from block devices during device initialisation breaks
>>> migration, so I'd like to see it go away wherever possible. Reading in
>>> the whole image file doesn't sound like something for which a good
>>> excuse exists,
>>
>> Makes sense for small devices on embedded platforms. You end up with a
>> very simple and clean device model. The approach is totally broken for
>> HDDs but it does make some sense for the little embedded flashes where
>> you can get away with caching the entire device storage in RAM for the
>> lifetime of the system.
>
> It kind of works for read-only devices, it's just ugly there. With
> writeable devices it makes the VM unmigratable.
>

Isnt it just a simple case of syncing the buffer with the backing
store on pre_save?

Regards,
Peter

>>> you can do that as well during the first access.
>>>
>>
>> Kind of painful though to change this for a lot of existing devices.
>> Its a reasonable policy for new devices, but can we just fix the
>> init->bdrv_read() calls in place for the moment?
>
> Sure, but you asked for the policy and the policy is "only with good
> reasons". ;-)
>
>>>> Pflash (pflash_cfi_01.c) is the device that is causing me interference
>>>> here and it works exactly like this. If we make the bdrv_read() aio
>>>> though, how do you ensure that it has completed before the guest talks
>>>> to the device? Will this just happen at the end of machine_init
>>>> anyways? Can we put a one liner in the machine init framework that
>>>> pumps all AIO events then just mass covert all these bdrv_reads (in
>>>> init functions) to bdrv_aio_read with a nop completion callback?
>>>
>>> The initialisation function of the device can wait at its end for all
>>> AIOs to return.
>>
>> You lose the performance increase discussed below however, as instead
>> of the init function returning to continue on with the machine init,
>> you block on disk IO.
>
> Do you think it really matters? I guess it might if you have two such
> devices that take each a second or so to load, but otherwise there isn't
> much to parallelise.
>
> Kevin
Peter Maydell - July 2, 2012, 10:59 a.m.
On 2 July 2012 11:44, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 02.07.2012 12:18, schrieb Peter Maydell:
>> Why complicate things by adding code for "if this is the first
>> access then read in the file"?
>
> Because then it works. :-)
>
> Migration works more or less like this:
>
> 1. Destination creates device model based on command line
> 2. RAM is copied live, source keeps running
> 3. Source stops, device state is transferred
> 4. Destination starts running the VM
>
> Reading from a block device is meaningful the earliest in step 3,
> because at earlier points the guest still runs on the source and can
> overwrite the data on the block device. If you're reading in the whole
> image, you're doing it in step 1, so your data will be outdated by the
> time the migration completes.

For pflash_cfi01 migration will work because although we read
the file (slightly pointlessly) in step 1, we will get the correct
contents transferred over in step 2, because we call vmstate_register_ram()
in device init.

We need to migrate flash contents like that anyway, to handle the
case of "no backing file, flash starts empty and gets whatever the
guest writes to it for as long as the guest is alive".

> The approach with migrating the block device content probably works for
> your 64k flash, but what about my hard disk?

I'm not claiming this is a good approach for everything, just for
some things.

-- PMM
Peter A. G. Crosthwaite - July 2, 2012, 11:03 a.m.
On Mon, Jul 2, 2012 at 8:59 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 2 July 2012 11:44, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 02.07.2012 12:18, schrieb Peter Maydell:
>>> Why complicate things by adding code for "if this is the first
>>> access then read in the file"?
>>
>> Because then it works. :-)
>>
>> Migration works more or less like this:
>>
>> 1. Destination creates device model based on command line
>> 2. RAM is copied live, source keeps running
>> 3. Source stops, device state is transferred
>> 4. Destination starts running the VM
>>
>> Reading from a block device is meaningful the earliest in step 3,
>> because at earlier points the guest still runs on the source and can
>> overwrite the data on the block device. If you're reading in the whole
>> image, you're doing it in step 1, so your data will be outdated by the
>> time the migration completes.
>
> For pflash_cfi01 migration will work because although we read
> the file (slightly pointlessly) in step 1, we will get the correct
> contents transferred over in step 2, because we call vmstate_register_ram()
> in device init.
>
> We need to migrate flash contents like that anyway, to handle the
> case of "no backing file,

Yeah these little flashes also tend to support a mode where there is
no backing (bdrv) file at all. If all your doing is testing smoke
testing a driver then you can boot with no -drive args and the device
model knows to just implement the buffer and init it to something
sensible. If there needs to be a device-size buffer to support this
behaviour in non-bdrv mode then in might as well be there for bdrv
mode as well.

Regards,
Peter

 flash starts empty and gets whatever the
> guest writes to it for as long as the guest is alive".
>
>> The approach with migrating the block device content probably works for
>> your 64k flash, but what about my hard disk?
>
> I'm not claiming this is a good approach for everything, just for
> some things.
>
> -- PMM
Kevin Wolf - July 2, 2012, 11:04 a.m.
Am 02.07.2012 12:57, schrieb Peter Crosthwaite:
> On Mon, Jul 2, 2012 at 8:52 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 02.07.2012 12:18, schrieb Peter Crosthwaite:
>>> On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
>>>>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>>>>>> No conditional on the qemu_coroutine_create. So it will always create
>>>>>>> a new coroutine for its work which will solve my problem. All I need
>>>>>>> to do is pump events once at the end of machine model creation. Any my
>>>>>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>>>>>> solution?
>>>>>>
>>>>>> If you don't need the read data in your initialisation code,
>>>>>
>>>>> definately not :) Just as long as the read data is there by the time
>>>>> the machine goes live.  Whats the current policy with bdrv_read()ing
>>>>> from init functions anyway? Several devices in qemu have init
>>>>> functions that read the entire storage into a buffer (then the guest
>>>>> just talks to the buffer rather than the backing store).
>>>>
>>>> Reading from block devices during device initialisation breaks
>>>> migration, so I'd like to see it go away wherever possible. Reading in
>>>> the whole image file doesn't sound like something for which a good
>>>> excuse exists,
>>>
>>> Makes sense for small devices on embedded platforms. You end up with a
>>> very simple and clean device model. The approach is totally broken for
>>> HDDs but it does make some sense for the little embedded flashes where
>>> you can get away with caching the entire device storage in RAM for the
>>> lifetime of the system.
>>
>> It kind of works for read-only devices, it's just ugly there. With
>> writeable devices it makes the VM unmigratable.
>>
> 
> Isnt it just a simple case of syncing the buffer with the backing
> store on pre_save?

Not if the buffer is only read during initialisation, which has long
happened on the destination when pre_save runs. So you'd need to either
migrate the whole block device content as suggested by PMM (works only
for really small devices) or have a reopen in post_load. Both sounds
rather hackish to me and likely doesn't work for block devices in general.

Kevin
Peter A. G. Crosthwaite - July 2, 2012, 11:12 a.m.
On Mon, Jul 2, 2012 at 8:44 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 02.07.2012 12:18, schrieb Peter Maydell:
>> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Reading from block devices during device initialisation breaks
>>> migration, so I'd like to see it go away wherever possible. Reading in
>>> the whole image file doesn't sound like something for which a good
>>> excuse exists, you can do that as well during the first access.
>>
>> It's much nicer to be able to produce an error message ("file
>> doesn't exist", "file is too short for this flash device") at
>> device startup rather than miles later on at first access,
>
> "file doesn't exist" is an error that occurs for the backend (-drive
> if=none), not for the -device, so you shouldn't have to deal with that
> at all.
>
>> and pulling in a 64K file at startup is a simple implementation.
>> Why complicate things by adding code for "if this is the first
>> access then read in the file"?
>
> Because then it works. :-)
>
> Migration works more or less like this:
>
> 1. Destination creates device model based on command line
> 2. RAM is copied live, source keeps running
> 3. Source stops, device state is transferred
> 4. Destination starts running the VM
>
> Reading from a block device is meaningful the earliest in step 3,
> because at earlier points the guest still runs on the source and can
> overwrite the data on the block device. If you're reading in the whole
> image, you're doing it in step 1, so your data will be outdated by the
> time the migration completes.
>

I feel like theres a "two birds with one stone" solution here, by
making bdrv_aio_read just yield until step 3? Just an if (..)
somewhere in the bdrv framework that says "while not ready for
migration qemu_coroutine_yield()". You implement the postponed
bdrv_read that you want, but you also get rid of the bdrv_read()s that
everyone hates without having the rewrite all the small flashes with
if-first-read-load-all logic.

Regards,
Peter

> And this description doesn't even take things like cache coherency for
> image files into consideration.
>
>> I would have thought migration would be simpler with a "read
>> whole file at startup" implementation, because in that case
>> the required migration state is always "this block of memory",
>> rather than "sometimes this block of memory and sometimes a
>> flag which says we haven't initialised the memory and will
>> need to do a file read".
>
> Oh, so you're actually migrating the whole content of the flash device
> instead of using the same file on the migration destination? What's the
> advantage over rereading the file on the destination? You still need
> access to the file there, don't you?
>
> The approach with migrating the block device content probably works for
> your 64k flash, but what about my hard disk?
>
> Kevin
Kevin Wolf - July 2, 2012, 11:19 a.m.
Am 02.07.2012 13:12, schrieb Peter Crosthwaite:
> On Mon, Jul 2, 2012 at 8:44 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 02.07.2012 12:18, schrieb Peter Maydell:
>>> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Reading from block devices during device initialisation breaks
>>>> migration, so I'd like to see it go away wherever possible. Reading in
>>>> the whole image file doesn't sound like something for which a good
>>>> excuse exists, you can do that as well during the first access.
>>>
>>> It's much nicer to be able to produce an error message ("file
>>> doesn't exist", "file is too short for this flash device") at
>>> device startup rather than miles later on at first access,
>>
>> "file doesn't exist" is an error that occurs for the backend (-drive
>> if=none), not for the -device, so you shouldn't have to deal with that
>> at all.
>>
>>> and pulling in a 64K file at startup is a simple implementation.
>>> Why complicate things by adding code for "if this is the first
>>> access then read in the file"?
>>
>> Because then it works. :-)
>>
>> Migration works more or less like this:
>>
>> 1. Destination creates device model based on command line
>> 2. RAM is copied live, source keeps running
>> 3. Source stops, device state is transferred
>> 4. Destination starts running the VM
>>
>> Reading from a block device is meaningful the earliest in step 3,
>> because at earlier points the guest still runs on the source and can
>> overwrite the data on the block device. If you're reading in the whole
>> image, you're doing it in step 1, so your data will be outdated by the
>> time the migration completes.
>>
> 
> I feel like theres a "two birds with one stone" solution here, by
> making bdrv_aio_read just yield until step 3? Just an if (..)
> somewhere in the bdrv framework that says "while not ready for
> migration qemu_coroutine_yield()". You implement the postponed
> bdrv_read that you want, but you also get rid of the bdrv_read()s that
> everyone hates without having the rewrite all the small flashes with
> if-first-read-load-all logic.

Or we could just have a second "late init" callback into the devices
where such requests could be issued. That would feel a bit cleaner.

Kevin
Peter A. G. Crosthwaite - July 2, 2012, 11:25 a.m.
On Mon, Jul 2, 2012 at 9:19 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 02.07.2012 13:12, schrieb Peter Crosthwaite:
>> On Mon, Jul 2, 2012 at 8:44 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 02.07.2012 12:18, schrieb Peter Maydell:
>>>> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Reading from block devices during device initialisation breaks
>>>>> migration, so I'd like to see it go away wherever possible. Reading in
>>>>> the whole image file doesn't sound like something for which a good
>>>>> excuse exists, you can do that as well during the first access.
>>>>
>>>> It's much nicer to be able to produce an error message ("file
>>>> doesn't exist", "file is too short for this flash device") at
>>>> device startup rather than miles later on at first access,
>>>
>>> "file doesn't exist" is an error that occurs for the backend (-drive
>>> if=none), not for the -device, so you shouldn't have to deal with that
>>> at all.
>>>
>>>> and pulling in a 64K file at startup is a simple implementation.
>>>> Why complicate things by adding code for "if this is the first
>>>> access then read in the file"?
>>>
>>> Because then it works. :-)
>>>
>>> Migration works more or less like this:
>>>
>>> 1. Destination creates device model based on command line
>>> 2. RAM is copied live, source keeps running
>>> 3. Source stops, device state is transferred
>>> 4. Destination starts running the VM
>>>
>>> Reading from a block device is meaningful the earliest in step 3,
>>> because at earlier points the guest still runs on the source and can
>>> overwrite the data on the block device. If you're reading in the whole
>>> image, you're doing it in step 1, so your data will be outdated by the
>>> time the migration completes.
>>>
>>
>> I feel like theres a "two birds with one stone" solution here, by
>> making bdrv_aio_read just yield until step 3? Just an if (..)
>> somewhere in the bdrv framework that says "while not ready for
>> migration qemu_coroutine_yield()". You implement the postponed
>> bdrv_read that you want, but you also get rid of the bdrv_read()s that
>> everyone hates without having the rewrite all the small flashes with
>> if-first-read-load-all logic.
>
> Or we could just have a second "late init" callback into the devices
> where such requests could be issued. That would feel a bit cleaner.
>

I disagree. Then device authors have to take into account all these
rules about what gets inited when. If you have a single fn and the
block layer just kicks off a coroutine that schedules the work for
when it should happen then everything is clean an simple out in device
model land. Hide the complexity from the devices and implement it in
one place in the block layer rather than in N different devices.

Regards,
Peter

> Kevin
Markus Armbruster - July 12, 2012, 1:42 p.m.
Kevin Wolf <kwolf@redhat.com> writes:

> Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>>> No conditional on the qemu_coroutine_create. So it will always create
>>>> a new coroutine for its work which will solve my problem. All I need
>>>> to do is pump events once at the end of machine model creation. Any my
>>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>>> solution?
>>>
>>> If you don't need the read data in your initialisation code,
>> 
>> definately not :) Just as long as the read data is there by the time
>> the machine goes live.  Whats the current policy with bdrv_read()ing
>> from init functions anyway? Several devices in qemu have init
>> functions that read the entire storage into a buffer (then the guest
>> just talks to the buffer rather than the backing store).
>
> Reading from block devices during device initialisation breaks
> migration, so I'd like to see it go away wherever possible. Reading in
> the whole image file doesn't sound like something for which a good
> excuse exists, you can do that as well during the first access.

I just stumbled over another problem case: encrypted images.

Encrypted images specified on the command line get their keys from the
monitor.  -S is forced automatically.  You can then use block_passwd to
set keys.  cont asks for keys still missing.

However, device initialization happens *before* you get access to the
monitor.  Therefore, your init method runs without a key.

It would be nice if the monitor was available before device
initialization.  Patches welcome.

[...]
Peter A. G. Crosthwaite - July 13, 2012, 1:21 a.m.
On Thu, Jul 12, 2012 at 11:42 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Kevin Wolf <kwolf@redhat.com> writes:
>
>> Am 02.07.2012 11:42, schrieb Peter Crosthwaite:
>>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite:
>>>>> No conditional on the qemu_coroutine_create. So it will always create
>>>>> a new coroutine for its work which will solve my problem. All I need
>>>>> to do is pump events once at the end of machine model creation. Any my
>>>>> coroutines will never yield or get queued by block/AIO. Sound like a
>>>>> solution?
>>>>
>>>> If you don't need the read data in your initialisation code,
>>>
>>> definately not :) Just as long as the read data is there by the time
>>> the machine goes live.  Whats the current policy with bdrv_read()ing
>>> from init functions anyway? Several devices in qemu have init
>>> functions that read the entire storage into a buffer (then the guest
>>> just talks to the buffer rather than the backing store).
>>
>> Reading from block devices during device initialisation breaks
>> migration, so I'd like to see it go away wherever possible. Reading in
>> the whole image file doesn't sound like something for which a good
>> excuse exists, you can do that as well during the first access.
>
> I just stumbled over another problem case: encrypted images.
>
> Encrypted images specified on the command line get their keys from the
> monitor.  -S is forced automatically.  You can then use block_passwd to
> set keys.  cont asks for keys still missing.
>
> However, device initialization happens *before* you get access to the
> monitor.  Therefore, your init method runs without a key.
>
> It would be nice if the monitor was available before device
> initialization.  Patches welcome.
>

Hi Markus,

I consolidated this discussion into a new RFC which proposes a few
solutions the the bdrv_read() from init() problem.

http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00237.html

Are you able to comment on those ideas WRT your latest thoughts?

Regards,
Peter

> [...]
Markus Armbruster - July 13, 2012, 8:33 a.m.
Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes:

> Hi Markus,
>
> I consolidated this discussion into a new RFC which proposes a few
> solutions the the bdrv_read() from init() problem.
>
> http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00237.html
>
> Are you able to comment on those ideas WRT your latest thoughts?

Done.

Patch

diff --git a/block.c b/block.c
index 0acdcac..b50af15 100644
--- a/block.c
+++ b/block.c
@@ -380,7 +380,7 @@  int bdrv_create(BlockDriver *drv, const char* filename,
         return -ENOTSUP;
     }
 
-    if (qemu_in_coroutine()) {
+    if (0) {
         /* Fast-path if already in coroutine context */
         bdrv_create_co_entry(&cco);
     } else {
@@ -1590,7 +1590,7 @@  static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
         bdrv_io_limits_disable(bs);
     }
 
-    if (qemu_in_coroutine()) {
+    if (0) {
         /* Fast-path if already in coroutine context */
         bdrv_rw_co_entry(&rwco);
     } else {
@@ -3813,7 +3813,7 @@  int bdrv_flush(BlockDriverState *bs)
         .ret = NOT_DONE,
     };
 
-    if (qemu_in_coroutine()) {
+    if (0) {
         /* Fast-path if already in coroutine context */
         bdrv_flush_co_entry(&rwco);
     } else {
@@ -3874,7 +3874,7 @@  int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
         .ret = NOT_DONE,
     };
 
-    if (qemu_in_coroutine()) {
+    if (0) {
         /* Fast-path if already in coroutine context */
         bdrv_discard_co_entry(&rwco);
     } else {