Patchwork [v4,05/16] memory: introduce ref, unref interface for MemoryRegionOps

login
register
mail settings
Submitter pingfank@linux.vnet.ibm.com
Date Oct. 22, 2012, 9:23 a.m.
Message ID <1350897839-29593-6-git-send-email-pingfank@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/193197/
State New
Headers show

Comments

pingfank@linux.vnet.ibm.com - Oct. 22, 2012, 9:23 a.m.
This pair of interface help to decide when dispatching, whether
we can pin mr without big lock or not.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 memory.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)
Paolo Bonzini - Oct. 23, 2012, 11:51 a.m.
Il 22/10/2012 11:38, Avi Kivity ha scritto:
>> >  
>> >  typedef struct MemoryRegionOps MemoryRegionOps;
>> >  typedef struct MemoryRegion MemoryRegion;
>> > @@ -66,6 +67,8 @@ struct MemoryRegionOps {
>> >                    target_phys_addr_t addr,
>> >                    uint64_t data,
>> >                    unsigned size);
>> > +    int (*ref)(MemoryRegion *mr);
>> > +    void (*unref)(MemoryRegion *mr);
>> >  
> Why return an int?  Should succeed unconditionally.  Please fold into 7
> (along with 6).

So the stop_machine idea is thrown away?  I really believe we're going
down a rat's nest with reference counting.

Paolo
Avi Kivity - Oct. 23, 2012, 11:55 a.m.
On 10/23/2012 01:51 PM, Paolo Bonzini wrote:
> Il 22/10/2012 11:38, Avi Kivity ha scritto:
>>> >  
>>> >  typedef struct MemoryRegionOps MemoryRegionOps;
>>> >  typedef struct MemoryRegion MemoryRegion;
>>> > @@ -66,6 +67,8 @@ struct MemoryRegionOps {
>>> >                    target_phys_addr_t addr,
>>> >                    uint64_t data,
>>> >                    unsigned size);
>>> > +    int (*ref)(MemoryRegion *mr);
>>> > +    void (*unref)(MemoryRegion *mr);
>>> >  
>> Why return an int?  Should succeed unconditionally.  Please fold into 7
>> (along with 6).
> 
> So the stop_machine idea is thrown away?  

IIRC I convinced myself that it's just as bad.

> I really believe we're going
> down a rat's nest with reference counting.

There will be a lot of teething problems, but the same ideas are used
extensively in the kernel.
Paolo Bonzini - Oct. 23, 2012, 11:57 a.m.
Il 23/10/2012 13:55, Avi Kivity ha scritto:
>> > So the stop_machine idea is thrown away?  
> IIRC I convinced myself that it's just as bad.

It may be just as bad, but it is less code (and less pervasive), which
makes it less painful.

Paolo
Avi Kivity - Oct. 23, 2012, 12:02 p.m.
On 10/23/2012 01:57 PM, Paolo Bonzini wrote:
> Il 23/10/2012 13:55, Avi Kivity ha scritto:
>>> > So the stop_machine idea is thrown away?  
>> IIRC I convinced myself that it's just as bad.
> 
> It may be just as bad, but it is less code (and less pervasive), which
> makes it less painful.

It saves you the ->ref() and ->unref() calls, which are boilerplate, but
not too onerous. All of the device model and subsystem threading work
still needs to be done.
Jan Kiszka - Oct. 23, 2012, 12:04 p.m.
On 2012-10-23 13:55, Avi Kivity wrote:
> On 10/23/2012 01:51 PM, Paolo Bonzini wrote:
>> Il 22/10/2012 11:38, Avi Kivity ha scritto:
>>>>>  
>>>>>  typedef struct MemoryRegionOps MemoryRegionOps;
>>>>>  typedef struct MemoryRegion MemoryRegion;
>>>>> @@ -66,6 +67,8 @@ struct MemoryRegionOps {
>>>>>                    target_phys_addr_t addr,
>>>>>                    uint64_t data,
>>>>>                    unsigned size);
>>>>> +    int (*ref)(MemoryRegion *mr);
>>>>> +    void (*unref)(MemoryRegion *mr);
>>>>>  
>>> Why return an int?  Should succeed unconditionally.  Please fold into 7
>>> (along with 6).
>>
>> So the stop_machine idea is thrown away?  
> 
> IIRC I convinced myself that it's just as bad.

One tricky part with stop machine is that legacy code may trigger it
while holding the BQL, does not expect to lose that lock even for a
brief while, but synchronizing on other threads does require dropping
the lock right now. Maybe an implementation detail, but at least a nasty
one.

Jan
Paolo Bonzini - Oct. 23, 2012, 12:06 p.m.
Il 23/10/2012 14:02, Avi Kivity ha scritto:
> On 10/23/2012 01:57 PM, Paolo Bonzini wrote:
>> Il 23/10/2012 13:55, Avi Kivity ha scritto:
>>>>> So the stop_machine idea is thrown away?  
>>> IIRC I convinced myself that it's just as bad.
>>
>> It may be just as bad, but it is less code (and less pervasive), which
>> makes it less painful.
> 
> It saves you the ->ref() and ->unref() calls, which are boilerplate, but
> not too onerous. All of the device model and subsystem threading work
> still needs to be done.

I'm not worried about saving the ->ref() and ->unref() calls in the
devices.  I'm worried about saving it in timers, bottom halves and
whatnot.  And also I'm not sure whether all callbacks would have
something to ref/unref as they are implemented now.

Paolo
Paolo Bonzini - Oct. 23, 2012, 12:12 p.m.
Il 23/10/2012 14:04, Jan Kiszka ha scritto:
>>> >>
>>> >> So the stop_machine idea is thrown away?  
>> > 
>> > IIRC I convinced myself that it's just as bad.
> One tricky part with stop machine is that legacy code may trigger it
> while holding the BQL, does not expect to lose that lock even for a
> brief while, but synchronizing on other threads does require dropping
> the lock right now. Maybe an implementation detail, but at least a nasty
> one.

But it would only be triggered by hot-unplug, no?  That is already an
asynchronous action, so it is not a problem to delay the actual
stop_machine+qdev_free (and just that part!) to a bottom half or another
place when it is safe to drop the BQL.

Paolo
Avi Kivity - Oct. 23, 2012, 12:15 p.m.
On 10/23/2012 02:06 PM, Paolo Bonzini wrote:
> Il 23/10/2012 14:02, Avi Kivity ha scritto:
>> On 10/23/2012 01:57 PM, Paolo Bonzini wrote:
>>> Il 23/10/2012 13:55, Avi Kivity ha scritto:
>>>>>> So the stop_machine idea is thrown away?  
>>>> IIRC I convinced myself that it's just as bad.
>>>
>>> It may be just as bad, but it is less code (and less pervasive), which
>>> makes it less painful.
>> 
>> It saves you the ->ref() and ->unref() calls, which are boilerplate, but
>> not too onerous. All of the device model and subsystem threading work
>> still needs to be done.
> 
> I'm not worried about saving the ->ref() and ->unref() calls in the
> devices.  I'm worried about saving it in timers, bottom halves and
> whatnot.  And also I'm not sure whether all callbacks would have
> something to ref/unref as they are implemented now.

Hard to say without examples.

Something that bothers be with stop_machine is the reliance on
cancellation.  With timers it's easy, stop_machine, remove the timer,
resume.  But if you have an aio operation in progress that is not
cancellable, you have to wait for that operation to complete.  Refcounts
handle that well, the object stays until completion, then disappears.
Jan Kiszka - Oct. 23, 2012, 12:16 p.m.
On 2012-10-23 14:12, Paolo Bonzini wrote:
> Il 23/10/2012 14:04, Jan Kiszka ha scritto:
>>>>>>
>>>>>> So the stop_machine idea is thrown away?  
>>>>
>>>> IIRC I convinced myself that it's just as bad.
>> One tricky part with stop machine is that legacy code may trigger it
>> while holding the BQL, does not expect to lose that lock even for a
>> brief while, but synchronizing on other threads does require dropping
>> the lock right now. Maybe an implementation detail, but at least a nasty
>> one.
> 
> But it would only be triggered by hot-unplug, no?

Once all code that adds/removes memory regions from within access
handlers is converted. Legacy is biting, not necessarily the pure model.

>  That is already an
> asynchronous action, so it is not a problem to delay the actual
> stop_machine+qdev_free (and just that part!) to a bottom half or another
> place when it is safe to drop the BQL.

Jan
Avi Kivity - Oct. 23, 2012, 12:28 p.m.
On 10/23/2012 02:16 PM, Jan Kiszka wrote:
> On 2012-10-23 14:12, Paolo Bonzini wrote:
>> Il 23/10/2012 14:04, Jan Kiszka ha scritto:
>>>>>>>
>>>>>>> So the stop_machine idea is thrown away?  
>>>>>
>>>>> IIRC I convinced myself that it's just as bad.
>>> One tricky part with stop machine is that legacy code may trigger it
>>> while holding the BQL, does not expect to lose that lock even for a
>>> brief while, but synchronizing on other threads does require dropping
>>> the lock right now. Maybe an implementation detail, but at least a nasty
>>> one.
>> 
>> But it would only be triggered by hot-unplug, no?
> 
> Once all code that adds/removes memory regions from within access
> handlers is converted. 

add/del is fine.  memory_region_destroy() is the problem.  I have
patches queued that fix those problems and add an assert() to make sure
we don't add more.

It's not just memory regions, it's practically anything that can be
removed and that has callbacks.  The two proposals are:

- qomify
- split unplug into isolate+destroy
- let the issuer of the callbacks manage the reference counts

vs

- split unplug into isolate+destroy
- let unplug defer destruction to a bottom half, and stop_machine there
- if we depend on the results [1], add a continuation

[1] Say a monitor command wants to return only after the block device
has been detached from qemu

> Legacy is biting, not necessarily the pure model.
Paolo Bonzini - Oct. 23, 2012, 12:32 p.m.
Il 23/10/2012 14:15, Avi Kivity ha scritto:
> On 10/23/2012 02:06 PM, Paolo Bonzini wrote:
>> Il 23/10/2012 14:02, Avi Kivity ha scritto:
>>> On 10/23/2012 01:57 PM, Paolo Bonzini wrote:
>>>> Il 23/10/2012 13:55, Avi Kivity ha scritto:
>>>>>>> So the stop_machine idea is thrown away?  
>>>>> IIRC I convinced myself that it's just as bad.
>>>>
>>>> It may be just as bad, but it is less code (and less pervasive), which
>>>> makes it less painful.
>>>
>>> It saves you the ->ref() and ->unref() calls, which are boilerplate, but
>>> not too onerous. All of the device model and subsystem threading work
>>> still needs to be done.
>>
>> I'm not worried about saving the ->ref() and ->unref() calls in the
>> devices.  I'm worried about saving it in timers, bottom halves and
>> whatnot.  And also I'm not sure whether all callbacks would have
>> something to ref/unref as they are implemented now.
> 
> Hard to say without examples.
> 
> Something that bothers be with stop_machine is the reliance on
> cancellation.  With timers it's easy, stop_machine, remove the timer,
> resume.  But if you have an aio operation in progress that is not
> cancellable, you have to wait for that operation to complete.  Refcounts
> handle that well, the object stays until completion, then disappears.

Yes, that's the point of doing things asynchronously---you do not need
to do everything within stop_machine, you can start canceling AIO as
soon as the OS sends the hot-unplug request.  Then you only proceed with
stop_machine and freeing device memory when the first part.

In other words, isolate can complete asynchronously.

The good thing is that this is an improvement that can be applied on top
of the current code, which avoids doing too many things at once...

Paolo
Jan Kiszka - Oct. 23, 2012, 12:40 p.m.
On 2012-10-23 14:28, Avi Kivity wrote:
> On 10/23/2012 02:16 PM, Jan Kiszka wrote:
>> On 2012-10-23 14:12, Paolo Bonzini wrote:
>>> Il 23/10/2012 14:04, Jan Kiszka ha scritto:
>>>>>>>>
>>>>>>>> So the stop_machine idea is thrown away?  
>>>>>>
>>>>>> IIRC I convinced myself that it's just as bad.
>>>> One tricky part with stop machine is that legacy code may trigger it
>>>> while holding the BQL, does not expect to lose that lock even for a
>>>> brief while, but synchronizing on other threads does require dropping
>>>> the lock right now. Maybe an implementation detail, but at least a nasty
>>>> one.
>>>
>>> But it would only be triggered by hot-unplug, no?
>>
>> Once all code that adds/removes memory regions from within access
>> handlers is converted. 
> 
> add/del is fine.  memory_region_destroy() is the problem.  I have
> patches queued that fix those problems and add an assert() to make sure
> we don't add more.
> 
> It's not just memory regions, it's practically anything that can be
> removed and that has callbacks.  The two proposals are:
> 
> - qomify
> - split unplug into isolate+destroy
> - let the issuer of the callbacks manage the reference counts

What do you mean with the last one?

> 
> vs
> 
> - split unplug into isolate+destroy
> - let unplug defer destruction to a bottom half, and stop_machine there
> - if we depend on the results [1], add a continuation
> 
> [1] Say a monitor command wants to return only after the block device
> has been detached from qemu

The monitor is likely harmless (as it's pretty confined). But is that
all? Hunting down all (corner) cases will make switching to this model
tricky.

Jan
Avi Kivity - Oct. 23, 2012, 2:37 p.m.
On 10/23/2012 02:40 PM, Jan Kiszka wrote:
> On 2012-10-23 14:28, Avi Kivity wrote:
>> On 10/23/2012 02:16 PM, Jan Kiszka wrote:
>>> On 2012-10-23 14:12, Paolo Bonzini wrote:
>>>> Il 23/10/2012 14:04, Jan Kiszka ha scritto:
>>>>>>>>>
>>>>>>>>> So the stop_machine idea is thrown away?  
>>>>>>>
>>>>>>> IIRC I convinced myself that it's just as bad.
>>>>> One tricky part with stop machine is that legacy code may trigger it
>>>>> while holding the BQL, does not expect to lose that lock even for a
>>>>> brief while, but synchronizing on other threads does require dropping
>>>>> the lock right now. Maybe an implementation detail, but at least a nasty
>>>>> one.
>>>>
>>>> But it would only be triggered by hot-unplug, no?
>>>
>>> Once all code that adds/removes memory regions from within access
>>> handlers is converted. 
>> 
>> add/del is fine.  memory_region_destroy() is the problem.  I have
>> patches queued that fix those problems and add an assert() to make sure
>> we don't add more.
>> 
>> It's not just memory regions, it's practically anything that can be
>> removed and that has callbacks.  The two proposals are:
>> 
>> - qomify
>> - split unplug into isolate+destroy
>> - let the issuer of the callbacks manage the reference counts
> 
> What do you mean with the last one?

Call ref/unref as needed (this patchset).

Here, "the issuer of the callbacks" is the memory core.  For timer
callbacks, it is the timer subsystem.

> 
>> 
>> vs
>> 
>> - split unplug into isolate+destroy
>> - let unplug defer destruction to a bottom half, and stop_machine there
>> - if we depend on the results [1], add a continuation
>> 
>> [1] Say a monitor command wants to return only after the block device
>> has been detached from qemu
> 
> The monitor is likely harmless (as it's pretty confined). But is that
> all? Hunting down all (corner) cases will make switching to this model
> tricky.

That is my feeling as well.  The first model requires more work, but is
complete.  The second model is easier, but we may run into a wall if we
find a case it doesn't cover.
Avi Kivity - Oct. 23, 2012, 2:49 p.m.
On 10/23/2012 02:32 PM, Paolo Bonzini wrote:
> Il 23/10/2012 14:15, Avi Kivity ha scritto:
>> On 10/23/2012 02:06 PM, Paolo Bonzini wrote:
>>> Il 23/10/2012 14:02, Avi Kivity ha scritto:
>>>> On 10/23/2012 01:57 PM, Paolo Bonzini wrote:
>>>>> Il 23/10/2012 13:55, Avi Kivity ha scritto:
>>>>>>>> So the stop_machine idea is thrown away?  
>>>>>> IIRC I convinced myself that it's just as bad.
>>>>>
>>>>> It may be just as bad, but it is less code (and less pervasive), which
>>>>> makes it less painful.
>>>>
>>>> It saves you the ->ref() and ->unref() calls, which are boilerplate, but
>>>> not too onerous. All of the device model and subsystem threading work
>>>> still needs to be done.
>>>
>>> I'm not worried about saving the ->ref() and ->unref() calls in the
>>> devices.  I'm worried about saving it in timers, bottom halves and
>>> whatnot.  And also I'm not sure whether all callbacks would have
>>> something to ref/unref as they are implemented now.
>> 
>> Hard to say without examples.
>> 
>> Something that bothers be with stop_machine is the reliance on
>> cancellation.  With timers it's easy, stop_machine, remove the timer,
>> resume.  But if you have an aio operation in progress that is not
>> cancellable, you have to wait for that operation to complete.  Refcounts
>> handle that well, the object stays until completion, then disappears.
> 
> Yes, that's the point of doing things asynchronously---you do not need
> to do everything within stop_machine, you can start canceling AIO as
> soon as the OS sends the hot-unplug request.  Then you only proceed with
> stop_machine and freeing device memory when the first part.

You cannot always cancel I/O (for example threaded I/O already in progress).

> In other words, isolate can complete asynchronously.

Can it?  I don't think so.

Here's how I see it:

 1. non-malicious guest stops driving device
 2. isolate()
 3. a malicious guest cannot drive the device at this point
 4. some kind of barrier to let the device, or drive activity from a
malicious guest, wind down
 5. destroy()

If you need to report the completion of step 2, it cannot be done
asynchronously.  One example is if the guest drivers the process from an
mmio callback.

We may also want notification after step 4 (or 5); if the device holds
some host resource someone may want to know that it is ready for reuse.

> The good thing is that this is an improvement that can be applied on top
> of the current code, which avoids doing too many things at once...
Paolo Bonzini - Oct. 23, 2012, 3:26 p.m.
Il 23/10/2012 16:49, Avi Kivity ha scritto:
> On 10/23/2012 02:32 PM, Paolo Bonzini wrote:
>> Il 23/10/2012 14:15, Avi Kivity ha scritto:
>>> On 10/23/2012 02:06 PM, Paolo Bonzini wrote:
>>>> Il 23/10/2012 14:02, Avi Kivity ha scritto:
>>>>> On 10/23/2012 01:57 PM, Paolo Bonzini wrote:
>>>>>> Il 23/10/2012 13:55, Avi Kivity ha scritto:
>>>>>>>>> So the stop_machine idea is thrown away?  
>>>>>>> IIRC I convinced myself that it's just as bad.
>>>>>>
>>>>>> It may be just as bad, but it is less code (and less pervasive), which
>>>>>> makes it less painful.
>>>>>
>>>>> It saves you the ->ref() and ->unref() calls, which are boilerplate, but
>>>>> not too onerous. All of the device model and subsystem threading work
>>>>> still needs to be done.
>>>>
>>>> I'm not worried about saving the ->ref() and ->unref() calls in the
>>>> devices.  I'm worried about saving it in timers, bottom halves and
>>>> whatnot.  And also I'm not sure whether all callbacks would have
>>>> something to ref/unref as they are implemented now.
>>>
>>> Hard to say without examples.
>>>
>>> Something that bothers be with stop_machine is the reliance on
>>> cancellation.  With timers it's easy, stop_machine, remove the timer,
>>> resume.  But if you have an aio operation in progress that is not
>>> cancellable, you have to wait for that operation to complete.  Refcounts
>>> handle that well, the object stays until completion, then disappears.
>>
>> Yes, that's the point of doing things asynchronously---you do not need
>> to do everything within stop_machine, you can start canceling AIO as
>> soon as the OS sends the hot-unplug request.  Then you only proceed with
>> stop_machine and freeing device memory when the first part.
> 
> You cannot always cancel I/O (for example threaded I/O already in progress).

Yep, but we try to do this anyway today and nothing changes really.  The
difference is between hotplug never completing and blocking
synchronously, vs. hotplug never completing and not invoking the
asynchronous callback.  I.e. really no difference at all.

>> In other words, isolate can complete asynchronously.
> 
> Can it?  I don't think so.
> 
> Here's how I see it:
> 
>  1. non-malicious guest stops driving device
>  2. isolate()
>  3. a malicious guest cannot drive the device at this point
>  4. some kind of barrier to let the device, or drive activity from a
> malicious guest, wind down
>  5. destroy()
> 
> If you need to report the completion of step 2, it cannot be done
> asynchronously.

In hardware everything is asynchronous anyway.  It will *look*
synchronous, because if CPU#0 is stuck in a synchronous isolate(), and
CPU#1 polls for the outcome, CPU#1 will lock on the BQL held by CPU#0.

But our interfaces had better support asynchronicity, and indeed they
do: after you write to the "eject" register, the "up" will show the
device as present until after destroy is done.  This can be changed to
show the device as present only until after step 4 is done.

> We may also want notification after step 4 (or 5); if the device holds
> some host resource someone may want to know that it is ready for reuse.

I think guest notification should be after (4), while management
notification should be after (5).

Paolo
Avi Kivity - Oct. 23, 2012, 4:09 p.m.
On 10/23/2012 05:26 PM, Paolo Bonzini wrote:
>>> Yes, that's the point of doing things asynchronously---you do not need
>>> to do everything within stop_machine, you can start canceling AIO as
>>> soon as the OS sends the hot-unplug request.  Then you only proceed with
>>> stop_machine and freeing device memory when the first part.
>> 
>> You cannot always cancel I/O (for example threaded I/O already in progress).
> 
> Yep, but we try to do this anyway today and nothing changes really.  The
> difference is between hotplug never completing and blocking
> synchronously, vs. hotplug never completing and not invoking the
> asynchronous callback.  I.e. really no difference at all.

Not cancelling is not the same as not completing; the request will
complete on its own eventually.  The question is whether the programming
model is synchronous or callback based.

> 
>>> In other words, isolate can complete asynchronously.
>> 
>> Can it?  I don't think so.
>> 
>> Here's how I see it:
>> 
>>  1. non-malicious guest stops driving device
>>  2. isolate()
>>  3. a malicious guest cannot drive the device at this point
>>  4. some kind of barrier to let the device, or drive activity from a
>> malicious guest, wind down
>>  5. destroy()
>> 
>> If you need to report the completion of step 2, it cannot be done
>> asynchronously.
> 
> In hardware everything is asynchronous anyway.  It will *look*
> synchronous, because if CPU#0 is stuck in a synchronous isolate(), and
> CPU#1 polls for the outcome, CPU#1 will lock on the BQL held by CPU#0.

That is fine.  isolate() is expensive but it is cpu bound, it does not
involve any I/O (unlike the barrier afterwards, which has to wait on any
I/O which we were not able to cancel).

> But our interfaces had better support asynchronicity, and indeed they
> do: after you write to the "eject" register, the "up" will show the
> device as present until after destroy is done.  This can be changed to
> show the device as present only until after step 4 is done.

Let's say we want to eject the hotplug hardware itself (just as an
example).  With refcounts, the callback that updates "up" will hold on
to to it via refcounts.  With stop_machine(), you need to cancel that
callback, or wait for it somehow, or it can arrive after the
stop_machine() and bite you.

> 
>> We may also want notification after step 4 (or 5); if the device holds
>> some host resource someone may want to know that it is ready for reuse.
> 
> I think guest notification should be after (4), while management
> notification should be after (5).

Yes. After (2) we can return from the eject mmio.
Paolo Bonzini - Oct. 24, 2012, 7:29 a.m.
Il 23/10/2012 18:09, Avi Kivity ha scritto:
>> But our interfaces had better support asynchronicity, and indeed they
>> do: after you write to the "eject" register, the "up" will show the
>> device as present until after destroy is done.  This can be changed to
>> show the device as present only until after step 4 is done.
> 
> Let's say we want to eject the hotplug hardware itself (just as an
> example).  With refcounts, the callback that updates "up" will hold on
> to to it via refcounts.  With stop_machine(), you need to cancel that
> callback, or wait for it somehow, or it can arrive after the
> stop_machine() and bite you.

The callback that updates "up" is for the parent of the hotplug
hardware.  There is nothing that has to be updated in the hotplug
hardware itself.

Updating the "up" register is the final part of isolate(), and runs
before the stop_machine().  The steps above can be further refined like
this:

4a. close all backends (also cancel or complete all pending I/O)
4b. notify parent that we're done
    4ba. parent removes device from its bus
    4bb. parent notifies guest
    4bc. parent schedules stop_machine(qdev_free(child))
5. a bottom half calls stop_machine(qdev_free(child))

If unplugging a whole sub-tree, the parent can notify its own parent at
the end of 4b.  Because the only purpose of stop_machine is to quiesce
subsystems not affected by step 4 (timer+memory, typically),
destructions can be done in any order and even intermixed with
executions of 4b for the parent.

In the beginning the only asynchronous step would be 5.  If the need
arises we can use continuation-passing to make all the preceding steps
asynchronous too.

>>> We may also want notification after step 4 (or 5); if the device holds
>>> some host resource someone may want to know that it is ready for reuse.
>>
>> I think guest notification should be after (4), while management
>> notification should be after (5).
> Yes. After (2) we can return from the eject mmio.

Agreed.

Paolo
Avi Kivity - Oct. 25, 2012, 4:28 p.m.
On 10/24/2012 09:29 AM, Paolo Bonzini wrote:
> Il 23/10/2012 18:09, Avi Kivity ha scritto:
>>> But our interfaces had better support asynchronicity, and indeed they
>>> do: after you write to the "eject" register, the "up" will show the
>>> device as present until after destroy is done.  This can be changed to
>>> show the device as present only until after step 4 is done.
>> 
>> Let's say we want to eject the hotplug hardware itself (just as an
>> example).  With refcounts, the callback that updates "up" will hold on
>> to to it via refcounts.  With stop_machine(), you need to cancel that
>> callback, or wait for it somehow, or it can arrive after the
>> stop_machine() and bite you.
> 
> The callback that updates "up" is for the parent of the hotplug
> hardware.  There is nothing that has to be updated in the hotplug
> hardware itself.

I meant, as an unrealistic example, hot-unplugging the bridge itself.
So we have a callback that updates information in the bridge (up
register state) being called asynchronously.

A more realistic example would be hot-unplug of an HBA, then the block
layer callback comes back to update the device.  So stop_machine() would
need to cancel all I/O and wait for I/O that cannot be cancelled.

> 
> Updating the "up" register is the final part of isolate(), and runs
> before the stop_machine().  The steps above can be further refined like
> this:
> 
> 4a. close all backends (also cancel or complete all pending I/O)

^ long latency

> 4b. notify parent that we're done
>     4ba. parent removes device from its bus
>     4bb. parent notifies guest
>     4bc. parent schedules stop_machine(qdev_free(child))
> 5. a bottom half calls stop_machine(qdev_free(child))
> 
> If unplugging a whole sub-tree, the parent can notify its own parent at
> the end of 4b.  Because the only purpose of stop_machine is to quiesce
> subsystems not affected by step 4 (timer+memory, typically),
> destructions can be done in any order and even intermixed with
> executions of 4b for the parent.
> 
> In the beginning the only asynchronous step would be 5.  If the need
> arises we can use continuation-passing to make all the preceding steps
> asynchronous too.
> 

Maybe my worry about long stop_machine latencies is premature.  Everyone
in the kernel hates it, but the kernel scales a lot more than qemu and
is in a much better place wrt threading.
Paolo Bonzini - Oct. 26, 2012, 3:05 p.m.
----- Messaggio originale -----
> Da: "Avi Kivity" <avi@redhat.com>
> A: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: "Liu Ping Fan" <pingfank@linux.vnet.ibm.com>, qemu-devel@nongnu.org, "Anthony Liguori" <anthony@codemonkey.ws>,
> "Marcelo Tosatti" <mtosatti@redhat.com>, "Jan Kiszka" <jan.kiszka@siemens.com>, "Stefan Hajnoczi"
> <stefanha@gmail.com>
> Inviato: Giovedì, 25 ottobre 2012 18:28:27
> Oggetto: Re: [patch v4 05/16] memory: introduce ref,unref interface for MemoryRegionOps
> 
> On 10/24/2012 09:29 AM, Paolo Bonzini wrote:
> > Il 23/10/2012 18:09, Avi Kivity ha scritto:
> >>> But our interfaces had better support asynchronicity, and indeed
> >>> they
> >>> do: after you write to the "eject" register, the "up" will show
> >>> the
> >>> device as present until after destroy is done.  This can be
> >>> changed to
> >>> show the device as present only until after step 4 is done.
> >> 
> >> Let's say we want to eject the hotplug hardware itself (just as an
> >> example).  With refcounts, the callback that updates "up" will hold
> >> on to to it via refcounts.  With stop_machine(), you need to cancel
> >> that callback, or wait for it somehow, or it can arrive after the
> >> stop_machine() and bite you.
> > 
> > The callback that updates "up" is for the parent of the hotplug
> > hardware.  There is nothing that has to be updated in the hotplug
> > hardware itself.
> 
> I meant, as an unrealistic example, hot-unplugging the bridge itself.
> So we have a callback that updates information in the bridge (up
> register state) being called asynchronously.
> 
> A more realistic example would be hot-unplug of an HBA, then the block
> layer callback comes back to update the device.  So stop_machine()
> would need to cancel all I/O and wait for I/O that cannot be cancelled.

Cancellation+wait would be triggered by isolate (4a) and it would run
outside stop_machine().  We know that stop_machine() will eventually
run because the guest cannot place more requests for the devices to
process.

At this point we're here:

> > 4a. close all backends (also cancel or complete all pending I/O)
> 
> ^ long latency
> 

but none of this is done in stop_machine().  Once cancellation/wait
finishes, the HBA gives a green-light to the parent, which proceeds
as follows:

> > 4b. notify parent that we're done
> >     4ba. parent removes device from its bus
> >     4bb. parent notifies guest
> >     4bc. parent schedules stop_machine(qdev_free(child))
> > 5. a bottom half calls stop_machine(qdev_free(child))

All we're doing in stop_machine() is really calling the destructor,
which---in an isolate-enabled device---only includes calls to
qemu_del_timer, drive_put_ref, memory_region_destroy and the like.

> Maybe my worry about long stop_machine latencies is premature.
> Everyone in the kernel hates it, but the kernel scales a lot more
> than qemu and is in a much better place wrt threading.

stop_machine may indeed require (or at least warmly suggest) a
conversion to isolate of storage devices, in order to reduce the
latency of the destructor.  We do not have that many though (the
IDE and SCSI buses, and virtio-blk).

Paolo

Patch

diff --git a/memory.h b/memory.h
index bd1bbae..9039411 100644
--- a/memory.h
+++ b/memory.h
@@ -25,6 +25,7 @@ 
 #include "iorange.h"
 #include "ioport.h"
 #include "int128.h"
+#include "qemu/object.h"
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegion MemoryRegion;
@@ -66,6 +67,8 @@  struct MemoryRegionOps {
                   target_phys_addr_t addr,
                   uint64_t data,
                   unsigned size);
+    int (*ref)(MemoryRegion *mr);
+    void (*unref)(MemoryRegion *mr);
 
     enum device_endian endianness;
     /* Guest-visible constraints: */