Patchwork [V3,10/11] vcpu: introduce lockmap

login
register
mail settings
Submitter pingfan liu
Date Sept. 11, 2012, 7:51 a.m.
Message ID <1347349912-15611-11-git-send-email-qemulist@gmail.com>
Download mbox | patch
Permalink /patch/183041/
State New
Headers show

Comments

pingfan liu - Sept. 11, 2012, 7:51 a.m.
From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

The func call chain can suffer from recursively hold
qemu_mutex_lock_iothread. We introduce lockmap to record the
lock depth.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 cpus.c              |   18 ++++++++++++++++++
 qemu-thread-posix.c |   23 +++++++++++++++++++++++
 qemu-thread-posix.h |    5 +++++
 qemu-thread.h       |    3 +++
 4 files changed, 49 insertions(+), 0 deletions(-)
Avi Kivity - Sept. 11, 2012, 8:35 a.m.
On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> The func call chain can suffer from recursively hold
> qemu_mutex_lock_iothread. We introduce lockmap to record the
> lock depth.

What is the root cause?  io handlers initiating I/O?

Perhaps we can have a solution that is local to exec.c.
pingfan liu - Sept. 11, 2012, 9:44 a.m.
On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> The func call chain can suffer from recursively hold
>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>> lock depth.
>
> What is the root cause?  io handlers initiating I/O?
>
cpu_physical_memory_rw() can be called nested, and when called, it can
be protected by no-lock/device lock/ big-lock.
I think without big lock, io-dispatcher should face the same issue.
As to main-loop, have not carefully consider, but at least, dma-helper
will call cpu_physical_memory_rw() with big-lock.

Regards,
pingfan

> Perhaps we can have a solution that is local to exec.c.
>
>
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 11, 2012, 9:54 a.m.
On 09/11/2012 12:44 PM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> The func call chain can suffer from recursively hold
>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>> lock depth.
>>
>> What is the root cause?  io handlers initiating I/O?
>>
> cpu_physical_memory_rw() can be called nested, and when called, it can
> be protected by no-lock/device lock/ big-lock.

Then we should look for a solution that is local to exec.c (and the
nested dispatch problem).  I think we can identify and fix all nested
dispatches (converting them to either async dma, or letting the memory
core do a single dispatch without indirection through I/O handlers).

> I think without big lock, io-dispatcher should face the same issue.
> As to main-loop, have not carefully consider, but at least, dma-helper
> will call cpu_physical_memory_rw() with big-lock.

DMA is inherently asynchronous, so we already drop the lock between
initiation and completion; we need to find a way to make it easy to use
the API without taking the lock while the transfer takes place.
Jan Kiszka - Sept. 11, 2012, 9:57 a.m.
On 2012-09-11 11:44, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> The func call chain can suffer from recursively hold
>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>> lock depth.
>>
>> What is the root cause?  io handlers initiating I/O?
>>
> cpu_physical_memory_rw() can be called nested, and when called, it can
> be protected by no-lock/device lock/ big-lock.
> I think without big lock, io-dispatcher should face the same issue.
> As to main-loop, have not carefully consider, but at least, dma-helper
> will call cpu_physical_memory_rw() with big-lock.

That is our core problem: inconsistent invocation of existing services
/wrt locking. For portio, I was lucky that there is no nesting and I was
able to drop the big lock around all (x86) call sites. But MMIO is way
more tricky due to DMA nesting.

We could try to introduce a different version of cpu_physical_memory_rw,
cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
request can trigger the very same access in a nested fashion, and we
will have to detect this to avoid locking up QEMU (locking up the guest
might be OK).

Jan
Jan Kiszka - Sept. 11, 2012, 10:04 a.m.
On 2012-09-11 11:54, Avi Kivity wrote:
> On 09/11/2012 12:44 PM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> The func call chain can suffer from recursively hold
>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>> lock depth.
>>>
>>> What is the root cause?  io handlers initiating I/O?
>>>
>> cpu_physical_memory_rw() can be called nested, and when called, it can
>> be protected by no-lock/device lock/ big-lock.
> 
> Then we should look for a solution that is local to exec.c (and the
> nested dispatch problem).  I think we can identify and fix all nested
> dispatches (converting them to either async dma, or letting the memory
> core do a single dispatch without indirection through I/O handlers).
> 
>> I think without big lock, io-dispatcher should face the same issue.
>> As to main-loop, have not carefully consider, but at least, dma-helper
>> will call cpu_physical_memory_rw() with big-lock.
> 
> DMA is inherently asynchronous, so we already drop the lock between
> initiation and completion; we need to find a way to make it easy to use
> the API without taking the lock while the transfer takes place.

We will have to review/rework device models that want to use the new
locking scheme in a way that they can drop their own lock while issuing
DMA. But that is surely non-trivial.

The other option is to keep DMA requests issued by devices synchronous
but let them fail if we are about to lock up. Still requires changes,
but is probably more comprehensible for device model developers.

Jan
Avi Kivity - Sept. 11, 2012, 11:03 a.m.
On 09/11/2012 01:04 PM, Jan Kiszka wrote:

>> DMA is inherently asynchronous, so we already drop the lock between
>> initiation and completion; we need to find a way to make it easy to use
>> the API without taking the lock while the transfer takes place.
> 
> We will have to review/rework device models that want to use the new
> locking scheme in a way that they can drop their own lock while issuing
> DMA. But that is surely non-trivial.

Most DMA today happens without the big qemu lock.  We only need to
convert the paths that actually access memory, these are the block and
network layers (for the respective devices).

> The other option is to keep DMA requests issued by devices synchronous
> but let them fail if we are about to lock up. Still requires changes,
> but is probably more comprehensible for device model developers.

How do you handle failures?
Jan Kiszka - Sept. 11, 2012, 11:08 a.m.
On 2012-09-11 13:03, Avi Kivity wrote:
> On 09/11/2012 01:04 PM, Jan Kiszka wrote:
> 
>>> DMA is inherently asynchronous, so we already drop the lock between
>>> initiation and completion; we need to find a way to make it easy to use
>>> the API without taking the lock while the transfer takes place.
>>
>> We will have to review/rework device models that want to use the new
>> locking scheme in a way that they can drop their own lock while issuing
>> DMA. But that is surely non-trivial.
> 
> Most DMA today happens without the big qemu lock.  We only need to
> convert the paths that actually access memory, these are the block and
> network layers (for the respective devices).

...and the devices themselves, of course.

> 
>> The other option is to keep DMA requests issued by devices synchronous
>> but let them fail if we are about to lock up. Still requires changes,
>> but is probably more comprehensible for device model developers.
> 
> How do you handle failures?

By not sending a network frame or dropping an incoming one, e.g., and
signaling this in a device specific way.

Jan
Avi Kivity - Sept. 11, 2012, 12:20 p.m.
On 09/11/2012 02:08 PM, Jan Kiszka wrote:
> On 2012-09-11 13:03, Avi Kivity wrote:
>> On 09/11/2012 01:04 PM, Jan Kiszka wrote:
>> 
>>>> DMA is inherently asynchronous, so we already drop the lock between
>>>> initiation and completion; we need to find a way to make it easy to use
>>>> the API without taking the lock while the transfer takes place.
>>>
>>> We will have to review/rework device models that want to use the new
>>> locking scheme in a way that they can drop their own lock while issuing
>>> DMA. But that is surely non-trivial.
>> 
>> Most DMA today happens without the big qemu lock.  We only need to
>> convert the paths that actually access memory, these are the block and
>> network layers (for the respective devices).
> 
> ...and the devices themselves, of course.

Right, for descriptors and stuff.  So a guest can set a DMA address to
point at its own BAR, and cause qemu to deadlock.

The problem is not limited to locking.  A guest can also cause a write
to a BAR to initiate a synchronous write with the same address and data,
triggering infinite recursion.

Perhaps one fix is the mythical DMA API, which will provide each DMA
initiator with its own view of memory (a private MemoryRegion root
node).  Even that can be worked around with a pair of devices accessing
each other.

> 
>> 
>>> The other option is to keep DMA requests issued by devices synchronous
>>> but let them fail if we are about to lock up. Still requires changes,
>>> but is probably more comprehensible for device model developers.
>> 
>> How do you handle failures?
> 
> By not sending a network frame or dropping an incoming one, e.g., and
> signaling this in a device specific way.

Doesn't work for block devices.
Avi Kivity - Sept. 11, 2012, 12:24 p.m.
On 09/11/2012 12:57 PM, Jan Kiszka wrote:
> On 2012-09-11 11:44, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> The func call chain can suffer from recursively hold
>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>> lock depth.
>>>
>>> What is the root cause?  io handlers initiating I/O?
>>>
>> cpu_physical_memory_rw() can be called nested, and when called, it can
>> be protected by no-lock/device lock/ big-lock.
>> I think without big lock, io-dispatcher should face the same issue.
>> As to main-loop, have not carefully consider, but at least, dma-helper
>> will call cpu_physical_memory_rw() with big-lock.
> 
> That is our core problem: inconsistent invocation of existing services
> /wrt locking. For portio, I was lucky that there is no nesting and I was
> able to drop the big lock around all (x86) call sites. But MMIO is way
> more tricky due to DMA nesting.

Maybe we need to switch to a continuation style.  Instead of expecting
cpu_physical_memory_rw() to complete synchronously, it becomes an
asynchronous call and you provide it with a completion.  That means
devices which use it are forced to drop the lock in between.  Block and
network clients will be easy to convert since they already use APIs that
drop the lock (except for accessing the descriptors).

> We could try to introduce a different version of cpu_physical_memory_rw,
> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
> request can trigger the very same access in a nested fashion, and we
> will have to detect this to avoid locking up QEMU (locking up the guest
> might be OK).

An async version of c_p_m_rw() will just cause a completion to bounce
around, consuming cpu but not deadlocking anything.  If we can keep a
count of the bounces, we might be able to stall it indefinitely or at
least ratelimit it.
Jan Kiszka - Sept. 11, 2012, 12:25 p.m.
On 2012-09-11 14:20, Avi Kivity wrote:
> On 09/11/2012 02:08 PM, Jan Kiszka wrote:
>> On 2012-09-11 13:03, Avi Kivity wrote:
>>> On 09/11/2012 01:04 PM, Jan Kiszka wrote:
>>>
>>>>> DMA is inherently asynchronous, so we already drop the lock between
>>>>> initiation and completion; we need to find a way to make it easy to use
>>>>> the API without taking the lock while the transfer takes place.
>>>>
>>>> We will have to review/rework device models that want to use the new
>>>> locking scheme in a way that they can drop their own lock while issuing
>>>> DMA. But that is surely non-trivial.
>>>
>>> Most DMA today happens without the big qemu lock.  We only need to
>>> convert the paths that actually access memory, these are the block and
>>> network layers (for the respective devices).
>>
>> ...and the devices themselves, of course.
> 
> Right, for descriptors and stuff.  So a guest can set a DMA address to
> point at its own BAR, and cause qemu to deadlock.
> 
> The problem is not limited to locking.  A guest can also cause a write
> to a BAR to initiate a synchronous write with the same address and data,
> triggering infinite recursion.
> 
> Perhaps one fix is the mythical DMA API, which will provide each DMA
> initiator with its own view of memory (a private MemoryRegion root
> node).  Even that can be worked around with a pair of devices accessing
> each other.

Hmm, don't see an alternative to runtime loop detection yet.

> 
>>
>>>
>>>> The other option is to keep DMA requests issued by devices synchronous
>>>> but let them fail if we are about to lock up. Still requires changes,
>>>> but is probably more comprehensible for device model developers.
>>>
>>> How do you handle failures?
>>
>> By not sending a network frame or dropping an incoming one, e.g., and
>> signaling this in a device specific way.
> 
> Doesn't work for block devices.

Because the block layer API cannot report errors to the devices? What
happens if there is a real I/O error?

Jan
Avi Kivity - Sept. 11, 2012, 12:30 p.m.
On 09/11/2012 03:25 PM, Jan Kiszka wrote:
>>>>
>>>> Most DMA today happens without the big qemu lock.  We only need to
>>>> convert the paths that actually access memory, these are the block and
>>>> network layers (for the respective devices).
>>>
>>> ...and the devices themselves, of course.
>> 
>> Right, for descriptors and stuff.  So a guest can set a DMA address to
>> point at its own BAR, and cause qemu to deadlock.
>> 
>> The problem is not limited to locking.  A guest can also cause a write
>> to a BAR to initiate a synchronous write with the same address and data,
>> triggering infinite recursion.
>> 
>> Perhaps one fix is the mythical DMA API, which will provide each DMA
>> initiator with its own view of memory (a private MemoryRegion root
>> node).  Even that can be worked around with a pair of devices accessing
>> each other.
> 
> Hmm, don't see an alternative to runtime loop detection yet.

See an expensive one on the other branch of this thread.

> 
>> 
>>>
>>>>
>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>> but is probably more comprehensible for device model developers.
>>>>
>>>> How do you handle failures?
>>>
>>> By not sending a network frame or dropping an incoming one, e.g., and
>>> signaling this in a device specific way.
>> 
>> Doesn't work for block devices.
> 
> Because the block layer API cannot report errors to the devices? What
> happens if there is a real I/O error?

We report real I/O errors.  But if we report a transient error due to
some lock being taken as an I/O error, the guest will take unwarranted
action.

If the errors are not expected in normal operation (we can avoid them if
all DMA is to real RAM) then this is an acceptable solution.  Still it
generates a lot of rarely used code paths and so isn't very good for
security.
Jan Kiszka - Sept. 11, 2012, 12:35 p.m.
On 2012-09-11 14:30, Avi Kivity wrote:
>>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>>> but is probably more comprehensible for device model developers.
>>>>>
>>>>> How do you handle failures?
>>>>
>>>> By not sending a network frame or dropping an incoming one, e.g., and
>>>> signaling this in a device specific way.
>>>
>>> Doesn't work for block devices.
>>
>> Because the block layer API cannot report errors to the devices? What
>> happens if there is a real I/O error?
> 
> We report real I/O errors.  But if we report a transient error due to
> some lock being taken as an I/O error, the guest will take unwarranted
> action.
> 
> If the errors are not expected in normal operation (we can avoid them if
> all DMA is to real RAM) then this is an acceptable solution.  Still it
> generates a lot of rarely used code paths and so isn't very good for
> security.

I'm not talking about transient errors. Recursions like this are always
guest configuration errors that would cause real devices to lock up or
timeout as well. This is practically about avoiding that a malicious
guest can lock up QEMU, leaving it inoperative even for management tools.

Jan
Avi Kivity - Sept. 11, 2012, 12:39 p.m.
On 09/11/2012 03:35 PM, Jan Kiszka wrote:
> On 2012-09-11 14:30, Avi Kivity wrote:
>>>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>>>> but is probably more comprehensible for device model developers.
>>>>>>
>>>>>> How do you handle failures?
>>>>>
>>>>> By not sending a network frame or dropping an incoming one, e.g., and
>>>>> signaling this in a device specific way.
>>>>
>>>> Doesn't work for block devices.
>>>
>>> Because the block layer API cannot report errors to the devices? What
>>> happens if there is a real I/O error?
>> 
>> We report real I/O errors.  But if we report a transient error due to
>> some lock being taken as an I/O error, the guest will take unwarranted
>> action.
>> 
>> If the errors are not expected in normal operation (we can avoid them if
>> all DMA is to real RAM) then this is an acceptable solution.  Still it
>> generates a lot of rarely used code paths and so isn't very good for
>> security.
> 
> I'm not talking about transient errors. Recursions like this are always
> guest configuration errors that would cause real devices to lock up or
> timeout as well. This is practically about avoiding that a malicious
> guest can lock up QEMU, leaving it inoperative even for management tools.

Ok.  That's more palatable.  We don't even have to report an error in
that case, we can just perform the operation incorrectly (as I'm sure
real hardware will), log an error, and continue.
Avi Kivity - Sept. 11, 2012, 12:41 p.m.
On 09/11/2012 03:24 PM, Avi Kivity wrote:
> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>> On 2012-09-11 11:44, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>
>>>>> The func call chain can suffer from recursively hold
>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>> lock depth.
>>>>
>>>> What is the root cause?  io handlers initiating I/O?
>>>>
>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>> be protected by no-lock/device lock/ big-lock.
>>> I think without big lock, io-dispatcher should face the same issue.
>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>> will call cpu_physical_memory_rw() with big-lock.
>> 
>> That is our core problem: inconsistent invocation of existing services
>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>> able to drop the big lock around all (x86) call sites. But MMIO is way
>> more tricky due to DMA nesting.
> 
> Maybe we need to switch to a continuation style.  Instead of expecting
> cpu_physical_memory_rw() to complete synchronously, it becomes an
> asynchronous call and you provide it with a completion.  That means
> devices which use it are forced to drop the lock in between.  Block and
> network clients will be easy to convert since they already use APIs that
> drop the lock (except for accessing the descriptors).
> 
>> We could try to introduce a different version of cpu_physical_memory_rw,
>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>> request can trigger the very same access in a nested fashion, and we
>> will have to detect this to avoid locking up QEMU (locking up the guest
>> might be OK).
> 
> An async version of c_p_m_rw() will just cause a completion to bounce
> around, consuming cpu but not deadlocking anything.  If we can keep a
> count of the bounces, we might be able to stall it indefinitely or at
> least ratelimit it.
> 

Another option is to require all users of c_p_m_rw() and related to use
a coroutine or thread.  That makes the programming easier (but still
required a revalidation after the dropped lock).
Marcelo Tosatti - Sept. 11, 2012, 2:54 p.m.
On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote:
> On 09/11/2012 03:24 PM, Avi Kivity wrote:
> > On 09/11/2012 12:57 PM, Jan Kiszka wrote:
> >> On 2012-09-11 11:44, liu ping fan wrote:
> >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
> >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >>>>>
> >>>>> The func call chain can suffer from recursively hold
> >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
> >>>>> lock depth.
> >>>>
> >>>> What is the root cause?  io handlers initiating I/O?
> >>>>
> >>> cpu_physical_memory_rw() can be called nested, and when called, it can
> >>> be protected by no-lock/device lock/ big-lock.
> >>> I think without big lock, io-dispatcher should face the same issue.
> >>> As to main-loop, have not carefully consider, but at least, dma-helper
> >>> will call cpu_physical_memory_rw() with big-lock.
> >> 
> >> That is our core problem: inconsistent invocation of existing services
> >> /wrt locking. For portio, I was lucky that there is no nesting and I was
> >> able to drop the big lock around all (x86) call sites. But MMIO is way
> >> more tricky due to DMA nesting.
> > 
> > Maybe we need to switch to a continuation style.  Instead of expecting
> > cpu_physical_memory_rw() to complete synchronously, it becomes an
> > asynchronous call and you provide it with a completion.  That means
> > devices which use it are forced to drop the lock in between.  Block and
> > network clients will be easy to convert since they already use APIs that
> > drop the lock (except for accessing the descriptors).
> > 
> >> We could try to introduce a different version of cpu_physical_memory_rw,
> >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
> >> request can trigger the very same access in a nested fashion, and we
> >> will have to detect this to avoid locking up QEMU (locking up the guest
> >> might be OK).
> > 
> > An async version of c_p_m_rw() will just cause a completion to bounce
> > around, consuming cpu but not deadlocking anything.  If we can keep a
> > count of the bounces, we might be able to stall it indefinitely or at
> > least ratelimit it.
> > 
> 
> Another option is to require all users of c_p_m_rw() and related to use
> a coroutine or thread.  That makes the programming easier (but still
> required a revalidation after the dropped lock).

Unless i am misunderstanding this thread, the iothread flow section of
http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html
contains possible solutions for this.

Basically use trylock() and if it fails, then choose a bailout option
(which there are more than one possibilities listed).
pingfan liu - Sept. 13, 2012, 6:55 a.m.
On Tue, Sep 11, 2012 at 10:54 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote:
>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>> > On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>> >> On 2012-09-11 11:44, liu ping fan wrote:
>> >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>> >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >>>>>
>> >>>>> The func call chain can suffer from recursively hold
>> >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>> >>>>> lock depth.
>> >>>>
>> >>>> What is the root cause?  io handlers initiating I/O?
>> >>>>
>> >>> cpu_physical_memory_rw() can be called nested, and when called, it can
>> >>> be protected by no-lock/device lock/ big-lock.
>> >>> I think without big lock, io-dispatcher should face the same issue.
>> >>> As to main-loop, have not carefully consider, but at least, dma-helper
>> >>> will call cpu_physical_memory_rw() with big-lock.
>> >>
>> >> That is our core problem: inconsistent invocation of existing services
>> >> /wrt locking. For portio, I was lucky that there is no nesting and I was
>> >> able to drop the big lock around all (x86) call sites. But MMIO is way
>> >> more tricky due to DMA nesting.
>> >
>> > Maybe we need to switch to a continuation style.  Instead of expecting
>> > cpu_physical_memory_rw() to complete synchronously, it becomes an
>> > asynchronous call and you provide it with a completion.  That means
>> > devices which use it are forced to drop the lock in between.  Block and
>> > network clients will be easy to convert since they already use APIs that
>> > drop the lock (except for accessing the descriptors).
>> >
>> >> We could try to introduce a different version of cpu_physical_memory_rw,
>> >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>> >> request can trigger the very same access in a nested fashion, and we
>> >> will have to detect this to avoid locking up QEMU (locking up the guest
>> >> might be OK).
>> >
>> > An async version of c_p_m_rw() will just cause a completion to bounce
>> > around, consuming cpu but not deadlocking anything.  If we can keep a
>> > count of the bounces, we might be able to stall it indefinitely or at
>> > least ratelimit it.
>> >
>>
>> Another option is to require all users of c_p_m_rw() and related to use
>> a coroutine or thread.  That makes the programming easier (but still
>> required a revalidation after the dropped lock).
>
> Unless i am misunderstanding this thread, the iothread flow section of
> http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html
> contains possible solutions for this.
>
> Basically use trylock() and if it fails, then choose a bailout option
> (which there are more than one possibilities listed).
>
I think in that thread, the original goal of the trylock() is to solve
the lock held by other thread, by here we want to resolve nested lock.
But yes, there is similar point to solve it by async if we do not
allow nested lock.

Regards,
pingfan
>
pingfan liu - Sept. 13, 2012, 6:55 a.m.
On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>
>>>>>> The func call chain can suffer from recursively hold
>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>> lock depth.
>>>>>
>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>
>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>> be protected by no-lock/device lock/ big-lock.
>>>> I think without big lock, io-dispatcher should face the same issue.
>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>> will call cpu_physical_memory_rw() with big-lock.
>>>
>>> That is our core problem: inconsistent invocation of existing services
>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>> more tricky due to DMA nesting.
>>
>> Maybe we need to switch to a continuation style.  Instead of expecting
>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>> asynchronous call and you provide it with a completion.  That means
>> devices which use it are forced to drop the lock in between.  Block and
>> network clients will be easy to convert since they already use APIs that
>> drop the lock (except for accessing the descriptors).
>>
>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>> request can trigger the very same access in a nested fashion, and we
>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>> might be OK).
>>
>> An async version of c_p_m_rw() will just cause a completion to bounce
>> around, consuming cpu but not deadlocking anything.  If we can keep a
>> count of the bounces, we might be able to stall it indefinitely or at
>> least ratelimit it.
>>
>
> Another option is to require all users of c_p_m_rw() and related to use
> a coroutine or thread.  That makes the programming easier (but still
> required a revalidation after the dropped lock).
>
For the nested cpu_physical_memory_rw(), we change it internal but
keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
into cpu_physical_memory_rw_internal() )


LOCK()  // can be device or big lock or both, depend on caller.
..............
cpu_physical_memory_rw()
{
   UNLOCK() //unlock all the locks
   queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
// cpu_physical_memory_rw_internal can take lock(device,biglock) again
   wait_for_completion(completion)
   LOCK()
}
..................
UNLOCK()

Although cpu_physical_memory_rw_internal() can be freed to use lock,
but with precondition. -- We still need to trace lock stack taken by
cpu_physical_memory_rw(), so that it can return to caller correctly.
Is that OK?

Regards,
pingfan

> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 13, 2012, 8:19 a.m.
On 09/13/2012 09:55 AM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>
>>>>>>> The func call chain can suffer from recursively hold
>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>> lock depth.
>>>>>>
>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>
>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>> be protected by no-lock/device lock/ big-lock.
>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>
>>>> That is our core problem: inconsistent invocation of existing services
>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>> more tricky due to DMA nesting.
>>>
>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>> asynchronous call and you provide it with a completion.  That means
>>> devices which use it are forced to drop the lock in between.  Block and
>>> network clients will be easy to convert since they already use APIs that
>>> drop the lock (except for accessing the descriptors).
>>>
>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>> request can trigger the very same access in a nested fashion, and we
>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>> might be OK).
>>>
>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>> count of the bounces, we might be able to stall it indefinitely or at
>>> least ratelimit it.
>>>
>>
>> Another option is to require all users of c_p_m_rw() and related to use
>> a coroutine or thread.  That makes the programming easier (but still
>> required a revalidation after the dropped lock).
>>
> For the nested cpu_physical_memory_rw(), we change it internal but
> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
> into cpu_physical_memory_rw_internal() )
> 
> 
> LOCK()  // can be device or big lock or both, depend on caller.
> ..............
> cpu_physical_memory_rw()
> {
>    UNLOCK() //unlock all the locks
>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>    wait_for_completion(completion)
>    LOCK()
> }
> ..................
> UNLOCK()
> 

This is dangerous.  The caller expects to hold the lock across the call,
and will not re-validate its state.

> Although cpu_physical_memory_rw_internal() can be freed to use lock,
> but with precondition. -- We still need to trace lock stack taken by
> cpu_physical_memory_rw(), so that it can return to caller correctly.
> Is that OK?

I'm convinced that we need a recursive lock if we don't convert
everything at once.

So how about:

- make bql a recursive lock (use pthreads, don't invent our own)
- change c_p_m_rw() calling convention to "caller may hold the BQL, but
no device lock"

this allows devices to DMA each other.  Unconverted device models will
work as they used to.  Converted device models will need to drop the
device lock, re-acquire it, and revalidate any state.  That will cause
them to become more correct wrt recursive invocation.
pingfan liu - Sept. 17, 2012, 2:24 a.m.
On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/13/2012 09:55 AM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>>
>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>> lock depth.
>>>>>>>
>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>
>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>
>>>>> That is our core problem: inconsistent invocation of existing services
>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>> more tricky due to DMA nesting.
>>>>
>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>> asynchronous call and you provide it with a completion.  That means
>>>> devices which use it are forced to drop the lock in between.  Block and
>>>> network clients will be easy to convert since they already use APIs that
>>>> drop the lock (except for accessing the descriptors).
>>>>
>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>> request can trigger the very same access in a nested fashion, and we
>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>> might be OK).
>>>>
>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>> least ratelimit it.
>>>>
>>>
>>> Another option is to require all users of c_p_m_rw() and related to use
>>> a coroutine or thread.  That makes the programming easier (but still
>>> required a revalidation after the dropped lock).
>>>
>> For the nested cpu_physical_memory_rw(), we change it internal but
>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>> into cpu_physical_memory_rw_internal() )
>>
>>
>> LOCK()  // can be device or big lock or both, depend on caller.
>> ..............
>> cpu_physical_memory_rw()
>> {
>>    UNLOCK() //unlock all the locks
>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>    wait_for_completion(completion)
>>    LOCK()
>> }
>> ..................
>> UNLOCK()
>>
>
> This is dangerous.  The caller expects to hold the lock across the call,
> and will not re-validate its state.
>
>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>> but with precondition. -- We still need to trace lock stack taken by
>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>> Is that OK?
>
> I'm convinced that we need a recursive lock if we don't convert
> everything at once.
>
> So how about:
>
> - make bql a recursive lock (use pthreads, don't invent our own)
> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
> no device lock"
>
> this allows devices to DMA each other.  Unconverted device models will
> work as they used to.  Converted device models will need to drop the
> device lock, re-acquire it, and revalidate any state.  That will cause

I think we are cornered by devices to DMA each other, and it raises
the unavoidable nested lock. Also to avoid deadlock caused by device's
lock sequence, we should drop the current device lock  to acquire
another one.   The only little  diverge is about the "revalidate", do
we need to rollback? Or for converted device, we can just tag it a
busy flag, that is check&set busy flag at the entry of device's
mmio-dispatch.  So  when re-acquire device's lock, the device's state
is intact.

Has anybody other suggestion?

Thanks and regards,
pingfan
> them to become more correct wrt recursive invocation.
>
>
> --
> error compiling committee.c: too many arguments to function
Peter A. G. Crosthwaite - Sept. 19, 2012, 4:25 a.m.
Ping for PMM,

This is the root case of your block on the SDHCI series - this is a
discussion on resolution to bogus infinite looping DMA. For current
participants in this discussion, heres our thread on the same topic
over in SD land:

http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

With the findings here and acknowledgement that this is a general
problem, we either need to declare this issue of scope for SDHCI, or
work with these guys (in the immediate future) on the DMA infinite
loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
here, but can we get a decisive plan for going forward with this issue
if it is going to continue to block SDHCI.

Thanks to Igor for identifying the overlap between these discussions.

Regards,
Peter

On Tue, Sep 11, 2012 at 10:39 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 03:35 PM, Jan Kiszka wrote:
>> On 2012-09-11 14:30, Avi Kivity wrote:
>>>>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>>>>> but is probably more comprehensible for device model developers.
>>>>>>>
>>>>>>> How do you handle failures?
>>>>>>
>>>>>> By not sending a network frame or dropping an incoming one, e.g., and
>>>>>> signaling this in a device specific way.
>>>>>
>>>>> Doesn't work for block devices.
>>>>
>>>> Because the block layer API cannot report errors to the devices? What
>>>> happens if there is a real I/O error?
>>>
>>> We report real I/O errors.  But if we report a transient error due to
>>> some lock being taken as an I/O error, the guest will take unwarranted
>>> action.
>>>
>>> If the errors are not expected in normal operation (we can avoid them if
>>> all DMA is to real RAM) then this is an acceptable solution.  Still it
>>> generates a lot of rarely used code paths and so isn't very good for
>>> security.
>>
>> I'm not talking about transient errors. Recursions like this are always
>> guest configuration errors that would cause real devices to lock up or
>> timeout as well. This is practically about avoiding that a malicious
>> guest can lock up QEMU, leaving it inoperative even for management tools.
>
> Ok.  That's more palatable.  We don't even have to report an error in
> that case, we can just perform the operation incorrectly (as I'm sure
> real hardware will), log an error, and continue.
>
>
> --
> error compiling committee.c: too many arguments to function
>
Edgar Iglesias - Sept. 19, 2012, 4:32 a.m.
On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
> Ping for PMM,
> 
> This is the root case of your block on the SDHCI series - this is a
> discussion on resolution to bogus infinite looping DMA. For current
> participants in this discussion, heres our thread on the same topic
> over in SD land:
> 
> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
> 
> With the findings here and acknowledgement that this is a general
> problem, we either need to declare this issue of scope for SDHCI, or
> work with these guys (in the immediate future) on the DMA infinite
> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
> here, but can we get a decisive plan for going forward with this issue
> if it is going to continue to block SDHCI.
> 
> Thanks to Igor for identifying the overlap between these discussions.

Hi,

A couple of dumb questions.

What is the reason for the blocker? that possible guest dos is worse
than no functionality at all?

Can't you put the DMA/IO processing into the background?

what's the diff between setup of bad descriptors and  writing a
while (1) loop in the guest CPU?

Cheers,
E
Peter A. G. Crosthwaite - Sept. 19, 2012, 4:40 a.m.
On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
<edgar.iglesias@gmail.com> wrote:
> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>> Ping for PMM,
>>
>> This is the root case of your block on the SDHCI series - this is a
>> discussion on resolution to bogus infinite looping DMA. For current
>> participants in this discussion, heres our thread on the same topic
>> over in SD land:
>>
>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>
>> With the findings here and acknowledgement that this is a general
>> problem, we either need to declare this issue of scope for SDHCI, or
>> work with these guys (in the immediate future) on the DMA infinite
>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>> here, but can we get a decisive plan for going forward with this issue
>> if it is going to continue to block SDHCI.
>>
>> Thanks to Igor for identifying the overlap between these discussions.
>
> Hi,
>
> A couple of dumb questions.
>
> What is the reason for the blocker? that possible guest dos is worse
> than no functionality at all?
>
> Can't you put the DMA/IO processing into the background?

I dont know a way do do asynchronous DMA unless I am missing something
here. So what happens is we have a device that walks a SG list
synchronously while holding all the locks and stuff being discussed
here. If that SG list infinite loops then crash.

>
> what's the diff between setup of bad descriptors and  writing a
> while (1) loop in the guest CPU?
>

While(1) in guest doesnt freeze all of QEMU (monitor still works and
stuff), wheras the DMA ops going in circles does, as locks are taken
by an infinite loop.

Regards,
peter

> Cheers,
> E
Avi Kivity - Sept. 19, 2012, 7:55 a.m.
On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> <edgar.iglesias@gmail.com> wrote:
>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>> Ping for PMM,
>>>
>>> This is the root case of your block on the SDHCI series - this is a
>>> discussion on resolution to bogus infinite looping DMA. For current
>>> participants in this discussion, heres our thread on the same topic
>>> over in SD land:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>
>>> With the findings here and acknowledgement that this is a general
>>> problem, we either need to declare this issue of scope for SDHCI, or
>>> work with these guys (in the immediate future) on the DMA infinite
>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>> here, but can we get a decisive plan for going forward with this issue
>>> if it is going to continue to block SDHCI.
>>>
>>> Thanks to Igor for identifying the overlap between these discussions.
>>
>> Hi,
>>
>> A couple of dumb questions.
>>
>> What is the reason for the blocker? that possible guest dos is worse
>> than no functionality at all?
>>
>> Can't you put the DMA/IO processing into the background?
> 
> I dont know a way do do asynchronous DMA unless I am missing something
> here.

You could schedule a bottom half and do the accesses from there.  Solves
nothing though.

> So what happens is we have a device that walks a SG list
> synchronously while holding all the locks and stuff being discussed
> here. If that SG list infinite loops then crash.

Did you mean loop, or recursion into the memory region that initiates DMA?

As this problem already exists I don't think new device models need
block on it.

Maybe a simple fix is:

void my_device_write(...)
{
   if (dma_in_progress) {
       return;
   }
   ...
   s->dma_in_progress = true;
   cpu_physical_memory_rw(...)
   s->dma_in_progress = false;
}

and we could try to figure out how to move this into common code.
Jan Kiszka - Sept. 19, 2012, 7:57 a.m.
On 2012-09-19 06:40, Peter Crosthwaite wrote:
> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> <edgar.iglesias@gmail.com> wrote:
>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>> Ping for PMM,
>>>
>>> This is the root case of your block on the SDHCI series - this is a
>>> discussion on resolution to bogus infinite looping DMA. For current
>>> participants in this discussion, heres our thread on the same topic
>>> over in SD land:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>
>>> With the findings here and acknowledgement that this is a general
>>> problem, we either need to declare this issue of scope for SDHCI, or
>>> work with these guys (in the immediate future) on the DMA infinite
>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>> here, but can we get a decisive plan for going forward with this issue
>>> if it is going to continue to block SDHCI.
>>>
>>> Thanks to Igor for identifying the overlap between these discussions.
>>
>> Hi,
>>
>> A couple of dumb questions.
>>
>> What is the reason for the blocker? that possible guest dos is worse
>> than no functionality at all?
>>
>> Can't you put the DMA/IO processing into the background?
> 
> I dont know a way do do asynchronous DMA unless I am missing something
> here. So what happens is we have a device that walks a SG list
> synchronously while holding all the locks and stuff being discussed
> here. If that SG list infinite loops then crash.

We could switch to async DMA, but most DMA-capable devices aren't
prepared for rescheduling points in the middle of their code. This will
require quite a bit of code review.

> 
>>
>> what's the diff between setup of bad descriptors and  writing a
>> while (1) loop in the guest CPU?
>>
> 
> While(1) in guest doesnt freeze all of QEMU (monitor still works and
> stuff), wheras the DMA ops going in circles does, as locks are taken
> by an infinite loop.

That's the point. If we loop, we need at least a way to stop it by
issuing a device or system reset. But I tend to think that detecting
loops and failing/ignoring requests is easier implementable.

Jan
Avi Kivity - Sept. 19, 2012, 8:01 a.m.
On 09/17/2012 05:24 AM, liu ping fan wrote:
> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/13/2012 09:55 AM, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>>>
>>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>>> lock depth.
>>>>>>>>
>>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>>
>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>>
>>>>>> That is our core problem: inconsistent invocation of existing services
>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>>> more tricky due to DMA nesting.
>>>>>
>>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>>> asynchronous call and you provide it with a completion.  That means
>>>>> devices which use it are forced to drop the lock in between.  Block and
>>>>> network clients will be easy to convert since they already use APIs that
>>>>> drop the lock (except for accessing the descriptors).
>>>>>
>>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>>> request can trigger the very same access in a nested fashion, and we
>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>>> might be OK).
>>>>>
>>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>>> least ratelimit it.
>>>>>
>>>>
>>>> Another option is to require all users of c_p_m_rw() and related to use
>>>> a coroutine or thread.  That makes the programming easier (but still
>>>> required a revalidation after the dropped lock).
>>>>
>>> For the nested cpu_physical_memory_rw(), we change it internal but
>>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>>> into cpu_physical_memory_rw_internal() )
>>>
>>>
>>> LOCK()  // can be device or big lock or both, depend on caller.
>>> ..............
>>> cpu_physical_memory_rw()
>>> {
>>>    UNLOCK() //unlock all the locks
>>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>>    wait_for_completion(completion)
>>>    LOCK()
>>> }
>>> ..................
>>> UNLOCK()
>>>
>>
>> This is dangerous.  The caller expects to hold the lock across the call,
>> and will not re-validate its state.
>>
>>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>>> but with precondition. -- We still need to trace lock stack taken by
>>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>>> Is that OK?
>>
>> I'm convinced that we need a recursive lock if we don't convert
>> everything at once.
>>
>> So how about:
>>
>> - make bql a recursive lock (use pthreads, don't invent our own)
>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
>> no device lock"
>>
>> this allows devices to DMA each other.  Unconverted device models will
>> work as they used to.  Converted device models will need to drop the
>> device lock, re-acquire it, and revalidate any state.  That will cause
> 
> I think we are cornered by devices to DMA each other, and it raises
> the unavoidable nested lock. Also to avoid deadlock caused by device's
> lock sequence, we should drop the current device lock  to acquire
> another one.   The only little  diverge is about the "revalidate", do
> we need to rollback?

It basically means you can't hold contents of device state in local
variables.  You need to read everything again from the device.  That
includes things like DMA enable bits.

(btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
is a trivial example of an iommu, we should get that going).

> Or for converted device, we can just tag it a
> busy flag, that is check&set busy flag at the entry of device's
> mmio-dispatch.  So  when re-acquire device's lock, the device's state
> is intact.

The state can be changed by a parallel access to another register, which
is valid.

> 
> Has anybody other suggestion?
pingfan liu - Sept. 19, 2012, 8:36 a.m.
On Wed, Sep 19, 2012 at 4:01 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/17/2012 05:24 AM, liu ping fan wrote:
>> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/13/2012 09:55 AM, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>>>>
>>>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>>>> lock depth.
>>>>>>>>>
>>>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>>>
>>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>>>
>>>>>>> That is our core problem: inconsistent invocation of existing services
>>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>>>> more tricky due to DMA nesting.
>>>>>>
>>>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>>>> asynchronous call and you provide it with a completion.  That means
>>>>>> devices which use it are forced to drop the lock in between.  Block and
>>>>>> network clients will be easy to convert since they already use APIs that
>>>>>> drop the lock (except for accessing the descriptors).
>>>>>>
>>>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>>>> request can trigger the very same access in a nested fashion, and we
>>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>>>> might be OK).
>>>>>>
>>>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>>>> least ratelimit it.
>>>>>>
>>>>>
>>>>> Another option is to require all users of c_p_m_rw() and related to use
>>>>> a coroutine or thread.  That makes the programming easier (but still
>>>>> required a revalidation after the dropped lock).
>>>>>
>>>> For the nested cpu_physical_memory_rw(), we change it internal but
>>>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>>>> into cpu_physical_memory_rw_internal() )
>>>>
>>>>
>>>> LOCK()  // can be device or big lock or both, depend on caller.
>>>> ..............
>>>> cpu_physical_memory_rw()
>>>> {
>>>>    UNLOCK() //unlock all the locks
>>>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>>>    wait_for_completion(completion)
>>>>    LOCK()
>>>> }
>>>> ..................
>>>> UNLOCK()
>>>>
>>>
>>> This is dangerous.  The caller expects to hold the lock across the call,
>>> and will not re-validate its state.
>>>
>>>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>>>> but with precondition. -- We still need to trace lock stack taken by
>>>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>>>> Is that OK?
>>>
>>> I'm convinced that we need a recursive lock if we don't convert
>>> everything at once.
>>>
>>> So how about:
>>>
>>> - make bql a recursive lock (use pthreads, don't invent our own)
>>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
>>> no device lock"
>>>
>>> this allows devices to DMA each other.  Unconverted device models will
>>> work as they used to.  Converted device models will need to drop the
>>> device lock, re-acquire it, and revalidate any state.  That will cause
>>
>> I think we are cornered by devices to DMA each other, and it raises
>> the unavoidable nested lock. Also to avoid deadlock caused by device's
>> lock sequence, we should drop the current device lock  to acquire
>> another one.   The only little  diverge is about the "revalidate", do
>> we need to rollback?
>
> It basically means you can't hold contents of device state in local
> variables.  You need to read everything again from the device.  That
> includes things like DMA enable bits.
>
I think that read everything again from the device can not work.
Suppose the following scene: If the device's state contains the change
of a series of internal registers (supposing partA+partB;
partC+partD), after partA changed, the device's lock is broken.  At
this point, another access to this device, it will work on partA+partB
to determine C+D, but since partB is not correctly updated yet. So C+D
may be decided by broken context and be wrong.

> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
> is a trivial example of an iommu, we should get that going).
>
>> Or for converted device, we can just tag it a
>> busy flag, that is check&set busy flag at the entry of device's
>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>> is intact.
>
> The state can be changed by a parallel access to another register, which
> is valid.
>
Do you mean that the device can be accessed in parallel? But how? we
use device's lock.
What my suggestion is:
lock();
set_and_test(dev->busy);
if busy
  unlock and return;
changing device registers;
do other things including calling to c_p_m_rw() //here,lock broken,
but set_and_test() works
clear(dev->busy);
unlock();

So changing device registers is protected, and unbreakable.

Regards,
pingfan
>>
>> Has anybody other suggestion?
>
>
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 19, 2012, 9:05 a.m.
On 09/19/2012 11:36 AM, liu ping fan wrote:
>>
>> It basically means you can't hold contents of device state in local
>> variables.  You need to read everything again from the device.  That
>> includes things like DMA enable bits.
>>
> I think that read everything again from the device can not work.
> Suppose the following scene: If the device's state contains the change
> of a series of internal registers (supposing partA+partB;
> partC+partD), after partA changed, the device's lock is broken.  At
> this point, another access to this device, it will work on partA+partB
> to determine C+D, but since partB is not correctly updated yet. So C+D
> may be decided by broken context and be wrong.

That's the guest's problem.  Note it could have happened even before the
change, since the writes to A/B/C/D are unordered wrt the DMA.

> 
>> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
>> is a trivial example of an iommu, we should get that going).
>>
>>> Or for converted device, we can just tag it a
>>> busy flag, that is check&set busy flag at the entry of device's
>>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>>> is intact.
>>
>> The state can be changed by a parallel access to another register, which
>> is valid.
>>
> Do you mean that the device can be accessed in parallel? But how? we
> use device's lock.

Some DMA is asynchronous already, like networking and IDE DMA.  So we
drop the big lock while doing it.  With the change to drop device locks
around c_p_m_rw(), we have that problem everywhere.

> What my suggestion is:
> lock();
> set_and_test(dev->busy);
> if busy
>   unlock and return;
> changing device registers;
> do other things including calling to c_p_m_rw() //here,lock broken,
> but set_and_test() works
> clear(dev->busy);
> unlock();
> 
> So changing device registers is protected, and unbreakable.

But the changes may be legitimate.  Maybe you're writing to a completely
unrelated register, from a different vcpu, now that write is lost.
Edgar Iglesias - Sept. 19, 2012, 11:46 a.m.
On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> > <edgar.iglesias@gmail.com> wrote:
> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
> >>> Ping for PMM,
> >>>
> >>> This is the root case of your block on the SDHCI series - this is a
> >>> discussion on resolution to bogus infinite looping DMA. For current
> >>> participants in this discussion, heres our thread on the same topic
> >>> over in SD land:
> >>>
> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
> >>>
> >>> With the findings here and acknowledgement that this is a general
> >>> problem, we either need to declare this issue of scope for SDHCI, or
> >>> work with these guys (in the immediate future) on the DMA infinite
> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
> >>> here, but can we get a decisive plan for going forward with this issue
> >>> if it is going to continue to block SDHCI.
> >>>
> >>> Thanks to Igor for identifying the overlap between these discussions.
> >>
> >> Hi,
> >>
> >> A couple of dumb questions.
> >>
> >> What is the reason for the blocker? that possible guest dos is worse
> >> than no functionality at all?
> >>
> >> Can't you put the DMA/IO processing into the background?
> > 
> > I dont know a way do do asynchronous DMA unless I am missing something
> > here.
> 
> You could schedule a bottom half and do the accesses from there.  Solves
> nothing though.
> 
> > So what happens is we have a device that walks a SG list
> > synchronously while holding all the locks and stuff being discussed
> > here. If that SG list infinite loops then crash.
> 
> Did you mean loop, or recursion into the memory region that initiates DMA?

I think we were discussing the loop issue (I hadn't thought of recursion
issues before) :)

Jan's point of resetability was interesting.

Processing a finite amount of dma descriptors and scheduling bh's
to continously get back and continue processing should be OK no?

That would avoid the lockups and allow the device to be reset at
any time. Or am I missing something?

IIRC the etrax dma does something like that but only for receive
dma rings...

Cheers,
Edgar
Avi Kivity - Sept. 19, 2012, 12:12 p.m.
On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
> On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
>> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
>> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
>> > <edgar.iglesias@gmail.com> wrote:
>> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>> >>> Ping for PMM,
>> >>>
>> >>> This is the root case of your block on the SDHCI series - this is a
>> >>> discussion on resolution to bogus infinite looping DMA. For current
>> >>> participants in this discussion, heres our thread on the same topic
>> >>> over in SD land:
>> >>>
>> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>> >>>
>> >>> With the findings here and acknowledgement that this is a general
>> >>> problem, we either need to declare this issue of scope for SDHCI, or
>> >>> work with these guys (in the immediate future) on the DMA infinite
>> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>> >>> here, but can we get a decisive plan for going forward with this issue
>> >>> if it is going to continue to block SDHCI.
>> >>>
>> >>> Thanks to Igor for identifying the overlap between these discussions.
>> >>
>> >> Hi,
>> >>
>> >> A couple of dumb questions.
>> >>
>> >> What is the reason for the blocker? that possible guest dos is worse
>> >> than no functionality at all?
>> >>
>> >> Can't you put the DMA/IO processing into the background?
>> > 
>> > I dont know a way do do asynchronous DMA unless I am missing something
>> > here.
>> 
>> You could schedule a bottom half and do the accesses from there.  Solves
>> nothing though.
>> 
>> > So what happens is we have a device that walks a SG list
>> > synchronously while holding all the locks and stuff being discussed
>> > here. If that SG list infinite loops then crash.
>> 
>> Did you mean loop, or recursion into the memory region that initiates DMA?
> 
> I think we were discussing the loop issue (I hadn't thought of recursion
> issues before) :)
> 
> Jan's point of resetability was interesting.
> 
> Processing a finite amount of dma descriptors and scheduling bh's
> to continously get back and continue processing should be OK no?
> 

Yes.

> That would avoid the lockups and allow the device to be reset at
> any time. Or am I missing something?
> 

Not that I can see.  If real hardware can be looped, so can qemu.  I'm
only worried about recursion and deadlocks (while real hardware can
deadlock, we'd prefer to avoid that).
Edgar Iglesias - Sept. 19, 2012, 12:17 p.m.
On Wed, Sep 19, 2012 at 03:12:12PM +0300, Avi Kivity wrote:
> On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
> > On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
> >> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
> >> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> >> > <edgar.iglesias@gmail.com> wrote:
> >> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
> >> >>> Ping for PMM,
> >> >>>
> >> >>> This is the root case of your block on the SDHCI series - this is a
> >> >>> discussion on resolution to bogus infinite looping DMA. For current
> >> >>> participants in this discussion, heres our thread on the same topic
> >> >>> over in SD land:
> >> >>>
> >> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
> >> >>>
> >> >>> With the findings here and acknowledgement that this is a general
> >> >>> problem, we either need to declare this issue of scope for SDHCI, or
> >> >>> work with these guys (in the immediate future) on the DMA infinite
> >> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
> >> >>> here, but can we get a decisive plan for going forward with this issue
> >> >>> if it is going to continue to block SDHCI.
> >> >>>
> >> >>> Thanks to Igor for identifying the overlap between these discussions.
> >> >>
> >> >> Hi,
> >> >>
> >> >> A couple of dumb questions.
> >> >>
> >> >> What is the reason for the blocker? that possible guest dos is worse
> >> >> than no functionality at all?
> >> >>
> >> >> Can't you put the DMA/IO processing into the background?
> >> > 
> >> > I dont know a way do do asynchronous DMA unless I am missing something
> >> > here.
> >> 
> >> You could schedule a bottom half and do the accesses from there.  Solves
> >> nothing though.
> >> 
> >> > So what happens is we have a device that walks a SG list
> >> > synchronously while holding all the locks and stuff being discussed
> >> > here. If that SG list infinite loops then crash.
> >> 
> >> Did you mean loop, or recursion into the memory region that initiates DMA?
> > 
> > I think we were discussing the loop issue (I hadn't thought of recursion
> > issues before) :)
> > 
> > Jan's point of resetability was interesting.
> > 
> > Processing a finite amount of dma descriptors and scheduling bh's
> > to continously get back and continue processing should be OK no?
> > 
> 
> Yes.
> 
> > That would avoid the lockups and allow the device to be reset at
> > any time. Or am I missing something?
> > 
> 
> Not that I can see.  If real hardware can be looped, so can qemu.  I'm
> only worried about recursion and deadlocks (while real hardware can
> deadlock, we'd prefer to avoid that).

Agreed, thanks
Mitsyanko Igor - Sept. 19, 2012, 1:01 p.m.
On 09/19/2012 04:12 PM, Avi Kivity wrote:
> On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
>> On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
>>> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
>>>> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
>>>> <edgar.iglesias@gmail.com> wrote:
>>>>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>>>>> Ping for PMM,
>>>>>>
>>>>>> This is the root case of your block on the SDHCI series - this is a
>>>>>> discussion on resolution to bogus infinite looping DMA. For current
>>>>>> participants in this discussion, heres our thread on the same topic
>>>>>> over in SD land:
>>>>>>
>>>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>>>>
>>>>>> With the findings here and acknowledgement that this is a general
>>>>>> problem, we either need to declare this issue of scope for SDHCI, or
>>>>>> work with these guys (in the immediate future) on the DMA infinite
>>>>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>>>>> here, but can we get a decisive plan for going forward with this issue
>>>>>> if it is going to continue to block SDHCI.
>>>>>>
>>>>>> Thanks to Igor for identifying the overlap between these discussions.
>>>>> Hi,
>>>>>
>>>>> A couple of dumb questions.
>>>>>
>>>>> What is the reason for the blocker? that possible guest dos is worse
>>>>> than no functionality at all?
>>>>>
>>>>> Can't you put the DMA/IO processing into the background?
>>>> I dont know a way do do asynchronous DMA unless I am missing something
>>>> here.
>>> You could schedule a bottom half and do the accesses from there.  Solves
>>> nothing though.
>>>
>>>> So what happens is we have a device that walks a SG list
>>>> synchronously while holding all the locks and stuff being discussed
>>>> here. If that SG list infinite loops then crash.
>>> Did you mean loop, or recursion into the memory region that initiates DMA?
>> I think we were discussing the loop issue (I hadn't thought of recursion
>> issues before) :)
>>
>> Jan's point of resetability was interesting.
>>
>> Processing a finite amount of dma descriptors and scheduling bh's
>> to continously get back and continue processing should be OK no?
>>
> Yes.

Bottom half idea sounds good.
Also, for this particular device, rather then limiting amount of 
processed descriptor entries in bottom half, it seems reasonable to 
interrupt bh handler if we encounter a LINK descriptor since only 
recursive LINKs can generate infinite loops. In that case, a very long 
descriptor table without a single LINK entry could potentially hung QEMU 
for a long time, but that time will be finite anyway. Long term, this 
problem will disappear when/if someone converts QEMU SD card model to 
use async io.
>> That would avoid the lockups and allow the device to be reset at
>> any time. Or am I missing something?
>>
> Not that I can see.  If real hardware can be looped, so can qemu.  I'm
> only worried about recursion and deadlocks (while real hardware can
> deadlock, we'd prefer to avoid that).
>
>

So, I think the idea here is that if real hardware can be locked we 
should lock too, but provide a guest CPU a possibility to abort locked 
operation. For this particular example, SD card controller can deadlock 
on recursive descriptor LINK entries, but CPU can abort ongoing 
transaction at any time by issuing ABORT command. And if guest CPU 
busy-waits in infinite while() loop for a TRANSFER OVER flag to be set, 
then it had it coming.
Avi Kivity - Sept. 19, 2012, 1:03 p.m.
On 09/19/2012 04:01 PM, Igor Mitsyanko wrote:
>>> That would avoid the lockups and allow the device to be reset at
>>> any time. Or am I missing something?
>>>
>> Not that I can see.  If real hardware can be looped, so can qemu.  I'm
>> only worried about recursion and deadlocks (while real hardware can
>> deadlock, we'd prefer to avoid that).
>>
>>
> 
> So, I think the idea here is that if real hardware can be locked we
> should lock too, but provide a guest CPU a possibility to abort locked
> operation. For this particular example, SD card controller can deadlock
> on recursive descriptor LINK entries, but CPU can abort ongoing
> transaction at any time by issuing ABORT command. And if guest CPU
> busy-waits in infinite while() loop for a TRANSFER OVER flag to be set,
> then it had it coming.

If real hardware would lock up so should we (in the guest's eyes) but
the qemu monitor should remain responsive.  It is not part of emulated
hardware.
Mitsyanko Igor - Sept. 19, 2012, 1:07 p.m.
On 09/19/2012 11:57 AM, Jan Kiszka wrote:
> On 2012-09-19 06:40, Peter Crosthwaite wrote:
>> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
>> <edgar.iglesias@gmail.com> wrote:
>>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>>> Ping for PMM,
>>>>
>>>> This is the root case of your block on the SDHCI series - this is a
>>>> discussion on resolution to bogus infinite looping DMA. For current
>>>> participants in this discussion, heres our thread on the same topic
>>>> over in SD land:
>>>>
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>>
>>>> With the findings here and acknowledgement that this is a general
>>>> problem, we either need to declare this issue of scope for SDHCI, or
>>>> work with these guys (in the immediate future) on the DMA infinite
>>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>>> here, but can we get a decisive plan for going forward with this issue
>>>> if it is going to continue to block SDHCI.
>>>>
>>>> Thanks to Igor for identifying the overlap between these discussions.
>>> Hi,
>>>
>>> A couple of dumb questions.
>>>
>>> What is the reason for the blocker? that possible guest dos is worse
>>> than no functionality at all?
>>>
>>> Can't you put the DMA/IO processing into the background?
>> I dont know a way do do asynchronous DMA unless I am missing something
>> here. So what happens is we have a device that walks a SG list
>> synchronously while holding all the locks and stuff being discussed
>> here. If that SG list infinite loops then crash.
> We could switch to async DMA, but most DMA-capable devices aren't
> prepared for rescheduling points in the middle of their code. This will
> require quite a bit of code review.

 From what I understand, we can't switch to async DMA while SD card 
model doesn't support it. And introducing this support would require to 
convert every existing SD card user in QEMU.
I can recall that such a conversion was already suggested in a mailing 
list a while ago, but no one (including me) volunteered to do it :)

>>> what's the diff between setup of bad descriptors and  writing a
>>> while (1) loop in the guest CPU?
>>>
>> While(1) in guest doesnt freeze all of QEMU (monitor still works and
>> stuff), wheras the DMA ops going in circles does, as locks are taken
>> by an infinite loop.
> That's the point. If we loop, we need at least a way to stop it by
> issuing a device or system reset. But I tend to think that detecting
> loops and failing/ignoring requests is easier implementable.
>
> Jan
>
pingfan liu - Sept. 20, 2012, 7:51 a.m.
On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/19/2012 11:36 AM, liu ping fan wrote:
>>>
>>> It basically means you can't hold contents of device state in local
>>> variables.  You need to read everything again from the device.  That
>>> includes things like DMA enable bits.
>>>
>> I think that read everything again from the device can not work.
>> Suppose the following scene: If the device's state contains the change
>> of a series of internal registers (supposing partA+partB;
>> partC+partD), after partA changed, the device's lock is broken.  At
>> this point, another access to this device, it will work on partA+partB
>> to determine C+D, but since partB is not correctly updated yet. So C+D
>> may be decided by broken context and be wrong.
>
> That's the guest's problem.  Note it could have happened even before the
> change, since the writes to A/B/C/D are unordered wrt the DMA.
>
Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
is not signaled to guest, the guest should not launch operations on
(C+D). Right?   But here comes the question, if ready not signaled to
guest, how can guest launch operation on (A+B) again?
i.e. although local lock is broken, the (A+B) is still intact when
re-acquire local lock.  So need not to read everything again from the
device.  Wrong?

>>
>>> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
>>> is a trivial example of an iommu, we should get that going).
>>>
>>>> Or for converted device, we can just tag it a
>>>> busy flag, that is check&set busy flag at the entry of device's
>>>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>>>> is intact.
>>>
>>> The state can be changed by a parallel access to another register, which
>>> is valid.
>>>
>> Do you mean that the device can be accessed in parallel? But how? we
>> use device's lock.
>
> Some DMA is asynchronous already, like networking and IDE DMA.  So we
> drop the big lock while doing it.  With the change to drop device locks
> around c_p_m_rw(), we have that problem everywhere.
>
>> What my suggestion is:
>> lock();
>> set_and_test(dev->busy);
>> if busy
>>   unlock and return;
>> changing device registers;
>> do other things including calling to c_p_m_rw() //here,lock broken,
>> but set_and_test() works
>> clear(dev->busy);
>> unlock();
>>
>> So changing device registers is protected, and unbreakable.
>
> But the changes may be legitimate.  Maybe you're writing to a completely
> unrelated register, from a different vcpu, now that write is lost.
>
But I think it will mean more-fine locks for each groups of unrelated
register, and accordingly, the busy should be bitmap for each group.

Thanks and regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 20, 2012, 9:15 a.m.
On 09/20/2012 10:51 AM, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/19/2012 11:36 AM, liu ping fan wrote:
>>>>
>>>> It basically means you can't hold contents of device state in local
>>>> variables.  You need to read everything again from the device.  That
>>>> includes things like DMA enable bits.
>>>>
>>> I think that read everything again from the device can not work.
>>> Suppose the following scene: If the device's state contains the change
>>> of a series of internal registers (supposing partA+partB;
>>> partC+partD), after partA changed, the device's lock is broken.  At
>>> this point, another access to this device, it will work on partA+partB
>>> to determine C+D, but since partB is not correctly updated yet. So C+D
>>> may be decided by broken context and be wrong.
>>
>> That's the guest's problem.  Note it could have happened even before the
>> change, since the writes to A/B/C/D are unordered wrt the DMA.
>>
> Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
> is not signaled to guest, the guest should not launch operations on
> (C+D). Right?   But here comes the question, if ready not signaled to
> guest, how can guest launch operation on (A+B) again?

It may be evil.

> i.e. although local lock is broken, the (A+B) is still intact when
> re-acquire local lock.  So need not to read everything again from the
> device.  Wrong?

The device needs to perform according to its specifications.  If the
specifications allow for this kind of access, we must ensure it works.
If they don't, we must ensure something sane happens, instead of a qemu
crash or exploit.  This means that anything dangerous like pointers must
be revalidated.  To be on the safe side, I recommend revalidating (or
reloading) everything, but it may not be necessary in all cases.

>>> What my suggestion is:
>>> lock();
>>> set_and_test(dev->busy);
>>> if busy
>>>   unlock and return;
>>> changing device registers;
>>> do other things including calling to c_p_m_rw() //here,lock broken,
>>> but set_and_test() works
>>> clear(dev->busy);
>>> unlock();
>>>
>>> So changing device registers is protected, and unbreakable.
>>
>> But the changes may be legitimate.  Maybe you're writing to a completely
>> unrelated register, from a different vcpu, now that write is lost.
>>
> But I think it will mean more-fine locks for each groups of unrelated
> register, and accordingly, the busy should be bitmap for each group.

It's possible.  Let's defer the discussion until a concrete case is
before us.  It may be that different devices will want different
solutions (though there is value in applying one solution everywhere).
pingfan liu - Sept. 21, 2012, 7:27 a.m.
On Thu, Sep 20, 2012 at 5:15 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/20/2012 10:51 AM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/19/2012 11:36 AM, liu ping fan wrote:
>>>>>
>>>>> It basically means you can't hold contents of device state in local
>>>>> variables.  You need to read everything again from the device.  That
>>>>> includes things like DMA enable bits.
>>>>>
>>>> I think that read everything again from the device can not work.
>>>> Suppose the following scene: If the device's state contains the change
>>>> of a series of internal registers (supposing partA+partB;
>>>> partC+partD), after partA changed, the device's lock is broken.  At
>>>> this point, another access to this device, it will work on partA+partB
>>>> to determine C+D, but since partB is not correctly updated yet. So C+D
>>>> may be decided by broken context and be wrong.
>>>
>>> That's the guest's problem.  Note it could have happened even before the
>>> change, since the writes to A/B/C/D are unordered wrt the DMA.
>>>
>> Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
>> is not signaled to guest, the guest should not launch operations on
>> (C+D). Right?   But here comes the question, if ready not signaled to
>> guest, how can guest launch operation on (A+B) again?
>
> It may be evil.
>
>> i.e. although local lock is broken, the (A+B) is still intact when
>> re-acquire local lock.  So need not to read everything again from the
>> device.  Wrong?
>
> The device needs to perform according to its specifications.  If the
> specifications allow for this kind of access, we must ensure it works.
> If they don't, we must ensure something sane happens, instead of a qemu
> crash or exploit.  This means that anything dangerous like pointers must
> be revalidated.  To be on the safe side, I recommend revalidating (or
> reloading) everything, but it may not be necessary in all cases.
>
Yeah, catch the two points exactly.

>>>> What my suggestion is:
>>>> lock();
>>>> set_and_test(dev->busy);
>>>> if busy
>>>>   unlock and return;
>>>> changing device registers;
>>>> do other things including calling to c_p_m_rw() //here,lock broken,
>>>> but set_and_test() works
>>>> clear(dev->busy);
>>>> unlock();
>>>>
>>>> So changing device registers is protected, and unbreakable.
>>>
>>> But the changes may be legitimate.  Maybe you're writing to a completely
>>> unrelated register, from a different vcpu, now that write is lost.
>>>
>> But I think it will mean more-fine locks for each groups of unrelated
>> register, and accordingly, the busy should be bitmap for each group.
>
> It's possible.  Let's defer the discussion until a concrete case is
> before us.  It may be that different devices will want different
> solutions (though there is value in applying one solution everywhere).
>
Okay. appreciate for the total detail explanation.

Regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function

Patch

diff --git a/cpus.c b/cpus.c
index 4cd7f85..09f6670 100644
--- a/cpus.c
+++ b/cpus.c
@@ -736,6 +736,8 @@  static void *qemu_kvm_cpu_thread_fn(void *arg)
     int r;
 
     pthread_setspecific(qemu_thread_key, cpu->thread);
+    cpu->thread->lockmap.biglock = 0;
+    qemu_big_lockmap_inc();
     qemu_mutex_lock(&qemu_global_mutex);
     qemu_thread_get_self(cpu->thread);
     env->thread_id = qemu_get_thread_id();
@@ -905,6 +907,14 @@  int qemu_cpu_is_self(void *_env)
 
 void qemu_mutex_lock_iothread(void)
 {
+    unsigned int map;
+
+    if (!qemu_thread_is_self(&io_thread)) {
+        map = qemu_big_lockmap_inc();
+        if (map > 1) {
+            return;
+        }
+    }
     if (!tcg_enabled()) {
         qemu_mutex_lock(&qemu_global_mutex);
     } else {
@@ -920,6 +930,14 @@  void qemu_mutex_lock_iothread(void)
 
 void qemu_mutex_unlock_iothread(void)
 {
+    unsigned int map;
+
+    if (!qemu_thread_is_self(&io_thread)) {
+        map = qemu_big_lockmap_dec();
+        if (map != 0) {
+            return;
+        }
+    }
     qemu_mutex_unlock(&qemu_global_mutex);
 }
 
diff --git a/qemu-thread-posix.c b/qemu-thread-posix.c
index f448fcb..1e07dc2 100644
--- a/qemu-thread-posix.c
+++ b/qemu-thread-posix.c
@@ -17,6 +17,7 @@ 
 #include <signal.h>
 #include <stdint.h>
 #include <string.h>
+#include <glib.h>
 #include "qemu-thread.h"
 
 pthread_key_t qemu_thread_key;
@@ -158,6 +159,28 @@  void qemu_thread_key_create(void)
     pthread_key_create(&qemu_thread_key, NULL);
 }
 
+int16_t qemu_big_lockmap_inc(void)
+{
+    QemuThread *t = pthread_getspecific(qemu_thread_key);
+
+    return ++t->lockmap.biglock;
+}
+
+int16_t qemu_big_lockmap_dec(void)
+{
+    QemuThread *t = pthread_getspecific(qemu_thread_key);
+    g_assert(t->lockmap.biglock > 0);
+
+    return --t->lockmap.biglock;
+}
+
+int16_t qemu_big_lockmap_get(void)
+{
+    QemuThread *t = pthread_getspecific(qemu_thread_key);
+
+    return t->lockmap.biglock;
+}
+
 bool qemu_thread_is_self(QemuThread *thread)
 {
    return pthread_equal(pthread_self(), thread->thread);
diff --git a/qemu-thread-posix.h b/qemu-thread-posix.h
index 2607b1c..8f9506b 100644
--- a/qemu-thread-posix.h
+++ b/qemu-thread-posix.h
@@ -10,8 +10,13 @@  struct QemuCond {
     pthread_cond_t cond;
 };
 
+typedef struct Lockmap {
+    int16_t biglock;
+} Lockmap;
+
 struct QemuThread {
     pthread_t thread;
+    Lockmap lockmap;
 };
 
 extern pthread_key_t qemu_thread_key;
diff --git a/qemu-thread.h b/qemu-thread.h
index 4a6427d..529850b 100644
--- a/qemu-thread.h
+++ b/qemu-thread.h
@@ -47,4 +47,7 @@  bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 
 void qemu_thread_key_create(void);
+int16_t qemu_big_lockmap_inc(void);
+int16_t qemu_big_lockmap_dec(void);
+int16_t qemu_big_lockmap_get(void);
 #endif