Message ID | 1347349912-15611-11-git-send-email-qemulist@gmail.com |
---|---|
State | New |
Headers | show |
On 09/11/2012 10:51 AM, Liu Ping Fan wrote: > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> > > The func call chain can suffer from recursively hold > qemu_mutex_lock_iothread. We introduce lockmap to record the > lock depth. What is the root cause? io handlers initiating I/O? Perhaps we can have a solution that is local to exec.c.
On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >> >> The func call chain can suffer from recursively hold >> qemu_mutex_lock_iothread. We introduce lockmap to record the >> lock depth. > > What is the root cause? io handlers initiating I/O? > cpu_physical_memory_rw() can be called nested, and when called, it can be protected by no-lock/device lock/ big-lock. I think without big lock, io-dispatcher should face the same issue. As to main-loop, have not carefully consider, but at least, dma-helper will call cpu_physical_memory_rw() with big-lock. Regards, pingfan > Perhaps we can have a solution that is local to exec.c. > > > -- > error compiling committee.c: too many arguments to function
On 09/11/2012 12:44 PM, liu ping fan wrote: > On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>> >>> The func call chain can suffer from recursively hold >>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>> lock depth. >> >> What is the root cause? io handlers initiating I/O? >> > cpu_physical_memory_rw() can be called nested, and when called, it can > be protected by no-lock/device lock/ big-lock. Then we should look for a solution that is local to exec.c (and the nested dispatch problem). I think we can identify and fix all nested dispatches (converting them to either async dma, or letting the memory core do a single dispatch without indirection through I/O handlers). > I think without big lock, io-dispatcher should face the same issue. > As to main-loop, have not carefully consider, but at least, dma-helper > will call cpu_physical_memory_rw() with big-lock. DMA is inherently asynchronous, so we already drop the lock between initiation and completion; we need to find a way to make it easy to use the API without taking the lock while the transfer takes place.
On 2012-09-11 11:44, liu ping fan wrote: > On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>> >>> The func call chain can suffer from recursively hold >>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>> lock depth. >> >> What is the root cause? io handlers initiating I/O? >> > cpu_physical_memory_rw() can be called nested, and when called, it can > be protected by no-lock/device lock/ big-lock. > I think without big lock, io-dispatcher should face the same issue. > As to main-loop, have not carefully consider, but at least, dma-helper > will call cpu_physical_memory_rw() with big-lock. That is our core problem: inconsistent invocation of existing services /wrt locking. For portio, I was lucky that there is no nesting and I was able to drop the big lock around all (x86) call sites. But MMIO is way more tricky due to DMA nesting. We could try to introduce a different version of cpu_physical_memory_rw, cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO request can trigger the very same access in a nested fashion, and we will have to detect this to avoid locking up QEMU (locking up the guest might be OK). Jan
On 2012-09-11 11:54, Avi Kivity wrote: > On 09/11/2012 12:44 PM, liu ping fan wrote: >> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>> >>>> The func call chain can suffer from recursively hold >>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>> lock depth. >>> >>> What is the root cause? io handlers initiating I/O? >>> >> cpu_physical_memory_rw() can be called nested, and when called, it can >> be protected by no-lock/device lock/ big-lock. > > Then we should look for a solution that is local to exec.c (and the > nested dispatch problem). I think we can identify and fix all nested > dispatches (converting them to either async dma, or letting the memory > core do a single dispatch without indirection through I/O handlers). > >> I think without big lock, io-dispatcher should face the same issue. >> As to main-loop, have not carefully consider, but at least, dma-helper >> will call cpu_physical_memory_rw() with big-lock. > > DMA is inherently asynchronous, so we already drop the lock between > initiation and completion; we need to find a way to make it easy to use > the API without taking the lock while the transfer takes place. We will have to review/rework device models that want to use the new locking scheme in a way that they can drop their own lock while issuing DMA. But that is surely non-trivial. The other option is to keep DMA requests issued by devices synchronous but let them fail if we are about to lock up. Still requires changes, but is probably more comprehensible for device model developers. Jan
On 09/11/2012 01:04 PM, Jan Kiszka wrote: >> DMA is inherently asynchronous, so we already drop the lock between >> initiation and completion; we need to find a way to make it easy to use >> the API without taking the lock while the transfer takes place. > > We will have to review/rework device models that want to use the new > locking scheme in a way that they can drop their own lock while issuing > DMA. But that is surely non-trivial. Most DMA today happens without the big qemu lock. We only need to convert the paths that actually access memory, these are the block and network layers (for the respective devices). > The other option is to keep DMA requests issued by devices synchronous > but let them fail if we are about to lock up. Still requires changes, > but is probably more comprehensible for device model developers. How do you handle failures?
On 2012-09-11 13:03, Avi Kivity wrote: > On 09/11/2012 01:04 PM, Jan Kiszka wrote: > >>> DMA is inherently asynchronous, so we already drop the lock between >>> initiation and completion; we need to find a way to make it easy to use >>> the API without taking the lock while the transfer takes place. >> >> We will have to review/rework device models that want to use the new >> locking scheme in a way that they can drop their own lock while issuing >> DMA. But that is surely non-trivial. > > Most DMA today happens without the big qemu lock. We only need to > convert the paths that actually access memory, these are the block and > network layers (for the respective devices). ...and the devices themselves, of course. > >> The other option is to keep DMA requests issued by devices synchronous >> but let them fail if we are about to lock up. Still requires changes, >> but is probably more comprehensible for device model developers. > > How do you handle failures? By not sending a network frame or dropping an incoming one, e.g., and signaling this in a device specific way. Jan
On 09/11/2012 02:08 PM, Jan Kiszka wrote: > On 2012-09-11 13:03, Avi Kivity wrote: >> On 09/11/2012 01:04 PM, Jan Kiszka wrote: >> >>>> DMA is inherently asynchronous, so we already drop the lock between >>>> initiation and completion; we need to find a way to make it easy to use >>>> the API without taking the lock while the transfer takes place. >>> >>> We will have to review/rework device models that want to use the new >>> locking scheme in a way that they can drop their own lock while issuing >>> DMA. But that is surely non-trivial. >> >> Most DMA today happens without the big qemu lock. We only need to >> convert the paths that actually access memory, these are the block and >> network layers (for the respective devices). > > ...and the devices themselves, of course. Right, for descriptors and stuff. So a guest can set a DMA address to point at its own BAR, and cause qemu to deadlock. The problem is not limited to locking. A guest can also cause a write to a BAR to initiate a synchronous write with the same address and data, triggering infinite recursion. Perhaps one fix is the mythical DMA API, which will provide each DMA initiator with its own view of memory (a private MemoryRegion root node). Even that can be worked around with a pair of devices accessing each other. > >> >>> The other option is to keep DMA requests issued by devices synchronous >>> but let them fail if we are about to lock up. Still requires changes, >>> but is probably more comprehensible for device model developers. >> >> How do you handle failures? > > By not sending a network frame or dropping an incoming one, e.g., and > signaling this in a device specific way. Doesn't work for block devices.
On 09/11/2012 12:57 PM, Jan Kiszka wrote: > On 2012-09-11 11:44, liu ping fan wrote: >> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>> >>>> The func call chain can suffer from recursively hold >>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>> lock depth. >>> >>> What is the root cause? io handlers initiating I/O? >>> >> cpu_physical_memory_rw() can be called nested, and when called, it can >> be protected by no-lock/device lock/ big-lock. >> I think without big lock, io-dispatcher should face the same issue. >> As to main-loop, have not carefully consider, but at least, dma-helper >> will call cpu_physical_memory_rw() with big-lock. > > That is our core problem: inconsistent invocation of existing services > /wrt locking. For portio, I was lucky that there is no nesting and I was > able to drop the big lock around all (x86) call sites. But MMIO is way > more tricky due to DMA nesting. Maybe we need to switch to a continuation style. Instead of expecting cpu_physical_memory_rw() to complete synchronously, it becomes an asynchronous call and you provide it with a completion. That means devices which use it are forced to drop the lock in between. Block and network clients will be easy to convert since they already use APIs that drop the lock (except for accessing the descriptors). > We could try to introduce a different version of cpu_physical_memory_rw, > cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO > request can trigger the very same access in a nested fashion, and we > will have to detect this to avoid locking up QEMU (locking up the guest > might be OK). An async version of c_p_m_rw() will just cause a completion to bounce around, consuming cpu but not deadlocking anything. If we can keep a count of the bounces, we might be able to stall it indefinitely or at least ratelimit it.
On 2012-09-11 14:20, Avi Kivity wrote: > On 09/11/2012 02:08 PM, Jan Kiszka wrote: >> On 2012-09-11 13:03, Avi Kivity wrote: >>> On 09/11/2012 01:04 PM, Jan Kiszka wrote: >>> >>>>> DMA is inherently asynchronous, so we already drop the lock between >>>>> initiation and completion; we need to find a way to make it easy to use >>>>> the API without taking the lock while the transfer takes place. >>>> >>>> We will have to review/rework device models that want to use the new >>>> locking scheme in a way that they can drop their own lock while issuing >>>> DMA. But that is surely non-trivial. >>> >>> Most DMA today happens without the big qemu lock. We only need to >>> convert the paths that actually access memory, these are the block and >>> network layers (for the respective devices). >> >> ...and the devices themselves, of course. > > Right, for descriptors and stuff. So a guest can set a DMA address to > point at its own BAR, and cause qemu to deadlock. > > The problem is not limited to locking. A guest can also cause a write > to a BAR to initiate a synchronous write with the same address and data, > triggering infinite recursion. > > Perhaps one fix is the mythical DMA API, which will provide each DMA > initiator with its own view of memory (a private MemoryRegion root > node). Even that can be worked around with a pair of devices accessing > each other. Hmm, don't see an alternative to runtime loop detection yet. > >> >>> >>>> The other option is to keep DMA requests issued by devices synchronous >>>> but let them fail if we are about to lock up. Still requires changes, >>>> but is probably more comprehensible for device model developers. >>> >>> How do you handle failures? >> >> By not sending a network frame or dropping an incoming one, e.g., and >> signaling this in a device specific way. > > Doesn't work for block devices. Because the block layer API cannot report errors to the devices? What happens if there is a real I/O error? Jan
On 09/11/2012 03:25 PM, Jan Kiszka wrote: >>>> >>>> Most DMA today happens without the big qemu lock. We only need to >>>> convert the paths that actually access memory, these are the block and >>>> network layers (for the respective devices). >>> >>> ...and the devices themselves, of course. >> >> Right, for descriptors and stuff. So a guest can set a DMA address to >> point at its own BAR, and cause qemu to deadlock. >> >> The problem is not limited to locking. A guest can also cause a write >> to a BAR to initiate a synchronous write with the same address and data, >> triggering infinite recursion. >> >> Perhaps one fix is the mythical DMA API, which will provide each DMA >> initiator with its own view of memory (a private MemoryRegion root >> node). Even that can be worked around with a pair of devices accessing >> each other. > > Hmm, don't see an alternative to runtime loop detection yet. See an expensive one on the other branch of this thread. > >> >>> >>>> >>>>> The other option is to keep DMA requests issued by devices synchronous >>>>> but let them fail if we are about to lock up. Still requires changes, >>>>> but is probably more comprehensible for device model developers. >>>> >>>> How do you handle failures? >>> >>> By not sending a network frame or dropping an incoming one, e.g., and >>> signaling this in a device specific way. >> >> Doesn't work for block devices. > > Because the block layer API cannot report errors to the devices? What > happens if there is a real I/O error? We report real I/O errors. But if we report a transient error due to some lock being taken as an I/O error, the guest will take unwarranted action. If the errors are not expected in normal operation (we can avoid them if all DMA is to real RAM) then this is an acceptable solution. Still it generates a lot of rarely used code paths and so isn't very good for security.
On 2012-09-11 14:30, Avi Kivity wrote: >>>>>> The other option is to keep DMA requests issued by devices synchronous >>>>>> but let them fail if we are about to lock up. Still requires changes, >>>>>> but is probably more comprehensible for device model developers. >>>>> >>>>> How do you handle failures? >>>> >>>> By not sending a network frame or dropping an incoming one, e.g., and >>>> signaling this in a device specific way. >>> >>> Doesn't work for block devices. >> >> Because the block layer API cannot report errors to the devices? What >> happens if there is a real I/O error? > > We report real I/O errors. But if we report a transient error due to > some lock being taken as an I/O error, the guest will take unwarranted > action. > > If the errors are not expected in normal operation (we can avoid them if > all DMA is to real RAM) then this is an acceptable solution. Still it > generates a lot of rarely used code paths and so isn't very good for > security. I'm not talking about transient errors. Recursions like this are always guest configuration errors that would cause real devices to lock up or timeout as well. This is practically about avoiding that a malicious guest can lock up QEMU, leaving it inoperative even for management tools. Jan
On 09/11/2012 03:35 PM, Jan Kiszka wrote: > On 2012-09-11 14:30, Avi Kivity wrote: >>>>>>> The other option is to keep DMA requests issued by devices synchronous >>>>>>> but let them fail if we are about to lock up. Still requires changes, >>>>>>> but is probably more comprehensible for device model developers. >>>>>> >>>>>> How do you handle failures? >>>>> >>>>> By not sending a network frame or dropping an incoming one, e.g., and >>>>> signaling this in a device specific way. >>>> >>>> Doesn't work for block devices. >>> >>> Because the block layer API cannot report errors to the devices? What >>> happens if there is a real I/O error? >> >> We report real I/O errors. But if we report a transient error due to >> some lock being taken as an I/O error, the guest will take unwarranted >> action. >> >> If the errors are not expected in normal operation (we can avoid them if >> all DMA is to real RAM) then this is an acceptable solution. Still it >> generates a lot of rarely used code paths and so isn't very good for >> security. > > I'm not talking about transient errors. Recursions like this are always > guest configuration errors that would cause real devices to lock up or > timeout as well. This is practically about avoiding that a malicious > guest can lock up QEMU, leaving it inoperative even for management tools. Ok. That's more palatable. We don't even have to report an error in that case, we can just perform the operation incorrectly (as I'm sure real hardware will), log an error, and continue.
On 09/11/2012 03:24 PM, Avi Kivity wrote: > On 09/11/2012 12:57 PM, Jan Kiszka wrote: >> On 2012-09-11 11:44, liu ping fan wrote: >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>>> >>>>> The func call chain can suffer from recursively hold >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>> lock depth. >>>> >>>> What is the root cause? io handlers initiating I/O? >>>> >>> cpu_physical_memory_rw() can be called nested, and when called, it can >>> be protected by no-lock/device lock/ big-lock. >>> I think without big lock, io-dispatcher should face the same issue. >>> As to main-loop, have not carefully consider, but at least, dma-helper >>> will call cpu_physical_memory_rw() with big-lock. >> >> That is our core problem: inconsistent invocation of existing services >> /wrt locking. For portio, I was lucky that there is no nesting and I was >> able to drop the big lock around all (x86) call sites. But MMIO is way >> more tricky due to DMA nesting. > > Maybe we need to switch to a continuation style. Instead of expecting > cpu_physical_memory_rw() to complete synchronously, it becomes an > asynchronous call and you provide it with a completion. That means > devices which use it are forced to drop the lock in between. Block and > network clients will be easy to convert since they already use APIs that > drop the lock (except for accessing the descriptors). > >> We could try to introduce a different version of cpu_physical_memory_rw, >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >> request can trigger the very same access in a nested fashion, and we >> will have to detect this to avoid locking up QEMU (locking up the guest >> might be OK). > > An async version of c_p_m_rw() will just cause a completion to bounce > around, consuming cpu but not deadlocking anything. If we can keep a > count of the bounces, we might be able to stall it indefinitely or at > least ratelimit it. > Another option is to require all users of c_p_m_rw() and related to use a coroutine or thread. That makes the programming easier (but still required a revalidation after the dropped lock).
On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote: > On 09/11/2012 03:24 PM, Avi Kivity wrote: > > On 09/11/2012 12:57 PM, Jan Kiszka wrote: > >> On 2012-09-11 11:44, liu ping fan wrote: > >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: > >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: > >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> > >>>>> > >>>>> The func call chain can suffer from recursively hold > >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the > >>>>> lock depth. > >>>> > >>>> What is the root cause? io handlers initiating I/O? > >>>> > >>> cpu_physical_memory_rw() can be called nested, and when called, it can > >>> be protected by no-lock/device lock/ big-lock. > >>> I think without big lock, io-dispatcher should face the same issue. > >>> As to main-loop, have not carefully consider, but at least, dma-helper > >>> will call cpu_physical_memory_rw() with big-lock. > >> > >> That is our core problem: inconsistent invocation of existing services > >> /wrt locking. For portio, I was lucky that there is no nesting and I was > >> able to drop the big lock around all (x86) call sites. But MMIO is way > >> more tricky due to DMA nesting. > > > > Maybe we need to switch to a continuation style. Instead of expecting > > cpu_physical_memory_rw() to complete synchronously, it becomes an > > asynchronous call and you provide it with a completion. That means > > devices which use it are forced to drop the lock in between. Block and > > network clients will be easy to convert since they already use APIs that > > drop the lock (except for accessing the descriptors). > > > >> We could try to introduce a different version of cpu_physical_memory_rw, > >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO > >> request can trigger the very same access in a nested fashion, and we > >> will have to detect this to avoid locking up QEMU (locking up the guest > >> might be OK). > > > > An async version of c_p_m_rw() will just cause a completion to bounce > > around, consuming cpu but not deadlocking anything. If we can keep a > > count of the bounces, we might be able to stall it indefinitely or at > > least ratelimit it. > > > > Another option is to require all users of c_p_m_rw() and related to use > a coroutine or thread. That makes the programming easier (but still > required a revalidation after the dropped lock). Unless i am misunderstanding this thread, the iothread flow section of http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html contains possible solutions for this. Basically use trylock() and if it fails, then choose a bailout option (which there are more than one possibilities listed).
On Tue, Sep 11, 2012 at 10:54 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote: >> On 09/11/2012 03:24 PM, Avi Kivity wrote: >> > On 09/11/2012 12:57 PM, Jan Kiszka wrote: >> >> On 2012-09-11 11:44, liu ping fan wrote: >> >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >> >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >> >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >> >>>>> >> >>>>> The func call chain can suffer from recursively hold >> >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >> >>>>> lock depth. >> >>>> >> >>>> What is the root cause? io handlers initiating I/O? >> >>>> >> >>> cpu_physical_memory_rw() can be called nested, and when called, it can >> >>> be protected by no-lock/device lock/ big-lock. >> >>> I think without big lock, io-dispatcher should face the same issue. >> >>> As to main-loop, have not carefully consider, but at least, dma-helper >> >>> will call cpu_physical_memory_rw() with big-lock. >> >> >> >> That is our core problem: inconsistent invocation of existing services >> >> /wrt locking. For portio, I was lucky that there is no nesting and I was >> >> able to drop the big lock around all (x86) call sites. But MMIO is way >> >> more tricky due to DMA nesting. >> > >> > Maybe we need to switch to a continuation style. Instead of expecting >> > cpu_physical_memory_rw() to complete synchronously, it becomes an >> > asynchronous call and you provide it with a completion. That means >> > devices which use it are forced to drop the lock in between. Block and >> > network clients will be easy to convert since they already use APIs that >> > drop the lock (except for accessing the descriptors). >> > >> >> We could try to introduce a different version of cpu_physical_memory_rw, >> >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >> >> request can trigger the very same access in a nested fashion, and we >> >> will have to detect this to avoid locking up QEMU (locking up the guest >> >> might be OK). >> > >> > An async version of c_p_m_rw() will just cause a completion to bounce >> > around, consuming cpu but not deadlocking anything. If we can keep a >> > count of the bounces, we might be able to stall it indefinitely or at >> > least ratelimit it. >> > >> >> Another option is to require all users of c_p_m_rw() and related to use >> a coroutine or thread. That makes the programming easier (but still >> required a revalidation after the dropped lock). > > Unless i am misunderstanding this thread, the iothread flow section of > http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html > contains possible solutions for this. > > Basically use trylock() and if it fails, then choose a bailout option > (which there are more than one possibilities listed). > I think in that thread, the original goal of the trylock() is to solve the lock held by other thread, by here we want to resolve nested lock. But yes, there is similar point to solve it by async if we do not allow nested lock. Regards, pingfan >
On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/11/2012 03:24 PM, Avi Kivity wrote: >> On 09/11/2012 12:57 PM, Jan Kiszka wrote: >>> On 2012-09-11 11:44, liu ping fan wrote: >>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>>>> >>>>>> The func call chain can suffer from recursively hold >>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>>> lock depth. >>>>> >>>>> What is the root cause? io handlers initiating I/O? >>>>> >>>> cpu_physical_memory_rw() can be called nested, and when called, it can >>>> be protected by no-lock/device lock/ big-lock. >>>> I think without big lock, io-dispatcher should face the same issue. >>>> As to main-loop, have not carefully consider, but at least, dma-helper >>>> will call cpu_physical_memory_rw() with big-lock. >>> >>> That is our core problem: inconsistent invocation of existing services >>> /wrt locking. For portio, I was lucky that there is no nesting and I was >>> able to drop the big lock around all (x86) call sites. But MMIO is way >>> more tricky due to DMA nesting. >> >> Maybe we need to switch to a continuation style. Instead of expecting >> cpu_physical_memory_rw() to complete synchronously, it becomes an >> asynchronous call and you provide it with a completion. That means >> devices which use it are forced to drop the lock in between. Block and >> network clients will be easy to convert since they already use APIs that >> drop the lock (except for accessing the descriptors). >> >>> We could try to introduce a different version of cpu_physical_memory_rw, >>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >>> request can trigger the very same access in a nested fashion, and we >>> will have to detect this to avoid locking up QEMU (locking up the guest >>> might be OK). >> >> An async version of c_p_m_rw() will just cause a completion to bounce >> around, consuming cpu but not deadlocking anything. If we can keep a >> count of the bounces, we might be able to stall it indefinitely or at >> least ratelimit it. >> > > Another option is to require all users of c_p_m_rw() and related to use > a coroutine or thread. That makes the programming easier (but still > required a revalidation after the dropped lock). > For the nested cpu_physical_memory_rw(), we change it internal but keep it sync API as it is. (Wrapper current cpu_physical_memory_rw() into cpu_physical_memory_rw_internal() ) LOCK() // can be device or big lock or both, depend on caller. .............. cpu_physical_memory_rw() { UNLOCK() //unlock all the locks queue_work_on_thread(cpu_physical_memory_rw_internal, completion); // cpu_physical_memory_rw_internal can take lock(device,biglock) again wait_for_completion(completion) LOCK() } .................. UNLOCK() Although cpu_physical_memory_rw_internal() can be freed to use lock, but with precondition. -- We still need to trace lock stack taken by cpu_physical_memory_rw(), so that it can return to caller correctly. Is that OK? Regards, pingfan > -- > error compiling committee.c: too many arguments to function
On 09/13/2012 09:55 AM, liu ping fan wrote: > On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote: >> On 09/11/2012 03:24 PM, Avi Kivity wrote: >>> On 09/11/2012 12:57 PM, Jan Kiszka wrote: >>>> On 2012-09-11 11:44, liu ping fan wrote: >>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>>>>> >>>>>>> The func call chain can suffer from recursively hold >>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>>>> lock depth. >>>>>> >>>>>> What is the root cause? io handlers initiating I/O? >>>>>> >>>>> cpu_physical_memory_rw() can be called nested, and when called, it can >>>>> be protected by no-lock/device lock/ big-lock. >>>>> I think without big lock, io-dispatcher should face the same issue. >>>>> As to main-loop, have not carefully consider, but at least, dma-helper >>>>> will call cpu_physical_memory_rw() with big-lock. >>>> >>>> That is our core problem: inconsistent invocation of existing services >>>> /wrt locking. For portio, I was lucky that there is no nesting and I was >>>> able to drop the big lock around all (x86) call sites. But MMIO is way >>>> more tricky due to DMA nesting. >>> >>> Maybe we need to switch to a continuation style. Instead of expecting >>> cpu_physical_memory_rw() to complete synchronously, it becomes an >>> asynchronous call and you provide it with a completion. That means >>> devices which use it are forced to drop the lock in between. Block and >>> network clients will be easy to convert since they already use APIs that >>> drop the lock (except for accessing the descriptors). >>> >>>> We could try to introduce a different version of cpu_physical_memory_rw, >>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >>>> request can trigger the very same access in a nested fashion, and we >>>> will have to detect this to avoid locking up QEMU (locking up the guest >>>> might be OK). >>> >>> An async version of c_p_m_rw() will just cause a completion to bounce >>> around, consuming cpu but not deadlocking anything. If we can keep a >>> count of the bounces, we might be able to stall it indefinitely or at >>> least ratelimit it. >>> >> >> Another option is to require all users of c_p_m_rw() and related to use >> a coroutine or thread. That makes the programming easier (but still >> required a revalidation after the dropped lock). >> > For the nested cpu_physical_memory_rw(), we change it internal but > keep it sync API as it is. (Wrapper current cpu_physical_memory_rw() > into cpu_physical_memory_rw_internal() ) > > > LOCK() // can be device or big lock or both, depend on caller. > .............. > cpu_physical_memory_rw() > { > UNLOCK() //unlock all the locks > queue_work_on_thread(cpu_physical_memory_rw_internal, completion); > // cpu_physical_memory_rw_internal can take lock(device,biglock) again > wait_for_completion(completion) > LOCK() > } > .................. > UNLOCK() > This is dangerous. The caller expects to hold the lock across the call, and will not re-validate its state. > Although cpu_physical_memory_rw_internal() can be freed to use lock, > but with precondition. -- We still need to trace lock stack taken by > cpu_physical_memory_rw(), so that it can return to caller correctly. > Is that OK? I'm convinced that we need a recursive lock if we don't convert everything at once. So how about: - make bql a recursive lock (use pthreads, don't invent our own) - change c_p_m_rw() calling convention to "caller may hold the BQL, but no device lock" this allows devices to DMA each other. Unconverted device models will work as they used to. Converted device models will need to drop the device lock, re-acquire it, and revalidate any state. That will cause them to become more correct wrt recursive invocation.
On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/13/2012 09:55 AM, liu ping fan wrote: >> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote: >>> On 09/11/2012 03:24 PM, Avi Kivity wrote: >>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote: >>>>> On 2012-09-11 11:44, liu ping fan wrote: >>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>>>>>> >>>>>>>> The func call chain can suffer from recursively hold >>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>>>>> lock depth. >>>>>>> >>>>>>> What is the root cause? io handlers initiating I/O? >>>>>>> >>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can >>>>>> be protected by no-lock/device lock/ big-lock. >>>>>> I think without big lock, io-dispatcher should face the same issue. >>>>>> As to main-loop, have not carefully consider, but at least, dma-helper >>>>>> will call cpu_physical_memory_rw() with big-lock. >>>>> >>>>> That is our core problem: inconsistent invocation of existing services >>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was >>>>> able to drop the big lock around all (x86) call sites. But MMIO is way >>>>> more tricky due to DMA nesting. >>>> >>>> Maybe we need to switch to a continuation style. Instead of expecting >>>> cpu_physical_memory_rw() to complete synchronously, it becomes an >>>> asynchronous call and you provide it with a completion. That means >>>> devices which use it are forced to drop the lock in between. Block and >>>> network clients will be easy to convert since they already use APIs that >>>> drop the lock (except for accessing the descriptors). >>>> >>>>> We could try to introduce a different version of cpu_physical_memory_rw, >>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >>>>> request can trigger the very same access in a nested fashion, and we >>>>> will have to detect this to avoid locking up QEMU (locking up the guest >>>>> might be OK). >>>> >>>> An async version of c_p_m_rw() will just cause a completion to bounce >>>> around, consuming cpu but not deadlocking anything. If we can keep a >>>> count of the bounces, we might be able to stall it indefinitely or at >>>> least ratelimit it. >>>> >>> >>> Another option is to require all users of c_p_m_rw() and related to use >>> a coroutine or thread. That makes the programming easier (but still >>> required a revalidation after the dropped lock). >>> >> For the nested cpu_physical_memory_rw(), we change it internal but >> keep it sync API as it is. (Wrapper current cpu_physical_memory_rw() >> into cpu_physical_memory_rw_internal() ) >> >> >> LOCK() // can be device or big lock or both, depend on caller. >> .............. >> cpu_physical_memory_rw() >> { >> UNLOCK() //unlock all the locks >> queue_work_on_thread(cpu_physical_memory_rw_internal, completion); >> // cpu_physical_memory_rw_internal can take lock(device,biglock) again >> wait_for_completion(completion) >> LOCK() >> } >> .................. >> UNLOCK() >> > > This is dangerous. The caller expects to hold the lock across the call, > and will not re-validate its state. > >> Although cpu_physical_memory_rw_internal() can be freed to use lock, >> but with precondition. -- We still need to trace lock stack taken by >> cpu_physical_memory_rw(), so that it can return to caller correctly. >> Is that OK? > > I'm convinced that we need a recursive lock if we don't convert > everything at once. > > So how about: > > - make bql a recursive lock (use pthreads, don't invent our own) > - change c_p_m_rw() calling convention to "caller may hold the BQL, but > no device lock" > > this allows devices to DMA each other. Unconverted device models will > work as they used to. Converted device models will need to drop the > device lock, re-acquire it, and revalidate any state. That will cause I think we are cornered by devices to DMA each other, and it raises the unavoidable nested lock. Also to avoid deadlock caused by device's lock sequence, we should drop the current device lock to acquire another one. The only little diverge is about the "revalidate", do we need to rollback? Or for converted device, we can just tag it a busy flag, that is check&set busy flag at the entry of device's mmio-dispatch. So when re-acquire device's lock, the device's state is intact. Has anybody other suggestion? Thanks and regards, pingfan > them to become more correct wrt recursive invocation. > > > -- > error compiling committee.c: too many arguments to function
Ping for PMM, This is the root case of your block on the SDHCI series - this is a discussion on resolution to bogus infinite looping DMA. For current participants in this discussion, heres our thread on the same topic over in SD land: http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html With the findings here and acknowledgement that this is a general problem, we either need to declare this issue of scope for SDHCI, or work with these guys (in the immediate future) on the DMA infinite loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig here, but can we get a decisive plan for going forward with this issue if it is going to continue to block SDHCI. Thanks to Igor for identifying the overlap between these discussions. Regards, Peter On Tue, Sep 11, 2012 at 10:39 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/11/2012 03:35 PM, Jan Kiszka wrote: >> On 2012-09-11 14:30, Avi Kivity wrote: >>>>>>>> The other option is to keep DMA requests issued by devices synchronous >>>>>>>> but let them fail if we are about to lock up. Still requires changes, >>>>>>>> but is probably more comprehensible for device model developers. >>>>>>> >>>>>>> How do you handle failures? >>>>>> >>>>>> By not sending a network frame or dropping an incoming one, e.g., and >>>>>> signaling this in a device specific way. >>>>> >>>>> Doesn't work for block devices. >>>> >>>> Because the block layer API cannot report errors to the devices? What >>>> happens if there is a real I/O error? >>> >>> We report real I/O errors. But if we report a transient error due to >>> some lock being taken as an I/O error, the guest will take unwarranted >>> action. >>> >>> If the errors are not expected in normal operation (we can avoid them if >>> all DMA is to real RAM) then this is an acceptable solution. Still it >>> generates a lot of rarely used code paths and so isn't very good for >>> security. >> >> I'm not talking about transient errors. Recursions like this are always >> guest configuration errors that would cause real devices to lock up or >> timeout as well. This is practically about avoiding that a malicious >> guest can lock up QEMU, leaving it inoperative even for management tools. > > Ok. That's more palatable. We don't even have to report an error in > that case, we can just perform the operation incorrectly (as I'm sure > real hardware will), log an error, and continue. > > > -- > error compiling committee.c: too many arguments to function >
On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: > Ping for PMM, > > This is the root case of your block on the SDHCI series - this is a > discussion on resolution to bogus infinite looping DMA. For current > participants in this discussion, heres our thread on the same topic > over in SD land: > > http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html > > With the findings here and acknowledgement that this is a general > problem, we either need to declare this issue of scope for SDHCI, or > work with these guys (in the immediate future) on the DMA infinite > loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig > here, but can we get a decisive plan for going forward with this issue > if it is going to continue to block SDHCI. > > Thanks to Igor for identifying the overlap between these discussions. Hi, A couple of dumb questions. What is the reason for the blocker? that possible guest dos is worse than no functionality at all? Can't you put the DMA/IO processing into the background? what's the diff between setup of bad descriptors and writing a while (1) loop in the guest CPU? Cheers, E
On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias <edgar.iglesias@gmail.com> wrote: > On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: >> Ping for PMM, >> >> This is the root case of your block on the SDHCI series - this is a >> discussion on resolution to bogus infinite looping DMA. For current >> participants in this discussion, heres our thread on the same topic >> over in SD land: >> >> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html >> >> With the findings here and acknowledgement that this is a general >> problem, we either need to declare this issue of scope for SDHCI, or >> work with these guys (in the immediate future) on the DMA infinite >> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig >> here, but can we get a decisive plan for going forward with this issue >> if it is going to continue to block SDHCI. >> >> Thanks to Igor for identifying the overlap between these discussions. > > Hi, > > A couple of dumb questions. > > What is the reason for the blocker? that possible guest dos is worse > than no functionality at all? > > Can't you put the DMA/IO processing into the background? I dont know a way do do asynchronous DMA unless I am missing something here. So what happens is we have a device that walks a SG list synchronously while holding all the locks and stuff being discussed here. If that SG list infinite loops then crash. > > what's the diff between setup of bad descriptors and writing a > while (1) loop in the guest CPU? > While(1) in guest doesnt freeze all of QEMU (monitor still works and stuff), wheras the DMA ops going in circles does, as locks are taken by an infinite loop. Regards, peter > Cheers, > E
On 09/19/2012 07:40 AM, Peter Crosthwaite wrote: > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias > <edgar.iglesias@gmail.com> wrote: >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: >>> Ping for PMM, >>> >>> This is the root case of your block on the SDHCI series - this is a >>> discussion on resolution to bogus infinite looping DMA. For current >>> participants in this discussion, heres our thread on the same topic >>> over in SD land: >>> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html >>> >>> With the findings here and acknowledgement that this is a general >>> problem, we either need to declare this issue of scope for SDHCI, or >>> work with these guys (in the immediate future) on the DMA infinite >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig >>> here, but can we get a decisive plan for going forward with this issue >>> if it is going to continue to block SDHCI. >>> >>> Thanks to Igor for identifying the overlap between these discussions. >> >> Hi, >> >> A couple of dumb questions. >> >> What is the reason for the blocker? that possible guest dos is worse >> than no functionality at all? >> >> Can't you put the DMA/IO processing into the background? > > I dont know a way do do asynchronous DMA unless I am missing something > here. You could schedule a bottom half and do the accesses from there. Solves nothing though. > So what happens is we have a device that walks a SG list > synchronously while holding all the locks and stuff being discussed > here. If that SG list infinite loops then crash. Did you mean loop, or recursion into the memory region that initiates DMA? As this problem already exists I don't think new device models need block on it. Maybe a simple fix is: void my_device_write(...) { if (dma_in_progress) { return; } ... s->dma_in_progress = true; cpu_physical_memory_rw(...) s->dma_in_progress = false; } and we could try to figure out how to move this into common code.
On 2012-09-19 06:40, Peter Crosthwaite wrote: > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias > <edgar.iglesias@gmail.com> wrote: >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: >>> Ping for PMM, >>> >>> This is the root case of your block on the SDHCI series - this is a >>> discussion on resolution to bogus infinite looping DMA. For current >>> participants in this discussion, heres our thread on the same topic >>> over in SD land: >>> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html >>> >>> With the findings here and acknowledgement that this is a general >>> problem, we either need to declare this issue of scope for SDHCI, or >>> work with these guys (in the immediate future) on the DMA infinite >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig >>> here, but can we get a decisive plan for going forward with this issue >>> if it is going to continue to block SDHCI. >>> >>> Thanks to Igor for identifying the overlap between these discussions. >> >> Hi, >> >> A couple of dumb questions. >> >> What is the reason for the blocker? that possible guest dos is worse >> than no functionality at all? >> >> Can't you put the DMA/IO processing into the background? > > I dont know a way do do asynchronous DMA unless I am missing something > here. So what happens is we have a device that walks a SG list > synchronously while holding all the locks and stuff being discussed > here. If that SG list infinite loops then crash. We could switch to async DMA, but most DMA-capable devices aren't prepared for rescheduling points in the middle of their code. This will require quite a bit of code review. > >> >> what's the diff between setup of bad descriptors and writing a >> while (1) loop in the guest CPU? >> > > While(1) in guest doesnt freeze all of QEMU (monitor still works and > stuff), wheras the DMA ops going in circles does, as locks are taken > by an infinite loop. That's the point. If we loop, we need at least a way to stop it by issuing a device or system reset. But I tend to think that detecting loops and failing/ignoring requests is easier implementable. Jan
On 09/17/2012 05:24 AM, liu ping fan wrote: > On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote: >> On 09/13/2012 09:55 AM, liu ping fan wrote: >>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote: >>>> On 09/11/2012 03:24 PM, Avi Kivity wrote: >>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote: >>>>>> On 2012-09-11 11:44, liu ping fan wrote: >>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>>>>>>> >>>>>>>>> The func call chain can suffer from recursively hold >>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>>>>>> lock depth. >>>>>>>> >>>>>>>> What is the root cause? io handlers initiating I/O? >>>>>>>> >>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can >>>>>>> be protected by no-lock/device lock/ big-lock. >>>>>>> I think without big lock, io-dispatcher should face the same issue. >>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper >>>>>>> will call cpu_physical_memory_rw() with big-lock. >>>>>> >>>>>> That is our core problem: inconsistent invocation of existing services >>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was >>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way >>>>>> more tricky due to DMA nesting. >>>>> >>>>> Maybe we need to switch to a continuation style. Instead of expecting >>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an >>>>> asynchronous call and you provide it with a completion. That means >>>>> devices which use it are forced to drop the lock in between. Block and >>>>> network clients will be easy to convert since they already use APIs that >>>>> drop the lock (except for accessing the descriptors). >>>>> >>>>>> We could try to introduce a different version of cpu_physical_memory_rw, >>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >>>>>> request can trigger the very same access in a nested fashion, and we >>>>>> will have to detect this to avoid locking up QEMU (locking up the guest >>>>>> might be OK). >>>>> >>>>> An async version of c_p_m_rw() will just cause a completion to bounce >>>>> around, consuming cpu but not deadlocking anything. If we can keep a >>>>> count of the bounces, we might be able to stall it indefinitely or at >>>>> least ratelimit it. >>>>> >>>> >>>> Another option is to require all users of c_p_m_rw() and related to use >>>> a coroutine or thread. That makes the programming easier (but still >>>> required a revalidation after the dropped lock). >>>> >>> For the nested cpu_physical_memory_rw(), we change it internal but >>> keep it sync API as it is. (Wrapper current cpu_physical_memory_rw() >>> into cpu_physical_memory_rw_internal() ) >>> >>> >>> LOCK() // can be device or big lock or both, depend on caller. >>> .............. >>> cpu_physical_memory_rw() >>> { >>> UNLOCK() //unlock all the locks >>> queue_work_on_thread(cpu_physical_memory_rw_internal, completion); >>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again >>> wait_for_completion(completion) >>> LOCK() >>> } >>> .................. >>> UNLOCK() >>> >> >> This is dangerous. The caller expects to hold the lock across the call, >> and will not re-validate its state. >> >>> Although cpu_physical_memory_rw_internal() can be freed to use lock, >>> but with precondition. -- We still need to trace lock stack taken by >>> cpu_physical_memory_rw(), so that it can return to caller correctly. >>> Is that OK? >> >> I'm convinced that we need a recursive lock if we don't convert >> everything at once. >> >> So how about: >> >> - make bql a recursive lock (use pthreads, don't invent our own) >> - change c_p_m_rw() calling convention to "caller may hold the BQL, but >> no device lock" >> >> this allows devices to DMA each other. Unconverted device models will >> work as they used to. Converted device models will need to drop the >> device lock, re-acquire it, and revalidate any state. That will cause > > I think we are cornered by devices to DMA each other, and it raises > the unavoidable nested lock. Also to avoid deadlock caused by device's > lock sequence, we should drop the current device lock to acquire > another one. The only little diverge is about the "revalidate", do > we need to rollback? It basically means you can't hold contents of device state in local variables. You need to read everything again from the device. That includes things like DMA enable bits. (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not. This is a trivial example of an iommu, we should get that going). > Or for converted device, we can just tag it a > busy flag, that is check&set busy flag at the entry of device's > mmio-dispatch. So when re-acquire device's lock, the device's state > is intact. The state can be changed by a parallel access to another register, which is valid. > > Has anybody other suggestion?
On Wed, Sep 19, 2012 at 4:01 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/17/2012 05:24 AM, liu ping fan wrote: >> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote: >>> On 09/13/2012 09:55 AM, liu ping fan wrote: >>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote: >>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote: >>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote: >>>>>>> On 2012-09-11 11:44, liu ping fan wrote: >>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote: >>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote: >>>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >>>>>>>>>> >>>>>>>>>> The func call chain can suffer from recursively hold >>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the >>>>>>>>>> lock depth. >>>>>>>>> >>>>>>>>> What is the root cause? io handlers initiating I/O? >>>>>>>>> >>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can >>>>>>>> be protected by no-lock/device lock/ big-lock. >>>>>>>> I think without big lock, io-dispatcher should face the same issue. >>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper >>>>>>>> will call cpu_physical_memory_rw() with big-lock. >>>>>>> >>>>>>> That is our core problem: inconsistent invocation of existing services >>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was >>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way >>>>>>> more tricky due to DMA nesting. >>>>>> >>>>>> Maybe we need to switch to a continuation style. Instead of expecting >>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an >>>>>> asynchronous call and you provide it with a completion. That means >>>>>> devices which use it are forced to drop the lock in between. Block and >>>>>> network clients will be easy to convert since they already use APIs that >>>>>> drop the lock (except for accessing the descriptors). >>>>>> >>>>>>> We could try to introduce a different version of cpu_physical_memory_rw, >>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO >>>>>>> request can trigger the very same access in a nested fashion, and we >>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest >>>>>>> might be OK). >>>>>> >>>>>> An async version of c_p_m_rw() will just cause a completion to bounce >>>>>> around, consuming cpu but not deadlocking anything. If we can keep a >>>>>> count of the bounces, we might be able to stall it indefinitely or at >>>>>> least ratelimit it. >>>>>> >>>>> >>>>> Another option is to require all users of c_p_m_rw() and related to use >>>>> a coroutine or thread. That makes the programming easier (but still >>>>> required a revalidation after the dropped lock). >>>>> >>>> For the nested cpu_physical_memory_rw(), we change it internal but >>>> keep it sync API as it is. (Wrapper current cpu_physical_memory_rw() >>>> into cpu_physical_memory_rw_internal() ) >>>> >>>> >>>> LOCK() // can be device or big lock or both, depend on caller. >>>> .............. >>>> cpu_physical_memory_rw() >>>> { >>>> UNLOCK() //unlock all the locks >>>> queue_work_on_thread(cpu_physical_memory_rw_internal, completion); >>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again >>>> wait_for_completion(completion) >>>> LOCK() >>>> } >>>> .................. >>>> UNLOCK() >>>> >>> >>> This is dangerous. The caller expects to hold the lock across the call, >>> and will not re-validate its state. >>> >>>> Although cpu_physical_memory_rw_internal() can be freed to use lock, >>>> but with precondition. -- We still need to trace lock stack taken by >>>> cpu_physical_memory_rw(), so that it can return to caller correctly. >>>> Is that OK? >>> >>> I'm convinced that we need a recursive lock if we don't convert >>> everything at once. >>> >>> So how about: >>> >>> - make bql a recursive lock (use pthreads, don't invent our own) >>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but >>> no device lock" >>> >>> this allows devices to DMA each other. Unconverted device models will >>> work as they used to. Converted device models will need to drop the >>> device lock, re-acquire it, and revalidate any state. That will cause >> >> I think we are cornered by devices to DMA each other, and it raises >> the unavoidable nested lock. Also to avoid deadlock caused by device's >> lock sequence, we should drop the current device lock to acquire >> another one. The only little diverge is about the "revalidate", do >> we need to rollback? > > It basically means you can't hold contents of device state in local > variables. You need to read everything again from the device. That > includes things like DMA enable bits. > I think that read everything again from the device can not work. Suppose the following scene: If the device's state contains the change of a series of internal registers (supposing partA+partB; partC+partD), after partA changed, the device's lock is broken. At this point, another access to this device, it will work on partA+partB to determine C+D, but since partB is not correctly updated yet. So C+D may be decided by broken context and be wrong. > (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not. This > is a trivial example of an iommu, we should get that going). > >> Or for converted device, we can just tag it a >> busy flag, that is check&set busy flag at the entry of device's >> mmio-dispatch. So when re-acquire device's lock, the device's state >> is intact. > > The state can be changed by a parallel access to another register, which > is valid. > Do you mean that the device can be accessed in parallel? But how? we use device's lock. What my suggestion is: lock(); set_and_test(dev->busy); if busy unlock and return; changing device registers; do other things including calling to c_p_m_rw() //here,lock broken, but set_and_test() works clear(dev->busy); unlock(); So changing device registers is protected, and unbreakable. Regards, pingfan >> >> Has anybody other suggestion? > > > -- > error compiling committee.c: too many arguments to function
On 09/19/2012 11:36 AM, liu ping fan wrote: >> >> It basically means you can't hold contents of device state in local >> variables. You need to read everything again from the device. That >> includes things like DMA enable bits. >> > I think that read everything again from the device can not work. > Suppose the following scene: If the device's state contains the change > of a series of internal registers (supposing partA+partB; > partC+partD), after partA changed, the device's lock is broken. At > this point, another access to this device, it will work on partA+partB > to determine C+D, but since partB is not correctly updated yet. So C+D > may be decided by broken context and be wrong. That's the guest's problem. Note it could have happened even before the change, since the writes to A/B/C/D are unordered wrt the DMA. > >> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not. This >> is a trivial example of an iommu, we should get that going). >> >>> Or for converted device, we can just tag it a >>> busy flag, that is check&set busy flag at the entry of device's >>> mmio-dispatch. So when re-acquire device's lock, the device's state >>> is intact. >> >> The state can be changed by a parallel access to another register, which >> is valid. >> > Do you mean that the device can be accessed in parallel? But how? we > use device's lock. Some DMA is asynchronous already, like networking and IDE DMA. So we drop the big lock while doing it. With the change to drop device locks around c_p_m_rw(), we have that problem everywhere. > What my suggestion is: > lock(); > set_and_test(dev->busy); > if busy > unlock and return; > changing device registers; > do other things including calling to c_p_m_rw() //here,lock broken, > but set_and_test() works > clear(dev->busy); > unlock(); > > So changing device registers is protected, and unbreakable. But the changes may be legitimate. Maybe you're writing to a completely unrelated register, from a different vcpu, now that write is lost.
On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote: > On 09/19/2012 07:40 AM, Peter Crosthwaite wrote: > > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias > > <edgar.iglesias@gmail.com> wrote: > >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: > >>> Ping for PMM, > >>> > >>> This is the root case of your block on the SDHCI series - this is a > >>> discussion on resolution to bogus infinite looping DMA. For current > >>> participants in this discussion, heres our thread on the same topic > >>> over in SD land: > >>> > >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html > >>> > >>> With the findings here and acknowledgement that this is a general > >>> problem, we either need to declare this issue of scope for SDHCI, or > >>> work with these guys (in the immediate future) on the DMA infinite > >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig > >>> here, but can we get a decisive plan for going forward with this issue > >>> if it is going to continue to block SDHCI. > >>> > >>> Thanks to Igor for identifying the overlap between these discussions. > >> > >> Hi, > >> > >> A couple of dumb questions. > >> > >> What is the reason for the blocker? that possible guest dos is worse > >> than no functionality at all? > >> > >> Can't you put the DMA/IO processing into the background? > > > > I dont know a way do do asynchronous DMA unless I am missing something > > here. > > You could schedule a bottom half and do the accesses from there. Solves > nothing though. > > > So what happens is we have a device that walks a SG list > > synchronously while holding all the locks and stuff being discussed > > here. If that SG list infinite loops then crash. > > Did you mean loop, or recursion into the memory region that initiates DMA? I think we were discussing the loop issue (I hadn't thought of recursion issues before) :) Jan's point of resetability was interesting. Processing a finite amount of dma descriptors and scheduling bh's to continously get back and continue processing should be OK no? That would avoid the lockups and allow the device to be reset at any time. Or am I missing something? IIRC the etrax dma does something like that but only for receive dma rings... Cheers, Edgar
On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote: > On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote: >> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote: >> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias >> > <edgar.iglesias@gmail.com> wrote: >> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: >> >>> Ping for PMM, >> >>> >> >>> This is the root case of your block on the SDHCI series - this is a >> >>> discussion on resolution to bogus infinite looping DMA. For current >> >>> participants in this discussion, heres our thread on the same topic >> >>> over in SD land: >> >>> >> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html >> >>> >> >>> With the findings here and acknowledgement that this is a general >> >>> problem, we either need to declare this issue of scope for SDHCI, or >> >>> work with these guys (in the immediate future) on the DMA infinite >> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig >> >>> here, but can we get a decisive plan for going forward with this issue >> >>> if it is going to continue to block SDHCI. >> >>> >> >>> Thanks to Igor for identifying the overlap between these discussions. >> >> >> >> Hi, >> >> >> >> A couple of dumb questions. >> >> >> >> What is the reason for the blocker? that possible guest dos is worse >> >> than no functionality at all? >> >> >> >> Can't you put the DMA/IO processing into the background? >> > >> > I dont know a way do do asynchronous DMA unless I am missing something >> > here. >> >> You could schedule a bottom half and do the accesses from there. Solves >> nothing though. >> >> > So what happens is we have a device that walks a SG list >> > synchronously while holding all the locks and stuff being discussed >> > here. If that SG list infinite loops then crash. >> >> Did you mean loop, or recursion into the memory region that initiates DMA? > > I think we were discussing the loop issue (I hadn't thought of recursion > issues before) :) > > Jan's point of resetability was interesting. > > Processing a finite amount of dma descriptors and scheduling bh's > to continously get back and continue processing should be OK no? > Yes. > That would avoid the lockups and allow the device to be reset at > any time. Or am I missing something? > Not that I can see. If real hardware can be looped, so can qemu. I'm only worried about recursion and deadlocks (while real hardware can deadlock, we'd prefer to avoid that).
On Wed, Sep 19, 2012 at 03:12:12PM +0300, Avi Kivity wrote: > On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote: > > On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote: > >> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote: > >> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias > >> > <edgar.iglesias@gmail.com> wrote: > >> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: > >> >>> Ping for PMM, > >> >>> > >> >>> This is the root case of your block on the SDHCI series - this is a > >> >>> discussion on resolution to bogus infinite looping DMA. For current > >> >>> participants in this discussion, heres our thread on the same topic > >> >>> over in SD land: > >> >>> > >> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html > >> >>> > >> >>> With the findings here and acknowledgement that this is a general > >> >>> problem, we either need to declare this issue of scope for SDHCI, or > >> >>> work with these guys (in the immediate future) on the DMA infinite > >> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig > >> >>> here, but can we get a decisive plan for going forward with this issue > >> >>> if it is going to continue to block SDHCI. > >> >>> > >> >>> Thanks to Igor for identifying the overlap between these discussions. > >> >> > >> >> Hi, > >> >> > >> >> A couple of dumb questions. > >> >> > >> >> What is the reason for the blocker? that possible guest dos is worse > >> >> than no functionality at all? > >> >> > >> >> Can't you put the DMA/IO processing into the background? > >> > > >> > I dont know a way do do asynchronous DMA unless I am missing something > >> > here. > >> > >> You could schedule a bottom half and do the accesses from there. Solves > >> nothing though. > >> > >> > So what happens is we have a device that walks a SG list > >> > synchronously while holding all the locks and stuff being discussed > >> > here. If that SG list infinite loops then crash. > >> > >> Did you mean loop, or recursion into the memory region that initiates DMA? > > > > I think we were discussing the loop issue (I hadn't thought of recursion > > issues before) :) > > > > Jan's point of resetability was interesting. > > > > Processing a finite amount of dma descriptors and scheduling bh's > > to continously get back and continue processing should be OK no? > > > > Yes. > > > That would avoid the lockups and allow the device to be reset at > > any time. Or am I missing something? > > > > Not that I can see. If real hardware can be looped, so can qemu. I'm > only worried about recursion and deadlocks (while real hardware can > deadlock, we'd prefer to avoid that). Agreed, thanks
On 09/19/2012 04:12 PM, Avi Kivity wrote: > On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote: >> On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote: >>> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote: >>>> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias >>>> <edgar.iglesias@gmail.com> wrote: >>>>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: >>>>>> Ping for PMM, >>>>>> >>>>>> This is the root case of your block on the SDHCI series - this is a >>>>>> discussion on resolution to bogus infinite looping DMA. For current >>>>>> participants in this discussion, heres our thread on the same topic >>>>>> over in SD land: >>>>>> >>>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html >>>>>> >>>>>> With the findings here and acknowledgement that this is a general >>>>>> problem, we either need to declare this issue of scope for SDHCI, or >>>>>> work with these guys (in the immediate future) on the DMA infinite >>>>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig >>>>>> here, but can we get a decisive plan for going forward with this issue >>>>>> if it is going to continue to block SDHCI. >>>>>> >>>>>> Thanks to Igor for identifying the overlap between these discussions. >>>>> Hi, >>>>> >>>>> A couple of dumb questions. >>>>> >>>>> What is the reason for the blocker? that possible guest dos is worse >>>>> than no functionality at all? >>>>> >>>>> Can't you put the DMA/IO processing into the background? >>>> I dont know a way do do asynchronous DMA unless I am missing something >>>> here. >>> You could schedule a bottom half and do the accesses from there. Solves >>> nothing though. >>> >>>> So what happens is we have a device that walks a SG list >>>> synchronously while holding all the locks and stuff being discussed >>>> here. If that SG list infinite loops then crash. >>> Did you mean loop, or recursion into the memory region that initiates DMA? >> I think we were discussing the loop issue (I hadn't thought of recursion >> issues before) :) >> >> Jan's point of resetability was interesting. >> >> Processing a finite amount of dma descriptors and scheduling bh's >> to continously get back and continue processing should be OK no? >> > Yes. Bottom half idea sounds good. Also, for this particular device, rather then limiting amount of processed descriptor entries in bottom half, it seems reasonable to interrupt bh handler if we encounter a LINK descriptor since only recursive LINKs can generate infinite loops. In that case, a very long descriptor table without a single LINK entry could potentially hung QEMU for a long time, but that time will be finite anyway. Long term, this problem will disappear when/if someone converts QEMU SD card model to use async io. >> That would avoid the lockups and allow the device to be reset at >> any time. Or am I missing something? >> > Not that I can see. If real hardware can be looped, so can qemu. I'm > only worried about recursion and deadlocks (while real hardware can > deadlock, we'd prefer to avoid that). > > So, I think the idea here is that if real hardware can be locked we should lock too, but provide a guest CPU a possibility to abort locked operation. For this particular example, SD card controller can deadlock on recursive descriptor LINK entries, but CPU can abort ongoing transaction at any time by issuing ABORT command. And if guest CPU busy-waits in infinite while() loop for a TRANSFER OVER flag to be set, then it had it coming.
On 09/19/2012 04:01 PM, Igor Mitsyanko wrote: >>> That would avoid the lockups and allow the device to be reset at >>> any time. Or am I missing something? >>> >> Not that I can see. If real hardware can be looped, so can qemu. I'm >> only worried about recursion and deadlocks (while real hardware can >> deadlock, we'd prefer to avoid that). >> >> > > So, I think the idea here is that if real hardware can be locked we > should lock too, but provide a guest CPU a possibility to abort locked > operation. For this particular example, SD card controller can deadlock > on recursive descriptor LINK entries, but CPU can abort ongoing > transaction at any time by issuing ABORT command. And if guest CPU > busy-waits in infinite while() loop for a TRANSFER OVER flag to be set, > then it had it coming. If real hardware would lock up so should we (in the guest's eyes) but the qemu monitor should remain responsive. It is not part of emulated hardware.
On 09/19/2012 11:57 AM, Jan Kiszka wrote: > On 2012-09-19 06:40, Peter Crosthwaite wrote: >> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias >> <edgar.iglesias@gmail.com> wrote: >>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote: >>>> Ping for PMM, >>>> >>>> This is the root case of your block on the SDHCI series - this is a >>>> discussion on resolution to bogus infinite looping DMA. For current >>>> participants in this discussion, heres our thread on the same topic >>>> over in SD land: >>>> >>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html >>>> >>>> With the findings here and acknowledgement that this is a general >>>> problem, we either need to declare this issue of scope for SDHCI, or >>>> work with these guys (in the immediate future) on the DMA infinite >>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig >>>> here, but can we get a decisive plan for going forward with this issue >>>> if it is going to continue to block SDHCI. >>>> >>>> Thanks to Igor for identifying the overlap between these discussions. >>> Hi, >>> >>> A couple of dumb questions. >>> >>> What is the reason for the blocker? that possible guest dos is worse >>> than no functionality at all? >>> >>> Can't you put the DMA/IO processing into the background? >> I dont know a way do do asynchronous DMA unless I am missing something >> here. So what happens is we have a device that walks a SG list >> synchronously while holding all the locks and stuff being discussed >> here. If that SG list infinite loops then crash. > We could switch to async DMA, but most DMA-capable devices aren't > prepared for rescheduling points in the middle of their code. This will > require quite a bit of code review. From what I understand, we can't switch to async DMA while SD card model doesn't support it. And introducing this support would require to convert every existing SD card user in QEMU. I can recall that such a conversion was already suggested in a mailing list a while ago, but no one (including me) volunteered to do it :) >>> what's the diff between setup of bad descriptors and writing a >>> while (1) loop in the guest CPU? >>> >> While(1) in guest doesnt freeze all of QEMU (monitor still works and >> stuff), wheras the DMA ops going in circles does, as locks are taken >> by an infinite loop. > That's the point. If we loop, we need at least a way to stop it by > issuing a device or system reset. But I tend to think that detecting > loops and failing/ignoring requests is easier implementable. > > Jan >
On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/19/2012 11:36 AM, liu ping fan wrote: >>> >>> It basically means you can't hold contents of device state in local >>> variables. You need to read everything again from the device. That >>> includes things like DMA enable bits. >>> >> I think that read everything again from the device can not work. >> Suppose the following scene: If the device's state contains the change >> of a series of internal registers (supposing partA+partB; >> partC+partD), after partA changed, the device's lock is broken. At >> this point, another access to this device, it will work on partA+partB >> to determine C+D, but since partB is not correctly updated yet. So C+D >> may be decided by broken context and be wrong. > > That's the guest's problem. Note it could have happened even before the > change, since the writes to A/B/C/D are unordered wrt the DMA. > Yes, agree, it is the guest's problem. So it means that ready_of(A+B) is not signaled to guest, the guest should not launch operations on (C+D). Right? But here comes the question, if ready not signaled to guest, how can guest launch operation on (A+B) again? i.e. although local lock is broken, the (A+B) is still intact when re-acquire local lock. So need not to read everything again from the device. Wrong? >> >>> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not. This >>> is a trivial example of an iommu, we should get that going). >>> >>>> Or for converted device, we can just tag it a >>>> busy flag, that is check&set busy flag at the entry of device's >>>> mmio-dispatch. So when re-acquire device's lock, the device's state >>>> is intact. >>> >>> The state can be changed by a parallel access to another register, which >>> is valid. >>> >> Do you mean that the device can be accessed in parallel? But how? we >> use device's lock. > > Some DMA is asynchronous already, like networking and IDE DMA. So we > drop the big lock while doing it. With the change to drop device locks > around c_p_m_rw(), we have that problem everywhere. > >> What my suggestion is: >> lock(); >> set_and_test(dev->busy); >> if busy >> unlock and return; >> changing device registers; >> do other things including calling to c_p_m_rw() //here,lock broken, >> but set_and_test() works >> clear(dev->busy); >> unlock(); >> >> So changing device registers is protected, and unbreakable. > > But the changes may be legitimate. Maybe you're writing to a completely > unrelated register, from a different vcpu, now that write is lost. > But I think it will mean more-fine locks for each groups of unrelated register, and accordingly, the busy should be bitmap for each group. Thanks and regards, pingfan > > -- > error compiling committee.c: too many arguments to function
On 09/20/2012 10:51 AM, liu ping fan wrote: > On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote: >> On 09/19/2012 11:36 AM, liu ping fan wrote: >>>> >>>> It basically means you can't hold contents of device state in local >>>> variables. You need to read everything again from the device. That >>>> includes things like DMA enable bits. >>>> >>> I think that read everything again from the device can not work. >>> Suppose the following scene: If the device's state contains the change >>> of a series of internal registers (supposing partA+partB; >>> partC+partD), after partA changed, the device's lock is broken. At >>> this point, another access to this device, it will work on partA+partB >>> to determine C+D, but since partB is not correctly updated yet. So C+D >>> may be decided by broken context and be wrong. >> >> That's the guest's problem. Note it could have happened even before the >> change, since the writes to A/B/C/D are unordered wrt the DMA. >> > Yes, agree, it is the guest's problem. So it means that ready_of(A+B) > is not signaled to guest, the guest should not launch operations on > (C+D). Right? But here comes the question, if ready not signaled to > guest, how can guest launch operation on (A+B) again? It may be evil. > i.e. although local lock is broken, the (A+B) is still intact when > re-acquire local lock. So need not to read everything again from the > device. Wrong? The device needs to perform according to its specifications. If the specifications allow for this kind of access, we must ensure it works. If they don't, we must ensure something sane happens, instead of a qemu crash or exploit. This means that anything dangerous like pointers must be revalidated. To be on the safe side, I recommend revalidating (or reloading) everything, but it may not be necessary in all cases. >>> What my suggestion is: >>> lock(); >>> set_and_test(dev->busy); >>> if busy >>> unlock and return; >>> changing device registers; >>> do other things including calling to c_p_m_rw() //here,lock broken, >>> but set_and_test() works >>> clear(dev->busy); >>> unlock(); >>> >>> So changing device registers is protected, and unbreakable. >> >> But the changes may be legitimate. Maybe you're writing to a completely >> unrelated register, from a different vcpu, now that write is lost. >> > But I think it will mean more-fine locks for each groups of unrelated > register, and accordingly, the busy should be bitmap for each group. It's possible. Let's defer the discussion until a concrete case is before us. It may be that different devices will want different solutions (though there is value in applying one solution everywhere).
On Thu, Sep 20, 2012 at 5:15 PM, Avi Kivity <avi@redhat.com> wrote: > On 09/20/2012 10:51 AM, liu ping fan wrote: >> On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote: >>> On 09/19/2012 11:36 AM, liu ping fan wrote: >>>>> >>>>> It basically means you can't hold contents of device state in local >>>>> variables. You need to read everything again from the device. That >>>>> includes things like DMA enable bits. >>>>> >>>> I think that read everything again from the device can not work. >>>> Suppose the following scene: If the device's state contains the change >>>> of a series of internal registers (supposing partA+partB; >>>> partC+partD), after partA changed, the device's lock is broken. At >>>> this point, another access to this device, it will work on partA+partB >>>> to determine C+D, but since partB is not correctly updated yet. So C+D >>>> may be decided by broken context and be wrong. >>> >>> That's the guest's problem. Note it could have happened even before the >>> change, since the writes to A/B/C/D are unordered wrt the DMA. >>> >> Yes, agree, it is the guest's problem. So it means that ready_of(A+B) >> is not signaled to guest, the guest should not launch operations on >> (C+D). Right? But here comes the question, if ready not signaled to >> guest, how can guest launch operation on (A+B) again? > > It may be evil. > >> i.e. although local lock is broken, the (A+B) is still intact when >> re-acquire local lock. So need not to read everything again from the >> device. Wrong? > > The device needs to perform according to its specifications. If the > specifications allow for this kind of access, we must ensure it works. > If they don't, we must ensure something sane happens, instead of a qemu > crash or exploit. This means that anything dangerous like pointers must > be revalidated. To be on the safe side, I recommend revalidating (or > reloading) everything, but it may not be necessary in all cases. > Yeah, catch the two points exactly. >>>> What my suggestion is: >>>> lock(); >>>> set_and_test(dev->busy); >>>> if busy >>>> unlock and return; >>>> changing device registers; >>>> do other things including calling to c_p_m_rw() //here,lock broken, >>>> but set_and_test() works >>>> clear(dev->busy); >>>> unlock(); >>>> >>>> So changing device registers is protected, and unbreakable. >>> >>> But the changes may be legitimate. Maybe you're writing to a completely >>> unrelated register, from a different vcpu, now that write is lost. >>> >> But I think it will mean more-fine locks for each groups of unrelated >> register, and accordingly, the busy should be bitmap for each group. > > It's possible. Let's defer the discussion until a concrete case is > before us. It may be that different devices will want different > solutions (though there is value in applying one solution everywhere). > Okay. appreciate for the total detail explanation. Regards, pingfan > > -- > error compiling committee.c: too many arguments to function
diff --git a/cpus.c b/cpus.c index 4cd7f85..09f6670 100644 --- a/cpus.c +++ b/cpus.c @@ -736,6 +736,8 @@ static void *qemu_kvm_cpu_thread_fn(void *arg) int r; pthread_setspecific(qemu_thread_key, cpu->thread); + cpu->thread->lockmap.biglock = 0; + qemu_big_lockmap_inc(); qemu_mutex_lock(&qemu_global_mutex); qemu_thread_get_self(cpu->thread); env->thread_id = qemu_get_thread_id(); @@ -905,6 +907,14 @@ int qemu_cpu_is_self(void *_env) void qemu_mutex_lock_iothread(void) { + unsigned int map; + + if (!qemu_thread_is_self(&io_thread)) { + map = qemu_big_lockmap_inc(); + if (map > 1) { + return; + } + } if (!tcg_enabled()) { qemu_mutex_lock(&qemu_global_mutex); } else { @@ -920,6 +930,14 @@ void qemu_mutex_lock_iothread(void) void qemu_mutex_unlock_iothread(void) { + unsigned int map; + + if (!qemu_thread_is_self(&io_thread)) { + map = qemu_big_lockmap_dec(); + if (map != 0) { + return; + } + } qemu_mutex_unlock(&qemu_global_mutex); } diff --git a/qemu-thread-posix.c b/qemu-thread-posix.c index f448fcb..1e07dc2 100644 --- a/qemu-thread-posix.c +++ b/qemu-thread-posix.c @@ -17,6 +17,7 @@ #include <signal.h> #include <stdint.h> #include <string.h> +#include <glib.h> #include "qemu-thread.h" pthread_key_t qemu_thread_key; @@ -158,6 +159,28 @@ void qemu_thread_key_create(void) pthread_key_create(&qemu_thread_key, NULL); } +int16_t qemu_big_lockmap_inc(void) +{ + QemuThread *t = pthread_getspecific(qemu_thread_key); + + return ++t->lockmap.biglock; +} + +int16_t qemu_big_lockmap_dec(void) +{ + QemuThread *t = pthread_getspecific(qemu_thread_key); + g_assert(t->lockmap.biglock > 0); + + return --t->lockmap.biglock; +} + +int16_t qemu_big_lockmap_get(void) +{ + QemuThread *t = pthread_getspecific(qemu_thread_key); + + return t->lockmap.biglock; +} + bool qemu_thread_is_self(QemuThread *thread) { return pthread_equal(pthread_self(), thread->thread); diff --git a/qemu-thread-posix.h b/qemu-thread-posix.h index 2607b1c..8f9506b 100644 --- a/qemu-thread-posix.h +++ b/qemu-thread-posix.h @@ -10,8 +10,13 @@ struct QemuCond { pthread_cond_t cond; }; +typedef struct Lockmap { + int16_t biglock; +} Lockmap; + struct QemuThread { pthread_t thread; + Lockmap lockmap; }; extern pthread_key_t qemu_thread_key; diff --git a/qemu-thread.h b/qemu-thread.h index 4a6427d..529850b 100644 --- a/qemu-thread.h +++ b/qemu-thread.h @@ -47,4 +47,7 @@ bool qemu_thread_is_self(QemuThread *thread); void qemu_thread_exit(void *retval); void qemu_thread_key_create(void); +int16_t qemu_big_lockmap_inc(void); +int16_t qemu_big_lockmap_dec(void); +int16_t qemu_big_lockmap_get(void); #endif