diff mbox

Inter-VM shared memory PCI device

Message ID 1267833161-25267-2-git-send-email-cam@cs.ualberta.ca
State New
Headers show

Commit Message

Cam Macdonell March 5, 2010, 11:52 p.m. UTC
Support an inter-vm shared memory device that maps a shared-memory object
as a PCI device in the guest.  This patch also supports interrupts between
guest by communicating over a unix domain socket.  This patch applies to the
qemu-kvm repository.

This device now creates a qemu character device and sends 1-bytes messages to
trigger interrupts.  Writes are trigger by writing to the "Doorbell" register
on the shared memory PCI device.  The lower 8-bits of the value written to this
register are sent as the 1-byte message so different meanings of interrupts can
be supported.

Interrupts are supported between multiple VMs by using a shared memory server

-ivshmem <size in MB>,[unix:<path>][file]

Interrupts can also be used between host and guest as well by implementing a
listener on the host that talks to shared memory server.  The shared memory
server passes file descriptors for the shared memory object and eventfds (our
interrupt mechanism) to the respective qemu instances.

Sample programs, init scripts and the shared memory server are available in a
git repo here:

www.gitorious.org/nahanni
---
 Makefile.target |    3 +
 hw/ivshmem.c    |  561 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/pc.c         |    6 +
 hw/pc.h         |    3 +
 qemu-char.c     |    6 +
 qemu-char.h     |    3 +
 qemu-options.hx |   12 ++
 sysemu.h        |    8 +
 vl.c            |   13 ++
 9 files changed, 615 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

Comments

Paul Brook March 7, 2010, 10:53 p.m. UTC | #1
> Support an inter-vm shared memory device that maps a shared-memory object
> as a PCI device in the guest.  This patch also supports interrupts between
> guest by communicating over a unix domain socket.  This patch applies to
>  the qemu-kvm repository.

No. All new devices should be fully qdev based.

I suspect you've also ignored a load of coherency issues, especially when not 
using KVM. As soon as you have shared memory in more than one host 
thread/process you have to worry about memory barriers.

Paul
Jamie Lokier March 8, 2010, 1:45 a.m. UTC | #2
Paul Brook wrote:
> > Support an inter-vm shared memory device that maps a shared-memory object
> > as a PCI device in the guest.  This patch also supports interrupts between
> > guest by communicating over a unix domain socket.  This patch applies to
> >  the qemu-kvm repository.
> 
> No. All new devices should be fully qdev based.
> 
> I suspect you've also ignored a load of coherency issues, especially when not 
> using KVM. As soon as you have shared memory in more than one host 
> thread/process you have to worry about memory barriers.

Yes. Guest-observable behaviour is likely to be quite different on
different hosts, expecially beteen x86 and non-x86 hosts, which is not
good at all for emulation.

Memory barriers performed by the guest would help, but would not
remove the fact that behaviour would vary beteen different host types
if a guest doesn't call them.  I.e. you could accidentally have some
guests working fine for years on x86 hosts, which gain subtle
memory corruption as soon as you run them on a different host.

This is acceptable when recompiling code for different architectures,
but it's asking for trouble with binary guest images which aren't
supposed to depend on host architecture.

However, coherence could be made host-type-independent by the host
mapping and unampping pages, so that each page is only mapped into one
guest (or guest CPU) at a time.  Just like some clustering filesystems
do to maintain coherence.

-- Jamie
Alexander Graf March 8, 2010, 9:48 a.m. UTC | #3
Am 08.03.2010 um 02:45 schrieb Jamie Lokier <jamie@shareable.org>:

> Paul Brook wrote:
>>> Support an inter-vm shared memory device that maps a shared-memory  
>>> object
>>> as a PCI device in the guest.  This patch also supports interrupts  
>>> between
>>> guest by communicating over a unix domain socket.  This patch  
>>> applies to
>>> the qemu-kvm repository.
>>
>> No. All new devices should be fully qdev based.
>>
>> I suspect you've also ignored a load of coherency issues,  
>> especially when not
>> using KVM. As soon as you have shared memory in more than one host
>> thread/process you have to worry about memory barriers.
>
> Yes. Guest-observable behaviour is likely to be quite different on
> different hosts, expecially beteen x86 and non-x86 hosts, which is not
> good at all for emulation.
>
> Memory barriers performed by the guest would help, but would not
> remove the fact that behaviour would vary beteen different host types
> if a guest doesn't call them.  I.e. you could accidentally have some
> guests working fine for years on x86 hosts, which gain subtle
> memory corruption as soon as you run them on a different host.
>
> This is acceptable when recompiling code for different architectures,
> but it's asking for trouble with binary guest images which aren't
> supposed to depend on host architecture.
>
> However, coherence could be made host-type-independent by the host
> mapping and unampping pages, so that each page is only mapped into one
> guest (or guest CPU) at a time.  Just like some clustering filesystems
> do to maintain coherence.

Or we could put in some code that tells the guest the host shm  
architecture and only accept x86 on x86 for now. If anyone cares for  
other combinations, they're free to implement them.

Seriously, we're looking at an interface designed for kvm here. Let's  
please keep it as simple and fast as possible for the actual use case,  
not some theoretically possible ones.


Alex
Avi Kivity March 8, 2010, 9:52 a.m. UTC | #4
On 03/08/2010 12:53 AM, Paul Brook wrote:
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a PCI device in the guest.  This patch also supports interrupts between
>> guest by communicating over a unix domain socket.  This patch applies to
>>   the qemu-kvm repository.
>>      
> No. All new devices should be fully qdev based.
>
> I suspect you've also ignored a load of coherency issues, especially when not
> using KVM. As soon as you have shared memory in more than one host
> thread/process you have to worry about memory barriers.
>    

Shouldn't it be sufficient to require the guest to issue barriers (and 
to ensure tcg honours the barriers, if someone wants this with tcg)?.
Jamie Lokier March 8, 2010, 9:54 a.m. UTC | #5
Alexander Graf wrote:
> Or we could put in some code that tells the guest the host shm  
> architecture and only accept x86 on x86 for now. If anyone cares for  
> other combinations, they're free to implement them.
> 
> Seriously, we're looking at an interface designed for kvm here. Let's  
> please keep it as simple and fast as possible for the actual use case,  
> not some theoretically possible ones.

The concern is that a perfectly working guest image running on kvm,
the guest being some OS or app that uses this facility (_not_ a
kvm-only guest driver), is later run on qemu on a different host, and
then mostly works except for some silent data corruption.

That is not a theoretical scenario.

Well, the bit with this driver is theoretical, obviously :-)
But not the bit about moving to a different host.

-- Jamie
Avi Kivity March 8, 2010, 9:56 a.m. UTC | #6
On 03/06/2010 01:52 AM, Cam Macdonell wrote:
> Support an inter-vm shared memory device that maps a shared-memory object
> as a PCI device in the guest.  This patch also supports interrupts between
> guest by communicating over a unix domain socket.  This patch applies to the
> qemu-kvm repository.
>
> This device now creates a qemu character device and sends 1-bytes messages to
> trigger interrupts.  Writes are trigger by writing to the "Doorbell" register
> on the shared memory PCI device.  The lower 8-bits of the value written to this
> register are sent as the 1-byte message so different meanings of interrupts can
> be supported.
>
> Interrupts are supported between multiple VMs by using a shared memory server
>
> -ivshmem<size in MB>,[unix:<path>][file]
>
> Interrupts can also be used between host and guest as well by implementing a
> listener on the host that talks to shared memory server.  The shared memory
> server passes file descriptors for the shared memory object and eventfds (our
> interrupt mechanism) to the respective qemu instances.
>
>    

Can you provide a spec that describes the device?  This would be useful 
for maintaining the code, writing guest drivers, and as a framework for 
review.
Alexander Graf March 8, 2010, 10:57 a.m. UTC | #7
Jamie Lokier wrote:
> Alexander Graf wrote:
>   
>> Or we could put in some code that tells the guest the host shm  
>> architecture and only accept x86 on x86 for now. If anyone cares for  
>> other combinations, they're free to implement them.
>>
>> Seriously, we're looking at an interface designed for kvm here. Let's  
>> please keep it as simple and fast as possible for the actual use case,  
>> not some theoretically possible ones.
>>     
>
> The concern is that a perfectly working guest image running on kvm,
> the guest being some OS or app that uses this facility (_not_ a
> kvm-only guest driver), is later run on qemu on a different host, and
> then mostly works except for some silent data corruption.
>
> That is not a theoretical scenario.
>
> Well, the bit with this driver is theoretical, obviously :-)
> But not the bit about moving to a different host.
>   

I agree. Hence there should be a safety check so people can't corrupt
their data silently.

Alex
Paul Brook March 8, 2010, 1:03 p.m. UTC | #8
> On 03/08/2010 12:53 AM, Paul Brook wrote:
> >> Support an inter-vm shared memory device that maps a shared-memory
> >> object as a PCI device in the guest.  This patch also supports
> >> interrupts between guest by communicating over a unix domain socket. 
> >> This patch applies to the qemu-kvm repository.
> >
> > No. All new devices should be fully qdev based.
> >
> > I suspect you've also ignored a load of coherency issues, especially when
> > not using KVM. As soon as you have shared memory in more than one host
> > thread/process you have to worry about memory barriers.
> 
> Shouldn't it be sufficient to require the guest to issue barriers (and
> to ensure tcg honours the barriers, if someone wants this with tcg)?.

In a cross environment that becomes extremely hairy.  For example the x86 
architecture effectively has an implicit write barrier before every store, and 
an implicit read barrier before every load.

Paul
Paul Brook March 8, 2010, 1:04 p.m. UTC | #9
> However, coherence could be made host-type-independent by the host
> mapping and unampping pages, so that each page is only mapped into one
> guest (or guest CPU) at a time.  Just like some clustering filesystems
> do to maintain coherence.

You're assuming that a TLB flush implies a write barrier, and a TLB miss 
implies a read barrier.  I'd be surprised if this were true in general.

Paul
Avi Kivity March 8, 2010, 1:16 p.m. UTC | #10
On 03/08/2010 03:03 PM, Paul Brook wrote:
>> On 03/08/2010 12:53 AM, Paul Brook wrote:
>>      
>>>> Support an inter-vm shared memory device that maps a shared-memory
>>>> object as a PCI device in the guest.  This patch also supports
>>>> interrupts between guest by communicating over a unix domain socket.
>>>> This patch applies to the qemu-kvm repository.
>>>>          
>>> No. All new devices should be fully qdev based.
>>>
>>> I suspect you've also ignored a load of coherency issues, especially when
>>> not using KVM. As soon as you have shared memory in more than one host
>>> thread/process you have to worry about memory barriers.
>>>        
>> Shouldn't it be sufficient to require the guest to issue barriers (and
>> to ensure tcg honours the barriers, if someone wants this with tcg)?.
>>      
> In a cross environment that becomes extremely hairy.  For example the x86
> architecture effectively has an implicit write barrier before every store, and
> an implicit read barrier before every load.
>    

Ah yes.  For cross tcg environments you can map the memory using mmio 
callbacks instead of directly, and issue the appropriate barriers there.
Cam Macdonell March 8, 2010, 5:57 p.m. UTC | #11
On Mon, Mar 8, 2010 at 2:56 AM, Avi Kivity <avi@redhat.com> wrote:
> On 03/06/2010 01:52 AM, Cam Macdonell wrote:
>>
>> Support an inter-vm shared memory device that maps a shared-memory object
>> as a PCI device in the guest.  This patch also supports interrupts between
>> guest by communicating over a unix domain socket.  This patch applies to
>> the
>> qemu-kvm repository.
>>
>> This device now creates a qemu character device and sends 1-bytes messages
>> to
>> trigger interrupts.  Writes are trigger by writing to the "Doorbell"
>> register
>> on the shared memory PCI device.  The lower 8-bits of the value written to
>> this
>> register are sent as the 1-byte message so different meanings of
>> interrupts can
>> be supported.
>>
>> Interrupts are supported between multiple VMs by using a shared memory
>> server
>>
>> -ivshmem<size in MB>,[unix:<path>][file]
>>
>> Interrupts can also be used between host and guest as well by implementing
>> a
>> listener on the host that talks to shared memory server.  The shared
>> memory
>> server passes file descriptors for the shared memory object and eventfds
>> (our
>> interrupt mechanism) to the respective qemu instances.
>>
>>
>
> Can you provide a spec that describes the device?  This would be useful for
> maintaining the code, writing guest drivers, and as a framework for review.

I'm not sure if you want the Qemu command-line part as part of the
spec here, but I've included for completeness.

Device Specification for Inter-VM shared memory device
-----------------------------------------------------------------------------------

Qemu Command-line
-------------------------------

The command-line for inter-vm shared memory is as follows

-ivshmem <size>,[unix:]name

the <size> argument specifies the size of the shared memory object.  The second
option specifies either a unix domain socket (when using the unix: prefix) or a
name for the shared memory object.

If a unix domain socket is specified, the guest will receive the shared object
from the shared memory server listening on that socket and will support
interrupts with the other guests using that server.  Each server only serves
one memory object.

If a name is specified on the command line (without 'unix:'), then the guest
will open the POSIX shared memory object with that name (in /dev/shm) and the
specified size.  The guest will NOT support interrupts but the shared memory
object can be shared between multiple guests.

The Inter-VM Shared Memory PCI device
-----------------------------------------------------------

BARs

The device supports two BARs.  BAR0 is a 256-byte MMIO region to
support registers
and BAR1 is used to map the shared memory object from the host.  The size of
BAR1 is specified on the command-line and must be a power of 2 in size.

Registers

BAR0 currently supports 5 registers of 16-bits each.  Registers are used
for synchronization between guests sharing the same memory object when
interrupts are supported (this requires using the shared memory server).

When using interrupts, VMs communicate with a shared memory server that passes
the shared memory object file descriptor using SCM_RIGHTS.  The server assigns
each VM an ID number and sends this ID number to the Qemu process along with a
series of eventfd file descriptors, one per guest using the shared memory
server.  These eventfds will be used to send interrupts between guests.  Each
guest listens on the eventfd corresponding to their ID and may use the others
for sending interrupts to other guests.

enum ivshmem_registers {
    IntrMask = 0,
    IntrStatus = 2,
    Doorbell = 4,
    IVPosition = 6,
    IVLiveList = 8
};

The first two registers are the interrupt mask and status registers.
Interrupts are triggered when a message is received on the guest's eventfd from
another VM.  Writing to the 'Doorbell' register is how synchronization messages
are sent to other VMs.

The IVPosition register is read-only and reports the guest's ID number.  The
IVLiveList register is also read-only and reports a bit vector of currently
live VM IDs.

The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
value which will be written to the destination VM and what the guest status
register will be set to when the interrupt is trigger is the destination guest.
A value of 255 in the upper 8-bits will trigger a broadcast where the message
will be sent to all other guests.

Cheers,
Cam

>
> --
> error compiling committee.c: too many arguments to function
>
>
Avi Kivity March 9, 2010, 10:29 a.m. UTC | #12
On 03/08/2010 07:57 PM, Cam Macdonell wrote:
>
>> Can you provide a spec that describes the device?  This would be useful for
>> maintaining the code, writing guest drivers, and as a framework for review.
>>      
> I'm not sure if you want the Qemu command-line part as part of the
> spec here, but I've included for completeness.
>    

I meant something from the guest's point of view, so command line syntax 
is less important.  It should be equally applicable to a real PCI card 
that works with the same driver.

See http://ozlabs.org/~rusty/virtio-spec/ for an example.

> The Inter-VM Shared Memory PCI device
> -----------------------------------------------------------
>
> BARs
>
> The device supports two BARs.  BAR0 is a 256-byte MMIO region to
> support registers
>    

(but might be extended in the future)

> and BAR1 is used to map the shared memory object from the host.  The size of
> BAR1 is specified on the command-line and must be a power of 2 in size.
>
> Registers
>
> BAR0 currently supports 5 registers of 16-bits each.

Suggest making registers 32-bits, friendlier towards non-x86.

>   Registers are used
> for synchronization between guests sharing the same memory object when
> interrupts are supported (this requires using the shared memory server).
>    

How does the driver detect whether interrupts are supported or not?

> When using interrupts, VMs communicate with a shared memory server that passes
> the shared memory object file descriptor using SCM_RIGHTS.  The server assigns
> each VM an ID number and sends this ID number to the Qemu process along with a
> series of eventfd file descriptors, one per guest using the shared memory
> server.  These eventfds will be used to send interrupts between guests.  Each
> guest listens on the eventfd corresponding to their ID and may use the others
> for sending interrupts to other guests.
>
> enum ivshmem_registers {
>      IntrMask = 0,
>      IntrStatus = 2,
>      Doorbell = 4,
>      IVPosition = 6,
>      IVLiveList = 8
> };
>
> The first two registers are the interrupt mask and status registers.
> Interrupts are triggered when a message is received on the guest's eventfd from
> another VM.  Writing to the 'Doorbell' register is how synchronization messages
> are sent to other VMs.
>
> The IVPosition register is read-only and reports the guest's ID number.  The
> IVLiveList register is also read-only and reports a bit vector of currently
> live VM IDs.
>    

That limits the number of guests to 16.

> The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
> upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
> value which will be written to the destination VM and what the guest status
> register will be set to when the interrupt is trigger is the destination guest.
>    

What happens when two interrupts are sent back-to-back to the same 
guest?  Will the first status value be lost?

Also, reading the status register requires a vmexit.  I suggest dropping 
it and requiring the application to manage this information in the 
shared memory area (where it could do proper queueing of multiple messages).

> A value of 255 in the upper 8-bits will trigger a broadcast where the message
> will be sent to all other guests.
>    

Please consider adding:

- MSI support
- interrupt on a guest attaching/detaching to the shared memory device

With MSI you could also have the doorbell specify both guest ID and 
vector number, which may be useful.

Thanks for this - it definitely makes reviewing easier.
Arnd Bergmann March 9, 2010, 12:49 p.m. UTC | #13
On Monday 08 March 2010, Cam Macdonell wrote:
> enum ivshmem_registers {
>     IntrMask = 0,
>     IntrStatus = 2,
>     Doorbell = 4,
>     IVPosition = 6,
>     IVLiveList = 8
> };
> 
> The first two registers are the interrupt mask and status registers.
> Interrupts are triggered when a message is received on the guest's eventfd from
> another VM.  Writing to the 'Doorbell' register is how synchronization messages
> are sent to other VMs.
> 
> The IVPosition register is read-only and reports the guest's ID number.  The
> IVLiveList register is also read-only and reports a bit vector of currently
> live VM IDs.
> 
> The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
> upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
> value which will be written to the destination VM and what the guest status
> register will be set to when the interrupt is trigger is the destination guest.
> A value of 255 in the upper 8-bits will trigger a broadcast where the message
> will be sent to all other guests.

This means you have at least two intercepts for each message:

1. Sender writes to doorbell
2. Receiver gets interrupted

With optionally two more intercepts in order to avoid interrupting the
receiver every time:

3. Receiver masks interrupt in order to process data
4. Receiver unmasks interrupt when it's done and status is no longer pending

I believe you can do much better than this, you combine status and mask
bits, making this level triggered, and move to a bitmask of all guests:

In order to send an interrupt to another guest, the sender first checks
the bit for the receiver. If it's '1', no need for any intercept, the
receiver will come back anyway. If it's zero, write a '1' bit, which
gets OR'd into the bitmask by the host. The receiver gets interrupted
at a raising edge and just leaves the bit on, until it's done processing,
then turns the bit off by writing a '1' into its own location in the mask.

	Arnd
Avi Kivity March 9, 2010, 1:03 p.m. UTC | #14
On 03/09/2010 02:49 PM, Arnd Bergmann wrote:
> On Monday 08 March 2010, Cam Macdonell wrote:
>    
>> enum ivshmem_registers {
>>      IntrMask = 0,
>>      IntrStatus = 2,
>>      Doorbell = 4,
>>      IVPosition = 6,
>>      IVLiveList = 8
>> };
>>
>> The first two registers are the interrupt mask and status registers.
>> Interrupts are triggered when a message is received on the guest's eventfd from
>> another VM.  Writing to the 'Doorbell' register is how synchronization messages
>> are sent to other VMs.
>>
>> The IVPosition register is read-only and reports the guest's ID number.  The
>> IVLiveList register is also read-only and reports a bit vector of currently
>> live VM IDs.
>>
>> The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
>> upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
>> value which will be written to the destination VM and what the guest status
>> register will be set to when the interrupt is trigger is the destination guest.
>> A value of 255 in the upper 8-bits will trigger a broadcast where the message
>> will be sent to all other guests.
>>      
> This means you have at least two intercepts for each message:
>
> 1. Sender writes to doorbell
> 2. Receiver gets interrupted
>
> With optionally two more intercepts in order to avoid interrupting the
> receiver every time:
>
> 3. Receiver masks interrupt in order to process data
> 4. Receiver unmasks interrupt when it's done and status is no longer pending
>
> I believe you can do much better than this, you combine status and mask
> bits, making this level triggered, and move to a bitmask of all guests:
>
> In order to send an interrupt to another guest, the sender first checks
> the bit for the receiver. If it's '1', no need for any intercept, the
> receiver will come back anyway. If it's zero, write a '1' bit, which
> gets OR'd into the bitmask by the host. The receiver gets interrupted
> at a raising edge and just leaves the bit on, until it's done processing,
> then turns the bit off by writing a '1' into its own location in the mask.
>    

We could make the masking in RAM, not in registers, like virtio, which 
would require no exits.  It would then be part of the application 
specific protocol and out of scope of of this spec.
Cam Macdonell March 9, 2010, 3:27 p.m. UTC | #15
On Tue, Mar 9, 2010 at 3:29 AM, Avi Kivity <avi@redhat.com> wrote:
> On 03/08/2010 07:57 PM, Cam Macdonell wrote:
>>
>>> Can you provide a spec that describes the device?  This would be useful
>>> for
>>> maintaining the code, writing guest drivers, and as a framework for
>>> review.
>>>
>>
>> I'm not sure if you want the Qemu command-line part as part of the
>> spec here, but I've included for completeness.
>>
>
> I meant something from the guest's point of view, so command line syntax is
> less important.  It should be equally applicable to a real PCI card that
> works with the same driver.
>
> See http://ozlabs.org/~rusty/virtio-spec/ for an example.
>
>> The Inter-VM Shared Memory PCI device
>> -----------------------------------------------------------
>>
>> BARs
>>
>> The device supports two BARs.  BAR0 is a 256-byte MMIO region to
>> support registers
>>
>
> (but might be extended in the future)
>
>> and BAR1 is used to map the shared memory object from the host.  The size
>> of
>> BAR1 is specified on the command-line and must be a power of 2 in size.
>>
>> Registers
>>
>> BAR0 currently supports 5 registers of 16-bits each.
>
> Suggest making registers 32-bits, friendlier towards non-x86.
>
>>  Registers are used
>> for synchronization between guests sharing the same memory object when
>> interrupts are supported (this requires using the shared memory server).
>>
>
> How does the driver detect whether interrupts are supported or not?

At the moment, the VM ID is set to -1 if interrupts aren't supported,
but that may not be the clearest way to do things.  With UIO is there
a way to detect if the interrupt pin is on?

>
>> When using interrupts, VMs communicate with a shared memory server that
>> passes
>> the shared memory object file descriptor using SCM_RIGHTS.  The server
>> assigns
>> each VM an ID number and sends this ID number to the Qemu process along
>> with a
>> series of eventfd file descriptors, one per guest using the shared memory
>> server.  These eventfds will be used to send interrupts between guests.
>>  Each
>> guest listens on the eventfd corresponding to their ID and may use the
>> others
>> for sending interrupts to other guests.
>>
>> enum ivshmem_registers {
>>     IntrMask = 0,
>>     IntrStatus = 2,
>>     Doorbell = 4,
>>     IVPosition = 6,
>>     IVLiveList = 8
>> };
>>
>> The first two registers are the interrupt mask and status registers.
>> Interrupts are triggered when a message is received on the guest's eventfd
>> from
>> another VM.  Writing to the 'Doorbell' register is how synchronization
>> messages
>> are sent to other VMs.
>>
>> The IVPosition register is read-only and reports the guest's ID number.
>>  The
>> IVLiveList register is also read-only and reports a bit vector of
>> currently
>> live VM IDs.
>>
>
> That limits the number of guests to 16.

True, it could grow to 32 or 64 without difficulty.  We could leave
'liveness' to the user (could be implemented using the shared memory
region) or via the interrupts that arrive on guest attach/detach as
you suggest below..

>
>> The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
>> upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
>> value which will be written to the destination VM and what the guest
>> status
>> register will be set to when the interrupt is trigger is the destination
>> guest.
>>
>
> What happens when two interrupts are sent back-to-back to the same guest?
>  Will the first status value be lost?

Right now, it would be.  I believe that eventfd has a counting
semaphore option, that could prevent loss of status (but limits what
the status could be).  My understanding of uio_pci interrupt handling
is fairly new, but we could have the uio driver store the interrupt
statuses to avoid losing them.

>
> Also, reading the status register requires a vmexit.  I suggest dropping it
> and requiring the application to manage this information in the shared
> memory area (where it could do proper queueing of multiple messages).
>
>> A value of 255 in the upper 8-bits will trigger a broadcast where the
>> message
>> will be sent to all other guests.
>>
>
> Please consider adding:
>
> - MSI support

Sure, I'll look into it.

> - interrupt on a guest attaching/detaching to the shared memory device

Sure.

>
> With MSI you could also have the doorbell specify both guest ID and vector
> number, which may be useful.
>
> Thanks for this - it definitely makes reviewing easier.
Cam Macdonell March 9, 2010, 4:44 p.m. UTC | #16
On Tue, Mar 9, 2010 at 6:03 AM, Avi Kivity <avi@redhat.com> wrote:
> On 03/09/2010 02:49 PM, Arnd Bergmann wrote:
>>
>> On Monday 08 March 2010, Cam Macdonell wrote:
>>
>>>
>>> enum ivshmem_registers {
>>>     IntrMask = 0,
>>>     IntrStatus = 2,
>>>     Doorbell = 4,
>>>     IVPosition = 6,
>>>     IVLiveList = 8
>>> };
>>>
>>> The first two registers are the interrupt mask and status registers.
>>> Interrupts are triggered when a message is received on the guest's
>>> eventfd from
>>> another VM.  Writing to the 'Doorbell' register is how synchronization
>>> messages
>>> are sent to other VMs.
>>>
>>> The IVPosition register is read-only and reports the guest's ID number.
>>>  The
>>> IVLiveList register is also read-only and reports a bit vector of
>>> currently
>>> live VM IDs.
>>>
>>> The Doorbell register is 16-bits, but is treated as two 8-bit values.
>>>  The
>>> upper 8-bits are used for the destination VM ID.  The lower 8-bits are
>>> the
>>> value which will be written to the destination VM and what the guest
>>> status
>>> register will be set to when the interrupt is trigger is the destination
>>> guest.
>>> A value of 255 in the upper 8-bits will trigger a broadcast where the
>>> message
>>> will be sent to all other guests.
>>>
>>
>> This means you have at least two intercepts for each message:
>>
>> 1. Sender writes to doorbell
>> 2. Receiver gets interrupted
>>
>> With optionally two more intercepts in order to avoid interrupting the
>> receiver every time:
>>
>> 3. Receiver masks interrupt in order to process data
>> 4. Receiver unmasks interrupt when it's done and status is no longer
>> pending
>>
>> I believe you can do much better than this, you combine status and mask
>> bits, making this level triggered, and move to a bitmask of all guests:
>>
>> In order to send an interrupt to another guest, the sender first checks
>> the bit for the receiver. If it's '1', no need for any intercept, the
>> receiver will come back anyway. If it's zero, write a '1' bit, which
>> gets OR'd into the bitmask by the host. The receiver gets interrupted
>> at a raising edge and just leaves the bit on, until it's done processing,
>> then turns the bit off by writing a '1' into its own location in the mask.
>>
>
> We could make the masking in RAM, not in registers, like virtio, which would
> require no exits.  It would then be part of the application specific
> protocol and out of scope of of this spec.
>

This kind of implementation would be possible now since with UIO it's
up to the application whether to mask interrupts or not and what
interrupts mean.  We could leave the interrupt mask register for those
who want that behaviour.  Arnd's idea would remove the need for the
Doorbell and Mask, but we will always need at least one MMIO register
to send whatever interrupts we do send.

Cam
Avi Kivity March 9, 2010, 5:28 p.m. UTC | #17
On 03/09/2010 05:27 PM, Cam Macdonell wrote:
>
>>
>>>   Registers are used
>>> for synchronization between guests sharing the same memory object when
>>> interrupts are supported (this requires using the shared memory server).
>>>
>>>        
>> How does the driver detect whether interrupts are supported or not?
>>      
> At the moment, the VM ID is set to -1 if interrupts aren't supported,
> but that may not be the clearest way to do things.  With UIO is there
> a way to detect if the interrupt pin is on?
>    

I suggest not designing the device to uio.  Make it a good 
guest-independent device, and if uio doesn't fit it, change it.

Why not support interrupts unconditionally?  Is the device useful 
without interrupts?

>>> The Doorbell register is 16-bits, but is treated as two 8-bit values.  The
>>> upper 8-bits are used for the destination VM ID.  The lower 8-bits are the
>>> value which will be written to the destination VM and what the guest
>>> status
>>> register will be set to when the interrupt is trigger is the destination
>>> guest.
>>>
>>>        
>> What happens when two interrupts are sent back-to-back to the same guest?
>>   Will the first status value be lost?
>>      
> Right now, it would be.  I believe that eventfd has a counting
> semaphore option, that could prevent loss of status (but limits what
> the status could be).
>    

It only counts the number of interrupts (and kvm will coalesce them anyway).

> My understanding of uio_pci interrupt handling
> is fairly new, but we could have the uio driver store the interrupt
> statuses to avoid losing them.
>    

There's nowhere to store them if we use ioeventfd/irqfd.  I think it's 
both easier and more efficient to leave this to the application (to 
store into shared memory).
Anthony Liguori March 9, 2010, 5:34 p.m. UTC | #18
On 03/09/2010 11:28 AM, Avi Kivity wrote:
> On 03/09/2010 05:27 PM, Cam Macdonell wrote:
>>
>>>
>>>>   Registers are used
>>>> for synchronization between guests sharing the same memory object when
>>>> interrupts are supported (this requires using the shared memory 
>>>> server).
>>>>
>>> How does the driver detect whether interrupts are supported or not?
>> At the moment, the VM ID is set to -1 if interrupts aren't supported,
>> but that may not be the clearest way to do things.  With UIO is there
>> a way to detect if the interrupt pin is on?
>
> I suggest not designing the device to uio.  Make it a good 
> guest-independent device, and if uio doesn't fit it, change it.

You can always fall back to reading the config space directly.  It's not 
strictly required that you stick to the UIO interface.

> Why not support interrupts unconditionally?  Is the device useful 
> without interrupts?

You can always just have interrupts enabled and not use them if that's 
desired.

Regards,

Anthony Liguori
Cam Macdonell March 9, 2010, 6:34 p.m. UTC | #19
On Tue, Mar 9, 2010 at 10:28 AM, Avi Kivity <avi@redhat.com> wrote:
> On 03/09/2010 05:27 PM, Cam Macdonell wrote:
>>
>>>
>>>>  Registers are used
>>>> for synchronization between guests sharing the same memory object when
>>>> interrupts are supported (this requires using the shared memory server).
>>>>
>>>>
>>>
>>> How does the driver detect whether interrupts are supported or not?
>>>
>>
>> At the moment, the VM ID is set to -1 if interrupts aren't supported,
>> but that may not be the clearest way to do things.  With UIO is there
>> a way to detect if the interrupt pin is on?
>>
>
> I suggest not designing the device to uio.  Make it a good guest-independent
> device, and if uio doesn't fit it, change it.
>
> Why not support interrupts unconditionally?  Is the device useful without
> interrupts?

Currently my patch works with or without the shared memory server.  If
you give the parameter

-ivshmem 256,foo

then this will create (if necessary) and map /dev/shm/foo as the
shared region without interrupt support.  Some users of shared memory
are using it this way.

Going forward we can require the shared memory server and always have
interrupts enabled.

>
>>>> The Doorbell register is 16-bits, but is treated as two 8-bit values.
>>>>  The
>>>> upper 8-bits are used for the destination VM ID.  The lower 8-bits are
>>>> the
>>>> value which will be written to the destination VM and what the guest
>>>> status
>>>> register will be set to when the interrupt is trigger is the destination
>>>> guest.
>>>>
>>>>
>>>
>>> What happens when two interrupts are sent back-to-back to the same guest?
>>>  Will the first status value be lost?
>>>
>>
>> Right now, it would be.  I believe that eventfd has a counting
>> semaphore option, that could prevent loss of status (but limits what
>> the status could be).
>>
>
> It only counts the number of interrupts (and kvm will coalesce them anyway).

Right.

>
>> My understanding of uio_pci interrupt handling
>> is fairly new, but we could have the uio driver store the interrupt
>> statuses to avoid losing them.
>>
>
> There's nowhere to store them if we use ioeventfd/irqfd.  I think it's both
> easier and more efficient to leave this to the application (to store into
> shared memory).

Agreed.

Cam
Jamie Lokier March 9, 2010, 7 p.m. UTC | #20
Paul Brook wrote:
> > However, coherence could be made host-type-independent by the host
> > mapping and unampping pages, so that each page is only mapped into one
> > guest (or guest CPU) at a time.  Just like some clustering filesystems
> > do to maintain coherence.
> 
> You're assuming that a TLB flush implies a write barrier, and a TLB miss 
> implies a read barrier.  I'd be surprised if this were true in general.

The host driver itself can issue full barriers at the same time as it
maps pages on TLB miss, and would probably have to interrupt the
guest's SMP KVM threads to insert a full barrier when broadcasting a
TLB flush on unmap.

-- Jamie
Jamie Lokier March 9, 2010, 8:11 p.m. UTC | #21
Avi Kivity wrote:
> On 03/08/2010 03:03 PM, Paul Brook wrote:
> >>On 03/08/2010 12:53 AM, Paul Brook wrote:
> >>     
> >>>>Support an inter-vm shared memory device that maps a shared-memory
> >>>>object as a PCI device in the guest.  This patch also supports
> >>>>interrupts between guest by communicating over a unix domain socket.
> >>>>This patch applies to the qemu-kvm repository.
> >>>>         
> >>>No. All new devices should be fully qdev based.
> >>>
> >>>I suspect you've also ignored a load of coherency issues, especially when
> >>>not using KVM. As soon as you have shared memory in more than one host
> >>>thread/process you have to worry about memory barriers.
> >>>       
> >>Shouldn't it be sufficient to require the guest to issue barriers (and
> >>to ensure tcg honours the barriers, if someone wants this with tcg)?.
> >>     
> >In a cross environment that becomes extremely hairy.  For example the x86
> >architecture effectively has an implicit write barrier before every store, 
> >and
> >an implicit read barrier before every load.
> >   
> 
> Ah yes.  For cross tcg environments you can map the memory using mmio 
> callbacks instead of directly, and issue the appropriate barriers there.

That makes sense.  It will force an mmio callback for every access to
the shared memory, which is ok for correctness but vastly slower when
running in TCG compared with KVM.

But it's hard to see what else could be done - those implicit write
barries on x86 have to be emulated somehow.  For TCG without inter-vm
shared memory, those barriers aren't a problem.

Non-random-corruption guest behaviour is paramount, so I hope the
inter-vm device will add those mmio callbacks for the cross-arch case
before it sees much action.  (Strictly, it isn't cross-arch, but
host-has-more-relaxed-implicit-memory-model-than-guest.  I'm assuming
TCG doesn't reorder memory instructions).

-- Jamie
Jamie Lokier March 9, 2010, 8:12 p.m. UTC | #22
Paul Brook wrote:
> > On 03/08/2010 12:53 AM, Paul Brook wrote:
> > >> Support an inter-vm shared memory device that maps a shared-memory
> > >> object as a PCI device in the guest.  This patch also supports
> > >> interrupts between guest by communicating over a unix domain socket. 
> > >> This patch applies to the qemu-kvm repository.
> > >
> > > No. All new devices should be fully qdev based.
> > >
> > > I suspect you've also ignored a load of coherency issues, especially when
> > > not using KVM. As soon as you have shared memory in more than one host
> > > thread/process you have to worry about memory barriers.
> > 
> > Shouldn't it be sufficient to require the guest to issue barriers (and
> > to ensure tcg honours the barriers, if someone wants this with tcg)?.
> 
> In a cross environment that becomes extremely hairy.  For example the x86 
> architecture effectively has an implicit write barrier before every store, and 
> an implicit read barrier before every load.

Btw, x86 doesn't have any implicit barriers due to ordinary loads.
Only stores and atomics have implicit barriers, afaik.

-- Jamie
Anthony Liguori March 9, 2010, 9:41 p.m. UTC | #23
On 03/08/2010 03:54 AM, Jamie Lokier wrote:
> Alexander Graf wrote:
>    
>> Or we could put in some code that tells the guest the host shm
>> architecture and only accept x86 on x86 for now. If anyone cares for
>> other combinations, they're free to implement them.
>>
>> Seriously, we're looking at an interface designed for kvm here. Let's
>> please keep it as simple and fast as possible for the actual use case,
>> not some theoretically possible ones.
>>      
> The concern is that a perfectly working guest image running on kvm,
> the guest being some OS or app that uses this facility (_not_ a
> kvm-only guest driver), is later run on qemu on a different host, and
> then mostly works except for some silent data corruption.
>
> That is not a theoretical scenario.
>    

Hint: no matter what you do, shared memory is a hack that's going to 
lead to subtle failures one way or another.

It's useful to support because it has some interesting academic uses but 
it's not a mechanism that can ever be used for real world purposes.

It's impossible to support save/restore correctly.  It can never be made 
to work with TCG in a safe way.  That's why I've been advocating keeping 
this as simple as humanly possible.  It's just not worth trying to make 
this fancier than it needs to be because it will never be fully correct.

Regards,

Anthony Liguori

> Well, the bit with this driver is theoretical, obviously :-)
> But not the bit about moving to a different host.
>
> -- Jamie
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Anthony Liguori March 9, 2010, 9:44 p.m. UTC | #24
On 03/08/2010 07:16 AM, Avi Kivity wrote:
> On 03/08/2010 03:03 PM, Paul Brook wrote:
>>> On 03/08/2010 12:53 AM, Paul Brook wrote:
>>>>> Support an inter-vm shared memory device that maps a shared-memory
>>>>> object as a PCI device in the guest.  This patch also supports
>>>>> interrupts between guest by communicating over a unix domain socket.
>>>>> This patch applies to the qemu-kvm repository.
>>>> No. All new devices should be fully qdev based.
>>>>
>>>> I suspect you've also ignored a load of coherency issues, 
>>>> especially when
>>>> not using KVM. As soon as you have shared memory in more than one host
>>>> thread/process you have to worry about memory barriers.
>>> Shouldn't it be sufficient to require the guest to issue barriers (and
>>> to ensure tcg honours the barriers, if someone wants this with tcg)?.
>> In a cross environment that becomes extremely hairy.  For example the 
>> x86
>> architecture effectively has an implicit write barrier before every 
>> store, and
>> an implicit read barrier before every load.
>
> Ah yes.  For cross tcg environments you can map the memory using mmio 
> callbacks instead of directly, and issue the appropriate barriers there.

Not good enough unless you want to severely restrict the use of shared 
memory within the guest.

For instance, it's going to be useful to assume that you atomic 
instructions remain atomic.  Crossing architecture boundaries here makes 
these assumptions invalid.  A barrier is not enough.

Shared memory only makes sense when using KVM.  In fact, we should 
actively disable the shared memory device when not using KVM.

Regards,

Anthony Liguori
Paul Brook March 10, 2010, 12:03 a.m. UTC | #25
> > In a cross environment that becomes extremely hairy.  For example the x86
> > architecture effectively has an implicit write barrier before every
> > store, and an implicit read barrier before every load.
> 
> Btw, x86 doesn't have any implicit barriers due to ordinary loads.
> Only stores and atomics have implicit barriers, afaik.

As of March 2009[1] Intel guarantees that memory reads occur in order (they 
may only be reordered relative to writes). It appears AMD do not provide this 
guarantee, which could be an interesting problem for heterogeneous migration..

Paul

[*] The most recent docs I have handy. Up to and including Core-2 Duo.
Cam Macdonell March 10, 2010, 4:38 a.m. UTC | #26
On Tue, Mar 9, 2010 at 5:03 PM, Paul Brook <paul@codesourcery.com> wrote:
>> > In a cross environment that becomes extremely hairy.  For example the x86
>> > architecture effectively has an implicit write barrier before every
>> > store, and an implicit read barrier before every load.
>>
>> Btw, x86 doesn't have any implicit barriers due to ordinary loads.
>> Only stores and atomics have implicit barriers, afaik.
>
> As of March 2009[1] Intel guarantees that memory reads occur in order (they
> may only be reordered relative to writes). It appears AMD do not provide this
> guarantee, which could be an interesting problem for heterogeneous migration..
>
> Paul
>
> [*] The most recent docs I have handy. Up to and including Core-2 Duo.
>

Interesting, but what ordering would cause problems that AMD would do
but Intel wouldn't?  Wouldn't that ordering cause the same problems
for POSIX shared memory in general (regardless of Qemu) on AMD?

I think shared memory breaks migration anyway.

Cam
Avi Kivity March 10, 2010, 9:21 a.m. UTC | #27
On 03/09/2010 08:34 PM, Cam Macdonell wrote:
> On Tue, Mar 9, 2010 at 10:28 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 03/09/2010 05:27 PM, Cam Macdonell wrote:
>>      
>>>        
>>>>          
>>>>>   Registers are used
>>>>> for synchronization between guests sharing the same memory object when
>>>>> interrupts are supported (this requires using the shared memory server).
>>>>>
>>>>>
>>>>>            
>>>> How does the driver detect whether interrupts are supported or not?
>>>>
>>>>          
>>> At the moment, the VM ID is set to -1 if interrupts aren't supported,
>>> but that may not be the clearest way to do things.  With UIO is there
>>> a way to detect if the interrupt pin is on?
>>>
>>>        
>> I suggest not designing the device to uio.  Make it a good guest-independent
>> device, and if uio doesn't fit it, change it.
>>
>> Why not support interrupts unconditionally?  Is the device useful without
>> interrupts?
>>      
> Currently my patch works with or without the shared memory server.  If
> you give the parameter
>
> -ivshmem 256,foo
>
> then this will create (if necessary) and map /dev/shm/foo as the
> shared region without interrupt support.  Some users of shared memory
> are using it this way.
>
> Going forward we can require the shared memory server and always have
> interrupts enabled.
>    

Can you explain how they synchronize?  Polling?  Using the network?  
Using it as a shared cache?

If it's a reasonable use case it makes sense to keep it.

Another thing comes to mind - a shared memory ID, in case a guest has 
multiple cards.
Avi Kivity March 10, 2010, 9:25 a.m. UTC | #28
On 03/09/2010 11:44 PM, Anthony Liguori wrote:
>> Ah yes.  For cross tcg environments you can map the memory using mmio 
>> callbacks instead of directly, and issue the appropriate barriers there.
>
>
> Not good enough unless you want to severely restrict the use of shared 
> memory within the guest.
>
> For instance, it's going to be useful to assume that you atomic 
> instructions remain atomic.  Crossing architecture boundaries here 
> makes these assumptions invalid.  A barrier is not enough.

You could make the mmio callbacks flow to the shared memory server over 
the unix-domain socket, which would then serialize them.  Still need to 
keep RMWs as single operations.  When the host supports it, implement 
the operation locally (you can't render cmpxchg16b on i386, for example).

> Shared memory only makes sense when using KVM.  In fact, we should 
> actively disable the shared memory device when not using KVM.

Looks like that's the only practical choice.
Avi Kivity March 10, 2010, 9:29 a.m. UTC | #29
On 03/10/2010 06:38 AM, Cam Macdonell wrote:
> On Tue, Mar 9, 2010 at 5:03 PM, Paul Brook<paul@codesourcery.com>  wrote:
>    
>>>> In a cross environment that becomes extremely hairy.  For example the x86
>>>> architecture effectively has an implicit write barrier before every
>>>> store, and an implicit read barrier before every load.
>>>>          
>>> Btw, x86 doesn't have any implicit barriers due to ordinary loads.
>>> Only stores and atomics have implicit barriers, afaik.
>>>        
>> As of March 2009[1] Intel guarantees that memory reads occur in order (they
>> may only be reordered relative to writes). It appears AMD do not provide this
>> guarantee, which could be an interesting problem for heterogeneous migration..
>>
>> Paul
>>
>> [*] The most recent docs I have handy. Up to and including Core-2 Duo.
>>
>>      
> Interesting, but what ordering would cause problems that AMD would do
> but Intel wouldn't?  Wouldn't that ordering cause the same problems
> for POSIX shared memory in general (regardless of Qemu) on AMD?
>    

If some code was written for the Intel guarantees it would break if 
migrated to AMD.  Of course, it would also break if run on AMD in the 
first place.

> I think shared memory breaks migration anyway.
>    

Until someone implements distributed shared memory.
Paul Brook March 10, 2010, 11:13 a.m. UTC | #30
> >> As of March 2009[1] Intel guarantees that memory reads occur in order
> >> (they may only be reordered relative to writes). It appears AMD do not
> >> provide this guarantee, which could be an interesting problem for
> >> heterogeneous migration..
> >
> > Interesting, but what ordering would cause problems that AMD would do
> > but Intel wouldn't?  Wouldn't that ordering cause the same problems
> > for POSIX shared memory in general (regardless of Qemu) on AMD?
> 
> If some code was written for the Intel guarantees it would break if
> migrated to AMD.  Of course, it would also break if run on AMD in the
> first place.

Right. This is independent of shared memory, and is a case where reporting an 
Intel CPUID on and AMD host might get you into trouble.

Paul
Arnd Bergmann March 10, 2010, 2:04 p.m. UTC | #31
On Tuesday 09 March 2010, Cam Macdonell wrote:
> >
> > We could make the masking in RAM, not in registers, like virtio, which would
> > require no exits.  It would then be part of the application specific
> > protocol and out of scope of of this spec.
> >
> 
> This kind of implementation would be possible now since with UIO it's
> up to the application whether to mask interrupts or not and what
> interrupts mean.  We could leave the interrupt mask register for those
> who want that behaviour.  Arnd's idea would remove the need for the
> Doorbell and Mask, but we will always need at least one MMIO register
> to send whatever interrupts we do send.

You'd also have to be very careful if the notification is in RAM to
avoid races between one guest triggering an interrupt and another
guest clearing its interrupt mask.

A totally different option that avoids this whole problem would
be to separate the signalling from the shared memory, making the
PCI shared memory device a trivial device with a single memory BAR,
and using something a higher-level concept like a virtio based
serial line for the actual signalling.

	Arnd
Cam Macdonell March 10, 2010, 4:36 p.m. UTC | #32
On Wed, Mar 10, 2010 at 2:21 AM, Avi Kivity <avi@redhat.com> wrote:
> On 03/09/2010 08:34 PM, Cam Macdonell wrote:
>>
>> On Tue, Mar 9, 2010 at 10:28 AM, Avi Kivity<avi@redhat.com>  wrote:
>>
>>>
>>> On 03/09/2010 05:27 PM, Cam Macdonell wrote:
>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>  Registers are used
>>>>>> for synchronization between guests sharing the same memory object when
>>>>>> interrupts are supported (this requires using the shared memory
>>>>>> server).
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> How does the driver detect whether interrupts are supported or not?
>>>>>
>>>>>
>>>>
>>>> At the moment, the VM ID is set to -1 if interrupts aren't supported,
>>>> but that may not be the clearest way to do things.  With UIO is there
>>>> a way to detect if the interrupt pin is on?
>>>>
>>>>
>>>
>>> I suggest not designing the device to uio.  Make it a good
>>> guest-independent
>>> device, and if uio doesn't fit it, change it.
>>>
>>> Why not support interrupts unconditionally?  Is the device useful without
>>> interrupts?
>>>
>>
>> Currently my patch works with or without the shared memory server.  If
>> you give the parameter
>>
>> -ivshmem 256,foo
>>
>> then this will create (if necessary) and map /dev/shm/foo as the
>> shared region without interrupt support.  Some users of shared memory
>> are using it this way.
>>
>> Going forward we can require the shared memory server and always have
>> interrupts enabled.
>>
>
> Can you explain how they synchronize?  Polling?  Using the network?  Using
> it as a shared cache?
>
> If it's a reasonable use case it makes sense to keep it.
>

Do you mean how they synchronize without interrupts?  One project I've
been contacted about uses the shared region directly for
synchronization for simulations running in different VMs that share
data in the memory region.  In my tests spinlocks in the shared region
work between guests.

If we want to keep the serverless implementation, do we need to
support shm_open with -chardev somehow? Something like -chardev
shm,name=foo.  Right now my qdev implementation just passes the name
to the -device option and opens it.

> Another thing comes to mind - a shared memory ID, in case a guest has
> multiple cards.

Sure, a number that can be passed on the command-line and stored in a register?

Cam
Anthony Liguori March 10, 2010, 5:13 p.m. UTC | #33
On 03/10/2010 03:25 AM, Avi Kivity wrote:
> On 03/09/2010 11:44 PM, Anthony Liguori wrote:
>>> Ah yes.  For cross tcg environments you can map the memory using 
>>> mmio callbacks instead of directly, and issue the appropriate 
>>> barriers there.
>>
>>
>> Not good enough unless you want to severely restrict the use of 
>> shared memory within the guest.
>>
>> For instance, it's going to be useful to assume that you atomic 
>> instructions remain atomic.  Crossing architecture boundaries here 
>> makes these assumptions invalid.  A barrier is not enough.
>
> You could make the mmio callbacks flow to the shared memory server 
> over the unix-domain socket, which would then serialize them.  Still 
> need to keep RMWs as single operations.  When the host supports it, 
> implement the operation locally (you can't render cmpxchg16b on i386, 
> for example).

But now you have a requirement that the shmem server runs in lock-step 
with the guest VCPU which has to happen for every single word of data 
transferred.

You're much better off using a bulk-data transfer API that relaxes 
coherency requirements.  IOW, shared memory doesn't make sense for TCG :-)

Regards,

Anthony Liguori
Avi Kivity March 10, 2010, 5:30 p.m. UTC | #34
On 03/10/2010 07:13 PM, Anthony Liguori wrote:
> On 03/10/2010 03:25 AM, Avi Kivity wrote:
>> On 03/09/2010 11:44 PM, Anthony Liguori wrote:
>>>> Ah yes.  For cross tcg environments you can map the memory using 
>>>> mmio callbacks instead of directly, and issue the appropriate 
>>>> barriers there.
>>>
>>>
>>> Not good enough unless you want to severely restrict the use of 
>>> shared memory within the guest.
>>>
>>> For instance, it's going to be useful to assume that you atomic 
>>> instructions remain atomic.  Crossing architecture boundaries here 
>>> makes these assumptions invalid.  A barrier is not enough.
>>
>> You could make the mmio callbacks flow to the shared memory server 
>> over the unix-domain socket, which would then serialize them.  Still 
>> need to keep RMWs as single operations.  When the host supports it, 
>> implement the operation locally (you can't render cmpxchg16b on i386, 
>> for example).
>
> But now you have a requirement that the shmem server runs in lock-step 
> with the guest VCPU which has to happen for every single word of data 
> transferred.
>

Alternative implementation: expose a futex in a shared memory object and 
use that to serialize access.  Now all accesses happen from vcpu 
context, and as long as there is no contention, should be fast, at least 
relative to tcg.

> You're much better off using a bulk-data transfer API that relaxes 
> coherency requirements.  IOW, shared memory doesn't make sense for TCG 
> :-)

Rather, tcg doesn't make sense for shared memory smp.  But we knew that 
already.
Paul Brook March 10, 2010, 5:41 p.m. UTC | #35
> > You're much better off using a bulk-data transfer API that relaxes
> > coherency requirements.  IOW, shared memory doesn't make sense for TCG
> 
> Rather, tcg doesn't make sense for shared memory smp.  But we knew that
> already.

In think TCG SMP is a hard, but soluble problem, especially when you're 
running guests used to coping with NUMA.

TCG interacting with third parties via shared memory is probably never going 
to make sense.

Paul
Jamie Lokier March 11, 2010, 3:10 a.m. UTC | #36
Paul Brook wrote:
> > > In a cross environment that becomes extremely hairy.  For example the x86
> > > architecture effectively has an implicit write barrier before every
> > > store, and an implicit read barrier before every load.
> > 
> > Btw, x86 doesn't have any implicit barriers due to ordinary loads.
> > Only stores and atomics have implicit barriers, afaik.
> 
> As of March 2009[1] Intel guarantees that memory reads occur in
> order (they may only be reordered relative to writes). It appears
> AMD do not provide this guarantee, which could be an interesting
> problem for heterogeneous migration..

(Summary: At least on AMD64, it does too, for normal accesses to
naturally aligned addresses in write-back cacheable memory.)

Oh, that's interesting.  Way back when I guess we knew writes were in
order and it wasn't explicit that reads were, hence smp_rmb() using a
locked atomic.

Here is a post by Nick Piggin from 2007 with links to Intel _and_ AMD
documents asserting that reads to cacheable memory are in program order:

    http://lkml.org/lkml/2007/9/28/212
    Subject: [patch] x86: improved memory barrier implementation

Links to documents:

    http://developer.intel.com/products/processor/manuals/318147.pdf
    http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

The Intel link doesn't work any more, but the AMD one does.

Nick asserts "both manufacturers are committed to in-order loads from
cacheable memory for the x86 architecture".

I have just read the AMD document, and it is in there (but not
completely obviously), in section 7.2.  The implicit load-load and
store-store barriers are only guaranteed for "normal cacheable
accesses on naturally aligned boundaries to WB [write-back cacheable]
memory".  There are also implicit load-store barriers but not
store-load.

Note that the document covers AMD64; it does not say anything about
their (now old) 32-bit processors.

> [*] The most recent docs I have handy. Up to and including Core-2 Duo.

Are you sure the read ordering applies to 32-bit Intel and AMD CPUs too?

Many years ago, before 64-bit x86 existed, I recall discussions on
LKML where it was made clear that stores were performed in program
order.  If it were known at the time that loads were performed in
program order on 32-bit x86s, I would have expected that to have been
mentioned by someone.

-- Jamie
Nick Piggin March 11, 2010, 4:37 a.m. UTC | #37
On Thu, Mar 11, 2010 at 03:10:47AM +0000, Jamie Lokier wrote:
> Paul Brook wrote:
> > > > In a cross environment that becomes extremely hairy.  For example the x86
> > > > architecture effectively has an implicit write barrier before every
> > > > store, and an implicit read barrier before every load.
> > > 
> > > Btw, x86 doesn't have any implicit barriers due to ordinary loads.
> > > Only stores and atomics have implicit barriers, afaik.
> > 
> > As of March 2009[1] Intel guarantees that memory reads occur in
> > order (they may only be reordered relative to writes). It appears
> > AMD do not provide this guarantee, which could be an interesting
> > problem for heterogeneous migration..
> 
> (Summary: At least on AMD64, it does too, for normal accesses to
> naturally aligned addresses in write-back cacheable memory.)
> 
> Oh, that's interesting.  Way back when I guess we knew writes were in
> order and it wasn't explicit that reads were, hence smp_rmb() using a
> locked atomic.
> 
> Here is a post by Nick Piggin from 2007 with links to Intel _and_ AMD
> documents asserting that reads to cacheable memory are in program order:
> 
>     http://lkml.org/lkml/2007/9/28/212
>     Subject: [patch] x86: improved memory barrier implementation
> 
> Links to documents:
> 
>     http://developer.intel.com/products/processor/manuals/318147.pdf
>     http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
> 
> The Intel link doesn't work any more, but the AMD one does.

It might have been merged into their development manual now.

 
> Nick asserts "both manufacturers are committed to in-order loads from
> cacheable memory for the x86 architecture".

At the time we did ask Intel and AMD engineers. We talked with Andy
Glew from Intel I believe, but I can't recall the AMD contact.
Linus was involved in the discussions as well. We tried to do the
right thing with this.

> I have just read the AMD document, and it is in there (but not
> completely obviously), in section 7.2.  The implicit load-load and
> store-store barriers are only guaranteed for "normal cacheable
> accesses on naturally aligned boundaries to WB [write-back cacheable]
> memory".  There are also implicit load-store barriers but not
> store-load.
> 
> Note that the document covers AMD64; it does not say anything about
> their (now old) 32-bit processors.

Hmm. Well it couldn't hurt to ask again. We've never seen any
problems yet, so I'm rather sure we're in the clear.

> 
> > [*] The most recent docs I have handy. Up to and including Core-2 Duo.
> 
> Are you sure the read ordering applies to 32-bit Intel and AMD CPUs too?
> 
> Many years ago, before 64-bit x86 existed, I recall discussions on
> LKML where it was made clear that stores were performed in program
> order.  If it were known at the time that loads were performed in
> program order on 32-bit x86s, I would have expected that to have been
> mentioned by someone.

The way it was explained to us by the Intel engineer is that they
had implemented only visibly in-order loads, but they wanted to keep
their options open in future so they did not want to commit to in
order loads as an ISA feature.

So when the whitepaper was released we got their blessing to
retroactively apply the rules to previous CPUs.
Avi Kivity March 11, 2010, 6:33 a.m. UTC | #38
On 03/10/2010 07:41 PM, Paul Brook wrote:
>>> You're much better off using a bulk-data transfer API that relaxes
>>> coherency requirements.  IOW, shared memory doesn't make sense for TCG
>>>        
>> Rather, tcg doesn't make sense for shared memory smp.  But we knew that
>> already.
>>      
> In think TCG SMP is a hard, but soluble problem, especially when you're
> running guests used to coping with NUMA.
>    

Do you mean by using a per-cpu tlb?  These kind of solutions are 
generally slow, but tcg's slowness may mask this out.

> TCG interacting with third parties via shared memory is probably never going
> to make sense.
>    

The third party in this case is qemu.
Avi Kivity March 11, 2010, 6:49 a.m. UTC | #39
On 03/10/2010 06:36 PM, Cam Macdonell wrote:
> On Wed, Mar 10, 2010 at 2:21 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 03/09/2010 08:34 PM, Cam Macdonell wrote:
>>      
>>> On Tue, Mar 9, 2010 at 10:28 AM, Avi Kivity<avi@redhat.com>    wrote:
>>>
>>>        
>>>> On 03/09/2010 05:27 PM, Cam Macdonell wrote:
>>>>
>>>>          
>>>>>
>>>>>            
>>>>>>
>>>>>>              
>>>>>>>   Registers are used
>>>>>>> for synchronization between guests sharing the same memory object when
>>>>>>> interrupts are supported (this requires using the shared memory
>>>>>>> server).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> How does the driver detect whether interrupts are supported or not?
>>>>>>
>>>>>>
>>>>>>              
>>>>> At the moment, the VM ID is set to -1 if interrupts aren't supported,
>>>>> but that may not be the clearest way to do things.  With UIO is there
>>>>> a way to detect if the interrupt pin is on?
>>>>>
>>>>>
>>>>>            
>>>> I suggest not designing the device to uio.  Make it a good
>>>> guest-independent
>>>> device, and if uio doesn't fit it, change it.
>>>>
>>>> Why not support interrupts unconditionally?  Is the device useful without
>>>> interrupts?
>>>>
>>>>          
>>> Currently my patch works with or without the shared memory server.  If
>>> you give the parameter
>>>
>>> -ivshmem 256,foo
>>>
>>> then this will create (if necessary) and map /dev/shm/foo as the
>>> shared region without interrupt support.  Some users of shared memory
>>> are using it this way.
>>>
>>> Going forward we can require the shared memory server and always have
>>> interrupts enabled.
>>>
>>>        
>> Can you explain how they synchronize?  Polling?  Using the network?  Using
>> it as a shared cache?
>>
>> If it's a reasonable use case it makes sense to keep it.
>>
>>      
> Do you mean how they synchronize without interrupts?  One project I've
> been contacted about uses the shared region directly for
> synchronization for simulations running in different VMs that share
> data in the memory region.  In my tests spinlocks in the shared region
> work between guests.
>    

I see.

> If we want to keep the serverless implementation, do we need to
> support shm_open with -chardev somehow? Something like -chardev
> shm,name=foo.  Right now my qdev implementation just passes the name
> to the -device option and opens it.
>    

I think using the file name is fine.

>> Another thing comes to mind - a shared memory ID, in case a guest has
>> multiple cards.
>>      
> Sure, a number that can be passed on the command-line and stored in a register?
>    

Yes.  NICs use the MAC address and storage uses the disk serial number, 
this is the same thing for shared memory.
Avi Kivity March 11, 2010, 6:50 a.m. UTC | #40
On 03/10/2010 04:04 PM, Arnd Bergmann wrote:
> On Tuesday 09 March 2010, Cam Macdonell wrote:
>    
>>> We could make the masking in RAM, not in registers, like virtio, which would
>>> require no exits.  It would then be part of the application specific
>>> protocol and out of scope of of this spec.
>>>
>>>        
>> This kind of implementation would be possible now since with UIO it's
>> up to the application whether to mask interrupts or not and what
>> interrupts mean.  We could leave the interrupt mask register for those
>> who want that behaviour.  Arnd's idea would remove the need for the
>> Doorbell and Mask, but we will always need at least one MMIO register
>> to send whatever interrupts we do send.
>>      
> You'd also have to be very careful if the notification is in RAM to
> avoid races between one guest triggering an interrupt and another
> guest clearing its interrupt mask.
>
> A totally different option that avoids this whole problem would
> be to separate the signalling from the shared memory, making the
> PCI shared memory device a trivial device with a single memory BAR,
> and using something a higher-level concept like a virtio based
> serial line for the actual signalling.
>    

That would be much slower.  The current scheme allows for an 
ioeventfd/irqfd short circuit which allows one guest to interrupt 
another without involving their qemus at all.
Paul Brook March 11, 2010, 12:21 p.m. UTC | #41
> On 03/10/2010 07:41 PM, Paul Brook wrote:
> >>> You're much better off using a bulk-data transfer API that relaxes
> >>> coherency requirements.  IOW, shared memory doesn't make sense for TCG
> >>
> >> Rather, tcg doesn't make sense for shared memory smp.  But we knew that
> >> already.
> >
> > In think TCG SMP is a hard, but soluble problem, especially when you're
> > running guests used to coping with NUMA.
> 
> Do you mean by using a per-cpu tlb?  These kind of solutions are
> generally slow, but tcg's slowness may mask this out.

Yes.

> > TCG interacting with third parties via shared memory is probably never
> > going to make sense.
> 
> The third party in this case is qemu.

Maybe. But it's a different instance of qemu, and once this feature exists I 
bet people will use it for other things.

Paul
Arnd Bergmann March 11, 2010, 12:57 p.m. UTC | #42
On Thursday 11 March 2010, Avi Kivity wrote:
> > A totally different option that avoids this whole problem would
> > be to separate the signalling from the shared memory, making the
> > PCI shared memory device a trivial device with a single memory BAR,
> > and using something a higher-level concept like a virtio based
> > serial line for the actual signalling.
> >    
> 
> That would be much slower.  The current scheme allows for an 
> ioeventfd/irqfd short circuit which allows one guest to interrupt 
> another without involving their qemus at all.

Yes, the serial line approach would be much slower, but my point
was that we can do signaling over "something else", which could
well be something building on irqfd.

	Arnd
Avi Kivity March 11, 2010, 1:07 p.m. UTC | #43
On 03/11/2010 02:57 PM, Arnd Bergmann wrote:
> On Thursday 11 March 2010, Avi Kivity wrote:
>    
>>> A totally different option that avoids this whole problem would
>>> be to separate the signalling from the shared memory, making the
>>> PCI shared memory device a trivial device with a single memory BAR,
>>> and using something a higher-level concept like a virtio based
>>> serial line for the actual signalling.
>>>
>>>        
>> That would be much slower.  The current scheme allows for an
>> ioeventfd/irqfd short circuit which allows one guest to interrupt
>> another without involving their qemus at all.
>>      
> Yes, the serial line approach would be much slower, but my point
> was that we can do signaling over "something else", which could
> well be something building on irqfd.
>    

Well, we could, but it seems to make things more complicated?  A card 
with shared memory, and another card with an interrupt interconnect?
Arnd Bergmann March 11, 2010, 2:32 p.m. UTC | #44
On Thursday 11 March 2010, Avi Kivity wrote:
> >> That would be much slower.  The current scheme allows for an
> >> ioeventfd/irqfd short circuit which allows one guest to interrupt
> >> another without involving their qemus at all.
> >>      
> > Yes, the serial line approach would be much slower, but my point
> > was that we can do signaling over "something else", which could
> > well be something building on irqfd.
> 
> Well, we could, but it seems to make things more complicated?  A card 
> with shared memory, and another card with an interrupt interconnect?

Yes, I agree that it's more complicated if you have a specific application
in mind that needs one of each, and most use cases that want shared memory
also need an interrupt mechanism, but it's not always the case:

- You could use ext2 with -o xip on a private mapping of a shared host file
in order to share the page cache. This does not need any interrupts.

- If you have more than two parties sharing the segment, there are different
ways to communicate, e.g. always send an interrupt to all others, or have
dedicated point-to-point connections. There is also some complexity in
trying to cover all possible cases in one driver.

I have to say that I also really like the idea of futex over shared memory,
which could potentially make this all a lot simpler. I don't know how this
would best be implemented on the host though.

	Arnd
malc March 11, 2010, 2:38 p.m. UTC | #45
On Thu, 11 Mar 2010, Nick Piggin wrote:

> On Thu, Mar 11, 2010 at 03:10:47AM +0000, Jamie Lokier wrote:
> > Paul Brook wrote:
> > > > > In a cross environment that becomes extremely hairy.  For example the x86
> > > > > architecture effectively has an implicit write barrier before every
> > > > > store, and an implicit read barrier before every load.
> > > > 
> > > > Btw, x86 doesn't have any implicit barriers due to ordinary loads.
> > > > Only stores and atomics have implicit barriers, afaik.
> > > 
> > > As of March 2009[1] Intel guarantees that memory reads occur in
> > > order (they may only be reordered relative to writes). It appears
> > > AMD do not provide this guarantee, which could be an interesting
> > > problem for heterogeneous migration..
> > 
> > (Summary: At least on AMD64, it does too, for normal accesses to
> > naturally aligned addresses in write-back cacheable memory.)
> > 
> > Oh, that's interesting.  Way back when I guess we knew writes were in
> > order and it wasn't explicit that reads were, hence smp_rmb() using a
> > locked atomic.
> > 
> > Here is a post by Nick Piggin from 2007 with links to Intel _and_ AMD
> > documents asserting that reads to cacheable memory are in program order:
> > 
> >     http://lkml.org/lkml/2007/9/28/212
> >     Subject: [patch] x86: improved memory barrier implementation
> > 
> > Links to documents:
> > 
> >     http://developer.intel.com/products/processor/manuals/318147.pdf
> >     http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
> > 
> > The Intel link doesn't work any more, but the AMD one does.
> 
> It might have been merged into their development manual now.

It was (http://www.intel.com/products/processor/manuals/):

Intel╝ 64 Architecture Memory Ordering White Paper

This document has been merged into Volume 3A of Intel 64 and IA-32 
Architectures Software Developer's Manual.

[..snip..]
diff mbox

Patch

diff --git a/Makefile.target b/Makefile.target
index 82caf20..921dc74 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -217,6 +217,9 @@  obj-y += pcnet.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y = ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/piix.o
 obj-i386-y += pckbd.o $(sound-obj-y) dma.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..aa88c07
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,561 @@ 
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "qemu-common.h"
+#include <sys/mman.h>
+#include <sys/socket.h>
+
+#define PCI_COMMAND_IOACCESS                0x0001
+#define PCI_COMMAND_MEMACCESS               0x0002
+#define PCI_COMMAND_BUSMASTER               0x0004
+
+#define DEBUG_IVSHMEM
+#define MAX_EVENT_FDS 16
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+#define BROADCAST_VAL ((1 << 8) - 1)
+
+typedef struct IVShmemState {
+    uint16_t intrmask;
+    uint16_t intrstatus;
+    uint16_t doorbell;
+
+    PCIDevice *pci_dev;
+    CharDriverState * chr;
+    CharDriverState * eventfd_chr;
+    int ivshmem_mmio_io_addr;
+
+    uint8_t *ivshmem_ptr;
+    unsigned long ivshmem_offset;
+    unsigned int ivshmem_size;
+    int shm_fd; /* shared memory file descriptor */
+
+    int eventfds[16]; /* for now we have a limit of 16 inter-connected guests */
+    int eventfd_posn;
+    uint16_t eventfd_bitvec;
+    int num_eventfds;
+} IVShmemState;
+
+typedef struct PCI_IVShmemState {
+    PCIDevice dev;
+    IVShmemState ivshmem_state;
+} PCI_IVShmemState;
+
+typedef struct IVShmemDesc {
+    char * chrdev;
+    int size;
+} IVShmemDesc;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 2,
+    Doorbell = 4,
+    IVPosition = 6,
+    IVLiveList = 8
+};
+
+static int num_ivshmem_devices = 0;
+static IVShmemDesc ivshmem_desc;
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    pcibus_t addr, pcibus_t size, int type)
+{
+    PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+    IVShmemState *s = &d->ivshmem_state;
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+void ivshmem_init(const char * optarg) {
+
+    char * temp;
+    char * ivshmem_sz;
+    int size;
+
+    num_ivshmem_devices++;
+
+    /* currently we only support 1 device */
+    if (num_ivshmem_devices > MAX_IVSHMEM_DEVICES) {
+        return;
+    }
+
+    temp = strdup(optarg);
+
+    ivshmem_sz=strsep(&temp,",");
+
+    if (ivshmem_sz != NULL) {
+        size = atol(ivshmem_sz);
+    } else {
+        size = -1;
+    }
+
+    ivshmem_desc.chrdev = strsep(&temp,"\0");
+
+    if ( size == -1) {
+        ivshmem_desc.size = TARGET_PAGE_SIZE;
+    } else {
+        ivshmem_desc.size = size*1024*1024;
+    }
+
+}
+
+int ivshmem_get_size(void) {
+    return ivshmem_desc.size;
+}
+
+static void broadcast_eventfds(int val, IVShmemState *s)
+{
+
+    int dest = val >> 4;
+    u_int64_t writelong = val & 0xff;
+
+    for (dest = 1; dest < s->num_eventfds; dest++) {
+
+        if (s->eventfds[dest] != -1) {
+            IVSHMEM_DPRINTF("Writing %ld to VM %d\n", writelong, dest);
+            if (write(s->eventfds[dest], &(writelong), 8) != 8)
+                IVSHMEM_DPRINTF("error writing to eventfd\n");
+        }
+
+    }
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->pci_dev->irq[0], (isr != 0));
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       pcibus_t addr, pcibus_t size, int type)
+{
+    PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+    IVShmemState *s = &d->ivshmem_state;
+
+    cpu_register_physical_memory(addr + 0, 0x100, s->ivshmem_mmio_io_addr);
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    // 32-bits are written to the address
+    int dest = val >> 8;
+    u_int64_t writelong = val & 0xff;
+
+    IVSHMEM_DPRINTF("writing 0x%x to 0x%lx\n", addr, (unsigned long) opaque);
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        case Doorbell:
+
+            if (dest == BROADCAST_VAL) {
+                broadcast_eventfds(val, s);
+            } else if (dest <= s->num_eventfds) {
+                IVSHMEM_DPRINTF("Writing %ld to VM %d\n", writelong, dest);
+                if (write(s->eventfds[dest], &(writelong), 8) != 8)
+                    IVSHMEM_DPRINTF("error writing to eventfd\n");
+            } else {
+                IVSHMEM_DPRINTF("Invalid %ld to VM %d\n", writelong, dest);
+            }
+
+            break;
+       default:
+            IVSHMEM_DPRINTF("why are we writing 0x%x\n", addr);
+    }
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing longs\n");
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+
+        case IVPosition:
+            /* return my id in the ivshmem list */
+            ret = s->eventfd_posn;
+            break;
+        case IVLiveList:
+            /* return the list of live VMs id for ivshmem */
+            ret = s->eventfd_bitvec;
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading longs\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+//    IVShmemState *s = opaque;
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 8;
+}
+
+static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd)
+{
+    // create a event character device based on the passed eventfd
+    IVShmemState *s = opaque;
+    CharDriverState * chr;
+
+    chr = qemu_chr_open_eventfd(eventfd);
+
+    if (chr == NULL) {
+        IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd);
+        exit(-1);
+    }
+
+    qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive,
+                      ivshmem_event, s);
+
+    return chr;
+
+}
+
+static int check_shm_size(IVShmemState *s, int shmemfd) {
+    /* check that the guest isn't going to try and map more memory than the
+     * card server allocated return -1 to indicate error */
+
+    struct stat buf;
+
+    fstat(shmemfd, &buf);
+
+    if (s->ivshmem_size > buf.st_size) {
+        fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater");
+        fprintf(stderr, " than shared object size (%d > %ld)\n",
+                                          s->ivshmem_size, buf.st_size);
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+static void create_shared_memory_BAR(IVShmemState *s, int fd) {
+
+    s->shm_fd = fd;
+
+    s->ivshmem_offset = qemu_add_file_to_ram(s->shm_fd, s->ivshmem_size,
+             MAP_SHARED);
+
+    s->ivshmem_ptr = qemu_get_ram_ptr(s->ivshmem_offset);
+
+    /* region for shared memory */
+    pci_register_bar(s->pci_dev, 1, ivshmem_desc.size,
+                       PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+}
+
+static void ivshmem_read(void *opaque, const uint8_t * buf, int flags)
+{
+    IVShmemState *s = opaque;
+    int incoming_fd, tmp_fd;
+    long incoming_posn;
+
+    memcpy(&incoming_posn, buf, sizeof(long));
+    /* pick off s->chr->msgfd and store it, posn should accompany msg */
+    tmp_fd = qemu_chr_get_msgfd(s->chr);
+    IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd);
+
+    if (tmp_fd == -1) {
+        s->eventfd_posn = incoming_posn;
+        return;
+    }
+
+    /* because of the implementation of get_msgfd, we need a dup */
+    incoming_fd = dup(tmp_fd);
+
+    /* if the position is -1, then it's shared memory fd */
+    if (incoming_posn == -1) {
+
+        s->eventfd_bitvec = 0;
+        s->num_eventfds = 0;
+
+        if (check_shm_size(s, incoming_fd) == -1) {
+            exit(-1);
+        }
+
+        /* creating a BAR in qemu_chr callback may be crazy */
+        create_shared_memory_BAR(s, incoming_fd);
+
+       return;
+    }
+
+    /* this is an eventfd for a particular guest VM */
+    IVSHMEM_DPRINTF("eventfds[%ld] = %d\n", incoming_posn, incoming_fd);
+    s->eventfds[incoming_posn] = incoming_fd;
+
+    /* bitmap to keep track of live VMs */
+    s->eventfd_bitvec |= 1 << incoming_posn;
+
+    /* keep track of the maximum VM ID */
+    if (incoming_posn > s->num_eventfds) {
+        s->num_eventfds = incoming_posn;
+    }
+
+    /* initialize char device for callback on my eventfd */
+    if (incoming_posn == s->eventfd_posn) {
+        s->eventfd_chr = create_eventfd_chr_device(s, s->eventfds[s->eventfd_posn]);
+    }
+
+    return;
+}
+
+int pci_ivshmem_init(PCIBus *bus)
+{
+    PCI_IVShmemState *d;
+    IVShmemState *s;
+    uint8_t *pci_conf;
+
+    d = (PCI_IVShmemState *)pci_register_device(bus, "kvm_ivshmem",
+                                           sizeof(PCI_IVShmemState),
+                                           -1, NULL, NULL);
+    if (!d) {
+        return -1;
+    }
+
+    s = &d->ivshmem_state;
+
+    s->pci_dev = &d->dev;
+    pci_conf = d->dev.config;
+    pci_conf[0x00] = 0xf4; // Qumranet vendor ID 0x5002
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IOACCESS | PCI_COMMAND_MEMACCESS;
+    pci_conf[0x0a] = 0x00; // RAM controller
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; // header_type
+
+    pci_conf[PCI_INTERRUPT_PIN] = 1; // we are going to support interrupts
+
+    s->ivshmem_size = ivshmem_desc.size;
+
+    if (ivshmem_desc.chrdev != NULL) {
+        if (strncmp(ivshmem_desc.chrdev, "unix:", 5) == 0) {
+
+            /* if we get a UNIX socket as the parameter we will talk
+             * to the ivshmem server*/
+            char label[32];
+
+            s->eventfd_posn = -1;
+
+            snprintf(label, 32, "ivshmem_chardev");
+            s->chr = qemu_chr_open(label, ivshmem_desc.chrdev, NULL);
+            if (s->chr == NULL) {
+                fprintf(stderr, "No server listening on %s\n",
+                                                        ivshmem_desc.chrdev);
+                exit(-1);
+            }
+
+            qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read,
+                              ivshmem_event, s);
+
+        } else {
+            /* just map the file, we're not using a server */
+            int fd;
+
+            if ((fd = shm_open(ivshmem_desc.chrdev, O_CREAT|O_RDWR,
+                            S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
+                fprintf(stderr, "kvm_shmem: could not open shared file\n");
+                exit(-1);
+            }
+
+            /* mmap onto PCI device's memory */
+            if (ftruncate(fd, ivshmem_desc.size) != 0) {
+                fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n");
+            }
+
+            create_shared_memory_BAR(s, fd);
+
+        }
+    }
+
+    IVSHMEM_DPRINTF("shared memory size is = %d\n", ivshmem_desc.size);
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+    /* region for registers*/
+    pci_register_bar(&d->dev, 0, 0x100,
+                           PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map);
+
+    return 0;
+}
+
diff --git a/hw/pc.c b/hw/pc.c
index 0aebae9..073402a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -90,6 +90,8 @@  struct e820_table {
 
 static struct e820_table e820_table;
 
+extern int ivshmem_enabled;
+
 typedef struct isa_irq_state {
     qemu_irq *i8259;
     qemu_irq *ioapic;
@@ -965,6 +967,10 @@  static void pc_init1(ram_addr_t ram_size,
         }
     }
 
+    if (pci_enabled && ivshmem_enabled) {
+        pci_ivshmem_init(pci_bus);
+    }
+
     rtc_state = rtc_init(2000);
 
     qemu_register_boot_set(pc_boot_set, rtc_state);
diff --git a/hw/pc.h b/hw/pc.h
index 88b5841..36b214c 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -167,6 +167,9 @@  void isa_ne2000_init(int base, int irq, NICInfo *nd);
 
 void extboot_init(BlockDriverState *bs);
 
+/* ivshmem.c */
+int pci_ivshmem_init(PCIBus *bus);
+
 int cpu_is_bsp(CPUState *env);
 
 /* e820 types */
diff --git a/qemu-char.c b/qemu-char.c
index 4169492..bff911c 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2066,6 +2066,12 @@  static void tcp_chr_read(void *opaque)
     }
 }
 
+CharDriverState *qemu_chr_open_eventfd(int eventfd){
+
+    return qemu_chr_open_fd(eventfd, eventfd);
+
+}
+
 static void tcp_chr_connect(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu-char.h b/qemu-char.h
index bcc0766..9a0d2c0 100644
--- a/qemu-char.h
+++ b/qemu-char.h
@@ -93,6 +93,9 @@  void qemu_chr_info_print(Monitor *mon, const QObject *ret_data);
 void qemu_chr_info(Monitor *mon, QObject **ret_data);
 CharDriverState *qemu_chr_find(const char *name);
 
+/* add an eventfd to the qemu devices that are polled */
+CharDriverState *qemu_chr_open_eventfd(int eventfd);
+
 extern int term_escape_char;
 
 /* async I/O support */
diff --git a/qemu-options.hx b/qemu-options.hx
index 9683d09..6a37083 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1679,6 +1679,18 @@  STEXI
 Setup monitor on chardev @var{name}.
 ETEXI
 
+DEF("ivshmem", HAS_ARG, QEMU_OPTION_ivshmem, \
+    "-ivshmem size,unix:file connects to shared memory PCI card server \
+    listening on unix domain socket 'path' of at least 'size' (in MB) and \
+    exposes  as a PCI device in the guest\n")
+STEXI
+@item -ivshmem @var{size},unix:@var{file}
+Connects to a shared memory PCI server at UDS @var{file} that shares a POSIX
+shared object of size @var{size} and creates a PCI device of the same size that
+maps the shared object (received from the server) into the device for guests to
+access.  The servers also sends eventfds to the guest to support interrupts.
+ETEXI
+
 DEF("debugcon", HAS_ARG, QEMU_OPTION_debugcon, \
     "-debugcon dev   redirect the debug console to char device 'dev'\n")
 STEXI
diff --git a/sysemu.h b/sysemu.h
index 01405e1..ace4645 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -241,6 +241,14 @@  extern CharDriverState *serial_hds[MAX_SERIAL_PORTS];
 
 extern CharDriverState *parallel_hds[MAX_PARALLEL_PORTS];
 
+/* inter-VM shared memory devices */
+
+#define MAX_IVSHMEM_DEVICES 1
+
+extern CharDriverState * ivshmem_chardev;
+void ivshmem_init(const char * optarg);
+int ivshmem_get_size(void);
+
 #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR)
 
 #ifdef HAS_AUDIO
diff --git a/vl.c b/vl.c
index 8847b70..01ca13b 100644
--- a/vl.c
+++ b/vl.c
@@ -194,6 +194,7 @@  int autostart;
 static int rtc_utc = 1;
 static int rtc_date_offset = -1; /* -1 means no change */
 QEMUClock *rtc_clock;
+int ivshmem_enabled = 0;
 int vga_interface_type = VGA_NONE;
 #ifdef TARGET_SPARC
 int graphic_width = 1024;
@@ -212,6 +213,8 @@  int no_quit = 0;
 CharDriverState *serial_hds[MAX_SERIAL_PORTS];
 CharDriverState *parallel_hds[MAX_PARALLEL_PORTS];
 CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
+CharDriverState *ivshmem_chardev;
+const char * ivshmem_device;
 #ifdef TARGET_I386
 int win2k_install_hack = 0;
 int rtc_td_hack = 0;
@@ -4953,6 +4956,8 @@  int main(int argc, char **argv, char **envp)
     kernel_cmdline = "";
     cyls = heads = secs = 0;
     translation = BIOS_ATA_TRANSLATION_AUTO;
+    ivshmem_device = NULL;
+    ivshmem_chardev = NULL;
 
     for (i = 0; i < MAX_NODES; i++) {
         node_mem[i] = 0;
@@ -5439,6 +5444,10 @@  int main(int argc, char **argv, char **envp)
                 add_device_config(DEV_PARALLEL, optarg);
                 default_parallel = 0;
                 break;
+            case QEMU_OPTION_ivshmem:
+                ivshmem_device = optarg;
+                ivshmem_enabled = 1;
+                break;
             case QEMU_OPTION_debugcon:
                 add_device_config(DEV_DEBUGCON, optarg);
                 break;
@@ -5956,6 +5965,10 @@  int main(int argc, char **argv, char **envp)
     if (ram_size == 0)
         ram_size = DEFAULT_RAM_SIZE * 1024 * 1024;
 
+    if (ivshmem_enabled) {
+        ivshmem_init(ivshmem_device);
+    }
+
     /* init the dynamic translator */
     cpu_exec_init_all(tb_size * 1024 * 1024);