diff mbox

[RFC,v2] virtio: add virtio-over-PCI driver

Message ID 20090224000002.GA578@ovro.caltech.edu (mailing list archive)
State Not Applicable, archived
Delegated to: Grant Likely
Headers show

Commit Message

Ira Snyder Feb. 24, 2009, midnight UTC
This adds support to Linux for using virtio between two computers linked by
a PCI interface. This allows the use of virtio_net to create a familiar,
fast interface for communication. It should be possible to use other virtio
devices in the future, but this has not been tested.

I have implemented guest support for the Freescale MPC8349EMDS board, which
is capable of running in PCI agent mode (It acts like a PCI card, but is a
complete computer system, running Linux). The driver is trivial to port to
any MPC83xx system.

It was developed to work in a CompactPCI crate of computers, one of which
is a standard x86 system (acting as the host) and many PowerPC systems
(acting as guests).

I have only tested this driver with a single board in my system. The host
is a 1066MHz Pentium3-M, and the guest is a 533MHz PowerPC. I am able
achieve transfer rates of about 150 mbit host->guest and 350 mbit
guest->host. A few tests showed that using an mtu of 4000 provided much
better results than an mtu of 1500. Using an mtu of 64000 significantly
dropped performance. The performance is equivalent to my PCINet driver for
host->guest, and about 20% faster for guest->host transfers.

I have included a short document explaining what I think is the most
complicated part of the driver: using the DMA engine to transfer data. I
hope everything else is readily obvious from the code. Questions are
welcome.

I will not be able to work on this full time for at least a few weeks, so I
would appreciate actual review of this driver.  Nitpicks are fine, I just
won't be able to respond to them quickly.

RFCv1 -> RFCv2:
  * fix major brokenness of host detach_buf()
  * support VIRTIO_NET_F_CSUM
  * support VIRTIO_NET_F_GSO
  * support VIRTIO_NET_F_MRG_RXBUF
  * rewrote DMA transfers to support merged rxbufs
  * added a hack to fix the endianness of virtio_net's metadata
  * lots more performance for guest->host transfers (~40MB/sec)
  * updated documentation
  * allocate 128 feature bits instead of 32

Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
---

Yes, the commit message has too much information. This is an RFC after
all. I fully expect to have to make changes. In fact, I posting this 
more to "get it out there" than anything else, since I have other tasks
that need doing.

I'd appreciate a serious review of the design by the people who have 
been pressuring me to use virtio. I'm very happy to answer any questions
you have.

Thanks to everyone who gave feedback for RFCv1!
Ira

 Documentation/virtio-over-PCI.txt     |   60 +
 arch/powerpc/boot/dts/mpc834x_mds.dts |    7 +
 drivers/virtio/Kconfig                |   22 +
 drivers/virtio/Makefile               |    2 +
 drivers/virtio/vop.h                  |  119 ++
 drivers/virtio/vop_fsl.c              | 2020 +++++++++++++++++++++++++++++++++
 drivers/virtio/vop_host.c             | 1071 +++++++++++++++++
 drivers/virtio/vop_hw.h               |   80 ++
 8 files changed, 3381 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/virtio-over-PCI.txt
 create mode 100644 drivers/virtio/vop.h
 create mode 100644 drivers/virtio/vop_fsl.c
 create mode 100644 drivers/virtio/vop_host.c
 create mode 100644 drivers/virtio/vop_hw.h

Comments

Arnd Bergmann Feb. 26, 2009, 4:15 p.m. UTC | #1
On Tuesday 24 February 2009, Ira Snyder wrote:
> This adds support to Linux for using virtio between two computers linked by
> a PCI interface. This allows the use of virtio_net to create a familiar,
> fast interface for communication. It should be possible to use other virtio
> devices in the future, but this has not been tested.

Wonderful, I like it a lot!

One major aspect that I hope can be improved is the layering
of the driver to make it easier to reuse parts for other
hardware implementations and also for sharing code between
the two sides. Most of my comments below are about this.

A better split I can imagine would be:

1. of_device hardware specific probing, and creation of virtqueues
2. pci hardware specific probing, and detection of virtqueues
3. library with common code, hardware independent
4. library with common code, hardware specific but used by both of_device
   and pci.
5. interface to virtio-net on top of that (symmetric)

> +/* Virtio-over-PCI descriptors: 12 bytes. These can chain together via "next" */
> +struct vop_desc {
> +	/* Address (host physical) */
> +	__le32 addr;
> +	/* Length (bytes) */
> +	__le32 len;
> +	/* Flags */
> +	__le16 flags;
> +	/* Chaining for descriptors */
> +	__le16 next;
> +} __attribute__((packed));

I would drop the "packed" attribute in the structure definitions.
It would imply that only byte accesses are allowed on these
data structures, because the attribute invalidates any assumptions
about alignment. None of your structures require padding, so
the attribute does not have any positive effect either.

> +/* MPC8349EMDS specific get_immrbase() */
> +#include <sysdev/fsl_soc.h>

Do you really need get_immrbase? I would expect that you can find
all the registers you need in the device tree, or exported from
other low-level drivers per subsystem.

immrbase is a concept from the time before our device trees.

> +/*
> + * These are internal use only versions of the structures that
> + * are exported over PCI by this driver
> + *
> + * They are used internally to keep track of the PowerPC queues so that
> + * we don't have to keep flipping endianness all the time
> + */
> +struct vop_loc_desc {
> +	u32 addr;
> +	u32 len;
> +	u16 flags;
> +	u16 next;
> +};
> +
> +struct vop_loc_avail {
> +	u16 index;
> +	u16 ring[VOP_RING_SIZE];
> +};
> +
> +struct vop_loc_used_elem {
> +	u32 id;
> +	u32 len;
> +};
> +
> +struct vop_loc_used {
> +	u16 index;
> +	struct vop_loc_used_elem ring[VOP_RING_SIZE];
> +};

Are you worried about the overhead of having to do byte flips,
or the code complexity? I would guess that the overhead is
near zero, but I'm not sure about the source code complexity.
Generally, I'd expect that you'd be better off just using the
wire-level data structures directly.

> +/*
> + * DMA Resolver state information
> + */
> +struct vop_dma_info {
> +	struct dma_chan *chan;
> +
> +	/* The currently processing avail entry */
> +	u16 loc_avail;
> +	u16 rem_avail;
> +
> +	/* The currently processing used entries */
> +	u16 loc_used;
> +	u16 rem_used;
> +};
> +
> +struct vop_vq {
> +
> +	/* The actual virtqueue itself */
> +	struct virtqueue vq;
> +	struct device *dev;
> +
> +	/* The host ring address */
> +	struct vop_host_ring __iomem *host;
> +
> +	/* The guest ring address */
> +	struct vop_guest_ring *guest;
> +
> +	/* Our own memory descriptors */
> +	struct vop_loc_desc desc[VOP_RING_SIZE];
> +	struct vop_loc_avail avail;
> +	struct vop_loc_used used;
> +	unsigned int flags;
> +
> +	/* Data tokens from add_buf() */
> +	void *data[VOP_RING_SIZE];
> +
> +	unsigned int num_free;	/* number of free descriptors in desc */
> +	unsigned int free_head;	/* start of the free descriptors in desc */
> +	unsigned int num_added;	/* number of entries added to desc */
> +
> +	u16 loc_last_used;	/* the last local used entry processed */
> +	u16 rem_last_used;	/* the current value of remote used_idx */
> +
> +	/* DMA resolver state */
> +	struct vop_dma_info dma;
> +	struct work_struct work;
> +	int (*resolve)(struct vop_vq *vq);
> +
> +	void __iomem *immr;
> +	int kick_val;
> +};

This data structure mixes generic information with fsl-834x specific
members. I think you should try to split this better into a common
part (also common for host and guest) to allow sharing the code
across other low-level implementations:

struct vop_vq {
	struct virtqueue vq;
	struct vop_host_ring __iomem *host;
	struct vop_guest_ring *guest;
	...
};

and in another file:

struct fsl834x_vq {
	struct vop_vq;
	struct fsl834x_vop_regs __iomem *regs; /* instead of immr */
}

If you split the structures this way, the abstraction should
come naturally.

> +/*
> + * This represents a virtio_device for our driver. It follows the memory
> + * layout shown above. It has pointers to all of the host and guest memory
> + * areas that we need to access
> + */
> +struct vop_vdev {
> +
> +	/* The specific virtio device (console, net, blk) */
> +	struct virtio_device vdev;
> +
> +	#define VOP_DEVICE_REGISTERED 1
> +	int status;
> +
> +	/* Start address of local and remote memory */
> +	void *loc;
> +	void __iomem *rem;
> +
> +	/*
> +	 * These are the status, feature, and configuration information
> +	 * for this virtio device. They are exposed in our memory block
> +	 * starting at offset 0.
> +	 */
> +	struct vop_status __iomem *host_status;
> +
> +	/*
> +	 * These are the status, feature, and configuration information
> +	 * for the guest virtio device. They are exposed in the guest
> +	 * memory block starting at offset 0.
> +	 */
> +	struct vop_status *guest_status;
> +
> +	/*
> +	 * These are the virtqueues for the virtio driver running this
> +	 * device to use. The host portions are exposed in our memory block
> +	 * starting at offset 1024. The exposed areas are aligned to 1024 byte
> +	 * boundaries, so they appear at offets 1024, 2048, and 3072
> +	 * respectively.
> +	 */
> +	struct vop_vq virtqueues[3];
> +};

Unfortunately, that structure layout implies an extra pointer level here:

	struct vop_vq *virtqueues[3];

I also wonder if the number of virtqueues should be variable here.

> +struct vop_dev {
> +
> +	struct of_device *op;
> +	struct device *dev;
> +
> +	/* Reset and start */
> +	struct mutex mutex;
> +	struct work_struct reset_work;
> +	struct work_struct start_work;
> +
> +	int irq;
> +
> +	/* Our board control registers */
> +	void __iomem *immr;
> +
> +	/* The guest memory, exposed at PCI BAR1 */
> +	#define VOP_GUEST_MEM_SIZE 16384
> +	void *guest_mem;
> +	dma_addr_t guest_mem_addr;
> +
> +	/* Host memory, given to us by host in OMR0 */
> +	#define VOP_HOST_MEM_SIZE 16384
> +	void __iomem *host_mem;
> +
> +	/* The virtio devices */
> +	struct vop_vdev devices[4];
> +	struct dma_chan *chan;
> +};

This one again is hardware specific, right? If so, it should go
together with what I call fsl834x_vq above.

> +/*----------------------------------------------------------------------------*/
> +/* Local descriptor ring access helpers                                       */
> +/*----------------------------------------------------------------------------*/
> +
> +static void vop_set_desc_addr(struct vop_vq *vq, unsigned int idx, u32 addr)
> +{
> +	vq->desc[idx].addr = addr;
> +}
> +
> +static void vop_set_desc_len(struct vop_vq *vq, unsigned int idx, u32 len)
> +{
> +	vq->desc[idx].len = len;
> +}
> +
> +static void vop_set_desc_flags(struct vop_vq *vq, unsigned int idx, u16 flags)
> +{
> +	vq->desc[idx].flags = flags;
> +}
> +
> +static void vop_set_desc_next(struct vop_vq *vq, unsigned int idx, u16 next)
> +{
> +	vq->desc[idx].next = next;
> +}
> +
> +static u16 vop_get_desc_flags(struct vop_vq *vq, unsigned int idx)
> +{
> +	return vq->desc[idx].flags;
> +}
> +
> +static u16 vop_get_desc_next(struct vop_vq *vq, unsigned int idx)
> +{
> +	return vq->desc[idx].next;
> +}

I don't quite get the point in these accessors. Calling one of these
functions would be longer than open-coding the content.

> +/*----------------------------------------------------------------------------*/
> +/* Scatterlist DMA helpers                                                    */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * This function abuses some of the scatterlist code and implements
> + * dma_map_sg() in such a way that we don't need to keep the scatterlist
> + * around in order to unmap it.
> + *
> + * It is also designed to never merge scatterlist entries, which is
> + * never what we want for virtio.
> + *
> + * When it is time to unmap the buffer, you can use dma_unmap_single() to
> + * unmap each entry in the chain. Get the address, length, and direction
> + * from the descriptors! (keep a local copy for speed)
> + */

Why is that an advantage over dma_unmap_sg?

> +static int vop_dma_map_sg(struct device *dev, struct scatterlist sg[],
> +			  unsigned int out, unsigned int in)
> +{
> +	dma_addr_t addr;
> +	enum dma_data_direction dir;
> +	struct scatterlist *start;
> +	unsigned int i, failure;
> +
> +	start = sg;
> +
> +	for (i = 0; i < out + in; i++) {
> +
> +		/* Check for scatterlist chaining abuse */
> +		BUG_ON(sg == NULL);
> +
> +		dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +		addr = dma_map_single(dev, sg_virt(sg), sg->length, dir);
> +
> +		if (dma_mapping_error(dev, addr))
> +			goto unwind;
> +
> +		sg_dma_address(sg) = addr;
> +		sg = sg_next(sg);
> +	}

I believe this kind of loop can be simplified using for_each_sg().


> +	/* Remap IMMR */
> +	priv->immr = ioremap(get_immrbase(), 0x100000);
> +	if (!priv->immr) {
> +		dev_err(&op->dev, "Unable to remap IMMR registers\n");
> +		ret = -ENOMEM;
> +		goto out_dma_release_channel;
> +	}

As mentioned above, this should be something like an of_iomap(op, ...)

> +struct vop_vq {
> +
> +	/* The actual virtqueue itself */
> +	struct virtqueue vq;
> +
> +	struct device *dev;
> +
> +	/* The host ring address */
> +	struct vop_host_ring *host;
> +
> +	/* The guest ring address */
> +	struct vop_guest_ring __iomem *guest;
> +
> +	/* Local copy of the descriptors for fast access */
> +	struct vop_loc_desc desc[VOP_RING_SIZE];
> +
> +	/* The data token from add_buf() */
> +	void *data[VOP_RING_SIZE];
> +
> +	unsigned int num_free;
> +	unsigned int free_head;
> +	unsigned int num_added;
> +
> +	u16 avail_idx;
> +	u16 last_used_idx;
> +
> +	/* The doorbell to kick() */
> +	unsigned int kick_val;
> +	void __iomem *immr;
> +};

I find it very confusing to have almost-identical data structures by the
same name in two files. Obviously, you have a lot of common code between
the two sides, but rather than making the implementation files *look*
similar, it would be better to focus on splitting out the shared code
into a common file and keep the different/duplicated code small.

> +	switch (index) {
> +	case 0: /* x86 recv virtqueue -- ppc xmit virtqueue */
> +		vq->guest = vdev->rem + 1024;
> +		vq->host  = vdev->loc + 1024;
> +		break;
> +	case 1: /* x86 xmit virtqueue -- ppc recv virtqueue */
> +		vq->guest = vdev->rem + 2048;
> +		vq->host  = vdev->loc + 2048;
> +		break;
> +	default:
> +		dev_err(vq->dev, "unknown virtqueue %d\n", index);
> +		return ERR_PTR(-ENODEV);
> +	}

I'd avoid making assumptions or comments about the architectures.
Rather than "x86" and "ppc", I'd write "local" and "remote".

	Arnd <><
Geert Uytterhoeven Feb. 26, 2009, 4:53 p.m. UTC | #2
On Thu, 26 Feb 2009, Arnd Bergmann wrote:
> On Tuesday 24 February 2009, Ira Snyder wrote:
> > +/* Virtio-over-PCI descriptors: 12 bytes. These can chain together via "next" */
> > +struct vop_desc {
> > +	/* Address (host physical) */
> > +	__le32 addr;
               ^^^^
Only 32-bit? Is this future-proof?

> > +	/* Length (bytes) */
> > +	__le32 len;
> > +	/* Flags */
> > +	__le16 flags;
> > +	/* Chaining for descriptors */
> > +	__le16 next;
> > +} __attribute__((packed));
> 
> I would drop the "packed" attribute in the structure definitions.
> It would imply that only byte accesses are allowed on these
> data structures, because the attribute invalidates any assumptions
> about alignment. None of your structures require padding, so
> the attribute does not have any positive effect either.
> 
> > +/* MPC8349EMDS specific get_immrbase() */
> > +#include <sysdev/fsl_soc.h>
> 
> Do you really need get_immrbase? I would expect that you can find
> all the registers you need in the device tree, or exported from
> other low-level drivers per subsystem.
> 
> immrbase is a concept from the time before our device trees.
> 
> > +/*
> > + * These are internal use only versions of the structures that
> > + * are exported over PCI by this driver
> > + *
> > + * They are used internally to keep track of the PowerPC queues so that
> > + * we don't have to keep flipping endianness all the time
> > + */
> > +struct vop_loc_desc {
> > +	u32 addr;
        ^^^
Same here.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010
Ira Snyder Feb. 26, 2009, 8:01 p.m. UTC | #3
On Thu, Feb 26, 2009 at 05:15:27PM +0100, Arnd Bergmann wrote:
> On Tuesday 24 February 2009, Ira Snyder wrote:
> > This adds support to Linux for using virtio between two computers linked by
> > a PCI interface. This allows the use of virtio_net to create a familiar,
> > fast interface for communication. It should be possible to use other virtio
> > devices in the future, but this has not been tested.
> 
> Wonderful, I like it a lot!
> 
> One major aspect that I hope can be improved is the layering
> of the driver to make it easier to reuse parts for other
> hardware implementations and also for sharing code between
> the two sides. Most of my comments below are about this.
> 
> A better split I can imagine would be:
> 
> 1. of_device hardware specific probing, and creation of virtqueues
> 2. pci hardware specific probing, and detection of virtqueues
> 3. library with common code, hardware independent
> 4. library with common code, hardware specific but used by both of_device
>    and pci.
> 5. interface to virtio-net on top of that (symmetric)
> 

I think so too. I was just getting something working, and thought it
would be better to have it "out there" rather than be working on it
forever. I'll try to break things up as I have time.

For the "libraries", would you suggest breaking things into seperate
code files, and using EXPORT_SYMBOL_GPL()? I'm not very familiar with
doing that, I've mostly been writing code within the existing device
driver frameworks. Or do I need export symbol at all? I'm not sure...

> > +/* Virtio-over-PCI descriptors: 12 bytes. These can chain together via "next" */
> > +struct vop_desc {
> > +	/* Address (host physical) */
> > +	__le32 addr;
> > +	/* Length (bytes) */
> > +	__le32 len;
> > +	/* Flags */
> > +	__le16 flags;
> > +	/* Chaining for descriptors */
> > +	__le16 next;
> > +} __attribute__((packed));
> 
> I would drop the "packed" attribute in the structure definitions.
> It would imply that only byte accesses are allowed on these
> data structures, because the attribute invalidates any assumptions
> about alignment. None of your structures require padding, so
> the attribute does not have any positive effect either.
> 

I always thought you were supposed to use packed for data structures
that are external to the system. I purposely designed the structures so
they wouldn't need padding.

I'll drop it and check for any problems.

> > +/* MPC8349EMDS specific get_immrbase() */
> > +#include <sysdev/fsl_soc.h>
> 
> Do you really need get_immrbase? I would expect that you can find
> all the registers you need in the device tree, or exported from
> other low-level drivers per subsystem.
> 
> immrbase is a concept from the time before our device trees.
> 

I mostly don't need it. In fact, the only place I'm using registers not
specific to the messaging unit is in the probe routine, where I setup
the 1GB window into host memory and setting up access to the guest
memory on the PCI bus.

Now, I wouldn't need to access these registers at all if the bootloader
could handle it. I just don't know if it is possible to have Linux not
use some memory that the bootloader allocated, other than with the
mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used
regular RAM so this is portable to my custom board (mpc8349emds based)
and a regular mpc8349emds. I didn't want to change anything board
specific.

I would love to have the bootloader allocate (or reserve somewhere in
the memory map) 16K of RAM, and not be required to allocate it with
dma_alloc_coherent(). It would save me plenty of headaches.

> > +/*
> > + * These are internal use only versions of the structures that
> > + * are exported over PCI by this driver
> > + *
> > + * They are used internally to keep track of the PowerPC queues so that
> > + * we don't have to keep flipping endianness all the time
> > + */
> > +struct vop_loc_desc {
> > +	u32 addr;
> > +	u32 len;
> > +	u16 flags;
> > +	u16 next;
> > +};
> > +
> > +struct vop_loc_avail {
> > +	u16 index;
> > +	u16 ring[VOP_RING_SIZE];
> > +};
> > +
> > +struct vop_loc_used_elem {
> > +	u32 id;
> > +	u32 len;
> > +};
> > +
> > +struct vop_loc_used {
> > +	u16 index;
> > +	struct vop_loc_used_elem ring[VOP_RING_SIZE];
> > +};
> 
> Are you worried about the overhead of having to do byte flips,
> or the code complexity? I would guess that the overhead is
> near zero, but I'm not sure about the source code complexity.
> Generally, I'd expect that you'd be better off just using the
> wire-level data structures directly.
> 

Code complexity only. Also, it was easier to write 80-char lines with
something like:

vop_get_desc(vq, idx, &desc);
if (desc.flags & VOP_DESC_F_NEXT) {
	/* do something */
}

Instead of:
if (le16_to_cpu(vq->desc[idx].flags) & VOP_DESC_F_NEXT) {
	/* do something */
}

Plus, I didn't have to remember how many bits were in each field. I just
thought it made everything simpler to understand. Suggestions?

> > +/*
> > + * DMA Resolver state information
> > + */
> > +struct vop_dma_info {
> > +	struct dma_chan *chan;
> > +
> > +	/* The currently processing avail entry */
> > +	u16 loc_avail;
> > +	u16 rem_avail;
> > +
> > +	/* The currently processing used entries */
> > +	u16 loc_used;
> > +	u16 rem_used;
> > +};
> > +
> > +struct vop_vq {
> > +
> > +	/* The actual virtqueue itself */
> > +	struct virtqueue vq;
> > +	struct device *dev;
> > +
> > +	/* The host ring address */
> > +	struct vop_host_ring __iomem *host;
> > +
> > +	/* The guest ring address */
> > +	struct vop_guest_ring *guest;
> > +
> > +	/* Our own memory descriptors */
> > +	struct vop_loc_desc desc[VOP_RING_SIZE];
> > +	struct vop_loc_avail avail;
> > +	struct vop_loc_used used;
> > +	unsigned int flags;
> > +
> > +	/* Data tokens from add_buf() */
> > +	void *data[VOP_RING_SIZE];
> > +
> > +	unsigned int num_free;	/* number of free descriptors in desc */
> > +	unsigned int free_head;	/* start of the free descriptors in desc */
> > +	unsigned int num_added;	/* number of entries added to desc */
> > +
> > +	u16 loc_last_used;	/* the last local used entry processed */
> > +	u16 rem_last_used;	/* the current value of remote used_idx */
> > +
> > +	/* DMA resolver state */
> > +	struct vop_dma_info dma;
> > +	struct work_struct work;
> > +	int (*resolve)(struct vop_vq *vq);
> > +
> > +	void __iomem *immr;
> > +	int kick_val;
> > +};
> 
> This data structure mixes generic information with fsl-834x specific
> members. I think you should try to split this better into a common
> part (also common for host and guest) to allow sharing the code
> across other low-level implementations:
> 
> struct vop_vq {
> 	struct virtqueue vq;
> 	struct vop_host_ring __iomem *host;
> 	struct vop_guest_ring *guest;
> 	...
> };
> 
> and in another file:
> 
> struct fsl834x_vq {
> 	struct vop_vq;
> 	struct fsl834x_vop_regs __iomem *regs; /* instead of immr */
> }
> 
> If you split the structures this way, the abstraction should
> come naturally.
> 

Looks good to me. I'll work towards this in my next version.

> > +/*
> > + * This represents a virtio_device for our driver. It follows the memory
> > + * layout shown above. It has pointers to all of the host and guest memory
> > + * areas that we need to access
> > + */
> > +struct vop_vdev {
> > +
> > +	/* The specific virtio device (console, net, blk) */
> > +	struct virtio_device vdev;
> > +
> > +	#define VOP_DEVICE_REGISTERED 1
> > +	int status;
> > +
> > +	/* Start address of local and remote memory */
> > +	void *loc;
> > +	void __iomem *rem;
> > +
> > +	/*
> > +	 * These are the status, feature, and configuration information
> > +	 * for this virtio device. They are exposed in our memory block
> > +	 * starting at offset 0.
> > +	 */
> > +	struct vop_status __iomem *host_status;
> > +
> > +	/*
> > +	 * These are the status, feature, and configuration information
> > +	 * for the guest virtio device. They are exposed in the guest
> > +	 * memory block starting at offset 0.
> > +	 */
> > +	struct vop_status *guest_status;
> > +
> > +	/*
> > +	 * These are the virtqueues for the virtio driver running this
> > +	 * device to use. The host portions are exposed in our memory block
> > +	 * starting at offset 1024. The exposed areas are aligned to 1024 byte
> > +	 * boundaries, so they appear at offets 1024, 2048, and 3072
> > +	 * respectively.
> > +	 */
> > +	struct vop_vq virtqueues[3];
> > +};
> 
> Unfortunately, that structure layout implies an extra pointer level here:
> 
> 	struct vop_vq *virtqueues[3];
> 
> I also wonder if the number of virtqueues should be variable here.
> 

I used 3 so they would would align to 1024 byte boundaries within a 4K
page. Then the layout was 16K on the bus, each 4K page is a single
virtio-device, and each 1K block is a single virtqueue. The first 1K is
for virtio-device status and feature bits, etc.

Packing them differently isn't a problem. It was just easier to code
because setting up a window with the correct size is so platform
specific.

> > +struct vop_dev {
> > +
> > +	struct of_device *op;
> > +	struct device *dev;
> > +
> > +	/* Reset and start */
> > +	struct mutex mutex;
> > +	struct work_struct reset_work;
> > +	struct work_struct start_work;
> > +
> > +	int irq;
> > +
> > +	/* Our board control registers */
> > +	void __iomem *immr;
> > +
> > +	/* The guest memory, exposed at PCI BAR1 */
> > +	#define VOP_GUEST_MEM_SIZE 16384
> > +	void *guest_mem;
> > +	dma_addr_t guest_mem_addr;
> > +
> > +	/* Host memory, given to us by host in OMR0 */
> > +	#define VOP_HOST_MEM_SIZE 16384
> > +	void __iomem *host_mem;
> > +
> > +	/* The virtio devices */
> > +	struct vop_vdev devices[4];
> > +	struct dma_chan *chan;
> > +};
> 
> This one again is hardware specific, right? If so, it should go
> together with what I call fsl834x_vq above.
> 

Yeah. It is the per-device private data, setup in the probe() routine.
Some of the data isn't needed in each virtqueue, so I left it seperate.

> > +/*----------------------------------------------------------------------------*/
> > +/* Local descriptor ring access helpers                                       */
> > +/*----------------------------------------------------------------------------*/
> > +
> > +static void vop_set_desc_addr(struct vop_vq *vq, unsigned int idx, u32 addr)
> > +{
> > +	vq->desc[idx].addr = addr;
> > +}
> > +
> > +static void vop_set_desc_len(struct vop_vq *vq, unsigned int idx, u32 len)
> > +{
> > +	vq->desc[idx].len = len;
> > +}
> > +
> > +static void vop_set_desc_flags(struct vop_vq *vq, unsigned int idx, u16 flags)
> > +{
> > +	vq->desc[idx].flags = flags;
> > +}
> > +
> > +static void vop_set_desc_next(struct vop_vq *vq, unsigned int idx, u16 next)
> > +{
> > +	vq->desc[idx].next = next;
> > +}
> > +
> > +static u16 vop_get_desc_flags(struct vop_vq *vq, unsigned int idx)
> > +{
> > +	return vq->desc[idx].flags;
> > +}
> > +
> > +static u16 vop_get_desc_next(struct vop_vq *vq, unsigned int idx)
> > +{
> > +	return vq->desc[idx].next;
> > +}
> 
> I don't quite get the point in these accessors. Calling one of these
> functions would be longer than open-coding the content.
> 

They're leftovers from the host code. I'll get rid of them. It would be
even better if I just returned pointers into the vq->desc array rather
than copying the descriptors out of them.

> > +/*----------------------------------------------------------------------------*/
> > +/* Scatterlist DMA helpers                                                    */
> > +/*----------------------------------------------------------------------------*/
> > +
> > +/*
> > + * This function abuses some of the scatterlist code and implements
> > + * dma_map_sg() in such a way that we don't need to keep the scatterlist
> > + * around in order to unmap it.
> > + *
> > + * It is also designed to never merge scatterlist entries, which is
> > + * never what we want for virtio.
> > + *
> > + * When it is time to unmap the buffer, you can use dma_unmap_single() to
> > + * unmap each entry in the chain. Get the address, length, and direction
> > + * from the descriptors! (keep a local copy for speed)
> > + */
> 
> Why is that an advantage over dma_unmap_sg?
> 

When running dma_map_sg(), the scatterlist code is allowed to alter the
scatterlist to store data it needs for dma_unmap_sg(), along with
merging adjacent buffers, etc.

I don't want any of that behavior. The generic virtio code does not
handle merging of buffers.

Also, all of the generic virtio code allocates its scatterlists on the
stack. This means I cannot save the pointers between add_buf() and
get_buf(). If I used dma_map_sg(), I'd have to allocate memory to copy
the scatterlist, map it, and save the pointer. Later, retrieve the
pointer, unmap it, and free the memory.

This is simpler than all of that.

> > +static int vop_dma_map_sg(struct device *dev, struct scatterlist sg[],
> > +			  unsigned int out, unsigned int in)
> > +{
> > +	dma_addr_t addr;
> > +	enum dma_data_direction dir;
> > +	struct scatterlist *start;
> > +	unsigned int i, failure;
> > +
> > +	start = sg;
> > +
> > +	for (i = 0; i < out + in; i++) {
> > +
> > +		/* Check for scatterlist chaining abuse */
> > +		BUG_ON(sg == NULL);
> > +
> > +		dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> > +		addr = dma_map_single(dev, sg_virt(sg), sg->length, dir);
> > +
> > +		if (dma_mapping_error(dev, addr))
> > +			goto unwind;
> > +
> > +		sg_dma_address(sg) = addr;
> > +		sg = sg_next(sg);
> > +	}
> 
> I believe this kind of loop can be simplified using for_each_sg().
> 
> 

Yep, you're right. The scatterlists that the virtio_net driver sent used
to be improperly terminated (they ended early, according to the
scatterlist chaining code). This caused me some null pointer oopses. I
submitted a patch to fix virtio_net, so I can use for_each_sg() now.

I'll make the change for my next version.

> > +	/* Remap IMMR */
> > +	priv->immr = ioremap(get_immrbase(), 0x100000);
> > +	if (!priv->immr) {
> > +		dev_err(&op->dev, "Unable to remap IMMR registers\n");
> > +		ret = -ENOMEM;
> > +		goto out_dma_release_channel;
> > +	}
> 
> As mentioned above, this should be something like an of_iomap(op, ...)
> 

Yep. See my comments above.

> > +struct vop_vq {
> > +
> > +	/* The actual virtqueue itself */
> > +	struct virtqueue vq;
> > +
> > +	struct device *dev;
> > +
> > +	/* The host ring address */
> > +	struct vop_host_ring *host;
> > +
> > +	/* The guest ring address */
> > +	struct vop_guest_ring __iomem *guest;
> > +
> > +	/* Local copy of the descriptors for fast access */
> > +	struct vop_loc_desc desc[VOP_RING_SIZE];
> > +
> > +	/* The data token from add_buf() */
> > +	void *data[VOP_RING_SIZE];
> > +
> > +	unsigned int num_free;
> > +	unsigned int free_head;
> > +	unsigned int num_added;
> > +
> > +	u16 avail_idx;
> > +	u16 last_used_idx;
> > +
> > +	/* The doorbell to kick() */
> > +	unsigned int kick_val;
> > +	void __iomem *immr;
> > +};
> 
> I find it very confusing to have almost-identical data structures by the
> same name in two files. Obviously, you have a lot of common code between
> the two sides, but rather than making the implementation files *look*
> similar, it would be better to focus on splitting out the shared code
> into a common file and keep the different/duplicated code small.
> 

I kept the names the same because they served the same purpose on both
sides. When the sides use common code, they'll mostly go away.

> > +	switch (index) {
> > +	case 0: /* x86 recv virtqueue -- ppc xmit virtqueue */
> > +		vq->guest = vdev->rem + 1024;
> > +		vq->host  = vdev->loc + 1024;
> > +		break;
> > +	case 1: /* x86 xmit virtqueue -- ppc recv virtqueue */
> > +		vq->guest = vdev->rem + 2048;
> > +		vq->host  = vdev->loc + 2048;
> > +		break;
> > +	default:
> > +		dev_err(vq->dev, "unknown virtqueue %d\n", index);
> > +		return ERR_PTR(-ENODEV);
> > +	}
> 
> I'd avoid making assumptions or comments about the architectures.
> Rather than "x86" and "ppc", I'd write "local" and "remote".
> 

Yep, just artifacts of my design here. I'll change it.


Thanks for the review. I really appreciate it.
Ira
Ira Snyder Feb. 26, 2009, 8:25 p.m. UTC | #4
On Thu, Feb 26, 2009 at 05:53:56PM +0100, Geert Uytterhoeven wrote:
> On Thu, 26 Feb 2009, Arnd Bergmann wrote:
> > On Tuesday 24 February 2009, Ira Snyder wrote:
> > > +/* Virtio-over-PCI descriptors: 12 bytes. These can chain together via "next" */
> > > +struct vop_desc {
> > > +	/* Address (host physical) */
> > > +	__le32 addr;
>                ^^^^
> Only 32-bit? Is this future-proof?
> 

Probably not. If I use __le64 instead, how do I write 64 bytes over the
PCI bus? There isn't an iowrite64()/ioread64() anywhere in Linux.

Thanks,
Ira
Arnd Bergmann Feb. 26, 2009, 8:37 p.m. UTC | #5
On Thursday 26 February 2009, Ira Snyder wrote:
> On Thu, Feb 26, 2009 at 05:15:27PM +0100, Arnd Bergmann wrote:
>
> I think so too. I was just getting something working, and thought it
> would be better to have it "out there" rather than be working on it
> forever. I'll try to break things up as I have time.

Ok, perfect!
 
> For the "libraries", would you suggest breaking things into seperate
> code files, and using EXPORT_SYMBOL_GPL()? I'm not very familiar with
> doing that, I've mostly been writing code within the existing device
> driver frameworks. Or do I need export symbol at all? I'm not sure...

You have both options. When you list each file as a separate module
in the Makefile, you use EXPORT_SYMBOL_GPL to mark functions that
get called by dependent modules, but this will work only in one way.

You can also link multiple files together into one module, although
it is less common to link a single source file into multiple modules.

> I always thought you were supposed to use packed for data structures
> that are external to the system. I purposely designed the structures so
> they wouldn't need padding.

That would only make sense for structures that are explicitly unaligned,
like a register layout using

struct my_registers {
	__le16 first;
	__le32 second __attribute__((packed));
	__le16 third;
};

Even here, I'd recommend listing the individual members as packed
rather than the entire struct. Obviously if you layout the members
in a sane way, you don't need either.

> I mostly don't need it. In fact, the only place I'm using registers not
> specific to the messaging unit is in the probe routine, where I setup
> the 1GB window into host memory and setting up access to the guest
> memory on the PCI bus.

You could add the registers you need for this to the "reg" property
of your device, to be mapped with of_iomap.

If the registers for setting up this window don't logically fit
into the same device as the one you already use, the cleanest
solution would be to have another device just for this and then
make a function call into that driver to set up the window.

> Now, I wouldn't need to access these registers at all if the bootloader
> could handle it. I just don't know if it is possible to have Linux not
> use some memory that the bootloader allocated, other than with the
> mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used
> regular RAM so this is portable to my custom board (mpc8349emds based)
> and a regular mpc8349emds. I didn't want to change anything board
> specific.
> 
> I would love to have the bootloader allocate (or reserve somewhere in
> the memory map) 16K of RAM, and not be required to allocate it with
> dma_alloc_coherent(). It would save me plenty of headaches.

I believe you can do that through the "memory" devices in the
device tree, by leaving out a small part of the description of
main memory, at putting it into the "reg" property of your own
device.

> Code complexity only. Also, it was easier to write 80-char lines with
> something like:
> 
> vop_get_desc(vq, idx, &desc);
> if (desc.flags & VOP_DESC_F_NEXT) {
> 	/* do something */
> }
> 
> Instead of:
> if (le16_to_cpu(vq->desc[idx].flags) & VOP_DESC_F_NEXT) {
> 	/* do something */
> }
> 
> Plus, I didn't have to remember how many bits were in each field. I just
> thought it made everything simpler to understand. Suggestions?

hmm, in this particular case, you could change the definition
of VOP_DESC_F_NEXT to

#define VOP_DESC_F_NEXT cpu_to_le16(1)

and then do the code as the even simpler (source and object code wise)

if (vq->desc[idx].flags) & VOP_DESC_F_NEXT)

I'm not sure if you can do something along these lines for the other
cases as well though.

> I used 3 so they would would align to 1024 byte boundaries within a 4K
> page. Then the layout was 16K on the bus, each 4K page is a single
> virtio-device, and each 1K block is a single virtqueue. The first 1K is
> for virtio-device status and feature bits, etc.
> 
> Packing them differently isn't a problem. It was just easier to code
> because setting up a window with the correct size is so platform
> specific.

Ok. I guess the important question is what part of the code makes
this decision. Ideally, the virtio-net glue would instantiate
the device with the right number of queues.

> > > +/*
> > > + * This function abuses some of the scatterlist code and implements
> > > + * dma_map_sg() in such a way that we don't need to keep the scatterlist
> > > + * around in order to unmap it.
> > > + *
> > > + * It is also designed to never merge scatterlist entries, which is
> > > + * never what we want for virtio.
> > > + *
> > > + * When it is time to unmap the buffer, you can use dma_unmap_single() to
> > > + * unmap each entry in the chain. Get the address, length, and direction
> > > + * from the descriptors! (keep a local copy for speed)
> > > + */
> > 
> > Why is that an advantage over dma_unmap_sg?
> > 
> 
> When running dma_map_sg(), the scatterlist code is allowed to alter the
> scatterlist to store data it needs for dma_unmap_sg(), along with
> merging adjacent buffers, etc.
> 
> I don't want any of that behavior. The generic virtio code does not
> handle merging of buffers.
> 
> Also, all of the generic virtio code allocates its scatterlists on the
> stack. This means I cannot save the pointers between add_buf() and
> get_buf(). If I used dma_map_sg(), I'd have to allocate memory to copy
> the scatterlist, map it, and save the pointer. Later, retrieve the
> pointer, unmap it, and free the memory.
> 
> This is simpler than all of that.

Not sure if I'm following, but you seem to have put enough thought
into it ;-)

	Arnd <><
Ira Snyder Feb. 26, 2009, 9:49 p.m. UTC | #6
On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:
> On Thursday 26 February 2009, Ira Snyder wrote:
> > On Thu, Feb 26, 2009 at 05:15:27PM +0100, Arnd Bergmann wrote:
> >
> > I think so too. I was just getting something working, and thought it
> > would be better to have it "out there" rather than be working on it
> > forever. I'll try to break things up as I have time.
> 
> Ok, perfect!
>  
> > For the "libraries", would you suggest breaking things into seperate
> > code files, and using EXPORT_SYMBOL_GPL()? I'm not very familiar with
> > doing that, I've mostly been writing code within the existing device
> > driver frameworks. Or do I need export symbol at all? I'm not sure...
> 
> You have both options. When you list each file as a separate module
> in the Makefile, you use EXPORT_SYMBOL_GPL to mark functions that
> get called by dependent modules, but this will work only in one way.
> 
> You can also link multiple files together into one module, although
> it is less common to link a single source file into multiple modules.
> 

Ok. I'm more familiar with the EXPORT_SYMBOL_GPL interface, so I'll do
that. If we decide it sucks later, we'll change it.

> > I always thought you were supposed to use packed for data structures
> > that are external to the system. I purposely designed the structures so
> > they wouldn't need padding.
> 
> That would only make sense for structures that are explicitly unaligned,
> like a register layout using
> 
> struct my_registers {
> 	__le16 first;
> 	__le32 second __attribute__((packed));
> 	__le16 third;
> };
> 
> Even here, I'd recommend listing the individual members as packed
> rather than the entire struct. Obviously if you layout the members
> in a sane way, you don't need either.
> 

Ok. I'll drop the __attribute__((packed)) and make sure there aren't
problems. I don't suspect any, though.

> > I mostly don't need it. In fact, the only place I'm using registers not
> > specific to the messaging unit is in the probe routine, where I setup
> > the 1GB window into host memory and setting up access to the guest
> > memory on the PCI bus.
> 
> You could add the registers you need for this to the "reg" property
> of your device, to be mapped with of_iomap.
> 
> If the registers for setting up this window don't logically fit
> into the same device as the one you already use, the cleanest
> solution would be to have another device just for this and then
> make a function call into that driver to set up the window.
> 

The registers are part of the board control registers. They don't fit at
all in the message unit. Doing this in the bootloader seems like a
logical place, but that would require any testers to flash a new U-Boot
image into their mpc8349emds boards.

The first set of access is used to set up a 1GB region in the memory map
that accesses the host's memory. Any reads/writes to addresses
0x80000000-0xc0000000 actually hit the host's memory.

The last access sets up PCI BAR1 to hit the memory from
dma_alloc_coherent(). The bootloader already sets up the window as 16K,
it just doesn't point it anywhere. Maybe this /should/ go into the
bootloader. Like above, it would require testers to flash a new U-Boot
image into their mpc8349emds boards.

> > Now, I wouldn't need to access these registers at all if the bootloader
> > could handle it. I just don't know if it is possible to have Linux not
> > use some memory that the bootloader allocated, other than with the
> > mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used
> > regular RAM so this is portable to my custom board (mpc8349emds based)
> > and a regular mpc8349emds. I didn't want to change anything board
> > specific.
> > 
> > I would love to have the bootloader allocate (or reserve somewhere in
> > the memory map) 16K of RAM, and not be required to allocate it with
> > dma_alloc_coherent(). It would save me plenty of headaches.
> 
> I believe you can do that through the "memory" devices in the
> device tree, by leaving out a small part of the description of
> main memory, at putting it into the "reg" property of your own
> device.
> 

I'll explore this option. I didn't even know you could do this.  Is a
driver that requires the trick acceptable for mainline inclusion? Just
like setting up the 16K PCI window, this is very platform specific.

This limits the guest driver to systems which are able to change Linux's
view of their memory somehow. Maybe this isn't a problem.

> > Code complexity only. Also, it was easier to write 80-char lines with
> > something like:
> > 
> > vop_get_desc(vq, idx, &desc);
> > if (desc.flags & VOP_DESC_F_NEXT) {
> > 	/* do something */
> > }
> > 
> > Instead of:
> > if (le16_to_cpu(vq->desc[idx].flags) & VOP_DESC_F_NEXT) {
> > 	/* do something */
> > }
> > 
> > Plus, I didn't have to remember how many bits were in each field. I just
> > thought it made everything simpler to understand. Suggestions?
> 
> hmm, in this particular case, you could change the definition
> of VOP_DESC_F_NEXT to
> 
> #define VOP_DESC_F_NEXT cpu_to_le16(1)
> 
> and then do the code as the even simpler (source and object code wise)
> 
> if (vq->desc[idx].flags) & VOP_DESC_F_NEXT)
> 
> I'm not sure if you can do something along these lines for the other
> cases as well though.
> 

That's a good idea. It wouldn't fix the addresses, lengths, and next
fields, though. I'll make the change and see how bad it is, then report
back. It may not be so bad after all.

> > I used 3 so they would would align to 1024 byte boundaries within a 4K
> > page. Then the layout was 16K on the bus, each 4K page is a single
> > virtio-device, and each 1K block is a single virtqueue. The first 1K is
> > for virtio-device status and feature bits, etc.
> > 
> > Packing them differently isn't a problem. It was just easier to code
> > because setting up a window with the correct size is so platform
> > specific.
> 
> Ok. I guess the important question is what part of the code makes
> this decision. Ideally, the virtio-net glue would instantiate
> the device with the right number of queues.
> 

Yeah, virtio doesn't work that way.

The virtio drivers just call find_vq() with a different index for each
queue they want to use. You have no way of knowing how many queues each
virtio driver will want, unless you go read their source code.

virtio-net currently uses 3 queues, but we only support the first two.
The third is optional (for now...), and non-symmetric.

Thanks again,
Ira
Arnd Bergmann Feb. 26, 2009, 10:34 p.m. UTC | #7
On Thursday 26 February 2009, Ira Snyder wrote:
> On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:
> 
> The registers are part of the board control registers. They don't fit at
> all in the message unit. Doing this in the bootloader seems like a
> logical place, but that would require any testers to flash a new U-Boot
> image into their mpc8349emds boards.
> 
> The first set of access is used to set up a 1GB region in the memory map
> that accesses the host's memory. Any reads/writes to addresses
> 0x80000000-0xc0000000 actually hit the host's memory.
> 
> The last access sets up PCI BAR1 to hit the memory from
> dma_alloc_coherent(). The bootloader already sets up the window as 16K,
> it just doesn't point it anywhere. Maybe this /should/ go into the
> bootloader. Like above, it would require testers to flash a new U-Boot
> image into their mpc8349emds boards.

Ok, I see.

I guess the best option for doing it in Linux then would be to have
a board control driver (not sure if this already exists) that exports
high-level functions to set up the inbound and outbound windows.

> Yeah, virtio doesn't work that way.
> 
> The virtio drivers just call find_vq() with a different index for each
> queue they want to use. You have no way of knowing how many queues each
> virtio driver will want, unless you go read their source code.
> 
> virtio-net currently uses 3 queues, but we only support the first two.
> The third is optional (for now...), and non-symmetric.

I mean the part of your driver that calls register_virtio_device()
could make the decision, this is the one I was referring to
as virtio_net glue because it is the only part that actually needs
to know about the features etc.

Right now, you just call register_virtio_net from vop_probe(), which
is absolutely appropriate for the specific use case. In the most
general case though, you would have a user interface on one or
both sides that allows a (root) user to trigger the creation of
a virtio_net (or other virtio) device with specific characteristics
such as MAC address or number of virtqueues.

One idea I had earlier was that there could be a special device
with just one virtqueue that is always present and that allows you
do communicate configuration changes regarding the available devices
to the remote VOP driver.

	Arnd <><
Ira Snyder Feb. 26, 2009, 11:17 p.m. UTC | #8
On Thu, Feb 26, 2009 at 11:34:33PM +0100, Arnd Bergmann wrote:
> On Thursday 26 February 2009, Ira Snyder wrote:
> > On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:
> > 
> > The registers are part of the board control registers. They don't fit at
> > all in the message unit. Doing this in the bootloader seems like a
> > logical place, but that would require any testers to flash a new U-Boot
> > image into their mpc8349emds boards.
> > 
> > The first set of access is used to set up a 1GB region in the memory map
> > that accesses the host's memory. Any reads/writes to addresses
> > 0x80000000-0xc0000000 actually hit the host's memory.
> > 
> > The last access sets up PCI BAR1 to hit the memory from
> > dma_alloc_coherent(). The bootloader already sets up the window as 16K,
> > it just doesn't point it anywhere. Maybe this /should/ go into the
> > bootloader. Like above, it would require testers to flash a new U-Boot
> > image into their mpc8349emds boards.
> 
> Ok, I see.
> 
> I guess the best option for doing it in Linux then would be to have
> a board control driver (not sure if this already exists) that exports
> high-level functions to set up the inbound and outbound windows.
> 

Nothing like it exists. The OF device tree doesn't even describe these
registers. The code in arch/powerpc/sysdev/fsl_pci.c uses some registers
near these, but it gets their address by masking the low bits off the
addresses from the device tree and adding the offsets of the new
registers. Nasty.

I'll do this for now:
1) Get the message unit registers from my device tree
2) Encapsulate all use of get_immrbase() to a single function

That way it could be easily replaced in the future when something more
suitable comes along.

> > Yeah, virtio doesn't work that way.
> > 
> > The virtio drivers just call find_vq() with a different index for each
> > queue they want to use. You have no way of knowing how many queues each
> > virtio driver will want, unless you go read their source code.
> > 
> > virtio-net currently uses 3 queues, but we only support the first two.
> > The third is optional (for now...), and non-symmetric.
> 
> I mean the part of your driver that calls register_virtio_device()
> could make the decision, this is the one I was referring to
> as virtio_net glue because it is the only part that actually needs
> to know about the features etc.
> 
> Right now, you just call register_virtio_net from vop_probe(), which
> is absolutely appropriate for the specific use case. In the most
> general case though, you would have a user interface on one or
> both sides that allows a (root) user to trigger the creation of
> a virtio_net (or other virtio) device with specific characteristics
> such as MAC address or number of virtqueues.
> 

I didn't think about this at all. This driver could be used to boot a
(guest) system over NFS, so in that case there isn't a userspace running
yet, to allow configuration. This is essentially my use case, though I
haven't implemented it yet.

Also, I hate designing user interfaces :) Any concrete suggestions on
design would be most welcome.

> One idea I had earlier was that there could be a special device
> with just one virtqueue that is always present and that allows you
> do communicate configuration changes regarding the available devices
> to the remote VOP driver.
> 

That's an interesting idea that I didn't consider, either. It wouldn't
have to be fast, just reliable. When you're doing small transfers, the
CPU is just fine.

Ira
Arnd Bergmann Feb. 26, 2009, 11:44 p.m. UTC | #9
On Friday 27 February 2009, Ira Snyder wrote:
> On Thu, Feb 26, 2009 at 11:34:33PM +0100, Arnd Bergmann wrote:
> > I guess the best option for doing it in Linux then would be to have
> > a board control driver (not sure if this already exists) that exports
> > high-level functions to set up the inbound and outbound windows.
> > 
> 
> Nothing like it exists. The OF device tree doesn't even describe these
> registers. The code in arch/powerpc/sysdev/fsl_pci.c uses some registers
> near these, but it gets their address by masking the low bits off the
> addresses from the device tree and adding the offsets of the new
> registers. Nasty.
> 
> I'll do this for now:
> 1) Get the message unit registers from my device tree
> 2) Encapsulate all use of get_immrbase() to a single function
> 
> That way it could be easily replaced in the future when something more
> suitable comes along.

Ok. However, I don't expect this to get fixed magically. Ideally,
you would start a new file for the board control in arch/powerpc/sysdev
and export the function from there, otherwise you do it the way you
suggested.
Then we can tell the fsl_pci and other people to use the same
method and source file to access the board control.
> 
> I didn't think about this at all. This driver could be used to boot a
> (guest) system over NFS, so in that case there isn't a userspace running
> yet, to allow configuration. This is essentially my use case, though I
> haven't implemented it yet.
> 
> Also, I hate designing user interfaces :) Any concrete suggestions on
> design would be most welcome.

Don't worry about it for now, just put all the hardcoded virtio_net
specific stuff into a file separate from the hardware specific
files so that we have a nice kernel level abstraction to build a
user abstraction on top of.

	Arnd <><
Grant Likely April 14, 2009, 8:28 p.m. UTC | #10
On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:
> This adds support to Linux for using virtio between two computers linked by
> a PCI interface. This allows the use of virtio_net to create a familiar,
> fast interface for communication. It should be possible to use other virtio
> devices in the future, but this has not been tested.

Hey Ira,

I like this a lot.  I need to do much the same thing on one of my
platforms, so I'm going to use your patch as my starting point.  Have
you made many changes since you posted this version of your patch?
I'd like to collaborate on the development and help to get it
mainlined.

In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex
(ppc440) as the 'client'.  I intend set aside a region of the Xilinx
Virtex's memory space for the shared queues.  I'm starting work on it
now, and I'll provide you with feedback and/or patches as I make
progress.

g.

>
> I have implemented guest support for the Freescale MPC8349EMDS board, which
> is capable of running in PCI agent mode (It acts like a PCI card, but is a
> complete computer system, running Linux). The driver is trivial to port to
> any MPC83xx system.
>
> It was developed to work in a CompactPCI crate of computers, one of which
> is a standard x86 system (acting as the host) and many PowerPC systems
> (acting as guests).
>
> I have only tested this driver with a single board in my system. The host
> is a 1066MHz Pentium3-M, and the guest is a 533MHz PowerPC. I am able
> achieve transfer rates of about 150 mbit host->guest and 350 mbit
> guest->host. A few tests showed that using an mtu of 4000 provided much
> better results than an mtu of 1500. Using an mtu of 64000 significantly
> dropped performance. The performance is equivalent to my PCINet driver for
> host->guest, and about 20% faster for guest->host transfers.
>
> I have included a short document explaining what I think is the most
> complicated part of the driver: using the DMA engine to transfer data. I
> hope everything else is readily obvious from the code. Questions are
> welcome.
>
> I will not be able to work on this full time for at least a few weeks, so I
> would appreciate actual review of this driver.  Nitpicks are fine, I just
> won't be able to respond to them quickly.
>
> RFCv1 -> RFCv2:
>  * fix major brokenness of host detach_buf()
>  * support VIRTIO_NET_F_CSUM
>  * support VIRTIO_NET_F_GSO
>  * support VIRTIO_NET_F_MRG_RXBUF
>  * rewrote DMA transfers to support merged rxbufs
>  * added a hack to fix the endianness of virtio_net's metadata
>  * lots more performance for guest->host transfers (~40MB/sec)
>  * updated documentation
>  * allocate 128 feature bits instead of 32
>
> Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
> ---
>
> Yes, the commit message has too much information. This is an RFC after
> all. I fully expect to have to make changes. In fact, I posting this
> more to "get it out there" than anything else, since I have other tasks
> that need doing.
>
> I'd appreciate a serious review of the design by the people who have
> been pressuring me to use virtio. I'm very happy to answer any questions
> you have.
>
> Thanks to everyone who gave feedback for RFCv1!
> Ira
>
>  Documentation/virtio-over-PCI.txt     |   60 +
>  arch/powerpc/boot/dts/mpc834x_mds.dts |    7 +
>  drivers/virtio/Kconfig                |   22 +
>  drivers/virtio/Makefile               |    2 +
>  drivers/virtio/vop.h                  |  119 ++
>  drivers/virtio/vop_fsl.c              | 2020 +++++++++++++++++++++++++++++++++
>  drivers/virtio/vop_host.c             | 1071 +++++++++++++++++
>  drivers/virtio/vop_hw.h               |   80 ++
>  8 files changed, 3381 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/virtio-over-PCI.txt
>  create mode 100644 drivers/virtio/vop.h
>  create mode 100644 drivers/virtio/vop_fsl.c
>  create mode 100644 drivers/virtio/vop_host.c
>  create mode 100644 drivers/virtio/vop_hw.h
>
> diff --git a/Documentation/virtio-over-PCI.txt b/Documentation/virtio-over-PCI.txt
> new file mode 100644
> index 0000000..e4520d4
> --- /dev/null
> +++ b/Documentation/virtio-over-PCI.txt
> @@ -0,0 +1,60 @@
> +The implementation of virtio-over-PCI was driven with the following goals:
> +* Avoid MMIO reads, try to use only MMIO writes
> +* Use the onboard DMA engine, for speed
> +
> +The implementation also borrows many of the details from the only other
> +implementation, virtio_ring.
> +
> +It succeeds in avoiding all MMIO reads on the critical paths. I did not
> +see any reason to avoid the use of MMIO reads during device probing, since
> +it is not a critical path.
> +
> +=== Avoiding MMIO reads ===
> +To avoid MMIO reads, both the host and guest systems have a copy of the
> +descriptors. Both sides need to read the descriptors after they have been
> +written, but only the host system writes to them. This allows us to keep a
> +local copy for later use.
> +
> +=== Using the DMA engine ===
> +This is the only truly complicated part of the system. Since this
> +implementation was designed for use with virtio_net, it may be biased
> +towards virtio_net's usage of the virtio interface.
> +
> +In merged rxbufs mode, the virtio_net driver provides a receive ring, which
> +it fills with empty PAGE_SIZE buffers. The DMA code sets up transfers
> +directly from the guest transmit queue to the empty packets in the host
> +receive queue. Data transfer in the other direction works in a similar
> +fashion.
> +
> +The guest (PowerPC) system keeps its own local set of descriptors, which are
> +filled by the virtio add_buf() call. Whenever this happens, the avail ring is
> +changed, and therefore we try to transfer data.
> +
> +The algorithm is essentially as follows:
> +1) Check for an available local or remote entry
> +2) Check that the other side has enough room for the packet
> +3) Transfer the chain, joining small packets and splitting large packets
> +4) Move the entries to the used rings, but do not update the used index
> +5) Schedule a DMA callback to happen when the transfer completes
> +6) Start the DMA transfer
> +7) When the DMA finishes, the callback updates the used indices and
> +   triggers any necessary callbacks
> +
> +The algorithm can only handle chains that are to be coalesced together. It
> +puts all data sequentially into the PAGE_SIZE buffers exposed by the
> +receiving side, including both the virtio_net header and packet data.
> +
> +=== Startup Sequence ===
> +There are currently problems in the startup sequence between the host and
> +guest drivers. The current scheme assumes that the guest is up and waiting
> +before the host is ready. I am having a very hard time coming up with a scheme
> +that is perfectly safe, where either side could win the race and be ready
> +first.
> +
> +Even harder is a situation where you would like to use the "network device"
> +from your bootloader to tftp a kernel, then boot Linux. In this case,
> +Linux has no knowledge of where the device descriptors were before it booted.
> +You'd need to stop and re-start the host driver to make sure it re-initializes
> +the new descriptor memory after Linux has booted.
> +
> +This is a definite "needs work" item.
> diff --git a/arch/powerpc/boot/dts/mpc834x_mds.dts b/arch/powerpc/boot/dts/mpc834x_mds.dts
> index d9adba0..5c7617d 100644
> --- a/arch/powerpc/boot/dts/mpc834x_mds.dts
> +++ b/arch/powerpc/boot/dts/mpc834x_mds.dts
> @@ -104,6 +104,13 @@
>                        mode = "cpu";
>                };
>
> +               message-unit@8030 {
> +                       compatible = "fsl,mpc8349-mu";
> +                       reg = <0x8030 0xd0>;
> +                       interrupts = <69 0x8>;
> +                       interrupt-parent = <&ipic>;
> +               };
> +
>                dma@82a8 {
>                        #address-cells = <1>;
>                        #size-cells = <1>;
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 3dd6294..efcf56b 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -33,3 +33,25 @@ config VIRTIO_BALLOON
>
>         If unsure, say M.
>
> +config VIRTIO_OVER_PCI_HOST
> +       tristate "Virtio-over-PCI Host support (EXPERIMENTAL)"
> +       depends on PCI && EXPERIMENTAL
> +       select VIRTIO
> +       ---help---
> +         This driver provides the host support necessary for using virtio
> +         over the PCI bus with a Freescale MPC8349EMDS evaluation board.
> +
> +         If unsure, say N.
> +
> +config VIRTIO_OVER_PCI_FSL
> +       tristate "Virtio-over-PCI Guest support (EXPERIMENTAL)"
> +       depends on MPC834x_MDS && EXPERIMENTAL
> +       select VIRTIO
> +       select DMA_ENGINE
> +       select FSL_DMA
> +       ---help---
> +         This driver provides the guest support necessary for using virtio
> +         over the PCI bus.
> +
> +         If unsure, say N.
> +
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 6738c44..f31afaa 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -2,3 +2,5 @@ obj-$(CONFIG_VIRTIO) += virtio.o
>  obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> +obj-$(CONFIG_VIRTIO_OVER_PCI_HOST) += vop_host.o
> +obj-$(CONFIG_VIRTIO_OVER_PCI_FSL) += vop_fsl.o
> diff --git a/drivers/virtio/vop.h b/drivers/virtio/vop.h
> new file mode 100644
> index 0000000..5f77228
> --- /dev/null
> +++ b/drivers/virtio/vop.h
> @@ -0,0 +1,119 @@
> +/*
> + * Virtio-over-PCI definitions
> + *
> + * Copyright (c) 2009 Ira W. Snyder <iws@ovro.caltech.edu>
> + *
> + * This file is licensed under the terms of the GNU General Public License
> + * version 2. This program is licensed "as is" without any warranty of any
> + * kind, whether express or implied.
> + */
> +
> +#ifndef VOP_H
> +#define VOP_H
> +
> +#include <linux/types.h>
> +
> +/* The number of entries per ring (MUST be a power of two) */
> +#define VOP_RING_SIZE          64
> +
> +/* Marks a buffer as continuing via the next field */
> +#define VOP_DESC_F_NEXT                1
> +/* Marks a buffer as write-only (otherwise read-only) */
> +#define VOP_DESC_F_WRITE       2
> +
> +/* Interrupts should not be generated when adding to avail or used */
> +#define VOP_F_NO_INTERRUPT     1
> +
> +/* Virtio-over-PCI descriptors: 12 bytes. These can chain together via "next" */
> +struct vop_desc {
> +       /* Address (host physical) */
> +       __le32 addr;
> +       /* Length (bytes) */
> +       __le32 len;
> +       /* Flags */
> +       __le16 flags;
> +       /* Chaining for descriptors */
> +       __le16 next;
> +} __attribute__((packed));
> +
> +/* Virtio-over-PCI used descriptor chains: 8 bytes */
> +struct vop_used_elem {
> +       /* Start index of used descriptor chain */
> +       __le32 id;
> +       /* Total length of the descriptor chain which was used (written to) */
> +       __le32 len;
> +} __attribute__((packed));
> +
> +/* The ring in host memory, only written by the guest */
> +/* NOTE: with VOP_RING_SIZE == 64, this is 520 bytes */
> +struct vop_host_ring {
> +       /* The flags, so the guest can indicate that it doesn't want
> +        * interrupts when things are added to the avail ring */
> +       __le16 flags;
> +
> +       /* The index, which points at the next slot where a chain index
> +        * will be added to the used ring */
> +       __le16 used_idx;
> +
> +       /* The used ring */
> +       struct vop_used_elem used[VOP_RING_SIZE];
> +} __attribute__((packed));
> +
> +/* The ring in guest memory, only written by the host */
> +/* NOTE: with VOP_RING_SIZE == 64, this is 904 bytes! */
> +struct vop_guest_ring {
> +       /* The descriptors */
> +       struct vop_desc desc[VOP_RING_SIZE];
> +
> +       /* The flags, so the host can indicate that it doesn't want
> +        * interrupts when things are added to the used ring */
> +       __le16 flags;
> +
> +       /* The index, which points at the next slot where a chain index
> +        * will be added to the avail ring */
> +       __le16 avail_idx;
> +
> +       /* The avail ring */
> +       __le16 avail[VOP_RING_SIZE];
> +} __attribute__((packed));
> +
> +/*
> + * This is the status structure holding the virtio_device status
> + * as well as the feature bits for this device and the configuration
> + * space.
> + *
> + * NOTE: it is for the LOCAL device. This is the slow path, so
> + * NOTE: the mmio reads won't cause any speed problems
> + */
> +struct vop_status {
> +       /* Status bits for the device */
> +       __le32 status;
> +
> +       /* Feature bits for the device (128 bits) */
> +       __le32 features[4];
> +
> +       /* Configuration space (different for each device type) */
> +       u8 config[1004];
> +
> +} __attribute__((packed));
> +
> +/*
> + * Layout in memory
> + *
> + * |--------------------------|
> + * | 0: local device status   |
> + * |--------------------------|
> + * | 1024: host/guest ring 1  |
> + * |--------------------------|
> + * | 2048: host/guest ring 2  |
> + * |--------------------------|
> + * | 3072: host/guest ring 3  |
> + * |--------------------------|
> + *
> + * Now, you have one of these for each virtio device, and
> + * then you're pretty much set. You can expose 16K of memory
> + * out on the bus (on each side) and have 4 virtio devices,
> + * each with a different type, and 3 virtqueues
> + */
> +
> +#endif /* VOP_H */
> diff --git a/drivers/virtio/vop_fsl.c b/drivers/virtio/vop_fsl.c
> new file mode 100644
> index 0000000..7cb3cdd
> --- /dev/null
> +++ b/drivers/virtio/vop_fsl.c
> @@ -0,0 +1,2020 @@
> +/*
> + * Virtio-over-PCI MPC8349EMDS Guest Driver
> + *
> + * Copyright (c) 2009 Ira W. Snyder <iws@ovro.caltech.edu>
> + *
> + * This file is licensed under the terms of the GNU General Public License
> + * version 2. This program is licensed "as is" without any warranty of any
> + * kind, whether express or implied.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/of_platform.h>
> +#include <linux/io.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_net.h>
> +#include <linux/interrupt.h>
> +#include <linux/virtio_net.h>
> +#include <linux/dmaengine.h>
> +#include <linux/workqueue.h>
> +#include <linux/etherdevice.h>
> +
> +/* MPC8349EMDS specific get_immrbase() */
> +#include <sysdev/fsl_soc.h>
> +
> +#include "vop_hw.h"
> +#include "vop.h"
> +
> +/*
> + * These are internal use only versions of the structures that
> + * are exported over PCI by this driver
> + *
> + * They are used internally to keep track of the PowerPC queues so that
> + * we don't have to keep flipping endianness all the time
> + */
> +struct vop_loc_desc {
> +       u32 addr;
> +       u32 len;
> +       u16 flags;
> +       u16 next;
> +};
> +
> +struct vop_loc_avail {
> +       u16 index;
> +       u16 ring[VOP_RING_SIZE];
> +};
> +
> +struct vop_loc_used_elem {
> +       u32 id;
> +       u32 len;
> +};
> +
> +struct vop_loc_used {
> +       u16 index;
> +       struct vop_loc_used_elem ring[VOP_RING_SIZE];
> +};
> +
> +/*
> + * DMA Resolver state information
> + */
> +struct vop_dma_info {
> +       struct dma_chan *chan;
> +
> +       /* The currently processing avail entry */
> +       u16 loc_avail;
> +       u16 rem_avail;
> +
> +       /* The currently processing used entries */
> +       u16 loc_used;
> +       u16 rem_used;
> +};
> +
> +struct vop_vq {
> +
> +       /* The actual virtqueue itself */
> +       struct virtqueue vq;
> +       struct device *dev;
> +
> +       /* The host ring address */
> +       struct vop_host_ring __iomem *host;
> +
> +       /* The guest ring address */
> +       struct vop_guest_ring *guest;
> +
> +       /* Our own memory descriptors */
> +       struct vop_loc_desc desc[VOP_RING_SIZE];
> +       struct vop_loc_avail avail;
> +       struct vop_loc_used used;
> +       unsigned int flags;
> +
> +       /* Data tokens from add_buf() */
> +       void *data[VOP_RING_SIZE];
> +
> +       unsigned int num_free;  /* number of free descriptors in desc */
> +       unsigned int free_head; /* start of the free descriptors in desc */
> +       unsigned int num_added; /* number of entries added to desc */
> +
> +       u16 loc_last_used;      /* the last local used entry processed */
> +       u16 rem_last_used;      /* the current value of remote used_idx */
> +
> +       /* DMA resolver state */
> +       struct vop_dma_info dma;
> +       struct work_struct work;
> +       int (*resolve)(struct vop_vq *vq);
> +
> +       void __iomem *immr;
> +       int kick_val;
> +};
> +
> +/* Convert from a struct virtqueue to a struct vop_vq */
> +#define to_vop_vq(X) container_of(X, struct vop_vq, vq)
> +
> +/*
> + * This represents a virtio_device for our driver. It follows the memory
> + * layout shown above. It has pointers to all of the host and guest memory
> + * areas that we need to access
> + */
> +struct vop_vdev {
> +
> +       /* The specific virtio device (console, net, blk) */
> +       struct virtio_device vdev;
> +
> +       #define VOP_DEVICE_REGISTERED 1
> +       int status;
> +
> +       /* Start address of local and remote memory */
> +       void *loc;
> +       void __iomem *rem;
> +
> +       /*
> +        * These are the status, feature, and configuration information
> +        * for this virtio device. They are exposed in our memory block
> +        * starting at offset 0.
> +        */
> +       struct vop_status __iomem *host_status;
> +
> +       /*
> +        * These are the status, feature, and configuration information
> +        * for the guest virtio device. They are exposed in the guest
> +        * memory block starting at offset 0.
> +        */
> +       struct vop_status *guest_status;
> +
> +       /*
> +        * These are the virtqueues for the virtio driver running this
> +        * device to use. The host portions are exposed in our memory block
> +        * starting at offset 1024. The exposed areas are aligned to 1024 byte
> +        * boundaries, so they appear at offets 1024, 2048, and 3072
> +        * respectively.
> +        */
> +       struct vop_vq virtqueues[3];
> +};
> +
> +#define to_vop_vdev(X) container_of(X, struct vop_vdev, vdev)
> +
> +struct vop_dev {
> +
> +       struct of_device *op;
> +       struct device *dev;
> +
> +       /* Reset and start */
> +       struct mutex mutex;
> +       struct work_struct reset_work;
> +       struct work_struct start_work;
> +
> +       int irq;
> +
> +       /* Our board control registers */
> +       void __iomem *immr;
> +
> +       /* The guest memory, exposed at PCI BAR1 */
> +       #define VOP_GUEST_MEM_SIZE 16384
> +       void *guest_mem;
> +       dma_addr_t guest_mem_addr;
> +
> +       /* Host memory, given to us by host in OMR0 */
> +       #define VOP_HOST_MEM_SIZE 16384
> +       void __iomem *host_mem;
> +
> +       /* The virtio devices */
> +       struct vop_vdev devices[4];
> +       struct dma_chan *chan;
> +};
> +
> +/*
> + * DMA callback information
> + */
> +struct vop_dma_cbinfo {
> +       struct vop_vq *vq;
> +
> +       /* The amount to increment the used rings */
> +       unsigned int loc;
> +       unsigned int rem;
> +};
> +
> +static const char driver_name[] = "vdev";
> +static struct kmem_cache *dma_cache;
> +
> +/*----------------------------------------------------------------------------*/
> +/* Whole-descriptor access helpers                                            */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Return a copy of a local descriptor in native format for easy use
> + * of all fields
> + *
> + * @vq the virtqueue
> + * @idx the descriptor index
> + * @desc pointer to the structure to copy into
> + */
> +static void vop_loc_desc(struct vop_vq *vq, unsigned int idx,
> +                        struct vop_loc_desc *desc)
> +{
> +       BUG_ON(idx >= VOP_RING_SIZE);
> +       BUG_ON(!desc);
> +
> +       desc->addr  = vq->desc[idx].addr;
> +       desc->len   = vq->desc[idx].len;
> +       desc->flags = vq->desc[idx].flags;
> +       desc->next  = vq->desc[idx].next;
> +}
> +
> +/*
> + * Return a copy of a remote descriptor in native format for easy use
> + * of all fields
> + *
> + * @vq the virtqueue
> + * @idx the descriptor index
> + * @desc pointer to the structure to copy into
> + */
> +static void vop_rem_desc(struct vop_vq *vq, unsigned int idx,
> +                        struct vop_loc_desc *desc)
> +{
> +       BUG_ON(idx >= VOP_RING_SIZE);
> +       BUG_ON(!desc);
> +
> +       desc->addr  = le32_to_cpu(vq->guest->desc[idx].addr);
> +       desc->len   = le32_to_cpu(vq->guest->desc[idx].len);
> +       desc->flags = le16_to_cpu(vq->guest->desc[idx].flags);
> +       desc->next  = le16_to_cpu(vq->guest->desc[idx].next);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Local descriptor ring access helpers                                       */
> +/*----------------------------------------------------------------------------*/
> +
> +static void vop_set_desc_addr(struct vop_vq *vq, unsigned int idx, u32 addr)
> +{
> +       vq->desc[idx].addr = addr;
> +}
> +
> +static void vop_set_desc_len(struct vop_vq *vq, unsigned int idx, u32 len)
> +{
> +       vq->desc[idx].len = len;
> +}
> +
> +static void vop_set_desc_flags(struct vop_vq *vq, unsigned int idx, u16 flags)
> +{
> +       vq->desc[idx].flags = flags;
> +}
> +
> +static void vop_set_desc_next(struct vop_vq *vq, unsigned int idx, u16 next)
> +{
> +       vq->desc[idx].next = next;
> +}
> +
> +static u16 vop_get_desc_flags(struct vop_vq *vq, unsigned int idx)
> +{
> +       return vq->desc[idx].flags;
> +}
> +
> +static u16 vop_get_desc_next(struct vop_vq *vq, unsigned int idx)
> +{
> +       return vq->desc[idx].next;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Status Helpers                                                             */
> +/*----------------------------------------------------------------------------*/
> +
> +static u32 vop_get_host_status(struct vop_vdev *vdev)
> +{
> +       return ioread32(&vdev->host_status->status);
> +}
> +
> +static u32 vop_get_host_features(struct vop_vdev *vdev)
> +{
> +       return ioread32(&vdev->host_status->features[0]);
> +}
> +
> +static u16 vop_get_host_flags(struct vop_vq *vq)
> +{
> +       return le16_to_cpu(vq->guest->flags);
> +}
> +
> +/*
> + * Set the guest's flags variable (lives in host memory)
> + */
> +static void vop_set_guest_flags(struct vop_vq *vq, u16 flags)
> +{
> +       iowrite16(flags, &vq->host->flags);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Remote Ring Debugging Helpers                                              */
> +/*----------------------------------------------------------------------------*/
> +
> +#ifdef DEBUG_DUMP_RINGS
> +static void dump_rem_desc(struct vop_vq *vq)
> +{
> +       struct vop_loc_desc desc;
> +       int i;
> +
> +       dev_dbg(vq->dev, "REM DESC 0xADDRESSX LENGTH 0xFLAG NEXT\n");
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               vop_rem_desc(vq, i, &desc);
> +               dev_dbg(vq->dev, "DESC %.2d: 0x%.8x %.6d 0x%.4x %.2d\n",
> +                               i, desc.addr, desc.len, desc.flags, desc.next);
> +       }
> +}
> +
> +static void dump_rem_avail(struct vop_vq *vq)
> +{
> +       int i;
> +
> +       dev_dbg(vq->dev, "REM AVAIL IDX %.2d\n", le16_to_cpu(vq->guest->avail_idx));
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               dev_dbg(vq->dev, "REM AVAIL %.2d: %.2d\n",
> +                               i, le16_to_cpu(vq->guest->avail[i]));
> +       }
> +}
> +
> +static void dump_rem_used(struct vop_vq *vq)
> +{
> +       int i;
> +
> +       dev_dbg(vq->dev, "REM USED IDX %.2d\n", ioread16(&vq->host->used_idx));
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               dev_dbg(vq->dev, "REM USED %.2d: %.2d %.6d\n", i,
> +                               ioread32(&vq->host->used[i].id),
> +                               ioread32(&vq->host->used[i].len));
> +       }
> +}
> +
> +static void dump_rem_rings(struct vop_vq *vq)
> +{
> +       dump_rem_desc(vq);
> +       dump_rem_avail(vq);
> +       dump_rem_used(vq);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Local Ring Debugging Helpers                                               */
> +/*----------------------------------------------------------------------------*/
> +
> +static void dump_loc_desc(struct vop_vq *vq)
> +{
> +       struct vop_loc_desc desc;
> +       int i;
> +
> +       dev_dbg(vq->dev, "LOC DESC 0xADDRESSX LENGTH 0xFLAG NEXT\n");
> +       for (i = 0 ; i < VOP_RING_SIZE; i++) {
> +               vop_loc_desc(vq, i, &desc);
> +               dev_dbg(vq->dev, "DESC %.2d: 0x%.8x %.6d 0x%.4x %.2d\n",
> +                               i, desc.addr, desc.len, desc.flags, desc.next);
> +       }
> +}
> +
> +static void dump_loc_avail(struct vop_vq *vq)
> +{
> +       int i;
> +
> +       dev_dbg(vq->dev, "LOC AVAIL IDX %.2d\n", vq->avail.index);
> +       for (i = 0; i < VOP_RING_SIZE; i++)
> +               dev_dbg(vq->dev, "LOC AVAIL %.2d: %.2d\n", i, vq->avail.ring[i]);
> +}
> +
> +static void dump_loc_used(struct vop_vq *vq)
> +{
> +       int i;
> +
> +       dev_dbg(vq->dev, "LOC USED IDX %.2hu\n", vq->used.index);
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               dev_dbg(vq->dev, "LOC USED %.2d: %.2d %.6d\n", i,
> +                               vq->used.ring[i].id, vq->used.ring[i].len);
> +       }
> +}
> +
> +static void dump_loc_rings(struct vop_vq *vq)
> +{
> +       dump_loc_desc(vq);
> +       dump_loc_avail(vq);
> +       dump_loc_used(vq);
> +}
> +
> +static void debug_dump_rings(struct vop_vq *vq, const char *msg)
> +{
> +       dev_dbg(vq->dev, "\n");
> +       dev_dbg(vq->dev, "%s\n", msg);
> +       dump_loc_rings(vq);
> +       dump_rem_rings(vq);
> +       dev_dbg(vq->dev, "\n");
> +}
> +#else
> +static void debug_dump_rings(struct vop_vq *vq, const char *msg)
> +{
> +       /* Nothing */
> +}
> +#endif
> +
> +/*----------------------------------------------------------------------------*/
> +/* Scatterlist DMA helpers                                                    */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * This function abuses some of the scatterlist code and implements
> + * dma_map_sg() in such a way that we don't need to keep the scatterlist
> + * around in order to unmap it.
> + *
> + * It is also designed to never merge scatterlist entries, which is
> + * never what we want for virtio.
> + *
> + * When it is time to unmap the buffer, you can use dma_unmap_single() to
> + * unmap each entry in the chain. Get the address, length, and direction
> + * from the descriptors! (keep a local copy for speed)
> + */
> +static int vop_dma_map_sg(struct device *dev, struct scatterlist sg[],
> +                         unsigned int out, unsigned int in)
> +{
> +       dma_addr_t addr;
> +       enum dma_data_direction dir;
> +       struct scatterlist *start;
> +       unsigned int i, failure;
> +
> +       start = sg;
> +
> +       for (i = 0; i < out + in; i++) {
> +
> +               /* Check for scatterlist chaining abuse */
> +               BUG_ON(sg == NULL);
> +
> +               dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +               addr = dma_map_single(dev, sg_virt(sg), sg->length, dir);
> +
> +               if (dma_mapping_error(dev, addr))
> +                       goto unwind;
> +
> +               sg_dma_address(sg) = addr;
> +               sg = sg_next(sg);
> +       }
> +
> +       return 0;
> +
> +unwind:
> +       failure = i;
> +       sg = start;
> +
> +       for (i = 0; i < failure; i++) {
> +               dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +               addr = sg_dma_address(sg);
> +
> +               dma_unmap_single(dev, addr, sg->length, dir);
> +               sg = sg_next(sg);
> +       }
> +
> +       return -ENOMEM;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* DMA Helpers                                                                */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Transfer data between two physical addresses with DMA
> + *
> + * NOTE: does not automatically unmap the src and dst addresses
> + *
> + * @chan the channel to use
> + * @dst the physical destination address
> + * @src the physical source address
> + * @len the length to transfer (in bytes)
> + * @return a valid cookie, or -ERRNO
> + */
> +static dma_cookie_t dma_async_memcpy_raw_to_raw(struct dma_chan *chan,
> +                                              dma_addr_t dst,
> +                                              dma_addr_t src,
> +                                              size_t len)
> +{
> +       struct dma_device *dev = chan->device;
> +       struct dma_async_tx_descriptor *tx;
> +       enum dma_ctrl_flags flags;
> +       dma_cookie_t cookie;
> +       int cpu;
> +
> +       flags = DMA_COMPL_SKIP_SRC_UNMAP | DMA_COMPL_SKIP_DEST_UNMAP;
> +       tx = dev->device_prep_dma_memcpy(chan, dst, src, len, flags);
> +       if (!tx)
> +               return -ENOMEM;
> +
> +       tx->callback = NULL;
> +       cookie = tx->tx_submit(tx);
> +
> +       cpu = get_cpu();
> +       per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
> +       per_cpu_ptr(chan->local, cpu)->memcpy_count++;
> +       put_cpu();
> +
> +       return cookie;
> +}
> +
> +/*
> + * Trigger an interrupt after all DMA issued up to this point
> + * have been processed
> + *
> + * @chan the channel to use
> + * @callback the function to call (must not sleep)
> + * @data the data to send to the callback
> + *
> + * @return a valid cookie, or -ERRNO
> + */
> +static dma_cookie_t dma_async_interrupt(struct dma_chan *chan,
> +                                       dma_async_tx_callback callback,
> +                                       void *data)
> +{
> +       struct dma_device *dev = chan->device;
> +       struct dma_async_tx_descriptor *tx;
> +
> +       /* Set up the DMA */
> +       tx = dev->device_prep_dma_interrupt(chan, DMA_PREP_INTERRUPT);
> +       if (!tx)
> +               return -ENOMEM;
> +
> +       tx->callback = callback;
> +       tx->callback_param = data;
> +
> +       return tx->tx_submit(tx);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* DMA Resolver                                                               */
> +/*----------------------------------------------------------------------------*/
> +
> +static void vop_remote_used_changed(struct vop_vq *vq)
> +{
> +       if (!(vop_get_host_flags(vq) & VOP_F_NO_INTERRUPT)) {
> +               dev_dbg(vq->dev, "notifying the host (new buffers in used)\n");
> +               iowrite32(vq->kick_val, vq->immr + ODR_OFFSET);
> +       }
> +}
> +
> +static void vop_local_used_changed(struct vop_vq *vq)
> +{
> +       if (!(vq->flags & VOP_F_NO_INTERRUPT)) {
> +               dev_dbg(vq->dev, "notifying self (new buffers in used)\n");
> +               vq->vq.callback(&vq->vq);
> +       }
> +}
> +
> +/*
> + * DMA callback function for merged rxbufs
> + *
> + * This is called every time a DMA transfer completes, and will update the
> + * indices in the local and remote used rings, then notify both sides that
> + * their used ring has changed
> + *
> + * You must be sure that the data was actually written to the used rings before
> + * this function is called
> + */
> +static void dma_callback(void *data)
> +{
> +       struct vop_dma_cbinfo *cb = data;
> +       struct vop_vq *vq = cb->vq;
> +
> +       dev_dbg(vq->dev, "%s: vq %p loc %d rem %d\n", __func__, vq, cb->loc, cb->rem);
> +
> +       /* Write the local used index */
> +       vq->used.index += cb->loc;
> +
> +       /* Write the remote used index */
> +       vq->rem_last_used += cb->rem;
> +       iowrite16(vq->rem_last_used, &vq->host->used_idx);
> +
> +       /* Make sure the indices are written before triggering callbacks */
> +       wmb();
> +
> +       /* Trigger the local used callback */
> +       dev_dbg(vq->dev, "local used changed, running callback\n");
> +       vop_local_used_changed(vq);
> +
> +       /* Trigger the remote used callback */
> +       dev_dbg(vq->dev, "remote used changed, running callback\n");
> +       vop_remote_used_changed(vq);
> +
> +       /* Free the callback data */
> +       kmem_cache_free(dma_cache, cb);
> +}
> +
> +/*
> + * Take an entry from the local avail ring and add it to the local
> + * used ring with the given length
> + *
> + * NOTE: does not update the used index
> + *
> + * @vq the virtqueue
> + * @avail_idx the index in the avail ring to take the entry from
> + * @used_idx the index in the used ring to put the entry
> + * @used_len the length used
> + */
> +static void vop_loc_avail_to_used(struct vop_vq *vq, unsigned int avail_idx,
> +                                 unsigned int used_idx, u32 used_len)
> +{
> +       u16 id;
> +
> +       /* Make sure the indices are inside the rings */
> +       avail_idx &= (VOP_RING_SIZE - 1);
> +       used_idx  &= (VOP_RING_SIZE - 1);
> +
> +       /* Get the index stored in the avail ring */
> +       id = vq->avail.ring[avail_idx];
> +
> +       /* Copy the index and length to the used ring */
> +       vq->used.ring[used_idx].id = id;
> +       vq->used.ring[used_idx].len = used_len;
> +}
> +
> +/*
> + * Take an entry from the remote avail ring and add it to the remote
> + * used ring with the given length
> + *
> + * NOTE: does not update the used index
> + *
> + * @vq the virtqueue
> + * @avail_idx the index in the avail ring to take the entry from
> + * @used_idx the index in the used ring to put the entry
> + * @used_len the length used
> + */
> +static void vop_rem_avail_to_used(struct vop_vq *vq, unsigned int avail_idx,
> +                                 unsigned int used_idx, u32 used_len)
> +{
> +       u16 id;
> +
> +       /* Make sure the indices are inside the rings */
> +       avail_idx &= (VOP_RING_SIZE - 1);
> +       used_idx  &= (VOP_RING_SIZE - 1);
> +
> +       /* Get the index stored in the avail ring */
> +       id = le16_to_cpu(vq->guest->avail[avail_idx]);
> +
> +       /* Copy the index and length to the used ring */
> +       iowrite32(id, &vq->host->used[used_idx].id);
> +       iowrite32(used_len, &vq->host->used[used_idx].len);
> +}
> +
> +/*
> + * Return the number of entries available in the local avail ring
> + */
> +static unsigned int loc_num_avail(struct vop_vq *vq)
> +{
> +       return vq->avail.index - vq->dma.loc_avail;
> +}
> +
> +/*
> + * Return the number of entries available in the remote avail ring
> + */
> +static unsigned int rem_num_avail(struct vop_vq *vq)
> +{
> +       return le16_to_cpu(vq->guest->avail_idx) - vq->dma.rem_avail;
> +}
> +
> +/*
> + * Return a descriptor id from the local avail ring
> + *
> + * @vq the virtqueue
> + * @idx the index to return the id from
> + */
> +static u16 vop_loc_avail_id(struct vop_vq *vq, unsigned int idx)
> +{
> +       idx &= (VOP_RING_SIZE - 1);
> +       return vq->avail.ring[idx];
> +}
> +
> +/*
> + * Return a descriptor id from the remote avail ring
> + *
> + * @vq the virtqueue
> + * @idx the index to return the id from
> + */
> +static u16 vop_rem_avail_id(struct vop_vq *vq, unsigned int idx)
> +{
> +       idx &= (VOP_RING_SIZE - 1);
> +       return le16_to_cpu(vq->guest->avail[idx]);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Extra helpers for mergeable DMA                                            */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * TODO: the number of bytes being transmitted could be added to the avail
> + * TODO: ring, rather than just an index. I'm not sure it would make much
> + * TODO: difference, though.
> + */
> +
> +/*
> + * Calculate the number of bytes used in a local descriptor chain
> + *
> + * @vq the virtqueue
> + * @idx the start descriptor index
> + * @return the number of bytes
> + */
> +static unsigned int loc_num_bytes(struct vop_vq *vq, unsigned int idx)
> +{
> +       struct vop_loc_desc desc;
> +       unsigned int bytes = 0;
> +
> +       while (true) {
> +               vop_loc_desc(vq, idx, &desc);
> +               bytes += desc.len;
> +
> +               if (!(desc.flags & VOP_DESC_F_NEXT))
> +                       break;
> +
> +               idx = desc.next;
> +       }
> +
> +       return bytes;
> +}
> +
> +/*
> + * Calculate the number of bytes used in a remote descriptor chain
> + *
> + * @vq the virtqueue
> + * @idx the start descriptor index
> + * @return the number of bytes
> + */
> +static unsigned int rem_num_bytes(struct vop_vq *vq, unsigned int idx)
> +{
> +       struct vop_loc_desc desc;
> +       unsigned int bytes = 0;
> +
> +       while (true) {
> +               vop_rem_desc(vq, idx, &desc);
> +               bytes += desc.len;
> +
> +               if (!(desc.flags & VOP_DESC_F_NEXT))
> +                       break;
> +
> +               idx = desc.next;
> +       }
> +
> +       return bytes;
> +}
> +
> +/*
> + * Transmit the next local available entry to the remote side, splitting
> + * up the local descriptor as needed
> + *
> + * This routine makes the following assumptions:
> + * 1) The header already has the correct number of buffers set
> + * 2) The available buffers are all PAGE_SIZE
> + */
> +static int vop_dma_xmit(struct vop_vq *vq)
> +{
> +       struct vop_dma_info *dma = &vq->dma;
> +       struct dma_chan *chan = dma->chan;
> +       dma_cookie_t cookie;
> +
> +       unsigned int loc_idx, rem_idx;
> +       struct vop_loc_desc loc, rem;
> +
> +       struct vop_dma_cbinfo *cb;
> +       dma_addr_t src, dst;
> +       size_t len;
> +
> +       unsigned int loc_total = 0;
> +       unsigned int rem_total = 0;
> +       unsigned int bufs_used = 0;
> +
> +       /* Check that there is a local descriptor available */
> +       if (!loc_num_avail(vq)) {
> +               dev_dbg(vq->dev, "No local descriptors available\n");
> +               return -ENOSPC;
> +       }
> +
> +       /* Get the starting entry from each available ring */
> +       loc_idx = vop_loc_avail_id(vq, dma->loc_avail);
> +       rem_idx = vop_rem_avail_id(vq, dma->rem_avail);
> +
> +       dev_dbg(vq->dev, "rem_avail %d loc_num_bytes %d\n", rem_num_avail(vq), loc_num_bytes(vq, loc_idx));
> +
> +       /* Check that there are enough remote buffers available */
> +       if (rem_num_avail(vq) * PAGE_SIZE < loc_num_bytes(vq, loc_idx)) {
> +               dev_dbg(vq->dev, "Insufficient remote descriptors available\n");
> +               return -ENOSPC;
> +       }
> +
> +       /* Allocate DMA callback data */
> +       cb = kmem_cache_alloc(dma_cache, GFP_KERNEL);
> +       if (!cb) {
> +               dev_dbg(vq->dev, "Unable to allocate DMA callback data\n");
> +               return -ENOMEM;
> +       }
> +
> +       /* Load the starting descriptors */
> +       vop_loc_desc(vq, loc_idx, &loc);
> +       vop_rem_desc(vq, rem_idx, &rem);
> +
> +       while (true) {
> +
> +               dst = rem.addr + 0x80000000;
> +               src = loc.addr;
> +               len = min(loc.len, rem.len);
> +
> +               dev_dbg(vq->dev, "DMA xmit dst %.8x src %.8x len %d\n", dst, src, len);
> +               cookie = dma_async_memcpy_raw_to_raw(chan, dst, src, len);
> +               if (dma_submit_error(cookie)) {
> +                       dev_err(vq->dev, "DMA submit error\n");
> +                       goto out_free_cb;
> +               }
> +
> +               loc.len -= len;
> +               rem.len -= len;
> +               loc.addr += len;
> +               rem.addr += len;
> +
> +               loc_total += len;
> +               rem_total += len;
> +
> +               dev_dbg(vq->dev, "loc.len %d rem.len %d\n", loc.len, rem.len);
> +               dev_dbg(vq->dev, "loc.addr %.8x rem.addr %.8x\n", loc.addr, rem.addr);
> +               dev_dbg(vq->dev, "loc_total %d rem_total %d\n", loc_total, rem_total);
> +
> +               if (loc.len == 0) {
> +                       dev_dbg(vq->dev, "local: descriptor depleted, loading next\n");
> +
> +                       if (!(loc.flags & VOP_DESC_F_NEXT)) {
> +                               dev_dbg(vq->dev, "local: no next descriptor, chain finished\n");
> +                               break;
> +                       }
> +
> +                       dev_dbg(vq->dev, "local: fetching next descriptor\n");
> +                       loc_idx = loc.next;
> +                       vop_loc_desc(vq, loc_idx, &loc);
> +               }
> +
> +               if (rem.len == 0) {
> +                       dev_dbg(vq->dev, "remote: descriptor depleted, adding to used\n");
> +                       vop_rem_avail_to_used(vq, dma->rem_avail + bufs_used, dma->rem_used + bufs_used, rem_total);
> +                       bufs_used++;
> +
> +                       dev_dbg(vq->dev, "remote: fetching next descriptor\n");
> +                       rem_idx = vop_rem_avail_id(vq, dma->rem_avail + bufs_used);
> +                       vop_rem_desc(vq, rem_idx, &rem);
> +                       rem_total = 0;
> +               }
> +       }
> +
> +       /* Add the last remote descriptor to the used ring */
> +       BUG_ON(rem_total == 0);
> +       dev_dbg(vq->dev, "adding last remote descriptor to used ring\n");
> +       vop_rem_avail_to_used(vq, dma->rem_avail + bufs_used, dma->rem_used + bufs_used, rem_total);
> +       bufs_used++;
> +
> +       /* Add the local descriptor to the sude ring */
> +       dev_dbg(vq->dev, "adding only local descriptor to used ring\n");
> +       vop_loc_avail_to_used(vq, dma->loc_avail, dma->loc_used, loc_total);
> +
> +       /* Make very sure that everything written to the rings actually happened
> +        * bofer the DMA callback can be triggered */
> +       wmb();
> +
> +       /* Set up the DMA callback information */
> +       cb->vq = vq;
> +       cb->loc = 1;
> +       cb->rem = bufs_used;
> +
> +       dev_dbg(vq->dev, "setup DMA callback vq %p loc %d rem %d\n", vq, 1, bufs_used);
> +
> +       /* Trigger an interrupt when the DMA completes to update the used
> +        * indices and trigger the necessary callbacks */
> +       cookie = dma_async_interrupt(chan, dma_callback, cb);
> +       if (dma_submit_error(cookie)) {
> +               dev_err(vq->dev, "DMA interrupt submit error\n");
> +               goto out_free_cb;
> +       }
> +
> +       /* Everything was successful, so update the DMA resolver's state */
> +       dma->loc_avail++;
> +       dma->rem_avail += bufs_used;
> +       dma->loc_used++;
> +       dma->rem_used += bufs_used;
> +
> +       /* Start the DMA */
> +       dev_dbg(vq->dev, "DMA xmit setup successful, starting\n");
> +       dma_async_memcpy_issue_pending(chan);
> +
> +       return 0;
> +
> +out_free_cb:
> +       kmem_cache_free(dma_cache, cb);
> +       return -ENOMEM;
> +}
> +
> +/*
> + * Receive the next remote available entry to the local side, splitting
> + * up the remote descriptor as needed
> + *
> + * This routine makes the following assumptions:
> + * 1) The header already has the correct number of buffers set
> + * 2) The available buffers are all PAGE_SIZE
> + */
> +static int vop_dma_recv(struct vop_vq *vq)
> +{
> +       struct vop_dma_info *dma = &vq->dma;
> +       struct dma_chan *chan = dma->chan;
> +       dma_cookie_t cookie;
> +
> +       unsigned int loc_idx, rem_idx;
> +       struct vop_loc_desc loc, rem;
> +
> +       struct vop_dma_cbinfo *cb;
> +       dma_addr_t src, dst;
> +       size_t len;
> +
> +       unsigned int loc_total = 0;
> +       unsigned int rem_total = 0;
> +       unsigned int bufs_used = 0;
> +
> +       /* Check that there is a remote descriptor available */
> +       if (!rem_num_avail(vq)) {
> +               dev_dbg(vq->dev, "No remote descriptors available\n");
> +               return -ENOSPC;
> +       }
> +
> +       /* Get the starting entry from each available ring */
> +       loc_idx = vop_loc_avail_id(vq, dma->loc_avail);
> +       rem_idx = vop_rem_avail_id(vq, dma->rem_avail);
> +
> +       /* Check that there are enough local buffers available */
> +       if (loc_num_avail(vq) * PAGE_SIZE < rem_num_bytes(vq, rem_idx)) {
> +               dev_dbg(vq->dev, "Insufficient local descriptors available\n");
> +               return -ENOSPC;
> +       }
> +
> +       /* Allocate DMA callback data */
> +       cb = kmem_cache_alloc(dma_cache, GFP_KERNEL);
> +       if (!cb) {
> +               dev_dbg(vq->dev, "Unable to allocate DMA callback data\n");
> +               return -ENOMEM;
> +       }
> +
> +       /* Load the starting descriptors */
> +       vop_loc_desc(vq, loc_idx, &loc);
> +       vop_rem_desc(vq, rem_idx, &rem);
> +
> +       while (true) {
> +
> +               dst = loc.addr;
> +               src = rem.addr + 0x80000000;
> +               len = min(loc.len, rem.len);
> +
> +               dev_dbg(vq->dev, "DMA recv dst %.8x src %.8x len %d\n", dst, src, len);
> +               cookie = dma_async_memcpy_raw_to_raw(chan, dst, src, len);
> +               if (dma_submit_error(cookie)) {
> +                       dev_err(vq->dev, "DMA submit error\n");
> +                       goto out_free_cb;
> +               }
> +
> +               loc.len -= len;
> +               rem.len -= len;
> +               loc.addr += len;
> +               rem.addr += len;
> +
> +               loc_total += len;
> +               rem_total += len;
> +
> +               if (rem.len == 0) {
> +                       if (!(rem.flags & VOP_DESC_F_NEXT))
> +                               break;
> +
> +                       rem_idx = rem.next;
> +                       vop_rem_desc(vq, rem_idx, &rem);
> +               }
> +
> +               if (loc.len == 0) {
> +                       vop_loc_avail_to_used(vq, dma->loc_avail + bufs_used, dma->loc_used + bufs_used, loc_total);
> +                       bufs_used++;
> +
> +                       loc_idx = vop_loc_avail_id(vq, dma->loc_avail + bufs_used);
> +                       vop_loc_desc(vq, loc_idx, &loc);
> +                       loc_total = 0;
> +               }
> +       }
> +
> +       /* Add the last local descriptor to the used ring */
> +       BUG_ON(loc_total == 0);
> +       vop_loc_avail_to_used(vq, dma->loc_avail + bufs_used, dma->loc_used + bufs_used, loc_total);
> +       bufs_used++;
> +
> +       /* Add the remote descriptor to the used ring */
> +       vop_rem_avail_to_used(vq, dma->rem_avail, dma->rem_used, rem_total);
> +
> +       /* Make very sure that everything written to the rings actually happened
> +        * before the DMA callback can be triggered */
> +       wmb();
> +
> +       /* Set up the DMA callback information */
> +       cb->vq = vq;
> +       cb->loc = bufs_used;
> +       cb->rem = 1;
> +
> +       /* Trigger an interrupt when the DMA completes to update the used
> +        * indices and trigger the necessary callbacks */
> +       cookie = dma_async_interrupt(chan, dma_callback, cb);
> +       if (dma_submit_error(cookie)) {
> +               dev_err(vq->dev, "DMA interrupt submit error\n");
> +               goto out_free_cb;
> +       }
> +
> +       /* Everything was successful, so update the DMA resolver's state */
> +       dma->loc_avail += bufs_used;
> +       dma->rem_avail++;
> +       dma->loc_used += bufs_used;
> +       dma->rem_used++;
> +
> +       /* Start the DMA */
> +       dev_dbg(vq->dev, "DMA recv setup successful, starting\n");
> +       dma_async_memcpy_issue_pending(chan);
> +
> +       return 0;
> +
> +out_free_cb:
> +       kmem_cache_free(dma_cache, cb);
> +       return -ENOMEM;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Virtqueue Ops Infrastructure                                               */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Modify the struct virtio_net_hdr_mrg_rxbuf's num_buffers field to account
> + * for the split that will happen in the DMA xmit routine
> + *
> + * This assumes that both sides have the same PAGE_SIZE
> + */
> +static void vop_fixup_vnet_mrg_hdr(struct scatterlist sg[], unsigned int out)
> +{
> +       struct virtio_net_hdr *hdr;
> +       struct virtio_net_hdr_mrg_rxbuf *mhdr;
> +       unsigned int bytes = 0;
> +
> +       /* There must be a header + data, at the least */
> +       BUG_ON(out < 2);
> +
> +       /* The first entry must be the structure */
> +       BUG_ON(sg->length != sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +
> +       hdr = sg_virt(sg);
> +       mhdr = sg_virt(sg);
> +
> +       /* We merge buffers together, so just count up the number of bytes
> +        * needed, then figure out how many pages that will be */
> +       for (/* none */; out; out--, sg = sg_next(sg))
> +               bytes += sg->length;
> +
> +       /* Of course, nobody ever imagined that we might actually use
> +        * this on machines with different endianness...
> +        *
> +        * We force little-endian for now, since that's what our host is */
> +       mhdr->num_buffers = cpu_to_le16(DIV_ROUND_UP(bytes, PAGE_SIZE));
> +
> +       /* Might as well fix up the other fields while we're at it */
> +       hdr->hdr_len = cpu_to_le16(hdr->hdr_len);
> +       hdr->gso_size = cpu_to_le16(hdr->gso_size);
> +       hdr->csum_start = cpu_to_le16(hdr->csum_start);
> +       hdr->csum_offset = cpu_to_le16(hdr->csum_offset);
> +}
> +
> +/*
> + * Add a buffer to our local descriptors and the local avail ring
> + *
> + * NOTE: there hasn't been any transfer yet, just adding to local
> + * NOTE: rings. The kick() will process any DMA that needs to happen
> + *
> + * @return 0 on success, -ERRNO otherwise
> + */
> +static int vop_add_buf(struct virtqueue *_vq, struct scatterlist sg[],
> +                      unsigned int out, unsigned int in, void *data)
> +{
> +       /* For now, we'll just add the buffers to our local descriptors and
> +        * avail ring */
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       unsigned int i, avail, head, uninitialized_var(prev);
> +
> +       BUG_ON(data == NULL);
> +       BUG_ON(out + in == 0);
> +
> +       /* Make sure we have space for this to succeed */
> +       if (vq->num_free < out + in) {
> +               dev_dbg(vq->dev, "No free space left: len=%d free=%d\n",
> +                               out + in, vq->num_free);
> +               return -ENOSPC;
> +       }
> +
> +       /* If this is an xmit buffer from virtio_net, fixup the header */
> +       if (out > 1) {
> +               dev_dbg(vq->dev, "Fixing up virtio_net header\n");
> +               vop_fixup_vnet_mrg_hdr(sg, out);
> +       }
> +
> +       head = vq->free_head;
> +
> +       /* DMA map the scatterlist */
> +       if (vop_dma_map_sg(vq->dev, sg, out, in)) {
> +               dev_err(vq->dev, "Failed to DMA map scatterlist\n");
> +               return -ENOMEM;
> +       }
> +
> +       /* We're about to use some buffers from the free list */
> +       vq->num_free -= out + in;
> +
> +       for (i = vq->free_head; out; i = vop_get_desc_next(vq, i), out--) {
> +               vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT);
> +               vop_set_desc_addr(vq, i, sg_dma_address(sg));
> +               vop_set_desc_len(vq, i, sg->length);
> +
> +               prev = i;
> +               sg = sg_next(sg);
> +       }
> +
> +       for (/* none */; in; i = vop_get_desc_next(vq, i), in--) {
> +               vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT | VOP_DESC_F_WRITE);
> +               vop_set_desc_addr(vq, i, sg_dma_address(sg));
> +               vop_set_desc_len(vq, i, sg->length);
> +
> +               prev = i;
> +               sg = sg_next(sg);
> +       }
> +
> +       /* Last one doesn't continue */
> +       vop_set_desc_flags(vq, prev, vop_get_desc_flags(vq, prev) & ~VOP_DESC_F_NEXT);
> +
> +       /* Update the free pointer */
> +       vq->free_head = i;
> +
> +       /* Set token */
> +       vq->data[head] = data;
> +
> +       /* Add an entry for the head of the chain into the avail array, but
> +        * don't update avail->idx until kick() */
> +       avail = (vq->avail.index + vq->num_added++) & (VOP_RING_SIZE - 1);
> +       vq->avail.ring[avail] = head;
> +
> +       dev_dbg(vq->dev, "Added buffer head %i to %p\n", head, vq);
> +       debug_dump_rings(vq, "Added buffer(s), dumping rings");
> +
> +       return 0;
> +}
> +
> +static inline bool loc_more_used(const struct vop_vq *vq)
> +{
> +       return vq->loc_last_used != vq->used.index;
> +}
> +
> +static void detach_buf(struct vop_vq *vq, unsigned int head)
> +{
> +       dma_addr_t addr;
> +       unsigned int idx, len;
> +       enum dma_data_direction dir;
> +       struct vop_loc_desc desc;
> +
> +       /* Clear data pointer */
> +       vq->data[head] = NULL;
> +
> +       /* Put the chain back on the free list, unmapping as we go */
> +       idx = head;
> +       while (true) {
> +               vop_loc_desc(vq, idx, &desc);
> +
> +               addr = desc.addr;
> +               len  = desc.len;
> +               dir  = (desc.flags & VOP_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
> +
> +               /* Unmap the entry */
> +               dma_unmap_single(vq->dev, addr, len, dir);
> +               vq->num_free++;
> +
> +               /* If there is no next descriptor, we're done */
> +               if (!(desc.flags & VOP_DESC_F_NEXT))
> +                       break;
> +
> +               idx = desc.next;
> +       }
> +
> +       vop_set_desc_next(vq, idx, vq->free_head);
> +       vq->free_head = head;
> +}
> +
> +/*
> + * Get a buffer from the used ring
> + *
> + * @return the data token given to add_buf(), or NULL if there
> + *         are no remaining buffers
> + */
> +static void *vop_get_buf(struct virtqueue *_vq, unsigned int *len)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       unsigned int head, used;
> +       void *ret;
> +
> +       if (!loc_more_used(vq)) {
> +               dev_dbg(vq->dev, "No more buffers in queue\n");
> +               return NULL;
> +       }
> +
> +       used = vq->loc_last_used & (VOP_RING_SIZE - 1);
> +       head = vq->used.ring[used].id;
> +       *len = vq->used.ring[used].len;
> +
> +       BUG_ON(head >= VOP_RING_SIZE);
> +       BUG_ON(!vq->data[head]);
> +
> +       /* detach_buf() clears data, save it now */
> +       ret = vq->data[head];
> +       detach_buf(vq, head);
> +
> +       /* Update the last local used_idx */
> +       vq->loc_last_used++;
> +
> +       return ret;
> +}
> +
> +/*
> + * The avail ring changed, so we need to start as much DMA as we can
> + */
> +static void vop_kick(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +
> +       dev_dbg(vq->dev, "kick: making %d new buffers available\n", vq->num_added);
> +       vq->avail.index += vq->num_added;
> +       vq->num_added = 0;
> +
> +       /* Run the DMA resolver */
> +       dev_dbg(vq->dev, "kick: using resolver %pS\n", vq->resolve);
> +       schedule_work(&vq->work);
> +}
> +
> +/*
> + * Try to disable callbacks on the used ring (unreliable)
> + */
> +static void vop_disable_cb(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       struct virtio_device *vdev = _vq->vdev;
> +
> +       dev_dbg(&vdev->dev, "disable callbacks\n");
> +       vq->flags = VOP_F_NO_INTERRUPT;
> +#if 0
> +       /*
> +        * FIXME: using this causes the host -> guest transfer rate to
> +        * FIXME: intermittently slow to 1/10th of the normal rate
> +        */
> +       vop_set_guest_flags(vq, vq->flags);
> +#endif
> +}
> +
> +/*
> + * Enable callbacks on changes to the used ring
> + *
> + * @return false if there are more pending buffers
> + *         true otherwise
> + */
> +static bool vop_enable_cb(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +
> +       /* We optimistically enable interrupts, then check if there
> +        * was more work to do */
> +       dev_dbg(vq->dev, "enable callbacks\n");
> +       vq->flags = 0;
> +#if 0
> +       /*
> +        * FIXME: using this causes the host -> guest transfer rate to
> +        * FIXME: intermittently slow to 1/10th of the normal rate
> +        */
> +       vop_set_guest_flags(vq, vq->flags);
> +#endif
> +
> +       if (unlikely(loc_more_used(vq)))
> +               return false;
> +
> +       return true;
> +}
> +
> +static struct virtqueue_ops vop_vq_ops = {
> +       .add_buf        = vop_add_buf,
> +       .get_buf        = vop_get_buf,
> +       .kick           = vop_kick,
> +       .disable_cb     = vop_disable_cb,
> +       .enable_cb      = vop_enable_cb,
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* Virtio Device Infrastructure                                               */
> +/*----------------------------------------------------------------------------*/
> +
> +/* Read some bytes from the host's configuration area */
> +static void vopc_get(struct virtio_device *_vdev, unsigned offset, void *buf,
> +                    unsigned len)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       void __iomem *config = vdev->host_status->config;
> +
> +       memcpy_fromio(buf, config + offset, len);
> +}
> +
> +/* Write some bytes to the host's configuration area */
> +static void vopc_set(struct virtio_device *_vdev, unsigned offset,
> +                    const void *buf, unsigned len)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       void __iomem *config = vdev->host_status->config;
> +
> +       memcpy_toio(config + offset, buf, len);
> +}
> +
> +/* Read your own status bits */
> +static u8 vopc_get_status(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       u32 status;
> +
> +       status = le32_to_cpu(vdev->guest_status->status);
> +       dev_dbg(&vdev->vdev.dev, "%s(): -> 0x%.2x\n", __func__, (u8)status);
> +
> +       return (u8)status;
> +}
> +
> +static void vopc_set_status(struct virtio_device *_vdev, u8 status)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       u32 old_status;
> +
> +       old_status = le32_to_cpu(vdev->guest_status->status);
> +       vdev->guest_status->status = cpu_to_le32(status);
> +
> +       dev_dbg(&vdev->vdev.dev, "%s(): <- 0x%.2x (was 0x%.2x)\n",
> +                       __func__, status, old_status);
> +
> +       /*
> +        * FIXME: we really need to notify the other side when status changes
> +        * FIXME: happen, so that they can take some action
> +        */
> +}
> +
> +static void vopc_reset(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +
> +       dev_dbg(&vdev->vdev.dev, "%s(): status reset\n", __func__);
> +       vdev->guest_status->status = cpu_to_le32(0);
> +}
> +
> +/* Find the given virtqueue */
> +static struct virtqueue *vopc_find_vq(struct virtio_device *_vdev,
> +                                            unsigned index,
> +                                            void (*cb)(struct virtqueue *vq))
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       struct vop_vq *vq = &vdev->virtqueues[index];
> +       int i;
> +
> +       /* Check that we support the virtqueue at this index */
> +       if (index >= ARRAY_SIZE(vdev->virtqueues)) {
> +               dev_err(&vdev->vdev.dev, "no virtqueue for index %d\n", index);
> +               return ERR_PTR(-ENODEV);
> +       }
> +
> +       /* HACK: we only support virtio_net for now */
> +       if (vdev->vdev.id.device != VIRTIO_ID_NET) {
> +               dev_err(&vdev->vdev.dev, "only virtio_net is supported\n");
> +               return ERR_PTR(-ENODEV);
> +       }
> +
> +       /* Initialize the virtqueue to a clean state */
> +       vq->num_free = VOP_RING_SIZE;
> +       vq->dev = &vdev->vdev.dev;
> +       vq->vq.vq_ops = &vop_vq_ops;
> +
> +       /* Hook up the local virtqueues to the corresponding remote virtqueues */
> +       /* TODO: maybe move this to the setup_virtio_net() function */
> +       switch (index) {
> +       case 0: /* x86 xmit virtqueue, hook to ppc recv virtqueue */
> +               vq->guest = vdev->loc + 2048;
> +               vq->host  = vdev->rem + 2048;
> +               vq->resolve = vop_dma_recv;
> +               vq->kick_val = 0x8;
> +               break;
> +       case 1: /* x86 recv virtqueue, hook to ppc xmit virtqueue */
> +               vq->guest = vdev->loc + 1024;
> +               vq->host  = vdev->rem + 1024;
> +               vq->resolve = vop_dma_xmit;
> +               vq->kick_val = 0x4;
> +               break;
> +       case 2: /* x86 ctrl virtqueue -- ppc ctrl virtqueue */
> +       default:
> +               dev_err(vq->dev, "Unsupported virtqueue\n");
> +               return ERR_PTR(-ENODEV);
> +       }
> +
> +       dev_dbg(vq->dev, "vq %d guest %p host %p\n", index, vq->guest, vq->host);
> +
> +       /* Initialize the descriptor, avail, and used rings */
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               vop_set_desc_addr(vq, i, 0x0);
> +               vop_set_desc_len(vq, i, 0);
> +               vop_set_desc_flags(vq, i, 0);
> +               vop_set_desc_next(vq, i, (i + 1) & (VOP_RING_SIZE - 1));
> +
> +               vq->avail.ring[i] = 0;
> +               vq->used.ring[i].id = 0;
> +               vq->used.ring[i].len = 0;
> +       }
> +
> +       vq->avail.index = 0;
> +       vop_set_guest_flags(vq, 0);
> +
> +       /* This is the guest, the host has already initialized the rings for us */
> +       debug_dump_rings(vq, "found a virtqueue, dumping rings");
> +
> +       vq->vq.callback = cb;
> +       vq->vq.vdev = &vdev->vdev;
> +
> +       return &vq->vq;
> +}
> +
> +static void vopc_del_vq(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       int i;
> +
> +       /* FIXME: make sure that DMA has stopped by this point */
> +
> +       /* Unmap and remove all outstanding descriptors from the ring */
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               if (vq->data[i]) {
> +                       dev_dbg(vq->dev, "cleanup detach buffer at index %d\n", i);
> +                       detach_buf(vq, i);
> +               }
> +       }
> +
> +       debug_dump_rings(vq, "virtqueue destroyed, dumping rings");
> +}
> +
> +/* Read the host's advertised features */
> +static u32 vopc_get_features(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       u32 ret;
> +
> +       ret = vop_get_host_features(vdev);
> +       dev_dbg(&vdev->vdev.dev, "%s(): host features 0x%.8x\n", __func__, ret);
> +
> +       return ret;
> +}
> +
> +/* At this point, we've chosen whichever features we can use and
> + * put them into the vdev->features array. We should probably notify
> + * the host at this point, but how will virtio react? */
> +static void vopc_finalize_features(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       struct device *dev = &vdev->vdev.dev;
> +
> +       /*
> +        * TODO: notify the other side at this point
> +        */
> +
> +       vdev->guest_status->features[0] = cpu_to_le32(vdev->vdev.features[0]);
> +       dev_dbg(dev, "%s(): final features 0x%.8lx\n", __func__, vdev->vdev.features[0]);
> +}
> +
> +static struct virtio_config_ops vop_config_ops = {
> +       .get                    = vopc_get,
> +       .set                    = vopc_set,
> +       .get_status             = vopc_get_status,
> +       .set_status             = vopc_set_status,
> +       .reset                  = vopc_reset,
> +       .find_vq                = vopc_find_vq,
> +       .del_vq                 = vopc_del_vq,
> +       .get_features           = vopc_get_features,
> +       .finalize_features      = vopc_finalize_features,
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* Last-minute device setup code                                              */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Do the last minute setup for virtio_net, now that the host memory is
> + * valid. This includes setting up pointers to the correct queues so that
> + * we can just start the virtqueues when the driver registers
> + */
> +static void setup_virtio_net(struct vop_vdev *vdev)
> +{
> +       /* TODO: move some of the setup code from find_vq() here */
> +}
> +
> +/*
> + * Do any last minute setup for a device just before starting it
> + *
> + * The host memory is now valid, so you should be setting up any pointers
> + * the device needs to the host memory
> + */
> +static int vop_setup_device(struct vop_dev *priv, int devnum)
> +{
> +       struct vop_vdev *vdev = &priv->devices[devnum];
> +       struct device *dev = priv->dev;
> +
> +       if (devnum >= ARRAY_SIZE(priv->devices)) {
> +               dev_err(dev, "Unknown virtio_device %d\n", devnum);
> +               return -ENODEV;
> +       }
> +
> +       /* Setup the device's pointers to host memory */
> +       vdev->rem = priv->host_mem  + (devnum * 4096);
> +       vdev->host_status = vdev->rem;
> +
> +       switch (devnum) {
> +       case 0: /* virtio_net */
> +               setup_virtio_net(vdev);
> +               break;
> +       default:
> +               dev_err(dev, "Device %d not implemented\n", devnum);
> +               return -ENODEV;
> +       }
> +
> +       return 0;
> +}
> +
> +/*
> + * Initialize and attempt to register a virtio_device
> + *
> + * @priv the driver data
> + * @devnum the virtio_device number (index into priv->devices)
> + */
> +static int vop_start_device(struct vop_dev *priv, int devnum)
> +{
> +       struct vop_vdev *vdev = &priv->devices[devnum];
> +       struct device *dev = priv->dev;
> +       int ret;
> +
> +       /* Check that we know about the device */
> +       if (devnum >= ARRAY_SIZE(priv->devices)) {
> +               dev_err(dev, "Unknown virtio_device %d\n", devnum);
> +               return -ENODEV;
> +       }
> +
> +       vdev->status = 0;
> +
> +       /* Do any last minute device-specific setup now that the
> +        * host memory is valid */
> +       ret = vop_setup_device(priv, devnum);
> +       if (ret) {
> +               dev_err(dev, "Unable to setup device %d\n", devnum);
> +               return ret;
> +       }
> +
> +       /* Register the device with the virtio subsystem */
> +       ret = register_virtio_device(&vdev->vdev);
> +       if (ret) {
> +               dev_err(dev, "Unable to register device %d\n", devnum);
> +               return ret;
> +       }
> +
> +       vdev->status = VOP_DEVICE_REGISTERED;
> +       return 0;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Work Functions                                                             */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Start as much DMA as we can on the given virtqueue
> + *
> + * This is put on the system shared queue, and will start us much DMA as is
> + * available when it is called. This should be triggered when the host adds
> + * things to the avail rings, and when the guest adds things to the internal
> + * avail rings
> + *
> + * Make sure it doesn't sleep for too long, you're on the shared queue
> + */
> +static void vop_dma_work(struct work_struct *work)
> +{
> +       struct vop_vq *vq = container_of(work, struct vop_vq, work);
> +       int ret;
> +
> +       /* Start as many DMA transactions as we can, immediately */
> +       while (true) {
> +               ret = vq->resolve(vq);
> +               if (ret)
> +                       break;
> +       }
> +}
> +
> +/*
> + * Remove all virtio devices immediately
> + *
> + * This will be called by the host to make sure that we are in a stopped
> + * state. It should be callable when everything is already stopped.
> + *
> + * Make sure it doesn't sleep for too long, you're on the shared queue
> + */
> +static void vop_reset_work(struct work_struct *work)
> +{
> +       struct vop_dev *priv = container_of(work, struct vop_dev, reset_work);
> +       struct device *dev = priv->dev;
> +       struct vop_vdev *vdev;
> +       int i;
> +
> +       dev_dbg(dev, "Resetting all virtio devices\n");
> +       mutex_lock(&priv->mutex);
> +
> +       for (i = 0; i < ARRAY_SIZE(priv->devices); i++) {
> +               vdev = &priv->devices[i];
> +
> +               if (vdev->status & VOP_DEVICE_REGISTERED) {
> +                       dev_dbg(dev, "Unregistering virtio_device #%d\n", i);
> +                       unregister_virtio_device(&vdev->vdev);
> +               }
> +
> +               vdev->status &= ~VOP_DEVICE_REGISTERED;
> +       }
> +
> +       if (priv->host_mem) {
> +               iounmap(priv->host_mem);
> +               priv->host_mem = NULL;
> +       }
> +
> +       mutex_unlock(&priv->mutex);
> +}
> +
> +/*
> + * This will map the host's memory, as well as start the devices that the host
> + * requested
> + *
> + * Mailbox registers contents:
> + * IMR0 - the host memory physical address (must be <1GB)
> + * IMR1 - the devices the host wants started
> + */
> +static void vop_start_work(struct work_struct *work)
> +{
> +       struct vop_dev *priv = container_of(work, struct vop_dev, start_work);
> +       struct device *dev = priv->dev;
> +       struct vop_vdev *vdev;
> +       u32 address, devices;
> +       int i;
> +
> +       dev_dbg(dev, "Starting requested virtio devices\n");
> +       mutex_lock(&priv->mutex);
> +
> +       /* Read the requested address and devices from the mailbox registers */
> +       address = ioread32(priv->immr + IMR0_OFFSET);
> +       devices = ioread32(priv->immr + IMR1_OFFSET);
> +
> +       dev_dbg(dev, "address 0x%.8x\n", address);
> +       dev_dbg(dev, "devices 0x%.8x\n", devices);
> +
> +       /* Remap the host's registers */
> +       priv->host_mem = ioremap(address + 0x80000000, VOP_HOST_MEM_SIZE);
> +       if (!priv->host_mem) {
> +               dev_err(dev, "Unable to ioremap host memory\n");
> +               goto out_unlock;
> +       }
> +
> +       /* Start the requested devices */
> +       for (i = 0; i < ARRAY_SIZE(priv->devices); i++) {
> +               vdev = &priv->devices[i];
> +
> +               if (devices & (1 << i)) {
> +                       dev_dbg(dev, "Starting virtio_device #%d\n", i);
> +                       vop_start_device(priv, i);
> +               }
> +       }
> +
> +out_unlock:
> +       mutex_unlock(&priv->mutex);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Interrupt Handling                                                         */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Schedule the work function for a given virtqueue only if the associated
> + * device is up and running. Otherwise, ignore the request
> + *
> + * @priv the private driver data
> + * @dev the virtio_device number in priv->devices[]
> + * @queue the virtqueue in vdev->virtqueues[]
> + */
> +static void schedule_work_if_ready(struct vop_dev *priv, int dev, int queue)
> +{
> +       struct vop_vdev *vdev = &priv->devices[dev];
> +       struct vop_vq *vq = &vdev->virtqueues[queue];
> +
> +       if (vdev->status & VOP_DEVICE_REGISTERED)
> +               schedule_work(&vq->work);
> +}
> +
> +static irqreturn_t vdev_interrupt(int irq, void *dev_id)
> +{
> +       struct vop_dev *priv = dev_id;
> +       struct device *dev = priv->dev;
> +       u32 imisr, idr;
> +
> +       imisr = ioread32(priv->immr + IMISR_OFFSET);
> +       idr   = ioread32(priv->immr + IDR_OFFSET);
> +
> +       dev_dbg(dev, "INTERRUPT idr 0x%.8x\n", idr);
> +
> +       /* Check the status register for doorbell interrupts */
> +       if (!(imisr & 0x8))
> +               return IRQ_NONE;
> +
> +       /* Clear all doorbell interrupts */
> +       iowrite32(idr, priv->immr + IDR_OFFSET);
> +
> +       /* Reset */
> +       if (idr & 0x1)
> +               schedule_work(&priv->reset_work);
> +
> +       /* Start */
> +       if (idr & 0x2)
> +               schedule_work(&priv->start_work);
> +
> +       /* vdev 0 vq 1 kick */
> +       if (idr & 0x4)
> +               schedule_work_if_ready(priv, 0, 1);
> +
> +       /* vdev 0 vq 0 kick */
> +       if (idr & 0x8)
> +               schedule_work_if_ready(priv, 0, 0);
> +
> +       if (idr & 0xfffffff0)
> +               dev_dbg(dev, "INTERRUPT unhandled 0x%.8x\n", idr & 0xfffffff0);
> +
> +       return IRQ_HANDLED;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Driver insertion time virtio device initialization                         */
> +/*----------------------------------------------------------------------------*/
> +
> +static void vdev_release(struct device *dev)
> +{
> +       /* TODO: this should probably do something useful */
> +       dev_dbg(dev, "%s: called\n", __func__);
> +}
> +
> +/*
> + * Do any device-specific setup for a virtio device
> + *
> + * This would include things like setting the feature bits for the
> + * device, as well as the device type.
> + *
> + * There is no access to host memory at this point, so don't access it
> + */
> +static void vop_setup_virtio_device(struct vop_dev *priv, int devnum)
> +{
> +       struct vop_vdev *vdev = &priv->devices[devnum];
> +       struct virtio_net_config *config;
> +       unsigned long features = 0;
> +
> +       /* HACK: we only support device #0 (virtio_net) right now */
> +       if (devnum != 0)
> +               return;
> +
> +       /* Generate a random ethernet address for the host to have
> +        *
> +        * This way, we could do something board-specific and get an
> +        * ethernet address that is consistent per-slot
> +        */
> +       config = (struct virtio_net_config *)vdev->guest_status->config;
> +       random_ether_addr(config->mac);
> +       dev_info(priv->dev, "Generated MAC %pM\n", config->mac);
> +
> +       /* Set the feature bits for the device */
> +       set_bit(VIRTIO_NET_F_MAC,       &features);
> +       set_bit(VIRTIO_NET_F_CSUM,      &features);
> +       set_bit(VIRTIO_NET_F_GSO,       &features);
> +       set_bit(VIRTIO_NET_F_MRG_RXBUF, &features);
> +
> +       vdev->guest_status->features[0] = cpu_to_le32(features);
> +       vdev->vdev.id.device = VIRTIO_ID_NET;
> +}
> +
> +/*
> + * Do all of the initialization of all of the virtqueues for a given virtio
> + * device. There is no access to host memory at this point, so don't access it
> + *
> + * @devnum the device number in the priv->devices[] array
> + */
> +static void vop_initialize_virtqueues(struct vop_dev *priv, int devnum)
> +{
> +       struct vop_vdev *vdev = &priv->devices[devnum];
> +       struct vop_vq *vq;
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(vdev->virtqueues); i++) {
> +               vq = &vdev->virtqueues[i];
> +
> +               memset(vq, 0, sizeof(struct vop_vq));
> +               vq->immr = priv->immr;
> +               vq->dma.chan = priv->chan;
> +               INIT_WORK(&vq->work, vop_dma_work);
> +       }
> +}
> +
> +/*
> + * Do all of the initialization for the virtio devices that is possible without
> + * access to the host memory
> + *
> + * This includes setting up the pointers that you can and setting the feature
> + * bits so that the host can read them before he starts us
> + */
> +static void vop_initialize_devices(struct vop_dev *priv)
> +{
> +       struct device *parent = priv->dev;
> +       struct vop_vdev *vdev;
> +       struct device *vdev_dev;
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(priv->devices); i++) {
> +               vdev = &priv->devices[i];
> +               vdev_dev = &vdev->vdev.dev;
> +
> +               /* Set up access to the guest memory, host memory isn't valid
> +                * yet, and will have to be set up just before we start */
> +               vdev->loc = priv->guest_mem + (i * 4096);
> +               vdev->guest_status = vdev->loc;
> +
> +               /* Initialize all of the device's virtqueues */
> +               vop_initialize_virtqueues(priv, i);
> +
> +               /* Zero the configuration space */
> +               memset(vdev->guest_status, 0, 1024);
> +
> +               /* Copy parent DMA parameters to this device */
> +               vdev_dev->dma_mask = parent->dma_mask;
> +               vdev_dev->dma_parms = parent->dma_parms;
> +               vdev_dev->coherent_dma_mask = parent->coherent_dma_mask;
> +
> +               vdev_dev->release = &vdev_release;
> +               vdev_dev->parent  = parent;
> +               vdev->vdev.config = &vop_config_ops;
> +
> +               /* Do any device-specific setup */
> +               vop_setup_virtio_device(priv, i);
> +       }
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* OpenFirmware Device Subsystem                                              */
> +/*----------------------------------------------------------------------------*/
> +
> +static int vdev_of_probe(struct of_device *op, const struct of_device_id *match)
> +{
> +       struct vop_dev *priv;
> +       dma_cap_mask_t mask;
> +       int ret;
> +
> +       /* Allocate private data */
> +       priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +       if (!priv) {
> +               dev_err(&op->dev, "Unable to allocate device private data\n");
> +               ret = -ENOMEM;
> +               goto out_return;
> +       }
> +
> +       dev_set_drvdata(&op->dev, priv);
> +       priv->dev = &op->dev;
> +       mutex_init(&priv->mutex);
> +       INIT_WORK(&priv->reset_work, vop_reset_work);
> +       INIT_WORK(&priv->start_work, vop_start_work);
> +
> +       /* Get a DMA channel */
> +       dma_cap_zero(mask);
> +       dma_cap_set(DMA_MEMCPY, mask);
> +       dma_cap_set(DMA_INTERRUPT, mask);
> +       priv->chan = dma_request_channel(mask, NULL, NULL);
> +       if (!priv->chan) {
> +               dev_err(&op->dev, "Unable to get DMA channel\n");
> +               ret = -ENODEV;
> +               goto out_free_priv;
> +       }
> +
> +       /* Remap IMMR */
> +       priv->immr = ioremap(get_immrbase(), 0x100000);
> +       if (!priv->immr) {
> +               dev_err(&op->dev, "Unable to remap IMMR registers\n");
> +               ret = -ENOMEM;
> +               goto out_dma_release_channel;
> +       }
> +
> +       /* Set up a static 1GB window into host memory */
> +       iowrite32be(LAWAR0_ENABLE | 0x1D, priv->immr + LAWAR0_OFFSET);
> +       iowrite32be(POCMR0_ENABLE | 0xC0000, priv->immr + POCMR0_OFFSET);
> +       iowrite32be(0x0, priv->immr + POTAR0_OFFSET);
> +
> +       /* Allocate guest memory */
> +       priv->guest_mem = dma_alloc_coherent(&op->dev, VOP_GUEST_MEM_SIZE,
> +                                            &priv->guest_mem_addr, GFP_KERNEL);
> +       if (!priv->guest_mem) {
> +               dev_err(&op->dev, "Unable to allocate guest memory\n");
> +               ret = -ENOMEM;
> +               goto out_iounmap_immr;
> +       }
> +
> +       memset(priv->guest_mem, 0, VOP_GUEST_MEM_SIZE);
> +
> +       /* Program BAR1 so that it will hit the guest memory */
> +       iowrite32be(priv->guest_mem_addr >> 12, priv->immr + PITAR0_OFFSET);
> +
> +       /* Initialize all of the virtio devices with their features, etc */
> +       vop_initialize_devices(priv);
> +
> +       /* Disable mailbox interrupts */
> +       iowrite32(0x2 | 0x1, priv->immr + IMIMR_OFFSET);
> +
> +       /* Hook up the irq handler */
> +       priv->irq = irq_of_parse_and_map(op->node, 0);
> +       ret = request_irq(priv->irq, vdev_interrupt, IRQF_SHARED, driver_name, priv);
> +       if (ret)
> +               goto out_free_guest_mem;
> +
> +       dev_info(&op->dev, "Virtio-over-PCI guest driver installed\n");
> +       dev_info(&op->dev, "Physical memory @ 0x%.8x\n", priv->guest_mem_addr);
> +       dev_info(&op->dev, "Descriptor ring size: %d entries\n", VOP_RING_SIZE);
> +       return 0;
> +
> +out_free_guest_mem:
> +       dma_free_coherent(&op->dev, VOP_GUEST_MEM_SIZE, priv->guest_mem,
> +                         priv->guest_mem_addr);
> +out_iounmap_immr:
> +       iounmap(priv->immr);
> +out_dma_release_channel:
> +       dma_release_channel(priv->chan);
> +out_free_priv:
> +       kfree(priv);
> +out_return:
> +       return ret;
> +}
> +
> +static int vdev_of_remove(struct of_device *op)
> +{
> +       struct vop_dev *priv = dev_get_drvdata(&op->dev);
> +
> +       /* Stop the irq handler */
> +       free_irq(priv->irq, priv);
> +
> +       /* Unregister and reset all of the devices */
> +       schedule_work(&priv->reset_work);
> +       flush_scheduled_work();
> +
> +       dma_free_coherent(&op->dev, VOP_GUEST_MEM_SIZE, priv->guest_mem,
> +                         priv->guest_mem_addr);
> +       iounmap(priv->immr);
> +       dma_release_channel(priv->chan);
> +       kfree(priv);
> +
> +       return 0;
> +}
> +
> +static struct of_device_id vdev_of_match[] = {
> +       { .compatible = "fsl,mpc8349-mu", },
> +       {},
> +};
> +
> +static struct of_platform_driver vdev_of_driver = {
> +       .owner          = THIS_MODULE,
> +       .name           = driver_name,
> +       .match_table    = vdev_of_match,
> +       .probe          = vdev_of_probe,
> +       .remove         = vdev_of_remove,
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* Module Init / Exit                                                         */
> +/*----------------------------------------------------------------------------*/
> +
> +static int __init vdev_init(void)
> +{
> +       dma_cache = KMEM_CACHE(vop_dma_cbinfo, 0);
> +       if (!dma_cache) {
> +               pr_err("%s: unable to create dma cache\n", driver_name);
> +               return -ENOMEM;
> +       }
> +
> +       return of_register_platform_driver(&vdev_of_driver);
> +}
> +
> +static void __exit vdev_exit(void)
> +{
> +       of_unregister_platform_driver(&vdev_of_driver);
> +       kmem_cache_destroy(dma_cache);
> +}
> +
> +MODULE_AUTHOR("Ira W. Snyder <iws@ovro.caltech.edu>");
> +MODULE_DESCRIPTION("Freescale Virtio-over-PCI Test Driver");
> +MODULE_LICENSE("GPL");
> +
> +module_init(vdev_init);
> +module_exit(vdev_exit);
> diff --git a/drivers/virtio/vop_host.c b/drivers/virtio/vop_host.c
> new file mode 100644
> index 0000000..814fa8a
> --- /dev/null
> +++ b/drivers/virtio/vop_host.c
> @@ -0,0 +1,1071 @@
> +/*
> + * Virtio-over-PCI Host Driver for MPC8349EMDS Guest
> + *
> + * Copyright (c) 2009 Ira W. Snyder <iws@ovro.caltech.edu>
> + *
> + * This file is licensed under the terms of the GNU General Public License
> + * version 2. This program is licensed "as is" without any warranty of any
> + * kind, whether express or implied.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/pci.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_net.h>
> +#include <linux/workqueue.h>
> +#include <linux/interrupt.h>
> +
> +#include <linux/etherdevice.h>
> +
> +#include "vop_hw.h"
> +#include "vop.h"
> +
> +static const char driver_name[] = "vdev";
> +
> +struct vop_loc_desc {
> +       u32 addr;
> +       u32 len;
> +       u16 flags;
> +       u16 next;
> +};
> +
> +struct vop_vq {
> +
> +       /* The actual virtqueue itself */
> +       struct virtqueue vq;
> +
> +       struct device *dev;
> +
> +       /* The host ring address */
> +       struct vop_host_ring *host;
> +
> +       /* The guest ring address */
> +       struct vop_guest_ring __iomem *guest;
> +
> +       /* Local copy of the descriptors for fast access */
> +       struct vop_loc_desc desc[VOP_RING_SIZE];
> +
> +       /* The data token from add_buf() */
> +       void *data[VOP_RING_SIZE];
> +
> +       unsigned int num_free;
> +       unsigned int free_head;
> +       unsigned int num_added;
> +
> +       u16 avail_idx;
> +       u16 last_used_idx;
> +
> +       /* The doorbell to kick() */
> +       unsigned int kick_val;
> +       void __iomem *immr;
> +};
> +
> +/* Convert from a struct virtqueue to a struct vop_vq */
> +#define to_vop_vq(X) container_of(X, struct vop_vq, vq)
> +
> +/*
> + * This represents a virtio_device for our driver. It follows the memory
> + * layout shown above. It has pointers to all of the host and guest memory
> + * areas that we need to access
> + */
> +struct vop_vdev {
> +
> +       /* The specific virtio device (console, net, blk) */
> +       struct virtio_device vdev;
> +
> +       /* Local and remote memory */
> +       void *loc;
> +       void __iomem *rem;
> +
> +       /*
> +        * These are the status, feature, and configuration information
> +        * for this virtio device. They are exposed in our memory block
> +        * starting at offset 0.
> +        */
> +       struct vop_status *host_status;
> +
> +       /*
> +        * These are the status, feature, and configuration information
> +        * for the guest virtio device. They are exposed in the guest
> +        * memory block starting at offset 0.
> +        */
> +       struct vop_status __iomem *guest_status;
> +
> +       /*
> +        * These are the virtqueues for the virtio driver running this
> +        * device to use. The host portions are exposed in our memory block
> +        * starting at offset 1024. The exposed areas are aligned to 1024 byte
> +        * boundaries, so they appear at offets 1024, 2048, and 3072
> +        * respectively.
> +        */
> +       struct vop_vq virtqueues[3];
> +};
> +
> +#define to_vop_vdev(X) container_of(X, struct vop_vdev, vdev)
> +
> +/*
> + * This is information from the PCI subsystem about each MPC8349EMDS board
> + *
> + * It holds information for all of the possible virtio_devices that are
> + * attached to this board.
> + */
> +struct vop_dev {
> +
> +       struct pci_dev *pdev;
> +       struct device *dev;
> +
> +       /* PowerPC memory (PCI BAR0 and BAR1, respectively) */
> +       #define VOP_GUEST_MEM_SIZE 16384
> +       void __iomem *immr;
> +       void __iomem *netregs;
> +
> +       /* Host memory, visible to the PowerPC */
> +       #define VOP_HOST_MEM_SIZE 16384
> +       void *host_mem;
> +       dma_addr_t host_mem_addr;
> +
> +       /* The virtio devices */
> +       struct vop_vdev devices[4];
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* Ring Debugging Helpers                                                     */
> +/*----------------------------------------------------------------------------*/
> +
> +#ifdef DEBUG_DUMP_RINGS
> +static void dump_guest_descriptors(struct vop_vq *vq)
> +{
> +       int i;
> +       struct vop_desc __iomem *desc;
> +
> +       pr_debug("DESC BG: 0xADDRESSX LENGTH 0xFLAG 0xNEXT\n");
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               desc = &vq->guest->desc[i];
> +               pr_debug("DESC %.2d: 0x%.8x %.6d 0x%.4x 0x%.4x\n", i,
> +                               ioread32(&desc->addr), ioread32(&desc->len),
> +                               ioread16(&desc->flags), ioread16(&desc->next));
> +       }
> +       pr_debug("DESC ED\n");
> +}
> +
> +static void dump_guest_avail(struct vop_vq *vq)
> +{
> +       int i;
> +
> +       pr_debug("BEGIN AVAIL DUMP\n");
> +       for (i = 0; i < VOP_RING_SIZE; i++)
> +               pr_debug("AVAIL %.2d: 0x%.4x\n", i, ioread16(&vq->guest->avail[i]));
> +       pr_debug("END AVAIL DUMP\n");
> +}
> +
> +static void dump_guest_ring(struct vop_vq *vq)
> +{
> +       pr_debug("BEGIN GUEST RING DUMP\n");
> +       dump_guest_descriptors(vq);
> +       pr_debug("GUEST FLAGS: 0x%.4x\n", ioread16(&vq->guest->flags));
> +       pr_debug("GUEST AVAIL_IDX: %d\n", ioread16(&vq->guest->avail_idx));
> +       dump_guest_avail(vq);
> +       pr_debug("END GUEST RING DUMP\n");
> +}
> +
> +static void dump_host_used(struct vop_vq *vq)
> +{
> +       int i;
> +       struct vop_used_elem *used;
> +
> +       pr_debug("USED BG: 0xIDID LENGTH\n");
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               used = &vq->host->used[i];
> +               pr_debug("USED %.2d: 0x%.4x %.6d\n", i, used->id, used->len);
> +       }
> +       pr_debug("USED ED\n");
> +}
> +
> +static void dump_host_ring(struct vop_vq *vq)
> +{
> +       pr_debug("BEGIN HOST RING DUMP\n");
> +       pr_debug("HOST FLAGS: 0x%.4x\n", vq->host->flags);
> +       pr_debug("HOST USED_IDX: 0x%.2d\n", vq->host->used_idx);
> +       dump_host_used(vq);
> +       pr_debug("END HOST RING DUMP\n");
> +}
> +
> +static void debug_dump_rings(struct vop_vq *vq, const char *msg)
> +{
> +       dev_dbg(vq->dev, "%s\n", msg);
> +       dump_guest_ring(vq);
> +       dump_host_ring(vq);
> +       pr_debug("\n");
> +}
> +#else
> +static void debug_dump_rings(struct vop_vq *vq, const char *msg)
> +{
> +       /* Nothing */
> +}
> +#endif /* DEBUG_DUMP_RINGS */
> +
> +/*----------------------------------------------------------------------------*/
> +/* Ring Access Helpers                                                        */
> +/*----------------------------------------------------------------------------*/
> +
> +static void vop_set_desc_addr(struct vop_vq *vq, unsigned int idx, u32 addr)
> +{
> +       vq->desc[idx].addr = addr;
> +       iowrite32(addr, &vq->guest->desc[idx].addr);
> +}
> +
> +static void vop_set_desc_len(struct vop_vq *vq, unsigned int idx, u32 len)
> +{
> +       vq->desc[idx].len = len;
> +       iowrite32(len, &vq->guest->desc[idx].len);
> +}
> +
> +static void vop_set_desc_flags(struct vop_vq *vq, unsigned int idx, u16 flags)
> +{
> +       vq->desc[idx].flags = flags;
> +       iowrite16(flags, &vq->guest->desc[idx].flags);
> +}
> +
> +static void vop_set_desc_next(struct vop_vq *vq, unsigned int idx, u16 next)
> +{
> +       vq->desc[idx].next = next;
> +       iowrite16(next, &vq->guest->desc[idx].next);
> +}
> +
> +static u32 vop_get_desc_addr(struct vop_vq *vq, unsigned int idx)
> +{
> +       return vq->desc[idx].addr;
> +}
> +
> +static u32 vop_get_desc_len(struct vop_vq *vq, unsigned int idx)
> +{
> +       return vq->desc[idx].len;
> +}
> +
> +static u16 vop_get_desc_flags(struct vop_vq *vq, unsigned int idx)
> +{
> +       return vq->desc[idx].flags;
> +}
> +
> +static u16 vop_get_desc_next(struct vop_vq *vq, unsigned int idx)
> +{
> +       return vq->desc[idx].next;
> +}
> +
> +/*
> + * Add an entry to the available ring at avail_idx pointing to the descriptor
> + * chain at index head
> + *
> + * @vq the virtqueue
> + * @idx the index in the avail ring
> + * @val the value to write
> + */
> +static void vop_set_avail_entry(struct vop_vq *vq, u16 idx, u16 val)
> +{
> +       iowrite16(val, &vq->guest->avail[idx]);
> +}
> +
> +/*
> + * Set the available index so the guest knows about buffers that were added
> + * with vop_set_avail_entry()
> + *
> + * @vq the virtqueue
> + * @idx the new avail_idx that the guest sees
> + */
> +static void vop_set_avail_idx(struct vop_vq *vq, u16 idx)
> +{
> +       iowrite16(idx, &vq->guest->avail_idx);
> +}
> +
> +/*
> + * Set the host's flags (in the guest memory)
> + *
> + * @vq the virtqueue
> + * @flags the new flags that the guest will see
> + */
> +static void vop_set_host_flags(struct vop_vq *vq, u16 flags)
> +{
> +       iowrite16(flags, &vq->guest->flags);
> +}
> +
> +/*
> + * Read the guests flags (in local memory)
> + *
> + * @vq the virtqueue
> + * @return the guest's flags
> + */
> +static u16 vop_get_guest_flags(struct vop_vq *vq)
> +{
> +       return le16_to_cpu(vq->host->flags);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Remote status helpers                                                      */
> +/*----------------------------------------------------------------------------*/
> +
> +static u32 vop_get_guest_status(struct vop_vdev *vdev)
> +{
> +       return ioread32(&vdev->guest_status->status);
> +}
> +
> +static u32 vop_get_guest_features(struct vop_vdev *vdev)
> +{
> +       return ioread32(&vdev->guest_status->features[0]);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Scatterlist DMA helpers                                                    */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * This function abuses some of the scatterlist code and implements
> + * dma_map_sg() in such a way that we don't need to keep the scatterlist
> + * around in order to unmap it.
> + *
> + * It is also designed to never merge scatterlist entries, which is
> + * never what we want for virtio.
> + *
> + * When it is time to unmap the buffer, you can use dma_unmap_single() to
> + * unmap each entry in the chain. Get the address, length, and direction
> + * from the descriptors! (keep a local copy for speed)
> + */
> +static int vop_dma_map_sg(struct device *dev, struct scatterlist sg[],
> +                         unsigned int out, unsigned int in)
> +{
> +       dma_addr_t addr;
> +       enum dma_data_direction dir;
> +       struct scatterlist *start;
> +       unsigned int i, failure;
> +
> +       start = sg;
> +
> +       for (i = 0; i < out + in; i++) {
> +
> +               /* Check for scatterlist chaining abuse */
> +               BUG_ON(sg == NULL);
> +
> +               dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +               addr = dma_map_single(dev, sg_virt(sg), sg->length, dir);
> +
> +               if (dma_mapping_error(dev, addr))
> +                       goto unwind;
> +
> +               sg_dma_address(sg) = addr;
> +               sg = sg_next(sg);
> +       }
> +
> +       return 0;
> +
> +unwind:
> +       failure = i;
> +       sg = start;
> +
> +       for (i = 0; i < failure; i++) {
> +               dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +               addr = sg_dma_address(sg);
> +
> +               dma_unmap_single(dev, addr, sg->length, dir);
> +               sg = sg_next(sg);
> +       }
> +
> +       return -ENOMEM;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* struct virtqueue_ops infrastructure                                        */
> +/*----------------------------------------------------------------------------*/
> +
> +/*
> + * Modify the struct virtio_net_hdr_mrg_rxbuf's num_buffers field to account
> + * for the split that will happen in the DMA xmit routine
> + *
> + * This assumes that both sides have the same PAGE_SIZE
> + */
> +static void vop_fixup_vnet_mrg_hdr(struct scatterlist sg[], unsigned int out)
> +{
> +       struct virtio_net_hdr *hdr;
> +       struct virtio_net_hdr_mrg_rxbuf *mhdr;
> +       unsigned int bytes = 0;
> +
> +       /* There must be a header + data, at the least */
> +       BUG_ON(out < 2);
> +
> +       /* The first entry must be the structure */
> +       BUG_ON(sg->length != sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +
> +       hdr = sg_virt(sg);
> +       mhdr = sg_virt(sg);
> +
> +       /* We merge buffers together, so just count up the number of bytes
> +        * needed, then figure out how many pages that will be */
> +       for (/* none */; out; out--, sg = sg_next(sg))
> +               bytes += sg->length;
> +
> +       /* Of course, nobody ever imagined that we might actually use
> +        * this on machines with different endianness...
> +        *
> +        * We force big-endian for now, since that's what our guest is */
> +       mhdr->num_buffers = cpu_to_be16(DIV_ROUND_UP(bytes, PAGE_SIZE));
> +
> +       /* Might as well fix up the other fields while we're at it */
> +       hdr->hdr_len = cpu_to_be16(hdr->hdr_len);
> +       hdr->gso_size = cpu_to_be16(hdr->gso_size);
> +       hdr->csum_start = cpu_to_be16(hdr->csum_start);
> +       hdr->csum_offset = cpu_to_be16(hdr->csum_offset);
> +}
> +
> +static int vop_add_buf(struct virtqueue *_vq, struct scatterlist sg[],
> +                               unsigned int out, unsigned int in, void *data)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       unsigned int i, avail, head, uninitialized_var(prev);
> +
> +       BUG_ON(data == NULL);
> +       BUG_ON(out + in == 0);
> +
> +       /* Make sure we have space for this to succeed */
> +       if (vq->num_free < out + in) {
> +               dev_dbg(vq->dev, "No free space left: len=%d free=%d\n",
> +                               out + in, vq->num_free);
> +               return -ENOSPC;
> +       }
> +
> +       /* If this is an xmit buffer from virtio_net, fixup the header */
> +       if (out > 1) {
> +               dev_dbg(vq->dev, "Fixing up virtio_net header\n");
> +               vop_fixup_vnet_mrg_hdr(sg, out);
> +       }
> +
> +       head = vq->free_head;
> +
> +       /* DMA map the scatterlist */
> +       if (vop_dma_map_sg(vq->dev, sg, out, in)) {
> +               dev_err(vq->dev, "Failed to DMA map scatterlist\n");
> +               return -ENOMEM;
> +       }
> +
> +       /* We're about to use some buffers from the free list */
> +       vq->num_free -= out + in;
> +
> +       for (i = vq->free_head; out; i = vop_get_desc_next(vq, i), out--) {
> +               vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT);
> +               vop_set_desc_addr(vq, i, sg_dma_address(sg));
> +               vop_set_desc_len(vq, i, sg->length);
> +
> +               prev = i;
> +               sg = sg_next(sg);
> +       }
> +
> +       for (/* none */; in; i = vop_get_desc_next(vq, i), in--) {
> +               vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT | VOP_DESC_F_WRITE);
> +               vop_set_desc_addr(vq, i, sg_dma_address(sg));
> +               vop_set_desc_len(vq, i, sg->length);
> +
> +               prev = i;
> +               sg = sg_next(sg);
> +       }
> +
> +       /* Last one doesn't continue */
> +       vop_set_desc_flags(vq, prev, vop_get_desc_flags(vq, prev) & ~VOP_DESC_F_NEXT);
> +
> +       /* Update the free pointer */
> +       vq->free_head = i;
> +
> +       /* Set token */
> +       vq->data[head] = data;
> +
> +       /* Add an entry for the head of the chain into the avail array, but
> +        * don't update avail->idx until kick() */
> +       avail = (vq->avail_idx + vq->num_added++) & (VOP_RING_SIZE - 1);
> +       vop_set_avail_entry(vq, avail, head);
> +
> +       dev_dbg(vq->dev, "Added buffer head %i to %p (num_free %d)\n", head, vq, vq->num_free);
> +       debug_dump_rings(vq, "Added buffer(s), dumping rings");
> +
> +       return 0;
> +}
> +
> +static inline bool more_used(const struct vop_vq *vq)
> +{
> +       return vq->last_used_idx != le16_to_cpu(vq->host->used_idx);
> +}
> +
> +static void detach_buf(struct vop_vq *vq, unsigned int head)
> +{
> +       unsigned int i, len;
> +       dma_addr_t addr;
> +       enum dma_data_direction dir;
> +
> +       /* Clear data pointer */
> +       vq->data[head] = NULL;
> +
> +       /* Put the chain back on the free list, unmapping as we go */
> +       i = head;
> +       while (true) {
> +               addr = vop_get_desc_addr(vq, i);
> +               len = vop_get_desc_len(vq, i);
> +               dir = (vop_get_desc_flags(vq, i) & VOP_DESC_F_WRITE) ?
> +                               DMA_FROM_DEVICE : DMA_TO_DEVICE;
> +
> +               /* Unmap the entry */
> +               dma_unmap_single(vq->dev, addr, len, dir);
> +               vq->num_free++;
> +
> +               /* Check for end-of-chain */
> +               if (!(vop_get_desc_flags(vq, i) & VOP_DESC_F_NEXT))
> +                       break;
> +
> +               i = vop_get_desc_next(vq, i);
> +       }
> +
> +       vop_set_desc_next(vq, i, vq->free_head);
> +       vq->free_head = head;
> +}
> +
> +static void *vop_get_buf(struct virtqueue *_vq, unsigned int *len)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       unsigned int head, used_idx;
> +       void *ret;
> +
> +       if (!more_used(vq)) {
> +               dev_dbg(vq->dev, "No more buffers in queue\n");
> +               return NULL;
> +       }
> +
> +       used_idx = vq->last_used_idx & (VOP_RING_SIZE - 1);
> +       head = le32_to_cpu(vq->host->used[used_idx].id);
> +       *len = le32_to_cpu(vq->host->used[used_idx].len);
> +
> +       dev_dbg(vq->dev, "REMOVE buffer head %i from %p (len %d)\n", head, vq, *len);
> +       debug_dump_rings(vq, "Removing buffer, dumping rings");
> +
> +       BUG_ON(head >= VOP_RING_SIZE);
> +       BUG_ON(!vq->data[head]);
> +
> +       /* detach_buf() clears data, save it now */
> +       ret = vq->data[head];
> +       detach_buf(vq, head);
> +
> +       /* Update the last used_idx we've consumed */
> +       vq->last_used_idx++;
> +       return ret;
> +}
> +
> +static void vop_kick(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +
> +       dev_dbg(vq->dev, "making %d new buffers available to guest\n", vq->num_added);
> +       vq->avail_idx += vq->num_added;
> +       vq->num_added = 0;
> +       vop_set_avail_idx(vq, vq->avail_idx);
> +
> +       if (!(vop_get_guest_flags(vq) & VOP_F_NO_INTERRUPT)) {
> +               dev_dbg(vq->dev, "kicking the guest (new buffers in avail)\n");
> +               iowrite32(vq->kick_val, vq->immr + IDR_OFFSET);
> +               debug_dump_rings(vq, "ran a kick, dumping rings");
> +       }
> +}
> +
> +/* Write to the guest's flags register to disable interrupts */
> +static void vop_disable_cb(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +
> +       vop_set_host_flags(vq, VOP_F_NO_INTERRUPT);
> +}
> +
> +static bool vop_enable_cb(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +
> +       /* We optimistically enable interrupts, then check if
> +        * there was more to do */
> +       vop_set_host_flags(vq, 0);
> +
> +       if (unlikely(more_used(vq)))
> +               return false;
> +
> +       return true;
> +}
> +
> +static struct virtqueue_ops vop_vq_ops = {
> +       .add_buf        = vop_add_buf,
> +       .get_buf        = vop_get_buf,
> +       .kick           = vop_kick,
> +       .disable_cb     = vop_disable_cb,
> +       .enable_cb      = vop_enable_cb,
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* struct virtio_device infrastructure                                        */
> +/*----------------------------------------------------------------------------*/
> +
> +/* Get something that the other side wants you to have, from configuration
> + * space. This is used to transfer the MAC address from the guest to the host,
> + * for example. It should be reading something from the guest, in this case */
> +static void vopc_get(struct virtio_device *_vdev, unsigned offset, void *buf,
> +                    unsigned len)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       void __iomem *config = vdev->guest_status->config;
> +
> +       memcpy_fromio(buf, config + offset, len);
> +}
> +
> +/* Set something in the configuration space (currently unused) */
> +static void vopc_set(struct virtio_device *_vdev, unsigned offset,
> +                    const void *buf, unsigned len)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       void __iomem *config = vdev->guest_status->config;
> +
> +       memcpy_toio(config + offset, buf, len);
> +}
> +
> +/* Get your own status */
> +static u8 vopc_get_status(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       u32 status;
> +
> +       status = le32_to_cpu(vdev->host_status->status);
> +       dev_dbg(&vdev->vdev.dev, "%s(): -> 0x%.2x\n", __func__, (u8)status);
> +
> +       return (u8)status;
> +}
> +
> +/* Set your own status */
> +static void vopc_set_status(struct virtio_device *_vdev, u8 status)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       u32 old_status;
> +
> +       old_status = le32_to_cpu(vdev->host_status->status);
> +       vdev->host_status->status = cpu_to_le32(status);
> +
> +       dev_dbg(&vdev->vdev.dev, "%s(): <- 0x%.2x (was 0x%.2x)\n",
> +                       __func__, status, old_status);
> +
> +       /*
> +        * FIXME: we really need to notify the other side when status changes
> +        * FIXME: happen, so that they can take some action
> +        */
> +}
> +
> +/* Reset your own status */
> +static void vopc_reset(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +
> +       dev_dbg(&vdev->vdev.dev, "%s(): status reset\n", __func__);
> +       vdev->host_status->status = cpu_to_le32(0);
> +}
> +
> +static struct virtqueue *vopc_find_vq(struct virtio_device *_vdev,
> +                                            unsigned index,
> +                                            void (*cb)(struct virtqueue *vq))
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       struct vop_vq *vq = &vdev->virtqueues[index];
> +       int i;
> +
> +       /* Check that we support the virtqueue at this index */
> +       if (index >= ARRAY_SIZE(vdev->virtqueues)) {
> +               dev_err(&vdev->vdev.dev, "no virtqueue for index %d\n", index);
> +               return ERR_PTR(-ENODEV);
> +       }
> +
> +       /* HACK: we only support virtio_net for now */
> +       if (vdev->vdev.id.device != VIRTIO_ID_NET) {
> +               dev_err(&vdev->vdev.dev, "only virtio_net is supported\n");
> +               return ERR_PTR(-ENODEV);
> +       }
> +
> +       /* Initialize the virtqueue to a clean state */
> +       vq->num_free = VOP_RING_SIZE;
> +       vq->dev = &vdev->vdev.dev;
> +
> +       switch (index) {
> +       case 0: /* x86 recv virtqueue -- ppc xmit virtqueue */
> +               vq->guest = vdev->rem + 1024;
> +               vq->host  = vdev->loc + 1024;
> +               break;
> +       case 1: /* x86 xmit virtqueue -- ppc recv virtqueue */
> +               vq->guest = vdev->rem + 2048;
> +               vq->host  = vdev->loc + 2048;
> +               break;
> +       default:
> +               dev_err(vq->dev, "unknown virtqueue %d\n", index);
> +               return ERR_PTR(-ENODEV);
> +       }
> +
> +       /* Initialize the descriptor, avail, and used rings */
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               vop_set_desc_addr(vq, i, 0x0);
> +               vop_set_desc_len(vq, i, 0);
> +               vop_set_desc_flags(vq, i, 0);
> +               vop_set_desc_next(vq, i, (i + 1) & (VOP_RING_SIZE - 1));
> +
> +               vop_set_avail_entry(vq, i, 0);
> +               vq->host->used[i].id = cpu_to_le32(0);
> +               vq->host->used[i].len = cpu_to_le32(0);
> +       }
> +
> +       vq->avail_idx = 0;
> +       vop_set_avail_idx(vq, 0);
> +       vop_set_host_flags(vq, 0);
> +
> +       debug_dump_rings(vq, "found a virtqueue, dumping rings");
> +
> +       vq->vq.callback = cb;
> +       vq->vq.vdev = &vdev->vdev;
> +       vq->vq.vq_ops = &vop_vq_ops;
> +
> +       return &vq->vq;
> +}
> +
> +static void vopc_del_vq(struct virtqueue *_vq)
> +{
> +       struct vop_vq *vq = to_vop_vq(_vq);
> +       int i;
> +
> +       /* FIXME: make sure that DMA has stopped by this point */
> +
> +       /* Unmap and remove all outstanding descriptors from the ring */
> +       for (i = 0; i < VOP_RING_SIZE; i++) {
> +               if (vq->data[i]) {
> +                       dev_dbg(vq->dev, "cleanup detach buffer at index %d\n", i);
> +                       detach_buf(vq, i);
> +               }
> +       }
> +
> +       debug_dump_rings(vq, "virtqueue destroyed, dumping rings");
> +}
> +
> +static u32 vopc_get_features(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +       u32 ret;
> +
> +       ret = vop_get_guest_features(vdev);
> +       dev_info(&vdev->vdev.dev, "%s(): guest features 0x%.8x\n", __func__, ret);
> +
> +       return ret;
> +}
> +
> +static void vopc_finalize_features(struct virtio_device *_vdev)
> +{
> +       struct vop_vdev *vdev = to_vop_vdev(_vdev);
> +
> +       /*
> +        * TODO: notify the other side at this point
> +        */
> +
> +       vdev->host_status->features[0] = cpu_to_le32(vdev->vdev.features[0]);
> +       dev_info(&vdev->vdev.dev, "%s(): final features 0x%.8lx\n", __func__, vdev->vdev.features[0]);
> +}
> +
> +static struct virtio_config_ops vop_config_ops = {
> +       .get                    = vopc_get,
> +       .set                    = vopc_set,
> +       .get_status             = vopc_get_status,
> +       .set_status             = vopc_set_status,
> +       .reset                  = vopc_reset,
> +       .find_vq                = vopc_find_vq,
> +       .del_vq                 = vopc_del_vq,
> +       .get_features           = vopc_get_features,
> +       .finalize_features      = vopc_finalize_features,
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* Setup code for virtio devices                                              */
> +/*----------------------------------------------------------------------------*/
> +
> +static void vop_release(struct device *dev)
> +{
> +       dev_dbg(dev, "calling device release\n");
> +}
> +
> +static int setup_virtio_device(struct vop_dev *priv, int devnum)
> +{
> +       struct vop_vdev *vdev = &priv->devices[devnum];
> +       struct device *dev = priv->dev;
> +       int i;
> +
> +       /* Set up the pointers to the guest and host memory areas */
> +       vdev->loc = priv->host_mem + (devnum * 4096);
> +       vdev->rem = priv->netregs  + (devnum * 4096);
> +       dev_dbg(dev, "memory guest 0x%p host 0x%p\n", vdev->rem, vdev->loc);
> +
> +       /* Set up the pointers to the guest and host status areas */
> +       vdev->guest_status = vdev->rem;
> +       vdev->host_status  = vdev->loc;
> +       dev_dbg(dev, "status guest 0x%p host 0x%p\n", vdev->rem, vdev->loc);
> +
> +       /* The find_vq() must set up the correct mappings to virtqueues itself,
> +        * so we cannot do it here */
> +       for (i = 0; i < ARRAY_SIZE(vdev->virtqueues); i++) {
> +               memset(&vdev->virtqueues[i], 0, sizeof(struct vop_vq));
> +               vdev->virtqueues[i].immr = priv->immr;
> +               vdev->virtqueues[i].kick_val = 1 << ((devnum * 4) + i + 2);
> +               dev_dbg(dev, "vq %d cleared, kick %d\n", i, (devnum * 4) + i + 2);
> +       }
> +
> +       /* Zero out the configuration space completely */
> +       memset(vdev->host_status, 0, 1024);
> +
> +       /* Copy the parent DMA parameters to this virtio_device */
> +       vdev->vdev.dev.dma_mask = dev->dma_mask;
> +       vdev->vdev.dev.dma_parms = dev->dma_parms;
> +       vdev->vdev.dev.coherent_dma_mask = dev->coherent_dma_mask;
> +
> +       /* Setup everything except the device type */
> +       vdev->vdev.dev.release = &vop_release;
> +       vdev->vdev.dev.parent  = dev;
> +       vdev->vdev.config      = &vop_config_ops;
> +
> +       return 0;
> +}
> +
> +static int register_virtio_net(struct vop_dev *priv)
> +{
> +       struct vop_vdev *vdev = &priv->devices[0];
> +       struct virtio_net_config *config;
> +       unsigned long features = 0;
> +       int ret;
> +
> +       /* Run the common setup routine */
> +       ret = setup_virtio_device(priv, 0);
> +       if (ret) {
> +               dev_err(priv->dev, "unable to setup virtio_net\n");
> +               return ret;
> +       }
> +
> +       /* Generate a random ethernet address for the other side
> +        *
> +        * This is necessary so we can allow it to give us a consistent
> +        * MAC address for itself, using something board-specific
> +        *
> +        * The feature bits must match for it to work correctly
> +        */
> +       config = (struct virtio_net_config *)vdev->host_status->config;
> +       random_ether_addr(config->mac);
> +       dev_info(priv->dev, "Generated MAC %pM\n", config->mac);
> +
> +       /* Set the feature bits for the device */
> +       set_bit(VIRTIO_NET_F_MAC,       &features);
> +       set_bit(VIRTIO_NET_F_CSUM,      &features);
> +       set_bit(VIRTIO_NET_F_GSO,       &features);
> +       set_bit(VIRTIO_NET_F_MRG_RXBUF, &features);
> +
> +       vdev->host_status->features[0] = cpu_to_le32(features);
> +       vdev->vdev.id.device = VIRTIO_ID_NET;
> +
> +       /* Register the virtio device */
> +       return register_virtio_device(&vdev->vdev);
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* Interrupt Handling                                                         */
> +/*----------------------------------------------------------------------------*/
> +
> +static irqreturn_t vdev_interrupt(int irq, void *dev_id)
> +{
> +       struct vop_dev *priv = dev_id;
> +       struct virtqueue *vq;
> +       u32 omisr, odr;
> +
> +       omisr = ioread32(priv->immr + OMISR_OFFSET);
> +       odr   = ioread32(priv->immr + ODR_OFFSET);
> +
> +       /* Check the status register for doorbell interrupts */
> +       if (!(omisr & 0x8))
> +               return IRQ_NONE;
> +
> +       /* Clear all doorbell interrupts */
> +       iowrite32(odr, priv->immr + ODR_OFFSET);
> +
> +       if (odr & 0x4) {
> +               vq = &priv->devices[0].virtqueues[0].vq;
> +               vq->callback(vq);
> +       }
> +
> +       if (odr & 0x8) {
> +               vq = &priv->devices[0].virtqueues[1].vq;
> +               vq->callback(vq);
> +       }
> +
> +       return IRQ_HANDLED;
> +}
> +
> +/*----------------------------------------------------------------------------*/
> +/* PCI Subsystem                                                              */
> +/*----------------------------------------------------------------------------*/
> +
> +static int vop_probe(struct pci_dev *dev, const struct pci_device_id *id)
> +{
> +       struct vop_dev *priv;
> +       int ret;
> +
> +       priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +       if (!priv) {
> +               ret = -ENOMEM;
> +               goto out_return;
> +       }
> +
> +       pci_set_drvdata(dev, priv);
> +       priv->dev = &dev->dev;
> +
> +       /* Hardware Initialization */
> +       ret = pci_enable_device(dev);
> +       if (ret)
> +               goto out_kfree_priv;
> +
> +       pci_set_master(dev);
> +       ret = pci_request_regions(dev, driver_name);
> +       if (ret)
> +               goto out_pci_disable_device;
> +
> +       priv->immr = pci_ioremap_bar(dev, 0);
> +       if (!priv->immr) {
> +               ret = -ENOMEM;
> +               goto out_pci_release_regions;
> +       }
> +
> +       priv->netregs = pci_ioremap_bar(dev, 1);
> +       if (!priv->netregs) {
> +               ret = -ENOMEM;
> +               goto out_iounmap_immr;
> +       }
> +
> +       /* The device can only see the lowest 1GB of memory over the bus */
> +       dev->dev.coherent_dma_mask = DMA_BIT_MASK(30);
> +       ret = dma_set_mask(&dev->dev, DMA_BIT_MASK(30));
> +       if (ret) {
> +               dev_err(&dev->dev, "Unable to set DMA mask\n");
> +               goto out_iounmap_netregs;
> +       }
> +
> +       /* Allocate the host memory, for writing by the guest */
> +       priv->host_mem = dma_alloc_coherent(&dev->dev, VOP_HOST_MEM_SIZE,
> +                       &priv->host_mem_addr, GFP_KERNEL);
> +       if (!priv->host_mem) {
> +               dev_err(&dev->dev, "Unable to allocate host memory\n");
> +               ret = -ENOMEM;
> +               goto out_iounmap_netregs;
> +       }
> +
> +       /* We use the guest's mailbox 0 to hold the host memory address */
> +       iowrite32(priv->host_mem_addr, priv->immr + IMR0_OFFSET);
> +
> +       /* Reset all of the devices */
> +       iowrite32(0x1, priv->immr + IDR_OFFSET);
> +
> +       /* Mask all of the MBOX interrupts */
> +       iowrite32(0x1 | 0x2, priv->immr + OMIMR_OFFSET);
> +
> +       /* Setup the virtio_net instance */
> +       ret = register_virtio_net(priv);
> +       if (ret) {
> +               dev_err(&dev->dev, "Unable to register virtio_net\n");
> +               goto out_free_host_mem;
> +       }
> +
> +       /* Hook up the interrupt handler */
> +       ret = request_irq(dev->irq, vdev_interrupt, IRQF_SHARED, driver_name, priv);
> +       if (ret) {
> +               dev_err(&dev->dev, "Unable to register interrupt handler\n");
> +               goto out_unregister_virtio_net;
> +       }
> +
> +       /* Start virtio_net */
> +       iowrite32(0x1, priv->immr + IMR1_OFFSET);
> +       iowrite32(0x2, priv->immr + IDR_OFFSET);
> +
> +       return 0;
> +
> +out_unregister_virtio_net:
> +       unregister_virtio_device(&priv->devices[0].vdev);
> +out_free_host_mem:
> +       dma_free_coherent(&dev->dev, VOP_HOST_MEM_SIZE, priv->host_mem,
> +                       priv->host_mem_addr);
> +out_iounmap_netregs:
> +       iounmap(priv->netregs);
> +out_iounmap_immr:
> +       iounmap(priv->immr);
> +out_pci_release_regions:
> +       pci_release_regions(dev);
> +out_pci_disable_device:
> +       pci_disable_device(dev);
> +out_kfree_priv:
> +       kfree(priv);
> +out_return:
> +       return ret;
> +}
> +
> +static void vop_remove(struct pci_dev *dev)
> +{
> +       struct vop_dev *priv = pci_get_drvdata(dev);
> +
> +       free_irq(dev->irq, priv);
> +
> +       /* Reset everything */
> +       iowrite32(0x1, priv->immr + IDR_OFFSET);
> +
> +       /* Unregister virtio_net */
> +       unregister_virtio_device(&priv->devices[0].vdev);
> +
> +       /* Clear the host memory address from the guest's mailbox 0 */
> +       iowrite32(0x0, priv->immr + IMR0_OFFSET);
> +       iowrite32(0x0, priv->immr + IMR1_OFFSET);
> +
> +       dma_free_coherent(&dev->dev, VOP_HOST_MEM_SIZE, priv->host_mem,
> +                       priv->host_mem_addr);
> +       iounmap(priv->netregs);
> +       iounmap(priv->immr);
> +       pci_release_regions(dev);
> +       pci_disable_device(dev);
> +       kfree(priv);
> +}
> +
> +#define PCI_DEVID_FSL_MPC8349EMDS 0x0080
> +
> +/* The list of devices that this module will support */
> +static struct pci_device_id vop_ids[] = {
> +       { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, PCI_DEVID_FSL_MPC8349EMDS), },
> +       { 0, }
> +};
> +MODULE_DEVICE_TABLE(pci, vop_ids);
> +
> +static struct pci_driver vop_pci_driver = {
> +       .name     = (char *)driver_name,
> +       .id_table = vop_ids,
> +       .probe    = vop_probe,
> +       .remove   = vop_remove,
> +};
> +
> +/*----------------------------------------------------------------------------*/
> +/* Module Init / Exit                                                         */
> +/*----------------------------------------------------------------------------*/
> +
> +static int __init vop_init(void)
> +{
> +       return pci_register_driver(&vop_pci_driver);
> +}
> +
> +static void __exit vop_exit(void)
> +{
> +       pci_unregister_driver(&vop_pci_driver);
> +}
> +
> +MODULE_AUTHOR("Ira W. Snyder <iws@ovro.caltech.edu>");
> +MODULE_DESCRIPTION("Virtio-PCI-Host Test Driver");
> +MODULE_LICENSE("GPL");
> +
> +module_init(vop_init);
> +module_exit(vop_exit);
> diff --git a/drivers/virtio/vop_hw.h b/drivers/virtio/vop_hw.h
> new file mode 100644
> index 0000000..8a19d3f
> --- /dev/null
> +++ b/drivers/virtio/vop_hw.h
> @@ -0,0 +1,80 @@
> +/*
> + * Register offsets for the MPC8349EMDS Message Unit from the IMMR base address
> + *
> + * Copyright (c) 2008 Ira W. Snyder <iws@ovro.caltech.edu>
> + *
> + * This file is licensed under the terms of the GNU General Public License
> + * version 2. This program is licensed "as is" without any warranty of any
> + * kind, whether express or implied.
> + */
> +
> +#ifndef PCINET_HW_H
> +#define PCINET_HW_H
> +
> +#define SGPRL_OFFSET           0x0100
> +#define SGPRH_OFFSET           0x0104
> +
> +/* mpc8349emds message unit register offsets */
> +#define OMISR_OFFSET           0x8030
> +#define OMIMR_OFFSET           0x8034
> +#define IMR0_OFFSET            0x8050
> +#define IMR1_OFFSET            0x8054
> +#define OMR0_OFFSET            0x8058
> +#define OMR1_OFFSET            0x805C
> +#define ODR_OFFSET             0x8060
> +#define IDR_OFFSET             0x8068
> +#define IMISR_OFFSET           0x8080
> +#define IMIMR_OFFSET           0x8084
> +
> +
> +/* mpc8349emds pci and local access window register offsets */
> +#define LAWAR0_OFFSET          0x0064
> +#define LAWAR0_ENABLE          (1<<31)
> +
> +#define POCMR0_OFFSET          0x8410
> +#define POCMR0_ENABLE          (1<<31)
> +
> +#define POTAR0_OFFSET          0x8400
> +
> +#define LAWAR1_OFFSET          0x006c
> +#define LAWAR1_ENABLE          (1<<31)
> +
> +#define POCMR1_OFFSET          0x8428
> +#define POCMR1_ENABLE          (1<<31)
> +
> +#define POTAR1_OFFSET          0x8418
> +
> +
> +/* mpc8349emds dma controller register offsets */
> +#define DMAMR0_OFFSET          0x8100
> +#define DMASR0_OFFSET          0x8104
> +#define DMASAR0_OFFSET         0x8110
> +#define DMADAR0_OFFSET         0x8118
> +#define DMABCR0_OFFSET         0x8120
> +
> +#define DMA_CHANNEL_BUSY       (1<<2)
> +
> +#define DMA_DIRECT_MODE_SNOOP  (1<<20)
> +#define DMA_CHANNEL_MODE_DIRECT        (1<<2)
> +#define DMA_CHANNEL_START      (1<<0)
> +
> +
> +/* mpc8349emds pci and local access window register offsets */
> +#define LAWAR0_OFFSET          0x0064
> +#define LAWAR0_ENABLE          (1<<31)
> +
> +#define POCMR0_OFFSET          0x8410
> +#define POCMR0_ENABLE          (1<<31)
> +
> +#define POTAR0_OFFSET          0x8400
> +
> +
> +/* mpc8349emds pci and inbound window register offsets */
> +#define PITAR0_OFFSET          0x8568
> +#define PIWAR0_OFFSET          0x8578
> +
> +#define PIWAR0_ENABLED         (1<<31)
> +#define PIWAR0_PREFETCH                (1<<29)
> +#define PIWAR0_IWS_4K          0xb
> +
> +#endif /* PCINET_HW_H */
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
David Hawkins April 14, 2009, 9:23 p.m. UTC | #11
Hi Grant,

> I like this a lot.  I need to do much the same thing on one of my
> platforms, so I'm going to use your patch as my starting point.  Have
> you made many changes since you posted this version of your patch?
> I'd like to collaborate on the development and help to get it
> mainlined.
> 
> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex
> (ppc440) as the 'client'.  I intend set aside a region of the Xilinx
> Virtex's memory space for the shared queues.  I'm starting work on it
> now, and I'll provide you with feedback and/or patches as I make
> progress.

I'll let Ira update you on the patch status.

If you want someone to chat about the hardware-level interaction,
feel free to chat off-list - assuming of course that no one wants
to hear us talk hardware :)

I selected the MPC8349EA in-part due to its PCI mailboxes, so
that we could implement this type of driver (an interlocked
handshake for flow control of data transfers). The previous
chip I'd used was a PLX PCI9054 Master/Target, and it has
similar registers.

I'm not sure if the Xilinx PCI core, or whatever PCI core you
are using, already has something like the mailboxes implemented,
but if not, it won't take much to code up some logic.
I prefer VHDL myself, but can speak Verilog if forced to :)

Cheers,
Dave
Grant Likely April 14, 2009, 9:45 p.m. UTC | #12
On Tue, Apr 14, 2009 at 3:23 PM, David Hawkins <dwh@ovro.caltech.edu> wrote:
> I'll let Ira update you on the patch status.
>
> If you want someone to chat about the hardware-level interaction,
> feel free to chat off-list - assuming of course that no one wants
> to hear us talk hardware :)
>
> I selected the MPC8349EA in-part due to its PCI mailboxes, so
> that we could implement this type of driver (an interlocked
> handshake for flow control of data transfers). The previous
> chip I'd used was a PLX PCI9054 Master/Target, and it has
> similar registers.
>
> I'm not sure if the Xilinx PCI core, or whatever PCI core you
> are using, already has something like the mailboxes implemented,
> but if not, it won't take much to code up some logic.
> I prefer VHDL myself, but can speak Verilog if forced to :)

Thanks David.  I haven't looked closely at the xilinx pci data sheet
yet, but I don't expect too many issues in this area.  As you say, it
won't take much to code it up.  I'll be poking my VHDL engineer to
make it do what I want it to.  :-)

I'll keep you up to date on my progress.

g.
David Hawkins April 14, 2009, 9:52 p.m. UTC | #13
Hi Grant,

> Thanks David.  I haven't looked closely at the xilinx pci data sheet
> yet, but I don't expect too many issues in this area.  As you say, it
> won't take much to code it up.  I'll be poking my VHDL engineer to
> make it do what I want it to.  :-)

The key aspects of the core will be that it is Master/Target
so that it can take over the PCI bus, and that it has a
DMA engine that can take care of most of the work. In
your case, since you have a DMA controller on the host
(MPC5200) and the target (Xilinx), your driver might end
up having nicer symmetry than our application. The
most efficient implementation will be the one that
uses PCI writes, i.e., MPC5200 DMAs to the Xilinx core,
and the Xilinx core DMAs to the MPC5200. If you use
a PCI Target only core, then the MPC5200 DMA controller
will have to do all the work, and read transfers might
be slightly less efficient.

Our target boards (PowerPC) live in compactPCI backplanes
and talk to x86 boards that do not have DMA controllers.
So the PCI target board DMA controllers are used to
transfer data efficiently to the x86 host (writes)
and less efficiently from the host to the boards
(reads). Our bandwidth requirements are 'to the host',
so we can live with the asymmetry in performance.

> I'll keep you up to date on my progress.

Sounds good.

Cheers,
Dave
Ira Snyder April 14, 2009, 9:53 p.m. UTC | #14
On Tue, Apr 14, 2009 at 02:28:26PM -0600, Grant Likely wrote:
> On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:
> > This adds support to Linux for using virtio between two computers linked by
> > a PCI interface. This allows the use of virtio_net to create a familiar,
> > fast interface for communication. It should be possible to use other virtio
> > devices in the future, but this has not been tested.
> 
> Hey Ira,
> 
> I like this a lot.  I need to do much the same thing on one of my
> platforms, so I'm going to use your patch as my starting point.  Have
> you made many changes since you posted this version of your patch?
> I'd like to collaborate on the development and help to get it
> mainlined.
> 

This would be great. I'd really appreciate the help. I haven't had time
to make any changes since I last posted the patch. I started work on
converting all of the usage of struct vop_loc_* to just use the on-wire
structures, but I didn't get very far before other work got in the way.

> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex
> (ppc440) as the 'client'.  I intend set aside a region of the Xilinx
> Virtex's memory space for the shared queues.  I'm starting work on it
> now, and I'll provide you with feedback and/or patches as I make
> progress.
> 

I'm looking forward to seeing your implementation. If you have any
questions, I'd be happy to attempt to answer them :)

Ira
Grant Likely April 14, 2009, 10:16 p.m. UTC | #15
On Tue, Apr 14, 2009 at 3:52 PM, David Hawkins <dwh@ovro.caltech.edu> wrote:
> Hi Grant,
>
>> Thanks David.  I haven't looked closely at the xilinx pci data sheet
>> yet, but I don't expect too many issues in this area.  As you say, it
>> won't take much to code it up.  I'll be poking my VHDL engineer to
>> make it do what I want it to.  :-)
>
> The key aspects of the core will be that it is Master/Target
> so that it can take over the PCI bus, and that it has a
> DMA engine that can take care of most of the work. In
> your case, since you have a DMA controller on the host
> (MPC5200) and the target (Xilinx), your driver might end
> up having nicer symmetry than our application. The
> most efficient implementation will be the one that
> uses PCI writes, i.e., MPC5200 DMAs to the Xilinx core,
> and the Xilinx core DMAs to the MPC5200.

Hmmm, I hadn't thought about this.  I was intending to use the
Virtex's memory region for all virtio, but if I can allocate memory
regions on both sides of the PCI bus, then that may be best.

> If you use
> a PCI Target only core, then the MPC5200 DMA controller
> will have to do all the work, and read transfers might
> be slightly less efficient.

I'll definitely intend to enable master mode on the Xilinx PCI controller.

> Our target boards (PowerPC) live in compactPCI backplanes
> and talk to x86 boards that do not have DMA controllers.
> So the PCI target board DMA controllers are used to
> transfer data efficiently to the x86 host (writes)
> and less efficiently from the host to the boards
> (reads). Our bandwidth requirements are 'to the host',
> so we can live with the asymmetry in performance.

Fortunately I don't have very high bandwidth requirements for the
first spin, so I have some room to experiment.  :-)

g.
David Hawkins April 14, 2009, 10:27 p.m. UTC | #16
Hi Grant,

> Hmmm, I hadn't thought about this.  I was intending to use the
> Virtex's memory region for all virtio, but if I can allocate memory
> regions on both sides of the PCI bus, then that may be best.

Sounds like you can experiment and see what works best :)

>> If you use
>> a PCI Target only core, then the MPC5200 DMA controller
>> will have to do all the work, and read transfers might
>> be slightly less efficient.
> 
> I'll definitely intend to enable master mode on the Xilinx PCI controller.

Since you understand the lingo, you clearly understand
there are core differences :)

>> Our target boards (PowerPC) live in compactPCI backplanes
>> and talk to x86 boards that do not have DMA controllers.
>> So the PCI target board DMA controllers are used to
>> transfer data efficiently to the x86 host (writes)
>> and less efficiently from the host to the boards
>> (reads). Our bandwidth requirements are 'to the host',
>> so we can live with the asymmetry in performance.
> 
> Fortunately I don't have very high bandwidth requirements for the
> first spin, so I have some room to experiment.  :-)

Yes, in theory you have enough bandwidth ... then a
few features are added, the PCI core is not quite as
fast as advertised, etc etc :)

Cheers,
Dave
Grant Likely April 21, 2009, 6:09 a.m. UTC | #17
On Thu, Feb 26, 2009 at 3:49 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:
> On Thu, Feb 26, 2009 at 09:37:14PM +0100, Arnd Bergmann wrote:
>> If the registers for setting up this window don't logically fit
>> into the same device as the one you already use, the cleanest
>> solution would be to have another device just for this and then
>> make a function call into that driver to set up the window.
>
> The registers are part of the board control registers. They don't fit at
> all in the message unit. Doing this in the bootloader seems like a
> logical place, but that would require any testers to flash a new U-Boot
> image into their mpc8349emds boards.

Alternately, the board platform code (arch/powerpc/platforms/83xx) is
an ideal place for 'fixups'.  ie. to setup things that the firmware
really should be do, but doesn't.

>> > Now, I wouldn't need to access these registers at all if the bootloader
>> > could handle it. I just don't know if it is possible to have Linux not
>> > use some memory that the bootloader allocated, other than with the
>> > mem=XXX trick, which I'm sure wouldn't be acceptable. I've just used
>> > regular RAM so this is portable to my custom board (mpc8349emds based)
>> > and a regular mpc8349emds. I didn't want to change anything board
>> > specific.
>> >
>> > I would love to have the bootloader allocate (or reserve somewhere in
>> > the memory map) 16K of RAM, and not be required to allocate it with
>> > dma_alloc_coherent(). It would save me plenty of headaches.
>>
>> I believe you can do that through the "memory" devices in the
>> device tree, by leaving out a small part of the description of
>> main memory, at putting it into the "reg" property of your own
>> device.
>>
>
> I'll explore this option. I didn't even know you could do this.  Is a
> driver that requires the trick acceptable for mainline inclusion? Just
> like setting up the 16K PCI window, this is very platform specific.

Yup.  You wouldn't even need to write any code to do this.  Just
reduce the memory node's RAM size listed in the .dts file by 16k and
add a 16K region to the reg property for the messaging region.

Speaking of which, the device tree changes should be adding 2 nodes; 1
node to describe the messaging unit, and 1 node to describe the virtio
instance.  The messaging unit is a general purpose piece of hardware,
so it is not appropriate to write a usage-specific device driver that
binds against it.  I'm kind of working on this right now, so I'll show
you what I mean in patch form when I actually get things running.

g.
Grant Likely June 11, 2009, 2:22 p.m. UTC | #18
On Tue, Apr 14, 2009 at 3:53 PM, Ira Snyder<iws@ovro.caltech.edu> wrote:
> On Tue, Apr 14, 2009 at 02:28:26PM -0600, Grant Likely wrote:
>> On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:
>> > This adds support to Linux for using virtio between two computers linked by
>> > a PCI interface. This allows the use of virtio_net to create a familiar,
>> > fast interface for communication. It should be possible to use other virtio
>> > devices in the future, but this has not been tested.
>>
>> Hey Ira,
>>
>> I like this a lot.  I need to do much the same thing on one of my
>> platforms, so I'm going to use your patch as my starting point.  Have
>> you made many changes since you posted this version of your patch?
>> I'd like to collaborate on the development and help to get it
>> mainlined.
>>
>
> This would be great. I'd really appreciate the help. I haven't had time
> to make any changes since I last posted the patch. I started work on
> converting all of the usage of struct vop_loc_* to just use the on-wire
> structures, but I didn't get very far before other work got in the way.
>
>> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex
>> (ppc440) as the 'client'.  I intend set aside a region of the Xilinx
>> Virtex's memory space for the shared queues.  I'm starting work on it
>> now, and I'll provide you with feedback and/or patches as I make
>> progress.
>>
>
> I'm looking forward to seeing your implementation. If you have any
> questions, I'd be happy to attempt to answer them :)

Hey Ira,

I've been slowly hacking on your virtio-over-pci stuff.  I've got an
initial series of cleanup patches which address some of the comments
from this thread.  Before I send them to you, have you made any
changes on your end?  They likely won't apply without changes to the
core code, so I'd like to sync up with you first.

Cheers,
g.
Ira Snyder June 11, 2009, 3:10 p.m. UTC | #19
On Thu, Jun 11, 2009 at 08:22:54AM -0600, Grant Likely wrote:
> On Tue, Apr 14, 2009 at 3:53 PM, Ira Snyder<iws@ovro.caltech.edu> wrote:
> > On Tue, Apr 14, 2009 at 02:28:26PM -0600, Grant Likely wrote:
> >> On Mon, Feb 23, 2009 at 6:00 PM, Ira Snyder <iws@ovro.caltech.edu> wrote:
> >> > This adds support to Linux for using virtio between two computers linked by
> >> > a PCI interface. This allows the use of virtio_net to create a familiar,
> >> > fast interface for communication. It should be possible to use other virtio
> >> > devices in the future, but this has not been tested.
> >>
> >> Hey Ira,
> >>
> >> I like this a lot.  I need to do much the same thing on one of my
> >> platforms, so I'm going to use your patch as my starting point.  Have
> >> you made many changes since you posted this version of your patch?
> >> I'd like to collaborate on the development and help to get it
> >> mainlined.
> >>
> >
> > This would be great. I'd really appreciate the help. I haven't had time
> > to make any changes since I last posted the patch. I started work on
> > converting all of the usage of struct vop_loc_* to just use the on-wire
> > structures, but I didn't get very far before other work got in the way.
> >
> >> In my case I've got an MPC5200 as the 'host' and a Xilinx Virtex
> >> (ppc440) as the 'client'.  I intend set aside a region of the Xilinx
> >> Virtex's memory space for the shared queues.  I'm starting work on it
> >> now, and I'll provide you with feedback and/or patches as I make
> >> progress.
> >>
> >
> > I'm looking forward to seeing your implementation. If you have any
> > questions, I'd be happy to attempt to answer them :)
> 
> Hey Ira,
> 
> I've been slowly hacking on your virtio-over-pci stuff.  I've got an
> initial series of cleanup patches which address some of the comments
> from this thread.  Before I send them to you, have you made any
> changes on your end?  They likely won't apply without changes to the
> core code, so I'd like to sync up with you first.
> 

I haven't made any changes to the code since you've last seen it. I've
been busy with other stuff for quite a while now. I can't wait to see
what you've done :)

At least for the use of the 83xx DMA controller, the DMA_SLAVE mode
patch I posted up about a week ago (to the ppcdev list) could make the
DMA setup much simpler. It hasn't been accepted to mainline yet.

Ira
diff mbox

Patch

diff --git a/Documentation/virtio-over-PCI.txt b/Documentation/virtio-over-PCI.txt
new file mode 100644
index 0000000..e4520d4
--- /dev/null
+++ b/Documentation/virtio-over-PCI.txt
@@ -0,0 +1,60 @@ 
+The implementation of virtio-over-PCI was driven with the following goals:
+* Avoid MMIO reads, try to use only MMIO writes
+* Use the onboard DMA engine, for speed
+
+The implementation also borrows many of the details from the only other
+implementation, virtio_ring.
+
+It succeeds in avoiding all MMIO reads on the critical paths. I did not
+see any reason to avoid the use of MMIO reads during device probing, since
+it is not a critical path.
+
+=== Avoiding MMIO reads ===
+To avoid MMIO reads, both the host and guest systems have a copy of the
+descriptors. Both sides need to read the descriptors after they have been
+written, but only the host system writes to them. This allows us to keep a
+local copy for later use.
+
+=== Using the DMA engine ===
+This is the only truly complicated part of the system. Since this
+implementation was designed for use with virtio_net, it may be biased
+towards virtio_net's usage of the virtio interface.
+
+In merged rxbufs mode, the virtio_net driver provides a receive ring, which
+it fills with empty PAGE_SIZE buffers. The DMA code sets up transfers
+directly from the guest transmit queue to the empty packets in the host
+receive queue. Data transfer in the other direction works in a similar
+fashion.
+
+The guest (PowerPC) system keeps its own local set of descriptors, which are
+filled by the virtio add_buf() call. Whenever this happens, the avail ring is
+changed, and therefore we try to transfer data.
+
+The algorithm is essentially as follows:
+1) Check for an available local or remote entry
+2) Check that the other side has enough room for the packet
+3) Transfer the chain, joining small packets and splitting large packets
+4) Move the entries to the used rings, but do not update the used index
+5) Schedule a DMA callback to happen when the transfer completes
+6) Start the DMA transfer
+7) When the DMA finishes, the callback updates the used indices and
+   triggers any necessary callbacks
+
+The algorithm can only handle chains that are to be coalesced together. It
+puts all data sequentially into the PAGE_SIZE buffers exposed by the
+receiving side, including both the virtio_net header and packet data.
+
+=== Startup Sequence ===
+There are currently problems in the startup sequence between the host and
+guest drivers. The current scheme assumes that the guest is up and waiting
+before the host is ready. I am having a very hard time coming up with a scheme
+that is perfectly safe, where either side could win the race and be ready
+first.
+
+Even harder is a situation where you would like to use the "network device"
+from your bootloader to tftp a kernel, then boot Linux. In this case,
+Linux has no knowledge of where the device descriptors were before it booted.
+You'd need to stop and re-start the host driver to make sure it re-initializes
+the new descriptor memory after Linux has booted.
+
+This is a definite "needs work" item.
diff --git a/arch/powerpc/boot/dts/mpc834x_mds.dts b/arch/powerpc/boot/dts/mpc834x_mds.dts
index d9adba0..5c7617d 100644
--- a/arch/powerpc/boot/dts/mpc834x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc834x_mds.dts
@@ -104,6 +104,13 @@ 
 			mode = "cpu";
 		};
 
+		message-unit@8030 {
+			compatible = "fsl,mpc8349-mu";
+			reg = <0x8030 0xd0>;
+			interrupts = <69 0x8>;
+			interrupt-parent = <&ipic>;
+		};
+
 		dma@82a8 {
 			#address-cells = <1>;
 			#size-cells = <1>;
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3dd6294..efcf56b 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -33,3 +33,25 @@  config VIRTIO_BALLOON
 
 	 If unsure, say M.
 
+config VIRTIO_OVER_PCI_HOST
+	tristate "Virtio-over-PCI Host support (EXPERIMENTAL)"
+	depends on PCI && EXPERIMENTAL
+	select VIRTIO
+	---help---
+	  This driver provides the host support necessary for using virtio
+	  over the PCI bus with a Freescale MPC8349EMDS evaluation board.
+
+	  If unsure, say N.
+
+config VIRTIO_OVER_PCI_FSL
+	tristate "Virtio-over-PCI Guest support (EXPERIMENTAL)"
+	depends on MPC834x_MDS && EXPERIMENTAL
+	select VIRTIO
+	select DMA_ENGINE
+	select FSL_DMA
+	---help---
+	  This driver provides the guest support necessary for using virtio
+	  over the PCI bus.
+
+	  If unsure, say N.
+
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 6738c44..f31afaa 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -2,3 +2,5 @@  obj-$(CONFIG_VIRTIO) += virtio.o
 obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
+obj-$(CONFIG_VIRTIO_OVER_PCI_HOST) += vop_host.o
+obj-$(CONFIG_VIRTIO_OVER_PCI_FSL) += vop_fsl.o
diff --git a/drivers/virtio/vop.h b/drivers/virtio/vop.h
new file mode 100644
index 0000000..5f77228
--- /dev/null
+++ b/drivers/virtio/vop.h
@@ -0,0 +1,119 @@ 
+/*
+ * Virtio-over-PCI definitions
+ *
+ * Copyright (c) 2009 Ira W. Snyder <iws@ovro.caltech.edu>
+ *
+ * This file is licensed under the terms of the GNU General Public License
+ * version 2. This program is licensed "as is" without any warranty of any
+ * kind, whether express or implied.
+ */
+
+#ifndef VOP_H
+#define VOP_H
+
+#include <linux/types.h>
+
+/* The number of entries per ring (MUST be a power of two) */
+#define VOP_RING_SIZE		64
+
+/* Marks a buffer as continuing via the next field */
+#define VOP_DESC_F_NEXT		1
+/* Marks a buffer as write-only (otherwise read-only) */
+#define VOP_DESC_F_WRITE	2
+
+/* Interrupts should not be generated when adding to avail or used */
+#define VOP_F_NO_INTERRUPT	1
+
+/* Virtio-over-PCI descriptors: 12 bytes. These can chain together via "next" */
+struct vop_desc {
+	/* Address (host physical) */
+	__le32 addr;
+	/* Length (bytes) */
+	__le32 len;
+	/* Flags */
+	__le16 flags;
+	/* Chaining for descriptors */
+	__le16 next;
+} __attribute__((packed));
+
+/* Virtio-over-PCI used descriptor chains: 8 bytes */
+struct vop_used_elem {
+	/* Start index of used descriptor chain */
+	__le32 id;
+	/* Total length of the descriptor chain which was used (written to) */
+	__le32 len;
+} __attribute__((packed));
+
+/* The ring in host memory, only written by the guest */
+/* NOTE: with VOP_RING_SIZE == 64, this is 520 bytes */
+struct vop_host_ring {
+	/* The flags, so the guest can indicate that it doesn't want
+	 * interrupts when things are added to the avail ring */
+	__le16 flags;
+
+	/* The index, which points at the next slot where a chain index
+	 * will be added to the used ring */
+	__le16 used_idx;
+
+	/* The used ring */
+	struct vop_used_elem used[VOP_RING_SIZE];
+} __attribute__((packed));
+
+/* The ring in guest memory, only written by the host */
+/* NOTE: with VOP_RING_SIZE == 64, this is 904 bytes! */
+struct vop_guest_ring {
+	/* The descriptors */
+	struct vop_desc desc[VOP_RING_SIZE];
+
+	/* The flags, so the host can indicate that it doesn't want
+	 * interrupts when things are added to the used ring */
+	__le16 flags;
+
+	/* The index, which points at the next slot where a chain index
+	 * will be added to the avail ring */
+	__le16 avail_idx;
+
+	/* The avail ring */
+	__le16 avail[VOP_RING_SIZE];
+} __attribute__((packed));
+
+/*
+ * This is the status structure holding the virtio_device status
+ * as well as the feature bits for this device and the configuration
+ * space.
+ *
+ * NOTE: it is for the LOCAL device. This is the slow path, so
+ * NOTE: the mmio reads won't cause any speed problems
+ */
+struct vop_status {
+	/* Status bits for the device */
+	__le32 status;
+
+	/* Feature bits for the device (128 bits) */
+	__le32 features[4];
+
+	/* Configuration space (different for each device type) */
+	u8 config[1004];
+
+} __attribute__((packed));
+
+/*
+ * Layout in memory
+ *
+ * |--------------------------|
+ * | 0: local device status   |
+ * |--------------------------|
+ * | 1024: host/guest ring 1  |
+ * |--------------------------|
+ * | 2048: host/guest ring 2  |
+ * |--------------------------|
+ * | 3072: host/guest ring 3  |
+ * |--------------------------|
+ *
+ * Now, you have one of these for each virtio device, and
+ * then you're pretty much set. You can expose 16K of memory
+ * out on the bus (on each side) and have 4 virtio devices,
+ * each with a different type, and 3 virtqueues
+ */
+
+#endif /* VOP_H */
diff --git a/drivers/virtio/vop_fsl.c b/drivers/virtio/vop_fsl.c
new file mode 100644
index 0000000..7cb3cdd
--- /dev/null
+++ b/drivers/virtio/vop_fsl.c
@@ -0,0 +1,2020 @@ 
+/*
+ * Virtio-over-PCI MPC8349EMDS Guest Driver
+ *
+ * Copyright (c) 2009 Ira W. Snyder <iws@ovro.caltech.edu>
+ *
+ * This file is licensed under the terms of the GNU General Public License
+ * version 2. This program is licensed "as is" without any warranty of any
+ * kind, whether express or implied.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/of_platform.h>
+#include <linux/io.h>
+#include <linux/dma-mapping.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_net.h>
+#include <linux/interrupt.h>
+#include <linux/virtio_net.h>
+#include <linux/dmaengine.h>
+#include <linux/workqueue.h>
+#include <linux/etherdevice.h>
+
+/* MPC8349EMDS specific get_immrbase() */
+#include <sysdev/fsl_soc.h>
+
+#include "vop_hw.h"
+#include "vop.h"
+
+/*
+ * These are internal use only versions of the structures that
+ * are exported over PCI by this driver
+ *
+ * They are used internally to keep track of the PowerPC queues so that
+ * we don't have to keep flipping endianness all the time
+ */
+struct vop_loc_desc {
+	u32 addr;
+	u32 len;
+	u16 flags;
+	u16 next;
+};
+
+struct vop_loc_avail {
+	u16 index;
+	u16 ring[VOP_RING_SIZE];
+};
+
+struct vop_loc_used_elem {
+	u32 id;
+	u32 len;
+};
+
+struct vop_loc_used {
+	u16 index;
+	struct vop_loc_used_elem ring[VOP_RING_SIZE];
+};
+
+/*
+ * DMA Resolver state information
+ */
+struct vop_dma_info {
+	struct dma_chan *chan;
+
+	/* The currently processing avail entry */
+	u16 loc_avail;
+	u16 rem_avail;
+
+	/* The currently processing used entries */
+	u16 loc_used;
+	u16 rem_used;
+};
+
+struct vop_vq {
+
+	/* The actual virtqueue itself */
+	struct virtqueue vq;
+	struct device *dev;
+
+	/* The host ring address */
+	struct vop_host_ring __iomem *host;
+
+	/* The guest ring address */
+	struct vop_guest_ring *guest;
+
+	/* Our own memory descriptors */
+	struct vop_loc_desc desc[VOP_RING_SIZE];
+	struct vop_loc_avail avail;
+	struct vop_loc_used used;
+	unsigned int flags;
+
+	/* Data tokens from add_buf() */
+	void *data[VOP_RING_SIZE];
+
+	unsigned int num_free;	/* number of free descriptors in desc */
+	unsigned int free_head;	/* start of the free descriptors in desc */
+	unsigned int num_added;	/* number of entries added to desc */
+
+	u16 loc_last_used;	/* the last local used entry processed */
+	u16 rem_last_used;	/* the current value of remote used_idx */
+
+	/* DMA resolver state */
+	struct vop_dma_info dma;
+	struct work_struct work;
+	int (*resolve)(struct vop_vq *vq);
+
+	void __iomem *immr;
+	int kick_val;
+};
+
+/* Convert from a struct virtqueue to a struct vop_vq */
+#define to_vop_vq(X) container_of(X, struct vop_vq, vq)
+
+/*
+ * This represents a virtio_device for our driver. It follows the memory
+ * layout shown above. It has pointers to all of the host and guest memory
+ * areas that we need to access
+ */
+struct vop_vdev {
+
+	/* The specific virtio device (console, net, blk) */
+	struct virtio_device vdev;
+
+	#define VOP_DEVICE_REGISTERED 1
+	int status;
+
+	/* Start address of local and remote memory */
+	void *loc;
+	void __iomem *rem;
+
+	/*
+	 * These are the status, feature, and configuration information
+	 * for this virtio device. They are exposed in our memory block
+	 * starting at offset 0.
+	 */
+	struct vop_status __iomem *host_status;
+
+	/*
+	 * These are the status, feature, and configuration information
+	 * for the guest virtio device. They are exposed in the guest
+	 * memory block starting at offset 0.
+	 */
+	struct vop_status *guest_status;
+
+	/*
+	 * These are the virtqueues for the virtio driver running this
+	 * device to use. The host portions are exposed in our memory block
+	 * starting at offset 1024. The exposed areas are aligned to 1024 byte
+	 * boundaries, so they appear at offets 1024, 2048, and 3072
+	 * respectively.
+	 */
+	struct vop_vq virtqueues[3];
+};
+
+#define to_vop_vdev(X) container_of(X, struct vop_vdev, vdev)
+
+struct vop_dev {
+
+	struct of_device *op;
+	struct device *dev;
+
+	/* Reset and start */
+	struct mutex mutex;
+	struct work_struct reset_work;
+	struct work_struct start_work;
+
+	int irq;
+
+	/* Our board control registers */
+	void __iomem *immr;
+
+	/* The guest memory, exposed at PCI BAR1 */
+	#define VOP_GUEST_MEM_SIZE 16384
+	void *guest_mem;
+	dma_addr_t guest_mem_addr;
+
+	/* Host memory, given to us by host in OMR0 */
+	#define VOP_HOST_MEM_SIZE 16384
+	void __iomem *host_mem;
+
+	/* The virtio devices */
+	struct vop_vdev devices[4];
+	struct dma_chan *chan;
+};
+
+/*
+ * DMA callback information
+ */
+struct vop_dma_cbinfo {
+	struct vop_vq *vq;
+
+	/* The amount to increment the used rings */
+	unsigned int loc;
+	unsigned int rem;
+};
+
+static const char driver_name[] = "vdev";
+static struct kmem_cache *dma_cache;
+
+/*----------------------------------------------------------------------------*/
+/* Whole-descriptor access helpers                                            */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Return a copy of a local descriptor in native format for easy use
+ * of all fields
+ *
+ * @vq the virtqueue
+ * @idx the descriptor index
+ * @desc pointer to the structure to copy into
+ */
+static void vop_loc_desc(struct vop_vq *vq, unsigned int idx,
+			 struct vop_loc_desc *desc)
+{
+	BUG_ON(idx >= VOP_RING_SIZE);
+	BUG_ON(!desc);
+
+	desc->addr  = vq->desc[idx].addr;
+	desc->len   = vq->desc[idx].len;
+	desc->flags = vq->desc[idx].flags;
+	desc->next  = vq->desc[idx].next;
+}
+
+/*
+ * Return a copy of a remote descriptor in native format for easy use
+ * of all fields
+ *
+ * @vq the virtqueue
+ * @idx the descriptor index
+ * @desc pointer to the structure to copy into
+ */
+static void vop_rem_desc(struct vop_vq *vq, unsigned int idx,
+			 struct vop_loc_desc *desc)
+{
+	BUG_ON(idx >= VOP_RING_SIZE);
+	BUG_ON(!desc);
+
+	desc->addr  = le32_to_cpu(vq->guest->desc[idx].addr);
+	desc->len   = le32_to_cpu(vq->guest->desc[idx].len);
+	desc->flags = le16_to_cpu(vq->guest->desc[idx].flags);
+	desc->next  = le16_to_cpu(vq->guest->desc[idx].next);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Local descriptor ring access helpers                                       */
+/*----------------------------------------------------------------------------*/
+
+static void vop_set_desc_addr(struct vop_vq *vq, unsigned int idx, u32 addr)
+{
+	vq->desc[idx].addr = addr;
+}
+
+static void vop_set_desc_len(struct vop_vq *vq, unsigned int idx, u32 len)
+{
+	vq->desc[idx].len = len;
+}
+
+static void vop_set_desc_flags(struct vop_vq *vq, unsigned int idx, u16 flags)
+{
+	vq->desc[idx].flags = flags;
+}
+
+static void vop_set_desc_next(struct vop_vq *vq, unsigned int idx, u16 next)
+{
+	vq->desc[idx].next = next;
+}
+
+static u16 vop_get_desc_flags(struct vop_vq *vq, unsigned int idx)
+{
+	return vq->desc[idx].flags;
+}
+
+static u16 vop_get_desc_next(struct vop_vq *vq, unsigned int idx)
+{
+	return vq->desc[idx].next;
+}
+
+/*----------------------------------------------------------------------------*/
+/* Status Helpers                                                             */
+/*----------------------------------------------------------------------------*/
+
+static u32 vop_get_host_status(struct vop_vdev *vdev)
+{
+	return ioread32(&vdev->host_status->status);
+}
+
+static u32 vop_get_host_features(struct vop_vdev *vdev)
+{
+	return ioread32(&vdev->host_status->features[0]);
+}
+
+static u16 vop_get_host_flags(struct vop_vq *vq)
+{
+	return le16_to_cpu(vq->guest->flags);
+}
+
+/*
+ * Set the guest's flags variable (lives in host memory)
+ */
+static void vop_set_guest_flags(struct vop_vq *vq, u16 flags)
+{
+	iowrite16(flags, &vq->host->flags);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Remote Ring Debugging Helpers                                              */
+/*----------------------------------------------------------------------------*/
+
+#ifdef DEBUG_DUMP_RINGS
+static void dump_rem_desc(struct vop_vq *vq)
+{
+	struct vop_loc_desc desc;
+	int i;
+
+	dev_dbg(vq->dev, "REM DESC 0xADDRESSX LENGTH 0xFLAG NEXT\n");
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		vop_rem_desc(vq, i, &desc);
+		dev_dbg(vq->dev, "DESC %.2d: 0x%.8x %.6d 0x%.4x %.2d\n",
+				i, desc.addr, desc.len, desc.flags, desc.next);
+	}
+}
+
+static void dump_rem_avail(struct vop_vq *vq)
+{
+	int i;
+
+	dev_dbg(vq->dev, "REM AVAIL IDX %.2d\n", le16_to_cpu(vq->guest->avail_idx));
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		dev_dbg(vq->dev, "REM AVAIL %.2d: %.2d\n",
+				i, le16_to_cpu(vq->guest->avail[i]));
+	}
+}
+
+static void dump_rem_used(struct vop_vq *vq)
+{
+	int i;
+
+	dev_dbg(vq->dev, "REM USED IDX %.2d\n", ioread16(&vq->host->used_idx));
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		dev_dbg(vq->dev, "REM USED %.2d: %.2d %.6d\n", i,
+				ioread32(&vq->host->used[i].id),
+				ioread32(&vq->host->used[i].len));
+	}
+}
+
+static void dump_rem_rings(struct vop_vq *vq)
+{
+	dump_rem_desc(vq);
+	dump_rem_avail(vq);
+	dump_rem_used(vq);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Local Ring Debugging Helpers                                               */
+/*----------------------------------------------------------------------------*/
+
+static void dump_loc_desc(struct vop_vq *vq)
+{
+	struct vop_loc_desc desc;
+	int i;
+
+	dev_dbg(vq->dev, "LOC DESC 0xADDRESSX LENGTH 0xFLAG NEXT\n");
+	for (i = 0 ; i < VOP_RING_SIZE; i++) {
+		vop_loc_desc(vq, i, &desc);
+		dev_dbg(vq->dev, "DESC %.2d: 0x%.8x %.6d 0x%.4x %.2d\n",
+				i, desc.addr, desc.len, desc.flags, desc.next);
+	}
+}
+
+static void dump_loc_avail(struct vop_vq *vq)
+{
+	int i;
+
+	dev_dbg(vq->dev, "LOC AVAIL IDX %.2d\n", vq->avail.index);
+	for (i = 0; i < VOP_RING_SIZE; i++)
+		dev_dbg(vq->dev, "LOC AVAIL %.2d: %.2d\n", i, vq->avail.ring[i]);
+}
+
+static void dump_loc_used(struct vop_vq *vq)
+{
+	int i;
+
+	dev_dbg(vq->dev, "LOC USED IDX %.2hu\n", vq->used.index);
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		dev_dbg(vq->dev, "LOC USED %.2d: %.2d %.6d\n", i,
+				vq->used.ring[i].id, vq->used.ring[i].len);
+	}
+}
+
+static void dump_loc_rings(struct vop_vq *vq)
+{
+	dump_loc_desc(vq);
+	dump_loc_avail(vq);
+	dump_loc_used(vq);
+}
+
+static void debug_dump_rings(struct vop_vq *vq, const char *msg)
+{
+	dev_dbg(vq->dev, "\n");
+	dev_dbg(vq->dev, "%s\n", msg);
+	dump_loc_rings(vq);
+	dump_rem_rings(vq);
+	dev_dbg(vq->dev, "\n");
+}
+#else
+static void debug_dump_rings(struct vop_vq *vq, const char *msg)
+{
+	/* Nothing */
+}
+#endif
+
+/*----------------------------------------------------------------------------*/
+/* Scatterlist DMA helpers                                                    */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * This function abuses some of the scatterlist code and implements
+ * dma_map_sg() in such a way that we don't need to keep the scatterlist
+ * around in order to unmap it.
+ *
+ * It is also designed to never merge scatterlist entries, which is
+ * never what we want for virtio.
+ *
+ * When it is time to unmap the buffer, you can use dma_unmap_single() to
+ * unmap each entry in the chain. Get the address, length, and direction
+ * from the descriptors! (keep a local copy for speed)
+ */
+static int vop_dma_map_sg(struct device *dev, struct scatterlist sg[],
+			  unsigned int out, unsigned int in)
+{
+	dma_addr_t addr;
+	enum dma_data_direction dir;
+	struct scatterlist *start;
+	unsigned int i, failure;
+
+	start = sg;
+
+	for (i = 0; i < out + in; i++) {
+
+		/* Check for scatterlist chaining abuse */
+		BUG_ON(sg == NULL);
+
+		dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+		addr = dma_map_single(dev, sg_virt(sg), sg->length, dir);
+
+		if (dma_mapping_error(dev, addr))
+			goto unwind;
+
+		sg_dma_address(sg) = addr;
+		sg = sg_next(sg);
+	}
+
+	return 0;
+
+unwind:
+	failure = i;
+	sg = start;
+
+	for (i = 0; i < failure; i++) {
+		dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+		addr = sg_dma_address(sg);
+
+		dma_unmap_single(dev, addr, sg->length, dir);
+		sg = sg_next(sg);
+	}
+
+	return -ENOMEM;
+}
+
+/*----------------------------------------------------------------------------*/
+/* DMA Helpers                                                                */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Transfer data between two physical addresses with DMA
+ *
+ * NOTE: does not automatically unmap the src and dst addresses
+ *
+ * @chan the channel to use
+ * @dst the physical destination address
+ * @src the physical source address
+ * @len the length to transfer (in bytes)
+ * @return a valid cookie, or -ERRNO
+ */
+static dma_cookie_t dma_async_memcpy_raw_to_raw(struct dma_chan *chan,
+					       dma_addr_t dst,
+					       dma_addr_t src,
+					       size_t len)
+{
+	struct dma_device *dev = chan->device;
+	struct dma_async_tx_descriptor *tx;
+	enum dma_ctrl_flags flags;
+	dma_cookie_t cookie;
+	int cpu;
+
+	flags = DMA_COMPL_SKIP_SRC_UNMAP | DMA_COMPL_SKIP_DEST_UNMAP;
+	tx = dev->device_prep_dma_memcpy(chan, dst, src, len, flags);
+	if (!tx)
+		return -ENOMEM;
+
+	tx->callback = NULL;
+	cookie = tx->tx_submit(tx);
+
+	cpu = get_cpu();
+	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
+	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
+	put_cpu();
+
+	return cookie;
+}
+
+/*
+ * Trigger an interrupt after all DMA issued up to this point
+ * have been processed
+ *
+ * @chan the channel to use
+ * @callback the function to call (must not sleep)
+ * @data the data to send to the callback
+ *
+ * @return a valid cookie, or -ERRNO
+ */
+static dma_cookie_t dma_async_interrupt(struct dma_chan *chan,
+					dma_async_tx_callback callback,
+					void *data)
+{
+	struct dma_device *dev = chan->device;
+	struct dma_async_tx_descriptor *tx;
+
+	/* Set up the DMA */
+	tx = dev->device_prep_dma_interrupt(chan, DMA_PREP_INTERRUPT);
+	if (!tx)
+		return -ENOMEM;
+
+	tx->callback = callback;
+	tx->callback_param = data;
+
+	return tx->tx_submit(tx);
+}
+
+/*----------------------------------------------------------------------------*/
+/* DMA Resolver                                                               */
+/*----------------------------------------------------------------------------*/
+
+static void vop_remote_used_changed(struct vop_vq *vq)
+{
+	if (!(vop_get_host_flags(vq) & VOP_F_NO_INTERRUPT)) {
+		dev_dbg(vq->dev, "notifying the host (new buffers in used)\n");
+		iowrite32(vq->kick_val, vq->immr + ODR_OFFSET);
+	}
+}
+
+static void vop_local_used_changed(struct vop_vq *vq)
+{
+	if (!(vq->flags & VOP_F_NO_INTERRUPT)) {
+		dev_dbg(vq->dev, "notifying self (new buffers in used)\n");
+		vq->vq.callback(&vq->vq);
+	}
+}
+
+/*
+ * DMA callback function for merged rxbufs
+ *
+ * This is called every time a DMA transfer completes, and will update the
+ * indices in the local and remote used rings, then notify both sides that
+ * their used ring has changed
+ *
+ * You must be sure that the data was actually written to the used rings before
+ * this function is called
+ */
+static void dma_callback(void *data)
+{
+	struct vop_dma_cbinfo *cb = data;
+	struct vop_vq *vq = cb->vq;
+
+	dev_dbg(vq->dev, "%s: vq %p loc %d rem %d\n", __func__, vq, cb->loc, cb->rem);
+
+	/* Write the local used index */
+	vq->used.index += cb->loc;
+
+	/* Write the remote used index */
+	vq->rem_last_used += cb->rem;
+	iowrite16(vq->rem_last_used, &vq->host->used_idx);
+
+	/* Make sure the indices are written before triggering callbacks */
+	wmb();
+
+	/* Trigger the local used callback */
+	dev_dbg(vq->dev, "local used changed, running callback\n");
+	vop_local_used_changed(vq);
+
+	/* Trigger the remote used callback */
+	dev_dbg(vq->dev, "remote used changed, running callback\n");
+	vop_remote_used_changed(vq);
+
+	/* Free the callback data */
+	kmem_cache_free(dma_cache, cb);
+}
+
+/*
+ * Take an entry from the local avail ring and add it to the local
+ * used ring with the given length
+ *
+ * NOTE: does not update the used index
+ *
+ * @vq the virtqueue
+ * @avail_idx the index in the avail ring to take the entry from
+ * @used_idx the index in the used ring to put the entry
+ * @used_len the length used
+ */
+static void vop_loc_avail_to_used(struct vop_vq *vq, unsigned int avail_idx,
+				  unsigned int used_idx, u32 used_len)
+{
+	u16 id;
+
+	/* Make sure the indices are inside the rings */
+	avail_idx &= (VOP_RING_SIZE - 1);
+	used_idx  &= (VOP_RING_SIZE - 1);
+
+	/* Get the index stored in the avail ring */
+	id = vq->avail.ring[avail_idx];
+
+	/* Copy the index and length to the used ring */
+	vq->used.ring[used_idx].id = id;
+	vq->used.ring[used_idx].len = used_len;
+}
+
+/*
+ * Take an entry from the remote avail ring and add it to the remote
+ * used ring with the given length
+ *
+ * NOTE: does not update the used index
+ *
+ * @vq the virtqueue
+ * @avail_idx the index in the avail ring to take the entry from
+ * @used_idx the index in the used ring to put the entry
+ * @used_len the length used
+ */
+static void vop_rem_avail_to_used(struct vop_vq *vq, unsigned int avail_idx,
+				  unsigned int used_idx, u32 used_len)
+{
+	u16 id;
+
+	/* Make sure the indices are inside the rings */
+	avail_idx &= (VOP_RING_SIZE - 1);
+	used_idx  &= (VOP_RING_SIZE - 1);
+
+	/* Get the index stored in the avail ring */
+	id = le16_to_cpu(vq->guest->avail[avail_idx]);
+
+	/* Copy the index and length to the used ring */
+	iowrite32(id, &vq->host->used[used_idx].id);
+	iowrite32(used_len, &vq->host->used[used_idx].len);
+}
+
+/*
+ * Return the number of entries available in the local avail ring
+ */
+static unsigned int loc_num_avail(struct vop_vq *vq)
+{
+	return vq->avail.index - vq->dma.loc_avail;
+}
+
+/*
+ * Return the number of entries available in the remote avail ring
+ */
+static unsigned int rem_num_avail(struct vop_vq *vq)
+{
+	return le16_to_cpu(vq->guest->avail_idx) - vq->dma.rem_avail;
+}
+
+/*
+ * Return a descriptor id from the local avail ring
+ *
+ * @vq the virtqueue
+ * @idx the index to return the id from
+ */
+static u16 vop_loc_avail_id(struct vop_vq *vq, unsigned int idx)
+{
+	idx &= (VOP_RING_SIZE - 1);
+	return vq->avail.ring[idx];
+}
+
+/*
+ * Return a descriptor id from the remote avail ring
+ *
+ * @vq the virtqueue
+ * @idx the index to return the id from
+ */
+static u16 vop_rem_avail_id(struct vop_vq *vq, unsigned int idx)
+{
+	idx &= (VOP_RING_SIZE - 1);
+	return le16_to_cpu(vq->guest->avail[idx]);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Extra helpers for mergeable DMA                                            */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * TODO: the number of bytes being transmitted could be added to the avail
+ * TODO: ring, rather than just an index. I'm not sure it would make much
+ * TODO: difference, though.
+ */
+
+/*
+ * Calculate the number of bytes used in a local descriptor chain
+ *
+ * @vq the virtqueue
+ * @idx the start descriptor index
+ * @return the number of bytes
+ */
+static unsigned int loc_num_bytes(struct vop_vq *vq, unsigned int idx)
+{
+	struct vop_loc_desc desc;
+	unsigned int bytes = 0;
+
+	while (true) {
+		vop_loc_desc(vq, idx, &desc);
+		bytes += desc.len;
+
+		if (!(desc.flags & VOP_DESC_F_NEXT))
+			break;
+
+		idx = desc.next;
+	}
+
+	return bytes;
+}
+
+/*
+ * Calculate the number of bytes used in a remote descriptor chain
+ *
+ * @vq the virtqueue
+ * @idx the start descriptor index
+ * @return the number of bytes
+ */
+static unsigned int rem_num_bytes(struct vop_vq *vq, unsigned int idx)
+{
+	struct vop_loc_desc desc;
+	unsigned int bytes = 0;
+
+	while (true) {
+		vop_rem_desc(vq, idx, &desc);
+		bytes += desc.len;
+
+		if (!(desc.flags & VOP_DESC_F_NEXT))
+			break;
+
+		idx = desc.next;
+	}
+
+	return bytes;
+}
+
+/*
+ * Transmit the next local available entry to the remote side, splitting
+ * up the local descriptor as needed
+ *
+ * This routine makes the following assumptions:
+ * 1) The header already has the correct number of buffers set
+ * 2) The available buffers are all PAGE_SIZE
+ */
+static int vop_dma_xmit(struct vop_vq *vq)
+{
+	struct vop_dma_info *dma = &vq->dma;
+	struct dma_chan *chan = dma->chan;
+	dma_cookie_t cookie;
+
+	unsigned int loc_idx, rem_idx;
+	struct vop_loc_desc loc, rem;
+
+	struct vop_dma_cbinfo *cb;
+	dma_addr_t src, dst;
+	size_t len;
+
+	unsigned int loc_total = 0;
+	unsigned int rem_total = 0;
+	unsigned int bufs_used = 0;
+
+	/* Check that there is a local descriptor available */
+	if (!loc_num_avail(vq)) {
+		dev_dbg(vq->dev, "No local descriptors available\n");
+		return -ENOSPC;
+	}
+
+	/* Get the starting entry from each available ring */
+	loc_idx = vop_loc_avail_id(vq, dma->loc_avail);
+	rem_idx = vop_rem_avail_id(vq, dma->rem_avail);
+
+	dev_dbg(vq->dev, "rem_avail %d loc_num_bytes %d\n", rem_num_avail(vq), loc_num_bytes(vq, loc_idx));
+
+	/* Check that there are enough remote buffers available */
+	if (rem_num_avail(vq) * PAGE_SIZE < loc_num_bytes(vq, loc_idx)) {
+		dev_dbg(vq->dev, "Insufficient remote descriptors available\n");
+		return -ENOSPC;
+	}
+
+	/* Allocate DMA callback data */
+	cb = kmem_cache_alloc(dma_cache, GFP_KERNEL);
+	if (!cb) {
+		dev_dbg(vq->dev, "Unable to allocate DMA callback data\n");
+		return -ENOMEM;
+	}
+
+	/* Load the starting descriptors */
+	vop_loc_desc(vq, loc_idx, &loc);
+	vop_rem_desc(vq, rem_idx, &rem);
+
+	while (true) {
+
+		dst = rem.addr + 0x80000000;
+		src = loc.addr;
+		len = min(loc.len, rem.len);
+
+		dev_dbg(vq->dev, "DMA xmit dst %.8x src %.8x len %d\n", dst, src, len);
+		cookie = dma_async_memcpy_raw_to_raw(chan, dst, src, len);
+		if (dma_submit_error(cookie)) {
+			dev_err(vq->dev, "DMA submit error\n");
+			goto out_free_cb;
+		}
+
+		loc.len -= len;
+		rem.len -= len;
+		loc.addr += len;
+		rem.addr += len;
+
+		loc_total += len;
+		rem_total += len;
+
+		dev_dbg(vq->dev, "loc.len %d rem.len %d\n", loc.len, rem.len);
+		dev_dbg(vq->dev, "loc.addr %.8x rem.addr %.8x\n", loc.addr, rem.addr);
+		dev_dbg(vq->dev, "loc_total %d rem_total %d\n", loc_total, rem_total);
+
+		if (loc.len == 0) {
+			dev_dbg(vq->dev, "local: descriptor depleted, loading next\n");
+
+			if (!(loc.flags & VOP_DESC_F_NEXT)) {
+				dev_dbg(vq->dev, "local: no next descriptor, chain finished\n");
+				break;
+			}
+
+			dev_dbg(vq->dev, "local: fetching next descriptor\n");
+			loc_idx = loc.next;
+			vop_loc_desc(vq, loc_idx, &loc);
+		}
+
+		if (rem.len == 0) {
+			dev_dbg(vq->dev, "remote: descriptor depleted, adding to used\n");
+			vop_rem_avail_to_used(vq, dma->rem_avail + bufs_used, dma->rem_used + bufs_used, rem_total);
+			bufs_used++;
+
+			dev_dbg(vq->dev, "remote: fetching next descriptor\n");
+			rem_idx = vop_rem_avail_id(vq, dma->rem_avail + bufs_used);
+			vop_rem_desc(vq, rem_idx, &rem);
+			rem_total = 0;
+		}
+	}
+
+	/* Add the last remote descriptor to the used ring */
+	BUG_ON(rem_total == 0);
+	dev_dbg(vq->dev, "adding last remote descriptor to used ring\n");
+	vop_rem_avail_to_used(vq, dma->rem_avail + bufs_used, dma->rem_used + bufs_used, rem_total);
+	bufs_used++;
+
+	/* Add the local descriptor to the sude ring */
+	dev_dbg(vq->dev, "adding only local descriptor to used ring\n");
+	vop_loc_avail_to_used(vq, dma->loc_avail, dma->loc_used, loc_total);
+
+	/* Make very sure that everything written to the rings actually happened
+	 * bofer the DMA callback can be triggered */
+	wmb();
+
+	/* Set up the DMA callback information */
+	cb->vq = vq;
+	cb->loc = 1;
+	cb->rem = bufs_used;
+
+	dev_dbg(vq->dev, "setup DMA callback vq %p loc %d rem %d\n", vq, 1, bufs_used);
+
+	/* Trigger an interrupt when the DMA completes to update the used
+	 * indices and trigger the necessary callbacks */
+	cookie = dma_async_interrupt(chan, dma_callback, cb);
+	if (dma_submit_error(cookie)) {
+		dev_err(vq->dev, "DMA interrupt submit error\n");
+		goto out_free_cb;
+	}
+
+	/* Everything was successful, so update the DMA resolver's state */
+	dma->loc_avail++;
+	dma->rem_avail += bufs_used;
+	dma->loc_used++;
+	dma->rem_used += bufs_used;
+
+	/* Start the DMA */
+	dev_dbg(vq->dev, "DMA xmit setup successful, starting\n");
+	dma_async_memcpy_issue_pending(chan);
+
+	return 0;
+
+out_free_cb:
+	kmem_cache_free(dma_cache, cb);
+	return -ENOMEM;
+}
+
+/*
+ * Receive the next remote available entry to the local side, splitting
+ * up the remote descriptor as needed
+ *
+ * This routine makes the following assumptions:
+ * 1) The header already has the correct number of buffers set
+ * 2) The available buffers are all PAGE_SIZE
+ */
+static int vop_dma_recv(struct vop_vq *vq)
+{
+	struct vop_dma_info *dma = &vq->dma;
+	struct dma_chan *chan = dma->chan;
+	dma_cookie_t cookie;
+
+	unsigned int loc_idx, rem_idx;
+	struct vop_loc_desc loc, rem;
+
+	struct vop_dma_cbinfo *cb;
+	dma_addr_t src, dst;
+	size_t len;
+
+	unsigned int loc_total = 0;
+	unsigned int rem_total = 0;
+	unsigned int bufs_used = 0;
+
+	/* Check that there is a remote descriptor available */
+	if (!rem_num_avail(vq)) {
+		dev_dbg(vq->dev, "No remote descriptors available\n");
+		return -ENOSPC;
+	}
+
+	/* Get the starting entry from each available ring */
+	loc_idx = vop_loc_avail_id(vq, dma->loc_avail);
+	rem_idx = vop_rem_avail_id(vq, dma->rem_avail);
+
+	/* Check that there are enough local buffers available */
+	if (loc_num_avail(vq) * PAGE_SIZE < rem_num_bytes(vq, rem_idx)) {
+		dev_dbg(vq->dev, "Insufficient local descriptors available\n");
+		return -ENOSPC;
+	}
+
+	/* Allocate DMA callback data */
+	cb = kmem_cache_alloc(dma_cache, GFP_KERNEL);
+	if (!cb) {
+		dev_dbg(vq->dev, "Unable to allocate DMA callback data\n");
+		return -ENOMEM;
+	}
+
+	/* Load the starting descriptors */
+	vop_loc_desc(vq, loc_idx, &loc);
+	vop_rem_desc(vq, rem_idx, &rem);
+
+	while (true) {
+
+		dst = loc.addr;
+		src = rem.addr + 0x80000000;
+		len = min(loc.len, rem.len);
+
+		dev_dbg(vq->dev, "DMA recv dst %.8x src %.8x len %d\n", dst, src, len);
+		cookie = dma_async_memcpy_raw_to_raw(chan, dst, src, len);
+		if (dma_submit_error(cookie)) {
+			dev_err(vq->dev, "DMA submit error\n");
+			goto out_free_cb;
+		}
+
+		loc.len -= len;
+		rem.len -= len;
+		loc.addr += len;
+		rem.addr += len;
+
+		loc_total += len;
+		rem_total += len;
+
+		if (rem.len == 0) {
+			if (!(rem.flags & VOP_DESC_F_NEXT))
+				break;
+
+			rem_idx = rem.next;
+			vop_rem_desc(vq, rem_idx, &rem);
+		}
+
+		if (loc.len == 0) {
+			vop_loc_avail_to_used(vq, dma->loc_avail + bufs_used, dma->loc_used + bufs_used, loc_total);
+			bufs_used++;
+
+			loc_idx = vop_loc_avail_id(vq, dma->loc_avail + bufs_used);
+			vop_loc_desc(vq, loc_idx, &loc);
+			loc_total = 0;
+		}
+	}
+
+	/* Add the last local descriptor to the used ring */
+	BUG_ON(loc_total == 0);
+	vop_loc_avail_to_used(vq, dma->loc_avail + bufs_used, dma->loc_used + bufs_used, loc_total);
+	bufs_used++;
+
+	/* Add the remote descriptor to the used ring */
+	vop_rem_avail_to_used(vq, dma->rem_avail, dma->rem_used, rem_total);
+
+	/* Make very sure that everything written to the rings actually happened
+	 * before the DMA callback can be triggered */
+	wmb();
+
+	/* Set up the DMA callback information */
+	cb->vq = vq;
+	cb->loc = bufs_used;
+	cb->rem = 1;
+
+	/* Trigger an interrupt when the DMA completes to update the used
+	 * indices and trigger the necessary callbacks */
+	cookie = dma_async_interrupt(chan, dma_callback, cb);
+	if (dma_submit_error(cookie)) {
+		dev_err(vq->dev, "DMA interrupt submit error\n");
+		goto out_free_cb;
+	}
+
+	/* Everything was successful, so update the DMA resolver's state */
+	dma->loc_avail += bufs_used;
+	dma->rem_avail++;
+	dma->loc_used += bufs_used;
+	dma->rem_used++;
+
+	/* Start the DMA */
+	dev_dbg(vq->dev, "DMA recv setup successful, starting\n");
+	dma_async_memcpy_issue_pending(chan);
+
+	return 0;
+
+out_free_cb:
+	kmem_cache_free(dma_cache, cb);
+	return -ENOMEM;
+}
+
+/*----------------------------------------------------------------------------*/
+/* Virtqueue Ops Infrastructure                                               */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Modify the struct virtio_net_hdr_mrg_rxbuf's num_buffers field to account
+ * for the split that will happen in the DMA xmit routine
+ *
+ * This assumes that both sides have the same PAGE_SIZE
+ */
+static void vop_fixup_vnet_mrg_hdr(struct scatterlist sg[], unsigned int out)
+{
+	struct virtio_net_hdr *hdr;
+	struct virtio_net_hdr_mrg_rxbuf *mhdr;
+	unsigned int bytes = 0;
+
+	/* There must be a header + data, at the least */
+	BUG_ON(out < 2);
+
+	/* The first entry must be the structure */
+	BUG_ON(sg->length != sizeof(struct virtio_net_hdr_mrg_rxbuf));
+
+	hdr = sg_virt(sg);
+	mhdr = sg_virt(sg);
+
+	/* We merge buffers together, so just count up the number of bytes
+	 * needed, then figure out how many pages that will be */
+	for (/* none */; out; out--, sg = sg_next(sg))
+		bytes += sg->length;
+
+	/* Of course, nobody ever imagined that we might actually use
+	 * this on machines with different endianness...
+	 *
+	 * We force little-endian for now, since that's what our host is */
+	mhdr->num_buffers = cpu_to_le16(DIV_ROUND_UP(bytes, PAGE_SIZE));
+
+	/* Might as well fix up the other fields while we're at it */
+	hdr->hdr_len = cpu_to_le16(hdr->hdr_len);
+	hdr->gso_size = cpu_to_le16(hdr->gso_size);
+	hdr->csum_start = cpu_to_le16(hdr->csum_start);
+	hdr->csum_offset = cpu_to_le16(hdr->csum_offset);
+}
+
+/*
+ * Add a buffer to our local descriptors and the local avail ring
+ *
+ * NOTE: there hasn't been any transfer yet, just adding to local
+ * NOTE: rings. The kick() will process any DMA that needs to happen
+ *
+ * @return 0 on success, -ERRNO otherwise
+ */
+static int vop_add_buf(struct virtqueue *_vq, struct scatterlist sg[],
+		       unsigned int out, unsigned int in, void *data)
+{
+	/* For now, we'll just add the buffers to our local descriptors and
+	 * avail ring */
+	struct vop_vq *vq = to_vop_vq(_vq);
+	unsigned int i, avail, head, uninitialized_var(prev);
+
+	BUG_ON(data == NULL);
+	BUG_ON(out + in == 0);
+
+	/* Make sure we have space for this to succeed */
+	if (vq->num_free < out + in) {
+		dev_dbg(vq->dev, "No free space left: len=%d free=%d\n",
+				out + in, vq->num_free);
+		return -ENOSPC;
+	}
+
+	/* If this is an xmit buffer from virtio_net, fixup the header */
+	if (out > 1) {
+		dev_dbg(vq->dev, "Fixing up virtio_net header\n");
+		vop_fixup_vnet_mrg_hdr(sg, out);
+	}
+
+	head = vq->free_head;
+
+	/* DMA map the scatterlist */
+	if (vop_dma_map_sg(vq->dev, sg, out, in)) {
+		dev_err(vq->dev, "Failed to DMA map scatterlist\n");
+		return -ENOMEM;
+	}
+
+	/* We're about to use some buffers from the free list */
+	vq->num_free -= out + in;
+
+	for (i = vq->free_head; out; i = vop_get_desc_next(vq, i), out--) {
+		vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT);
+		vop_set_desc_addr(vq, i, sg_dma_address(sg));
+		vop_set_desc_len(vq, i, sg->length);
+
+		prev = i;
+		sg = sg_next(sg);
+	}
+
+	for (/* none */; in; i = vop_get_desc_next(vq, i), in--) {
+		vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT | VOP_DESC_F_WRITE);
+		vop_set_desc_addr(vq, i, sg_dma_address(sg));
+		vop_set_desc_len(vq, i, sg->length);
+
+		prev = i;
+		sg = sg_next(sg);
+	}
+
+	/* Last one doesn't continue */
+	vop_set_desc_flags(vq, prev, vop_get_desc_flags(vq, prev) & ~VOP_DESC_F_NEXT);
+
+	/* Update the free pointer */
+	vq->free_head = i;
+
+	/* Set token */
+	vq->data[head] = data;
+
+	/* Add an entry for the head of the chain into the avail array, but
+	 * don't update avail->idx until kick() */
+	avail = (vq->avail.index + vq->num_added++) & (VOP_RING_SIZE - 1);
+	vq->avail.ring[avail] = head;
+
+	dev_dbg(vq->dev, "Added buffer head %i to %p\n", head, vq);
+	debug_dump_rings(vq, "Added buffer(s), dumping rings");
+
+	return 0;
+}
+
+static inline bool loc_more_used(const struct vop_vq *vq)
+{
+	return vq->loc_last_used != vq->used.index;
+}
+
+static void detach_buf(struct vop_vq *vq, unsigned int head)
+{
+	dma_addr_t addr;
+	unsigned int idx, len;
+	enum dma_data_direction dir;
+	struct vop_loc_desc desc;
+
+	/* Clear data pointer */
+	vq->data[head] = NULL;
+
+	/* Put the chain back on the free list, unmapping as we go */
+	idx = head;
+	while (true) {
+		vop_loc_desc(vq, idx, &desc);
+
+		addr = desc.addr;
+		len  = desc.len;
+		dir  = (desc.flags & VOP_DESC_F_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+
+		/* Unmap the entry */
+		dma_unmap_single(vq->dev, addr, len, dir);
+		vq->num_free++;
+
+		/* If there is no next descriptor, we're done */
+		if (!(desc.flags & VOP_DESC_F_NEXT))
+			break;
+
+		idx = desc.next;
+	}
+
+	vop_set_desc_next(vq, idx, vq->free_head);
+	vq->free_head = head;
+}
+
+/*
+ * Get a buffer from the used ring
+ *
+ * @return the data token given to add_buf(), or NULL if there
+ *         are no remaining buffers
+ */
+static void *vop_get_buf(struct virtqueue *_vq, unsigned int *len)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+	unsigned int head, used;
+	void *ret;
+
+	if (!loc_more_used(vq)) {
+		dev_dbg(vq->dev, "No more buffers in queue\n");
+		return NULL;
+	}
+
+	used = vq->loc_last_used & (VOP_RING_SIZE - 1);
+	head = vq->used.ring[used].id;
+	*len = vq->used.ring[used].len;
+
+	BUG_ON(head >= VOP_RING_SIZE);
+	BUG_ON(!vq->data[head]);
+
+	/* detach_buf() clears data, save it now */
+	ret = vq->data[head];
+	detach_buf(vq, head);
+
+	/* Update the last local used_idx */
+	vq->loc_last_used++;
+
+	return ret;
+}
+
+/*
+ * The avail ring changed, so we need to start as much DMA as we can
+ */
+static void vop_kick(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+
+	dev_dbg(vq->dev, "kick: making %d new buffers available\n", vq->num_added);
+	vq->avail.index += vq->num_added;
+	vq->num_added = 0;
+
+	/* Run the DMA resolver */
+	dev_dbg(vq->dev, "kick: using resolver %pS\n", vq->resolve);
+	schedule_work(&vq->work);
+}
+
+/*
+ * Try to disable callbacks on the used ring (unreliable)
+ */
+static void vop_disable_cb(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+	struct virtio_device *vdev = _vq->vdev;
+
+	dev_dbg(&vdev->dev, "disable callbacks\n");
+	vq->flags = VOP_F_NO_INTERRUPT;
+#if 0
+	/*
+	 * FIXME: using this causes the host -> guest transfer rate to
+	 * FIXME: intermittently slow to 1/10th of the normal rate
+	 */
+	vop_set_guest_flags(vq, vq->flags);
+#endif
+}
+
+/*
+ * Enable callbacks on changes to the used ring
+ *
+ * @return false if there are more pending buffers
+ *         true otherwise
+ */
+static bool vop_enable_cb(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+
+	/* We optimistically enable interrupts, then check if there
+	 * was more work to do */
+	dev_dbg(vq->dev, "enable callbacks\n");
+	vq->flags = 0;
+#if 0
+	/*
+	 * FIXME: using this causes the host -> guest transfer rate to
+	 * FIXME: intermittently slow to 1/10th of the normal rate
+	 */
+	vop_set_guest_flags(vq, vq->flags);
+#endif
+
+	if (unlikely(loc_more_used(vq)))
+		return false;
+
+	return true;
+}
+
+static struct virtqueue_ops vop_vq_ops = {
+	.add_buf	= vop_add_buf,
+	.get_buf	= vop_get_buf,
+	.kick		= vop_kick,
+	.disable_cb	= vop_disable_cb,
+	.enable_cb	= vop_enable_cb,
+};
+
+/*----------------------------------------------------------------------------*/
+/* Virtio Device Infrastructure                                               */
+/*----------------------------------------------------------------------------*/
+
+/* Read some bytes from the host's configuration area */
+static void vopc_get(struct virtio_device *_vdev, unsigned offset, void *buf,
+		     unsigned len)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	void __iomem *config = vdev->host_status->config;
+
+	memcpy_fromio(buf, config + offset, len);
+}
+
+/* Write some bytes to the host's configuration area */
+static void vopc_set(struct virtio_device *_vdev, unsigned offset,
+		     const void *buf, unsigned len)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	void __iomem *config = vdev->host_status->config;
+
+	memcpy_toio(config + offset, buf, len);
+}
+
+/* Read your own status bits */
+static u8 vopc_get_status(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	u32 status;
+
+	status = le32_to_cpu(vdev->guest_status->status);
+	dev_dbg(&vdev->vdev.dev, "%s(): -> 0x%.2x\n", __func__, (u8)status);
+
+	return (u8)status;
+}
+
+static void vopc_set_status(struct virtio_device *_vdev, u8 status)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	u32 old_status;
+
+	old_status = le32_to_cpu(vdev->guest_status->status);
+	vdev->guest_status->status = cpu_to_le32(status);
+
+	dev_dbg(&vdev->vdev.dev, "%s(): <- 0x%.2x (was 0x%.2x)\n",
+			__func__, status, old_status);
+
+	/*
+	 * FIXME: we really need to notify the other side when status changes
+	 * FIXME: happen, so that they can take some action
+	 */
+}
+
+static void vopc_reset(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+
+	dev_dbg(&vdev->vdev.dev, "%s(): status reset\n", __func__);
+	vdev->guest_status->status = cpu_to_le32(0);
+}
+
+/* Find the given virtqueue */
+static struct virtqueue *vopc_find_vq(struct virtio_device *_vdev,
+					     unsigned index,
+					     void (*cb)(struct virtqueue *vq))
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	struct vop_vq *vq = &vdev->virtqueues[index];
+	int i;
+
+	/* Check that we support the virtqueue at this index */
+	if (index >= ARRAY_SIZE(vdev->virtqueues)) {
+		dev_err(&vdev->vdev.dev, "no virtqueue for index %d\n", index);
+		return ERR_PTR(-ENODEV);
+	}
+
+	/* HACK: we only support virtio_net for now */
+	if (vdev->vdev.id.device != VIRTIO_ID_NET) {
+		dev_err(&vdev->vdev.dev, "only virtio_net is supported\n");
+		return ERR_PTR(-ENODEV);
+	}
+
+	/* Initialize the virtqueue to a clean state */
+	vq->num_free = VOP_RING_SIZE;
+	vq->dev = &vdev->vdev.dev;
+	vq->vq.vq_ops = &vop_vq_ops;
+
+	/* Hook up the local virtqueues to the corresponding remote virtqueues */
+	/* TODO: maybe move this to the setup_virtio_net() function */
+	switch (index) {
+	case 0: /* x86 xmit virtqueue, hook to ppc recv virtqueue */
+		vq->guest = vdev->loc + 2048;
+		vq->host  = vdev->rem + 2048;
+		vq->resolve = vop_dma_recv;
+		vq->kick_val = 0x8;
+		break;
+	case 1: /* x86 recv virtqueue, hook to ppc xmit virtqueue */
+		vq->guest = vdev->loc + 1024;
+		vq->host  = vdev->rem + 1024;
+		vq->resolve = vop_dma_xmit;
+		vq->kick_val = 0x4;
+		break;
+	case 2: /* x86 ctrl virtqueue -- ppc ctrl virtqueue */
+	default:
+		dev_err(vq->dev, "Unsupported virtqueue\n");
+		return ERR_PTR(-ENODEV);
+	}
+
+	dev_dbg(vq->dev, "vq %d guest %p host %p\n", index, vq->guest, vq->host);
+
+	/* Initialize the descriptor, avail, and used rings */
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		vop_set_desc_addr(vq, i, 0x0);
+		vop_set_desc_len(vq, i, 0);
+		vop_set_desc_flags(vq, i, 0);
+		vop_set_desc_next(vq, i, (i + 1) & (VOP_RING_SIZE - 1));
+
+		vq->avail.ring[i] = 0;
+		vq->used.ring[i].id = 0;
+		vq->used.ring[i].len = 0;
+	}
+
+	vq->avail.index = 0;
+	vop_set_guest_flags(vq, 0);
+
+	/* This is the guest, the host has already initialized the rings for us */
+	debug_dump_rings(vq, "found a virtqueue, dumping rings");
+
+	vq->vq.callback = cb;
+	vq->vq.vdev = &vdev->vdev;
+
+	return &vq->vq;
+}
+
+static void vopc_del_vq(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+	int i;
+
+	/* FIXME: make sure that DMA has stopped by this point */
+
+	/* Unmap and remove all outstanding descriptors from the ring */
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		if (vq->data[i]) {
+			dev_dbg(vq->dev, "cleanup detach buffer at index %d\n", i);
+			detach_buf(vq, i);
+		}
+	}
+
+	debug_dump_rings(vq, "virtqueue destroyed, dumping rings");
+}
+
+/* Read the host's advertised features */
+static u32 vopc_get_features(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	u32 ret;
+
+	ret = vop_get_host_features(vdev);
+	dev_dbg(&vdev->vdev.dev, "%s(): host features 0x%.8x\n", __func__, ret);
+
+	return ret;
+}
+
+/* At this point, we've chosen whichever features we can use and
+ * put them into the vdev->features array. We should probably notify
+ * the host at this point, but how will virtio react? */
+static void vopc_finalize_features(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	struct device *dev = &vdev->vdev.dev;
+
+	/*
+	 * TODO: notify the other side at this point
+	 */
+
+	vdev->guest_status->features[0] = cpu_to_le32(vdev->vdev.features[0]);
+	dev_dbg(dev, "%s(): final features 0x%.8lx\n", __func__, vdev->vdev.features[0]);
+}
+
+static struct virtio_config_ops vop_config_ops = {
+	.get			= vopc_get,
+	.set			= vopc_set,
+	.get_status		= vopc_get_status,
+	.set_status		= vopc_set_status,
+	.reset			= vopc_reset,
+	.find_vq		= vopc_find_vq,
+	.del_vq			= vopc_del_vq,
+	.get_features		= vopc_get_features,
+	.finalize_features	= vopc_finalize_features,
+};
+
+/*----------------------------------------------------------------------------*/
+/* Last-minute device setup code                                              */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Do the last minute setup for virtio_net, now that the host memory is
+ * valid. This includes setting up pointers to the correct queues so that
+ * we can just start the virtqueues when the driver registers
+ */
+static void setup_virtio_net(struct vop_vdev *vdev)
+{
+	/* TODO: move some of the setup code from find_vq() here */
+}
+
+/*
+ * Do any last minute setup for a device just before starting it
+ *
+ * The host memory is now valid, so you should be setting up any pointers
+ * the device needs to the host memory
+ */
+static int vop_setup_device(struct vop_dev *priv, int devnum)
+{
+	struct vop_vdev *vdev = &priv->devices[devnum];
+	struct device *dev = priv->dev;
+
+	if (devnum >= ARRAY_SIZE(priv->devices)) {
+		dev_err(dev, "Unknown virtio_device %d\n", devnum);
+		return -ENODEV;
+	}
+
+	/* Setup the device's pointers to host memory */
+	vdev->rem = priv->host_mem  + (devnum * 4096);
+	vdev->host_status = vdev->rem;
+
+	switch (devnum) {
+	case 0: /* virtio_net */
+		setup_virtio_net(vdev);
+		break;
+	default:
+		dev_err(dev, "Device %d not implemented\n", devnum);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+/*
+ * Initialize and attempt to register a virtio_device
+ *
+ * @priv the driver data
+ * @devnum the virtio_device number (index into priv->devices)
+ */
+static int vop_start_device(struct vop_dev *priv, int devnum)
+{
+	struct vop_vdev *vdev = &priv->devices[devnum];
+	struct device *dev = priv->dev;
+	int ret;
+
+	/* Check that we know about the device */
+	if (devnum >= ARRAY_SIZE(priv->devices)) {
+		dev_err(dev, "Unknown virtio_device %d\n", devnum);
+		return -ENODEV;
+	}
+
+	vdev->status = 0;
+
+	/* Do any last minute device-specific setup now that the
+	 * host memory is valid */
+	ret = vop_setup_device(priv, devnum);
+	if (ret) {
+		dev_err(dev, "Unable to setup device %d\n", devnum);
+		return ret;
+	}
+
+	/* Register the device with the virtio subsystem */
+	ret = register_virtio_device(&vdev->vdev);
+	if (ret) {
+		dev_err(dev, "Unable to register device %d\n", devnum);
+		return ret;
+	}
+
+	vdev->status = VOP_DEVICE_REGISTERED;
+	return 0;
+}
+
+/*----------------------------------------------------------------------------*/
+/* Work Functions                                                             */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Start as much DMA as we can on the given virtqueue
+ *
+ * This is put on the system shared queue, and will start us much DMA as is
+ * available when it is called. This should be triggered when the host adds
+ * things to the avail rings, and when the guest adds things to the internal
+ * avail rings
+ *
+ * Make sure it doesn't sleep for too long, you're on the shared queue
+ */
+static void vop_dma_work(struct work_struct *work)
+{
+	struct vop_vq *vq = container_of(work, struct vop_vq, work);
+	int ret;
+
+	/* Start as many DMA transactions as we can, immediately */
+	while (true) {
+		ret = vq->resolve(vq);
+		if (ret)
+			break;
+	}
+}
+
+/*
+ * Remove all virtio devices immediately
+ *
+ * This will be called by the host to make sure that we are in a stopped
+ * state. It should be callable when everything is already stopped.
+ *
+ * Make sure it doesn't sleep for too long, you're on the shared queue
+ */
+static void vop_reset_work(struct work_struct *work)
+{
+	struct vop_dev *priv = container_of(work, struct vop_dev, reset_work);
+	struct device *dev = priv->dev;
+	struct vop_vdev *vdev;
+	int i;
+
+	dev_dbg(dev, "Resetting all virtio devices\n");
+	mutex_lock(&priv->mutex);
+
+	for (i = 0; i < ARRAY_SIZE(priv->devices); i++) {
+		vdev = &priv->devices[i];
+
+		if (vdev->status & VOP_DEVICE_REGISTERED) {
+			dev_dbg(dev, "Unregistering virtio_device #%d\n", i);
+			unregister_virtio_device(&vdev->vdev);
+		}
+
+		vdev->status &= ~VOP_DEVICE_REGISTERED;
+	}
+
+	if (priv->host_mem) {
+		iounmap(priv->host_mem);
+		priv->host_mem = NULL;
+	}
+
+	mutex_unlock(&priv->mutex);
+}
+
+/*
+ * This will map the host's memory, as well as start the devices that the host
+ * requested
+ *
+ * Mailbox registers contents:
+ * IMR0 - the host memory physical address (must be <1GB)
+ * IMR1 - the devices the host wants started
+ */
+static void vop_start_work(struct work_struct *work)
+{
+	struct vop_dev *priv = container_of(work, struct vop_dev, start_work);
+	struct device *dev = priv->dev;
+	struct vop_vdev *vdev;
+	u32 address, devices;
+	int i;
+
+	dev_dbg(dev, "Starting requested virtio devices\n");
+	mutex_lock(&priv->mutex);
+
+	/* Read the requested address and devices from the mailbox registers */
+	address = ioread32(priv->immr + IMR0_OFFSET);
+	devices = ioread32(priv->immr + IMR1_OFFSET);
+
+	dev_dbg(dev, "address 0x%.8x\n", address);
+	dev_dbg(dev, "devices 0x%.8x\n", devices);
+
+	/* Remap the host's registers */
+	priv->host_mem = ioremap(address + 0x80000000, VOP_HOST_MEM_SIZE);
+	if (!priv->host_mem) {
+		dev_err(dev, "Unable to ioremap host memory\n");
+		goto out_unlock;
+	}
+
+	/* Start the requested devices */
+	for (i = 0; i < ARRAY_SIZE(priv->devices); i++) {
+		vdev = &priv->devices[i];
+
+		if (devices & (1 << i)) {
+			dev_dbg(dev, "Starting virtio_device #%d\n", i);
+			vop_start_device(priv, i);
+		}
+	}
+
+out_unlock:
+	mutex_unlock(&priv->mutex);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Interrupt Handling                                                         */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Schedule the work function for a given virtqueue only if the associated
+ * device is up and running. Otherwise, ignore the request
+ *
+ * @priv the private driver data
+ * @dev the virtio_device number in priv->devices[]
+ * @queue the virtqueue in vdev->virtqueues[]
+ */
+static void schedule_work_if_ready(struct vop_dev *priv, int dev, int queue)
+{
+	struct vop_vdev *vdev = &priv->devices[dev];
+	struct vop_vq *vq = &vdev->virtqueues[queue];
+
+	if (vdev->status & VOP_DEVICE_REGISTERED)
+		schedule_work(&vq->work);
+}
+
+static irqreturn_t vdev_interrupt(int irq, void *dev_id)
+{
+	struct vop_dev *priv = dev_id;
+	struct device *dev = priv->dev;
+	u32 imisr, idr;
+
+	imisr = ioread32(priv->immr + IMISR_OFFSET);
+	idr   = ioread32(priv->immr + IDR_OFFSET);
+
+	dev_dbg(dev, "INTERRUPT idr 0x%.8x\n", idr);
+
+	/* Check the status register for doorbell interrupts */
+	if (!(imisr & 0x8))
+		return IRQ_NONE;
+
+	/* Clear all doorbell interrupts */
+	iowrite32(idr, priv->immr + IDR_OFFSET);
+
+	/* Reset */
+	if (idr & 0x1)
+		schedule_work(&priv->reset_work);
+
+	/* Start */
+	if (idr & 0x2)
+		schedule_work(&priv->start_work);
+
+	/* vdev 0 vq 1 kick */
+	if (idr & 0x4)
+		schedule_work_if_ready(priv, 0, 1);
+
+	/* vdev 0 vq 0 kick */
+	if (idr & 0x8)
+		schedule_work_if_ready(priv, 0, 0);
+
+	if (idr & 0xfffffff0)
+		dev_dbg(dev, "INTERRUPT unhandled 0x%.8x\n", idr & 0xfffffff0);
+
+	return IRQ_HANDLED;
+}
+
+/*----------------------------------------------------------------------------*/
+/* Driver insertion time virtio device initialization                         */
+/*----------------------------------------------------------------------------*/
+
+static void vdev_release(struct device *dev)
+{
+	/* TODO: this should probably do something useful */
+	dev_dbg(dev, "%s: called\n", __func__);
+}
+
+/*
+ * Do any device-specific setup for a virtio device
+ *
+ * This would include things like setting the feature bits for the
+ * device, as well as the device type.
+ *
+ * There is no access to host memory at this point, so don't access it
+ */
+static void vop_setup_virtio_device(struct vop_dev *priv, int devnum)
+{
+	struct vop_vdev *vdev = &priv->devices[devnum];
+	struct virtio_net_config *config;
+	unsigned long features = 0;
+
+	/* HACK: we only support device #0 (virtio_net) right now */
+	if (devnum != 0)
+		return;
+
+	/* Generate a random ethernet address for the host to have
+	 *
+	 * This way, we could do something board-specific and get an
+	 * ethernet address that is consistent per-slot
+	 */
+	config = (struct virtio_net_config *)vdev->guest_status->config;
+	random_ether_addr(config->mac);
+	dev_info(priv->dev, "Generated MAC %pM\n", config->mac);
+
+	/* Set the feature bits for the device */
+	set_bit(VIRTIO_NET_F_MAC,       &features);
+	set_bit(VIRTIO_NET_F_CSUM,      &features);
+	set_bit(VIRTIO_NET_F_GSO,       &features);
+	set_bit(VIRTIO_NET_F_MRG_RXBUF, &features);
+
+	vdev->guest_status->features[0] = cpu_to_le32(features);
+	vdev->vdev.id.device = VIRTIO_ID_NET;
+}
+
+/*
+ * Do all of the initialization of all of the virtqueues for a given virtio
+ * device. There is no access to host memory at this point, so don't access it
+ *
+ * @devnum the device number in the priv->devices[] array
+ */
+static void vop_initialize_virtqueues(struct vop_dev *priv, int devnum)
+{
+	struct vop_vdev *vdev = &priv->devices[devnum];
+	struct vop_vq *vq;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vdev->virtqueues); i++) {
+		vq = &vdev->virtqueues[i];
+
+		memset(vq, 0, sizeof(struct vop_vq));
+		vq->immr = priv->immr;
+		vq->dma.chan = priv->chan;
+		INIT_WORK(&vq->work, vop_dma_work);
+	}
+}
+
+/*
+ * Do all of the initialization for the virtio devices that is possible without
+ * access to the host memory
+ *
+ * This includes setting up the pointers that you can and setting the feature
+ * bits so that the host can read them before he starts us
+ */
+static void vop_initialize_devices(struct vop_dev *priv)
+{
+	struct device *parent = priv->dev;
+	struct vop_vdev *vdev;
+	struct device *vdev_dev;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(priv->devices); i++) {
+		vdev = &priv->devices[i];
+		vdev_dev = &vdev->vdev.dev;
+
+		/* Set up access to the guest memory, host memory isn't valid
+		 * yet, and will have to be set up just before we start */
+		vdev->loc = priv->guest_mem + (i * 4096);
+		vdev->guest_status = vdev->loc;
+
+		/* Initialize all of the device's virtqueues */
+		vop_initialize_virtqueues(priv, i);
+
+		/* Zero the configuration space */
+		memset(vdev->guest_status, 0, 1024);
+
+		/* Copy parent DMA parameters to this device */
+		vdev_dev->dma_mask = parent->dma_mask;
+		vdev_dev->dma_parms = parent->dma_parms;
+		vdev_dev->coherent_dma_mask = parent->coherent_dma_mask;
+
+		vdev_dev->release = &vdev_release;
+		vdev_dev->parent  = parent;
+		vdev->vdev.config = &vop_config_ops;
+
+		/* Do any device-specific setup */
+		vop_setup_virtio_device(priv, i);
+	}
+}
+
+/*----------------------------------------------------------------------------*/
+/* OpenFirmware Device Subsystem                                              */
+/*----------------------------------------------------------------------------*/
+
+static int vdev_of_probe(struct of_device *op, const struct of_device_id *match)
+{
+	struct vop_dev *priv;
+	dma_cap_mask_t mask;
+	int ret;
+
+	/* Allocate private data */
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv) {
+		dev_err(&op->dev, "Unable to allocate device private data\n");
+		ret = -ENOMEM;
+		goto out_return;
+	}
+
+	dev_set_drvdata(&op->dev, priv);
+	priv->dev = &op->dev;
+	mutex_init(&priv->mutex);
+	INIT_WORK(&priv->reset_work, vop_reset_work);
+	INIT_WORK(&priv->start_work, vop_start_work);
+
+	/* Get a DMA channel */
+	dma_cap_zero(mask);
+	dma_cap_set(DMA_MEMCPY, mask);
+	dma_cap_set(DMA_INTERRUPT, mask);
+	priv->chan = dma_request_channel(mask, NULL, NULL);
+	if (!priv->chan) {
+		dev_err(&op->dev, "Unable to get DMA channel\n");
+		ret = -ENODEV;
+		goto out_free_priv;
+	}
+
+	/* Remap IMMR */
+	priv->immr = ioremap(get_immrbase(), 0x100000);
+	if (!priv->immr) {
+		dev_err(&op->dev, "Unable to remap IMMR registers\n");
+		ret = -ENOMEM;
+		goto out_dma_release_channel;
+	}
+
+	/* Set up a static 1GB window into host memory */
+	iowrite32be(LAWAR0_ENABLE | 0x1D, priv->immr + LAWAR0_OFFSET);
+	iowrite32be(POCMR0_ENABLE | 0xC0000, priv->immr + POCMR0_OFFSET);
+	iowrite32be(0x0, priv->immr + POTAR0_OFFSET);
+
+	/* Allocate guest memory */
+	priv->guest_mem = dma_alloc_coherent(&op->dev, VOP_GUEST_MEM_SIZE,
+					     &priv->guest_mem_addr, GFP_KERNEL);
+	if (!priv->guest_mem) {
+		dev_err(&op->dev, "Unable to allocate guest memory\n");
+		ret = -ENOMEM;
+		goto out_iounmap_immr;
+	}
+
+	memset(priv->guest_mem, 0, VOP_GUEST_MEM_SIZE);
+
+	/* Program BAR1 so that it will hit the guest memory */
+	iowrite32be(priv->guest_mem_addr >> 12, priv->immr + PITAR0_OFFSET);
+
+	/* Initialize all of the virtio devices with their features, etc */
+	vop_initialize_devices(priv);
+
+	/* Disable mailbox interrupts */
+	iowrite32(0x2 | 0x1, priv->immr + IMIMR_OFFSET);
+
+	/* Hook up the irq handler */
+	priv->irq = irq_of_parse_and_map(op->node, 0);
+	ret = request_irq(priv->irq, vdev_interrupt, IRQF_SHARED, driver_name, priv);
+	if (ret)
+		goto out_free_guest_mem;
+
+	dev_info(&op->dev, "Virtio-over-PCI guest driver installed\n");
+	dev_info(&op->dev, "Physical memory @ 0x%.8x\n", priv->guest_mem_addr);
+	dev_info(&op->dev, "Descriptor ring size: %d entries\n", VOP_RING_SIZE);
+	return 0;
+
+out_free_guest_mem:
+	dma_free_coherent(&op->dev, VOP_GUEST_MEM_SIZE, priv->guest_mem,
+			  priv->guest_mem_addr);
+out_iounmap_immr:
+	iounmap(priv->immr);
+out_dma_release_channel:
+	dma_release_channel(priv->chan);
+out_free_priv:
+	kfree(priv);
+out_return:
+	return ret;
+}
+
+static int vdev_of_remove(struct of_device *op)
+{
+	struct vop_dev *priv = dev_get_drvdata(&op->dev);
+
+	/* Stop the irq handler */
+	free_irq(priv->irq, priv);
+
+	/* Unregister and reset all of the devices */
+	schedule_work(&priv->reset_work);
+	flush_scheduled_work();
+
+	dma_free_coherent(&op->dev, VOP_GUEST_MEM_SIZE, priv->guest_mem,
+			  priv->guest_mem_addr);
+	iounmap(priv->immr);
+	dma_release_channel(priv->chan);
+	kfree(priv);
+
+	return 0;
+}
+
+static struct of_device_id vdev_of_match[] = {
+	{ .compatible = "fsl,mpc8349-mu", },
+	{},
+};
+
+static struct of_platform_driver vdev_of_driver = {
+	.owner		= THIS_MODULE,
+	.name		= driver_name,
+	.match_table	= vdev_of_match,
+	.probe		= vdev_of_probe,
+	.remove		= vdev_of_remove,
+};
+
+/*----------------------------------------------------------------------------*/
+/* Module Init / Exit                                                         */
+/*----------------------------------------------------------------------------*/
+
+static int __init vdev_init(void)
+{
+	dma_cache = KMEM_CACHE(vop_dma_cbinfo, 0);
+	if (!dma_cache) {
+		pr_err("%s: unable to create dma cache\n", driver_name);
+		return -ENOMEM;
+	}
+
+	return of_register_platform_driver(&vdev_of_driver);
+}
+
+static void __exit vdev_exit(void)
+{
+	of_unregister_platform_driver(&vdev_of_driver);
+	kmem_cache_destroy(dma_cache);
+}
+
+MODULE_AUTHOR("Ira W. Snyder <iws@ovro.caltech.edu>");
+MODULE_DESCRIPTION("Freescale Virtio-over-PCI Test Driver");
+MODULE_LICENSE("GPL");
+
+module_init(vdev_init);
+module_exit(vdev_exit);
diff --git a/drivers/virtio/vop_host.c b/drivers/virtio/vop_host.c
new file mode 100644
index 0000000..814fa8a
--- /dev/null
+++ b/drivers/virtio/vop_host.c
@@ -0,0 +1,1071 @@ 
+/*
+ * Virtio-over-PCI Host Driver for MPC8349EMDS Guest
+ *
+ * Copyright (c) 2009 Ira W. Snyder <iws@ovro.caltech.edu>
+ *
+ * This file is licensed under the terms of the GNU General Public License
+ * version 2. This program is licensed "as is" without any warranty of any
+ * kind, whether express or implied.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_net.h>
+#include <linux/workqueue.h>
+#include <linux/interrupt.h>
+
+#include <linux/etherdevice.h>
+
+#include "vop_hw.h"
+#include "vop.h"
+
+static const char driver_name[] = "vdev";
+
+struct vop_loc_desc {
+	u32 addr;
+	u32 len;
+	u16 flags;
+	u16 next;
+};
+
+struct vop_vq {
+
+	/* The actual virtqueue itself */
+	struct virtqueue vq;
+
+	struct device *dev;
+
+	/* The host ring address */
+	struct vop_host_ring *host;
+
+	/* The guest ring address */
+	struct vop_guest_ring __iomem *guest;
+
+	/* Local copy of the descriptors for fast access */
+	struct vop_loc_desc desc[VOP_RING_SIZE];
+
+	/* The data token from add_buf() */
+	void *data[VOP_RING_SIZE];
+
+	unsigned int num_free;
+	unsigned int free_head;
+	unsigned int num_added;
+
+	u16 avail_idx;
+	u16 last_used_idx;
+
+	/* The doorbell to kick() */
+	unsigned int kick_val;
+	void __iomem *immr;
+};
+
+/* Convert from a struct virtqueue to a struct vop_vq */
+#define to_vop_vq(X) container_of(X, struct vop_vq, vq)
+
+/*
+ * This represents a virtio_device for our driver. It follows the memory
+ * layout shown above. It has pointers to all of the host and guest memory
+ * areas that we need to access
+ */
+struct vop_vdev {
+
+	/* The specific virtio device (console, net, blk) */
+	struct virtio_device vdev;
+
+	/* Local and remote memory */
+	void *loc;
+	void __iomem *rem;
+
+	/*
+	 * These are the status, feature, and configuration information
+	 * for this virtio device. They are exposed in our memory block
+	 * starting at offset 0.
+	 */
+	struct vop_status *host_status;
+
+	/*
+	 * These are the status, feature, and configuration information
+	 * for the guest virtio device. They are exposed in the guest
+	 * memory block starting at offset 0.
+	 */
+	struct vop_status __iomem *guest_status;
+
+	/*
+	 * These are the virtqueues for the virtio driver running this
+	 * device to use. The host portions are exposed in our memory block
+	 * starting at offset 1024. The exposed areas are aligned to 1024 byte
+	 * boundaries, so they appear at offets 1024, 2048, and 3072
+	 * respectively.
+	 */
+	struct vop_vq virtqueues[3];
+};
+
+#define to_vop_vdev(X) container_of(X, struct vop_vdev, vdev)
+
+/*
+ * This is information from the PCI subsystem about each MPC8349EMDS board
+ *
+ * It holds information for all of the possible virtio_devices that are
+ * attached to this board.
+ */
+struct vop_dev {
+
+	struct pci_dev *pdev;
+	struct device *dev;
+
+	/* PowerPC memory (PCI BAR0 and BAR1, respectively) */
+	#define VOP_GUEST_MEM_SIZE 16384
+	void __iomem *immr;
+	void __iomem *netregs;
+
+	/* Host memory, visible to the PowerPC */
+	#define VOP_HOST_MEM_SIZE 16384
+	void *host_mem;
+	dma_addr_t host_mem_addr;
+
+	/* The virtio devices */
+	struct vop_vdev devices[4];
+};
+
+/*----------------------------------------------------------------------------*/
+/* Ring Debugging Helpers                                                     */
+/*----------------------------------------------------------------------------*/
+
+#ifdef DEBUG_DUMP_RINGS
+static void dump_guest_descriptors(struct vop_vq *vq)
+{
+	int i;
+	struct vop_desc __iomem *desc;
+
+	pr_debug("DESC BG: 0xADDRESSX LENGTH 0xFLAG 0xNEXT\n");
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		desc = &vq->guest->desc[i];
+		pr_debug("DESC %.2d: 0x%.8x %.6d 0x%.4x 0x%.4x\n", i,
+				ioread32(&desc->addr), ioread32(&desc->len),
+				ioread16(&desc->flags), ioread16(&desc->next));
+	}
+	pr_debug("DESC ED\n");
+}
+
+static void dump_guest_avail(struct vop_vq *vq)
+{
+	int i;
+
+	pr_debug("BEGIN AVAIL DUMP\n");
+	for (i = 0; i < VOP_RING_SIZE; i++)
+		pr_debug("AVAIL %.2d: 0x%.4x\n", i, ioread16(&vq->guest->avail[i]));
+	pr_debug("END AVAIL DUMP\n");
+}
+
+static void dump_guest_ring(struct vop_vq *vq)
+{
+	pr_debug("BEGIN GUEST RING DUMP\n");
+	dump_guest_descriptors(vq);
+	pr_debug("GUEST FLAGS: 0x%.4x\n", ioread16(&vq->guest->flags));
+	pr_debug("GUEST AVAIL_IDX: %d\n", ioread16(&vq->guest->avail_idx));
+	dump_guest_avail(vq);
+	pr_debug("END GUEST RING DUMP\n");
+}
+
+static void dump_host_used(struct vop_vq *vq)
+{
+	int i;
+	struct vop_used_elem *used;
+
+	pr_debug("USED BG: 0xIDID LENGTH\n");
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		used = &vq->host->used[i];
+		pr_debug("USED %.2d: 0x%.4x %.6d\n", i, used->id, used->len);
+	}
+	pr_debug("USED ED\n");
+}
+
+static void dump_host_ring(struct vop_vq *vq)
+{
+	pr_debug("BEGIN HOST RING DUMP\n");
+	pr_debug("HOST FLAGS: 0x%.4x\n", vq->host->flags);
+	pr_debug("HOST USED_IDX: 0x%.2d\n", vq->host->used_idx);
+	dump_host_used(vq);
+	pr_debug("END HOST RING DUMP\n");
+}
+
+static void debug_dump_rings(struct vop_vq *vq, const char *msg)
+{
+	dev_dbg(vq->dev, "%s\n", msg);
+	dump_guest_ring(vq);
+	dump_host_ring(vq);
+	pr_debug("\n");
+}
+#else
+static void debug_dump_rings(struct vop_vq *vq, const char *msg)
+{
+	/* Nothing */
+}
+#endif /* DEBUG_DUMP_RINGS */
+
+/*----------------------------------------------------------------------------*/
+/* Ring Access Helpers                                                        */
+/*----------------------------------------------------------------------------*/
+
+static void vop_set_desc_addr(struct vop_vq *vq, unsigned int idx, u32 addr)
+{
+	vq->desc[idx].addr = addr;
+	iowrite32(addr, &vq->guest->desc[idx].addr);
+}
+
+static void vop_set_desc_len(struct vop_vq *vq, unsigned int idx, u32 len)
+{
+	vq->desc[idx].len = len;
+	iowrite32(len, &vq->guest->desc[idx].len);
+}
+
+static void vop_set_desc_flags(struct vop_vq *vq, unsigned int idx, u16 flags)
+{
+	vq->desc[idx].flags = flags;
+	iowrite16(flags, &vq->guest->desc[idx].flags);
+}
+
+static void vop_set_desc_next(struct vop_vq *vq, unsigned int idx, u16 next)
+{
+	vq->desc[idx].next = next;
+	iowrite16(next, &vq->guest->desc[idx].next);
+}
+
+static u32 vop_get_desc_addr(struct vop_vq *vq, unsigned int idx)
+{
+	return vq->desc[idx].addr;
+}
+
+static u32 vop_get_desc_len(struct vop_vq *vq, unsigned int idx)
+{
+	return vq->desc[idx].len;
+}
+
+static u16 vop_get_desc_flags(struct vop_vq *vq, unsigned int idx)
+{
+	return vq->desc[idx].flags;
+}
+
+static u16 vop_get_desc_next(struct vop_vq *vq, unsigned int idx)
+{
+	return vq->desc[idx].next;
+}
+
+/*
+ * Add an entry to the available ring at avail_idx pointing to the descriptor
+ * chain at index head
+ *
+ * @vq the virtqueue
+ * @idx the index in the avail ring
+ * @val the value to write
+ */
+static void vop_set_avail_entry(struct vop_vq *vq, u16 idx, u16 val)
+{
+	iowrite16(val, &vq->guest->avail[idx]);
+}
+
+/*
+ * Set the available index so the guest knows about buffers that were added
+ * with vop_set_avail_entry()
+ *
+ * @vq the virtqueue
+ * @idx the new avail_idx that the guest sees
+ */
+static void vop_set_avail_idx(struct vop_vq *vq, u16 idx)
+{
+	iowrite16(idx, &vq->guest->avail_idx);
+}
+
+/*
+ * Set the host's flags (in the guest memory)
+ *
+ * @vq the virtqueue
+ * @flags the new flags that the guest will see
+ */
+static void vop_set_host_flags(struct vop_vq *vq, u16 flags)
+{
+	iowrite16(flags, &vq->guest->flags);
+}
+
+/*
+ * Read the guests flags (in local memory)
+ *
+ * @vq the virtqueue
+ * @return the guest's flags
+ */
+static u16 vop_get_guest_flags(struct vop_vq *vq)
+{
+	return le16_to_cpu(vq->host->flags);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Remote status helpers                                                      */
+/*----------------------------------------------------------------------------*/
+
+static u32 vop_get_guest_status(struct vop_vdev *vdev)
+{
+	return ioread32(&vdev->guest_status->status);
+}
+
+static u32 vop_get_guest_features(struct vop_vdev *vdev)
+{
+	return ioread32(&vdev->guest_status->features[0]);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Scatterlist DMA helpers                                                    */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * This function abuses some of the scatterlist code and implements
+ * dma_map_sg() in such a way that we don't need to keep the scatterlist
+ * around in order to unmap it.
+ *
+ * It is also designed to never merge scatterlist entries, which is
+ * never what we want for virtio.
+ *
+ * When it is time to unmap the buffer, you can use dma_unmap_single() to
+ * unmap each entry in the chain. Get the address, length, and direction
+ * from the descriptors! (keep a local copy for speed)
+ */
+static int vop_dma_map_sg(struct device *dev, struct scatterlist sg[],
+			  unsigned int out, unsigned int in)
+{
+	dma_addr_t addr;
+	enum dma_data_direction dir;
+	struct scatterlist *start;
+	unsigned int i, failure;
+
+	start = sg;
+
+	for (i = 0; i < out + in; i++) {
+
+		/* Check for scatterlist chaining abuse */
+		BUG_ON(sg == NULL);
+
+		dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+		addr = dma_map_single(dev, sg_virt(sg), sg->length, dir);
+
+		if (dma_mapping_error(dev, addr))
+			goto unwind;
+
+		sg_dma_address(sg) = addr;
+		sg = sg_next(sg);
+	}
+
+	return 0;
+
+unwind:
+	failure = i;
+	sg = start;
+
+	for (i = 0; i < failure; i++) {
+		dir = (i < out) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+		addr = sg_dma_address(sg);
+
+		dma_unmap_single(dev, addr, sg->length, dir);
+		sg = sg_next(sg);
+	}
+
+	return -ENOMEM;
+}
+
+/*----------------------------------------------------------------------------*/
+/* struct virtqueue_ops infrastructure                                        */
+/*----------------------------------------------------------------------------*/
+
+/*
+ * Modify the struct virtio_net_hdr_mrg_rxbuf's num_buffers field to account
+ * for the split that will happen in the DMA xmit routine
+ *
+ * This assumes that both sides have the same PAGE_SIZE
+ */
+static void vop_fixup_vnet_mrg_hdr(struct scatterlist sg[], unsigned int out)
+{
+	struct virtio_net_hdr *hdr;
+	struct virtio_net_hdr_mrg_rxbuf *mhdr;
+	unsigned int bytes = 0;
+
+	/* There must be a header + data, at the least */
+	BUG_ON(out < 2);
+
+	/* The first entry must be the structure */
+	BUG_ON(sg->length != sizeof(struct virtio_net_hdr_mrg_rxbuf));
+
+	hdr = sg_virt(sg);
+	mhdr = sg_virt(sg);
+
+	/* We merge buffers together, so just count up the number of bytes
+	 * needed, then figure out how many pages that will be */
+	for (/* none */; out; out--, sg = sg_next(sg))
+		bytes += sg->length;
+
+	/* Of course, nobody ever imagined that we might actually use
+	 * this on machines with different endianness...
+	 *
+	 * We force big-endian for now, since that's what our guest is */
+	mhdr->num_buffers = cpu_to_be16(DIV_ROUND_UP(bytes, PAGE_SIZE));
+
+	/* Might as well fix up the other fields while we're at it */
+	hdr->hdr_len = cpu_to_be16(hdr->hdr_len);
+	hdr->gso_size = cpu_to_be16(hdr->gso_size);
+	hdr->csum_start = cpu_to_be16(hdr->csum_start);
+	hdr->csum_offset = cpu_to_be16(hdr->csum_offset);
+}
+
+static int vop_add_buf(struct virtqueue *_vq, struct scatterlist sg[],
+				unsigned int out, unsigned int in, void *data)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+	unsigned int i, avail, head, uninitialized_var(prev);
+
+	BUG_ON(data == NULL);
+	BUG_ON(out + in == 0);
+
+	/* Make sure we have space for this to succeed */
+	if (vq->num_free < out + in) {
+		dev_dbg(vq->dev, "No free space left: len=%d free=%d\n",
+				out + in, vq->num_free);
+		return -ENOSPC;
+	}
+
+	/* If this is an xmit buffer from virtio_net, fixup the header */
+	if (out > 1) {
+		dev_dbg(vq->dev, "Fixing up virtio_net header\n");
+		vop_fixup_vnet_mrg_hdr(sg, out);
+	}
+
+	head = vq->free_head;
+
+	/* DMA map the scatterlist */
+	if (vop_dma_map_sg(vq->dev, sg, out, in)) {
+		dev_err(vq->dev, "Failed to DMA map scatterlist\n");
+		return -ENOMEM;
+	}
+
+	/* We're about to use some buffers from the free list */
+	vq->num_free -= out + in;
+
+	for (i = vq->free_head; out; i = vop_get_desc_next(vq, i), out--) {
+		vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT);
+		vop_set_desc_addr(vq, i, sg_dma_address(sg));
+		vop_set_desc_len(vq, i, sg->length);
+
+		prev = i;
+		sg = sg_next(sg);
+	}
+
+	for (/* none */; in; i = vop_get_desc_next(vq, i), in--) {
+		vop_set_desc_flags(vq, i, VOP_DESC_F_NEXT | VOP_DESC_F_WRITE);
+		vop_set_desc_addr(vq, i, sg_dma_address(sg));
+		vop_set_desc_len(vq, i, sg->length);
+
+		prev = i;
+		sg = sg_next(sg);
+	}
+
+	/* Last one doesn't continue */
+	vop_set_desc_flags(vq, prev, vop_get_desc_flags(vq, prev) & ~VOP_DESC_F_NEXT);
+
+	/* Update the free pointer */
+	vq->free_head = i;
+
+	/* Set token */
+	vq->data[head] = data;
+
+	/* Add an entry for the head of the chain into the avail array, but
+	 * don't update avail->idx until kick() */
+	avail = (vq->avail_idx + vq->num_added++) & (VOP_RING_SIZE - 1);
+	vop_set_avail_entry(vq, avail, head);
+
+	dev_dbg(vq->dev, "Added buffer head %i to %p (num_free %d)\n", head, vq, vq->num_free);
+	debug_dump_rings(vq, "Added buffer(s), dumping rings");
+
+	return 0;
+}
+
+static inline bool more_used(const struct vop_vq *vq)
+{
+	return vq->last_used_idx != le16_to_cpu(vq->host->used_idx);
+}
+
+static void detach_buf(struct vop_vq *vq, unsigned int head)
+{
+	unsigned int i, len;
+	dma_addr_t addr;
+	enum dma_data_direction dir;
+
+	/* Clear data pointer */
+	vq->data[head] = NULL;
+
+	/* Put the chain back on the free list, unmapping as we go */
+	i = head;
+	while (true) {
+		addr = vop_get_desc_addr(vq, i);
+		len = vop_get_desc_len(vq, i);
+		dir = (vop_get_desc_flags(vq, i) & VOP_DESC_F_WRITE) ?
+				DMA_FROM_DEVICE : DMA_TO_DEVICE;
+
+		/* Unmap the entry */
+		dma_unmap_single(vq->dev, addr, len, dir);
+		vq->num_free++;
+
+		/* Check for end-of-chain */
+		if (!(vop_get_desc_flags(vq, i) & VOP_DESC_F_NEXT))
+			break;
+
+		i = vop_get_desc_next(vq, i);
+	}
+
+	vop_set_desc_next(vq, i, vq->free_head);
+	vq->free_head = head;
+}
+
+static void *vop_get_buf(struct virtqueue *_vq, unsigned int *len)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+	unsigned int head, used_idx;
+	void *ret;
+
+	if (!more_used(vq)) {
+		dev_dbg(vq->dev, "No more buffers in queue\n");
+		return NULL;
+	}
+
+	used_idx = vq->last_used_idx & (VOP_RING_SIZE - 1);
+	head = le32_to_cpu(vq->host->used[used_idx].id);
+	*len = le32_to_cpu(vq->host->used[used_idx].len);
+
+	dev_dbg(vq->dev, "REMOVE buffer head %i from %p (len %d)\n", head, vq, *len);
+	debug_dump_rings(vq, "Removing buffer, dumping rings");
+
+	BUG_ON(head >= VOP_RING_SIZE);
+	BUG_ON(!vq->data[head]);
+
+	/* detach_buf() clears data, save it now */
+	ret = vq->data[head];
+	detach_buf(vq, head);
+
+	/* Update the last used_idx we've consumed */
+	vq->last_used_idx++;
+	return ret;
+}
+
+static void vop_kick(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+
+	dev_dbg(vq->dev, "making %d new buffers available to guest\n", vq->num_added);
+	vq->avail_idx += vq->num_added;
+	vq->num_added = 0;
+	vop_set_avail_idx(vq, vq->avail_idx);
+
+	if (!(vop_get_guest_flags(vq) & VOP_F_NO_INTERRUPT)) {
+		dev_dbg(vq->dev, "kicking the guest (new buffers in avail)\n");
+		iowrite32(vq->kick_val, vq->immr + IDR_OFFSET);
+		debug_dump_rings(vq, "ran a kick, dumping rings");
+	}
+}
+
+/* Write to the guest's flags register to disable interrupts */
+static void vop_disable_cb(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+
+	vop_set_host_flags(vq, VOP_F_NO_INTERRUPT);
+}
+
+static bool vop_enable_cb(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+
+	/* We optimistically enable interrupts, then check if
+	 * there was more to do */
+	vop_set_host_flags(vq, 0);
+
+	if (unlikely(more_used(vq)))
+		return false;
+
+	return true;
+}
+
+static struct virtqueue_ops vop_vq_ops = {
+	.add_buf	= vop_add_buf,
+	.get_buf	= vop_get_buf,
+	.kick		= vop_kick,
+	.disable_cb	= vop_disable_cb,
+	.enable_cb	= vop_enable_cb,
+};
+
+/*----------------------------------------------------------------------------*/
+/* struct virtio_device infrastructure                                        */
+/*----------------------------------------------------------------------------*/
+
+/* Get something that the other side wants you to have, from configuration
+ * space. This is used to transfer the MAC address from the guest to the host,
+ * for example. It should be reading something from the guest, in this case */
+static void vopc_get(struct virtio_device *_vdev, unsigned offset, void *buf,
+		     unsigned len)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	void __iomem *config = vdev->guest_status->config;
+
+	memcpy_fromio(buf, config + offset, len);
+}
+
+/* Set something in the configuration space (currently unused) */
+static void vopc_set(struct virtio_device *_vdev, unsigned offset,
+		     const void *buf, unsigned len)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	void __iomem *config = vdev->guest_status->config;
+
+	memcpy_toio(config + offset, buf, len);
+}
+
+/* Get your own status */
+static u8 vopc_get_status(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	u32 status;
+
+	status = le32_to_cpu(vdev->host_status->status);
+	dev_dbg(&vdev->vdev.dev, "%s(): -> 0x%.2x\n", __func__, (u8)status);
+
+	return (u8)status;
+}
+
+/* Set your own status */
+static void vopc_set_status(struct virtio_device *_vdev, u8 status)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	u32 old_status;
+
+	old_status = le32_to_cpu(vdev->host_status->status);
+	vdev->host_status->status = cpu_to_le32(status);
+
+	dev_dbg(&vdev->vdev.dev, "%s(): <- 0x%.2x (was 0x%.2x)\n",
+			__func__, status, old_status);
+
+	/*
+	 * FIXME: we really need to notify the other side when status changes
+	 * FIXME: happen, so that they can take some action
+	 */
+}
+
+/* Reset your own status */
+static void vopc_reset(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+
+	dev_dbg(&vdev->vdev.dev, "%s(): status reset\n", __func__);
+	vdev->host_status->status = cpu_to_le32(0);
+}
+
+static struct virtqueue *vopc_find_vq(struct virtio_device *_vdev,
+					     unsigned index,
+					     void (*cb)(struct virtqueue *vq))
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	struct vop_vq *vq = &vdev->virtqueues[index];
+	int i;
+
+	/* Check that we support the virtqueue at this index */
+	if (index >= ARRAY_SIZE(vdev->virtqueues)) {
+		dev_err(&vdev->vdev.dev, "no virtqueue for index %d\n", index);
+		return ERR_PTR(-ENODEV);
+	}
+
+	/* HACK: we only support virtio_net for now */
+	if (vdev->vdev.id.device != VIRTIO_ID_NET) {
+		dev_err(&vdev->vdev.dev, "only virtio_net is supported\n");
+		return ERR_PTR(-ENODEV);
+	}
+
+	/* Initialize the virtqueue to a clean state */
+	vq->num_free = VOP_RING_SIZE;
+	vq->dev = &vdev->vdev.dev;
+
+	switch (index) {
+	case 0: /* x86 recv virtqueue -- ppc xmit virtqueue */
+		vq->guest = vdev->rem + 1024;
+		vq->host  = vdev->loc + 1024;
+		break;
+	case 1: /* x86 xmit virtqueue -- ppc recv virtqueue */
+		vq->guest = vdev->rem + 2048;
+		vq->host  = vdev->loc + 2048;
+		break;
+	default:
+		dev_err(vq->dev, "unknown virtqueue %d\n", index);
+		return ERR_PTR(-ENODEV);
+	}
+
+	/* Initialize the descriptor, avail, and used rings */
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		vop_set_desc_addr(vq, i, 0x0);
+		vop_set_desc_len(vq, i, 0);
+		vop_set_desc_flags(vq, i, 0);
+		vop_set_desc_next(vq, i, (i + 1) & (VOP_RING_SIZE - 1));
+
+		vop_set_avail_entry(vq, i, 0);
+		vq->host->used[i].id = cpu_to_le32(0);
+		vq->host->used[i].len = cpu_to_le32(0);
+	}
+
+	vq->avail_idx = 0;
+	vop_set_avail_idx(vq, 0);
+	vop_set_host_flags(vq, 0);
+
+	debug_dump_rings(vq, "found a virtqueue, dumping rings");
+
+	vq->vq.callback = cb;
+	vq->vq.vdev = &vdev->vdev;
+	vq->vq.vq_ops = &vop_vq_ops;
+
+	return &vq->vq;
+}
+
+static void vopc_del_vq(struct virtqueue *_vq)
+{
+	struct vop_vq *vq = to_vop_vq(_vq);
+	int i;
+
+	/* FIXME: make sure that DMA has stopped by this point */
+
+	/* Unmap and remove all outstanding descriptors from the ring */
+	for (i = 0; i < VOP_RING_SIZE; i++) {
+		if (vq->data[i]) {
+			dev_dbg(vq->dev, "cleanup detach buffer at index %d\n", i);
+			detach_buf(vq, i);
+		}
+	}
+
+	debug_dump_rings(vq, "virtqueue destroyed, dumping rings");
+}
+
+static u32 vopc_get_features(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+	u32 ret;
+
+	ret = vop_get_guest_features(vdev);
+	dev_info(&vdev->vdev.dev, "%s(): guest features 0x%.8x\n", __func__, ret);
+
+	return ret;
+}
+
+static void vopc_finalize_features(struct virtio_device *_vdev)
+{
+	struct vop_vdev *vdev = to_vop_vdev(_vdev);
+
+	/*
+	 * TODO: notify the other side at this point
+	 */
+
+	vdev->host_status->features[0] = cpu_to_le32(vdev->vdev.features[0]);
+	dev_info(&vdev->vdev.dev, "%s(): final features 0x%.8lx\n", __func__, vdev->vdev.features[0]);
+}
+
+static struct virtio_config_ops vop_config_ops = {
+	.get			= vopc_get,
+	.set			= vopc_set,
+	.get_status		= vopc_get_status,
+	.set_status		= vopc_set_status,
+	.reset			= vopc_reset,
+	.find_vq		= vopc_find_vq,
+	.del_vq			= vopc_del_vq,
+	.get_features		= vopc_get_features,
+	.finalize_features	= vopc_finalize_features,
+};
+
+/*----------------------------------------------------------------------------*/
+/* Setup code for virtio devices                                              */
+/*----------------------------------------------------------------------------*/
+
+static void vop_release(struct device *dev)
+{
+	dev_dbg(dev, "calling device release\n");
+}
+
+static int setup_virtio_device(struct vop_dev *priv, int devnum)
+{
+	struct vop_vdev *vdev = &priv->devices[devnum];
+	struct device *dev = priv->dev;
+	int i;
+
+	/* Set up the pointers to the guest and host memory areas */
+	vdev->loc = priv->host_mem + (devnum * 4096);
+	vdev->rem = priv->netregs  + (devnum * 4096);
+	dev_dbg(dev, "memory guest 0x%p host 0x%p\n", vdev->rem, vdev->loc);
+
+	/* Set up the pointers to the guest and host status areas */
+	vdev->guest_status = vdev->rem;
+	vdev->host_status  = vdev->loc;
+	dev_dbg(dev, "status guest 0x%p host 0x%p\n", vdev->rem, vdev->loc);
+
+	/* The find_vq() must set up the correct mappings to virtqueues itself,
+	 * so we cannot do it here */
+	for (i = 0; i < ARRAY_SIZE(vdev->virtqueues); i++) {
+		memset(&vdev->virtqueues[i], 0, sizeof(struct vop_vq));
+		vdev->virtqueues[i].immr = priv->immr;
+		vdev->virtqueues[i].kick_val = 1 << ((devnum * 4) + i + 2);
+		dev_dbg(dev, "vq %d cleared, kick %d\n", i, (devnum * 4) + i + 2);
+	}
+
+	/* Zero out the configuration space completely */
+	memset(vdev->host_status, 0, 1024);
+
+	/* Copy the parent DMA parameters to this virtio_device */
+	vdev->vdev.dev.dma_mask = dev->dma_mask;
+	vdev->vdev.dev.dma_parms = dev->dma_parms;
+	vdev->vdev.dev.coherent_dma_mask = dev->coherent_dma_mask;
+
+	/* Setup everything except the device type */
+	vdev->vdev.dev.release = &vop_release;
+	vdev->vdev.dev.parent  = dev;
+	vdev->vdev.config      = &vop_config_ops;
+
+	return 0;
+}
+
+static int register_virtio_net(struct vop_dev *priv)
+{
+	struct vop_vdev *vdev = &priv->devices[0];
+	struct virtio_net_config *config;
+	unsigned long features = 0;
+	int ret;
+
+	/* Run the common setup routine */
+	ret = setup_virtio_device(priv, 0);
+	if (ret) {
+		dev_err(priv->dev, "unable to setup virtio_net\n");
+		return ret;
+	}
+
+	/* Generate a random ethernet address for the other side
+	 *
+	 * This is necessary so we can allow it to give us a consistent
+	 * MAC address for itself, using something board-specific
+	 *
+	 * The feature bits must match for it to work correctly
+	 */
+	config = (struct virtio_net_config *)vdev->host_status->config;
+	random_ether_addr(config->mac);
+	dev_info(priv->dev, "Generated MAC %pM\n", config->mac);
+
+	/* Set the feature bits for the device */
+	set_bit(VIRTIO_NET_F_MAC,       &features);
+	set_bit(VIRTIO_NET_F_CSUM,      &features);
+	set_bit(VIRTIO_NET_F_GSO,       &features);
+	set_bit(VIRTIO_NET_F_MRG_RXBUF, &features);
+
+	vdev->host_status->features[0] = cpu_to_le32(features);
+	vdev->vdev.id.device = VIRTIO_ID_NET;
+
+	/* Register the virtio device */
+	return register_virtio_device(&vdev->vdev);
+}
+
+/*----------------------------------------------------------------------------*/
+/* Interrupt Handling                                                         */
+/*----------------------------------------------------------------------------*/
+
+static irqreturn_t vdev_interrupt(int irq, void *dev_id)
+{
+	struct vop_dev *priv = dev_id;
+	struct virtqueue *vq;
+	u32 omisr, odr;
+
+	omisr = ioread32(priv->immr + OMISR_OFFSET);
+	odr   = ioread32(priv->immr + ODR_OFFSET);
+
+	/* Check the status register for doorbell interrupts */
+	if (!(omisr & 0x8))
+		return IRQ_NONE;
+
+	/* Clear all doorbell interrupts */
+	iowrite32(odr, priv->immr + ODR_OFFSET);
+
+	if (odr & 0x4) {
+		vq = &priv->devices[0].virtqueues[0].vq;
+		vq->callback(vq);
+	}
+
+	if (odr & 0x8) {
+		vq = &priv->devices[0].virtqueues[1].vq;
+		vq->callback(vq);
+	}
+
+	return IRQ_HANDLED;
+}
+
+/*----------------------------------------------------------------------------*/
+/* PCI Subsystem                                                              */
+/*----------------------------------------------------------------------------*/
+
+static int vop_probe(struct pci_dev *dev, const struct pci_device_id *id)
+{
+	struct vop_dev *priv;
+	int ret;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv) {
+		ret = -ENOMEM;
+		goto out_return;
+	}
+
+	pci_set_drvdata(dev, priv);
+	priv->dev = &dev->dev;
+
+	/* Hardware Initialization */
+	ret = pci_enable_device(dev);
+	if (ret)
+		goto out_kfree_priv;
+
+	pci_set_master(dev);
+	ret = pci_request_regions(dev, driver_name);
+	if (ret)
+		goto out_pci_disable_device;
+
+	priv->immr = pci_ioremap_bar(dev, 0);
+	if (!priv->immr) {
+		ret = -ENOMEM;
+		goto out_pci_release_regions;
+	}
+
+	priv->netregs = pci_ioremap_bar(dev, 1);
+	if (!priv->netregs) {
+		ret = -ENOMEM;
+		goto out_iounmap_immr;
+	}
+
+	/* The device can only see the lowest 1GB of memory over the bus */
+	dev->dev.coherent_dma_mask = DMA_BIT_MASK(30);
+	ret = dma_set_mask(&dev->dev, DMA_BIT_MASK(30));
+	if (ret) {
+		dev_err(&dev->dev, "Unable to set DMA mask\n");
+		goto out_iounmap_netregs;
+	}
+
+	/* Allocate the host memory, for writing by the guest */
+	priv->host_mem = dma_alloc_coherent(&dev->dev, VOP_HOST_MEM_SIZE,
+			&priv->host_mem_addr, GFP_KERNEL);
+	if (!priv->host_mem) {
+		dev_err(&dev->dev, "Unable to allocate host memory\n");
+		ret = -ENOMEM;
+		goto out_iounmap_netregs;
+	}
+
+	/* We use the guest's mailbox 0 to hold the host memory address */
+	iowrite32(priv->host_mem_addr, priv->immr + IMR0_OFFSET);
+
+	/* Reset all of the devices */
+	iowrite32(0x1, priv->immr + IDR_OFFSET);
+
+	/* Mask all of the MBOX interrupts */
+	iowrite32(0x1 | 0x2, priv->immr + OMIMR_OFFSET);
+
+	/* Setup the virtio_net instance */
+	ret = register_virtio_net(priv);
+	if (ret) {
+		dev_err(&dev->dev, "Unable to register virtio_net\n");
+		goto out_free_host_mem;
+	}
+
+	/* Hook up the interrupt handler */
+	ret = request_irq(dev->irq, vdev_interrupt, IRQF_SHARED, driver_name, priv);
+	if (ret) {
+		dev_err(&dev->dev, "Unable to register interrupt handler\n");
+		goto out_unregister_virtio_net;
+	}
+
+	/* Start virtio_net */
+	iowrite32(0x1, priv->immr + IMR1_OFFSET);
+	iowrite32(0x2, priv->immr + IDR_OFFSET);
+
+	return 0;
+
+out_unregister_virtio_net:
+	unregister_virtio_device(&priv->devices[0].vdev);
+out_free_host_mem:
+	dma_free_coherent(&dev->dev, VOP_HOST_MEM_SIZE, priv->host_mem,
+			priv->host_mem_addr);
+out_iounmap_netregs:
+	iounmap(priv->netregs);
+out_iounmap_immr:
+	iounmap(priv->immr);
+out_pci_release_regions:
+	pci_release_regions(dev);
+out_pci_disable_device:
+	pci_disable_device(dev);
+out_kfree_priv:
+	kfree(priv);
+out_return:
+	return ret;
+}
+
+static void vop_remove(struct pci_dev *dev)
+{
+	struct vop_dev *priv = pci_get_drvdata(dev);
+
+	free_irq(dev->irq, priv);
+
+	/* Reset everything */
+	iowrite32(0x1, priv->immr + IDR_OFFSET);
+
+	/* Unregister virtio_net */
+	unregister_virtio_device(&priv->devices[0].vdev);
+
+	/* Clear the host memory address from the guest's mailbox 0 */
+	iowrite32(0x0, priv->immr + IMR0_OFFSET);
+	iowrite32(0x0, priv->immr + IMR1_OFFSET);
+
+	dma_free_coherent(&dev->dev, VOP_HOST_MEM_SIZE, priv->host_mem,
+			priv->host_mem_addr);
+	iounmap(priv->netregs);
+	iounmap(priv->immr);
+	pci_release_regions(dev);
+	pci_disable_device(dev);
+	kfree(priv);
+}
+
+#define PCI_DEVID_FSL_MPC8349EMDS 0x0080
+
+/* The list of devices that this module will support */
+static struct pci_device_id vop_ids[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, PCI_DEVID_FSL_MPC8349EMDS), },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, vop_ids);
+
+static struct pci_driver vop_pci_driver = {
+	.name     = (char *)driver_name,
+	.id_table = vop_ids,
+	.probe    = vop_probe,
+	.remove   = vop_remove,
+};
+
+/*----------------------------------------------------------------------------*/
+/* Module Init / Exit                                                         */
+/*----------------------------------------------------------------------------*/
+
+static int __init vop_init(void)
+{
+	return pci_register_driver(&vop_pci_driver);
+}
+
+static void __exit vop_exit(void)
+{
+	pci_unregister_driver(&vop_pci_driver);
+}
+
+MODULE_AUTHOR("Ira W. Snyder <iws@ovro.caltech.edu>");
+MODULE_DESCRIPTION("Virtio-PCI-Host Test Driver");
+MODULE_LICENSE("GPL");
+
+module_init(vop_init);
+module_exit(vop_exit);
diff --git a/drivers/virtio/vop_hw.h b/drivers/virtio/vop_hw.h
new file mode 100644
index 0000000..8a19d3f
--- /dev/null
+++ b/drivers/virtio/vop_hw.h
@@ -0,0 +1,80 @@ 
+/*
+ * Register offsets for the MPC8349EMDS Message Unit from the IMMR base address
+ *
+ * Copyright (c) 2008 Ira W. Snyder <iws@ovro.caltech.edu>
+ *
+ * This file is licensed under the terms of the GNU General Public License
+ * version 2. This program is licensed "as is" without any warranty of any
+ * kind, whether express or implied.
+ */
+
+#ifndef PCINET_HW_H
+#define PCINET_HW_H
+
+#define SGPRL_OFFSET		0x0100
+#define SGPRH_OFFSET		0x0104
+
+/* mpc8349emds message unit register offsets */
+#define OMISR_OFFSET		0x8030
+#define OMIMR_OFFSET		0x8034
+#define IMR0_OFFSET		0x8050
+#define IMR1_OFFSET		0x8054
+#define OMR0_OFFSET		0x8058
+#define OMR1_OFFSET		0x805C
+#define ODR_OFFSET		0x8060
+#define IDR_OFFSET		0x8068
+#define IMISR_OFFSET		0x8080
+#define IMIMR_OFFSET		0x8084
+
+
+/* mpc8349emds pci and local access window register offsets */
+#define LAWAR0_OFFSET		0x0064
+#define LAWAR0_ENABLE		(1<<31)
+
+#define POCMR0_OFFSET		0x8410
+#define POCMR0_ENABLE		(1<<31)
+
+#define POTAR0_OFFSET		0x8400
+
+#define LAWAR1_OFFSET		0x006c
+#define LAWAR1_ENABLE		(1<<31)
+
+#define POCMR1_OFFSET		0x8428
+#define POCMR1_ENABLE		(1<<31)
+
+#define POTAR1_OFFSET		0x8418
+
+
+/* mpc8349emds dma controller register offsets */
+#define DMAMR0_OFFSET		0x8100
+#define DMASR0_OFFSET		0x8104
+#define DMASAR0_OFFSET		0x8110
+#define DMADAR0_OFFSET		0x8118
+#define DMABCR0_OFFSET		0x8120
+
+#define DMA_CHANNEL_BUSY	(1<<2)
+
+#define DMA_DIRECT_MODE_SNOOP	(1<<20)
+#define DMA_CHANNEL_MODE_DIRECT	(1<<2)
+#define DMA_CHANNEL_START	(1<<0)
+
+
+/* mpc8349emds pci and local access window register offsets */
+#define LAWAR0_OFFSET		0x0064
+#define LAWAR0_ENABLE		(1<<31)
+
+#define POCMR0_OFFSET		0x8410
+#define POCMR0_ENABLE		(1<<31)
+
+#define POTAR0_OFFSET		0x8400
+
+
+/* mpc8349emds pci and inbound window register offsets */
+#define PITAR0_OFFSET		0x8568
+#define PIWAR0_OFFSET		0x8578
+
+#define PIWAR0_ENABLED		(1<<31)
+#define PIWAR0_PREFETCH		(1<<29)
+#define PIWAR0_IWS_4K		0xb
+
+#endif /* PCINET_HW_H */