Patchwork [RFC,1/2] USB: add HCD_BOUNCE_BUFFERS host controller driver flag

login
register
mail settings
Submitter Albert Herranz
Date Feb. 3, 2010, 6:30 p.m.
Message ID <1265221840-25634-2-git-send-email-albert_herranz@yahoo.es>
Download mbox | patch
Permalink /patch/44398/
State Changes Requested
Delegated to: Grant Likely
Headers show

Comments

Albert Herranz - Feb. 3, 2010, 6:30 p.m.
The HCD_BOUNCE_BUFFERS USB host controller driver flag can be enabled
to instruct the USB stack to always bounce USB buffers to/from coherent
memory buffers _just_ before/after a host controller transmission.

This setting allows overcoming some platform-specific limitations.

For example, the Nintendo Wii video game console is a NOT_COHERENT_CACHE
platform that is unable to safely perform non-32 bit uncached writes
to RAM because the byte enables are not connected to the bus.
Thus, in that platform, "coherent" DMA buffers cannot be directly used
by the kernel code unless it guarantees that all write accesses
to said buffers are done in 32 bit chunks (which is not the case in the
USB subsystem).

To avoid this unwanted behaviour HCD_BOUNCE_BUFFERS can be enabled at
the HCD controller, causing buffer allocations to be satisfied from
normal memory and, only at the very last moment, before the actual
transfer, buffers get copied to/from their corresponding DMA coherent
bounce buffers.

Note that HCD_LOCAL_MEM doesn't help in solving this problem as in that
case buffers may be allocated from coherent memory in the first place
and thus potentially accessed in non-32 bit chuncks by USB drivers.

Signed-off-by: Albert Herranz <albert_herranz@yahoo.es>
---
 drivers/usb/core/hcd.c |  337 +++++++++++++++++++++++++++++++++++++++--------
 drivers/usb/core/hcd.h |   13 +-
 2 files changed, 286 insertions(+), 64 deletions(-)
Greg KH - Feb. 3, 2010, 7 p.m.
On Wed, Feb 03, 2010 at 07:30:39PM +0100, Albert Herranz wrote:
> +/**
> + * hcd_memcpy32_to_coherent - copy data to a bounce buffer
> + * @dst: destination dma bounce buffer
> + * @src: source buffer
> + * @len: number of bytes to copy
> + *
> + * This function copies @len bytes from @src to @dst in 32 bit chunks.
> + * The caller must guarantee that @dst length is 4 byte aligned and
> + * that @dst length is greater than or equal to @src length.
> + */
> +static void *hcd_memcpy32_to_coherent(void *dst, const void *src, size_t len)

Why isn't there platform-specific functions for this already?  It seems
a bit odd to bury them in the USB hcd core, when I'm sure that other
people need these, if they haven't already created them.

thanks,

greg k-h
Alan Stern - Feb. 3, 2010, 7:40 p.m.
On Wed, 3 Feb 2010, Albert Herranz wrote:

> The HCD_BOUNCE_BUFFERS USB host controller driver flag can be enabled
> to instruct the USB stack to always bounce USB buffers to/from coherent
> memory buffers _just_ before/after a host controller transmission.
> 
> This setting allows overcoming some platform-specific limitations.
> 
> For example, the Nintendo Wii video game console is a NOT_COHERENT_CACHE
> platform that is unable to safely perform non-32 bit uncached writes
> to RAM because the byte enables are not connected to the bus.
> Thus, in that platform, "coherent" DMA buffers cannot be directly used
> by the kernel code unless it guarantees that all write accesses
> to said buffers are done in 32 bit chunks (which is not the case in the
> USB subsystem).
> 
> To avoid this unwanted behaviour HCD_BOUNCE_BUFFERS can be enabled at
> the HCD controller, causing buffer allocations to be satisfied from
> normal memory and, only at the very last moment, before the actual
> transfer, buffers get copied to/from their corresponding DMA coherent
> bounce buffers.
> 
> Note that HCD_LOCAL_MEM doesn't help in solving this problem as in that
> case buffers may be allocated from coherent memory in the first place
> and thus potentially accessed in non-32 bit chuncks by USB drivers.

This description sounds hopelessly confused.  Maybe you're just
misusing the term "coherent".  The patch itself doesn't affect the
coherent DMA mappings anyway; it affects the streaming mappings.  Or to
put it another way, what's the justification for replacing a call to
dma_map_single() with a call to dma_alloc_coherent()?

Since the patch doesn't affect any of the coherent mappings (see for 
example the calls to dma_pool_create() in ehci-mem.c), I don't see how 
it can possibly do what you claim.

> +/**
> + * hcd_memcpy32_to_coherent - copy data to a bounce buffer
> + * @dst: destination dma bounce buffer
> + * @src: source buffer
> + * @len: number of bytes to copy
> + *
> + * This function copies @len bytes from @src to @dst in 32 bit chunks.
> + * The caller must guarantee that @dst length is 4 byte aligned and
> + * that @dst length is greater than or equal to @src length.
> + */
> +static void *hcd_memcpy32_to_coherent(void *dst, const void *src, size_t len)
> +{
> +	u32 *q = dst, *p = (void *)src;
> +	u8 *s;
> +
> +	while (len >= 4) {
> +		*q++ = *p++;
> +		len -= 4;
> +	}
> +	s = (u8 *)p;
> +	switch (len) {
> +	case 3:
> +		*q = s[0] << 24 | s[1] << 16 | s[2] << 8;
> +		break;
> +	case 2:
> +		*q = s[0] << 24 | s[1] << 16;
> +		break;
> +	case 1:
> +		*q = s[0] << 24;
> +		break;
> +	default:
> +		break;
> +	}
> +	return dst;
> +}

What happens if somebody tries to use this code on a little-endian CPU?

>  static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  			   gfp_t mem_flags)
>  {
> @@ -1274,53 +1441,80 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  	if (is_root_hub(urb->dev))
>  		return 0;
>  
> -	if (usb_endpoint_xfer_control(&urb->ep->desc)
> -	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma) {
> -			urb->setup_dma = dma_map_single(
> -					hcd->self.controller,
> -					urb->setup_packet,
> -					sizeof(struct usb_ctrlrequest),
> -					DMA_TO_DEVICE);
> -			if (dma_mapping_error(hcd->self.controller,
> -						urb->setup_dma))
> -				return -EAGAIN;
> -		} else if (hcd->driver->flags & HCD_LOCAL_MEM)
> -			ret = hcd_alloc_coherent(
> -					urb->dev->bus, mem_flags,
> +	if (usb_endpoint_xfer_control(&urb->ep->desc)) {
> +		if (hcd->driver->flags & HCD_BOUNCE_BUFFERS) {
> +			if (!(urb->transfer_flags & URB_NO_SETUP_DMA_MAP))
> +				urb->setup_dma = 0;
> +			ret = hcd_bounce_to_coherent(
> +					hcd->self.controller, mem_flags,
>  					&urb->setup_dma,
>  					(void **)&urb->setup_packet,
>  					sizeof(struct usb_ctrlrequest),
>  					DMA_TO_DEVICE);
> +		} else if (!(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> +			if (hcd->self.uses_dma) {
> +				urb->setup_dma = dma_map_single(
> +						hcd->self.controller,
> +						urb->setup_packet,
> +						sizeof(struct usb_ctrlrequest),
> +						DMA_TO_DEVICE);
> +				if (dma_mapping_error(hcd->self.controller,
> +						      urb->setup_dma))
> +					return -EAGAIN;
> +			} else if (hcd->driver->flags & HCD_LOCAL_MEM)
> +				ret = hcd_alloc_coherent(
> +						urb->dev->bus, mem_flags,
> +						&urb->setup_dma,
> +						(void **)&urb->setup_packet,
> +						sizeof(struct usb_ctrlrequest),
> +						DMA_TO_DEVICE);
> +		}
>  	}
>  
>  	dir = usb_urb_dir_in(urb) ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
> -	if (ret == 0 && urb->transfer_buffer_length != 0
> -	    && !(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP)) {
> -		if (hcd->self.uses_dma) {
> -			urb->transfer_dma = dma_map_single (
> -					hcd->self.controller,
> -					urb->transfer_buffer,
> -					urb->transfer_buffer_length,
> -					dir);
> -			if (dma_mapping_error(hcd->self.controller,
> -						urb->transfer_dma))
> -				return -EAGAIN;
> -		} else if (hcd->driver->flags & HCD_LOCAL_MEM) {
> -			ret = hcd_alloc_coherent(
> -					urb->dev->bus, mem_flags,
> +	if (ret == 0 && urb->transfer_buffer_length != 0) {
> +		if (hcd->driver->flags & HCD_BOUNCE_BUFFERS) {
> +			if (!(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
> +				urb->transfer_dma = 0;
> +			ret = hcd_bounce_to_coherent(
> +					hcd->self.controller, mem_flags,
>  					&urb->transfer_dma,
>  					&urb->transfer_buffer,
>  					urb->transfer_buffer_length,
>  					dir);
>  
> -			if (ret && usb_endpoint_xfer_control(&urb->ep->desc)
> -			    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP))
> -				hcd_free_coherent(urb->dev->bus,
> -					&urb->setup_dma,
> -					(void **)&urb->setup_packet,
> -					sizeof(struct usb_ctrlrequest),
> -					DMA_TO_DEVICE);
> +			if (ret && usb_endpoint_xfer_control(&urb->ep->desc))
> +				hcd_bounce_from_coherent(hcd->self.controller,
> +						&urb->setup_dma,
> +						(void **)&urb->setup_packet,
> +						sizeof(struct usb_ctrlrequest),
> +						DMA_TO_DEVICE);
> +		} else if (!(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP)) {
> +			if (hcd->self.uses_dma) {
> +				urb->transfer_dma = dma_map_single(
> +						hcd->self.controller,
> +						urb->transfer_buffer,
> +						urb->transfer_buffer_length,
> +						dir);
> +				if (dma_mapping_error(hcd->self.controller,
> +						      urb->transfer_dma))
> +					return -EAGAIN;
> +			} else if (hcd->driver->flags & HCD_LOCAL_MEM) {
> +				ret = hcd_alloc_coherent(
> +						urb->dev->bus, mem_flags,
> +						&urb->transfer_dma,
> +						&urb->transfer_buffer,
> +						urb->transfer_buffer_length,
> +						dir);
> +
> +				if (ret &&
> +				    usb_endpoint_xfer_control(&urb->ep->desc))
> +					hcd_free_coherent(urb->dev->bus,
> +						&urb->setup_dma,
> +						(void **)&urb->setup_packet,
> +						sizeof(struct usb_ctrlrequest),
> +						DMA_TO_DEVICE);
> +			}
>  		}
>  	}
>  	return ret;

It seems that every time somebody comes up with a new kind of 
memory-access restriction, this function grows by a factor of 2.  After 
a few more iterations it will be larger than the rest of the kernel!

There must be a better way to structure the requirements here.

Alan Stern
Albert Herranz - Feb. 4, 2010, 6:23 p.m.
Hi Alan,

Alan Stern wrote:
> This description sounds hopelessly confused.  Maybe you're just
> misusing the term "coherent".  The patch itself doesn't affect the
> coherent DMA mappings anyway; it affects the streaming mappings.  Or to
> put it another way, what's the justification for replacing a call to
> dma_map_single() with a call to dma_alloc_coherent()?
> 
> Since the patch doesn't affect any of the coherent mappings (see for 
> example the calls to dma_pool_create() in ehci-mem.c), I don't see how 
> it can possibly do what you claim.
> 

Thanks for your comments. Let's try to hopefully clarify this a bit.

I've used the term "coherent" as described in Documentation/DMA-API.txt (aka "consistent" as used in PCI-related functions).
I've tried to describe first the limitations of the platform that I'm working on. Basically, one of the annoying things of that platform is that writes to uncached memory (as used in "coherent" memory) can only be reliably performed in 32-bit accesses.

The USB subsystem ends up using "coherent" memory for buffers and/or other structures in different ways.

The "coherent" memory allocated in dma_pool_create() in ehci-mem.c that you report is not a problem at all because it is always accessed in 32-bit chunks (it hasn't been always like that but since commit 3807e26d69b9ad3864fe03224ebebc9610d5802e "USB: EHCI: split ehci_qh into hw and sw parts" this got addressed as a side effect, so I didn't need to post another patch for that).

Other possible interactions with "coherent" memory are those involving buffers used in USB transactions, which may be allocated via the USB subsystem (at usb_buffer_alloc() or when bounced via hcd_alloc_coherent()) or which may come already allocated and ready for use (URB_NO_{SETUP,TRANSFER}_DMA_MAP).

The patch, as posted, allocates normal memory for USB buffers _within_ the USB subsystem and invariably bounces all buffers to new "coherent" buffers.
So, basically, what the patch claims (avoid 32-bit writes for "coherent" memory within the USB subsystem) is "done" (hey, it actually works ;-).

But I think you have raised valid points here :)

If the "coherent" memory is already allocated and passed (as already dma-mapped) to the USB subsystem then there is no gain in bouncing the buffer:
- if a non-32 bit write was done to that "coherent" memory the damage is already done
- if the "coherent" memory was written always in 32-bit accesses then we can just safely use it
So bouncing here should be avoided as it is unneeded.

On the other hand, we can control USB buffers managed by the USB subsystem itself.
That's what the patch "does". It avoids access restrictions to USB buffers by allocating them from normal memory (leaving USB drivers free to access those buffers in whatever bus width they need, as they do today) ... and bouncing them.
The thing here is that it makes no sense to bounce them to "coherent" memory if they can be dma-mapped directly (as you point in your dma_map_single-vs-dma_alloc_coherent comment).

So... that's what RFCs are for :)
I'll take a look again at the patch.

>> +/**
>> + * hcd_memcpy32_to_coherent - copy data to a bounce buffer
>> + * @dst: destination dma bounce buffer
>> + * @src: source buffer
>> + * @len: number of bytes to copy
>> + *
>> + * This function copies @len bytes from @src to @dst in 32 bit chunks.
>> + * The caller must guarantee that @dst length is 4 byte aligned and
>> + * that @dst length is greater than or equal to @src length.
>> + */
>> +static void *hcd_memcpy32_to_coherent(void *dst, const void *src, size_t len)
>> +{
>> +	u32 *q = dst, *p = (void *)src;
>> +	u8 *s;
>> +
>> +	while (len >= 4) {
>> +		*q++ = *p++;
>> +		len -= 4;
>> +	}
>> +	s = (u8 *)p;
>> +	switch (len) {
>> +	case 3:
>> +		*q = s[0] << 24 | s[1] << 16 | s[2] << 8;
>> +		break;
>> +	case 2:
>> +		*q = s[0] << 24 | s[1] << 16;
>> +		break;
>> +	case 1:
>> +		*q = s[0] << 24;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +	return dst;
>> +}
> 
> What happens if somebody tries to use this code on a little-endian CPU?
> 

It will fail.


> It seems that every time somebody comes up with a new kind of 
> memory-access restriction, this function grows by a factor of 2.  After 
> a few more iterations it will be larger than the rest of the kernel!
> 
> There must be a better way to structure the requirements here.
> 

Hopefully I didn't miss any of your concerns and managed to explain the problem.

> Alan Stern
> 
> 

Thanks,
Albert
Alan Stern - Feb. 4, 2010, 6:58 p.m.
On Thu, 4 Feb 2010, Albert Herranz wrote:

> Hi Alan,
> 
> Alan Stern wrote:
> > This description sounds hopelessly confused.  Maybe you're just
> > misusing the term "coherent".  The patch itself doesn't affect the
> > coherent DMA mappings anyway; it affects the streaming mappings.  Or to
> > put it another way, what's the justification for replacing a call to
> > dma_map_single() with a call to dma_alloc_coherent()?
> > 
> > Since the patch doesn't affect any of the coherent mappings (see for 
> > example the calls to dma_pool_create() in ehci-mem.c), I don't see how 
> > it can possibly do what you claim.
> > 
> 
> Thanks for your comments. Let's try to hopefully clarify this a bit.
> 
> I've used the term "coherent" as described in Documentation/DMA-API.txt (aka "consistent" as used in PCI-related functions).
> I've tried to describe first the limitations of the platform that I'm working on. Basically, one of the annoying things of that platform is that writes to uncached memory (as used in "coherent" memory) can only be reliably performed in 32-bit accesses.
> 
> The USB subsystem ends up using "coherent" memory for buffers and/or other structures in different ways.
> 
> The "coherent" memory allocated in dma_pool_create() in ehci-mem.c that you report is not a problem at all because it is always accessed in 32-bit chunks (it hasn't been always like that but since commit 3807e26d69b9ad3864fe03224ebebc9610d5802e "USB: EHCI: split ehci_qh into hw and sw parts" this got addressed as a side effect, so I didn't need to post another patch for that).

On a 64-bit processor, some of the accesses will be 64 bits wide
instead of 32.  Does that matter for your purposes?

What about ohci-hcd and uhci-hcd?  They both use non-32-bit accesses to 
structures in coherent memory.

> Other possible interactions with "coherent" memory are those involving buffers used in USB transactions, which may be allocated via the USB subsystem (at usb_buffer_alloc() or when bounced via hcd_alloc_coherent()) or which may come already allocated and ready for use (URB_NO_{SETUP,TRANSFER}_DMA_MAP).

Ah yes, quite correct.  And this indicates that you need to concentrate
on usb_buffer_alloc().  On your system (or rather, whenever the
HCD_NO_COHERENT_MEM flag is set) it should allocate normal memory and
set the DMA pointer to NULL.

Then map_urb_for_dma() should check the urb->setup_dma and
urb->transfer_dma pointers; if a pointer is NULL then the
corresponding urb->URB_NO_SETUP_DMA_MAP or urb->NO_TRANSFER_DMA_MAP
flag should be ignored (and probably should be cleared so as to 
avoid confusing unmap_urb_for_dma()).

> The patch, as posted, allocates normal memory for USB buffers _within_ the USB subsystem and invariably bounces all buffers to new "coherent" buffers.
> So, basically, what the patch claims (avoid 32-bit writes for "coherent" memory within the USB subsystem) is "done" (hey, it actually works ;-).
> 
> But I think you have raised valid points here :)
> 
> If the "coherent" memory is already allocated and passed (as already dma-mapped) to the USB subsystem then there is no gain in bouncing the buffer:
> - if a non-32 bit write was done to that "coherent" memory the damage is already done
> - if the "coherent" memory was written always in 32-bit accesses then we can just safely use it
> So bouncing here should be avoided as it is unneeded.
> 
> On the other hand, we can control USB buffers managed by the USB subsystem itself.
> That's what the patch "does". It avoids access restrictions to USB buffers by allocating them from normal memory (leaving USB drivers free to access those buffers in whatever bus width they need, as they do today) ... and bouncing them.
> The thing here is that it makes no sense to bounce them to "coherent" memory if they can be dma-mapped directly (as you point in your dma_map_single-vs-dma_alloc_coherent comment).
> 
> So... that's what RFCs are for :)

If you do it as described above then the buffers you're worried about
won't be allocated in coherent memory to begin with, so no problems 
will arise.

Alan Stern
Albert Herranz - Feb. 7, 2010, 6:10 p.m.
Alan Stern wrote:
>On a 64-bit processor, some of the accesses will be 64 bits wide
>instead of 32.  Does that matter for your purposes?
>

The wii uses a 32-bit processor, so this is safe in this case.

>What about ohci-hcd and uhci-hcd?  They both use non-32-bit accesses to 
>structures in coherent memory.
>

The wii has no uhci, but has 2 ohci controllers.
For ohci we need a similar approach as done for ehci.

>If you do it as described above then the buffers you're worried about
>won't be allocated in coherent memory to begin with, so no problems 
>will arise.

It turns out that we have more limitations.
The wii has 2 discontiguous memory areas (usually called MEM1 and MEM2). I have checked that the ehci controller doesn't work properly when performing dma to buffers allocated in MEM1 (it corrupts part of the data) but has no problems if the buffers sit within MEM2.
So usb buffers will need to be bounced anyway if they are part of MEM1.

This worked in the original patch as buffers were always bounced to MEM2 buffers. Sigh.

>Alan Stern
>

Thanks,
Albert

PS: Your reply didn't get to me. I looked at the ML and found it (I'm not subscribed). Sorry for the late answer.
Alan Stern - Feb. 7, 2010, 8:26 p.m.
On Sun, 7 Feb 2010, Albert Herranz wrote:

> The wii has no uhci, but has 2 ohci controllers.
> For ohci we need a similar approach as done for ehci.

So you'll need to write a patch splitting up the OHCI data structures 
in the same way the EHCI qh was split up.

> >If you do it as described above then the buffers you're worried about
> >won't be allocated in coherent memory to begin with, so no problems 
> >will arise.
> 
> It turns out that we have more limitations.
> The wii has 2 discontiguous memory areas (usually called MEM1 and MEM2). I have checked that the ehci controller doesn't work properly when performing dma to buffers allocated in MEM1 (it corrupts part of the data) but has no problems if the buffers sit within MEM2.
> So usb buffers will need to be bounced anyway if they are part of MEM1.

This sounds like the sort of restriction that dma_map_single() should 
be capable of handling.

Alan Stern
Albert Herranz - Feb. 7, 2010, 11:38 p.m.
Alan Stern wrote:
> On Sun, 7 Feb 2010, Albert Herranz wrote:
> 
>> The wii has no uhci, but has 2 ohci controllers.
>> For ohci we need a similar approach as done for ehci.
> 
> So you'll need to write a patch splitting up the OHCI data structures 
> in the same way the EHCI qh was split up.
> 

Yes.

>> It turns out that we have more limitations.
>> The wii has 2 discontiguous memory areas (usually called MEM1 and MEM2). I have checked that the ehci controller doesn't work properly when performing dma to buffers allocated in MEM1 (it corrupts part of the data) but has no problems if the buffers sit within MEM2.
>> So usb buffers will need to be bounced anyway if they are part of MEM1.
> 
> This sounds like the sort of restriction that dma_map_single() should 
> be capable of handling.
> 

On powerpc you can have per-device specific dma ops.
I'll work on that direction and create a special dma ops set for devices which need their dma buffers on mem2, and then use those for ehci-hlwd.

> Alan Stern
> 

Thanks,
Albert

Patch

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 80995ef..befca85 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1260,6 +1260,173 @@  static void hcd_free_coherent(struct usb_bus *bus, dma_addr_t *dma_handle,
 	*dma_handle = 0;
 }
 
+/*
+ * The HCD_BOUNCE_BUFFERS USB host controller driver flag can be enabled
+ * to instruct the USB stack to always bounce USB buffers to/from coherent
+ * memory buffers _just_ before/after a host controller transmission.
+ *
+ * This setting allows overcoming some platform-specific limitations.
+ *
+ * For example, the Nintendo Wii video game console is a NOT_COHERENT_CACHE
+ * platform that is unable to safely perform non-32 bit uncached writes
+ * to RAM because the byte enables are not connected to the bus.
+ * Thus, in that platform, "coherent" DMA buffers cannot be directly used
+ * by the kernel code unless it guarantees that all write accesses
+ * to said buffers are done in 32 bit chunks (which is not the case in the
+ * USB subsystem).
+ *
+ * To avoid this unwanted behaviour HCD_BOUNCE_BUFFERS can be enabled at
+ * the HCD controller, causing buffer allocations to be satisfied from
+ * normal memory and, only at the very last moment, before the actual
+ * transfer, buffers get copied to/from their corresponding DMA coherent
+ * bounce buffers.
+ *
+ * Note that HCD_LOCAL_MEM doesn't help in solving this problem as in that
+ * case buffers may be allocated from coherent memory in the first place
+ * and thus potentially accessed in non-32 bit chuncks by USB drivers.
+ *
+ */
+
+#define HCD_BOUNCE_ALIGN	4
+
+#define hcd_align_up(addr, size)	(((addr)+((size)-1))&(~((size)-1)))
+
+struct hcd_coherent_buffer_ctx {
+	unsigned char *vaddr;
+	dma_addr_t dma_handle;
+};
+
+/**
+ * hcd_memcpy32_to_coherent - copy data to a bounce buffer
+ * @dst: destination dma bounce buffer
+ * @src: source buffer
+ * @len: number of bytes to copy
+ *
+ * This function copies @len bytes from @src to @dst in 32 bit chunks.
+ * The caller must guarantee that @dst length is 4 byte aligned and
+ * that @dst length is greater than or equal to @src length.
+ */
+static void *hcd_memcpy32_to_coherent(void *dst, const void *src, size_t len)
+{
+	u32 *q = dst, *p = (void *)src;
+	u8 *s;
+
+	while (len >= 4) {
+		*q++ = *p++;
+		len -= 4;
+	}
+	s = (u8 *)p;
+	switch (len) {
+	case 3:
+		*q = s[0] << 24 | s[1] << 16 | s[2] << 8;
+		break;
+	case 2:
+		*q = s[0] << 24 | s[1] << 16;
+		break;
+	case 1:
+		*q = s[0] << 24;
+		break;
+	default:
+		break;
+	}
+	return dst;
+}
+
+/**
+ * hcd_memcpy32_from_coherent - copy data from a bounce buffer
+ * @dst: destination buffer
+ * @src: source dma bounce buffer
+ * @len: number of bytes to copy
+ *
+ * This function copies @len bytes from @src to @dst in 32 bit chunks.
+ * The caller must guarantee that @src length is 4 byte aligned and
+ * that @src length is greater than or equal to @dst length.
+ */
+static void *hcd_memcpy32_from_coherent(void *dst, const void *src, size_t len)
+{
+	u32 *q = dst, *p = (void *)src;
+	u32 v;
+	u8 *d;
+
+	while (len >= 4) {
+		*q++ = *p++;
+		len -= 4;
+	}
+	if (len) {
+		d = (u8 *)q;
+		v = p[0];
+		switch (len) {
+		case 3:
+			d[2] = (v >> 8) & 0xff;
+			/* FALL THROUGH */
+		case 2:
+			d[1] = (v >> 16) & 0xff;
+			/* FALL THROUGH */
+		case 1:
+			d[0] = (v >> 24) & 0xff;
+			break;
+		default:
+			break;
+		}
+	}
+	return dst;
+}
+
+static int hcd_bounce_to_coherent(struct device *dev,
+				  gfp_t mem_flags, dma_addr_t *dma_handle,
+				  void **vaddr_handle, size_t size,
+				  enum dma_data_direction dir)
+{
+	struct hcd_coherent_buffer_ctx ctx;
+	unsigned char *vaddr;
+	size_t up_size = hcd_align_up(size + sizeof(ctx), HCD_BOUNCE_ALIGN);
+
+	BUILD_BUG_ON(sizeof(ctx) & (HCD_BOUNCE_ALIGN-1));
+
+	if (!*vaddr_handle) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	ctx.vaddr = *vaddr_handle;
+	ctx.dma_handle = *dma_handle;
+
+	vaddr = dma_alloc_coherent(dev, up_size, dma_handle, mem_flags);
+	if (!vaddr)
+		return -ENOMEM;
+
+	hcd_memcpy32_to_coherent(vaddr + up_size - sizeof(ctx), &ctx,
+				 sizeof(ctx));
+	if (dir == DMA_TO_DEVICE)
+		hcd_memcpy32_to_coherent(vaddr, *vaddr_handle, size);
+
+	*vaddr_handle = vaddr;
+	return 0;
+}
+
+static void hcd_bounce_from_coherent(struct device *dev, dma_addr_t *dma_handle,
+				     void **vaddr_handle, size_t size,
+				     enum dma_data_direction dir)
+{
+	struct hcd_coherent_buffer_ctx ctx;
+	unsigned char *vaddr = *vaddr_handle;
+	size_t up_size = hcd_align_up(size + sizeof(ctx), HCD_BOUNCE_ALIGN);
+
+	if (!*vaddr_handle)
+		return;
+
+	hcd_memcpy32_from_coherent(&ctx, vaddr + up_size - sizeof(ctx),
+				   sizeof(ctx));
+	BUG_ON(!ctx.vaddr);
+	if (dir == DMA_FROM_DEVICE)
+		hcd_memcpy32_from_coherent(ctx.vaddr, vaddr, size);
+
+	dma_free_coherent(dev, up_size, vaddr, *dma_handle);
+
+	*vaddr_handle = ctx.vaddr;
+	*dma_handle = ctx.dma_handle;
+}
+
 static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
 			   gfp_t mem_flags)
 {
@@ -1274,53 +1441,80 @@  static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
 	if (is_root_hub(urb->dev))
 		return 0;
 
-	if (usb_endpoint_xfer_control(&urb->ep->desc)
-	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma) {
-			urb->setup_dma = dma_map_single(
-					hcd->self.controller,
-					urb->setup_packet,
-					sizeof(struct usb_ctrlrequest),
-					DMA_TO_DEVICE);
-			if (dma_mapping_error(hcd->self.controller,
-						urb->setup_dma))
-				return -EAGAIN;
-		} else if (hcd->driver->flags & HCD_LOCAL_MEM)
-			ret = hcd_alloc_coherent(
-					urb->dev->bus, mem_flags,
+	if (usb_endpoint_xfer_control(&urb->ep->desc)) {
+		if (hcd->driver->flags & HCD_BOUNCE_BUFFERS) {
+			if (!(urb->transfer_flags & URB_NO_SETUP_DMA_MAP))
+				urb->setup_dma = 0;
+			ret = hcd_bounce_to_coherent(
+					hcd->self.controller, mem_flags,
 					&urb->setup_dma,
 					(void **)&urb->setup_packet,
 					sizeof(struct usb_ctrlrequest),
 					DMA_TO_DEVICE);
+		} else if (!(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
+			if (hcd->self.uses_dma) {
+				urb->setup_dma = dma_map_single(
+						hcd->self.controller,
+						urb->setup_packet,
+						sizeof(struct usb_ctrlrequest),
+						DMA_TO_DEVICE);
+				if (dma_mapping_error(hcd->self.controller,
+						      urb->setup_dma))
+					return -EAGAIN;
+			} else if (hcd->driver->flags & HCD_LOCAL_MEM)
+				ret = hcd_alloc_coherent(
+						urb->dev->bus, mem_flags,
+						&urb->setup_dma,
+						(void **)&urb->setup_packet,
+						sizeof(struct usb_ctrlrequest),
+						DMA_TO_DEVICE);
+		}
 	}
 
 	dir = usb_urb_dir_in(urb) ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
-	if (ret == 0 && urb->transfer_buffer_length != 0
-	    && !(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP)) {
-		if (hcd->self.uses_dma) {
-			urb->transfer_dma = dma_map_single (
-					hcd->self.controller,
-					urb->transfer_buffer,
-					urb->transfer_buffer_length,
-					dir);
-			if (dma_mapping_error(hcd->self.controller,
-						urb->transfer_dma))
-				return -EAGAIN;
-		} else if (hcd->driver->flags & HCD_LOCAL_MEM) {
-			ret = hcd_alloc_coherent(
-					urb->dev->bus, mem_flags,
+	if (ret == 0 && urb->transfer_buffer_length != 0) {
+		if (hcd->driver->flags & HCD_BOUNCE_BUFFERS) {
+			if (!(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
+				urb->transfer_dma = 0;
+			ret = hcd_bounce_to_coherent(
+					hcd->self.controller, mem_flags,
 					&urb->transfer_dma,
 					&urb->transfer_buffer,
 					urb->transfer_buffer_length,
 					dir);
 
-			if (ret && usb_endpoint_xfer_control(&urb->ep->desc)
-			    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP))
-				hcd_free_coherent(urb->dev->bus,
-					&urb->setup_dma,
-					(void **)&urb->setup_packet,
-					sizeof(struct usb_ctrlrequest),
-					DMA_TO_DEVICE);
+			if (ret && usb_endpoint_xfer_control(&urb->ep->desc))
+				hcd_bounce_from_coherent(hcd->self.controller,
+						&urb->setup_dma,
+						(void **)&urb->setup_packet,
+						sizeof(struct usb_ctrlrequest),
+						DMA_TO_DEVICE);
+		} else if (!(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP)) {
+			if (hcd->self.uses_dma) {
+				urb->transfer_dma = dma_map_single(
+						hcd->self.controller,
+						urb->transfer_buffer,
+						urb->transfer_buffer_length,
+						dir);
+				if (dma_mapping_error(hcd->self.controller,
+						      urb->transfer_dma))
+					return -EAGAIN;
+			} else if (hcd->driver->flags & HCD_LOCAL_MEM) {
+				ret = hcd_alloc_coherent(
+						urb->dev->bus, mem_flags,
+						&urb->transfer_dma,
+						&urb->transfer_buffer,
+						urb->transfer_buffer_length,
+						dir);
+
+				if (ret &&
+				    usb_endpoint_xfer_control(&urb->ep->desc))
+					hcd_free_coherent(urb->dev->bus,
+						&urb->setup_dma,
+						(void **)&urb->setup_packet,
+						sizeof(struct usb_ctrlrequest),
+						DMA_TO_DEVICE);
+			}
 		}
 	}
 	return ret;
@@ -1333,32 +1527,49 @@  static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
 	if (is_root_hub(urb->dev))
 		return;
 
-	if (usb_endpoint_xfer_control(&urb->ep->desc)
-	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma)
-			dma_unmap_single(hcd->self.controller, urb->setup_dma,
-					sizeof(struct usb_ctrlrequest),
-					DMA_TO_DEVICE);
-		else if (hcd->driver->flags & HCD_LOCAL_MEM)
-			hcd_free_coherent(urb->dev->bus, &urb->setup_dma,
-					(void **)&urb->setup_packet,
-					sizeof(struct usb_ctrlrequest),
-					DMA_TO_DEVICE);
+	if (usb_endpoint_xfer_control(&urb->ep->desc)) {
+		if (hcd->driver->flags & HCD_BOUNCE_BUFFERS)
+			hcd_bounce_from_coherent(hcd->self.controller,
+						&urb->setup_dma,
+						(void **)&urb->setup_packet,
+						sizeof(struct usb_ctrlrequest),
+						DMA_TO_DEVICE);
+		else if (!(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
+			if (hcd->self.uses_dma)
+				dma_unmap_single(hcd->self.controller,
+						 urb->setup_dma,
+						 sizeof(struct usb_ctrlrequest),
+						 DMA_TO_DEVICE);
+			else if (hcd->driver->flags & HCD_LOCAL_MEM)
+				hcd_free_coherent(urb->dev->bus,
+						&urb->setup_dma,
+						(void **)&urb->setup_packet,
+						sizeof(struct usb_ctrlrequest),
+						DMA_TO_DEVICE);
+		}
 	}
 
 	dir = usb_urb_dir_in(urb) ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
-	if (urb->transfer_buffer_length != 0
-	    && !(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP)) {
-		if (hcd->self.uses_dma)
-			dma_unmap_single(hcd->self.controller,
-					urb->transfer_dma,
-					urb->transfer_buffer_length,
-					dir);
-		else if (hcd->driver->flags & HCD_LOCAL_MEM)
-			hcd_free_coherent(urb->dev->bus, &urb->transfer_dma,
-					&urb->transfer_buffer,
-					urb->transfer_buffer_length,
-					dir);
+	if (urb->transfer_buffer_length != 0) {
+		if (hcd->driver->flags & HCD_BOUNCE_BUFFERS)
+			hcd_bounce_from_coherent(hcd->self.controller,
+						&urb->transfer_dma,
+						&urb->transfer_buffer,
+						urb->transfer_buffer_length,
+						dir);
+		else if (!(urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP)) {
+			if (hcd->self.uses_dma)
+				dma_unmap_single(hcd->self.controller,
+						urb->transfer_dma,
+						urb->transfer_buffer_length,
+						dir);
+			else if (hcd->driver->flags & HCD_LOCAL_MEM)
+				hcd_free_coherent(urb->dev->bus,
+						&urb->transfer_dma,
+						&urb->transfer_buffer,
+						urb->transfer_buffer_length,
+						dir);
+		}
 	}
 }
 
@@ -2022,6 +2233,16 @@  struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
 	dev_set_drvdata(dev, hcd);
 	kref_init(&hcd->kref);
 
+	if (driver->flags & HCD_BOUNCE_BUFFERS) {
+		/*
+		 * This will force allocation of HCD buffers in normal kernel
+		 * memory via kmalloc().
+		 * Note that this is fine as we always bounce them later
+		 * to coherent memory.
+		 */
+		dev->dma_mask = NULL;
+	}
+
 	usb_bus_init(&hcd->self);
 	hcd->self.controller = dev;
 	hcd->self.bus_name = bus_name;
diff --git a/drivers/usb/core/hcd.h b/drivers/usb/core/hcd.h
index bbe2b92..e0b8179 100644
--- a/drivers/usb/core/hcd.h
+++ b/drivers/usb/core/hcd.h
@@ -183,12 +183,13 @@  struct hc_driver {
 	irqreturn_t	(*irq) (struct usb_hcd *hcd);
 
 	int	flags;
-#define	HCD_MEMORY	0x0001		/* HC regs use memory (else I/O) */
-#define	HCD_LOCAL_MEM	0x0002		/* HC needs local memory */
-#define	HCD_USB11	0x0010		/* USB 1.1 */
-#define	HCD_USB2	0x0020		/* USB 2.0 */
-#define	HCD_USB3	0x0040		/* USB 3.0 */
-#define	HCD_MASK	0x0070
+#define	HCD_MEMORY		0x0001	/* HC regs use memory (else I/O) */
+#define	HCD_LOCAL_MEM		0x0002	/* HC needs local memory */
+#define HCD_BOUNCE_BUFFERS	0x0004	/* HC needs to bounce buffers */
+#define	HCD_USB11		0x0010	/* USB 1.1 */
+#define	HCD_USB2		0x0020	/* USB 2.0 */
+#define	HCD_USB3		0x0040	/* USB 3.0 */
+#define	HCD_MASK		0x0070
 
 	/* called to init HCD and root hub */
 	int	(*reset) (struct usb_hcd *hcd);