diff mbox

[04/12] usb: Add support for input pipelining

Message ID 1349682696-3046-5-git-send-email-hdegoede@redhat.com
State New
Headers show

Commit Message

Hans de Goede Oct. 8, 2012, 7:51 a.m. UTC
Currently we effectively only do pipelining for output endpoints, since
controllers will stop filling the queue for a packet with stop the queue
on a short read semantics enabled.

This causes large input transfers to get split into multiple packets, which
comes with a serious performance penalty.

This patch makes it possible to do pipelining for input endpoints, by
re-combining packets which belong to 1 transfer at the guest device-driver
level into 1 large packet before sending them to the device, this is mostly
useful as a significant performance improvement for redirection of real USB
devices.

This patch will combine packets together into 1 until a transfer terminating
packet is encountered. A terminating packet is a packet which meets one of
the following conditions:
1) The packet size is *not* a multiple of the endpoint max packet size
2) The packet does *not* have its short-not-ok flag set
3) The packet has its interrupt-on-complete flag set

The short-not-ok flag of the combined packet is that of the terminating packet.
Multiple combined packets may be submitted to the device, if the combined
packets do not have their short-not-ok flag set, enabling true pipelining.

If a combined packet does have its short-not-ok flag set the queue will
wait with submitting further packets to the device until that packet has
completed.

Once enabled in the ehci driver, this improves the speed (MB/s) of a Linux
guest reading from a USB mass storage device by a factor of 1.2 - 1.5.

And the main reason why I started working on this, when reading from a pl2303
USB<->serial converter, it combines the previous 4 packets submitted per
device-driver level read into 1 big read, reducing the number of packets / sec
by a factor 4, and it allows to have multiple reads outstanding. This allows
for much better latency tolerance without the pl2303's internal buffer
overflowing (which was happening at 115200 bps, without serial flow control).

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
 hw/usb.h         |  24 ++++++++
 hw/usb/core.c    | 176 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/usb/dev-hub.c |   9 +++
 3 files changed, 205 insertions(+), 4 deletions(-)

Comments

Gerd Hoffmann Oct. 8, 2012, 12:50 p.m. UTC | #1
On 10/08/12 09:51, Hans de Goede wrote:
> Currently we effectively only do pipelining for output endpoints, since
> controllers will stop filling the queue for a packet with stop the queue
> on a short read semantics enabled.

> This causes large input transfers to get split into multiple packets, which
> comes with a serious performance penalty.
> 
> This patch makes it possible to do pipelining for input endpoints, by
> re-combining packets which belong to 1 transfer at the guest device-driver
> level into 1 large packet before sending them to the device, this is mostly
> useful as a significant performance improvement for redirection of real USB
> devices.

This splitting and recombining is too much magic in the core layer for
my taste.

I think the core work flow should stay as-is.  Instead we should add
more meta data to USBPacket (stop-queue-on-short-read bit, maybe
interrupt-on-complete too).  Continue queuing packets even with
stop-queue-on-short-read set.  Maybe add a callback to notify USBDevice
when the host controller is done filling the queue, so USBDevices can
process all queued packets in one go (bottom half would work too
though).  Then expect USBDevices to get things right based on the
USBPacket flags passed on (i.e. have host-linux use
USBFS_URB_BULK_CONTINUATION as needed).

This way usb-host and usbredir will see what is really going on instead
of having the usb core magically fixing up stuff for them.  I think this
will serve us better long-term.  Maybe you are seeing this "data *and*
stall" issue (patch 1) exactly because of all this combining & splitting
logic?

cheers,
  Gerd
Hans de Goede Oct. 8, 2012, 2:50 p.m. UTC | #2
Hi,

On 10/08/2012 02:50 PM, Gerd Hoffmann wrote:
> On 10/08/12 09:51, Hans de Goede wrote:
>> Currently we effectively only do pipelining for output endpoints, since
>> controllers will stop filling the queue for a packet with stop the queue
>> on a short read semantics enabled.
>
>> This causes large input transfers to get split into multiple packets, which
>> comes with a serious performance penalty.
>>
>> This patch makes it possible to do pipelining for input endpoints, by
>> re-combining packets which belong to 1 transfer at the guest device-driver
>> level into 1 large packet before sending them to the device, this is mostly
>> useful as a significant performance improvement for redirection of real USB
>> devices.
>
> This splitting and recombining is too much magic in the core layer for
> my taste.

I disagree, let me try to explain below.

> I think the core work flow should stay as-is.  Instead we should add
> more meta data to USBPacket (stop-queue-on-short-read bit, maybe
> interrupt-on-complete too).

We agree on the meta data :)

> Continue queuing packets even with
> stop-queue-on-short-read set.

This can only happen if we combine them first. We *must* ensure that
we do not read additional data after a short completion and add
that to a packet belonging to the previous short ended transfer, or
if we end up cancelling after the short completion, throwing away that
data, otherwise the guest and device can get out of sync. Which is exactly
what was happening with the uhci code, before I added the SPD tests there.

Another important reason for combining is that in most cases, we only
see short-not-ok flags because the device-driver in the guest has asked
for a single bulk in transfer which does not fit in a single td, so
it gets split, and to make sure that if it ends up short, the problem
of data from another part of the protocol (ie usb mass storage bulk only
transport data versus sense) ending up in tds belonging to the previous
transfer does not happen, the short-not-ok flag gets set on all
packets but the last one. Notice btw that the xhci controller has no
notion of short-not-ok, instead it simply allows for very large
bulk transfers to be submitted in one "atomic" entity.

So far to not cause this mixup of to which protocol phase in-data belongs,
we do not queue up packets to the device past a short-not-ok boundary,
but this means for serial <-> usb converters that instead of doing:

submit 1st 256 byte bulk in
submit 2nd 256 byte bulk in
<wait for completions>
<after a single transfer completion>
resubmit 256 byte bulk in

We do:
submit 1st 64/256 byte bulk in
<wait for completion>
<on full 64 byte completion>
submit 2nd 64/256 byte bulk in
<etc>

Meaning that for receiving 256 bytes we pay 4 times the round trip
time instead of 1 time, and since we do not submit further packets
after a short-not-ok packet, that we don't have a 2nd 256 byte
transfer queued up ready to receive more data while the 1st
one is being processed. Making reading from such a device highly
unreliable if there is any sort of load on the host.

The whole combining "magic" results in a single 256 byte packet, which
does not have its short-not-ok flag set, allowing true pipelining and
having another in transfer queued up to keep the device busy will the
1st one is being processed.

Re-combining the packets into a single transfer, is the *only* way to
ensure that a short completion cannot end up reading some data from
the next protocol phase.

One possible alternative to re-combining would be to pass the
short-not-ok flag along the chain all the way down into usbfs, but
that won't work since the USBFS_URB_SHORT_NOT_OK flag does not
work with XHCI controllers! There has even been discussion about
outright refusing packet submissions with that flag set in newer
Linux kernels, but that would cause regressions for apps which set
it when they don't really need it (and most don't). So since more
and more machines are using XHCI controllers, passing the
short-not-ok flag along is not a valid option.

So lets try to split this discussion into multiple questions:

1) Can we agree that re-combining input packets back into their original
transfers as done at the guest device driver level, is useful and given
the serial <-> usb converters case, even necessary to have ?


2) If we agree on 1), what is the right place to do them ?

To be able to properly combine packets into one larger transfer, an
uncombine them again later, some coordination is needed between the
entity doing the combining / uncombining and the hcd code:

2a) The combining entity needs to know when the hcd is done with
queuing up packets for an ep.

2b) When uncombining it is possible that less packets are actually
used / filled with data, then were combined into one transfer. When
this happens the hcd code will stop processing the queue after the
short packet, and wait for the guest to restart it. When the guest
restarts the queue it will be modified and the un-used packets will
be gone from it. Since the hcd code keeps its own queue, when
when uncombining the hcd code needs to be told to forget about
the unused packets.

Note that currently we have the same problem on a halt of the ep
queue already, and currently we expect the hcd to completely empty
the queue on a stall. I'm planning to change this code path over
to also let the core tell the hcd to forget about the packets it
had queued beyond the stall, as this seems something to do at the
core level rather then reproducing it in every hcd emulation
separately.


Given the necessary close interaction between the hcd code and
the combine / uncombine code + not wanting to reproduce the
same code in both host-linux.c and redirect.c I believe that
the usb core is the right place for this.

 > Maybe add a callback to notify USBDevice
> when the host controller is done filling the queue, so USBDevices can
> process all queued packets in one go (bottom half would work too
> though).

Changing the callback which I added for this to a bh is fine with me,
I think that both from a making clear what is going on when reading
the code, and from a having all the data hot in cache pov, the callback
is better though.

> Then expect USBDevices to get things right based on the
> USBPacket flags passed on (i.e. have host-linux use
> USBFS_URB_BULK_CONTINUATION as needed).

Note that the need for host-linux to use USBFS_URB_BULK_CONTINUATION +
SHORT_NOT_OK flags with older kernels, and to simply no longer split with
newer kernels, already exists! A single ehci td can span 20k, and we
split at 16k, so an input transfer ending short in the first 16k will
already cause the wrong thing to happen (the sense data of usb-ms gets read
as the next 4k), without input pipelining coming into play at all, and with
the xhci emulation we can already get much larger in transfers...

The reason why I disable input pipelining for host-linux for now in my
patchset is because it will make the problem worse, not because the
problem is new.

For the full details on all the magic needed to correctly handle
larger input transfers on usbdevfs with all possible different
kernel versions, please see:
https://github.com/libusbx/libusbx/commit/ede02ba91920f9be787a7f3cd006c5a4b92b5eab

Or just switch to libusb ...

> This way usb-host and usbredir will see what is really going on instead
> of having the usb core magically fixing up stuff for them.

I believe this is based on you thinking simply passing along a short-not-ok
flag all the way down into usbfs will be enough, but it definitely is not
enough, since the XHCI hardware has no notion of short-not-ok, the only
way to do input pipelinig with an XHCI attached device is to re-combine
the split packets into a single large input transfer. And the proper
place for such combining code is in the core IMHO.

> I think this
> will serve us better long-term.  Maybe you are seeing this "data *and*
> stall" issue (patch 1) exactly because of all this combining & splitting
> logic?

I'm not seeing the data and stall condition in practice at all, but while
designing all this I noticed that this can happen. a single ehci td
can be a multiple of the endpoints maxPacketSize and then the endpoint
can respond with maxPacketSize data, maxPacketSize data, stall at which
point we will have both data and a stall condition in one.

And the Linux ehci + usbfs + libusb + usbredir code all handle passing
along both data and a stall as a result for one packet / transfer,
so this is something which eventually we will need to support in
the qemu usb core too. For now I decided on the quick fix in usbredir
to cover the eventuality of this actually happening and added a
warning to see if this actually ever happens (which it seems not to
in my testing done sofar).

Regards,

Hans
Gerd Hoffmann Oct. 9, 2012, 9:21 a.m. UTC | #3
Hi,

> One possible alternative to re-combining would be to pass the
> short-not-ok flag along the chain all the way down into usbfs,

Yes, this is what I have in mind.

> but
> that won't work since the USBFS_URB_SHORT_NOT_OK flag does not
> work with XHCI controllers!

Oops.

> So lets try to split this discussion into multiple questions:
> 
> 1) Can we agree that re-combining input packets back into their original
> transfers as done at the guest device driver level, is useful and given
> the serial <-> usb converters case, even necessary to have ?

Given that the USBFS_URB_SHORT_NOT_OK doesn't work there is no way
around that I guess.

> 2) If we agree on 1), what is the right place to do them ?
> 
> To be able to properly combine packets into one larger transfer, an
> uncombine them again later, some coordination is needed between the
> entity doing the combining / uncombining and the hcd code:
> 
> 2a) The combining entity needs to know when the hcd is done with
> queuing up packets for an ep.

Yes.  We can add a USBDeviceClass method for that.

> 2b) When uncombining it is possible that less packets are actually
> used / filled with data, then were combined into one transfer. When
> this happens the hcd code will stop processing the queue after the
> short packet, and wait for the guest to restart it. When the guest
> restarts the queue it will be modified and the un-used packets will
> be gone from it. Since the hcd code keeps its own queue, when
> when uncombining the hcd code needs to be told to forget about
> the unused packets.

Yes.  We can add a new USBPortOps for that, or reuse
USBPortOps->complete with USBPacket->result == USB_RET_NOT_USED.

> Note that currently we have the same problem on a halt of the ep
> queue already, and currently we expect the hcd to completely empty
> the queue on a stall. I'm planning to change this code path over
> to also let the core tell the hcd to forget about the packets it
> had queued beyond the stall, as this seems something to do at the
> core level rather then reproducing it in every hcd emulation
> separately.

The core can and should do that for packets it owns (USBPacketState ==
USB_PACKET_QUEUED) because they are not (yet) passed to the USBDevice.

Packets owned by USBDevice (USBPacketState == USB_PACKET_ASYNC) must be
handled by the USBDevice itself.

> Given the necessary close interaction between the hcd code and
> the combine / uncombine code + not wanting to reproduce the
> same code in both host-linux.c and redirect.c I believe that
> the usb core is the right place for this.

I think the combining should happen just before submitting the transfer
to usbfs, in usb-host.  I'd like usb-host see what is really going on,
and the usb core not hiding this by magically creating jumbo packets.

Likewise I think usbredir should do the combining on the *server* side,
not in the qemu redir code.

>> Maybe add a callback to notify USBDevice
>> when the host controller is done filling the queue, so USBDevices can
>> process all queued packets in one go (bottom half would work too
>> though).
> 
> Changing the callback which I added for this to a bh is fine with me,
> I think that both from a making clear what is going on when reading
> the code, and from a having all the data hot in cache pov, the callback
> is better though.

Yes, callback is probably better, that is just a little minor detail though.

cheers,
  Gerd
Hans de Goede Oct. 9, 2012, 10:40 a.m. UTC | #4
Hi,

On 10/09/2012 11:21 AM, Gerd Hoffmann wrote:

<snip>

>> So lets try to split this discussion into multiple questions:
>>
>> 1) Can we agree that re-combining input packets back into their original
>> transfers as done at the guest device driver level, is useful and given
>> the serial <-> usb converters case, even necessary to have ?
>
> Given that the USBFS_URB_SHORT_NOT_OK doesn't work there is no way
> around that I guess.
>

Yes, short-not-ok not stopping the queue on a short packet is an
unpleasant surprise, it had all 3 of Alan Stern, Sarah Sharp and me
surprised quite a bit while working on fixing the problem some
people were seeing with redirecting some USB mass storage
devices, when those devices were plugged into an XHCI port.

>> 2) If we agree on 1), what is the right place to do them ?
>>
>> To be able to properly combine packets into one larger transfer, an
>> uncombine them again later, some coordination is needed between the
>> entity doing the combining / uncombining and the hcd code:
>>
>> 2a) The combining entity needs to know when the hcd is done with
>> queuing up packets for an ep.
>
> Yes.  We can add a USBDeviceClass method for that.
>

Ok, this assumes that the combining / uncombining gets moved down
into the usb-device level code. Which clearly has your preference,
but I don't completely agree with, so lets have that discussion first,
once we've decided what to do there, details like this will likely
sort out themselves.

>> 2b) When uncombining it is possible that less packets are actually
>> used / filled with data, then were combined into one transfer. When
>> this happens the hcd code will stop processing the queue after the
>> short packet, and wait for the guest to restart it. When the guest
>> restarts the queue it will be modified and the un-used packets will
>> be gone from it. Since the hcd code keeps its own queue, when
>> when uncombining the hcd code needs to be told to forget about
>> the unused packets.
>
> Yes.  We can add a new USBPortOps for that, or reuse
> USBPortOps->complete with USBPacket->result == USB_RET_NOT_USED.

I had the same thought about using a new result for this, thinking
more about this using a new result code is better, so I'll go go
that.

>> Note that currently we have the same problem on a halt of the ep
>> queue already, and currently we expect the hcd to completely empty
>> the queue on a stall. I'm planning to change this code path over
>> to also let the core tell the hcd to forget about the packets it
>> had queued beyond the stall, as this seems something to do at the
>> core level rather then reproducing it in every hcd emulation
>> separately.
>
> The core can and should do that for packets it owns (USBPacketState ==
> USB_PACKET_QUEUED) because they are not (yet) passed to the USBDevice.
>
> Packets owned by USBDevice (USBPacketState == USB_PACKET_ASYNC) must be
> handled by the USBDevice itself.

Getting offtopic a bit here, but this not how we currently handle
things, currently the hcd code cancels packets after a queue halt,
rather then the device-code itself. But this is another discussion,
so lets save this for later.

>> Given the necessary close interaction between the hcd code and
>> the combine / uncombine code + not wanting to reproduce the
>> same code in both host-linux.c and redirect.c I believe that
>> the usb core is the right place for this.
>
> I think the combining should happen just before submitting the transfer
> to usbfs, in usb-host.  I'd like usb-host see what is really going on,
> and the usb core not hiding this by magically creating jumbo packets.

Note that with my current code linux-host can see it is a combined packet
if it wants to already. But this all really boils down to the next bit:

> Likewise I think usbredir should do the combining on the *server* side,
> not in the qemu redir code.

I think that moving the combining to the server / usb-redir-host side is
*not* a good idea. There is nothing to be gained by doing so, and a lot to
loose:
1) The usb-redir-host has a lot less info to work with, as it does not have
access to the actual tds which lead to the packets being send. The protocol
can be extended to send what we need to do the combining now, but what if
later on it turns out we need more meta data for some corner case ...
2) Doing the combining at the usb-redir-host leads to more packets being send
and received, increasing the protocol overhead
3) Doing the combining at the usb-redir-host means that instead of a single
packet being received back, multiple will be send back, which may not all
be received in one go, causing multiple vm-exits rather then just one.
4) The combining/uncombining logic will be the same for usb-redir or
linux-host redirection, implementing this all twice and then keeping it
in sync is not a good way to spend our time.

Notice that arguments 2 and 3 could be made for combining output packets
too, but what we do their now is nice and KISS, and the possible speed
improvement is not worth the complication IMHO.

So again lets try to split the discussion in answering multiple questions:

1) I believe it is better, and would greatly prefer to, do the combining
on the qemu side rather then on the usb-redir-host side, do you agree?

2) If you agree with 1, then I assume you agree we will want to share
the combining code between host-linux.c (or host-* for that matter) and
redirect.c ?

3) If you agree with 2, then all we need is a place for the shared logic
to live, we could put it in a new file called input-pipeline.c ?

4) Since you would like to keep the core clean, I'm fine with moving the
calling of the combining-functions into the usb-redir device code
(and the same calls can then later be added to host-linux.c).

Regards,

Hans
Gerd Hoffmann Oct. 9, 2012, 1:31 p.m. UTC | #5
Hi,

>>> 2a) The combining entity needs to know when the hcd is done with
>>> queuing up packets for an ep.
>>
>> Yes.  We can add a USBDeviceClass method for that.
> 
> Ok, this assumes that the combining / uncombining gets moved down
> into the usb-device level code. Which clearly has your preference,
> but I don't completely agree with, so lets have that discussion first,
> once we've decided what to do there, details like this will likely
> sort out themselves.

Yes.

>> The core can and should do that for packets it owns (USBPacketState ==
>> USB_PACKET_QUEUED) because they are not (yet) passed to the USBDevice.
>>
>> Packets owned by USBDevice (USBPacketState == USB_PACKET_ASYNC) must be
>> handled by the USBDevice itself.
> 
> Getting offtopic a bit here, but this not how we currently handle
> things, currently the hcd code cancels packets after a queue halt,

Ah, right.  Well, that should continue to work.  USB_RET_NOT_USED would
make the hcd code just free the packet, and anything not-yet freed will
be zapped by the queue halt handling.

> Notice that arguments 2 and 3 could be made for combining output packets
> too, but what we do their now is nice and KISS, and the possible speed
> improvement is not worth the complication IMHO.

Handling in/out eps the same way has its merits too (code sharing).

> So again lets try to split the discussion in answering multiple questions:
> 
> 1) I believe it is better, and would greatly prefer to, do the combining
> on the qemu side rather then on the usb-redir-host side, do you agree?

Your call, you know the bits much better than I do.
Your arguments make sense.

> 2) If you agree with 1, then I assume you agree we will want to share
> the combining code between host-linux.c (or host-* for that matter) and
> redirect.c ?

I'm not sure there is that much to share.

A helper function which takes a USBEndpoint and returns an iovec for all
USBPackets lined up there would probably be useful.  Likewise for one
for completing the packets (takes xfer length + status, then fill
USBPacket->result & call usb_packet_complete for each packet).

But beyond that?

> 3) If you agree with 2, then all we need is a place for the shared logic
> to live, we could put it in a new file called input-pipeline.c ?

Just stick the helpers into hw/usb/core.c?

cheers,
  Gerd
Hans de Goede Oct. 9, 2012, 2:19 p.m. UTC | #6
Hi,

On 10/09/2012 03:31 PM, Gerd Hoffmann wrote:

<snip>

>>> The core can and should do that for packets it owns (USBPacketState ==
>>> USB_PACKET_QUEUED) because they are not (yet) passed to the USBDevice.
>>>
>>> Packets owned by USBDevice (USBPacketState == USB_PACKET_ASYNC) must be
>>> handled by the USBDevice itself.
>>
>> Getting offtopic a bit here, but this not how we currently handle
>> things, currently the hcd code cancels packets after a queue halt,
>
> Ah, right.  Well, that should continue to work.  USB_RET_NOT_USED would
> make the hcd code just free the packet, and anything not-yet freed will
> be zapped by the queue halt handling.
>

Right.

<snip>

>> 2) If you agree with 1, then I assume you agree we will want to share
>> the combining code between host-linux.c (or host-* for that matter) and
>> redirect.c ?
>
> I'm not sure there is that much to share.
>

It is not much, but you rightly called it "magic" before, as it is somewhat
tricky code, so better to write (and debug) it once :)

> A helper function which takes a USBEndpoint and returns an iovec for all
> USBPackets lined up there would probably be useful.  Likewise for one
> for completing the packets (takes xfer length + status, then fill
> USBPacket->result & call usb_packet_complete for each packet).

That is more or less what I had in mind yes :)

> But beyond that?
>
>> 3) If you agree with 2, then all we need is a place for the shared logic
>> to live, we could put it in a new file called input-pipeline.c ?
>
> Just stick the helpers into hw/usb/core.c?

I think core.c is getting a bit crowded, so I'll go with a new file for
the next revision, if you don't like that moving the functions someplace
else is easy :)

I'll start working on a new reworked version of the patchset which hopefully
will be more to your liking :)

Regards,

Hans
diff mbox

Patch

diff --git a/hw/usb.h b/hw/usb.h
index 2ecd8e9..57d200b 100644
--- a/hw/usb.h
+++ b/hw/usb.h
@@ -312,6 +312,15 @@  typedef struct USBPortOps {
      * the packet originated when a hub is involved.
      */
     void (*complete)(USBPort *port, USBPacket *p);
+    /*
+     * Called by the usb-core when a previous queued up packet
+     * (usb_handle_packet returned USB_RET_ASYNC) gets removed from the
+     * queue, without completing or being cancelled.
+     * This happens with input pipelinging when multiple packets are combined
+     * into one large transfer, and that transfer ends with a short read.
+     * Mandatory for controllers which set supports_queuing on their ports.
+     */
+    void (*remove_from_queue)(USBPort *port, USBPacket *p);
 } USBPortOps;
 
 /* USB port on which a device can be connected */
@@ -322,6 +331,7 @@  struct USBPort {
     USBPortOps *ops;
     void *opaque;
     int index; /* internal port index, may be used with the opaque */
+    bool supports_queuing;
     QTAILQ_ENTRY(USBPort) next;
 };
 
@@ -347,7 +357,12 @@  struct USBPacket {
     int result; /* transfer length or USB_RET_* status code */
     /* Internal use by the USB layer.  */
     USBPacketState state;
+    USBPacket *combined_in;
+    QTAILQ_HEAD(, USBPacket) uncombined_packets;
     QTAILQ_ENTRY(USBPacket) queue;
+    /* Flags set by the hcd when queuing is supported */
+    bool short_not_ok;
+    bool int_req;
 };
 
 void usb_packet_init(USBPacket *p);
@@ -387,6 +402,15 @@  int usb_ep_get_max_packet_size(USBDevice *dev, int pid, int ep);
 void usb_ep_set_pipeline(USBDevice *dev, int pid, int ep, bool enabled);
 USBPacket *usb_ep_find_packet_by_id(USBDevice *dev, int pid, int ep,
                                     uint64_t id);
+static inline bool usb_ep_input_pipeline(USBEndpoint *ep)
+{
+    return ep->pid == USB_TOKEN_IN && ep->pipeline;
+}
+static inline bool usb_ep_output_pipeline(USBEndpoint *ep)
+{
+    return ep->pid == USB_TOKEN_OUT && ep->pipeline;
+}
+void usb_ep_process_queue(USBEndpoint *ep);
 
 void usb_attach(USBPort *port);
 void usb_detach(USBPort *port);
diff --git a/hw/usb/core.c b/hw/usb/core.c
index b9f1f7a..54db17b 100644
--- a/hw/usb/core.c
+++ b/hw/usb/core.c
@@ -388,7 +388,8 @@  int usb_handle_packet(USBDevice *dev, USBPacket *p)
         p->ep->halted = false;
     }
 
-    if (QTAILQ_EMPTY(&p->ep->queue) || p->ep->pipeline) {
+    if (!usb_ep_input_pipeline(p->ep) &&
+            (QTAILQ_EMPTY(&p->ep->queue) || usb_ep_output_pipeline(p->ep))) {
         ret = usb_process_one(p);
         if (ret == USB_RET_ASYNC) {
             usb_packet_set_state(p, USB_PACKET_ASYNC);
@@ -398,7 +399,8 @@  int usb_handle_packet(USBDevice *dev, USBPacket *p)
              * When pipelining is enabled usb-devices must always return async,
              * otherwise packets can complete out of order!
              */
-            assert(!p->ep->pipeline || QTAILQ_EMPTY(&p->ep->queue));
+            assert(!usb_ep_output_pipeline(p->ep) ||
+                   QTAILQ_EMPTY(&p->ep->queue));
             if (ret != USB_RET_NAK) {
                 p->result = ret;
                 usb_packet_set_state(p, USB_PACKET_COMPLETE);
@@ -412,6 +414,21 @@  int usb_handle_packet(USBDevice *dev, USBPacket *p)
     return ret;
 }
 
+static void usb_add_packet_to_uncombined(USBPacket *dst, USBPacket *src)
+{
+    qemu_iovec_concat(&dst->iov, &src->iov, 0, src->iov.size);
+    QTAILQ_REMOVE(&dst->ep->queue, src, queue);
+    QTAILQ_INSERT_TAIL(&dst->uncombined_packets, src, queue);
+    src->combined_in = dst;
+}
+
+static void usb_remove_packet_from_uncombined(USBPacket *dst, USBPacket *src)
+{
+    QTAILQ_REMOVE(&dst->uncombined_packets, src, queue);
+    usb_packet_set_state(src, USB_PACKET_COMPLETE);
+    qemu_iovec_destroy(&dst->iov);
+}
+
 static void __usb_packet_complete(USBDevice *dev, USBPacket *p)
 {
     USBEndpoint *ep = p->ep;
@@ -421,11 +438,133 @@  static void __usb_packet_complete(USBDevice *dev, USBPacket *p)
     if (p->result < 0) {
         ep->halted = true;
     }
+
     usb_packet_set_state(p, USB_PACKET_COMPLETE);
+
+    /* If this is a combined packet, split it */
+    if (!QTAILQ_EMPTY(&p->uncombined_packets)) {
+        USBPacket *uncombined, *next;
+        bool need_zero_packet = true;
+        bool free_p = true;
+
+        QTAILQ_FOREACH_SAFE(uncombined, &p->uncombined_packets, queue, next) {
+            if (p->result > 0) {
+                /* Distribute data over uncombined packets */
+                if (p->result >= uncombined->iov.size) {
+                    uncombined->result = uncombined->iov.size;
+                } else {
+                    if (p->short_not_ok) {
+                        ep->halted = true;
+                    }
+                    uncombined->result = p->result;
+                    need_zero_packet = false;
+                }
+                p->result -= uncombined->result;
+                usb_remove_packet_from_uncombined(p, uncombined);
+                dev->port->ops->complete(dev->port, uncombined);
+            } else if (need_zero_packet) {
+                /* Send terminating 0 or error status packet */
+                if (p->short_not_ok) {
+                    ep->halted = true;
+                }
+                uncombined->result = p->result;
+                usb_remove_packet_from_uncombined(p, uncombined);
+                dev->port->ops->complete(dev->port, uncombined);
+                need_zero_packet = false;
+            } else {
+                /* Remove any leftover packets from the queue */
+                dev->port->ops->remove_from_queue(dev->port, uncombined);
+                free_p = false; /* remove_from_queue calls usb_cancel_packet */
+            }
+        }
+        if (free_p) {
+            QTAILQ_REMOVE(&ep->queue, p, queue);
+            g_free(p);
+        }
+        return;
+    }
+
+    if (p->short_not_ok && (p->result < p->iov.size)) {
+        ep->halted = true;
+    }
     QTAILQ_REMOVE(&ep->queue, p, queue);
     dev->port->ops->complete(dev->port, p);
 }
 
+/*
+ * Large input transfers can get split into multiple input packets, this
+ * function recombines them, removing the short_not_ok checks which all but
+ * the last packet of such splits transfers have, thereby allowing input
+ * transfer pipelining (which we cannot do on short_not_ok transfers)
+ */
+void usb_ep_process_queue(USBEndpoint *ep)
+{
+    USBPacket *u, *n, *p, *next, *prev = NULL, *first = NULL;
+    USBPort *port = ep->dev->port;
+
+    if (!usb_ep_input_pipeline(ep)) {
+        return;
+    }
+
+    QTAILQ_FOREACH_SAFE(p, &ep->queue, queue, next) {
+        /* Empty the queue on a halt */
+        if (ep->halted) {
+            if (!QTAILQ_EMPTY(&p->uncombined_packets)) {
+                QTAILQ_FOREACH_SAFE(u, &p->uncombined_packets, queue, n) {
+                    port->ops->remove_from_queue(port, u);
+                }
+            } else {
+                port->ops->remove_from_queue(port, p);
+            }
+            continue;
+        }
+
+        /* Skip packets already submitted to the device */
+        if (p->state == USB_PACKET_ASYNC) {
+            prev = p;
+            continue;
+        }
+        usb_packet_check_state(p, USB_PACKET_QUEUED);
+
+        /*
+         * If the previous (combined) packet has the short_not_ok flag set
+         * stop, as we must not submit packets to the device after a transfer
+         * ending with short_not_ok packet.
+         */
+        if (prev && prev->short_not_ok) {
+            break;
+        }
+
+        if (first) {
+            if (QTAILQ_EMPTY(&first->uncombined_packets)) {
+                USBPacket *new_first = g_new0(USBPacket, 1);
+
+                usb_packet_init(new_first);
+                usb_packet_setup(new_first, first->pid, first->ep, first->id);
+                new_first->state = USB_PACKET_QUEUED;
+
+                usb_add_packet_to_uncombined(new_first, first);
+                QTAILQ_INSERT_BEFORE(p, new_first, queue);
+                first = new_first;
+            }
+            usb_add_packet_to_uncombined(first, p);
+            first->short_not_ok = p->short_not_ok;
+        } else {
+            first = p;
+        }
+
+        /* Is this packet the last one of a (combined) transfer? */
+        if ((p->iov.size % ep->max_packet_size) != 0 || !p->short_not_ok ||
+                p->int_req || next == NULL) {
+            int ret = usb_process_one(first);
+            assert(ret == USB_RET_ASYNC);
+            usb_packet_set_state(first, USB_PACKET_ASYNC);
+            prev = first;
+            first = NULL;
+        }
+    }
+}
+
 /* Notify the controller that an async packet is complete.  This should only
    be called for packets previously deferred by returning USB_RET_ASYNC from
    handle_packet. */
@@ -438,6 +577,11 @@  void usb_packet_complete(USBDevice *dev, USBPacket *p)
     assert(QTAILQ_FIRST(&ep->queue) == p);
     __usb_packet_complete(dev, p);
 
+    if (usb_ep_input_pipeline(ep)) {
+        usb_ep_process_queue(ep);
+        return;
+    }
+
     while (!ep->halted && !QTAILQ_EMPTY(&ep->queue)) {
         p = QTAILQ_FIRST(&ep->queue);
         if (p->state == USB_PACKET_ASYNC) {
@@ -459,19 +603,34 @@  void usb_packet_complete(USBDevice *dev, USBPacket *p)
    completed.  */
 void usb_cancel_packet(USBPacket * p)
 {
-    bool callback = (p->state == USB_PACKET_ASYNC);
+    USBPacket *free_me = NULL;
+    bool callback;
+
     assert(usb_packet_is_inflight(p));
+
+    /* Is this part of a combined packet? */
+    if (p->combined_in) {
+        usb_remove_packet_from_uncombined(p->combined_in, p);
+        if (!QTAILQ_EMPTY(&p->combined_in->uncombined_packets)) {
+            return;
+        }
+        free_me = p = p->combined_in;
+    }
+
+    callback = (p->state == USB_PACKET_ASYNC);
     usb_packet_set_state(p, USB_PACKET_CANCELED);
     QTAILQ_REMOVE(&p->ep->queue, p, queue);
     if (callback) {
         usb_device_cancel_packet(p->ep->dev, p);
     }
+    g_free(free_me);
 }
 
 
 void usb_packet_init(USBPacket *p)
 {
     qemu_iovec_init(&p->iov, 1);
+    QTAILQ_INIT(&p->uncombined_packets);
 }
 
 static const char *usb_packet_state_name(USBPacketState state)
@@ -531,6 +690,9 @@  void usb_packet_setup(USBPacket *p, int pid, USBEndpoint *ep, uint64_t id)
     p->ep = ep;
     p->result = 0;
     p->parameter = 0;
+    p->combined_in = NULL;
+    p->short_not_ok = false;
+    p->int_req = false;
     qemu_iovec_reset(&p->iov);
     usb_packet_set_state(p, USB_PACKET_SETUP);
 }
@@ -724,7 +886,13 @@  int usb_ep_get_max_packet_size(USBDevice *dev, int pid, int ep)
 void usb_ep_set_pipeline(USBDevice *dev, int pid, int ep, bool enabled)
 {
     struct USBEndpoint *uep = usb_ep_get(dev, pid, ep);
-    uep->pipeline = enabled;
+
+    if (pid == USB_TOKEN_IN) {
+        uep->pipeline =
+            enabled && uep->max_packet_size != 0 && dev->port->supports_queuing;
+    } else {
+        uep->pipeline = enabled;
+    }
 }
 
 USBPacket *usb_ep_find_packet_by_id(USBDevice *dev, int pid, int ep,
diff --git a/hw/usb/dev-hub.c b/hw/usb/dev-hub.c
index 8fd30df..d326281 100644
--- a/hw/usb/dev-hub.c
+++ b/hw/usb/dev-hub.c
@@ -222,6 +222,13 @@  static void usb_hub_complete(USBPort *port, USBPacket *packet)
     s->dev.port->ops->complete(s->dev.port, packet);
 }
 
+static void usb_hub_remove_from_queue(USBPort *port, USBPacket *packet)
+{
+    USBHubState *s = port->opaque;
+
+    s->dev.port->ops->remove_from_queue(s->dev.port, packet);
+}
+
 static USBDevice *usb_hub_find_device(USBDevice *dev, uint8_t addr)
 {
     USBHubState *s = DO_UPCAST(USBHubState, dev, dev);
@@ -512,6 +519,7 @@  static USBPortOps usb_hub_port_ops = {
     .child_detach = usb_hub_child_detach,
     .wakeup = usb_hub_wakeup,
     .complete = usb_hub_complete,
+    .remove_from_queue = usb_hub_remove_from_queue,
 };
 
 static int usb_hub_initfn(USBDevice *dev)
@@ -529,6 +537,7 @@  static int usb_hub_initfn(USBDevice *dev)
                           &port->port, s, i, &usb_hub_port_ops,
                           USB_SPEED_MASK_LOW | USB_SPEED_MASK_FULL);
         usb_port_location(&port->port, dev->port, i+1);
+        port->port.supports_queuing = dev->port->supports_queuing;
     }
     usb_hub_handle_reset(dev);
     return 0;