diff mbox

usb: xhci: Link TRB must not occur with a USB payload burst.

Message ID AE90C24D6B3A694183C094C60CF0A2F6026B73EF@saturn3.aculab.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

David Laight Nov. 7, 2013, 5:20 p.m. UTC
Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
can only occur at a boundary between underlying USB frames (512 bytes for 480M).

If this isn't done the USB frames aren't formatted correctly and, for example,
the USB3 ethernet ax88179_178a card will stop sending (while still receiving)
when running a netperf tcp transmit test with (say) and 8k buffer.

While this change improves things a lot (it runs for 20 minutes rather than
a few seconds), there is still something else wrong.

This should be a candidate for stable, the ax88179_178a driver defaults to
gso and tso enabled so it passes a lot of fragmented skb to the USB stack.

Signed-off-by: David Laight <david.laight@aculab.com>
---

Although I've got a USB2 analyser its trigger and trace stuff is almost
unusable! In any case this fails much faster on USB3 (Intel i7 cpu).





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sarah Sharp Nov. 8, 2013, 12:18 a.m. UTC | #1
On Thu, Nov 07, 2013 at 05:20:49PM -0000, David Laight wrote:
> 
> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> can only occur at a boundary between underlying USB frames (512 bytes for 480M).

Which version of the spec are you looking at?  I'm looking at the
updated version from 08/2012 and I don't see any such requirement.

I do see that section says "A TD Fragment shall not span Transfer Ring
Segments" where a TD fragment is defined as exact multiples of Max Burst
Size * Max Packet Size bytes for the endpoint.  Is that what you mean
about USB frames?

> If this isn't done the USB frames aren't formatted correctly and, for example,
> the USB3 ethernet ax88179_178a card will stop sending (while still receiving)
> when running a netperf tcp transmit test with (say) and 8k buffer.

Which driver does that use?  Is it using the scatter gather list
mechanism for URBs (i.e. urb->sg)?  Or is it submitting multiple URBs
for fragmented buffers?  Or is it submitting isochronous URBs with
multiple frames?  Or is it submitting bulk URBs with one transfer buffer
that crosses the 64KB boundary?

I'm trying to understand what the USB device driver is doing in order to
create multi-TRB TDs that would violate the TD fragment rules.

> While this change improves things a lot (it runs for 20 minutes rather than
> a few seconds), there is still something else wrong.
> 
> This should be a candidate for stable, the ax88179_178a driver defaults to
> gso and tso enabled so it passes a lot of fragmented skb to the USB stack.

I want to understand what the larger issue is here before pushing
bandaid fixes.

I don't think this is the right approach, because prepare_ring() can be
called for isochronous URBs that get mapped to multiple TDs.  The TD
fragment rules only apply to one TD, so failing URB submission because
many isochronous TDs span more than one ring segment doesn't seem like
the right approach.  Inserting no-op TRBs will also result in odd
isochronous behavior, since a no-op TRB means the xHC should not send or
request an isochronous packet for that service interval.

The patch also doesn't address the underlying issue of constructing
multi-TRB TDs so that it fits the TD fragment rules.  Your patch ensures
there are no link TRBs in the middle of a TD, but it doesn't ensure that
the TRBs in the TD are exact multiples of the Max Burst Size * Max
Packet Size bytes for the endpoint.

Instead, new code in count_sg_trbs_needed() should break the TD into
proper TD fragments.  The queueing code for all endpoints would also
have to follow those rules.  This would have to happen while still
respecting the 64KB boundary rule.

The driver would also have to make sure that TD fragments didn't have
link TRBs in them.  That's an issue, since the spec decidedly unclear
about what exactly comprises a TD fragment.  Is the xHC host greedy, and
will lump all multiples into one TD fragment?  Or will it assume the TD
fragment has ended once it consumes one multiple of Max Burst Size * Max
Packet Size bytes?

This ambiguity means it's basically impossible to figure out whether a
TD with a link TRB in the middle is constructed properly.  The xHCI spec
architect didn't have a good solution to this problem, so I punted and
never implemented TD fragments.  If this is really an issue, it's going
to be pretty complex to solve.

> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> 
> Although I've got a USB2 analyser its trigger and trace stuff is almost
> unusable! In any case this fails much faster on USB3 (Intel i7 cpu).

I thought you were using a big endian system?

> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 5480215..23abc9b 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -2927,8 +2932,43 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
>  	}
>  
>  	while (1) {
> -		if (room_on_ring(xhci, ep_ring, num_trbs))
> -			break;
> +		if (room_on_ring(xhci, ep_ring, num_trbs)) {
> +			union xhci_trb *trb = ep_ring->enqueue;
> +			unsigned int usable = ep_ring->enq_seg->trbs +
> +					TRBS_PER_SEGMENT - 1 - trb;
> +			u32 nop_cmd;
> +
> +			/*
> +			 * Section 4.11.7.1 TD Fragments states that a link
> +			 * TRB must only occur at the boundary between
> +			 * data bursts (eg 512 bytes for 480M).
> +			 * While it is possible to split a large fragment
> +			 * we don't know the size yet.
> +			 * Simplest solution is to fill the trb before the
> +			 * LINK with nop commands.
> +			 */
> +			if (num_trbs == 1 || num_trbs <= usable || usable == 0)
> +				break;
> +
> +			if (num_trbs >= TRBS_PER_SEGMENT) {
> +				xhci_err(xhci, "Too many fragments %d, max %d\n",
> +						num_trbs, TRBS_PER_SEGMENT - 1);
> +				return -ENOMEM;
> +			}
> +
> +			nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) |
> +					ep_ring->cycle_state);
> +			for (; !TRB_TYPE_LINK_LE32(trb->link.control); trb++) {
> +				trb->generic.field[0] = 0;
> +				trb->generic.field[1] = 0;
> +				trb->generic.field[2] = 0;
> +				trb->generic.field[3] = nop_cmd;
> +				ep_ring->num_trbs_free--;
> +			}
> +			ep_ring->enqueue = trb;
> +			if (room_on_ring(xhci, ep_ring, num_trbs))
> +				break;
> +		}
>  
>  		if (ep_ring == xhci->cmd_ring) {
>  			xhci_err(xhci, "Do not support expand command ring\n");

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 8, 2013, 10:47 a.m. UTC | #2
> From: Sarah Sharp [mailto:sarah.a.sharp@linux.intel.com]
> On Thu, Nov 07, 2013 at 05:20:49PM -0000, David Laight wrote:
> >
> > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> > can only occur at a boundary between underlying USB frames (512 bytes for 480M).
> 
> Which version of the spec are you looking at?  I'm looking at the
> updated version from 08/2012 and I don't see any such requirement.

The copy I downloaded last week is dated 5/21/10, seems the copy on the
web site hasn't been updated.

> I do see that section says "A TD Fragment shall not span Transfer Ring
> Segments" where a TD fragment is defined as exact multiples of Max Burst
> Size * Max Packet Size bytes for the endpoint.  Is that what you mean
> about USB frames?

That is the clause, it needs some understanding.
Figure 24 shows a correctly aligned link TRB - one where the TRB boundary
is aligned with a packet boundary (512 bytes at 480M, 1k at 5G etc).

> > If this isn't done the USB frames aren't formatted correctly and, for example,
> > the USB3 ethernet ax88179_178a card will stop sending (while still receiving)
> > when running a netperf tcp transmit test with (say) and 8k buffer.
> 
> Which driver does that use?  Is it using the scatter gather list
> mechanism for URBs (i.e. urb->sg)?  Or is it submitting multiple URBs
> for fragmented buffers?  Or is it submitting isochronous URBs with
> multiple frames?  Or is it submitting bulk URBs with one transfer buffer
> that crosses the 64KB boundary?
> 
> I'm trying to understand what the USB device driver is doing in order to
> create multi-TRB TDs that would violate the TD fragment rules.

The usb setup is done by net/usbnet.c, the ax88179_178a driver just adds
a header to the ethernet frame. Since the hardware support TCP segment
offload the urb->sg list is used, the first fragment seems to be 1578
bytes and is followed by two or three fragments (split on an 0x8000
boundary) making up the rest of the 65234 bytes.
(I haven't yet seen a buffer that crossed a 64k boundary).

I'm running netperf - which is doing a burst of 8k sends into
a TCP socket.

> > While this change improves things a lot (it runs for 20 minutes rather than
> > a few seconds), there is still something else wrong.
> >
> > This should be a candidate for stable, the ax88179_178a driver defaults to
> > gso and tso enabled so it passes a lot of fragmented skb to the USB stack.
> 
> I want to understand what the larger issue is here before pushing
> bandaid fixes.
> 
> I don't think this is the right approach, because prepare_ring() can be
> called for isochronous URBs that get mapped to multiple TDs.  The TD
> fragment rules only apply to one TD, so failing URB submission because
> many isochronous TDs span more than one ring segment doesn't seem like
> the right approach.  Inserting no-op TRBs will also result in odd
> isochronous behavior, since a no-op TRB means the xHC should not send or
> request an isochronous packet for that service interval.

I wasn't aware of that effect of NOP TRB on isoc data.
Three obvious options:
1) Pass an extra parameter indicating that the caller will handle an
   TRB data alignment issues at the link TRB.
2) Add a test for bulk endpoints (ok until the isoc code gets modified
   to support SG data).
3) Copy the LINK TRB instead of adding NOPs (would require the rest of
   the code use the ring length instead of reading the TRB type).

The second is probably the easiest way to isolate the change.
However there could be problems on an isoc endpoint if the data
crosses a 64k boundary.

> The patch also doesn't address the underlying issue of constructing
> multi-TRB TDs so that it fits the TD fragment rules.  Your patch ensures
> there are no link TRBs in the middle of a TD, but it doesn't ensure that
> the TRBs in the TD are exact multiples of the Max Burst Size * Max
> Packet Size bytes for the endpoint.

That doesn't matter, look at figure 25. A TD can be any number of USB
packets and any number or TRB. The constraint is that a TD cannot
contain a LINK TRB.

> Instead, new code in count_sg_trbs_needed() should break the TD into
> proper TD fragments.  The queueing code for all endpoints would also
> have to follow those rules.  This would have to happen while still
> respecting the 64KB boundary rule.
> 
> The driver would also have to make sure that TD fragments didn't have
> link TRBs in them.  That's an issue, since the spec decidedly unclear
> about what exactly comprises a TD fragment.  Is the xHC host greedy, and
> will lump all multiples into one TD fragment?  Or will it assume the TD
> fragment has ended once it consumes one multiple of Max Burst Size * Max
> Packet Size bytes?

It only matters at a LINK TRB (unless you want to request an interrupt).
The constraint seems to be that the data TRB before a link TRB must end
on a MBP boundary.

> This ambiguity means it's basically impossible to figure out whether a
> TD with a link TRB in the middle is constructed properly.  The xHCI spec
> architect didn't have a good solution to this problem, so I punted and
> never implemented TD fragments.

The problem is that the hardware does implement them:-)

> If this is really an issue, it's going to be pretty complex to solve.

I thought about solutions and then decided it was easiest to skip to
the next ring segment. The data could be scanned for an earlier
boundary that was aligned, but it really didn't seem worth the effort.

If all of the TRB reference buffers that are larger than MBP then,
when the LINK is encountered, the previous TRB can be split on its
last MBP boundary, if not a bounce buffer would have to be used
(and the code doesn't really know if this a tx or rx transaction).
I've a tx that is split 1576/328/32768/30560 where the 328 could
not be split - so this does happen.

> 
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >
> > Although I've got a USB2 analyser its trigger and trace stuff is almost
> > unusable! In any case this fails much faster on USB3 (Intel i7 cpu).
> 
> I thought you were using a big endian system?

No - I was just reading the code and spotted the endianness bug.

	David

> > diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> > index 5480215..23abc9b 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -2927,8 +2932,43 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
> >  	}
> >
> >  	while (1) {
> > -		if (room_on_ring(xhci, ep_ring, num_trbs))
> > -			break;
> > +		if (room_on_ring(xhci, ep_ring, num_trbs)) {
> > +			union xhci_trb *trb = ep_ring->enqueue;
> > +			unsigned int usable = ep_ring->enq_seg->trbs +
> > +					TRBS_PER_SEGMENT - 1 - trb;
> > +			u32 nop_cmd;
> > +
> > +			/*
> > +			 * Section 4.11.7.1 TD Fragments states that a link
> > +			 * TRB must only occur at the boundary between
> > +			 * data bursts (eg 512 bytes for 480M).
> > +			 * While it is possible to split a large fragment
> > +			 * we don't know the size yet.
> > +			 * Simplest solution is to fill the trb before the
> > +			 * LINK with nop commands.
> > +			 */
> > +			if (num_trbs == 1 || num_trbs <= usable || usable == 0)
> > +				break;
> > +
> > +			if (num_trbs >= TRBS_PER_SEGMENT) {
> > +				xhci_err(xhci, "Too many fragments %d, max %d\n",
> > +						num_trbs, TRBS_PER_SEGMENT - 1);
> > +				return -ENOMEM;
> > +			}
> > +
> > +			nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) |
> > +					ep_ring->cycle_state);
> > +			for (; !TRB_TYPE_LINK_LE32(trb->link.control); trb++) {
> > +				trb->generic.field[0] = 0;
> > +				trb->generic.field[1] = 0;
> > +				trb->generic.field[2] = 0;
> > +				trb->generic.field[3] = nop_cmd;
> > +				ep_ring->num_trbs_free--;
> > +			}
> > +			ep_ring->enqueue = trb;
> > +			if (room_on_ring(xhci, ep_ring, num_trbs))
> > +				break;
> > +		}
> >
> >  		if (ep_ring == xhci->cmd_ring) {
> >  			xhci_err(xhci, "Do not support expand command ring\n");
> 
> Sarah Sharp


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 8, 2013, 11:07 a.m. UTC | #3
> While this change improves things a lot (it runs for 20 minutes rather than
> a few seconds), there is still something else wrong.

Almost certainly caused by an unrelated local change.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 8, 2013, 4:45 p.m. UTC | #4
On Thu, 7 Nov 2013, Sarah Sharp wrote:

> On Thu, Nov 07, 2013 at 05:20:49PM -0000, David Laight wrote:
> > 
> > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> > can only occur at a boundary between underlying USB frames (512 bytes for 480M).
> 
> Which version of the spec are you looking at?  I'm looking at the
> updated version from 08/2012 and I don't see any such requirement.
> 
> I do see that section says "A TD Fragment shall not span Transfer Ring
> Segments" where a TD fragment is defined as exact multiples of Max Burst
> Size * Max Packet Size bytes for the endpoint.  Is that what you mean
> about USB frames?

...

> The driver would also have to make sure that TD fragments didn't have
> link TRBs in them.  That's an issue, since the spec decidedly unclear
> about what exactly comprises a TD fragment.  Is the xHC host greedy, and
> will lump all multiples into one TD fragment?  Or will it assume the TD
> fragment has ended once it consumes one multiple of Max Burst Size * Max
> Packet Size bytes?
> 
> This ambiguity means it's basically impossible to figure out whether a
> TD with a link TRB in the middle is constructed properly.  The xHCI spec
> architect didn't have a good solution to this problem, so I punted and
> never implemented TD fragments.  If this is really an issue, it's going
> to be pretty complex to solve.

This is something I wanted to ask you about also.

The whole idea of TD fragments is exceedingly cloudy.  The definition
doesn't make sense, and the spec doesn't say what the actual hardware
restrictions are, i.e., what is the underlying reality that the TD
fragment concept wants to express.

Even if you do your best to ignore the whole idea, there still appear
to be certain restrictions you need to obey.  In addition to the
question of where Link TRBs are allowed, you also have to worry about
TDs whose size isn't a multiple of the Max Burst Payload (MBP) size =
MaxBurstSize * MaxPacketSize.

According to my version of the spec (Rev 1.0, dated 5/21/2010), if a TD
is larger than the MBP and its length isn't a multiple of the MBP, then
the last MBP boundary in the TD must also be a TRB boundary.  This 
follows from two statements in section 4.11.7.1:

	If the TD Transfer Size is an even multiple of the MBP then all 
	TD Fragments shall define exact multiples of MBP data bytes.  
	If not, then only the last TD Fragment shall define less than 
	MBP data (or the Residue) bytes.

So if a TD is longer than MBP then it must contain at least two TD
fragments, and the last fragment must consist of the last Residue
bytes (i.e., the bytes beyond the last MBP boundary).

	Each TD Fragment is comprised of one or more TRBs.

Hence the last MBP boundary in the TD, which marks the beginning of the 
last TD fragment, must also be a TRB boundary.

I have no idea whether violating this restriction will cause trouble
for real hardware.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 8, 2013, 5:39 p.m. UTC | #5
> -----Original Message-----
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> Sent: 08 November 2013 16:46
> To: Sarah Sharp
> Cc: David Laight; netdev@vger.kernel.org; linux-usb@vger.kernel.org
> Subject: Re: [PATCH] usb: xhci: Link TRB must not occur with a USB payload burst.
> 
> On Thu, 7 Nov 2013, Sarah Sharp wrote:
> 
> > On Thu, Nov 07, 2013 at 05:20:49PM -0000, David Laight wrote:
> > >
> > > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> > > can only occur at a boundary between underlying USB frames (512 bytes for 480M).
> >
> > Which version of the spec are you looking at?  I'm looking at the
> > updated version from 08/2012 and I don't see any such requirement.
> >
> > I do see that section says "A TD Fragment shall not span Transfer Ring
> > Segments" where a TD fragment is defined as exact multiples of Max Burst
> > Size * Max Packet Size bytes for the endpoint.  Is that what you mean
> > about USB frames?
> 
> ...
> 
> > The driver would also have to make sure that TD fragments didn't have
> > link TRBs in them.  That's an issue, since the spec decidedly unclear
> > about what exactly comprises a TD fragment.  Is the xHC host greedy, and
> > will lump all multiples into one TD fragment?  Or will it assume the TD
> > fragment has ended once it consumes one multiple of Max Burst Size * Max
> > Packet Size bytes?
> >
> > This ambiguity means it's basically impossible to figure out whether a
> > TD with a link TRB in the middle is constructed properly.  The xHCI spec
> > architect didn't have a good solution to this problem, so I punted and
> > never implemented TD fragments.  If this is really an issue, it's going
> > to be pretty complex to solve.
> 
> This is something I wanted to ask you about also.
> 
> The whole idea of TD fragments is exceedingly cloudy.  The definition
> doesn't make sense, and the spec doesn't say what the actual hardware
> restrictions are, i.e., what is the underlying reality that the TD
> fragment concept wants to express.

I think that a TD fragment boundary exists whenever a TRB/dma boundary
lines up with the packet boundary.

> Even if you do your best to ignore the whole idea, there still appear
> to be certain restrictions you need to obey.  In addition to the
> question of where Link TRBs are allowed, you also have to worry about
> TDs whose size isn't a multiple of the Max Burst Payload (MBP) size =
> MaxBurstSize * MaxPacketSize.

I'm not certain what MaxBurstSize and MaxPacketSize are, but the important
number is the MBP. I'm fairly sure that is the value returned by
GET_MAX_PACKET() - 1024 for USB3 bulk endpoints.

> According to my version of the spec (Rev 1.0, dated 5/21/2010), if a TD
> is larger than the MBP and its length isn't a multiple of the MBP, then
> the last MBP boundary in the TD must also be a TRB boundary.  This
> follows from two statements in section 4.11.7.1:
> 
> 	If the TD Transfer Size is an even multiple of the MBP then all
> 	TD Fragments shall define exact multiples of MBP data bytes.
> 	If not, then only the last TD Fragment shall define less than
> 	MBP data (or the Residue) bytes.

No, that doesn't stop there being only 1 TD fragment.
(I think it means exact multiple of the MBP.)

> So if a TD is longer than MBP then it must contain at least two TD
> fragments, and the last fragment must consist of the last Residue
> bytes (i.e., the bytes beyond the last MBP boundary).
> 
> 	Each TD Fragment is comprised of one or more TRBs.
> 
> Hence the last MBP boundary in the TD, which marks the beginning of the
> last TD fragment, must also be a TRB boundary.
> 
> I have no idea whether violating this restriction will cause trouble
> for real hardware.

Violating the one for LINK TRBs is definitely a problem.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 8, 2013, 6:12 p.m. UTC | #6
On Fri, 8 Nov 2013, David Laight wrote:

> > The whole idea of TD fragments is exceedingly cloudy.  The definition
> > doesn't make sense, and the spec doesn't say what the actual hardware
> > restrictions are, i.e., what is the underlying reality that the TD
> > fragment concept wants to express.
> 
> I think that a TD fragment boundary exists whenever a TRB/dma boundary
> lines up with the packet boundary.

The spec doesn't say that.

> > Even if you do your best to ignore the whole idea, there still appear
> > to be certain restrictions you need to obey.  In addition to the
> > question of where Link TRBs are allowed, you also have to worry about
> > TDs whose size isn't a multiple of the Max Burst Payload (MBP) size =
> > MaxBurstSize * MaxPacketSize.
> 
> I'm not certain what MaxBurstSize and MaxPacketSize are, but the important
> number is the MBP. I'm fairly sure that is the value returned by
> GET_MAX_PACKET() - 1024 for USB3 bulk endpoints.

GET_MAX_PACKET always returns MaxPacketSize, and for USB-3 bulk
endpoints, MaxPacketSize is always 1024.  MaxBurstSize can be anything
from 1 to 16.

> > According to my version of the spec (Rev 1.0, dated 5/21/2010), if a TD
> > is larger than the MBP and its length isn't a multiple of the MBP, then
> > the last MBP boundary in the TD must also be a TRB boundary.  This
> > follows from two statements in section 4.11.7.1:
> > 
> > 	If the TD Transfer Size is an even multiple of the MBP then all
> > 	TD Fragments shall define exact multiples of MBP data bytes.
> > 	If not, then only the last TD Fragment shall define less than
> > 	MBP data (or the Residue) bytes.
> 
> No, that doesn't stop there being only 1 TD fragment.
> (I think it means exact multiple of the MBP.)

Suppose, for example, the MBP is 1024.  If you have a TD with length
1500, and if it had only one fragment, the last (and only) fragment's
length would not less than the MBP and it would not be an exact 
multiple of the MBP.

I agree that the text is not as clear as it should be.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 11, 2013, 5:12 p.m. UTC | #7
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> On Fri, 8 Nov 2013, David Laight wrote:
> 
...
> GET_MAX_PACKET always returns MaxPacketSize, and for USB-3 bulk
> endpoints, MaxPacketSize is always 1024.  MaxBurstSize can be anything
> from 1 to 16.

I've just read something that explained about bursts.

> > > According to my version of the spec (Rev 1.0, dated 5/21/2010), if a TD
> > > is larger than the MBP and its length isn't a multiple of the MBP, then
> > > the last MBP boundary in the TD must also be a TRB boundary.  This
> > > follows from two statements in section 4.11.7.1:
> > >
> > > 	If the TD Transfer Size is an even multiple of the MBP then all
> > > 	TD Fragments shall define exact multiples of MBP data bytes.
> > > 	If not, then only the last TD Fragment shall define less than
> > > 	MBP data (or the Residue) bytes.
> >
> > No, that doesn't stop there being only 1 TD fragment.
> > (I think it means exact multiple of the MBP.)
> 
> Suppose, for example, the MBP is 1024.  If you have a TD with length
> 1500, and if it had only one fragment, the last (and only) fragment's
> length would not less than the MBP and it would not be an exact
> multiple of the MBP.

That doesn't matter - eg example 2 in figure 25
 
> I agree that the text is not as clear as it should be.

Reading it all again makes me think that a LINK trb is only
allowed on the burst boundary (which might be 16k bytes).
The only real way to implement that is to ensure that TD never
contain LINK TRB.

There are other things that can be done at a TD fragment boundary
(or only once per TD fragment) - but the code doesn't attempt any
of them.

The restriction might be there to simplify the hardware to
retransmit as TRB burst.

My USB3 ethernet card (ax88179_178a driver) is running a lot more
reliably with these changes.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 11, 2013, 8:34 p.m. UTC | #8
On Mon, 11 Nov 2013, David Laight wrote:

> > Suppose, for example, the MBP is 1024.  If you have a TD with length
> > 1500, and if it had only one fragment, the last (and only) fragment's
> > length would not less than the MBP and it would not be an exact
> > multiple of the MBP.
> 
> That doesn't matter - eg example 2 in figure 25

You're right.  I do wish the spec had been written more clearly.

> Reading it all again makes me think that a LINK trb is only
> allowed on the burst boundary (which might be 16k bytes).
> The only real way to implement that is to ensure that TD never
> contain LINK TRB.

That's one way to do it.  Or you could allow a Link TRB at an 
intermediate MBP boundary.

It comes down to a question of how often you want the controller to
issue an interrupt.  If a ring segment is 4 KB (one page), then it can
hold 256 TRBs.  With scatter-gather transfers, each SG element
typically refers to something like a 2-page buffer (depends on how
fragmented the memory is).  Therefore a ring segment will describe
somewhere around 512 pages of data, i.e., something like 2 MB.  Since
SuperSpeed is 500 MB/s, you'd end up getting in the vicinity of 250
interrupts every second just because of ring segment crossings.

Using larger ring segments would help.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 12, 2013, 9:43 a.m. UTC | #9
> You're right.  I do wish the spec had been written more clearly.

I've read a lot of hardware specs in my time ...

> > Reading it all again makes me think that a LINK trb is only
> > allowed on the burst boundary (which might be 16k bytes).
> > The only real way to implement that is to ensure that TD never
> > contain LINK TRB.
> 
> That's one way to do it.  Or you could allow a Link TRB at an
> intermediate MBP boundary.

If all the fragments are larger than the MBP (assume 16k) then
that would be relatively easy. However that is very dependant
on the source of the data. It might be true for disk data, but
is unlikely to be true for ethernet data.

For bulk data the link TRB can be forced at a packet boundary
by splitting the TD up - the receiving end won't know the difference.

> It comes down to a question of how often you want the controller to
> issue an interrupt.  If a ring segment is 4 KB (one page), then it can
> hold 256 TRBs.  With scatter-gather transfers, each SG element
> typically refers to something like a 2-page buffer (depends on how
> fragmented the memory is).  Therefore a ring segment will describe
> somewhere around 512 pages of data, i.e., something like 2 MB.  Since
> SuperSpeed is 500 MB/s, you'd end up getting in the vicinity of 250
> interrupts every second just because of ring segment crossings.

250 interrupts/sec is noise. Send/receive 13000 ethernet packets/sec
and then look at the interrupt rate!

There is no necessity for taking an interrupt from every link segment.
OTOH an interrupt is requested for every bulk TD.
I'm sending ethernet data (with TSO) and each TD is just under 64k
mostly made up of 3 or 4 fragments.
The receive side is interrupting for every receive packet.

> Using larger ring segments would help.

The current ring segments contain 64 entries, a strange choice
since they are created with 2 segments.
(The ring expansion code soon doubles that for my ethernet traffic.)

I would change the code to use a single segment (for coding simplicity)
and queue bulk URB when there isn't enough ring space.
URB with too many fragments could either be rejected, sent in sections,
or partially linearised (and probably still sent in sections).

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 12, 2013, 4:08 p.m. UTC | #10
On Tue, 12 Nov 2013, David Laight wrote:

> > You're right.  I do wish the spec had been written more clearly.
> 
> I've read a lot of hardware specs in my time ...
> 
> > > Reading it all again makes me think that a LINK trb is only
> > > allowed on the burst boundary (which might be 16k bytes).
> > > The only real way to implement that is to ensure that TD never
> > > contain LINK TRB.
> > 
> > That's one way to do it.  Or you could allow a Link TRB at an
> > intermediate MBP boundary.
> 
> If all the fragments are larger than the MBP (assume 16k) then
> that would be relatively easy. However that is very dependant
> on the source of the data. It might be true for disk data, but
> is unlikely to be true for ethernet data.

I don't quite understand your point.  Are you saying that if all the 
TRBs are very short, you might need more than 64 TRBs to reach a 16-KB 
boundary?

> For bulk data the link TRB can be forced at a packet boundary
> by splitting the TD up - the receiving end won't know the difference.

That won't work.  What happens if you split a TD up into two pieces and 
the first piece receives a short packet?  The host controller will 
automatically move to the start of the second piece.  That's not what 
we want.

> > It comes down to a question of how often you want the controller to
> > issue an interrupt.  If a ring segment is 4 KB (one page), then it can
> > hold 256 TRBs.  With scatter-gather transfers, each SG element
> > typically refers to something like a 2-page buffer (depends on how
> > fragmented the memory is).  Therefore a ring segment will describe
> > somewhere around 512 pages of data, i.e., something like 2 MB.  Since
> > SuperSpeed is 500 MB/s, you'd end up getting in the vicinity of 250
> > interrupts every second just because of ring segment crossings.
> 
> 250 interrupts/sec is noise. Send/receive 13000 ethernet packets/sec
> and then look at the interrupt rate!
> 
> There is no necessity for taking an interrupt from every link segment.

Yes, there is.  The HCD needs to know when the dequeue pointer has
moved beyond the end of the ring segment, so that it can start reusing
the TRB slots in that segment.

Suppose you have queued a bulk URB because there weren't enough free 
TRB slots.  How else would you know when the occupied slots became 
available?

> The current ring segments contain 64 entries, a strange choice
> since they are created with 2 segments.
> (The ring expansion code soon doubles that for my ethernet traffic.)
> 
> I would change the code to use a single segment (for coding simplicity)
> and queue bulk URB when there isn't enough ring space.
> URB with too many fragments could either be rejected, sent in sections,
> or partially linearised (and probably still sent in sections).

Rejecting an URB is not feasible.  Splitting it up into multiple TDs is
not acceptable, as explained above.  Sending it in sections (i.e.,
queueing only some of the TRBs at any time) would work, provided you
got at least two interrupts every time the queue wrapped around (which
suggests you might want at least two ring segments).

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 12, 2013, 4:50 p.m. UTC | #11
> > If all the fragments are larger than the MBP (assume 16k) then
> > that would be relatively easy. However that is very dependant
> > on the source of the data. It might be true for disk data, but
> > is unlikely to be true for ethernet data.
> 
> I don't quite understand your point.  Are you saying that if all the
> TRBs are very short, you might need more than 64 TRBs to reach a 16-KB
> boundary?

Since you don't really want to do all the work twice, the sensible
way is to add each input fragment to the ring one a time.
If you need to cross a link TRB and the last MBP boundary is within
the previous data TRB then you can split the previous data TRB at the
MBP boundary and continue.
If the previous TRB is short then you'd need to go back through the
earlier TRB until you found the one that contains a TRB boundary,
split it, and write a link TRB in the following slot.
If you are within the first MBP then you'd need to replace the first
TRB of the message with a link TRB.

And yes, if the data is really fragmented you might need a lot of
TRB for even 1k of data.

> > For bulk data the link TRB can be forced at a packet boundary
> > by splitting the TD up - the receiving end won't know the difference.
> 
> That won't work.  What happens if you split a TD up into two pieces and
> the first piece receives a short packet?  The host controller will
> automatically move to the start of the second piece.  That's not what
> we want.

You can split a bulk TD on a 1k boundary and the target won't know the
difference.

> > There is no necessity for taking an interrupt from every link segment.
> 
> Yes, there is.  The HCD needs to know when the dequeue pointer has
> moved beyond the end of the ring segment, so that it can start reusing
> the TRB slots in that segment.

You already know that because of the interrupts for the data packets
themselves.

> > I would change the code to use a single segment (for coding simplicity)
> > and queue bulk URB when there isn't enough ring space.
> > URB with too many fragments could either be rejected, sent in sections,
> > or partially linearised (and probably still sent in sections).
> 
> Rejecting an URB is not feasible.  Splitting it up into multiple TDs is
> not acceptable, as explained above.  Sending it in sections (i.e.,
> queueing only some of the TRBs at any time) would work, provided you
> got at least two interrupts every time the queue wrapped around (which
> suggests you might want at least two ring segments).

Rejecting badly fragmented URB is almost certainly ok. You really don't
want the hardware overhead of processing a TRB every few bytes.
This would be especially bad on iommu systems.
Before the ring expansion code was added there was an implicit
limit of (probably) 125 fragments for a URB. Exceeding this limit
wasn't the reason for adding the ring expansion code.

And, as I've pointed out, both bulk and isoc URB can be split unless
they are using too many fragments for a short piece of data.

The current code refuses to write into a TRB segment until it is
empty - I think that restriction is only there so that it can
add another segment when the space runs out.

	David





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sarah Sharp Nov. 12, 2013, 5:15 p.m. UTC | #12
On Mon, Nov 11, 2013 at 03:34:53PM -0500, Alan Stern wrote:
> On Mon, 11 Nov 2013, David Laight wrote:
> 
> > > Suppose, for example, the MBP is 1024.  If you have a TD with length
> > > 1500, and if it had only one fragment, the last (and only) fragment's
> > > length would not less than the MBP and it would not be an exact
> > > multiple of the MBP.
> > 
> > That doesn't matter - eg example 2 in figure 25
> 
> You're right.  I do wish the spec had been written more clearly.
> 
> > Reading it all again makes me think that a LINK trb is only
> > allowed on the burst boundary (which might be 16k bytes).
> > The only real way to implement that is to ensure that TD never
> > contain LINK TRB.
> 
> That's one way to do it.  Or you could allow a Link TRB at an 
> intermediate MBP boundary.

I like this idea instead.  The xHCI driver should be modified to be able
to handle link TRBs in the middle of the segments (the cancellation code
would have to be touched as well).  We would keep a running count
of the number of bytes left in a TD fragment, as we fill in the TRBs.
If we find the TD fragment would span a link TRB, we backtrack to the
end of the last TD fragment, put in a link TRB, and then continue on the
next segment.

> It comes down to a question of how often you want the controller to
> issue an interrupt.  If a ring segment is 4 KB (one page), then it can
> hold 256 TRBs.  With scatter-gather transfers, each SG element
> typically refers to something like a 2-page buffer (depends on how
> fragmented the memory is).  Therefore a ring segment will describe
> somewhere around 512 pages of data, i.e., something like 2 MB.  Since
> SuperSpeed is 500 MB/s, you'd end up getting in the vicinity of 250
> interrupts every second just because of ring segment crossings.

The driver is currently defined to have 64 TRBs per ring segment.  But
that doesn't matter; we don't get an interrupt when a ring segment is
crossed.  Instead we set the interrupt-on-completion flag on the last
TRB of the TD, not on any earlier fragment or link TRB.

> Using larger ring segments would help.

Ring segments have to be physically contiguous, so I'm not sure if we
want to ask for segments that are bigger than a page.  I've already got
a report from someone else about the ring expansion getting out of
control, so I would like to figure that out before we talk about using
even bigger segments.

Finally, it's interesting to note that the USB mass storage driver is
using scatter gather lists just fine without the driver following the TD
fragment rules.  Or at least no one has reported any issues.  I wonder
why it works?

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sarah Sharp Nov. 12, 2013, 5:15 p.m. UTC | #13
On Fri, Nov 08, 2013 at 11:07:41AM -0000, David Laight wrote:
> > While this change improves things a lot (it runs for 20 minutes rather than
> > a few seconds), there is still something else wrong.
> 
> Almost certainly caused by an unrelated local change.

Ok, good, I was concerned there was another issue we were missing.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 12, 2013, 5:28 p.m. UTC | #14
> On Fri, Nov 08, 2013 at 11:07:41AM -0000, David Laight wrote:
> > > While this change improves things a lot (it runs for 20 minutes rather than
> > > a few seconds), there is still something else wrong.
> >
> > Almost certainly caused by an unrelated local change.
> 
> Ok, good, I was concerned there was another issue we were missing.

My modified code wasn't setting the td_size properly.
That is fixed and verified in the later patch I sent.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 12, 2013, 5:33 p.m. UTC | #15
> > That's one way to do it.  Or you could allow a Link TRB at an
> > intermediate MBP boundary.
> 
> I like this idea instead.  The xHCI driver should be modified to be able
> to handle link TRBs in the middle of the segments (the cancellation code
> would have to be touched as well).  We would keep a running count
> of the number of bytes left in a TD fragment, as we fill in the TRBs.
> If we find the TD fragment would span a link TRB, we backtrack to the
> end of the last TD fragment, put in a link TRB, and then continue on the
> next segment.

I'd do that as a later change.
It needs a fair amount of other stuff fixing first and the current
code is rather broken - and needs a fix for stable.

> > It comes down to a question of how often you want the controller to
> > issue an interrupt.  If a ring segment is 4 KB (one page), then it can
> > hold 256 TRBs.  With scatter-gather transfers, each SG element
> > typically refers to something like a 2-page buffer (depends on how
> > fragmented the memory is).  Therefore a ring segment will describe
> > somewhere around 512 pages of data, i.e., something like 2 MB.  Since
> > SuperSpeed is 500 MB/s, you'd end up getting in the vicinity of 250
> > interrupts every second just because of ring segment crossings.
> 
> The driver is currently defined to have 64 TRBs per ring segment.  But
> that doesn't matter; we don't get an interrupt when a ring segment is
> crossed.  Instead we set the interrupt-on-completion flag on the last
> TRB of the TD, not on any earlier fragment or link TRB.
> 
> > Using larger ring segments would help.
> 
> Ring segments have to be physically contiguous, so I'm not sure if we
> want to ask for segments that are bigger than a page.  I've already got
> a report from someone else about the ring expansion getting out of
> control, so I would like to figure that out before we talk about using
> even bigger segments.

I'd vote for a single segment containing either 128 or 256 TRBs.
That is less than a single page.

> Finally, it's interesting to note that the USB mass storage driver is
> using scatter gather lists just fine without the driver following the TD
> fragment rules.  Or at least no one has reported any issues.  I wonder
> why it works?

I suspect that breaking the rules just generates two TD.
The mass storage driver is probably almost always presenting
buffer fragments that are page sized and page aligned.
In which case the split is invisible at the target.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 12, 2013, 7 p.m. UTC | #16
On Tue, 12 Nov 2013, David Laight wrote:

> > > If all the fragments are larger than the MBP (assume 16k) then
> > > that would be relatively easy. However that is very dependant
> > > on the source of the data. It might be true for disk data, but
> > > is unlikely to be true for ethernet data.
> > 
> > I don't quite understand your point.  Are you saying that if all the
> > TRBs are very short, you might need more than 64 TRBs to reach a 16-KB
> > boundary?
> 
> Since you don't really want to do all the work twice, the sensible
> way is to add each input fragment to the ring one a time.
> If you need to cross a link TRB and the last MBP boundary is within
> the previous data TRB then you can split the previous data TRB at the
> MBP boundary and continue.
> If the previous TRB is short then you'd need to go back through the
> earlier TRB until you found the one that contains a TRB boundary,
> split it, and write a link TRB in the following slot.

Right.  You could avoid doing a lot of work twice by using two passes.  
In the first pass, keep track of the number of bytes allotted to each
TRB and the MBP boundaries, without doing anything else.  That way,
you'll know where to put a Link TRB without any need for backtracking.  
Then in the second pass, fill in the actual contents of the TRBs.

> If you are within the first MBP then you'd need to replace the first
> TRB of the message with a link TRB.

Unless the first TRB of the message is the first TRB of the ring 
segment.  Then you're in trouble...  Hopefully, real URBs will never be 
that badly fragmented.

> And yes, if the data is really fragmented you might need a lot of
> TRB for even 1k of data.

Until somebody needs more than 16 TRBs for each KB of data, we should 
be okay.

> > > For bulk data the link TRB can be forced at a packet boundary
> > > by splitting the TD up - the receiving end won't know the difference.
> > 
> > That won't work.  What happens if you split a TD up into two pieces and
> > the first piece receives a short packet?  The host controller will
> > automatically move to the start of the second piece.  That's not what
> > we want.
> 
> You can split a bulk TD on a 1k boundary and the target won't know the
> difference.

The target won't know the difference, but the host will.  Here's an
example: Suppose the driver submits two URBs, each for a data-in
transfer of 32 KB.  You split each URB up into two 16-KB TDs; let's 
call them A, B, C, and D (where A and B make up the first URB, and C 
and D make up the second URB).

Now suppose the device terminates the first transfer after only 7200 
bytes, with a short packet.  It then sends a complete 32 KB of data for 
the second transfer.  What will happen in this case?

7200 bytes is somewhere in the middle of TD A.  That TD will terminate 
early.  The host controller will then proceed to TD B.  As a result, 
the next 32 KB of data will be stored in TD B and C -- not in TD C and 
D as it should be.

This example shows why you must not split IN URBs into multiple TDs.

> > > There is no necessity for taking an interrupt from every link segment.
> > 
> > Yes, there is.  The HCD needs to know when the dequeue pointer has
> > moved beyond the end of the ring segment, so that it can start reusing
> > the TRB slots in that segment.
> 
> You already know that because of the interrupts for the data packets
> themselves.

I don't know what you mean by this.  The host controller does not 
generate an interrupt for each data packet.  In generates interrupts 
only for TRBs with the IOC bit set (which is generally the last TRB in 
each TD).

Suppose you have only two ring segments, and a driver submits an URB 
which is so fragmented that it requires more TRBs than you have room 
for in those two segments.  When do you want the interrupts to arrive?  
Answer: At each segment crossing.

> > > I would change the code to use a single segment (for coding simplicity)
> > > and queue bulk URB when there isn't enough ring space.
> > > URB with too many fragments could either be rejected, sent in sections,
> > > or partially linearised (and probably still sent in sections).
> > 
> > Rejecting an URB is not feasible.  Splitting it up into multiple TDs is
> > not acceptable, as explained above.  Sending it in sections (i.e.,
> > queueing only some of the TRBs at any time) would work, provided you
> > got at least two interrupts every time the queue wrapped around (which
> > suggests you might want at least two ring segments).
> 
> Rejecting badly fragmented URB is almost certainly ok. You really don't
> want the hardware overhead of processing a TRB every few bytes.
> This would be especially bad on iommu systems.

I disagree.  Sure, it would be bad.  But if you can handle it, you 
should.

> Before the ring expansion code was added there was an implicit
> limit of (probably) 125 fragments for a URB. Exceeding this limit
> wasn't the reason for adding the ring expansion code.
> 
> And, as I've pointed out, both bulk and isoc URB can be split unless
> they are using too many fragments for a short piece of data.
> 
> The current code refuses to write into a TRB segment until it is
> empty - I think that restriction is only there so that it can
> add another segment when the space runs out.

That's right.  The idea was my suggestion.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 12, 2013, 7:22 p.m. UTC | #17
On Tue, 12 Nov 2013, Sarah Sharp wrote:

> > It comes down to a question of how often you want the controller to
> > issue an interrupt.  If a ring segment is 4 KB (one page), then it can
> > hold 256 TRBs.  With scatter-gather transfers, each SG element
> > typically refers to something like a 2-page buffer (depends on how
> > fragmented the memory is).  Therefore a ring segment will describe
> > somewhere around 512 pages of data, i.e., something like 2 MB.  Since
> > SuperSpeed is 500 MB/s, you'd end up getting in the vicinity of 250
> > interrupts every second just because of ring segment crossings.
> 
> The driver is currently defined to have 64 TRBs per ring segment.  But

A TRB is 16 bytes, right?  So a page can hold 256 TRBs.  Why use only 
64 per segment?

> that doesn't matter; we don't get an interrupt when a ring segment is
> crossed.  Instead we set the interrupt-on-completion flag on the last
> TRB of the TD, not on any earlier fragment or link TRB.

That's because you don't worry about handling URBs which require too 
many TRBs (i.e., more than are available).  You just add more ring 
segments.  Instead, you could re-use segments on the fly.

For example, suppose you have only two ring segments and you get an URB
which requires enough TRBs to fill up four segments.  You could fill in
the first two segments worth, and get an interrupt when the controller
traverses the Link TRB between them.  At that point you store the third
set of TRBs in the first segment, which is now vacant.  Similarly, when
the second Link TRB is traversed, you fill in the fourth set of TRBs.

> > Using larger ring segments would help.
> 
> Ring segments have to be physically contiguous, so I'm not sure if we
> want to ask for segments that are bigger than a page.  I've already got
> a report from someone else about the ring expansion getting out of
> control, so I would like to figure that out before we talk about using
> even bigger segments.

Maybe you can get away with fewer segments, if they are bigger.

> Finally, it's interesting to note that the USB mass storage driver is
> using scatter gather lists just fine without the driver following the TD
> fragment rules.  Or at least no one has reported any issues.  I wonder
> why it works?

I'd guess this is because the hardware is actually a lot more flexible
than the "No Link TRBs in the middle of a TD fragment" rule.

The whole idea of TD fragments makes no sense to begin with.  What
point is there in grouping packets into MaxBurst-sized collections?  
The hardware does not have to finish one burst before beginning the
next one.  For example, suppose the MaxBurst size is 8.  The host
starts by bursting packets 1-8.  When it receives the ACK for packet 4,
the host could then burst packets 9-12.  It doesn't have to wait for
the ACK to packet 8.  (Unless I have misunderstood the USB-3 spec.)

If the host does this, the burst boundaries won't occur on MBP 
boundaries, and hence won't occur on TD fragment boundaries.  The 
fragment boundaries will be essentially meaningless.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 13, 2013, 9:28 a.m. UTC | #18
> > Since you don't really want to do all the work twice, the sensible
> > way is to add each input fragment to the ring one a time.
> > If you need to cross a link TRB and the last MBP boundary is within
> > the previous data TRB then you can split the previous data TRB at the
> > MBP boundary and continue.
> > If the previous TRB is short then you'd need to go back through the
> > earlier TRB until you found the one that contains a TRB boundary,
> > split it, and write a link TRB in the following slot.
> 
> Right.  You could avoid doing a lot of work twice by using two passes.
> In the first pass, keep track of the number of bytes allotted to each
> TRB and the MBP boundaries, without doing anything else.  That way,
> you'll know where to put a Link TRB without any need for backtracking.
> Then in the second pass, fill in the actual contents of the TRBs.

No - you really don't want to process everything twice.
You could remember the TRB and offset of the last MBP boundary.
 
> > If you are within the first MBP then you'd need to replace the first
> > TRB of the message with a link TRB.
> 
> Unless the first TRB of the message is the first TRB of the ring
> segment.  Then you're in trouble...  Hopefully, real URBs will never be
> that badly fragmented.

I suspect that ones from the network stack might be badly fragmented.
The code needs to at least detect and error them.
I don't think linux skb can be as fragmented as I've seen network
buffers on other systems - 1 byte per buffer chained together to
a maximal sized ethernet packet.
 
> > > > For bulk data the link TRB can be forced at a packet boundary
> > > > by splitting the TD up - the receiving end won't know the difference.
> > >
> > > That won't work.  What happens if you split a TD up into two pieces and
> > > the first piece receives a short packet?  The host controller will
> > > automatically move to the start of the second piece.  That's not what
> > > we want.
> >
> > You can split a bulk TD on a 1k boundary and the target won't know the
> > difference.
> 
> The target won't know the difference, but the host will.  Here's an
> example: Suppose the driver submits two URBs, each for a data-in
> transfer of 32 KB.  You split each URB up into two 16-KB TDs; let's
> call them A, B, C, and D (where A and B make up the first URB, and C
> and D make up the second URB).

I was thinking about OUT transfers, IN ones are unlikely to be badly
fragmented.

> > > > There is no necessity for taking an interrupt from every link segment.
> > >
> > > Yes, there is.  The HCD needs to know when the dequeue pointer has
> > > moved beyond the end of the ring segment, so that it can start reusing
> > > the TRB slots in that segment.
> >
> > You already know that because of the interrupts for the data packets
> > themselves.
> 
> I don't know what you mean by this.  The host controller does not
> generate an interrupt for each data packet.  In generates interrupts
> only for TRBs with the IOC bit set (which is generally the last TRB in
> each TD).
> 
> Suppose you have only two ring segments, and a driver submits an URB
> which is so fragmented that it requires more TRBs than you have room
> for in those two segments.  When do you want the interrupts to arrive?
> Answer: At each segment crossing.

You bounce the original request and fix the driver to submit URB
with fewer fragments.

> > > > I would change the code to use a single segment (for coding simplicity)
> > > > and queue bulk URB when there isn't enough ring space.
> > > > URB with too many fragments could either be rejected, sent in sections,
> > > > or partially linearised (and probably still sent in sections).
> > >
> > > Rejecting an URB is not feasible.  Splitting it up into multiple TDs is
> > > not acceptable, as explained above.  Sending it in sections (i.e.,
> > > queueing only some of the TRBs at any time) would work, provided you
> > > got at least two interrupts every time the queue wrapped around (which
> > > suggests you might want at least two ring segments).
> >
> > Rejecting badly fragmented URB is almost certainly ok. You really don't
> > want the hardware overhead of processing a TRB every few bytes.
> > This would be especially bad on iommu systems.
> 
> I disagree.  Sure, it would be bad.  But if you can handle it, you
> should.

You have to draw the line somewhere.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 13, 2013, 9:45 a.m. UTC | #19
> > that doesn't matter; we don't get an interrupt when a ring segment is
> > crossed.  Instead we set the interrupt-on-completion flag on the last
> > TRB of the TD, not on any earlier fragment or link TRB.
> 
> That's because you don't worry about handling URBs which require too
> many TRBs (i.e., more than are available).  You just add more ring
> segments.  Instead, you could re-use segments on the fly.
> 
> For example, suppose you have only two ring segments and you get an URB
> which requires enough TRBs to fill up four segments.  You could fill in
> the first two segments worth, and get an interrupt when the controller
> traverses the Link TRB between them.  At that point you store the third
> set of TRBs in the first segment, which is now vacant.  Similarly, when
> the second Link TRB is traversed, you fill in the fourth set of TRBs.

That isn't going to work.
It might work for very long TD, but for very fragmented ones the interrupt
rate would be stupid. In any case you'd definitely need enough ring
segments for the TRB that describe a single 'max packet size' block.

> > Finally, it's interesting to note that the USB mass storage driver is
> > using scatter gather lists just fine without the driver following the TD
> > fragment rules.  Or at least no one has reported any issues.  I wonder
> > why it works?
> 
> I'd guess this is because the hardware is actually a lot more flexible
> than the "No Link TRBs in the middle of a TD fragment" rule.

With the hardware I have (Intel i7 Sandy bridge) Link TRB cannot
be placed at arbitrary boundaries.
I don't know whether the actual restriction is only to packet boundaries.
I don't have a USB3 monitor and it would also require a more contrived
test than I've been doing.

> The whole idea of TD fragments makes no sense to begin with.  What
> point is there in grouping packets into MaxBurst-sized collections?

It probably saved a few logic gates somewhere, either that or it is
a bug in some hardware implementation that got documented in the spec
instead of being fixed :-)

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 13, 2013, 4:20 p.m. UTC | #20
On Wed, 13 Nov 2013, David Laight wrote:

> > > Since you don't really want to do all the work twice, the sensible
> > > way is to add each input fragment to the ring one a time.
> > > If you need to cross a link TRB and the last MBP boundary is within
> > > the previous data TRB then you can split the previous data TRB at the
> > > MBP boundary and continue.
> > > If the previous TRB is short then you'd need to go back through the
> > > earlier TRB until you found the one that contains a TRB boundary,
> > > split it, and write a link TRB in the following slot.
> > 
> > Right.  You could avoid doing a lot of work twice by using two passes.
> > In the first pass, keep track of the number of bytes allotted to each
> > TRB and the MBP boundaries, without doing anything else.  That way,
> > you'll know where to put a Link TRB without any need for backtracking.
> > Then in the second pass, fill in the actual contents of the TRBs.
> 
> No - you really don't want to process everything twice.

What I described does not process anything twice.  It calculates the 
byte counts on the first pass and everything else on the second pass.

> You could remember the TRB and offset of the last MBP boundary.

Then you really _do_ end up processing the TRB's which follow the last 
MBP boundary twice.  Once when setting up the TD as normal, and then 
again after you realize they need to be moved to a different ring 
segment.

> > Unless the first TRB of the message is the first TRB of the ring
> > segment.  Then you're in trouble...  Hopefully, real URBs will never be
> > that badly fragmented.
> 
> I suspect that ones from the network stack might be badly fragmented.
> The code needs to at least detect and error them.

I'm not so sure about this.  The network stack works okay with other
host controller drivers (like ehci-hcd), which have much stricter
requirements about fragmentation.  They require that the length of each
SG element except the last one must be a multiple of the maxpacket
size.  For high-speed bulk transfers, that means a multiple of 512.

> > > You can split a bulk TD on a 1k boundary and the target won't know the
> > > difference.
> > 
> > The target won't know the difference, but the host will.  Here's an
> > example: Suppose the driver submits two URBs, each for a data-in
> > transfer of 32 KB.  You split each URB up into two 16-KB TDs; let's
> > call them A, B, C, and D (where A and B make up the first URB, and C
> > and D make up the second URB).
> 
> I was thinking about OUT transfers, IN ones are unlikely to be badly
> fragmented.

Maybe not for the network stack, but OUT and IN end up equally 
fragmented for the storage stack.

> > Suppose you have only two ring segments, and a driver submits an URB
> > which is so fragmented that it requires more TRBs than you have room
> > for in those two segments.  When do you want the interrupts to arrive?
> > Answer: At each segment crossing.
> 
> You bounce the original request and fix the driver to submit URB
> with fewer fragments.

In other words, you want to limit the number of SG elements the driver 
will accept.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 13, 2013, 4:27 p.m. UTC | #21
On Wed, 13 Nov 2013, David Laight wrote:

> > > that doesn't matter; we don't get an interrupt when a ring segment is
> > > crossed.  Instead we set the interrupt-on-completion flag on the last
> > > TRB of the TD, not on any earlier fragment or link TRB.
> > 
> > That's because you don't worry about handling URBs which require too
> > many TRBs (i.e., more than are available).  You just add more ring
> > segments.  Instead, you could re-use segments on the fly.
> > 
> > For example, suppose you have only two ring segments and you get an URB
> > which requires enough TRBs to fill up four segments.  You could fill in
> > the first two segments worth, and get an interrupt when the controller
> > traverses the Link TRB between them.  At that point you store the third
> > set of TRBs in the first segment, which is now vacant.  Similarly, when
> > the second Link TRB is traversed, you fill in the fourth set of TRBs.
> 
> That isn't going to work.

Queuing URBs and processing them in pieces was _your_ suggestion.  I
merely repeated it to Sarah and filled in some detail.

> It might work for very long TD, but for very fragmented ones the interrupt
> rate would be stupid.

Not if the individual SG elements are reasonably large -- say at least 
512 bytes.  They have to be that big if they are to work with ehci-hcd, 
so why not also for xhci-hcd?

>  In any case you'd definitely need enough ring
> segments for the TRB that describe a single 'max packet size' block.

That would be two TRBs (for maxpacket = 1024), meaning you'd need at
least one ring segment.  That's not much of a restriction.

> > > Finally, it's interesting to note that the USB mass storage driver is
> > > using scatter gather lists just fine without the driver following the TD
> > > fragment rules.  Or at least no one has reported any issues.  I wonder
> > > why it works?
> > 
> > I'd guess this is because the hardware is actually a lot more flexible
> > than the "No Link TRBs in the middle of a TD fragment" rule.
> 
> With the hardware I have (Intel i7 Sandy bridge) Link TRB cannot
> be placed at arbitrary boundaries.
> I don't know whether the actual restriction is only to packet boundaries.

That's what I had in mind by "more flexible" above.

> I don't have a USB3 monitor and it would also require a more contrived
> test than I've been doing.

You wouldn't need a monitor to check it, but you would need some 
careful tests.

> > The whole idea of TD fragments makes no sense to begin with.  What
> > point is there in grouping packets into MaxBurst-sized collections?
> 
> It probably saved a few logic gates somewhere, either that or it is
> a bug in some hardware implementation that got documented in the spec
> instead of being fixed :-)

I can believe the guess about it reflecting a bug.  Particularly since 
it is so badly written.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 13, 2013, 4:58 p.m. UTC | #22
> -----Original Message-----
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> Sent: 13 November 2013 16:21
> To: David Laight
> Cc: Sarah Sharp; netdev@vger.kernel.org; linux-usb@vger.kernel.org
> Subject: RE: [PATCH] usb: xhci: Link TRB must not occur with a USB payload burst.
> 
> On Wed, 13 Nov 2013, David Laight wrote:
> 
> > > > Since you don't really want to do all the work twice, the sensible
> > > > way is to add each input fragment to the ring one a time.
> > > > If you need to cross a link TRB and the last MBP boundary is within
> > > > the previous data TRB then you can split the previous data TRB at the
> > > > MBP boundary and continue.
> > > > If the previous TRB is short then you'd need to go back through the
> > > > earlier TRB until you found the one that contains a TRB boundary,
> > > > split it, and write a link TRB in the following slot.
> > >
> > > Right.  You could avoid doing a lot of work twice by using two passes.
> > > In the first pass, keep track of the number of bytes allotted to each
> > > TRB and the MBP boundaries, without doing anything else.  That way,
> > > you'll know where to put a Link TRB without any need for backtracking.
> > > Then in the second pass, fill in the actual contents of the TRBs.
> >
> > No - you really don't want to process everything twice.
> 
> What I described does not process anything twice.  It calculates the
> byte counts on the first pass and everything else on the second pass.
> 
> > You could remember the TRB and offset of the last MBP boundary.
> 
> Then you really _do_ end up processing the TRB's which follow the last
> MBP boundary twice.  Once when setting up the TD as normal, and then
> again after you realize they need to be moved to a different ring
> segment.

That would only be a few of the TRBs, not all of them.
The cheap option is to work out an upper bound for the number of TRB
and the write a new LINK TRB if there isn't enough space.
That does limit the maximum number of TRB in a URB to half the total
ring (less if it has more than 2 fragments).
 
> > > Unless the first TRB of the message is the first TRB of the ring
> > > segment.  Then you're in trouble...  Hopefully, real URBs will never be
> > > that badly fragmented.
> >
> > I suspect that ones from the network stack might be badly fragmented.
> > The code needs to at least detect and error them.
> 
> I'm not so sure about this.  The network stack works okay with other
> host controller drivers (like ehci-hcd), which have much stricter
> requirements about fragmentation.  They require that the length of each
> SG element except the last one must be a multiple of the maxpacket
> size.  For high-speed bulk transfers, that means a multiple of 512.

Yes, for other host controllers the network cards linearise the skb.
it is only for xhci where where the buffer fragments from the skb
can be given to the usb controller.
This is probably also only enabled (at the moment) for the ASIX Ge card
- which can also do TCP transmit segmentation offload (TSO).
It is effectively a necessity for TSO since it would otherwise
require 64k blocks of contiguous memory (or some horrid code that
involves a misaligned copy).
The SG code in usbnet was only added for 3.12.

> > > > You can split a bulk TD on a 1k boundary and the target won't know the
> > > > difference.
> > >
> > > The target won't know the difference, but the host will.  Here's an
> > > example: Suppose the driver submits two URBs, each for a data-in
> > > transfer of 32 KB.  You split each URB up into two 16-KB TDs; let's
> > > call them A, B, C, and D (where A and B make up the first URB, and C
> > > and D make up the second URB).
> >
> > I was thinking about OUT transfers, IN ones are unlikely to be badly
> > fragmented.
> 
> Maybe not for the network stack, but OUT and IN end up equally
> fragmented for the storage stack.

But the minimum fragment size is (probably) 4k.
For the network stack an OUT transfer might have a lot (and I mean lots)
of fragments (there may be constraints, and linearising the skb is a option).

> > > Suppose you have only two ring segments, and a driver submits an URB
> > > which is so fragmented that it requires more TRBs than you have room
> > > for in those two segments.  When do you want the interrupts to arrive?
> > > Answer: At each segment crossing.
> >
> > You bounce the original request and fix the driver to submit URB
> > with fewer fragments.
> 
> In other words, you want to limit the number of SG elements the driver
> will accept.

Yes, passing the buck to the higher layer.
If the ring was a single segment of 256 slots you'd have to limit the
number of TRB to 128. If the alignment is horrid this could be under 64
fragments (because of the 64k boundary restriction).

What I don't know is whether the code to increase the ring size was added
because a single URB contained too many fragments, or because a lot
of URB were submitted at once - I don't remember anything in the commit
messages about why.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Nov. 16, 2013, 1:38 a.m. UTC | #23
On Wed, 2013-11-13 at 16:58 +0000, David Laight wrote:
[...]
> > > > > You can split a bulk TD on a 1k boundary and the target won't know the
> > > > > difference.
> > > >
> > > > The target won't know the difference, but the host will.  Here's an
> > > > example: Suppose the driver submits two URBs, each for a data-in
> > > > transfer of 32 KB.  You split each URB up into two 16-KB TDs; let's
> > > > call them A, B, C, and D (where A and B make up the first URB, and C
> > > > and D make up the second URB).
> > >
> > > I was thinking about OUT transfers, IN ones are unlikely to be badly
> > > fragmented.
> > 
> > Maybe not for the network stack, but OUT and IN end up equally
> > fragmented for the storage stack.
> 
> But the minimum fragment size is (probably) 4k.
> For the network stack an OUT transfer might have a lot (and I mean lots)
> of fragments (there may be constraints, and linearising the skb is a option).
[...]

The maximum number of fragments in the skb is going to be 17 (including
the 'head' area).  (I'm ignoring NETIF_F_FRAGLIST which is not normally
supported by physical device drivers.)

I don't know how many fragments that can end up as, at the USB level.

Ben.
David Laight Nov. 18, 2013, 9:48 a.m. UTC | #24
> > But the minimum fragment size is (probably) 4k.

> > For the network stack an OUT transfer might have a lot (and I mean lots)

> > of fragments (there may be constraints, and linearising the skb is a option).

> [...]

> 

> The maximum number of fragments in the skb is going to be 17 (including

> the 'head' area).  (I'm ignoring NETIF_F_FRAGLIST which is not normally

> supported by physical device drivers.)

> 

> I don't know how many fragments that can end up as, at the USB level.


If you assume that every fragment crosses a 64k boundary that would be 34.
OTOH I've not seen a fragment of a 64k TSO send crossing a 32k
boundary, and I think the 'head' area is constrained to be part of
a single (4k or larger) page.

Isn't there something odd about skb merged by receive offload?
I've not entirely sorted out the full structure of skb.

	David
Ben Hutchings Nov. 18, 2013, 3:03 p.m. UTC | #25
On Mon, 2013-11-18 at 09:48 +0000, David Laight wrote:
> > > But the minimum fragment size is (probably) 4k.
> > > For the network stack an OUT transfer might have a lot (and I mean lots)
> > > of fragments (there may be constraints, and linearising the skb is a option).
> > [...]
> > 
> > The maximum number of fragments in the skb is going to be 17 (including
> > the 'head' area).  (I'm ignoring NETIF_F_FRAGLIST which is not normally
> > supported by physical device drivers.)
> > 
> > I don't know how many fragments that can end up as, at the USB level.
> 
> If you assume that every fragment crosses a 64k boundary that would be 34.
> OTOH I've not seen a fragment of a 64k TSO send crossing a 32k
> boundary, and I think the 'head' area is constrained to be part of
> a single (4k or larger) page.

I don't know that it's possible at the moment, but I wouldn't recommend
relying on that.

> Isn't there something odd about skb merged by receive offload?
> I've not entirely sorted out the full structure of skb.

There has been some work to allow for using both the frags array and
frag list, but a driver will not see such an skb if it does not
advertise the NETIF_F_FRAGLIST feature.

Ben.
David Laight Nov. 18, 2013, 3:41 p.m. UTC | #26
> -----Original Message-----

> From: Ben Hutchings [mailto:bhutchings@solarflare.com]

> Sent: 18 November 2013 15:03

> To: David Laight

> Cc: Alan Stern; Sarah Sharp; netdev@vger.kernel.org; linux-usb@vger.kernel.org

> Subject: Re: [PATCH] usb: xhci: Link TRB must not occur with a USB payload burst.

> 

> On Mon, 2013-11-18 at 09:48 +0000, David Laight wrote:

> > > > But the minimum fragment size is (probably) 4k.

> > > > For the network stack an OUT transfer might have a lot (and I mean lots)

> > > > of fragments (there may be constraints, and linearising the skb is a option).

> > > [...]

> > >

> > > The maximum number of fragments in the skb is going to be 17 (including

> > > the 'head' area).  (I'm ignoring NETIF_F_FRAGLIST which is not normally

> > > supported by physical device drivers.)

> > >

> > > I don't know how many fragments that can end up as, at the USB level.

> >

> > If you assume that every fragment crosses a 64k boundary that would be 34.

> > OTOH I've not seen a fragment of a 64k TSO send crossing a 32k

> > boundary, and I think the 'head' area is constrained to be part of

> > a single (4k or larger) page.

> 

> I don't know that it's possible at the moment, but I wouldn't recommend

> relying on that.


The xhci (USB3) hardware supports SG, but a non-obvious alignment restriction
applies at the end of a ring segment. In effect this means that the number of
fragments mustn't exceed the size of the ring segment.
It would make the xchi driver simpler if excessively fragmented requests
could just be bounced.
Since ring entries are 16 bytes there isn't much reason to not use a 4k
'page' for a ring and have (almost) 256 ring slots.
(The code currently uses multiple ring segments with 63 usable slots.)

Looks like skb are constrained enough that a sensible limit can be applied.

The other likely generator of fragmented requests is the mass storage code.
Most likely for dd(1) with a large block size.

	David
Sarah Sharp Nov. 19, 2013, 10:58 p.m. UTC | #27
On Mon, Nov 18, 2013 at 03:41:00PM -0000, David Laight wrote:
> 
> 
> > -----Original Message-----
> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > Sent: 18 November 2013 15:03
> > To: David Laight
> > Cc: Alan Stern; Sarah Sharp; netdev@vger.kernel.org; linux-usb@vger.kernel.org
> > Subject: Re: [PATCH] usb: xhci: Link TRB must not occur with a USB payload burst.
> > 
> > On Mon, 2013-11-18 at 09:48 +0000, David Laight wrote:
> > > > > But the minimum fragment size is (probably) 4k.
> > > > > For the network stack an OUT transfer might have a lot (and I mean lots)
> > > > > of fragments (there may be constraints, and linearising the skb is a option).
> > > > [...]
> > > >
> > > > The maximum number of fragments in the skb is going to be 17 (including
> > > > the 'head' area).  (I'm ignoring NETIF_F_FRAGLIST which is not normally
> > > > supported by physical device drivers.)
> > > >
> > > > I don't know how many fragments that can end up as, at the USB level.
> > >
> > > If you assume that every fragment crosses a 64k boundary that would be 34.
> > > OTOH I've not seen a fragment of a 64k TSO send crossing a 32k
> > > boundary, and I think the 'head' area is constrained to be part of
> > > a single (4k or larger) page.
> > 
> > I don't know that it's possible at the moment, but I wouldn't recommend
> > relying on that.
> 
> The xhci (USB3) hardware supports SG, but a non-obvious alignment restriction
> applies at the end of a ring segment. In effect this means that the number of
> fragments mustn't exceed the size of the ring segment.
> It would make the xchi driver simpler if excessively fragmented requests
> could just be bounced.
> Since ring entries are 16 bytes there isn't much reason to not use a 4k
> 'page' for a ring and have (almost) 256 ring slots.
> (The code currently uses multiple ring segments with 63 usable slots.)
> 
> Looks like skb are constrained enough that a sensible limit can be applied.
> 
> The other likely generator of fragmented requests is the mass storage code.
> Most likely for dd(1) with a large block size.

The xHCI driver can limit the number of sg-list entries through
hcd->self.sg_tablesize.  It's currently set to ~0, which is "however
many entries you want.  You could set that to the number of TRBs in a
segment (minus one for the link TRB).

The usb-storage and uas drivers currently use sg_tablesize.  Could the
network stack be taught to use sg_tablesize as well?

(Also, usb-storage aligns the block sizes to 512K, which explains why
we've never had an issue with TD fragments with that driver.)

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 20, 2013, 3:09 a.m. UTC | #28
On Tue, 19 Nov 2013, Sarah Sharp wrote:

> The xHCI driver can limit the number of sg-list entries through
> hcd->self.sg_tablesize.  It's currently set to ~0, which is "however
> many entries you want.  You could set that to the number of TRBs in a
> segment (minus one for the link TRB).
> 
> The usb-storage and uas drivers currently use sg_tablesize.  Could the
> network stack be taught to use sg_tablesize as well?

The sg_tablesize you're talking about is a field in struct usb_bus
(there's a similar field in struct scsi_host_template).  It's not
relevant to the network stack, since network interfaces aren't USB host
controllers (or SCSI hosts).

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 20, 2013, 9:36 a.m. UTC | #29
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> On Tue, 19 Nov 2013, Sarah Sharp wrote:
> 
> > The xHCI driver can limit the number of sg-list entries through
> > hcd->self.sg_tablesize.  It's currently set to ~0, which is "however
> > many entries you want.  You could set that to the number of TRBs in a
> > segment (minus one for the link TRB).
> >
> > The usb-storage and uas drivers currently use sg_tablesize.  Could the
> > network stack be taught to use sg_tablesize as well?
> 
> The sg_tablesize you're talking about is a field in struct usb_bus
> (there's a similar field in struct scsi_host_template).  It's not
> relevant to the network stack, since network interfaces aren't USB host
> controllers (or SCSI hosts).

Ben said the largest number of fragments from the current network
stack will be 17, and that none of them will cross 32k boundaries.
So the network stack won't send down long SG lists.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 20, 2013, 9:46 a.m. UTC | #30
> From: Sarah Sharp
...
> (Also, usb-storage aligns the block sizes to 512K, which explains why
> we've never had an issue with TD fragments with that driver.)

What is a 'block' in that context?
512K sounds more like the value that very long transfers get chopped
up into. With 4k pages that might be 128 fragments.

I'd have thought that the SG list would normally contain references
to a number of memory pages - so each would be 4k (on x86) aligned.
My suspicion is that the xhci controller will generate correct USB3
data provided the link TRB is on a 1k boundary - so such data won't
be a problem.

If a user program does a direct transfer from the block device
(and that is done by locking down the user pages) then the buffer
could have an arbitrary alignment.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2013, 3:03 p.m. UTC | #31
On Wed, 2013-11-20 at 09:36 +0000, David Laight wrote:

> Ben said the largest number of fragments from the current network
> stack will be 17, and that none of them will cross 32k boundaries.
> So the network stack won't send down long SG lists.

Please note that skb->head itself _might_ cross a 32K or 64K boundary :

skb->head is kmalloc() provided, and SLUB can be tweaked
(slub_max_order) to use very high order pages.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sarah Sharp Nov. 20, 2013, 4:06 p.m. UTC | #32
On Wed, Nov 20, 2013 at 09:36:11AM -0000, David Laight wrote:
> > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > On Tue, 19 Nov 2013, Sarah Sharp wrote:
> > 
> > > The xHCI driver can limit the number of sg-list entries through
> > > hcd->self.sg_tablesize.  It's currently set to ~0, which is "however
> > > many entries you want.  You could set that to the number of TRBs in a
> > > segment (minus one for the link TRB).
> > >
> > > The usb-storage and uas drivers currently use sg_tablesize.  Could the
> > > network stack be taught to use sg_tablesize as well?
> > 
> > The sg_tablesize you're talking about is a field in struct usb_bus
> > (there's a similar field in struct scsi_host_template).  It's not
> > relevant to the network stack, since network interfaces aren't USB host
> > controllers (or SCSI hosts).
> 
> Ben said the largest number of fragments from the current network
> stack will be 17, and that none of them will cross 32k boundaries.
> So the network stack won't send down long SG lists.

Ok, so the networking layer should be fine.  However, with the current
patch, if the mass storage driver sends down a scatter-gather list
that's bigger than a ring segment, or needs to be split up so it doesn't
cross 64K boundaries, then the URB submission will fail.  We don't want
that to happen.

At the very least, we should limit hcd->self.sg_tablesize in
drivers/usb/host/xhci.c to 63 (TRBS_PER_SEGMENT - 1).  But we could
still potentially run into the 64K boundary issue in one or maybe all of
those entries.  Would it be crazy to limit the number of entries to half
that (31)?  It may impact performance, but it ensures that SCSI reads
and writes don't randomly fail.  We can always increase the ring segment
size in a later patch.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Nov. 20, 2013, 4:26 p.m. UTC | #33
On Wed, 20 Nov 2013, David Laight wrote:

> > From: Sarah Sharp
> ...
> > (Also, usb-storage aligns the block sizes to 512K, which explains why
> > we've never had an issue with TD fragments with that driver.)
> 
> What is a 'block' in that context?

I think Sarah means that usb-storage requires the block layer to align 
its data buffers to 512-byte boundaries.  (Note: 512 bytes, not 512K.)  
Disk I/O naturally tends to be done in units of the page size, anyway, 
although raw I/O can involve single sectors.

If a user supplies an unaligned buffer, the block layer will set up a 
bounce buffer.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sarah Sharp Nov. 20, 2013, 4:50 p.m. UTC | #34
On Wed, Nov 20, 2013 at 09:46:08AM -0000, David Laight wrote:
> > From: Sarah Sharp
> ...
> > (Also, usb-storage aligns the block sizes to 512K, which explains why
> > we've never had an issue with TD fragments with that driver.)
> 
> What is a 'block' in that context?
> 512K sounds more like the value that very long transfers get chopped
> up into. With 4k pages that might be 128 fragments.

Sorry, I meant 512-byte boundaries.  See Alan's comment in
drivers/usb/storage/scsiglue.c:

        /* USB has unusual DMA-alignment requirements: Although the
         * starting address of each scatter-gather element doesn't matter,
         * the length of each element except the last must be divisible
         * by the Bulk maxpacket value.  There's currently no way to
         * express this by block-layer constraints, so we'll cop out
         * and simply require addresses to be aligned at 512-byte
         * boundaries.  This is okay since most block I/O involves
         * hardware sectors that are multiples of 512 bytes in length,
         * and since host controllers up through USB 2.0 have maxpacket
         * values no larger than 512.
         *
         * But it doesn't suffice for Wireless USB, where Bulk maxpacket
         * values can be as large as 2048.  To make that work properly
         * will require changes to the block layer.
         */
        blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));

> I'd have thought that the SG list would normally contain references
> to a number of memory pages - so each would be 4k (on x86) aligned.
> My suspicion is that the xhci controller will generate correct USB3
> data provided the link TRB is on a 1k boundary - so such data won't
> be a problem.

If the max burst size is less than four, and the scsi layer hands down
4k chunks, then the driver would still work without any modification for
TD fragments, since MBP would be 4k and there would never be a link TRB
in the middle of an MBP.

However, the driver could be in violation of the spec if the burst size
was greater than 4.  I suspect what would happen is the host controller
would read the TD, and do a shorter burst of 4 max-packet-sized 1k
chunks, and then end the burst early.  But I'm not a hardware engineer,
and we can't count on how they designed it.  I'm just trying to figure
out why usb-storage worked for so many years without running into this
issue.

> If a user program does a direct transfer from the block device
> (and that is done by locking down the user pages) then the buffer
> could have an arbitrary alignment.

Sure.  In that case though, limiting the sg_tablesize so that TDs fit
into one ring segment isn't going to help, because the block layer won't
use it.  I guess the transfers will just fail, until we can get a better
fix in.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 20, 2013, 5:02 p.m. UTC | #35
> From: Sarah Sharp [mailto:sarah.a.sharp@linux.intel.com]
> On Wed, Nov 20, 2013 at 09:36:11AM -0000, David Laight wrote:
> > > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > > On Tue, 19 Nov 2013, Sarah Sharp wrote:
> > >
> > > > The xHCI driver can limit the number of sg-list entries through
> > > > hcd->self.sg_tablesize.  It's currently set to ~0, which is "however
> > > > many entries you want.  You could set that to the number of TRBs in a
> > > > segment (minus one for the link TRB).
> > > >
...
> 
> Ok, so the networking layer should be fine.  However, with the current
> patch, if the mass storage driver sends down a scatter-gather list
> that's bigger than a ring segment, or needs to be split up so it doesn't
> cross 64K boundaries, then the URB submission will fail.  We don't want
> that to happen.
> 
> At the very least, we should limit hcd->self.sg_tablesize in
> drivers/usb/host/xhci.c to 63 (TRBS_PER_SEGMENT - 1).  But we could
> still potentially run into the 64K boundary issue in one or maybe all of
> those entries.  Would it be crazy to limit the number of entries to half
> that (31)?  It may impact performance, but it ensures that SCSI reads
> and writes don't randomly fail.  We can always increase the ring segment
> size in a later patch.

Unless there is a limit on the overall length of the transfer you'll always lose.
An aligned 4MB transfer in a contiguous memory buffer requires 64 TRBs.
So won't fit in a ring segment.

Although I'd guess that such transfers are unusual.
If you bet that crossing a 64k boundaries is unusual then 32 is probably
a sane limit.

Fragments over 16k can easily be split between ring segments - so a more
complex check could allow for that - but it won't be quick to write.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 20, 2013, 5:08 p.m. UTC | #36
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> On Wed, 20 Nov 2013, David Laight wrote:
> 
> > > From: Sarah Sharp
> > ...
> > > (Also, usb-storage aligns the block sizes to 512K, which explains why
> > > we've never had an issue with TD fragments with that driver.)
> >
> > What is a 'block' in that context?
> 
> I think Sarah means that usb-storage requires the block layer to align
> its data buffers to 512-byte boundaries.  (Note: 512 bytes, not 512K.)

I did think it might be a typo...

> Disk I/O naturally tends to be done in units of the page size, anyway,
> although raw I/O can involve single sectors.
> 
> If a user supplies an unaligned buffer, the block layer will set up a
> bounce buffer.

Ah, ok, some other systems only do that from byte-misaligned buffers
(for general disk access).
The ehci alignment rules force 512 byte alignment.

That does mean that USB2 mass storage devices (attached to xhci)
will never generate incorrectly aligned link TRBs.

The ax88179_178a driver might still do so.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Nov. 20, 2013, 5:16 p.m. UTC | #37
> Ok, so the networking layer should be fine.  However, with the current
> patch, if the mass storage driver sends down a scatter-gather list
> that's bigger than a ring segment, or needs to be split up so it doesn't
> cross 64K boundaries, then the URB submission will fail.  We don't want
> that to happen.

My suspicion is that long SG lists are unusual - otherwise the
ring expansion code would have been needed much earlier.

Can anyone remember whether that was needed because of long SG lists
or because of large numbers of outstanding requests?

I've seen it for network cards - but only because usbnet sends
down far too many tx buffers.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sarah Sharp Nov. 20, 2013, 6:39 p.m. UTC | #38
On Wed, Nov 20, 2013 at 05:16:12PM -0000, David Laight wrote:
> > Ok, so the networking layer should be fine.  However, with the current
> > patch, if the mass storage driver sends down a scatter-gather list
> > that's bigger than a ring segment, or needs to be split up so it doesn't
> > cross 64K boundaries, then the URB submission will fail.  We don't want
> > that to happen.
> 
> My suspicion is that long SG lists are unusual - otherwise the
> ring expansion code would have been needed much earlier.
> 
> Can anyone remember whether that was needed because of long SG lists
> or because of large numbers of outstanding requests?
>
> I've seen it for network cards - but only because usbnet sends
> down far too many tx buffers.

It was added because USB video and audio drivers queue isochronous URBs
with lots of buffers for each service interval.  So basically, lots of
outstanding requests, not long SG lists was the driving motivation to
add the ring expansion.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Zimmerman Nov. 20, 2013, 9:11 p.m. UTC | #39
> From: linux-usb-owner@vger.kernel.org [mailto:linux-usb-owner@vger.kernel.org] On Behalf Of David Laight
> Sent: Wednesday, November 20, 2013 9:16 AM
> 
> > Ok, so the networking layer should be fine.  However, with the current
> > patch, if the mass storage driver sends down a scatter-gather list
> > that's bigger than a ring segment, or needs to be split up so it doesn't
> > cross 64K boundaries, then the URB submission will fail.  We don't want
> > that to happen.
> 
> My suspicion is that long SG lists are unusual - otherwise the
> ring expansion code would have been needed much earlier.
> 
> Can anyone remember whether that was needed because of long SG lists
> or because of large numbers of outstanding requests?
> 
> I've seen it for network cards - but only because usbnet sends
> down far too many tx buffers.

usb-storage limits the maximum transfer size to 120K. That is a max of
31 page-size segments if my math is right. That's probably why mass-storage
never saw a problem.
David Laight Nov. 21, 2013, 9:53 a.m. UTC | #40
> > My suspicion is that long SG lists are unusual - otherwise the
> > ring expansion code would have been needed much earlier.
> >
> > Can anyone remember whether that was needed because of long SG lists
> > or because of large numbers of outstanding requests?
> >
> > I've seen it for network cards - but only because usbnet sends
> > down far too many tx buffers.
> 
> usb-storage limits the maximum transfer size to 120K. That is a max of
> 31 page-size segments if my math is right. That's probably why mass-storage
> never saw a problem.

I found some references to mass storage running out of ring space.

My best guess as to why mass storage has never had a problem with
the alignment at link TRBs is that the effect of getting it wrong
is to split the transfer. Since almost all the transfers are of page
aligned data that will only happen on a 1k boundary - where it is
invisible to the target.

For USB2 targets the link TRB only has to be 512 byte aligned.
The constraint that mass storage applied for the usb2 controllers
is enough to satisfy it.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 5480215..23abc9b 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2927,8 +2932,43 @@  static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
 	}
 
 	while (1) {
-		if (room_on_ring(xhci, ep_ring, num_trbs))
-			break;
+		if (room_on_ring(xhci, ep_ring, num_trbs)) {
+			union xhci_trb *trb = ep_ring->enqueue;
+			unsigned int usable = ep_ring->enq_seg->trbs +
+					TRBS_PER_SEGMENT - 1 - trb;
+			u32 nop_cmd;
+
+			/*
+			 * Section 4.11.7.1 TD Fragments states that a link
+			 * TRB must only occur at the boundary between
+			 * data bursts (eg 512 bytes for 480M).
+			 * While it is possible to split a large fragment
+			 * we don't know the size yet.
+			 * Simplest solution is to fill the trb before the
+			 * LINK with nop commands.
+			 */
+			if (num_trbs == 1 || num_trbs <= usable || usable == 0)
+				break;
+
+			if (num_trbs >= TRBS_PER_SEGMENT) {
+				xhci_err(xhci, "Too many fragments %d, max %d\n",
+						num_trbs, TRBS_PER_SEGMENT - 1);
+				return -ENOMEM;
+			}
+
+			nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) |
+					ep_ring->cycle_state);
+			for (; !TRB_TYPE_LINK_LE32(trb->link.control); trb++) {
+				trb->generic.field[0] = 0;
+				trb->generic.field[1] = 0;
+				trb->generic.field[2] = 0;
+				trb->generic.field[3] = nop_cmd;
+				ep_ring->num_trbs_free--;
+			}
+			ep_ring->enqueue = trb;
+			if (room_on_ring(xhci, ep_ring, num_trbs))
+				break;
+		}
 
 		if (ep_ring == xhci->cmd_ring) {
 			xhci_err(xhci, "Do not support expand command ring\n");