diff mbox series

usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Message ID 20200226112955.7930-1-lukma@denx.de
State Rejected
Delegated to: Marek Vasut
Headers show
Series usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd | expand

Commit Message

Lukasz Majewski Feb. 26, 2020, 11:29 a.m. UTC
This patch aims to improve robustness of 'usb' command operation on the
ehci-hcd IP block as it ports to contemporary U-Boot the patch described
and provided in [1] (originally applicable to U-Boot 2016.05).

According to the fix author - "rayvt" (from [1]):

TD stands for "Queue Element Transfer Descriptor (qTD)", which is a micro-code
instruction for the EHCI USB chip.
The "token" field is detailed information, control, and status of the TD's data
transfer. In particular, the rightmost byte is the status field. 0x80 bit means
the TD is active and the data transfer has not yet completed. 0x08 bit means
there was some sort of data transfer error (XactErr).

If the drive gets a transfer error, it refuses to do any other I/O transfer
until the error is properly cleared and reset. U-boot did not do this, so every
subsequent disk read would get a timeout error because the drive would not send
any data. The fix is to properly clear and reset the USB disk when it gets a
transfer error.

Every read operation starts at the maximum block size. When the USB pendrive is
not able to correctly serve this data read request, the dynamic reduction of IO
size is performed. Up to six tries (with smaller IO block each time) are
attempted.

A related problem is that some drives are slow to come up. Linux handles this by
issuing a spinup command and allowing more time for the drive to respond. The
same idea is applied in this fix.

On TPC70 (i.MX6Q) once per ~10 times (without this patch):

Bus usb@2184200: USB EHCI 1.00
scanning bus usb@2184200 for devices... 2 USB Device(s) found
       scanning usb for storage devices... 1 Storage Device(s) found
EHCI timed out on TD - token=0x1f8c80

Performance impact:

With U-Boot -master in mainline:
16869440 bytes read in 979 ms (16.4 MiB/s)

With this patch:
16869440 bytes read in 1219 ms (13.2 MiB/s)

Links:
[1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
[2] - https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0

Signed-off-by: Lukasz Majewski <lukma@denx.de>
[Unfortunately, the original patch [2] did not contain S-o-B from the original
author - "rayvt"]
---

 common/usb.c                |  10 +++-
 common/usb_storage.c        | 106 +++++++++++++++++++++++++++++++++---
 drivers/usb/host/ehci-hcd.c |  55 +++++++++++++++++--
 include/usb.h               |   1 +
 include/usb_defs.h          |   1 +
 5 files changed, 160 insertions(+), 13 deletions(-)

Comments

Tom Rini Feb. 26, 2020, 4:05 p.m. UTC | #1
On Wed, Feb 26, 2020 at 12:29:55PM +0100, Lukasz Majewski wrote:

> This patch aims to improve robustness of 'usb' command operation on the
> ehci-hcd IP block as it ports to contemporary U-Boot the patch described
> and provided in [1] (originally applicable to U-Boot 2016.05).
> 
> According to the fix author - "rayvt" (from [1]):
> 
> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a micro-code
> instruction for the EHCI USB chip.
> The "token" field is detailed information, control, and status of the TD's data
> transfer. In particular, the rightmost byte is the status field. 0x80 bit means
> the TD is active and the data transfer has not yet completed. 0x08 bit means
> there was some sort of data transfer error (XactErr).
> 
> If the drive gets a transfer error, it refuses to do any other I/O transfer
> until the error is properly cleared and reset. U-boot did not do this, so every
> subsequent disk read would get a timeout error because the drive would not send
> any data. The fix is to properly clear and reset the USB disk when it gets a
> transfer error.
> 
> Every read operation starts at the maximum block size. When the USB pendrive is
> not able to correctly serve this data read request, the dynamic reduction of IO
> size is performed. Up to six tries (with smaller IO block each time) are
> attempted.
> 
> A related problem is that some drives are slow to come up. Linux handles this by
> issuing a spinup command and allowing more time for the drive to respond. The
> same idea is applied in this fix.
> 
> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> 
> Bus usb@2184200: USB EHCI 1.00
> scanning bus usb@2184200 for devices... 2 USB Device(s) found
>        scanning usb for storage devices... 1 Storage Device(s) found
> EHCI timed out on TD - token=0x1f8c80
> 
> Performance impact:
> 
> With U-Boot -master in mainline:
> 16869440 bytes read in 979 ms (16.4 MiB/s)
> 
> With this patch:
> 16869440 bytes read in 1219 ms (13.2 MiB/s)
> 
> Links:
> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> [2] - https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
> 
> Signed-off-by: Lukasz Majewski <lukma@denx.de>
> [Unfortunately, the original patch [2] did not contain S-o-B from the original
> author - "rayvt"]

This problem was seen in Beagleboard xM as well and reported by
Guillaume back in January, so:
Reported-by: Guillaume Gardet <Guillaume.Gardet@arm.com>

And on board, I can confirm the problem also goes away now so:

Tested-by: Tom Rini <trini@konsulko.com>
Marek Vasut Feb. 28, 2020, 11:32 p.m. UTC | #2
On 2/26/20 12:29 PM, Lukasz Majewski wrote:
> This patch aims to improve robustness of 'usb' command operation on the
> ehci-hcd IP block as it ports to contemporary U-Boot the patch described
> and provided in [1] (originally applicable to U-Boot 2016.05).
> 
> According to the fix author - "rayvt" (from [1]):

[...]

> diff --git a/common/usb_storage.c b/common/usb_storage.c
> index 097b6729c1..48c8c2ae64 100644
> --- a/common/usb_storage.c
> +++ b/common/usb_storage.c
> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev, struct us_data *us,
>  		      struct blk_desc *dev_desc);
>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
>  		      struct us_data *ss);
> +
> +#ifdef CONFIG_USB_EHCI_HCD
> +	/*
> +	 * The U-Boot EHCI driver can handle any transfer length as long as
> +	 * there is enough free heap space left, but the SCSI READ(10) and
> +	 * WRITE(10) commands are limited to 65535 blocks.
> +	 */
> +int usb_max_xfer_blk = 4096;
> +#else
> +int usb_max_xfer_blk = 20;
> +#endif

This all looks horribly wrong and exactly what
7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
02b0e1a36c5bc20174299312556ec4e266872bd6 fixed properly.

All those "dynamic reduction of transfer size" attempts are nonsensical,
the real solution (sadly) is to reduce the transfer size to cater for
the most limited devices and profile/fix the remaining delays in the USB
stack (which should have already been done, see the commits above). This
is also what the Linux USB stack does.

What is the problem you are trying to solve here ?
Tom Rini Feb. 29, 2020, 3:20 a.m. UTC | #3
On Sat, Feb 29, 2020 at 12:32:57AM +0100, Marek Vasut wrote:
> On 2/26/20 12:29 PM, Lukasz Majewski wrote:
> > This patch aims to improve robustness of 'usb' command operation on the
> > ehci-hcd IP block as it ports to contemporary U-Boot the patch described
> > and provided in [1] (originally applicable to U-Boot 2016.05).
> > 
> > According to the fix author - "rayvt" (from [1]):
> 
> [...]
> 
> > diff --git a/common/usb_storage.c b/common/usb_storage.c
> > index 097b6729c1..48c8c2ae64 100644
> > --- a/common/usb_storage.c
> > +++ b/common/usb_storage.c
> > @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev, struct us_data *us,
> >  		      struct blk_desc *dev_desc);
> >  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >  		      struct us_data *ss);
> > +
> > +#ifdef CONFIG_USB_EHCI_HCD
> > +	/*
> > +	 * The U-Boot EHCI driver can handle any transfer length as long as
> > +	 * there is enough free heap space left, but the SCSI READ(10) and
> > +	 * WRITE(10) commands are limited to 65535 blocks.
> > +	 */
> > +int usb_max_xfer_blk = 4096;
> > +#else
> > +int usb_max_xfer_blk = 20;
> > +#endif
> 
> This all looks horribly wrong and exactly what
> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
> 02b0e1a36c5bc20174299312556ec4e266872bd6 fixed properly.
> 
> All those "dynamic reduction of transfer size" attempts are nonsensical,
> the real solution (sadly) is to reduce the transfer size to cater for
> the most limited devices and profile/fix the remaining delays in the USB
> stack (which should have already been done, see the commits above). This
> is also what the Linux USB stack does.
> 
> What is the problem you are trying to solve here ?

Things like the following (omap3_beagle_defconfig):
U-Boot SPL 2020.04-rc3-00009-g9e1d65f36b83 (Feb 28 2020 - 19:08:53 -0500)
Trying to boot from MMC1


U-Boot 2020.04-rc3-00009-g9e1d65f36b83 (Feb 28 2020 - 19:08:53 -0500)

OMAP3630/3730-GP ES1.1, CPU-OPP2, L3-200MHz, Max CPU Clock 800 MHz
Model: TI OMAP3 BeagleBoard
OMAP3 Beagle board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
NAND:  0 MiB
MMC:   OMAP SD/MMC: 0
Loading Environment from NAND... *** Warning - readenv() failed, using default environment

Beagle xM Rev A/B
No EEPROM on expansion board
OMAP die ID: 6e5e00211ff00000015739eb08031024
Net:   usb_ether
starting USB...
USB0:   USB EHCI 1.00
scanning bus 0 for devices... EHCI timed out on TD - token=0x80008d80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
 ERROR: NOT USB_CONFIG_DESC b8
EHCI timed out on TD - token=0x80008d80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008d80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008d80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008d80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008c80
EHCI timed out on TD - token=0x80008d80
3 USB Device(s) found
       scanning usb for ethernet devices... 0 Ethernet Device(s) found
Hit any key to stop autoboot:  2  0 
BeagleBoard # usb tree
USB device tree:
  1  Hub (480 Mb/s, 0mA)
  |  u-boot EHCI Host Controller 
  |
  |+-2  Hub (480 Mb/s, 2mA)
    |
    |+-3  See Interface (480 Mb/s, 0mA)
         ??????????? ??????????? ???????????
       
BeagleBoard # 

Note that the hub and ethernet are on-SBC and not something I'm plugging
in.  Thanks!
Marek Vasut Feb. 29, 2020, 7:20 a.m. UTC | #4
On 2/29/20 4:20 AM, Tom Rini wrote:
> On Sat, Feb 29, 2020 at 12:32:57AM +0100, Marek Vasut wrote:
>> On 2/26/20 12:29 PM, Lukasz Majewski wrote:
>>> This patch aims to improve robustness of 'usb' command operation on the
>>> ehci-hcd IP block as it ports to contemporary U-Boot the patch described
>>> and provided in [1] (originally applicable to U-Boot 2016.05).
>>>
>>> According to the fix author - "rayvt" (from [1]):
>>
>> [...]
>>
>>> diff --git a/common/usb_storage.c b/common/usb_storage.c
>>> index 097b6729c1..48c8c2ae64 100644
>>> --- a/common/usb_storage.c
>>> +++ b/common/usb_storage.c
>>> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev, struct us_data *us,
>>>  		      struct blk_desc *dev_desc);
>>>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
>>>  		      struct us_data *ss);
>>> +
>>> +#ifdef CONFIG_USB_EHCI_HCD
>>> +	/*
>>> +	 * The U-Boot EHCI driver can handle any transfer length as long as
>>> +	 * there is enough free heap space left, but the SCSI READ(10) and
>>> +	 * WRITE(10) commands are limited to 65535 blocks.
>>> +	 */
>>> +int usb_max_xfer_blk = 4096;
>>> +#else
>>> +int usb_max_xfer_blk = 20;
>>> +#endif
>>
>> This all looks horribly wrong and exactly what
>> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
>> 02b0e1a36c5bc20174299312556ec4e266872bd6 fixed properly.
>>
>> All those "dynamic reduction of transfer size" attempts are nonsensical,
>> the real solution (sadly) is to reduce the transfer size to cater for
>> the most limited devices and profile/fix the remaining delays in the USB
>> stack (which should have already been done, see the commits above). This
>> is also what the Linux USB stack does.
>>
>> What is the problem you are trying to solve here ?
> 
> Things like the following (omap3_beagle_defconfig):
> U-Boot SPL 2020.04-rc3-00009-g9e1d65f36b83 (Feb 28 2020 - 19:08:53 -0500)
> Trying to boot from MMC1

[...]

> EHCI timed out on TD - token=0x80008c80
> EHCI timed out on TD - token=0x80008c80
> EHCI timed out on TD - token=0x80008d80
> 3 USB Device(s) found
>        scanning usb for ethernet devices... 0 Ethernet Device(s) found
> Hit any key to stop autoboot:  2  0 
> BeagleBoard # usb tree
> USB device tree:
>   1  Hub (480 Mb/s, 0mA)
>   |  u-boot EHCI Host Controller 
>   |
>   |+-2  Hub (480 Mb/s, 2mA)
>     |
>     |+-3  See Interface (480 Mb/s, 0mA)
>          ??????????? ??????????? ???????????
>        
> BeagleBoard # 
> 
> Note that the hub and ethernet are on-SBC and not something I'm plugging
> in.  Thanks!

The device #3 is a usb mass storage or what is it ?

Can you try and implement usb_get_max_xfer_size for musb and make it
report 240*512 unconditionally (*size = 240*512; return 0;) ? I think
that would "fix" it for you on omap too.
Lukasz Majewski March 1, 2020, 5:19 p.m. UTC | #5
Hi Marek,

> On 2/26/20 12:29 PM, Lukasz Majewski wrote:
> > This patch aims to improve robustness of 'usb' command operation on
> > the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> > described and provided in [1] (originally applicable to U-Boot
> > 2016.05).
> > 
> > According to the fix author - "rayvt" (from [1]):  
> 
> [...]
> 
> > diff --git a/common/usb_storage.c b/common/usb_storage.c
> > index 097b6729c1..48c8c2ae64 100644
> > --- a/common/usb_storage.c
> > +++ b/common/usb_storage.c
> > @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
> > struct us_data *us, struct blk_desc *dev_desc);
> >  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >  		      struct us_data *ss);
> > +
> > +#ifdef CONFIG_USB_EHCI_HCD
> > +	/*
> > +	 * The U-Boot EHCI driver can handle any transfer length
> > as long as
> > +	 * there is enough free heap space left, but the SCSI
> > READ(10) and
> > +	 * WRITE(10) commands are limited to 65535 blocks.
> > +	 */
> > +int usb_max_xfer_blk = 4096;
> > +#else
> > +int usb_max_xfer_blk = 20;
> > +#endif  
> 
> This all looks horribly wrong and exactly what
> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
> 02b0e1a36c5bc20174299312556ec4e266872bd680 fixed properly.
> 
> All those "dynamic reduction of transfer size" attempts are
> nonsensical, the real solution (sadly) is to reduce the transfer size
> to cater for the most limited devices and profile/fix the remaining
> delays in the USB stack (which should have already been done, see the
> commits above). This is also what the Linux USB stack does.
> 
> What is the problem you are trying to solve here ?

As it was stated in the commit message:

On TPC70 (i.MX6Q) once per ~10 times (without this patch):

Bus usb@2184200: USB EHCI 1.00
scanning bus usb@2184200 for devices... 2 USB Device(s) found
       scanning usb for storage devices... 1 Storage Device(s) found
EHCI timed out on TD - token=0x1f8c80


The problem is that even with the mentioned fix in:
7d6fd7f0ba71cd93d94079132f958d9630f27a89 from time to time I do see the:

EHCI timed out on TD - token=0x1f8c80

error. The same issue was already reported by Tom.

The description of the fix written by the original author:
https://forum.doozan.com/read.php?3,35295,35295#msg-35295

states that the reduction of the transfer is done just for "spin
up"/"detection" of the pen drive. The issue is that not all pen drive
disks are able to be properly detected in the first place.



Last but not least - the mainline is still broken. The "token=0x1f8c80"
errors pops up from time to time. However, after applying the approach
from this fix - the error is gone (and pass some acceptance tests).

Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de
Marek Vasut March 1, 2020, 5:35 p.m. UTC | #6
On 3/1/20 6:19 PM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

>> On 2/26/20 12:29 PM, Lukasz Majewski wrote:
>>> This patch aims to improve robustness of 'usb' command operation on
>>> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
>>> described and provided in [1] (originally applicable to U-Boot
>>> 2016.05).
>>>
>>> According to the fix author - "rayvt" (from [1]):  
>>
>> [...]
>>
>>> diff --git a/common/usb_storage.c b/common/usb_storage.c
>>> index 097b6729c1..48c8c2ae64 100644
>>> --- a/common/usb_storage.c
>>> +++ b/common/usb_storage.c
>>> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
>>> struct us_data *us, struct blk_desc *dev_desc);
>>>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
>>>  		      struct us_data *ss);
>>> +
>>> +#ifdef CONFIG_USB_EHCI_HCD
>>> +	/*
>>> +	 * The U-Boot EHCI driver can handle any transfer length
>>> as long as
>>> +	 * there is enough free heap space left, but the SCSI
>>> READ(10) and
>>> +	 * WRITE(10) commands are limited to 65535 blocks.
>>> +	 */
>>> +int usb_max_xfer_blk = 4096;
>>> +#else
>>> +int usb_max_xfer_blk = 20;
>>> +#endif  
>>
>> This all looks horribly wrong and exactly what
>> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
>> 02b0e1a36c5bc20174299312556ec4e266872bd680 fixed properly.
>>
>> All those "dynamic reduction of transfer size" attempts are
>> nonsensical, the real solution (sadly) is to reduce the transfer size
>> to cater for the most limited devices and profile/fix the remaining
>> delays in the USB stack (which should have already been done, see the
>> commits above). This is also what the Linux USB stack does.
>>
>> What is the problem you are trying to solve here ?
> 
> As it was stated in the commit message:
> 
> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> 
> Bus usb@2184200: USB EHCI 1.00
> scanning bus usb@2184200 for devices... 2 USB Device(s) found
>        scanning usb for storage devices... 1 Storage Device(s) found
> EHCI timed out on TD - token=0x1f8c80
> 
> 
> The problem is that even with the mentioned fix in:
> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 from time to time I do see the:
> 
> EHCI timed out on TD - token=0x1f8c80
> 
> error. The same issue was already reported by Tom.

OK. Why do you see the error ? Can you dig in further ?

> The description of the fix written by the original author:
> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> 
> states that the reduction of the transfer is done just for "spin
> up"/"detection" of the pen drive. The issue is that not all pen drive
> disks are able to be properly detected in the first place.

But the commit message claims this patch reduces the transfer rate by
20% always. So that's a NAK right away, sorry, you need to find a better
solution which doesn't have such an enormous impact.

I'm fine with the spin-up handling, but this auto-detection of transfer
size should be pulled into separate patch and is likely wrong anyway.

So try to separate this into two patches, one for handling the spin up,
another one for this auto-detection of the transfer size. And I think
you will end up needed only the first one.

> Last but not least - the mainline is still broken. The "token=0x1f8c80"
> errors pops up from time to time. However, after applying the approach
> from this fix - the error is gone (and pass some acceptance tests).

It works for me. Do you have a specific device which fails ? Which
device and on which hardware ?

Also, what is "some acceptance test" ?
Lukasz Majewski March 1, 2020, 5:59 p.m. UTC | #7
Hi Marek,

> On 3/1/20 6:19 PM, Lukasz Majewski wrote:
> > Hi Marek,  
> 
> Hi,
> 
> >> On 2/26/20 12:29 PM, Lukasz Majewski wrote:  
> >>> This patch aims to improve robustness of 'usb' command operation
> >>> on the ehci-hcd IP block as it ports to contemporary U-Boot the
> >>> patch described and provided in [1] (originally applicable to
> >>> U-Boot 2016.05).
> >>>
> >>> According to the fix author - "rayvt" (from [1]):    
> >>
> >> [...]
> >>  
> >>> diff --git a/common/usb_storage.c b/common/usb_storage.c
> >>> index 097b6729c1..48c8c2ae64 100644
> >>> --- a/common/usb_storage.c
> >>> +++ b/common/usb_storage.c
> >>> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
> >>> struct us_data *us, struct blk_desc *dev_desc);
> >>>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >>>  		      struct us_data *ss);
> >>> +
> >>> +#ifdef CONFIG_USB_EHCI_HCD
> >>> +	/*
> >>> +	 * The U-Boot EHCI driver can handle any transfer length
> >>> as long as
> >>> +	 * there is enough free heap space left, but the SCSI
> >>> READ(10) and
> >>> +	 * WRITE(10) commands are limited to 65535 blocks.
> >>> +	 */
> >>> +int usb_max_xfer_blk = 4096;
> >>> +#else
> >>> +int usb_max_xfer_blk = 20;
> >>> +#endif    
> >>
> >> This all looks horribly wrong and exactly what
> >> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
> >> 02b0e1a36c5bc20174299312556ec4e266872bd680 fixed properly.
> >>
> >> All those "dynamic reduction of transfer size" attempts are
> >> nonsensical, the real solution (sadly) is to reduce the transfer
> >> size to cater for the most limited devices and profile/fix the
> >> remaining delays in the USB stack (which should have already been
> >> done, see the commits above). This is also what the Linux USB
> >> stack does.
> >>
> >> What is the problem you are trying to solve here ?  
> > 
> > As it was stated in the commit message:
> > 
> > On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> > 
> > Bus usb@2184200: USB EHCI 1.00
> > scanning bus usb@2184200 for devices... 2 USB Device(s) found
> >        scanning usb for storage devices... 1 Storage Device(s) found
> > EHCI timed out on TD - token=0x1f8c80
> > 
> > 
> > The problem is that even with the mentioned fix in:
> > 7d6fd7f0ba71cd93d94079132f958d9630f27a89 from time to time I do see
> > the:
> > 
> > EHCI timed out on TD - token=0x1f8c80
> > 
> > error. The same issue was already reported by Tom.  
> 
> OK. Why do you see the error ? Can you dig in further ?

It is all explained in the link:
https://forum.doozan.com/read.php?3,35295,35295#msg-35295

> 
> > The description of the fix written by the original author:
> > https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> > 
> > states that the reduction of the transfer is done just for "spin
> > up"/"detection" of the pen drive. The issue is that not all pen
> > drive disks are able to be properly detected in the first place.  
> 
> But the commit message claims this patch reduces the transfer rate by
> 20% always. So that's a NAK right away, sorry, 

Marek, now we do have from time to time NOT functional USB (once per
ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb start'
for recovery/normal operation - it is bad.

In such a use case - the 20% performance drop is acceptable.

> you need to find a
> better solution which doesn't have such an enormous impact.
> 
> I'm fine with the spin-up handling, but this auto-detection of
> transfer size should be pulled into separate patch and is likely
> wrong anyway.

I cannot agree that is it "likely wrong anyway" - please correct me if
I'm wrong, but up till now we were fiddling with transfer size up till
having "good enough" results.

The above link describes what needs to be done in case of the "EHCI
timed out on TD - token=0x1f8c80" error.

Do we handle it now? In my opinion we don't.

> 
> So try to separate this into two patches, one for handling the spin
> up, another one for this auto-detection of the transfer size. And I
> think you will end up needed only the first one.

This is an open question if only the spin up fixes the error. It would
be great if you could look for the changes introduced in this patch and
compare it with your experience of ehci driver development.

> 
> > Last but not least - the mainline is still broken. The
> > "token=0x1f8c80" errors pops up from time to time. However, after
> > applying the approach from this fix - the error is gone (and pass
> > some acceptance tests).  
> 
> It works for me.

Also works for Tom.

> Do you have a specific device which fails ?

Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can confirm
that from U-Boot 2018.04 up till now the rate of:

"EHCI timed out on TD - token=0x1f8c80" error has been reduced
significantly for their use case (reading ~16MiB of binary from several
pen drives). 

Unfortunately the errors are still present from time to time.

> Which
> device and on which hardware ?
> 
> Also, what is "some acceptance test" ?
> 

The set of automated tests to assess if "usb start" don't break.


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de
Marek Vasut March 1, 2020, 6:39 p.m. UTC | #8
On 3/1/20 6:59 PM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

[...]

>>> The description of the fix written by the original author:
>>> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
>>>
>>> states that the reduction of the transfer is done just for "spin
>>> up"/"detection" of the pen drive. The issue is that not all pen
>>> drive disks are able to be properly detected in the first place.  
>>
>> But the commit message claims this patch reduces the transfer rate by
>> 20% always. So that's a NAK right away, sorry, 
> 
> Marek, now we do have from time to time NOT functional USB (once per
> ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb start'
> for recovery/normal operation - it is bad.
> 
> In such a use case - the 20% performance drop is acceptable.

I'm sure we can do better. 2020.04 is still quite far out.

Can you take a look ?

>> you need to find a
>> better solution which doesn't have such an enormous impact.
>>
>> I'm fine with the spin-up handling, but this auto-detection of
>> transfer size should be pulled into separate patch and is likely
>> wrong anyway.
> 
> I cannot agree that is it "likely wrong anyway" - please correct me if
> I'm wrong, but up till now we were fiddling with transfer size up till
> having "good enough" results.

Correction, read the patches I linked above
(7d6fd7f0ba71cd93d94079132f958d9630f27a89 and esp.
02b0e1a36c5bc20174299312556ec4e266872bd6). Those solve the "we were
fiddling around with transfer size" attempts, because I actually
invested the time to understand the issue with those devices that failed
and I was right about never letting any of those "fiddling around with
transfer size" patches into mainline.

The problem with those devices which needed "fiddling with transfer
size" was that their internal 16bit counters (or 8bit counters) overflew
when a very long transfer was started and thus those devices failed. The
fix there was to transfer less than that amount of data which triggered
the overflow, which on a wast majority of devices is 240 blocks. However
, this led to a massive slowdown, which is unacceptable. A fix for that
slowdown was to avoid turning async schedule off-on between each
transfer, but rather keep it running, that saves A LOT of wasted
transfer time. If you build a long transfer up front, you would never do
that many transfers to notice the delays incurred by turning the async
schedule off/on, but if you do multiple shorter transfers, these delays
add up.

Yes, these issues are convoluted and take time to solve properly, yet
the end result might not have such a performance hit.

> The above link describes what needs to be done in case of the "EHCI
> timed out on TD - token=0x1f8c80" error.
> 
> Do we handle it now? In my opinion we don't.
> 
>>
>> So try to separate this into two patches, one for handling the spin
>> up, another one for this auto-detection of the transfer size. And I
>> think you will end up needed only the first one.
> 
> This is an open question if only the spin up fixes the error. It would
> be great if you could look for the changes introduced in this patch and
> compare it with your experience of ehci driver development.

Surely you can separate those two patches and perform such a test ?

>>> Last but not least - the mainline is still broken. The
>>> "token=0x1f8c80" errors pops up from time to time. However, after
>>> applying the approach from this fix - the error is gone (and pass
>>> some acceptance tests).  
>>
>> It works for me.
> 
> Also works for Tom.
> 
>> Do you have a specific device which fails ?
> 
> Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can confirm
> that from U-Boot 2018.04 up till now the rate of:
> 
> "EHCI timed out on TD - token=0x1f8c80" error has been reduced
> significantly for their use case (reading ~16MiB of binary from several
> pen drives). 

Can you run lsusb (to get IDs) on those pendrives which fail ?
I would like to add them to a list.

> Unfortunately the errors are still present from time to time.
> 
>> Which
>> device and on which hardware ?
>>
>> Also, what is "some acceptance test" ?
>>
> 
> The set of automated tests to assess if "usb start" don't break.

The usb start only reads out the partition table and does not do any
long transfers (above 240 blocks, which usually triggers the failure on
cheap sticks). So if long transfers fail for you, then this is a
separate issue from the 'usb start' spin-up problem. And so, if this
patch makes both work, then this indeed should be two patches.
Tom Rini March 1, 2020, 11:04 p.m. UTC | #9
On Sat, Feb 29, 2020 at 08:20:57AM +0100, Marek Vasut wrote:
> On 2/29/20 4:20 AM, Tom Rini wrote:
> > On Sat, Feb 29, 2020 at 12:32:57AM +0100, Marek Vasut wrote:
> >> On 2/26/20 12:29 PM, Lukasz Majewski wrote:
> >>> This patch aims to improve robustness of 'usb' command operation on the
> >>> ehci-hcd IP block as it ports to contemporary U-Boot the patch described
> >>> and provided in [1] (originally applicable to U-Boot 2016.05).
> >>>
> >>> According to the fix author - "rayvt" (from [1]):
> >>
> >> [...]
> >>
> >>> diff --git a/common/usb_storage.c b/common/usb_storage.c
> >>> index 097b6729c1..48c8c2ae64 100644
> >>> --- a/common/usb_storage.c
> >>> +++ b/common/usb_storage.c
> >>> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev, struct us_data *us,
> >>>  		      struct blk_desc *dev_desc);
> >>>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >>>  		      struct us_data *ss);
> >>> +
> >>> +#ifdef CONFIG_USB_EHCI_HCD
> >>> +	/*
> >>> +	 * The U-Boot EHCI driver can handle any transfer length as long as
> >>> +	 * there is enough free heap space left, but the SCSI READ(10) and
> >>> +	 * WRITE(10) commands are limited to 65535 blocks.
> >>> +	 */
> >>> +int usb_max_xfer_blk = 4096;
> >>> +#else
> >>> +int usb_max_xfer_blk = 20;
> >>> +#endif
> >>
> >> This all looks horribly wrong and exactly what
> >> 7d6fd7f0ba71cd93d94079132f958d9630f27a89 and
> >> 02b0e1a36c5bc20174299312556ec4e266872bd6 fixed properly.
> >>
> >> All those "dynamic reduction of transfer size" attempts are nonsensical,
> >> the real solution (sadly) is to reduce the transfer size to cater for
> >> the most limited devices and profile/fix the remaining delays in the USB
> >> stack (which should have already been done, see the commits above). This
> >> is also what the Linux USB stack does.
> >>
> >> What is the problem you are trying to solve here ?
> > 
> > Things like the following (omap3_beagle_defconfig):
> > U-Boot SPL 2020.04-rc3-00009-g9e1d65f36b83 (Feb 28 2020 - 19:08:53 -0500)
> > Trying to boot from MMC1
> 
> [...]
> 
> > EHCI timed out on TD - token=0x80008c80
> > EHCI timed out on TD - token=0x80008c80
> > EHCI timed out on TD - token=0x80008d80
> > 3 USB Device(s) found
> >        scanning usb for ethernet devices... 0 Ethernet Device(s) found
> > Hit any key to stop autoboot:  2  0 
> > BeagleBoard # usb tree
> > USB device tree:
> >   1  Hub (480 Mb/s, 0mA)
> >   |  u-boot EHCI Host Controller 
> >   |
> >   |+-2  Hub (480 Mb/s, 2mA)
> >     |
> >     |+-3  See Interface (480 Mb/s, 0mA)
> >          ??????????? ??????????? ???????????
> >        
> > BeagleBoard # 
> > 
> > Note that the hub and ethernet are on-SBC and not something I'm plugging
> > in.  Thanks!
> 
> The device #3 is a usb mass storage or what is it ?

It's a usb ethernet device.

> Can you try and implement usb_get_max_xfer_size for musb and make it
> report 240*512 unconditionally (*size = 240*512; return 0;) ? I think
> that would "fix" it for you on omap too.

I'll pencil in some time to try that, thanks.
Marek Vasut March 2, 2020, 12:39 a.m. UTC | #10
On 3/2/20 12:04 AM, Tom Rini wrote:
[...]

>>> 3 USB Device(s) found
>>>        scanning usb for ethernet devices... 0 Ethernet Device(s) found
>>> Hit any key to stop autoboot:  2  0 
>>> BeagleBoard # usb tree
>>> USB device tree:
>>>   1  Hub (480 Mb/s, 0mA)
>>>   |  u-boot EHCI Host Controller 
>>>   |
>>>   |+-2  Hub (480 Mb/s, 2mA)
>>>     |
>>>     |+-3  See Interface (480 Mb/s, 0mA)
>>>          ??????????? ??????????? ???????????
>>>        
>>> BeagleBoard # 
>>>
>>> Note that the hub and ethernet are on-SBC and not something I'm plugging
>>> in.  Thanks!
>>
>> The device #3 is a usb mass storage or what is it ?
> 
> It's a usb ethernet device.
> 
>> Can you try and implement usb_get_max_xfer_size for musb and make it
>> report 240*512 unconditionally (*size = 240*512; return 0;) ? I think
>> that would "fix" it for you on omap too.
> 
> I'll pencil in some time to try that, thanks.

That's not gonna help you with USB ethernet. I recall seeing flakiness
with asix devices, maybe that's what you're running into?

Do you have any further details on that device ? lsusb -vvv would help.
Lukasz Majewski March 2, 2020, 1:01 p.m. UTC | #11
Hi Marek,

> On 3/1/20 6:59 PM, Lukasz Majewski wrote:
> > Hi Marek,  
> 
> Hi,
> 
> [...]
> 
> >>> The description of the fix written by the original author:
> >>> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> >>>
> >>> states that the reduction of the transfer is done just for "spin
> >>> up"/"detection" of the pen drive. The issue is that not all pen
> >>> drive disks are able to be properly detected in the first place.
> >>>   
> >>
> >> But the commit message claims this patch reduces the transfer rate
> >> by 20% always. So that's a NAK right away, sorry,   
> > 
> > Marek, now we do have from time to time NOT functional USB (once per
> > ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb
> > start' for recovery/normal operation - it is bad.
> > 
> > In such a use case - the 20% performance drop is acceptable.  
> 
> I'm sure we can do better. 2020.04 is still quite far out.
> 
> Can you take a look ?

There are several issues which are fixed by this patch. 

I will reply to the actual patch set to point them out in the code, as
this may be more readable.

> 
> >> you need to find a
> >> better solution which doesn't have such an enormous impact.
> >>
> >> I'm fine with the spin-up handling, but this auto-detection of
> >> transfer size should be pulled into separate patch and is likely
> >> wrong anyway.  
> > 
> > I cannot agree that is it "likely wrong anyway" - please correct me
> > if I'm wrong, but up till now we were fiddling with transfer size
> > up till having "good enough" results.  
> 
> Correction, read the patches I linked above
> (7d6fd7f0ba71cd93d94079132f958d9630f27a89 

The above sha1 uses the observation (and heuristics) from other
operating systems to deal with usb pendrives, which controllers cannot
handle more than 256 sectors (as they have 8 bit counter).

> and esp.
> 02b0e1a36c5bc20174299312556ec4e266872bd6).

This patch introduces the optimization - instead of setup/disable async
transmissions - it enables it once and then just adds new tokens to be
transmitted.

Is the transmission error (USB_ST_XACTERR 0x80) handled at this point?

> Those solve the "we were
> fiddling around with transfer size" attempts, because I actually
> invested the time to understand the issue with those devices that
> failed and I was right about never letting any of those "fiddling
> around with transfer size" patches into mainline.
> 
> The problem with those devices which needed "fiddling with transfer
> size" was that their internal 16bit counters (or 8bit counters)
> overflew when a very long transfer was started and thus those devices
> failed. The fix there was to transfer less than that amount of data
> which triggered the overflow, which on a wast majority of devices is
> 240 blocks. However , this led to a massive slowdown, which is
> unacceptable. A fix for that slowdown was to avoid turning async
> schedule off-on between each transfer, but rather keep it running,
> that saves A LOT of wasted transfer time. If you build a long
> transfer up front, you would never do that many transfers to notice
> the delays incurred by turning the async schedule off/on, but if you
> do multiple shorter transfers, these delays add up.
> 

Thanks for the explanation.

> Yes, these issues are convoluted and take time to solve properly, yet
> the end result might not have such a performance hit.
> 
> > The above link describes what needs to be done in case of the "EHCI
> > timed out on TD - token=0x1f8c80" error.
> > 
> > Do we handle it now? In my opinion we don't.
> >   
> >>
> >> So try to separate this into two patches, one for handling the spin
> >> up, another one for this auto-detection of the transfer size. And I
> >> think you will end up needed only the first one.  
> > 
> > This is an open question if only the spin up fixes the error. It
> > would be great if you could look for the changes introduced in this
> > patch and compare it with your experience of ehci driver
> > development.  
> 
> Surely you can separate those two patches and perform such a test ?

I would prefer to first go through the actual patch code, before I
spent time on things which may be not necessary.

> 
> >>> Last but not least - the mainline is still broken. The
> >>> "token=0x1f8c80" errors pops up from time to time. However, after
> >>> applying the approach from this fix - the error is gone (and pass
> >>> some acceptance tests).    
> >>
> >> It works for me.  
> > 
> > Also works for Tom.
> >   
> >> Do you have a specific device which fails ?  
> > 
> > Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can
> > confirm that from U-Boot 2018.04 up till now the rate of:
> > 
> > "EHCI timed out on TD - token=0x1f8c80" error has been reduced
> > significantly for their use case (reading ~16MiB of binary from
> > several pen drives).   
> 
> Can you run lsusb (to get IDs) on those pendrives which fail ?
> I would like to add them to a list.

This was working correctly on the beginning, but after some time it
started to show errors in question:

Bus 001 Device 040: ID 0930:6544 Toshiba Corp. TransMemory-Mini /
Kingston DataTraveler 2.0 Stick

This one was causing issues from the very beginning:
Bus 001 Device 041: ID 058f:6387 Alcor Micro Corp. Flash Drive

This one - Verbatim 
Bus 001 Device 042: ID 21c4:8005

Never caused issues - worked reliably even with old U-Boot (on which
errors were observed very often).

> 
> > Unfortunately the errors are still present from time to time.
> >   
> >> Which
> >> device and on which hardware ?
> >>
> >> Also, what is "some acceptance test" ?
> >>  
> > 
> > The set of automated tests to assess if "usb start" don't break.  
> 
> The usb start only reads out the partition table and does not do any
> long transfers (above 240 blocks, which usually triggers the failure
> on cheap sticks). So if long transfers fail for you, then this is a
> separate issue from the 'usb start' spin-up problem. And so, if this
> patch makes both work, then this indeed should be two patches.


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de
Lukasz Majewski March 2, 2020, 1:21 p.m. UTC | #12
Hi,

> This patch aims to improve robustness of 'usb' command operation on
> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> described and provided in [1] (originally applicable to U-Boot
> 2016.05).
> 
> According to the fix author - "rayvt" (from [1]):
> 
> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
> micro-code instruction for the EHCI USB chip.
> The "token" field is detailed information, control, and status of the
> TD's data transfer. In particular, the rightmost byte is the status
> field. 0x80 bit means the TD is active and the data transfer has not
> yet completed. 0x08 bit means there was some sort of data transfer
> error (XactErr).
	 ^^^^^^^ - this error is NOT handled now in mainline u-boot. Do
	 you have any experience with it?

> 
> If the drive gets a transfer error, it refuses to do any other I/O
> transfer until the error is properly cleared and reset. U-boot did
> not do this, so every subsequent disk read would get a timeout error
> because the drive would not send any data. The fix is to properly
> clear and reset the USB disk when it gets a transfer error.
> 
> Every read operation starts at the maximum block size. When the USB
> pendrive is not able to correctly serve this data read request, the
> dynamic reduction of IO size is performed. Up to six tries (with
> smaller IO block each time) are attempted.
> 
> A related problem is that some drives are slow to come up. Linux
> handles this by issuing a spinup command and allowing more time for
> the drive to respond. The same idea is applied in this fix.
> 
> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> 
> Bus usb@2184200: USB EHCI 1.00
> scanning bus usb@2184200 for devices... 2 USB Device(s) found
>        scanning usb for storage devices... 1 Storage Device(s) found
> EHCI timed out on TD - token=0x1f8c80

This is how the error gets evident. The detailed explanation is in link
[1].

> 
> Performance impact:
> 
> With U-Boot -master in mainline:
> 16869440 bytes read in 979 ms (16.4 MiB/s)
> 
> With this patch:
> 16869440 bytes read in 1219 ms (13.2 MiB/s)
> 
> Links:
> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> [2] -
> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
> 
> Signed-off-by: Lukasz Majewski <lukma@denx.de>
> [Unfortunately, the original patch [2] did not contain S-o-B from the
> original author - "rayvt"]
> ---
> 
>  common/usb.c                |  10 +++-
>  common/usb_storage.c        | 106
> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c |
> 55 +++++++++++++++++-- include/usb.h               |   1 +
>  include/usb_defs.h          |   1 +
>  5 files changed, 160 insertions(+), 13 deletions(-)
> 
> diff --git a/common/usb.c b/common/usb.c
> index 349e838f1d..305482b5bb 100644
> --- a/common/usb.c
> +++ b/common/usb.c
> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct usb_device
> *dev, int len, int expect_len) __maybe_unused struct
> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned char,
> tmpbuf, USB_BUFSIZ); int err;
> +	int retry = 5;
>  
>  	desc = (struct usb_device_descriptor *)tmpbuf;
>  
> +again:
>  	err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc, len);
>  	if (err < expect_len) {
>  		if (err < 0) {
> -			printf("unable to get device descriptor
> (error=%d)\n",
> -				err);
> +			printf("unable to get device descriptor
> (error=%d) retry: %d\n",
> +			       err, retry);
> +			mdelay(50);
> +			if (--retry >= 0)
> +				/* Some drives are just slow to wake
> up. */
> +				goto again;

This I think is just to "wait" for the inserted device if it is rather
slow (or probably its memory got degraded over time).

This shouldn't bring any transfer slow down.


>  			return err;
>  		} else {
>  			printf("USB device descriptor short read
> (expected %i, got %i)\n", diff --git a/common/usb_storage.c
> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644
> --- a/common/usb_storage.c
> +++ b/common/usb_storage.c
> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
> struct us_data *us, struct blk_desc *dev_desc);
>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
>  		      struct us_data *ss);
> +
> +#ifdef CONFIG_USB_EHCI_HCD
> +	/*
> +	 * The U-Boot EHCI driver can handle any transfer length as
> long as
> +	 * there is enough free heap space left, but the SCSI
> READ(10) and
> +	 * WRITE(10) commands are limited to 65535 blocks.
> +	 */
> +int usb_max_xfer_blk = 4096;

Setting it to magic value - 4096 may be indeed problematic. However, I
don't know what is the 'usual' transfer size.

> +#else
> +int usb_max_xfer_blk = 20;

This is also a magic value - probably took from some old U-Boot (when
default transfer was set to this value).

I do guess that this snippet shall be removed from this patch set.

> +#endif
> +
>  #if CONFIG_IS_ENABLED(BLK)
>  static unsigned long usb_stor_read(struct udevice *dev, lbaint_t
> blknr, lbaint_t blkcnt, void *buffer);
> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct scsi_cmd
> *srb, struct us_data *us) pipeout = usb_sndbulkpipe(us->pusb_dev,
> us->ep_out); /* DATA phase + error handling */
>  	data_actlen = 0;
> +	mdelay(10);		/* Like linux does. */
>  	/* no data, go immediately to the STATUS phase */
>  	if (srb->datalen == 0)
>  		goto st;
> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct
> scsi_cmd *srb, struct us_data *us) 
>  	result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata,
> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5);
> +
> +	/* special handling of XACTERR in DATA phase */
> +	if (result < 0 && (us->pusb_dev->status & USB_ST_XACTERR)) {
> +		debug("XACTERR in data phase - clr, reset, and
> return fail.\n");
> +		usb_stor_BBB_clear_endpt_stall(us, dir_in ?
> +					       us->ep_in :
> us->ep_out);
> +		usb_stor_BBB_reset(us);
> +		return USB_STOR_TRANSPORT_FAILED;
> +	}
> +

Here the case of XACTERR is handled. This is missing now in -master.

>  	/* special handling of STALL in DATA phase */
>  	if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED))
> { debug("DATA:stall\n");
> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd
> *srb, struct us_data *ss) return 0;
>  }
>  
> +/*
> + * This spins up the disk and also consumes the time that the
> + * disk takes to become active and ready to read data.
> + * Some drives (like Western Digital) can take more than 5 seconds.
> + * The delay occurs on the 1st data read from the disk.
> + * Extending the timeout here works better than handling the timeout
> + * as an error on a "real" read.
> + */
> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
> +{
> +	memset(&srb->cmd[0], 0, 12);
> +	srb->cmd[0] = SCSI_START_STP;
> +	srb->cmd[1] = srb->lun << 5;
> +	srb->cmd[4] = 1;	/* Start spinup. */
> +	srb->datalen = 0;
> +	srb->cmdlen = 6;
> +	ss->pusb_dev->extra_timeout = 9876;
> +	ss->transport(srb, ss);
> +	ss->pusb_dev->extra_timeout = 0;
> +	return 0;
> +}
> +
>  static int usb_test_unit_ready(struct scsi_cmd *srb, struct us_data
> *ss) {
>  	int retries = 10;
> +	int gave_extra_time = 0;
>  
>  	do {
>  		memset(&srb->cmd[0], 0, 12);
> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct scsi_cmd
> *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02) &&
>  		    (srb->sense_buf[12] == 0x3a))
>  			return -1;
> +		/*
> +		 * If the status is "Not Ready - becoming ready",
> give it
> +		 * more time.  Linux issues a spinup command (once)
> and gives
> +		 * it 100 seconds.
> +		 */
> +		if (srb->sense_buf[2] == 0x02 &&
> +		    srb->sense_buf[12] == 0x04 &&
> +		    gave_extra_time == 0) {
> +			retries = 100; /* Allow 10 seconds. */
> +			gave_extra_time = retries;
> +		}

The above code looks like targeting the HDDs, but some degraded pen
drives may need this time as well for proper operation.

>  		mdelay(100);
>  	} while (retries--);
>  
> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct
> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start, blks,
> buf_addr); 
>  	do {
> -		/* XXX need some comment here */
> +		/*
> +		 * Probably most errors are USB errors, not hard
> disk error.
> +		 * Many disks use a USB chip that is flaky when
> doing large
> +		 * transfers. The workaround is to dynamically
> reduce the
> +		 * transfer size and allow an additional try.
> +		 * This should pick up flaky disks. Linux uses a
> quirks table.
> +		 * We'll use observation:
> +		 * Give it 1 try very large, 1 try large, 2 tries
> medium and 2
> +		 * tries small(ish).
> +		 * On a solid fail (actual disk error - which should
> be rare),
> +		 * this will give us 6 tries max, and only that many
> if the read
> +		 * is quite large.
> +		 * A fail on a very short read obviously doesn't
> have a
> +		 * too-large max_blks.  Timeout due to spinup being
> a case in
> +		 * this point.
> +		 */
>  		retry = 2;
>  		srb->pdata = (unsigned char *)buf_addr;
> -		if (blks > ss->max_xfer_blk)
> -			smallblks = ss->max_xfer_blk;
> +retry_it:
> +		if (blks > usb_max_xfer_blk)
> +			smallblks = usb_max_xfer_blk;
>  		else
>  			smallblks = (unsigned short) blks;
> -retry_it:
> -		if (smallblks == ss->max_xfer_blk)
> +
> +		if (smallblks == usb_max_xfer_blk)
>  			usb_show_progress();
> +
>  		srb->datalen = block_dev->blksz * smallblks;
>  		srb->pdata = (unsigned char *)buf_addr;
>  		if (usb_read_10(srb, ss, start, smallblks)) {
>  			debug("Read ERROR\n");
>  			ss->flags &= ~USB_READY;
>  			usb_request_sense(srb, ss);
> +
> +			/* Dynamically reduce the I/O size. */
> +			if (smallblks > 2047) {
> +				usb_max_xfer_blk = 2047;
> +				++retry;
> +			} else if (smallblks > 512) {
> +				usb_max_xfer_blk = 512;
> +				++retry;
> +			} else if (smallblks > 511) {
> +				usb_max_xfer_blk = 511;
> +				++retry;
> +			} else if (smallblks > 63) {
> +				usb_max_xfer_blk = 63;
> +				retry += 2;
> +			}
> +			debug("step down usb_max_xfer_blk to %d\n",
> +			      usb_max_xfer_blk);
> +

This shall not be needed, as the ehci controller's driver will scatter
the transfers up till 240 blocks max. Am I right?

The above code was supposed to work before the 240 block limitation was
in place?

>  			if (retry--)
>  				goto retry_it;
>  			blkcnt -= blks;
> @@ -1196,8 +1288,7 @@ retry_it:
>  	      start, smallblks, buf_addr);
>  
>  	usb_disable_asynch(0); /* asynch transfer allowed */
> -	if (blkcnt >= ss->max_xfer_blk)
> -		debug("\n");
> +
>  	return blkcnt;
>  }
>  
> @@ -1492,6 +1583,7 @@ int usb_stor_get_info(struct usb_device *dev,
> struct us_data *ss, dev_desc->log2blksz = LOG2(dev_desc->blksz);
>  	dev_desc->type = perq;
>  	debug(" address %d\n", dev_desc->target);
> +	usb_spinup(pccb, ss);

As above - this may be required for some "slower" (i.e. degraded) disks.

>  
>  	return 1;
>  }
> diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
> index 1cc02052f5..c4041f1e9c 100644
> --- a/drivers/usb/host/ehci-hcd.c
> +++ b/drivers/usb/host/ehci-hcd.c
> @@ -315,7 +315,9 @@ ehci_submit_async(struct usb_device *dev,
> unsigned long pipe, void *buffer, int timeout;
>  	int ret = 0;
>  	struct ehci_ctrl *ctrl = ehci_get_ctrl(dev);
> +	int trynum;
>  
> +	debug("\n***** %s ****\n", __func__);
>  	debug("dev=%p, pipe=%lx, buffer=%p, length=%d, req=%p\n",
> dev, pipe, buffer, length, req);
>  	if (req != NULL)
> @@ -557,6 +559,15 @@ ehci_submit_async(struct usb_device *dev,
> unsigned long pipe, void *buffer, ehci_writel(&ctrl->hcor->or_usbsts,
> (usbsts & 0x3f)); 
>  	/* Enable async. schedule. */
> +	trynum = 1;	/* No more than 2 tries, in case of
> XACTERR. */
> +	/* When the 1st try gets xacterr,
> +	 * 2nd try gets xacterr and often babble and/or halted.
> +	 * 3rd try times out.
> +	 * After the 2nd try, the disk has recovered, so we need to
> clear and
> +	 * reset the USB port, then return fail so the upper layer
> can retry.
> +	 */
> +retry_xacterr:
> +	vtd = &qtd[qtd_counter - 1];
>  	cmd = ehci_readl(&ctrl->hcor->or_usbcmd);
>  	if (!(cmd & CMD_ASE)) {
>  		cmd |= CMD_ASE;
> @@ -572,8 +583,9 @@ ehci_submit_async(struct usb_device *dev,
> unsigned long pipe, void *buffer, 
>  	/* Wait for TDs to be processed. */
>  	ts = get_timer(0);
> -	vtd = &qtd[qtd_counter - 1];
> +
>  	timeout = USB_TIMEOUT_MS(pipe);
> +	timeout += dev->extra_timeout;
>  	do {
>  		/* Invalidate dcache */
>  		invalidate_dcache_range((unsigned
> long)&ctrl->qh_list, @@ -588,9 +600,10 @@ ehci_submit_async(struct
> usb_device *dev, unsigned long pipe, void *buffer, break;
>  		WATCHDOG_RESET();
>  	} while (get_timer(ts) < timeout);
> -	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
>  
> -	ctrl->qh_list.qh_link =
> cpu_to_hc32(virt_to_phys(&ctrl->qh_list) | QH_LINK_TYPE_QH);
> +	debug("took %4lu ms of %4d\n", get_timer(ts), timeout);
> +	ctrl->qh_list.qh_link =
> +		cpu_to_hc32(virt_to_phys(&ctrl->qh_list) |
> QH_LINK_TYPE_QH); flush_dcache_range((unsigned long)&ctrl->qh_list,
>  		ALIGN_END_ADDR(struct QH, &ctrl->qh_list, 1));
>  
> @@ -605,14 +618,48 @@ ehci_submit_async(struct usb_device *dev,
> unsigned long pipe, void *buffer, */
>  	if (buffer != NULL && length > 0)
>  		invalidate_dcache_range((unsigned long)buffer,
> -			ALIGN((unsigned long)buffer + length,
> ARCH_DMA_MINALIGN));
> +			ALIGN((unsigned long)buffer + length,
> +			      ARCH_DMA_MINALIGN));
>  
>  	/* Check that the TD processing happened */
>  	if (QT_TOKEN_GET_STATUS(token) & QT_TOKEN_STATUS_ACTIVE)
>  		printf("EHCI timed out on TD - token=%#x\n", token);
>  
> +	/* Disable async schedule. */
> +	cmd = ehci_readl(&ctrl->hcor->or_usbcmd);
> +	cmd &= ~CMD_ASE;
> +	ehci_writel(&ctrl->hcor->or_usbcmd, cmd);
> +
> +	ret = handshake((uint32_t *)&ctrl->hcor->or_usbsts, STS_ASS,
> 0,
> +			100 * 1000);
> +	if (ret < 0) {
> +		printf("EHCI fail timeout STS_ASS reset\n");
> +		goto fail;
> +	}
> +
> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {
			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
			      before this patch was also not checked. 

> +			if (--trynum >= 0) {
> +				/*
> +				 * It is necessary to do this,
> otherwise the
> +				 * disk is clagged.
> +				 */
> +				debug("reset the TD and redo,
> because of XACTERR\n");
> +				qhtoken &= ~QT_TOKEN_STATUS_HALTED;
> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE |
> +					QT_TOKEN_CERR(2);
> +				vtd->qt_token = cpu_to_hc32(qhtoken);
> +				qh->qh_overlay.qt_token =
> cpu_to_hc32(qhtoken);
> +				goto retry_xacterr;
> +			}
> +			dev->status = USB_ST_XACTERR;
> +			dev->act_len = length -
> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);

This may solve the issue - as it resets the controller in the case of
transfer error.

> +			goto fail;
> +		}
> +
>  		switch (QT_TOKEN_GET_STATUS(qhtoken) &
>  			~(QT_TOKEN_STATUS_SPLITXSTATE |
> QT_TOKEN_STATUS_PERR)) { case 0:
> diff --git a/include/usb.h b/include/usb.h
> index efb67ea33f..41c8c47f89 100644
> --- a/include/usb.h
> +++ b/include/usb.h
> @@ -140,6 +140,7 @@ struct usb_device {
>  	int act_len;			/* transferred bytes */
>  	int maxchild;			/* Number of ports if
> hub */ int portnr;			/* Port number, 1=first */
> +	int extra_timeout;	/* Add to timeout in
> ehci_submit_async */ #if !CONFIG_IS_ENABLED(DM_USB)
>  	/* parent hub, or NULL if this is the root hub */
>  	struct usb_device *parent;
> diff --git a/include/usb_defs.h b/include/usb_defs.h
> index 6dd2c997f9..572f6ab296 100644
> --- a/include/usb_defs.h
> +++ b/include/usb_defs.h
> @@ -197,6 +197,7 @@
>  #define USB_ST_NAK_REC          0x10	/* NAK Received*/
>  #define USB_ST_CRC_ERR          0x20	/* CRC/timeout Error */
>  #define USB_ST_BIT_ERR          0x40	/* Bitstuff error */
> +#define USB_ST_XACTERR          0x80	/* XACTERR error */
>  #define USB_ST_NOT_PROC         0x80000000L	/* Not yet
> processed */ 
>  




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de
Tom Rini March 2, 2020, 5 p.m. UTC | #13
On Mon, Mar 02, 2020 at 01:39:49AM +0100, Marek Vasut wrote:
> On 3/2/20 12:04 AM, Tom Rini wrote:
> [...]
> 
> >>> 3 USB Device(s) found
> >>>        scanning usb for ethernet devices... 0 Ethernet Device(s) found
> >>> Hit any key to stop autoboot:  2  0 
> >>> BeagleBoard # usb tree
> >>> USB device tree:
> >>>   1  Hub (480 Mb/s, 0mA)
> >>>   |  u-boot EHCI Host Controller 
> >>>   |
> >>>   |+-2  Hub (480 Mb/s, 2mA)
> >>>     |
> >>>     |+-3  See Interface (480 Mb/s, 0mA)
> >>>          ??????????? ??????????? ???????????
> >>>        
> >>> BeagleBoard # 
> >>>
> >>> Note that the hub and ethernet are on-SBC and not something I'm plugging
> >>> in.  Thanks!
> >>
> >> The device #3 is a usb mass storage or what is it ?
> > 
> > It's a usb ethernet device.
> > 
> >> Can you try and implement usb_get_max_xfer_size for musb and make it
> >> report 240*512 unconditionally (*size = 240*512; return 0;) ? I think
> >> that would "fix" it for you on omap too.
> > 
> > I'll pencil in some time to try that, thanks.
> 
> That's not gonna help you with USB ethernet. I recall seeing flakiness
> with asix devices, maybe that's what you're running into?
> 
> Do you have any further details on that device ? lsusb -vvv would help.

I can only give:

BeagleBoard # usb info
1: Hub,  USB Revision 2.0
 - u-boot EHCI Host Controller
 - Class: Hub
 - PacketSize: 64  Configurations: 1
 - Vendor: 0x0000  Product 0x0000 Version 1.0
   Configuration: 1
   - Interfaces: 1 Self Powered 0mA
     Interface: 0
     - Alternate Setting 0, Endpoints: 1
     - Class Hub
     - Endpoint 1 In Interrupt MaxPacket 8 Interval 255ms

2: Hub,  USB Revision 2.0
 - Class: Hub
 - PacketSize: 64  Configurations: 1
 - Vendor: 0x0424  Product 0x9514 Version 2.0
   Configuration: 1
   - Interfaces: 1 Self Powered Remote Wakeup 2mA
     Interface: 0
     - Alternate Setting 0, Endpoints: 1
     - Class Hub
     - Endpoint 1 In Interrupt MaxPacket 1 Interval 12ms
     - Endpoint 1 In Interrupt MaxPacket 1 Interval 12ms

3: Vendor specific,  USB Revision 2.0
 - Class: Vendor specific
 - PacketSize: 64  Configurations: 1
 - Vendor: 0x0424  Product 0xec00 Version 2.0
   Configuration: 1
   - Interfaces: 1 Self Powered Remote Wakeup 2mA
     Interface: 0
     - Alternate Setting 0, Endpoints: 3
     - Class Vendor specific
     - Endpoint 1 In Bulk MaxPacket 512
     - Endpoint 2 Out Bulk MaxPacket 512
     - Endpoint 3 In Interrupt MaxPacket 16 Interval 4ms

As the old Fedora core image on the SD card there isn't happy and I
don't have time to dig in to that problem right now.  But please note
that the issue here is a relatively recent regression as Guillaume
reported.
Marek Vasut March 2, 2020, 7:08 p.m. UTC | #14
On 3/2/20 2:01 PM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

>> On 3/1/20 6:59 PM, Lukasz Majewski wrote:
>>> Hi Marek,  
>>
>> Hi,
>>
>> [...]
>>
>>>>> The description of the fix written by the original author:
>>>>> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
>>>>>
>>>>> states that the reduction of the transfer is done just for "spin
>>>>> up"/"detection" of the pen drive. The issue is that not all pen
>>>>> drive disks are able to be properly detected in the first place.
>>>>>   
>>>>
>>>> But the commit message claims this patch reduces the transfer rate
>>>> by 20% always. So that's a NAK right away, sorry,   
>>>
>>> Marek, now we do have from time to time NOT functional USB (once per
>>> ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb
>>> start' for recovery/normal operation - it is bad.
>>>
>>> In such a use case - the 20% performance drop is acceptable.  
>>
>> I'm sure we can do better. 2020.04 is still quite far out.
>>
>> Can you take a look ?
> 
> There are several issues which are fixed by this patch. 

Then it should be multiple patches ?

> I will reply to the actual patch set to point them out in the code, as
> this may be more readable.
> 
>>
>>>> you need to find a
>>>> better solution which doesn't have such an enormous impact.
>>>>
>>>> I'm fine with the spin-up handling, but this auto-detection of
>>>> transfer size should be pulled into separate patch and is likely
>>>> wrong anyway.  
>>>
>>> I cannot agree that is it "likely wrong anyway" - please correct me
>>> if I'm wrong, but up till now we were fiddling with transfer size
>>> up till having "good enough" results.  
>>
>> Correction, read the patches I linked above
>> (7d6fd7f0ba71cd93d94079132f958d9630f27a89 
> 
> The above sha1 uses the observation (and heuristics) from other
> operating systems to deal with usb pendrives, which controllers cannot
> handle more than 256 sectors (as they have 8 bit counter).

Yes

>> and esp.
>> 02b0e1a36c5bc20174299312556ec4e266872bd6).
> 
> This patch introduces the optimization - instead of setup/disable async
> transmissions - it enables it once and then just adds new tokens to be
> transmitted.

Yes

> Is the transmission error (USB_ST_XACTERR 0x80) handled at this point?

I don't think so.

>> Those solve the "we were
>> fiddling around with transfer size" attempts, because I actually
>> invested the time to understand the issue with those devices that
>> failed and I was right about never letting any of those "fiddling
>> around with transfer size" patches into mainline.
>>
>> The problem with those devices which needed "fiddling with transfer
>> size" was that their internal 16bit counters (or 8bit counters)
>> overflew when a very long transfer was started and thus those devices
>> failed. The fix there was to transfer less than that amount of data
>> which triggered the overflow, which on a wast majority of devices is
>> 240 blocks. However , this led to a massive slowdown, which is
>> unacceptable. A fix for that slowdown was to avoid turning async
>> schedule off-on between each transfer, but rather keep it running,
>> that saves A LOT of wasted transfer time. If you build a long
>> transfer up front, you would never do that many transfers to notice
>> the delays incurred by turning the async schedule off/on, but if you
>> do multiple shorter transfers, these delays add up.
>>
> 
> Thanks for the explanation.
> 
>> Yes, these issues are convoluted and take time to solve properly, yet
>> the end result might not have such a performance hit.
>>
>>> The above link describes what needs to be done in case of the "EHCI
>>> timed out on TD - token=0x1f8c80" error.
>>>
>>> Do we handle it now? In my opinion we don't.
>>>   
>>>>
>>>> So try to separate this into two patches, one for handling the spin
>>>> up, another one for this auto-detection of the transfer size. And I
>>>> think you will end up needed only the first one.  
>>>
>>> This is an open question if only the spin up fixes the error. It
>>> would be great if you could look for the changes introduced in this
>>> patch and compare it with your experience of ehci driver
>>> development.  
>>
>> Surely you can separate those two patches and perform such a test ?
> 
> I would prefer to first go through the actual patch code, before I
> spent time on things which may be not necessary.

How do you suggest we go through a patch which does multiple things,
which have effects on one another, and analyze it properly ?

>>>>> Last but not least - the mainline is still broken. The
>>>>> "token=0x1f8c80" errors pops up from time to time. However, after
>>>>> applying the approach from this fix - the error is gone (and pass
>>>>> some acceptance tests).    
>>>>
>>>> It works for me.  
>>>
>>> Also works for Tom.
>>>   
>>>> Do you have a specific device which fails ?  
>>>
>>> Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can
>>> confirm that from U-Boot 2018.04 up till now the rate of:
>>>
>>> "EHCI timed out on TD - token=0x1f8c80" error has been reduced
>>> significantly for their use case (reading ~16MiB of binary from
>>> several pen drives).   
>>
>> Can you run lsusb (to get IDs) on those pendrives which fail ?
>> I would like to add them to a list.
> 
> This was working correctly on the beginning, but after some time it
> started to show errors in question:
> 
> Bus 001 Device 040: ID 0930:6544 Toshiba Corp. TransMemory-Mini /
> Kingston DataTraveler 2.0 Stick

This one I believe has 16bit counter indeed.

> This one was causing issues from the very beginning:
> Bus 001 Device 041: ID 058f:6387 Alcor Micro Corp. Flash Drive

Silver INTENSO USB stick ? I have that one too (or at least one with
such ID), it was breaking DWC2 controller.

> This one - Verbatim 
> Bus 001 Device 042: ID 21c4:8005

This one is new.

> Never caused issues - worked reliably even with old U-Boot (on which
> errors were observed very often).

Maybe the problems you see have something to do with the async schedule?

[...]
Marek Vasut March 2, 2020, 7:19 p.m. UTC | #15
On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> Hi,

Hi,

>> This patch aims to improve robustness of 'usb' command operation on
>> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
>> described and provided in [1] (originally applicable to U-Boot
>> 2016.05).
>>
>> According to the fix author - "rayvt" (from [1]):
>>
>> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
>> micro-code instruction for the EHCI USB chip.
>> The "token" field is detailed information, control, and status of the
>> TD's data transfer. In particular, the rightmost byte is the status
>> field. 0x80 bit means the TD is active and the data transfer has not
>> yet completed. 0x08 bit means there was some sort of data transfer
>> error (XactErr).
> 	 ^^^^^^^ - this error is NOT handled now in mainline u-boot. Do
> 	 you have any experience with it?

No, but then I also didn't run into those yet.

>> If the drive gets a transfer error, it refuses to do any other I/O
>> transfer until the error is properly cleared and reset. U-boot did
>> not do this, so every subsequent disk read would get a timeout error
>> because the drive would not send any data. The fix is to properly
>> clear and reset the USB disk when it gets a transfer error.

So basically, what this says is, "we crashed the drive, because the
drive suffered counter overflow, thus we fix it by resetting the drive".
No, the fix is to not crash the drive in the first place.

>> Every read operation starts at the maximum block size. When the USB
>> pendrive is not able to correctly serve this data read request, the
>> dynamic reduction of IO size is performed. Up to six tries (with
>> smaller IO block each time) are attempted.
>>
>> A related problem is that some drives are slow to come up. Linux
>> handles this by issuing a spinup command and allowing more time for
>> the drive to respond. The same idea is applied in this fix.
>>
>> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
>>
>> Bus usb@2184200: USB EHCI 1.00
>> scanning bus usb@2184200 for devices... 2 USB Device(s) found
>>        scanning usb for storage devices... 1 Storage Device(s) found
>> EHCI timed out on TD - token=0x1f8c80
> 
> This is how the error gets evident. The detailed explanation is in link
> [1].

Is there one specific post in that forum, or do I need to read through
the whole multi-page thread ?

I would expect that if you run -- I assume 'usb reset' (the command is
missing above) -- then the bus gets power-cycled, hence the USB device
also gets power-cycled.

>> Performance impact:
>>
>> With U-Boot -master in mainline:
>> 16869440 bytes read in 979 ms (16.4 MiB/s)
>>
>> With this patch:
>> 16869440 bytes read in 1219 ms (13.2 MiB/s)
>>
>> Links:
>> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
>> [2] -
>> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
>>
>> Signed-off-by: Lukasz Majewski <lukma@denx.de>
>> [Unfortunately, the original patch [2] did not contain S-o-B from the
>> original author - "rayvt"]
>> ---
>>
>>  common/usb.c                |  10 +++-
>>  common/usb_storage.c        | 106
>> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c |
>> 55 +++++++++++++++++-- include/usb.h               |   1 +
>>  include/usb_defs.h          |   1 +
>>  5 files changed, 160 insertions(+), 13 deletions(-)
>>
>> diff --git a/common/usb.c b/common/usb.c
>> index 349e838f1d..305482b5bb 100644
>> --- a/common/usb.c
>> +++ b/common/usb.c
>> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct usb_device
>> *dev, int len, int expect_len) __maybe_unused struct
>> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned char,
>> tmpbuf, USB_BUFSIZ); int err;
>> +	int retry = 5;
>>  
>>  	desc = (struct usb_device_descriptor *)tmpbuf;
>>  
>> +again:
>>  	err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc, len);
>>  	if (err < expect_len) {
>>  		if (err < 0) {
>> -			printf("unable to get device descriptor
>> (error=%d)\n",
>> -				err);
>> +			printf("unable to get device descriptor
>> (error=%d) retry: %d\n",
>> +			       err, retry);
>> +			mdelay(50);
>> +			if (--retry >= 0)
>> +				/* Some drives are just slow to wake
>> up. */
>> +				goto again;
> 
> This I think is just to "wait" for the inserted device if it is rather
> slow (or probably its memory got degraded over time).
> 
> This shouldn't bring any transfer slow down.

Then this could be a separate patch ?

>>  			return err;
>>  		} else {
>>  			printf("USB device descriptor short read
>> (expected %i, got %i)\n", diff --git a/common/usb_storage.c
>> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644
>> --- a/common/usb_storage.c
>> +++ b/common/usb_storage.c
>> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
>> struct us_data *us, struct blk_desc *dev_desc);
>>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
>>  		      struct us_data *ss);
>> +
>> +#ifdef CONFIG_USB_EHCI_HCD
>> +	/*
>> +	 * The U-Boot EHCI driver can handle any transfer length as
>> long as
>> +	 * there is enough free heap space left, but the SCSI
>> READ(10) and
>> +	 * WRITE(10) commands are limited to 65535 blocks.
>> +	 */
>> +int usb_max_xfer_blk = 4096;
> 
> Setting it to magic value - 4096 may be indeed problematic. However, I
> don't know what is the 'usual' transfer size.

240 blocks , see 7d6fd7f0ba71cd93d94079132f958d9630f27a89 .

>> +#else
>> +int usb_max_xfer_blk = 20;
> 
> This is also a magic value - probably took from some old U-Boot (when
> default transfer was set to this value).
> 
> I do guess that this snippet shall be removed from this patch set.

Probably.

>> +#endif
>> +
>>  #if CONFIG_IS_ENABLED(BLK)
>>  static unsigned long usb_stor_read(struct udevice *dev, lbaint_t
>> blknr, lbaint_t blkcnt, void *buffer);
>> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct scsi_cmd
>> *srb, struct us_data *us) pipeout = usb_sndbulkpipe(us->pusb_dev,
>> us->ep_out); /* DATA phase + error handling */
>>  	data_actlen = 0;
>> +	mdelay(10);		/* Like linux does. */
>>  	/* no data, go immediately to the STATUS phase */
>>  	if (srb->datalen == 0)
>>  		goto st;
>> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct
>> scsi_cmd *srb, struct us_data *us) 
>>  	result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata,
>> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5);
>> +
>> +	/* special handling of XACTERR in DATA phase */
>> +	if (result < 0 && (us->pusb_dev->status & USB_ST_XACTERR)) {
>> +		debug("XACTERR in data phase - clr, reset, and
>> return fail.\n");
>> +		usb_stor_BBB_clear_endpt_stall(us, dir_in ?
>> +					       us->ep_in :
>> us->ep_out);
>> +		usb_stor_BBB_reset(us);
>> +		return USB_STOR_TRANSPORT_FAILED;
>> +	}
>> +
> 
> Here the case of XACTERR is handled. This is missing now in -master.

Surely this could be added as a separate patch ?

>>  	/* special handling of STALL in DATA phase */
>>  	if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED))
>> { debug("DATA:stall\n");
>> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd
>> *srb, struct us_data *ss) return 0;
>>  }
>>  
>> +/*
>> + * This spins up the disk and also consumes the time that the
>> + * disk takes to become active and ready to read data.
>> + * Some drives (like Western Digital) can take more than 5 seconds.
>> + * The delay occurs on the 1st data read from the disk.
>> + * Extending the timeout here works better than handling the timeout
>> + * as an error on a "real" read.
>> + */
>> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
>> +{
>> +	memset(&srb->cmd[0], 0, 12);
>> +	srb->cmd[0] = SCSI_START_STP;
>> +	srb->cmd[1] = srb->lun << 5;
>> +	srb->cmd[4] = 1;	/* Start spinup. */
>> +	srb->datalen = 0;
>> +	srb->cmdlen = 6;
>> +	ss->pusb_dev->extra_timeout = 9876;
>> +	ss->transport(srb, ss);
>> +	ss->pusb_dev->extra_timeout = 0;
>> +	return 0;
>> +}
>> +
>>  static int usb_test_unit_ready(struct scsi_cmd *srb, struct us_data
>> *ss) {
>>  	int retries = 10;
>> +	int gave_extra_time = 0;
>>  
>>  	do {
>>  		memset(&srb->cmd[0], 0, 12);
>> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct scsi_cmd
>> *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02) &&
>>  		    (srb->sense_buf[12] == 0x3a))
>>  			return -1;
>> +		/*
>> +		 * If the status is "Not Ready - becoming ready",
>> give it
>> +		 * more time.  Linux issues a spinup command (once)
>> and gives
>> +		 * it 100 seconds.
>> +		 */
>> +		if (srb->sense_buf[2] == 0x02 &&
>> +		    srb->sense_buf[12] == 0x04 &&
>> +		    gave_extra_time == 0) {
>> +			retries = 100; /* Allow 10 seconds. */
>> +			gave_extra_time = retries;
>> +		}
> 
> The above code looks like targeting the HDDs, but some degraded pen
> drives may need this time as well for proper operation.

Hard drive spin up handling. This could be a separate patch too.

>>  		mdelay(100);
>>  	} while (retries--);
>>  
>> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct
>> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start, blks,
>> buf_addr); 
>>  	do {
>> -		/* XXX need some comment here */
>> +		/*
>> +		 * Probably most errors are USB errors, not hard
>> disk error.
>> +		 * Many disks use a USB chip that is flaky when
>> doing large
>> +		 * transfers. The workaround is to dynamically
>> reduce the
>> +		 * transfer size and allow an additional try.
>> +		 * This should pick up flaky disks. Linux uses a
>> quirks table.
>> +		 * We'll use observation:
>> +		 * Give it 1 try very large, 1 try large, 2 tries
>> medium and 2
>> +		 * tries small(ish).
>> +		 * On a solid fail (actual disk error - which should
>> be rare),
>> +		 * this will give us 6 tries max, and only that many
>> if the read
>> +		 * is quite large.
>> +		 * A fail on a very short read obviously doesn't
>> have a
>> +		 * too-large max_blks.  Timeout due to spinup being
>> a case in
>> +		 * this point.
>> +		 */
>>  		retry = 2;
>>  		srb->pdata = (unsigned char *)buf_addr;
>> -		if (blks > ss->max_xfer_blk)
>> -			smallblks = ss->max_xfer_blk;
>> +retry_it:
>> +		if (blks > usb_max_xfer_blk)
>> +			smallblks = usb_max_xfer_blk;
>>  		else
>>  			smallblks = (unsigned short) blks;
>> -retry_it:
>> -		if (smallblks == ss->max_xfer_blk)
>> +
>> +		if (smallblks == usb_max_xfer_blk)
>>  			usb_show_progress();
>> +
>>  		srb->datalen = block_dev->blksz * smallblks;
>>  		srb->pdata = (unsigned char *)buf_addr;
>>  		if (usb_read_10(srb, ss, start, smallblks)) {
>>  			debug("Read ERROR\n");
>>  			ss->flags &= ~USB_READY;
>>  			usb_request_sense(srb, ss);
>> +
>> +			/* Dynamically reduce the I/O size. */
>> +			if (smallblks > 2047) {
>> +				usb_max_xfer_blk = 2047;
>> +				++retry;
>> +			} else if (smallblks > 512) {
>> +				usb_max_xfer_blk = 512;
>> +				++retry;
>> +			} else if (smallblks > 511) {
>> +				usb_max_xfer_blk = 511;
>> +				++retry;
>> +			} else if (smallblks > 63) {
>> +				usb_max_xfer_blk = 63;
>> +				retry += 2;
>> +			}
>> +			debug("step down usb_max_xfer_blk to %d\n",
>> +			      usb_max_xfer_blk);
>> +
> 
> This shall not be needed, as the ehci controller's driver will scatter
> the transfers up till 240 blocks max. Am I right?
> 
> The above code was supposed to work before the 240 block limitation was
> in place?

Correct, this code is wrong.

[...]

>> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
>>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
>> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
>> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {
> 			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
> 			      before this patch was also not checked. 
> 
>> +			if (--trynum >= 0) {
>> +				/*
>> +				 * It is necessary to do this,
>> otherwise the
>> +				 * disk is clagged.
>> +				 */
>> +				debug("reset the TD and redo,
>> because of XACTERR\n");
>> +				qhtoken &= ~QT_TOKEN_STATUS_HALTED;
>> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE |
>> +					QT_TOKEN_CERR(2);
>> +				vtd->qt_token = cpu_to_hc32(qhtoken);
>> +				qh->qh_overlay.qt_token =
>> cpu_to_hc32(qhtoken);
>> +				goto retry_xacterr;
>> +			}
>> +			dev->status = USB_ST_XACTERR;
>> +			dev->act_len = length -
>> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);
> 
> This may solve the issue - as it resets the controller in the case of
> transfer error.

Do you need to reset the controller ? Or is there some graceful way out,
i.e. a way to recover from the error ?
[...]
Tom Rini March 2, 2020, 7:54 p.m. UTC | #16
On Mon, Mar 02, 2020 at 08:19:22PM +0100, Marek Vasut wrote:
> On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> > Hi,
> 
> Hi,
> 
> >> This patch aims to improve robustness of 'usb' command operation on
> >> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> >> described and provided in [1] (originally applicable to U-Boot
> >> 2016.05).
> >>
> >> According to the fix author - "rayvt" (from [1]):
> >>
> >> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
> >> micro-code instruction for the EHCI USB chip.
> >> The "token" field is detailed information, control, and status of the
> >> TD's data transfer. In particular, the rightmost byte is the status
> >> field. 0x80 bit means the TD is active and the data transfer has not
> >> yet completed. 0x08 bit means there was some sort of data transfer
> >> error (XactErr).
> > 	 ^^^^^^^ - this error is NOT handled now in mainline u-boot. Do
> > 	 you have any experience with it?
> 
> No, but then I also didn't run into those yet.
> 
> >> If the drive gets a transfer error, it refuses to do any other I/O
> >> transfer until the error is properly cleared and reset. U-boot did
> >> not do this, so every subsequent disk read would get a timeout error
> >> because the drive would not send any data. The fix is to properly
> >> clear and reset the USB disk when it gets a transfer error.
> 
> So basically, what this says is, "we crashed the drive, because the
> drive suffered counter overflow, thus we fix it by resetting the drive".
> No, the fix is to not crash the drive in the first place.
> 
> >> Every read operation starts at the maximum block size. When the USB
> >> pendrive is not able to correctly serve this data read request, the
> >> dynamic reduction of IO size is performed. Up to six tries (with
> >> smaller IO block each time) are attempted.
> >>
> >> A related problem is that some drives are slow to come up. Linux
> >> handles this by issuing a spinup command and allowing more time for
> >> the drive to respond. The same idea is applied in this fix.
> >>
> >> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> >>
> >> Bus usb@2184200: USB EHCI 1.00
> >> scanning bus usb@2184200 for devices... 2 USB Device(s) found
> >>        scanning usb for storage devices... 1 Storage Device(s) found
> >> EHCI timed out on TD - token=0x1f8c80
> > 
> > This is how the error gets evident. The detailed explanation is in link
> > [1].
> 
> Is there one specific post in that forum, or do I need to read through
> the whole multi-page thread ?
> 
> I would expect that if you run -- I assume 'usb reset' (the command is
> missing above) -- then the bus gets power-cycled, hence the USB device
> also gets power-cycled.

FWIW, the problem I have does not get fixed via "usb reset".  Ethernet
never shows up again.  I do see right now that I disabled network tests
on the platform a long while ago for unknown reasons, but did comment to
myself elsewhere that network was fairly flaky in U-Boot.
Marek Vasut March 2, 2020, 7:58 p.m. UTC | #17
On 3/2/20 8:54 PM, Tom Rini wrote:
> On Mon, Mar 02, 2020 at 08:19:22PM +0100, Marek Vasut wrote:
>> On 3/2/20 2:21 PM, Lukasz Majewski wrote:
>>> Hi,
>>
>> Hi,
>>
>>>> This patch aims to improve robustness of 'usb' command operation on
>>>> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
>>>> described and provided in [1] (originally applicable to U-Boot
>>>> 2016.05).
>>>>
>>>> According to the fix author - "rayvt" (from [1]):
>>>>
>>>> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
>>>> micro-code instruction for the EHCI USB chip.
>>>> The "token" field is detailed information, control, and status of the
>>>> TD's data transfer. In particular, the rightmost byte is the status
>>>> field. 0x80 bit means the TD is active and the data transfer has not
>>>> yet completed. 0x08 bit means there was some sort of data transfer
>>>> error (XactErr).
>>> 	 ^^^^^^^ - this error is NOT handled now in mainline u-boot. Do
>>> 	 you have any experience with it?
>>
>> No, but then I also didn't run into those yet.
>>
>>>> If the drive gets a transfer error, it refuses to do any other I/O
>>>> transfer until the error is properly cleared and reset. U-boot did
>>>> not do this, so every subsequent disk read would get a timeout error
>>>> because the drive would not send any data. The fix is to properly
>>>> clear and reset the USB disk when it gets a transfer error.
>>
>> So basically, what this says is, "we crashed the drive, because the
>> drive suffered counter overflow, thus we fix it by resetting the drive".
>> No, the fix is to not crash the drive in the first place.
>>
>>>> Every read operation starts at the maximum block size. When the USB
>>>> pendrive is not able to correctly serve this data read request, the
>>>> dynamic reduction of IO size is performed. Up to six tries (with
>>>> smaller IO block each time) are attempted.
>>>>
>>>> A related problem is that some drives are slow to come up. Linux
>>>> handles this by issuing a spinup command and allowing more time for
>>>> the drive to respond. The same idea is applied in this fix.
>>>>
>>>> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
>>>>
>>>> Bus usb@2184200: USB EHCI 1.00
>>>> scanning bus usb@2184200 for devices... 2 USB Device(s) found
>>>>        scanning usb for storage devices... 1 Storage Device(s) found
>>>> EHCI timed out on TD - token=0x1f8c80
>>>
>>> This is how the error gets evident. The detailed explanation is in link
>>> [1].
>>
>> Is there one specific post in that forum, or do I need to read through
>> the whole multi-page thread ?
>>
>> I would expect that if you run -- I assume 'usb reset' (the command is
>> missing above) -- then the bus gets power-cycled, hence the USB device
>> also gets power-cycled.
> 
> FWIW, the problem I have does not get fixed via "usb reset".  Ethernet
> never shows up again.  I do see right now that I disabled network tests
> on the platform a long while ago for unknown reasons, but did comment to
> myself elsewhere that network was fairly flaky in U-Boot.

There certainly is something with USB ethernet in U-Boot. I noticed the
flakiness myself, but didn't have time to investigate. Patches welcome I
guess ?
Tom Rini March 2, 2020, 7:59 p.m. UTC | #18
On Mon, Mar 02, 2020 at 08:58:26PM +0100, Marek Vasut wrote:
> On 3/2/20 8:54 PM, Tom Rini wrote:
> > On Mon, Mar 02, 2020 at 08:19:22PM +0100, Marek Vasut wrote:
> >> On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> >>> Hi,
> >>
> >> Hi,
> >>
> >>>> This patch aims to improve robustness of 'usb' command operation on
> >>>> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> >>>> described and provided in [1] (originally applicable to U-Boot
> >>>> 2016.05).
> >>>>
> >>>> According to the fix author - "rayvt" (from [1]):
> >>>>
> >>>> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
> >>>> micro-code instruction for the EHCI USB chip.
> >>>> The "token" field is detailed information, control, and status of the
> >>>> TD's data transfer. In particular, the rightmost byte is the status
> >>>> field. 0x80 bit means the TD is active and the data transfer has not
> >>>> yet completed. 0x08 bit means there was some sort of data transfer
> >>>> error (XactErr).
> >>> 	 ^^^^^^^ - this error is NOT handled now in mainline u-boot. Do
> >>> 	 you have any experience with it?
> >>
> >> No, but then I also didn't run into those yet.
> >>
> >>>> If the drive gets a transfer error, it refuses to do any other I/O
> >>>> transfer until the error is properly cleared and reset. U-boot did
> >>>> not do this, so every subsequent disk read would get a timeout error
> >>>> because the drive would not send any data. The fix is to properly
> >>>> clear and reset the USB disk when it gets a transfer error.
> >>
> >> So basically, what this says is, "we crashed the drive, because the
> >> drive suffered counter overflow, thus we fix it by resetting the drive".
> >> No, the fix is to not crash the drive in the first place.
> >>
> >>>> Every read operation starts at the maximum block size. When the USB
> >>>> pendrive is not able to correctly serve this data read request, the
> >>>> dynamic reduction of IO size is performed. Up to six tries (with
> >>>> smaller IO block each time) are attempted.
> >>>>
> >>>> A related problem is that some drives are slow to come up. Linux
> >>>> handles this by issuing a spinup command and allowing more time for
> >>>> the drive to respond. The same idea is applied in this fix.
> >>>>
> >>>> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> >>>>
> >>>> Bus usb@2184200: USB EHCI 1.00
> >>>> scanning bus usb@2184200 for devices... 2 USB Device(s) found
> >>>>        scanning usb for storage devices... 1 Storage Device(s) found
> >>>> EHCI timed out on TD - token=0x1f8c80
> >>>
> >>> This is how the error gets evident. The detailed explanation is in link
> >>> [1].
> >>
> >> Is there one specific post in that forum, or do I need to read through
> >> the whole multi-page thread ?
> >>
> >> I would expect that if you run -- I assume 'usb reset' (the command is
> >> missing above) -- then the bus gets power-cycled, hence the USB device
> >> also gets power-cycled.
> > 
> > FWIW, the problem I have does not get fixed via "usb reset".  Ethernet
> > never shows up again.  I do see right now that I disabled network tests
> > on the platform a long while ago for unknown reasons, but did comment to
> > myself elsewhere that network was fairly flaky in U-Boot.
> 
> There certainly is something with USB ethernet in U-Boot. I noticed the
> flakiness myself, but didn't have time to investigate. Patches welcome I
> guess ?

Looking forward to testing them!
Lukasz Majewski March 2, 2020, 11:25 p.m. UTC | #19
Hi Marek,

> On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> > Hi,  
> 
> Hi,
> 
> >> This patch aims to improve robustness of 'usb' command operation on
> >> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> >> described and provided in [1] (originally applicable to U-Boot
> >> 2016.05).
> >>
> >> According to the fix author - "rayvt" (from [1]):
> >>
> >> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
> >> micro-code instruction for the EHCI USB chip.
> >> The "token" field is detailed information, control, and status of
> >> the TD's data transfer. In particular, the rightmost byte is the
> >> status field. 0x80 bit means the TD is active and the data
> >> transfer has not yet completed. 0x08 bit means there was some sort
> >> of data transfer error (XactErr).  
> > 	 ^^^^^^^ - this error is NOT handled now in mainline
> > u-boot. Do you have any experience with it?  
> 
> No, but then I also didn't run into those yet.

Ok.

> 
> >> If the drive gets a transfer error, it refuses to do any other I/O
> >> transfer until the error is properly cleared and reset. U-boot did
> >> not do this, so every subsequent disk read would get a timeout
> >> error because the drive would not send any data. The fix is to
> >> properly clear and reset the USB disk when it gets a transfer
> >> error.  
> 
> So basically, what this says is, "we crashed the drive, because the
> drive suffered counter overflow, thus we fix it by resetting the
> drive". 

The transfer error may be also caused by some timeout from internal
controller when the USB disk is worn out. Counter overflow may not be
the only reason.

> No, the fix is to not crash the drive in the first place.

From 2019.0{14} a lot was done to make "usb" command more robust, but
still some mishaps happen.

> 
> >> Every read operation starts at the maximum block size. When the USB
> >> pendrive is not able to correctly serve this data read request, the
> >> dynamic reduction of IO size is performed. Up to six tries (with
> >> smaller IO block each time) are attempted.
> >>
> >> A related problem is that some drives are slow to come up. Linux
> >> handles this by issuing a spinup command and allowing more time for
> >> the drive to respond. The same idea is applied in this fix.
> >>
> >> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> >>
> >> Bus usb@2184200: USB EHCI 1.00
> >> scanning bus usb@2184200 for devices... 2 USB Device(s) found
> >>        scanning usb for storage devices... 1 Storage Device(s)
> >> found EHCI timed out on TD - token=0x1f8c80  
> > 
> > This is how the error gets evident. The detailed explanation is in
> > link [1].  
> 
> Is there one specific post in that forum, or do I need to read through
> the whole multi-page thread ?

There is a thread pointed out by the below link which explains the
issue thoroughly:
https://forum.doozan.com/read.php?3,35295,35295#msg-35295

(The first 2 posts from 'rayvt' explains the problem).

> 
> I would expect that if you run -- I assume 'usb reset' (the command is
> missing above) -- then the bus gets power-cycled, hence the USB device
> also gets power-cycled.

Unfortunately, the 'usb reset' is not fixing the issue. The controller
is unresponsive and the only way for recovering is a power cycle.

> 
> >> Performance impact:
> >>
> >> With U-Boot -master in mainline:
> >> 16869440 bytes read in 979 ms (16.4 MiB/s)
> >>
> >> With this patch:
> >> 16869440 bytes read in 1219 ms (13.2 MiB/s)
> >>
> >> Links:
> >> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> >> [2] -
> >> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
> >>
> >> Signed-off-by: Lukasz Majewski <lukma@denx.de>
> >> [Unfortunately, the original patch [2] did not contain S-o-B from
> >> the original author - "rayvt"]
> >> ---
> >>
> >>  common/usb.c                |  10 +++-
> >>  common/usb_storage.c        | 106
> >> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c |
> >> 55 +++++++++++++++++-- include/usb.h               |   1 +
> >>  include/usb_defs.h          |   1 +
> >>  5 files changed, 160 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/common/usb.c b/common/usb.c
> >> index 349e838f1d..305482b5bb 100644
> >> --- a/common/usb.c
> >> +++ b/common/usb.c
> >> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct
> >> usb_device *dev, int len, int expect_len) __maybe_unused struct
> >> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned
> >> char, tmpbuf, USB_BUFSIZ); int err;
> >> +	int retry = 5;
> >>  
> >>  	desc = (struct usb_device_descriptor *)tmpbuf;
> >>  
> >> +again:
> >>  	err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc,
> >> len); if (err < expect_len) {
> >>  		if (err < 0) {
> >> -			printf("unable to get device descriptor
> >> (error=%d)\n",
> >> -				err);
> >> +			printf("unable to get device descriptor
> >> (error=%d) retry: %d\n",
> >> +			       err, retry);
> >> +			mdelay(50);
> >> +			if (--retry >= 0)
> >> +				/* Some drives are just slow to
> >> wake up. */
> >> +				goto again;  
> > 
> > This I think is just to "wait" for the inserted device if it is
> > rather slow (or probably its memory got degraded over time).
> > 
> > This shouldn't bring any transfer slow down.  
> 
> Then this could be a separate patch ?

Yes. I think so.

> 
> >>  			return err;
> >>  		} else {
> >>  			printf("USB device descriptor short read
> >> (expected %i, got %i)\n", diff --git a/common/usb_storage.c
> >> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644
> >> --- a/common/usb_storage.c
> >> +++ b/common/usb_storage.c
> >> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
> >> struct us_data *us, struct blk_desc *dev_desc);
> >>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >>  		      struct us_data *ss);
> >> +
> >> +#ifdef CONFIG_USB_EHCI_HCD
> >> +	/*
> >> +	 * The U-Boot EHCI driver can handle any transfer length
> >> as long as
> >> +	 * there is enough free heap space left, but the SCSI
> >> READ(10) and
> >> +	 * WRITE(10) commands are limited to 65535 blocks.
> >> +	 */
> >> +int usb_max_xfer_blk = 4096;  
> > 
> > Setting it to magic value - 4096 may be indeed problematic.
> > However, I don't know what is the 'usual' transfer size.  
> 
> 240 blocks , see 7d6fd7f0ba71cd93d94079132f958d9630f27a89 .

Ok.

> 
> >> +#else
> >> +int usb_max_xfer_blk = 20;  
> > 
> > This is also a magic value - probably took from some old U-Boot
> > (when default transfer was set to this value).
> > 
> > I do guess that this snippet shall be removed from this patch set.  
> 
> Probably.

Ok.

> 
> >> +#endif
> >> +
> >>  #if CONFIG_IS_ENABLED(BLK)
> >>  static unsigned long usb_stor_read(struct udevice *dev, lbaint_t
> >> blknr, lbaint_t blkcnt, void *buffer);
> >> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct
> >> scsi_cmd *srb, struct us_data *us) pipeout =
> >> usb_sndbulkpipe(us->pusb_dev, us->ep_out); /* DATA phase + error
> >> handling */ data_actlen = 0;
> >> +	mdelay(10);		/* Like linux does. */
> >>  	/* no data, go immediately to the STATUS phase */
> >>  	if (srb->datalen == 0)
> >>  		goto st;
> >> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct
> >> scsi_cmd *srb, struct us_data *us) 
> >>  	result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata,
> >> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5);
> >> +
> >> +	/* special handling of XACTERR in DATA phase */
> >> +	if (result < 0 && (us->pusb_dev->status &
> >> USB_ST_XACTERR)) {
> >> +		debug("XACTERR in data phase - clr, reset, and
> >> return fail.\n");
> >> +		usb_stor_BBB_clear_endpt_stall(us, dir_in ?
> >> +					       us->ep_in :
> >> us->ep_out);
> >> +		usb_stor_BBB_reset(us);
> >> +		return USB_STOR_TRANSPORT_FAILED;
> >> +	}
> >> +  
> > 
> > Here the case of XACTERR is handled. This is missing now in
> > -master.  
> 
> Surely this could be added as a separate patch ?

Ok.

> 
> >>  	/* special handling of STALL in DATA phase */
> >>  	if ((result < 0) && (us->pusb_dev->status &
> >> USB_ST_STALLED)) { debug("DATA:stall\n");
> >> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd
> >> *srb, struct us_data *ss) return 0;
> >>  }
> >>  
> >> +/*
> >> + * This spins up the disk and also consumes the time that the
> >> + * disk takes to become active and ready to read data.
> >> + * Some drives (like Western Digital) can take more than 5
> >> seconds.
> >> + * The delay occurs on the 1st data read from the disk.
> >> + * Extending the timeout here works better than handling the
> >> timeout
> >> + * as an error on a "real" read.
> >> + */
> >> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
> >> +{
> >> +	memset(&srb->cmd[0], 0, 12);
> >> +	srb->cmd[0] = SCSI_START_STP;
> >> +	srb->cmd[1] = srb->lun << 5;
> >> +	srb->cmd[4] = 1;	/* Start spinup. */
> >> +	srb->datalen = 0;
> >> +	srb->cmdlen = 6;
> >> +	ss->pusb_dev->extra_timeout = 9876;
> >> +	ss->transport(srb, ss);
> >> +	ss->pusb_dev->extra_timeout = 0;
> >> +	return 0;
> >> +}
> >> +
> >>  static int usb_test_unit_ready(struct scsi_cmd *srb, struct
> >> us_data *ss) {
> >>  	int retries = 10;
> >> +	int gave_extra_time = 0;
> >>  
> >>  	do {
> >>  		memset(&srb->cmd[0], 0, 12);
> >> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct
> >> scsi_cmd *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02)
> >> && (srb->sense_buf[12] == 0x3a))
> >>  			return -1;
> >> +		/*
> >> +		 * If the status is "Not Ready - becoming ready",
> >> give it
> >> +		 * more time.  Linux issues a spinup command
> >> (once) and gives
> >> +		 * it 100 seconds.
> >> +		 */
> >> +		if (srb->sense_buf[2] == 0x02 &&
> >> +		    srb->sense_buf[12] == 0x04 &&
> >> +		    gave_extra_time == 0) {
> >> +			retries = 100; /* Allow 10 seconds. */
> >> +			gave_extra_time = retries;
> >> +		}  
> > 
> > The above code looks like targeting the HDDs, but some degraded pen
> > drives may need this time as well for proper operation.  
> 
> Hard drive spin up handling. This could be a separate patch too.
> 

Ok.

> >>  		mdelay(100);
> >>  	} while (retries--);
> >>  
> >> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct
> >> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start,
> >> blks, buf_addr); 
> >>  	do {
> >> -		/* XXX need some comment here */
> >> +		/*
> >> +		 * Probably most errors are USB errors, not hard
> >> disk error.
> >> +		 * Many disks use a USB chip that is flaky when
> >> doing large
> >> +		 * transfers. The workaround is to dynamically
> >> reduce the
> >> +		 * transfer size and allow an additional try.
> >> +		 * This should pick up flaky disks. Linux uses a
> >> quirks table.
> >> +		 * We'll use observation:
> >> +		 * Give it 1 try very large, 1 try large, 2 tries
> >> medium and 2
> >> +		 * tries small(ish).
> >> +		 * On a solid fail (actual disk error - which
> >> should be rare),
> >> +		 * this will give us 6 tries max, and only that
> >> many if the read
> >> +		 * is quite large.
> >> +		 * A fail on a very short read obviously doesn't
> >> have a
> >> +		 * too-large max_blks.  Timeout due to spinup
> >> being a case in
> >> +		 * this point.
> >> +		 */
> >>  		retry = 2;
> >>  		srb->pdata = (unsigned char *)buf_addr;
> >> -		if (blks > ss->max_xfer_blk)
> >> -			smallblks = ss->max_xfer_blk;
> >> +retry_it:
> >> +		if (blks > usb_max_xfer_blk)
> >> +			smallblks = usb_max_xfer_blk;
> >>  		else
> >>  			smallblks = (unsigned short) blks;
> >> -retry_it:
> >> -		if (smallblks == ss->max_xfer_blk)
> >> +
> >> +		if (smallblks == usb_max_xfer_blk)
> >>  			usb_show_progress();
> >> +
> >>  		srb->datalen = block_dev->blksz * smallblks;
> >>  		srb->pdata = (unsigned char *)buf_addr;
> >>  		if (usb_read_10(srb, ss, start, smallblks)) {
> >>  			debug("Read ERROR\n");
> >>  			ss->flags &= ~USB_READY;
> >>  			usb_request_sense(srb, ss);
> >> +
> >> +			/* Dynamically reduce the I/O size. */
> >> +			if (smallblks > 2047) {
> >> +				usb_max_xfer_blk = 2047;
> >> +				++retry;
> >> +			} else if (smallblks > 512) {
> >> +				usb_max_xfer_blk = 512;
> >> +				++retry;
> >> +			} else if (smallblks > 511) {
> >> +				usb_max_xfer_blk = 511;
> >> +				++retry;
> >> +			} else if (smallblks > 63) {
> >> +				usb_max_xfer_blk = 63;
> >> +				retry += 2;
> >> +			}
> >> +			debug("step down usb_max_xfer_blk to
> >> %d\n",
> >> +			      usb_max_xfer_blk);
> >> +  
> > 
> > This shall not be needed, as the ehci controller's driver will
> > scatter the transfers up till 240 blocks max. Am I right?
> > 
> > The above code was supposed to work before the 240 block limitation
> > was in place?  
> 
> Correct, this code is wrong.

Ok.

> 
> [...]
> 
> >> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
> >>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
> >> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
> >> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {  
> > 			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
> > 			      before this patch was also not
> > checked. 
> >> +			if (--trynum >= 0) {
> >> +				/*
> >> +				 * It is necessary to do this,
> >> otherwise the
> >> +				 * disk is clagged.
> >> +				 */
> >> +				debug("reset the TD and redo,
> >> because of XACTERR\n");
> >> +				qhtoken &=
> >> ~QT_TOKEN_STATUS_HALTED;
> >> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE
> >> |
> >> +					QT_TOKEN_CERR(2);
> >> +				vtd->qt_token =
> >> cpu_to_hc32(qhtoken);
> >> +				qh->qh_overlay.qt_token =
> >> cpu_to_hc32(qhtoken);
> >> +				goto retry_xacterr;
> >> +			}
> >> +			dev->status = USB_ST_XACTERR;
> >> +			dev->act_len = length -
> >> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);
> >>  
> > 
> > This may solve the issue - as it resets the controller in the case
> > of transfer error.  
> 
> Do you need to reset the controller ? Or is there some graceful way
> out, i.e. a way to recover from the error ?

The original patch was developed when the async transmission was start
and stopped for each transfer (before your optimization), hence the code
for disabling it.

I need to check if it would be enough to just clear the
QT_TOKEN_STATUS_HALTED and set QT_TOKEN_STATUS_ACTIVE again.

> [...]




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de
Marek Vasut March 3, 2020, 12:29 p.m. UTC | #20
On 3/3/20 12:25 AM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

[...]

>>>> Every read operation starts at the maximum block size. When the USB
>>>> pendrive is not able to correctly serve this data read request, the
>>>> dynamic reduction of IO size is performed. Up to six tries (with
>>>> smaller IO block each time) are attempted.
>>>>
>>>> A related problem is that some drives are slow to come up. Linux
>>>> handles this by issuing a spinup command and allowing more time for
>>>> the drive to respond. The same idea is applied in this fix.
>>>>
>>>> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
>>>>
>>>> Bus usb@2184200: USB EHCI 1.00
>>>> scanning bus usb@2184200 for devices... 2 USB Device(s) found
>>>>        scanning usb for storage devices... 1 Storage Device(s)
>>>> found EHCI timed out on TD - token=0x1f8c80  
>>>
>>> This is how the error gets evident. The detailed explanation is in
>>> link [1].  
>>
>> Is there one specific post in that forum, or do I need to read through
>> the whole multi-page thread ?
> 
> There is a thread pointed out by the below link which explains the
> issue thoroughly:
> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> 
> (The first 2 posts from 'rayvt' explains the problem).

So basically a drive spin-up problem and the problem with stick counter
overflow.

btw QH/qTD are not microcode instructions, they are just plain DMA
descriptors. And the USB controller in Marvell Kirkwood/Orion SoCs is
on-SoC, it's not a discrete chip.

>> I would expect that if you run -- I assume 'usb reset' (the command is
>> missing above) -- then the bus gets power-cycled, hence the USB device
>> also gets power-cycled.
> 
> Unfortunately, the 'usb reset' is not fixing the issue. The controller
> is unresponsive and the only way for recovering is a power cycle.

My guess would be it might be related to the async schedule, so that's
where I would start debugging.

[...]

>>>> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
>>>>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
>>>> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
>>>> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {  
>>> 			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
>>> 			      before this patch was also not
>>> checked. 
>>>> +			if (--trynum >= 0) {
>>>> +				/*
>>>> +				 * It is necessary to do this,
>>>> otherwise the
>>>> +				 * disk is clagged.
>>>> +				 */
>>>> +				debug("reset the TD and redo,
>>>> because of XACTERR\n");
>>>> +				qhtoken &=
>>>> ~QT_TOKEN_STATUS_HALTED;
>>>> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE
>>>> |
>>>> +					QT_TOKEN_CERR(2);
>>>> +				vtd->qt_token =
>>>> cpu_to_hc32(qhtoken);
>>>> +				qh->qh_overlay.qt_token =
>>>> cpu_to_hc32(qhtoken);
>>>> +				goto retry_xacterr;
>>>> +			}
>>>> +			dev->status = USB_ST_XACTERR;
>>>> +			dev->act_len = length -
>>>> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);
>>>>  
>>>
>>> This may solve the issue - as it resets the controller in the case
>>> of transfer error.  
>>
>> Do you need to reset the controller ? Or is there some graceful way
>> out, i.e. a way to recover from the error ?
> 
> The original patch was developed when the async transmission was start
> and stopped for each transfer (before your optimization), hence the code
> for disabling it.
> 
> I need to check if it would be enough to just clear the
> QT_TOKEN_STATUS_HALTED and set QT_TOKEN_STATUS_ACTIVE again.

OK
Marek Vasut March 14, 2020, 6:16 p.m. UTC | #21
On 3/2/20 6:00 PM, Tom Rini wrote:
> On Mon, Mar 02, 2020 at 01:39:49AM +0100, Marek Vasut wrote:
>> On 3/2/20 12:04 AM, Tom Rini wrote:
>> [...]
>>
>>>>> 3 USB Device(s) found
>>>>>        scanning usb for ethernet devices... 0 Ethernet Device(s) found
>>>>> Hit any key to stop autoboot:  2  0 
>>>>> BeagleBoard # usb tree
>>>>> USB device tree:
>>>>>   1  Hub (480 Mb/s, 0mA)
>>>>>   |  u-boot EHCI Host Controller 
>>>>>   |
>>>>>   |+-2  Hub (480 Mb/s, 2mA)
>>>>>     |
>>>>>     |+-3  See Interface (480 Mb/s, 0mA)
>>>>>          ??????????? ??????????? ???????????
>>>>>        
>>>>> BeagleBoard # 
>>>>>
>>>>> Note that the hub and ethernet are on-SBC and not something I'm plugging
>>>>> in.  Thanks!
>>>>
>>>> The device #3 is a usb mass storage or what is it ?
>>>
>>> It's a usb ethernet device.
>>>
>>>> Can you try and implement usb_get_max_xfer_size for musb and make it
>>>> report 240*512 unconditionally (*size = 240*512; return 0;) ? I think
>>>> that would "fix" it for you on omap too.
>>>
>>> I'll pencil in some time to try that, thanks.
>>
>> That's not gonna help you with USB ethernet. I recall seeing flakiness
>> with asix devices, maybe that's what you're running into?
>>
>> Do you have any further details on that device ? lsusb -vvv would help.
> 
> I can only give:
> 
> BeagleBoard # usb info
> 1: Hub,  USB Revision 2.0
>  - u-boot EHCI Host Controller
>  - Class: Hub
>  - PacketSize: 64  Configurations: 1
>  - Vendor: 0x0000  Product 0x0000 Version 1.0
>    Configuration: 1
>    - Interfaces: 1 Self Powered 0mA
>      Interface: 0
>      - Alternate Setting 0, Endpoints: 1
>      - Class Hub
>      - Endpoint 1 In Interrupt MaxPacket 8 Interval 255ms
> 
> 2: Hub,  USB Revision 2.0
>  - Class: Hub
>  - PacketSize: 64  Configurations: 1
>  - Vendor: 0x0424  Product 0x9514 Version 2.0
>    Configuration: 1
>    - Interfaces: 1 Self Powered Remote Wakeup 2mA
>      Interface: 0
>      - Alternate Setting 0, Endpoints: 1
>      - Class Hub
>      - Endpoint 1 In Interrupt MaxPacket 1 Interval 12ms
>      - Endpoint 1 In Interrupt MaxPacket 1 Interval 12ms
> 
> 3: Vendor specific,  USB Revision 2.0
>  - Class: Vendor specific
>  - PacketSize: 64  Configurations: 1
>  - Vendor: 0x0424  Product 0xec00 Version 2.0
>    Configuration: 1
>    - Interfaces: 1 Self Powered Remote Wakeup 2mA
>      Interface: 0
>      - Alternate Setting 0, Endpoints: 3
>      - Class Vendor specific
>      - Endpoint 1 In Bulk MaxPacket 512
>      - Endpoint 2 Out Bulk MaxPacket 512
>      - Endpoint 3 In Interrupt MaxPacket 16 Interval 4ms
> 
> As the old Fedora core image on the SD card there isn't happy and I
> don't have time to dig in to that problem right now.  But please note
> that the issue here is a relatively recent regression as Guillaume
> reported.

Link would be helpful.

I tried this on R-Car3 EHCI via USB 2.0 hub and I don't see any issue
with detecting either ASIX 0b95:7720 Zoltan Tech ZoWii , asix_usb ,
ethernet OR SMSC 95xx 0454:9e00 ATMES oabrDongle . So I suspect this
must be some limitation of the EHCI on the AM335x somehow ?

Or does it also happen on other SoCs and/or other USB controllers and/or
USB ethernet adapters? I need a way to replicate the issue if I should
debug it.
diff mbox series

Patch

diff --git a/common/usb.c b/common/usb.c
index 349e838f1d..305482b5bb 100644
--- a/common/usb.c
+++ b/common/usb.c
@@ -925,14 +925,20 @@  static int get_descriptor_len(struct usb_device *dev, int len, int expect_len)
 	__maybe_unused struct usb_device_descriptor *desc;
 	ALLOC_CACHE_ALIGN_BUFFER(unsigned char, tmpbuf, USB_BUFSIZ);
 	int err;
+	int retry = 5;
 
 	desc = (struct usb_device_descriptor *)tmpbuf;
 
+again:
 	err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc, len);
 	if (err < expect_len) {
 		if (err < 0) {
-			printf("unable to get device descriptor (error=%d)\n",
-				err);
+			printf("unable to get device descriptor (error=%d) retry: %d\n",
+			       err, retry);
+			mdelay(50);
+			if (--retry >= 0)
+				/* Some drives are just slow to wake up. */
+				goto again;
 			return err;
 		} else {
 			printf("USB device descriptor short read (expected %i, got %i)\n",
diff --git a/common/usb_storage.c b/common/usb_storage.c
index 097b6729c1..48c8c2ae64 100644
--- a/common/usb_storage.c
+++ b/common/usb_storage.c
@@ -111,6 +111,18 @@  int usb_stor_get_info(struct usb_device *dev, struct us_data *us,
 		      struct blk_desc *dev_desc);
 int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
 		      struct us_data *ss);
+
+#ifdef CONFIG_USB_EHCI_HCD
+	/*
+	 * The U-Boot EHCI driver can handle any transfer length as long as
+	 * there is enough free heap space left, but the SCSI READ(10) and
+	 * WRITE(10) commands are limited to 65535 blocks.
+	 */
+int usb_max_xfer_blk = 4096;
+#else
+int usb_max_xfer_blk = 20;
+#endif
+
 #if CONFIG_IS_ENABLED(BLK)
 static unsigned long usb_stor_read(struct udevice *dev, lbaint_t blknr,
 				   lbaint_t blkcnt, void *buffer);
@@ -729,6 +741,7 @@  static int usb_stor_BBB_transport(struct scsi_cmd *srb, struct us_data *us)
 	pipeout = usb_sndbulkpipe(us->pusb_dev, us->ep_out);
 	/* DATA phase + error handling */
 	data_actlen = 0;
+	mdelay(10);		/* Like linux does. */
 	/* no data, go immediately to the STATUS phase */
 	if (srb->datalen == 0)
 		goto st;
@@ -740,6 +753,16 @@  static int usb_stor_BBB_transport(struct scsi_cmd *srb, struct us_data *us)
 
 	result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata, srb->datalen,
 			      &data_actlen, USB_CNTL_TIMEOUT * 5);
+
+	/* special handling of XACTERR in DATA phase */
+	if (result < 0 && (us->pusb_dev->status & USB_ST_XACTERR)) {
+		debug("XACTERR in data phase - clr, reset, and return fail.\n");
+		usb_stor_BBB_clear_endpt_stall(us, dir_in ?
+					       us->ep_in : us->ep_out);
+		usb_stor_BBB_reset(us);
+		return USB_STOR_TRANSPORT_FAILED;
+	}
+
 	/* special handling of STALL in DATA phase */
 	if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED)) {
 		debug("DATA:stall\n");
@@ -1013,9 +1036,32 @@  static int usb_request_sense(struct scsi_cmd *srb, struct us_data *ss)
 	return 0;
 }
 
+/*
+ * This spins up the disk and also consumes the time that the
+ * disk takes to become active and ready to read data.
+ * Some drives (like Western Digital) can take more than 5 seconds.
+ * The delay occurs on the 1st data read from the disk.
+ * Extending the timeout here works better than handling the timeout
+ * as an error on a "real" read.
+ */
+static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
+{
+	memset(&srb->cmd[0], 0, 12);
+	srb->cmd[0] = SCSI_START_STP;
+	srb->cmd[1] = srb->lun << 5;
+	srb->cmd[4] = 1;	/* Start spinup. */
+	srb->datalen = 0;
+	srb->cmdlen = 6;
+	ss->pusb_dev->extra_timeout = 9876;
+	ss->transport(srb, ss);
+	ss->pusb_dev->extra_timeout = 0;
+	return 0;
+}
+
 static int usb_test_unit_ready(struct scsi_cmd *srb, struct us_data *ss)
 {
 	int retries = 10;
+	int gave_extra_time = 0;
 
 	do {
 		memset(&srb->cmd[0], 0, 12);
@@ -1038,6 +1084,17 @@  static int usb_test_unit_ready(struct scsi_cmd *srb, struct us_data *ss)
 		if ((srb->sense_buf[2] == 0x02) &&
 		    (srb->sense_buf[12] == 0x3a))
 			return -1;
+		/*
+		 * If the status is "Not Ready - becoming ready", give it
+		 * more time.  Linux issues a spinup command (once) and gives
+		 * it 100 seconds.
+		 */
+		if (srb->sense_buf[2] == 0x02 &&
+		    srb->sense_buf[12] == 0x04 &&
+		    gave_extra_time == 0) {
+			retries = 100; /* Allow 10 seconds. */
+			gave_extra_time = retries;
+		}
 		mdelay(100);
 	} while (retries--);
 
@@ -1166,22 +1223,57 @@  static unsigned long usb_stor_read(struct blk_desc *block_dev, lbaint_t blknr,
 	      block_dev->devnum, start, blks, buf_addr);
 
 	do {
-		/* XXX need some comment here */
+		/*
+		 * Probably most errors are USB errors, not hard disk error.
+		 * Many disks use a USB chip that is flaky when doing large
+		 * transfers. The workaround is to dynamically reduce the
+		 * transfer size and allow an additional try.
+		 * This should pick up flaky disks. Linux uses a quirks table.
+		 * We'll use observation:
+		 * Give it 1 try very large, 1 try large, 2 tries medium and 2
+		 * tries small(ish).
+		 * On a solid fail (actual disk error - which should be rare),
+		 * this will give us 6 tries max, and only that many if the read
+		 * is quite large.
+		 * A fail on a very short read obviously doesn't have a
+		 * too-large max_blks.  Timeout due to spinup being a case in
+		 * this point.
+		 */
 		retry = 2;
 		srb->pdata = (unsigned char *)buf_addr;
-		if (blks > ss->max_xfer_blk)
-			smallblks = ss->max_xfer_blk;
+retry_it:
+		if (blks > usb_max_xfer_blk)
+			smallblks = usb_max_xfer_blk;
 		else
 			smallblks = (unsigned short) blks;
-retry_it:
-		if (smallblks == ss->max_xfer_blk)
+
+		if (smallblks == usb_max_xfer_blk)
 			usb_show_progress();
+
 		srb->datalen = block_dev->blksz * smallblks;
 		srb->pdata = (unsigned char *)buf_addr;
 		if (usb_read_10(srb, ss, start, smallblks)) {
 			debug("Read ERROR\n");
 			ss->flags &= ~USB_READY;
 			usb_request_sense(srb, ss);
+
+			/* Dynamically reduce the I/O size. */
+			if (smallblks > 2047) {
+				usb_max_xfer_blk = 2047;
+				++retry;
+			} else if (smallblks > 512) {
+				usb_max_xfer_blk = 512;
+				++retry;
+			} else if (smallblks > 511) {
+				usb_max_xfer_blk = 511;
+				++retry;
+			} else if (smallblks > 63) {
+				usb_max_xfer_blk = 63;
+				retry += 2;
+			}
+			debug("step down usb_max_xfer_blk to %d\n",
+			      usb_max_xfer_blk);
+
 			if (retry--)
 				goto retry_it;
 			blkcnt -= blks;
@@ -1196,8 +1288,7 @@  retry_it:
 	      start, smallblks, buf_addr);
 
 	usb_disable_asynch(0); /* asynch transfer allowed */
-	if (blkcnt >= ss->max_xfer_blk)
-		debug("\n");
+
 	return blkcnt;
 }
 
@@ -1492,6 +1583,7 @@  int usb_stor_get_info(struct usb_device *dev, struct us_data *ss,
 	dev_desc->log2blksz = LOG2(dev_desc->blksz);
 	dev_desc->type = perq;
 	debug(" address %d\n", dev_desc->target);
+	usb_spinup(pccb, ss);
 
 	return 1;
 }
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 1cc02052f5..c4041f1e9c 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -315,7 +315,9 @@  ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer,
 	int timeout;
 	int ret = 0;
 	struct ehci_ctrl *ctrl = ehci_get_ctrl(dev);
+	int trynum;
 
+	debug("\n***** %s ****\n", __func__);
 	debug("dev=%p, pipe=%lx, buffer=%p, length=%d, req=%p\n", dev, pipe,
 	      buffer, length, req);
 	if (req != NULL)
@@ -557,6 +559,15 @@  ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer,
 	ehci_writel(&ctrl->hcor->or_usbsts, (usbsts & 0x3f));
 
 	/* Enable async. schedule. */
+	trynum = 1;	/* No more than 2 tries, in case of XACTERR. */
+	/* When the 1st try gets xacterr,
+	 * 2nd try gets xacterr and often babble and/or halted.
+	 * 3rd try times out.
+	 * After the 2nd try, the disk has recovered, so we need to clear and
+	 * reset the USB port, then return fail so the upper layer can retry.
+	 */
+retry_xacterr:
+	vtd = &qtd[qtd_counter - 1];
 	cmd = ehci_readl(&ctrl->hcor->or_usbcmd);
 	if (!(cmd & CMD_ASE)) {
 		cmd |= CMD_ASE;
@@ -572,8 +583,9 @@  ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer,
 
 	/* Wait for TDs to be processed. */
 	ts = get_timer(0);
-	vtd = &qtd[qtd_counter - 1];
+
 	timeout = USB_TIMEOUT_MS(pipe);
+	timeout += dev->extra_timeout;
 	do {
 		/* Invalidate dcache */
 		invalidate_dcache_range((unsigned long)&ctrl->qh_list,
@@ -588,9 +600,10 @@  ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer,
 			break;
 		WATCHDOG_RESET();
 	} while (get_timer(ts) < timeout);
-	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
 
-	ctrl->qh_list.qh_link = cpu_to_hc32(virt_to_phys(&ctrl->qh_list) | QH_LINK_TYPE_QH);
+	debug("took %4lu ms of %4d\n", get_timer(ts), timeout);
+	ctrl->qh_list.qh_link =
+		cpu_to_hc32(virt_to_phys(&ctrl->qh_list) | QH_LINK_TYPE_QH);
 	flush_dcache_range((unsigned long)&ctrl->qh_list,
 		ALIGN_END_ADDR(struct QH, &ctrl->qh_list, 1));
 
@@ -605,14 +618,48 @@  ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer,
 	 */
 	if (buffer != NULL && length > 0)
 		invalidate_dcache_range((unsigned long)buffer,
-			ALIGN((unsigned long)buffer + length, ARCH_DMA_MINALIGN));
+			ALIGN((unsigned long)buffer + length,
+			      ARCH_DMA_MINALIGN));
 
 	/* Check that the TD processing happened */
 	if (QT_TOKEN_GET_STATUS(token) & QT_TOKEN_STATUS_ACTIVE)
 		printf("EHCI timed out on TD - token=%#x\n", token);
 
+	/* Disable async schedule. */
+	cmd = ehci_readl(&ctrl->hcor->or_usbcmd);
+	cmd &= ~CMD_ASE;
+	ehci_writel(&ctrl->hcor->or_usbcmd, cmd);
+
+	ret = handshake((uint32_t *)&ctrl->hcor->or_usbsts, STS_ASS, 0,
+			100 * 1000);
+	if (ret < 0) {
+		printf("EHCI fail timeout STS_ASS reset\n");
+		goto fail;
+	}
+
+	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
 	if (!(QT_TOKEN_GET_STATUS(qhtoken) & QT_TOKEN_STATUS_ACTIVE)) {
 		debug("TOKEN=%#x\n", qhtoken);
+		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {
+			if (--trynum >= 0) {
+				/*
+				 * It is necessary to do this, otherwise the
+				 * disk is clagged.
+				 */
+				debug("reset the TD and redo, because of XACTERR\n");
+				qhtoken &= ~QT_TOKEN_STATUS_HALTED;
+				qhtoken |= QT_TOKEN_STATUS_ACTIVE |
+					QT_TOKEN_CERR(2);
+				vtd->qt_token = cpu_to_hc32(qhtoken);
+				qh->qh_overlay.qt_token = cpu_to_hc32(qhtoken);
+				goto retry_xacterr;
+			}
+			dev->status = USB_ST_XACTERR;
+			dev->act_len = length -
+				QT_TOKEN_GET_TOTALBYTES(qhtoken);
+			goto fail;
+		}
+
 		switch (QT_TOKEN_GET_STATUS(qhtoken) &
 			~(QT_TOKEN_STATUS_SPLITXSTATE | QT_TOKEN_STATUS_PERR)) {
 		case 0:
diff --git a/include/usb.h b/include/usb.h
index efb67ea33f..41c8c47f89 100644
--- a/include/usb.h
+++ b/include/usb.h
@@ -140,6 +140,7 @@  struct usb_device {
 	int act_len;			/* transferred bytes */
 	int maxchild;			/* Number of ports if hub */
 	int portnr;			/* Port number, 1=first */
+	int extra_timeout;	/* Add to timeout in ehci_submit_async */
 #if !CONFIG_IS_ENABLED(DM_USB)
 	/* parent hub, or NULL if this is the root hub */
 	struct usb_device *parent;
diff --git a/include/usb_defs.h b/include/usb_defs.h
index 6dd2c997f9..572f6ab296 100644
--- a/include/usb_defs.h
+++ b/include/usb_defs.h
@@ -197,6 +197,7 @@ 
 #define USB_ST_NAK_REC          0x10	/* NAK Received*/
 #define USB_ST_CRC_ERR          0x20	/* CRC/timeout Error */
 #define USB_ST_BIT_ERR          0x40	/* Bitstuff error */
+#define USB_ST_XACTERR          0x80	/* XACTERR error */
 #define USB_ST_NOT_PROC         0x80000000L	/* Not yet processed */