Patchwork Intermittent SATA failures ("link offline, clearing class 1 to NONE")

login
register
mail settings
Submitter Tejun Heo
Date Dec. 2, 2009, 1:43 a.m.
Message ID <4B15C64E.7090106@kernel.org>
Download mbox | patch
Permalink /patch/39952/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - Dec. 2, 2009, 1:43 a.m.
Hello,

Can you please try this one?

Thanks.
Alex Vandiver - Dec. 2, 2009, 5:45 p.m.
At Tue Dec 01 20:43:42 -0500 2009, Tejun Heo wrote:
> Can you please try this one?

Good and bad dmesg output attached.
 - Alex
Tejun Heo - Dec. 3, 2009, 12:19 a.m.
On 12/03/2009 02:45 AM, Alex Vandiver wrote:
> At Tue Dec 01 20:43:42 -0500 2009, Tejun Heo wrote:
>> Can you please try this one?
> 
> Good and bad dmesg output attached.

That's interesting.  The link state seems to oscillates for quite long
time.  Can you please try the followings?

1. Change sata_deb_timing_hotplug to sata_deb_timing_long in
   ata_piix.c::piidx_sidpr_hardreset() and see whether it solves the
   problem?

2. Switch the controller into ahci mode in BIOS if possible and see
   whether the problem goes away.

3. Try kernel parameter "libata.force=nohrst" and see whether it makes
   any difference.

Thanks.
Tejun Heo - Dec. 3, 2009, 12:42 a.m.
On 12/03/2009 09:19 AM, Tejun Heo wrote:
> On 12/03/2009 02:45 AM, Alex Vandiver wrote:
>> At Tue Dec 01 20:43:42 -0500 2009, Tejun Heo wrote:
>>> Can you please try this one?
>>
>> Good and bad dmesg output attached.
> 
> That's interesting.  The link state seems to oscillates for quite long
> time.  Can you please try the followings?
> 
> 1. Change sata_deb_timing_hotplug to sata_deb_timing_long in
>    ata_piix.c::piidx_sidpr_hardreset() and see whether it solves the
>    problem?
> 
> 2. Switch the controller into ahci mode in BIOS if possible and see
>    whether the problem goes away.
> 
> 3. Try kernel parameter "libata.force=nohrst" and see whether it makes
>    any difference.

One more thing to try.

4. You can tell libata to reset the port by running the following command.

   echo "- - -" > /sys/class/scsi_host/hostN/scan

   N would be 0 or 1 on your configuration.  Can you please run the
   above command repeatedly and see whether it triggers a reset
   failure?  If so, the above will be far more convenient way to test
   than rebooting repeatedly.

Thanks.
Alex Vandiver - Dec. 18, 2009, 11:26 p.m.
On 12/03/2009 09:19 AM, Tejun Heo wrote:
> That's interesting.  The link state seems to oscillates for quite long
> time.  Can you please try the followings?

Sorry for taking so long to get back to this.

> 1. Change sata_deb_timing_hotplug to sata_deb_timing_long in
>    ata_piix.c::piidx_sidpr_hardreset() and see whether it solves the
>    problem?

Nope -- in fact, _three_ of the drives failed to come up once, with
this change.
 
> 2. Switch the controller into ahci mode in BIOS if possible and see
>    whether the problem goes away.

Sadly, the BIOS doesn't have an option for this.

> 3. Try kernel parameter "libata.force=nohrst" and see whether it makes
>    any difference.

I wasn't able to replicate the problem with this kernel parameter.
dmesg for such a boot is attached.

At Wed Dec 02 19:42:04 -0500 2009, Tejun Heo wrote:
> 4. You can tell libata to reset the port by running the following command.
> 
>    echo "- - -" > /sys/class/scsi_host/hostN/scan
> 
>    N would be 0 or 1 on your configuration.  Can you please run the
>    above command repeatedly and see whether it triggers a reset
>    failure?  If so, the above will be far more convenient way to test
>    than rebooting repeatedly.

I wasn't able to get this to fail, sadly.  If the boot _had_ failed,
however, this technique did bring the drive online.
 - Alex
Tejun Heo - Dec. 21, 2009, 4:50 a.m.
Hello,

On 12/19/2009 08:26 AM, Alex Vandiver wrote:
> On 12/03/2009 09:19 AM, Tejun Heo wrote:
>> 1. Change sata_deb_timing_hotplug to sata_deb_timing_long in
>>    ata_piix.c::piidx_sidpr_hardreset() and see whether it solves the
>>    problem?
> 
> Nope -- in fact, _three_ of the drives failed to come up once, with
> this change.

Wow, that's unexpected.  Can you please post log of this with the
debugging patch applied?

>> 2. Switch the controller into ahci mode in BIOS if possible and see
>>    whether the problem goes away.
> 
> Sadly, the BIOS doesn't have an option for this.

Ah... that's sad.

>> 3. Try kernel parameter "libata.force=nohrst" and see whether it makes
>>    any difference.
> 
> I wasn't able to replicate the problem with this kernel parameter.
> dmesg for such a boot is attached.

Alright, thanks for verifying.

>> 4. You can tell libata to reset the port by running the following command.
>>
>>    echo "- - -" > /sys/class/scsi_host/hostN/scan
>>
>>    N would be 0 or 1 on your configuration.  Can you please run the
>>    above command repeatedly and see whether it triggers a reset
>>    failure?  If so, the above will be far more convenient way to test
>>    than rebooting repeatedly.
> 
> I wasn't able to get this to fail, sadly.  If the boot _had_ failed,
> however, this technique did bring the drive online.

Hmmm... interesting.  So, the boot time probing is somehow different
from probing initiated by manual scan (they both use hardreset).  Can
you please post log of manual scans with debug patch applied?

Thanks.
Alex Vandiver - Dec. 21, 2009, 7:13 p.m.
At Sun Dec 20 23:50:15 -0500 2009, Tejun Heo wrote:
> Wow, that's unexpected.  Can you please post log of this with the
> debugging patch applied?

I've only been able to replicate this once, and I didn't manage to
save its dmesg, sadly.  I'll see if I can get it to crop up again,
though.

> Hmmm... interesting.  So, the boot time probing is somehow different
> from probing initiated by manual scan (they both use hardreset).  Can
> you please post log of manual scans with debug patch applied?

The output of a manual scan is:

[  255.236938] ata1.00: hard resetting link
[  255.465422] ata1.00: XXX debounce start, SStatus=123
[  257.551239] ata1.00: XXX debounce done, SStatus=123, DET stable for 2090 msecs
[  257.551246] ata1.01: hard resetting link
[  257.780776] ata1.01: XXX debounce start, SStatus=123
[  259.866589] ata1.01: XXX debounce done, SStatus=123, DET stable for 2090 msecs
[  260.026328] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  260.026342] ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  260.087511] ata1.00: configured for UDMA/133
[  260.127736] ata1.01: configured for UDMA/133
[  260.127740] ata1: EH complete

 - Alex
Tejun Heo - Dec. 22, 2009, 12:06 a.m.
Hello,

On 12/22/2009 04:13 AM, Alex Vandiver wrote:
> At Sun Dec 20 23:50:15 -0500 2009, Tejun Heo wrote:
>> Wow, that's unexpected.  Can you please post log of this with the
>> debugging patch applied?
> 
> I've only been able to replicate this once, and I didn't manage to
> save its dmesg, sadly.  I'll see if I can get it to crop up again,
> though.

Any failing trace witfffh long debounce timing would be helpful
because it really shouldn't be failing with long debounce timing.

Thanks.

Patch

diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index 9ac4e37..f50bba6 100644
--- a/drivers/ata/ata_piix.c
+++ b/drivers/ata/ata_piix.c
@@ -173,6 +173,8 @@  static int piix_sidpr_scr_read(struct ata_link *link,
 			       unsigned int reg, u32 *val);
 static int piix_sidpr_scr_write(struct ata_link *link,
 				unsigned int reg, u32 val);
+static int piix_sidpr_hardreset(struct ata_link *link, unsigned int *class,
+				unsigned long deadline);
 #ifdef CONFIG_PM
 static int piix_pci_device_suspend(struct pci_dev *pdev, pm_message_t mesg);
 static int piix_pci_device_resume(struct pci_dev *pdev);
@@ -334,9 +336,9 @@  static struct ata_port_operations piix_sata_ops = {
 
 static struct ata_port_operations piix_sidpr_sata_ops = {
 	.inherits		= &piix_sata_ops,
-	.hardreset		= sata_std_hardreset,
 	.scr_read		= piix_sidpr_scr_read,
 	.scr_write		= piix_sidpr_scr_write,
+	.hardreset		= piix_sidpr_hardreset,
 };
 
 static const struct piix_map_db ich5_map_db = {
@@ -962,6 +964,18 @@  static int piix_sidpr_scr_write(struct ata_link *link,
 	return 0;
 }
 
+static int piix_sidpr_hardreset(struct ata_link *link, unsigned int *class,
+				unsigned long deadline)
+{
+	const unsigned long *timing = sata_deb_timing_hotplug;
+	bool online;
+	int rc;
+
+	/* do hardreset */
+	rc = sata_link_hardreset(link, timing, deadline, &online, NULL);
+	return online ? -EAGAIN : rc;
+}
+
 #ifdef CONFIG_PM
 static int piix_broken_suspend(void)
 {
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index dc72690..c501e1e 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3731,7 +3731,7 @@  int sata_link_debounce(struct ata_link *link, const unsigned long *params,
 	unsigned long interval = params[0];
 	unsigned long duration = params[1];
 	unsigned long last_jiffies, t;
-	u32 last, cur;
+	u32 last, cur, xxx;
 	int rc;
 
 	t = ata_deadline(jiffies, params[2]);
@@ -3740,7 +3740,9 @@  int sata_link_debounce(struct ata_link *link, const unsigned long *params,
 
 	if ((rc = sata_scr_read(link, SCR_STATUS, &cur)))
 		return rc;
+	xxx = cur;
 	cur &= 0xf;
+	ata_link_printk(link, KERN_INFO, "XXX debounce start, SStatus=%x\n", xxx);
 
 	last = cur;
 	last_jiffies = jiffies;
@@ -3749,6 +3751,7 @@  int sata_link_debounce(struct ata_link *link, const unsigned long *params,
 		msleep(interval);
 		if ((rc = sata_scr_read(link, SCR_STATUS, &cur)))
 			return rc;
+		xxx = cur;
 		cur &= 0xf;
 
 		/* DET stable? */
@@ -3756,8 +3759,12 @@  int sata_link_debounce(struct ata_link *link, const unsigned long *params,
 			if (cur == 1 && time_before(jiffies, deadline))
 				continue;
 			if (time_after(jiffies,
-				       ata_deadline(last_jiffies, duration)))
+				       ata_deadline(last_jiffies, duration))) {
+				ata_link_printk(link, KERN_INFO,
+						"XXX debounce done, SStatus=%x, DET stable for %u msecs\n",
+						xxx, jiffies_to_msecs(jiffies - last_jiffies));
 				return 0;
+			}
 			continue;
 		}