Patchwork Intermittent SATA link down SStatus 0

login
register
mail settings
Submitter Tejun Heo
Date July 14, 2010, 12:26 p.m.
Message ID <4C3DACFF.2060504@kernel.org>
Download mbox | patch
Permalink /patch/58884/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - July 14, 2010, 12:26 p.m.
Hello,

On 07/12/2010 06:36 PM, Paul Check wrote:
>> Tejun & Co:
>>
>> I have finally upgraded to the 2.6.34 kernel, and I am still having
>> problems with some of my drives not coming up some of the time (different
>> drives, at different times, never more than one).
>>
>> Here are some lines from /var/log/messages on the most recent boot below.
>> Do you have any suggestions for this? I'm getting tired of having to
>> reconstitute my raid 30-50% of the time, and will try anything to see if
>> it fixes. Note the "link up (unknown)" and "link down" lines. I don't know
>> what should appear, but I have 4 hard drives and one optical drive plugged
>> into 5 of the 6 SATA ports on the board.
...
>> Jul 12 12:25:22 min kernel: [    2.078489] ata2.01: SATA link up <unknown>
>> (SStatus 300 SControl 123)

Hmm, yeah, it seems like SCR access via SIDPR is more flaky than
covered by the previous commit.  There's another thread where similar
problem is being debugged.  Can you please do the following?  I'm
attaching patch here too.

  http://thread.gmane.org/gmane.linux.kernel/1005983/focus=46749

Thanks.
Paul Check - July 14, 2010, 5:58 p.m.
I applied your patch to the 2.6.34 kernel, and on my very first reboot had
a drive missing. What information do you want? Just /var/log/messages?

I don't want to keep rebooting forever. Was this patch supposed to fix the
problem, or just give you information to debug? Is the problem even known?

Regards, Paul


> Hello,
>
> On 07/12/2010 06:36 PM, Paul Check wrote:
>>> Tejun & Co:
>>>
>>> I have finally upgraded to the 2.6.34 kernel, and I am still having
>>> problems with some of my drives not coming up some of the time
>>> (different
>>> drives, at different times, never more than one).
>>>
>>> Here are some lines from /var/log/messages on the most recent boot
>>> below.
>>> Do you have any suggestions for this? I'm getting tired of having to
>>> reconstitute my raid 30-50% of the time, and will try anything to see
>>> if
>>> it fixes. Note the "link up (unknown)" and "link down" lines. I don't
>>> know
>>> what should appear, but I have 4 hard drives and one optical drive
>>> plugged
>>> into 5 of the 6 SATA ports on the board.
> ...
>>> Jul 12 12:25:22 min kernel: [    2.078489] ata2.01: SATA link up
>>> <unknown>
>>> (SStatus 300 SControl 123)
>
> Hmm, yeah, it seems like SCR access via SIDPR is more flaky than
> covered by the previous commit.  There's another thread where similar
> problem is being debugged.  Can you please do the following?  I'm
> attaching patch here too.
>
>   http://thread.gmane.org/gmane.linux.kernel/1005983/focus=46749
>
> Thanks.
>
> --
> tejun
>


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - July 14, 2010, 11:36 p.m.
Hello,

On 07/14/2010 07:58 PM, Paul Check wrote:
> I applied your patch to the 2.6.34 kernel, and on my very first reboot had
> a drive missing. What information do you want? Just /var/log/messages?

The kernel boot log should suffice.  ie. dmesg output after boot.

> I don't want to keep rebooting forever. Was this patch supposed to fix the
> problem, or just give you information to debug? Is the problem even known?

It should give more information but seems to actually make the other
reporter's problem go away.  It looks like SIDPR SCR access can be
flaky depending on timing but I'm still trying to determine how to
work around it.

Thanks.
Paul Check - July 15, 2010, 12:43 a.m.
Ok, well we didn't have to wait long. Once again one of my drives failed
to appear. This time it was a different drive than the times before.

I have attached the output from dmesg. Can you tell me from this output
what the problem is?

Regards, Paul

> Hello,
>
> On 07/14/2010 07:58 PM, Paul Check wrote:
>> I applied your patch to the 2.6.34 kernel, and on my very first reboot
>> had
>> a drive missing. What information do you want? Just /var/log/messages?
>
> The kernel boot log should suffice.  ie. dmesg output after boot.
>
>> I don't want to keep rebooting forever. Was this patch supposed to fix
>> the
>> problem, or just give you information to debug? Is the problem even
>> known?
>
> It should give more information but seems to actually make the other
> reporter's problem go away.  It looks like SIDPR SCR access can be
> flaky depending on timing but I'm still trying to determine how to
> work around it.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Tejun Heo - July 15, 2010, 9:20 a.m.
Hello,

On 07/15/2010 02:43 AM, Paul Check wrote:
> Ok, well we didn't have to wait long. Once again one of my drives failed
> to appear. This time it was a different drive than the times before.
>
> I have attached the output from dmesg. Can you tell me from this output
> what the problem is?

SStatus and SControl are behaving erratically.  I'm not yet sure why
that's happening yet.

[    2.562157] ata4.01: SATA link up <unknown> (SStatus 300 SControl 123)

Hmm... it looks like SStatus and SControl are swapped here.  Maybe
there's a race condition in SIDPR code.  I'll look into it a bit more.

Thanks.

Patch

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2984e45..ce87bfe 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3712,7 +3712,7 @@  int sata_link_resume(struct ata_link *link, const unsigned long *params,
 		     unsigned long deadline)
 {
 	int tries = ATA_LINK_RESUME_TRIES;
-	u32 scontrol, serror;
+	u32 scontrol, scontrol1, serror;
 	int rc;
 
 	if ((rc = sata_scr_read(link, SCR_CONTROL, &scontrol)))
@@ -3739,6 +3739,14 @@  int sata_link_resume(struct ata_link *link, const unsigned long *params,
 			return rc;
 	} while ((scontrol & 0xf0f) != 0x300 && --tries);
 
+	/* check once more */
+	msleep(100);
+	if ((rc = sata_scr_read(link, SCR_CONTROL, &scontrol1)))
+			return rc;
+	ata_link_printk(link, KERN_ERR,
+			"XXX SControl after resume = %X %X, tries=%d\n",
+			scontrol, scontrol1, ATA_LINK_RESUME_TRIES - tries + 1);
+
 	if ((scontrol & 0xf0f) != 0x300) {
 		ata_link_printk(link, KERN_ERR,
 				"failed to resume link (SControl %X)\n",
@@ -6007,7 +6015,7 @@  static void async_port_probe(void *data, async_cookie_t cookie)
 
 		ehi->probe_mask |= ATA_ALL_DEVICES;
 		ehi->action |= ATA_EH_RESET | ATA_EH_LPM;
-		ehi->flags |= ATA_EHI_NO_AUTOPSY | ATA_EHI_QUIET;
+		ehi->flags |= ATA_EHI_NO_AUTOPSY/* | ATA_EHI_QUIET*/;
 
 		ap->pflags &= ~ATA_PFLAG_INITIALIZING;
 		ap->pflags |= ATA_PFLAG_LOADING;