Patchwork Drives missing at boot

login
register
mail settings
Submitter Tejun Heo
Date July 7, 2010, 3:48 p.m.
Message ID <4C34A1D5.1090202@kernel.org>
Download mbox | patch
Permalink /patch/58125/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - July 7, 2010, 3:48 p.m.
Hello,

On 07/07/2010 05:34 PM, Mark Knecht wrote:
> OK - I don't know if this was you intention but since adding this
> patch I've not had a single drive missing failure. I've cold booted
> about 8 times and warm booted at least 20 times. Every one has come up
> fine. I've even gone so far as to turn off the UPS and sit for 5
> minutes before cold booting. Still nothing fails right now.
> 
> I've had this sort of statistical thing happen before where it hasn't
> failed for days, maybe even weeks, but then it starts failing and
> fails every time for awhile. Over the past few days working with you
> I've never had to reboot more than twice to get you a file. Now I've
> tried 30 times this morning and I've come up with nothing.
> 
> I will continue to watch the machine and send you the failing dmesg
> file whenever I finally get it. For now I can only attach the passing
> file showing the patch is now included.

It seems like SIDPR is a bit more unreliable than the current code can
handle and the added delay and read could have affected the result.
Eh... weird.  Can you please apply the attached patch instead?  The
only difference is it will print out two SControl values instead of
one.  ie. "XXX SControl after resume = AAA BBB, tries=T".  Can you
please try to boot multiple times and see if AAA and BBB differ
anytime?  If that happens, please attach the boot log.  Also, if you
see one with T > 1, please attach that one too.

Thanks.
Mark Knecht - July 7, 2010, 4:15 p.m.
On Wed, Jul 7, 2010 at 8:48 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On 07/07/2010 05:34 PM, Mark Knecht wrote:
>> OK - I don't know if this was you intention but since adding this
>> patch I've not had a single drive missing failure. I've cold booted
>> about 8 times and warm booted at least 20 times. Every one has come up
>> fine. I've even gone so far as to turn off the UPS and sit for 5
>> minutes before cold booting. Still nothing fails right now.
>>
>> I've had this sort of statistical thing happen before where it hasn't
>> failed for days, maybe even weeks, but then it starts failing and
>> fails every time for awhile. Over the past few days working with you
>> I've never had to reboot more than twice to get you a file. Now I've
>> tried 30 times this morning and I've come up with nothing.
>>
>> I will continue to watch the machine and send you the failing dmesg
>> file whenever I finally get it. For now I can only attach the passing
>> file showing the patch is now included.
>
> It seems like SIDPR is a bit more unreliable than the current code can
> handle and the added delay and read could have affected the result.
> Eh... weird.  Can you please apply the attached patch instead?  The
> only difference is it will print out two SControl values instead of
> one.  ie. "XXX SControl after resume = AAA BBB, tries=T".  Can you
> please try to boot multiple times and see if AAA and BBB differ
> anytime?  If that happens, please attach the boot log.  Also, if you
> see one with T > 1, please attach that one too.
>
> Thanks.
>
> --
> tejun
>

Certainly. Is there a way to reverse the previous patch?

c2stable linux # patch -p1 --dry-run <~mark/Downloads/resume-dbg-1.patch
patching file drivers/ata/libata-core.c
Hunk #1 succeeded at 3798 (offset 86 lines).
Hunk #2 succeeded at 3833 with fuzz 2 (offset 94 lines).
Hunk #3 FAILED at 6109.
1 out of 3 hunks FAILED -- saving rejects to file drivers/ata/libata-core.c.rej
c2stable linux #

I assume this is failing because your patch is over the plain kernel,
not the one I've patched?

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - July 7, 2010, 4:19 p.m.
On 07/07/2010 06:15 PM, Mark Knecht wrote:
> Certainly. Is there a way to reverse the previous patch?
> 
> c2stable linux # patch -p1 --dry-run <~mark/Downloads/resume-dbg-1.patch
> patching file drivers/ata/libata-core.c
> Hunk #1 succeeded at 3798 (offset 86 lines).
> Hunk #2 succeeded at 3833 with fuzz 2 (offset 94 lines).
> Hunk #3 FAILED at 6109.
> 1 out of 3 hunks FAILED -- saving rejects to file drivers/ata/libata-core.c.rej
> c2stable linux #
> 
> I assume this is failing because your patch is over the plain kernel,
> not the one I've patched?

$ patch -R -p1 < ~mark/Downloads/resume-dbg.patch
$ patch -p1 < ~mark/Downloads/resume-dbg-1.patch
Mark Knecht - July 7, 2010, 4:27 p.m.
On Wed, Jul 7, 2010 at 9:19 AM, Tejun Heo <tj@kernel.org> wrote:
> On 07/07/2010 06:15 PM, Mark Knecht wrote:
>> Certainly. Is there a way to reverse the previous patch?
>>
>> c2stable linux # patch -p1 --dry-run <~mark/Downloads/resume-dbg-1.patch
>> patching file drivers/ata/libata-core.c
>> Hunk #1 succeeded at 3798 (offset 86 lines).
>> Hunk #2 succeeded at 3833 with fuzz 2 (offset 94 lines).
>> Hunk #3 FAILED at 6109.
>> 1 out of 3 hunks FAILED -- saving rejects to file drivers/ata/libata-core.c.rej
>> c2stable linux #
>>
>> I assume this is failing because your patch is over the plain kernel,
>> not the one I've patched?
>
> $ patch -R -p1 < ~mark/Downloads/resume-dbg.patch
> $ patch -p1 < ~mark/Downloads/resume-dbg-1.patch
>
> --
> tejun
>
Thanks. Building the new kernel now. I'll start trying to save the
data you're looking for.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Knecht - July 7, 2010, 5:06 p.m.
On Wed, Jul 7, 2010 at 8:48 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On 07/07/2010 05:34 PM, Mark Knecht wrote:
>> OK - I don't know if this was you intention but since adding this
>> patch I've not had a single drive missing failure. I've cold booted
>> about 8 times and warm booted at least 20 times. Every one has come up
>> fine. I've even gone so far as to turn off the UPS and sit for 5
>> minutes before cold booting. Still nothing fails right now.
>>
>> I've had this sort of statistical thing happen before where it hasn't
>> failed for days, maybe even weeks, but then it starts failing and
>> fails every time for awhile. Over the past few days working with you
>> I've never had to reboot more than twice to get you a file. Now I've
>> tried 30 times this morning and I've come up with nothing.
>>
>> I will continue to watch the machine and send you the failing dmesg
>> file whenever I finally get it. For now I can only attach the passing
>> file showing the patch is now included.
>
> It seems like SIDPR is a bit more unreliable than the current code can
> handle and the added delay and read could have affected the result.
> Eh... weird.  Can you please apply the attached patch instead?  The
> only difference is it will print out two SControl values instead of
> one.  ie. "XXX SControl after resume = AAA BBB, tries=T".  Can you
> please try to boot multiple times and see if AAA and BBB differ
> anytime?  If that happens, please attach the boot log.  Also, if you
> see one with T > 1, please attach that one too.
>
> Thanks.
>
> --
> tejun
>

4 warm reboots. All 4 said 300 300. However the 4th one only showed an
extra attempt at running the patch code with and also showed Tries =
2. I'm attaching boot #1 and boot #4 for now. I've saved them all if
you need or just want them.

Please note that in all 4 cases all drives were found. Nothing is
missing in any test yet after adding these either of these patches.

I've not tried cold boots yet. That's next.

Cheers,
Mark
Tejun Heo - July 7, 2010, 5:26 p.m.
Hello,

On 07/07/2010 07:06 PM, Mark Knecht wrote:
> 4 warm reboots. All 4 said 300 300. However the 4th one only showed an
> extra attempt at running the patch code with and also showed Tries =
> 2. I'm attaching boot #1 and boot #4 for now. I've saved them all if
> you need or just want them.
> 
> Please note that in all 4 cases all drives were found. Nothing is
> missing in any test yet after adding these either of these patches.
> 
> I've not tried cold boots yet. That's next.

Hmm... just in case you're being lucky, please keep an eye on it over
several days and report the result.  I think all that's necessary is
slight modification to the resume logic but let's watch a bit first.

Thanks.
Mark Knecht - July 7, 2010, 5:32 p.m.
On Wed, Jul 7, 2010 at 10:26 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On 07/07/2010 07:06 PM, Mark Knecht wrote:
>> 4 warm reboots. All 4 said 300 300. However the 4th one only showed an
>> extra attempt at running the patch code with and also showed Tries =
>> 2. I'm attaching boot #1 and boot #4 for now. I've saved them all if
>> you need or just want them.
>>
>> Please note that in all 4 cases all drives were found. Nothing is
>> missing in any test yet after adding these either of these patches.
>>
>> I've not tried cold boots yet. That's next.
>
> Hmm... just in case you're being lucky, please keep an eye on it over
> several days and report the result.  I think all that's necessary is
> slight modification to the resume logic but let's watch a bit first.
>
> Thanks.
>
> --
> tejun
>
I've tried two cold boots so far. One of them had that same extra
Tries = 2 at the same place. I'll do a couple more just to see what
happens.

I'm happy to watch it as long as it takes and will certainly save any
results if and when a drive isn't found. This problem has run hot and
cold for a few months. Sometimes I go a week with every boot is good.
This last two weeks was finally bad enough to get me to report it.

I also want to investigate actually getting the BIOS AHCI setting
working, but I think I'll leave that alone for afew days so as to not
upset this experiment. Performance on the machine isn't critical right
now, assuming AHCI helps.

I'll get back to you when I've got something new to report.

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Knecht - July 19, 2010, 7:31 p.m.
On Wed, Jul 7, 2010 at 10:26 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On 07/07/2010 07:06 PM, Mark Knecht wrote:
>> 4 warm reboots. All 4 said 300 300. However the 4th one only showed an
>> extra attempt at running the patch code with and also showed Tries =
>> 2. I'm attaching boot #1 and boot #4 for now. I've saved them all if
>> you need or just want them.
>>
>> Please note that in all 4 cases all drives were found. Nothing is
>> missing in any test yet after adding these either of these patches.
>>
>> I've not tried cold boots yet. That's next.
>
> Hmm... just in case you're being lucky, please keep an eye on it over
> several days and report the result.  I think all that's necessary is
> slight modification to the resume logic but let's watch a bit first.
>
> Thanks.
>
> --
> tejun
>

Tejun,
   With about 10-12 day of testing, 1-2 boots/day, I've not had a
single boot failure since adding the patch. Only twice has it said
tries=2. Every other time it's tries=1. The machine seems to work fine
either way.

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2984e45..ce87bfe 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3712,7 +3712,7 @@  int sata_link_resume(struct ata_link *link, const unsigned long *params,
 		     unsigned long deadline)
 {
 	int tries = ATA_LINK_RESUME_TRIES;
-	u32 scontrol, serror;
+	u32 scontrol, scontrol1, serror;
 	int rc;
 
 	if ((rc = sata_scr_read(link, SCR_CONTROL, &scontrol)))
@@ -3739,6 +3739,14 @@  int sata_link_resume(struct ata_link *link, const unsigned long *params,
 			return rc;
 	} while ((scontrol & 0xf0f) != 0x300 && --tries);
 
+	/* check once more */
+	msleep(100);
+	if ((rc = sata_scr_read(link, SCR_CONTROL, &scontrol1)))
+			return rc;
+	ata_link_printk(link, KERN_ERR,
+			"XXX SControl after resume = %X %X, tries=%d\n",
+			scontrol, scontrol1, ATA_LINK_RESUME_TRIES - tries + 1);
+
 	if ((scontrol & 0xf0f) != 0x300) {
 		ata_link_printk(link, KERN_ERR,
 				"failed to resume link (SControl %X)\n",
@@ -6007,7 +6015,7 @@  static void async_port_probe(void *data, async_cookie_t cookie)
 
 		ehi->probe_mask |= ATA_ALL_DEVICES;
 		ehi->action |= ATA_EH_RESET | ATA_EH_LPM;
-		ehi->flags |= ATA_EHI_NO_AUTOPSY | ATA_EHI_QUIET;
+		ehi->flags |= ATA_EHI_NO_AUTOPSY/* | ATA_EHI_QUIET*/;
 
 		ap->pflags &= ~ATA_PFLAG_INITIALIZING;
 		ap->pflags |= ATA_PFLAG_LOADING;