diff mbox

Drives missing at boot

Message ID 4C44BD42.3030904@kernel.org
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Tejun Heo July 19, 2010, 9:01 p.m. UTC
Hello,

On 07/19/2010 09:31 PM, Mark Knecht wrote:
>    With about 10-12 day of testing, 1-2 boots/day, I've not had a
> single boot failure since adding the patch. Only twice has it said
> tries=2. Every other time it's tries=1. The machine seems to work fine
> either way.

Hmmm... can you please test the attached patch instead?  It seems
likely that the root cause is not flakiness of SIDPR but incorrect
locking in libata EH code.

Thanks.

Comments

Paul Check July 20, 2010, 3:14 a.m. UTC | #1
Hey Tejun: I guess this is the same patch that you sent me to fix my issue
with missing drives.  Good news: I've been through about 10 reboots now
and no problems.  Based on my prior experience, I'd say with the old
setup, 10 clean boots in a row was probably less than a 1% event.  So, it
seems that this has fixed my problem.

Thanks!

Regards, Paul

> Hello,
>
> On 07/19/2010 09:31 PM, Mark Knecht wrote:
>>    With about 10-12 day of testing, 1-2 boots/day, I've not had a
>> single boot failure since adding the patch. Only twice has it said
>> tries=2. Every other time it's tries=1. The machine seems to work fine
>> either way.
>
> Hmmm... can you please test the attached patch instead?  It seems
> likely that the root cause is not flakiness of SIDPR but incorrect
> locking in libata EH code.
>
> Thanks.
>
> --
> tejun
>


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo July 20, 2010, 2:14 p.m. UTC | #2
Helo,

On 07/20/2010 05:14 AM, Paul Check wrote:
> Hey Tejun: I guess this is the same patch that you sent me to fix my issue
> with missing drives.  Good news: I've been through about 10 reboots now
> and no problems.  Based on my prior experience, I'd say with the old
> setup, 10 clean boots in a row was probably less than a 1% event.  So, it
> seems that this has fixed my problem.

Yeap, it's the same one.  I'm forwarding the patch upstream now but,
Mark, please let me know the test result.

Thanks.
Mark Knecht July 20, 2010, 2:53 p.m. UTC | #3
On Tue, Jul 20, 2010 at 7:14 AM, Tejun Heo <tj@kernel.org> wrote:
> Helo,
>
> On 07/20/2010 05:14 AM, Paul Check wrote:
>> Hey Tejun: I guess this is the same patch that you sent me to fix my issue
>> with missing drives.  Good news: I've been through about 10 reboots now
>> and no problems.  Based on my prior experience, I'd say with the old
>> setup, 10 clean boots in a row was probably less than a 1% event.  So, it
>> seems that this has fixed my problem.
>
> Yeap, it's the same one.  I'm forwarding the patch upstream now but,
> Mark, please let me know the test result.
>
> Thanks.
>
> --
> tejun
>

Tejun,
   I'm traveling but back tonight. I'll try it then.

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Knecht July 20, 2010, 4:16 p.m. UTC | #4
On Tue, Jul 20, 2010 at 7:53 AM, Mark Knecht <markknecht@gmail.com> wrote:
> On Tue, Jul 20, 2010 at 7:14 AM, Tejun Heo <tj@kernel.org> wrote:
>> Helo,
>>
>> On 07/20/2010 05:14 AM, Paul Check wrote:
>>> Hey Tejun: I guess this is the same patch that you sent me to fix my issue
>>> with missing drives.  Good news: I've been through about 10 reboots now
>>> and no problems.  Based on my prior experience, I'd say with the old
>>> setup, 10 clean boots in a row was probably less than a 1% event.  So, it
>>> seems that this has fixed my problem.
>>
>> Yeap, it's the same one.  I'm forwarding the patch upstream now but,
>> Mark, please let me know the test result.
>>
>> Thanks.
>>
>> --
>> tejun
>>
>
> Tejun,
>   I'm traveling but back tonight. I'll try it then.
>
> Thanks,
> Mark
>

OK, I was able to get into the machine remotely for a few minutes. I
think the patch applied correctly and the machine cold booted cleanly.
(My wife powered it up for me.)

I'll do more boots later when you confirm everything looks reasonable.
I don't see any print statement in the patch so I don't know what to
look for. I'm attaching a dmesg file for you to review.

Assuming I did this right then everything seems good so far.

Cheers,
Mark

c2stable linux # patch --verbose -p1 <~mark/Downloads/ata_piix-sidpr-lock.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
|index 7409f98..3971bc0 100644
|--- a/drivers/ata/ata_piix.c
|+++ b/drivers/ata/ata_piix.c
--------------------------
Patching file drivers/ata/ata_piix.c using Plan A...
Hunk #1 succeeded at 158.
Hunk #2 succeeded at 952.
Hunk #3 succeeded at 968.
Hunk #4 succeeded at 1573.
done
c2stable linux #
Mark Knecht July 20, 2010, 8:52 p.m. UTC | #5
On Mon, Jul 19, 2010 at 8:14 PM, Paul Check <paul@thechecks.ca> wrote:
> Hey Tejun: I guess this is the same patch that you sent me to fix my issue
> with missing drives.  Good news: I've been through about 10 reboots now
> and no problems.  Based on my prior experience, I'd say with the old
> setup, 10 clean boots in a row was probably less than a 1% event.  So, it
> seems that this has fixed my problem.
>
> Thanks!
>
> Regards, Paul
>

Hey Paul. Glad it worked for you as it did for me.

Was your hardware in any way similar to mine?

- Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Check July 20, 2010, 9:19 p.m. UTC | #6
I missed your hardware. I have an ASUS PT 6 Deluxe V2 board with 4 drives.
I think they are 1TB each, Seagate SATA drives.  P

> On Mon, Jul 19, 2010 at 8:14 PM, Paul Check <paul@thechecks.ca> wrote:
>> Hey Tejun: I guess this is the same patch that you sent me to fix my
>> issue
>> with missing drives.  Good news: I've been through about 10 reboots now
>> and no problems.  Based on my prior experience, I'd say with the old
>> setup, 10 clean boots in a row was probably less than a 1% event.  So,
>> it
>> seems that this has fixed my problem.
>>
>> Thanks!
>>
>> Regards, Paul
>>
>
> Hey Paul. Glad it worked for you as it did for me.
>
> Was your hardware in any way similar to mine?
>
> - Mark
>


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Knecht July 20, 2010, 9:26 p.m. UTC | #7
On Tue, Jul 20, 2010 at 2:19 PM, Paul Check <paul@thechecks.ca> wrote:
> I missed your hardware. I have an ASUS PT 6 Deluxe V2 board with 4 drives.
> I think they are 1TB each, Seagate SATA drives.  P
>

What processor? I've got the Core i7-980X. I've wondered if this
problem showed up for me because possibly the 12 thread processor
moves through some part of boot setup pretty quickly.

Mine is an Asus Rampage II Extreme with five 500GB drives.

Whatever the reason great thanks to Tejun for fixing it. I keep
wanting to add an 'r' to his last name to get 'Hero'!

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Check July 20, 2010, 11:05 p.m. UTC | #8
Oh, yeah, my processor is the i7 975. That's 4 cores 8 threads, right? 
Yes, nice work Tejun, thanks.  I'll have to run custom kernels until this
gets fed into Debian.  P

> On Tue, Jul 20, 2010 at 2:19 PM, Paul Check <paul@thechecks.ca> wrote:
>> I missed your hardware. I have an ASUS PT 6 Deluxe V2 board with 4
>> drives.
>> I think they are 1TB each, Seagate SATA drives.  P
>>
>
> What processor? I've got the Core i7-980X. I've wondered if this
> problem showed up for me because possibly the 12 thread processor
> moves through some part of boot setup pretty quickly.
>
> Mine is an Asus Rampage II Extreme with five 500GB drives.
>
> Whatever the reason great thanks to Tejun for fixing it. I keep
> wanting to add an 'r' to his last name to get 'Hero'!
>
> Cheers,
> Mark
>


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Knecht July 21, 2010, 8:54 p.m. UTC | #9
On Tue, Jul 20, 2010 at 7:14 AM, Tejun Heo <tj@kernel.org> wrote:
> Helo,
>
> On 07/20/2010 05:14 AM, Paul Check wrote:
>> Hey Tejun: I guess this is the same patch that you sent me to fix my issue
>> with missing drives.  Good news: I've been through about 10 reboots now
>> and no problems.  Based on my prior experience, I'd say with the old
>> setup, 10 clean boots in a row was probably less than a 1% event.  So, it
>> seems that this has fixed my problem.
>
> Yeap, it's the same one.  I'm forwarding the patch upstream now but,
> Mark, please let me know the test result.
>
> Thanks.
>
> --
> tejun
>

Tejun,
   Looks like I had a failure today. First one in weeks and only the
3rd or 4th boot with this newer patch file. One of the two drives
making a RAID0 wasn't found so /dev/md11 (constructed from /dev/sdd
and /dev/sde) couldn't be started. I did a cold reboot and the drive
was found.

   If it matters, and it probably doesn't, the failure came on a boot
which had a scheduled fsck to do of /dev/md5 - my main / drive. I
don't see how that would make a difference but I figure why leave the
info out. That's why the times are so much larger in the dmesg file.
(I think)

   dmesg attached. I patched the Gentoo kernel if it makes a
difference, same as I did with the earlier patch.

mark@c2stable ~ $ uname -a
Linux c2stable 2.6.34-gentoo-r2 #1 SMP PREEMPT Sun Jul 18 14:09:48 PDT
2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel
GNU/Linux
mark@c2stable ~ $

Sorry,
Mark
Paul Check July 21, 2010, 9:22 p.m. UTC | #10
That's unfortunate. FYI I have continued to be trouble free, but my
processor is a bit weaker than Mark's, although I would find it surprising
that this would cause a problem.  Also, FYI Mark, I have 12GB of Corsair
RAM, and have it bumped up to the Intel XMP profile.

P

> On Tue, Jul 20, 2010 at 7:14 AM, Tejun Heo <tj@kernel.org> wrote:
>> Helo,
>>
>> On 07/20/2010 05:14 AM, Paul Check wrote:
>>> Hey Tejun: I guess this is the same patch that you sent me to fix my
>>> issue
>>> with missing drives.  Good news: I've been through about 10 reboots
>>> now
>>> and no problems.  Based on my prior experience, I'd say with the old
>>> setup, 10 clean boots in a row was probably less than a 1% event.  So,
>>> it
>>> seems that this has fixed my problem.
>>
>> Yeap, it's the same one.  I'm forwarding the patch upstream now but,
>> Mark, please let me know the test result.
>>
>> Thanks.
>>
>> --
>> tejun
>>
>
> Tejun,
>    Looks like I had a failure today. First one in weeks and only the
> 3rd or 4th boot with this newer patch file. One of the two drives
> making a RAID0 wasn't found so /dev/md11 (constructed from /dev/sdd
> and /dev/sde) couldn't be started. I did a cold reboot and the drive
> was found.
>
>    If it matters, and it probably doesn't, the failure came on a boot
> which had a scheduled fsck to do of /dev/md5 - my main / drive. I
> don't see how that would make a difference but I figure why leave the
> info out. That's why the times are so much larger in the dmesg file.
> (I think)
>
>    dmesg attached. I patched the Gentoo kernel if it makes a
> difference, same as I did with the earlier patch.
>
> mark@c2stable ~ $ uname -a
> Linux c2stable 2.6.34-gentoo-r2 #1 SMP PREEMPT Sun Jul 18 14:09:48 PDT
> 2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel
> GNU/Linux
> mark@c2stable ~ $
>
> Sorry,
> Mark
>


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo July 22, 2010, 12:39 p.m. UTC | #11
Hello,

On 07/21/2010 10:54 PM, Mark Knecht wrote:
>    Looks like I had a failure today. First one in weeks and only the
> 3rd or 4th boot with this newer patch file. One of the two drives
> making a RAID0 wasn't found so /dev/md11 (constructed from /dev/sdd
> and /dev/sde) couldn't be started. I did a cold reboot and the drive
> was found.
> 
>    If it matters, and it probably doesn't, the failure came on a boot
> which had a scheduled fsck to do of /dev/md5 - my main / drive. I
> don't see how that would make a difference but I figure why leave the
> info out. That's why the times are so much larger in the dmesg file.
> (I think)
> 
>    dmesg attached. I patched the Gentoo kernel if it makes a
> difference, same as I did with the earlier patch.
> 
> mark@c2stable ~ $ uname -a
> Linux c2stable 2.6.34-gentoo-r2 #1 SMP PREEMPT Sun Jul 18 14:09:48 PDT
> 2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel
> GNU/Linux
> mark@c2stable ~ $

Hmmm... that's weird.  Can you please make sure the patch is actually
applied?  Adding a printk("XXX patch applied!\n") near other changes
usually is easy enough.  Also, can you please apply resume-dbg-1.patch
too and reproduce the failure and post log?

Thanks.
diff mbox

Patch

diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index 7409f98..3971bc0 100644
--- a/drivers/ata/ata_piix.c
+++ b/drivers/ata/ata_piix.c
@@ -158,6 +158,7 @@  struct piix_map_db {
 struct piix_host_priv {
 	const int *map;
 	u32 saved_iocfg;
+	spinlock_t sidpr_lock;	/* FIXME: remove once locking in EH is fixed */
 	void __iomem *sidpr;
 };
 
@@ -951,12 +952,15 @@  static int piix_sidpr_scr_read(struct ata_link *link,
 			       unsigned int reg, u32 *val)
 {
 	struct piix_host_priv *hpriv = link->ap->host->private_data;
+	unsigned long flags;
 
 	if (reg >= ARRAY_SIZE(piix_sidx_map))
 		return -EINVAL;
 
+	spin_lock_irqsave(&hpriv->sidpr_lock, flags);
 	piix_sidpr_sel(link, reg);
 	*val = ioread32(hpriv->sidpr + PIIX_SIDPR_DATA);
+	spin_unlock_irqrestore(&hpriv->sidpr_lock, flags);
 	return 0;
 }
 
@@ -964,12 +968,15 @@  static int piix_sidpr_scr_write(struct ata_link *link,
 				unsigned int reg, u32 val)
 {
 	struct piix_host_priv *hpriv = link->ap->host->private_data;
+	unsigned long flags;
 
 	if (reg >= ARRAY_SIZE(piix_sidx_map))
 		return -EINVAL;
 
+	spin_lock_irqsave(&hpriv->sidpr_lock, flags);
 	piix_sidpr_sel(link, reg);
 	iowrite32(val, hpriv->sidpr + PIIX_SIDPR_DATA);
+	spin_unlock_irqrestore(&hpriv->sidpr_lock, flags);
 	return 0;
 }
 
@@ -1566,6 +1573,7 @@  static int __devinit piix_init_one(struct pci_dev *pdev,
 	hpriv = devm_kzalloc(dev, sizeof(*hpriv), GFP_KERNEL);
 	if (!hpriv)
 		return -ENOMEM;
+	spin_lock_init(&hpriv->sidpr_lock);
 
 	/* Save IOCFG, this will be used for cable detection, quirk
 	 * detection and restoration on detach.  This is necessary