Patchwork 2.6.31-rc5 regression: hd don't show up

login
register
mail settings
Submitter Tejun Heo
Date Sept. 5, 2009, 12:12 a.m.
Message ID <4AA1ACF8.7030101@kernel.org>
Download mbox | patch
Permalink /patch/33019/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - Sept. 5, 2009, 12:12 a.m.
Tim Blechmann wrote:
>>>>>>> booting the machine today, one hd is missing again ... bootlog attached
>>>>>> Hmmm... strange.  I don't really see how it could be escaping.  Can
>>>>>> you please apply the attached patch?  It still won't change the
>>>>>> behavior but should be able to catch where it's escaping.
>>>>> attached you find two bootlogs, for a correct boot, and with one hd
>>>>> missing ...
>>>> Heh heh, this is getting a bit embarrassing.  Seems like I wasn't
>>>> looking at the right path.  Can you please try this one too?  If it
>>>> says "XXX D7 pulldown quick exit path" and then succeed to probe,
>>>> that's the previous failure case so you don't need to keep trying to
>>>> reproduce the problem.
>>> i've attached the two boot logs again ...
>> Okay, it was another wrong guess.  Can you please try this one?
> 
> unfortunately, i haven't been able to get a bootlog of a failure the
> issue after rebooting like 20 times with yesterday's linus/master.
> once i couldn't boot, since the root hd wasn't found, so i don't think,
> the issue is solved, it just doesn't show very frequently ...
> 
> the bootlog of a working system is attached, if i experience another
> issue, i will send you another bootlog. since i am out of town for a few
> days, it may take some time, though ...

Alright, please keep me posted.  Another possibility is that it's
timing related and the PHY goes down briefly post-reset.  I think I've
found the code path but not sure yet and given how many times my hunch
has been wrong on this case, not too confident either.  Anyways, if
it's timing related, too many printks could have thrown it off.  If
you can't reproduce the failure with the previous patch, please try
this one and see whether it prints out "XXX: clearing to
ATA_DEV_NONE" on failure.

Thanks.
Tim Blechmann - Sept. 8, 2009, 8:58 p.m.
On 09/05/2009 02:12 AM, Tejun Heo wrote:
> Tim Blechmann wrote:
>>>>>>>> booting the machine today, one hd is missing again ... bootlog attached
>>>>>>> Hmmm... strange.  I don't really see how it could be escaping.  Can
>>>>>>> you please apply the attached patch?  It still won't change the
>>>>>>> behavior but should be able to catch where it's escaping.
>>>>>> attached you find two bootlogs, for a correct boot, and with one hd
>>>>>> missing ...
>>>>> Heh heh, this is getting a bit embarrassing.  Seems like I wasn't
>>>>> looking at the right path.  Can you please try this one too?  If it
>>>>> says "XXX D7 pulldown quick exit path" and then succeed to probe,
>>>>> that's the previous failure case so you don't need to keep trying to
>>>>> reproduce the problem.
>>>> i've attached the two boot logs again ...
>>> Okay, it was another wrong guess.  Can you please try this one?
>>
>> unfortunately, i haven't been able to get a bootlog of a failure the
>> issue after rebooting like 20 times with yesterday's linus/master.
>> once i couldn't boot, since the root hd wasn't found, so i don't think,
>> the issue is solved, it just doesn't show very frequently ...
>>
>> the bootlog of a working system is attached, if i experience another
>> issue, i will send you another bootlog. since i am out of town for a few
>> days, it may take some time, though ...
> 
> Alright, please keep me posted.  Another possibility is that it's
> timing related and the PHY goes down briefly post-reset.  I think I've
> found the code path but not sure yet and given how many times my hunch
> has been wrong on this case, not too confident either.  Anyways, if
> it's timing related, too many printks could have thrown it off.  If
> you can't reproduce the failure with the previous patch, please try
> this one and see whether it prints out "XXX: clearing to
> ATA_DEV_NONE" on failure.

with this patch, i could reproduce it again on the first boot. bootlog
attached.

cheers, tim
Tejun Heo - Sept. 16, 2009, 2:19 a.m.
Hello,

Tim Blechmann wrote:
> with this patch, i could reproduce it again on the first boot. bootlog
> attached.

Thanks a lot for testing.  The offending commit is 816ab897.

commit 816ab89782ac139a8b65147cca990822bb7e8675
Author: Tejun Heo <tj@kernel.org>
Date:   Wed Oct 22 00:31:34 2008 +0900

    libata: set device class to NONE if phys_offline

    Reset methods don't have access to phys link status for slave links
    and may incorrectly indicate device presence causing unnecessary probe
    failures for unoccupied links.  This patch clears device class to NONE
    during post-reset processing if phys link is offline.

    As on/offlineness semantics is strictly defined and used in multiple
    places by the core layer, this won't change behavior for drivers which
    don't use slave links.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

The problem is that I don't really remember why I added this one back
then.  This is incorrect because the condition should be dealt with
later in the reset logic.  That didn't work quite as expected and I
ended up adding the above to work around that and it turned out wrong.
I'll dig deeper and find out what was the problem back then.

Thanks.

Patch

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index a04488f..d0d0f88 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2673,8 +2673,10 @@  int ata_eh_reset(struct ata_link *link, int classify,
 				classes[dev->devno] = ATA_DEV_ATA;
 			else if (lflags & ATA_LFLAG_ASSUME_SEMB)
 				classes[dev->devno] = ATA_DEV_SEMB_UNSUP;
-		} else
+		} else {
+			ata_dev_printk(dev, KERN_INFO, "XXX clearing to ATA_DEV_NONE\n");
 			classes[dev->devno] = ATA_DEV_NONE;
+		}
 	}
 
 	/* record current link speed */