diff mbox

BIOS SATA legacy mode failure

Message ID 522C1AC5.4080105@linux.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Levente Kurusa Sept. 8, 2013, 6:35 a.m. UTC
Hi,

I have been testing the Linux Kernel on a two year Toshiba NB100
netbook of mine, however when I enabled SATA compatibility/legacy mode
instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
the relevant dmesg piece along with a patch that fixes it temporarily.
What I suspect to be the cause is that the BIOS sets the device into
IDE mode, but it will report it as a SATA device and hence libata tries
to send ATA commands to it, which obviously makes it go bad. The patch
fixes it, by adding a new field to ata_device called exce_cnt, which
counts how many exceptions have occured. After three exceptions, it
automatically disables the device. Also, please note this is my first
ever patch for the kernel :-)

The following dmesg is stuck in an infinite loop.
dmesg:
ata3: lost interrupt (Status 0x50)
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata3.00: failed command: READ DMA
ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
               res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata3: soft resetting link
ata3.00: configured for UDMA/33 (no error)
ata3.00: device reported invalid CHS sector 0
ata3: EH complete

Patch that fixes the infinite loop:
  };

Regards,
Levente Kurusa
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Robert Hancock Sept. 10, 2013, 4:01 a.m. UTC | #1
On 09/08/2013 12:35 AM, Levente Kurusa wrote:
> Hi,
>
> I have been testing the Linux Kernel on a two year Toshiba NB100
> netbook of mine, however when I enabled SATA compatibility/legacy mode
> instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
> the relevant dmesg piece along with a patch that fixes it temporarily.
> What I suspect to be the cause is that the BIOS sets the device into
> IDE mode, but it will report it as a SATA device and hence libata tries
> to send ATA commands to it, which obviously makes it go bad. The patch

No, the commands are the same whichever mode the controller is in. The 
problem is presumably something else, like maybe some kind of interrupt 
routing problem when the controller is in legacy mode.

> fixes it, by adding a new field to ata_device called exce_cnt, which
> counts how many exceptions have occured. After three exceptions, it
> automatically disables the device. Also, please note this is my first
> ever patch for the kernel :-)
>
> The following dmesg is stuck in an infinite loop.
> dmesg:
> ata3: lost interrupt (Status 0x50)
> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata3.00: failed command: READ DMA
> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>                res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata3.00: status: { DRDY }
> ata3: soft resetting link
> ata3.00: configured for UDMA/33 (no error)
> ata3.00: device reported invalid CHS sector 0
> ata3: EH complete
>
> Patch that fixes the infinite loop:
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index f9476fb..eeedf80 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
> *link)
>                              ehc->i.action, frozen, tries_buf);
>                  if (desc)
>                          ata_dev_err(ehc->i.dev, "%s\n", desc);
> +               ehc->i.dev->exce_cnt ++;
> +               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n",
> ehc->i.dev->exce_cnt);
> +               /**
> +                  * The device is failing terribly,
> +                 * disable it to prevent damage.
> +                 */
> +               if(ehc->i.dev->exce_cnt > 2)
> +                       ata_dev_disable(ehc->i.dev);
>          } else {
>                  ata_link_err(link, "exception Emask 0x%x "
>                               "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
> diff --git a/include/linux/libata.h b/include/linux/libata.h
> index eae7a05..fa52ee6 100644
> --- a/include/linux/libata.h
> +++ b/include/linux/libata.h
> @@ -660,7 +660,8 @@ struct ata_device {
>          u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];
>
>          /* error history */
> -       int                     spdn_cnt;
> +       int                     spdn_cnt; /* Number of speed_downs */
> +       int                     exce_cnt; /* Number of exceptions that
> happenned */
>          /* ering is CLEAR_END, read comment above CLEAR_END */
>          struct ata_ering        ering;
>   };
>

This doesn't seem like a very good fix. It may prevent the apparent 
infinite loop but will just prevent that device from functioning at all. 
It would be better if we could figure out what was actually going wrong.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Sept. 14, 2013, 3:09 p.m. UTC | #2
2013-09-10 06:01 keltezéssel, Robert Hancock írta:
> On 09/08/2013 12:35 AM, Levente Kurusa wrote:
>> Hi,
>>
>> I have been testing the Linux Kernel on a two year Toshiba NB100
>> netbook of mine, however when I enabled SATA compatibility/legacy mode
>> instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
>> the relevant dmesg piece along with a patch that fixes it temporarily.
>> What I suspect to be the cause is that the BIOS sets the device into
>> IDE mode, but it will report it as a SATA device and hence libata tries
>> to send ATA commands to it, which obviously makes it go bad. The patch
>
> No, the commands are the same whichever mode the controller is in. The
> problem is presumably something else, like maybe some kind of interrupt
> routing problem when the controller is in legacy mode.
>
Yes, I see now.

>> fixes it, by adding a new field to ata_device called exce_cnt, which
>> counts how many exceptions have occured. After three exceptions, it
>> automatically disables the device. Also, please note this is my first
>> ever patch for the kernel :-)
>>
>> The following dmesg is stuck in an infinite loop.
>> dmesg:
>> ata3: lost interrupt (Status 0x50)
>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> ata3.00: failed command: READ DMA
>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>                res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>> (timeout)
>> ata3.00: status: { DRDY }
>> ata3: soft resetting link
>> ata3.00: configured for UDMA/33 (no error)
>> ata3.00: device reported invalid CHS sector 0
>> ata3: EH complete
>>
>> Patch that fixes the infinite loop:
>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>> index f9476fb..eeedf80 100644
>> --- a/drivers/ata/libata-eh.c
>> +++ b/drivers/ata/libata-eh.c
>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>> *link)
>>                              ehc->i.action, frozen, tries_buf);
>>                  if (desc)
>>                          ata_dev_err(ehc->i.dev, "%s\n", desc);
>> +               ehc->i.dev->exce_cnt ++;
>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n",
>> ehc->i.dev->exce_cnt);
>> +               /**
>> +                  * The device is failing terribly,
>> +                 * disable it to prevent damage.
>> +                 */
>> +               if(ehc->i.dev->exce_cnt > 2)
>> +                       ata_dev_disable(ehc->i.dev);
>>          } else {
>>                  ata_link_err(link, "exception Emask 0x%x "
>>                               "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>> index eae7a05..fa52ee6 100644
>> --- a/include/linux/libata.h
>> +++ b/include/linux/libata.h
>> @@ -660,7 +660,8 @@ struct ata_device {
>>          u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>
>>          /* error history */
>> -       int                     spdn_cnt;
>> +       int                     spdn_cnt; /* Number of speed_downs */
>> +       int                     exce_cnt; /* Number of exceptions that
>> happenned */
>>          /* ering is CLEAR_END, read comment above CLEAR_END */
>>          struct ata_ering        ering;
>>   };
>>
>
> This doesn't seem like a very good fix. It may prevent the apparent
> infinite loop but will just prevent that device from functioning at all.
> It would be better if we could figure out what was actually going wrong.
>
>
I have tested the problem with three different computers, all switched
to legacy/IDE/compatibility mode, and they didn't have this problem. Of 
course, they could have been set to AHCI mode, and there the kernel 
would boot normally. Feels strange, but so far I was only able to 
reproduce the problem with a Toshiba MK8052GSX. On the topic of my 
patch, I still don't see why a device which fails so terribly that it 
reports 3 exceptions shouldn't be disabled. Like in this case, it could 
cause infinite loops.
Robert Hancock Sept. 16, 2013, 4:37 a.m. UTC | #3
On Sat, Sep 14, 2013 at 9:09 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-09-10 06:01 keltezéssel, Robert Hancock írta:
>
>> On 09/08/2013 12:35 AM, Levente Kurusa wrote:
>>>
>>> Hi,
>>>
>>> I have been testing the Linux Kernel on a two year Toshiba NB100
>>> netbook of mine, however when I enabled SATA compatibility/legacy mode
>>> instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
>>> the relevant dmesg piece along with a patch that fixes it temporarily.
>>> What I suspect to be the cause is that the BIOS sets the device into
>>> IDE mode, but it will report it as a SATA device and hence libata tries
>>> to send ATA commands to it, which obviously makes it go bad. The patch
>>
>>
>> No, the commands are the same whichever mode the controller is in. The
>> problem is presumably something else, like maybe some kind of interrupt
>> routing problem when the controller is in legacy mode.
>>
> Yes, I see now.
>
>
>>> fixes it, by adding a new field to ata_device called exce_cnt, which
>>> counts how many exceptions have occured. After three exceptions, it
>>> automatically disables the device. Also, please note this is my first
>>> ever patch for the kernel :-)
>>>
>>> The following dmesg is stuck in an infinite loop.
>>> dmesg:
>>> ata3: lost interrupt (Status 0x50)
>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>> ata3.00: failed command: READ DMA
>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>                res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>> (timeout)
>>> ata3.00: status: { DRDY }
>>> ata3: soft resetting link
>>> ata3.00: configured for UDMA/33 (no error)
>>> ata3.00: device reported invalid CHS sector 0
>>> ata3: EH complete
>>>
>>> Patch that fixes the infinite loop:
>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>> index f9476fb..eeedf80 100644
>>> --- a/drivers/ata/libata-eh.c
>>> +++ b/drivers/ata/libata-eh.c
>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>> *link)
>>>                              ehc->i.action, frozen, tries_buf);
>>>                  if (desc)
>>>                          ata_dev_err(ehc->i.dev, "%s\n", desc);
>>> +               ehc->i.dev->exce_cnt ++;
>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n",
>>> ehc->i.dev->exce_cnt);
>>> +               /**
>>> +                  * The device is failing terribly,
>>> +                 * disable it to prevent damage.
>>> +                 */
>>> +               if(ehc->i.dev->exce_cnt > 2)
>>> +                       ata_dev_disable(ehc->i.dev);
>>>          } else {
>>>                  ata_link_err(link, "exception Emask 0x%x "
>>>                               "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>> index eae7a05..fa52ee6 100644
>>> --- a/include/linux/libata.h
>>> +++ b/include/linux/libata.h
>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>          u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>
>>>          /* error history */
>>> -       int                     spdn_cnt;
>>> +       int                     spdn_cnt; /* Number of speed_downs */
>>> +       int                     exce_cnt; /* Number of exceptions that
>>> happenned */
>>>          /* ering is CLEAR_END, read comment above CLEAR_END */
>>>          struct ata_ering        ering;
>>>   };
>>>
>>
>> This doesn't seem like a very good fix. It may prevent the apparent
>> infinite loop but will just prevent that device from functioning at all.
>> It would be better if we could figure out what was actually going wrong.
>>
>>
> I have tested the problem with three different computers, all switched
> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
> course, they could have been set to AHCI mode, and there the kernel would
> boot normally. Feels strange, but so far I was only able to reproduce the
> problem with a Toshiba MK8052GSX. On the topic of my patch, I still don't
> see why a device which fails so terribly that it reports 3 exceptions
> shouldn't be disabled. Like in this case, it could cause infinite loops.

The problem is that this could happen in some cases when you wouldn't
want to disable the device, like an error that just happens
sporadically and works on retry, or a device you're trying to recover
data from.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Sept. 17, 2013, 4:47 p.m. UTC | #4
2013-09-16 06:37 keltezéssel, Robert Hancock írta:
> On Sat, Sep 14, 2013 at 9:09 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-09-10 06:01 keltezéssel, Robert Hancock írta:
>>
>>> On 09/08/2013 12:35 AM, Levente Kurusa wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have been testing the Linux Kernel on a two year Toshiba NB100
>>>> netbook of mine, however when I enabled SATA compatibility/legacy mode
>>>> instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
>>>> the relevant dmesg piece along with a patch that fixes it temporarily.
>>>> What I suspect to be the cause is that the BIOS sets the device into
>>>> IDE mode, but it will report it as a SATA device and hence libata tries
>>>> to send ATA commands to it, which obviously makes it go bad. The patch
>>>
>>>
>>> No, the commands are the same whichever mode the controller is in. The
>>> problem is presumably something else, like maybe some kind of interrupt
>>> routing problem when the controller is in legacy mode.
>>>
>> Yes, I see now.
>>
>>
>>>> fixes it, by adding a new field to ata_device called exce_cnt, which
>>>> counts how many exceptions have occured. After three exceptions, it
>>>> automatically disables the device. Also, please note this is my first
>>>> ever patch for the kernel :-)
>>>>
>>>> The following dmesg is stuck in an infinite loop.
>>>> dmesg:
>>>> ata3: lost interrupt (Status 0x50)
>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>> ata3.00: failed command: READ DMA
>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>                 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>> (timeout)
>>>> ata3.00: status: { DRDY }
>>>> ata3: soft resetting link
>>>> ata3.00: configured for UDMA/33 (no error)
>>>> ata3.00: device reported invalid CHS sector 0
>>>> ata3: EH complete
>>>>
>>>> Patch that fixes the infinite loop:
>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>> index f9476fb..eeedf80 100644
>>>> --- a/drivers/ata/libata-eh.c
>>>> +++ b/drivers/ata/libata-eh.c
>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>>> *link)
>>>>                               ehc->i.action, frozen, tries_buf);
>>>>                   if (desc)
>>>>                           ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>> +               ehc->i.dev->exce_cnt ++;
>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n",
>>>> ehc->i.dev->exce_cnt);
>>>> +               /**
>>>> +                  * The device is failing terribly,
>>>> +                 * disable it to prevent damage.
>>>> +                 */
>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>           } else {
>>>>                   ata_link_err(link, "exception Emask 0x%x "
>>>>                                "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>> index eae7a05..fa52ee6 100644
>>>> --- a/include/linux/libata.h
>>>> +++ b/include/linux/libata.h
>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>           u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>
>>>>           /* error history */
>>>> -       int                     spdn_cnt;
>>>> +       int                     spdn_cnt; /* Number of speed_downs */
>>>> +       int                     exce_cnt; /* Number of exceptions that
>>>> happenned */
>>>>           /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>           struct ata_ering        ering;
>>>>    };
>>>>
>>>
>>> This doesn't seem like a very good fix. It may prevent the apparent
>>> infinite loop but will just prevent that device from functioning at all.
>>> It would be better if we could figure out what was actually going wrong.
>>>
>>>
>> I have tested the problem with three different computers, all switched
>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
>> course, they could have been set to AHCI mode, and there the kernel would
>> boot normally. Feels strange, but so far I was only able to reproduce the
>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still don't
>> see why a device which fails so terribly that it reports 3 exceptions
>> shouldn't be disabled. Like in this case, it could cause infinite loops.
>
> The problem is that this could happen in some cases when you wouldn't
> want to disable the device, like an error that just happens
> sporadically and works on retry, or a device you're trying to recover
> data from.
>
What do you think if I edit the patch in a way, that when an operation 
successfully completes, it resets exce_cnt to zero. Might as well add a 
module_param, which can set the maximum value of exce_cnt, while having 
zero as an option to never disable the device. Please don't think me 
wrong, I don't want to force this patch, I just want to learn how all 
this works, and in the process try to make it better. :-)
Robert Hancock Sept. 18, 2013, 1:35 a.m. UTC | #5
On Tue, Sep 17, 2013 at 10:47 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-09-16 06:37 keltezéssel, Robert Hancock írta:
>
>> On Sat, Sep 14, 2013 at 9:09 AM, Levente Kurusa <levex@linux.com> wrote:
>>>
>>> 2013-09-10 06:01 keltezéssel, Robert Hancock írta:
>>>
>>>> On 09/08/2013 12:35 AM, Levente Kurusa wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have been testing the Linux Kernel on a two year Toshiba NB100
>>>>> netbook of mine, however when I enabled SATA compatibility/legacy mode
>>>>> instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
>>>>> the relevant dmesg piece along with a patch that fixes it temporarily.
>>>>> What I suspect to be the cause is that the BIOS sets the device into
>>>>> IDE mode, but it will report it as a SATA device and hence libata tries
>>>>> to send ATA commands to it, which obviously makes it go bad. The patch
>>>>
>>>>
>>>>
>>>> No, the commands are the same whichever mode the controller is in. The
>>>> problem is presumably something else, like maybe some kind of interrupt
>>>> routing problem when the controller is in legacy mode.
>>>>
>>> Yes, I see now.
>>>
>>>
>>>>> fixes it, by adding a new field to ata_device called exce_cnt, which
>>>>> counts how many exceptions have occured. After three exceptions, it
>>>>> automatically disables the device. Also, please note this is my first
>>>>> ever patch for the kernel :-)
>>>>>
>>>>> The following dmesg is stuck in an infinite loop.
>>>>> dmesg:
>>>>> ata3: lost interrupt (Status 0x50)
>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>> ata3.00: failed command: READ DMA
>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>                 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>> (timeout)
>>>>> ata3.00: status: { DRDY }
>>>>> ata3: soft resetting link
>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>> ata3.00: device reported invalid CHS sector 0
>>>>> ata3: EH complete
>>>>>
>>>>> Patch that fixes the infinite loop:
>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>> index f9476fb..eeedf80 100644
>>>>> --- a/drivers/ata/libata-eh.c
>>>>> +++ b/drivers/ata/libata-eh.c
>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>>>> *link)
>>>>>                               ehc->i.action, frozen, tries_buf);
>>>>>                   if (desc)
>>>>>                           ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n",
>>>>> ehc->i.dev->exce_cnt);
>>>>> +               /**
>>>>> +                  * The device is failing terribly,
>>>>> +                 * disable it to prevent damage.
>>>>> +                 */
>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>           } else {
>>>>>                   ata_link_err(link, "exception Emask 0x%x "
>>>>>                                "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>> index eae7a05..fa52ee6 100644
>>>>> --- a/include/linux/libata.h
>>>>> +++ b/include/linux/libata.h
>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>           u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>
>>>>>           /* error history */
>>>>> -       int                     spdn_cnt;
>>>>> +       int                     spdn_cnt; /* Number of speed_downs */
>>>>> +       int                     exce_cnt; /* Number of exceptions that
>>>>> happenned */
>>>>>           /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>           struct ata_ering        ering;
>>>>>    };
>>>>>
>>>>
>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>> infinite loop but will just prevent that device from functioning at all.
>>>> It would be better if we could figure out what was actually going wrong.
>>>>
>>>>
>>> I have tested the problem with three different computers, all switched
>>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
>>> course, they could have been set to AHCI mode, and there the kernel would
>>> boot normally. Feels strange, but so far I was only able to reproduce the
>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still don't
>>> see why a device which fails so terribly that it reports 3 exceptions
>>> shouldn't be disabled. Like in this case, it could cause infinite loops.
>>
>>
>> The problem is that this could happen in some cases when you wouldn't
>> want to disable the device, like an error that just happens
>> sporadically and works on retry, or a device you're trying to recover
>> data from.
>>
> What do you think if I edit the patch in a way, that when an operation
> successfully completes, it resets exce_cnt to zero. Might as well add a
> module_param, which can set the maximum value of exce_cnt, while having zero
> as an option to never disable the device. Please don't think me wrong, I
> don't want to force this patch, I just want to learn how all this works, and
> in the process try to make it better. :-)

That would be better, but I think you're still going to have an issue
with what magic number to pick to avoid disabling devices
inappropriately.

Conceptually, disabling the device doesn't really make sense anyway.
If someone in userspace wants to keep trying to read from that device,
why would you stop them because of some arbitrary judgement? The
kernel itself isn't "locked up" during this process, anything not
blocked on I/O to that device should be able to continue running, so
that process is only hurting itself. If the system fails to boot from
another device due to this, this would likely point out some kind of
problem in userspace or the distro boot process being overly
serialized.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Sept. 21, 2013, 7:35 a.m. UTC | #6
2013-09-18 03:35 keltezéssel, Robert Hancock írta:
> On Tue, Sep 17, 2013 at 10:47 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-09-16 06:37 keltezéssel, Robert Hancock írta:
>>
>>> On Sat, Sep 14, 2013 at 9:09 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>
>>>> 2013-09-10 06:01 keltezéssel, Robert Hancock írta:
>>>>
>>>>> On 09/08/2013 12:35 AM, Levente Kurusa wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have been testing the Linux Kernel on a two year Toshiba NB100
>>>>>> netbook of mine, however when I enabled SATA compatibility/legacy mode
>>>>>> instead of AHCI mode in the BIOS, the kernel got stuck. I have pasted
>>>>>> the relevant dmesg piece along with a patch that fixes it temporarily.
>>>>>> What I suspect to be the cause is that the BIOS sets the device into
>>>>>> IDE mode, but it will report it as a SATA device and hence libata tries
>>>>>> to send ATA commands to it, which obviously makes it go bad. The patch
>>>>>
>>>>>
>>>>>
>>>>> No, the commands are the same whichever mode the controller is in. The
>>>>> problem is presumably something else, like maybe some kind of interrupt
>>>>> routing problem when the controller is in legacy mode.
>>>>>
>>>> Yes, I see now.
>>>>
>>>>
>>>>>> fixes it, by adding a new field to ata_device called exce_cnt, which
>>>>>> counts how many exceptions have occured. After three exceptions, it
>>>>>> automatically disables the device. Also, please note this is my first
>>>>>> ever patch for the kernel :-)
>>>>>>
>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>> dmesg:
>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>> ata3.00: failed command: READ DMA
>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>>                  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>>> (timeout)
>>>>>> ata3.00: status: { DRDY }
>>>>>> ata3: soft resetting link
>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>> ata3: EH complete
>>>>>>
>>>>>> Patch that fixes the infinite loop:
>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>> index f9476fb..eeedf80 100644
>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>>>>> *link)
>>>>>>                                ehc->i.action, frozen, tries_buf);
>>>>>>                    if (desc)
>>>>>>                            ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n",
>>>>>> ehc->i.dev->exce_cnt);
>>>>>> +               /**
>>>>>> +                  * The device is failing terribly,
>>>>>> +                 * disable it to prevent damage.
>>>>>> +                 */
>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>            } else {
>>>>>>                    ata_link_err(link, "exception Emask 0x%x "
>>>>>>                                 "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>> index eae7a05..fa52ee6 100644
>>>>>> --- a/include/linux/libata.h
>>>>>> +++ b/include/linux/libata.h
>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>            u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>
>>>>>>            /* error history */
>>>>>> -       int                     spdn_cnt;
>>>>>> +       int                     spdn_cnt; /* Number of speed_downs */
>>>>>> +       int                     exce_cnt; /* Number of exceptions that
>>>>>> happenned */
>>>>>>            /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>>            struct ata_ering        ering;
>>>>>>     };
>>>>>>
>>>>>
>>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>>> infinite loop but will just prevent that device from functioning at all.
>>>>> It would be better if we could figure out what was actually going wrong.
>>>>>
>>>>>
>>>> I have tested the problem with three different computers, all switched
>>>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
>>>> course, they could have been set to AHCI mode, and there the kernel would
>>>> boot normally. Feels strange, but so far I was only able to reproduce the
>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still don't
>>>> see why a device which fails so terribly that it reports 3 exceptions
>>>> shouldn't be disabled. Like in this case, it could cause infinite loops.
>>>
>>>
>>> The problem is that this could happen in some cases when you wouldn't
>>> want to disable the device, like an error that just happens
>>> sporadically and works on retry, or a device you're trying to recover
>>> data from.
>>>
>> What do you think if I edit the patch in a way, that when an operation
>> successfully completes, it resets exce_cnt to zero. Might as well add a
>> module_param, which can set the maximum value of exce_cnt, while having zero
>> as an option to never disable the device. Please don't think me wrong, I
>> don't want to force this patch, I just want to learn how all this works, and
>> in the process try to make it better. :-)
>
> That would be better, but I think you're still going to have an issue
> with what magic number to pick to avoid disabling devices
> inappropriately.
>
> Conceptually, disabling the device doesn't really make sense anyway.
> If someone in userspace wants to keep trying to read from that device,
> why would you stop them because of some arbitrary judgement? The
> kernel itself isn't "locked up" during this process, anything not
> blocked on I/O to that device should be able to continue running, so
> that process is only hurting itself. If the system fails to boot from
> another device due to this, this would likely point out some kind of
> problem in userspace or the distro boot process being overly
> serialized.
>

I have been booting up with the initramfs from ubuntu 13.04,
and I have also tried to boot with the ubuntu install cd. They couldn't
continue the boot process. I'm gonna spend the weekend trying to figure
out where and why the interrupts don't happen. Whether it be a routing
or a hardware issue, which I highly doubt due to the fact that Windows
XP SP2 was able to boot up without errors.
Robert Hancock Sept. 21, 2013, 5:04 p.m. UTC | #7
On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>> dmesg:
>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>> ata3.00: failed command: READ DMA
>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>>>                  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>>>> (timeout)
>>>>>>> ata3.00: status: { DRDY }
>>>>>>> ata3: soft resetting link
>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>> ata3: EH complete
>>>>>>>
>>>>>>> Patch that fixes the infinite loop:
>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>> index f9476fb..eeedf80 100644
>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>>>>>> *link)
>>>>>>>                                ehc->i.action, frozen, tries_buf);
>>>>>>>                    if (desc)
>>>>>>>                            ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>> %d\n",
>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>> +               /**
>>>>>>> +                  * The device is failing terribly,
>>>>>>> +                 * disable it to prevent damage.
>>>>>>> +                 */
>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>            } else {
>>>>>>>                    ata_link_err(link, "exception Emask 0x%x "
>>>>>>>                                 "SAct 0x%x SErr 0x%x action
>>>>>>> 0x%x%s%s\n",
>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>> --- a/include/linux/libata.h
>>>>>>> +++ b/include/linux/libata.h
>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>            u8
>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>
>>>>>>>            /* error history */
>>>>>>> -       int                     spdn_cnt;
>>>>>>> +       int                     spdn_cnt; /* Number of speed_downs */
>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>> that
>>>>>>> happenned */
>>>>>>>            /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>>>            struct ata_ering        ering;
>>>>>>>     };
>>>>>>>
>>>>>>
>>>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>>>> infinite loop but will just prevent that device from functioning at
>>>>>> all.
>>>>>> It would be better if we could figure out what was actually going
>>>>>> wrong.
>>>>>>
>>>>>>
>>>>> I have tested the problem with three different computers, all switched
>>>>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>> would
>>>>> boot normally. Feels strange, but so far I was only able to reproduce
>>>>> the
>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>> don't
>>>>> see why a device which fails so terribly that it reports 3 exceptions
>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>> loops.
>>>>
>>>>
>>>>
>>>> The problem is that this could happen in some cases when you wouldn't
>>>> want to disable the device, like an error that just happens
>>>> sporadically and works on retry, or a device you're trying to recover
>>>> data from.
>>>>
>>> What do you think if I edit the patch in a way, that when an operation
>>> successfully completes, it resets exce_cnt to zero. Might as well add a
>>> module_param, which can set the maximum value of exce_cnt, while having
>>> zero
>>> as an option to never disable the device. Please don't think me wrong, I
>>> don't want to force this patch, I just want to learn how all this works,
>>> and
>>> in the process try to make it better. :-)
>>
>>
>> That would be better, but I think you're still going to have an issue
>> with what magic number to pick to avoid disabling devices
>> inappropriately.
>>
>> Conceptually, disabling the device doesn't really make sense anyway.
>> If someone in userspace wants to keep trying to read from that device,
>> why would you stop them because of some arbitrary judgement? The
>> kernel itself isn't "locked up" during this process, anything not
>> blocked on I/O to that device should be able to continue running, so
>> that process is only hurting itself. If the system fails to boot from
>> another device due to this, this would likely point out some kind of
>> problem in userspace or the distro boot process being overly
>> serialized.
>>
>
> I have been booting up with the initramfs from ubuntu 13.04,
> and I have also tried to boot with the ubuntu install cd. They couldn't
> continue the boot process. I'm gonna spend the weekend trying to figure
> out where and why the interrupts don't happen. Whether it be a routing
> or a hardware issue, which I highly doubt due to the fact that Windows
> XP SP2 was able to boot up without errors.

Are you able to get out full dmesg output from a boot attempt and the
contents of /proc/interrupts?
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Sept. 22, 2013, 7:13 a.m. UTC | #8
2013-09-21 19:04 keltezéssel, Robert Hancock írta:
> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>> dmesg:
>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>>>>                   res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>>>>> (timeout)
>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>> ata3: soft resetting link
>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>> ata3: EH complete
>>>>>>>>
>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link
>>>>>>>> *link)
>>>>>>>>                                 ehc->i.action, frozen, tries_buf);
>>>>>>>>                     if (desc)
>>>>>>>>                             ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>> %d\n",
>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>> +               /**
>>>>>>>> +                  * The device is failing terribly,
>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>> +                 */
>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>             } else {
>>>>>>>>                     ata_link_err(link, "exception Emask 0x%x "
>>>>>>>>                                  "SAct 0x%x SErr 0x%x action
>>>>>>>> 0x%x%s%s\n",
>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>> --- a/include/linux/libata.h
>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>             u8
>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>
>>>>>>>>             /* error history */
>>>>>>>> -       int                     spdn_cnt;
>>>>>>>> +       int                     spdn_cnt; /* Number of speed_downs */
>>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>>> that
>>>>>>>> happenned */
>>>>>>>>             /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>>>>             struct ata_ering        ering;
>>>>>>>>      };
>>>>>>>>
>>>>>>>
>>>>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>>>>> infinite loop but will just prevent that device from functioning at
>>>>>>> all.
>>>>>>> It would be better if we could figure out what was actually going
>>>>>>> wrong.
>>>>>>>
>>>>>>>
>>>>>> I have tested the problem with three different computers, all switched
>>>>>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of
>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>> would
>>>>>> boot normally. Feels strange, but so far I was only able to reproduce
>>>>>> the
>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>> don't
>>>>>> see why a device which fails so terribly that it reports 3 exceptions
>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>> loops.
>>>>>
>>>>>
>>>>>
>>>>> The problem is that this could happen in some cases when you wouldn't
>>>>> want to disable the device, like an error that just happens
>>>>> sporadically and works on retry, or a device you're trying to recover
>>>>> data from.
>>>>>
>>>> What do you think if I edit the patch in a way, that when an operation
>>>> successfully completes, it resets exce_cnt to zero. Might as well add a
>>>> module_param, which can set the maximum value of exce_cnt, while having
>>>> zero
>>>> as an option to never disable the device. Please don't think me wrong, I
>>>> don't want to force this patch, I just want to learn how all this works,
>>>> and
>>>> in the process try to make it better. :-)
>>>
>>>
>>> That would be better, but I think you're still going to have an issue
>>> with what magic number to pick to avoid disabling devices
>>> inappropriately.
>>>
>>> Conceptually, disabling the device doesn't really make sense anyway.
>>> If someone in userspace wants to keep trying to read from that device,
>>> why would you stop them because of some arbitrary judgement? The
>>> kernel itself isn't "locked up" during this process, anything not
>>> blocked on I/O to that device should be able to continue running, so
>>> that process is only hurting itself. If the system fails to boot from
>>> another device due to this, this would likely point out some kind of
>>> problem in userspace or the distro boot process being overly
>>> serialized.
>>>
>>
>> I have been booting up with the initramfs from ubuntu 13.04,
>> and I have also tried to boot with the ubuntu install cd. They couldn't
>> continue the boot process. I'm gonna spend the weekend trying to figure
>> out where and why the interrupts don't happen. Whether it be a routing
>> or a hardware issue, which I highly doubt due to the fact that Windows
>> XP SP2 was able to boot up without errors.
>
> Are you able to get out full dmesg output from a boot attempt and the
> contents of /proc/interrupts?
>
As I said before, I am not able to get to the shell, without my 'symptom 
cure'. With my patch I get the following dmesg output, with
some of my debug messages turned off:
http://pastebin.com/5eb5G3Dx
/proc/interrupts is here:
http://pastebin.com/84CJey2D
After yesterday's research, I have come to ata_piix.c . That file looks 
like the real culprit, as my netbook's controller is an Intel ICH7M one,
The values I am getting from the device are very different than those
that are expected.

Things I have noticed, but ignored in dmesg:
There is a stack dump, because nobody cared about IRQ#20. I have ignored
this because it is the EHCI IRQ, and I suppose it has nothing to do with 
ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
with /dev/sda, which works fine.

Things I have not noticed before, but did now:
The ACPI errors at ~0.1329
Robert Hancock Sept. 25, 2013, 6:31 a.m. UTC | #9
On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>
>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>
>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>> dmesg:
>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>>>>>                   res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>>>>>> (timeout)
>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>> ata3: soft resetting link
>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>> ata3: EH complete
>>>>>>>>>
>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>> ata_link
>>>>>>>>> *link)
>>>>>>>>>                                 ehc->i.action, frozen, tries_buf);
>>>>>>>>>                     if (desc)
>>>>>>>>>                             ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>> %d\n",
>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>> +               /**
>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>> +                 */
>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>             } else {
>>>>>>>>>                     ata_link_err(link, "exception Emask 0x%x "
>>>>>>>>>                                  "SAct 0x%x SErr 0x%x action
>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>             u8
>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>
>>>>>>>>>             /* error history */
>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>> +       int                     spdn_cnt; /* Number of speed_downs
>>>>>>>>> */
>>>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>>>> that
>>>>>>>>> happenned */
>>>>>>>>>             /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>>>>>             struct ata_ering        ering;
>>>>>>>>>      };
>>>>>>>>>
>>>>>>>>
>>>>>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>>>>>> infinite loop but will just prevent that device from functioning at
>>>>>>>> all.
>>>>>>>> It would be better if we could figure out what was actually going
>>>>>>>> wrong.
>>>>>>>>
>>>>>>>>
>>>>>>> I have tested the problem with three different computers, all
>>>>>>> switched
>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this problem.
>>>>>>> Of
>>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>>> would
>>>>>>> boot normally. Feels strange, but so far I was only able to reproduce
>>>>>>> the
>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>>> don't
>>>>>>> see why a device which fails so terribly that it reports 3 exceptions
>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>> loops.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem is that this could happen in some cases when you wouldn't
>>>>>> want to disable the device, like an error that just happens
>>>>>> sporadically and works on retry, or a device you're trying to recover
>>>>>> data from.
>>>>>>
>>>>> What do you think if I edit the patch in a way, that when an operation
>>>>> successfully completes, it resets exce_cnt to zero. Might as well add a
>>>>> module_param, which can set the maximum value of exce_cnt, while having
>>>>> zero
>>>>> as an option to never disable the device. Please don't think me wrong,
>>>>> I
>>>>> don't want to force this patch, I just want to learn how all this
>>>>> works,
>>>>> and
>>>>> in the process try to make it better. :-)
>>>>
>>>>
>>>>
>>>> That would be better, but I think you're still going to have an issue
>>>> with what magic number to pick to avoid disabling devices
>>>> inappropriately.
>>>>
>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>> If someone in userspace wants to keep trying to read from that device,
>>>> why would you stop them because of some arbitrary judgement? The
>>>> kernel itself isn't "locked up" during this process, anything not
>>>> blocked on I/O to that device should be able to continue running, so
>>>> that process is only hurting itself. If the system fails to boot from
>>>> another device due to this, this would likely point out some kind of
>>>> problem in userspace or the distro boot process being overly
>>>> serialized.
>>>>
>>>
>>> I have been booting up with the initramfs from ubuntu 13.04,
>>> and I have also tried to boot with the ubuntu install cd. They couldn't
>>> continue the boot process. I'm gonna spend the weekend trying to figure
>>> out where and why the interrupts don't happen. Whether it be a routing
>>> or a hardware issue, which I highly doubt due to the fact that Windows
>>> XP SP2 was able to boot up without errors.
>>
>>
>> Are you able to get out full dmesg output from a boot attempt and the
>> contents of /proc/interrupts?
>>
> As I said before, I am not able to get to the shell, without my 'symptom
> cure'. With my patch I get the following dmesg output, with
> some of my debug messages turned off:
> http://pastebin.com/5eb5G3Dx
> /proc/interrupts is here:
> http://pastebin.com/84CJey2D
> After yesterday's research, I have come to ata_piix.c . That file looks like
> the real culprit, as my netbook's controller is an Intel ICH7M one,
> The values I am getting from the device are very different than those
> that are expected.
>
> Things I have noticed, but ignored in dmesg:
> There is a stack dump, because nobody cared about IRQ#20. I have ignored
> this because it is the EHCI IRQ, and I suppose it has nothing to do with
> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
> with /dev/sda, which works fine.

I think it is likely related to the problem. The kernel thinks this
controller is on IRQ 16, but apparently something is raising
un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
16. It seems quite likely that this is actually the ATA controller.

You mentioned that Windows XP was able to work in this mode. I wonder
if it was using the IOAPIC, as if not then the IRQ routing is
different which might mask the problem. Do you know what IRQ Device
Manager reported for this controller in Windows? And was it using any
IRQs over 15 (which would indicate the IOAPIC was in use)?

>
> Things I have not noticed before, but did now:
> The ACPI errors at ~0.1329
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Sept. 27, 2013, 1:24 p.m. UTC | #10
2013-09-25 08:31 keltezéssel, Robert Hancock írta:
> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>
>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>
>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>> dmesg:
>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
>>>>>>>>>>                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>>>>>>>>> (timeout)
>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>> ata3: EH complete
>>>>>>>>>>
>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>> ata_link
>>>>>>>>>> *link)
>>>>>>>>>>                                  ehc->i.action, frozen, tries_buf);
>>>>>>>>>>                      if (desc)
>>>>>>>>>>                              ata_dev_err(ehc->i.dev, "%s\n", desc);
>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>> %d\n",
>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>> +               /**
>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>> +                 */
>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>              } else {
>>>>>>>>>>                      ata_link_err(link, "exception Emask 0x%x "
>>>>>>>>>>                                   "SAct 0x%x SErr 0x%x action
>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>              u8
>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>
>>>>>>>>>>              /* error history */
>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>> +       int                     spdn_cnt; /* Number of speed_downs
>>>>>>>>>> */
>>>>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>>>>> that
>>>>>>>>>> happenned */
>>>>>>>>>>              /* ering is CLEAR_END, read comment above CLEAR_END */
>>>>>>>>>>              struct ata_ering        ering;
>>>>>>>>>>       };
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This doesn't seem like a very good fix. It may prevent the apparent
>>>>>>>>> infinite loop but will just prevent that device from functioning at
>>>>>>>>> all.
>>>>>>>>> It would be better if we could figure out what was actually going
>>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>> switched
>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this problem.
>>>>>>>> Of
>>>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>>>> would
>>>>>>>> boot normally. Feels strange, but so far I was only able to reproduce
>>>>>>>> the
>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>>>> don't
>>>>>>>> see why a device which fails so terribly that it reports 3 exceptions
>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>> loops.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The problem is that this could happen in some cases when you wouldn't
>>>>>>> want to disable the device, like an error that just happens
>>>>>>> sporadically and works on retry, or a device you're trying to recover
>>>>>>> data from.
>>>>>>>
>>>>>> What do you think if I edit the patch in a way, that when an operation
>>>>>> successfully completes, it resets exce_cnt to zero. Might as well add a
>>>>>> module_param, which can set the maximum value of exce_cnt, while having
>>>>>> zero
>>>>>> as an option to never disable the device. Please don't think me wrong,
>>>>>> I
>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>> works,
>>>>>> and
>>>>>> in the process try to make it better. :-)
>>>>>
>>>>>
>>>>>
>>>>> That would be better, but I think you're still going to have an issue
>>>>> with what magic number to pick to avoid disabling devices
>>>>> inappropriately.
>>>>>
>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>> If someone in userspace wants to keep trying to read from that device,
>>>>> why would you stop them because of some arbitrary judgement? The
>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>> blocked on I/O to that device should be able to continue running, so
>>>>> that process is only hurting itself. If the system fails to boot from
>>>>> another device due to this, this would likely point out some kind of
>>>>> problem in userspace or the distro boot process being overly
>>>>> serialized.
>>>>>
>>>>
>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>> and I have also tried to boot with the ubuntu install cd. They couldn't
>>>> continue the boot process. I'm gonna spend the weekend trying to figure
>>>> out where and why the interrupts don't happen. Whether it be a routing
>>>> or a hardware issue, which I highly doubt due to the fact that Windows
>>>> XP SP2 was able to boot up without errors.
>>>
>>>
>>> Are you able to get out full dmesg output from a boot attempt and the
>>> contents of /proc/interrupts?
>>>
>> As I said before, I am not able to get to the shell, without my 'symptom
>> cure'. With my patch I get the following dmesg output, with
>> some of my debug messages turned off:
>> http://pastebin.com/5eb5G3Dx
>> /proc/interrupts is here:
>> http://pastebin.com/84CJey2D
>> After yesterday's research, I have come to ata_piix.c . That file looks like
>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>> The values I am getting from the device are very different than those
>> that are expected.
>>
>> Things I have noticed, but ignored in dmesg:
>> There is a stack dump, because nobody cared about IRQ#20. I have ignored
>> this because it is the EHCI IRQ, and I suppose it has nothing to do with
>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>> with /dev/sda, which works fine.
>
> I think it is likely related to the problem. The kernel thinks this
> controller is on IRQ 16, but apparently something is raising
> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
> 16. It seems quite likely that this is actually the ATA controller.
>
> You mentioned that Windows XP was able to work in this mode. I wonder
> if it was using the IOAPIC, as if not then the IRQ routing is
> different which might mask the problem. Do you know what IRQ Device
> Manager reported for this controller in Windows? And was it using any
> IRQs over 15 (which would indicate the IOAPIC was in use)?

Hmm, according to WinXP's Device manager for this controller,
it listens to IRQ# 20, and therefore it is using the I/O APIC.
Now, one question remains where is the error that mismaps
controller?
I have created a simple patch which seems to fix this:
---
@@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, 
const struct pci_device_id *ent)
  		hpriv->map = piix_init_sata_map(pdev, port_info,
  					piix_map_db_table[ent->driver_data]);

+	if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
+		pdev->irq = 20;
  	rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
  	if (rc)
  		return rc;

However, I am more than sure that this is not the way
to solve this problem. Do you have any idea on where
the ideal place would be to implement a fix?
According to specs of ICH7M, which is essentially the
same as ICH6M, we need to check on what interrupt pin
is the SATA controller, and after that check which IRQ line
is connected to the I/O APIC and decide the IRQ's number
on those findings.

Specs of ICH7: 
http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
Device 31 Interrupt Route Register: Chapter 7.1.46
Device 31 Interrupt Pin Register: Chapter 7.1.41

The SATA controller is always Device 31.
Robert Hancock Sept. 28, 2013, 4:55 a.m. UTC | #11
On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>
>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>
>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>
>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>> dmesg:
>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>> in
>>>>>>>>>>>                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
>>>>>>>>>>> 0x4
>>>>>>>>>>> (timeout)
>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>
>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>> ata_link
>>>>>>>>>>> *link)
>>>>>>>>>>>                                  ehc->i.action, frozen,
>>>>>>>>>>> tries_buf);
>>>>>>>>>>>                      if (desc)
>>>>>>>>>>>                              ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>> desc);
>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>> %d\n",
>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>> +               /**
>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>> +                 */
>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>              } else {
>>>>>>>>>>>                      ata_link_err(link, "exception Emask 0x%x "
>>>>>>>>>>>                                   "SAct 0x%x SErr 0x%x action
>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>              u8
>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>
>>>>>>>>>>>              /* error history */
>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>> speed_downs
>>>>>>>>>>> */
>>>>>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>>>>>> that
>>>>>>>>>>> happenned */
>>>>>>>>>>>              /* ering is CLEAR_END, read comment above CLEAR_END
>>>>>>>>>>> */
>>>>>>>>>>>              struct ata_ering        ering;
>>>>>>>>>>>       };
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>> apparent
>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>> at
>>>>>>>>>> all.
>>>>>>>>>> It would be better if we could figure out what was actually going
>>>>>>>>>> wrong.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>> switched
>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>> problem.
>>>>>>>>> Of
>>>>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>>>>> would
>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>> reproduce
>>>>>>>>> the
>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>>>>> don't
>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>> exceptions
>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>> loops.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>> wouldn't
>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>> recover
>>>>>>>> data from.
>>>>>>>>
>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>> operation
>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well add
>>>>>>> a
>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>> having
>>>>>>> zero
>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>> wrong,
>>>>>>> I
>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>> works,
>>>>>>> and
>>>>>>> in the process try to make it better. :-)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> That would be better, but I think you're still going to have an issue
>>>>>> with what magic number to pick to avoid disabling devices
>>>>>> inappropriately.
>>>>>>
>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>> If someone in userspace wants to keep trying to read from that device,
>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>> that process is only hurting itself. If the system fails to boot from
>>>>>> another device due to this, this would likely point out some kind of
>>>>>> problem in userspace or the distro boot process being overly
>>>>>> serialized.
>>>>>>
>>>>>
>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>> and I have also tried to boot with the ubuntu install cd. They couldn't
>>>>> continue the boot process. I'm gonna spend the weekend trying to figure
>>>>> out where and why the interrupts don't happen. Whether it be a routing
>>>>> or a hardware issue, which I highly doubt due to the fact that Windows
>>>>> XP SP2 was able to boot up without errors.
>>>>
>>>>
>>>>
>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>> contents of /proc/interrupts?
>>>>
>>> As I said before, I am not able to get to the shell, without my 'symptom
>>> cure'. With my patch I get the following dmesg output, with
>>> some of my debug messages turned off:
>>> http://pastebin.com/5eb5G3Dx
>>> /proc/interrupts is here:
>>> http://pastebin.com/84CJey2D
>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>> like
>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>> The values I am getting from the device are very different than those
>>> that are expected.
>>>
>>> Things I have noticed, but ignored in dmesg:
>>> There is a stack dump, because nobody cared about IRQ#20. I have ignored
>>> this because it is the EHCI IRQ, and I suppose it has nothing to do with
>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>> with /dev/sda, which works fine.
>>
>>
>> I think it is likely related to the problem. The kernel thinks this
>> controller is on IRQ 16, but apparently something is raising
>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>> 16. It seems quite likely that this is actually the ATA controller.
>>
>> You mentioned that Windows XP was able to work in this mode. I wonder
>> if it was using the IOAPIC, as if not then the IRQ routing is
>> different which might mask the problem. Do you know what IRQ Device
>> Manager reported for this controller in Windows? And was it using any
>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>
>
> Hmm, according to WinXP's Device manager for this controller,
> it listens to IRQ# 20, and therefore it is using the I/O APIC.
> Now, one question remains where is the error that mismaps
> controller?
> I have created a simple patch which seems to fix this:
> ---
> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, const
> struct pci_device_id *ent)
>                 hpriv->map = piix_init_sata_map(pdev, port_info,
>
> piix_map_db_table[ent->driver_data]);
>
> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
> +               pdev->irq = 20;
>         rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>         if (rc)
>                 return rc;
>
> However, I am more than sure that this is not the way
> to solve this problem. Do you have any idea on where
> the ideal place would be to implement a fix?
> According to specs of ICH7M, which is essentially the
> same as ICH6M, we need to check on what interrupt pin
> is the SATA controller, and after that check which IRQ line
> is connected to the I/O APIC and decide the IRQ's number
> on those findings.
>
> Specs of ICH7:
> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
> Device 31 Interrupt Route Register: Chapter 7.1.46
> Device 31 Interrupt Pin Register: Chapter 7.1.41
>
> The SATA controller is always Device 31.

It would appear that something is messing up with the ACPI IRQ routing
on this machine that's causing us to think the controller is on the
wrong IRQ. CCing the linux-acpi list to see if anyone has some
additional debugging suggestions. I suspect that dumping the DSDT is
likely the first step though. If you can get IASL installed, you can
do something like:

cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
iasl -d dsdt.aml

That should spit out a dsdt.dsl file which would hopefully have the
info needed to figure out what's going on.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Sept. 28, 2013, 5:46 p.m. UTC | #12
2013-09-28 06:55 keltezéssel, Robert Hancock írta:
> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>
>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>
>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>
>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>> dmesg:
>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>> in
>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
>>>>>>>>>>>> 0x4
>>>>>>>>>>>> (timeout)
>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>
>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>> ata_link
>>>>>>>>>>>> *link)
>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>> desc);
>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>> %d\n",
>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>> +               /**
>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>> +                 */
>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>               } else {
>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x "
>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>               u8
>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>
>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>> speed_downs
>>>>>>>>>>>> */
>>>>>>>>>>>> +       int                     exce_cnt; /* Number of exceptions
>>>>>>>>>>>> that
>>>>>>>>>>>> happenned */
>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above CLEAR_END
>>>>>>>>>>>> */
>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>        };
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>> apparent
>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>> at
>>>>>>>>>>> all.
>>>>>>>>>>> It would be better if we could figure out what was actually going
>>>>>>>>>>> wrong.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>> switched
>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>> problem.
>>>>>>>>>> Of
>>>>>>>>>> course, they could have been set to AHCI mode, and there the kernel
>>>>>>>>>> would
>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>> reproduce
>>>>>>>>>> the
>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still
>>>>>>>>>> don't
>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>> exceptions
>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>> loops.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>> wouldn't
>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>> recover
>>>>>>>>> data from.
>>>>>>>>>
>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>> operation
>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well add
>>>>>>>> a
>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>> having
>>>>>>>> zero
>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>> wrong,
>>>>>>>> I
>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>> works,
>>>>>>>> and
>>>>>>>> in the process try to make it better. :-)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That would be better, but I think you're still going to have an issue
>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>> inappropriately.
>>>>>>>
>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>> If someone in userspace wants to keep trying to read from that device,
>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>> that process is only hurting itself. If the system fails to boot from
>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>> serialized.
>>>>>>>
>>>>>>
>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>> and I have also tried to boot with the ubuntu install cd. They couldn't
>>>>>> continue the boot process. I'm gonna spend the weekend trying to figure
>>>>>> out where and why the interrupts don't happen. Whether it be a routing
>>>>>> or a hardware issue, which I highly doubt due to the fact that Windows
>>>>>> XP SP2 was able to boot up without errors.
>>>>>
>>>>>
>>>>>
>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>> contents of /proc/interrupts?
>>>>>
>>>> As I said before, I am not able to get to the shell, without my 'symptom
>>>> cure'. With my patch I get the following dmesg output, with
>>>> some of my debug messages turned off:
>>>> http://pastebin.com/5eb5G3Dx
>>>> /proc/interrupts is here:
>>>> http://pastebin.com/84CJey2D
>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>> like
>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>> The values I am getting from the device are very different than those
>>>> that are expected.
>>>>
>>>> Things I have noticed, but ignored in dmesg:
>>>> There is a stack dump, because nobody cared about IRQ#20. I have ignored
>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do with
>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>> with /dev/sda, which works fine.
>>>
>>>
>>> I think it is likely related to the problem. The kernel thinks this
>>> controller is on IRQ 16, but apparently something is raising
>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>> 16. It seems quite likely that this is actually the ATA controller.
>>>
>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>> different which might mask the problem. Do you know what IRQ Device
>>> Manager reported for this controller in Windows? And was it using any
>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>
>>
>> Hmm, according to WinXP's Device manager for this controller,
>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>> Now, one question remains where is the error that mismaps
>> controller?
>> I have created a simple patch which seems to fix this:
>> ---
>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, const
>> struct pci_device_id *ent)
>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>
>> piix_map_db_table[ent->driver_data]);
>>
>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>> +               pdev->irq = 20;
>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>          if (rc)
>>                  return rc;
>>
>> However, I am more than sure that this is not the way
>> to solve this problem. Do you have any idea on where
>> the ideal place would be to implement a fix?
>> According to specs of ICH7M, which is essentially the
>> same as ICH6M, we need to check on what interrupt pin
>> is the SATA controller, and after that check which IRQ line
>> is connected to the I/O APIC and decide the IRQ's number
>> on those findings.
>>
>> Specs of ICH7:
>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>> Device 31 Interrupt Route Register: Chapter 7.1.46
>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>
>> The SATA controller is always Device 31.
>
> It would appear that something is messing up with the ACPI IRQ routing
> on this machine that's causing us to think the controller is on the
> wrong IRQ. CCing the linux-acpi list to see if anyone has some
> additional debugging suggestions. I suspect that dumping the DSDT is
> likely the first step though. If you can get IASL installed, you can
> do something like:
>
> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
> iasl -d dsdt.aml
>
> That should spit out a dsdt.dsl file which would hopefully have the
> info needed to figure out what's going on.
>

Here is the disassembled DSDT table:
http://pastebin.com/LWNVht9H
The SATA controller is at line 5206.
I also disassembled the SSDT, but nothing interesting was there:
http://pastebin.com/fus5sxU8

I disabled the usage of ACPI for IRQs with acpi=noirq,
and it successfully booted up setting itself to IRQ#3.
This makes me think that this is the BIOS's fault.
I think it would be possible to create a DMI check
and forcibly set the irq to 20 if the DMI matches.
Any comments on this?
Robert Hancock Sept. 29, 2013, 1:21 a.m. UTC | #13
On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>
>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>
>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>
>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>
>>>>>
>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>
>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>> frozen
>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>> in
>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>> Emask
>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>> desc);
>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>> "
>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>               u8
>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>
>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>> */
>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>> that
>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>> */
>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>        };
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>> apparent
>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>> at
>>>>>>>>>>>> all.
>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>> going
>>>>>>>>>>>> wrong.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>> switched
>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>> problem.
>>>>>>>>>>> Of
>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>> kernel
>>>>>>>>>>> would
>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>> reproduce
>>>>>>>>>>> the
>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>> still
>>>>>>>>>>> don't
>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>> exceptions
>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>> loops.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>> wouldn't
>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>> recover
>>>>>>>>>> data from.
>>>>>>>>>>
>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>> operation
>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>> add
>>>>>>>>> a
>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>> having
>>>>>>>>> zero
>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>> wrong,
>>>>>>>>> I
>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>> works,
>>>>>>>>> and
>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>> issue
>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>> inappropriately.
>>>>>>>>
>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>> device,
>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>> from
>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>> serialized.
>>>>>>>>
>>>>>>>
>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>> couldn't
>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>> figure
>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>> routing
>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>> Windows
>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>> contents of /proc/interrupts?
>>>>>>
>>>>> As I said before, I am not able to get to the shell, without my
>>>>> 'symptom
>>>>> cure'. With my patch I get the following dmesg output, with
>>>>> some of my debug messages turned off:
>>>>> http://pastebin.com/5eb5G3Dx
>>>>> /proc/interrupts is here:
>>>>> http://pastebin.com/84CJey2D
>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>> like
>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>> The values I am getting from the device are very different than those
>>>>> that are expected.
>>>>>
>>>>> Things I have noticed, but ignored in dmesg:
>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>> ignored
>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>> with
>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>> with /dev/sda, which works fine.
>>>>
>>>>
>>>>
>>>> I think it is likely related to the problem. The kernel thinks this
>>>> controller is on IRQ 16, but apparently something is raising
>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>
>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>> different which might mask the problem. Do you know what IRQ Device
>>>> Manager reported for this controller in Windows? And was it using any
>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>
>>>
>>>
>>> Hmm, according to WinXP's Device manager for this controller,
>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>> Now, one question remains where is the error that mismaps
>>> controller?
>>> I have created a simple patch which seems to fix this:
>>> ---
>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>> const
>>> struct pci_device_id *ent)
>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>
>>> piix_map_db_table[ent->driver_data]);
>>>
>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>> +               pdev->irq = 20;
>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>          if (rc)
>>>                  return rc;
>>>
>>> However, I am more than sure that this is not the way
>>> to solve this problem. Do you have any idea on where
>>> the ideal place would be to implement a fix?
>>> According to specs of ICH7M, which is essentially the
>>> same as ICH6M, we need to check on what interrupt pin
>>> is the SATA controller, and after that check which IRQ line
>>> is connected to the I/O APIC and decide the IRQ's number
>>> on those findings.
>>>
>>> Specs of ICH7:
>>>
>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>
>>> The SATA controller is always Device 31.
>>
>>
>> It would appear that something is messing up with the ACPI IRQ routing
>> on this machine that's causing us to think the controller is on the
>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>> additional debugging suggestions. I suspect that dumping the DSDT is
>> likely the first step though. If you can get IASL installed, you can
>> do something like:
>>
>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>> iasl -d dsdt.aml
>>
>> That should spit out a dsdt.dsl file which would hopefully have the
>> info needed to figure out what's going on.
>>
>
> Here is the disassembled DSDT table:
> http://pastebin.com/LWNVht9H
> The SATA controller is at line 5206.
> I also disassembled the SSDT, but nothing interesting was there:
> http://pastebin.com/fus5sxU8
>
> I disabled the usage of ACPI for IRQs with acpi=noirq,
> and it successfully booted up setting itself to IRQ#3.
> This makes me think that this is the BIOS's fault.
> I think it would be possible to create a DMI check
> and forcibly set the irq to 20 if the DMI matches.
> Any comments on this?

The BIOS may be doing something funky, but since Windows apparently
can figure out it's on IRQ 20, Linux presumably should be able to as
well. DMI checks should be the last resort - Windows almost certainly
doesn't have any machine-specific logic here, and it's hard to tell
what other machine models could be affected. With ACPI stuff, we
generally just need to do the same thing Windows does for things to
work reliably, and DMI checks are more of a hack workaround than a
real fix.

I'll try and have a look at the DSDT within the next few days and see
if I can figure anything out, unless someone beats me to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Hancock Oct. 1, 2013, 4:25 a.m. UTC | #14
On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>
>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>
>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>
>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>
>>>>>>
>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>
>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>> apparent
>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>> at
>>>>>>>>>>>>> all.
>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>> going
>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>> switched
>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>> problem.
>>>>>>>>>>>> Of
>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>> kernel
>>>>>>>>>>>> would
>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>> reproduce
>>>>>>>>>>>> the
>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>> still
>>>>>>>>>>>> don't
>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>> exceptions
>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>> loops.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>> wouldn't
>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>> recover
>>>>>>>>>>> data from.
>>>>>>>>>>>
>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>> operation
>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>> add
>>>>>>>>>> a
>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>> having
>>>>>>>>>> zero
>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>> wrong,
>>>>>>>>>> I
>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>> works,
>>>>>>>>>> and
>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>> issue
>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>> inappropriately.
>>>>>>>>>
>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>> device,
>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>> from
>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>> serialized.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>> couldn't
>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>> figure
>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>> routing
>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>> Windows
>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>> contents of /proc/interrupts?
>>>>>>>
>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>> 'symptom
>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>> some of my debug messages turned off:
>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>> /proc/interrupts is here:
>>>>>> http://pastebin.com/84CJey2D
>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>> like
>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>> The values I am getting from the device are very different than those
>>>>>> that are expected.
>>>>>>
>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>> ignored
>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>> with
>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>> with /dev/sda, which works fine.
>>>>>
>>>>>
>>>>>
>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>> controller is on IRQ 16, but apparently something is raising
>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>
>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>> Manager reported for this controller in Windows? And was it using any
>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>
>>>>
>>>>
>>>> Hmm, according to WinXP's Device manager for this controller,
>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>> Now, one question remains where is the error that mismaps
>>>> controller?
>>>> I have created a simple patch which seems to fix this:
>>>> ---
>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>> const
>>>> struct pci_device_id *ent)
>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>
>>>> piix_map_db_table[ent->driver_data]);
>>>>
>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>> +               pdev->irq = 20;
>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>          if (rc)
>>>>                  return rc;
>>>>
>>>> However, I am more than sure that this is not the way
>>>> to solve this problem. Do you have any idea on where
>>>> the ideal place would be to implement a fix?
>>>> According to specs of ICH7M, which is essentially the
>>>> same as ICH6M, we need to check on what interrupt pin
>>>> is the SATA controller, and after that check which IRQ line
>>>> is connected to the I/O APIC and decide the IRQ's number
>>>> on those findings.
>>>>
>>>> Specs of ICH7:
>>>>
>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>
>>>> The SATA controller is always Device 31.
>>>
>>>
>>> It would appear that something is messing up with the ACPI IRQ routing
>>> on this machine that's causing us to think the controller is on the
>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>> likely the first step though. If you can get IASL installed, you can
>>> do something like:
>>>
>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>> iasl -d dsdt.aml
>>>
>>> That should spit out a dsdt.dsl file which would hopefully have the
>>> info needed to figure out what's going on.
>>>
>>
>> Here is the disassembled DSDT table:
>> http://pastebin.com/LWNVht9H
>> The SATA controller is at line 5206.
>> I also disassembled the SSDT, but nothing interesting was there:
>> http://pastebin.com/fus5sxU8
>>
>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>> and it successfully booted up setting itself to IRQ#3.
>> This makes me think that this is the BIOS's fault.
>> I think it would be possible to create a DMI check
>> and forcibly set the irq to 20 if the DMI matches.
>> Any comments on this?
>
> The BIOS may be doing something funky, but since Windows apparently
> can figure out it's on IRQ 20, Linux presumably should be able to as
> well. DMI checks should be the last resort - Windows almost certainly
> doesn't have any machine-specific logic here, and it's hard to tell
> what other machine models could be affected. With ACPI stuff, we
> generally just need to do the same thing Windows does for things to
> work reliably, and DMI checks are more of a hack workaround than a
> real fix.
>
> I'll try and have a look at the DSDT within the next few days and see
> if I can figure anything out, unless someone beats me to it.

I haven't gone into too much detail, but one thing I noticed with the
DSDT is that there appear to be some _OSI checks for Windows 2006
(i.e. Vista) that seem to affect various things, including potentially
the PCI IRQ routing table. It's possible that their IRQ routing table
is broken for legacy mode with an ACPI OS supporting Vista (as current
Linux versions do). Could be this slipped through testing if they only
tested AHCI mode with Vista installed.

You can try booting with the kernel parameters

acpi_osi=! acpi_osi="Windows 2001 SP3"

That should make the BIOS think we are Windows XP and bypass the Vista
code path. If that works, then you might want to check for a BIOS
update on this machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Oct. 11, 2013, 4:07 p.m. UTC | #15
2013-10-01 06:25 keltezéssel, Robert Hancock írta:
> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>
>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>
>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>
>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>
>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>> going
>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>> switched
>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>> problem.
>>>>>>>>>>>>> Of
>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>> kernel
>>>>>>>>>>>>> would
>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>> the
>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>> still
>>>>>>>>>>>>> don't
>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>> loops.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>> wouldn't
>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>> recover
>>>>>>>>>>>> data from.
>>>>>>>>>>>>
>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>> operation
>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>> add
>>>>>>>>>>> a
>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>> having
>>>>>>>>>>> zero
>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>> wrong,
>>>>>>>>>>> I
>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>> works,
>>>>>>>>>>> and
>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>> issue
>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>> inappropriately.
>>>>>>>>>>
>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>> device,
>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>> from
>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>> serialized.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>> couldn't
>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>> figure
>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>> routing
>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>> Windows
>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>> contents of /proc/interrupts?
>>>>>>>>
>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>> 'symptom
>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>> some of my debug messages turned off:
>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>> /proc/interrupts is here:
>>>>>>> http://pastebin.com/84CJey2D
>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>> like
>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>> The values I am getting from the device are very different than those
>>>>>>> that are expected.
>>>>>>>
>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>> ignored
>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>> with
>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>> with /dev/sda, which works fine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>
>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>
>>>>>
>>>>>
>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>> Now, one question remains where is the error that mismaps
>>>>> controller?
>>>>> I have created a simple patch which seems to fix this:
>>>>> ---
>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>> const
>>>>> struct pci_device_id *ent)
>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>
>>>>> piix_map_db_table[ent->driver_data]);
>>>>>
>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>> +               pdev->irq = 20;
>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>          if (rc)
>>>>>                  return rc;
>>>>>
>>>>> However, I am more than sure that this is not the way
>>>>> to solve this problem. Do you have any idea on where
>>>>> the ideal place would be to implement a fix?
>>>>> According to specs of ICH7M, which is essentially the
>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>> is the SATA controller, and after that check which IRQ line
>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>> on those findings.
>>>>>
>>>>> Specs of ICH7:
>>>>>
>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>
>>>>> The SATA controller is always Device 31.
>>>>
>>>>
>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>> on this machine that's causing us to think the controller is on the
>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>> likely the first step though. If you can get IASL installed, you can
>>>> do something like:
>>>>
>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>> iasl -d dsdt.aml
>>>>
>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>> info needed to figure out what's going on.
>>>>
>>>
>>> Here is the disassembled DSDT table:
>>> http://pastebin.com/LWNVht9H
>>> The SATA controller is at line 5206.
>>> I also disassembled the SSDT, but nothing interesting was there:
>>> http://pastebin.com/fus5sxU8
>>>
>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>> and it successfully booted up setting itself to IRQ#3.
>>> This makes me think that this is the BIOS's fault.
>>> I think it would be possible to create a DMI check
>>> and forcibly set the irq to 20 if the DMI matches.
>>> Any comments on this?
>>
>> The BIOS may be doing something funky, but since Windows apparently
>> can figure out it's on IRQ 20, Linux presumably should be able to as
>> well. DMI checks should be the last resort - Windows almost certainly
>> doesn't have any machine-specific logic here, and it's hard to tell
>> what other machine models could be affected. With ACPI stuff, we
>> generally just need to do the same thing Windows does for things to
>> work reliably, and DMI checks are more of a hack workaround than a
>> real fix.
>>
>> I'll try and have a look at the DSDT within the next few days and see
>> if I can figure anything out, unless someone beats me to it.
> 
> I haven't gone into too much detail, but one thing I noticed with the
> DSDT is that there appear to be some _OSI checks for Windows 2006
> (i.e. Vista) that seem to affect various things, including potentially
> the PCI IRQ routing table. It's possible that their IRQ routing table
> is broken for legacy mode with an ACPI OS supporting Vista (as current
> Linux versions do). Could be this slipped through testing if they only
> tested AHCI mode with Vista installed.
> 
> You can try booting with the kernel parameters
> 
> acpi_osi=! acpi_osi="Windows 2001 SP3"
> 
> That should make the BIOS think we are Windows XP and bypass the Vista
> code path. If that works, then you might want to check for a BIOS
> update on this machine.
> 

First of all, sorry for the late reply. I was kinda busy.

I tried what you suggested but unfortunately the problem persists.
This makes me believe that Windows XP does have somekind of DMI check here.
Of course, while a BIOS update may solve this, I would prefer that Linux
should also be able to boot up with this broken BIOS as well.

If you are certain that WinXP doesn't use DMI checks,
it could be that WinXP's driver of ICH7M's SATA controller applies
a quirk and sets that irq line to #20.
Robert Hancock Oct. 12, 2013, 2:06 a.m. UTC | #16
On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-10-01 06:25 keltezéssel, Robert Hancock írta:
>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>>
>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>
>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>>
>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>>> switched
>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>> Of
>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>>> still
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>>> loops.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>>> wouldn't
>>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>>> recover
>>>>>>>>>>>>> data from.
>>>>>>>>>>>>>
>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>>> operation
>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>>> add
>>>>>>>>>>>> a
>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>>> having
>>>>>>>>>>>> zero
>>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>>> wrong,
>>>>>>>>>>>> I
>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>>> works,
>>>>>>>>>>>> and
>>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>>> issue
>>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>>> inappropriately.
>>>>>>>>>>>
>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>>> device,
>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>>> from
>>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>>> serialized.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>>> couldn't
>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>>> figure
>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>>> routing
>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>>> Windows
>>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>>> contents of /proc/interrupts?
>>>>>>>>>
>>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>>> 'symptom
>>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>>> some of my debug messages turned off:
>>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>>> /proc/interrupts is here:
>>>>>>>> http://pastebin.com/84CJey2D
>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>>> like
>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>>> The values I am getting from the device are very different than those
>>>>>>>> that are expected.
>>>>>>>>
>>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>>> ignored
>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>>> with
>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>>> with /dev/sda, which works fine.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>>
>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>>> Now, one question remains where is the error that mismaps
>>>>>> controller?
>>>>>> I have created a simple patch which seems to fix this:
>>>>>> ---
>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>>> const
>>>>>> struct pci_device_id *ent)
>>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>>
>>>>>> piix_map_db_table[ent->driver_data]);
>>>>>>
>>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>>> +               pdev->irq = 20;
>>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>>          if (rc)
>>>>>>                  return rc;
>>>>>>
>>>>>> However, I am more than sure that this is not the way
>>>>>> to solve this problem. Do you have any idea on where
>>>>>> the ideal place would be to implement a fix?
>>>>>> According to specs of ICH7M, which is essentially the
>>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>>> is the SATA controller, and after that check which IRQ line
>>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>>> on those findings.
>>>>>>
>>>>>> Specs of ICH7:
>>>>>>
>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>>
>>>>>> The SATA controller is always Device 31.
>>>>>
>>>>>
>>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>>> on this machine that's causing us to think the controller is on the
>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>>> likely the first step though. If you can get IASL installed, you can
>>>>> do something like:
>>>>>
>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>>> iasl -d dsdt.aml
>>>>>
>>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>>> info needed to figure out what's going on.
>>>>>
>>>>
>>>> Here is the disassembled DSDT table:
>>>> http://pastebin.com/LWNVht9H
>>>> The SATA controller is at line 5206.
>>>> I also disassembled the SSDT, but nothing interesting was there:
>>>> http://pastebin.com/fus5sxU8
>>>>
>>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>>> and it successfully booted up setting itself to IRQ#3.
>>>> This makes me think that this is the BIOS's fault.
>>>> I think it would be possible to create a DMI check
>>>> and forcibly set the irq to 20 if the DMI matches.
>>>> Any comments on this?
>>>
>>> The BIOS may be doing something funky, but since Windows apparently
>>> can figure out it's on IRQ 20, Linux presumably should be able to as
>>> well. DMI checks should be the last resort - Windows almost certainly
>>> doesn't have any machine-specific logic here, and it's hard to tell
>>> what other machine models could be affected. With ACPI stuff, we
>>> generally just need to do the same thing Windows does for things to
>>> work reliably, and DMI checks are more of a hack workaround than a
>>> real fix.
>>>
>>> I'll try and have a look at the DSDT within the next few days and see
>>> if I can figure anything out, unless someone beats me to it.
>>
>> I haven't gone into too much detail, but one thing I noticed with the
>> DSDT is that there appear to be some _OSI checks for Windows 2006
>> (i.e. Vista) that seem to affect various things, including potentially
>> the PCI IRQ routing table. It's possible that their IRQ routing table
>> is broken for legacy mode with an ACPI OS supporting Vista (as current
>> Linux versions do). Could be this slipped through testing if they only
>> tested AHCI mode with Vista installed.
>>
>> You can try booting with the kernel parameters
>>
>> acpi_osi=! acpi_osi="Windows 2001 SP3"
>>
>> That should make the BIOS think we are Windows XP and bypass the Vista
>> code path. If that works, then you might want to check for a BIOS
>> update on this machine.
>>
>
> First of all, sorry for the late reply. I was kinda busy.
>
> I tried what you suggested but unfortunately the problem persists.
> This makes me believe that Windows XP does have somekind of DMI check here.
> Of course, while a BIOS update may solve this, I would prefer that Linux
> should also be able to boot up with this broken BIOS as well.
>
> If you are certain that WinXP doesn't use DMI checks,
> it could be that WinXP's driver of ICH7M's SATA controller applies
> a quirk and sets that irq line to #20.

Can you post the dmesg output from a bootup attempt with those options?

You may also want to try adding just: acpi_osi=!
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Oct. 12, 2013, 9:29 a.m. UTC | #17
2013-10-12 04:06 keltezéssel, Robert Hancock írta:
> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta:
>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>>>
>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>
>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>>>
>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>>>
>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>>>> switched
>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>> Of
>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>>>> loops.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>>>> wouldn't
>>>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>>>> recover
>>>>>>>>>>>>>> data from.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>>>> operation
>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>>>> add
>>>>>>>>>>>>> a
>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>>>> having
>>>>>>>>>>>>> zero
>>>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>>>> wrong,
>>>>>>>>>>>>> I
>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>>>> works,
>>>>>>>>>>>>> and
>>>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>>>> issue
>>>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>>>> inappropriately.
>>>>>>>>>>>>
>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>>>> device,
>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>>>> from
>>>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>>>> serialized.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>>>> couldn't
>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>>>> figure
>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>>>> routing
>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>>>> Windows
>>>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>>>> contents of /proc/interrupts?
>>>>>>>>>>
>>>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>>>> 'symptom
>>>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>>>> some of my debug messages turned off:
>>>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>>>> /proc/interrupts is here:
>>>>>>>>> http://pastebin.com/84CJey2D
>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>>>> like
>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>>>> The values I am getting from the device are very different than those
>>>>>>>>> that are expected.
>>>>>>>>>
>>>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>>>> ignored
>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>>>> with
>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>>>> with /dev/sda, which works fine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>>>
>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>>>> Now, one question remains where is the error that mismaps
>>>>>>> controller?
>>>>>>> I have created a simple patch which seems to fix this:
>>>>>>> ---
>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>>>> const
>>>>>>> struct pci_device_id *ent)
>>>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>>>
>>>>>>> piix_map_db_table[ent->driver_data]);
>>>>>>>
>>>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>>>> +               pdev->irq = 20;
>>>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>>>          if (rc)
>>>>>>>                  return rc;
>>>>>>>
>>>>>>> However, I am more than sure that this is not the way
>>>>>>> to solve this problem. Do you have any idea on where
>>>>>>> the ideal place would be to implement a fix?
>>>>>>> According to specs of ICH7M, which is essentially the
>>>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>>>> is the SATA controller, and after that check which IRQ line
>>>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>>>> on those findings.
>>>>>>>
>>>>>>> Specs of ICH7:
>>>>>>>
>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>>>
>>>>>>> The SATA controller is always Device 31.
>>>>>>
>>>>>>
>>>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>>>> on this machine that's causing us to think the controller is on the
>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>>>> likely the first step though. If you can get IASL installed, you can
>>>>>> do something like:
>>>>>>
>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>>>> iasl -d dsdt.aml
>>>>>>
>>>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>>>> info needed to figure out what's going on.
>>>>>>
>>>>>
>>>>> Here is the disassembled DSDT table:
>>>>> http://pastebin.com/LWNVht9H
>>>>> The SATA controller is at line 5206.
>>>>> I also disassembled the SSDT, but nothing interesting was there:
>>>>> http://pastebin.com/fus5sxU8
>>>>>
>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>>>> and it successfully booted up setting itself to IRQ#3.
>>>>> This makes me think that this is the BIOS's fault.
>>>>> I think it would be possible to create a DMI check
>>>>> and forcibly set the irq to 20 if the DMI matches.
>>>>> Any comments on this?
>>>>
>>>> The BIOS may be doing something funky, but since Windows apparently
>>>> can figure out it's on IRQ 20, Linux presumably should be able to as
>>>> well. DMI checks should be the last resort - Windows almost certainly
>>>> doesn't have any machine-specific logic here, and it's hard to tell
>>>> what other machine models could be affected. With ACPI stuff, we
>>>> generally just need to do the same thing Windows does for things to
>>>> work reliably, and DMI checks are more of a hack workaround than a
>>>> real fix.
>>>>
>>>> I'll try and have a look at the DSDT within the next few days and see
>>>> if I can figure anything out, unless someone beats me to it.
>>>
>>> I haven't gone into too much detail, but one thing I noticed with the
>>> DSDT is that there appear to be some _OSI checks for Windows 2006
>>> (i.e. Vista) that seem to affect various things, including potentially
>>> the PCI IRQ routing table. It's possible that their IRQ routing table
>>> is broken for legacy mode with an ACPI OS supporting Vista (as current
>>> Linux versions do). Could be this slipped through testing if they only
>>> tested AHCI mode with Vista installed.
>>>
>>> You can try booting with the kernel parameters
>>>
>>> acpi_osi=! acpi_osi="Windows 2001 SP3"
>>>
>>> That should make the BIOS think we are Windows XP and bypass the Vista
>>> code path. If that works, then you might want to check for a BIOS
>>> update on this machine.
>>>
>>
>> First of all, sorry for the late reply. I was kinda busy.
>>
>> I tried what you suggested but unfortunately the problem persists.
>> This makes me believe that Windows XP does have somekind of DMI check here.
>> Of course, while a BIOS update may solve this, I would prefer that Linux
>> should also be able to boot up with this broken BIOS as well.
>>
>> If you are certain that WinXP doesn't use DMI checks,
>> it could be that WinXP's driver of ICH7M's SATA controller applies
>> a quirk and sets that irq line to #20.
> 
> Can you post the dmesg output from a bootup attempt with those options?
> 
> You may also want to try adding just: acpi_osi=!
> 

None of the 3 possible combinations succeeded to boot.

Here are a couple of dmesgs:

Params: acpi_osi="Windows 2001 SP3"
http://pastebin.com/vF3BSuhc

Params: acpi_osi=! acpi_osi="Windows 2001 SP3"
http://pastebin.com/BuUzc3es

Params: acpi_osi=!
http://pastebin.com/u7uRx8Ru
Robert Hancock Oct. 13, 2013, 5:57 a.m. UTC | #18
On Sat, Oct 12, 2013 at 3:29 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-10-12 04:06 keltezéssel, Robert Hancock írta:
>> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@linux.com> wrote:
>>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta:
>>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
>>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>>>>
>>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>
>>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>>>>
>>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>>>>> switched
>>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>> Of
>>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>>>>> loops.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>>>>> wouldn't
>>>>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>>>>> recover
>>>>>>>>>>>>>>> data from.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>>>>> operation
>>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>>>>> add
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>>>>> having
>>>>>>>>>>>>>> zero
>>>>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>>>>> wrong,
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>>>>> works,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>>>>> issue
>>>>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>>>>> inappropriately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>>>>> device,
>>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>>>>> from
>>>>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>>>>> serialized.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>>>>> couldn't
>>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>>>>> figure
>>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>>>>> routing
>>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>>>>> Windows
>>>>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>>>>> contents of /proc/interrupts?
>>>>>>>>>>>
>>>>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>>>>> 'symptom
>>>>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>>>>> some of my debug messages turned off:
>>>>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>>>>> /proc/interrupts is here:
>>>>>>>>>> http://pastebin.com/84CJey2D
>>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>>>>> like
>>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>>>>> The values I am getting from the device are very different than those
>>>>>>>>>> that are expected.
>>>>>>>>>>
>>>>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>>>>> ignored
>>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>>>>> with
>>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>>>>> with /dev/sda, which works fine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>>>>
>>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>>>>> Now, one question remains where is the error that mismaps
>>>>>>>> controller?
>>>>>>>> I have created a simple patch which seems to fix this:
>>>>>>>> ---
>>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>>>>> const
>>>>>>>> struct pci_device_id *ent)
>>>>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>>>>
>>>>>>>> piix_map_db_table[ent->driver_data]);
>>>>>>>>
>>>>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>>>>> +               pdev->irq = 20;
>>>>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>>>>          if (rc)
>>>>>>>>                  return rc;
>>>>>>>>
>>>>>>>> However, I am more than sure that this is not the way
>>>>>>>> to solve this problem. Do you have any idea on where
>>>>>>>> the ideal place would be to implement a fix?
>>>>>>>> According to specs of ICH7M, which is essentially the
>>>>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>>>>> is the SATA controller, and after that check which IRQ line
>>>>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>>>>> on those findings.
>>>>>>>>
>>>>>>>> Specs of ICH7:
>>>>>>>>
>>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>>>>
>>>>>>>> The SATA controller is always Device 31.
>>>>>>>
>>>>>>>
>>>>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>>>>> on this machine that's causing us to think the controller is on the
>>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>>>>> likely the first step though. If you can get IASL installed, you can
>>>>>>> do something like:
>>>>>>>
>>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>>>>> iasl -d dsdt.aml
>>>>>>>
>>>>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>>>>> info needed to figure out what's going on.
>>>>>>>
>>>>>>
>>>>>> Here is the disassembled DSDT table:
>>>>>> http://pastebin.com/LWNVht9H
>>>>>> The SATA controller is at line 5206.
>>>>>> I also disassembled the SSDT, but nothing interesting was there:
>>>>>> http://pastebin.com/fus5sxU8
>>>>>>
>>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>>>>> and it successfully booted up setting itself to IRQ#3.
>>>>>> This makes me think that this is the BIOS's fault.
>>>>>> I think it would be possible to create a DMI check
>>>>>> and forcibly set the irq to 20 if the DMI matches.
>>>>>> Any comments on this?
>>>>>
>>>>> The BIOS may be doing something funky, but since Windows apparently
>>>>> can figure out it's on IRQ 20, Linux presumably should be able to as
>>>>> well. DMI checks should be the last resort - Windows almost certainly
>>>>> doesn't have any machine-specific logic here, and it's hard to tell
>>>>> what other machine models could be affected. With ACPI stuff, we
>>>>> generally just need to do the same thing Windows does for things to
>>>>> work reliably, and DMI checks are more of a hack workaround than a
>>>>> real fix.
>>>>>
>>>>> I'll try and have a look at the DSDT within the next few days and see
>>>>> if I can figure anything out, unless someone beats me to it.
>>>>
>>>> I haven't gone into too much detail, but one thing I noticed with the
>>>> DSDT is that there appear to be some _OSI checks for Windows 2006
>>>> (i.e. Vista) that seem to affect various things, including potentially
>>>> the PCI IRQ routing table. It's possible that their IRQ routing table
>>>> is broken for legacy mode with an ACPI OS supporting Vista (as current
>>>> Linux versions do). Could be this slipped through testing if they only
>>>> tested AHCI mode with Vista installed.
>>>>
>>>> You can try booting with the kernel parameters
>>>>
>>>> acpi_osi=! acpi_osi="Windows 2001 SP3"
>>>>
>>>> That should make the BIOS think we are Windows XP and bypass the Vista
>>>> code path. If that works, then you might want to check for a BIOS
>>>> update on this machine.
>>>>
>>>
>>> First of all, sorry for the late reply. I was kinda busy.
>>>
>>> I tried what you suggested but unfortunately the problem persists.
>>> This makes me believe that Windows XP does have somekind of DMI check here.
>>> Of course, while a BIOS update may solve this, I would prefer that Linux
>>> should also be able to boot up with this broken BIOS as well.
>>>
>>> If you are certain that WinXP doesn't use DMI checks,
>>> it could be that WinXP's driver of ICH7M's SATA controller applies
>>> a quirk and sets that irq line to #20.
>>
>> Can you post the dmesg output from a bootup attempt with those options?
>>
>> You may also want to try adding just: acpi_osi=!
>>
>
> None of the 3 possible combinations succeeded to boot.
>
> Here are a couple of dmesgs:
>
> Params: acpi_osi="Windows 2001 SP3"
> http://pastebin.com/vF3BSuhc
>
> Params: acpi_osi=! acpi_osi="Windows 2001 SP3"
> http://pastebin.com/BuUzc3es
>
> Params: acpi_osi=!
> http://pastebin.com/u7uRx8Ru

I'm not sure the option is actually taking effect properly. There
should be a message "Disabled all _OSI OS vendors" that shows up in
dmesg with the ! option. Can you try:

acpi_osi="!" acpi_osi="Windows 2001 SP3"

(with the quotes around the ! character).
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Levente Kurusa Oct. 13, 2013, 12:02 p.m. UTC | #19
2013-10-13 07:57 keltezéssel, Robert Hancock írta:
> On Sat, Oct 12, 2013 at 3:29 AM, Levente Kurusa <levex@linux.com> wrote:
>> 2013-10-12 04:06 keltezéssel, Robert Hancock írta:
>>> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@linux.com> wrote:
>>>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta:
>>>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
>>>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>>>>>
>>>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>
>>>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>>>>>
>>>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>>>>>> switched
>>>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>> Of
>>>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>>>>>> loops.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>>>>>> wouldn't
>>>>>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>>>>>> recover
>>>>>>>>>>>>>>>> data from.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>>>>>> operation
>>>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>> zero
>>>>>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>>>>>> wrong,
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>>>>>> works,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>>>>>> inappropriately.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>>>>>> device,
>>>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>>>>>> serialized.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>>>>>> couldn't
>>>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>>>>>> figure
>>>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>>>>>> routing
>>>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>>>>>> Windows
>>>>>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>>>>>> contents of /proc/interrupts?
>>>>>>>>>>>>
>>>>>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>>>>>> 'symptom
>>>>>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>>>>>> some of my debug messages turned off:
>>>>>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>>>>>> /proc/interrupts is here:
>>>>>>>>>>> http://pastebin.com/84CJey2D
>>>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>>>>>> like
>>>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>>>>>> The values I am getting from the device are very different than those
>>>>>>>>>>> that are expected.
>>>>>>>>>>>
>>>>>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>>>>>> ignored
>>>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>>>>>> with
>>>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>>>>>> with /dev/sda, which works fine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>>>>>
>>>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>>>>>> Now, one question remains where is the error that mismaps
>>>>>>>>> controller?
>>>>>>>>> I have created a simple patch which seems to fix this:
>>>>>>>>> ---
>>>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>>>>>> const
>>>>>>>>> struct pci_device_id *ent)
>>>>>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>>>>>
>>>>>>>>> piix_map_db_table[ent->driver_data]);
>>>>>>>>>
>>>>>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>>>>>> +               pdev->irq = 20;
>>>>>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>>>>>          if (rc)
>>>>>>>>>                  return rc;
>>>>>>>>>
>>>>>>>>> However, I am more than sure that this is not the way
>>>>>>>>> to solve this problem. Do you have any idea on where
>>>>>>>>> the ideal place would be to implement a fix?
>>>>>>>>> According to specs of ICH7M, which is essentially the
>>>>>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>>>>>> is the SATA controller, and after that check which IRQ line
>>>>>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>>>>>> on those findings.
>>>>>>>>>
>>>>>>>>> Specs of ICH7:
>>>>>>>>>
>>>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>>>>>
>>>>>>>>> The SATA controller is always Device 31.
>>>>>>>>
>>>>>>>>
>>>>>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>>>>>> on this machine that's causing us to think the controller is on the
>>>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>>>>>> likely the first step though. If you can get IASL installed, you can
>>>>>>>> do something like:
>>>>>>>>
>>>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>>>>>> iasl -d dsdt.aml
>>>>>>>>
>>>>>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>>>>>> info needed to figure out what's going on.
>>>>>>>>
>>>>>>>
>>>>>>> Here is the disassembled DSDT table:
>>>>>>> http://pastebin.com/LWNVht9H
>>>>>>> The SATA controller is at line 5206.
>>>>>>> I also disassembled the SSDT, but nothing interesting was there:
>>>>>>> http://pastebin.com/fus5sxU8
>>>>>>>
>>>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>>>>>> and it successfully booted up setting itself to IRQ#3.
>>>>>>> This makes me think that this is the BIOS's fault.
>>>>>>> I think it would be possible to create a DMI check
>>>>>>> and forcibly set the irq to 20 if the DMI matches.
>>>>>>> Any comments on this?
>>>>>>
>>>>>> The BIOS may be doing something funky, but since Windows apparently
>>>>>> can figure out it's on IRQ 20, Linux presumably should be able to as
>>>>>> well. DMI checks should be the last resort - Windows almost certainly
>>>>>> doesn't have any machine-specific logic here, and it's hard to tell
>>>>>> what other machine models could be affected. With ACPI stuff, we
>>>>>> generally just need to do the same thing Windows does for things to
>>>>>> work reliably, and DMI checks are more of a hack workaround than a
>>>>>> real fix.
>>>>>>
>>>>>> I'll try and have a look at the DSDT within the next few days and see
>>>>>> if I can figure anything out, unless someone beats me to it.
>>>>>
>>>>> I haven't gone into too much detail, but one thing I noticed with the
>>>>> DSDT is that there appear to be some _OSI checks for Windows 2006
>>>>> (i.e. Vista) that seem to affect various things, including potentially
>>>>> the PCI IRQ routing table. It's possible that their IRQ routing table
>>>>> is broken for legacy mode with an ACPI OS supporting Vista (as current
>>>>> Linux versions do). Could be this slipped through testing if they only
>>>>> tested AHCI mode with Vista installed.
>>>>>
>>>>> You can try booting with the kernel parameters
>>>>>
>>>>> acpi_osi=! acpi_osi="Windows 2001 SP3"
>>>>>
>>>>> That should make the BIOS think we are Windows XP and bypass the Vista
>>>>> code path. If that works, then you might want to check for a BIOS
>>>>> update on this machine.
>>>>>
>>>>
>>>> First of all, sorry for the late reply. I was kinda busy.
>>>>
>>>> I tried what you suggested but unfortunately the problem persists.
>>>> This makes me believe that Windows XP does have somekind of DMI check here.
>>>> Of course, while a BIOS update may solve this, I would prefer that Linux
>>>> should also be able to boot up with this broken BIOS as well.
>>>>
>>>> If you are certain that WinXP doesn't use DMI checks,
>>>> it could be that WinXP's driver of ICH7M's SATA controller applies
>>>> a quirk and sets that irq line to #20.
>>>
>>> Can you post the dmesg output from a bootup attempt with those options?
>>>
>>> You may also want to try adding just: acpi_osi=!
>>>
>>
>> None of the 3 possible combinations succeeded to boot.
>>
>> Here are a couple of dmesgs:
>>
>> Params: acpi_osi="Windows 2001 SP3"
>> http://pastebin.com/vF3BSuhc
>>
>> Params: acpi_osi=! acpi_osi="Windows 2001 SP3"
>> http://pastebin.com/BuUzc3es
>>
>> Params: acpi_osi=!
>> http://pastebin.com/u7uRx8Ru
> 
> I'm not sure the option is actually taking effect properly. There
> should be a message "Disabled all _OSI OS vendors" that shows up in
> dmesg with the ! option. Can you try:
> 
> acpi_osi="!" acpi_osi="Windows 2001 SP3"
> 
> (with the quotes around the ! character).
> 

The following command line worked:
acpi_osi= acpi_osi="Windows 2001 SP3"

So, it seems that the BIOS is broken. Is there any way to fix this,
without resorting to the hackish DMI checks?
Robert Hancock Oct. 16, 2013, 12:16 a.m. UTC | #20
On Sun, Oct 13, 2013 at 6:02 AM, Levente Kurusa <levex@linux.com> wrote:
> 2013-10-13 07:57 keltezéssel, Robert Hancock írta:
>> On Sat, Oct 12, 2013 at 3:29 AM, Levente Kurusa <levex@linux.com> wrote:
>>> 2013-10-12 04:06 keltezéssel, Robert Hancock írta:
>>>> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta:
>>>>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@gmail.com> wrote:
>>>>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>>>>>>
>>>>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>
>>>>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>>>>>>
>>>>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@linux.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@linux.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>>>>>>> switched
>>>>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>> Of
>>>>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>>>>>>> loops.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>>>>>>> wouldn't
>>>>>>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>>>>>>> recover
>>>>>>>>>>>>>>>>> data from.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>>>>>>> operation
>>>>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>>> zero
>>>>>>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>>>>>>> wrong,
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>>>>>>> works,
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>>>>>>> inappropriately.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>>>>>>> device,
>>>>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>>>>>>> serialized.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>>>>>>> couldn't
>>>>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>>>>>>> figure
>>>>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>>>>>>> routing
>>>>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>>>>>>> Windows
>>>>>>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>>>>>>> contents of /proc/interrupts?
>>>>>>>>>>>>>
>>>>>>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>>>>>>> 'symptom
>>>>>>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>>>>>>> some of my debug messages turned off:
>>>>>>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>>>>>>> /proc/interrupts is here:
>>>>>>>>>>>> http://pastebin.com/84CJey2D
>>>>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>>>>>>> like
>>>>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>>>>>>> The values I am getting from the device are very different than those
>>>>>>>>>>>> that are expected.
>>>>>>>>>>>>
>>>>>>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>>>>>>> ignored
>>>>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>>>>>>> with
>>>>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>>>>>>> with /dev/sda, which works fine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>>>>>>
>>>>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>>>>>>> Now, one question remains where is the error that mismaps
>>>>>>>>>> controller?
>>>>>>>>>> I have created a simple patch which seems to fix this:
>>>>>>>>>> ---
>>>>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>>>>>>> const
>>>>>>>>>> struct pci_device_id *ent)
>>>>>>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>>>>>>
>>>>>>>>>> piix_map_db_table[ent->driver_data]);
>>>>>>>>>>
>>>>>>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>>>>>>> +               pdev->irq = 20;
>>>>>>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>>>>>>          if (rc)
>>>>>>>>>>                  return rc;
>>>>>>>>>>
>>>>>>>>>> However, I am more than sure that this is not the way
>>>>>>>>>> to solve this problem. Do you have any idea on where
>>>>>>>>>> the ideal place would be to implement a fix?
>>>>>>>>>> According to specs of ICH7M, which is essentially the
>>>>>>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>>>>>>> is the SATA controller, and after that check which IRQ line
>>>>>>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>>>>>>> on those findings.
>>>>>>>>>>
>>>>>>>>>> Specs of ICH7:
>>>>>>>>>>
>>>>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>>>>>>
>>>>>>>>>> The SATA controller is always Device 31.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>>>>>>> on this machine that's causing us to think the controller is on the
>>>>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>>>>>>> likely the first step though. If you can get IASL installed, you can
>>>>>>>>> do something like:
>>>>>>>>>
>>>>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>>>>>>> iasl -d dsdt.aml
>>>>>>>>>
>>>>>>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>>>>>>> info needed to figure out what's going on.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Here is the disassembled DSDT table:
>>>>>>>> http://pastebin.com/LWNVht9H
>>>>>>>> The SATA controller is at line 5206.
>>>>>>>> I also disassembled the SSDT, but nothing interesting was there:
>>>>>>>> http://pastebin.com/fus5sxU8
>>>>>>>>
>>>>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>>>>>>> and it successfully booted up setting itself to IRQ#3.
>>>>>>>> This makes me think that this is the BIOS's fault.
>>>>>>>> I think it would be possible to create a DMI check
>>>>>>>> and forcibly set the irq to 20 if the DMI matches.
>>>>>>>> Any comments on this?
>>>>>>>
>>>>>>> The BIOS may be doing something funky, but since Windows apparently
>>>>>>> can figure out it's on IRQ 20, Linux presumably should be able to as
>>>>>>> well. DMI checks should be the last resort - Windows almost certainly
>>>>>>> doesn't have any machine-specific logic here, and it's hard to tell
>>>>>>> what other machine models could be affected. With ACPI stuff, we
>>>>>>> generally just need to do the same thing Windows does for things to
>>>>>>> work reliably, and DMI checks are more of a hack workaround than a
>>>>>>> real fix.
>>>>>>>
>>>>>>> I'll try and have a look at the DSDT within the next few days and see
>>>>>>> if I can figure anything out, unless someone beats me to it.
>>>>>>
>>>>>> I haven't gone into too much detail, but one thing I noticed with the
>>>>>> DSDT is that there appear to be some _OSI checks for Windows 2006
>>>>>> (i.e. Vista) that seem to affect various things, including potentially
>>>>>> the PCI IRQ routing table. It's possible that their IRQ routing table
>>>>>> is broken for legacy mode with an ACPI OS supporting Vista (as current
>>>>>> Linux versions do). Could be this slipped through testing if they only
>>>>>> tested AHCI mode with Vista installed.
>>>>>>
>>>>>> You can try booting with the kernel parameters
>>>>>>
>>>>>> acpi_osi=! acpi_osi="Windows 2001 SP3"
>>>>>>
>>>>>> That should make the BIOS think we are Windows XP and bypass the Vista
>>>>>> code path. If that works, then you might want to check for a BIOS
>>>>>> update on this machine.
>>>>>>
>>>>>
>>>>> First of all, sorry for the late reply. I was kinda busy.
>>>>>
>>>>> I tried what you suggested but unfortunately the problem persists.
>>>>> This makes me believe that Windows XP does have somekind of DMI check here.
>>>>> Of course, while a BIOS update may solve this, I would prefer that Linux
>>>>> should also be able to boot up with this broken BIOS as well.
>>>>>
>>>>> If you are certain that WinXP doesn't use DMI checks,
>>>>> it could be that WinXP's driver of ICH7M's SATA controller applies
>>>>> a quirk and sets that irq line to #20.
>>>>
>>>> Can you post the dmesg output from a bootup attempt with those options?
>>>>
>>>> You may also want to try adding just: acpi_osi=!
>>>>
>>>
>>> None of the 3 possible combinations succeeded to boot.
>>>
>>> Here are a couple of dmesgs:
>>>
>>> Params: acpi_osi="Windows 2001 SP3"
>>> http://pastebin.com/vF3BSuhc
>>>
>>> Params: acpi_osi=! acpi_osi="Windows 2001 SP3"
>>> http://pastebin.com/BuUzc3es
>>>
>>> Params: acpi_osi=!
>>> http://pastebin.com/u7uRx8Ru
>>
>> I'm not sure the option is actually taking effect properly. There
>> should be a message "Disabled all _OSI OS vendors" that shows up in
>> dmesg with the ! option. Can you try:
>>
>> acpi_osi="!" acpi_osi="Windows 2001 SP3"
>>
>> (with the quotes around the ! character).
>>
>
> The following command line worked:
> acpi_osi= acpi_osi="Windows 2001 SP3"
>
> So, it seems that the BIOS is broken. Is there any way to fix this,
> without resorting to the hackish DMI checks?

Probably not really. Have you checked for a newer BIOS version on this machine?

If not, this is likely similar to a number of other systems listed in
acpi_osi_dmi_table in drivers/acpi/blacklist.c which need to disable
reporting Vista support.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index f9476fb..eeedf80 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2437,6 +2437,14 @@  static void ata_eh_link_report(struct ata_link *link)
                             ehc->i.action, frozen, tries_buf);
                 if (desc)
                         ata_dev_err(ehc->i.dev, "%s\n", desc);
+               ehc->i.dev->exce_cnt ++;
+               ata_dev_warn(ehc->i.dev, "Number of exceptions: %d\n", 
ehc->i.dev->exce_cnt);
+               /**
+                  * The device is failing terribly,
+                 * disable it to prevent damage.
+                 */
+               if(ehc->i.dev->exce_cnt > 2)
+                       ata_dev_disable(ehc->i.dev);
         } else {
                 ata_link_err(link, "exception Emask 0x%x "
                              "SAct 0x%x SErr 0x%x action 0x%x%s%s\n",
diff --git a/include/linux/libata.h b/include/linux/libata.h
index eae7a05..fa52ee6 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -660,7 +660,8 @@  struct ata_device {
         u8                      devslp_timing[ATA_LOG_DEVSLP_SIZE];

         /* error history */
-       int                     spdn_cnt;
+       int                     spdn_cnt; /* Number of speed_downs */
+       int                     exce_cnt; /* Number of exceptions that 
happenned */
         /* ering is CLEAR_END, read comment above CLEAR_END */
         struct ata_ering        ering;