Message ID | DM8PR11MB5702255A6A92F735D90A4446868B9@DM8PR11MB5702.namprd11.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | HW power fault defect cause system hang on kernel 5.4.y | expand |
[+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"), Lukas, pciehp expert] On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: > Hi, dear kernel developer, > > Recently we encounter system hang (dead spinlock) when move to > kernel linux-5.4.y. > > Finally, we use bisect to locate the suspicious commit > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. 4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race") to v5.4.69 just over a year ago. > Our system has some HW defect, which will wrongly set > PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop > jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD > bit since ctrl is not updated), I know this is our HW defect, but > this commit makes kernel trapped in this isr function and leads to > kernel hang (then the user could not get useful information to show > what's wrong), which I think is not expected behavior, so I would > like to report to you for discussion. I guess this happens because the first time we handle PFD, pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA? It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change. It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang. I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"). > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 356786a3b7f4b..88b996764ff95 100644 > --- a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc299cde85b18d1f46ac21e1ba > +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df > @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > struct controller *ctrl = (struct controller *)dev_id; > struct pci_dev *pdev = ctrl_dev(ctrl); > struct device *parent = pdev->dev.parent; > - u16 status, events; > + u16 status, events = 0; > > /* > * Interrupts only occur in D3hot or shallower and only if enabled > @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > } > } > > +read_status: > pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); > if (status == (u16) ~0) { > ctrl_info(ctrl, "%s: no response from device\n", __func__); > @@ -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > * Slot Status contains plain status bits as well as event > * notification bits; right now we only want the event bits. > */ > - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | > - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | > - PCI_EXP_SLTSTA_DLLSC); > + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | > + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | > + PCI_EXP_SLTSTA_DLLSC; > > /* > * If we've already reported a power fault, don't report it again > * until we've done something to handle it. > */ > if (ctrl->power_fault_detected) > - events &= ~PCI_EXP_SLTSTA_PFD; > + status &= ~PCI_EXP_SLTSTA_PFD; > > + events |= status; > if (!events) { > if (parent) > pm_runtime_put(parent); > return IRQ_NONE; > } > > - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > + if (status) { > + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > + > + /* > + * In MSI mode, all event bits must be zero before the port > + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). > + * So re-read the Slot Status register in case a bit was set > + * between read and write. > + */ > + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) > + goto read_status; > + } > + > ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); > if (parent) > pm_runtime_put(parent);
Hi Lukas/Stuart, Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. Regards Joseph -----Original Message----- From: Bjorn Helgaas <helgaas@kernel.org> Sent: Wednesday, November 3, 2021 6:34 AM To: Bao, Joseph <joseph.bao@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; Lukas Wunner <lukas@wunner.de> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y [+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"), Lukas, pciehp expert] On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: > Hi, dear kernel developer, > > Recently we encounter system hang (dead spinlock) when move to kernel > linux-5.4.y. > > Finally, we use bisect to locate the suspicious commit > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. 4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race") to v5.4.69 just over a year ago. > Our system has some HW defect, which will wrongly set > PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop > jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD > bit since ctrl is not updated), I know this is our HW defect, but this > commit makes kernel trapped in this isr function and leads to kernel > hang (then the user could not get useful information to show what's > wrong), which I think is not expected behavior, so I would like to > report to you for discussion. I guess this happens because the first time we handle PFD, pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA? It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change. It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang. I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"). > diff --git a/drivers/pci/hotplug/pciehp_hpc.c > b/drivers/pci/hotplug/pciehp_hpc.c > index 356786a3b7f4b..88b996764ff95 100644 > --- > a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tre > e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc29 > 9cde85b18d1f46ac21e1ba > +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git > +++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358dab > +++ 9cc07da044d5bc087065545b1000df > @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > struct controller *ctrl = (struct controller *)dev_id; > struct pci_dev *pdev = ctrl_dev(ctrl); > struct device *parent = pdev->dev.parent; > - u16 status, events; > + u16 status, events = 0; > > /* > * Interrupts only occur in D3hot or shallower and only if enabled > @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > } > } > > +read_status: > pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); > if (status == (u16) ~0) { > ctrl_info(ctrl, "%s: no response from device\n", __func__); @@ > -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > * Slot Status contains plain status bits as well as event > * notification bits; right now we only want the event bits. > */ > - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | > - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | > - PCI_EXP_SLTSTA_DLLSC); > + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | > + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | > + PCI_EXP_SLTSTA_DLLSC; > > /* > * If we've already reported a power fault, don't report it again > * until we've done something to handle it. > */ > if (ctrl->power_fault_detected) > - events &= ~PCI_EXP_SLTSTA_PFD; > + status &= ~PCI_EXP_SLTSTA_PFD; > > + events |= status; > if (!events) { > if (parent) > pm_runtime_put(parent); > return IRQ_NONE; > } > > - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > + if (status) { > + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > + > + /* > + * In MSI mode, all event bits must be zero before the port > + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). > + * So re-read the Slot Status register in case a bit was set > + * between read and write. > + */ > + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) > + goto read_status; > + } > + > ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); > if (parent) > pm_runtime_put(parent);
On Tue, Nov 09, 2021 at 07:59:59AM +0000, Bao, Joseph wrote: > Hi Lukas/Stuart, > Want to follow up with you whether the system hang is expected when > HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. A system hang in response to a hardware defect like this is never the expected situation. Worst case we should be able to work around it with a quirk. Far better would be a generic fix that could recognize and deal with the situation even without a quirk. But I don't know the fix yet. I'm just responding to encourage you to keep pestering us and not give up :) In the meantime, it might be worth opening a report at https://bugzilla.kernel.org with a description of how you trigger the problem, and attaching the complete dmesg log and "sudo lspci -vv" output. Bjorn
On 11/9/2021 1:59 AM, Bao, Joseph wrote: > Hi Lukas/Stuart, > Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. > > > Regards > Joseph > It does appear that the code will hang when pciehp_isr sees PFD high and power_fault_detected isn't yet set, if PFD doesn't clear when a 1 is written to it. It will continue to loop trying to clear it, and power_fault_detected won't get set until after it gets through this loop. It wouldn't be hard to modify that code to only attempt to clear each bit once. I wouldn't expect the same event bit to get set twice within this loop, so this might fix it (I did not test). Alternately, a loop counter could be added to prevent it from looping more than some arbitrary number (6?) of times in case of stuck bits. diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 3024d7e85e6a..3e502b4e8ef7 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) struct controller *ctrl = (struct controller *)dev_id; struct pci_dev *pdev = ctrl_dev(ctrl); struct device *parent = pdev->dev.parent; - u16 status, events = 0; + u16 changed, status, events = 0; /* * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) if (ctrl->power_fault_detected) status &= ~PCI_EXP_SLTSTA_PFD; + changed = status ^ events; events |= status; if (!events) { if (parent) @@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) * So re-read the Slot Status register in case a bit was set * between read and write. */ - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed) goto read_status; } > -----Original Message----- > From: Bjorn Helgaas <helgaas@kernel.org>> Sent: Wednesday, November 3, 2021 6:34 AM > To: Bao, Joseph <joseph.bao@intel.com> > Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; Lukas Wunner <lukas@wunner.de> > Subject: Re: HW power fault defect cause system hang on kernel 5.4.y > > [+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"), Lukas, pciehp expert] > > On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: >> Hi, dear kernel developer, >> >> Recently we encounter system hang (dead spinlock) when move to kernel >> linux-5.4.y. >> >> Finally, we use bisect to locate the suspicious commit >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. > > 4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp: > Fix MSI interrupt race") to v5.4.69 just over a year ago. > >> Our system has some HW defect, which will wrongly set >> PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop >> jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD >> bit since ctrl is not updated), I know this is our HW defect, but this >> commit makes kernel trapped in this isr function and leads to kernel >> hang (then the user could not get useful information to show what's >> wrong), which I think is not expected behavior, so I would like to >> report to you for discussion. > > I guess this happens because the first time we handle PFD, > pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA? > > It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change. > > It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang. > > I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp: > Fix MSI interrupt race"). > >> diff --git a/drivers/pci/hotplug/pciehp_hpc.c >> b/drivers/pci/hotplug/pciehp_hpc.c >> index 356786a3b7f4b..88b996764ff95 100644 >> --- >> a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tre >> e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc29 >> 9cde85b18d1f46ac21e1ba >> +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git >> +++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358dab >> +++ 9cc07da044d5bc087065545b1000df >> @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >> struct controller *ctrl = (struct controller *)dev_id; >> struct pci_dev *pdev = ctrl_dev(ctrl); >> struct device *parent = pdev->dev.parent; >> - u16 status, events; >> + u16 status, events = 0; >> >> /* >> * Interrupts only occur in D3hot or shallower and only if enabled >> @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >> } >> } >> >> +read_status: >> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); >> if (status == (u16) ~0) { >> ctrl_info(ctrl, "%s: no response from device\n", __func__); @@ >> -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >> * Slot Status contains plain status bits as well as event >> * notification bits; right now we only want the event bits. >> */ >> - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >> - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >> - PCI_EXP_SLTSTA_DLLSC); >> + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >> + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >> + PCI_EXP_SLTSTA_DLLSC; >> >> /* >> * If we've already reported a power fault, don't report it again >> * until we've done something to handle it. >> */ >> if (ctrl->power_fault_detected) >> - events &= ~PCI_EXP_SLTSTA_PFD; >> + status &= ~PCI_EXP_SLTSTA_PFD; >> >> + events |= status; >> if (!events) { >> if (parent) >> pm_runtime_put(parent); >> return IRQ_NONE; >> } >> >> - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >> + if (status) { >> + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >> + >> + /* >> + * In MSI mode, all event bits must be zero before the port >> + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). >> + * So re-read the Slot Status register in case a bit was set >> + * between read and write. >> + */ >> + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) >> + goto read_status; >> + } >> + >> ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); >> if (parent) >> pm_runtime_put(parent);
Hi Stuart, The patch you attached does not work for me, the logic should be ok but I have not figured out why. But a loop counter actually helps mitigate this issue. Thank you very much! diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 88b996764ff9..3d2c336ff740 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -529,7 +529,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) struct controller *ctrl = (struct controller *)dev_id; struct pci_dev *pdev = ctrl_dev(ctrl); struct device *parent = pdev->dev.parent; - u16 status, events = 0; + u16 status, events, read_retry_count = 0; + u8 READ_RETRY_MAX = 6; /* * Interrupts only occur in D3hot or shallower and only if enabled @@ -585,7 +586,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) return IRQ_NONE; } - if (status) { + if (status && (read_retry_count < READ_RETRY_MAX)) { pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); /* @@ -594,8 +595,10 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) * So re-read the Slot Status register in case a bit was set * between read and write. */ - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) { + read_retry_count++; goto read_status; + } } Regards. Joseph Regards. Joseph -----Original Message----- From: stuart hayes <stuart.w.hayes@gmail.com> Sent: Tuesday, November 9, 2021 11:37 PM To: Bao, Joseph <joseph.bao@intel.com>; Bjorn Helgaas <helgaas@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Lukas Wunner <lukas@wunner.de> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y On 11/9/2021 1:59 AM, Bao, Joseph wrote: > Hi Lukas/Stuart, > Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. > > > Regards > Joseph > It does appear that the code will hang when pciehp_isr sees PFD high and power_fault_detected isn't yet set, if PFD doesn't clear when a 1 is written to it. It will continue to loop trying to clear it, and power_fault_detected won't get set until after it gets through this loop. It wouldn't be hard to modify that code to only attempt to clear each bit once. I wouldn't expect the same event bit to get set twice within this loop, so this might fix it (I did not test). Alternately, a loop counter could be added to prevent it from looping more than some arbitrary number (6?) of times in case of stuck bits. diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 3024d7e85e6a..3e502b4e8ef7 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) struct controller *ctrl = (struct controller *)dev_id; struct pci_dev *pdev = ctrl_dev(ctrl); struct device *parent = pdev->dev.parent; - u16 status, events = 0; + u16 changed, status, events = 0; /* * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) if (ctrl->power_fault_detected) status &= ~PCI_EXP_SLTSTA_PFD; + changed = status ^ events; events |= status; if (!events) { if (parent) @@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) * So re-read the Slot Status register in case a bit was set * between read and write. */ - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed) goto read_status; } > -----Original Message----- > From: Bjorn Helgaas <helgaas@kernel.org>> Sent: Wednesday, November 3, > 2021 6:34 AM > To: Bao, Joseph <joseph.bao@intel.com> > Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; > linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; > Lukas Wunner <lukas@wunner.de> > Subject: Re: HW power fault defect cause system hang on kernel 5.4.y > > [+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt > race"), Lukas, pciehp expert] > > On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: >> Hi, dear kernel developer, >> >> Recently we encounter system hang (dead spinlock) when move to kernel >> linux-5.4.y. >> >> Finally, we use bisect to locate the suspicious commit >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. > > 4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp: > Fix MSI interrupt race") to v5.4.69 just over a year ago. > >> Our system has some HW defect, which will wrongly set >> PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop >> jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD >> bit since ctrl is not updated), I know this is our HW defect, but >> this commit makes kernel trapped in this isr function and leads to >> kernel hang (then the user could not get useful information to show >> what's wrong), which I think is not expected behavior, so I would >> like to report to you for discussion. > > I guess this happens because the first time we handle PFD, > pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA? > > It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change. > > It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang. > > I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp: > Fix MSI interrupt race"). > >> diff --git a/drivers/pci/hotplug/pciehp_hpc.c >> b/drivers/pci/hotplug/pciehp_hpc.c >> index 356786a3b7f4b..88b996764ff95 100644 >> --- >> a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tr >> e >> e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc2 >> 9 >> 9cde85b18d1f46ac21e1ba >> +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.gi >> +++ t >> +++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358da >> +++ b >> +++ 9cc07da044d5bc087065545b1000df >> @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >> struct controller *ctrl = (struct controller *)dev_id; >> struct pci_dev *pdev = ctrl_dev(ctrl); >> struct device *parent = pdev->dev.parent; >> - u16 status, events; >> + u16 status, events = 0; >> >> /* >> * Interrupts only occur in D3hot or shallower and only if enabled >> @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >> } >> } >> >> +read_status: >> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); >> if (status == (u16) ~0) { >> ctrl_info(ctrl, "%s: no response from device\n", __func__); @@ >> -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >> * Slot Status contains plain status bits as well as event >> * notification bits; right now we only want the event bits. >> */ >> - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >> - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >> - PCI_EXP_SLTSTA_DLLSC); >> + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >> + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >> + PCI_EXP_SLTSTA_DLLSC; >> >> /* >> * If we've already reported a power fault, don't report it again >> * until we've done something to handle it. >> */ >> if (ctrl->power_fault_detected) >> - events &= ~PCI_EXP_SLTSTA_PFD; >> + status &= ~PCI_EXP_SLTSTA_PFD; >> >> + events |= status; >> if (!events) { >> if (parent) >> pm_runtime_put(parent); >> return IRQ_NONE; >> } >> >> - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >> + if (status) { >> + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >> + >> + /* >> + * In MSI mode, all event bits must be zero before the port >> + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). >> + * So re-read the Slot Status register in case a bit was set >> + * between read and write. >> + */ >> + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) >> + goto read_status; >> + } >> + >> ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); >> if (parent) >> pm_runtime_put(parent);
On 11/10/2021 3:20 AM, Bao, Joseph wrote: > Hi Stuart, > The patch you attached does not work for me, the logic should be ok but I have not figured out why. But a loop counter actually helps mitigate this issue. Thank you very much! > I think I see the error in the logic... the patch I suggested yesterday would see bits that got cleared as changed and cause it to loop forever, when all we want to check for is if any new bits get set in each loop. Maybe try this patch? The only difference is the line "changed = status ^ (events & status);", which hopefully will only compare those status bits that are currently set against those same bits from the previous loop. diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 3024d7e85e6a..bf8fe868a293 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) struct controller *ctrl = (struct controller *)dev_id; struct pci_dev *pdev = ctrl_dev(ctrl); struct device *parent = pdev->dev.parent; - u16 status, events = 0; + u16 changed, status, events = 0; /* * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) if (ctrl->power_fault_detected) status &= ~PCI_EXP_SLTSTA_PFD; + changed = status ^ (events & status); events |= status; if (!events) { if (parent) @@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) * So re-read the Slot Status register in case a bit was set * between read and write. */ - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed) goto read_status; } > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 88b996764ff9..3d2c336ff740 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -529,7 +529,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > struct controller *ctrl = (struct controller *)dev_id; > struct pci_dev *pdev = ctrl_dev(ctrl); > struct device *parent = pdev->dev.parent; > - u16 status, events = 0; > + u16 status, events, read_retry_count = 0; > + u8 READ_RETRY_MAX = 6; > > /* > * Interrupts only occur in D3hot or shallower and only if enabled > @@ -585,7 +586,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > return IRQ_NONE; > } > > - if (status) { > + if (status && (read_retry_count < READ_RETRY_MAX)) { > pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > > /* > @@ -594,8 +595,10 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > * So re-read the Slot Status register in case a bit was set > * between read and write. > */ > - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) > + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) { > + read_retry_count++; > goto read_status; > + } > } > Regards. > Joseph > > Regards. > Joseph > > -----Original Message----- > From: stuart hayes <stuart.w.hayes@gmail.com> > Sent: Tuesday, November 9, 2021 11:37 PM > To: Bao, Joseph <joseph.bao@intel.com>; Bjorn Helgaas <helgaas@kernel.org> > Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Lukas Wunner <lukas@wunner.de> > Subject: Re: HW power fault defect cause system hang on kernel 5.4.y > > > > On 11/9/2021 1:59 AM, Bao, Joseph wrote: >> Hi Lukas/Stuart, >> Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. >> >> >> Regards >> Joseph >> > > It does appear that the code will hang when pciehp_isr sees PFD high and power_fault_detected isn't yet set, if PFD doesn't clear when a 1 is written to it. It will continue to loop trying to clear it, and power_fault_detected won't get set until after it gets through this loop. > > It wouldn't be hard to modify that code to only attempt to clear each bit once. I wouldn't expect the same event bit to get set twice within this loop, so this might fix it (I did not test). Alternately, a loop counter could be added to prevent it from looping more than some arbitrary number > (6?) of times in case of stuck bits. > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 3024d7e85e6a..3e502b4e8ef7 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > struct controller *ctrl = (struct controller *)dev_id; > struct pci_dev *pdev = ctrl_dev(ctrl); > struct device *parent = pdev->dev.parent; > - u16 status, events = 0; > + u16 changed, status, events = 0; > > /* > * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > if (ctrl->power_fault_detected) > status &= ~PCI_EXP_SLTSTA_PFD; > > + changed = status ^ events; > events |= status; > if (!events) { > if (parent) > @@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > * So re-read the Slot Status register in case a bit was set > * between read and write. > */ > - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) > + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed) > goto read_status; > } > > > >> -----Original Message----- >> From: Bjorn Helgaas <helgaas@kernel.org>> Sent: Wednesday, November 3, >> 2021 6:34 AM >> To: Bao, Joseph <joseph.bao@intel.com> >> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; >> linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; >> Lukas Wunner <lukas@wunner.de> >> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y >> >> [+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt >> race"), Lukas, pciehp expert] >> >> On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: >>> Hi, dear kernel developer, >>> >>> Recently we encounter system hang (dead spinlock) when move to kernel >>> linux-5.4.y. >>> >>> Finally, we use bisect to locate the suspicious commit >>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. >> >> 4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp: >> Fix MSI interrupt race") to v5.4.69 just over a year ago. >> >>> Our system has some HW defect, which will wrongly set >>> PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop >>> jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD >>> bit since ctrl is not updated), I know this is our HW defect, but >>> this commit makes kernel trapped in this isr function and leads to >>> kernel hang (then the user could not get useful information to show >>> what's wrong), which I think is not expected behavior, so I would >>> like to report to you for discussion. >> >> I guess this happens because the first time we handle PFD, >> pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA? >> >> It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change. >> >> It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang. >> >> I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp: >> Fix MSI interrupt race"). >> >>> diff --git a/drivers/pci/hotplug/pciehp_hpc.c >>> b/drivers/pci/hotplug/pciehp_hpc.c >>> index 356786a3b7f4b..88b996764ff95 100644 >>> --- >>> a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tr >>> e >>> e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc2 >>> 9 >>> 9cde85b18d1f46ac21e1ba >>> +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.gi >>> +++ t >>> +++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358da >>> +++ b >>> +++ 9cc07da044d5bc087065545b1000df >>> @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >>> struct controller *ctrl = (struct controller *)dev_id; >>> struct pci_dev *pdev = ctrl_dev(ctrl); >>> struct device *parent = pdev->dev.parent; >>> - u16 status, events; >>> + u16 status, events = 0; >>> >>> /* >>> * Interrupts only occur in D3hot or shallower and only if enabled >>> @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >>> } >>> } >>> >>> +read_status: >>> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); >>> if (status == (u16) ~0) { >>> ctrl_info(ctrl, "%s: no response from device\n", __func__); @@ >>> -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >>> * Slot Status contains plain status bits as well as event >>> * notification bits; right now we only want the event bits. >>> */ >>> - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >>> - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >>> - PCI_EXP_SLTSTA_DLLSC); >>> + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >>> + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >>> + PCI_EXP_SLTSTA_DLLSC; >>> >>> /* >>> * If we've already reported a power fault, don't report it again >>> * until we've done something to handle it. >>> */ >>> if (ctrl->power_fault_detected) >>> - events &= ~PCI_EXP_SLTSTA_PFD; >>> + status &= ~PCI_EXP_SLTSTA_PFD; >>> >>> + events |= status; >>> if (!events) { >>> if (parent) >>> pm_runtime_put(parent); >>> return IRQ_NONE; >>> } >>> >>> - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >>> + if (status) { >>> + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >>> + >>> + /* >>> + * In MSI mode, all event bits must be zero before the port >>> + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). >>> + * So re-read the Slot Status register in case a bit was set >>> + * between read and write. >>> + */ >>> + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) >>> + goto read_status; >>> + } >>> + >>> ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); >>> if (parent) >>> pm_runtime_put(parent);
Hi Stuart, This patch fixes the issue, it does as expected. Thank you! Regards. Joseph -----Original Message----- From: stuart hayes <stuart.w.hayes@gmail.com> Sent: Wednesday, November 10, 2021 11:57 PM To: Bao, Joseph <joseph.bao@intel.com>; Bjorn Helgaas <helgaas@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Lukas Wunner <lukas@wunner.de> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y On 11/10/2021 3:20 AM, Bao, Joseph wrote: > Hi Stuart, > The patch you attached does not work for me, the logic should be ok but I have not figured out why. But a loop counter actually helps mitigate this issue. Thank you very much! > I think I see the error in the logic... the patch I suggested yesterday would see bits that got cleared as changed and cause it to loop forever, when all we want to check for is if any new bits get set in each loop. Maybe try this patch? The only difference is the line "changed = status ^ (events & status);", which hopefully will only compare those status bits that are currently set against those same bits from the previous loop. diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 3024d7e85e6a..bf8fe868a293 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) struct controller *ctrl = (struct controller *)dev_id; struct pci_dev *pdev = ctrl_dev(ctrl); struct device *parent = pdev->dev.parent; - u16 status, events = 0; + u16 changed, status, events = 0; /* * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) if (ctrl->power_fault_detected) status &= ~PCI_EXP_SLTSTA_PFD; + changed = status ^ (events & status); events |= status; if (!events) { if (parent) @@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) * So re-read the Slot Status register in case a bit was set * between read and write. */ - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed) goto read_status; } > diff --git a/drivers/pci/hotplug/pciehp_hpc.c > b/drivers/pci/hotplug/pciehp_hpc.c > index 88b996764ff9..3d2c336ff740 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -529,7 +529,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > struct controller *ctrl = (struct controller *)dev_id; > struct pci_dev *pdev = ctrl_dev(ctrl); > struct device *parent = pdev->dev.parent; > - u16 status, events = 0; > + u16 status, events, read_retry_count = 0; > + u8 READ_RETRY_MAX = 6; > > /* > * Interrupts only occur in D3hot or shallower and only if > enabled @@ -585,7 +586,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > return IRQ_NONE; > } > > - if (status) { > + if (status && (read_retry_count < READ_RETRY_MAX)) { > pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, > events); > > /* > @@ -594,8 +595,10 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > * So re-read the Slot Status register in case a bit was set > * between read and write. > */ > - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) > + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) { > + read_retry_count++; > goto read_status; > + } > } > Regards. > Joseph > > Regards. > Joseph > > -----Original Message----- > From: stuart hayes <stuart.w.hayes@gmail.com> > Sent: Tuesday, November 9, 2021 11:37 PM > To: Bao, Joseph <joseph.bao@intel.com>; Bjorn Helgaas > <helgaas@kernel.org> > Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; > linux-kernel@vger.kernel.org; Lukas Wunner <lukas@wunner.de> > Subject: Re: HW power fault defect cause system hang on kernel 5.4.y > > > > On 11/9/2021 1:59 AM, Bao, Joseph wrote: >> Hi Lukas/Stuart, >> Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. >> >> >> Regards >> Joseph >> > > It does appear that the code will hang when pciehp_isr sees PFD high and power_fault_detected isn't yet set, if PFD doesn't clear when a 1 is written to it. It will continue to loop trying to clear it, and power_fault_detected won't get set until after it gets through this loop. > > It wouldn't be hard to modify that code to only attempt to clear each > bit once. I wouldn't expect the same event bit to get set twice > within this loop, so this might fix it (I did not test). Alternately, > a loop counter could be added to prevent it from looping more than > some arbitrary number > (6?) of times in case of stuck bits. > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c > b/drivers/pci/hotplug/pciehp_hpc.c > index 3024d7e85e6a..3e502b4e8ef7 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -594,7 +594,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > struct controller *ctrl = (struct controller *)dev_id; > struct pci_dev *pdev = ctrl_dev(ctrl); > struct device *parent = pdev->dev.parent; > - u16 status, events = 0; > + u16 changed, status, events = 0; > > /* > * Interrupts only occur in D3hot or shallower and only if enabled @@ -643,6 +643,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > if (ctrl->power_fault_detected) > status &= ~PCI_EXP_SLTSTA_PFD; > > + changed = status ^ events; > events |= status; > if (!events) { > if (parent) > @@ -659,7 +660,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > * So re-read the Slot Status register in case a bit was set > * between read and write. > */ > - if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) > + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode && changed) > goto read_status; > } > > > >> -----Original Message----- >> From: Bjorn Helgaas <helgaas@kernel.org>> Sent: Wednesday, November >> 3, >> 2021 6:34 AM >> To: Bao, Joseph <joseph.bao@intel.com> >> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; >> linux-kernel@vger.kernel.org; Stuart Hayes >> <stuart.w.hayes@gmail.com>; Lukas Wunner <lukas@wunner.de> >> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y >> >> [+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt >> race"), Lukas, pciehp expert] >> >> On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: >>> Hi, dear kernel developer, >>> >>> Recently we encounter system hang (dead spinlock) when move to >>> kernel linux-5.4.y. >>> >>> Finally, we use bisect to locate the suspicious commit >>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. >> >> 4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp: >> Fix MSI interrupt race") to v5.4.69 just over a year ago. >> >>> Our system has some HW defect, which will wrongly set >>> PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop >>> jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD >>> bit since ctrl is not updated), I know this is our HW defect, but >>> this commit makes kernel trapped in this isr function and leads to >>> kernel hang (then the user could not get useful information to show >>> what's wrong), which I think is not expected behavior, so I would >>> like to report to you for discussion. >> >> I guess this happens because the first time we handle PFD, >> pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA? >> >> It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change. >> >> It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang. >> >> I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp: >> Fix MSI interrupt race"). >> >>> diff --git a/drivers/pci/hotplug/pciehp_hpc.c >>> b/drivers/pci/hotplug/pciehp_hpc.c >>> index 356786a3b7f4b..88b996764ff95 100644 >>> --- >>> a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/t >>> r >>> e >>> e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc >>> 2 >>> 9 >>> 9cde85b18d1f46ac21e1ba >>> +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.g >>> +++ i >>> +++ t >>> +++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358d >>> +++ a >>> +++ b >>> +++ 9cc07da044d5bc087065545b1000df >>> @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >>> struct controller *ctrl = (struct controller *)dev_id; >>> struct pci_dev *pdev = ctrl_dev(ctrl); >>> struct device *parent = pdev->dev.parent; >>> - u16 status, events; >>> + u16 status, events = 0; >>> >>> /* >>> * Interrupts only occur in D3hot or shallower and only if >>> enabled @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >>> } >>> } >>> >>> +read_status: >>> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); >>> if (status == (u16) ~0) { >>> ctrl_info(ctrl, "%s: no response from device\n", __func__); @@ >>> -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) >>> * Slot Status contains plain status bits as well as event >>> * notification bits; right now we only want the event bits. >>> */ >>> - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >>> - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >>> - PCI_EXP_SLTSTA_DLLSC); >>> + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | >>> + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | >>> + PCI_EXP_SLTSTA_DLLSC; >>> >>> /* >>> * If we've already reported a power fault, don't report it again >>> * until we've done something to handle it. >>> */ >>> if (ctrl->power_fault_detected) >>> - events &= ~PCI_EXP_SLTSTA_PFD; >>> + status &= ~PCI_EXP_SLTSTA_PFD; >>> >>> + events |= status; >>> if (!events) { >>> if (parent) >>> pm_runtime_put(parent); >>> return IRQ_NONE; >>> } >>> >>> - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >>> + if (status) { >>> + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); >>> + >>> + /* >>> + * In MSI mode, all event bits must be zero before the port >>> + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). >>> + * So re-read the Slot Status register in case a bit was set >>> + * between read and write. >>> + */ >>> + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) >>> + goto read_status; >>> + } >>> + >>> ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); >>> if (parent) >>> pm_runtime_put(parent);
Hi Bjorn, Thanks for the encouragement! Stuart already helps patch the hang issue, do I still go an open a report at https://bugzilla.kernel.org? Regards Joseph -----Original Message----- From: Bjorn Helgaas <helgaas@kernel.org> Sent: Tuesday, November 9, 2021 11:30 PM To: Bao, Joseph <joseph.bao@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; Lukas Wunner <lukas@wunner.de> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y On Tue, Nov 09, 2021 at 07:59:59AM +0000, Bao, Joseph wrote: > Hi Lukas/Stuart, > Want to follow up with you whether the system hang is expected when HW > has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH. A system hang in response to a hardware defect like this is never the expected situation. Worst case we should be able to work around it with a quirk. Far better would be a generic fix that could recognize and deal with the situation even without a quirk. But I don't know the fix yet. I'm just responding to encourage you to keep pestering us and not give up :) In the meantime, it might be worth opening a report at https://bugzilla.kernel.org with a description of how you trigger the problem, and attaching the complete dmesg log and "sudo lspci -vv" output. Bjorn
[+CC Adding Sasha for visibility] > Thanks for the encouragement! Stuart already helps patch the hang issue, > do I still go an open a report at https://bugzilla.kernel.org? I am not sure what Bjorn would say here, however personally I say that it would be nice if you do open a bug there, if you have a minute, as it might help someone else who would stumble into this issue - a lot of people searches through Kernel's Bugzilla as part of their troubleshooting. Also, as this is a hang in 5.4, and a lot of people uses this kernel, it might warrant a back-port to stable and long-term kernels. Krzysztof
Yeah, other people may encounter a similar issue, I've created https://bugzilla.kernel.org/show_bug.cgi?id=214989 for reference including the patch from Stuart. Regards Joseph -----Original Message----- From: Krzysztof Wilczyński <kw@linux.com> Sent: Friday, November 12, 2021 8:44 AM To: Bao, Joseph <joseph.bao@intel.com> Cc: Bjorn Helgaas <helgaas@kernel.org>; Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; Lukas Wunner <lukas@wunner.de>; Sasha Levin <sashal@kernel.org> Subject: Re: HW power fault defect cause system hang on kernel 5.4.y [+CC Adding Sasha for visibility] > Thanks for the encouragement! Stuart already helps patch the hang > issue, do I still go an open a report at https://bugzilla.kernel.org? I am not sure what Bjorn would say here, however personally I say that it would be nice if you do open a bug there, if you have a minute, as it might help someone else who would stumble into this issue - a lot of people searches through Kernel's Bugzilla as part of their troubleshooting. Also, as this is a hang in 5.4, and a lot of people uses this kernel, it might warrant a back-port to stable and long-term kernels. Krzysztof
Hi Joseph, > Yeah, other people may encounter a similar issue, I've created > https://bugzilla.kernel.org/show_bug.cgi?id=214989 for reference > including the patch from Stuart. Thank you so much! I appreciate that. To add, the Bugzilla link above would need to be included in the commit message for the final patch as per the usual. Krzysztof
On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: > Recently we encounter system hang (dead spinlock) when move to kernel > linux-5.4.y. > > Finally, we use bisect to locate the suspicious commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. > > Our system has some HW defect, which will wrongly set PCI_EXP_SLTSTA_PFD > high, and this commit will lead to infinite loop jumping to read_status > (no chance to clear status PCI_EXP_SLTSTA_PFD bit since ctrl is not > updated), I know this is our HW defect, but this commit makes kernel > trapped in this isr function and leads to kernel hang (then the user > could not get useful information to show what's wrong), which I think > is not expected behavior, so I would like to report to you for discussion. Thanks a lot for the report and apologies for the breakage and the delay. Below please find a tentative fix. Could you test whether it fixes the issue? I don't think this is a hardware defect. If I'm reading the spec right (PCIe r5.0, sec. 6.7.1.8), the PFD bit is meant to remain set and cannot be cleared until the kernel disables slot power. When a power fault happens, we currently only change the LEDs (Power Indicator Off, Attention Indicator On) and emit a log message. We otherwise leave the slot as is, even though I'd assume that the PCI device in the slot is no longer accessible. I'm wondering whether we should interpret a power fault as surprise removal. Alternatively, we could attempt recovery, i.e. turn slot power off and back on. Similar to what we're doing when an Uncorrectable Error occurs. Do you have an opinion on that? What would be the desired behavior for your users? Thanks, Lukas -- >8 -- Subject: [PATCH] PCI: pciehp: Fix infinite loop in IRQ handler upon power fault The Power Fault Detected bit in the Slot Status register differs from all other hotplug events in that it is sticky: It can only be cleared after turning off slot power. Per PCIe r5.0, sec. 6.7.1.8: If a power controller detects a main power fault on the hot-plug slot, it must automatically set its internal main power fault latch [...]. The main power fault latch is cleared when software turns off power to the hot-plug slot. The stickiness used to cause interrupt storms and infinite loops which were fixed in 2009 by commits 5651c48cfafe ("PCI pciehp: fix power fault interrupt storm problem") and 99f0169c17f3 ("PCI: pciehp: enable software notification on empty slots"). Unfortunately in 2020 the infinite loop issue was inadvertently reintroduced by commit 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"): The hardirq handler pciehp_isr() clears the PFD bit until pciehp's power_fault_detected flag is set. That happens in the IRQ thread pciehp_ist(), which never learns of the event because the hardirq handler is stuck in an infinite loop. Fix by setting the power_fault_detected flag already in the hardirq handler. Fixes: 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214989 Link: https://lore.kernel.org/linux-pci/DM8PR11MB5702255A6A92F735D90A4446868B9@DM8PR11MB5702.namprd11.prod.outlook.com Reported-by: Joseph Bao <joseph.bao@intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: stable@vger.kernel.org # v4.19+ Cc: Stuart Hayes <stuart.w.hayes@gmail.com> --- drivers/pci/hotplug/pciehp_hpc.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 6ac5ea5..fac6b8e 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -640,6 +640,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) */ if (ctrl->power_fault_detected) status &= ~PCI_EXP_SLTSTA_PFD; + else if (status & PCI_EXP_SLTSTA_PFD) + ctrl->power_fault_detected = true; events |= status; if (!events) { @@ -649,7 +651,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) } if (status) { - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status); /* * In MSI mode, all event bits must be zero before the port @@ -723,8 +725,7 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) } /* Check Power Fault Detected */ - if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) { - ctrl->power_fault_detected = 1; + if (events & PCI_EXP_SLTSTA_PFD) { ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(ctrl)); pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF, PCI_EXP_SLTCTL_ATTN_IND_ON);
I've tested this tentative patch, and it fixes the issue. And current kernel behavior of only changing the LEDs may come from spec sec 6.7.1.1, which indicates that the adapter should not be removed when encountering a power fault event. Based on the description of sec 6.7.1.8, I think it's the power controller's responsibility that power off the slot -> wait until the power fault status cleared (latched by HW)-> power on the slot again, not the kernel's responsibility. If the kernel reports a power fault log msg, there must be something wrong with the HW, #1 (current case) the power controller circuit has some defect, falsely trigger power fault even the slot has correct voltage. #2 (seen in our factory) The slot has some defect, pulling down the correct voltage, so trigger power fault. In my opinion, only the infinite loop bug fix is necessary for the kernel, neither the cases I listed above should cause a system hang, then we could get power fault log msg/indicators and know something wrong with our HW. Otherwise, if exists the case that power fault is not HW-related, then maybe the kernel should interpret power fault as a surprising removal or do some power cycle retry to recovery, but I do not know such a case yet... Regards Joseph -----Original Message----- From: Lukas Wunner <lukas@wunner.de> Sent: Tuesday, November 16, 2021 3:27 AM To: Bao, Joseph <joseph.bao@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com>; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Stuart Hayes <stuart.w.hayes@gmail.com>; kw@linux.com Subject: Re: HW power fault defect cause system hang on kernel 5.4.y On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote: > Recently we encounter system hang (dead spinlock) when move to kernel > linux-5.4.y. > > Finally, we use bisect to locate the suspicious commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df. > > Our system has some HW defect, which will wrongly set > PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop > jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD > bit since ctrl is not updated), I know this is our HW defect, but this > commit makes kernel trapped in this isr function and leads to kernel > hang (then the user could not get useful information to show what's > wrong), which I think is not expected behavior, so I would like to report to you for discussion. Thanks a lot for the report and apologies for the breakage and the delay. Below please find a tentative fix. Could you test whether it fixes the issue? I don't think this is a hardware defect. If I'm reading the spec right (PCIe r5.0, sec. 6.7.1.8), the PFD bit is meant to remain set and cannot be cleared until the kernel disables slot power. When a power fault happens, we currently only change the LEDs (Power Indicator Off, Attention Indicator On) and emit a log message. We otherwise leave the slot as is, even though I'd assume that the PCI device in the slot is no longer accessible. I'm wondering whether we should interpret a power fault as surprise removal. Alternatively, we could attempt recovery, i.e. turn slot power off and back on. Similar to what we're doing when an Uncorrectable Error occurs. Do you have an opinion on that? What would be the desired behavior for your users? Thanks, Lukas -- >8 -- Subject: [PATCH] PCI: pciehp: Fix infinite loop in IRQ handler upon power fault The Power Fault Detected bit in the Slot Status register differs from all other hotplug events in that it is sticky: It can only be cleared after turning off slot power. Per PCIe r5.0, sec. 6.7.1.8: If a power controller detects a main power fault on the hot-plug slot, it must automatically set its internal main power fault latch [...]. The main power fault latch is cleared when software turns off power to the hot-plug slot. The stickiness used to cause interrupt storms and infinite loops which were fixed in 2009 by commits 5651c48cfafe ("PCI pciehp: fix power fault interrupt storm problem") and 99f0169c17f3 ("PCI: pciehp: enable software notification on empty slots"). Unfortunately in 2020 the infinite loop issue was inadvertently reintroduced by commit 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"): The hardirq handler pciehp_isr() clears the PFD bit until pciehp's power_fault_detected flag is set. That happens in the IRQ thread pciehp_ist(), which never learns of the event because the hardirq handler is stuck in an infinite loop. Fix by setting the power_fault_detected flag already in the hardirq handler. Fixes: 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214989 Link: https://lore.kernel.org/linux-pci/DM8PR11MB5702255A6A92F735D90A4446868B9@DM8PR11MB5702.namprd11.prod.outlook.com Reported-by: Joseph Bao <joseph.bao@intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: stable@vger.kernel.org # v4.19+ Cc: Stuart Hayes <stuart.w.hayes@gmail.com> --- drivers/pci/hotplug/pciehp_hpc.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 6ac5ea5..fac6b8e 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -640,6 +640,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) */ if (ctrl->power_fault_detected) status &= ~PCI_EXP_SLTSTA_PFD; + else if (status & PCI_EXP_SLTSTA_PFD) + ctrl->power_fault_detected = true; events |= status; if (!events) { @@ -649,7 +651,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) } if (status) { - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status); /* * In MSI mode, all event bits must be zero before the port @@ -723,8 +725,7 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) } /* Check Power Fault Detected */ - if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) { - ctrl->power_fault_detected = 1; + if (events & PCI_EXP_SLTSTA_PFD) { ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(ctrl)); pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF, PCI_EXP_SLTCTL_ATTN_IND_ON); -- 2.33.0
On Tue, Nov 16, 2021 at 02:42:20AM +0000, Bao, Joseph wrote:
> I've tested this tentative patch, and it fixes the issue.
Bjorn, do you want me to resend the fix as a proper patch
(instead of appended to an e-mail) for visibility?
If so, do you have any change requests that I should fold in?
Link to the fix:
https://lore.kernel.org/linux-pci/20211115192723.GA19161@wunner.de/
Thanks,
Lukas
On Wed, Nov 17, 2021 at 09:00:35AM +0100, Lukas Wunner wrote: > On Tue, Nov 16, 2021 at 02:42:20AM +0000, Bao, Joseph wrote: > > I've tested this tentative patch, and it fixes the issue. > > Bjorn, do you want me to resend the fix as a proper patch > (instead of appended to an e-mail) for visibility? Yes, please; I think it only made it into patchwork as a reply in the email thread, not as the primary patch, so it'll be less likely to get lost if you post it as a regular patch. > If so, do you have any change requests that I should fold in? > > Link to the fix: > https://lore.kernel.org/linux-pci/20211115192723.GA19161@wunner.de/ > > Thanks, > > Lukas
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 356786a3b7f4b..88b996764ff95 100644 --- a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc299cde85b18d1f46ac21e1ba +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) struct controller *ctrl = (struct controller *)dev_id; struct pci_dev *pdev = ctrl_dev(ctrl); struct device *parent = pdev->dev.parent; - u16 status, events; + u16 status, events = 0; /* * Interrupts only occur in D3hot or shallower and only if enabled @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) } } +read_status: pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); if (status == (u16) ~0) { ctrl_info(ctrl, "%s: no response from device\n", __func__); @@ -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) * Slot Status contains plain status bits as well as event * notification bits; right now we only want the event bits. */ - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | - PCI_EXP_SLTSTA_DLLSC); + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | + PCI_EXP_SLTSTA_DLLSC; /* * If we've already reported a power fault, don't report it again * until we've done something to handle it. */ if (ctrl->power_fault_detected) - events &= ~PCI_EXP_SLTSTA_PFD; + status &= ~PCI_EXP_SLTSTA_PFD; + events |= status; if (!events) { if (parent) pm_runtime_put(parent); return IRQ_NONE; } - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); + if (status) { + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); + + /* + * In MSI mode, all event bits must be zero before the port + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4). + * So re-read the Slot Status register in case a bit was set + * between read and write. + */ + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode) + goto read_status; + } + ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events); if (parent) pm_runtime_put(parent);