diff mbox series

[v2] PCI: pciehp: Make sure pciehp_isr clears interrupt events

Message ID 20191120222043.53432-1-stuart.w.hayes@gmail.com
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series [v2] PCI: pciehp: Make sure pciehp_isr clears interrupt events | expand

Commit Message

Stuart Hayes Nov. 20, 2019, 10:20 p.m. UTC
Without this patch, a pciehp hotplug port can stop generating interrupts
on hotplug events, so device adds and removals will not be seen.

The pciehp interrupt handler pciehp_isr() will read the slot status
register and then write back to it to clear the bits that caused the
interrupt. If a different interrupt event bit gets set between the read and
the write, pciehp_isr will exit without having cleared all of the interrupt
event bits. If this happens, and the port is using an MSI interrupt where
per-vector masking is not supported, we won't get any more hotplug
interrupts from that device.

That is expected behavior, according to the PCI Express Base Specification
Revision 5.0 Version 1.0, section 6.7.3.4, "Software Notification of Hot-
Plug Events".

Because the "presence detect changed" and "data link layer state changed"
event bits are both getting set at nearly the same time when a device is
added or removed, this is more likely to happen than it might seem. The
issue was found (and can be reproduced rather easily) by connecting and
disconnecting an NVMe storage device on at least one system model.

This issue was found while adding and removing various NVMe storage devices
on an AMD PCIe port (PCI device 0x1022/0x1483).

This patch fixes this issue by modifying pciehp_isr() by looping back and
re-reading the slot status register immediately after writing to it, until
it sees that all of the event status bits have been cleared.

Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
---
v2:
  * fixed ctrl_warn() call
  * improved comments
  * added pvm_capable flag and changed pciehp_isr() to loop back only when
    pvm_capable flag not set (suggested by Lukas Wunner)
  
 drivers/pci/hotplug/pciehp.h     |  3 ++
 drivers/pci/hotplug/pciehp_hpc.c | 50 ++++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 6 deletions(-)

Comments

Stuart Hayes Jan. 20, 2020, 4:10 p.m. UTC | #1
On 11/20/19 4:20 PM, Stuart Hayes wrote:
> Without this patch, a pciehp hotplug port can stop generating interrupts
> on hotplug events, so device adds and removals will not be seen.
> 
> The pciehp interrupt handler pciehp_isr() will read the slot status
> register and then write back to it to clear the bits that caused the
> interrupt. If a different interrupt event bit gets set between the read and
> the write, pciehp_isr will exit without having cleared all of the interrupt
> event bits. If this happens, and the port is using an MSI interrupt where
> per-vector masking is not supported, we won't get any more hotplug
> interrupts from that device.
> 
> That is expected behavior, according to the PCI Express Base Specification
> Revision 5.0 Version 1.0, section 6.7.3.4, "Software Notification of Hot-
> Plug Events".
> 
> Because the "presence detect changed" and "data link layer state changed"
> event bits are both getting set at nearly the same time when a device is
> added or removed, this is more likely to happen than it might seem. The
> issue was found (and can be reproduced rather easily) by connecting and
> disconnecting an NVMe storage device on at least one system model.
> 
> This issue was found while adding and removing various NVMe storage devices
> on an AMD PCIe port (PCI device 0x1022/0x1483).
> 
> This patch fixes this issue by modifying pciehp_isr() by looping back and
> re-reading the slot status register immediately after writing to it, until
> it sees that all of the event status bits have been cleared.
> 
> Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
> ---
> v2:
>   * fixed ctrl_warn() call
>   * improved comments
>   * added pvm_capable flag and changed pciehp_isr() to loop back only when
>     pvm_capable flag not set (suggested by Lukas Wunner)
>   
>  drivers/pci/hotplug/pciehp.h     |  3 ++
>  drivers/pci/hotplug/pciehp_hpc.c | 50 ++++++++++++++++++++++++++++----
>  2 files changed, 47 insertions(+), 6 deletions(-)
> 

Bjorn,

Please let me know if I could do anything to help get this patch accepted.

Thanks!
Stuart
Enzo Matsumiya Jan. 28, 2020, 9:51 p.m. UTC | #2
On Mon, 2020-01-20 at 10:10 -0600, Stuart Hayes wrote:
> On 11/20/19 4:20 PM, Stuart Hayes wrote:
> > Without this patch, a pciehp hotplug port can stop generating
> > interrupts
> > on hotplug events, so device adds and removals will not be seen.
> > 
> > The pciehp interrupt handler pciehp_isr() will read the slot status
> > register and then write back to it to clear the bits that caused
> > the
> > interrupt. If a different interrupt event bit gets set between the
> > read and
> > the write, pciehp_isr will exit without having cleared all of the
> > interrupt
> > event bits. If this happens, and the port is using an MSI interrupt
> > where
> > per-vector masking is not supported, we won't get any more hotplug
> > interrupts from that device.
> > 
> > That is expected behavior, according to the PCI Express Base
> > Specification
> > Revision 5.0 Version 1.0, section 6.7.3.4, "Software Notification
> > of Hot-
> > Plug Events".
> > 
> > Because the "presence detect changed" and "data link layer state
> > changed"
> > event bits are both getting set at nearly the same time when a
> > device is
> > added or removed, this is more likely to happen than it might seem.
> > The
> > issue was found (and can be reproduced rather easily) by connecting
> > and
> > disconnecting an NVMe storage device on at least one system model.
> > 
> > This issue was found while adding and removing various NVMe storage
> > devices
> > on an AMD PCIe port (PCI device 0x1022/0x1483).
> > 
> > This patch fixes this issue by modifying pciehp_isr() by looping
> > back and
> > re-reading the slot status register immediately after writing to
> > it, until
> > it sees that all of the event status bits have been cleared.
> > 
> > Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
> > ---
> > v2:
> >   * fixed ctrl_warn() call
> >   * improved comments
> >   * added pvm_capable flag and changed pciehp_isr() to loop back
> > only when
> >     pvm_capable flag not set (suggested by Lukas Wunner)
> >   
> >  drivers/pci/hotplug/pciehp.h     |  3 ++
> >  drivers/pci/hotplug/pciehp_hpc.c | 50
> > ++++++++++++++++++++++++++++----
> >  2 files changed, 47 insertions(+), 6 deletions(-)
> > 
> 
> Bjorn,
> 
> Please let me know if I could do anything to help get this patch
> accepted.
> 
> Thanks!
> Stuart
> 

Hi, can someone please review/accept this patch please? It fixes NVMe
hotplug operations in SLES15-SP1.

Thanks,

Enzo
Bjorn Helgaas Jan. 29, 2020, 12:51 a.m. UTC | #3
[+cc Enzo]

On Wed, Nov 20, 2019 at 05:20:43PM -0500, Stuart Hayes wrote:
> Without this patch, a pciehp hotplug port can stop generating interrupts
> on hotplug events, so device adds and removals will not be seen.
> 
> The pciehp interrupt handler pciehp_isr() will read the slot status
> register and then write back to it to clear the bits that caused the
> interrupt. If a different interrupt event bit gets set between the read and
> the write, pciehp_isr will exit without having cleared all of the interrupt
> event bits. If this happens, and the port is using an MSI interrupt where
> per-vector masking is not supported, we won't get any more hotplug
> interrupts from that device.
> 
> That is expected behavior, according to the PCI Express Base Specification
> Revision 5.0 Version 1.0, section 6.7.3.4, "Software Notification of Hot-
> Plug Events".
> 
> Because the "presence detect changed" and "data link layer state changed"
> event bits are both getting set at nearly the same time when a device is
> added or removed, this is more likely to happen than it might seem. The
> issue was found (and can be reproduced rather easily) by connecting and
> disconnecting an NVMe storage device on at least one system model.
> 
> This issue was found while adding and removing various NVMe storage devices
> on an AMD PCIe port (PCI device 0x1022/0x1483).
> 
> This patch fixes this issue by modifying pciehp_isr() by looping back and
> re-reading the slot status register immediately after writing to it, until
> it sees that all of the event status bits have been cleared.

I see that Lukas took a look at this earlier; I'd really like to have
his reviewed-by, since he's the expert on this code.

> Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
> ---
> v2:
>   * fixed ctrl_warn() call
>   * improved comments
>   * added pvm_capable flag and changed pciehp_isr() to loop back only when
>     pvm_capable flag not set (suggested by Lukas Wunner)
>   
>  drivers/pci/hotplug/pciehp.h     |  3 ++
>  drivers/pci/hotplug/pciehp_hpc.c | 50 ++++++++++++++++++++++++++++----
>  2 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index 654c972b8ea0..1804277ec433 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -47,6 +47,8 @@ extern int pciehp_poll_time;
>   * struct controller - PCIe hotplug controller
>   * @pcie: pointer to the controller's PCIe port service device
>   * @slot_cap: cached copy of the Slot Capabilities register
> + * @pvm_capable: set if per-vector masking is supported (if not set, the ISR
> + *	needs to ensure all event bits cleared)
>   * @slot_ctrl: cached copy of the Slot Control register
>   * @ctrl_lock: serializes writes to the Slot Control register
>   * @cmd_started: jiffies when the Slot Control register was last written;
> @@ -83,6 +85,7 @@ struct controller {
>  	struct pcie_device *pcie;
>  
>  	u32 slot_cap;				/* capabilities and quirks */
> +	unsigned int pvm_capable:1;
>  
>  	u16 slot_ctrl;				/* control register access */
>  	struct mutex ctrl_lock;
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 1a522c1c4177..4ffd489a5413 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -487,12 +487,21 @@ void pciehp_power_off_slot(struct controller *ctrl)
>  		 PCI_EXP_SLTCTL_PWR_OFF);
>  }
>  
> +/*
> + * We should never need to re-read the slot status register this many times,
> + * because there are only six possible events that could generate this
> + * interrupt.  If we still see events after this many reads, there's a stuck
> + * bit.
> + */
> +#define MAX_ISR_STATUS_READS 6
> +
>  static irqreturn_t pciehp_isr(int irq, void *dev_id)
>  {
>  	struct controller *ctrl = (struct controller *)dev_id;
>  	struct pci_dev *pdev = ctrl_dev(ctrl);
>  	struct device *parent = pdev->dev.parent;
> -	u16 status, events;
> +	u16 status, events = 0;
> +	int status_reads = 0;
>  
>  	/*
>  	 * Interrupts only occur in D3hot or shallower and only if enabled
> @@ -517,6 +526,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>  		}
>  	}
>  
> +read_status:
>  	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
>  	if (status == (u16) ~0) {
>  		ctrl_info(ctrl, "%s: no response from device\n", __func__);
> @@ -529,24 +539,44 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>  	 * Slot Status contains plain status bits as well as event
>  	 * notification bits; right now we only want the event bits.
>  	 */
> -	events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
> -			   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
> -			   PCI_EXP_SLTSTA_DLLSC);
> +	status &= (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
> +		   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
> +		   PCI_EXP_SLTSTA_DLLSC);
>  
>  	/*
>  	 * If we've already reported a power fault, don't report it again
>  	 * until we've done something to handle it.
>  	 */
>  	if (ctrl->power_fault_detected)
> -		events &= ~PCI_EXP_SLTSTA_PFD;
> +		status &= ~PCI_EXP_SLTSTA_PFD;
>  
> +	events |= status;
>  	if (!events) {
>  		if (parent)
>  			pm_runtime_put(parent);
>  		return IRQ_NONE;
>  	}
>  
> -	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> +	if (status) {
> +		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status);
> +
> +		/*
> +		 * If MSI per-vector masking is not supported by the port,
> +		 * all of the event bits must be zero before the port will
> +		 * send a new interrupt (see PCI Express Base Specification
> +		 * Revision 5.0 Version 1.0, section 6.7.3.4, "Software
> +		 * Notification of Hot-Plug Events"). So in that case, if
> +		 * event bit gets set between the read and the write of
> +		 * PCI_EXP_SLTSTA, we need to loop back and try again.
> +		 */
> +		if (!ctrl->pvm_capable) {

I don't think this per-vector masking check is correct.  It's true
that if the device does not support masking, it must send an interrupt
every time this expression transitions from FALSE to TRUE:

  (INT ENABLE == 1) AND ((event status == 1) AND event enable == 1)

But if the device *does* support per-vector masking, an interrupt
message must be sent every time this expression transitions from FALSE
to TRUE:

  (MSI vector is unmasked) AND (INT ENABLE == 1) AND
    ((event status == 1) AND event enable == 1)

I don't know what whether the MSI vector is masked or not at this
point, and I don't think this code *should* know that.  All it should
know is "we got an interrupt via some mechanism".  Actually, it can't
even know that, because it should always be safe to call an ISR
because the interrupt may be shared with other devices, and this ISR
may be called because a different device interrupted.

So if we assume the vector is unmasked, the situation is the same as
when the device doesn't support per-vector masking.

> +			if (status_reads++ < MAX_ISR_STATUS_READS)
> +				goto read_status;

I don't understand this limit of six.  We clear a status bit above,
but what's to prevent that same bit from being set again after we read
it but before we write it?

> +			ctrl_warn(ctrl, "Hot plug event bit stuck (%x)\n",
> +				  status);
> +		}
> +	}
> +
>  	ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events);
>  	if (parent)
>  		pm_runtime_put(parent);
> @@ -812,6 +842,7 @@ struct controller *pcie_init(struct pcie_device *dev)
>  {
>  	struct controller *ctrl;
>  	u32 slot_cap, link_cap;
> +	u16 msiflags;
>  	u8 poweron;
>  	struct pci_dev *pdev = dev->port;
>  	struct pci_bus *subordinate = pdev->subordinate;
> @@ -881,6 +912,13 @@ struct controller *pcie_init(struct pcie_device *dev)
>  		}
>  	}
>  
> +	ctrl->pvm_capable = 1;
> +	if (pdev->msi_enabled) {
> +		pci_read_config_word(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +				     &msiflags);
> +		ctrl->pvm_capable = !!(msiflags & PCI_MSI_FLAGS_MASKBIT);
> +	}
> +
>  	return ctrl;
>  }
>  
> -- 
> 2.18.1
>
Stuart Hayes Jan. 29, 2020, 9:55 p.m. UTC | #4
On 1/28/20 6:51 PM, Bjorn Helgaas wrote:
> [+cc Enzo]
> 
> On Wed, Nov 20, 2019 at 05:20:43PM -0500, Stuart Hayes wrote:
>> Without this patch, a pciehp hotplug port can stop generating interrupts
>> on hotplug events, so device adds and removals will not be seen.
>>
>> The pciehp interrupt handler pciehp_isr() will read the slot status
>> register and then write back to it to clear the bits that caused the
>> interrupt. If a different interrupt event bit gets set between the read and
>> the write, pciehp_isr will exit without having cleared all of the interrupt
>> event bits. If this happens, and the port is using an MSI interrupt where
>> per-vector masking is not supported, we won't get any more hotplug
>> interrupts from that device.
>>
>> That is expected behavior, according to the PCI Express Base Specification
>> Revision 5.0 Version 1.0, section 6.7.3.4, "Software Notification of Hot-
>> Plug Events".
>>
>> Because the "presence detect changed" and "data link layer state changed"
>> event bits are both getting set at nearly the same time when a device is
>> added or removed, this is more likely to happen than it might seem. The
>> issue was found (and can be reproduced rather easily) by connecting and
>> disconnecting an NVMe storage device on at least one system model.
>>
>> This issue was found while adding and removing various NVMe storage devices
>> on an AMD PCIe port (PCI device 0x1022/0x1483).
>>
>> This patch fixes this issue by modifying pciehp_isr() by looping back and
>> re-reading the slot status register immediately after writing to it, until
>> it sees that all of the event status bits have been cleared.
> 
> I see that Lukas took a look at this earlier; I'd really like to have
> his reviewed-by, since he's the expert on this code.
> 
>> Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
>> ---
>> v2:
>>   * fixed ctrl_warn() call
>>   * improved comments
>>   * added pvm_capable flag and changed pciehp_isr() to loop back only when
>>     pvm_capable flag not set (suggested by Lukas Wunner)
>>   
>>  drivers/pci/hotplug/pciehp.h     |  3 ++
>>  drivers/pci/hotplug/pciehp_hpc.c | 50 ++++++++++++++++++++++++++++----
>>  2 files changed, 47 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
>> index 654c972b8ea0..1804277ec433 100644
>> --- a/drivers/pci/hotplug/pciehp.h
>> +++ b/drivers/pci/hotplug/pciehp.h
>> @@ -47,6 +47,8 @@ extern int pciehp_poll_time;
>>   * struct controller - PCIe hotplug controller
>>   * @pcie: pointer to the controller's PCIe port service device
>>   * @slot_cap: cached copy of the Slot Capabilities register
>> + * @pvm_capable: set if per-vector masking is supported (if not set, the ISR
>> + *	needs to ensure all event bits cleared)
>>   * @slot_ctrl: cached copy of the Slot Control register
>>   * @ctrl_lock: serializes writes to the Slot Control register
>>   * @cmd_started: jiffies when the Slot Control register was last written;
>> @@ -83,6 +85,7 @@ struct controller {
>>  	struct pcie_device *pcie;
>>  
>>  	u32 slot_cap;				/* capabilities and quirks */
>> +	unsigned int pvm_capable:1;
>>  
>>  	u16 slot_ctrl;				/* control register access */
>>  	struct mutex ctrl_lock;
>> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
>> index 1a522c1c4177..4ffd489a5413 100644
>> --- a/drivers/pci/hotplug/pciehp_hpc.c
>> +++ b/drivers/pci/hotplug/pciehp_hpc.c
>> @@ -487,12 +487,21 @@ void pciehp_power_off_slot(struct controller *ctrl)
>>  		 PCI_EXP_SLTCTL_PWR_OFF);
>>  }
>>  
>> +/*
>> + * We should never need to re-read the slot status register this many times,
>> + * because there are only six possible events that could generate this
>> + * interrupt.  If we still see events after this many reads, there's a stuck
>> + * bit.
>> + */
>> +#define MAX_ISR_STATUS_READS 6
>> +
>>  static irqreturn_t pciehp_isr(int irq, void *dev_id)
>>  {
>>  	struct controller *ctrl = (struct controller *)dev_id;
>>  	struct pci_dev *pdev = ctrl_dev(ctrl);
>>  	struct device *parent = pdev->dev.parent;
>> -	u16 status, events;
>> +	u16 status, events = 0;
>> +	int status_reads = 0;
>>  
>>  	/*
>>  	 * Interrupts only occur in D3hot or shallower and only if enabled
>> @@ -517,6 +526,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>>  		}
>>  	}
>>  
>> +read_status:
>>  	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
>>  	if (status == (u16) ~0) {
>>  		ctrl_info(ctrl, "%s: no response from device\n", __func__);
>> @@ -529,24 +539,44 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>>  	 * Slot Status contains plain status bits as well as event
>>  	 * notification bits; right now we only want the event bits.
>>  	 */
>> -	events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
>> -			   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
>> -			   PCI_EXP_SLTSTA_DLLSC);
>> +	status &= (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
>> +		   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
>> +		   PCI_EXP_SLTSTA_DLLSC);
>>  
>>  	/*
>>  	 * If we've already reported a power fault, don't report it again
>>  	 * until we've done something to handle it.
>>  	 */
>>  	if (ctrl->power_fault_detected)
>> -		events &= ~PCI_EXP_SLTSTA_PFD;
>> +		status &= ~PCI_EXP_SLTSTA_PFD;
>>  
>> +	events |= status;
>>  	if (!events) {
>>  		if (parent)
>>  			pm_runtime_put(parent);
>>  		return IRQ_NONE;
>>  	}
>>  
>> -	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
>> +	if (status) {
>> +		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status);
>> +
>> +		/*
>> +		 * If MSI per-vector masking is not supported by the port,
>> +		 * all of the event bits must be zero before the port will
>> +		 * send a new interrupt (see PCI Express Base Specification
>> +		 * Revision 5.0 Version 1.0, section 6.7.3.4, "Software
>> +		 * Notification of Hot-Plug Events"). So in that case, if
>> +		 * event bit gets set between the read and the write of
>> +		 * PCI_EXP_SLTSTA, we need to loop back and try again.
>> +		 */
>> +		if (!ctrl->pvm_capable) {
> 
> I don't think this per-vector masking check is correct.  It's true
> that if the device does not support masking, it must send an interrupt
> every time this expression transitions from FALSE to TRUE:
> 
>   (INT ENABLE == 1) AND ((event status == 1) AND event enable == 1)
> 
> But if the device *does* support per-vector masking, an interrupt
> message must be sent every time this expression transitions from FALSE
> to TRUE:
> 
>   (MSI vector is unmasked) AND (INT ENABLE == 1) AND
>     ((event status == 1) AND event enable == 1)
> 
> I don't know what whether the MSI vector is masked or not at this
> point, and I don't think this code *should* know that.  All it should
> know is "we got an interrupt via some mechanism".  Actually, it can't
> even know that, because it should always be safe to call an ISR
> because the interrupt may be shared with other devices, and this ISR
> may be called because a different device interrupted.
> 
> So if we assume the vector is unmasked, the situation is the same as
> when the device doesn't support per-vector masking.
> 

I agree.  I am happy to remove the check if per-vector masking is
supported and fix the comment.


>> +			if (status_reads++ < MAX_ISR_STATUS_READS)
>> +				goto read_status;
> 
> I don't understand this limit of six.  We clear a status bit above,
> but what's to prevent that same bit from being set again after we read
> it but before we write it?
> 

I think the nature of the status bits (power fault, link active, etc)
means that they shouldn't be toggling at a high frequency, and there are
only six status bits that can cause this interrupt, which is why I chose
six.  But it is really just an arbitrary number that should be larger
than any non-broken system would ever get to, just to safeguard against
getting stuck in the ISR.

I guess if a raw, unbounced attention button is connected, that one
status bit might toggle repeatedly at a high rate--but, if only a
single bit is toggling, the write to the status register would clear
the status register even if that status bit gets set again before the
status register is read back, so in that case it should still be able
to generate another interrupt after pciehp_isr() exits, and it won't
matter that the loop was aborted after six tries to clear the status.


>> +			ctrl_warn(ctrl, "Hot plug event bit stuck (%x)\n",
>> +				  status);
>> +		}
>> +	}
>> +
>>  	ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events);
>>  	if (parent)
>>  		pm_runtime_put(parent);
>> @@ -812,6 +842,7 @@ struct controller *pcie_init(struct pcie_device *dev)
>>  {
>>  	struct controller *ctrl;
>>  	u32 slot_cap, link_cap;
>> +	u16 msiflags;
>>  	u8 poweron;
>>  	struct pci_dev *pdev = dev->port;
>>  	struct pci_bus *subordinate = pdev->subordinate;
>> @@ -881,6 +912,13 @@ struct controller *pcie_init(struct pcie_device *dev)
>>  		}
>>  	}
>>  
>> +	ctrl->pvm_capable = 1;
>> +	if (pdev->msi_enabled) {
>> +		pci_read_config_word(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +				     &msiflags);
>> +		ctrl->pvm_capable = !!(msiflags & PCI_MSI_FLAGS_MASKBIT);
>> +	}
>> +
>>  	return ctrl;
>>  }
>>  
>> -- 
>> 2.18.1
>>
Lukas Wunner Feb. 9, 2020, 12:55 p.m. UTC | #5
On Tue, Jan 28, 2020 at 06:51:51PM -0600, Bjorn Helgaas wrote:
> On Wed, Nov 20, 2019 at 05:20:43PM -0500, Stuart Hayes wrote:
> > -	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> > +	if (status) {
> > +		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status);
> > +
> > +		/*
> > +		 * If MSI per-vector masking is not supported by the port,
> > +		 * all of the event bits must be zero before the port will
> > +		 * send a new interrupt (see PCI Express Base Specification
> > +		 * Revision 5.0 Version 1.0, section 6.7.3.4, "Software
> > +		 * Notification of Hot-Plug Events"). So in that case, if
> > +		 * event bit gets set between the read and the write of
> > +		 * PCI_EXP_SLTSTA, we need to loop back and try again.
> > +		 */
> > +		if (!ctrl->pvm_capable) {
> 
> I don't know what whether the MSI vector is masked or not at this
> point

I think MSIs are handled by handle_edge_irq(), which, unlike
handle_level_irq(), does not mask the IRQ by default.

We could call disable_irq_nosync() / enable_irq() to mask the IRQ and
immediately unmask it after writing to the slot status register if the
interrupt uses MSI / MSI-X (and not INTx), thereby forcing another
interrupt if new bits have been set in the meantime.  But I suspect this
approach may not work if PVM is unsupported.


> I see that Lukas took a look at this earlier; I'd really like to have
> his reviewed-by, since he's the expert on this code.

Hm, should we add an entry for pciehp to MAINTAINERS and list me as R: or M:?

Thanks,

Lukas
Lukas Wunner Feb. 9, 2020, 1:08 p.m. UTC | #6
On Wed, Jan 29, 2020 at 03:55:21PM -0600, Stuart Hayes wrote:
> On 1/28/20 6:51 PM, Bjorn Helgaas wrote:
> > I don't understand this limit of six.  We clear a status bit above,
> > but what's to prevent that same bit from being set again after we read
> > it but before we write it?
> 
> I think the nature of the status bits (power fault, link active, etc)
> means that they shouldn't be toggling at a high frequency, and there are
> only six status bits that can cause this interrupt, which is why I chose
> six.  But it is really just an arbitrary number that should be larger
> than any non-broken system would ever get to, just to safeguard against
> getting stuck in the ISR.

From v4.9 until v4.18 we already had a loop which did what you're
trying to achieve here.  It was added by commit fad214b0aa72
("PCI: pciehp: Process all hotplug events before looking for new ones").

v4.9 is an LTS stable kernel, it's being used a lot and noone ever
complained about the ISR getting stuck.  So it seems safe to me to
drop the limit of six.  It can be added later if issues do get
reported.

I'm sorry that I dropped the loop when refactoring the code for v4.19.
The commit message made it seem that the loop was necessary because we
might otherwise miss events.  However my refactoring made the code *cope*
with missed events, so the loop seemed unnecessary.  I didn't realize
that the loop is also necessary to avoid missing *interrupts* in the
MSI case.

Thanks,

Lukas
Bjorn Helgaas Feb. 10, 2020, 11:21 p.m. UTC | #7
On Sun, Feb 09, 2020 at 01:55:43PM +0100, Lukas Wunner wrote:
> On Tue, Jan 28, 2020 at 06:51:51PM -0600, Bjorn Helgaas wrote:

> > I see that Lukas took a look at this earlier; I'd really like to have
> > his reviewed-by, since he's the expert on this code.
> 
> Hm, should we add an entry for pciehp to MAINTAINERS and list me as R: or M:?

That'd be great.  I would certainly apply a patch like that, but I don't
want to presume by generating it myself.

Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 654c972b8ea0..1804277ec433 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -47,6 +47,8 @@  extern int pciehp_poll_time;
  * struct controller - PCIe hotplug controller
  * @pcie: pointer to the controller's PCIe port service device
  * @slot_cap: cached copy of the Slot Capabilities register
+ * @pvm_capable: set if per-vector masking is supported (if not set, the ISR
+ *	needs to ensure all event bits cleared)
  * @slot_ctrl: cached copy of the Slot Control register
  * @ctrl_lock: serializes writes to the Slot Control register
  * @cmd_started: jiffies when the Slot Control register was last written;
@@ -83,6 +85,7 @@  struct controller {
 	struct pcie_device *pcie;
 
 	u32 slot_cap;				/* capabilities and quirks */
+	unsigned int pvm_capable:1;
 
 	u16 slot_ctrl;				/* control register access */
 	struct mutex ctrl_lock;
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 1a522c1c4177..4ffd489a5413 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -487,12 +487,21 @@  void pciehp_power_off_slot(struct controller *ctrl)
 		 PCI_EXP_SLTCTL_PWR_OFF);
 }
 
+/*
+ * We should never need to re-read the slot status register this many times,
+ * because there are only six possible events that could generate this
+ * interrupt.  If we still see events after this many reads, there's a stuck
+ * bit.
+ */
+#define MAX_ISR_STATUS_READS 6
+
 static irqreturn_t pciehp_isr(int irq, void *dev_id)
 {
 	struct controller *ctrl = (struct controller *)dev_id;
 	struct pci_dev *pdev = ctrl_dev(ctrl);
 	struct device *parent = pdev->dev.parent;
-	u16 status, events;
+	u16 status, events = 0;
+	int status_reads = 0;
 
 	/*
 	 * Interrupts only occur in D3hot or shallower and only if enabled
@@ -517,6 +526,7 @@  static irqreturn_t pciehp_isr(int irq, void *dev_id)
 		}
 	}
 
+read_status:
 	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
 	if (status == (u16) ~0) {
 		ctrl_info(ctrl, "%s: no response from device\n", __func__);
@@ -529,24 +539,44 @@  static irqreturn_t pciehp_isr(int irq, void *dev_id)
 	 * Slot Status contains plain status bits as well as event
 	 * notification bits; right now we only want the event bits.
 	 */
-	events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
-			   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
-			   PCI_EXP_SLTSTA_DLLSC);
+	status &= (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
+		   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
+		   PCI_EXP_SLTSTA_DLLSC);
 
 	/*
 	 * If we've already reported a power fault, don't report it again
 	 * until we've done something to handle it.
 	 */
 	if (ctrl->power_fault_detected)
-		events &= ~PCI_EXP_SLTSTA_PFD;
+		status &= ~PCI_EXP_SLTSTA_PFD;
 
+	events |= status;
 	if (!events) {
 		if (parent)
 			pm_runtime_put(parent);
 		return IRQ_NONE;
 	}
 
-	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
+	if (status) {
+		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status);
+
+		/*
+		 * If MSI per-vector masking is not supported by the port,
+		 * all of the event bits must be zero before the port will
+		 * send a new interrupt (see PCI Express Base Specification
+		 * Revision 5.0 Version 1.0, section 6.7.3.4, "Software
+		 * Notification of Hot-Plug Events"). So in that case, if
+		 * event bit gets set between the read and the write of
+		 * PCI_EXP_SLTSTA, we need to loop back and try again.
+		 */
+		if (!ctrl->pvm_capable) {
+			if (status_reads++ < MAX_ISR_STATUS_READS)
+				goto read_status;
+			ctrl_warn(ctrl, "Hot plug event bit stuck (%x)\n",
+				  status);
+		}
+	}
+
 	ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events);
 	if (parent)
 		pm_runtime_put(parent);
@@ -812,6 +842,7 @@  struct controller *pcie_init(struct pcie_device *dev)
 {
 	struct controller *ctrl;
 	u32 slot_cap, link_cap;
+	u16 msiflags;
 	u8 poweron;
 	struct pci_dev *pdev = dev->port;
 	struct pci_bus *subordinate = pdev->subordinate;
@@ -881,6 +912,13 @@  struct controller *pcie_init(struct pcie_device *dev)
 		}
 	}
 
+	ctrl->pvm_capable = 1;
+	if (pdev->msi_enabled) {
+		pci_read_config_word(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+				     &msiflags);
+		ctrl->pvm_capable = !!(msiflags & PCI_MSI_FLAGS_MASKBIT);
+	}
+
 	return ctrl;
 }