PCI: dwc: Fix interrupt race in when handling MSI

Message ID 20181027000028.21343-1-tpiepho@impinj.com
State New
Delegated to: Lorenzo Pieralisi
Headers show
Series
  • PCI: dwc: Fix interrupt race in when handling MSI
Related show

Commit Message

Trent Piepho Oct. 27, 2018, midnight
This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
after it is handled, not before").

This is a very real race that we observed quickly after switching from
4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
track it to the precise race and verify the fixed behavior, as will be
described below.

This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
designware: Fix missing MSI IRQs") The discussion of that commit,
archived in patchwork [1], is informative and worth reading.

The bug was re-added in the '8c934 commit this is reverting, which
appeared in the 4.14 kernel.

Unfortunately, Synopsys appears to consider the operation of this PCI-e
controller secret.  They provide no publicly available docs for it nor
allow the references manuals of SoCs using the controller to publish any
documentation of it.

So I can not say certain this code is correctly using the controller's
features.  However, extensive testing gives me high confidence in the
accuracy of what is described below.

If an MSI is received by the PCI-e controller while the status register
bit for that MSI is set, the PCI-e controller will NOT generate another
interrupt.  In addition, it will NOT queue or otherwise mark the
interrupt as "pending", which means it will NOT generate an interrupt
when the status bit is unmasked.

This gives the following race scenario:

1.  An MSI is received by, and the status bit for the MSI is set in, the
DWC PCI-e controller.
2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
for the MSI received.
3.  At some point, the interrupt handler must decide, correctly, that
there is no more work to do and return.
4.  The hardware generates a new MSI.  As the MSI's status bit is still
set, this new MSI is ignored.
6.  dw_handle_msi_irq() unsets the MSI status bit.

The MSI received at point 4 will never be acted upon.  It occurred after
the driver had finished checking the hardware status for interrupt
conditions to act on.  Since the MSI status was masked, it does not
generated a new IRQ, neither when it was received nor when the MSI is
unmasked.

It seems clear there is an unsolvable race here.

After this patch, the sequence becomes as follows:

1.  An MSI is received and the status bit for the MSI is set in the
DWC PCI-e controller.
2.  dw_handle_msi_irq() clears this MSI status bit.
3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
for the MSI received.
3.  At some point, the interrupt handler must decide, correctly, that
there is no more work to do and return.
4.  The hardware generates a new MSI.  This sets the MSI status bit and
triggers another interrupt in the interrupt controller(s) above the DWC
PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
not run again at this time.
6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
again as the status bit was still set.

The observant will notice that point 4 present the opportunity for the
SoC's interrupt controller to lose the interrupt in the same manner as
the bug in this driver.  The driver for that interrupt controller will
be written to properly deal with this.  In some cases the hardware
supports an EOI concept, where the 2nd IRQ is masked and internally
queued in the hardware, to be re-raised at EOI in step 7.  In other
cases the IRQ will be unmasked and re-raised at step 4, but the kernel
will see the handler is INPROGRESS and not re-invoke it, and instead set
a PENDING flag, which causes the handler to re-run at step 7.

[1] https://patchwork.kernel.org/patch/3333681/

Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
Cc: stable@vger.kernel.org
Cc: Vignesh R <vigneshr@ti.com>
Cc: Faiz Abbas <faiz_abbas@ti.com>
Cc: Jingoo Han <jingoohan1@gmail.com>
Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Trent Piepho <tpiepho@impinj.com>
---
 drivers/pci/controller/dwc/pcie-designware-host.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Vignesh R Nov. 5, 2018, 10:28 a.m. | #1
On 27/10/18 5:30 AM, Trent Piepho wrote:
> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
> after it is handled, not before").
> 
> This is a very real race that we observed quickly after switching from
> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
> track it to the precise race and verify the fixed behavior, as will be
> described below.
> 
> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
> designware: Fix missing MSI IRQs") The discussion of that commit,
> archived in patchwork [1], is informative and worth reading.
> 
> The bug was re-added in the '8c934 commit this is reverting, which
> appeared in the 4.14 kernel.
> 
> Unfortunately, Synopsys appears to consider the operation of this PCI-e
> controller secret.  They provide no publicly available docs for it nor
> allow the references manuals of SoCs using the controller to publish any
> documentation of it.
> 
> So I can not say certain this code is correctly using the controller's
> features.  However, extensive testing gives me high confidence in the
> accuracy of what is described below.
> 
> If an MSI is received by the PCI-e controller while the status register
> bit for that MSI is set, the PCI-e controller will NOT generate another
> interrupt.  In addition, it will NOT queue or otherwise mark the
> interrupt as "pending", which means it will NOT generate an interrupt
> when the status bit is unmasked.
> 
> This gives the following race scenario:
> 
> 1.  An MSI is received by, and the status bit for the MSI is set in, the
> DWC PCI-e controller.
> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> for the MSI received.
> 3.  At some point, the interrupt handler must decide, correctly, that
> there is no more work to do and return.
> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
> set, this new MSI is ignored.
> 6.  dw_handle_msi_irq() unsets the MSI status bit.
> 
> The MSI received at point 4 will never be acted upon.  It occurred after
> the driver had finished checking the hardware status for interrupt
> conditions to act on.  Since the MSI status was masked, it does not
> generated a new IRQ, neither when it was received nor when the MSI is
> unmasked.
> 
> It seems clear there is an unsolvable race here.
> 
> After this patch, the sequence becomes as follows:
> 
> 1.  An MSI is received and the status bit for the MSI is set in the
> DWC PCI-e controller.
> 2.  dw_handle_msi_irq() clears this MSI status bit.
> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> for the MSI received.
> 3.  At some point, the interrupt handler must decide, correctly, that
> there is no more work to do and return.
> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
> triggers another interrupt in the interrupt controller(s) above the DWC
> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
> not run again at this time.
> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
> again as the status bit was still set.
> 
> The observant will notice that point 4 present the opportunity for the
> SoC's interrupt controller to lose the interrupt in the same manner as
> the bug in this driver.  The driver for that interrupt controller will
> be written to properly deal with this.  In some cases the hardware
> supports an EOI concept, where the 2nd IRQ is masked and internally
> queued in the hardware, to be re-raised at EOI in step 7.  In other
> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
> will see the handler is INPROGRESS and not re-invoke it, and instead set
> a PENDING flag, which causes the handler to re-run at step 7.
> 
> [1] https://patchwork.kernel.org/patch/3333681/
> 
> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
> Cc: stable@vger.kernel.org
> Cc: Vignesh R <vigneshr@ti.com>
> Cc: Faiz Abbas <faiz_abbas@ti.com>
> Cc: Jingoo Han <jingoohan1@gmail.com>
> Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Trent Piepho <tpiepho@impinj.com>
> ---


Thanks for the patch! I have observed similar race as described above
using DWC controller on DRA7xx SoCs[2]. The proposed patch and correct
handling of MSIs within controller driver helps overcome this issue. So,
for this patch:

Reviewed-by: Vignesh R <vigneshr@ti.com>


[2] https://www.spinics.net/lists/linux-pci/msg68996.html

>  drivers/pci/controller/dwc/pcie-designware-host.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
> index 29a05759a294..9a3960c95ad3 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-host.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-host.c
> @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp)
>  			irq = irq_find_mapping(pp->irq_domain,
>  					       (i * MAX_MSI_IRQS_PER_CTRL) +
>  					       pos);
> -			generic_handle_irq(irq);
>  			dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS +
>  						(i * MSI_REG_CTRL_BLOCK_SIZE),
>  					    4, 1 << pos);
> +			generic_handle_irq(irq);
>  			pos++;
>  		}
>  	}
>
Gustavo Pimentel Nov. 5, 2018, 12:08 p.m. | #2
On 27/10/2018 01:00, Trent Piepho wrote:
> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
> after it is handled, not before").
> 
> This is a very real race that we observed quickly after switching from
> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
> track it to the precise race and verify the fixed behavior, as will be
> described below.
> 
> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
> designware: Fix missing MSI IRQs") The discussion of that commit,
> archived in patchwork [1], is informative and worth reading.
> 
> The bug was re-added in the '8c934 commit this is reverting, which
> appeared in the 4.14 kernel.
> 
> Unfortunately, Synopsys appears to consider the operation of this PCI-e
> controller secret.  They provide no publicly available docs for it nor
> allow the references manuals of SoCs using the controller to publish any
> documentation of it.
> 
> So I can not say certain this code is correctly using the controller's
> features.  However, extensive testing gives me high confidence in the
> accuracy of what is described below.
> 
> If an MSI is received by the PCI-e controller while the status register
> bit for that MSI is set, the PCI-e controller will NOT generate another
> interrupt.  In addition, it will NOT queue or otherwise mark the
> interrupt as "pending", which means it will NOT generate an interrupt
> when the status bit is unmasked.
> 
> This gives the following race scenario:
> 
> 1.  An MSI is received by, and the status bit for the MSI is set in, the
> DWC PCI-e controller.
> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> for the MSI received.
> 3.  At some point, the interrupt handler must decide, correctly, that
> there is no more work to do and return.
> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
> set, this new MSI is ignored.
> 6.  dw_handle_msi_irq() unsets the MSI status bit.
> 
> The MSI received at point 4 will never be acted upon.  It occurred after
> the driver had finished checking the hardware status for interrupt
> conditions to act on.  Since the MSI status was masked, it does not
> generated a new IRQ, neither when it was received nor when the MSI is
> unmasked.
> 
> It seems clear there is an unsolvable race here.
> 
> After this patch, the sequence becomes as follows:
> 
> 1.  An MSI is received and the status bit for the MSI is set in the
> DWC PCI-e controller.
> 2.  dw_handle_msi_irq() clears this MSI status bit.
> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> for the MSI received.
> 3.  At some point, the interrupt handler must decide, correctly, that
> there is no more work to do and return.
> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
> triggers another interrupt in the interrupt controller(s) above the DWC
> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
> not run again at this time.
> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
> again as the status bit was still set.
> 
> The observant will notice that point 4 present the opportunity for the
> SoC's interrupt controller to lose the interrupt in the same manner as
> the bug in this driver.  The driver for that interrupt controller will
> be written to properly deal with this.  In some cases the hardware
> supports an EOI concept, where the 2nd IRQ is masked and internally
> queued in the hardware, to be re-raised at EOI in step 7.  In other
> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
> will see the handler is INPROGRESS and not re-invoke it, and instead set
> a PENDING flag, which causes the handler to re-run at step 7.
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwIFAw&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=ZglJ52030-QSg4t06j182Uq9rsDcKNSB0P7q7Cx5Ow8&s=sHHaXL5syvT5Y0WGTVohW_3QqpQgnPTPOQ-HxAmvnb0&e=
> 
> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
> Cc: stable@vger.kernel.org
> Cc: Vignesh R <vigneshr@ti.com>
> Cc: Faiz Abbas <faiz_abbas@ti.com>
> Cc: Jingoo Han <jingoohan1@gmail.com>
> Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Trent Piepho <tpiepho@impinj.com>
> ---
>  drivers/pci/controller/dwc/pcie-designware-host.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
> index 29a05759a294..9a3960c95ad3 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-host.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-host.c
> @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp)
>  			irq = irq_find_mapping(pp->irq_domain,
>  					       (i * MAX_MSI_IRQS_PER_CTRL) +
>  					       pos);
> -			generic_handle_irq(irq);
>  			dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS +
>  						(i * MSI_REG_CTRL_BLOCK_SIZE),
>  					    4, 1 << pos);
> +			generic_handle_irq(irq);

Makes sense.

Acked-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com>

>  			pos++;
>  		}
>  	}
>
Lorenzo Pieralisi Nov. 6, 2018, 2:53 p.m. | #3
[CC Marc]

On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
> after it is handled, not before").
> 
> This is a very real race that we observed quickly after switching from
> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
> track it to the precise race and verify the fixed behavior, as will be
> described below.
> 
> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
> designware: Fix missing MSI IRQs") The discussion of that commit,
> archived in patchwork [1], is informative and worth reading.
> 
> The bug was re-added in the '8c934 commit this is reverting, which
> appeared in the 4.14 kernel.
> 
> Unfortunately, Synopsys appears to consider the operation of this PCI-e
> controller secret.  They provide no publicly available docs for it nor
> allow the references manuals of SoCs using the controller to publish any
> documentation of it.
> 
> So I can not say certain this code is correctly using the controller's
> features.  However, extensive testing gives me high confidence in the
> accuracy of what is described below.
> 
> If an MSI is received by the PCI-e controller while the status register
> bit for that MSI is set, the PCI-e controller will NOT generate another
> interrupt.  In addition, it will NOT queue or otherwise mark the
> interrupt as "pending", which means it will NOT generate an interrupt
> when the status bit is unmasked.
> 
> This gives the following race scenario:
> 
> 1.  An MSI is received by, and the status bit for the MSI is set in, the
> DWC PCI-e controller.
> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> for the MSI received.
> 3.  At some point, the interrupt handler must decide, correctly, that
> there is no more work to do and return.
> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
> set, this new MSI is ignored.
> 6.  dw_handle_msi_irq() unsets the MSI status bit.
> 
> The MSI received at point 4 will never be acted upon.  It occurred after
> the driver had finished checking the hardware status for interrupt
> conditions to act on.  Since the MSI status was masked, it does not
> generated a new IRQ, neither when it was received nor when the MSI is
> unmasked.
> 
> It seems clear there is an unsolvable race here.
> 
> After this patch, the sequence becomes as follows:
> 
> 1.  An MSI is received and the status bit for the MSI is set in the
> DWC PCI-e controller.
> 2.  dw_handle_msi_irq() clears this MSI status bit.
> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> for the MSI received.
> 3.  At some point, the interrupt handler must decide, correctly, that
> there is no more work to do and return.
> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
> triggers another interrupt in the interrupt controller(s) above the DWC
> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
> not run again at this time.
> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
> again as the status bit was still set.

Not sure why (5) is not used in your lists, I assume because you want
to highlight the race condition with the jump from 4 to 6 (or maybe
you do not like number 5 :), just curious).

> The observant will notice that point 4 present the opportunity for the
> SoC's interrupt controller to lose the interrupt in the same manner as
> the bug in this driver.  The driver for that interrupt controller will
> be written to properly deal with this.  In some cases the hardware
> supports an EOI concept, where the 2nd IRQ is masked and internally
> queued in the hardware, to be re-raised at EOI in step 7.  In other
> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
> will see the handler is INPROGRESS and not re-invoke it, and instead set
> a PENDING flag, which causes the handler to re-run at step 7.
> 
> [1] https://patchwork.kernel.org/patch/3333681/
> 
> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")

I have two questions:

- Commit 8c934095fa2f has been in the kernel for a year and no
  regression was reported. It was assumed to fix a problem so before
  reverting it I want to make sure we are not breaking anything else.
- Your reasoning seems correct but I would pick Marc's brain on this
  because I want to understand if what this patch does is what IRQ core
  expects it to do, especially in relation to the IRQ chaining you
  are mentioning.

Lorenzo

> Cc: stable@vger.kernel.org
> Cc: Vignesh R <vigneshr@ti.com>
> Cc: Faiz Abbas <faiz_abbas@ti.com>
> Cc: Jingoo Han <jingoohan1@gmail.com>
> Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Trent Piepho <tpiepho@impinj.com>
> ---
>  drivers/pci/controller/dwc/pcie-designware-host.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
> index 29a05759a294..9a3960c95ad3 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-host.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-host.c
> @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp)
>  			irq = irq_find_mapping(pp->irq_domain,
>  					       (i * MAX_MSI_IRQS_PER_CTRL) +
>  					       pos);
> -			generic_handle_irq(irq);
>  			dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS +
>  						(i * MSI_REG_CTRL_BLOCK_SIZE),
>  					    4, 1 << pos);
> +			generic_handle_irq(irq);
>  			pos++;
>  		}
>  	}
> -- 
> 2.14.4
>
Marc Zyngier Nov. 6, 2018, 4 p.m. | #4
On 06/11/18 14:53, Lorenzo Pieralisi wrote:
> [CC Marc]
> 
> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
>> after it is handled, not before").
>>
>> This is a very real race that we observed quickly after switching from
>> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
>> track it to the precise race and verify the fixed behavior, as will be
>> described below.
>>
>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
>> designware: Fix missing MSI IRQs") The discussion of that commit,
>> archived in patchwork [1], is informative and worth reading.
>>
>> The bug was re-added in the '8c934 commit this is reverting, which
>> appeared in the 4.14 kernel.
>>
>> Unfortunately, Synopsys appears to consider the operation of this PCI-e
>> controller secret.  They provide no publicly available docs for it nor
>> allow the references manuals of SoCs using the controller to publish any
>> documentation of it.
>>
>> So I can not say certain this code is correctly using the controller's
>> features.  However, extensive testing gives me high confidence in the
>> accuracy of what is described below.
>>
>> If an MSI is received by the PCI-e controller while the status register
>> bit for that MSI is set, the PCI-e controller will NOT generate another
>> interrupt.  In addition, it will NOT queue or otherwise mark the
>> interrupt as "pending", which means it will NOT generate an interrupt
>> when the status bit is unmasked.
>>
>> This gives the following race scenario:
>>
>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>> DWC PCI-e controller.
>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>> for the MSI received.
>> 3.  At some point, the interrupt handler must decide, correctly, that
>> there is no more work to do and return.
>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>> set, this new MSI is ignored.
>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>
>> The MSI received at point 4 will never be acted upon.  It occurred after
>> the driver had finished checking the hardware status for interrupt
>> conditions to act on.  Since the MSI status was masked, it does not
>> generated a new IRQ, neither when it was received nor when the MSI is
>> unmasked.
>>
>> It seems clear there is an unsolvable race here.
>>
>> After this patch, the sequence becomes as follows:
>>
>> 1.  An MSI is received and the status bit for the MSI is set in the
>> DWC PCI-e controller.
>> 2.  dw_handle_msi_irq() clears this MSI status bit.
>> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>> for the MSI received.
>> 3.  At some point, the interrupt handler must decide, correctly, that
>> there is no more work to do and return.
>> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
>> triggers another interrupt in the interrupt controller(s) above the DWC
>> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
>> not run again at this time.
>> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
>> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
>> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
>> again as the status bit was still set.
> 
> Not sure why (5) is not used in your lists, I assume because you want
> to highlight the race condition with the jump from 4 to 6 (or maybe
> you do not like number 5 :), just curious).
> 
>> The observant will notice that point 4 present the opportunity for the
>> SoC's interrupt controller to lose the interrupt in the same manner as
>> the bug in this driver.  The driver for that interrupt controller will
>> be written to properly deal with this.  In some cases the hardware
>> supports an EOI concept, where the 2nd IRQ is masked and internally
>> queued in the hardware, to be re-raised at EOI in step 7.  In other
>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
>> will see the handler is INPROGRESS and not re-invoke it, and instead set
>> a PENDING flag, which causes the handler to re-run at step 7.
>>
>> [1] https://patchwork.kernel.org/patch/3333681/
>>
>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
> 
> I have two questions:
> 
> - Commit 8c934095fa2f has been in the kernel for a year and no
>   regression was reported. It was assumed to fix a problem so before
>   reverting it I want to make sure we are not breaking anything else.
> - Your reasoning seems correct but I would pick Marc's brain on this
>   because I want to understand if what this patch does is what IRQ core
>   expects it to do, especially in relation to the IRQ chaining you
>   are mentioning.

It is hard to decide what the right solution is without understanding
exactly what this particular write actually does. It seems to be some
form of acknowledgement, but I'm only making an educated guess, and some
of the defines suggest that there might be another register for that.

What I'm interested in is the relationship this has with the mask/unmask
callbacks, and whether masking the interrupt before acking it would help.

Gustavo, can you help here?

In any way, moving the action of acknowledging the interrupt to its
right spot in the kernel (dw_pci_bottom_ack) would be a good start.

Thanks,

	M.
Trent Piepho Nov. 6, 2018, 6:59 p.m. | #5
On Tue, 2018-11-06 at 14:53 +0000, Lorenzo Pieralisi wrote:
> 
> 
> Not sure why (5) is not used in your lists, I assume because you want
> to highlight the race condition with the jump from 4 to 6 (or maybe
> you do not like number 5 :), just curious).

It's because I removed a step and I'm used to rst docs where it
renumbers automatically.

> 
> > The observant will notice that point 4 present the opportunity for the
> > SoC's interrupt controller to lose the interrupt in the same manner as
> > the bug in this driver.  The driver for that interrupt controller will
> > be written to properly deal with this.  In some cases the hardware
> > supports an EOI concept, where the 2nd IRQ is masked and internally
> > queued in the hardware, to be re-raised at EOI in step 7.  In other
> > cases the IRQ will be unmasked and re-raised at step 4, but the kernel
> > will see the handler is INPROGRESS and not re-invoke it, and instead set
> > a PENDING flag, which causes the handler to re-run at step 7.
> > 
> > [1] 
> > 
> > Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
> 
> I have two questions:
> 
> - Commit 8c934095fa2f has been in the kernel for a year and no
>   regression was reported. It was assumed to fix a problem so before
>   reverting it I want to make sure we are not breaking anything else.

There have been reports.  Vignesh reported this some time ago.  Also
see https://lkml.org/lkml/2018/11/2/55, the author indicated to me that
it was the caused by this problem after I pointed him to my patch.  And
I've reported it now too.  

We only updated to a kernel with the regression a few months ago.  I
think imx6/7 systems don't update their kernels that quickly.  Usually
not running a mainstream distro with automatic updates.  And there are
probably a lot of people using the NXP vendor kernel, which has a ton
of unmerged patches, and so they don't see this regression yet.

> - Your reasoning seems correct but I would pick Marc's brain on this
>   because I want to understand if what this patch does is what IRQ core
>   expects it to do, especially in relation to the IRQ chaining you
>   are mentioning.

I've traced this through the ARM arch code and used our custom PCI
device to generate interrupts in all the race windows.  It does work as
I say, though I can't say this it how it's supposed to work.
Trent Piepho Nov. 6, 2018, 7:40 p.m. | #6
On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote:
> 
> It is hard to decide what the right solution is without understanding
> exactly what this particular write actually does. It seems to be some
> form of acknowledgement, but I'm only making an educated guess, and some
> of the defines suggest that there might be another register for that.

Unfortunately, there are no docs for this controller.  I've determined
that it sets a bit in this register when an MSI is received.  Once set,
it acts as a mask and the controller will generate no interrupts when
the same MSI is subsequently received.  Writing a 1 to a bit clears
that mask bit, obviously so that each bit can be cleared atomically vs
a non-atomic RMW.  The controller does not queue any MSIs received
while the interrupt was masked.

> 
> What I'm interested in is the relationship this has with the mask/unmask
> callbacks, and whether masking the interrupt before acking it would help.


> 
> Gustavo, can you help here?
> 
> In any way, moving the action of acknowledging the interrupt to its
> right spot in the kernel (dw_pci_bottom_ack) would be a good start.

What about stable kernels that don't have the hierarchical API?
Lorenzo Pieralisi Nov. 7, 2018, 11:07 a.m. | #7
On Tue, Nov 06, 2018 at 07:40:18PM +0000, Trent Piepho wrote:
> On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote:
> > 
> > It is hard to decide what the right solution is without understanding
> > exactly what this particular write actually does. It seems to be some
> > form of acknowledgement, but I'm only making an educated guess, and some
> > of the defines suggest that there might be another register for that.
> 
> Unfortunately, there are no docs for this controller.  I've determined
> that it sets a bit in this register when an MSI is received.  Once set,
> it acts as a mask and the controller will generate no interrupts when
> the same MSI is subsequently received.  Writing a 1 to a bit clears
> that mask bit, obviously so that each bit can be cleared atomically vs
> a non-atomic RMW.  The controller does not queue any MSIs received
> while the interrupt was masked.

I would like to understand if this is the expected behaviour across
dwc controllers but to understand that, as Marc mentioned, we would
need some input from Synopsys (if that's not customizable option).

Side note: what's the register at offset PCIE_MSI_INTR0_MASK
supposed to do ? It seems unused in the current kernel, with
a macro name hinting at MSI masking so I am asking.

> > What I'm interested in is the relationship this has with the mask/unmask
> > callbacks, and whether masking the interrupt before acking it would help.
> 
> 
> > 
> > Gustavo, can you help here?
> > 
> > In any way, moving the action of acknowledging the interrupt to its
> > right spot in the kernel (dw_pci_bottom_ack) would be a good start.
> 
> What about stable kernels that don't have the hierarchical API?

We will have to backport the patches.

Lorenzo
Gustavo Pimentel Nov. 7, 2018, 12:57 p.m. | #8
On 06/11/2018 16:00, Marc Zyngier wrote:
> On 06/11/18 14:53, Lorenzo Pieralisi wrote:
>> [CC Marc]
>>
>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
>>> after it is handled, not before").
>>>
>>> This is a very real race that we observed quickly after switching from
>>> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
>>> track it to the precise race and verify the fixed behavior, as will be
>>> described below.
>>>
>>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
>>> designware: Fix missing MSI IRQs") The discussion of that commit,
>>> archived in patchwork [1], is informative and worth reading.
>>>
>>> The bug was re-added in the '8c934 commit this is reverting, which
>>> appeared in the 4.14 kernel.
>>>
>>> Unfortunately, Synopsys appears to consider the operation of this PCI-e
>>> controller secret.  They provide no publicly available docs for it nor
>>> allow the references manuals of SoCs using the controller to publish any
>>> documentation of it.
>>>
>>> So I can not say certain this code is correctly using the controller's
>>> features.  However, extensive testing gives me high confidence in the
>>> accuracy of what is described below.
>>>
>>> If an MSI is received by the PCI-e controller while the status register
>>> bit for that MSI is set, the PCI-e controller will NOT generate another
>>> interrupt.  In addition, it will NOT queue or otherwise mark the
>>> interrupt as "pending", which means it will NOT generate an interrupt
>>> when the status bit is unmasked.
>>>
>>> This gives the following race scenario:
>>>
>>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>>> DWC PCI-e controller.
>>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>> for the MSI received.
>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>> there is no more work to do and return.
>>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>>> set, this new MSI is ignored.
>>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>>
>>> The MSI received at point 4 will never be acted upon.  It occurred after
>>> the driver had finished checking the hardware status for interrupt
>>> conditions to act on.  Since the MSI status was masked, it does not
>>> generated a new IRQ, neither when it was received nor when the MSI is
>>> unmasked.
>>>
>>> It seems clear there is an unsolvable race here.
>>>
>>> After this patch, the sequence becomes as follows:
>>>
>>> 1.  An MSI is received and the status bit for the MSI is set in the
>>> DWC PCI-e controller.
>>> 2.  dw_handle_msi_irq() clears this MSI status bit.
>>> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>> for the MSI received.
>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>> there is no more work to do and return.
>>> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
>>> triggers another interrupt in the interrupt controller(s) above the DWC
>>> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
>>> not run again at this time.
>>> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
>>> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
>>> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
>>> again as the status bit was still set.
>>
>> Not sure why (5) is not used in your lists, I assume because you want
>> to highlight the race condition with the jump from 4 to 6 (or maybe
>> you do not like number 5 :), just curious).
>>
>>> The observant will notice that point 4 present the opportunity for the
>>> SoC's interrupt controller to lose the interrupt in the same manner as
>>> the bug in this driver.  The driver for that interrupt controller will
>>> be written to properly deal with this.  In some cases the hardware
>>> supports an EOI concept, where the 2nd IRQ is masked and internally
>>> queued in the hardware, to be re-raised at EOI in step 7.  In other
>>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
>>> will see the handler is INPROGRESS and not re-invoke it, and instead set
>>> a PENDING flag, which causes the handler to re-run at step 7.
>>>
>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=3fxVS1aKAoLsOD0JiRgCC9pnyPfwwNVecSPqFznQ3Kc&s=X9ZYshlrXFwQbZJTzWC3ZC77lveB4ZlqfYqGJiSuqY8&e=
>>>
>>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
>>
>> I have two questions:
>>
>> - Commit 8c934095fa2f has been in the kernel for a year and no
>>   regression was reported. It was assumed to fix a problem so before
>>   reverting it I want to make sure we are not breaking anything else.
>> - Your reasoning seems correct but I would pick Marc's brain on this
>>   because I want to understand if what this patch does is what IRQ core
>>   expects it to do, especially in relation to the IRQ chaining you
>>   are mentioning.
> 
> It is hard to decide what the right solution is without understanding
> exactly what this particular write actually does. It seems to be some
> form of acknowledgement, but I'm only making an educated guess, and some
> of the defines suggest that there might be another register for that.

This status register indicates whether exists or not a MSI interrupt on that
controller [0..7] to be handle.

In theory, we should clear the interrupt flag only after the interrupt has
actually handled (which can take some time to process on the worst case scenario).

However, the Trent's patch allows to acknowledge the flag and handle the
interrupt later, giving the opportunity to catch a possible new interrupt, which
will be handle by a new call of this function.

> 
> What I'm interested in is the relationship this has with the mask/unmask
> callbacks, and whether masking the interrupt before acking it would help.

Although there is the possibility of mask/unmask the interruptions on the
controller, from what I've seen typically in other dw drivers this is not done.
Probably we don't get much benefit from using it.

Gustavo

> 
> Gustavo, can you help here?
> 
> In any way, moving the action of acknowledging the interrupt to its
> right spot in the kernel (dw_pci_bottom_ack) would be a good start.
> 
> Thanks,
> 
> 	M.
>
Gustavo Pimentel Nov. 7, 2018, 12:58 p.m. | #9
On 07/11/2018 11:07, Lorenzo Pieralisi wrote:
> On Tue, Nov 06, 2018 at 07:40:18PM +0000, Trent Piepho wrote:
>> On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote:
>>>
>>> It is hard to decide what the right solution is without understanding
>>> exactly what this particular write actually does. It seems to be some
>>> form of acknowledgement, but I'm only making an educated guess, and some
>>> of the defines suggest that there might be another register for that.
>>
>> Unfortunately, there are no docs for this controller.  I've determined
>> that it sets a bit in this register when an MSI is received.  Once set,
>> it acts as a mask and the controller will generate no interrupts when
>> the same MSI is subsequently received.  Writing a 1 to a bit clears
>> that mask bit, obviously so that each bit can be cleared atomically vs
>> a non-atomic RMW.  The controller does not queue any MSIs received
>> while the interrupt was masked.
> 
> I would like to understand if this is the expected behaviour across
> dwc controllers but to understand that, as Marc mentioned, we would
> need some input from Synopsys (if that's not customizable option).
> 
> Side note: what's the register at offset PCIE_MSI_INTR0_MASK
> supposed to do ? It seems unused in the current kernel, with
> a macro name hinting at MSI masking so I am asking.

That's the interrupt mask for the controller #0 offset. Currently it's not used.

Gustavo

> 
>>> What I'm interested in is the relationship this has with the mask/unmask
>>> callbacks, and whether masking the interrupt before acking it would help.
>>
>>
>>>
>>> Gustavo, can you help here?
>>>
>>> In any way, moving the action of acknowledging the interrupt to its
>>> right spot in the kernel (dw_pci_bottom_ack) would be a good start.
>>
>> What about stable kernels that don't have the hierarchical API?
> 
> We will have to backport the patches.
> 
> Lorenzo
>
Trent Piepho Nov. 7, 2018, 6:32 p.m. | #10
On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote:
> On 06/11/2018 16:00, Marc Zyngier wrote:
> > On 06/11/18 14:53, Lorenzo Pieralisi wrote:
> > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
> > > > 
> > > > This gives the following race scenario:
> > > > 
> > > > 1.  An MSI is received by, and the status bit for the MSI is set in, the
> > > > DWC PCI-e controller.
> > > > 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> > > > for the MSI received.
> > > > 3.  At some point, the interrupt handler must decide, correctly, that
> > > > there is no more work to do and return.
> > > > 4.  The hardware generates a new MSI.  As the MSI's status bit is still
> > > > set, this new MSI is ignored.
> > > > 6.  dw_handle_msi_irq() unsets the MSI status bit.
> > > > 
> > > > The MSI received at point 4 will never be acted upon.  It occurred after
> > > > the driver had finished checking the hardware status for interrupt
> > > > conditions to act on.  Since the MSI status was masked, it does not
> > > > generated a new IRQ, neither when it was received nor when the MSI is
> > > > unmasked.
> > > > 

> This status register indicates whether exists or not a MSI interrupt on that
> controller [0..7] to be handle.

While the status for an MSI is set, no new interrupt will be triggered
if another identical MSI is received, correct?

> In theory, we should clear the interrupt flag only after the interrupt has
> actually handled (which can take some time to process on the worst case scenario).

But see above, there is a race if a new MSI arrives while still masked.
 I can see no possible way to solve this in software that does not
involve unmasking the MSI before calling the handler.  To leave the
interrupt masked while calling the handler requires the hardware to
queue an interrupt that arrives while masked.  We have no docs, but the
designware controller doesn't appear to do this in practice.

> However, the Trent's patch allows to acknowledge the flag and handle the
> interrupt later, giving the opportunity to catch a possible new interrupt, which
> will be handle by a new call of this function.
> 
> > 
> > What I'm interested in is the relationship this has with the mask/unmask
> > callbacks, and whether masking the interrupt before acking it would help.
> 
> Although there is the possibility of mask/unmask the interruptions on the
> controller, from what I've seen typically in other dw drivers this is not done.
> Probably we don't get much benefit from using it.
> 
> Gustavo
> 
> > 
> > Gustavo, can you help here?
> > 
> > In any way, moving the action of acknowledging the interrupt to its
> > right spot in the kernel (dw_pci_bottom_ack) would be a good start.
> > 
> > Thanks,
> > 
> > 	M.
> > 
> 
>
Marc Zyngier Nov. 7, 2018, 6:41 p.m. | #11
On 06/11/18 19:40, Trent Piepho wrote:
> On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote:
>>
>> It is hard to decide what the right solution is without understanding
>> exactly what this particular write actually does. It seems to be some
>> form of acknowledgement, but I'm only making an educated guess, and some
>> of the defines suggest that there might be another register for that.
> 
> Unfortunately, there are no docs for this controller.  I've determined
> that it sets a bit in this register when an MSI is received.  Once set,
> it acts as a mask and the controller will generate no interrupts when
> the same MSI is subsequently received.  Writing a 1 to a bit clears
> that mask bit, obviously so that each bit can be cleared atomically vs
> a non-atomic RMW.  The controller does not queue any MSIs received
> while the interrupt was masked.

I'm not sure this is really a mask, as the controller seems to have
other ways of really masking the interrupt. It looks a lot more like
either an ACK, but I need to understand how this interacts with the
masking register.

And I do not want to guess it. I want actual information from people who
build this IP.

>> What I'm interested in is the relationship this has with the mask/unmask
>> callbacks, and whether masking the interrupt before acking it would help.
> 
> 
>>
>> Gustavo, can you help here?
>>
>> In any way, moving the action of acknowledging the interrupt to its
>> right spot in the kernel (dw_pci_bottom_ack) would be a good start.
> 
> What about stable kernels that don't have the hierarchical API?

My goal is to fix mainline first. Once we have something that works on
mainline, we can look at propagating the fix to other versions. But
mainline always comes first.

Thanks,

	M.
Marc Zyngier Nov. 7, 2018, 6:46 p.m. | #12
On 07/11/18 12:57, Gustavo Pimentel wrote:
> On 06/11/2018 16:00, Marc Zyngier wrote:
>> On 06/11/18 14:53, Lorenzo Pieralisi wrote:
>>> [CC Marc]
>>>
>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>>>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
>>>> after it is handled, not before").
>>>>
>>>> This is a very real race that we observed quickly after switching from
>>>> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
>>>> track it to the precise race and verify the fixed behavior, as will be
>>>> described below.
>>>>
>>>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
>>>> designware: Fix missing MSI IRQs") The discussion of that commit,
>>>> archived in patchwork [1], is informative and worth reading.
>>>>
>>>> The bug was re-added in the '8c934 commit this is reverting, which
>>>> appeared in the 4.14 kernel.
>>>>
>>>> Unfortunately, Synopsys appears to consider the operation of this PCI-e
>>>> controller secret.  They provide no publicly available docs for it nor
>>>> allow the references manuals of SoCs using the controller to publish any
>>>> documentation of it.
>>>>
>>>> So I can not say certain this code is correctly using the controller's
>>>> features.  However, extensive testing gives me high confidence in the
>>>> accuracy of what is described below.
>>>>
>>>> If an MSI is received by the PCI-e controller while the status register
>>>> bit for that MSI is set, the PCI-e controller will NOT generate another
>>>> interrupt.  In addition, it will NOT queue or otherwise mark the
>>>> interrupt as "pending", which means it will NOT generate an interrupt
>>>> when the status bit is unmasked.
>>>>
>>>> This gives the following race scenario:
>>>>
>>>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>>>> DWC PCI-e controller.
>>>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>> for the MSI received.
>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>> there is no more work to do and return.
>>>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>>>> set, this new MSI is ignored.
>>>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>>>
>>>> The MSI received at point 4 will never be acted upon.  It occurred after
>>>> the driver had finished checking the hardware status for interrupt
>>>> conditions to act on.  Since the MSI status was masked, it does not
>>>> generated a new IRQ, neither when it was received nor when the MSI is
>>>> unmasked.
>>>>
>>>> It seems clear there is an unsolvable race here.
>>>>
>>>> After this patch, the sequence becomes as follows:
>>>>
>>>> 1.  An MSI is received and the status bit for the MSI is set in the
>>>> DWC PCI-e controller.
>>>> 2.  dw_handle_msi_irq() clears this MSI status bit.
>>>> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>> for the MSI received.
>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>> there is no more work to do and return.
>>>> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
>>>> triggers another interrupt in the interrupt controller(s) above the DWC
>>>> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
>>>> not run again at this time.
>>>> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
>>>> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
>>>> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
>>>> again as the status bit was still set.
>>>
>>> Not sure why (5) is not used in your lists, I assume because you want
>>> to highlight the race condition with the jump from 4 to 6 (or maybe
>>> you do not like number 5 :), just curious).
>>>
>>>> The observant will notice that point 4 present the opportunity for the
>>>> SoC's interrupt controller to lose the interrupt in the same manner as
>>>> the bug in this driver.  The driver for that interrupt controller will
>>>> be written to properly deal with this.  In some cases the hardware
>>>> supports an EOI concept, where the 2nd IRQ is masked and internally
>>>> queued in the hardware, to be re-raised at EOI in step 7.  In other
>>>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
>>>> will see the handler is INPROGRESS and not re-invoke it, and instead set
>>>> a PENDING flag, which causes the handler to re-run at step 7.
>>>>
>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=3fxVS1aKAoLsOD0JiRgCC9pnyPfwwNVecSPqFznQ3Kc&s=X9ZYshlrXFwQbZJTzWC3ZC77lveB4ZlqfYqGJiSuqY8&e=
>>>>
>>>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
>>>
>>> I have two questions:
>>>
>>> - Commit 8c934095fa2f has been in the kernel for a year and no
>>>   regression was reported. It was assumed to fix a problem so before
>>>   reverting it I want to make sure we are not breaking anything else.
>>> - Your reasoning seems correct but I would pick Marc's brain on this
>>>   because I want to understand if what this patch does is what IRQ core
>>>   expects it to do, especially in relation to the IRQ chaining you
>>>   are mentioning.
>>
>> It is hard to decide what the right solution is without understanding
>> exactly what this particular write actually does. It seems to be some
>> form of acknowledgement, but I'm only making an educated guess, and some
>> of the defines suggest that there might be another register for that.
> 
> This status register indicates whether exists or not a MSI interrupt on that
> controller [0..7] to be handle.

OK. What happen when the interrupt is masked? Does this interrupt appear
in the status register? And what is the effect of writing to that register?

> In theory, we should clear the interrupt flag only after the interrupt has
> actually handled (which can take some time to process on the worst case scenario).

At this stage, we do not care about performance, but correctness. If we
loose interrupts, then the driver is not correct, and we need to address
this first.

> However, the Trent's patch allows to acknowledge the flag and handle the
> interrupt later, giving the opportunity to catch a possible new interrupt, which
> will be handle by a new call of this function.
> 
>>
>> What I'm interested in is the relationship this has with the mask/unmask
>> callbacks, and whether masking the interrupt before acking it would help.
> 
> Although there is the possibility of mask/unmask the interruptions on the
> controller, from what I've seen typically in other dw drivers this is not done.
> Probably we don't get much benefit from using it.

I don't worry care about the drivers. I need to know what the HW does,
and how this maps to the Linux interrupt subsystem. At the moment, this
is just too blurry to be understandable.

Thanks,

	M.
Trent Piepho Nov. 7, 2018, 8:17 p.m. | #13
On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
> On 06/11/18 19:40, Trent Piepho wrote:
> > 
> > What about stable kernels that don't have the hierarchical API?
> 
> My goal is to fix mainline first. Once we have something that works on
> mainline, we can look at propagating the fix to other versions. But
> mainline always comes first.

This is a regression that went into 4.14.  Wouldn't the appropriate
action for the stable series be to undo the regression?

The hierarchical design might allow unmasking the irq to be placed in a
better location.  And maybe knowing how the hardware worked would allow
for a design that maintained a separate pending and active state in
hardware, with distinct ack and eoi controls, i.e. allowed an MSI to be
queued while still masked.  But that doesn't seem like the kind of
enhancement that would be backported to stable.
Marc Zyngier Nov. 8, 2018, 9:49 a.m. | #14
On 07/11/18 20:17, Trent Piepho wrote:
> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
>> On 06/11/18 19:40, Trent Piepho wrote:
>>>
>>> What about stable kernels that don't have the hierarchical API?
>>
>> My goal is to fix mainline first. Once we have something that works on
>> mainline, we can look at propagating the fix to other versions. But
>> mainline always comes first.
> 
> This is a regression that went into 4.14.  Wouldn't the appropriate
> action for the stable series be to undo the regression?

This is not how stable works. Stable kernels *only* contain patches that
are backported from mainline, and do not take standalone patch.

Furthermore, your fix is to actually undo someone else's fix. Who is
right? In the absence of any documentation, the answer is "nobody".

> The hierarchical design might allow unmasking the irq to be placed in a
> better location.  And maybe knowing how the hardware worked would allow
> for a design that maintained a separate pending and active state in
> hardware, with distinct ack and eoi controls, i.e. allowed an MSI to be
> queued while still masked.  But that doesn't seem like the kind of
> enhancement that would be backported to stable.
Anything can be backported to stable once we understand the issue. At
the moment, we're just playing games moving stuff around and hope
nothing else will break. That's not a sustainable way of maintaining
this driver. At the moment, the only patch I'm inclined to propose until
we get an actual interrupt handling flow from Synopsys is to mark this
driver as "BROKEN".

Thanks,

	M.
Gustavo Pimentel Nov. 8, 2018, 11:24 a.m. | #15
On 07/11/2018 18:46, Marc Zyngier wrote:
> On 07/11/18 12:57, Gustavo Pimentel wrote:
>> On 06/11/2018 16:00, Marc Zyngier wrote:
>>> On 06/11/18 14:53, Lorenzo Pieralisi wrote:
>>>> [CC Marc]
>>>>
>>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>>>>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status
>>>>> after it is handled, not before").
>>>>>
>>>>> This is a very real race that we observed quickly after switching from
>>>>> 4.13 to 4.16.  Using a custom PCI-e endpoint and driver, I was able to
>>>>> track it to the precise race and verify the fixed behavior, as will be
>>>>> described below.
>>>>>
>>>>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI:
>>>>> designware: Fix missing MSI IRQs") The discussion of that commit,
>>>>> archived in patchwork [1], is informative and worth reading.
>>>>>
>>>>> The bug was re-added in the '8c934 commit this is reverting, which
>>>>> appeared in the 4.14 kernel.
>>>>>
>>>>> Unfortunately, Synopsys appears to consider the operation of this PCI-e
>>>>> controller secret.  They provide no publicly available docs for it nor
>>>>> allow the references manuals of SoCs using the controller to publish any
>>>>> documentation of it.
>>>>>
>>>>> So I can not say certain this code is correctly using the controller's
>>>>> features.  However, extensive testing gives me high confidence in the
>>>>> accuracy of what is described below.
>>>>>
>>>>> If an MSI is received by the PCI-e controller while the status register
>>>>> bit for that MSI is set, the PCI-e controller will NOT generate another
>>>>> interrupt.  In addition, it will NOT queue or otherwise mark the
>>>>> interrupt as "pending", which means it will NOT generate an interrupt
>>>>> when the status bit is unmasked.
>>>>>
>>>>> This gives the following race scenario:
>>>>>
>>>>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>>>>> DWC PCI-e controller.
>>>>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>>> for the MSI received.
>>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>>> there is no more work to do and return.
>>>>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>>>>> set, this new MSI is ignored.
>>>>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>>>>
>>>>> The MSI received at point 4 will never be acted upon.  It occurred after
>>>>> the driver had finished checking the hardware status for interrupt
>>>>> conditions to act on.  Since the MSI status was masked, it does not
>>>>> generated a new IRQ, neither when it was received nor when the MSI is
>>>>> unmasked.
>>>>>
>>>>> It seems clear there is an unsolvable race here.
>>>>>
>>>>> After this patch, the sequence becomes as follows:
>>>>>
>>>>> 1.  An MSI is received and the status bit for the MSI is set in the
>>>>> DWC PCI-e controller.
>>>>> 2.  dw_handle_msi_irq() clears this MSI status bit.
>>>>> 3.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>>> for the MSI received.
>>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>>> there is no more work to do and return.
>>>>> 4.  The hardware generates a new MSI.  This sets the MSI status bit and
>>>>> triggers another interrupt in the interrupt controller(s) above the DWC
>>>>> PCI-e controller.  As the the dwc-pcie handler is not re-entrant, it is
>>>>> not run again at this time.
>>>>> 6.  dw_handle_msi_irq() finishes.  The MSI status bit remains set.
>>>>> 7.  The IRQ is re-raised and dw_handle_msi_irq() runs again.
>>>>> 8.  dw_handle_msi_irq() invokes the MSI's registered interrupt handler
>>>>> again as the status bit was still set.
>>>>
>>>> Not sure why (5) is not used in your lists, I assume because you want
>>>> to highlight the race condition with the jump from 4 to 6 (or maybe
>>>> you do not like number 5 :), just curious).
>>>>
>>>>> The observant will notice that point 4 present the opportunity for the
>>>>> SoC's interrupt controller to lose the interrupt in the same manner as
>>>>> the bug in this driver.  The driver for that interrupt controller will
>>>>> be written to properly deal with this.  In some cases the hardware
>>>>> supports an EOI concept, where the 2nd IRQ is masked and internally
>>>>> queued in the hardware, to be re-raised at EOI in step 7.  In other
>>>>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel
>>>>> will see the handler is INPROGRESS and not re-invoke it, and instead set
>>>>> a PENDING flag, which causes the handler to re-run at step 7.
>>>>>
>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=3fxVS1aKAoLsOD0JiRgCC9pnyPfwwNVecSPqFznQ3Kc&s=X9ZYshlrXFwQbZJTzWC3ZC77lveB4ZlqfYqGJiSuqY8&e=
>>>>>
>>>>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before")
>>>>
>>>> I have two questions:
>>>>
>>>> - Commit 8c934095fa2f has been in the kernel for a year and no
>>>>   regression was reported. It was assumed to fix a problem so before
>>>>   reverting it I want to make sure we are not breaking anything else.
>>>> - Your reasoning seems correct but I would pick Marc's brain on this
>>>>   because I want to understand if what this patch does is what IRQ core
>>>>   expects it to do, especially in relation to the IRQ chaining you
>>>>   are mentioning.
>>>
>>> It is hard to decide what the right solution is without understanding
>>> exactly what this particular write actually does. It seems to be some
>>> form of acknowledgement, but I'm only making an educated guess, and some
>>> of the defines suggest that there might be another register for that.
>>
>> This status register indicates whether exists or not a MSI interrupt on that
>> controller [0..7] to be handle.
> 
> OK. What happen when the interrupt is masked? Does this interrupt appear
> in the status register? And what is the effect of writing to that register?

When an MSI is received for a masked interrupt, the corresponding status bit
gets set in the interrupt status register but the controller will not signal it.
As soon as the masked interrupt is unmasked and assuming the status bit is still
set the controller will signal it. Have I explain this clearly?

> 
>> In theory, we should clear the interrupt flag only after the interrupt has
>> actually handled (which can take some time to process on the worst case scenario).
> 
> At this stage, we do not care about performance, but correctness. If we
> loose interrupts, then the driver is not correct, and we need to address
> this first.

Ok, Marc. Let see if I can summarize it. You are suggesting to change the
register from PCIE_MSI_INTR0_ENABLE to PCIE_MSI_INTR0_MASK on
dw_pci_bottom_mask() and dw_pci_bottom_unmask() and clearing the interrupt
status bit inside of dw_pci_bottom_ack() instead of dw_handle_msi_irq(), right?

> 
>> However, the Trent's patch allows to acknowledge the flag and handle the
>> interrupt later, giving the opportunity to catch a possible new interrupt, which
>> will be handle by a new call of this function.
>>
>>>
>>> What I'm interested in is the relationship this has with the mask/unmask
>>> callbacks, and whether masking the interrupt before acking it would help.
>>
>> Although there is the possibility of mask/unmask the interruptions on the
>> controller, from what I've seen typically in other dw drivers this is not done.
>> Probably we don't get much benefit from using it.
> 
> I don't worry care about the drivers. I need to know what the HW does,
> and how this maps to the Linux interrupt subsystem. At the moment, this
> is just too blurry to be understandable.

Ok, let's clarify this then.

> 
> Thanks,
> 
> 	M.
>
Gustavo Pimentel Nov. 8, 2018, 11:46 a.m. | #16
On 07/11/2018 18:32, Trent Piepho wrote:
> On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote:
>> On 06/11/2018 16:00, Marc Zyngier wrote:
>>> On 06/11/18 14:53, Lorenzo Pieralisi wrote:
>>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>>>>>
>>>>> This gives the following race scenario:
>>>>>
>>>>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>>>>> DWC PCI-e controller.
>>>>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>>> for the MSI received.
>>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>>> there is no more work to do and return.
>>>>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>>>>> set, this new MSI is ignored.
>>>>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>>>>
>>>>> The MSI received at point 4 will never be acted upon.  It occurred after
>>>>> the driver had finished checking the hardware status for interrupt
>>>>> conditions to act on.  Since the MSI status was masked, it does not
>>>>> generated a new IRQ, neither when it was received nor when the MSI is
>>>>> unmasked.
>>>>>
> 
>> This status register indicates whether exists or not a MSI interrupt on that
>> controller [0..7] to be handle.
> 
> While the status for an MSI is set, no new interrupt will be triggered

Yes

> if another identical MSI is received, correct?

You cannot receive another identical MSI till you acknowledge the current one
(This is ensured by the PCI protocol, I guess).

> 
>> In theory, we should clear the interrupt flag only after the interrupt has
>> actually handled (which can take some time to process on the worst case scenario).
> 
> But see above, there is a race if a new MSI arrives while still masked.
>  I can see no possible way to solve this in software that does not
> involve unmasking the MSI before calling the handler.  To leave the
> interrupt masked while calling the handler requires the hardware to
> queue an interrupt that arrives while masked.  We have no docs, but the
> designware controller doesn't appear to do this in practice.

See my reply to Marc about the interrupt masking. Like you said, probably the
solution pass through using interrupt mask/unmask register instead of interrupt
enable/disable register.

Can you do a quick test, since you can easily reproduce the issue? Can you
change register offset on both functions dw_pci_bottom_mask() and
dw_pci_bottom_unmask()?

Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK.

Thanks.
Gustavo

> 
>> However, the Trent's patch allows to acknowledge the flag and handle the
>> interrupt later, giving the opportunity to catch a possible new interrupt, which
>> will be handle by a new call of this function.
>>
>>>
>>> What I'm interested in is the relationship this has with the mask/unmask
>>> callbacks, and whether masking the interrupt before acking it would help.
>>
>> Although there is the possibility of mask/unmask the interruptions on the
>> controller, from what I've seen typically in other dw drivers this is not done.
>> Probably we don't get much benefit from using it.
>>
>> Gustavo
>>
>>>
>>> Gustavo, can you help here?
>>>
>>> In any way, moving the action of acknowledging the interrupt to its
>>> right spot in the kernel (dw_pci_bottom_ack) would be a good start.
>>>
>>> Thanks,
>>>
>>> 	M.
>>>
>>
Trent Piepho Nov. 8, 2018, 7:49 p.m. | #17
On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
> On 07/11/18 20:17, Trent Piepho wrote:
> > On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
> > > On 06/11/18 19:40, Trent Piepho wrote:
> > > > 
> > > > What about stable kernels that don't have the hierarchical API?
> > > 
> > > My goal is to fix mainline first. Once we have something that works on
> > > mainline, we can look at propagating the fix to other versions. But
> > > mainline always comes first.
> > 
> > This is a regression that went into 4.14.  Wouldn't the appropriate
> > action for the stable series be to undo the regression?
> 
> This is not how stable works. Stable kernels *only* contain patches that
> are backported from mainline, and do not take standalone patch.
> 
> Furthermore, your fix is to actually undo someone else's fix. Who is
> right? In the absence of any documentation, the answer is "nobody".

Little more history to this bug.  The code was originally the way it is
now, but this same bug was fixed in 2013 in https://patchwork.kernel.or
g/patch/3333681/

Then that lasted four years until it was changed Aug 2017 in https://pa
tchwork.kernel.org/patch/9893303/

That lasted just six months until someone tried to revert it, https://p
atchwork.kernel.org/patch/9893303/

Seems pretty clear the way it is now is much worse than the way it was
before, even if the previous design may have had another flaw.  Though
I've yet to see anyone point out something makes the previous design
broken.  Sub-optimal yes, but not broken.

> Anything can be backported to stable once we understand the issue. At
> the moment, we're just playing games moving stuff around and hope
> nothing else will break. That's not a sustainable way of maintaining
> this driver. At the moment, the only patch I'm inclined to propose until
> we get an actual interrupt handling flow from Synopsys is to mark this
> driver as "BROKEN".

It feels like you're using this bug to hold designware hostage in a
broken kernel, and me along with them.  I don't have the documentation,
no one does, there's no way for me to give you want you want.  But I've
got hardware that doesn't work in the mainline kernel.
Trent Piepho Nov. 8, 2018, 8:51 p.m. | #18
On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote:
> On 07/11/2018 18:32, Trent Piepho wrote:
> > On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote:
> > > On 06/11/2018 16:00, Marc Zyngier wrote:
> > > > On 06/11/18 14:53, Lorenzo Pieralisi wrote:
> > > > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
> > > > > > 
> > > > > > This gives the following race scenario:
> > > > > > 
> > > > > > 1.  An MSI is received by, and the status bit for the MSI is set in, the
> > > > > > DWC PCI-e controller.
> > > > > > 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> > > > > > for the MSI received.
> > > > > > 3.  At some point, the interrupt handler must decide, correctly, that
> > > > > > there is no more work to do and return.
> > > > > > 4.  The hardware generates a new MSI.  As the MSI's status bit is still
> > > > > > set, this new MSI is ignored.
> > > > > > 6.  dw_handle_msi_irq() unsets the MSI status bit.
> > > > > > 
> > > > > > The MSI received at point 4 will never be acted upon.  It occurred after
> > > > > > the driver had finished checking the hardware status for interrupt
> > > > > > conditions to act on.  Since the MSI status was masked, it does not
> > > > > > generated a new IRQ, neither when it was received nor when the MSI is
> > > > > > unmasked.
> > > > > > 
> > > This status register indicates whether exists or not a MSI interrupt on that
> > > controller [0..7] to be handle.
> > 
> > While the status for an MSI is set, no new interrupt will be triggered
> 
> Yes
> 
> > if another identical MSI is received, correct?
> 
> You cannot receive another identical MSI till you acknowledge the current one
> (This is ensured by the PCI protocol, I guess).

I don't believe this is a requirement of PCI.  We designed our hardware
to not send another MSI until our hardware's interrupt status register
is read, but we didn't have to do that.

> > > In theory, we should clear the interrupt flag only after the interrupt has
> > > actually handled (which can take some time to process on the worst case scenario).
> > 
> > But see above, there is a race if a new MSI arrives while still masked.
> >  I can see no possible way to solve this in software that does not
> > involve unmasking the MSI before calling the handler.  To leave the
> > interrupt masked while calling the handler requires the hardware to
> > queue an interrupt that arrives while masked.  We have no docs, but the
> > designware controller doesn't appear to do this in practice.
> 
> See my reply to Marc about the interrupt masking. Like you said, probably the
> solution pass through using interrupt mask/unmask register instead of interrupt
> enable/disable register.
> 
> Can you do a quick test, since you can easily reproduce the issue? Can you
> change register offset on both functions dw_pci_bottom_mask() and
> dw_pci_bottom_unmask()?
> 
> Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK.

Of course MSI still need to be enabled to work at all, which happens
once when the driver using the MSI registers a handler.  Masking can be
done via mask register after that.

It is not so easy for me to test on the newest kernel, as imx7d does
not work due to yet more bugs.  I have to port a set of patches to each
new kernel.  A set that does not shrink due to holdups like this.

I understand the new flow would look like this (assume dw controller
MSI interrupt output signal is connected to one of the ARM GIC
interrupt lines, there could be different or more controllers above the
dwc of course (but usually aren't)):

1. MSI arrives, status bit is set in dwc, interrupt raised to GIC.
2. dwc handler runs
3. dwc handler sees status bit is set for a(n) MSI(s)
4. dwc handler sets mask for those MSIs
5. dwc handler clears status bit
6. dwc handler runs driver handler for the received MSI
7. ** an new MSI arrives, racing with 6 **
8. status bit becomes set again, but no interrupt is raised due to mask
9. dwc handler unmasks MSI, which raises the interrupt to GIC because
of new MSI received in 7.
10. The original GIC interrupt is EOI'ed.
11. The interrupt for the dwc is re-raised by the GIC due to 9, and we
go back to 2.

It is very important that 5 be done before 6.  Less so 4 before 5, but
reversing the order there would allow re-raising even if the 2nd MSI
arrived before the driver handler ran, which is not necessary.

I do not see a race in this design and it appears correct to me.  But,
I also do not think there is any immediate improvement due to extra
steps of masking and unmasking the MSI.

The reason is that the GIC interrupt above the dwc is non-reentrant. 
It remains masked (aka active[1]) during this entire process (1 to 10).
 This means every MSI is effectively already masked.  So masking the
active MSI(s) a 2nd time gains nothing besides preventing some extra
edges for a masked interrupt going to the ARM GIC.

In theory, if the GIC interrupt handler was reentrant, then on receipt
of a new MSI we could re-enter the dwc handler on a different CPU and
run the new MSI (a different MSI!) at the same time as the original MSI
handler is still running.

There difference here is that by unmasking in the interrupt in the GIC
before the dwc handler is finished, masking an individual MSI in the
dwc is no longer a 2nd redundant masking.


[1] When I say masked in GIC, I mean the interrupt is in the "active"
or "active and pending" states.  In these states the interrupt will not
be raised to the CPU and can be considered masked.
Lorenzo Pieralisi Nov. 9, 2018, 10:13 a.m. | #19
On Thu, Nov 08, 2018 at 07:49:52PM +0000, Trent Piepho wrote:
> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
> > On 07/11/18 20:17, Trent Piepho wrote:
> > > On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
> > > > On 06/11/18 19:40, Trent Piepho wrote:
> > > > > 
> > > > > What about stable kernels that don't have the hierarchical API?
> > > > 
> > > > My goal is to fix mainline first. Once we have something that works on
> > > > mainline, we can look at propagating the fix to other versions. But
> > > > mainline always comes first.
> > > 
> > > This is a regression that went into 4.14.  Wouldn't the appropriate
> > > action for the stable series be to undo the regression?
> > 
> > This is not how stable works. Stable kernels *only* contain patches that
> > are backported from mainline, and do not take standalone patch.
> > 
> > Furthermore, your fix is to actually undo someone else's fix. Who is
> > right? In the absence of any documentation, the answer is "nobody".
> 
> Little more history to this bug.  The code was originally the way it is
> now, but this same bug was fixed in 2013 in https://patchwork.kernel.or
> g/patch/3333681/
> 
> Then that lasted four years until it was changed Aug 2017 in https://pa
> tchwork.kernel.org/patch/9893303/
> 
> That lasted just six months until someone tried to revert it, https://p
> atchwork.kernel.org/patch/9893303/

The last link is the same as the previous one, unless I am missing
something.

> Seems pretty clear the way it is now is much worse than the way it was
> before, even if the previous design may have had another flaw.  Though
> I've yet to see anyone point out something makes the previous design
> broken.  Sub-optimal yes, but not broken.

The way I see it is: either the MSI handling works or it does not.

AFAICS:

8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled,
not before")

was fixing a bug, causing "timeouts on some wireless lan cards", we want
to understand what the problem is, fix it once for all on all DWC
based systems.

> > Anything can be backported to stable once we understand the issue. At
> > the moment, we're just playing games moving stuff around and hope
> > nothing else will break. That's not a sustainable way of maintaining
> > this driver. At the moment, the only patch I'm inclined to propose until
> > we get an actual interrupt handling flow from Synopsys is to mark this
> > driver as "BROKEN".
> 
> It feels like you're using this bug to hold designware hostage in a
> broken kernel, and me along with them.  I don't have the documentation,
> no one does, there's no way for me to give you want you want.  But I've
> got hardware that doesn't work in the mainline kernel.

Nobody is holding anyone hostage here, it is a pretty normal patch
discussion, given the controversial history of fixes you reported
we are just trying to get the whole picture.

There is a bug that ought to be fixed, you are doing the right thing
with the feedback you are providing and DWC maintainers must provide the
information you need to get to the bottom of this, once for all, that's
as simple as that.

Thanks,
Lorenzo
Marc Zyngier Nov. 9, 2018, 11:34 a.m. | #20
On 08/11/18 19:49, Trent Piepho wrote:
> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
>> On 07/11/18 20:17, Trent Piepho wrote:
>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
>>>> On 06/11/18 19:40, Trent Piepho wrote:
>>>>>
>>>>> What about stable kernels that don't have the hierarchical API?
>>>>
>>>> My goal is to fix mainline first. Once we have something that works on
>>>> mainline, we can look at propagating the fix to other versions. But
>>>> mainline always comes first.
>>>
>>> This is a regression that went into 4.14.  Wouldn't the appropriate
>>> action for the stable series be to undo the regression?
>>
>> This is not how stable works. Stable kernels *only* contain patches that
>> are backported from mainline, and do not take standalone patch.
>>
>> Furthermore, your fix is to actually undo someone else's fix. Who is
>> right? In the absence of any documentation, the answer is "nobody".
> 
> Little more history to this bug.  The code was originally the way it is
> now, but this same bug was fixed in 2013 in https://patchwork.kernel.or
> g/patch/3333681/
> 
> Then that lasted four years until it was changed Aug 2017 in https://pa
> tchwork.kernel.org/patch/9893303/
> 
> That lasted just six months until someone tried to revert it, https://p
> atchwork.kernel.org/patch/9893303/
> 
> Seems pretty clear the way it is now is much worse than the way it was
> before, even if the previous design may have had another flaw.  Though
> I've yet to see anyone point out something makes the previous design
> broken.  Sub-optimal yes, but not broken.

This is not what was reported by the previous submitter. I guess they
changed this for a reason, no? I'm prepared to admit this is a end-point
driver bug, but we need to understand what is happening and stop
changing this driver randomly.

>> Anything can be backported to stable once we understand the issue. At
>> the moment, we're just playing games moving stuff around and hope
>> nothing else will break. That's not a sustainable way of maintaining
>> this driver. At the moment, the only patch I'm inclined to propose until
>> we get an actual interrupt handling flow from Synopsys is to mark this
>> driver as "BROKEN".
> 
> It feels like you're using this bug to hold designware hostage in a
> broken kernel, and me along with them.  I don't have the documentation,
> no one does, there's no way for me to give you want you want.  But I've
> got hardware that doesn't work in the mainline kernel.

I take it as a personal offence that I'd be holding anything or anyone
hostage. I think I have a long enough track record working with the
Linux kernel not to take any of this nonsense. What's my interest in
keeping anything in this sorry state? Think about it for a minute.

When I'm asking for documentation, I'm not asking you. I perfectly
understood that you don't have any. We need Synopsys to step up and give
us a simple description of what the MSI workflow is. Because no matter
how you randomly mutate this code, it will still be broken until we get
a clear answer from *Synopsys*.

Gustavo, here's one simple ask. Please reply to this email with a step
by step, register by register description of how an MSI must be handled
on this HW. We do need it, and we need it now.

Thanks,

	M.
Vignesh R Nov. 9, 2018, 5:17 p.m. | #21
On 09/11/18 3:43 PM, Lorenzo Pieralisi wrote:
> On Thu, Nov 08, 2018 at 07:49:52PM +0000, Trent Piepho wrote:
>> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
>>> On 07/11/18 20:17, Trent Piepho wrote:
>>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
>>>>> On 06/11/18 19:40, Trent Piepho wrote:
>>>>>>
>>>>>> What about stable kernels that don't have the hierarchical API?
>>>>>
>>>>> My goal is to fix mainline first. Once we have something that works on
>>>>> mainline, we can look at propagating the fix to other versions. But
>>>>> mainline always comes first.
>>>>
>>>> This is a regression that went into 4.14.  Wouldn't the appropriate
>>>> action for the stable series be to undo the regression?
>>>
>>> This is not how stable works. Stable kernels *only* contain patches that
>>> are backported from mainline, and do not take standalone patch.
>>>
>>> Furthermore, your fix is to actually undo someone else's fix. Who is
>>> right? In the absence of any documentation, the answer is "nobody".
>>
>> Little more history to this bug.  The code was originally the way it is
>> now, but this same bug was fixed in 2013 in https://patchwork.kernel.or
>> g/patch/3333681/
>>
>> Then that lasted four years until it was changed Aug 2017 in https://pa
>> tchwork.kernel.org/patch/9893303/
>>
>> That lasted just six months until someone tried to revert it, https://p
>> atchwork.kernel.org/patch/9893303/
> 
> The last link is the same as the previous one, unless I am missing
> something.
> 
>> Seems pretty clear the way it is now is much worse than the way it was
>> before, even if the previous design may have had another flaw.  Though
>> I've yet to see anyone point out something makes the previous design
>> broken.  Sub-optimal yes, but not broken.
> 
> The way I see it is: either the MSI handling works or it does not.
> 
> AFAICS:
> 
> 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled,
> not before")
> 
> was fixing a bug, causing "timeouts on some wireless lan cards", we want
> to understand what the problem is, fix it once for all on all DWC
> based systems.
> 

That issue was root caused to be due to a HW errata in dra7xx DWC
wrapper which requires a special way of handling MSI interrupts at
wrapper level. More info in this thread:
https://www.spinics.net/lists/linux-pci/msg70462.html

Unfortunately, commit 8c934095fa2f did not fix WLAN issue in longer
tests and also broke PCIe USB cards. Therefore, it makes sense to revert
8c934095fa2f

I am working on patches fix dra7xx wrapper for WLAN card issue.

Regards
Vignesh

>>> Anything can be backported to stable once we understand the issue. At
>>> the moment, we're just playing games moving stuff around and hope
>>> nothing else will break. That's not a sustainable way of maintaining
>>> this driver. At the moment, the only patch I'm inclined to propose until
>>> we get an actual interrupt handling flow from Synopsys is to mark this
>>> driver as "BROKEN".
>>
>> It feels like you're using this bug to hold designware hostage in a
>> broken kernel, and me along with them.  I don't have the documentation,
>> no one does, there's no way for me to give you want you want.  But I've
>> got hardware that doesn't work in the mainline kernel.
> 
> Nobody is holding anyone hostage here, it is a pretty normal patch
> discussion, given the controversial history of fixes you reported
> we are just trying to get the whole picture.
> 
> There is a bug that ought to be fixed, you are doing the right thing
> with the feedback you are providing and DWC maintainers must provide the
> information you need to get to the bottom of this, once for all, that's
> as simple as that.
> 
> Thanks,
> Lorenzo
>
Trent Piepho Nov. 9, 2018, 6:53 p.m. | #22
On Fri, 2018-11-09 at 11:34 +0000, Marc Zyngier wrote:
> On 08/11/18 19:49, Trent Piepho wrote:
> > On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
> > > 
> > Then that lasted four years until it was changed Aug 2017 in https://pa
> > tchwork.kernel.org/patch/9893303/
> > 
> > That lasted just six months until someone tried to revert it, https://p
> > atchwork.kernel.org/patch/9893303/

Should be https://patchwork.kernel.org/patch/10208879/

> > 
> > Seems pretty clear the way it is now is much worse than the way it was
> > before, even if the previous design may have had another flaw.  Though
> > I've yet to see anyone point out something makes the previous design
> > broken.  Sub-optimal yes, but not broken.
> 
> This is not what was reported by the previous submitter. I guess they
> changed this for a reason, no? I'm prepared to admit this is a end-point
> driver bug, but we need to understand what is happening and stop
> changing this driver randomly.

See Vignesh's recent message about the last change.  It was a mistaken
attempt to fix a problem, which it didn't fix, and I think we all agree
it's not right.

Reverting it is not "changing this driver randomly."  And I take that
as a personal offense.  You imply I just applied patches randomly until
something appeared to work.  Maybe you think this is all over my head? 
Far from it.  I traced every part of the interrupt path and thought
through every race.  Hamstrung by lack of docs, I still determined the
behavior that was relevant through empirical analysis.  I only
discovered this was a recent regression and Vignesh's earlier attempt
to revert it after I was done and was trying to determine how the code
got this way in the first place.

> > It feels like you're using this bug to hold designware hostage in a
> > broken kernel, and me along with them.  I don't have the documentation,
> > no one does, there's no way for me to give you want you want.  But I've
> > got hardware that doesn't work in the mainline kernel.
> 
> I take it as a personal offence that I'd be holding anything or anyone
> hostage. I think I have a long enough track record working with the
> Linux kernel not to take any of this nonsense. What's my interest in
> keeping anything in this sorry state? Think about it for a minute.

I'm sorry if you took it that way.  I appreciate that there are still
people who care about fixing things right and don't settle for whatever
the easiest thing is that lets them say they're done, even if that just
leaves time bombs for everyone who comes after.

So I thank you for taking a stand.

But I think it's clear that 8c934095fa2f was a mistake that causes
serious bugs.  That's not a random guess; it's well considered and well
tested.  Not reverting it now isn't helping anyone using stable kernels
with this regression.
Lorenzo Pieralisi Nov. 12, 2018, 4:01 p.m. | #23
On Thu, Nov 08, 2018 at 08:51:38PM +0000, Trent Piepho wrote:
> On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote:
> > On 07/11/2018 18:32, Trent Piepho wrote:
> > > On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote:
> > > > On 06/11/2018 16:00, Marc Zyngier wrote:
> > > > > On 06/11/18 14:53, Lorenzo Pieralisi wrote:
> > > > > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
> > > > > > > 
> > > > > > > This gives the following race scenario:
> > > > > > > 
> > > > > > > 1.  An MSI is received by, and the status bit for the MSI is set in, the
> > > > > > > DWC PCI-e controller.
> > > > > > > 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
> > > > > > > for the MSI received.
> > > > > > > 3.  At some point, the interrupt handler must decide, correctly, that
> > > > > > > there is no more work to do and return.
> > > > > > > 4.  The hardware generates a new MSI.  As the MSI's status bit is still
> > > > > > > set, this new MSI is ignored.
> > > > > > > 6.  dw_handle_msi_irq() unsets the MSI status bit.
> > > > > > > 
> > > > > > > The MSI received at point 4 will never be acted upon.  It occurred after
> > > > > > > the driver had finished checking the hardware status for interrupt
> > > > > > > conditions to act on.  Since the MSI status was masked, it does not
> > > > > > > generated a new IRQ, neither when it was received nor when the MSI is
> > > > > > > unmasked.
> > > > > > > 
> > > > This status register indicates whether exists or not a MSI interrupt on that
> > > > controller [0..7] to be handle.
> > > 
> > > While the status for an MSI is set, no new interrupt will be triggered
> > 
> > Yes
> > 
> > > if another identical MSI is received, correct?
> > 
> > You cannot receive another identical MSI till you acknowledge the current one
> > (This is ensured by the PCI protocol, I guess).
> 
> I don't believe this is a requirement of PCI.  We designed our hardware
> to not send another MSI until our hardware's interrupt status register
> is read, but we didn't have to do that.
> 
> > > > In theory, we should clear the interrupt flag only after the interrupt has
> > > > actually handled (which can take some time to process on the worst case scenario).
> > > 
> > > But see above, there is a race if a new MSI arrives while still masked.
> > >  I can see no possible way to solve this in software that does not
> > > involve unmasking the MSI before calling the handler.  To leave the
> > > interrupt masked while calling the handler requires the hardware to
> > > queue an interrupt that arrives while masked.  We have no docs, but the
> > > designware controller doesn't appear to do this in practice.
> > 
> > See my reply to Marc about the interrupt masking. Like you said, probably the
> > solution pass through using interrupt mask/unmask register instead of interrupt
> > enable/disable register.
> > 
> > Can you do a quick test, since you can easily reproduce the issue? Can you
> > change register offset on both functions dw_pci_bottom_mask() and
> > dw_pci_bottom_unmask()?
> > 
> > Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK.
> 
> Of course MSI still need to be enabled to work at all, which happens
> once when the driver using the MSI registers a handler.  Masking can be
> done via mask register after that.

I want to understand from Synopsys maintainers what's the difference
between the ENABLE and MASK registers and *why* the MSI IRQ masking is
currently implemented through the PCIE_MSI_INTR0_ENABLE register instead
of PCIE_MSI_INTR0_MASK.

I am not happy to test and see what's the difference I want to know
what's the difference as-per specifications.

I need a definitive answer on this based on HW specifications before
merging any code.

> It is not so easy for me to test on the newest kernel, as imx7d does
> not work due to yet more bugs.  I have to port a set of patches to each
> new kernel.  A set that does not shrink due to holdups like this.
> 
> I understand the new flow would look like this (assume dw controller
> MSI interrupt output signal is connected to one of the ARM GIC
> interrupt lines, there could be different or more controllers above the
> dwc of course (but usually aren't)):
> 
> 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC.
> 2. dwc handler runs
> 3. dwc handler sees status bit is set for a(n) MSI(s)
> 4. dwc handler sets mask for those MSIs
> 5. dwc handler clears status bit
> 6. dwc handler runs driver handler for the received MSI
> 7. ** an new MSI arrives, racing with 6 **
> 8. status bit becomes set again, but no interrupt is raised due to mask
> 9. dwc handler unmasks MSI, which raises the interrupt to GIC because
> of new MSI received in 7.
> 10. The original GIC interrupt is EOI'ed.
> 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we
> go back to 2.
> 
> It is very important that 5 be done before 6.  Less so 4 before 5, but
> reversing the order there would allow re-raising even if the 2nd MSI
> arrived before the driver handler ran, which is not necessary.
> 
> I do not see a race in this design and it appears correct to me.  But,
> I also do not think there is any immediate improvement due to extra
> steps of masking and unmasking the MSI.
> 
> The reason is that the GIC interrupt above the dwc is non-reentrant. 
> It remains masked (aka active[1]) during this entire process (1 to 10).
>  This means every MSI is effectively already masked.  So masking the
> active MSI(s) a 2nd time gains nothing besides preventing some extra
> edges for a masked interrupt going to the ARM GIC.
> 
> In theory, if the GIC interrupt handler was reentrant, then on receipt
> of a new MSI we could re-enter the dwc handler on a different CPU and
> run the new MSI (a different MSI!) at the same time as the original MSI
> handler is still running.
> 
> There difference here is that by unmasking in the interrupt in the GIC
> before the dwc handler is finished, masking an individual MSI in the
> dwc is no longer a 2nd redundant masking.

There is not (and there must not be) anything in DWC HW that allows
us assume its "interrupt controller" is cascaded to a GIC. It may
correspond to what we are debugging but it is an assumption we must not
make and its driver has to work regardless of what its interrupt
controller is connected to, that's the basis of hierarchical IRQ chips
management, at least from my narrow and (most likely) incomplete
knowledge of the IRQ subsystem.

So, this means that both the masking AND your revert must be
merged in the mainline to make sure we are fixing MSI handling
for good, that's how I sum up your report.

Pending my request above, I am inclined to merge your revert but I would
appreciate testing (tags) from all DWC host bridge maintainers (I am
happy to delay it till -rc3 or -rc4 to that extent), do not take this as
a lack of trust in your testing, it is just that the last thing we want
is asking for reverting the revert given that it is going to stable
kernels and most of all we have to make sure this works for all DWC
derived host bridges.

I *also* expect the patch implementing the MSI masking and I think that
falls within Gustavo's remit.

Thanks,
Lorenzo
Gustavo Pimentel Nov. 12, 2018, 11:45 p.m. | #24
On 08/11/2018 20:51, Trent Piepho wrote:
> On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote:
>> On 07/11/2018 18:32, Trent Piepho wrote:
>>> On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote:
>>>> On 06/11/2018 16:00, Marc Zyngier wrote:
>>>>> On 06/11/18 14:53, Lorenzo Pieralisi wrote:
>>>>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>>>>>>>
>>>>>>> This gives the following race scenario:
>>>>>>>
>>>>>>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>>>>>>> DWC PCI-e controller.
>>>>>>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>>>>> for the MSI received.
>>>>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>>>>> there is no more work to do and return.
>>>>>>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>>>>>>> set, this new MSI is ignored.
>>>>>>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>>>>>>
>>>>>>> The MSI received at point 4 will never be acted upon.  It occurred after
>>>>>>> the driver had finished checking the hardware status for interrupt
>>>>>>> conditions to act on.  Since the MSI status was masked, it does not
>>>>>>> generated a new IRQ, neither when it was received nor when the MSI is
>>>>>>> unmasked.
>>>>>>>
>>>> This status register indicates whether exists or not a MSI interrupt on that
>>>> controller [0..7] to be handle.
>>>
>>> While the status for an MSI is set, no new interrupt will be triggered
>>
>> Yes
>>
>>> if another identical MSI is received, correct?
>>
>> You cannot receive another identical MSI till you acknowledge the current one
>> (This is ensured by the PCI protocol, I guess).
> 
> I don't believe this is a requirement of PCI.  We designed our hardware
> to not send another MSI until our hardware's interrupt status register
> is read, but we didn't have to do that.
> 
>>>> In theory, we should clear the interrupt flag only after the interrupt has
>>>> actually handled (which can take some time to process on the worst case scenario).
>>>
>>> But see above, there is a race if a new MSI arrives while still masked.
>>>  I can see no possible way to solve this in software that does not
>>> involve unmasking the MSI before calling the handler.  To leave the
>>> interrupt masked while calling the handler requires the hardware to
>>> queue an interrupt that arrives while masked.  We have no docs, but the
>>> designware controller doesn't appear to do this in practice.
>>
>> See my reply to Marc about the interrupt masking. Like you said, probably the
>> solution pass through using interrupt mask/unmask register instead of interrupt
>> enable/disable register.
>>
>> Can you do a quick test, since you can easily reproduce the issue? Can you
>> change register offset on both functions dw_pci_bottom_mask() and
>> dw_pci_bottom_unmask()?
>>
>> Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK.
> 
> Of course MSI still need to be enabled to work at all, which happens
> once when the driver using the MSI registers a handler.  Masking can be
> done via mask register after that.
> 
Correct, I was asking to switch only on the functions mentioned that are called
after the dw_pcie_setup_rc() that enables the interrupts.

> It is not so easy for me to test on the newest kernel, as imx7d does
> not work due to yet more bugs.  I have to port a set of patches to each
> new kernel.  A set that does not shrink due to holdups like this.

Ok, I've to try to replicate this scenario of loss of interruptions so that I
can do something about it. Till now this never happen before.

> 
> I understand the new flow would look like this (assume dw controller
> MSI interrupt output signal is connected to one of the ARM GIC
> interrupt lines, there could be different or more controllers above the
> dwc of course (but usually aren't)):
> 
> 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC.
> 2. dwc handler runs
> 3. dwc handler sees status bit is set for a(n) MSI(s)
> 4. dwc handler sets mask for those MSIs
> 5. dwc handler clears status bit
> 6. dwc handler runs driver handler for the received MSI
> 7. ** an new MSI arrives, racing with 6 **
> 8. status bit becomes set again, but no interrupt is raised due to mask
> 9. dwc handler unmasks MSI, which raises the interrupt to GIC because
> of new MSI received in 7.
> 10. The original GIC interrupt is EOI'ed.
> 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we
> go back to 2.
> 
> It is very important that 5 be done before 6.  Less so 4 before 5, but
> reversing the order there would allow re-raising even if the 2nd MSI
> arrived before the driver handler ran, which is not necessary.
> 
> I do not see a race in this design and it appears correct to me.  But,
> I also do not think there is any immediate improvement due to extra
> steps of masking and unmasking the MSI.
> 
> The reason is that the GIC interrupt above the dwc is non-reentrant. 
> It remains masked (aka active[1]) during this entire process (1 to 10).
>  This means every MSI is effectively already masked.  So masking the
> active MSI(s) a 2nd time gains nothing besides preventing some extra
> edges for a masked interrupt going to the ARM GIC.
> 
> In theory, if the GIC interrupt handler was reentrant, then on receipt
> of a new MSI we could re-enter the dwc handler on a different CPU and
> run the new MSI (a different MSI!) at the same time as the original MSI
> handler is still running.
> 
> There difference here is that by unmasking in the interrupt in the GIC
> before the dwc handler is finished, masking an individual MSI in the
> dwc is no longer a 2nd redundant masking.
> 
> 
> [1] When I say masked in GIC, I mean the interrupt is in the "active"
> or "active and pending" states.  In these states the interrupt will not
> be raised to the CPU and can be considered masked.
>
Gustavo Pimentel Nov. 13, 2018, 12:41 a.m. | #25
On 09/11/2018 11:34, Marc Zyngier wrote:
> On 08/11/18 19:49, Trent Piepho wrote:
>> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
>>> On 07/11/18 20:17, Trent Piepho wrote:
>>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
>>>>> On 06/11/18 19:40, Trent Piepho wrote:
>>>>>>
>>>>>> What about stable kernels that don't have the hierarchical API?
>>>>>
>>>>> My goal is to fix mainline first. Once we have something that works on
>>>>> mainline, we can look at propagating the fix to other versions. But
>>>>> mainline always comes first.
>>>>
>>>> This is a regression that went into 4.14.  Wouldn't the appropriate
>>>> action for the stable series be to undo the regression?
>>>
>>> This is not how stable works. Stable kernels *only* contain patches that
>>> are backported from mainline, and do not take standalone patch.
>>>
>>> Furthermore, your fix is to actually undo someone else's fix. Who is
>>> right? In the absence of any documentation, the answer is "nobody".
>>
>> Little more history to this bug.  The code was originally the way it is
>> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e=
>> g/patch/3333681/
>>
>> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e=
>> tchwork.kernel.org/patch/9893303/
>>
>> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e=
>> atchwork.kernel.org/patch/9893303/
>>
>> Seems pretty clear the way it is now is much worse than the way it was
>> before, even if the previous design may have had another flaw.  Though
>> I've yet to see anyone point out something makes the previous design
>> broken.  Sub-optimal yes, but not broken.
> 
> This is not what was reported by the previous submitter. I guess they
> changed this for a reason, no? I'm prepared to admit this is a end-point
> driver bug, but we need to understand what is happening and stop
> changing this driver randomly.
> 
>>> Anything can be backported to stable once we understand the issue. At
>>> the moment, we're just playing games moving stuff around and hope
>>> nothing else will break. That's not a sustainable way of maintaining
>>> this driver. At the moment, the only patch I'm inclined to propose until
>>> we get an actual interrupt handling flow from Synopsys is to mark this
>>> driver as "BROKEN".
>>
>> It feels like you're using this bug to hold designware hostage in a
>> broken kernel, and me along with them.  I don't have the documentation,
>> no one does, there's no way for me to give you want you want.  But I've
>> got hardware that doesn't work in the mainline kernel.
> 
> I take it as a personal offence that I'd be holding anything or anyone
> hostage. I think I have a long enough track record working with the
> Linux kernel not to take any of this nonsense. What's my interest in
> keeping anything in this sorry state? Think about it for a minute.
> 
> When I'm asking for documentation, I'm not asking you. I perfectly
> understood that you don't have any. We need Synopsys to step up and give
> us a simple description of what the MSI workflow is. Because no matter
> how you randomly mutate this code, it will still be broken until we get
> a clear answer from *Synopsys*.
> 
> Gustavo, here's one simple ask. Please reply to this email with a step
> by step, register by register description of how an MSI must be handled
> on this HW. We do need it, and we need it now.

Hi Marc, I thought that I did respond to your question on Nov 8. However I will
repeat it and add some extra information to it now.

About the registers:

PCIE_MSI_INTR0_MASK
When an MSI is received for a masked interrupt, the corresponding status bit
gets set in the interrupt status register but the controller will not signal it.
As soon as the masked interrupt is unmasked and assuming the status bit is still
set the controller will signal it.

PCIE_MSI_INTR0_ENABLE
Enables/disables every interrupt. When an MSI is received from a disabled
interrupt, no status bit gets set in MSI controller interrupt status register.

About this new callbacks:

The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on
Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains
hierarchical API patch [1] and based on what I've seen so far, before this patch
there were no interrupt masking been performed.

Based on the information provided, its very likely that I should use the
PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the
dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return
from Linux plumbers conference, I will test the system using the
PCIE_MSI_INTR0_MASK register.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=7c5925afbc58c6d6b384e1dc051bb992969bf787

Is there anything else that has not been clarified yet?

Regards,
Gustavo

> 
> Thanks,
> 
> 	M.
>
Gustavo Pimentel Nov. 13, 2018, 1:03 a.m. | #26
On 12/11/2018 16:01, Lorenzo Pieralisi wrote:
> On Thu, Nov 08, 2018 at 08:51:38PM +0000, Trent Piepho wrote:
>> On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote:
>>> On 07/11/2018 18:32, Trent Piepho wrote:
>>>> On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote:
>>>>> On 06/11/2018 16:00, Marc Zyngier wrote:
>>>>>> On 06/11/18 14:53, Lorenzo Pieralisi wrote:
>>>>>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote:
>>>>>>>>
>>>>>>>> This gives the following race scenario:
>>>>>>>>
>>>>>>>> 1.  An MSI is received by, and the status bit for the MSI is set in, the
>>>>>>>> DWC PCI-e controller.
>>>>>>>> 2.  dw_handle_msi_irq() calls a driver's registered interrupt handler
>>>>>>>> for the MSI received.
>>>>>>>> 3.  At some point, the interrupt handler must decide, correctly, that
>>>>>>>> there is no more work to do and return.
>>>>>>>> 4.  The hardware generates a new MSI.  As the MSI's status bit is still
>>>>>>>> set, this new MSI is ignored.
>>>>>>>> 6.  dw_handle_msi_irq() unsets the MSI status bit.
>>>>>>>>
>>>>>>>> The MSI received at point 4 will never be acted upon.  It occurred after
>>>>>>>> the driver had finished checking the hardware status for interrupt
>>>>>>>> conditions to act on.  Since the MSI status was masked, it does not
>>>>>>>> generated a new IRQ, neither when it was received nor when the MSI is
>>>>>>>> unmasked.
>>>>>>>>
>>>>> This status register indicates whether exists or not a MSI interrupt on that
>>>>> controller [0..7] to be handle.
>>>>
>>>> While the status for an MSI is set, no new interrupt will be triggered
>>>
>>> Yes
>>>
>>>> if another identical MSI is received, correct?
>>>
>>> You cannot receive another identical MSI till you acknowledge the current one
>>> (This is ensured by the PCI protocol, I guess).
>>
>> I don't believe this is a requirement of PCI.  We designed our hardware
>> to not send another MSI until our hardware's interrupt status register
>> is read, but we didn't have to do that.
>>
>>>>> In theory, we should clear the interrupt flag only after the interrupt has
>>>>> actually handled (which can take some time to process on the worst case scenario).
>>>>
>>>> But see above, there is a race if a new MSI arrives while still masked.
>>>>  I can see no possible way to solve this in software that does not
>>>> involve unmasking the MSI before calling the handler.  To leave the
>>>> interrupt masked while calling the handler requires the hardware to
>>>> queue an interrupt that arrives while masked.  We have no docs, but the
>>>> designware controller doesn't appear to do this in practice.
>>>
>>> See my reply to Marc about the interrupt masking. Like you said, probably the
>>> solution pass through using interrupt mask/unmask register instead of interrupt
>>> enable/disable register.
>>>
>>> Can you do a quick test, since you can easily reproduce the issue? Can you
>>> change register offset on both functions dw_pci_bottom_mask() and
>>> dw_pci_bottom_unmask()?
>>>
>>> Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK.
>>
>> Of course MSI still need to be enabled to work at all, which happens
>> once when the driver using the MSI registers a handler.  Masking can be
>> done via mask register after that.
> 
> I want to understand from Synopsys maintainers what's the difference
> between the ENABLE and MASK registers and *why* the MSI IRQ masking is
> currently implemented through the PCIE_MSI_INTR0_ENABLE register instead
> of PCIE_MSI_INTR0_MASK.

I've already reply to Marc with that information twice.

> 
> I am not happy to test and see what's the difference I want to know
> what's the difference as-per specifications.

Of course. I only asked if it was possible to test the register change, because
the interruption loss (I got the impression of that) was deterministic on that
platform.

Based on the information provided to Marc, the correct register should be
PCIE_MSI_INTR0_MASK instead of PCIE_MSI_INTR0_ENABLE, however I must replicate
this "interruption loss" (I'm still thinking how...) and test any possible fix
before doing anything else.

> 
> I need a definitive answer on this based on HW specifications before
> merging any code.
> 
>> It is not so easy for me to test on the newest kernel, as imx7d does
>> not work due to yet more bugs.  I have to port a set of patches to each
>> new kernel.  A set that does not shrink due to holdups like this.
>>
>> I understand the new flow would look like this (assume dw controller
>> MSI interrupt output signal is connected to one of the ARM GIC
>> interrupt lines, there could be different or more controllers above the
>> dwc of course (but usually aren't)):
>>
>> 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC.
>> 2. dwc handler runs
>> 3. dwc handler sees status bit is set for a(n) MSI(s)
>> 4. dwc handler sets mask for those MSIs
>> 5. dwc handler clears status bit
>> 6. dwc handler runs driver handler for the received MSI
>> 7. ** an new MSI arrives, racing with 6 **
>> 8. status bit becomes set again, but no interrupt is raised due to mask
>> 9. dwc handler unmasks MSI, which raises the interrupt to GIC because
>> of new MSI received in 7.
>> 10. The original GIC interrupt is EOI'ed.
>> 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we
>> go back to 2.
>>
>> It is very important that 5 be done before 6.  Less so 4 before 5, but
>> reversing the order there would allow re-raising even if the 2nd MSI
>> arrived before the driver handler ran, which is not necessary.
>>
>> I do not see a race in this design and it appears correct to me.  But,
>> I also do not think there is any immediate improvement due to extra
>> steps of masking and unmasking the MSI.
>>
>> The reason is that the GIC interrupt above the dwc is non-reentrant. 
>> It remains masked (aka active[1]) during this entire process (1 to 10).
>>  This means every MSI is effectively already masked.  So masking the
>> active MSI(s) a 2nd time gains nothing besides preventing some extra
>> edges for a masked interrupt going to the ARM GIC.
>>
>> In theory, if the GIC interrupt handler was reentrant, then on receipt
>> of a new MSI we could re-enter the dwc handler on a different CPU and
>> run the new MSI (a different MSI!) at the same time as the original MSI
>> handler is still running.
>>
>> There difference here is that by unmasking in the interrupt in the GIC
>> before the dwc handler is finished, masking an individual MSI in the
>> dwc is no longer a 2nd redundant masking.
> 
> There is not (and there must not be) anything in DWC HW that allows
> us assume its "interrupt controller" is cascaded to a GIC. It may
> correspond to what we are debugging but it is an assumption we must not
> make and its driver has to work regardless of what its interrupt
> controller is connected to, that's the basis of hierarchical IRQ chips
> management, at least from my narrow and (most likely) incomplete
> knowledge of the IRQ subsystem.
> 
> So, this means that both the masking AND your revert must be
> merged in the mainline to make sure we are fixing MSI handling
> for good, that's how I sum up your report.
> 
> Pending my request above, I am inclined to merge your revert but I would
> appreciate testing (tags) from all DWC host bridge maintainers (I am
> happy to delay it till -rc3 or -rc4 to that extent), do not take this as
> a lack of trust in your testing, it is just that the last thing we want
> is asking for reverting the revert given that it is going to stable
> kernels and most of all we have to make sure this works for all DWC
> derived host bridges.

Agree. Any change on this have a big impact on all DWC host and should be tested
by all.

> 
> I *also* expect the patch implementing the MSI masking and I think that
> falls within Gustavo's remit.

Yes.

Regards,
Gustavo

> 
> Thanks,
> Lorenzo
>
Trent Piepho Nov. 13, 2018, 1:18 a.m. | #27
On Tue, 2018-11-13 at 00:41 +0000, Gustavo Pimentel wrote:
> On 09/11/2018 11:34, Marc Zyngier wrote:
> > 
> > Gustavo, here's one simple ask. Please reply to this email with a step
> > by step, register by register description of how an MSI must be handled
> > on this HW. We do need it, and we need it now.
> 
> Hi Marc, I thought that I did respond to your question on Nov 8. However I will
> repeat it and add some extra information to it now.
> 
> About the registers:
> 
> PCIE_MSI_INTR0_MASK
> When an MSI is received for a masked interrupt, the corresponding status bit
> gets set in the interrupt status register but the controller will not signal it.
> As soon as the masked interrupt is unmasked and assuming the status bit is still
> set the controller will signal it.
> 
> PCIE_MSI_INTR0_ENABLE
> Enables/disables every interrupt. When an MSI is received from a disabled
> interrupt, no status bit gets set in MSI controller interrupt status register.

If the MSI is unmasked and the status bit is still set, *not* from a
new MSI but rather from one that arrived before the MSI was masked,
does the controller still signal it?

I would suspect the answer is no: only a new MSI will cause the
interrupt be signaled on unmask.  And then only if the status bit has
not been cleared before the unmask (as you stated already).

But for that to be the case, there must be another bit of interrupt
status (i.e., "pending") that we don't know about yet.  The bit in the
status register alone is not enough.

This pending bit would need to be set on a 0->1 transition of the
status bit.  An MSI arriving while the status bit is already 1 does not
trigger a new interrupt.

> About this new callbacks:
> 
> The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on
> Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains
> hierarchical API patch [1] and based on what I've seen so far, before this patch
> there were no interrupt masking been performed.

Yes.  Or rather, the status bit being set effectively masks the
interrupt.

> Based on the information provided, its very likely that I should use the
> PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the
> dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return
> from Linux plumbers conference, I will test the system using the
> PCIE_MSI_INTR0_MASK register.

Using enable to mask the interrupt is broken.  It will allow an MSI to
be lost if it arrives while the MSI in not enabled.  It is impossible
to prevent that from racing against the pci device driver's final check
that no MSI-signaled condition in the hardware is present.
Lorenzo Pieralisi Nov. 13, 2018, 10:36 a.m. | #28
On Tue, Nov 13, 2018 at 01:18:41AM +0000, Trent Piepho wrote:

[...]

> > Based on the information provided, its very likely that I should use the
> > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the
> > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return
> > from Linux plumbers conference, I will test the system using the
> > PCIE_MSI_INTR0_MASK register.
> 
> Using enable to mask the interrupt is broken.  It will allow an MSI to
> be lost if it arrives while the MSI in not enabled.  It is impossible
> to prevent that from racing against the pci device driver's final check
> that no MSI-signaled condition in the hardware is present.

That's exactly why I raised the point. I would like to understand if
this means that this revert is not enough to fix the underlying issue
or we need this revert AND correct mask/unmask operations.

Thanks,
Lorenzo
Marc Zyngier Nov. 13, 2018, 2:40 p.m. | #29
On Tue, 13 Nov 2018 00:41:24 +0000,
Gustavo Pimentel <gustavo.pimentel@synopsys.com> wrote:
> 
> On 09/11/2018 11:34, Marc Zyngier wrote:
> > On 08/11/18 19:49, Trent Piepho wrote:
> >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
> >>> On 07/11/18 20:17, Trent Piepho wrote:
> >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
> >>>>> On 06/11/18 19:40, Trent Piepho wrote:
> >>>>>>
> >>>>>> What about stable kernels that don't have the hierarchical API?
> >>>>>
> >>>>> My goal is to fix mainline first. Once we have something that works on
> >>>>> mainline, we can look at propagating the fix to other versions. But
> >>>>> mainline always comes first.
> >>>>
> >>>> This is a regression that went into 4.14.  Wouldn't the appropriate
> >>>> action for the stable series be to undo the regression?
> >>>
> >>> This is not how stable works. Stable kernels *only* contain patches that
> >>> are backported from mainline, and do not take standalone patch.
> >>>
> >>> Furthermore, your fix is to actually undo someone else's fix. Who is
> >>> right? In the absence of any documentation, the answer is "nobody".
> >>
> >> Little more history to this bug.  The code was originally the way it is
> >> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e=
> >> g/patch/3333681/
> >>
> >> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e=
> >> tchwork.kernel.org/patch/9893303/
> >>
> >> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e=
> >> atchwork.kernel.org/patch/9893303/
> >>
> >> Seems pretty clear the way it is now is much worse than the way it was
> >> before, even if the previous design may have had another flaw.  Though
> >> I've yet to see anyone point out something makes the previous design
> >> broken.  Sub-optimal yes, but not broken.
> > 
> > This is not what was reported by the previous submitter. I guess they
> > changed this for a reason, no? I'm prepared to admit this is a end-point
> > driver bug, but we need to understand what is happening and stop
> > changing this driver randomly.
> > 
> >>> Anything can be backported to stable once we understand the issue. At
> >>> the moment, we're just playing games moving stuff around and hope
> >>> nothing else will break. That's not a sustainable way of maintaining
> >>> this driver. At the moment, the only patch I'm inclined to propose until
> >>> we get an actual interrupt handling flow from Synopsys is to mark this
> >>> driver as "BROKEN".
> >>
> >> It feels like you're using this bug to hold designware hostage in a
> >> broken kernel, and me along with them.  I don't have the documentation,
> >> no one does, there's no way for me to give you want you want.  But I've
> >> got hardware that doesn't work in the mainline kernel.
> > 
> > I take it as a personal offence that I'd be holding anything or anyone
> > hostage. I think I have a long enough track record working with the
> > Linux kernel not to take any of this nonsense. What's my interest in
> > keeping anything in this sorry state? Think about it for a minute.
> > 
> > When I'm asking for documentation, I'm not asking you. I perfectly
> > understood that you don't have any. We need Synopsys to step up and give
> > us a simple description of what the MSI workflow is. Because no matter
> > how you randomly mutate this code, it will still be broken until we get
> > a clear answer from *Synopsys*.
> > 
> > Gustavo, here's one simple ask. Please reply to this email with a step
> > by step, register by register description of how an MSI must be handled
> > on this HW. We do need it, and we need it now.
> 
> Hi Marc, I thought that I did respond to your question on Nov 8. However I will
> repeat it and add some extra information to it now.
> 
> About the registers:
> 
> PCIE_MSI_INTR0_MASK
> When an MSI is received for a masked interrupt, the corresponding status bit
> gets set in the interrupt status register but the controller will not signal it.
> As soon as the masked interrupt is unmasked and assuming the status bit is still
> set the controller will signal it.

So this is *really* a mask.

> PCIE_MSI_INTR0_ENABLE
> Enables/disables every interrupt. When an MSI is received from a disabled
> interrupt, no status bit gets set in MSI controller interrupt status register.

And that's not a mask at all. This allows the interrupt to be simply
lost. It is just a bit worrying that this is exactly what the DWC
driver is using, and faces all kind of interrupt loss when
enabling/disabling interrupts (which do have a mask/unmask semantic --
yes, the language is confusing).


> 
> About this new callbacks:
> 
> The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on
> Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains
> hierarchical API patch [1] and based on what I've seen so far, before this patch
> there were no interrupt masking been performed.
> 
> Based on the information provided, its very likely that I should use the
> PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the
> dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return
> from Linux plumbers conference, I will test the system using the
> PCIE_MSI_INTR0_MASK register.

I'm at Plumbers as well, so feel free to grab me and we'll sort this
out. We've wasted enough time on this broken driver.

Thanks,

	M.
Trent Piepho Nov. 13, 2018, 6:55 p.m. | #30
On Tue, 2018-11-13 at 10:36 +0000, Lorenzo Pieralisi wrote:
> On Tue, Nov 13, 2018 at 01:18:41AM +0000, Trent Piepho wrote:
> 
> [...]
> 
> > > Based on the information provided, its very likely that I should
> > > use the
> > > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on
> > > the
> > > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As
> > > soon as I return
> > > from Linux plumbers conference, I will test the system using the
> > > PCIE_MSI_INTR0_MASK register.
> > 
> > Using enable to mask the interrupt is broken.  It will allow an MSI
> > to
> > be lost if it arrives while the MSI in not enabled.  It is
> > impossible
> > to prevent that from racing against the pci device driver's final
> > check
> > that no MSI-signaled condition in the hardware is present.
> 
> That's exactly why I raised the point. I would like to understand if
> this means that this revert is not enough to fix the underlying issue
> or we need this revert AND correct mask/unmask operations.

Looks like _another_ bug added to the driver, this time in 4.17.
Trent Piepho Nov. 14, 2018, 9:29 p.m. | #31
On Mon, 2018-11-12 at 16:01 +0000, Lorenzo Pieralisi wrote:
> On Thu, Nov 08, 2018 at 08:51:38PM +0000, Trent Piepho wrote:
> > 
> > The reason is that the GIC interrupt above the dwc is non-reentrant. 
> > It remains masked (aka active[1]) during this entire process (1 to 10).
> >  This means every MSI is effectively already masked.  So masking the
> > active MSI(s) a 2nd time gains nothing besides preventing some extra
> > edges for a masked interrupt going to the ARM GIC.
> > 
> > In theory, if the GIC interrupt handler was reentrant, then on receipt
> > of a new MSI we could re-enter the dwc handler on a different CPU and
> > run the new MSI (a different MSI!) at the same time as the original MSI
> > handler is still running.
> > 
> > There difference here is that by unmasking in the interrupt in the GIC
> > before the dwc handler is finished, masking an individual MSI in the
> > dwc is no longer a 2nd redundant masking.
> 
> There is not (and there must not be) anything in DWC HW that allows
> us assume its "interrupt controller" is cascaded to a GIC. It may
> correspond to what we are debugging but it is an assumption we must not
> make and its driver has to work regardless of what its interrupt
> controller is connected to, that's the basis of hierarchical IRQ chips
> management, at least from my narrow and (most likely) incomplete
> knowledge of the IRQ subsystem.

If you mean we can't assume the dwc is cascaded from specifically a
GIC, totally agree.

If you mean we can't assume the dwc is cascaded from "something", then
I disagree.  The driver can't possibly be at the top level.  How would
the cpu ever enter it?

What I think really is what matters here is if the the chain the dwc is
cascaded from allows the dwc to be called reentrantly or not.

On my system, the code base I inspected does not allow that.  I don't
know of any written Linux interrupt policy that says this must be the
case.  If it is not the case, then I see other points in the dwc driver
that are potentially racy.

Patch

diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
index 29a05759a294..9a3960c95ad3 100644
--- a/drivers/pci/controller/dwc/pcie-designware-host.c
+++ b/drivers/pci/controller/dwc/pcie-designware-host.c
@@ -98,10 +98,10 @@  irqreturn_t dw_handle_msi_irq(struct pcie_port *pp)
 			irq = irq_find_mapping(pp->irq_domain,
 					       (i * MAX_MSI_IRQS_PER_CTRL) +
 					       pos);
-			generic_handle_irq(irq);
 			dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS +
 						(i * MSI_REG_CTRL_BLOCK_SIZE),
 					    4, 1 << pos);
+			generic_handle_irq(irq);
 			pos++;
 		}
 	}