Patchwork [v2,2/7] PCI: don't touch enable_cnt in pci_device_shutdown()

login
register
mail settings
Submitter Konstantin Khlebnikov
Date Feb. 4, 2013, 11:55 a.m.
Message ID <20130204115557.5569.9748.stgit@zurg>
Download mbox | patch
Permalink /patch/217887/
State Accepted
Headers show

Comments

Konstantin Khlebnikov - Feb. 4, 2013, 11:55 a.m.
comment in commit b566a22c23327f18ce941ffad0ca907e50a53d41
("PCI: disable Bus Master on PCI device shutdown") says:
| Disable Bus Master bit on the device in pci_device_shutdown() to ensure PCI
| devices do not continue to DMA data after shutdown.  This can cause memory
| corruption in case of a kexec where the current kernel shuts down and
| transfers control to a new kernel while a PCI device continues to DMA to
| memory that does not belong to it any more in the new kernel.

Seems like pci_clear_master() must be used here instead of pci_disable_device(),
because it disables Bus Muster unconditionally and doesn't changes enable_cnt.

Matthew Garrett and Alan Cox said (see LKML link below) that clearing bus-master
for all PCI devices may lead to unpredictable consequences, some devices ignores
this bit and continues DMA, some of them hang after that or crash whole system.
Probably we should leave here only warning and disable bus-mastering for each
driver individually in ->shutdown() callback.

Link: https://lkml.org/lkml/2012/6/6/278
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Khalid Aziz <khalid.aziz@hp.com>
Cc: linux-pci@vger.kernel.org
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
---
 drivers/pci/pci-driver.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Khalid Aziz - Feb. 4, 2013, 10:20 p.m.
On Mon, 2013-02-04 at 15:55 +0400, Konstantin Khlebnikov wrote:
> Matthew Garrett and Alan Cox said (see LKML link below) that clearing bus-master
> for all PCI devices may lead to unpredictable consequences, some devices ignores
> this bit and continues DMA, some of them hang after that or crash whole system.
> Probably we should leave here only warning and disable bus-mastering for each
> driver individually in ->shutdown() callback.

Agreed that the right place for shutting down a PCI device properly and
clearing its Bus Master bit, is the driver shutdown routine, if only all
drivers supplied a shutdown routine. As it is today, there are too many
drivers that do not provide a shutdown routine, ata_piix, Marvell SATA
driver, ATI AGP driver just to name a few among a large number of them.
Yet kexec is expected to work inspite of these drivers especially since
kdump depends on it. So until all PCI drivers supply a shutdown routine,
this is just a band-aid to disable interrupt and Bus Master bit in
pci_device_shutdown(). Most drivers do seem to supply a suspend and
resume function and it was discussed many years ago if it is feasible to
use the suspend() routine for drivers to shut devices down cleanly.
Maybe it is time to revisit that discussion.

>Cc: Khalid Aziz <khalid.aziz@hp.com>

Please update this to khalid@gonehiking.org

--
Khalid


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas - Feb. 4, 2013, 11:13 p.m.
On Mon, Feb 4, 2013 at 3:20 PM, Khalid Aziz <khalid@gonehiking.org> wrote:
> On Mon, 2013-02-04 at 15:55 +0400, Konstantin Khlebnikov wrote:
>> Matthew Garrett and Alan Cox said (see LKML link below) that clearing bus-master
>> for all PCI devices may lead to unpredictable consequences, some devices ignores
>> this bit and continues DMA, some of them hang after that or crash whole system.
>> Probably we should leave here only warning and disable bus-mastering for each
>> driver individually in ->shutdown() callback.
>
> Agreed that the right place for shutting down a PCI device properly and
> clearing its Bus Master bit, is the driver shutdown routine, if only all
> drivers supplied a shutdown routine. As it is today, there are too many
> drivers that do not provide a shutdown routine, ata_piix, Marvell SATA
> driver, ATI AGP driver just to name a few among a large number of them.
> Yet kexec is expected to work inspite of these drivers especially since
> kdump depends on it. So until all PCI drivers supply a shutdown routine,
> this is just a band-aid to disable interrupt and Bus Master bit in
> pci_device_shutdown(). Most drivers do seem to supply a suspend and
> resume function and it was discussed many years ago if it is feasible to
> use the suspend() routine for drivers to shut devices down cleanly.
> Maybe it is time to revisit that discussion.

This patch as posted doesn't do anything with IRQs.  It only clears
PCI_COMMAND_MASTER.

I'm open to considering something with IRQs, but I don't understand
exactly what we should do.  In your response to the previous version
(https://lkml.org/lkml/2013/1/28/720) you suggested this:

  pci_clear_master(pci_dev);
  pcibios_disable_device(pci_dev);

Did you figure out specifically why pcibios_disable_device() helps?
Using pcibios_disable_device() doesn't seem like the ideal solution
because on most architectures, it is an empty function with no obvious
connection to IRQs.  On x86 with ACPI, it cleans up some ACPI PCI IRQ
stuff, but as far as I can tell, it doesn't actually touch the PCI
device itself or even the IOAPIC to which it's connected, so I'm not
sure how this would help kexec.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Khalid Aziz - Feb. 5, 2013, 3:28 p.m.
On Mon, 2013-02-04 at 16:13 -0700, Bjorn Helgaas wrote:
> On Mon, Feb 4, 2013 at 3:20 PM, Khalid Aziz <khalid@gonehiking.org> wrote:
> > On Mon, 2013-02-04 at 15:55 +0400, Konstantin Khlebnikov wrote:
> >> Matthew Garrett and Alan Cox said (see LKML link below) that clearing bus-master
> >> for all PCI devices may lead to unpredictable consequences, some devices ignores
> >> this bit and continues DMA, some of them hang after that or crash whole system.
> >> Probably we should leave here only warning and disable bus-mastering for each
> >> driver individually in ->shutdown() callback.
> >
> > Agreed that the right place for shutting down a PCI device properly and
> > clearing its Bus Master bit, is the driver shutdown routine, if only all
> > drivers supplied a shutdown routine. As it is today, there are too many
> > drivers that do not provide a shutdown routine, ata_piix, Marvell SATA
> > driver, ATI AGP driver just to name a few among a large number of them.
> > Yet kexec is expected to work inspite of these drivers especially since
> > kdump depends on it. So until all PCI drivers supply a shutdown routine,
> > this is just a band-aid to disable interrupt and Bus Master bit in
> > pci_device_shutdown(). Most drivers do seem to supply a suspend and
> > resume function and it was discussed many years ago if it is feasible to
> > use the suspend() routine for drivers to shut devices down cleanly.
> > Maybe it is time to revisit that discussion.
> 
> This patch as posted doesn't do anything with IRQs.  It only clears
> PCI_COMMAND_MASTER.
> 
> I'm open to considering something with IRQs, but I don't understand
> exactly what we should do.  In your response to the previous version
> (https://lkml.org/lkml/2013/1/28/720) you suggested this:
> 
>   pci_clear_master(pci_dev);
>   pcibios_disable_device(pci_dev);
> 
> Did you figure out specifically why pcibios_disable_device() helps?
> Using pcibios_disable_device() doesn't seem like the ideal solution
> because on most architectures, it is an empty function with no obvious
> connection to IRQs.  On x86 with ACPI, it cleans up some ACPI PCI IRQ
> stuff, but as far as I can tell, it doesn't actually touch the PCI
> device itself or even the IOAPIC to which it's connected, so I'm not
> sure how this would help kexec.
> 
> Bjorn

Hi Bjorn,

My reading of the code was that pcibios_disable_device() does clear the
interrupt on x86 and ia64. I am not deeply familiar with the ACPI code
and I might be interpreting it incorrectly, so please do correct me if I
am reading it incorrectly. Here is the code sequence I see:

pcibios_disable_device() ->
   pcibios_disable_irq() ->
       acpi_pci_irq_disable() -> 
           acpi_pci_link_free_irq() ->
              acpi_evaluate_object(link->device->handle, "_DIS", NULL,
NULL);

My understanding is the evaluation of ACPI _DIS method will disable the
interrupt from the device. Does that sound reasonable?

The problem this code attempts to solve is I/O devices continuing to be
active as we start to boot a kexec'd kernel. That activity can come from
DMA (has been seen with NICs for sure, but can happen from a SATA/IDE
controller as well when a pending read completes). When a DMA activity
overwrites a section of memory area in use by the new kexec'd kernel, it
takes lot of work to narrow that memory corruption down. The right way
to quiesce I/O devices is to call shutdown() function for every active
driver, which pci_device_shutdown() does today. If every driver provided
a proper shutdown() function, we would be done. Since that is not the
case, we need to stop potentially active devices from interfering with
kexec'd kernel. Too many drivers rely upon firmware reinitializing the
device when system is shut down. The two ways I can think of are to stop
DMA by clearing Bus Master bit and turn off the interrupt, which have
been shown to get kexec (and thus kdump) working on machines it didn't
work on before. 

This is a non-trivial problem to solve and I am very open to better
ideas.

--
Khalid

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas - Feb. 5, 2013, 7:22 p.m.
On Tue, Feb 5, 2013 at 8:28 AM, Khalid Aziz <khalid@gonehiking.org> wrote:
> On Mon, 2013-02-04 at 16:13 -0700, Bjorn Helgaas wrote:
>> On Mon, Feb 4, 2013 at 3:20 PM, Khalid Aziz <khalid@gonehiking.org> wrote:
>> > On Mon, 2013-02-04 at 15:55 +0400, Konstantin Khlebnikov wrote:
>> >> Matthew Garrett and Alan Cox said (see LKML link below) that clearing bus-master
>> >> for all PCI devices may lead to unpredictable consequences, some devices ignores
>> >> this bit and continues DMA, some of them hang after that or crash whole system.
>> >> Probably we should leave here only warning and disable bus-mastering for each
>> >> driver individually in ->shutdown() callback.
>> >
>> > Agreed that the right place for shutting down a PCI device properly and
>> > clearing its Bus Master bit, is the driver shutdown routine, if only all
>> > drivers supplied a shutdown routine. As it is today, there are too many
>> > drivers that do not provide a shutdown routine, ata_piix, Marvell SATA
>> > driver, ATI AGP driver just to name a few among a large number of them.
>> > Yet kexec is expected to work inspite of these drivers especially since
>> > kdump depends on it. So until all PCI drivers supply a shutdown routine,
>> > this is just a band-aid to disable interrupt and Bus Master bit in
>> > pci_device_shutdown(). Most drivers do seem to supply a suspend and
>> > resume function and it was discussed many years ago if it is feasible to
>> > use the suspend() routine for drivers to shut devices down cleanly.
>> > Maybe it is time to revisit that discussion.
>>
>> This patch as posted doesn't do anything with IRQs.  It only clears
>> PCI_COMMAND_MASTER.
>>
>> I'm open to considering something with IRQs, but I don't understand
>> exactly what we should do.  In your response to the previous version
>> (https://lkml.org/lkml/2013/1/28/720) you suggested this:
>>
>>   pci_clear_master(pci_dev);
>>   pcibios_disable_device(pci_dev);
>>
>> Did you figure out specifically why pcibios_disable_device() helps?
>> Using pcibios_disable_device() doesn't seem like the ideal solution
>> because on most architectures, it is an empty function with no obvious
>> connection to IRQs.  On x86 with ACPI, it cleans up some ACPI PCI IRQ
>> stuff, but as far as I can tell, it doesn't actually touch the PCI
>> device itself or even the IOAPIC to which it's connected, so I'm not
>> sure how this would help kexec.
>
> My reading of the code was that pcibios_disable_device() does clear the
> interrupt on x86 and ia64. I am not deeply familiar with the ACPI code
> and I might be interpreting it incorrectly, so please do correct me if I
> am reading it incorrectly. Here is the code sequence I see:
>
> pcibios_disable_device() ->
>    pcibios_disable_irq() ->
>        acpi_pci_irq_disable() ->
>            acpi_pci_link_free_irq() ->
>               acpi_evaluate_object(link->device->handle, "_DIS", NULL,
> NULL);
>
> My understanding is the evaluation of ACPI _DIS method will disable the
> interrupt from the device. Does that sound reasonable?

I see the code you're looking at in acpi_pci_link_free_irq(), but we
only evaluate _DIS if link->refcnt == 0, and I don't think refcnt is
ever zero at that point.

refcnt starts out at zero in acpi_pci_link_add() (called when we find
PNP0C0F devices), and it's incremented in acpi_pci_link_allocate_irq()
(called in the pci_enable_device() path), but as far as I can tell,
it's never decremented, so I doubt that _DIS is ever evaluated.

If we did evaluate _DIS, it would act on an "interrupt link" device,
not on the PCI device itself.  I guess that could help, but only for
devices connected to such a link device.  For others, I guess we might
be able to accomplish something similar by updating local APIC and/or
IOAPIC config.  I don't think we do that today, at least not in the
pci_disable_device() path, but it might be something interesting to
explore.  There is also the INTx Disable bit, though it's obviously
only on new PCI devices.

> ... The two ways I can think of are to stop
> DMA by clearing Bus Master bit and turn off the interrupt, which have
> been shown to get kexec (and thus kdump) working on machines it didn't
> work on before.

I was just curious if you had actually verified that _DIS was being
evaluated and making a difference here, or if the Bus Master bit was
really the important part.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Khalid Aziz - Feb. 6, 2013, 12:21 a.m.
On Tue, 2013-02-05 at 12:22 -0700, Bjorn Helgaas wrote:
> On Tue, Feb 5, 2013 at 8:28 AM, Khalid Aziz <khalid@gonehiking.org> wrote:
> > On Mon, 2013-02-04 at 16:13 -0700, Bjorn Helgaas wrote:
> >> Did you figure out specifically why pcibios_disable_device() helps?
> >> Using pcibios_disable_device() doesn't seem like the ideal solution
> >> because on most architectures, it is an empty function with no obvious
> >> connection to IRQs.  On x86 with ACPI, it cleans up some ACPI PCI IRQ
> >> stuff, but as far as I can tell, it doesn't actually touch the PCI
> >> device itself or even the IOAPIC to which it's connected, so I'm not
> >> sure how this would help kexec.
> >
> > My reading of the code was that pcibios_disable_device() does clear the
> > interrupt on x86 and ia64. I am not deeply familiar with the ACPI code
> > and I might be interpreting it incorrectly, so please do correct me if I
> > am reading it incorrectly. Here is the code sequence I see:
> >
> > pcibios_disable_device() ->
> >    pcibios_disable_irq() ->
> >        acpi_pci_irq_disable() ->
> >            acpi_pci_link_free_irq() ->
> >               acpi_evaluate_object(link->device->handle, "_DIS", NULL,
> > NULL);
> >
> > My understanding is the evaluation of ACPI _DIS method will disable the
> > interrupt from the device. Does that sound reasonable?
> 
> I see the code you're looking at in acpi_pci_link_free_irq(), but we
> only evaluate _DIS if link->refcnt == 0, and I don't think refcnt is
> ever zero at that point.
> 
> refcnt starts out at zero in acpi_pci_link_add() (called when we find
> PNP0C0F devices), and it's incremented in acpi_pci_link_allocate_irq()
> (called in the pci_enable_device() path), but as far as I can tell,
> it's never decremented, so I doubt that _DIS is ever evaluated.

Ah, that is interesting. I was assuming as we disable PCI devices, the
refcnt would have been decremented and if no one was using the IRQ, we
would evaluate _DIS method and disable the interrupt link.

> 
> If we did evaluate _DIS, it would act on an "interrupt link" device,
> not on the PCI device itself.  

Right, it should be the shutdown() routine for the device driver that
disables interrupt on the device itself. We want to turn interrupt off
one level higher in pci_device_shutdown().

> I guess that could help, but only for
> devices connected to such a link device.  For others, I guess we might
> be able to accomplish something similar by updating local APIC and/or
> IOAPIC config.  I don't think we do that today, at least not in the
> pci_disable_device() path, but it might be something interesting to
> explore.  There is also the INTx Disable bit, though it's obviously
> only on new PCI devices.

Turning interrupt of at local APIC or IOAPIC level sounds like more
reliable thing to do.

> 
> > ... The two ways I can think of are to stop
> > DMA by clearing Bus Master bit and turn off the interrupt, which have
> > been shown to get kexec (and thus kdump) working on machines it didn't
> > work on before.
> 
> I was just curious if you had actually verified that _DIS was being
> evaluated and making a difference here, or if the Bus Master bit was
> really the important part.
> 
> Bjorn

I have been able to reproduce the kexec problem caused by active PCI
devices only in limited cases and in those cases it was really the Bus
Master bit that was important. Keeping interrupts from errant devices
from reaching the kernel until newly kexc'd kernel is initialized is
additional safety measure.

--
Khalid

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f79cbcd..dc5bdce 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -392,7 +392,7 @@  static void pci_device_shutdown(struct device *dev)
 	 * Turn off Bus Master bit on the device to tell it to not
 	 * continue to do DMA
 	 */
-	pci_disable_device(pci_dev);
+	pci_clear_master(pci_dev);
 }
 
 #ifdef CONFIG_PM