diff mbox

[v2] PCI: Reset PCIe devices to stop ongoing DMA

Message ID 51F85BCC.2070103@jp.fujitsu.com
State Not Applicable
Headers show

Commit Message

Takao Indoh July 31, 2013, 12:35 a.m. UTC
(2013/07/31 0:59), Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> <indou.takao@jp.fujitsu.com> wrote:
>> (2013/07/29 23:17), Bjorn Helgaas wrote:
>>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>>>> (2013/07/26 2:00), Bjorn Helgaas wrote:
> 
>>>>> My point about IOMMU and PCI initialization order doesn't go away just
>>>>> because it doesn't fit "kdump policy."  Having system initialization
>>>>> occur in a logical order is far more important than making kdump work.
>>>>
>>>> My next plan is as follows. I think this is matched to logical order
>>>> on boot.
>>>>
>>>> drivers/pci/pci.c
>>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>>>
>>>> drivers/iommu/intel-iommu.c
>>>> - On initialization, if IOMMU is already enabled, call this bus reset
>>>>     function before disabling and re-enabling IOMMU.
>>>
>>> I raised this issue because of arches like sparc that enumerate the
>>> IOMMU before the PCI devices that use it.  In that situation, I think
>>> you're proposing this:
>>>
>>>     panic kernel
>>>       enable IOMMU
>>>       panic
>>>     kdump kernel
>>>       initialize IOMMU (already enabled)
>>>         pci_reset_bus
>>>         disable IOMMU
>>>         enable IOMMU
>>>       enumerate PCI devices
>>>
>>> But the problem is that when you call pci_reset_bus(), you haven't
>>> enumerated the PCI devices, so you don't know what to reset.
>>
>> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
>> initialization is based on the assumption that enumeration of PCI devices
>> is already done. We can find target device from IOMMU page table instead
>> of scanning all devices in pci tree.
>>
>> Therefore, this idea is only for intel-iommu. Other architectures need
>> to implement their own reset code.
> 
> That's my point.  I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design.  I would rather have a design that can work for all
> arches.
> 
> If your implementation is totally implemented under arch/x86 (or in
> intel-iommu.c, I guess), I can't object as much.  However, I think
> that eventually even x86 should enumerate the IOMMUs via ACPI before
> we enumerate PCI devices.
> 
> It's pretty clear that's how BIOS designers expect the OS to work.
> For example, sec 8.7.3 of the Intel Virtualization Technology for
> Directed I/O spec, rev 1.3, shows the expectation that remapping
> hardware (IOMMU) is initialized before discovering the I/O hierarchy
> below a hot-added host bridge.  Obviously you're not talking about a
> hot-add scenario, but I think the same sequence should apply at
> boot-time as well.

Of course I won't add something just for x86 into common PCI layer. I
attach my new patch, though it is not well tested yet.

On x86, currently IOMMU initialization run *after* PCI enumeration, but
what you are talking about is that it should be changed so that x86
IOMMU initialization is done *before* PCI enumeration like sparc, right?

Hmm, ok, I think I need to post attached patch to iommu list and
discuss it including current order of x86 IOMMU initialization.

Thanks,
Takao Indoh
---
 drivers/iommu/intel-iommu.c |   55 +++++++++++++++++++++++++++++++++-
 drivers/pci/pci.c           |   53 ++++++++++++++++++++++++++++++++
 include/linux/pci.h         |    1
 3 files changed, 108 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alex Williamson July 31, 2013, 3:11 a.m. UTC | #1
On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
> (2013/07/31 0:59), Bjorn Helgaas wrote:
> > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> > <indou.takao@jp.fujitsu.com> wrote:
> >> (2013/07/29 23:17), Bjorn Helgaas wrote:
> >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> >>>> (2013/07/26 2:00), Bjorn Helgaas wrote:
> > 
> >>>>> My point about IOMMU and PCI initialization order doesn't go away just
> >>>>> because it doesn't fit "kdump policy."  Having system initialization
> >>>>> occur in a logical order is far more important than making kdump work.
> >>>>
> >>>> My next plan is as follows. I think this is matched to logical order
> >>>> on boot.
> >>>>
> >>>> drivers/pci/pci.c
> >>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
> >>>>
> >>>> drivers/iommu/intel-iommu.c
> >>>> - On initialization, if IOMMU is already enabled, call this bus reset
> >>>>     function before disabling and re-enabling IOMMU.
> >>>
> >>> I raised this issue because of arches like sparc that enumerate the
> >>> IOMMU before the PCI devices that use it.  In that situation, I think
> >>> you're proposing this:
> >>>
> >>>     panic kernel
> >>>       enable IOMMU
> >>>       panic
> >>>     kdump kernel
> >>>       initialize IOMMU (already enabled)
> >>>         pci_reset_bus
> >>>         disable IOMMU
> >>>         enable IOMMU
> >>>       enumerate PCI devices
> >>>
> >>> But the problem is that when you call pci_reset_bus(), you haven't
> >>> enumerated the PCI devices, so you don't know what to reset.
> >>
> >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> >> initialization is based on the assumption that enumeration of PCI devices
> >> is already done. We can find target device from IOMMU page table instead
> >> of scanning all devices in pci tree.
> >>
> >> Therefore, this idea is only for intel-iommu. Other architectures need
> >> to implement their own reset code.
> > 
> > That's my point.  I'm opposed to adding code to PCI when it only
> > benefits x86 and we know other arches will need a fundamentally
> > different design.  I would rather have a design that can work for all
> > arches.
> > 
> > If your implementation is totally implemented under arch/x86 (or in
> > intel-iommu.c, I guess), I can't object as much.  However, I think
> > that eventually even x86 should enumerate the IOMMUs via ACPI before
> > we enumerate PCI devices.
> > 
> > It's pretty clear that's how BIOS designers expect the OS to work.
> > For example, sec 8.7.3 of the Intel Virtualization Technology for
> > Directed I/O spec, rev 1.3, shows the expectation that remapping
> > hardware (IOMMU) is initialized before discovering the I/O hierarchy
> > below a hot-added host bridge.  Obviously you're not talking about a
> > hot-add scenario, but I think the same sequence should apply at
> > boot-time as well.
> 
> Of course I won't add something just for x86 into common PCI layer. I
> attach my new patch, though it is not well tested yet.
> 
> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> 
> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
> 
> Thanks,
> Takao Indoh
> ---
>  drivers/iommu/intel-iommu.c |   55 +++++++++++++++++++++++++++++++++-
>  drivers/pci/pci.c           |   53 ++++++++++++++++++++++++++++++++
>  include/linux/pci.h         |    1
>  3 files changed, 108 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..fb8a546 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
>  	.notifier_call = device_notifier,
>  };
>  
> +/* Reset PCI device if its entry exists in DMAR table */
> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
> +{
> +	u64 addr;
> +	struct root_entry *root;
> +	struct context_entry *context;
> +	int bus, devfn;
> +	struct pci_dev *dev;
> +
> +	addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
> +	if (!addr)
> +		return;
> +
> +	/*
> +	 *  In the case of kdump, ioremap is needed because root-entry table
> +	 *  exists in first kernel's memory area which is not mapped in second
> +	 *  kernel
> +	 */
> +	root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
> +	if (!root)
> +		return;
> +
> +	for (bus=0; bus<ROOT_ENTRY_NR; bus++) {
> +		if (!root_present(&root[bus]))
> +			continue;
> +
> +		context = (struct context_entry *)ioremap(
> +			root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
> +		if (!context)
> +			continue;
> +
> +		for (devfn=0; devfn<ROOT_ENTRY_NR; devfn++) {
> +			if (!context_present(&context[devfn]))
> +				continue;
> +
> +			dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
> +			if (!dev)
> +				continue;
> +
> +			if (!pci_reset_bus(dev->bus)) /* go to next bus */
> +				break;
> +			else /* Try per-function reset */
> +				pci_reset_function(dev);
> +
> +		}
> +		iounmap(context);
> +	}
> +	iounmap(root);
> +}
> +
>  int __init intel_iommu_init(void)
>  {
>  	int ret = 0;
> @@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
>  			continue;
>  
>  		iommu = drhd->iommu;
> -		if (iommu->gcmd & DMA_GCMD_TE)
> +		if (iommu->gcmd & DMA_GCMD_TE) {
> +			if (reset_devices)
> +				iommu_reset_devices(iommu, drhd->segment);
>  			iommu_disable_translation(iommu);
> +		}
>  	}
>  
>  	if (dmar_dev_scope_init() < 0) {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e37fea6..c595997 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
>  EXPORT_SYMBOL_GPL(pci_reset_function);
>  
>  /**
> + * pci_reset_bus - reset a PCI bus
> + * @bus: PCI bus to reset
> + *
> + * Returns 0 if the bus was successfully reset or negative if failed.
> + */
> +int pci_reset_bus(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +	u16 ctrl;
> +
> +	if (!bus->self)
> +		return -ENOTTY;
> +
> +	list_for_each_entry(pdev, &bus->devices, bus_list)
> +		if (pdev->subordinate)
> +			return -ENOTTY;
> +
> +	/* Save config registers of children */
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		dev_info(&pdev->dev, "Save state\n");
> +		pci_save_state(pdev);
> +	}
> +
> +	dev_info(&bus->self->dev, "Reset Secondary bus\n");
> +
> +	/* Assert Secondary Bus Reset */
> +	pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	/* Read config again to flush previous write */
> +	pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> +
> +	msleep(2);
> +
> +	/* De-assert Secondary Bus Reset */
> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +
> +	/* Wait for completion */
> +	msleep(1000);


We already have secondary bus reset code in this file, why are we
duplicating it here?  Also, why are these delays different from the
existing code?  I'm also in need of a bus reset interface for when we
assign all of the devices on a bus to userspace and do not have working
function level resets per device.  I'll post my patch series and perhaps
we can collaborate on a pci bus reset interface.  Thanks,

Alex

> +
> +	/* Restore config registers of children */
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		dev_info(&pdev->dev, "Restore state\n");
> +		pci_restore_state(pdev);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_reset_bus);
> +
> +/**
>   * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
>   * @dev: PCI device to query
>   *
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0fd1f15..125fbc6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -924,6 +924,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>  int __pci_reset_function(struct pci_dev *dev);
>  int __pci_reset_function_locked(struct pci_dev *dev);
>  int pci_reset_function(struct pci_dev *dev);
> +int pci_reset_bus(struct pci_bus *bus);
>  void pci_update_resource(struct pci_dev *dev, int resno);
>  int __must_check pci_assign_resource(struct pci_dev *dev, int i);
>  int __must_check pci_reassign_resource(struct pci_dev *dev, int i, resource_size_t add_size, resource_size_t align);
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh July 31, 2013, 5:50 a.m. UTC | #2
(2013/07/31 12:11), Alex Williamson wrote:
> On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e37fea6..c595997 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
>>   EXPORT_SYMBOL_GPL(pci_reset_function);
>>   
>>   /**
>> + * pci_reset_bus - reset a PCI bus
>> + * @bus: PCI bus to reset
>> + *
>> + * Returns 0 if the bus was successfully reset or negative if failed.
>> + */
>> +int pci_reset_bus(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *pdev;
>> +	u16 ctrl;
>> +
>> +	if (!bus->self)
>> +		return -ENOTTY;
>> +
>> +	list_for_each_entry(pdev, &bus->devices, bus_list)
>> +		if (pdev->subordinate)
>> +			return -ENOTTY;
>> +
>> +	/* Save config registers of children */
>> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> +		dev_info(&pdev->dev, "Save state\n");
>> +		pci_save_state(pdev);
>> +	}
>> +
>> +	dev_info(&bus->self->dev, "Reset Secondary bus\n");
>> +
>> +	/* Assert Secondary Bus Reset */
>> +	pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
>> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> +	pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +	/* Read config again to flush previous write */
>> +	pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> +	msleep(2);
>> +
>> +	/* De-assert Secondary Bus Reset */
>> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +	pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +	/* Wait for completion */
>> +	msleep(1000);
> 
> 
> We already have secondary bus reset code in this file, why are we
> duplicating it here?  Also, why are these delays different from the
> existing code?  I'm also in need of a bus reset interface for when we
> assign all of the devices on a bus to userspace and do not have working
> function level resets per device.  I'll post my patch series and perhaps
> we can collaborate on a pci bus reset interface.  Thanks,

Good point. Yes, we have already similar functions.

pci_parent_bus_reset()
1. Assert secondary bus reset
2. msleep(100)
3. De-assert secondary bus reset
4. msleep(100)

aer_do_secondary_bus_reset()
1. Assert secondary bus reset
2. msleep(2)
3. De-assert secondary bus reset,
4. msleep(200)

To be honest, I wrote my reset code almost one years ago, so I forgot
the reason why I separated them.

Basically my reset code is based on aer_do_secondary_bus_reset(). The
different is waiting time after reset. My patch has 1000msec waiting
time.

At first my reset code is almost same as aer_do_secondary_bus_reset().
But when I tested the reset code, I found that on certain machine
restoring config registers failed after reset. It failed because 200msec
waiting time was too short. And I found the following description in
PCIe spec. According to this, I thought we should wait at least 1000msec.

6.6.1. Conventional Reset

* The Root Complex and/or system software must allow at least 1.0s
  after a Conventional Reset of a device, before it may determine that a
  device which fails to return a Successful Completion status for a
  valid Configuration Request is a broken device. This period is
  independent of how quickly Link training completes.

  Note: This delay is analogous to the Trhfa parameter specified for
  PCI/PCI-X, and is intended to allow an adequate amount of time for
  devices which require self initialization.

* When attempting a Configuration access to devices on a PCI or PCI-X
  bus segment behind a PCI Express/PCI(-X) Bridge, the timing parameter
  Trhfa must be respected.

And I saw patches you posted today, yes, your patch looks helpful for
my purpose:-)

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas July 31, 2013, 9:08 p.m. UTC | #3
[+cc Rafael, linux-acpi]

On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:

> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?

Yes.  I don't know whether or when that initialization order will ever
be changed, but I do think we should avoid building more
infrastructure that depends on the current order.

Changing the order is a pretty big deal because it's a lot more than
just the IOMMU.  Basically I think we should be enumerating ACPI
devices, including the IOMMU, before PCI devices, but there's a lot of
legacy involved in that area.  Added Rafael in case he has any
thoughts.

> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki July 31, 2013, 9:23 p.m. UTC | #4
On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> [+cc Rafael, linux-acpi]
> 
> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> 
> > On x86, currently IOMMU initialization run *after* PCI enumeration, but
> > what you are talking about is that it should be changed so that x86
> > IOMMU initialization is done *before* PCI enumeration like sparc, right?
> 
> Yes.  I don't know whether or when that initialization order will ever
> be changed, but I do think we should avoid building more
> infrastructure that depends on the current order.
> 
> Changing the order is a pretty big deal because it's a lot more than
> just the IOMMU.  Basically I think we should be enumerating ACPI
> devices, including the IOMMU, before PCI devices, but there's a lot of
> legacy involved in that area.  Added Rafael in case he has any
> thoughts.

Well, actually, I'm not really familiar with IOMMUs, sorry.

I do think that initializing IOMMU before PCI enumeration would be better,
however.  At least if the ordering should be the same on all architectures,
which I suppose is the case, that's the one I'd choose.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takao Indoh Aug. 1, 2013, 6:34 a.m. UTC | #5
(2013/08/01 6:23), Rafael J. Wysocki wrote:
> On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
>> [+cc Rafael, linux-acpi]
>>
>> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>>
>>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
>>> what you are talking about is that it should be changed so that x86
>>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
>>
>> Yes.  I don't know whether or when that initialization order will ever
>> be changed, but I do think we should avoid building more
>> infrastructure that depends on the current order.
>>
>> Changing the order is a pretty big deal because it's a lot more than
>> just the IOMMU.  Basically I think we should be enumerating ACPI
>> devices, including the IOMMU, before PCI devices, but there's a lot of
>> legacy involved in that area.  Added Rafael in case he has any
>> thoughts.
> 
> Well, actually, I'm not really familiar with IOMMUs, sorry.
> 
> I do think that initializing IOMMU before PCI enumeration would be better,
> however.  At least if the ordering should be the same on all architectures,
> which I suppose is the case, that's the one I'd choose.

Ok guys. If x86 IOMMU maintainer also thinks changing order is
necessary, maybe I need to give up device reset in kdump kernel and
consider doing it in panic kernel.

Either way, I need bus reset interface to reset devices. Bjorn, could
you review the bus reset patches Alex posted yesterday?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Aug. 1, 2013, 12:42 p.m. UTC | #6
On Thu, 2013-08-01 at 15:34 +0900, Takao Indoh wrote:
> (2013/08/01 6:23), Rafael J. Wysocki wrote:
> > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> >> [+cc Rafael, linux-acpi]
> >>
> >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> >>
> >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> >>> what you are talking about is that it should be changed so that x86
> >>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> >>
> >> Yes.  I don't know whether or when that initialization order will ever
> >> be changed, but I do think we should avoid building more
> >> infrastructure that depends on the current order.
> >>
> >> Changing the order is a pretty big deal because it's a lot more than
> >> just the IOMMU.  Basically I think we should be enumerating ACPI
> >> devices, including the IOMMU, before PCI devices, but there's a lot of
> >> legacy involved in that area.  Added Rafael in case he has any
> >> thoughts.
> > 
> > Well, actually, I'm not really familiar with IOMMUs, sorry.
> > 
> > I do think that initializing IOMMU before PCI enumeration would be better,
> > however.  At least if the ordering should be the same on all architectures,
> > which I suppose is the case, that's the one I'd choose.
> 
> Ok guys. If x86 IOMMU maintainer also thinks changing order is
> necessary, maybe I need to give up device reset in kdump kernel and
> consider doing it in panic kernel.
> 
> Either way, I need bus reset interface to reset devices. Bjorn, could
> you review the bus reset patches Alex posted yesterday?

I'll post a non-RFC version today, I've made a couple cleanups, tuned
the delays and rolled in the AER version of secondary bus reset.
Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Aug. 1, 2013, 1:20 p.m. UTC | #7
On Thu, Aug 01, 2013 at 03:34:06PM +0900, Takao Indoh wrote:
> (2013/08/01 6:23), Rafael J. Wysocki wrote:
> > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> >> [+cc Rafael, linux-acpi]
> >>
> >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> >>
> >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> >>> what you are talking about is that it should be changed so that x86
> >>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> >>
> >> Yes.  I don't know whether or when that initialization order will ever
> >> be changed, but I do think we should avoid building more
> >> infrastructure that depends on the current order.
> >>
> >> Changing the order is a pretty big deal because it's a lot more than
> >> just the IOMMU.  Basically I think we should be enumerating ACPI
> >> devices, including the IOMMU, before PCI devices, but there's a lot of
> >> legacy involved in that area.  Added Rafael in case he has any
> >> thoughts.
> > 
> > Well, actually, I'm not really familiar with IOMMUs, sorry.
> > 
> > I do think that initializing IOMMU before PCI enumeration would be better,
> > however.  At least if the ordering should be the same on all architectures,
> > which I suppose is the case, that's the one I'd choose.
> 
> Ok guys. If x86 IOMMU maintainer also thinks changing order is
> necessary, maybe I need to give up device reset in kdump kernel and
> consider doing it in panic kernel.

I don't think trying to reset all the devices in panic kernel is
a good idea.

We need to handle the problem at IOMMU level first which is 
independent of whether devices have been reset or not.

IOW, we should have the capability to initialize IOMMU first
and be able to deal with devices which are doing DMA.

I am not against doing device reset and it most likely is a good thing
but it should happen in second kernel and not in crashed kernel.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index eec0d3e..fb8a546 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3663,6 +3663,56 @@  static struct notifier_block device_nb = {
 	.notifier_call = device_notifier,
 };
 
+/* Reset PCI device if its entry exists in DMAR table */
+static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
+{
+	u64 addr;
+	struct root_entry *root;
+	struct context_entry *context;
+	int bus, devfn;
+	struct pci_dev *dev;
+
+	addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
+	if (!addr)
+		return;
+
+	/*
+	 *  In the case of kdump, ioremap is needed because root-entry table
+	 *  exists in first kernel's memory area which is not mapped in second
+	 *  kernel
+	 */
+	root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
+	if (!root)
+		return;
+
+	for (bus=0; bus<ROOT_ENTRY_NR; bus++) {
+		if (!root_present(&root[bus]))
+			continue;
+
+		context = (struct context_entry *)ioremap(
+			root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
+		if (!context)
+			continue;
+
+		for (devfn=0; devfn<ROOT_ENTRY_NR; devfn++) {
+			if (!context_present(&context[devfn]))
+				continue;
+
+			dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
+			if (!dev)
+				continue;
+
+			if (!pci_reset_bus(dev->bus)) /* go to next bus */
+				break;
+			else /* Try per-function reset */
+				pci_reset_function(dev);
+
+		}
+		iounmap(context);
+	}
+	iounmap(root);
+}
+
 int __init intel_iommu_init(void)
 {
 	int ret = 0;
@@ -3687,8 +3737,11 @@  int __init intel_iommu_init(void)
 			continue;
 
 		iommu = drhd->iommu;
-		if (iommu->gcmd & DMA_GCMD_TE)
+		if (iommu->gcmd & DMA_GCMD_TE) {
+			if (reset_devices)
+				iommu_reset_devices(iommu, drhd->segment);
 			iommu_disable_translation(iommu);
+		}
 	}
 
 	if (dmar_dev_scope_init() < 0) {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e37fea6..c595997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3392,6 +3392,59 @@  int pci_reset_function(struct pci_dev *dev)
 EXPORT_SYMBOL_GPL(pci_reset_function);
 
 /**
+ * pci_reset_bus - reset a PCI bus
+ * @bus: PCI bus to reset
+ *
+ * Returns 0 if the bus was successfully reset or negative if failed.
+ */
+int pci_reset_bus(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	u16 ctrl;
+
+	if (!bus->self)
+		return -ENOTTY;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list)
+		if (pdev->subordinate)
+			return -ENOTTY;
+
+	/* Save config registers of children */
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		dev_info(&pdev->dev, "Save state\n");
+		pci_save_state(pdev);
+	}
+
+	dev_info(&bus->self->dev, "Reset Secondary bus\n");
+
+	/* Assert Secondary Bus Reset */
+	pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
+	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
+
+	/* Read config again to flush previous write */
+	pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
+
+	msleep(2);
+
+	/* De-assert Secondary Bus Reset */
+	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
+
+	/* Wait for completion */
+	msleep(1000);
+
+	/* Restore config registers of children */
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		dev_info(&pdev->dev, "Restore state\n");
+		pci_restore_state(pdev);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_reset_bus);
+
+/**
  * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
  * @dev: PCI device to query
  *
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0fd1f15..125fbc6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -924,6 +924,7 @@  int pcie_set_mps(struct pci_dev *dev, int mps);
 int __pci_reset_function(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);
+int pci_reset_bus(struct pci_bus *bus);
 void pci_update_resource(struct pci_dev *dev, int resno);
 int __must_check pci_assign_resource(struct pci_dev *dev, int i);
 int __must_check pci_reassign_resource(struct pci_dev *dev, int i, resource_size_t add_size, resource_size_t align);