diff mbox

[2/2] pseries/iommu: remove DDW on kexec

Message ID 20130129020357.GB12156@linux.vnet.ibm.com (mailing list archive)
State Accepted, archived
Commit 14b6f00f8a4fdec5ccd45a0710284de301a61628
Delegated to: Benjamin Herrenschmidt
Headers show

Commit Message

Nishanth Aravamudan Jan. 29, 2013, 2:03 a.m. UTC
pseries/iommu: remove DDW on kexec
    
We currently insert a property in the device-tree when we successfully
configure DDW for a given slot. This was meant to be an optimization to
speed up kexec/kdump, so that we don't need to make the RTAS calls again
to re-configured DDW in the new kernel.
    
However, we end up tripping a plpar_tce_stuff failure on kexec/kdump
because we unconditionally parse the ibm,dma-window property for the
node at bus/dev setup time. This property contains the 32-bit DMA window
LIOBN, which is distinct from the DDW window's. We pass that LIOBN (via
iommu_table_init -> iommu_table_clear -> tce_free ->
tce_freemulti_pSeriesLP) to plpar_tce_stuff, which fails because that
32-bit window is no longer present after
25ebc45b93452d0bc60271f178237123c4b26808 ("powerpc/pseries/iommu: remove
default window before attempting DDW manipulation").
    
I believe the simplest, easiest-to-maintain fix is to just change our
initcall to, rather than detecting and updating the new kernel's DDW
knowledge, just remove all DDW configurations. When the drivers
re-initialize, we will set everything back up as it was before.
    
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

Comments

Michael Ellerman Jan. 29, 2013, 10:58 a.m. UTC | #1
On Mon, 2013-01-28 at 18:03 -0800, Nishanth Aravamudan wrote:
> pseries/iommu: remove DDW on kexec
>  ...
>     
> I believe the simplest, easiest-to-maintain fix is to just change our
> initcall to, rather than detecting and updating the new kernel's DDW
> knowledge, just remove all DDW configurations. When the drivers
> re-initialize, we will set everything back up as it was before.

I don't know this code at all, but this sounds like it will also work
for kdump, right? ie. when the original kernel has crashed the 2nd
kernel will tear the DDW down and set it back up.

cheers
Nishanth Aravamudan Jan. 29, 2013, 8:33 p.m. UTC | #2
Hi Michael,

On 29.01.2013 [21:58:28 +1100], Michael Ellerman wrote:
> On Mon, 2013-01-28 at 18:03 -0800, Nishanth Aravamudan wrote:
> > pseries/iommu: remove DDW on kexec
> >  ...
> >     
> > I believe the simplest, easiest-to-maintain fix is to just change our
> > initcall to, rather than detecting and updating the new kernel's DDW
> > knowledge, just remove all DDW configurations. When the drivers
> > re-initialize, we will set everything back up as it was before.
> 
> I don't know this code at all, but this sounds like it will also work
> for kdump, right? ie. when the original kernel has crashed the 2nd
> kernel will tear the DDW down and set it back up.

Yes, my actual test-case (and what was reported as broken) was kdump.
From my relatively vague (but now growing) understanding of that
process, kdump does use kexec under the covers to switch to the crash
kernel, and so does get the same benefit from this change.

Another datapoint, though, is that it might make sense to recommend (and
I'm working on figuring this out for the distros, etc) to use
disable_ddw anyways for the kdump kernel command-line, as DDW isn't
'free' and it's unclear if performance is a huge concern for the crash
kernel (sort of varies with where your storage is, and how much you need
to dump, which for kdump generally doesn't seem like that much?).

Thanks,
Nish
Michael Ellerman Feb. 4, 2013, 3:38 a.m. UTC | #3
On Tue, 2013-01-29 at 12:33 -0800, Nishanth Aravamudan wrote:
> Hi Michael,
> 
> On 29.01.2013 [21:58:28 +1100], Michael Ellerman wrote:
> > On Mon, 2013-01-28 at 18:03 -0800, Nishanth Aravamudan wrote:
> > > pseries/iommu: remove DDW on kexec
> > >  ...
> > >     
> > > I believe the simplest, easiest-to-maintain fix is to just change our
> > > initcall to, rather than detecting and updating the new kernel's DDW
> > > knowledge, just remove all DDW configurations. When the drivers
> > > re-initialize, we will set everything back up as it was before.
> > 
> > I don't know this code at all, but this sounds like it will also work
> > for kdump, right? ie. when the original kernel has crashed the 2nd
> > kernel will tear the DDW down and set it back up.
> 
> Yes, my actual test-case (and what was reported as broken) was kdump.
> From my relatively vague (but now growing) understanding of that
> process, kdump does use kexec under the covers to switch to the crash
> kernel, and so does get the same benefit from this change.

OK good. Yeah kdump is basically equivalent to kexec with one major
difference, which is that before a kexec the first kernel is shutdown
more or less normally - as if it was about to reboot.

kdump on the other hand is invoked in the panic path, so nothing is
shutdown in the first kernel (almost, see ehea). So to accommodate kdump
you want any logic to be in the startup path of the 2nd kernel.

> Another datapoint, though, is that it might make sense to recommend (and
> I'm working on figuring this out for the distros, etc) to use
> disable_ddw anyways for the kdump kernel command-line, as DDW isn't
> 'free' and it's unclear if performance is a huge concern for the crash
> kernel (sort of varies with where your storage is, and how much you need
> to dump, which for kdump generally doesn't seem like that much?).

The kdump kernel /should/ just be creating a crash dump and then
rebooting. That may involve writing all of RAM out to disk/nw, which
could be a lot of I/O. So performance may be a concern, but correctness
is much much much more important.

Some folks have talked about having the kdump kernel just come up and
pretend it is back in production, but that is madness for various
reasons and I don't think anyone ever did it. 

cheers
diff mbox

Patch

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a8e99f9..1b2a174 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -787,33 +787,68 @@  static u64 find_existing_ddw(struct device_node *pdn)
 	return dma_addr;
 }
 
+static void __restore_default_window(struct eeh_dev *edev,
+						u32 ddw_restore_token)
+{
+	u32 cfg_addr;
+	u64 buid;
+	int ret;
+
+	/*
+	 * Get the config address and phb buid of the PE window.
+	 * Rely on eeh to retrieve this for us.
+	 * Retrieve them from the pci device, not the node with the
+	 * dma-window property
+	 */
+	cfg_addr = edev->config_addr;
+	if (edev->pe_config_addr)
+		cfg_addr = edev->pe_config_addr;
+	buid = edev->phb->buid;
+
+	do {
+		ret = rtas_call(ddw_restore_token, 3, 1, NULL, cfg_addr,
+					BUID_HI(buid), BUID_LO(buid));
+	} while (rtas_busy_delay(ret));
+	pr_info("ibm,reset-pe-dma-windows(%x) %x %x %x returned %d\n",
+		 ddw_restore_token, cfg_addr, BUID_HI(buid), BUID_LO(buid), ret);
+}
+
 static int find_existing_ddw_windows(void)
 {
-	int len;
 	struct device_node *pdn;
-	struct direct_window *window;
 	const struct dynamic_dma_window_prop *direct64;
+	const u32 *ddw_extensions;
 
 	if (!firmware_has_feature(FW_FEATURE_LPAR))
 		return 0;
 
 	for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
-		direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
+		direct64 = of_get_property(pdn, DIRECT64_PROPNAME, NULL);
 		if (!direct64)
 			continue;
 
-		window = kzalloc(sizeof(*window), GFP_KERNEL);
-		if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
-			kfree(window);
-			remove_ddw(pdn);
-			continue;
-		}
+		/*
+		 * We need to ensure the IOMMU table is active when we
+		 * return from the IOMMU setup so that the common code
+		 * can clear the table or find the holes. To that end,
+		 * first, remove any existing DDW configuration.
+		 */
+		remove_ddw(pdn);
 
-		window->device = pdn;
-		window->prop = direct64;
-		spin_lock(&direct_window_list_lock);
-		list_add(&window->list, &direct_window_list);
-		spin_unlock(&direct_window_list_lock);
+		/*
+		 * Second, if we are running on a new enough level of
+		 * firmware where the restore API is present, use it to
+		 * restore the 32-bit window, which was removed in
+		 * create_ddw.
+		 * If the API is not present, then create_ddw couldn't
+		 * have removed the 32-bit window in the first place, so
+		 * removing the DDW configuration should be sufficient.
+		 */
+		ddw_extensions = of_get_property(pdn, "ibm,ddw-extensions",
+									NULL);
+		if (ddw_extensions && ddw_extensions[0] > 0)
+			__restore_default_window(of_node_to_eeh_dev(pdn),
+							ddw_extensions[1]);
 	}
 
 	return 0;
@@ -886,30 +921,7 @@  static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail,
 static void restore_default_window(struct pci_dev *dev,
 					u32 ddw_restore_token)
 {
-	struct eeh_dev *edev;
-	u32 cfg_addr;
-	u64 buid;
-	int ret;
-
-	/*
-	 * Get the config address and phb buid of the PE window.
-	 * Rely on eeh to retrieve this for us.
-	 * Retrieve them from the pci device, not the node with the
-	 * dma-window property
-	 */
-	edev = pci_dev_to_eeh_dev(dev);
-	cfg_addr = edev->config_addr;
-	if (edev->pe_config_addr)
-		cfg_addr = edev->pe_config_addr;
-	buid = edev->phb->buid;
-
-	do {
-		ret = rtas_call(ddw_restore_token, 3, 1, NULL, cfg_addr,
-					BUID_HI(buid), BUID_LO(buid));
-	} while (rtas_busy_delay(ret));
-	dev_info(&dev->dev,
-		"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d\n",
-		 ddw_restore_token, cfg_addr, BUID_HI(buid), BUID_LO(buid), ret);
+	__restore_default_window(pci_dev_to_eeh_dev(dev), ddw_restore_token);
 }
 
 /*