diff mbox

PCI: Clear Bus Master bit only on kexec reboot

Message ID 1385579908-24608-1-git-send-email-khalid.aziz@oracle.com
State Superseded
Headers show

Commit Message

Khalid Aziz Nov. 27, 2013, 7:18 p.m. UTC
Add a flag to tell the PCI subsystem that kernel is shutting down
in prepapration to kexec a kernel. Add code in PCI subsystem to use
this flag to clear Bus Master bit on PCI devices only in case of
kexec reboot. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=63861
and avoids any other issues caused by clearing Bus Master bit on PCI
devices in normal shutdown path. This patch is based on discussion at
http://marc.info/?l=linux-pci&m=138425645204355&w=2

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: stable@vger.kernel.org
---
 drivers/pci/pci-driver.c | 9 ++++++---
 drivers/pci/pci.h        | 3 +++
 kernel/kexec.c           | 4 ++++
 3 files changed, 13 insertions(+), 3 deletions(-)

Comments

Eric W. Biederman Nov. 27, 2013, 7:38 p.m. UTC | #1
Khalid Aziz <khalid.aziz@oracle.com> writes:

> Add a flag to tell the PCI subsystem that kernel is shutting down
> in prepapration to kexec a kernel. Add code in PCI subsystem to use
> this flag to clear Bus Master bit on PCI devices only in case of
> kexec reboot. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=63861
> and avoids any other issues caused by clearing Bus Master bit on PCI
> devices in normal shutdown path. This patch is based on discussion at
> http://marc.info/?l=linux-pci&m=138425645204355&w=2

Scratches head.

Given that most devices already call pci_disable_device which clears the
bus master bit how does this change anything meaningful?

Is is the problem here that most drivers are lazy and have a noop
shutdown method?

Eric


> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/pci-driver.c | 9 ++++++---
>  drivers/pci/pci.h        | 3 +++
>  kernel/kexec.c           | 4 ++++
>  3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 9042fdb..e920195 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -400,10 +400,13 @@ static void pci_device_shutdown(struct device *dev)
>  	pci_msix_shutdown(pci_dev);
>  
>  	/*
> -	 * Turn off Bus Master bit on the device to tell it to not
> -	 * continue to do DMA. Don't touch devices in D3cold or unknown states.
> +	 * If this is a kexec reboot, turn off Bus Master bit on the
> +	 * device to tell it to not continue to do DMA. Don't touch
> +	 * devices in D3cold or unknown states.
> +	 * If it is not a kexec reboot, firmware will hit the PCI
> +	 * devices with big hammer and stop their DMA any way.
>  	 */
> -	if (pci_dev->current_state <= PCI_D3hot)
> +	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
>  		pci_clear_master(pci_dev);
>  }
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9c91ecc..7d85733 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -9,6 +9,9 @@
>  extern const unsigned char pcix_bus_speed[];
>  extern const unsigned char pcie_link_speed[];
>  
> +/* flag to track if kexec reboot is in progress */
> +extern unsigned long kexec_in_progress;
> +
>  /* Functions internal to the PCI core code */
>  
>  int pci_create_sysfs_dev_files(struct pci_dev *pdev);
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 490afc0..fd2d63e 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -47,6 +47,9 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>  size_t vmcoreinfo_size;
>  size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
>  
> +/* Flag to indicate we are going to kexec a new kernel */
> +unsigned long kexec_in_progress = 0;
> +
>  /* Location of the reserved area for the crash kernel */
>  struct resource crashk_res = {
>  	.name  = "Crash kernel",
> @@ -1675,6 +1678,7 @@ int kernel_kexec(void)
>  	} else
>  #endif
>  	{
> +		kexec_in_progress = 1;
>  		kernel_restart_prepare(NULL);
>  		printk(KERN_EMERG "Starting new kernel\n");
>  		machine_shutdown();
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman Nov. 27, 2013, 7:39 p.m. UTC | #2
On Wed, Nov 27, 2013 at 12:18:28PM -0700, Khalid Aziz wrote:
> +/* flag to track if kexec reboot is in progress */
> +extern unsigned long kexec_in_progress;

"unsigned long" for a "flag"?

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Khalid Aziz Nov. 27, 2013, 7:48 p.m. UTC | #3
On 11/27/2013 12:24 PM, Matthew Garrett wrote:
> On Wed, Nov 27, 2013 at 12:18:28PM -0700, Khalid Aziz wrote:
>
>> +/* flag to track if kexec reboot is in progress */
>> +extern unsigned long kexec_in_progress;
>
> Adding this to pci.h seems a little odd. We may want to use it somewhere
> else at some point. Add it to kexec.h instead?
>

I debated between pci.h and kexec.h but pci-driver.c does not include 
kexec.h and I didn't want to include a whole new file. Now I see another 
problem with adding that extern declaration to pci.h - if CONFIG_KEXEC 
is not set, build will fail. I should add #ifdef CONFIG_KEXEC to the 
code in pci-driver.c as well. Time for v2.

--
Khalid
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Nov. 27, 2013, 7:53 p.m. UTC | #4
On Wed, Nov 27, 2013 at 12:48:42PM -0700, Khalid Aziz wrote:

> I debated between pci.h and kexec.h but pci-driver.c does not
> include kexec.h and I didn't want to include a whole new file. Now I
> see another problem with adding that extern declaration to pci.h -
> if CONFIG_KEXEC is not set, build will fail. I should add #ifdef
> CONFIG_KEXEC to the code in pci-driver.c as well. Time for v2.

You're making the behaviour of the pci code conditional on whether we're 
in kexec or not, so I think adding kexec.h is perfectly reasonable.
Khalid Aziz Nov. 27, 2013, 7:59 p.m. UTC | #5
On 11/27/2013 12:38 PM, ebiederm@xmission.com wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
>
>> Add a flag to tell the PCI subsystem that kernel is shutting down
>> in prepapration to kexec a kernel. Add code in PCI subsystem to use
>> this flag to clear Bus Master bit on PCI devices only in case of
>> kexec reboot. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=63861
>> and avoids any other issues caused by clearing Bus Master bit on PCI
>> devices in normal shutdown path. This patch is based on discussion at
>> http://marc.info/?l=linux-pci&m=138425645204355&w=2
>
> Scratches head.
>
> Given that most devices already call pci_disable_device which clears the
> bus master bit how does this change anything meaningful?
>
> Is is the problem here that most drivers are lazy and have a noop
> shutdown method?

Yes, that is exactly the problem.

--
Khalid

>
> Eric
>
>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
>> Cc: stable@vger.kernel.org
>> ---
>>   drivers/pci/pci-driver.c | 9 ++++++---
>>   drivers/pci/pci.h        | 3 +++
>>   kernel/kexec.c           | 4 ++++
>>   3 files changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
>> index 9042fdb..e920195 100644
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -400,10 +400,13 @@ static void pci_device_shutdown(struct device *dev)
>>   	pci_msix_shutdown(pci_dev);
>>
>>   	/*
>> -	 * Turn off Bus Master bit on the device to tell it to not
>> -	 * continue to do DMA. Don't touch devices in D3cold or unknown states.
>> +	 * If this is a kexec reboot, turn off Bus Master bit on the
>> +	 * device to tell it to not continue to do DMA. Don't touch
>> +	 * devices in D3cold or unknown states.
>> +	 * If it is not a kexec reboot, firmware will hit the PCI
>> +	 * devices with big hammer and stop their DMA any way.
>>   	 */
>> -	if (pci_dev->current_state <= PCI_D3hot)
>> +	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
>>   		pci_clear_master(pci_dev);
>>   }
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 9c91ecc..7d85733 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -9,6 +9,9 @@
>>   extern const unsigned char pcix_bus_speed[];
>>   extern const unsigned char pcie_link_speed[];
>>
>> +/* flag to track if kexec reboot is in progress */
>> +extern unsigned long kexec_in_progress;
>> +
>>   /* Functions internal to the PCI core code */
>>
>>   int pci_create_sysfs_dev_files(struct pci_dev *pdev);
>> diff --git a/kernel/kexec.c b/kernel/kexec.c
>> index 490afc0..fd2d63e 100644
>> --- a/kernel/kexec.c
>> +++ b/kernel/kexec.c
>> @@ -47,6 +47,9 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
>>   size_t vmcoreinfo_size;
>>   size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
>>
>> +/* Flag to indicate we are going to kexec a new kernel */
>> +unsigned long kexec_in_progress = 0;
>> +
>>   /* Location of the reserved area for the crash kernel */
>>   struct resource crashk_res = {
>>   	.name  = "Crash kernel",
>> @@ -1675,6 +1678,7 @@ int kernel_kexec(void)
>>   	} else
>>   #endif
>>   	{
>> +		kexec_in_progress = 1;
>>   		kernel_restart_prepare(NULL);
>>   		printk(KERN_EMERG "Starting new kernel\n");
>>   		machine_shutdown();

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman Nov. 27, 2013, 9:22 p.m. UTC | #6
On Wed, Nov 27, 2013 at 12:59:40PM -0700, Khalid Aziz wrote:
> On 11/27/2013 12:38 PM, ebiederm@xmission.com wrote:
> >Khalid Aziz <khalid.aziz@oracle.com> writes:
> >
> >>Add a flag to tell the PCI subsystem that kernel is shutting down
> >>in prepapration to kexec a kernel. Add code in PCI subsystem to use
> >>this flag to clear Bus Master bit on PCI devices only in case of
> >>kexec reboot. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=63861
> >>and avoids any other issues caused by clearing Bus Master bit on PCI
> >>devices in normal shutdown path. This patch is based on discussion at
> >>http://marc.info/?l=linux-pci&m=138425645204355&w=2
> >
> >Scratches head.
> >
> >Given that most devices already call pci_disable_device which clears the
> >bus master bit how does this change anything meaningful?
> >
> >Is is the problem here that most drivers are lazy and have a noop
> >shutdown method?
> 
> Yes, that is exactly the problem.

Then fix the drivers please.  It's not as if you don't have access to
the source for them all...

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Nov. 27, 2013, 9:53 p.m. UTC | #7
On Wed, Nov 27, 2013 at 01:22:27PM -0800, Greg KH wrote:

> Then fix the drivers please.  It's not as if you don't have access to
> the source for them all...

Define "fix". It's clearly wrong to disable busmastering at shutdown on 
some devices, otherwise we wouldn't be having this discussion at all.
Greg Kroah-Hartman Nov. 27, 2013, 10:01 p.m. UTC | #8
On Wed, Nov 27, 2013 at 09:53:09PM +0000, Matthew Garrett wrote:
> On Wed, Nov 27, 2013 at 01:22:27PM -0800, Greg KH wrote:
> 
> > Then fix the drivers please.  It's not as if you don't have access to
> > the source for them all...
> 
> Define "fix". It's clearly wrong to disable busmastering at shutdown on 
> some devices, otherwise we wouldn't be having this discussion at all.

I thought it was only "wrong" to disable this on multi-function devices,
which is why some drivers didn't do it.  Otherwise, how would it be any
different to have the global setting?

Anyway, I really don't care either way, but this seems like something
that the drivers should be doing.  What suddenly changed that caused
this problem to occur that hasn't happened in the years prior to now
that drives this to be a stable-kernel issue?

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Nov. 27, 2013, 10:07 p.m. UTC | #9
On Wed, Nov 27, 2013 at 02:01:06PM -0800, Greg KH wrote:
> On Wed, Nov 27, 2013 at 09:53:09PM +0000, Matthew Garrett wrote:
> > On Wed, Nov 27, 2013 at 01:22:27PM -0800, Greg KH wrote:
> > 
> > > Then fix the drivers please.  It's not as if you don't have access to
> > > the source for them all...
> > 
> > Define "fix". It's clearly wrong to disable busmastering at shutdown on 
> > some devices, otherwise we wouldn't be having this discussion at all.
> 
> I thought it was only "wrong" to disable this on multi-function devices,
> which is why some drivers didn't do it.  Otherwise, how would it be any
> different to have the global setting?

kexec doesn't jump into the firmware, so firmware assumptions about the 
state of the busmaster bit don't matter.

> Anyway, I really don't care either way, but this seems like something
> that the drivers should be doing.  What suddenly changed that caused
> this problem to occur that hasn't happened in the years prior to now
> that drives this to be a stable-kernel issue?

We started clearing the busmaster bit on all devices on shutdown in 
3.something in order to ensure that DMA wasn't occuring while we were 
in the process of performing a kexec. Some machines freeze on shutdown 
as a result. This patch reverts back to the original behaviour on real 
shutdown, while still avoiding the "This PCI device scribbled over my 
new kernel" kexec case.
Khalid Aziz Nov. 27, 2013, 10:18 p.m. UTC | #10
On 11/27/2013 03:07 PM, Matthew Garrett wrote:
> On Wed, Nov 27, 2013 at 02:01:06PM -0800, Greg KH wrote:
>> Anyway, I really don't care either way, but this seems like something
>> that the drivers should be doing.  What suddenly changed that caused
>> this problem to occur that hasn't happened in the years prior to now
>> that drives this to be a stable-kernel issue?
>
> We started clearing the busmaster bit on all devices on shutdown in
> 3.something in order to ensure that DMA wasn't occuring while we were
> in the process of performing a kexec. Some machines freeze on shutdown
> as a result. This patch reverts back to the original behaviour on real
> shutdown, while still avoiding the "This PCI device scribbled over my
> new kernel" kexec case.
>

Thanks for explaining this, Matthew. That was my reasoning exactly for 
why this patch should apply to stable. It fixes a real problem some 
users are experiencing. Commit log contains the URL to bugzilla entry 
for the problem.

--
Khalid
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Cox Nov. 28, 2013, 2:15 p.m. UTC | #11
> Anyway, I really don't care either way, but this seems like something
> that the drivers should be doing.  What suddenly changed that caused
> this problem to occur that hasn't happened in the years prior to now
> that drives this to be a stable-kernel issue?

When this first went in I pointed out it was an utterly stupid idea.
Since it went in lots of machines haven't rebooted properly or powered
off right.

There are two problems

1. Clearing the busmaster bit is not well defined behaviour. It even
freezes some hardware.

2. Lots of PC class hardware has firmware which believes that it can
access the hardware as it goes to reboot or poweroff and that someone
won't have shut it down.

Like it or not the firmware expected behaviour for such things is "what
Windows did". The expected PC behaviour is subtle and magic - eg the fact
D3 on some IDE disk controllers is terminal until power cycled because
Windows didn't do that or that the BIOS goes off and chats to the disks
in reboot without assuming the disk controller is completely
uninitialized.

kexec is a special cornercase and handling is as such (knowing it will
bother to re-init appropriate devices) is very different to the current
broken behaviour for PC systems.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 9042fdb..e920195 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -400,10 +400,13 @@  static void pci_device_shutdown(struct device *dev)
 	pci_msix_shutdown(pci_dev);
 
 	/*
-	 * Turn off Bus Master bit on the device to tell it to not
-	 * continue to do DMA. Don't touch devices in D3cold or unknown states.
+	 * If this is a kexec reboot, turn off Bus Master bit on the
+	 * device to tell it to not continue to do DMA. Don't touch
+	 * devices in D3cold or unknown states.
+	 * If it is not a kexec reboot, firmware will hit the PCI
+	 * devices with big hammer and stop their DMA any way.
 	 */
-	if (pci_dev->current_state <= PCI_D3hot)
+	if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
 		pci_clear_master(pci_dev);
 }
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9c91ecc..7d85733 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -9,6 +9,9 @@ 
 extern const unsigned char pcix_bus_speed[];
 extern const unsigned char pcie_link_speed[];
 
+/* flag to track if kexec reboot is in progress */
+extern unsigned long kexec_in_progress;
+
 /* Functions internal to the PCI core code */
 
 int pci_create_sysfs_dev_files(struct pci_dev *pdev);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 490afc0..fd2d63e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -47,6 +47,9 @@  u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
 size_t vmcoreinfo_size;
 size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
 
+/* Flag to indicate we are going to kexec a new kernel */
+unsigned long kexec_in_progress = 0;
+
 /* Location of the reserved area for the crash kernel */
 struct resource crashk_res = {
 	.name  = "Crash kernel",
@@ -1675,6 +1678,7 @@  int kernel_kexec(void)
 	} else
 #endif
 	{
+		kexec_in_progress = 1;
 		kernel_restart_prepare(NULL);
 		printk(KERN_EMERG "Starting new kernel\n");
 		machine_shutdown();