mbox series

[0/4] PCI/AER: Use-after-free fix

Message ID 20180409220444.6632-1-keith.busch@intel.com
Headers show
Series PCI/AER: Use-after-free fix | expand

Message

Keith Busch April 9, 2018, 10:04 p.m. UTC
AER error handling walks the PCI topology below a root port, saving
pointers of the pci_dev structs affected by the error along the way.

At the same time, the pcie hotplug driver could be freeing those very
same structures, causing the AER driver to reference freed memory.

This series fixes this by synchroniziing the aer driver with the pci
hotplug driver during. The final patch is the one that ultimately provides
the locking by having AER lock the same pci lock rescan/remove mutex as
the pciehp driver. The first three patches are prepping to make it safe
for the aer bottom half handler to hold that lock.

Keith Busch (4):
  PCI/AER: Remove unused parameters
  PCI/AER: Replace struct pcie_device with pci_dev
  PCI/AER: Reference count aer structures
  PCI/AER: Lock pci topology when scanning errors

 drivers/pci/pcie/aer/aerdrv.c      | 28 +++++++++++++++++++++-------
 drivers/pci/pcie/aer/aerdrv.h      |  9 +++------
 drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++---------------------
 3 files changed, 41 insertions(+), 34 deletions(-)

Comments

Dongdong Liu April 10, 2018, 1:15 p.m. UTC | #1
在 2018/4/10 6:04, Keith Busch 写道:
> AER error handling walks the PCI topology below a root port, saving
> pointers of the pci_dev structs affected by the error along the way.
>
> At the same time, the pcie hotplug driver could be freeing those very
> same structures, causing the AER driver to reference freed memory.

I also have met such issue.  The details see the below link.
https://www.spinics.net/lists/linux-pci/msg70614.html
It seems do reset_link() will trigger hotplug driver to remove/rescan device
when met AER fatal error.
do_recovery()
    --- reset_link()
	--- pci_reset_secondary_bus() //then trigger link down/up.
So if the root port support hotplug, it will remove/rescan device.
I still have a question, since AER driver have already done recovery,
it seems no need to trigger hotplug to remove and rescan the device.

Thanks,
Dongdong

>
> This series fixes this by synchroniziing the aer driver with the pci
> hotplug driver during. The final patch is the one that ultimately provides
> the locking by having AER lock the same pci lock rescan/remove mutex as
> the pciehp driver. The first three patches are prepping to make it safe
> for the aer bottom half handler to hold that lock.
>
> Keith Busch (4):
>   PCI/AER: Remove unused parameters
>   PCI/AER: Replace struct pcie_device with pci_dev
>   PCI/AER: Reference count aer structures
>   PCI/AER: Lock pci topology when scanning errors
>
>  drivers/pci/pcie/aer/aerdrv.c      | 28 +++++++++++++++++++++-------
>  drivers/pci/pcie/aer/aerdrv.h      |  9 +++------
>  drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++---------------------
>  3 files changed, 41 insertions(+), 34 deletions(-)
>
Scott Bauer April 12, 2018, 4:47 p.m. UTC | #2
On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com] 
> 
> > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way.
> 
> Hi Keith,
> 
> I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing.
> 
> Alex


If you need help triggering the race you can add a sleep/microsleep here:

aer_isr_one_error() between the find_source_device and process err device:

sbauer@sbauer-Z170X-UD5:~/nvme_code/upstream_jens/linux-block$ git diff drivers/pci/pcie/aer/aerdrv_core.c
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea52e7d4..5ca0c07b1d05 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/kfifo.h>
+#include <linux/delay.h>
 #include "aerdrv.h"
 
 #define        PCI_EXP_AER_FLAGS       (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
@@ -740,8 +741,10 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 
                aer_print_port_info(p_device->port, e_info);
 
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 
        if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -759,8 +762,10 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 
                aer_print_port_info(p_device->port, e_info);
 
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 }
Alex_Gagniuc@Dellteam.com April 12, 2018, 5:06 p.m. UTC | #3
From: Keith Busch [mailto:keith.busch@intel.com] 

> AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way.

Hi Keith,

I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing.

Alex
Keith Busch April 12, 2018, 5:10 p.m. UTC | #4
On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com] 
> 
> > AER error handling walks the PCI topology below a root port, saving
> > pointers of the pci_dev structs affected by the error along the way.
> 
> Hi Keith,
> 
> I've been trying to do an ABA test to confirm that your change
> eliminates the use-after-free issue we've seen. The race seems to be
> quite elusive, so I can't reliably reproduce it. Your changes have not
> been forgotten; I have them staged for further testing.
> 
> Alex

No worries. This is essentially just a "safe" version of the patch that
you had tested, so if that one was successful, this one should be too.
Alex_Gagniuc@Dellteam.com April 13, 2018, 2:49 p.m. UTC | #5
I got the cold chills when I realized you called for a delay of 350ms. It's because 350ms is around the delay I've observed to be caused by FFS.
First run KASANed with the extra delay, so hopefully, I'll have more cement test results by EOB today.

Alex

-----Original Message-----
From: Scott Bauer [mailto:scott.bauer@intel.com] 
Sent: Thursday, April 12, 2018 11:47 AM
To: Gagniuc, Alexandru - Dell Team
Cc: keith.busch@intel.com; linux-pci@vger.kernel.org; bhelgaas@google.com
Subject: Re: [PATCH 0/4] PCI/AER: Use-after-free fix

On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com]
> 
> > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way.
> 
> Hi Keith,
> 
> I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing.
> 
> Alex


If you need help triggering the race you can add a sleep/microsleep here:

aer_isr_one_error() between the find_source_device and process err device:

sbauer@sbauer-Z170X-UD5:~/nvme_code/upstream_jens/linux-block$ git diff drivers/pci/pcie/aer/aerdrv_core.c
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea52e7d4..5ca0c07b1d05 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/kfifo.h>
+#include <linux/delay.h>
 #include "aerdrv.h"
 
 #define        PCI_EXP_AER_FLAGS       (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
@@ -740,8 +741,10 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 
                aer_print_port_info(p_device->port, e_info);
 
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 
        if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) { @@ -759,8 +762,10 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 
                aer_print_port_info(p_device->port, e_info);
 
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 }
Alex_Gagniuc@Dellteam.com April 16, 2018, 7:49 p.m. UTC | #6
From: Scott Bauer [mailto:scott.bauer@intel.com] 
On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com]
> 
> > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way.
> 
> Hi Keith,
> 
> I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing.
> 
> Alex


> If you need help triggering the race you can add a sleep/microsleep here:

Tested-by: Alexandru Gagniuc
Bjorn Helgaas June 5, 2018, 10:11 p.m. UTC | #7
On Mon, Apr 09, 2018 at 04:04:40PM -0600, Keith Busch wrote:
> AER error handling walks the PCI topology below a root port, saving
> pointers of the pci_dev structs affected by the error along the way.
> 
> At the same time, the pcie hotplug driver could be freeing those very
> same structures, causing the AER driver to reference freed memory.
> 
> This series fixes this by synchroniziing the aer driver with the pci
> hotplug driver during. The final patch is the one that ultimately provides
> the locking by having AER lock the same pci lock rescan/remove mutex as
> the pciehp driver. The first three patches are prepping to make it safe
> for the aer bottom half handler to hold that lock.
> 
> Keith Busch (4):
>   PCI/AER: Remove unused parameters
>   PCI/AER: Replace struct pcie_device with pci_dev
>   PCI/AER: Reference count aer structures
>   PCI/AER: Lock pci topology when scanning errors
> 
>  drivers/pci/pcie/aer/aerdrv.c      | 28 +++++++++++++++++++++-------
>  drivers/pci/pcie/aer/aerdrv.h      |  9 +++------
>  drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++---------------------
>  3 files changed, 41 insertions(+), 34 deletions(-)

I applied the first two to pci/aer for v4.18, thanks!

I think the last two might need some rework to adapt after Oza's patches.