Message ID | 20230919212959.1939749-1-jesse.brandeburg@intel.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [iwl-net,v1] ice: reset first in crash dump kernels | expand |
Dear Jesse, Thank you for your patch. Am 19.09.23 um 23:29 schrieb Jesse Brandeburg: > When booting into the crash dump kernels there are cases where upon > enabling the device, the system under test will panic or machine check. > > One such test is to > - load ice driver > $ modprobe ice > - enable SR-IOV (2 VFs) > $ echo 2 > /sys/class/net/eth0/device/sriov_num_vfs > - crash > echo c > /proc/sysrq-trigger Above you prepended a $. > - load ice driver (or happens automatically) > modprobe ice > - crash during pcim_enable_device() > > Avoid this problem by issuing a FLR to the device via PCIe config space > on the crash kernel, to clear out any outstanding transactions and stop > all queues and interrupts. Restore config space afterword because the afterw*a*rd > driver won't load successfully otherwise. Excuse my ignorance, could you please add, what the crashdump kernel does differently from the “normal” kernel, so this special handling is needed? > Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> > Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > --- > drivers/net/ethernet/intel/ice/ice_main.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c > index c8286adae946..6550c46e4e36 100644 > --- a/drivers/net/ethernet/intel/ice/ice_main.c > +++ b/drivers/net/ethernet/intel/ice/ice_main.c > @@ -6,6 +6,7 @@ > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > #include <generated/utsrelease.h> > +#include <linux/crash_dump.h> > #include "ice.h" > #include "ice_base.h" > #include "ice_lib.h" > @@ -5014,6 +5015,20 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) > return -EINVAL; > } > > + /* when under a kdump kernel initiate a reset before enabling the > + * device in order to clear out any pending DMA transactions. These > + * transactions can cause some systems to machine check when doing > + * the pcim_enable_device() below. > + */ > + if (is_kdump_kernel()) { > + pci_save_state(pdev); > + pci_clear_master(pdev); > + err = pcie_flr(pdev); > + if (err) > + return err; > + pci_restore_state(pdev); > + } > + Should this be added to the common PCI code? Maybe loop the PCI subsystem folks in? > /* this driver uses devres, see > * Documentation/driver-api/driver-model/devres.rst > */ Kind regards, Paul
On 9/19/2023 10:18 PM, Paul Menzel wrote: > Dear Jesse, > > > Thank you for your patch. > > Am 19.09.23 um 23:29 schrieb Jesse Brandeburg: >> When booting into the crash dump kernels there are cases where upon >> enabling the device, the system under test will panic or machine check. >> >> One such test is to >> - load ice driver >> $ modprobe ice >> - enable SR-IOV (2 VFs) >> $ echo 2 > /sys/class/net/eth0/device/sriov_num_vfs >> - crash >> echo c > /proc/sysrq-trigger > > Above you prepended a $. Fixed in v2. > >> - load ice driver (or happens automatically) >> modprobe ice >> - crash during pcim_enable_device() >> >> Avoid this problem by issuing a FLR to the device via PCIe config space >> on the crash kernel, to clear out any outstanding transactions and stop >> all queues and interrupts. Restore config space afterword because the > > afterw*a*rd Fixed in v2. > >> driver won't load successfully otherwise. > > Excuse my ignorance, could you please add, what the crashdump kernel > does differently from the “normal” kernel, so this special handling is > needed? I added more description in the v2 commit message, I hope that helps. In summary: the crashdump kernel is starting up on "dirty" state of hardware, due to the surprise crash of the previously running kernel that had running devices when it "panicked" > >> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> >> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> --- >> drivers/net/ethernet/intel/ice/ice_main.c | 15 +++++++++++++++ >> 1 file changed, 15 insertions(+) >> >> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c >> b/drivers/net/ethernet/intel/ice/ice_main.c >> index c8286adae946..6550c46e4e36 100644 >> --- a/drivers/net/ethernet/intel/ice/ice_main.c >> +++ b/drivers/net/ethernet/intel/ice/ice_main.c >> @@ -6,6 +6,7 @@ >> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt >> #include <generated/utsrelease.h> >> +#include <linux/crash_dump.h> >> #include "ice.h" >> #include "ice_base.h" >> #include "ice_lib.h" >> @@ -5014,6 +5015,20 @@ ice_probe(struct pci_dev *pdev, const struct >> pci_device_id __always_unused *ent) >> return -EINVAL; >> } >> + /* when under a kdump kernel initiate a reset before enabling the >> + * device in order to clear out any pending DMA transactions. These >> + * transactions can cause some systems to machine check when doing >> + * the pcim_enable_device() below. >> + */ >> + if (is_kdump_kernel()) { >> + pci_save_state(pdev); >> + pci_clear_master(pdev); >> + err = pcie_flr(pdev); >> + if (err) >> + return err; >> + pci_restore_state(pdev); >> + } >> + > > Should this be added to the common PCI code? Maybe loop the PCI > subsystem folks in? Ok, I'll cc: linux-pci when I send v2. > >> /* this driver uses devres, see >> * Documentation/driver-api/driver-model/devres.rst >> */ > > > Kind regards, > > Paul
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index c8286adae946..6550c46e4e36 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -6,6 +6,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <generated/utsrelease.h> +#include <linux/crash_dump.h> #include "ice.h" #include "ice_base.h" #include "ice_lib.h" @@ -5014,6 +5015,20 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) return -EINVAL; } + /* when under a kdump kernel initiate a reset before enabling the + * device in order to clear out any pending DMA transactions. These + * transactions can cause some systems to machine check when doing + * the pcim_enable_device() below. + */ + if (is_kdump_kernel()) { + pci_save_state(pdev); + pci_clear_master(pdev); + err = pcie_flr(pdev); + if (err) + return err; + pci_restore_state(pdev); + } + /* this driver uses devres, see * Documentation/driver-api/driver-model/devres.rst */