Message ID | 1505831139-6053-1-git-send-email-joserz@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
Series | powerpc/eeh: Disable EEH stack dump by default | expand |
Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> writes: > Today, each EEH causes a stack dump to be printed in the logs. In > production environment it's not quite necessary. Thus, this patch I'm unconvinced. A production environment is exactly where you don't want to be getting an EEH, and so if you *do* then every bit of information is helpful. > For example, instead of the following: > > [ 131.778661] EEH: Frozen PHB#2-PE#fd detected > [ 131.778672] EEH: PE location: N/A, PHB location: N/A > [ 131.778677] CPU: 21 PID: 10098 Comm: lspci Not tainted ... > [ 131.778680] Call Trace: > [ 131.778686] [c0000003a140bab0] [c000000000beb58c] dump_stack+... > <snip ~10 lines> > [ 131.778770] EEH: Detected PCI bus error on PHB#2-PE#fd > [ 131.778775] EEH: This PCI device has failed 1 times in the last hour > ... > > we will have this by default: > > [12777.175880] EEH: Frozen PHB#2-PE#fd detected > [12777.175893] EEH: PE location: N/A, PHB location: N/A > [12777.175922] EEH: Detected PCI bus error on PHB#2-PE#fd > [12777.175931] EEH: This PCI device has failed 2 times in the last hour *What* PCI device? How am I supposed to know what device/driver just failed? If I had the stack trace I could probably at least work it out based on the driver involved. cheers
On 20/09/17 00:25, Jose Ricardo Ziviani wrote: > Today, each EEH causes a stack dump to be printed in the logs. In > production environment it's not quite necessary. Thus, this patch > adds a new command line argument in order to enable the stack > dump for debugging purposes. > > For example, instead of the following: > > [ 131.778661] EEH: Frozen PHB#2-PE#fd detected > [ 131.778672] EEH: PE location: N/A, PHB location: N/A > [ 131.778677] CPU: 21 PID: 10098 Comm: lspci Not tainted ... > [ 131.778680] Call Trace: > [ 131.778686] [c0000003a140bab0] [c000000000beb58c] dump_stack+... > <snip ~10 lines> > [ 131.778770] EEH: Detected PCI bus error on PHB#2-PE#fd > [ 131.778775] EEH: This PCI device has failed 1 times in the last hour > ... > > we will have this by default: > > [12777.175880] EEH: Frozen PHB#2-PE#fd detected > [12777.175893] EEH: PE location: N/A, PHB location: N/A > [12777.175922] EEH: Detected PCI bus error on PHB#2-PE#fd > [12777.175931] EEH: This PCI device has failed 2 times in the last hour > ... > > Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> As someone who's had to debug far too many EEH-related bugs, I'd really prefer if this remained as is. Andrew > --- > arch/powerpc/kernel/eeh.c | 26 +++++++++++++++++++++++--- > 1 file changed, 23 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c > index 9e81678..4336c3b1 100644 > --- a/arch/powerpc/kernel/eeh.c > +++ b/arch/powerpc/kernel/eeh.c > @@ -157,6 +157,19 @@ static int __init eeh_setup(char *str) > __setup("eeh=", eeh_setup); > > /* > + * It's not necessary to dump the stack trace when an EEH occours > + * in the production environment. For debugging, the command line > + * option "enable_eeh_stacktrace" brings the stack dump back > + */ > +static bool eeh_show_stacktrace; > +static int __init enable_eeh_stacktrace(char *p) > +{ > + eeh_show_stacktrace = true; > + return 0; > +} > +early_param("enable_eeh_stacktrace", enable_eeh_stacktrace); > + > +/* > * This routine captures assorted PCI configuration space data > * for the indicated PCI device, and puts them into a buffer > * for RTAS error logging. > @@ -407,7 +420,10 @@ static int eeh_phb_check_failure(struct eeh_pe *pe) > > pr_err("EEH: PHB#%x failure detected, location: %s\n", > phb_pe->phb->global_number, eeh_pe_loc_get(phb_pe)); > - dump_stack(); > + > + if (eeh_show_stacktrace) > + dump_stack(); > + > eeh_send_failure_event(phb_pe); > > return 1; > @@ -504,7 +520,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev) > eeh_driver_name(dev), eeh_pci_name(dev)); > printk(KERN_ERR "EEH: Might be infinite loop in %s driver\n", > eeh_driver_name(dev)); > - dump_stack(); > + > + if (eeh_show_stacktrace) > + dump_stack(); > } > goto dn_unlock; > } > @@ -572,7 +590,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev) > pe->phb->global_number, pe->addr); > pr_err("EEH: PE location: %s, PHB location: %s\n", > eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe)); > - dump_stack(); > + > + if (eeh_show_stacktrace) > + dump_stack(); > > eeh_send_failure_event(pe); > >
On Wed, Sep 20, 2017 at 02:47:08PM +1000, Michael Ellerman wrote: > Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> writes: > > > Today, each EEH causes a stack dump to be printed in the logs. In > > production environment it's not quite necessary. Thus, this patch > > I'm unconvinced. A production environment is exactly where you don't > want to be getting an EEH, and so if you *do* then every bit of > information is helpful. > > > For example, instead of the following: > > > > [ 131.778661] EEH: Frozen PHB#2-PE#fd detected > > [ 131.778672] EEH: PE location: N/A, PHB location: N/A > > [ 131.778677] CPU: 21 PID: 10098 Comm: lspci Not tainted ... > > [ 131.778680] Call Trace: > > [ 131.778686] [c0000003a140bab0] [c000000000beb58c] dump_stack+... > > <snip ~10 lines> > > [ 131.778770] EEH: Detected PCI bus error on PHB#2-PE#fd > > [ 131.778775] EEH: This PCI device has failed 1 times in the last hour > > ... > > > > we will have this by default: > > > > [12777.175880] EEH: Frozen PHB#2-PE#fd detected > > [12777.175893] EEH: PE location: N/A, PHB location: N/A > > [12777.175922] EEH: Detected PCI bus error on PHB#2-PE#fd > > [12777.175931] EEH: This PCI device has failed 2 times in the last hour > > *What* PCI device? > > How am I supposed to know what device/driver just failed? If I had the > stack trace I could probably at least work it out based on the driver > involved. > > cheers > Thank you guys! More people told me it's important to keep it as is. Please, disregard this patch.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 9e81678..4336c3b1 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -157,6 +157,19 @@ static int __init eeh_setup(char *str) __setup("eeh=", eeh_setup); /* + * It's not necessary to dump the stack trace when an EEH occours + * in the production environment. For debugging, the command line + * option "enable_eeh_stacktrace" brings the stack dump back + */ +static bool eeh_show_stacktrace; +static int __init enable_eeh_stacktrace(char *p) +{ + eeh_show_stacktrace = true; + return 0; +} +early_param("enable_eeh_stacktrace", enable_eeh_stacktrace); + +/* * This routine captures assorted PCI configuration space data * for the indicated PCI device, and puts them into a buffer * for RTAS error logging. @@ -407,7 +420,10 @@ static int eeh_phb_check_failure(struct eeh_pe *pe) pr_err("EEH: PHB#%x failure detected, location: %s\n", phb_pe->phb->global_number, eeh_pe_loc_get(phb_pe)); - dump_stack(); + + if (eeh_show_stacktrace) + dump_stack(); + eeh_send_failure_event(phb_pe); return 1; @@ -504,7 +520,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev) eeh_driver_name(dev), eeh_pci_name(dev)); printk(KERN_ERR "EEH: Might be infinite loop in %s driver\n", eeh_driver_name(dev)); - dump_stack(); + + if (eeh_show_stacktrace) + dump_stack(); } goto dn_unlock; } @@ -572,7 +590,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev) pe->phb->global_number, pe->addr); pr_err("EEH: PE location: %s, PHB location: %s\n", eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe)); - dump_stack(); + + if (eeh_show_stacktrace) + dump_stack(); eeh_send_failure_event(pe);
Today, each EEH causes a stack dump to be printed in the logs. In production environment it's not quite necessary. Thus, this patch adds a new command line argument in order to enable the stack dump for debugging purposes. For example, instead of the following: [ 131.778661] EEH: Frozen PHB#2-PE#fd detected [ 131.778672] EEH: PE location: N/A, PHB location: N/A [ 131.778677] CPU: 21 PID: 10098 Comm: lspci Not tainted ... [ 131.778680] Call Trace: [ 131.778686] [c0000003a140bab0] [c000000000beb58c] dump_stack+... <snip ~10 lines> [ 131.778770] EEH: Detected PCI bus error on PHB#2-PE#fd [ 131.778775] EEH: This PCI device has failed 1 times in the last hour ... we will have this by default: [12777.175880] EEH: Frozen PHB#2-PE#fd detected [12777.175893] EEH: PE location: N/A, PHB location: N/A [12777.175922] EEH: Detected PCI bus error on PHB#2-PE#fd [12777.175931] EEH: This PCI device has failed 2 times in the last hour ... Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> --- arch/powerpc/kernel/eeh.c | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-)