From patchwork Tue Feb 28 07:02:20 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vaibhav Jain X-Patchwork-Id: 733347 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [103.22.144.68]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3vXV2n0Z5wz9s7s for ; Tue, 28 Feb 2017 18:05:21 +1100 (AEDT) Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 3vXV2m6wckzDqRX for ; Tue, 28 Feb 2017 18:05:20 +1100 (AEDT) X-Original-To: linuxppc-dev@lists.ozlabs.org Delivered-To: linuxppc-dev@lists.ozlabs.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3vXV0B26GjzDqJk for ; Tue, 28 Feb 2017 18:03:06 +1100 (AEDT) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v1S6sCQs075919 for ; Tue, 28 Feb 2017 02:03:04 -0500 Received: from e28smtp04.in.ibm.com (e28smtp04.in.ibm.com [125.16.236.4]) by mx0a-001b2d01.pphosted.com with ESMTP id 28w2hc4gss-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Tue, 28 Feb 2017 02:03:04 -0500 Received: from localhost by e28smtp04.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 28 Feb 2017 12:33:00 +0530 Received: from d28dlp01.in.ibm.com (9.184.220.126) by e28smtp04.in.ibm.com (192.168.1.134) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 28 Feb 2017 12:32:58 +0530 Received: from d28relay05.in.ibm.com (d28relay05.in.ibm.com [9.184.220.62]) by d28dlp01.in.ibm.com (Postfix) with ESMTP id A7D5DE0045 for ; Tue, 28 Feb 2017 12:34:41 +0530 (IST) Received: from d28av02.in.ibm.com (d28av02.in.ibm.com [9.184.220.64]) by d28relay05.in.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v1S72stF15466728 for ; Tue, 28 Feb 2017 12:32:54 +0530 Received: from d28av02.in.ibm.com (localhost [127.0.0.1]) by d28av02.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v1S72t1S020800 for ; Tue, 28 Feb 2017 12:32:57 +0530 Received: from vajain21.in.ibm.com (vajain21.in.ibm.com [9.124.35.236]) by d28av02.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id v1S72p4Q020647; Tue, 28 Feb 2017 12:32:55 +0530 From: Vaibhav Jain To: linuxppc-dev@lists.ozlabs.org, Russell Currey , frederic.barrat@fr.ibm.com Subject: [RFC 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp to update freeze_count Date: Tue, 28 Feb 2017 12:32:20 +0530 X-Mailer: git-send-email 2.9.3 In-Reply-To: <20170228070222.21126-1-vaibhav@linux.vnet.ibm.com> References: <20170228070222.21126-1-vaibhav@linux.vnet.ibm.com> X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 17022807-0012-0000-0000-000003C00340 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17022807-0013-0000-0000-00001B431B35 Message-Id: <20170228070222.21126-2-vaibhav@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-02-28_06:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000 definitions=main-1702280065 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Philippe Bergheaud , Vaibhav Jain , Gavin Shan , Ian Munsie , Andrew Donnellan , Gregory Kurz , Christophe Lombard Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" This patch introduces a new function named eeh_pe_update_freeze_counter replacing existing function eeh_pe_update_time_stamp. The new function also manages the value of freeze_count along with tstamp to track the number of times the PE froze in last one hour and if the freeze_count > eeh_max_freezes then reports an error(-ENOTRECOVERABLE) to indicate that the PE should be permanently disabled. This patch should not introduce any behavioral change. Signed-off-by: Vaibhav Jain --- arch/powerpc/include/asm/eeh.h | 2 +- arch/powerpc/kernel/eeh_driver.c | 20 +++------------- arch/powerpc/kernel/eeh_pe.c | 50 ++++++++++++++++++++++++++-------------- 3 files changed, 37 insertions(+), 35 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 8e37b71..68806be 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb); struct eeh_pe *eeh_pe_get(struct eeh_dev *edev); int eeh_add_to_parent_pe(struct eeh_dev *edev); int eeh_rmv_from_parent_pe(struct eeh_dev *edev); -void eeh_pe_update_time_stamp(struct eeh_pe *pe); +int eeh_pe_update_freeze_counter(struct eeh_pe *pe); void *eeh_pe_traverse(struct eeh_pe *root, eeh_traverse_func fn, void *flag); void *eeh_pe_dev_traverse(struct eeh_pe *root, diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index b948871..326b4e4 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe) return; } - eeh_pe_update_time_stamp(pe); - pe->freeze_count++; - if (pe->freeze_count > eeh_max_freezes) - goto excess_failures; + /* Update freeze counters and see if we have tripped max-freeze limit */ + if (eeh_pe_update_freeze_counter(pe) < 0) + goto perm_error; pr_warn("EEH: This PCI device has failed %d times in the last hour\n", pe->freeze_count); @@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe) return; -excess_failures: - /* - * About 90% of all real-life EEH failures in the field - * are due to poorly seated PCI cards. Only 10% or so are - * due to actual, failed cards. - */ - pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n" - "last hour and has been permanently disabled.\n" - "Please try reseating or replacing it.\n", - pe->phb->global_number, pe->addr, - pe->freeze_count); - goto perm_error; - hard_fail: pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n" "Please try reseating or replacing it\n", diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index cc4b206..cf70a8b 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -504,30 +504,46 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev) } /** - * eeh_pe_update_time_stamp - Update PE's frozen time stamp + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp + * and freeze counter * @pe: EEH PE * - * We have time stamp for each PE to trace its time of getting - * frozen in last hour. The function should be called to update - * the time stamp on first error of the specific PE. On the other - * handle, we needn't account for errors happened in last hour. + * We have a freeze-counter and time stamp for each PE to trace + * number of times the PE was frozen in last one hour. This function + * updates the PE's freeze counter and if its > eeh_max_freezes then + * returns an error. The function should be called to once every-time + * a specific PE freezes. */ -void eeh_pe_update_time_stamp(struct eeh_pe *pe) +int eeh_pe_update_freeze_counter(struct eeh_pe *pe) { struct timeval tstamp; - if (!pe) return; + if (!pe) + return -EINVAL; - if (pe->freeze_count <= 0) { - pe->freeze_count = 0; - do_gettimeofday(&pe->tstamp); - } else { - do_gettimeofday(&tstamp); - if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) { - pe->tstamp = tstamp; - pe->freeze_count = 0; - } - } + do_gettimeofday(&tstamp); + if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) { + pe->tstamp = tstamp; + pe->freeze_count = 1; + + } else if (pe->freeze_count >= eeh_max_freezes) { + pe->freeze_count++; + /* + * About 90% of all real-life EEH failures in the field + * are due to poorly seated PCI cards. Only 10% or so are + * due to actual, failed cards. + */ + pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n" + "last hour and has been permanently disabled.\n" + "Please try reseating or replacing it.\n", + pe->phb->global_number, pe->addr, + pe->freeze_count); + return -ENOTRECOVERABLE; + + } else + pe->freeze_count++; + + return 0; } /**