From patchwork Tue Jul 17 17:50:07 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brad Figg X-Patchwork-Id: 171521 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from chlorine.canonical.com (chlorine.canonical.com [91.189.94.204]) by ozlabs.org (Postfix) with ESMTP id 9D03E2C009B for ; Wed, 18 Jul 2012 03:50:16 +1000 (EST) Received: from localhost ([127.0.0.1] helo=chlorine.canonical.com) by chlorine.canonical.com with esmtp (Exim 4.71) (envelope-from ) id 1SrBuS-0008BO-RB; Tue, 17 Jul 2012 17:50:08 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by chlorine.canonical.com with esmtp (Exim 4.71) (envelope-from ) id 1SrBuQ-0008B2-AW for kernel-team@lists.ubuntu.com; Tue, 17 Jul 2012 17:50:06 +0000 Received: from static-50-53-107-235.bvtn.or.frontiernet.net ([50.53.107.235] helo=localhost) by youngberry.canonical.com with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1SrBuP-0006Nf-Po for kernel-team@lists.ubuntu.com; Tue, 17 Jul 2012 17:50:06 +0000 From: Brad Figg To: kernel-team@lists.ubuntu.com Subject: [PATCH] x86, mce, therm_throt: Don't report power limit and package level thermal throttle events in mcelog Date: Tue, 17 Jul 2012 10:50:07 -0700 Message-Id: <1342547407-17409-2-git-send-email-brad.figg@canonical.com> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1342547407-17409-1-git-send-email-brad.figg@canonical.com> References: <1342547407-17409-1-git-send-email-brad.figg@canonical.com> X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.13 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: kernel-team-bounces@lists.ubuntu.com Errors-To: kernel-team-bounces@lists.ubuntu.com From: Fenghua Yu BugLink: http://bugs.launchpad.net/bugs/1025752 Thermal throttle and power limit events are not defined as MCE errors in x86 architecture and should not generate MCE errors in mcelog. Current kernel generates fake software defined MCE errors for these events. This may confuse users because they may think the machine has real MCE errors while actually only thermal throttle or power limit events happen. To make it worse, buggy firmware on some platforms may falsely generate the events. Therefore, kernel reports MCE errors which users think as real hardware errors. Although the firmware bugs should be fixed, on the other hand, kernel should not report MCE errors either. So mcelog is not a good mechanism to report these events. To report the events, we count them in respective counters (core_power_limit_count, package_power_limit_count, core_throttle_count, and package_throttle_count) in /sys/devices/system/cpu/cpu#/thermal_throttle/. Users can check the counters for each event on each CPU. Please note that all CPU's on one package report duplicate counters. It's user application's responsibity to retrieve a package level counter for one package. This patch doesn't report package level power limit, core level power limit, and package level thermal throttle events in mcelog. When the events happen, only report them in respective counters in sysfs. Since core level thermal throttle has been legacy code in kernel for a while and users accepted it as MCE error in mcelog, core level thermal throttle is still reported in mcelog. In the mean time, the event is counted in a counter in sysfs as well. Signed-off-by: Fenghua Yu Acked-by: Borislav Petkov Acked-by: Tony Luck Link: http://lkml.kernel.org/r/20111215001945.GA21009@linux-os.sc.intel.com Signed-off-by: H. Peter Anvin (cherry picked from commit 29e9bf1841e4f9df13b4992a716fece7087dd237) Signed-off-by: Brad Figg --- arch/x86/kernel/cpu/mcheck/therm_throt.c | 29 +++++++---------------------- 1 file changed, 7 insertions(+), 22 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c index 27c6251..99cd9d2 100644 --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c @@ -322,17 +322,6 @@ device_initcall(thermal_throttle_init_device); #endif /* CONFIG_SYSFS */ -/* - * Set up the most two significant bit to notify mce log that this thermal - * event type. - * This is a temp solution. May be changed in the future with mce log - * infrasture. - */ -#define CORE_THROTTLED (0) -#define CORE_POWER_LIMIT ((__u64)1 << 62) -#define PACKAGE_THROTTLED ((__u64)2 << 62) -#define PACKAGE_POWER_LIMIT ((__u64)3 << 62) - static void notify_thresholds(__u64 msr_val) { /* check whether the interrupt handler is defined; @@ -362,27 +351,23 @@ static void intel_thermal_interrupt(void) if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT, THERMAL_THROTTLING_EVENT, CORE_LEVEL) != 0) - mce_log_therm_throt_event(CORE_THROTTLED | msr_val); + mce_log_therm_throt_event(msr_val); if (this_cpu_has(X86_FEATURE_PLN)) - if (therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT, + therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT, POWER_LIMIT_EVENT, - CORE_LEVEL) != 0) - mce_log_therm_throt_event(CORE_POWER_LIMIT | msr_val); + CORE_LEVEL); if (this_cpu_has(X86_FEATURE_PTS)) { rdmsrl(MSR_IA32_PACKAGE_THERM_STATUS, msr_val); - if (therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT, + therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT, THERMAL_THROTTLING_EVENT, - PACKAGE_LEVEL) != 0) - mce_log_therm_throt_event(PACKAGE_THROTTLED | msr_val); + PACKAGE_LEVEL); if (this_cpu_has(X86_FEATURE_PLN)) - if (therm_throt_process(msr_val & + therm_throt_process(msr_val & PACKAGE_THERM_STATUS_POWER_LIMIT, POWER_LIMIT_EVENT, - PACKAGE_LEVEL) != 0) - mce_log_therm_throt_event(PACKAGE_POWER_LIMIT - | msr_val); + PACKAGE_LEVEL); } }