From patchwork Wed Jul 26 05:59:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kai-Heng Feng X-Patchwork-Id: 1812964 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=canonical.com header.i=@canonical.com header.a=rsa-sha256 header.s=20210705 header.b=TtmpGeU7; dkim-atps=neutral Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4R9jt70dwBz1yZv for ; Wed, 26 Jul 2023 16:00:14 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1qOXZ9-0003Wd-LH; Wed, 26 Jul 2023 06:00:07 +0000 Received: from smtp-relay-canonical-1.internal ([10.131.114.174] helo=smtp-relay-canonical-1.canonical.com) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1qOXZ7-0003WH-RS for kernel-team@lists.ubuntu.com; Wed, 26 Jul 2023 06:00:05 +0000 Received: from localhost.localdomain (unknown [10.101.196.174]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-1.canonical.com (Postfix) with ESMTPSA id 4B43F3F16F for ; Wed, 26 Jul 2023 06:00:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1690351205; bh=E4RfIg6J3pCXVsZfqcn/35Ptbo3JMqtfnuZYYuz3rP4=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=TtmpGeU7sFhY7T4W0JH+a5374/8tr9tW8L/Ls1kvSsNZEFa8qoOGdOD6LO3VBgWPY ie9pysK+R9iv1lShi5g67BB3T7ben5Of6xdaVlffgHswdJ5pkfkzc1gdXG/iCp5ISJ CQ1raZxaxV5EA/wQxqJbNLmqkXoppidhoIDjW0aeFklqUCFWNSG+1K64tlVqKeH8qc LDiFATql79OnzEsL2C+C3/QzgiEcIQkRnb5yK5ED2k+S4kchq6f/KG8eE6+AMZEIYH iMOO4dxpLUy1I5A7JRYls6PqkFZ4W7/LU/ahas0uxs5Zj4Km7au1oz3DnUEzCcN0op tyYqTsOxsn/8A== From: Kai-Heng Feng To: kernel-team@lists.ubuntu.com Subject: [L] [PATCH 1/6] EDAC/skx_common: Enable EDAC support for the "near" memory Date: Wed, 26 Jul 2023 13:59:12 +0800 Message-Id: <20230726055917.3291633-2-kai.heng.feng@canonical.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230726055917.3291633-1-kai.heng.feng@canonical.com> References: <20230726055917.3291633-1-kai.heng.feng@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Qiuxu Zhuo BugLink: https://bugs.launchpad.net/bugs/2028746 The current {skx,i10nm}_edac miss the EDAC support to decode errors from the 1st level memory (the fast "near" memory as cache) of the 2-level memory system. Introduce a helper function skx_error_in_mem() to check whether errors are from memory at the beginning of skx_mce_check_error(). As long as the errors are from memory (either the 1-level memory system or the 2-level memory system), decode the errors. Reported-and-tested-by: Youquan Song Signed-off-by: Qiuxu Zhuo Signed-off-by: Tony Luck Link: https://lore.kernel.org/all/20230113032802.41752-1-qiuxu.zhuo@intel.com (cherry picked from commit 6e8746cb735166eaf3ceb086b31bb0431f5e3532) Signed-off-by: Kai-Heng Feng --- drivers/edac/skx_common.c | 18 ++++++++++++------ drivers/edac/skx_common.h | 24 ++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 6 deletions(-) diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index f0f8e98f6efb..5fdcb6d5b96b 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -632,12 +632,18 @@ static bool skx_error_in_1st_level_mem(const struct mce *m) if (!skx_mem_cfg_2lm) return false; - errcode = GET_BITFIELD(m->status, 0, 15); + errcode = GET_BITFIELD(m->status, 0, 15) & MCACOD_MEM_ERR_MASK; - if ((errcode & 0xef80) != 0x280) - return false; + return errcode == MCACOD_EXT_MEM_ERR; +} - return true; +static bool skx_error_in_mem(const struct mce *m) +{ + u32 errcode; + + errcode = GET_BITFIELD(m->status, 0, 15) & MCACOD_MEM_ERR_MASK; + + return (errcode == MCACOD_MEM_CTL_ERR || errcode == MCACOD_EXT_MEM_ERR); } int skx_mce_check_error(struct notifier_block *nb, unsigned long val, @@ -651,8 +657,8 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, if (mce->kflags & MCE_HANDLED_CEC) return NOTIFY_DONE; - /* ignore unless this is memory related with an address */ - if ((mce->status & 0xefff) >> 7 != 1 || !(mce->status & MCI_STATUS_ADDRV)) + /* Ignore unless this is memory related with an address */ + if (!skx_error_in_mem(mce) || !(mce->status & MCI_STATUS_ADDRV)) return NOTIFY_DONE; memset(&res, 0, sizeof(res)); diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index 0cbadd3d2cd3..312032657264 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -56,6 +56,30 @@ #define MCI_MISC_ECC_MODE(m) (((m) >> 59) & 15) #define MCI_MISC_ECC_DDRT 8 /* read from DDRT */ +/* + * According to Intel Architecture spec vol 3B, + * Table 15-10 "IA32_MCi_Status [15:0] Compound Error Code Encoding" + * memory errors should fit one of these masks: + * 000f 0000 1mmm cccc (binary) + * 000f 0010 1mmm cccc (binary) [RAM used as cache] + * where: + * f = Correction Report Filtering Bit. If 1, subsequent errors + * won't be shown + * mmm = error type + * cccc = channel + */ +#define MCACOD_MEM_ERR_MASK 0xef80 +/* + * Errors from either the memory of the 1-level memory system or the + * 2nd level memory (the slow "far" memory) of the 2-level memory system. + */ +#define MCACOD_MEM_CTL_ERR 0x80 +/* + * Errors from the 1st level memory (the fast "near" memory as cache) + * of the 2-level memory system. + */ +#define MCACOD_EXT_MEM_ERR 0x280 + /* * Each cpu socket contains some pci devices that provide global * information, and also some that are local to each of the two