Patch Detail
get:
Show a patch.
patch:
Update a patch.
put:
Update a patch.
GET /api/1.2/patches/2233125/?format=api
{ "id": 2233125, "url": "http://patchwork.ozlabs.org/api/1.2/patches/2233125/?format=api", "web_url": "http://patchwork.ozlabs.org/project/linux-pci/patch/20260505173029.2718246-12-terry.bowman@amd.com/", "project": { "id": 28, "url": "http://patchwork.ozlabs.org/api/1.2/projects/28/?format=api", "name": "Linux PCI development", "link_name": "linux-pci", "list_id": "linux-pci.vger.kernel.org", "list_email": "linux-pci@vger.kernel.org", "web_url": null, "scm_url": null, "webscm_url": null, "list_archive_url": "", "list_archive_url_format": "", "commit_url_format": "" }, "msgid": "<20260505173029.2718246-12-terry.bowman@amd.com>", "list_archive_url": null, "date": "2026-05-05T17:30:29", "name": "[v17,11/11] Documentation: cxl: Document CXL protocol error handling", "commit_ref": null, "pull_url": null, "state": "new", "archived": false, "hash": "70b5168643eaf26fc8578f2251192827f5873749", "submitter": { "id": 82124, "url": "http://patchwork.ozlabs.org/api/1.2/people/82124/?format=api", "name": "Bowman, Terry", "email": "Terry.Bowman@amd.com" }, "delegate": null, "mbox": "http://patchwork.ozlabs.org/project/linux-pci/patch/20260505173029.2718246-12-terry.bowman@amd.com/mbox/", "series": [ { "id": 502875, "url": "http://patchwork.ozlabs.org/api/1.2/series/502875/?format=api", "web_url": "http://patchwork.ozlabs.org/project/linux-pci/list/?series=502875", "date": "2026-05-05T17:30:19", "name": "Enable CXL PCIe Port Protocol Error handling and logging", "version": 17, "mbox": "http://patchwork.ozlabs.org/series/502875/mbox/" } ], "comments": "http://patchwork.ozlabs.org/api/patches/2233125/comments/", "check": "pending", "checks": "http://patchwork.ozlabs.org/api/patches/2233125/checks/", "tags": {}, "related": [], "headers": { "Return-Path": "\n <linux-pci+bounces-53775-incoming=patchwork.ozlabs.org@vger.kernel.org>", "X-Original-To": [ "incoming@patchwork.ozlabs.org", "linux-pci@vger.kernel.org" ], "Delivered-To": "patchwork-incoming@legolas.ozlabs.org", "Authentication-Results": [ "legolas.ozlabs.org;\n\tdkim=pass (1024-bit key;\n unprotected) header.d=amd.com header.i=@amd.com header.a=rsa-sha256\n header.s=selector1 header.b=NhoD/pRt;\n\tdkim-atps=neutral", "legolas.ozlabs.org;\n spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org\n (client-ip=2600:3c09:e001:a7::12fc:5321; helo=sto.lore.kernel.org;\n envelope-from=linux-pci+bounces-53775-incoming=patchwork.ozlabs.org@vger.kernel.org;\n receiver=patchwork.ozlabs.org)", "smtp.subspace.kernel.org;\n\tdkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com\n header.b=\"NhoD/pRt\"", "smtp.subspace.kernel.org;\n arc=fail smtp.client-ip=40.93.195.23", "smtp.subspace.kernel.org;\n dmarc=pass (p=quarantine dis=none) header.from=amd.com", "smtp.subspace.kernel.org;\n spf=fail smtp.mailfrom=amd.com" ], "Received": [ "from sto.lore.kernel.org (sto.lore.kernel.org\n [IPv6:2600:3c09:e001:a7::12fc:5321])\n\t(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)\n\t key-exchange x25519 server-signature ECDSA (secp384r1) server-digest SHA384)\n\t(No client certificate requested)\n\tby legolas.ozlabs.org (Postfix) with ESMTPS id 4g95Fr1HPQz1yJ0\n\tfor <incoming@patchwork.ozlabs.org>; Wed, 06 May 2026 03:33:20 +1000 (AEST)", "from smtp.subspace.kernel.org (conduit.subspace.kernel.org\n [100.90.174.1])\n\tby sto.lore.kernel.org (Postfix) with ESMTP id AD84F301FF33\n\tfor <incoming@patchwork.ozlabs.org>; Tue, 5 May 2026 17:33:11 +0000 (UTC)", "from localhost.localdomain (localhost.localdomain [127.0.0.1])\n\tby smtp.subspace.kernel.org (Postfix) with ESMTP id 381E14A2E26;\n\tTue, 5 May 2026 17:33:08 +0000 (UTC)", "from SN4PR2101CU001.outbound.protection.outlook.com\n (mail-southcentralusazon11012023.outbound.protection.outlook.com\n [40.93.195.23])\n\t(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))\n\t(No client certificate requested)\n\tby smtp.subspace.kernel.org (Postfix) with ESMTPS id 185C838E5F9;\n\tTue, 5 May 2026 17:33:05 +0000 (UTC)", "from DS7PR03CA0099.namprd03.prod.outlook.com (2603:10b6:5:3b7::14)\n by CH2PR12MB4120.namprd12.prod.outlook.com (2603:10b6:610:7b::13) with\n Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Tue, 5 May\n 2026 17:32:56 +0000", "from DM2PEPF00003FC4.namprd04.prod.outlook.com\n (2603:10b6:5:3b7:cafe::db) by DS7PR03CA0099.outlook.office365.com\n (2603:10b6:5:3b7::14) with Microsoft SMTP Server (version=TLS1_3,\n cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Tue,\n 5 May 2026 17:32:56 +0000", "from satlexmb07.amd.com (165.204.84.17) by\n DM2PEPF00003FC4.mail.protection.outlook.com (10.167.23.22) with Microsoft\n SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id\n 15.20.9891.9 via Frontend Transport; Tue, 5 May 2026 17:32:55 +0000", "from ethanolx7ea3host.amd.com (10.180.168.240) by satlexmb07.amd.com\n (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Tue, 5 May\n 2026 12:32:54 -0500" ], "ARC-Seal": [ "i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;\n\tt=1778002388; cv=fail;\n b=TYwind9COIQqia4uojFaQT4MJ7t7Ks32NnGdttZPNtZspkJfjqdRjUE5T9BlmFYt3L0g5OMViC5uh6gtS8l6cH8zCRFpsKcfqE5hEyYaHNKpVysGSmLnrH9tnkHV+jJsw9WqxnEXztGVjGEiB6J8Nv8E8dicJCRB8e3Cmx7QiCQ=", "i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;\n b=RkGm0h7fXhDN1R+EcXbhB1DFSYWwCFleGLf3NCR/r49yrdgvxRgiyDJ8M9on9ZrlBPp3uuDS5a258jXuzRJbzxeBjWD85jD9TRSA/W1m1iteXk0PBRMlRiaM63RVY8POPwJo/f0GhtSR4mBOpg+lBlnYl3huNsOPtKDfGQ5gFo54ClOcKsKqOxvSnfK3EU3f3pf9hsTPnJtojzljQ1tvQUJ+drgxoNi7bExaeIYzNMvRKtMBFa6lWpLVihQwr70DLFnYzWlp+FNTjJs1qbUCMocyA84PhgdUwcK1LMzOFdCR7J+SUkk11ieQ0K/zOvhG4dfec9UmOciJMKOvrEO/fA==" ], "ARC-Message-Signature": [ "i=2; a=rsa-sha256; d=subspace.kernel.org;\n\ts=arc-20240116; t=1778002388; c=relaxed/simple;\n\tbh=kKUGFKbhKHjNOknXaHeutQh+mJSrXYf2VvDkSAcS2MA=;\n\th=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:\n\t MIME-Version:Content-Type;\n b=SXhPoscKZonnPgJGaQXsZFd3mvlByUK8UAffEMzfDNc9qgdtH0H9nZJ8U0fZBDExfNfN4qjEUj5pueRftoZKkOYDMAImidB+Aqx0WqcFKh8HLloAk3y6XQvLcDz1gL9QyiuSsQXOFcThWkp1Kpv8bhYmovfdTv82PuERqCz1K/E=", "i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;\n s=arcselector10001;\n h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;\n bh=87h5VvFDLPcUf0akGajedE6q2tJmRieN7ll/0XADI78=;\n b=DHFPvxW20Ct+wZts3zA/R1TO5SVKJy/cjrOpEH7yZYuF9RPtoSXHruRbBJbTl12+VOZsk4zaCgvGKHK4UPinIujHmvC50Jp630XbEhZjsaJREKS4BNi78Q1Xh7ED5oKtNwgIpKcwo65fMLtzo4oU75Nyt7ru9racmWFrvDWhl5ItO+BdJAk1N7eXhTdKSE3FknXQaOKXVYozTH6U7mhd2ihtRtcZSDRwNSJzGOsJWyzEetPQtPyNxpLvvH4tUj9/EJC/02JTr2Hf8R2a1L2kv1UxE4rdrcEYixKdQdmw+FydBoscwQWJLgEGSOGVbfrElP51GphxfdZyhO14AR5BWA==" ], "ARC-Authentication-Results": [ "i=2; smtp.subspace.kernel.org;\n dmarc=pass (p=quarantine dis=none) header.from=amd.com;\n spf=fail smtp.mailfrom=amd.com;\n dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com\n header.b=NhoD/pRt; arc=fail smtp.client-ip=40.93.195.23", "i=1; mx.microsoft.com 1; spf=pass (sender ip is\n 165.204.84.17) smtp.rcpttodomain=stgolabs.net smtp.mailfrom=amd.com;\n dmarc=pass (p=quarantine sp=quarantine pct=100) action=none\n header.from=amd.com; dkim=none (message not signed); arc=none (0)" ], "DKIM-Signature": "v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1;\n h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;\n bh=87h5VvFDLPcUf0akGajedE6q2tJmRieN7ll/0XADI78=;\n b=NhoD/pRtoswUEySheIqU3xics7mdY//YNzCdnyBZkGTqV7kOldeVdGIvtW4fZJj9D3aYfj89sxkiMdbDm7P+pI2uoapA1Q/fx50JAAlfInI2iXp1gkAci27stdqw4QhSKWhMc6OIjwdv85hLEXXFoOgh3reE3W8uVaC7bQqiF8o=", "X-MS-Exchange-Authentication-Results": "spf=pass (sender IP is 165.204.84.17)\n smtp.mailfrom=amd.com; dkim=none (message not signed)\n header.d=none;dmarc=pass action=none header.from=amd.com;", "Received-SPF": "Pass (protection.outlook.com: domain of amd.com designates\n 165.204.84.17 as permitted sender) receiver=protection.outlook.com;\n client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C", "From": "Terry Bowman <terry.bowman@amd.com>", "To": "<dave@stgolabs.net>, <jic23@kernel.org>, <dave.jiang@intel.com>,\n\t<alison.schofield@intel.com>, <djbw@kernel.org>, <bhelgaas@google.com>,\n\t<shiju.jose@huawei.com>, <ming.li@zohomail.com>,\n\t<Smita.KoralahalliChannabasappa@amd.com>, <rrichter@amd.com>,\n\t<dan.carpenter@linaro.org>, <PradeepVineshReddy.Kodamati@amd.com>,\n\t<lukas@wunner.de>, <Benjamin.Cheatham@amd.com>,\n\t<sathyanarayanan.kuppuswamy@linux.intel.com>, <vishal.l.verma@intel.com>,\n\t<alucerop@amd.com>, <ira.weiny@intel.com>, <corbet@lwn.net>,\n\t<rafael@kernel.org>, <xueshuai@linux.alibaba.com>,\n\t<linux-cxl@vger.kernel.org>", "CC": "<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>,\n\t<linux-acpi@vger.kernel.org>, <linux-doc@vger.kernel.org>,\n\t<terry.bowman@amd.com>", "Subject": "[PATCH v17 11/11] Documentation: cxl: Document CXL protocol error\n handling", "Date": "Tue, 5 May 2026 12:30:29 -0500", "Message-ID": "<20260505173029.2718246-12-terry.bowman@amd.com>", "X-Mailer": "git-send-email 2.34.1", "In-Reply-To": "<20260505173029.2718246-1-terry.bowman@amd.com>", "References": "<20260505173029.2718246-1-terry.bowman@amd.com>", "Precedence": "bulk", "X-Mailing-List": "linux-pci@vger.kernel.org", "List-Id": "<linux-pci.vger.kernel.org>", "List-Subscribe": "<mailto:linux-pci+subscribe@vger.kernel.org>", "List-Unsubscribe": "<mailto:linux-pci+unsubscribe@vger.kernel.org>", "MIME-Version": "1.0", "Content-Transfer-Encoding": "7bit", "Content-Type": "text/plain", "X-ClientProxiedBy": "satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com\n (10.181.42.216)", "X-EOPAttributedMessage": "0", "X-MS-PublicTrafficType": "Email", "X-MS-TrafficTypeDiagnostic": "DM2PEPF00003FC4:EE_|CH2PR12MB4120:EE_", "X-MS-Office365-Filtering-Correlation-Id": "d3d82984-1d3b-414f-8ecf-08deaacc5803", "X-MS-Exchange-SenderADCheck": "1", "X-MS-Exchange-AntiSpam-Relay": "0", "X-Microsoft-Antispam": "\n\tBCL:0;ARA:13230040|7416014|82310400026|376014|36860700016|1800799024|921020|22082099003|56012099003|18002099003;", "X-Microsoft-Antispam-Message-Info": "\n\tOvWS6REqSZC9SwIJlNEaxlRHQosTOd/USdpr+XT7PeeMyT64InyRJi+ulxU7XXO2W6+wVF5lGhVB5vEvhKd+yJoWIrB+OCqDwq/mtihDIBudFTttYIcbDs5GpdlhXFmz3YBmw1a8d3wQRWS46tHKMhqVijbFL+yJjuP+fp+syj1xEla5uEwKHOppFXNvfe/ccP1zqP++qtAwAh26oJY0yHlsaVNLzaClSWY/D3SwNC8en2UhhZeuSvEKWBE0kYkvKKAiSxZg7SjVp7O72YgkGZYbuSH5VEfcA0JhEknxBuFU+dbCkiUECpXewaKLnB8IAgW1w7qesNoU//soB4d5pOoCnaMgVcK1QrI9SHhmuzLtlw/scluFRKB1hJk6w1H3IBBsKr2hgzigy6ElUuHXSb8ivapy1I8Pa72bsFBHEoAAh8VOAmq0GWwyGQLXMxUVxg/dQcy6xiGlfUAnDu6OQ7YIqfXs9oLEXaAnwjD8jGrFH6qTQwBR/0+2J2xjAr/oyEoakiIXc3wNGlrtZtbZsv//p2rEClQ7u+MSVyfZIk6naMc227CW+0fwBrUC+Y/RUMcUJlTyGttpbT629ApImQjrZ+vBosYvFM2jmR3ZRK4wucmlvJdk4slaMsT2Ri+ostNsm1BszspqIsHC0sP0fuRHkq3iIKFKFCOoTf8pBiWiNTsuBm7i5G7hyF6Rgg4ECcSMZ74V49xBQO5qFCcugKy/j5yKbRBhvyUWuzG4fGwVlbS6YMbXdaOD3+uiR3vIs2LGNSYGpHc3DAIuOmMyig==", "X-Forefront-Antispam-Report": "\n\tCIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(82310400026)(376014)(36860700016)(1800799024)(921020)(22082099003)(56012099003)(18002099003);DIR:OUT;SFP:1101;", "X-MS-Exchange-AntiSpam-MessageData-ChunkCount": "1", "X-MS-Exchange-AntiSpam-MessageData-0": "\n\tjC1kONZPDnXQrh4e8gSJbIrC/OP4Jp9yby901hqj79y7ATlfCb68iSUf1cRS6pzVI0NXfyYGyL9ngkqAPg+y426x4nPB/UtwEl24sstVf/xR7nMwx+/Zn/EjniW6NstKMVGBBqcCMIY5InE9bJ+f2RWxMe5+X9vQdMjgxo+TaVcUQM2L0Hez0/vTX7aHiU1SQLmWVKZyiyTd193NOLVsHMNArrwFxYHbkpZJ4RpSEyLeiYwzyzmgsm2EcZwhD6+AKMDgmZwQldseN2Uvode2gsAihsrUUa47lNwVt0fMMtq9Gv3DB+3v5VGeCNfYt44Sapkz1LZySuuVU8xazHzGYw3yoy/lZ4jMV0PrL4cmFOhrNVH4GZt4eUzsc9MgbtkVy8M8+f2an14Guv9P2FJSxCtns97RhvmLMliatbn8LX378CBug5bMFtWc0ZA+utyy", "X-OriginatorOrg": "amd.com", "X-MS-Exchange-CrossTenant-OriginalArrivalTime": "05 May 2026 17:32:55.9082\n (UTC)", "X-MS-Exchange-CrossTenant-Network-Message-Id": "\n d3d82984-1d3b-414f-8ecf-08deaacc5803", "X-MS-Exchange-CrossTenant-Id": "3dd8961f-e488-4e60-8e11-a82d994e183d", "X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp": "\n TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com]", "X-MS-Exchange-CrossTenant-AuthSource": "\n\tDM2PEPF00003FC4.namprd04.prod.outlook.com", "X-MS-Exchange-CrossTenant-AuthAs": "Anonymous", "X-MS-Exchange-CrossTenant-FromEntityHeader": "HybridOnPrem", "X-MS-Exchange-Transport-CrossTenantHeadersStamped": "CH2PR12MB4120" }, "content": "Add Documentation/driver-api/cxl/linux/protocol-error-handling.rst\ndescribing the end-to-end CXL protocol error path: AER ingress, the\nAER-CXL kfifo handoff, the cxl_core consumer worker, RCD/RCH special\ncases, severity policy, trace events, and a source code map.\n\nThis documents the architecture introduced by the preceding patches in\nthis series.\n\nThis was generated by claude-opus-4.7.\n\nAssisted-by: Claude:claude-opus-4.7\nSigned-off-by: Terry Bowman <terry.bowman@amd.com>\n---\n Documentation/driver-api/cxl/index.rst | 1 +\n .../cxl/linux/protocol-error-handling.rst | 440 ++++++++++++++++++\n 2 files changed, 441 insertions(+)\n create mode 100644 Documentation/driver-api/cxl/linux/protocol-error-handling.rst", "diff": "diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst\nindex 3dfae1d310ca..6861b2e5726a 100644\n--- a/Documentation/driver-api/cxl/index.rst\n+++ b/Documentation/driver-api/cxl/index.rst\n@@ -42,6 +42,7 @@ that have impacts on each other. The docs here break up configurations steps.\n linux/dax-driver\n linux/memory-hotplug\n linux/access-coordinates\n+ linux/protocol-error-handling\n \n .. toctree::\n :maxdepth: 2\ndiff --git a/Documentation/driver-api/cxl/linux/protocol-error-handling.rst b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst\nnew file mode 100644\nindex 000000000000..4d6f33f0ed31\n--- /dev/null\n+++ b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst\n@@ -0,0 +1,440 @@\n+.. SPDX-License-Identifier: GPL-2.0\n+\n+==============================\n+CXL Protocol Error Handling\n+==============================\n+\n+This document describes how the kernel detects, classifies, dispatches,\n+logs, and recovers from CXL protocol errors signaled through the PCIe\n+Advanced Error Reporting (AER) interface. It covers both Virtual\n+Hierarchy (VH) topologies (Root Ports, Upstream/Downstream Switch\n+Ports, and Endpoints) and Restricted CXL Host (RCH) topologies\n+(Root Complex Event Collectors driving Restricted CXL Devices).\n+\n+It is intended for kernel developers maintaining or extending\n+``drivers/pci/pcie/aer*.c``, ``drivers/cxl/core/ras.c``, and the\n+related plumbing in ``include/linux/aer.h``.\n+\n+\n+Background\n+==========\n+\n+A CXL device reports protocol-layer failures (CXL.cachemem RAS) as\n+PCIe AER **Internal Errors**: ``PCI_ERR_COR_INTERNAL`` for correctable\n+events and ``PCI_ERR_UNC_INTN`` for uncorrectable events. From the AER\n+core's point of view these look like ordinary PCIe AER messages, but\n+their semantics are CXL-specific: the actual fault information lives\n+in CXL RAS capability registers, not in the PCIe AER status registers.\n+\n+Historically, native CXL.cachemem RAS handling was implemented only\n+for CXL Endpoints and for RCH Downstream Ports. CXL Root Ports,\n+Upstream Switch Ports, and Downstream Switch Ports were not covered.\n+This left the kernel unable to log or react to protocol errors\n+signaled by switch components.\n+\n+The unified CXL protocol error path closes that gap by routing every\n+CXL Internal Error through a single producer/consumer pipeline shared\n+by all CXL device types.\n+\n+\n+Architecture overview\n+=====================\n+\n+CXL protocol error handling is implemented as a distinct error plane\n+layered on top of the existing PCIe AER infrastructure. The two planes\n+are kept separate:\n+\n+* The **PCIe AER plane** continues to handle native PCIe errors\n+ (Receiver overflows, malformed TLPs, completion timeouts, and so\n+ on). This is unchanged.\n+\n+* The **CXL protocol error plane** owns CXL Internal Errors. The AER\n+ core forwards them to ``cxl_core`` via a dedicated kfifo; ``cxl_core``\n+ then dispatches to CE/UE handlers and drives the recovery and\n+ panic policy.\n+\n+The boundary between the two planes is ``is_cxl_error()`` in\n+``drivers/pci/pcie/aer_cxl_vh.c``, which inspects ``info->is_cxl``\n+(set from ``pcie_is_cxl()``) together with the PCIe device type and\n+the AER status word. When ``is_cxl_error()`` returns true the event\n+is enqueued into the AER-CXL kfifo; otherwise the event flows through\n+``pci_aer_handle_error()`` as before.\n+\n+The pipeline has three layers:\n+\n+1. **Producer** (``aer_cxl_vh.c``, ``aer_cxl_rch.c``) - runs in AER\n+ IRQ/threaded context, classifies, clears the AER CE status, and\n+ enqueues ``struct cxl_proto_err_work_data``.\n+2. **Queue** - the AER-CXL kfifo plus a backing ``struct work_struct``.\n+3. **Consumer** (``cxl_core/ras.c``) - workqueue-context worker that\n+ resolves the CXL Port topology and dispatches to CE/UE handlers.\n+\n+\n+Topologies\n+==========\n+\n+Two topologies are supported, and both feed the same kfifo.\n+\n+Virtual Hierarchy (VH)\n+----------------------\n+\n+A standard CXL VH consists of a CXL Root Port (RP), an optional CXL\n+Upstream Switch Port (USP), one or more CXL Downstream Switch Ports\n+(DSPs), and CXL Endpoints (EPs) attached to the DSPs. Each component\n+is a regular PCIe device with a CXL DVSEC and a CXL RAS capability,\n+and it raises Internal Errors directly to the AER subsystem via the\n+RP's MSI/MSI-X interrupt.\n+\n+The VH producer is ``cxl_forward_error()`` in\n+``drivers/pci/pcie/aer_cxl_vh.c``.\n+\n+Restricted CXL Host (RCH)\n+-------------------------\n+\n+In the RCH topology, a Root Complex Event Collector (RCEC) aggregates\n+errors from one or more Restricted CXL Devices (RCDs) attached as\n+Root Complex Integrated Endpoints. The RCEC delivers the AER\n+interrupt; the AER driver iterates the RCDs beneath it.\n+\n+The RCH producer is ``cxl_rch_handle_error_iter()`` in\n+``drivers/pci/pcie/aer_cxl_rch.c``. For each RCD it finds, it calls\n+``cxl_forward_error()`` (the same producer helper used by the VH\n+path), so RCH events end up in the same AER-CXL kfifo as VH events.\n+\n+\n+End-to-end flow\n+===============\n+\n+The diagram below shows the full path from an AER interrupt through\n+producer classification, kfifo handoff, and consumer dispatch.\n+\n+.. code-block:: text\n+\n+ +-------------------------------------------------------------------------+\n+ | CXL Internal Error Packet Flow |\n+ | From PCIe AER Interrupt to CXL Protocol Error Handling and Logging |\n+ +-------------------------------------------------------------------------+\n+\n+ CXL device (RP / USP / DSP / EP / RCD) raises AER Internal Error\n+ (correctable PCI_ERR_COR_INTERNAL or uncorrectable PCI_ERR_UNC_INTN)\n+ |\n+ v\n+ +-------------------------------------------------------------+\n+ | PCIe Root Port AER MSI/MSI-X interrupt fires |\n+ +-------------------------------------------------------------+\n+ |\n+ ============= drivers/pci/pcie/aer.c (AER core) =============\n+ |\n+ v\n+ +---------------------------------+\n+ | aer_irq() / aer_isr() | (top + threaded handler)\n+ +---------------------------------+\n+ |\n+ v\n+ +---------------------------------+\n+ | aer_isr_one_error() |\n+ | aer_isr_one_error_type() |\n+ +---------------------------------+\n+ |\n+ v\n+ +------------------------------------------+\n+ | aer_get_device_error_info() |\n+ | - reads PCI_ERR_COR_STATUS |\n+ | - reads PCI_ERR_UNCOR_STATUS (*if RP/ |\n+ | RCEC/DSP, or non-fatal severity) |\n+ | - sets info->is_cxl = pcie_is_cxl(dev) |\n+ +------------------------------------------+\n+ |\n+ v\n+ +---------------------------------+\n+ | handle_error_source(dev, info) |\n+ +---------------------------------+\n+ | |\n+ | is_cxl_error() +---> pci_aer_handle_error()\n+ | (CXL device + Internal) (native PCIe AER path,\n+ v not covered here)\n+ +-------------------------------------------------------------+\n+ | Topology dispatch within AER core: |\n+ | |\n+ | - VH topology (RP / USP / DSP / EP) |\n+ | -> drivers/pci/pcie/aer_cxl_vh.c |\n+ | |\n+ | - RCH topology (RCEC iterates RCDs under it) |\n+ | -> drivers/pci/pcie/aer_cxl_rch.c |\n+ +-------------------------------------------------------------+\n+ | |\n+ | VH path RCH path (RCEC AER)\n+ v v\n+ ============= aer_cxl_vh.c (VH ============= aer_cxl_rch.c (RCH\n+ producer) ============= producer) ==========\n+ | |\n+ v v\n+ +-----------------------------+ +-------------------------------+\n+ | cxl_forward_error(pdev,info)| | cxl_rch_handle_error_iter() |\n+ | - if AER_CORRECTABLE: | | - iterate each RCD pdev |\n+ | clear PCI_ERR_COR_STATUS| | beneath the RCEC |\n+ | - pci_dev_get(pdev) | | - call cxl_forward_error() |\n+ | - build cxl_proto_err_ | | for each RCD |\n+ | work_data | | (same producer helper as |\n+ | { pdev, severity } | | the VH path uses) |\n+ | - kfifo_in_spinlocked(...) | +-------------------------------+\n+ | - schedule_work(...) | |\n+ +-----------------------------+ |\n+ | |\n+ +-----------------+---------------------------+\n+ |\n+ v\n+ +--------------------------+\n+ | AER-CXL kfifo |\n+ | (work_struct) |\n+ +--------------------------+\n+ |\n+ v\n+ ============= drivers/cxl/core/ras.c (consumer worker) =======\n+ |\n+ v\n+ +-------------------------------------------------------------+\n+ | cxl_proto_err_work_fn() (workqueue handler) |\n+ | for_each_cxl_proto_err(&wd, __cxl_proto_err_work_fn) |\n+ +-------------------------------------------------------------+\n+ |\n+ v\n+ +-------------------------------------------------------------+\n+ | __cxl_proto_err_work_fn(wd) |\n+ | port = find_cxl_port_by_dev(&pdev->dev, &dport) |\n+ | cxl_handle_proto_error(pdev, port, dport, severity) |\n+ | pci_dev_put(pdev) |\n+ +-------------------------------------------------------------+\n+ |\n+ v\n+ +-------------------------------------------------------------+\n+ | cxl_handle_proto_error() |\n+ +-------------------------------------------------------------+\n+ | |\n+ pci_pcie_type == pci_pcie_type !=\n+ PCI_EXP_TYPE_RC_END PCI_EXP_TYPE_RC_END\n+ (RCD Endpoint) (VH: RP/USP/DSP/EP)\n+ | |\n+ v |\n+ +-------------------------------------+ |\n+ | cxl_handle_rdport_errors(pdev) | |\n+ | - process RCH Downstream Port's | |\n+ | RAS register block first | |\n+ | - cxl_handle_cor_ras() for CE | |\n+ | - cxl_handle_ras() for UE | |\n+ | (log only; does NOT panic) | |\n+ +-------------------------------------+ |\n+ | |\n+ +--------------------+-----------------------+\n+ |\n+ v\n+ +-----------------------------+\n+ | severity == AER_CORRECTABLE |\n+ +-----------------------------+\n+ | |\n+ yes no\n+ v v\n+ +----------------------+ +-------------------------+\n+ | cxl_handle_cor_ras() | | cxl_do_recovery() |\n+ | - emit cxl_aer_ | | (described below) |\n+ | correctable_ | +-------------------------+\n+ | error trace |\n+ | pcie_clear_device_ |\n+ | status() |\n+ +----------------------+\n+\n+ +-------------------------------+\n+ | cxl_do_recovery() |\n+ | if pci_dev_is_disconnected: |\n+ | panic(\"CXL cachemem err.\") |\n+ | |\n+ | ue = cxl_handle_ras() |\n+ | -> emit |\n+ | cxl_aer_uncorrectable_ |\n+ | error trace event |\n+ | |\n+ | if (ue): |\n+ | panic(\"CXL cachemem err.\") |\n+ | |\n+ | pcie_clear_device_status() |\n+ | pci_aer_clear_nonfatal_status|\n+ | pci_aer_clear_fatal_status |\n+ +-------------------------------+\n+\n+\n+Severity policy\n+===============\n+\n+The kernel's response to a CXL protocol error depends on the AER\n+severity reported by the device and on the result of inspecting the\n+CXL RAS registers.\n+\n+Correctable Error (CE)\n+----------------------\n+\n+* The AER driver clears ``PCI_ERR_COR_STATUS`` in the producer\n+ (``cxl_forward_error()``) before enqueue, so the device is\n+ acknowledged even if the consumer drops the event.\n+* The consumer's ``cxl_handle_cor_ras()`` reads and clears the CXL\n+ RAS correctable status and emits a ``cxl_aer_correctable_error``\n+ trace event.\n+* No recovery action is taken.\n+\n+Uncorrectable Error (UE), non-fatal\n+-----------------------------------\n+\n+* The producer enqueues the event without clearing the AER UCE\n+ status.\n+* The consumer enters ``cxl_do_recovery()``.\n+* ``cxl_handle_ras()`` reads the CXL RAS uncorrectable status and\n+ emits a ``cxl_aer_uncorrectable_error`` trace event.\n+* If ``cxl_handle_ras()`` returns true (a CXL RAS UE bit was set),\n+ the kernel panics with ``\"CXL cachemem error.\"``. CXL.cachemem\n+ traffic cannot be safely recovered in software once corruption is\n+ observed; continuing risks silent data loss across all devices in\n+ an interleaved HDM region.\n+* If ``cxl_handle_ras()`` returns false (no CXL RAS bit set, i.e.\n+ the AER UCE was a PCIe-side issue rather than a CXL.cachemem\n+ issue), the AER UCE status is cleared and execution continues.\n+\n+Uncorrectable Error (UE), fatal\n+-------------------------------\n+\n+Fatal severity follows the same recovery path as non-fatal in\n+``cxl_do_recovery()``, with one important caveat: the AER core only\n+reads ``PCI_ERR_UNCOR_STATUS`` for Root Ports, RCECs, Downstream\n+Ports, or non-fatal severities (see ``aer_get_device_error_info()``\n+in ``drivers/pci/pcie/aer.c``). For a fatal UE signaled by an\n+upstream component, PCI config reads to the source device are\n+expected to fail, so ``UNCOR_STATUS`` is never retrieved and\n+``info->status`` stays zero.\n+\n+The practical consequence: a fatal UE on an Upstream Switch Port or\n+Endpoint is **not** classified as a CXL error by ``is_cxl_error()``.\n+It falls through to ``pci_aer_handle_error()`` and is processed by\n+the standard AER recovery flow. Only the CXL trace events emitted by\n+the AER core (``aer_event``) appear; the CXL-specific\n+``cxl_aer_uncorrectable_error`` event is not emitted on this path.\n+\n+Disconnect during recovery\n+--------------------------\n+\n+``cxl_do_recovery()`` checks ``pci_dev_is_disconnected(pdev)`` before\n+touching the RAS registers. A device disconnecting during an\n+uncorrectable error event is itself unrecoverable, particularly when\n+the device backs an interleaved HDM region; in that case the kernel\n+panics directly rather than returning ``~0u`` from the readl() and\n+masking the cause.\n+\n+\n+RCD/RCH special cases\n+=====================\n+\n+RCD Endpoint flow\n+-----------------\n+\n+When ``cxl_handle_proto_error()`` sees ``pci_pcie_type(pdev) ==\n+PCI_EXP_TYPE_RC_END`` (i.e. an RCD Endpoint), it calls\n+``cxl_handle_rdport_errors()`` first. This processes the RAS state\n+of the RCH Downstream Port that hosts the RCD before falling through\n+to the common CE/UE dispatch on the RCD Endpoint itself.\n+\n+The RCH Downstream Port's RAS UE is **logged only**: it emits the\n+trace event but does not panic. The panic decision is taken on the\n+RCD Endpoint's own RAS in ``cxl_do_recovery()``.\n+\n+This split mirrors the structure of an RCH topology: the RCH dport\n+is functionally a CXL infrastructure component (similar to a switch\n+port), while the RCD itself is the actual CXL.cachemem source whose\n+corruption drives the recovery decision.\n+\n+RCH ingress aggregation\n+-----------------------\n+\n+RCH errors do not arrive on a per-RCD interrupt. The RCEC is the AER\n+source, and the AER driver drives ``cxl_rch_handle_error_iter()`` to\n+walk each RCD beneath it and forward an event per RCD through the\n+shared kfifo. From the consumer's point of view, RCH-originated\n+events are indistinguishable from VH events.\n+\n+\n+Trace events\n+============\n+\n+Two unified trace events are emitted from ``cxl_handle_cor_ras()``\n+and ``cxl_handle_ras()`` and are used by every CXL device type and\n+both topologies:\n+\n+* ``cxl_aer_correctable_error`` - emitted when a CXL RAS CE bit is\n+ set; carries the human-readable status string.\n+* ``cxl_aer_uncorrectable_error`` - emitted when a CXL RAS UE bit is\n+ set; carries both the current status and the first-error pointer.\n+\n+Common fields:\n+\n+* ``device=<PCI BDF>`` - the source device (always a PCI BDF, even\n+ for RCH paths where the trace was historically a memdev name).\n+* ``host=<bridge>`` - the parent host bridge or PCI host BDF.\n+* ``serial=<u64>`` - the device serial from ``pci_get_dsn()``.\n+\n+The ``device`` field replaces the older ``memdev`` field that earlier\n+revisions emitted on Endpoint events. Userspace consumers\n+(rasdaemon's ``ras-cxl-handler.c``) need a corresponding update to\n+read the new field name.\n+\n+\n+Source code map\n+===============\n+\n+============================================ ==============================\n+File Role\n+============================================ ==============================\n+``drivers/pci/pcie/aer.c`` AER core; receives the IRQ,\n+ builds ``aer_err_info``,\n+ dispatches to either the CXL\n+ path (``is_cxl_error()``) or\n+ ``pci_aer_handle_error()``.\n+``drivers/pci/pcie/aer_cxl_vh.c`` VH producer; provides\n+ ``is_cxl_error()``,\n+ ``cxl_forward_error()``, the\n+ AER-CXL kfifo, and the\n+ consumer registration\n+ helpers.\n+``drivers/pci/pcie/aer_cxl_rch.c`` RCH producer; iterates RCDs\n+ under an RCEC and forwards\n+ each via\n+ ``cxl_forward_error()``.\n+``drivers/cxl/core/ras.c`` Consumer; defines\n+ ``cxl_proto_err_work_fn()``,\n+ ``cxl_handle_proto_error()``,\n+ ``cxl_handle_rdport_errors()``,\n+ ``cxl_do_recovery()``,\n+ ``cxl_handle_cor_ras()`` and\n+ ``cxl_handle_ras()``.\n+``include/linux/aer.h`` Public declarations:\n+ ``struct cxl_proto_err_work_data``,\n+ ``cxl_proto_err_fn_t``,\n+ ``cxl_register_proto_err_work()``\n+ and ``for_each_cxl_proto_err()``.\n+============================================ ==============================\n+\n+\n+Limitations and future work\n+===========================\n+\n+* **USP/EP fatal UCE is not classified as CXL.** As described under\n+ `Severity policy`_, the AER core never retrieves\n+ ``PCI_ERR_UNCOR_STATUS`` in this scenario, so ``is_cxl_error()``\n+ cannot tag the event as CXL. The event is handled by the AER path\n+ only. Resolving this requires either an AER-core change to attempt\n+ a config read with link-validity gating, or a separate CXL-side\n+ notification mechanism for upstream-signaled fatal events.\n+* **User-defined status masks** are not yet supported. All CE and UE\n+ status bits are reported as they appear in the RAS register.\n+* **Port traversing in cxl_do_recovery()** is not yet implemented; a\n+ CXL UE today is reported and acted on at the source device only,\n+ not propagated to ancestor ports.\n+* The RCH producer (``aer_cxl_rch.c``) currently lives under\n+ ``drivers/pci/pcie/`` for historical reasons. Moving it to\n+ ``drivers/cxl/core/ras_rch.c`` is on the roadmap.\n+\n", "prefixes": [ "v17", "11/11" ] }