From patchwork Wed Feb 14 18:07:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philip Cox X-Patchwork-Id: 1899041 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4TZmSW3T1Xz23j4 for ; Thu, 15 Feb 2024 05:10:03 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1raJhg-0005rW-6i; Wed, 14 Feb 2024 18:09:52 +0000 Received: from smtp-relay-internal-1.internal ([10.131.114.114] helo=smtp-relay-internal-1.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1raJhZ-0005qt-M3 for kernel-team@lists.ubuntu.com; Wed, 14 Feb 2024 18:09:45 +0000 Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 7CE0B4061B for ; Wed, 14 Feb 2024 18:09:45 +0000 (UTC) Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-680b2c9b0ccso1213536d6.1 for ; Wed, 14 Feb 2024 10:09:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707934182; x=1708538982; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=s8UFsGovyQlgyIqvvH+B6L/9Mj0/6nAN05xSKE9ZvgE=; b=TWJhG7s9K5iPozwE4o0cSobYfsorR0T+OHJwM8Nt8X3fGjwM+HYt8ouDlu2nxIWbCN TpcZAdos9MhT4Nrvsjz1fJSPaFyzoXtp0QvA5Bzgme7erD+UPja0mLMhx4ebP4DRKv0O JwD9nciR4EhHpxr+jbzSVq18UI+G/qQDeB3diLY+UjAeE+aaHUMaizr3pZvciru5M1ci c7E7N4f+1Y8jSfULIkmgUUeOaswAK/chAO1EfE7P+UMLf9zJx3ZT2Tg5HxfFzOr5YFtp 4rv1HG4q0ZF8UT+8Ap8VfCPg3xDa9XhBJoxV6nMgBKVgptqfe68XVJ641brUqn3tIB6d x/yg== X-Gm-Message-State: AOJu0YxtHxTVnuowra0aD8LUeo9NUXnG9kSIQGYAPpjGS+wTMUZzRnuw Rd9OF50kmYS0x7+pEyomFgsqsADuUkFOyzFHzf29qR6e4/N5Q9LriyX9uWqXwFb/QKraIeB50DC alMuR4FODUIgFvczPAxkTEr4CPn/2VbKwIEW2C4SriGn1FUL59G87GemkBJuoom+xGzBn0zNSkP jks5t69KX7NA== X-Received: by 2002:a05:6214:1c4c:b0:68e:fae5:46d9 with SMTP id if12-20020a0562141c4c00b0068efae546d9mr4316305qvb.29.1707934182409; Wed, 14 Feb 2024 10:09:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IEZmjmBJ2kH1SZsM32iKHPcMzSkiC3YCZ+CEekjyIKyveJFQA6DqAigCt7GqhvUaK30qlfvjQ== X-Received: by 2002:a05:6214:1c4c:b0:68e:fae5:46d9 with SMTP id if12-20020a0562141c4c00b0068efae546d9mr4316288qvb.29.1707934182139; Wed, 14 Feb 2024 10:09:42 -0800 (PST) Received: from cox.home.arpa ([108.175.227.176]) by smtp.gmail.com with ESMTPSA id q17-20020ad44351000000b0068c4b445991sm333958qvs.67.2024.02.14.10.09.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Feb 2024 10:09:39 -0800 (PST) From: Philip Cox To: kernel-team@lists.ubuntu.com Subject: [SRU][jammy][PATCH 1/1] percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing Date: Wed, 14 Feb 2024 13:07:35 -0500 Message-Id: <20240214180735.702533-2-philip.cox@canonical.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240214180735.702533-1-philip.cox@canonical.com> References: <20240214180735.702533-1-philip.cox@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Yu Ma BugLink: https://bugs.launchpad.net/bugs/2053152 When running UnixBench/Execl throughput case, false sharing is observed due to frequent read on base_addr and write on free_bytes, chunk_md. UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short jobs. It will do system call on execl frequently, and execl will call mm_init to initialize mm_struct of the process. mm_init will call __percpu_counter_init for percpu_counters initialization. Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a specified chunk. This function will update "free_bytes" and "chunk_md" to record the rest free bytes and other meta data for this chunk. Correspondingly, pcpu_free_area will also update these 2 members when free memory. Call trace from perf is as below: + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock - 53.54% 0x654278696e552f34 main __execve entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_execve do_execveat_common.isra.47 alloc_bprm mm_init __percpu_counter_init pcpu_alloc - __mutex_lock.isra.17 In current pcpu_chunk layout, `base_addr' is in the same cache line with `free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a new cacheline. With this change, on Intel Sapphire Rapids 112C/224T platform, based on v6.4-rc4, the 160 parallel score improves by 24%. The pcpu_chunk struct is a backing data structure per chunk, so the additional memory should not be dramatic. A chunk covers ballpark between 64kb and 512kb memory depending on some config and boot time stuff, so I believe the additional memory used here is nominal at best. Working the #s on my desktop: Percpu: 58624 kB 28 cores -> ~2.1MB of percpu memory. At say ~128KB per chunk -> 33 chunks, generously 40 chunks. Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of overhead? I believe we can do a little better to avoid eating that full padding, so likely less than that. [dennis@kernel.org: changelog details] Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com Signed-off-by: Yu Ma Reviewed-by: Tim Chen Acked-by: Dennis Zhou Cc: Dan Williams Cc: Dave Hansen Cc: Liam R. Howlett Cc: Shakeel Butt Signed-off-by: Andrew Morton (cherry picked from commit 3a6358c0dbe6a286a4f4504ba392a6039a9fbd12) Signed-off-by: Philip Cox --- mm/percpu-internal.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 639662c20c82..0bc4c2eac808 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -40,10 +40,17 @@ struct pcpu_chunk { struct list_head list; /* linked to pcpu_slot lists */ int free_bytes; /* free bytes in the chunk */ struct pcpu_block_md chunk_md; - void *base_addr; /* base address of this chunk */ + unsigned long *bound_map; /* boundary map */ + + /* + * base_addr is the base address of this chunk. + * To reduce false sharing, current layout is optimized to make sure + * base_addr locate in the different cacheline with free_bytes and + * chunk_md. + */ + void *base_addr ____cacheline_aligned_in_smp; unsigned long *alloc_map; /* allocation map */ - unsigned long *bound_map; /* boundary map */ struct pcpu_block_md *md_blocks; /* metadata blocks */ void *data; /* chunk data */