From patchwork Thu Feb 15 18:07:08 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Philip Cox X-Patchwork-Id: 1899672 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4TbNPh1G0Cz23j8 for ; Fri, 16 Feb 2024 05:09:43 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1ragAu-00024Z-EU; Thu, 15 Feb 2024 18:09:32 +0000 Received: from smtp-relay-internal-1.internal ([10.131.114.114] helo=smtp-relay-internal-1.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1ragAh-0001z7-Oy for kernel-team@lists.ubuntu.com; Thu, 15 Feb 2024 18:09:20 +0000 Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 9471040517 for ; Thu, 15 Feb 2024 18:09:19 +0000 (UTC) Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-787303d8e32so159116385a.1 for ; Thu, 15 Feb 2024 10:09:19 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708020558; x=1708625358; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YJWwRcxulOSBsLglRNJYi2ILzSPmnPRGEm4BQLvwbpE=; b=E7/DhsjzIrPNwdzPuUF1PlX45hzNrb48hIoXOASJAUlCHeD1t2+7szOZHNAXpj7kMK ajhxiTVv7a21G268IM1jrd4oI0H2PUDU+v+aIJ2ndUdoIeh5Cb4Na5DxomYR3r5M8pbC pzbANBsOqNJ/t5Ln0aROs4IOulcQygeOXV56q8lqdkM2PxE8DEQjD4p3yc9SxrXNOAEL lrJbAU5kKw9Ury1l110rcctfS5gwVIyZe/3jvpjGR27nmb1Y+5mrJEn7h8YHJTMaTpql SKSDkGFQUqTUHnMA9DHapuppfw6SDmv9exi3Lb254nKZnJZCLHdrOmXf8hNLFgviic4w oj9Q== X-Gm-Message-State: AOJu0YyNPZ3E+m8r9I544Mw6Lc/EWhzBK39TB27ovs/sc4XYH5qKamvT j00J253DCJDPvsvaJEUrp21xB5/UNIWGSR4/F6yF0BZrr8JnNleIPlWqKxVpcZHdrW0Bz6LAShQ aluUrRyYjkaID/O9Ryc66YVsBWsrVLAA0sZqL1YGIVtr3gYMUaSrIutxWVN9FWG4x74oeEQaf54 hx5q5ZzKxixA== X-Received: by 2002:a05:620a:55a9:b0:786:4e6b:c9b8 with SMTP id vr9-20020a05620a55a900b007864e6bc9b8mr2292149qkn.23.1708020557983; Thu, 15 Feb 2024 10:09:17 -0800 (PST) X-Google-Smtp-Source: AGHT+IEmxzUl7c94fjgaPeTJ1Mw1UsMYWAVlD74a22Y3uOYg4zgwo+zGSwTlSYDiEUrZbtAk/dC84w== X-Received: by 2002:a05:620a:55a9:b0:786:4e6b:c9b8 with SMTP id vr9-20020a05620a55a900b007864e6bc9b8mr2292129qkn.23.1708020557620; Thu, 15 Feb 2024 10:09:17 -0800 (PST) Received: from cox.home.arpa ([108.175.227.176]) by smtp.gmail.com with ESMTPSA id b19-20020a05620a0cd300b007840a08a097sm803313qkj.76.2024.02.15.10.09.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Feb 2024 10:09:14 -0800 (PST) From: Philip Cox To: kernel-team@lists.ubuntu.com Subject: [PATCH 1/1] sched/fair: Ratelimit update to tg->load_avg Date: Thu, 15 Feb 2024 13:07:08 -0500 Message-Id: <20240215180708.891086-2-philip.cox@canonical.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240215180708.891086-1-philip.cox@canonical.com> References: <20240215180708.891086-1-philip.cox@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Aaron Lu BugLink: https://bugs.launchpad.net/bugs/2053251 When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. The update frequency is a tradeoff between tracking accuracy and overhead. 1ms is chosen because PELT window is roughly 1ms and it delivered good results for the tests that I've done. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. ============================== postgres_sysbench on SPR: 25% base: 42382±19.8% patch: 50174±9.5% (noise) 50% base: 67626±1.3% patch: 67365±3.1% (noise) 75% base: 100216±1.2% patch: 112470±0.1% +12.2% 100% base: 93671±0.4% patch: 113563±0.2% +21.2% ============================== hackbench on ICL: group=1 base: 114912±5.2% patch: 117857±2.5% (noise) group=4 base: 359902±1.6% patch: 361685±2.7% (noise) group=8 base: 461070±0.8% patch: 491713±0.3% +6.6% group=16 base: 309032±5.0% patch: 378337±1.3% +22.4% ============================= hackbench on SPR: group=1 base: 100768±2.9% patch: 103134±2.9% (noise) group=4 base: 413830±12.5% patch: 378660±16.6% (noise) group=8 base: 436124±0.6% patch: 490787±3.2% +12.5% group=16 base: 457730±3.2% patch: 680452±1.3% +48.8% ============================ netperf/udp_rr on ICL 25% base: 114413±0.1% patch: 115111±0.0% +0.6% 50% base: 86803±0.5% patch: 86611±0.0% (noise) 75% base: 35959±5.3% patch: 49801±0.6% +38.5% 100% base: 61951±6.4% patch: 70224±0.8% +13.4% =========================== netperf/udp_rr on SPR 25% base: 104954±1.3% patch: 107312±2.8% (noise) 50% base: 55394±4.6% patch: 54940±7.4% (noise) 75% base: 13779±3.1% patch: 36105±1.1% +162% 100% base: 9703±3.7% patch: 28011±0.2% +189% ============================================== netperf/tcp_stream on ICL (all in noise range) 25% base: 43092±0.1% patch: 42891±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% =============================================== netperf/tcp_stream on SPR (all in noise range) 25% base: 34491±0.3% patch: 34886±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% Reported-by: Nitin Tekchandani Suggested-by: Vincent Guittot Signed-off-by: Aaron Lu Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Vincent Guittot Reviewed-by: Mathieu Desnoyers Reviewed-by: David Vernet Tested-by: Mathieu Desnoyers Tested-by: Swapnil Sapkal Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com (cherry picked from commit 1528c661c24b407e92194426b0adbb43de859ce0) Signed-off-by: Philip Cox --- kernel/sched/fair.c | 13 ++++++++++++- kernel/sched/sched.h | 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0541b5de927b..9c3d49af3be3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3373,7 +3373,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + long delta; + u64 now; /* * No need to update load_avg for root_task_group as it is not used. @@ -3381,9 +3382,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) if (cfs_rq->tg == &root_task_group) return; + /* + * For migration heavy workloads, access to tg->load_avg can be + * unbound. Limit the update rate to at most once per ms. + */ + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) + return; + + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; + cfs_rq->last_update_tg_load_avg = now; } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb2e1264e086..0bf67bec4b16 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct cfs_rq { } removed; #ifdef CONFIG_FAIR_GROUP_SCHED + u64 last_update_tg_load_avg; unsigned long tg_load_avg_contrib; long propagate; long prop_runnable_sum;