From patchwork Tue Feb 27 14:07:09 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Philip Cox <philip.cox@canonical.com>
X-Patchwork-Id: 1905113
Return-Path: <kernel-team-bounces@lists.ubuntu.com>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com
 (client-ip=185.125.189.65; helo=lists.ubuntu.com;
 envelope-from=kernel-team-bounces@lists.ubuntu.com;
 receiver=patchwork.ozlabs.org)
Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65])
	(using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4TkfWT5QPrz1yX4
	for <incoming@patchwork.ozlabs.org>; Wed, 28 Feb 2024 01:09:56 +1100 (AEDT)
Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com)
	by lists.ubuntu.com with esmtp (Exim 4.86_2)
	(envelope-from <kernel-team-bounces@lists.ubuntu.com>)
	id 1rey9S-0006Xk-1n; Tue, 27 Feb 2024 14:09:46 +0000
Received: from smtp-relay-internal-1.internal ([10.131.114.114]
 helo=smtp-relay-internal-1.canonical.com)
 by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.86_2) (envelope-from <philip.cox@canonical.com>)
 id 1rey9J-0006Wy-Mz
 for kernel-team@lists.ubuntu.com; Tue, 27 Feb 2024 14:09:38 +0000
Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com
 [209.85.219.69])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 803403F6A7
 for <kernel-team@lists.ubuntu.com>; Tue, 27 Feb 2024 14:09:37 +0000 (UTC)
Received: by mail-qv1-f69.google.com with SMTP id
 6a1803df08f44-68f74a0a3c7so61726406d6.2
 for <kernel-team@lists.ubuntu.com>; Tue, 27 Feb 2024 06:09:37 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1709042976; x=1709647776;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=GHydF67vJref0h3Mo5sqr0p58cNnoBNW3/+o9Tsa+s4=;
 b=l8jbSO5SD2Zzq6T59sQqerRI9a0ANXJRPjt2PbplgyqAiafuL+HkVPnP/iIsKcGYI8
 f/9QLjMP59yhGjKCbUBmcVkwpP011KhHhRgIj7/1C7aAXiC96VABu74ErfWM8Ozin3k7
 jar1OfqGe3TOemGd6n0rMtaLx+MA804vRO2+Yi0dfIJg4E8v+mN9jE3EnDtBheBdXmKJ
 XR67Op9QqBv0UX9zDbeg58T2IOkTJIkGLtpQLRzlBKL+ArXJzwIkmNHln/ZlvF90FGhb
 7y5Tx4HYj/47mSFGvngkAdgxVn7N1OknKd6YD9E/iDrEcp+VfOOPGcoCcqos2iFZWf45
 yO+g==
X-Gm-Message-State: AOJu0YwUmo0h5//JcX0834FVZXcrKU9Bi+tuY6EDFF6FMBT9nbRrhKtg
 oMT26Vo5JkS87T7WeLIqzwm3u0yDKz23yDxWxjUlt+u6dCCyL+oK0Oqz08M4QsMS31ZjYIR7cYe
 6GHQR/LqrsXYOcRB+xSPVmdH065iNpNiTx/VISTuqVg3qh/IthqfUWENeuxI647W98IXNRn5d6S
 EX53475rddnQ==
X-Received: by 2002:a0c:f505:0:b0:68c:92ca:fec5 with SMTP id
 j5-20020a0cf505000000b0068c92cafec5mr1994964qvm.51.1709042975950;
 Tue, 27 Feb 2024 06:09:35 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHptU4VgJvLwvl11rkgflWz3SrQESEDFNJ448VQ4hpx1FgqYDc0kNzi2+ObWN9obtqGTjppbA==
X-Received: by 2002:a0c:f505:0:b0:68c:92ca:fec5 with SMTP id
 j5-20020a0cf505000000b0068c92cafec5mr1994949qvm.51.1709042975601;
 Tue, 27 Feb 2024 06:09:35 -0800 (PST)
Received: from cox.home.arpa ([108.175.227.176])
 by smtp.gmail.com with ESMTPSA id
 qo1-20020a056214590100b0069010959b00sm1887811qvb.44.2024.02.27.06.09.30
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 27 Feb 2024 06:09:31 -0800 (PST)
From: Philip Cox <philip.cox@canonical.com>
To: kernel-team@lists.ubuntu.com
Subject: [SRU][mantic:linux][PATCH 1/1] sched/fair: Ratelimit update to
 tg->load_avg
Date: Tue, 27 Feb 2024 09:07:09 -0500
Message-Id: <20240227140709.1744449-2-philip.cox@canonical.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240227140709.1744449-1-philip.cox@canonical.com>
References: <20240227140709.1744449-1-philip.cox@canonical.com>
MIME-Version: 1.0
X-BeenThere: kernel-team@lists.ubuntu.com
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Kernel team discussions <kernel-team.lists.ubuntu.com>
List-Unsubscribe: <https://lists.ubuntu.com/mailman/options/kernel-team>,
 <mailto:kernel-team-request@lists.ubuntu.com?subject=unsubscribe>
List-Archive: <https://lists.ubuntu.com/archives/kernel-team>
List-Post: <mailto:kernel-team@lists.ubuntu.com>
List-Help: <mailto:kernel-team-request@lists.ubuntu.com?subject=help>
List-Subscribe: <https://lists.ubuntu.com/mailman/listinfo/kernel-team>,
 <mailto:kernel-team-request@lists.ubuntu.com?subject=subscribe>
Errors-To: kernel-team-bounces@lists.ubuntu.com
Sender: "kernel-team" <kernel-team-bounces@lists.ubuntu.com>

From: Aaron Lu <aaron.lu@intel.com>

BugLink: https://bugs.launchpad.net/bugs/2053251

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

  ==============================
  postgres_sysbench on SPR:
  25%
  base:   42382±19.8%
  patch:  50174±9.5%  (noise)

  50%
  base:   67626±1.3%
  patch:  67365±3.1%  (noise)

  75%
  base:   100216±1.2%
  patch:  112470±0.1% +12.2%

  100%
  base:    93671±0.4%
  patch:  113563±0.2% +21.2%

  ==============================
  hackbench on ICL:
  group=1
  base:    114912±5.2%
  patch:   117857±2.5%  (noise)

  group=4
  base:    359902±1.6%
  patch:   361685±2.7%  (noise)

  group=8
  base:    461070±0.8%
  patch:   491713±0.3% +6.6%

  group=16
  base:    309032±5.0%
  patch:   378337±1.3% +22.4%

  =============================
  hackbench on SPR:
  group=1
  base:    100768±2.9%
  patch:   103134±2.9%  (noise)

  group=4
  base:    413830±12.5%
  patch:   378660±16.6% (noise)

  group=8
  base:    436124±0.6%
  patch:   490787±3.2% +12.5%

  group=16
  base:    457730±3.2%
  patch:   680452±1.3% +48.8%

  ============================
  netperf/udp_rr on ICL
  25%
  base:    114413±0.1%
  patch:   115111±0.0% +0.6%

  50%
  base:    86803±0.5%
  patch:   86611±0.0%  (noise)

  75%
  base:    35959±5.3%
  patch:   49801±0.6% +38.5%

  100%
  base:    61951±6.4%
  patch:   70224±0.8% +13.4%

  ===========================
  netperf/udp_rr on SPR
  25%
  base:   104954±1.3%
  patch:  107312±2.8%  (noise)

  50%
  base:    55394±4.6%
  patch:   54940±7.4%  (noise)

  75%
  base:    13779±3.1%
  patch:   36105±1.1% +162%

  100%
  base:     9703±3.7%
  patch:   28011±0.2% +189%

  ==============================================
  netperf/tcp_stream on ICL (all in noise range)
  25%
  base:    43092±0.1%
  patch:   42891±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

  ===============================================
  netperf/tcp_stream on SPR (all in noise range)
  25%
  base:    34491±0.3%
  patch:   34886±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
(cherry picked from commit 1528c661c24b407e92194426b0adbb43de859ce0)
Signed-off-by: Philip Cox <philip.cox@canonical.com>
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e1b54dc2a21..f9a50eae6471 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workloads, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..8cb74bed9e55 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -591,6 +591,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;