From patchwork Fri Aug 20 19:00:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tim Gardner X-Patchwork-Id: 1519155 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=canonical.com header.i=@canonical.com header.a=rsa-sha256 header.s=20210705 header.b=lyH6BI33; dkim-atps=neutral Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4GrrZJ3gTVz9sjJ; Sat, 21 Aug 2021 05:00:56 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1mH9l7-0003qE-86; Fri, 20 Aug 2021 19:00:53 +0000 Received: from smtp-relay-internal-0.internal ([10.131.114.225] helo=smtp-relay-internal-0.canonical.com) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1mH9kx-0003n8-71 for kernel-team@lists.ubuntu.com; Fri, 20 Aug 2021 19:00:43 +0000 Received: from mail-pj1-f69.google.com (mail-pj1-f69.google.com [209.85.216.69]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id EDD6A4079A for ; Fri, 20 Aug 2021 19:00:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1629486042; bh=DvH77Uh6RUxoWhsZMUhuRDaEraCTlQOES9n5C5kvS5M=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lyH6BI33B6aB6gdEfq9UKc7m10QtY7Clnr/WZGQvUehRXmKqjc8HvkQACkBFmVWLK ZVRaXlJ83u7km8ngRZh8UgehVFXs3TmsZEV1n6pNhaalfKlLPU7bW3i2+CXKaMFGAe QmP3raZ1GkJLhDsaZuazGrzKxBeLu4Qt4bJTqRFICySXAq0E32Eg6diuMjClBHfod/ +BSukYpiIxMaZysImAzzyMy6j1GecznmKpS7GKdeH1/HO3VNMvpdVCNDN4vwXMJbIo jvbQ6Y2Q0cOPJL22uZ47gz8CKkhSMR/waF0+TTYJRxLJiVVOhGv6S/lMxeaqoxtPLK wZIu9qsQvZitA== Received: by mail-pj1-f69.google.com with SMTP id v3-20020a17090ac903b029017912733966so8157190pjt.2 for ; Fri, 20 Aug 2021 12:00:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DvH77Uh6RUxoWhsZMUhuRDaEraCTlQOES9n5C5kvS5M=; b=m2kjni90PPTFdPeLJMrO4XMnuu3yA0MMb0KKQ+Yh4KZFCrEXQs/GAcMAfOTLy4k2ps PFy9IVeDcwWRg5yee6kv/vkWBY+v6/ZRe2+9fDL1L+EhQCaP+e3+1mBzMhnuKZjM23gA YZCGZzkPQbUelm/bzrZ/Dv6+NeUJY1m4RPRZho0zXcnWXT2NJw1uFyuuC6OE4GZnutQp gMe13w+j4zR4LAAKaQ1pRSm4l/gcrrV9QmUT351cSl60ks89iyp/Yl3ithUOwP4ep+nb fmhsu8fZIyAPvQbrNTQyD5RJuObajmoOKAxJlIW7oc5RPc1tS3tg+bHAT8glYUbLrTSX epPA== X-Gm-Message-State: AOAM530ryMXnsfyo0n2zydITTy2ivjbnx3kU0EHCP6hntbziShUFCn4T fFhrosAcDM80fhAa2GjsdanV5R0XV4iRfKKxA9Yqs7CYoajdnh3RIsVi+it9NjYDwEROUo0tENT hSlnG1KPKcLP7yvcKkU6kTuEymh1UXKrBYF1liZuv5w== X-Received: by 2002:a17:90b:f10:: with SMTP id br16mr6015488pjb.65.1629486040218; Fri, 20 Aug 2021 12:00:40 -0700 (PDT) X-Google-Smtp-Source: ABdhPJydnHcbnnUvieqBINPiy/xOGbPuy1ZfHO8XsxzPgdal8d9VqBUnz5PNqUvgDFpCQ2Wa3xzXvg== X-Received: by 2002:a17:90b:f10:: with SMTP id br16mr6015464pjb.65.1629486039925; Fri, 20 Aug 2021 12:00:39 -0700 (PDT) Received: from localhost.localdomain ([69.163.84.166]) by smtp.gmail.com with ESMTPSA id j68sm9400228pgc.44.2021.08.20.12.00.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 20 Aug 2021 12:00:39 -0700 (PDT) From: Tim Gardner To: kernel-team@lists.ubuntu.com Subject: [PATCH][focal/linux-azure] Drivers: hv: vmbus: Fix duplicate CPU assignments within a device Date: Fri, 20 Aug 2021 13:00:31 -0600 Message-Id: <20210820190032.8956-2-tim.gardner@canonical.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20210820190032.8956-1-tim.gardner@canonical.com> References: <20210820190032.8956-1-tim.gardner@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Haiyang Zhang BugLink: https://bugs.launchpad.net/bugs/1937078 The vmbus module uses a rotational algorithm to assign target CPUs to a device's channels. Depending on the timing of different device's channel offers, different channels of a device may be assigned to the same CPU. For example on a VM with 2 CPUs, if NIC A and B's channels are offered in the following order, NIC A will have both channels on CPU0, and NIC B will have both channels on CPU1 -- see below. This kind of assignment causes RSS load that is spreading across different channels to end up on the same CPU. Timing of channel offers: NIC A channel 0 NIC B channel 0 NIC A channel 1 NIC B channel 1 VMBUS ID 14: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter Device_ID = {cab064cd-1f31-47d5-a8b4-9d57e320cccd} Sysfs path: /sys/bus/vmbus/devices/cab064cd-1f31-47d5-a8b4-9d57e320cccd Rel_ID=14, target_cpu=0 Rel_ID=17, target_cpu=0 VMBUS ID 16: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter Device_ID = {244225ca-743e-4020-a17d-d7baa13d6cea} Sysfs path: /sys/bus/vmbus/devices/244225ca-743e-4020-a17d-d7baa13d6cea Rel_ID=16, target_cpu=1 Rel_ID=18, target_cpu=1 Update the vmbus CPU assignment algorithm to avoid duplicate CPU assignments within a device. The new algorithm iterates num_online_cpus + 1 times. The existing rotational algorithm to find "next NUMA & CPU" is still here. But if the resulting CPU is already used by the same device, it will try the next CPU. In the last iteration, it assigns the channel to the next available CPU like the existing algorithm. This is not normally expected, because during device probe, we limit the number of channels of a device to be <= number of online CPUs. Signed-off-by: Haiyang Zhang Reviewed-by: Michael Kelley Tested-by: Michael Kelley Link: https://lore.kernel.org/r/1626459673-17420-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Wei Liu (backported from commit 7c9ff3deeee61b253715dcf968a6307af148c9b2) [rtg - serious surgery on init_vp_index()] Signed-off-by: Tim Gardner --- drivers/hv/channel_mgmt.c | 139 ++++++++++++++++++-------------------- 1 file changed, 67 insertions(+), 72 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 2b5d20965b1bf..8f83ab617c5fd 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -445,7 +445,9 @@ static void vmbus_add_channel_work(struct work_struct *work) dev_type = hv_get_dev_type(newchannel); + mutex_lock(&vmbus_connection.channel_mutex); init_vp_index(newchannel, dev_type); + mutex_unlock(&vmbus_connection.channel_mutex); if (newchannel->target_cpu != get_cpu()) { put_cpu(); @@ -553,26 +555,29 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) mutex_lock(&vmbus_connection.channel_mutex); - /* Remember the channels that should be cleaned up upon suspend. */ - if (is_hvsock_channel(newchannel) || is_sub_channel(newchannel)) - atomic_inc(&vmbus_connection.nr_chan_close_on_suspend); - - /* - * Now that we have acquired the channel_mutex, - * we can release the potentially racing rescind thread. - */ - atomic_dec(&vmbus_connection.offer_in_progress); - list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) { if (guid_equal(&channel->offermsg.offer.if_type, &newchannel->offermsg.offer.if_type) && guid_equal(&channel->offermsg.offer.if_instance, &newchannel->offermsg.offer.if_instance)) { fnew = false; + newchannel->primary_channel = channel; break; } } + init_vp_index(newchannel, hv_get_dev_type(newchannel)); + + /* Remember the channels that should be cleaned up upon suspend. */ + if (is_hvsock_channel(newchannel) || is_sub_channel(newchannel)) + atomic_inc(&vmbus_connection.nr_chan_close_on_suspend); + + /* + * Now that we have acquired the channel_mutex, + * we can release the potentially racing rescind thread. + */ + atomic_dec(&vmbus_connection.offer_in_progress); + if (fnew) list_add_tail(&newchannel->listentry, &vmbus_connection.chn_list); @@ -593,7 +598,6 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) /* * Process the sub-channel. */ - newchannel->primary_channel = channel; spin_lock_irqsave(&channel->lock, flags); list_add_tail(&newchannel->sc_list, &channel->sc_list); spin_unlock_irqrestore(&channel->lock, flags); @@ -628,6 +632,30 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) queue_work(wq, &newchannel->add_channel_work); } +/* + * Check if CPUs used by other channels of the same device. + * It should only be called by init_vp_index(). + */ +static bool hv_cpuself_used(u32 cpu, struct vmbus_channel *chn) +{ + struct vmbus_channel *primary = chn->primary_channel; + struct vmbus_channel *sc; + + lockdep_assert_held(&vmbus_connection.channel_mutex); + + if (!primary) + return false; + + if (primary->target_cpu == cpu) + return true; + + list_for_each_entry(sc, &primary->sc_list, sc_list) + if (sc != chn && sc->target_cpu == cpu) + return true; + + return false; +} + /* * We use this state to statically distribute the channel interrupt load. */ @@ -651,12 +679,13 @@ static DEFINE_SPINLOCK(bind_channel_to_cpu_lock); */ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) { - u32 cur_cpu; bool perf_chn = vmbus_devs[dev_type].perf_device; struct vmbus_channel *primary = channel->primary_channel; - int next_node; + u32 i, ncpu = num_online_cpus(); cpumask_var_t available_mask; struct cpumask *alloced_mask; + u32 target_cpu; + int numa_node; if ((vmbus_proto_version == VERSION_WS2008) || (vmbus_proto_version == VERSION_WIN7) || (!perf_chn) || @@ -676,40 +705,38 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) spin_lock(&bind_channel_to_cpu_lock); - /* - * Based on the channel affinity policy, we will assign the NUMA - * nodes. - */ - - if ((channel->affinity_policy == HV_BALANCED) || (!primary)) { + for (i = 1; i <= ncpu + 1; i++) { while (true) { - next_node = next_numa_node_id++; - if (next_node == nr_node_ids) { - next_node = next_numa_node_id = 0; + numa_node = next_numa_node_id++; + if (numa_node == nr_node_ids) { + next_numa_node_id = 0; continue; } - if (cpumask_empty(cpumask_of_node(next_node))) + if (cpumask_empty(cpumask_of_node(numa_node))) continue; break; } - channel->numa_node = next_node; - primary = channel; - } - alloced_mask = &hv_context.hv_numa_map[primary->numa_node]; + alloced_mask = &hv_context.hv_numa_map[numa_node]; - if (cpumask_weight(alloced_mask) == - cpumask_weight(cpumask_of_node(primary->numa_node))) { - /* - * We have cycled through all the CPUs in the node; - * reset the alloced map. - */ - cpumask_clear(alloced_mask); - } + if (cpumask_weight(alloced_mask) == + cpumask_weight(cpumask_of_node(numa_node))) { + /* + * We have cycled through all the CPUs in the node; + * reset the alloced map. + */ + cpumask_clear(alloced_mask); + } - cpumask_xor(available_mask, alloced_mask, - cpumask_of_node(primary->numa_node)); + cpumask_xor(available_mask, alloced_mask, + cpumask_of_node(numa_node)); - cur_cpu = -1; + target_cpu = cpumask_first(available_mask); + cpumask_set_cpu(target_cpu, alloced_mask); + + if (channel->offermsg.offer.sub_channel_index >= ncpu || + i > ncpu || !hv_cpuself_used(target_cpu, channel)) + break; + } if (primary->affinity_policy == HV_LOCALIZED) { /* @@ -723,40 +750,8 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) cpumask_clear(&primary->alloced_cpus_in_node); } - while (true) { - cur_cpu = cpumask_next(cur_cpu, available_mask); - if (cur_cpu >= nr_cpu_ids) { - cur_cpu = -1; - cpumask_copy(available_mask, - cpumask_of_node(primary->numa_node)); - continue; - } - - if (primary->affinity_policy == HV_LOCALIZED) { - /* - * NOTE: in the case of sub-channel, we clear the - * sub-channel related bit(s) in - * primary->alloced_cpus_in_node in - * hv_process_channel_removal(), so when we - * reload drivers like hv_netvsc in SMP guest, here - * we're able to re-allocate - * bit from primary->alloced_cpus_in_node. - */ - if (!cpumask_test_cpu(cur_cpu, - &primary->alloced_cpus_in_node)) { - cpumask_set_cpu(cur_cpu, - &primary->alloced_cpus_in_node); - cpumask_set_cpu(cur_cpu, alloced_mask); - break; - } - } else { - cpumask_set_cpu(cur_cpu, alloced_mask); - break; - } - } - - channel->target_cpu = cur_cpu; - channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu); + channel->target_cpu = target_cpu; + channel->target_vp = hv_cpu_number_to_vp_number(target_cpu); spin_unlock(&bind_channel_to_cpu_lock);