From patchwork Fri Jun 4 21:18:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Traynor X-Patchwork-Id: 1488115 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=2605:bc80:3010::136; helo=smtp3.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=DUOUS2dr; dkim-atps=neutral Received: from smtp3.osuosl.org (smtp3.osuosl.org [IPv6:2605:bc80:3010::136]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FxbHs6d28z9sRK for ; Sat, 5 Jun 2021 07:19:37 +1000 (AEST) Received: from localhost (localhost [127.0.0.1]) by smtp3.osuosl.org (Postfix) with ESMTP id 4025F606D9; Fri, 4 Jun 2021 21:19:34 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp3.osuosl.org ([127.0.0.1]) by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6pHFLKih3cei; Fri, 4 Jun 2021 21:19:30 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by smtp3.osuosl.org (Postfix) with ESMTP id C135D60754; Fri, 4 Jun 2021 21:19:27 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 1A8B7C0025; Fri, 4 Jun 2021 21:19:27 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp4.osuosl.org (smtp4.osuosl.org [IPv6:2605:bc80:3010::137]) by lists.linuxfoundation.org (Postfix) with ESMTP id 62FC2C0025 for ; Fri, 4 Jun 2021 21:19:26 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp4.osuosl.org (Postfix) with ESMTP id 4393440695 for ; Fri, 4 Jun 2021 21:19:26 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Authentication-Results: smtp4.osuosl.org (amavisd-new); dkim=pass (1024-bit key) header.d=redhat.com Received: from smtp4.osuosl.org ([127.0.0.1]) by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mzgvbpnxK9BX for ; Fri, 4 Jun 2021 21:19:22 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by smtp4.osuosl.org (Postfix) with ESMTPS id D370040660 for ; Fri, 4 Jun 2021 21:19:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622841560; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kHgyuCR4UzvEnRmxAp6kfeVRxwLsLvx4DIl8FudltlA=; b=DUOUS2drdQm7CxXzsuNi1Lqd7I636cvLC10Astcl4KcbFqPGaW+HRedZeXrWFHtQHCfSW2 OMeIqimyoiDzzkby0cAACWRiihttJTN5C+XoGhvyRtMR1i8Uk9hdTGLVQbr65M4XRaRAnI hkV10IuAmIHO0p3QWc7eoZMXV/Qpmeg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-481-j2Xp3-eHNaSV_rjsK5Iblg-1; Fri, 04 Jun 2021 17:19:19 -0400 X-MC-Unique: j2Xp3-eHNaSV_rjsK5Iblg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 17B75801107; Fri, 4 Jun 2021 21:19:18 +0000 (UTC) Received: from rh.redhat.com (ovpn-114-242.ams2.redhat.com [10.36.114.242]) by smtp.corp.redhat.com (Postfix) with ESMTP id AE45A10023AC; Fri, 4 Jun 2021 21:19:16 +0000 (UTC) From: Kevin Traynor To: dev@openvswitch.org Date: Fri, 4 Jun 2021 22:18:52 +0100 Message-Id: <20210604211856.915563-2-ktraynor@redhat.com> In-Reply-To: <20210604211856.915563-1-ktraynor@redhat.com> References: <20210604211856.915563-1-ktraynor@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=ktraynor@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Cc: david.marchand@redhat.com Subject: [ovs-dev] [PATCH 1/5] dpif-netdev: Rework rxq scheduling code. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" This reworks the current rxq scheduling code to break it into more generic and reusable pieces. The behaviour does not change from a user perspective, except the logs are updated to be more consistent. From an implementation view, there are some changes with mind to adding functionality and reuse in later patches. The high level reusable functions added in this patch are: - Generate a list of current numas and pmds - Perform rxq scheduling into the list - Effect the rxq scheduling assignments so they are used The rxq scheduling is updated to handle both pinned and non-pinned rxqs in the same call. Signed-off-by: Kevin Traynor --- lib/dpif-netdev.c | 538 ++++++++++++++++++++++++++++++++++++++-------- tests/pmd.at | 2 +- 2 files changed, 446 insertions(+), 94 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 650e67ab3..57d23e112 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -5006,4 +5006,211 @@ rr_numa_list_destroy(struct rr_numa_list *rr) } +struct sched_numa_list { + struct hmap numas; /* Contains 'struct sched_numa'. */ +}; + +/* Meta data for out-of-place pmd rxq assignments. */ +struct sched_pmd { + /* Associated PMD thread. */ + struct dp_netdev_pmd_thread *pmd; + uint64_t pmd_proc_cycles; + struct dp_netdev_rxq **rxqs; + unsigned n_rxq; + bool isolated; +}; + +struct sched_numa { + struct hmap_node node; + int numa_id; + /* PMDs on numa node. */ + struct sched_pmd *pmds; + /* Num of PMDs on numa node. */ + unsigned n_pmds; + /* Num of isolated PMDs on numa node. */ + unsigned n_iso; + int rr_cur_index; + bool rr_idx_inc; +}; + +static size_t +sched_numa_list_count(struct sched_numa_list *numa_list) +{ + return hmap_count(&numa_list->numas); +} + +static struct sched_numa * +sched_numa_list_next(struct sched_numa_list *numa_list, + const struct sched_numa *numa) +{ + struct hmap_node *node = NULL; + + if (numa) { + node = hmap_next(&numa_list->numas, &numa->node); + } + if (!node) { + node = hmap_first(&numa_list->numas); + } + + return (node) ? CONTAINER_OF(node, struct sched_numa, node) : NULL; +} + +static struct sched_numa * +sched_numa_list_lookup(struct sched_numa_list * numa_list, int numa_id) +{ + struct sched_numa *numa; + + HMAP_FOR_EACH_WITH_HASH (numa, node, hash_int(numa_id, 0), + &numa_list->numas) { + if (numa->numa_id == numa_id) { + return numa; + } + } + return NULL; +} + +/* Populate numas and pmds on those numas */ +static void +sched_numa_list_populate(struct sched_numa_list *numa_list, + struct dp_netdev *dp) +{ + struct dp_netdev_pmd_thread *pmd; + hmap_init(&numa_list->numas); + + /* For each pmd on this datapath. */ + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { + struct sched_numa *numa; + struct sched_pmd *sched_pmd; + if (pmd->core_id == NON_PMD_CORE_ID) { + continue; + } + + /* Get the numa of the PMD. */ + numa = sched_numa_list_lookup(numa_list, pmd->numa_id); + /* Create a new numa node for it if not already created */ + if (!numa) { + numa = xzalloc(sizeof *numa); + numa->numa_id = pmd->numa_id; + hmap_insert(&numa_list->numas, &numa->node, + hash_int(pmd->numa_id, 0)); + } + + /* Create a sched_pmd on this numa for the pmd. */ + numa->n_pmds++; + numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof *numa->pmds); + sched_pmd = &numa->pmds[numa->n_pmds - 1]; + memset(sched_pmd ,0, sizeof *sched_pmd); + sched_pmd->pmd = pmd; + /* At least one pmd is present so initialize curr_idx and idx_inc. */ + numa->rr_cur_index = 0; + numa->rr_idx_inc = true; + } +} + +static void +sched_numa_list_free_entries(struct sched_numa_list *numa_list) +{ + struct sched_numa *numa; + + HMAP_FOR_EACH_POP (numa, node, &numa_list->numas) { + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + + sched_pmd = &numa->pmds[i]; + sched_pmd->n_rxq = 0; + free(sched_pmd->rxqs); + } + numa->n_pmds = 0; + free(numa->pmds); + } + hmap_destroy(&numa_list->numas); +} + +static struct sched_pmd * +find_sched_pmd_by_pmd(struct sched_numa_list *numa_list, + struct dp_netdev_pmd_thread *pmd) +{ + struct sched_numa *numa; + + HMAP_FOR_EACH (numa, node, &numa_list->numas) { + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + + sched_pmd = &numa->pmds[i]; + if (pmd == sched_pmd->pmd) { + return sched_pmd; + } + } + } + return NULL; +} + +static struct sched_numa * +sched_numa_list_find_numa(struct sched_numa_list *numa_list, + struct sched_pmd *sched_pmd) +{ + struct sched_numa *numa; + + HMAP_FOR_EACH (numa, node, &numa_list->numas) { + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *numa_sched_pmd; + + numa_sched_pmd = &numa->pmds[i]; + if (numa_sched_pmd == sched_pmd) { + return numa; + } + } + } + return NULL; +} + +static void +sched_add_rxq_to_sched_pmd(struct sched_pmd *sched_pmd, + struct dp_netdev_rxq *rxq, uint64_t cycles) +{ + /* As sched_pmd is allocated outside this fn. better to not assume + * rxq is initialized to NULL. */ + if (sched_pmd->n_rxq == 0) { + sched_pmd->rxqs = xmalloc(sizeof *sched_pmd->rxqs); + } else { + sched_pmd->rxqs = xrealloc(sched_pmd->rxqs, (sched_pmd->n_rxq + 1) * + sizeof *sched_pmd->rxqs); + } + + sched_pmd->rxqs[sched_pmd->n_rxq++] = rxq; + sched_pmd->pmd_proc_cycles += cycles; +} + +static void +sched_numa_list_put_in_place(struct sched_numa_list *numa_list) +{ + struct sched_numa *numa; + + /* For each numa */ + HMAP_FOR_EACH (numa, node, &numa_list->numas) { + /* For each pmd */ + for (int i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + + sched_pmd = &numa->pmds[i]; + sched_pmd->pmd->isolated = sched_pmd->isolated; + /* For each rxq. */ + for (unsigned k = 0; k < sched_pmd->n_rxq; k++) { + /* Store the new pmd from the out of place sched_numa_list + * struct to the dp_netdev_rxq struct */ + sched_pmd->rxqs[k]->pmd = sched_pmd->pmd; + } + } + } +} + +static unsigned +sched_get_numa_pmd_noniso(struct sched_numa *numa) +{ + if (numa->n_pmds > numa->n_iso) { + return numa->n_pmds - numa->n_iso; + } + return 0; +} + /* Sort Rx Queues by the processing cycles they are consuming. */ static int @@ -5037,22 +5244,106 @@ compare_rxq_cycles(const void *a, const void *b) } -/* Assign pmds to queues. If 'pinned' is true, assign pmds to pinned - * queues and marks the pmds as isolated. Otherwise, assign non isolated - * pmds to unpinned queues. +/* + * Returns the next pmd from the numa node. * - * The function doesn't touch the pmd threads, it just stores the assignment - * in the 'pmd' member of each rxq. */ + * If 'updown' is 'true' it will alternate between selecting the next pmd in + * either an up or down walk, switching between up/down when the first or last + * core is reached. e.g. 1,2,3,3,2,1,1,2... + * + * If 'updown' is 'false' it will select the next pmd wrapping around when + * last core reached. e.g. 1,2,3,1,2,3,1,2... + */ +static struct sched_pmd * +get_rr_pmd(struct sched_numa *numa, bool updown) +{ + int numa_idx = numa->rr_cur_index; + + if (numa->rr_idx_inc == true) { + /* Incrementing through list of pmds. */ + if (numa->rr_cur_index == numa->n_pmds - 1) { + /* Reached the last pmd. */ + if (updown) { + numa->rr_idx_inc = false; + } else { + numa->rr_cur_index = 0; + } + } else { + numa->rr_cur_index++; + } + } else { + /* Decrementing through list of pmds. */ + if (numa->rr_cur_index == 0) { + /* Reached the first pmd. */ + numa->rr_idx_inc = true; + } else { + numa->rr_cur_index--; + } + } + return &numa->pmds[numa_idx]; +} + +static struct sched_pmd * +get_available_rr_pmd(struct sched_numa *numa, bool updown) +{ + struct sched_pmd *sched_pmd = NULL; + + /* get_rr_pmd() may return duplicate PMDs before all PMDs have been + * returned depending on updown. Extend the number of call to ensure all + * PMDs can be checked. */ + for (unsigned i = 0; i < numa->n_pmds * 2; i++) { + sched_pmd = get_rr_pmd(numa, updown); + if (!sched_pmd->isolated) { + break; + } + sched_pmd = NULL; + } + return sched_pmd; +} + +static struct sched_pmd * +get_next_pmd(struct sched_numa *numa, bool algo) +{ + return get_available_rr_pmd(numa, algo); +} + +static const char * +get_assignment_type_string(bool algo) +{ + if (algo == false) { + return "roundrobin"; + } + return "cycles"; +} + +#define MAX_RXQ_CYC_STRLEN (INT_STRLEN(uint64_t) + 40) + +static bool +get_rxq_cyc_log(char *a, bool algo, uint64_t cycles) +{ + int ret = 0; + + if (algo) { + ret = snprintf(a, MAX_RXQ_CYC_STRLEN, + " (measured processing cycles %"PRIu64").", + cycles); + } + return ret > 0; +} + static void -rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex) +sched_numa_list_schedule(struct sched_numa_list *numa_list, + struct dp_netdev *dp, + bool algo, + enum vlog_level level) + OVS_REQUIRES(dp->port_mutex) { struct dp_netdev_port *port; - struct rr_numa_list rr; - struct rr_numa *non_local_numa = NULL; struct dp_netdev_rxq ** rxqs = NULL; - int n_rxqs = 0; - struct rr_numa *numa = NULL; - int numa_id; - bool assign_cyc = dp->pmd_rxq_assign_cyc; + struct sched_numa *last_cross_numa; + unsigned n_rxqs = 0; + bool start_logged = false; + size_t n_numa; + /* For each port. */ HMAP_FOR_EACH (port, node, &dp->ports) { if (!netdev_is_pmd(port->netdev)) { @@ -5060,48 +5351,68 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex) } + /* For each rxq on the port. */ for (int qid = 0; qid < port->n_rxq; qid++) { - struct dp_netdev_rxq *q = &port->rxqs[qid]; + struct dp_netdev_rxq *rxq = &port->rxqs[qid]; - if (pinned && q->core_id != OVS_CORE_UNSPEC) { - struct dp_netdev_pmd_thread *pmd; + rxqs = xrealloc(rxqs, (n_rxqs + 1) * sizeof *rxqs); + rxqs[n_rxqs++] = rxq; - pmd = dp_netdev_get_pmd(dp, q->core_id); - if (!pmd) { - VLOG_WARN("There is no PMD thread on core %d. Queue " - "%d on port \'%s\' will not be polled.", - q->core_id, qid, netdev_get_name(port->netdev)); - } else { - q->pmd = pmd; - pmd->isolated = true; - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " - "rx queue %d.", pmd->core_id, pmd->numa_id, - netdev_rxq_get_name(q->rx), - netdev_rxq_get_queue_id(q->rx)); - dp_netdev_pmd_unref(pmd); - } - } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) { + if (algo == true) { uint64_t cycle_hist = 0; - if (n_rxqs == 0) { - rxqs = xmalloc(sizeof *rxqs); - } else { - rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1)); + /* Sum the queue intervals and store the cycle history. */ + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { + cycle_hist += dp_netdev_rxq_get_intrvl_cycles(rxq, i); } + dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_PROC_HIST, + cycle_hist); + } - if (assign_cyc) { - /* Sum the queue intervals and store the cycle history. */ - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { - cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); - } - dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, - cycle_hist); + /* Check if this rxq is pinned. */ + if (rxq->core_id != OVS_CORE_UNSPEC) { + struct sched_pmd *sched_pmd = NULL; + struct dp_netdev_pmd_thread *pmd; + struct sched_numa *numa; + uint64_t proc_cycles; + char rxq_cyc_log[MAX_RXQ_CYC_STRLEN]; + + + /* This rxq should be pinned, pin it now. */ + pmd = dp_netdev_get_pmd(dp, rxq->core_id); + sched_pmd = find_sched_pmd_by_pmd(numa_list, pmd); + if (!sched_pmd) { + /* Cannot find the PMD. Cannot pin this rxq. */ + VLOG(level == VLL_DBG ? VLL_DBG : VLL_WARN, + "Core %2u cannot be pinned with " + "port \'%s\' rx queue %d. Use pmd-cpu-mask to " + "enable a pmd on core %u.", + rxq->core_id, + netdev_rxq_get_name(rxq->rx), + netdev_rxq_get_queue_id(rxq->rx), + rxq->core_id); + continue; + } + /* Mark PMD as isolated if not done already. */ + if (sched_pmd->isolated == false) { + sched_pmd->isolated = true; + numa = sched_numa_list_find_numa(numa_list, + sched_pmd); + numa->n_iso++; } - /* Store the queue. */ - rxqs[n_rxqs++] = q; + proc_cycles = dp_netdev_rxq_get_cycles(rxq, + RXQ_CYCLES_PROC_HIST); + VLOG(level, "Core %2u on numa node %d is pinned with " + "port \'%s\' rx queue %d.%s", + sched_pmd->pmd->core_id, sched_pmd->pmd->numa_id, + netdev_rxq_get_name(rxq->rx), + netdev_rxq_get_queue_id(rxq->rx), + get_rxq_cyc_log(rxq_cyc_log, algo, proc_cycles) + ? rxq_cyc_log : ""); + sched_add_rxq_to_sched_pmd(sched_pmd, rxq, proc_cycles); } } } - if (n_rxqs > 1 && assign_cyc) { + if (n_rxqs > 1 && algo) { /* Sort the queues in order of the processing cycles * they consumed during their last pmd interval. */ @@ -5109,54 +5420,100 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex) } - rr_numa_list_populate(dp, &rr); - /* Assign the sorted queues to pmds in round robin. */ - for (int i = 0; i < n_rxqs; i++) { - numa_id = netdev_get_numa_id(rxqs[i]->port->netdev); - numa = rr_numa_list_lookup(&rr, numa_id); - if (!numa) { - /* There are no pmds on the queue's local NUMA node. - Round robin on the NUMA nodes that do have pmds. */ - non_local_numa = rr_numa_list_next(&rr, non_local_numa); - if (!non_local_numa) { - VLOG_ERR("There is no available (non-isolated) pmd " - "thread for port \'%s\' queue %d. This queue " - "will not be polled. Is pmd-cpu-mask set to " - "zero? Or are all PMDs isolated to other " - "queues?", netdev_rxq_get_name(rxqs[i]->rx), - netdev_rxq_get_queue_id(rxqs[i]->rx)); - continue; + last_cross_numa = NULL; + n_numa = sched_numa_list_count(numa_list); + for (unsigned i = 0; i < n_rxqs; i++) { + struct dp_netdev_rxq *rxq = rxqs[i]; + struct sched_pmd *sched_pmd; + struct sched_numa *numa; + int numa_id; + uint64_t proc_cycles; + char rxq_cyc_log[MAX_RXQ_CYC_STRLEN]; + + if (rxq->core_id != OVS_CORE_UNSPEC) { + continue; + } + + if (start_logged == false && level != VLL_DBG) { + VLOG(level, "Performing pmd to rx queue assignment using %s " + "algorithm.", get_assignment_type_string(algo)); + start_logged = true; + } + + /* Store the cycles for this rxq as we will log these later. */ + proc_cycles = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_HIST); + /* Select the numa that should be used for this rxq. */ + numa_id = netdev_get_numa_id(rxq->port->netdev); + numa = sched_numa_list_lookup(numa_list, numa_id); + + /* Ensure that there is at least one non-isolated pmd on that numa. */ + if (numa && !sched_get_numa_pmd_noniso(numa)) { + numa = NULL; + } + + if (!numa || !sched_get_numa_pmd_noniso(numa)) { + /* Find any numa with available pmds. */ + for (int k = 0; k < n_numa; k++) { + numa = sched_numa_list_next(numa_list, last_cross_numa); + if (sched_get_numa_pmd_noniso(numa)) { + break; + } + last_cross_numa = numa; + numa = NULL; } - rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc); - VLOG_WARN("There's no available (non-isolated) pmd thread " - "on numa node %d. Queue %d on port \'%s\' will " - "be assigned to the pmd on core %d " - "(numa node %d). Expect reduced performance.", - numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx), - netdev_rxq_get_name(rxqs[i]->rx), - rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id); - } else { - rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc); - if (assign_cyc) { - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " - "rx queue %d " - "(measured processing cycles %"PRIu64").", - rxqs[i]->pmd->core_id, numa_id, - netdev_rxq_get_name(rxqs[i]->rx), - netdev_rxq_get_queue_id(rxqs[i]->rx), - dp_netdev_rxq_get_cycles(rxqs[i], - RXQ_CYCLES_PROC_HIST)); - } else { - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' " - "rx queue %d.", rxqs[i]->pmd->core_id, numa_id, - netdev_rxq_get_name(rxqs[i]->rx), - netdev_rxq_get_queue_id(rxqs[i]->rx)); + } + if (numa && numa->numa_id != numa_id) { + VLOG(level, "There's no available (non-isolated) pmd thread " + "on numa node %d. Port \'%s\' rx queue %d will " + "be assigned to a pmd on numa node %d. " + "This may lead to reduced performance.", + numa_id, netdev_rxq_get_name(rxq->rx), + netdev_rxq_get_queue_id(rxq->rx), numa->numa_id); + } + + sched_pmd = NULL; + if (numa) { + /* Select the PMD that should be used for this rxq. */ + sched_pmd = get_next_pmd(numa, algo); + if (sched_pmd) { + VLOG(level, "Core %2u on numa node %d assigned port \'%s\' " + "rx queue %d.%s", + sched_pmd->pmd->core_id, sched_pmd->pmd->numa_id, + netdev_rxq_get_name(rxq->rx), + netdev_rxq_get_queue_id(rxq->rx), + get_rxq_cyc_log(rxq_cyc_log, algo, proc_cycles) + ? rxq_cyc_log : ""); + sched_add_rxq_to_sched_pmd(sched_pmd, rxq, proc_cycles); } } + if (!sched_pmd) { + VLOG(level == VLL_DBG ? level : VLL_WARN, + "No non-isolated pmd on any numa available for " + "port \'%s\' rx queue %d.%s " + "This rx queue will not be polled.", + netdev_rxq_get_name(rxq->rx), + netdev_rxq_get_queue_id(rxq->rx), + get_rxq_cyc_log(rxq_cyc_log, algo, proc_cycles) + ? rxq_cyc_log : ""); + } } - - rr_numa_list_destroy(&rr); free(rxqs); } +static void +rxq_scheduling(struct dp_netdev *dp) OVS_REQUIRES(dp->port_mutex) +{ + struct sched_numa_list *numa_list; + bool algo = dp->pmd_rxq_assign_cyc; + + numa_list = xzalloc(sizeof *numa_list); + + sched_numa_list_populate(numa_list, dp); + sched_numa_list_schedule(numa_list, dp, algo, VLL_INFO); + sched_numa_list_put_in_place(numa_list); + + sched_numa_list_free_entries(numa_list); + free(numa_list); +} + static void reload_affected_pmds(struct dp_netdev *dp) @@ -5406,10 +5763,5 @@ reconfigure_datapath(struct dp_netdev *dp) } } - - /* Add pinned queues and mark pmd threads isolated. */ - rxq_scheduling(dp, true); - - /* Add non-pinned queues. */ - rxq_scheduling(dp, false); + rxq_scheduling(dp); /* Step 5: Remove queues not compliant with new scheduling. */ diff --git a/tests/pmd.at b/tests/pmd.at index cc5371d5a..78105bf45 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -580,5 +580,5 @@ p1 3 0 2 ]) -OVS_VSWITCHD_STOP(["/dpif_netdev|WARN|There is no PMD thread on core/d"]) +OVS_VSWITCHD_STOP(["/cannot be pinned with port/d"]) AT_CLEANUP From patchwork Fri Jun 4 21:18:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Traynor X-Patchwork-Id: 1488116 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=140.211.166.138; helo=smtp1.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=TgmqYVgO; dkim-atps=neutral Received: from smtp1.osuosl.org (smtp1.osuosl.org [140.211.166.138]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FxbHt1BXfz9sT6 for ; Sat, 5 Jun 2021 07:19:38 +1000 (AEST) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id A106F83DCB; Fri, 4 Jun 2021 21:19:35 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UkQMhhbia9Ck; Fri, 4 Jun 2021 21:19:31 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by smtp1.osuosl.org (Postfix) with ESMTP id 3584C83D4F; Fri, 4 Jun 2021 21:19:29 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 39A8BC002C; Fri, 4 Jun 2021 21:19:28 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp3.osuosl.org (smtp3.osuosl.org [IPv6:2605:bc80:3010::136]) by lists.linuxfoundation.org (Postfix) with ESMTP id 49478C0027 for ; Fri, 4 Jun 2021 21:19:27 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp3.osuosl.org (Postfix) with ESMTP id 1EE646072F for ; Fri, 4 Jun 2021 21:19:27 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Authentication-Results: smtp3.osuosl.org (amavisd-new); dkim=pass (1024-bit key) header.d=redhat.com Received: from smtp3.osuosl.org ([127.0.0.1]) by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GB-O1NQoqmzs for ; Fri, 4 Jun 2021 21:19:25 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by smtp3.osuosl.org (Postfix) with ESMTPS id AB21C60664 for ; Fri, 4 Jun 2021 21:19:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622841564; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PwS2tPxTgsZjURS65IQCWnhQsAF4ZJXCcyJHdFLWsPE=; b=TgmqYVgOdT1M7m2+gOqibXLakqGJgxiyRXM813xErV4nBk4TPzu1msuTGbWeiKVJ21vJhW Nt7o4ztQ5sqxj1X9M6m/5iUQpe7k56KgTT/yO67/hjoBWngnRy7zgMFSxCmRuTi8aXh5XK hfcOTKGz0arjkPCGqam8ofO4EJ/sYNU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-219-AeuiJNsSOU210qbxkYY6iQ-1; Fri, 04 Jun 2021 17:19:20 -0400 X-MC-Unique: AeuiJNsSOU210qbxkYY6iQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BC622501E0; Fri, 4 Jun 2021 21:19:19 +0000 (UTC) Received: from rh.redhat.com (ovpn-114-242.ams2.redhat.com [10.36.114.242]) by smtp.corp.redhat.com (Postfix) with ESMTP id 65BD710023AC; Fri, 4 Jun 2021 21:19:18 +0000 (UTC) From: Kevin Traynor To: dev@openvswitch.org Date: Fri, 4 Jun 2021 22:18:53 +0100 Message-Id: <20210604211856.915563-3-ktraynor@redhat.com> In-Reply-To: <20210604211856.915563-1-ktraynor@redhat.com> References: <20210604211856.915563-1-ktraynor@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=ktraynor@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Cc: david.marchand@redhat.com Subject: [ovs-dev] [PATCH 2/5] dpif-netdev: Make PMD auto load balance use common rxq scheduling. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" PMD auto load balance had its own separate implementation of the rxq scheduling that it used for dry runs. This was done because previously the rxq scheduling was not reusable for a dry run. Apart from the code duplication (which is a good enough reason to replace it alone) this meant that if any further rxq scheduling changes or assignment types were added they would also have to be duplicated in the auto load balance code too. This patch replaces the current PMD auto load balance rxq scheduling code to reuse the common rxq scheduling code. The behaviour does not change from a user perspective, except the logs are updated to be more consistent. As the dry run will compare the pmd load variances for current and estimated assignments, new functions are added to populate the current assignments and calculate variance on the rxq scheduling data structs. Now that the new rxq scheduling data structures are being used in PMD auto load balance, the older rr_* data structs and associated functions can be removed. Signed-off-by: Kevin Traynor --- lib/dpif-netdev.c | 511 +++++++++++++++------------------------------- 1 file changed, 164 insertions(+), 347 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 57d23e112..eaa4e9733 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -4872,138 +4872,4 @@ port_reconfigure(struct dp_netdev_port *port) } -struct rr_numa_list { - struct hmap numas; /* Contains 'struct rr_numa' */ -}; - -struct rr_numa { - struct hmap_node node; - - int numa_id; - - /* Non isolated pmds on numa node 'numa_id' */ - struct dp_netdev_pmd_thread **pmds; - int n_pmds; - - int cur_index; - bool idx_inc; -}; - -static size_t -rr_numa_list_count(struct rr_numa_list *rr) -{ - return hmap_count(&rr->numas); -} - -static struct rr_numa * -rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id) -{ - struct rr_numa *numa; - - HMAP_FOR_EACH_WITH_HASH (numa, node, hash_int(numa_id, 0), &rr->numas) { - if (numa->numa_id == numa_id) { - return numa; - } - } - - return NULL; -} - -/* Returns the next node in numa list following 'numa' in round-robin fashion. - * Returns first node if 'numa' is a null pointer or the last node in 'rr'. - * Returns NULL if 'rr' numa list is empty. */ -static struct rr_numa * -rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa *numa) -{ - struct hmap_node *node = NULL; - - if (numa) { - node = hmap_next(&rr->numas, &numa->node); - } - if (!node) { - node = hmap_first(&rr->numas); - } - - return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; -} - -static void -rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr) -{ - struct dp_netdev_pmd_thread *pmd; - struct rr_numa *numa; - - hmap_init(&rr->numas); - - CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { - if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) { - continue; - } - - numa = rr_numa_list_lookup(rr, pmd->numa_id); - if (!numa) { - numa = xzalloc(sizeof *numa); - numa->numa_id = pmd->numa_id; - hmap_insert(&rr->numas, &numa->node, hash_int(pmd->numa_id, 0)); - } - numa->n_pmds++; - numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof *numa->pmds); - numa->pmds[numa->n_pmds - 1] = pmd; - /* At least one pmd so initialise curr_idx and idx_inc. */ - numa->cur_index = 0; - numa->idx_inc = true; - } -} - -/* - * Returns the next pmd from the numa node. - * - * If 'updown' is 'true' it will alternate between selecting the next pmd in - * either an up or down walk, switching between up/down when the first or last - * core is reached. e.g. 1,2,3,3,2,1,1,2... - * - * If 'updown' is 'false' it will select the next pmd wrapping around when last - * core reached. e.g. 1,2,3,1,2,3,1,2... - */ -static struct dp_netdev_pmd_thread * -rr_numa_get_pmd(struct rr_numa *numa, bool updown) -{ - int numa_idx = numa->cur_index; - - if (numa->idx_inc == true) { - /* Incrementing through list of pmds. */ - if (numa->cur_index == numa->n_pmds-1) { - /* Reached the last pmd. */ - if (updown) { - numa->idx_inc = false; - } else { - numa->cur_index = 0; - } - } else { - numa->cur_index++; - } - } else { - /* Decrementing through list of pmds. */ - if (numa->cur_index == 0) { - /* Reached the first pmd. */ - numa->idx_inc = true; - } else { - numa->cur_index--; - } - } - return numa->pmds[numa_idx]; -} - -static void -rr_numa_list_destroy(struct rr_numa_list *rr) -{ - struct rr_numa *numa; - - HMAP_FOR_EACH_POP (numa, node, &rr->numas) { - free(numa->pmds); - free(numa); - } - hmap_destroy(&rr->numas); -} - struct sched_numa_list { struct hmap numas; /* Contains 'struct sched_numa'. */ @@ -5033,4 +4899,6 @@ struct sched_numa { }; +static uint64_t variance(uint64_t a[], int n); + static size_t sched_numa_list_count(struct sched_numa_list *numa_list) @@ -5181,4 +5049,36 @@ sched_add_rxq_to_sched_pmd(struct sched_pmd *sched_pmd, } +static void +sched_numa_list_assignments(struct sched_numa_list *numa_list, + struct dp_netdev *dp) +{ + struct dp_netdev_port *port; + + /* For each port. */ + HMAP_FOR_EACH (port, node, &dp->ports) { + if (!netdev_is_pmd(port->netdev)) { + continue; + } + /* For each rxq on the port. */ + for (unsigned qid = 0; qid < port->n_rxq; qid++) { + struct dp_netdev_rxq *rxq = &port->rxqs[qid]; + struct sched_pmd *sched_pmd; + uint64_t proc_cycles = 0; + + for (int i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { + proc_cycles += dp_netdev_rxq_get_intrvl_cycles(rxq, i); + } + + sched_pmd = find_sched_pmd_by_pmd(numa_list, rxq->pmd); + if (sched_pmd) { + if (rxq->core_id != OVS_CORE_UNSPEC) { + sched_pmd->isolated = true; + } + sched_add_rxq_to_sched_pmd(sched_pmd, rxq, proc_cycles); + } + } + } +} + static void sched_numa_list_put_in_place(struct sched_numa_list *numa_list) @@ -5204,4 +5104,31 @@ sched_numa_list_put_in_place(struct sched_numa_list *numa_list) } +static bool +sched_numa_list_cross_numa_polling(struct sched_numa_list *numa_list) +{ + struct sched_numa *numa; + + /* For each numa */ + HMAP_FOR_EACH (numa, node, &numa_list->numas) { + /* For each pmd */ + for (int i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + + sched_pmd = &numa->pmds[i]; + /* For each rxq. */ + for (unsigned k = 0; k < sched_pmd->n_rxq; k++) { + struct dp_netdev_rxq *rxq = sched_pmd->rxqs[k]; + + if (!sched_pmd->isolated && + rxq->pmd->numa_id != + netdev_get_numa_id(rxq->port->netdev)) { + return true; + } + } + } + } + return false; +} + static unsigned sched_get_numa_pmd_noniso(struct sched_numa *numa) @@ -5516,4 +5443,105 @@ rxq_scheduling(struct dp_netdev *dp) OVS_REQUIRES(dp->port_mutex) } +static uint64_t +sched_numa_list_variance(struct sched_numa_list *numa_list) +{ + struct sched_numa *numa; + uint64_t *percent_busy = NULL; + unsigned total_pmds = 0; + int n_proc = 0; + uint64_t var; + + HMAP_FOR_EACH (numa, node, &numa_list->numas) { + total_pmds += numa->n_pmds; + percent_busy = xrealloc(percent_busy, + total_pmds * sizeof *percent_busy); + + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + uint64_t total_cycles = 0; + + sched_pmd = &numa->pmds[i]; + /* Exclude isolated PMDs from variance calculations. */ + if (sched_pmd->isolated == true) { + continue; + } + /* Get the total pmd cycles for an interval. */ + atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); + + if (total_cycles) { + /* Estimate the cycles to cover all intervals. */ + total_cycles *= PMD_RXQ_INTERVAL_MAX; + percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) + / total_cycles; + } else { + percent_busy[n_proc++] = 0; + } + } + } + var = variance(percent_busy, n_proc); + free(percent_busy); + return var; +} + +static bool +pmd_rebalance_dry_run(struct dp_netdev *dp) + OVS_REQUIRES(dp->port_mutex) +{ + struct sched_numa_list *numa_list_cur; + struct sched_numa_list *numa_list_est; + bool thresh_met = false; + uint64_t current, estimate; + uint64_t improvement = 0; + + VLOG_DBG("PMD auto load balance performing dry run."); + + /* Populate current assignments. */ + numa_list_cur = xzalloc(sizeof *numa_list_cur); + sched_numa_list_populate(numa_list_cur, dp); + sched_numa_list_assignments(numa_list_cur, dp); + + /* Populate estimated assignments. */ + numa_list_est = xzalloc(sizeof *numa_list_est); + sched_numa_list_populate(numa_list_est, dp); + sched_numa_list_schedule(numa_list_est, dp, + dp->pmd_rxq_assign_cyc, VLL_DBG); + + /* Check if cross-numa polling, there is only one numa with PMDs. */ + if (!sched_numa_list_cross_numa_polling(numa_list_est) || + sched_numa_list_count(numa_list_est) == 1) { + + /* Calculate variances. */ + current = sched_numa_list_variance(numa_list_cur); + estimate = sched_numa_list_variance(numa_list_est); + + if (estimate < current) { + improvement = ((current - estimate) * 100) / current; + } + VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64"", + current, estimate); + VLOG_DBG("Variance improvement %"PRIu64"%%", improvement); + + if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { + thresh_met = true; + VLOG_DBG("PMD load variance improvement threshold %u%% " + "is met", dp->pmd_alb.rebalance_improve_thresh); + } else { + VLOG_DBG("PMD load variance improvement threshold %u%% is not met", + dp->pmd_alb.rebalance_improve_thresh); + } + } else { + VLOG_DBG("PMD auto load balance detected cross-numa polling with " + "multiple numa nodes. Unable to accurately estimate."); + } + + sched_numa_list_free_entries(numa_list_cur); + sched_numa_list_free_entries(numa_list_est); + + free(numa_list_cur); + free(numa_list_est); + + return thresh_met; +} + static void reload_affected_pmds(struct dp_netdev *dp) @@ -5897,215 +5925,4 @@ variance(uint64_t a[], int n) } - -/* Returns the variance in the PMDs usage as part of dry run of rxqs - * assignment to PMDs. */ -static bool -get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list, - uint32_t num_pmds, uint64_t *predicted_variance) - OVS_REQUIRES(dp->port_mutex) -{ - struct dp_netdev_port *port; - struct dp_netdev_pmd_thread *pmd; - struct dp_netdev_rxq **rxqs = NULL; - struct rr_numa *numa = NULL; - struct rr_numa_list rr; - int n_rxqs = 0; - bool ret = false; - uint64_t *pmd_usage; - - if (!predicted_variance) { - return ret; - } - - pmd_usage = xcalloc(num_pmds, sizeof(uint64_t)); - - HMAP_FOR_EACH (port, node, &dp->ports) { - if (!netdev_is_pmd(port->netdev)) { - continue; - } - - for (int qid = 0; qid < port->n_rxq; qid++) { - struct dp_netdev_rxq *q = &port->rxqs[qid]; - uint64_t cycle_hist = 0; - - if (q->pmd->isolated) { - continue; - } - - if (n_rxqs == 0) { - rxqs = xmalloc(sizeof *rxqs); - } else { - rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1)); - } - - /* Sum the queue intervals and store the cycle history. */ - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { - cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); - } - dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, - cycle_hist); - /* Store the queue. */ - rxqs[n_rxqs++] = q; - } - } - if (n_rxqs > 1) { - /* Sort the queues in order of the processing cycles - * they consumed during their last pmd interval. */ - qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles); - } - rr_numa_list_populate(dp, &rr); - - for (int i = 0; i < n_rxqs; i++) { - int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev); - numa = rr_numa_list_lookup(&rr, numa_id); - /* If there is no available pmd on the local numa but there is only one - * numa for cross-numa polling, we can estimate the dry run. */ - if (!numa && rr_numa_list_count(&rr) == 1) { - numa = rr_numa_list_next(&rr, NULL); - } - if (!numa) { - VLOG_DBG("PMD auto lb dry run: " - "There's no available (non-isolated) PMD thread on NUMA " - "node %d for port '%s' and there are PMD threads on more " - "than one NUMA node available for cross-NUMA polling. " - "Aborting.", numa_id, netdev_rxq_get_name(rxqs[i]->rx)); - goto cleanup; - } - - pmd = rr_numa_get_pmd(numa, true); - VLOG_DBG("PMD auto lb dry run. Predicted: Core %d on numa node %d " - "to be assigned port \'%s\' rx queue %d " - "(measured processing cycles %"PRIu64").", - pmd->core_id, numa_id, - netdev_rxq_get_name(rxqs[i]->rx), - netdev_rxq_get_queue_id(rxqs[i]->rx), - dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST)); - - for (int id = 0; id < num_pmds; id++) { - if (pmd->core_id == core_list[id]) { - /* Add the processing cycles of rxq to pmd polling it. */ - pmd_usage[id] += dp_netdev_rxq_get_cycles(rxqs[i], - RXQ_CYCLES_PROC_HIST); - } - } - } - - CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { - uint64_t total_cycles = 0; - - if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) { - continue; - } - - /* Get the total pmd cycles for an interval. */ - atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); - /* Estimate the cycles to cover all intervals. */ - total_cycles *= PMD_RXQ_INTERVAL_MAX; - for (int id = 0; id < num_pmds; id++) { - if (pmd->core_id == core_list[id]) { - if (pmd_usage[id]) { - pmd_usage[id] = (pmd_usage[id] * 100) / total_cycles; - } - VLOG_DBG("PMD auto lb dry run. Predicted: Core %d, " - "usage %"PRIu64"", pmd->core_id, pmd_usage[id]); - } - } - } - *predicted_variance = variance(pmd_usage, num_pmds); - ret = true; - -cleanup: - rr_numa_list_destroy(&rr); - free(rxqs); - free(pmd_usage); - return ret; -} - -/* Does the dry run of Rxq assignment to PMDs and returns true if it gives - * better distribution of load on PMDs. */ -static bool -pmd_rebalance_dry_run(struct dp_netdev *dp) - OVS_REQUIRES(dp->port_mutex) -{ - struct dp_netdev_pmd_thread *pmd; - uint64_t *curr_pmd_usage; - - uint64_t curr_variance; - uint64_t new_variance; - uint64_t improvement = 0; - uint32_t num_pmds; - uint32_t *pmd_corelist; - struct rxq_poll *poll; - bool ret; - - num_pmds = cmap_count(&dp->poll_threads); - - if (num_pmds > 1) { - curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t)); - pmd_corelist = xcalloc(num_pmds, sizeof(uint32_t)); - } else { - return false; - } - - num_pmds = 0; - CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { - uint64_t total_cycles = 0; - uint64_t total_proc = 0; - - if ((pmd->core_id == NON_PMD_CORE_ID) || pmd->isolated) { - continue; - } - - /* Get the total pmd cycles for an interval. */ - atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); - /* Estimate the cycles to cover all intervals. */ - total_cycles *= PMD_RXQ_INTERVAL_MAX; - - ovs_mutex_lock(&pmd->port_mutex); - HMAP_FOR_EACH (poll, node, &pmd->poll_list) { - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { - total_proc += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i); - } - } - ovs_mutex_unlock(&pmd->port_mutex); - - if (total_proc) { - curr_pmd_usage[num_pmds] = (total_proc * 100) / total_cycles; - } - - VLOG_DBG("PMD auto lb dry run. Current: Core %d, usage %"PRIu64"", - pmd->core_id, curr_pmd_usage[num_pmds]); - - if (atomic_count_get(&pmd->pmd_overloaded)) { - atomic_count_set(&pmd->pmd_overloaded, 0); - } - - pmd_corelist[num_pmds] = pmd->core_id; - num_pmds++; - } - - curr_variance = variance(curr_pmd_usage, num_pmds); - ret = get_dry_run_variance(dp, pmd_corelist, num_pmds, &new_variance); - - if (ret) { - VLOG_DBG("PMD auto lb dry run. Current PMD variance: %"PRIu64"," - " Predicted PMD variance: %"PRIu64"", - curr_variance, new_variance); - - if (new_variance < curr_variance) { - improvement = - ((curr_variance - new_variance) * 100) / curr_variance; - } - if (improvement < dp->pmd_alb.rebalance_improve_thresh) { - ret = false; - } - } - - free(curr_pmd_usage); - free(pmd_corelist); - return ret; -} - - /* Return true if needs to revalidate datapath flows. */ static bool @@ -6183,6 +6000,6 @@ dpif_netdev_run(struct dpif *dpif) !ports_require_restart(dp) && pmd_rebalance_dry_run(dp)) { - VLOG_INFO("PMD auto lb dry run." - " requesting datapath reconfigure."); + VLOG_INFO("PMD auto load balance dry run. " + "Requesting datapath reconfigure."); dp_netdev_request_reconfigure(dp); } From patchwork Fri Jun 4 21:18:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Traynor X-Patchwork-Id: 1488117 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=140.211.166.138; helo=smtp1.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=S/3hTGoZ; dkim-atps=neutral Received: from smtp1.osuosl.org (smtp1.osuosl.org [140.211.166.138]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FxbHw3Q1Xz9sRK for ; Sat, 5 Jun 2021 07:19:40 +1000 (AEST) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id C78B283CC0; Fri, 4 Jun 2021 21:19:37 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8QOGcGG5731c; Fri, 4 Jun 2021 21:19:34 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by smtp1.osuosl.org (Postfix) with ESMTP id 40FAC83D6F; Fri, 4 Jun 2021 21:19:30 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id CF2C3C002F; Fri, 4 Jun 2021 21:19:28 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137]) by lists.linuxfoundation.org (Postfix) with ESMTP id A0846C0019 for ; Fri, 4 Jun 2021 21:19:27 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp4.osuosl.org (Postfix) with ESMTP id 74A7F40660 for ; Fri, 4 Jun 2021 21:19:26 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Authentication-Results: smtp4.osuosl.org (amavisd-new); dkim=pass (1024-bit key) header.d=redhat.com Received: from smtp4.osuosl.org ([127.0.0.1]) by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9IERg4EYQTDt for ; Fri, 4 Jun 2021 21:19:25 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by smtp4.osuosl.org (Postfix) with ESMTPS id D181F4068A for ; Fri, 4 Jun 2021 21:19:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622841563; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=N/sh9a4yMuxpLbZGXNVxobcy+ALJt3TwKq/mHpeUbFU=; b=S/3hTGoZbH0quASKH4HCvq7XL1Kgu3syoUgrFic9Wizgm9YvHoWdSdeZdNGetofsYPrLdN +h0651t9988LJsMHdR7zLpAvk/rNKkjwRwM+dwfCM9fnzT5l67KBYAA1kZmw5BaoErjdru CJoc1jJFq0z8Sc4mc8MiIrLLdBWU3ys= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-15-URzEFg8QMj-O1O7DTmX3eA-1; Fri, 04 Jun 2021 17:19:22 -0400 X-MC-Unique: URzEFg8QMj-O1O7DTmX3eA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 44361180FD66; Fri, 4 Jun 2021 21:19:21 +0000 (UTC) Received: from rh.redhat.com (ovpn-114-242.ams2.redhat.com [10.36.114.242]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1678A100EB3D; Fri, 4 Jun 2021 21:19:19 +0000 (UTC) From: Kevin Traynor To: dev@openvswitch.org Date: Fri, 4 Jun 2021 22:18:54 +0100 Message-Id: <20210604211856.915563-4-ktraynor@redhat.com> In-Reply-To: <20210604211856.915563-1-ktraynor@redhat.com> References: <20210604211856.915563-1-ktraynor@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=ktraynor@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Cc: david.marchand@redhat.com Subject: [ovs-dev] [PATCH 3/5] dpif-netdev: Add group rxq scheduling assignment type. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" Add an rxq scheduling option that allows rxqs to be grouped on a pmd based purely on their load. The current default 'cycles' assignment sorts rxqs by measured processing load and then assigns them to a list of round robin PMDs. This helps to keep the rxqs that require most processing on different cores but as it selects the PMDs in round robin order, it equally distributes rxqs to PMDs. 'cycles' assignment has the advantage in that it separates the most loaded rxqs from being on the same core but maintains the rxqs being spread across a broad range of PMDs to mitigate against changes to traffic pattern. 'cycles' assignment has the disadvantage that in order to make the trade off between optimising for current traffic load and mitigating against future changes, it tries to assign and equal amount of rxqs per PMD in a round robin manner and this can lead to less than optimal balance of the processing load. Now that PMD auto load balance can help mitigate with future changes in traffic patterns, a 'group' assignment can be used to assign rxqs based on their measured cycles and the estimated running total of the PMDs. In this case, there is no restriction about keeping equal number of rxqs per PMD as it is purely load based. This means that one PMD may have a group of low load rxqs assigned to it while another PMD has one high load rxq assigned to it, as that is the best balance of their measured loads across the PMDs. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 26 ++++++ lib/dpif-netdev.c | 141 +++++++++++++++++++++++++----- vswitchd/vswitch.xml | 5 +- 3 files changed, 148 insertions(+), 24 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index e481e7941..d1c45cdfb 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -137,4 +137,30 @@ The Rx queues will be assigned to the cores in the following order:: Core 8: Q3 (60%) | Q0 (30%) +``group`` assignment is similar to ``cycles`` in that the Rxqs will be +ordered by their measured processing cycles before being assigned to PMDs. +It differs from ``cycles`` in that it uses a running estimate of the cycles +that will be on each PMD to select the PMD with the lowest load for each Rxq. + +This means that there can be a group of low traffic Rxqs on one PMD, while a +high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to +the same number of Rxqs per PMD as possible, with ``group`` this restriction is +removed for a better balance of the workload across PMDs. + +For example, where there are five Rx queues and three cores - 3, 7, and 8 - +available and the measured usage of core cycles per Rx queue over the last +interval is seen to be: + +- Queue #0: 10% +- Queue #1: 80% +- Queue #3: 50% +- Queue #4: 70% +- Queue #5: 10% + +The Rx queues will be assigned to the cores in the following order:: + + Core 3: Q1 (80%) | + Core 7: Q4 (70%) | + Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%) + Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are assigned to PMDs in a round-robined fashion. This algorithm was used by diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index eaa4e9733..61e0a516f 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -306,4 +306,11 @@ struct pmd_auto_lb { }; +enum sched_assignment_type { + SCHED_ROUNDROBIN, + SCHED_CYCLES, /* Default.*/ + SCHED_GROUP, + SCHED_MAX +}; + /* Datapath based on the network device interface from netdev.h. * @@ -367,5 +374,5 @@ struct dp_netdev { struct ovs_mutex tx_qid_pool_mutex; /* Use measured cycles for rxq to pmd assignment. */ - bool pmd_rxq_assign_cyc; + enum sched_assignment_type pmd_rxq_assign_cyc; /* Protects the access of the 'struct dp_netdev_pmd_thread' @@ -1799,5 +1806,5 @@ create_dp_netdev(const char *name, const struct dpif_class *class, cmap_init(&dp->poll_threads); - dp->pmd_rxq_assign_cyc = true; + dp->pmd_rxq_assign_cyc = SCHED_CYCLES; ovs_mutex_init(&dp->tx_qid_pool_mutex); @@ -4223,5 +4230,5 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log) bool enable_alb = false; bool multi_rxq = false; - bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc; + enum sched_assignment_type pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc; /* Ensure that there is at least 2 non-isolated PMDs and @@ -4242,6 +4249,6 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log) } - /* Enable auto LB if it is requested and cycle based assignment is true. */ - enable_alb = enable_alb && pmd_rxq_assign_cyc && + /* Enable auto LB if requested and not using roundrobin assignment. */ + enable_alb = enable_alb && pmd_rxq_assign_cyc != SCHED_ROUNDROBIN && pmd_alb->auto_lb_requested; @@ -4284,4 +4291,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) uint8_t rebalance_improve; bool log_autolb = false; + enum sched_assignment_type pmd_rxq_assign_cyc; tx_flush_interval = smap_get_int(other_config, "tx-flush-interval", @@ -4342,9 +4350,15 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) } - bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles"); - if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) { - VLOG_WARN("Unsupported Rxq to PMD assignment mode in pmd-rxq-assign. " - "Defaulting to 'cycles'."); - pmd_rxq_assign_cyc = true; + if (!strcmp(pmd_rxq_assign, "roundrobin")) { + pmd_rxq_assign_cyc = SCHED_ROUNDROBIN; + } else if (!strcmp(pmd_rxq_assign, "cycles")) { + pmd_rxq_assign_cyc = SCHED_CYCLES; + } else if (!strcmp(pmd_rxq_assign, "group")) { + pmd_rxq_assign_cyc = SCHED_GROUP; + } else { + /* default */ + VLOG_WARN("Unsupported rx queue to PMD assignment mode in " + "pmd-rxq-assign. Defaulting to 'cycles'."); + pmd_rxq_assign_cyc = SCHED_CYCLES; pmd_rxq_assign = "cycles"; } @@ -5171,4 +5185,61 @@ compare_rxq_cycles(const void *a, const void *b) } +static struct sched_pmd * +get_lowest_num_rxq_pmd(struct sched_numa *numa) +{ + struct sched_pmd *lowest_rxqs_sched_pmd = NULL; + unsigned lowest_rxqs = UINT_MAX; + + /* find the pmd with lowest number of rxqs */ + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + unsigned num_rxqs; + + sched_pmd = &numa->pmds[i]; + num_rxqs = sched_pmd->n_rxq; + if (sched_pmd->isolated) { + continue; + } + + /* If this current load is higher we can go to the next one */ + if (num_rxqs > lowest_rxqs) { + continue; + } + if (num_rxqs < lowest_rxqs) { + lowest_rxqs = num_rxqs; + lowest_rxqs_sched_pmd = sched_pmd; + } + } + return lowest_rxqs_sched_pmd; +} + +static struct sched_pmd * +get_lowest_proc_pmd(struct sched_numa *numa) +{ + struct sched_pmd *lowest_loaded_sched_pmd = NULL; + uint64_t lowest_load = UINT64_MAX; + + /* find the pmd with the lowest load */ + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + uint64_t pmd_load; + + sched_pmd = &numa->pmds[i]; + if (sched_pmd->isolated) { + continue; + } + pmd_load = sched_pmd->pmd_proc_cycles; + /* If this current load is higher we can go to the next one */ + if (pmd_load > lowest_load) { + continue; + } + if (pmd_load < lowest_load) { + lowest_load = pmd_load; + lowest_loaded_sched_pmd = sched_pmd; + } + } + return lowest_loaded_sched_pmd; +} + /* * Returns the next pmd from the numa node. @@ -5229,16 +5300,40 @@ get_available_rr_pmd(struct sched_numa *numa, bool updown) static struct sched_pmd * -get_next_pmd(struct sched_numa *numa, bool algo) +get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo, + bool has_proc) { - return get_available_rr_pmd(numa, algo); + if (algo == SCHED_GROUP) { + struct sched_pmd *sched_pmd = NULL; + + /* Check if the rxq has associated cycles. This is handled differently + * as adding an zero cycles rxq to a PMD will mean that the lowest + * core would not change on a subsequent call and all zero rxqs would + * be assigned to the same PMD. */ + if (has_proc) { + sched_pmd = get_lowest_proc_pmd(numa); + } else { + sched_pmd = get_lowest_num_rxq_pmd(numa); + } + /* If there is a pmd selected, return it now. */ + if (sched_pmd) { + return sched_pmd; + } + } + + /* By default or as a last resort, just RR the PMDs. */ + return get_available_rr_pmd(numa, algo == SCHED_CYCLES ? true : false); } static const char * -get_assignment_type_string(bool algo) +get_assignment_type_string(enum sched_assignment_type algo) { - if (algo == false) { - return "roundrobin"; + switch (algo) { + case SCHED_ROUNDROBIN: return "roundrobin"; + case SCHED_CYCLES: return "cycles"; + case SCHED_GROUP: return "group"; + case SCHED_MAX: + /* fall through */ + default: return "Unknown"; } - return "cycles"; } @@ -5246,9 +5341,9 @@ get_assignment_type_string(bool algo) static bool -get_rxq_cyc_log(char *a, bool algo, uint64_t cycles) +get_rxq_cyc_log(char *a, enum sched_assignment_type algo, uint64_t cycles) { int ret = 0; - if (algo) { + if (algo != SCHED_ROUNDROBIN) { ret = snprintf(a, MAX_RXQ_CYC_STRLEN, " (measured processing cycles %"PRIu64").", @@ -5261,5 +5356,5 @@ static void sched_numa_list_schedule(struct sched_numa_list *numa_list, struct dp_netdev *dp, - bool algo, + enum sched_assignment_type algo, enum vlog_level level) OVS_REQUIRES(dp->port_mutex) @@ -5285,5 +5380,5 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, rxqs[n_rxqs++] = rxq; - if (algo == true) { + if (algo != SCHED_ROUNDROBIN) { uint64_t cycle_hist = 0; @@ -5341,5 +5436,5 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, } - if (n_rxqs > 1 && algo) { + if (n_rxqs > 1 && algo != SCHED_ROUNDROBIN) { /* Sort the queues in order of the processing cycles * they consumed during their last pmd interval. */ @@ -5401,5 +5496,5 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, if (numa) { /* Select the PMD that should be used for this rxq. */ - sched_pmd = get_next_pmd(numa, algo); + sched_pmd = get_next_pmd(numa, algo, proc_cycles ? true : false); if (sched_pmd) { VLOG(level, "Core %2u on numa node %d assigned port \'%s\' " @@ -5431,5 +5526,5 @@ rxq_scheduling(struct dp_netdev *dp) OVS_REQUIRES(dp->port_mutex) { struct sched_numa_list *numa_list; - bool algo = dp->pmd_rxq_assign_cyc; + enum sched_assignment_type algo = dp->pmd_rxq_assign_cyc; numa_list = xzalloc(sizeof *numa_list); diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 4597a215d..14cb8a2c6 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -520,5 +520,5 @@ + "enum": ["set", ["cycles", "roundrobin", "group"]]}'>

Specifies how RX queues will be automatically assigned to CPU cores. @@ -530,4 +530,7 @@

roundrobin
Rxqs will be round-robined across CPU cores.
+
group
+
Rxqs will be sorted by order of measured processing cycles + before being assigned to CPU cores with lowest estimated load.

From patchwork Fri Jun 4 21:18:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Traynor X-Patchwork-Id: 1488118 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=2605:bc80:3010::137; helo=smtp4.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=NrwAeQeG; dkim-atps=neutral Received: from smtp4.osuosl.org (smtp4.osuosl.org [IPv6:2605:bc80:3010::137]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FxbHz695cz9sRK for ; Sat, 5 Jun 2021 07:19:43 +1000 (AEST) Received: from localhost (localhost [127.0.0.1]) by smtp4.osuosl.org (Postfix) with ESMTP id 6CC9F4160C; Fri, 4 Jun 2021 21:19:41 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp4.osuosl.org ([127.0.0.1]) by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QdgzT1go_NLA; Fri, 4 Jun 2021 21:19:37 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [IPv6:2605:bc80:3010:104::8cd3:938]) by smtp4.osuosl.org (Postfix) with ESMTP id F1E5B4068A; Fri, 4 Jun 2021 21:19:31 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id E2DD7C0027; Fri, 4 Jun 2021 21:19:31 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133]) by lists.linuxfoundation.org (Postfix) with ESMTP id 99B98C0024 for ; Fri, 4 Jun 2021 21:19:30 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp2.osuosl.org (Postfix) with ESMTP id 6B505405FE for ; Fri, 4 Jun 2021 21:19:30 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Authentication-Results: smtp2.osuosl.org (amavisd-new); dkim=pass (1024-bit key) header.d=redhat.com Received: from smtp2.osuosl.org ([127.0.0.1]) by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id V24aKvUDK08V for ; Fri, 4 Jun 2021 21:19:26 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by smtp2.osuosl.org (Postfix) with ESMTPS id 4DC70401E8 for ; Fri, 4 Jun 2021 21:19:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622841565; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jL183FywPBqzE6vIcLMcOFmsX5D0keiFcyTRiKCA9S0=; b=NrwAeQeGbITbe9nW9TyxU0c4fHoKQY9Qt3fXzVOU8hyjTTZixluk5T7xj7cbyTjMazeSxL EzZbVETYaoGf2R4OtfXM3DPE3Vyfrews6HlcgoG8S5AgTdyVx2dzRI0Yu9Oy66DqkAvDhV KcCG+20KXN2CNKuTb4n0lmv7U3k8paQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-163-7oAg0zJvP9ySwm4BMkGrxw-1; Fri, 04 Jun 2021 17:19:23 -0400 X-MC-Unique: 7oAg0zJvP9ySwm4BMkGrxw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id D2DC9180FD66; Fri, 4 Jun 2021 21:19:22 +0000 (UTC) Received: from rh.redhat.com (ovpn-114-242.ams2.redhat.com [10.36.114.242]) by smtp.corp.redhat.com (Postfix) with ESMTP id 92125101E24F; Fri, 4 Jun 2021 21:19:21 +0000 (UTC) From: Kevin Traynor To: dev@openvswitch.org Date: Fri, 4 Jun 2021 22:18:55 +0100 Message-Id: <20210604211856.915563-5-ktraynor@redhat.com> In-Reply-To: <20210604211856.915563-1-ktraynor@redhat.com> References: <20210604211856.915563-1-ktraynor@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=ktraynor@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Cc: david.marchand@redhat.com Subject: [ovs-dev] [PATCH 4/5] dpif-netdev: Assign PMD for failed pinned rxqs. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" Previously, if pmd-rxq-affinity was used to pin an rxq to a core that was not in pmd-cpu-mask the rxq was not polled and the user received a warning. Now that pinned and non-pinned rxqs are assigned to PMDs in a common call to rxq scheduling, if an invalid core is selected in pmd-rxq-affinity the rxq can be assigned an available PMD (if any). A warning will still be logged as the requested core could not be used. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 6 +++--- lib/dpif-netdev.c | 30 ++++++++++++++++++++++++++++-- tests/pmd.at | 5 ++++- 3 files changed, 35 insertions(+), 6 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index d1c45cdfb..29ba53954 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -108,7 +108,7 @@ means that this thread will only poll the *pinned* Rx queues. If there are no *non-isolated* PMD threads, *non-pinned* RX queues will not - be polled. Also, if the provided ```` is not available (e.g. the - ```` is not in ``pmd-cpu-mask``), the RX queue will not be polled - by any PMD thread. + be polled. If the provided ```` is not available (e.g. the + ```` is not in ``pmd-cpu-mask``), the RX queue will be assigned to + a *non-isolated* PMD, that will remain *non-isolated*. If ``pmd-rxq-affinity`` is not set for Rx queues, they will be assigned to PMDs diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 61e0a516f..377573233 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -5027,4 +5027,25 @@ find_sched_pmd_by_pmd(struct sched_numa_list *numa_list, } +static struct sched_pmd * +find_sched_pmd_by_rxq(struct sched_numa_list *numa_list, + struct dp_netdev_rxq *rxq) +{ + struct sched_numa *numa; + + HMAP_FOR_EACH (numa, node, &numa_list->numas) { + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + + sched_pmd = &numa->pmds[i]; + for (int k = 0; k < sched_pmd->n_rxq; k++) { + if (sched_pmd->rxqs[k] == rxq) { + return sched_pmd; + } + } + } + } + return NULL; +} + static struct sched_numa * sched_numa_list_find_numa(struct sched_numa_list *numa_list, @@ -5408,5 +5429,6 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, "Core %2u cannot be pinned with " "port \'%s\' rx queue %d. Use pmd-cpu-mask to " - "enable a pmd on core %u.", + "enable a pmd on core %u. An alternative core " + "will be assigned.", rxq->core_id, netdev_rxq_get_name(rxq->rx), @@ -5453,5 +5475,9 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, if (rxq->core_id != OVS_CORE_UNSPEC) { - continue; + /* This rxq should have been pinned, check it was. */ + sched_pmd = find_sched_pmd_by_rxq(numa_list, rxq); + if (sched_pmd && sched_pmd->pmd->core_id == rxq->core_id) { + continue; + } } diff --git a/tests/pmd.at b/tests/pmd.at index 78105bf45..55977632a 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -552,7 +552,10 @@ AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6]) dnl We removed the cores requested by some queues from pmd-cpu-mask. -dnl Those queues will not be polled. +dnl Those queues will be polled by remaining non-isolated pmds. AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], [dnl +p1 0 0 1 +p1 1 0 1 p1 2 0 2 +p1 3 0 1 ]) From patchwork Fri Jun 4 21:18:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Traynor X-Patchwork-Id: 1488119 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=140.211.166.133; helo=smtp2.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ZCq1hAGp; dkim-atps=neutral Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FxbJ546j2z9sW4 for ; Sat, 5 Jun 2021 07:19:49 +1000 (AEST) Received: from localhost (localhost [127.0.0.1]) by smtp2.osuosl.org (Postfix) with ESMTP id C092E41DD9; Fri, 4 Jun 2021 21:19:47 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp2.osuosl.org ([127.0.0.1]) by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4sfPmQajOrhD; Fri, 4 Jun 2021 21:19:44 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by smtp2.osuosl.org (Postfix) with ESMTP id 9CBBB41DC2; Fri, 4 Jun 2021 21:19:34 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 80571C0019; Fri, 4 Jun 2021 21:19:34 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp4.osuosl.org (smtp4.osuosl.org [IPv6:2605:bc80:3010::137]) by lists.linuxfoundation.org (Postfix) with ESMTP id 8769BC000D for ; Fri, 4 Jun 2021 21:19:33 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp4.osuosl.org (Postfix) with ESMTP id 66B5D415CB for ; Fri, 4 Jun 2021 21:19:33 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Authentication-Results: smtp4.osuosl.org (amavisd-new); dkim=pass (1024-bit key) header.d=redhat.com Received: from smtp4.osuosl.org ([127.0.0.1]) by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id X_ROHpFBWsWb for ; Fri, 4 Jun 2021 21:19:32 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by smtp4.osuosl.org (Postfix) with ESMTPS id 7380E406B2 for ; Fri, 4 Jun 2021 21:19:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622841568; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UOmSiDiSe5+kN6Ig4IIsZDXysXZbB+kEEXOJFlgD3Bc=; b=ZCq1hAGpiAShX/rAKcvxK+ZRb8FW8SlqvvIL1KIAxXwvvCe8jVl87/il/AQ5cGOU9IeEXj XM2xo6bCKuoRHd22R+Sa0swt2eVqs0Y6c1oKMTeJ8l8k31LJA+eBfM84bMr1yldNPRCk+Z TJkxcQQETkJwOBTq7ckDqXr5R7jHBM8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-42-QtVG2nhFPa2U0un5JwqTpQ-1; Fri, 04 Jun 2021 17:19:25 -0400 X-MC-Unique: QtVG2nhFPa2U0un5JwqTpQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5A56F501F5; Fri, 4 Jun 2021 21:19:24 +0000 (UTC) Received: from rh.redhat.com (ovpn-114-242.ams2.redhat.com [10.36.114.242]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2B6EF10023AC; Fri, 4 Jun 2021 21:19:23 +0000 (UTC) From: Kevin Traynor To: dev@openvswitch.org Date: Fri, 4 Jun 2021 22:18:56 +0100 Message-Id: <20210604211856.915563-6-ktraynor@redhat.com> In-Reply-To: <20210604211856.915563-1-ktraynor@redhat.com> References: <20210604211856.915563-1-ktraynor@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=ktraynor@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Cc: david.marchand@redhat.com Subject: [ovs-dev] [PATCH 5/5] dpif-netdev: Allow pin rxq and non-isolate PMD. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from a port are handled on different PMDs. Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be assigned to it by OVS. There may be cases where there is unused cycles on those pmds and the user would like other rxqs to also be able to be assigned to it be OVS. Add an option to pin the rxq and non-isolate. The default behaviour is unchanged, which is pin and isolate. In order to pin and non-isolate: ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Note this is available only with group assignment type. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 9 ++++++-- lib/dpif-netdev.c | 37 +++++++++++++++++++++++++------ vswitchd/vswitch.xml | 19 ++++++++++++++++ 3 files changed, 56 insertions(+), 9 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index 29ba53954..a24a59430 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -102,6 +102,11 @@ like so: - Queue #3 pinned to core 8 -PMD threads on cores where Rx queues are *pinned* will become *isolated*. This -means that this thread will only poll the *pinned* Rx queues. +PMD threads on cores where Rx queues are *pinned* will become *isolated* by +default. This means that this thread will only poll the *pinned* Rx queues. + +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can be +*non-isolated* by setting:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-isolate=false .. warning:: diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 377573233..cf592a23e 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -375,4 +375,5 @@ struct dp_netdev { /* Use measured cycles for rxq to pmd assignment. */ enum sched_assignment_type pmd_rxq_assign_cyc; + bool pmd_iso; /* Protects the access of the 'struct dp_netdev_pmd_thread' @@ -4370,4 +4371,22 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) } + bool pmd_iso = smap_get_bool(other_config, "pmd-rxq-isolate", true); + + if (pmd_rxq_assign_cyc != SCHED_GROUP && pmd_iso == false) { + /* Invalid combination*/ + VLOG_WARN("pmd-rxq-isolate can only be set false " + "when using pmd-rxq-assign=group"); + pmd_iso = true; + } + if (dp->pmd_iso != pmd_iso) { + dp->pmd_iso = pmd_iso; + if (pmd_iso) { + VLOG_INFO("pmd-rxq-affinity isolates PMD core"); + } else { + VLOG_INFO("pmd-rxq-affinity does not isolate PMD core"); + } + dp_netdev_request_reconfigure(dp); + } + struct pmd_auto_lb *pmd_alb = &dp->pmd_alb; bool cur_rebalance_requested = pmd_alb->auto_lb_requested; @@ -5107,5 +5126,5 @@ sched_numa_list_assignments(struct sched_numa_list *numa_list, sched_pmd = find_sched_pmd_by_pmd(numa_list, rxq->pmd); if (sched_pmd) { - if (rxq->core_id != OVS_CORE_UNSPEC) { + if (rxq->core_id != OVS_CORE_UNSPEC && dp->pmd_iso) { sched_pmd->isolated = true; } @@ -5417,4 +5436,5 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, struct dp_netdev_pmd_thread *pmd; struct sched_numa *numa; + bool iso = dp->pmd_iso; uint64_t proc_cycles; char rxq_cyc_log[MAX_RXQ_CYC_STRLEN]; @@ -5437,10 +5457,13 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list, continue; } - /* Mark PMD as isolated if not done already. */ - if (sched_pmd->isolated == false) { - sched_pmd->isolated = true; - numa = sched_numa_list_find_numa(numa_list, - sched_pmd); - numa->n_iso++; + /* Check if isolating PMDs with pinned rxqs.*/ + if (iso) { + /* Mark PMD as isolated if not done already. */ + if (sched_pmd->isolated == false) { + sched_pmd->isolated = true; + numa = sched_numa_list_find_numa(numa_list, + sched_pmd); + numa->n_iso++; + } } proc_cycles = dp_netdev_rxq_get_cycles(rxq, diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 14cb8a2c6..dca334961 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -545,4 +545,23 @@
+ +

+ Specifies if a CPU core will be isolated after being pinned with + an Rx queue. +

+ Set this value to false to non-isolate a CPU core after + it is pinned with an Rxq using pmd-rxq-affinity. This + will allow OVS to assign other Rxqs to that CPU core. +

+

+ The default value is true. +

+

+ This can only be false when pmd-rxq-assign + is set to group. +

+
+