From patchwork Thu Jul  8 14:02:31 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502306
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::136; helo=smtp3.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp3.osuosl.org (smtp3.osuosl.org [IPv6:2605:bc80:3010::136])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0V2W1mz9sX2
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:06 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp3.osuosl.org (Postfix) with ESMTP id E9DCB60B3A;
	Thu,  8 Jul 2021 14:03:02 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp3.osuosl.org ([127.0.0.1])
	by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id OQKxL9qfKJoJ; Thu,  8 Jul 2021 14:02:56 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org
 [IPv6:2605:bc80:3010:104::8cd3:938])
	by smtp3.osuosl.org (Postfix) with ESMTPS id 6C77C60B2A;
	Thu,  8 Jul 2021 14:02:55 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 879E7C001C;
	Thu,  8 Jul 2021 14:02:54 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [IPv6:2605:bc80:3010::137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id D9915C000E
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:51 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id AA56D415FB
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:51 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id J5QaY0gvNVTV for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:02:48 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 14642415DE
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:47 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326234"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326234"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:02:47 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436235"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:44 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:31 +0100
Message-Id: <20210708140240.61172-2-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 01/10] dpif-netdev: Refactor to multiple header
	files.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

Split the very large file dpif-netdev.c and the datastructures
it contains into multiple header files. Each header file is
responsible for the datastructures of that component.

This logical split allows better reuse and modularity of the code,
and reduces the very large file dpif-netdev.c to be more managable.

Due to dependencies between components, it is not possible to
move component in smaller granularities than this patch.

To explain the dependencies better, eg:

DPCLS has no deps (from dpif-netdev.c file)
FLOW depends on DPCLS (struct dpcls_rule)
DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key)
THREAD depends on DFC (struct dfc_cache)

DFC_PROC depends on THREAD (struct pmd_thread)

DPCLS lookup.h/c require only DPCLS
DPCLS implementations require only dpif-netdev-lookup.h.
- This change was made in 2.12 release with function pointers
- This commit only refactors the name to "private-dpcls.h"

netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf().

Rename functions specific to dpcls from netdev_* namespace to the
dpcls_* namespace, as they are only used by dpcls code.

'inline' is added to the dp_netdev_flow_hash() when it is moved
definition to fix a compiler error.

One valid checkpatch issue with the use of the
EMC_FOR_EACH_POS_WITH_HASH() macro was fixed.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

Cc: Gaetan Rivet <gaetanr@nvidia.com>
Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>

v15:
- Added Flavio's Acked-by tag.

v14:
- Make some functions in lib/dpif-netdev-private-dfc.c private as they
  aren't used in other files.
- Fix the order of includes to what is layed out in the coding-style.rst

v13:
- Add NEWS item in this commit rather than later.
- Add lib/dpif-netdev-private-dfc.c file and move non fast path dfc
  related functions there.
- Squash commit which renames functions specific to dpcls from netdev_*
  namespace to the dpcls_* namespace, as they are only used by dpcls
  code into this commit.
- Minor fixes from review comments.
---
 NEWS                                   |   1 +
 lib/automake.mk                        |   5 +
 lib/dpif-netdev-lookup-autovalidator.c |   1 -
 lib/dpif-netdev-lookup-avx512-gather.c |   1 -
 lib/dpif-netdev-lookup-generic.c       |   1 -
 lib/dpif-netdev-lookup.h               |   2 +-
 lib/dpif-netdev-private-dfc.c          | 110 +++++
 lib/dpif-netdev-private-dfc.h          | 164 ++++++++
 lib/dpif-netdev-private-dpcls.h        | 128 ++++++
 lib/dpif-netdev-private-flow.h         | 163 ++++++++
 lib/dpif-netdev-private-thread.h       | 206 ++++++++++
 lib/dpif-netdev-private.h              | 100 +----
 lib/dpif-netdev.c                      | 539 +------------------------
 13 files changed, 801 insertions(+), 620 deletions(-)
 create mode 100644 lib/dpif-netdev-private-dfc.c
 create mode 100644 lib/dpif-netdev-private-dfc.h
 create mode 100644 lib/dpif-netdev-private-dpcls.h
 create mode 100644 lib/dpif-netdev-private-flow.h
 create mode 100644 lib/dpif-netdev-private-thread.h
diff --git a/NEWS b/NEWS
index dddd57fc2..d7b278cab 100644
--- a/NEWS
+++ b/NEWS
@@ -17,6 +17,7 @@ Post-v2.15.0
        cases, e.g if all PMD threads are running on the same NUMA node.
      * Userspace datapath now supports up to 2^18 meters.
      * Added support for systems with non-contiguous NUMA nodes and core ids.
+     * Refactor lib/dpif-netdev.c to multiple header files.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/automake.mk b/lib/automake.mk
index 1980bbeef..8690bfb7a 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -111,6 +111,11 @@ lib_libopenvswitch_la_SOURCES = \
 	lib/dpif-netdev-lookup-generic.c \
 	lib/dpif-netdev.c \
 	lib/dpif-netdev.h \
+	lib/dpif-netdev-private-dfc.c \
+	lib/dpif-netdev-private-dfc.h \
+	lib/dpif-netdev-private-dpcls.h \
+	lib/dpif-netdev-private-flow.h \
+	lib/dpif-netdev-private-thread.h \
 	lib/dpif-netdev-private.h \
 	lib/dpif-netdev-perf.c \
 	lib/dpif-netdev-perf.h \
diff --git a/lib/dpif-netdev-lookup-autovalidator.c b/lib/dpif-netdev-lookup-autovalidator.c
index 97b59fdd0..475e1ab1e 100644
--- a/lib/dpif-netdev-lookup-autovalidator.c
+++ b/lib/dpif-netdev-lookup-autovalidator.c
@@ -17,7 +17,6 @@
 #include <config.h>
 #include "dpif-netdev.h"
 #include "dpif-netdev-lookup.h"
-#include "dpif-netdev-private.h"
 #include "openvswitch/vlog.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_autovalidator);
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c
index 5e3634249..8fc1cdfa5 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -21,7 +21,6 @@
 
 #include "dpif-netdev.h"
 #include "dpif-netdev-lookup.h"
-#include "dpif-netdev-private.h"
 #include "cmap.h"
 #include "flow.h"
 #include "pvector.h"
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index b1a0cfc36..e3b6be4b6 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -17,7 +17,6 @@
 
 #include <config.h>
 #include "dpif-netdev.h"
-#include "dpif-netdev-private.h"
 #include "dpif-netdev-lookup.h"
 
 #include "bitmap.h"
diff --git a/lib/dpif-netdev-lookup.h b/lib/dpif-netdev-lookup.h
index bd72aa29b..59f51faa0 100644
--- a/lib/dpif-netdev-lookup.h
+++ b/lib/dpif-netdev-lookup.h
@@ -19,7 +19,7 @@
 
 #include <config.h>
 #include "dpif-netdev.h"
-#include "dpif-netdev-private.h"
+#include "dpif-netdev-private-dpcls.h"
 
 /* Function to perform a probe for the subtable bit fingerprint.
  * Returns NULL if not valid, or a valid function pointer to call for this
diff --git a/lib/dpif-netdev-private-dfc.c b/lib/dpif-netdev-private-dfc.c
new file mode 100644
index 000000000..1d53fafff
--- /dev/null
+++ b/lib/dpif-netdev-private-dfc.c
@@ -0,0 +1,110 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2019, 2020, 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+#include <config.h>
+
+#include "dpif-netdev-private-dfc.h"
+
+static void
+emc_clear_entry(struct emc_entry *ce)
+{
+    if (ce->flow) {
+        dp_netdev_flow_unref(ce->flow);
+        ce->flow = NULL;
+    }
+}
+
+static void
+smc_clear_entry(struct smc_bucket *b, int idx)
+{
+    b->flow_idx[idx] = UINT16_MAX;
+}
+
+static void
+emc_cache_init(struct emc_cache *flow_cache)
+{
+    int i;
+
+    flow_cache->sweep_idx = 0;
+    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
+        flow_cache->entries[i].flow = NULL;
+        flow_cache->entries[i].key.hash = 0;
+        flow_cache->entries[i].key.len = sizeof(struct miniflow);
+        flowmap_init(&flow_cache->entries[i].key.mf.map);
+    }
+}
+
+static void
+smc_cache_init(struct smc_cache *smc_cache)
+{
+    int i, j;
+    for (i = 0; i < SMC_BUCKET_CNT; i++) {
+        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
+            smc_cache->buckets[i].flow_idx[j] = UINT16_MAX;
+        }
+    }
+}
+
+void
+dfc_cache_init(struct dfc_cache *flow_cache)
+{
+    emc_cache_init(&flow_cache->emc_cache);
+    smc_cache_init(&flow_cache->smc_cache);
+}
+
+static void
+emc_cache_uninit(struct emc_cache *flow_cache)
+{
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
+        emc_clear_entry(&flow_cache->entries[i]);
+    }
+}
+
+static void
+smc_cache_uninit(struct smc_cache *smc)
+{
+    int i, j;
+
+    for (i = 0; i < SMC_BUCKET_CNT; i++) {
+        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
+            smc_clear_entry(&(smc->buckets[i]), j);
+        }
+    }
+}
+
+void
+dfc_cache_uninit(struct dfc_cache *flow_cache)
+{
+    smc_cache_uninit(&flow_cache->smc_cache);
+    emc_cache_uninit(&flow_cache->emc_cache);
+}
+
+/* Check and clear dead flow references slowly (one entry at each
+ * invocation).  */
+void
+emc_cache_slow_sweep(struct emc_cache *flow_cache)
+{
+    struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx];
+
+    if (!emc_entry_alive(entry)) {
+        emc_clear_entry(entry);
+    }
+    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK;
+}
diff --git a/lib/dpif-netdev-private-dfc.h b/lib/dpif-netdev-private-dfc.h
new file mode 100644
index 000000000..6f1570355
--- /dev/null
+++ b/lib/dpif-netdev-private-dfc.h
@@ -0,0 +1,164 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2019, 2020, 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_DFC_H
+#define DPIF_NETDEV_PRIVATE_DFC_H 1
+
+#include "dpif.h"
+#include "dpif-netdev-private-dpcls.h"
+#include "dpif-netdev-private-flow.h"
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+/* EMC cache and SMC cache compose the datapath flow cache (DFC)
+ *
+ * Exact match cache for frequently used flows
+ *
+ * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
+ * search its entries for a miniflow that matches exactly the miniflow of the
+ * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
+ *
+ * A cache entry holds a reference to its 'dp_netdev_flow'.
+ *
+ * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS different
+ * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
+ * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away). Each
+ * value is the index of a cache entry where the miniflow could be.
+ *
+ *
+ * Signature match cache (SMC)
+ *
+ * This cache stores a 16-bit signature for each flow without storing keys, and
+ * stores the corresponding 16-bit flow_table index to the 'dp_netdev_flow'.
+ * Each flow thus occupies 32bit which is much more memory efficient than EMC.
+ * SMC uses a set-associative design that each bucket contains
+ * SMC_ENTRY_PER_BUCKET number of entries.
+ * Since 16-bit flow_table index is used, if there are more than 2^16
+ * dp_netdev_flow, SMC will miss them that cannot be indexed by a 16-bit value.
+ *
+ *
+ * Thread-safety
+ * =============
+ *
+ * Each pmd_thread has its own private exact match cache.
+ * If dp_netdev_input is not called from a pmd thread, a mutex is used.
+ */
+
+#define EM_FLOW_HASH_SHIFT 13
+#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT)
+#define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)
+#define EM_FLOW_HASH_SEGS 2
+
+/* SMC uses a set-associative design. A bucket contains a set of entries that
+ * a flow item can occupy. For now, it uses one hash function rather than two
+ * as for the EMC design. */
+#define SMC_ENTRY_PER_BUCKET 4
+#define SMC_ENTRIES (1u << 20)
+#define SMC_BUCKET_CNT (SMC_ENTRIES / SMC_ENTRY_PER_BUCKET)
+#define SMC_MASK (SMC_BUCKET_CNT - 1)
+
+/* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */
+#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
+#define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
+                                    DEFAULT_EM_FLOW_INSERT_INV_PROB)
+
+struct emc_entry {
+    struct dp_netdev_flow *flow;
+    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
+};
+
+struct emc_cache {
+    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
+    int sweep_idx;                /* For emc_cache_slow_sweep(). */
+};
+
+struct smc_bucket {
+    uint16_t sig[SMC_ENTRY_PER_BUCKET];
+    uint16_t flow_idx[SMC_ENTRY_PER_BUCKET];
+};
+
+/* Signature match cache, differentiate from EMC cache */
+struct smc_cache {
+    struct smc_bucket buckets[SMC_BUCKET_CNT];
+};
+
+struct dfc_cache {
+    struct emc_cache emc_cache;
+    struct smc_cache smc_cache;
+};
+
+/* Iterate in the exact match cache through every entry that might contain a
+ * miniflow with hash 'HASH'. */
+#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)                 \
+    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
+         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \
+         i__ < EM_FLOW_HASH_SEGS;                                            \
+         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
+
+void dfc_cache_init(struct dfc_cache *flow_cache);
+
+void dfc_cache_uninit(struct dfc_cache *flow_cache);
+
+/* Check and clear dead flow references slowly (one entry at each
+ * invocation).  */
+void emc_cache_slow_sweep(struct emc_cache *flow_cache);
+
+static inline bool
+emc_entry_alive(struct emc_entry *ce)
+{
+    return ce->flow && !ce->flow->dead;
+}
+
+/* Used to compare 'netdev_flow_key' in the exact match cache to a miniflow.
+ * The maps are compared bitwise, so both 'key->mf' and 'mf' must have been
+ * generated by miniflow_extract. */
+static inline bool
+emc_flow_key_equal_mf(const struct netdev_flow_key *key,
+                         const struct miniflow *mf)
+{
+    return !memcmp(&key->mf, mf, key->len);
+}
+
+static inline struct dp_netdev_flow *
+emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
+{
+    struct emc_entry *current_entry;
+
+    EMC_FOR_EACH_POS_WITH_HASH (cache, current_entry, key->hash) {
+        if (current_entry->key.hash == key->hash
+            && emc_entry_alive(current_entry)
+            && emc_flow_key_equal_mf(&current_entry->key, &key->mf)) {
+
+            /* We found the entry with the 'key->mf' miniflow */
+            return current_entry->flow;
+        }
+    }
+
+    return NULL;
+}
+
+
+#ifdef  __cplusplus
+}
+#endif
+
+#endif /* dpif-netdev-private-dfc.h */
diff --git a/lib/dpif-netdev-private-dpcls.h b/lib/dpif-netdev-private-dpcls.h
new file mode 100644
index 000000000..dc22431a3
--- /dev/null
+++ b/lib/dpif-netdev-private-dpcls.h
@@ -0,0 +1,128 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2019, 2020, 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_DPCLS_H
+#define DPIF_NETDEV_PRIVATE_DPCLS_H 1
+
+#include "dpif.h"
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#include "cmap.h"
+#include "openvswitch/thread.h"
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+/* Forward declaration for lookup_func typedef. */
+struct dpcls_subtable;
+struct dpcls_rule;
+
+/* Must be public as it is instantiated in subtable struct below. */
+struct netdev_flow_key {
+    uint32_t hash;       /* Hash function differs for different users. */
+    uint32_t len;        /* Length of the following miniflow (incl. map). */
+    struct miniflow mf;
+    uint64_t buf[FLOW_MAX_PACKET_U64S];
+};
+
+/* A rule to be inserted to the classifier. */
+struct dpcls_rule {
+    struct cmap_node cmap_node;   /* Within struct dpcls_subtable 'rules'. */
+    struct netdev_flow_key *mask; /* Subtable's mask. */
+    struct netdev_flow_key flow;  /* Matching key. */
+    /* 'flow' must be the last field, additional space is allocated here. */
+};
+
+/* Lookup function for a subtable in the dpcls. This function is called
+ * by each subtable with an array of packets, and a bitmask of packets to
+ * perform the lookup on. Using a function pointer gives flexibility to
+ * optimize the lookup function based on subtable properties and the
+ * CPU instruction set available at runtime.
+ */
+typedef
+uint32_t (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
+                                       uint32_t keys_map,
+                                       const struct netdev_flow_key *keys[],
+                                       struct dpcls_rule **rules);
+
+/* A set of rules that all have the same fields wildcarded. */
+struct dpcls_subtable {
+    /* The fields are only used by writers. */
+    struct cmap_node cmap_node OVS_GUARDED; /* Within dpcls 'subtables_map'. */
+
+    /* These fields are accessed by readers. */
+    struct cmap rules;           /* Contains "struct dpcls_rule"s. */
+    uint32_t hit_cnt;            /* Number of match hits in subtable in current
+                                    optimization interval. */
+
+    /* Miniflow fingerprint that the subtable matches on. The miniflow "bits"
+     * are used to select the actual dpcls lookup implementation at subtable
+     * creation time.
+     */
+    uint8_t mf_bits_set_unit0;
+    uint8_t mf_bits_set_unit1;
+
+    /* The lookup function to use for this subtable. If there is a known
+     * property of the subtable (eg: only 3 bits of miniflow metadata is
+     * used for the lookup) then this can point at an optimized version of
+     * the lookup function for this particular subtable. */
+    dpcls_subtable_lookup_func lookup_func;
+
+    /* Caches the masks to match a packet to, reducing runtime calculations. */
+    uint64_t *mf_masks;
+
+    struct netdev_flow_key mask; /* Wildcards for fields (const). */
+    /* 'mask' must be the last field, additional space is allocated here. */
+};
+
+/* Iterate through netdev_flow_key TNL u64 values specified by 'FLOWMAP'. */
+#define NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(VALUE, KEY, FLOWMAP)   \
+    MINIFLOW_FOR_EACH_IN_FLOWMAP (VALUE, &(KEY)->mf, FLOWMAP)
+
+/* Generates a mask for each bit set in the subtable's miniflow. */
+void
+dpcls_flow_key_gen_masks(const struct netdev_flow_key *tbl, uint64_t *mf_masks,
+                         const uint32_t mf_bits_u0, const uint32_t mf_bits_u1);
+
+/* Matches a dpcls rule against the incoming packet in 'target' */
+bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
+                            const struct netdev_flow_key *target);
+
+static inline uint32_t
+dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
+                                const struct miniflow *mf)
+{
+    uint32_t hash;
+
+    if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
+        hash = dp_packet_get_rss_hash(packet);
+    } else {
+        hash = miniflow_hash_5tuple(mf, 0);
+        dp_packet_set_rss_hash(packet, hash);
+    }
+
+    return hash;
+}
+
+#ifdef  __cplusplus
+}
+#endif
+
+#endif /* dpif-netdev-private-dpcls.h */
diff --git a/lib/dpif-netdev-private-flow.h b/lib/dpif-netdev-private-flow.h
new file mode 100644
index 000000000..303066067
--- /dev/null
+++ b/lib/dpif-netdev-private-flow.h
@@ -0,0 +1,163 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2019, 2020, 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_FLOW_H
+#define DPIF_NETDEV_PRIVATE_FLOW_H 1
+
+#include "dpif.h"
+#include "dpif-netdev-private-dpcls.h"
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#include "cmap.h"
+#include "openvswitch/thread.h"
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+/* Contained by struct dp_netdev_flow's 'stats' member.  */
+struct dp_netdev_flow_stats {
+    atomic_llong used;             /* Last used time, in monotonic msecs. */
+    atomic_ullong packet_count;    /* Number of packets matched. */
+    atomic_ullong byte_count;      /* Number of bytes matched. */
+    atomic_uint16_t tcp_flags;     /* Bitwise-OR of seen tcp_flags values. */
+};
+
+/* Contained by struct dp_netdev_flow's 'last_attrs' member.  */
+struct dp_netdev_flow_attrs {
+    atomic_bool offloaded;         /* True if flow is offloaded to HW. */
+    ATOMIC(const char *) dp_layer; /* DP layer the flow is handled in. */
+};
+
+/* A flow in 'dp_netdev_pmd_thread's 'flow_table'.
+ *
+ *
+ * Thread-safety
+ * =============
+ *
+ * Except near the beginning or ending of its lifespan, rule 'rule' belongs to
+ * its pmd thread's classifier.  The text below calls this classifier 'cls'.
+ *
+ * Motivation
+ * ----------
+ *
+ * The thread safety rules described here for "struct dp_netdev_flow" are
+ * motivated by two goals:
+ *
+ *    - Prevent threads that read members of "struct dp_netdev_flow" from
+ *      reading bad data due to changes by some thread concurrently modifying
+ *      those members.
+ *
+ *    - Prevent two threads making changes to members of a given "struct
+ *      dp_netdev_flow" from interfering with each other.
+ *
+ *
+ * Rules
+ * -----
+ *
+ * A flow 'flow' may be accessed without a risk of being freed during an RCU
+ * grace period.  Code that needs to hold onto a flow for a while
+ * should try incrementing 'flow->ref_cnt' with dp_netdev_flow_ref().
+ *
+ * 'flow->ref_cnt' protects 'flow' from being freed.  It doesn't protect the
+ * flow from being deleted from 'cls' and it doesn't protect members of 'flow'
+ * from modification.
+ *
+ * Some members, marked 'const', are immutable.  Accessing other members
+ * requires synchronization, as noted in more detail below.
+ */
+struct dp_netdev_flow {
+    const struct flow flow;      /* Unmasked flow that created this entry. */
+    /* Hash table index by unmasked flow. */
+    const struct cmap_node node; /* In owning dp_netdev_pmd_thread's */
+                                 /* 'flow_table'. */
+    const struct cmap_node mark_node; /* In owning flow_mark's mark_to_flow */
+    const ovs_u128 ufid;         /* Unique flow identifier. */
+    const ovs_u128 mega_ufid;    /* Unique mega flow identifier. */
+    const unsigned pmd_id;       /* The 'core_id' of pmd thread owning this */
+                                 /* flow. */
+
+    /* Number of references.
+     * The classifier owns one reference.
+     * Any thread trying to keep a rule from being freed should hold its own
+     * reference. */
+    struct ovs_refcount ref_cnt;
+
+    bool dead;
+    uint32_t mark;               /* Unique flow mark assigned to a flow */
+
+    /* Statistics. */
+    struct dp_netdev_flow_stats stats;
+
+    /* Statistics and attributes received from the netdev offload provider. */
+    atomic_int netdev_flow_get_result;
+    struct dp_netdev_flow_stats last_stats;
+    struct dp_netdev_flow_attrs last_attrs;
+
+    /* Actions. */
+    OVSRCU_TYPE(struct dp_netdev_actions *) actions;
+
+    /* While processing a group of input packets, the datapath uses the next
+     * member to store a pointer to the output batch for the flow.  It is
+     * reset after the batch has been sent out (See dp_netdev_queue_batches(),
+     * packet_batch_per_flow_init() and packet_batch_per_flow_execute()). */
+    struct packet_batch_per_flow *batch;
+
+    /* Packet classification. */
+    char *dp_extra_info;         /* String to return in a flow dump/get. */
+    struct dpcls_rule cr;        /* In owning dp_netdev's 'cls'. */
+    /* 'cr' must be the last member. */
+};
+
+static inline uint32_t
+dp_netdev_flow_hash(const ovs_u128 *ufid)
+{
+    return ufid->u32[0];
+}
+
+/* Given the number of bits set in miniflow's maps, returns the size of the
+ * 'netdev_flow_key.mf' */
+static inline size_t
+netdev_flow_key_size(size_t flow_u64s)
+{
+    return sizeof(struct miniflow) + MINIFLOW_VALUES_SIZE(flow_u64s);
+}
+
+/* forward declaration required for EMC to unref flows */
+void dp_netdev_flow_unref(struct dp_netdev_flow *);
+
+/* A set of datapath actions within a "struct dp_netdev_flow".
+ *
+ *
+ * Thread-safety
+ * =============
+ *
+ * A struct dp_netdev_actions 'actions' is protected with RCU. */
+struct dp_netdev_actions {
+    /* These members are immutable: they do not change during the struct's
+     * lifetime.  */
+    unsigned int size;          /* Size of 'actions', in bytes. */
+    struct nlattr actions[];    /* Sequence of OVS_ACTION_ATTR_* attributes. */
+};
+
+#ifdef  __cplusplus
+}
+#endif
+
+#endif /* dpif-netdev-private-flow.h */
diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
new file mode 100644
index 000000000..91f3753d1
--- /dev/null
+++ b/lib/dpif-netdev-private-thread.h
@@ -0,0 +1,206 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2019, 2020, 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_THREAD_H
+#define DPIF_NETDEV_PRIVATE_THREAD_H 1
+
+#include "dpif.h"
+#include "dpif-netdev-perf.h"
+#include "dpif-netdev-private-dfc.h"
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#include "cmap.h"
+#include "openvswitch/thread.h"
+
+#ifdef  __cplusplus
+extern "C" {
+#endif
+
+/* PMD Thread Structures */
+
+/* A set of properties for the current processing loop that is not directly
+ * associated with the pmd thread itself, but with the packets being
+ * processed or the short-term system configuration (for example, time).
+ * Contained by struct dp_netdev_pmd_thread's 'ctx' member. */
+struct dp_netdev_pmd_thread_ctx {
+    /* Latest measured time. See 'pmd_thread_ctx_time_update()'. */
+    long long now;
+    /* RX queue from which last packet was received. */
+    struct dp_netdev_rxq *last_rxq;
+    /* EMC insertion probability context for the current processing cycle. */
+    uint32_t emc_insert_min;
+};
+
+/* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
+ * the performance overhead of interrupt processing.  Therefore netdev can
+ * not implement rx-wait for these devices.  dpif-netdev needs to poll
+ * these device to check for recv buffer.  pmd-thread does polling for
+ * devices assigned to itself.
+ *
+ * DPDK used PMD for accessing NIC.
+ *
+ * Note, instance with cpu core id NON_PMD_CORE_ID will be reserved for
+ * I/O of all non-pmd threads.  There will be no actual thread created
+ * for the instance.
+ *
+ * Each struct has its own flow cache and classifier per managed ingress port.
+ * For packets received on ingress port, a look up is done on corresponding PMD
+ * thread's flow cache and in case of a miss, lookup is performed in the
+ * corresponding classifier of port.  Packets are executed with the found
+ * actions in either case.
+ * */
+struct dp_netdev_pmd_thread {
+    struct dp_netdev *dp;
+    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
+    struct cmap_node node;          /* In 'dp->poll_threads'. */
+
+    /* Per thread exact-match cache.  Note, the instance for cpu core
+     * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
+     * need to be protected by 'non_pmd_mutex'.  Every other instance
+     * will only be accessed by its own pmd thread. */
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
+
+    /* Flow-Table and classifiers
+     *
+     * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
+     * changes to 'classifiers' must be made while still holding the
+     * 'flow_mutex'.
+     */
+    struct ovs_mutex flow_mutex;
+    struct cmap flow_table OVS_GUARDED; /* Flow table. */
+
+    /* One classifier per in_port polled by the pmd */
+    struct cmap classifiers;
+    /* Periodically sort subtable vectors according to hit frequencies */
+    long long int next_optimization;
+    /* End of the next time interval for which processing cycles
+       are stored for each polled rxq. */
+    long long int rxq_next_cycle_store;
+
+    /* Last interval timestamp. */
+    uint64_t intrvl_tsc_prev;
+    /* Last interval cycles. */
+    atomic_ullong intrvl_cycles;
+
+    /* Current context of the PMD thread. */
+    struct dp_netdev_pmd_thread_ctx ctx;
+
+    struct seq *reload_seq;
+    uint64_t last_reload_seq;
+
+    /* These are atomic variables used as a synchronization and configuration
+     * points for thread reload/exit.
+     *
+     * 'reload' atomic is the main one and it's used as a memory
+     * synchronization point for all other knobs and data.
+     *
+     * For a thread that requests PMD reload:
+     *
+     *   * All changes that should be visible to the PMD thread must be made
+     *     before setting the 'reload'.  These changes could use any memory
+     *     ordering model including 'relaxed'.
+     *   * Setting the 'reload' atomic should occur in the same thread where
+     *     all other PMD configuration options updated.
+     *   * Setting the 'reload' atomic should be done with 'release' memory
+     *     ordering model or stricter.  This will guarantee that all previous
+     *     changes (including non-atomic and 'relaxed') will be visible to
+     *     the PMD thread.
+     *   * To check that reload is done, thread should poll the 'reload' atomic
+     *     to become 'false'.  Polling should be done with 'acquire' memory
+     *     ordering model or stricter.  This ensures that PMD thread completed
+     *     the reload process.
+     *
+     * For the PMD thread:
+     *
+     *   * PMD thread should read 'reload' atomic with 'acquire' memory
+     *     ordering model or stricter.  This will guarantee that all changes
+     *     made before setting the 'reload' in the requesting thread will be
+     *     visible to the PMD thread.
+     *   * All other configuration data could be read with any memory
+     *     ordering model (including non-atomic and 'relaxed') but *only after*
+     *     reading the 'reload' atomic set to 'true'.
+     *   * When the PMD reload done, PMD should (optionally) set all the below
+     *     knobs except the 'reload' to their default ('false') values and
+     *     (mandatory), as the last step, set the 'reload' to 'false' using
+     *     'release' memory ordering model or stricter.  This will inform the
+     *     requesting thread that PMD has completed a reload cycle.
+     */
+    atomic_bool reload;             /* Do we need to reload ports? */
+    atomic_bool wait_for_reload;    /* Can we busy wait for the next reload? */
+    atomic_bool reload_tx_qid;      /* Do we need to reload static_tx_qid? */
+    atomic_bool exit;               /* For terminating the pmd thread. */
+
+    pthread_t thread;
+    unsigned core_id;               /* CPU core id of this pmd thread. */
+    int numa_id;                    /* numa node id of this pmd thread. */
+    bool isolated;
+
+    /* Queue id used by this pmd thread to send packets on all netdevs if
+     * XPS disabled for this netdev. All static_tx_qid's are unique and less
+     * than 'cmap_count(dp->poll_threads)'. */
+    uint32_t static_tx_qid;
+
+    /* Number of filled output batches. */
+    int n_output_batches;
+
+    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'. */
+    /* List of rx queues to poll. */
+    struct hmap poll_list OVS_GUARDED;
+    /* Map of 'tx_port's used for transmission.  Written by the main thread,
+     * read by the pmd thread. */
+    struct hmap tx_ports OVS_GUARDED;
+
+    struct ovs_mutex bond_mutex;    /* Protects updates of 'tx_bonds'. */
+    /* Map of 'tx_bond's used for transmission.  Written by the main thread
+     * and read by the pmd thread. */
+    struct cmap tx_bonds;
+
+    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
+     * ports (that support push_tunnel/pop_tunnel), the other contains ports
+     * with at least one txq (that support send).  A port can be in both.
+     *
+     * There are two separate maps to make sure that we don't try to execute
+     * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel device.
+     *
+     * The instances for cpu core NON_PMD_CORE_ID can be accessed by multiple
+     * threads, and thusly need to be protected by 'non_pmd_mutex'.  Every
+     * other instance will only be accessed by its own pmd thread. */
+    struct hmap tnl_port_cache;
+    struct hmap send_port_cache;
+
+    /* Keep track of detailed PMD performance statistics. */
+    struct pmd_perf_stats perf_stats;
+
+    /* Stats from previous iteration used by automatic pmd
+     * load balance logic. */
+    uint64_t prev_stats[PMD_N_STATS];
+    atomic_count pmd_overloaded;
+
+    /* Set to true if the pmd thread needs to be reloaded. */
+    bool need_reload;
+
+    /* Next time when PMD should try RCU quiescing. */
+    long long next_rcu_quiesce;
+};
+
+#ifdef  __cplusplus
+}
+#endif
+
+#endif /* dpif-netdev-private-thread.h */
diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
index 4fda1220b..d7b6fd7ec 100644
--- a/lib/dpif-netdev-private.h
+++ b/lib/dpif-netdev-private.h
@@ -18,95 +18,17 @@
 #ifndef DPIF_NETDEV_PRIVATE_H
 #define DPIF_NETDEV_PRIVATE_H 1
 
-#include <stdbool.h>
-#include <stdint.h>
-
-#include "dpif.h"
-#include "cmap.h"
-
-#ifdef  __cplusplus
-extern "C" {
-#endif
-
-/* Forward declaration for lookup_func typedef. */
-struct dpcls_subtable;
-struct dpcls_rule;
-
-/* Must be public as it is instantiated in subtable struct below. */
-struct netdev_flow_key {
-    uint32_t hash;       /* Hash function differs for different users. */
-    uint32_t len;        /* Length of the following miniflow (incl. map). */
-    struct miniflow mf;
-    uint64_t buf[FLOW_MAX_PACKET_U64S];
-};
-
-/* A rule to be inserted to the classifier. */
-struct dpcls_rule {
-    struct cmap_node cmap_node;   /* Within struct dpcls_subtable 'rules'. */
-    struct netdev_flow_key *mask; /* Subtable's mask. */
-    struct netdev_flow_key flow;  /* Matching key. */
-    /* 'flow' must be the last field, additional space is allocated here. */
-};
-
-/* Lookup function for a subtable in the dpcls. This function is called
- * by each subtable with an array of packets, and a bitmask of packets to
- * perform the lookup on. Using a function pointer gives flexibility to
- * optimize the lookup function based on subtable properties and the
- * CPU instruction set available at runtime.
+/* This header includes the various dpif-netdev components' header
+ * files in the appropriate order. Unfortunately there is a strict
+ * requirement in the include order due to dependences between components.
+ * E.g:
+ *  DFC/EMC/SMC requires the netdev_flow_key struct
+ *  PMD thread requires DFC_flow struct
+ *
  */
-typedef
-uint32_t (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
-                                       uint32_t keys_map,
-                                       const struct netdev_flow_key *keys[],
-                                       struct dpcls_rule **rules);
-
-/* A set of rules that all have the same fields wildcarded. */
-struct dpcls_subtable {
-    /* The fields are only used by writers. */
-    struct cmap_node cmap_node OVS_GUARDED; /* Within dpcls 'subtables_map'. */
-
-    /* These fields are accessed by readers. */
-    struct cmap rules;           /* Contains "struct dpcls_rule"s. */
-    uint32_t hit_cnt;            /* Number of match hits in subtable in current
-                                    optimization interval. */
-
-    /* Miniflow fingerprint that the subtable matches on. The miniflow "bits"
-     * are used to select the actual dpcls lookup implementation at subtable
-     * creation time.
-     */
-    uint8_t mf_bits_set_unit0;
-    uint8_t mf_bits_set_unit1;
-
-    /* The lookup function to use for this subtable. If there is a known
-     * property of the subtable (eg: only 3 bits of miniflow metadata is
-     * used for the lookup) then this can point at an optimized version of
-     * the lookup function for this particular subtable. */
-    dpcls_subtable_lookup_func lookup_func;
-
-    /* Caches the masks to match a packet to, reducing runtime calculations. */
-    uint64_t *mf_masks;
-
-    struct netdev_flow_key mask; /* Wildcards for fields (const). */
-    /* 'mask' must be the last field, additional space is allocated here. */
-};
-
-/* Iterate through netdev_flow_key TNL u64 values specified by 'FLOWMAP'. */
-#define NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(VALUE, KEY, FLOWMAP)   \
-    MINIFLOW_FOR_EACH_IN_FLOWMAP (VALUE, &(KEY)->mf, FLOWMAP)
-
-/* Generates a mask for each bit set in the subtable's miniflow. */
-void
-netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
-                          uint64_t *mf_masks,
-                          const uint32_t mf_bits_u0,
-                          const uint32_t mf_bits_u1);
-
-/* Matches a dpcls rule against the incoming packet in 'target' */
-bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
-                            const struct netdev_flow_key *target);
-
-#ifdef  __cplusplus
-}
-#endif
+#include "dpif-netdev-private-flow.h"
+#include "dpif-netdev-private-dpcls.h"
+#include "dpif-netdev-private-dfc.h"
+#include "dpif-netdev-private-thread.h"
 
 #endif /* netdev-private.h */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 26218ad72..b9fb84681 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -17,6 +17,7 @@
 #include <config.h>
 #include "dpif-netdev.h"
 #include "dpif-netdev-private.h"
+#include "dpif-netdev-private-dfc.h"
 
 #include <ctype.h>
 #include <errno.h>
@@ -142,90 +143,6 @@ static struct odp_support dp_netdev_support = {
     .ct_orig_tuple6 = true,
 };
 
-/* EMC cache and SMC cache compose the datapath flow cache (DFC)
- *
- * Exact match cache for frequently used flows
- *
- * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
- * search its entries for a miniflow that matches exactly the miniflow of the
- * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
- *
- * A cache entry holds a reference to its 'dp_netdev_flow'.
- *
- * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS different
- * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
- * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away). Each
- * value is the index of a cache entry where the miniflow could be.
- *
- *
- * Signature match cache (SMC)
- *
- * This cache stores a 16-bit signature for each flow without storing keys, and
- * stores the corresponding 16-bit flow_table index to the 'dp_netdev_flow'.
- * Each flow thus occupies 32bit which is much more memory efficient than EMC.
- * SMC uses a set-associative design that each bucket contains
- * SMC_ENTRY_PER_BUCKET number of entries.
- * Since 16-bit flow_table index is used, if there are more than 2^16
- * dp_netdev_flow, SMC will miss them that cannot be indexed by a 16-bit value.
- *
- *
- * Thread-safety
- * =============
- *
- * Each pmd_thread has its own private exact match cache.
- * If dp_netdev_input is not called from a pmd thread, a mutex is used.
- */
-
-#define EM_FLOW_HASH_SHIFT 13
-#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT)
-#define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)
-#define EM_FLOW_HASH_SEGS 2
-
-/* SMC uses a set-associative design. A bucket contains a set of entries that
- * a flow item can occupy. For now, it uses one hash function rather than two
- * as for the EMC design. */
-#define SMC_ENTRY_PER_BUCKET 4
-#define SMC_ENTRIES (1u << 20)
-#define SMC_BUCKET_CNT (SMC_ENTRIES / SMC_ENTRY_PER_BUCKET)
-#define SMC_MASK (SMC_BUCKET_CNT - 1)
-
-/* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */
-#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
-#define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
-                                    DEFAULT_EM_FLOW_INSERT_INV_PROB)
-
-struct emc_entry {
-    struct dp_netdev_flow *flow;
-    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
-};
-
-struct emc_cache {
-    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
-    int sweep_idx;                /* For emc_cache_slow_sweep(). */
-};
-
-struct smc_bucket {
-    uint16_t sig[SMC_ENTRY_PER_BUCKET];
-    uint16_t flow_idx[SMC_ENTRY_PER_BUCKET];
-};
-
-/* Signature match cache, differentiate from EMC cache */
-struct smc_cache {
-    struct smc_bucket buckets[SMC_BUCKET_CNT];
-};
-
-struct dfc_cache {
-    struct emc_cache emc_cache;
-    struct smc_cache smc_cache;
-};
-
-/* Iterate in the exact match cache through every entry that might contain a
- * miniflow with hash 'HASH'. */
-#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)                 \
-    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
-         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \
-         i__ < EM_FLOW_HASH_SEGS;                                            \
-         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
 
 /* Simple non-wildcarding single-priority classifier. */
 
@@ -478,119 +395,10 @@ struct dp_netdev_port {
     char *rxq_affinity_list;    /* Requested affinity of rx queues. */
 };
 
-/* Contained by struct dp_netdev_flow's 'stats' member.  */
-struct dp_netdev_flow_stats {
-    atomic_llong used;             /* Last used time, in monotonic msecs. */
-    atomic_ullong packet_count;    /* Number of packets matched. */
-    atomic_ullong byte_count;      /* Number of bytes matched. */
-    atomic_uint16_t tcp_flags;     /* Bitwise-OR of seen tcp_flags values. */
-};
-
-/* Contained by struct dp_netdev_flow's 'last_attrs' member.  */
-struct dp_netdev_flow_attrs {
-    atomic_bool offloaded;         /* True if flow is offloaded to HW. */
-    ATOMIC(const char *) dp_layer; /* DP layer the flow is handled in. */
-};
-
-/* A flow in 'dp_netdev_pmd_thread's 'flow_table'.
- *
- *
- * Thread-safety
- * =============
- *
- * Except near the beginning or ending of its lifespan, rule 'rule' belongs to
- * its pmd thread's classifier.  The text below calls this classifier 'cls'.
- *
- * Motivation
- * ----------
- *
- * The thread safety rules described here for "struct dp_netdev_flow" are
- * motivated by two goals:
- *
- *    - Prevent threads that read members of "struct dp_netdev_flow" from
- *      reading bad data due to changes by some thread concurrently modifying
- *      those members.
- *
- *    - Prevent two threads making changes to members of a given "struct
- *      dp_netdev_flow" from interfering with each other.
- *
- *
- * Rules
- * -----
- *
- * A flow 'flow' may be accessed without a risk of being freed during an RCU
- * grace period.  Code that needs to hold onto a flow for a while
- * should try incrementing 'flow->ref_cnt' with dp_netdev_flow_ref().
- *
- * 'flow->ref_cnt' protects 'flow' from being freed.  It doesn't protect the
- * flow from being deleted from 'cls' and it doesn't protect members of 'flow'
- * from modification.
- *
- * Some members, marked 'const', are immutable.  Accessing other members
- * requires synchronization, as noted in more detail below.
- */
-struct dp_netdev_flow {
-    const struct flow flow;      /* Unmasked flow that created this entry. */
-    /* Hash table index by unmasked flow. */
-    const struct cmap_node node; /* In owning dp_netdev_pmd_thread's */
-                                 /* 'flow_table'. */
-    const struct cmap_node mark_node; /* In owning flow_mark's mark_to_flow */
-    const ovs_u128 ufid;         /* Unique flow identifier. */
-    const ovs_u128 mega_ufid;    /* Unique mega flow identifier. */
-    const unsigned pmd_id;       /* The 'core_id' of pmd thread owning this */
-                                 /* flow. */
-
-    /* Number of references.
-     * The classifier owns one reference.
-     * Any thread trying to keep a rule from being freed should hold its own
-     * reference. */
-    struct ovs_refcount ref_cnt;
-
-    bool dead;
-    uint32_t mark;               /* Unique flow mark assigned to a flow */
-
-    /* Statistics. */
-    struct dp_netdev_flow_stats stats;
-
-    /* Statistics and attributes received from the netdev offload provider. */
-    atomic_int netdev_flow_get_result;
-    struct dp_netdev_flow_stats last_stats;
-    struct dp_netdev_flow_attrs last_attrs;
-
-    /* Actions. */
-    OVSRCU_TYPE(struct dp_netdev_actions *) actions;
-
-    /* While processing a group of input packets, the datapath uses the next
-     * member to store a pointer to the output batch for the flow.  It is
-     * reset after the batch has been sent out (See dp_netdev_queue_batches(),
-     * packet_batch_per_flow_init() and packet_batch_per_flow_execute()). */
-    struct packet_batch_per_flow *batch;
-
-    /* Packet classification. */
-    char *dp_extra_info;         /* String to return in a flow dump/get. */
-    struct dpcls_rule cr;        /* In owning dp_netdev's 'cls'. */
-    /* 'cr' must be the last member. */
-};
-
-static void dp_netdev_flow_unref(struct dp_netdev_flow *);
 static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
 static int dpif_netdev_flow_from_nlattrs(const struct nlattr *, uint32_t,
                                          struct flow *, bool);
 
-/* A set of datapath actions within a "struct dp_netdev_flow".
- *
- *
- * Thread-safety
- * =============
- *
- * A struct dp_netdev_actions 'actions' is protected with RCU. */
-struct dp_netdev_actions {
-    /* These members are immutable: they do not change during the struct's
-     * lifetime.  */
-    unsigned int size;          /* Size of 'actions', in bytes. */
-    struct nlattr actions[];    /* Sequence of OVS_ACTION_ATTR_* attributes. */
-};
-
 struct dp_netdev_actions *dp_netdev_actions_create(const struct nlattr *,
                                                    size_t);
 struct dp_netdev_actions *dp_netdev_flow_get_actions(
@@ -637,171 +445,6 @@ struct tx_bond {
     struct member_entry member_buckets[BOND_BUCKETS];
 };
 
-/* A set of properties for the current processing loop that is not directly
- * associated with the pmd thread itself, but with the packets being
- * processed or the short-term system configuration (for example, time).
- * Contained by struct dp_netdev_pmd_thread's 'ctx' member. */
-struct dp_netdev_pmd_thread_ctx {
-    /* Latest measured time. See 'pmd_thread_ctx_time_update()'. */
-    long long now;
-    /* RX queue from which last packet was received. */
-    struct dp_netdev_rxq *last_rxq;
-    /* EMC insertion probability context for the current processing cycle. */
-    uint32_t emc_insert_min;
-};
-
-/* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
- * the performance overhead of interrupt processing.  Therefore netdev can
- * not implement rx-wait for these devices.  dpif-netdev needs to poll
- * these device to check for recv buffer.  pmd-thread does polling for
- * devices assigned to itself.
- *
- * DPDK used PMD for accessing NIC.
- *
- * Note, instance with cpu core id NON_PMD_CORE_ID will be reserved for
- * I/O of all non-pmd threads.  There will be no actual thread created
- * for the instance.
- *
- * Each struct has its own flow cache and classifier per managed ingress port.
- * For packets received on ingress port, a look up is done on corresponding PMD
- * thread's flow cache and in case of a miss, lookup is performed in the
- * corresponding classifier of port.  Packets are executed with the found
- * actions in either case.
- * */
-struct dp_netdev_pmd_thread {
-    struct dp_netdev *dp;
-    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
-    struct cmap_node node;          /* In 'dp->poll_threads'. */
-
-    /* Per thread exact-match cache.  Note, the instance for cpu core
-     * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
-     * need to be protected by 'non_pmd_mutex'.  Every other instance
-     * will only be accessed by its own pmd thread. */
-    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
-
-    /* Flow-Table and classifiers
-     *
-     * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
-     * changes to 'classifiers' must be made while still holding the
-     * 'flow_mutex'.
-     */
-    struct ovs_mutex flow_mutex;
-    struct cmap flow_table OVS_GUARDED; /* Flow table. */
-
-    /* One classifier per in_port polled by the pmd */
-    struct cmap classifiers;
-    /* Periodically sort subtable vectors according to hit frequencies */
-    long long int next_optimization;
-    /* End of the next time interval for which processing cycles
-       are stored for each polled rxq. */
-    long long int rxq_next_cycle_store;
-
-    /* Last interval timestamp. */
-    uint64_t intrvl_tsc_prev;
-    /* Last interval cycles. */
-    atomic_ullong intrvl_cycles;
-
-    /* Current context of the PMD thread. */
-    struct dp_netdev_pmd_thread_ctx ctx;
-
-    struct seq *reload_seq;
-    uint64_t last_reload_seq;
-
-    /* These are atomic variables used as a synchronization and configuration
-     * points for thread reload/exit.
-     *
-     * 'reload' atomic is the main one and it's used as a memory
-     * synchronization point for all other knobs and data.
-     *
-     * For a thread that requests PMD reload:
-     *
-     *   * All changes that should be visible to the PMD thread must be made
-     *     before setting the 'reload'.  These changes could use any memory
-     *     ordering model including 'relaxed'.
-     *   * Setting the 'reload' atomic should occur in the same thread where
-     *     all other PMD configuration options updated.
-     *   * Setting the 'reload' atomic should be done with 'release' memory
-     *     ordering model or stricter.  This will guarantee that all previous
-     *     changes (including non-atomic and 'relaxed') will be visible to
-     *     the PMD thread.
-     *   * To check that reload is done, thread should poll the 'reload' atomic
-     *     to become 'false'.  Polling should be done with 'acquire' memory
-     *     ordering model or stricter.  This ensures that PMD thread completed
-     *     the reload process.
-     *
-     * For the PMD thread:
-     *
-     *   * PMD thread should read 'reload' atomic with 'acquire' memory
-     *     ordering model or stricter.  This will guarantee that all changes
-     *     made before setting the 'reload' in the requesting thread will be
-     *     visible to the PMD thread.
-     *   * All other configuration data could be read with any memory
-     *     ordering model (including non-atomic and 'relaxed') but *only after*
-     *     reading the 'reload' atomic set to 'true'.
-     *   * When the PMD reload done, PMD should (optionally) set all the below
-     *     knobs except the 'reload' to their default ('false') values and
-     *     (mandatory), as the last step, set the 'reload' to 'false' using
-     *     'release' memory ordering model or stricter.  This will inform the
-     *     requesting thread that PMD has completed a reload cycle.
-     */
-    atomic_bool reload;             /* Do we need to reload ports? */
-    atomic_bool wait_for_reload;    /* Can we busy wait for the next reload? */
-    atomic_bool reload_tx_qid;      /* Do we need to reload static_tx_qid? */
-    atomic_bool exit;               /* For terminating the pmd thread. */
-
-    pthread_t thread;
-    unsigned core_id;               /* CPU core id of this pmd thread. */
-    int numa_id;                    /* numa node id of this pmd thread. */
-    bool isolated;
-
-    /* Queue id used by this pmd thread to send packets on all netdevs if
-     * XPS disabled for this netdev. All static_tx_qid's are unique and less
-     * than 'cmap_count(dp->poll_threads)'. */
-    uint32_t static_tx_qid;
-
-    /* Number of filled output batches. */
-    int n_output_batches;
-
-    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'. */
-    /* List of rx queues to poll. */
-    struct hmap poll_list OVS_GUARDED;
-    /* Map of 'tx_port's used for transmission.  Written by the main thread,
-     * read by the pmd thread. */
-    struct hmap tx_ports OVS_GUARDED;
-
-    struct ovs_mutex bond_mutex;    /* Protects updates of 'tx_bonds'. */
-    /* Map of 'tx_bond's used for transmission.  Written by the main thread
-     * and read by the pmd thread. */
-    struct cmap tx_bonds;
-
-    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
-     * ports (that support push_tunnel/pop_tunnel), the other contains ports
-     * with at least one txq (that support send).  A port can be in both.
-     *
-     * There are two separate maps to make sure that we don't try to execute
-     * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel device.
-     *
-     * The instances for cpu core NON_PMD_CORE_ID can be accessed by multiple
-     * threads, and thusly need to be protected by 'non_pmd_mutex'.  Every
-     * other instance will only be accessed by its own pmd thread. */
-    struct hmap tnl_port_cache;
-    struct hmap send_port_cache;
-
-    /* Keep track of detailed PMD performance statistics. */
-    struct pmd_perf_stats perf_stats;
-
-    /* Stats from previous iteration used by automatic pmd
-     * load balance logic. */
-    uint64_t prev_stats[PMD_N_STATS];
-    atomic_count pmd_overloaded;
-
-    /* Set to true if the pmd thread needs to be reloaded. */
-    bool need_reload;
-
-    /* Next time when PMD should try RCU quiescing. */
-    long long next_rcu_quiesce;
-};
-
 /* Interface to netdev-based datapath. */
 struct dpif_netdev {
     struct dpif dpif;
@@ -906,90 +549,12 @@ static inline struct dpcls *
 dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
                            odp_port_t in_port);
 
-static inline bool emc_entry_alive(struct emc_entry *ce);
-static void emc_clear_entry(struct emc_entry *ce);
-static void smc_clear_entry(struct smc_bucket *b, int idx);
-
 static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
 static inline bool
 pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd);
 static void queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
                                   struct dp_netdev_flow *flow);
 
-static void
-emc_cache_init(struct emc_cache *flow_cache)
-{
-    int i;
-
-    flow_cache->sweep_idx = 0;
-    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
-        flow_cache->entries[i].flow = NULL;
-        flow_cache->entries[i].key.hash = 0;
-        flow_cache->entries[i].key.len = sizeof(struct miniflow);
-        flowmap_init(&flow_cache->entries[i].key.mf.map);
-    }
-}
-
-static void
-smc_cache_init(struct smc_cache *smc_cache)
-{
-    int i, j;
-    for (i = 0; i < SMC_BUCKET_CNT; i++) {
-        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
-            smc_cache->buckets[i].flow_idx[j] = UINT16_MAX;
-        }
-    }
-}
-
-static void
-dfc_cache_init(struct dfc_cache *flow_cache)
-{
-    emc_cache_init(&flow_cache->emc_cache);
-    smc_cache_init(&flow_cache->smc_cache);
-}
-
-static void
-emc_cache_uninit(struct emc_cache *flow_cache)
-{
-    int i;
-
-    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
-        emc_clear_entry(&flow_cache->entries[i]);
-    }
-}
-
-static void
-smc_cache_uninit(struct smc_cache *smc)
-{
-    int i, j;
-
-    for (i = 0; i < SMC_BUCKET_CNT; i++) {
-        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
-            smc_clear_entry(&(smc->buckets[i]), j);
-        }
-    }
-}
-
-static void
-dfc_cache_uninit(struct dfc_cache *flow_cache)
-{
-    smc_cache_uninit(&flow_cache->smc_cache);
-    emc_cache_uninit(&flow_cache->emc_cache);
-}
-
-/* Check and clear dead flow references slowly (one entry at each
- * invocation).  */
-static void
-emc_cache_slow_sweep(struct emc_cache *flow_cache)
-{
-    struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx];
-
-    if (!emc_entry_alive(entry)) {
-        emc_clear_entry(entry);
-    }
-    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK;
-}
-
 /* Updates the time in PMD threads context and should be called in three cases:
  *
  *     1. PMD structure initialization:
@@ -2363,19 +1928,13 @@ dp_netdev_flow_free(struct dp_netdev_flow *flow)
     free(flow);
 }
 
-static void dp_netdev_flow_unref(struct dp_netdev_flow *flow)
+void dp_netdev_flow_unref(struct dp_netdev_flow *flow)
 {
     if (ovs_refcount_unref_relaxed(&flow->ref_cnt) == 1) {
         ovsrcu_postpone(dp_netdev_flow_free, flow);
     }
 }
 
-static uint32_t
-dp_netdev_flow_hash(const ovs_u128 *ufid)
-{
-    return ufid->u32[0];
-}
-
 static inline struct dpcls *
 dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
                            odp_port_t in_port)
@@ -2995,14 +2554,6 @@ static bool dp_netdev_flow_ref(struct dp_netdev_flow *flow)
  *   single memcmp().
  * - These functions can be inlined by the compiler. */
 
-/* Given the number of bits set in miniflow's maps, returns the size of the
- * 'netdev_flow_key.mf' */
-static inline size_t
-netdev_flow_key_size(size_t flow_u64s)
-{
-    return sizeof(struct miniflow) + MINIFLOW_VALUES_SIZE(flow_u64s);
-}
-
 static inline bool
 netdev_flow_key_equal(const struct netdev_flow_key *a,
                       const struct netdev_flow_key *b)
@@ -3011,16 +2562,6 @@ netdev_flow_key_equal(const struct netdev_flow_key *a,
     return a->hash == b->hash && !memcmp(&a->mf, &b->mf, a->len);
 }
 
-/* Used to compare 'netdev_flow_key' in the exact match cache to a miniflow.
- * The maps are compared bitwise, so both 'key->mf' and 'mf' must have been
- * generated by miniflow_extract. */
-static inline bool
-netdev_flow_key_equal_mf(const struct netdev_flow_key *key,
-                         const struct miniflow *mf)
-{
-    return !memcmp(&key->mf, mf, key->len);
-}
-
 static inline void
 netdev_flow_key_clone(struct netdev_flow_key *dst,
                       const struct netdev_flow_key *src)
@@ -3087,21 +2628,6 @@ netdev_flow_key_init_masked(struct netdev_flow_key *dst,
                             (dst_u64 - miniflow_get_values(&dst->mf)) * 8);
 }
 
-static inline bool
-emc_entry_alive(struct emc_entry *ce)
-{
-    return ce->flow && !ce->flow->dead;
-}
-
-static void
-emc_clear_entry(struct emc_entry *ce)
-{
-    if (ce->flow) {
-        dp_netdev_flow_unref(ce->flow);
-        ce->flow = NULL;
-    }
-}
-
 static inline void
 emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
                  const struct netdev_flow_key *key)
@@ -3167,24 +2693,6 @@ emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
-static inline struct dp_netdev_flow *
-emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
-{
-    struct emc_entry *current_entry;
-
-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (current_entry->key.hash == key->hash
-            && emc_entry_alive(current_entry)
-            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
-
-            /* We found the entry with the 'key->mf' miniflow */
-            return current_entry->flow;
-        }
-    }
-
-    return NULL;
-}
-
 static inline const struct cmap_node *
 smc_entry_get(struct dp_netdev_pmd_thread *pmd, const uint32_t hash)
 {
@@ -3205,12 +2713,6 @@ smc_entry_get(struct dp_netdev_pmd_thread *pmd, const uint32_t hash)
     return NULL;
 }
 
-static void
-smc_clear_entry(struct smc_bucket *b, int idx)
-{
-    b->flow_idx[idx] = UINT16_MAX;
-}
-
 /* Insert the flow_table index into SMC. Insertion may fail when 1) SMC is
  * turned off, 2) the flow_table index is larger than uint16_t can handle.
  * If there is already an SMC entry having same signature, the index will be
@@ -6898,22 +6400,6 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, struct dp_packet *packet_,
                          actions, wc, put_actions, dp->upcall_aux);
 }
 
-static inline uint32_t
-dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
-                                const struct miniflow *mf)
-{
-    uint32_t hash;
-
-    if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-        hash = dp_packet_get_rss_hash(packet);
-    } else {
-        hash = miniflow_hash_5tuple(mf, 0);
-        dp_packet_set_rss_hash(packet, hash);
-    }
-
-    return hash;
-}
-
 static inline uint32_t
 dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
                                 const struct miniflow *mf)
@@ -8761,7 +8247,7 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     subtable->mf_bits_set_unit0 = unit0;
     subtable->mf_bits_set_unit1 = unit1;
     subtable->mf_masks = xmalloc(sizeof(uint64_t) * (unit0 + unit1));
-    netdev_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1);
+    dpcls_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1);
 
     /* Get the preferred subtable search function for this (u0,u1) subtable.
      * The function is guaranteed to always return a valid implementation, and
@@ -8936,11 +8422,10 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
     }
 }
 
-/* Inner loop for mask generation of a unit, see netdev_flow_key_gen_masks. */
+/* Inner loop for mask generation of a unit, see dpcls_flow_key_gen_masks. */
 static inline void
-netdev_flow_key_gen_mask_unit(uint64_t iter,
-                              const uint64_t count,
-                              uint64_t *mf_masks)
+dpcls_flow_key_gen_mask_unit(uint64_t iter, const uint64_t count,
+                             uint64_t *mf_masks)
 {
     int i;
     for (i = 0; i < count; i++) {
@@ -8961,16 +8446,16 @@ netdev_flow_key_gen_mask_unit(uint64_t iter,
  * @param mf_bits_unit0 Number of bits set in unit0 of the miniflow
  */
 void
-netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
-                          uint64_t *mf_masks,
-                          const uint32_t mf_bits_u0,
-                          const uint32_t mf_bits_u1)
+dpcls_flow_key_gen_masks(const struct netdev_flow_key *tbl,
+                         uint64_t *mf_masks,
+                         const uint32_t mf_bits_u0,
+                         const uint32_t mf_bits_u1)
 {
     uint64_t iter_u0 = tbl->mf.map.bits[0];
     uint64_t iter_u1 = tbl->mf.map.bits[1];
 
-    netdev_flow_key_gen_mask_unit(iter_u0, mf_bits_u0, &mf_masks[0]);
-    netdev_flow_key_gen_mask_unit(iter_u1, mf_bits_u1, &mf_masks[mf_bits_u0]);
+    dpcls_flow_key_gen_mask_unit(iter_u0, mf_bits_u0, &mf_masks[0]);
+    dpcls_flow_key_gen_mask_unit(iter_u1, mf_bits_u1, &mf_masks[mf_bits_u0]);
 }
 
 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit

From patchwork Thu Jul  8 14:02:32 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502304
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::136; helo=smtp3.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp3.osuosl.org (smtp3.osuosl.org [IPv6:2605:bc80:3010::136])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0K2FFlz9sWq
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:02:57 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp3.osuosl.org (Postfix) with ESMTP id 647A860B22;
	Thu,  8 Jul 2021 14:02:54 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp3.osuosl.org ([127.0.0.1])
	by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id FuCLfCnJiLFq; Thu,  8 Jul 2021 14:02:53 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56])
	by smtp3.osuosl.org (Postfix) with ESMTPS id 96EA660B1A;
	Thu,  8 Jul 2021 14:02:52 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 4917DC001F;
	Thu,  8 Jul 2021 14:02:52 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id D3F48C000E
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:50 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id BDB6A415F9
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:50 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id vPQB7gUXVp2D for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:02:49 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 9C2C5415CC
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:49 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326240"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326240"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:02:49 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436276"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:47 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:32 +0100
Message-Id: <20210708140240.61172-3-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 02/10] dpif-netdev: Add function pointer for netdev
	input.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit adds a function pointer to the pmd thread data structure,
giving the pmd thread flexibility in its dpif-input function choice.
This allows choosing of the implementation based on ISA capabilities
of the runtime CPU, leading to optimizations and higher performance.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v15:
- Added Flavio's Acked-by tag.

v14:
- Add ATOMIC macro to netdev_input_func function pointer in struct
  dp_netdev_pmd_thread.

v13:
- Minor code refactor to address review comments.
---
 lib/dpif-netdev-private-thread.h | 10 ++++++++++
 lib/dpif-netdev.c                |  7 ++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
index 91f3753d1..d38a7a2c3 100644
--- a/lib/dpif-netdev-private-thread.h
+++ b/lib/dpif-netdev-private-thread.h
@@ -47,6 +47,13 @@ struct dp_netdev_pmd_thread_ctx {
     uint32_t emc_insert_min;
 };
 
+/* Forward declaration for typedef. */
+struct dp_netdev_pmd_thread;
+
+typedef void (*dp_netdev_input_func)(struct dp_netdev_pmd_thread *pmd,
+                                     struct dp_packet_batch *packets,
+                                     odp_port_t port_no);
+
 /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
  * the performance overhead of interrupt processing.  Therefore netdev can
  * not implement rx-wait for these devices.  dpif-netdev needs to poll
@@ -101,6 +108,9 @@ struct dp_netdev_pmd_thread {
     /* Current context of the PMD thread. */
     struct dp_netdev_pmd_thread_ctx ctx;
 
+    /* Function pointer to call for dp_netdev_input() functionality. */
+    ATOMIC(dp_netdev_input_func) netdev_input_func;
+
     struct seq *reload_seq;
     uint64_t last_reload_seq;
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b9fb84681..b89fcd276 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4286,8 +4286,9 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
                 }
             }
         }
+
         /* Process packet batch. */
-        dp_netdev_input(pmd, &batch, port_no);
+        pmd->netdev_input_func(pmd, &batch, port_no);
 
         /* Assign processing cycles to rx queue. */
         cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
@@ -6088,6 +6089,10 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     hmap_init(&pmd->tnl_port_cache);
     hmap_init(&pmd->send_port_cache);
     cmap_init(&pmd->tx_bonds);
+
+    /* Initialize the DPIF function pointer to the default scalar version. */
+    pmd->netdev_input_func = dp_netdev_input;
+
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {

From patchwork Thu Jul  8 14:02:33 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502307
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::133; helo=smtp2.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp2.osuosl.org (smtp2.osuosl.org [IPv6:2605:bc80:3010::133])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0b2lrKz9sWq
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:11 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp2.osuosl.org (Postfix) with ESMTP id 51A1C41D31;
	Thu,  8 Jul 2021 14:03:08 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp2.osuosl.org ([127.0.0.1])
	by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id at-fP480Wc55; Thu,  8 Jul 2021 14:03:03 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56])
	by smtp2.osuosl.org (Postfix) with ESMTPS id 7C54541D03;
	Thu,  8 Jul 2021 14:02:57 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 30DDCC0029;
	Thu,  8 Jul 2021 14:02:57 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 459CFC0020
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:55 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id 882E14160A
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:54 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id EYGuNTaxcYji for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:02:52 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 4369D415FD
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:52 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326244"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326244"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:02:52 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436323"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:49 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:33 +0100
Message-Id: <20210708140240.61172-4-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org, Kumar Amber <kumar.amber@intel.com>
Subject: [ovs-dev] [v15 03/10] dpif-avx512: Add ISA implementation of dpif.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit adds the AVX512 implementation of DPIF functionality,
specifically the dp_netdev_input_outer_avx512 function. This function
only handles outer (no re-circulations), and is optimized to use the
AVX512 ISA for packet batching and other DPIF work.

Sparse is not able to handle the AVX512 intrinsics, causing compile
time failures, so it is disabled for this file.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v15:
- Added Flavio's Acked-by tag.
- Fix minor spelling mistakes and formatting.
- Fix an issue with prefetching packets ahead in AVX512 DPIF with a
  batch size of 1.

v14:
- Fix the order of includes to what is layed out in the coding-style.rst
- Update PHWOL implementation to match what's used in the scalar DPIF.
  The scalar DPIF PHWOL implementation changed since v13.
- Use raw_ctz() to wrap __builtin_ctz(). This should fix Windows build
  errors.
- Remove unnecessary if (!f) check.
- Introduce hwol_emc_smc_missmask variable to save the lookup state
  before DPCLS lookup. This fixes an issue where the DPCLS lookup would
  modify hwol_emc_smc_hitmask before the EMC and SMC inserts could use
  it.
- Move dpcls_lookup prototype from lib/dpif-netdev-private-thread.h to
  lib/dpif-netdev-private-dpcls.h
- Fix a comment.
- Move addition of *netdev_input_func_userdata to struct
  dp_netdev_pmd_thread to this patch.
- Remove dp_netdev_input_outer_avx512() prototype from
  lib/dpif-netdev-private-thread.h since it already has a prototype in
  lib/dpif-netdev-private-dpif.h.
- Prefetch 2 packets ahead when processing in AVX512 DPIF. This was
  found to perform best when testing.
- Other minor rework from Flavio's review.

v13:
- Squash "Add HWOL support" commit into this commit.
- Add NEWS item about this feature here rather than in a later commit.
- Add #define NUM_U64_IN_ZMM_REG 8.
- Add comment describing operation of while loop handling HWOL->EMC->SMC
  lookups in dp_netdev_input_outer_avx512().
- Add EMC and SMC batch insert functions for better handling of EMC and
  SMC in AVX512 DPIF.
- Minor code refactor to address review comments.
---
 NEWS                             |   2 +
 lib/automake.mk                  |   5 +-
 lib/dpif-netdev-avx512.c         | 339 +++++++++++++++++++++++++++++++
 lib/dpif-netdev-private-dfc.h    |  25 +++
 lib/dpif-netdev-private-dpcls.h  |   7 +
 lib/dpif-netdev-private-dpif.h   |  32 +++
 lib/dpif-netdev-private-thread.h |  15 +-
 lib/dpif-netdev-private.h        |  21 +-
 lib/dpif-netdev.c                | 105 ++++++++--
 9 files changed, 533 insertions(+), 18 deletions(-)
 create mode 100644 lib/dpif-netdev-avx512.c
 create mode 100644 lib/dpif-netdev-private-dpif.h

diff --git a/NEWS b/NEWS
index d7b278cab..349718178 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,8 @@ Post-v2.15.0
      * Userspace datapath now supports up to 2^18 meters.
      * Added support for systems with non-contiguous NUMA nodes and core ids.
      * Refactor lib/dpif-netdev.c to multiple header files.
+     * Add avx512 implementation of dpif which can process non recirculated
+       packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/automake.mk b/lib/automake.mk
index 8690bfb7a..432b98e62 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -33,11 +33,13 @@ lib_libopenvswitchavx512_la_CFLAGS = \
 	-mavx512f \
 	-mavx512bw \
 	-mavx512dq \
+	-mbmi \
 	-mbmi2 \
 	-fPIC \
 	$(AM_CFLAGS)
 lib_libopenvswitchavx512_la_SOURCES = \
-	lib/dpif-netdev-lookup-avx512-gather.c
+	lib/dpif-netdev-lookup-avx512-gather.c \
+	lib/dpif-netdev-avx512.c
 lib_libopenvswitchavx512_la_LDFLAGS = \
 	-static
 endif
@@ -114,6 +116,7 @@ lib_libopenvswitch_la_SOURCES = \
 	lib/dpif-netdev-private-dfc.c \
 	lib/dpif-netdev-private-dfc.h \
 	lib/dpif-netdev-private-dpcls.h \
+	lib/dpif-netdev-private-dpif.h \
 	lib/dpif-netdev-private-flow.h \
 	lib/dpif-netdev-private-thread.h \
 	lib/dpif-netdev-private.h \
diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
new file mode 100644
index 000000000..f59c1bbe0
--- /dev/null
+++ b/lib/dpif-netdev-avx512.c
@@ -0,0 +1,339 @@
+/*
+ * Copyright (c) 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifdef __x86_64__
+/* Sparse cannot handle the AVX512 instructions. */
+#if !defined(__CHECKER__)
+
+#include <config.h>
+
+#include "dpif-netdev.h"
+#include "dpif-netdev-perf.h"
+#include "dpif-netdev-private.h"
+
+#include <immintrin.h>
+
+#include "dp-packet.h"
+#include "netdev.h"
+#include "netdev-offload.h"
+
+/* Each AVX512 register (zmm register in assembly notation) can contain up to
+ * 512 bits, which is equivalent to 8 uint64_t variables. This is the maximum
+ * number of miniflow blocks that can be processed in a single pass of the
+ * AVX512 code at a time.
+ */
+#define NUM_U64_IN_ZMM_REG (8)
+
+/* Structure to contain per-packet metadata that must be attributed to the
+ * dp netdev flow. This is unfortunate to have to track per packet, however
+ * it's a bit awkward to maintain them in a performant way. This structure
+ * helps to keep two variables on a single cache line per packet.
+ */
+struct pkt_flow_meta {
+    uint16_t bytes;
+    uint16_t tcp_flags;
+};
+
+/* Structure of heap allocated memory for DPIF internals. */
+struct dpif_userdata {
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
+        struct netdev_flow_key keys[NETDEV_MAX_BURST];
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
+        struct netdev_flow_key *key_ptrs[NETDEV_MAX_BURST];
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
+        struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST];
+};
+
+int32_t
+dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
+                             struct dp_packet_batch *packets,
+                             odp_port_t in_port)
+{
+    /* Allocate DPIF userdata. */
+    if (OVS_UNLIKELY(!pmd->netdev_input_func_userdata)) {
+        pmd->netdev_input_func_userdata =
+                xmalloc_pagealign(sizeof(struct dpif_userdata));
+    }
+
+    struct dpif_userdata *ud = pmd->netdev_input_func_userdata;
+    struct netdev_flow_key *keys = ud->keys;
+    struct netdev_flow_key **key_ptrs = ud->key_ptrs;
+    struct pkt_flow_meta *pkt_meta = ud->pkt_meta;
+
+    /* The AVX512 DPIF implementation handles rules in a way that is optimized
+     * for reducing data-movement between HWOL/EMC/SMC and DPCLS. This is
+     * achieved by separating the rule arrays. Bitmasks are kept for each
+     * packet, indicating if it matched in the HWOL/EMC/SMC array or DPCLS
+     * array. Later the two arrays are merged by AVX-512 expand instructions.
+     */
+
+    /* Stores the computed output: a rule pointer for each packet. */
+    /* Used initially for HWOL/EMC/SMC. */
+    struct dpcls_rule *rules[NETDEV_MAX_BURST];
+    /* Used for DPCLS. */
+    struct dpcls_rule *dpcls_rules[NETDEV_MAX_BURST];
+
+    uint32_t dpcls_key_idx = 0;
+
+    for (uint32_t i = 0; i < NETDEV_MAX_BURST; i += NUM_U64_IN_ZMM_REG) {
+        _mm512_storeu_si512(&rules[i], _mm512_setzero_si512());
+        _mm512_storeu_si512(&dpcls_rules[i], _mm512_setzero_si512());
+    }
+
+    const size_t batch_size = dp_packet_batch_size(packets);
+
+    /* Prefetch 2 packets ahead when processing. This was found to perform best
+     * through testing. */
+    const uint32_t prefetch_ahead = 2;
+    const uint32_t initial_prefetch = MIN(prefetch_ahead, batch_size);
+    for (int i = 0; i < initial_prefetch; i++) {
+        struct dp_packet *packet = packets->packets[i];
+        OVS_PREFETCH(dp_packet_data(packet));
+        pkt_metadata_prefetch_init(&packet->md);
+    }
+
+    /* Check if EMC or SMC are enabled. */
+    struct dfc_cache *cache = &pmd->flow_cache;
+    const uint32_t hwol_enabled = netdev_is_flow_api_enabled();
+    const uint32_t emc_enabled = pmd->ctx.emc_insert_min != 0;
+    const uint32_t smc_enabled = pmd->ctx.smc_enable_db;
+
+    uint32_t emc_hits = 0;
+    uint32_t smc_hits = 0;
+
+    /* A 1 bit in this mask indicates a hit, so no DPCLS lookup on the pkt. */
+    uint32_t hwol_emc_smc_hitmask = 0;
+    uint32_t smc_hitmask = 0;
+
+    /* The below while loop is based on the 'iter' variable which has a number
+     * of bits set representing packets that we want to process
+     * (HWOL->MFEX->EMC->SMC). As each packet is processed, we clear (set to 0)
+     * the bit representing that packet using '_blsr_u64()'. The
+     * 'raw_ctz()' will give us the correct index into the 'packets',
+     * 'pkt_meta', 'keys' and 'rules' arrays.
+     *
+     * For one iteration of the while loop, here's some pseudocode as an
+     * example where 'iter' is represented in binary:
+     *
+     * while (iter) { // iter = 1100
+     *     uint32_t i = raw_ctz(iter); // i = 2
+     *     iter = _blsr_u64(iter); // iter = 1000
+     *     // do all processing (HWOL->MFEX->EMC->SMC)
+     * }
+     */
+    uint32_t lookup_pkts_bitmask = (1ULL << batch_size) - 1;
+    uint32_t iter = lookup_pkts_bitmask;
+    while (iter) {
+        uint32_t i = raw_ctz(iter);
+        iter = _blsr_u64(iter);
+
+        if (i + prefetch_ahead < batch_size) {
+            struct dp_packet **dp_packets = packets->packets;
+            /* Prefetch next packet data and metadata. */
+            OVS_PREFETCH(dp_packet_data(dp_packets[i + prefetch_ahead]));
+            pkt_metadata_prefetch_init(&dp_packets[i + prefetch_ahead]->md);
+        }
+
+        /* Get packet pointer from bitmask and packet md. */
+        struct dp_packet *packet = packets->packets[i];
+        pkt_metadata_init(&packet->md, in_port);
+
+        struct dp_netdev_flow *f = NULL;
+
+        /* Check for a partial hardware offload match. */
+        if (hwol_enabled) {
+            if (OVS_UNLIKELY(dp_netdev_hw_flow(pmd, in_port, packet, &f))) {
+                /* Packet restoration failed and it was dropped, do not
+                 * continue processing. */
+                continue;
+            }
+            if (f) {
+                rules[i] = &f->cr;
+                pkt_meta[i].tcp_flags = parse_tcp_flags(packet);
+                pkt_meta[i].bytes = dp_packet_size(packet);
+                hwol_emc_smc_hitmask |= (1 << i);
+                continue;
+            }
+        }
+
+        /* Do miniflow extract into keys. */
+        struct netdev_flow_key *key = &keys[i];
+        miniflow_extract(packet, &key->mf);
+
+        /* Cache TCP and byte values for all packets. */
+        pkt_meta[i].bytes = dp_packet_size(packet);
+        pkt_meta[i].tcp_flags = miniflow_get_tcp_flags(&key->mf);
+
+        key->len = netdev_flow_key_size(miniflow_n_values(&key->mf));
+        key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, &key->mf);
+
+        if (emc_enabled) {
+            f = emc_lookup(&cache->emc_cache, key);
+
+            if (f) {
+                rules[i] = &f->cr;
+                emc_hits++;
+                hwol_emc_smc_hitmask |= (1 << i);
+                continue;
+            }
+        }
+
+        if (smc_enabled) {
+            f = smc_lookup_single(pmd, packet, key);
+            if (f) {
+                rules[i] = &f->cr;
+                smc_hits++;
+                smc_hitmask |= (1 << i);
+                continue;
+            }
+        }
+
+        /* The flow pointer was not found in HWOL/EMC/SMC, so add it to the
+         * dpcls input keys array for batch lookup later.
+         */
+        key_ptrs[dpcls_key_idx] = &keys[i];
+        dpcls_key_idx++;
+    }
+
+    hwol_emc_smc_hitmask |= smc_hitmask;
+    uint32_t hwol_emc_smc_missmask = ~hwol_emc_smc_hitmask;
+
+    /* DPCLS handles any packets missed by HWOL/EMC/SMC. It operates on the
+     * key_ptrs[] for input miniflows to match, storing results in the
+     * dpcls_rules[] array.
+     */
+    if (dpcls_key_idx > 0) {
+        struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
+        if (OVS_UNLIKELY(!cls)) {
+            return -1;
+        }
+        bool any_miss =
+            !dpcls_lookup(cls, (const struct netdev_flow_key **) key_ptrs,
+                          dpcls_rules, dpcls_key_idx, NULL);
+        if (OVS_UNLIKELY(any_miss)) {
+            return -1;
+        }
+
+        /* Merge DPCLS rules and HWOL/EMC/SMC rules. */
+        uint32_t dpcls_idx = 0;
+        for (int i = 0; i < NETDEV_MAX_BURST; i += NUM_U64_IN_ZMM_REG) {
+            /* Indexing here is somewhat complicated due to DPCLS output rule
+             * load index depending on the hitmask of HWOL/EMC/SMC. More
+             * packets from HWOL/EMC/SMC bitmask means less DPCLS rules are
+             * used.
+             */
+            __m512i v_cache_rules = _mm512_loadu_si512(&rules[i]);
+            __m512i v_merged_rules =
+                        _mm512_mask_expandloadu_epi64(v_cache_rules,
+                                                      ~hwol_emc_smc_hitmask,
+                                                      &dpcls_rules[dpcls_idx]);
+            _mm512_storeu_si512(&rules[i], v_merged_rules);
+
+            /* Update DPCLS load index and bitmask for HWOL/EMC/SMC hits.
+             * There are NUM_U64_IN_ZMM_REG output pointers per register,
+             * subtract the HWOL/EMC/SMC lanes equals the number of DPCLS rules
+             * consumed.
+             */
+            uint32_t hitmask_FF = (hwol_emc_smc_hitmask & 0xFF);
+            dpcls_idx += NUM_U64_IN_ZMM_REG - __builtin_popcountll(hitmask_FF);
+            hwol_emc_smc_hitmask =
+                (hwol_emc_smc_hitmask >> NUM_U64_IN_ZMM_REG);
+        }
+    }
+
+    /* At this point we have a 1:1 pkt to rules mapping, so update EMC/SMC
+     * if required.
+     */
+    /* Insert SMC and DPCLS hits into EMC. */
+    if (emc_enabled) {
+        uint32_t emc_insert_mask = smc_hitmask | hwol_emc_smc_missmask;
+        emc_insert_mask &= lookup_pkts_bitmask;
+        emc_probabilistic_insert_batch(pmd, keys, &rules[0], emc_insert_mask);
+    }
+    /* Insert DPCLS hits into SMC. */
+    if (smc_enabled) {
+        uint32_t smc_insert_mask = hwol_emc_smc_missmask;
+        smc_insert_mask &= lookup_pkts_bitmask;
+        smc_insert_batch(pmd, keys, &rules[0], smc_insert_mask);
+    }
+
+    /* At this point we don't return error anymore, so commit stats here. */
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV, batch_size);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, emc_hits);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, smc_hits);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
+                            dpcls_key_idx);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_LOOKUP,
+                            dpcls_key_idx);
+
+    /* Initialize the "Action Batch" for each flow handled below. */
+    struct dp_packet_batch action_batch;
+    action_batch.trunc = 0;
+
+    while (lookup_pkts_bitmask) {
+        uint32_t rule_pkt_idx = raw_ctz(lookup_pkts_bitmask);
+        uint64_t needle = (uintptr_t) rules[rule_pkt_idx];
+
+        /* Parallel compare NUM_U64_IN_ZMM_REG flow* 's to the needle, create a
+         * bitmask.
+         */
+        uint32_t batch_bitmask = 0;
+        for (uint32_t j = 0; j < NETDEV_MAX_BURST; j += NUM_U64_IN_ZMM_REG) {
+            /* Pre-calculate store addr. */
+            uint32_t num_pkts_in_batch = __builtin_popcountll(batch_bitmask);
+            void *store_addr = &action_batch.packets[num_pkts_in_batch];
+
+            /* Search for identical flow* in burst, update bitmask. */
+            __m512i v_needle = _mm512_set1_epi64(needle);
+            __m512i v_hay = _mm512_loadu_si512(&rules[j]);
+            __mmask8 k_cmp_bits = _mm512_cmpeq_epi64_mask(v_needle, v_hay);
+            uint32_t cmp_bits = k_cmp_bits;
+            batch_bitmask |= cmp_bits << j;
+
+            /* Compress and store the batched packets. */
+            struct dp_packet **packets_ptrs = &packets->packets[j];
+            __m512i v_pkt_ptrs = _mm512_loadu_si512(packets_ptrs);
+            _mm512_mask_compressstoreu_epi64(store_addr, cmp_bits, v_pkt_ptrs);
+        }
+
+        /* Strip all packets in this batch from the lookup_pkts_bitmask. */
+        lookup_pkts_bitmask &= (~batch_bitmask);
+        action_batch.count = __builtin_popcountll(batch_bitmask);
+
+        /* Loop over all packets in this batch, to gather the byte and tcp_flag
+         * values, and pass them to the execute function. It would be nice to
+         * optimize this away, however it is not easy to refactor in dpif.
+         */
+        uint32_t bytes = 0;
+        uint16_t tcp_flags = 0;
+        uint32_t bitmask_iter = batch_bitmask;
+        for (int i = 0; i < action_batch.count; i++) {
+            uint32_t idx = raw_ctz(bitmask_iter);
+            bitmask_iter = _blsr_u64(bitmask_iter);
+
+            bytes += pkt_meta[idx].bytes;
+            tcp_flags |= pkt_meta[idx].tcp_flags;
+        }
+
+        dp_netdev_batch_execute(pmd, &action_batch, rules[rule_pkt_idx],
+                                bytes, tcp_flags);
+    }
+
+    return 0;
+}
+
+#endif
+#endif
diff --git a/lib/dpif-netdev-private-dfc.h b/lib/dpif-netdev-private-dfc.h
index 6f1570355..92092ebec 100644
--- a/lib/dpif-netdev-private-dfc.h
+++ b/lib/dpif-netdev-private-dfc.h
@@ -81,6 +81,14 @@ extern "C" {
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
                                     DEFAULT_EM_FLOW_INSERT_INV_PROB)
 
+/* Forward declaration for SMC function prototype that requires access to
+ * 'struct dp_netdev_pmd_thread'. */
+struct dp_netdev_pmd_thread;
+
+/* Forward declaration for EMC and SMC batch insert function prototypes that
+ * require access to 'struct dpcls_rule'. */
+struct dpcls_rule;
+
 struct emc_entry {
     struct dp_netdev_flow *flow;
     struct netdev_flow_key key;   /* key.hash used for emc hash value. */
@@ -156,6 +164,23 @@ emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
     return NULL;
 }
 
+/* Insert a batch of keys/flows into the EMC and SMC caches. */
+void
+emc_probabilistic_insert_batch(struct dp_netdev_pmd_thread *pmd,
+                               const struct netdev_flow_key *keys,
+                               struct dpcls_rule **rules,
+                               uint32_t emc_insert_mask);
+
+void
+smc_insert_batch(struct dp_netdev_pmd_thread *pmd,
+                               const struct netdev_flow_key *keys,
+                               struct dpcls_rule **rules,
+                               uint32_t smc_insert_mask);
+
+struct dp_netdev_flow *
+smc_lookup_single(struct dp_netdev_pmd_thread *pmd,
+                  struct dp_packet *packet,
+                  struct netdev_flow_key *key);
 
 #ifdef  __cplusplus
 }
diff --git a/lib/dpif-netdev-private-dpcls.h b/lib/dpif-netdev-private-dpcls.h
index dc22431a3..7c4a840cb 100644
--- a/lib/dpif-netdev-private-dpcls.h
+++ b/lib/dpif-netdev-private-dpcls.h
@@ -33,6 +33,7 @@ extern "C" {
 /* Forward declaration for lookup_func typedef. */
 struct dpcls_subtable;
 struct dpcls_rule;
+struct dpcls;
 
 /* Must be public as it is instantiated in subtable struct below. */
 struct netdev_flow_key {
@@ -121,6 +122,12 @@ dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
     return hash;
 }
 
+/* Allow other implementations to call dpcls_lookup() for subtable search. */
+bool
+dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
+             struct dpcls_rule **rules, const size_t cnt,
+             int *num_lookups_p);
+
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
new file mode 100644
index 000000000..bbd719b22
--- /dev/null
+++ b/lib/dpif-netdev-private-dpif.h
@@ -0,0 +1,32 @@
+/*
+ * Copyright (c) 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_DPIF_H
+#define DPIF_NETDEV_PRIVATE_DPIF_H 1
+
+#include "openvswitch/types.h"
+
+/* Forward declarations to avoid including files. */
+struct dp_netdev_pmd_thread;
+struct dp_packet_batch;
+
+/* Available DPIF implementations below. */
+int32_t
+dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
+                             struct dp_packet_batch *packets,
+                             odp_port_t in_port);
+
+#endif /* netdev-private.h */
diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
index d38a7a2c3..63b99220b 100644
--- a/lib/dpif-netdev-private-thread.h
+++ b/lib/dpif-netdev-private-thread.h
@@ -21,6 +21,7 @@
 #include "dpif.h"
 #include "dpif-netdev-perf.h"
 #include "dpif-netdev-private-dfc.h"
+#include "dpif-netdev-private-dpif.h"
 
 #include <stdbool.h>
 #include <stdint.h>
@@ -45,14 +46,19 @@ struct dp_netdev_pmd_thread_ctx {
     struct dp_netdev_rxq *last_rxq;
     /* EMC insertion probability context for the current processing cycle. */
     uint32_t emc_insert_min;
+    /* Enable the SMC cache from ovsdb config. */
+    bool smc_enable_db;
 };
 
 /* Forward declaration for typedef. */
 struct dp_netdev_pmd_thread;
 
-typedef void (*dp_netdev_input_func)(struct dp_netdev_pmd_thread *pmd,
-                                     struct dp_packet_batch *packets,
-                                     odp_port_t port_no);
+/* Typedef for DPIF functions.
+ * Returns a bitmask of packets to handle, possibly including upcall/misses.
+ */
+typedef int32_t (*dp_netdev_input_func)(struct dp_netdev_pmd_thread *pmd,
+                                        struct dp_packet_batch *packets,
+                                        odp_port_t port_no);
 
 /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
  * the performance overhead of interrupt processing.  Therefore netdev can
@@ -111,6 +117,9 @@ struct dp_netdev_pmd_thread {
     /* Function pointer to call for dp_netdev_input() functionality. */
     ATOMIC(dp_netdev_input_func) netdev_input_func;
 
+    /* Pointer for per-DPIF implementation scratch space. */
+    void *netdev_input_func_userdata;
+
     struct seq *reload_seq;
     uint64_t last_reload_seq;
 
diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
index d7b6fd7ec..4593649bd 100644
--- a/lib/dpif-netdev-private.h
+++ b/lib/dpif-netdev-private.h
@@ -31,4 +31,23 @@
 #include "dpif-netdev-private-dfc.h"
 #include "dpif-netdev-private-thread.h"
 
-#endif /* netdev-private.h */
+/* Allow other implementations to lookup the DPCLS instances. */
+struct dpcls *
+dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
+                           odp_port_t in_port);
+
+/* Allow other implementations to execute actions on a batch. */
+void
+dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
+                        struct dp_packet_batch *packets,
+                        struct dpcls_rule *rule,
+                        uint32_t bytes,
+                        uint16_t tcp_flags);
+
+int
+dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
+                  odp_port_t port_no,
+                  struct dp_packet *packet,
+                  struct dp_netdev_flow **flow);
+
+#endif /* dpif-netdev-private.h */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b89fcd276..6e006da9e 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -182,10 +182,6 @@ static uint32_t dpcls_subtable_lookup_reprobe(struct dpcls *cls);
 static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
                          const struct netdev_flow_key *mask);
 static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
-static bool dpcls_lookup(struct dpcls *cls,
-                         const struct netdev_flow_key *keys[],
-                         struct dpcls_rule **rules, size_t cnt,
-                         int *num_lookups_p);
 
 /* Set of supported meter flags */
 #define DP_SUPPORTED_METER_FLAGS_MASK \
@@ -473,7 +469,7 @@ static void dp_netdev_execute_actions(struct dp_netdev_pmd_thread *pmd,
                                       const struct flow *flow,
                                       const struct nlattr *actions,
                                       size_t actions_len);
-static void dp_netdev_input(struct dp_netdev_pmd_thread *,
+static int32_t dp_netdev_input(struct dp_netdev_pmd_thread *,
                             struct dp_packet_batch *, odp_port_t port_no);
 static void dp_netdev_recirculate(struct dp_netdev_pmd_thread *,
                                   struct dp_packet_batch *);
@@ -545,7 +541,7 @@ dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
                                bool purge);
 static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
                                       struct tx_port *tx);
-static inline struct dpcls *
+inline struct dpcls *
 dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
                            odp_port_t in_port);
 
@@ -1935,7 +1931,7 @@ void dp_netdev_flow_unref(struct dp_netdev_flow *flow)
     }
 }
 
-static inline struct dpcls *
+inline struct dpcls *
 dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
                            odp_port_t in_port)
 {
@@ -2767,13 +2763,46 @@ smc_insert(struct dp_netdev_pmd_thread *pmd,
     bucket->flow_idx[i] = index;
 }
 
+inline void
+emc_probabilistic_insert_batch(struct dp_netdev_pmd_thread *pmd,
+                               const struct netdev_flow_key *keys,
+                               struct dpcls_rule **rules,
+                               uint32_t emc_insert_mask)
+{
+    while (emc_insert_mask) {
+        uint32_t i = raw_ctz(emc_insert_mask);
+        emc_insert_mask &= emc_insert_mask - 1;
+        /* Get the require parameters for EMC/SMC from the rule */
+        struct dp_netdev_flow *flow = dp_netdev_flow_cast(rules[i]);
+        /* Insert the key into EMC/SMC. */
+        emc_probabilistic_insert(pmd, &keys[i], flow);
+    }
+}
+
+inline void
+smc_insert_batch(struct dp_netdev_pmd_thread *pmd,
+                 const struct netdev_flow_key *keys,
+                 struct dpcls_rule **rules,
+                 uint32_t smc_insert_mask)
+{
+    while (smc_insert_mask) {
+        uint32_t i = raw_ctz(smc_insert_mask);
+        smc_insert_mask &= smc_insert_mask - 1;
+        /* Get the require parameters for EMC/SMC from the rule */
+        struct dp_netdev_flow *flow = dp_netdev_flow_cast(rules[i]);
+        uint32_t hash = dp_netdev_flow_hash(&flow->ufid);
+        /* Insert the key into EMC/SMC. */
+        smc_insert(pmd, &keys[i], hash);
+    }
+}
+
 static struct dp_netdev_flow *
 dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
                           const struct netdev_flow_key *key,
                           int *lookup_num_p)
 {
     struct dpcls *cls;
-    struct dpcls_rule *rule;
+    struct dpcls_rule *rule = NULL;
     odp_port_t in_port = u32_to_odp(MINIFLOW_GET_U32(&key->mf,
                                                      in_port.odp_port));
     struct dp_netdev_flow *netdev_flow = NULL;
@@ -4288,7 +4317,10 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
         }
 
         /* Process packet batch. */
-        pmd->netdev_input_func(pmd, &batch, port_no);
+        int ret = pmd->netdev_input_func(pmd, &batch, port_no);
+        if (ret) {
+            dp_netdev_input(pmd, &batch, port_no);
+        }
 
         /* Assign processing cycles to rx queue. */
         cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
@@ -5306,6 +5338,8 @@ dpif_netdev_run(struct dpif *dpif)
                     non_pmd->ctx.emc_insert_min = 0;
                 }
 
+                non_pmd->ctx.smc_enable_db = dp->smc_enable_db;
+
                 for (i = 0; i < port->n_rxq; i++) {
 
                     if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
@@ -5577,6 +5611,8 @@ reload:
                 pmd->ctx.emc_insert_min = 0;
             }
 
+            pmd->ctx.smc_enable_db = pmd->dp->smc_enable_db;
+
             process_packets =
                 dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
                                            poll_list[i].port_no);
@@ -6474,6 +6510,24 @@ packet_batch_per_flow_execute(struct packet_batch_per_flow *batch,
                               actions->actions, actions->size);
 }
 
+void
+dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
+                        struct dp_packet_batch *packets,
+                        struct dpcls_rule *rule,
+                        uint32_t bytes,
+                        uint16_t tcp_flags)
+{
+    /* Gets action* from the rule. */
+    struct dp_netdev_flow *flow = dp_netdev_flow_cast(rule);
+    struct dp_netdev_actions *actions = dp_netdev_flow_get_actions(flow);
+
+    dp_netdev_flow_used(flow, dp_packet_batch_size(packets), bytes,
+                        tcp_flags, pmd->ctx.now / 1000);
+    const uint32_t steal = 1;
+    dp_netdev_execute_actions(pmd, packets, steal, &flow->flow,
+                              actions->actions, actions->size);
+}
+
 static inline void
 dp_netdev_queue_batches(struct dp_packet *pkt,
                         struct dp_netdev_flow *flow, uint16_t tcp_flags,
@@ -6578,10 +6632,34 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
 }
 
+struct dp_netdev_flow *
+smc_lookup_single(struct dp_netdev_pmd_thread *pmd,
+                  struct dp_packet *packet,
+                  struct netdev_flow_key *key)
+{
+    const struct cmap_node *flow_node = smc_entry_get(pmd, key->hash);
+
+    if (OVS_LIKELY(flow_node != NULL)) {
+        struct dp_netdev_flow *flow = NULL;
+
+        CMAP_NODE_FOR_EACH (flow, node, flow_node) {
+            /* Since we dont have per-port megaflow to check the port
+             * number, we need to verify that the input ports match. */
+            if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, key) &&
+                flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
+
+                return (void *) flow;
+            }
+        }
+    }
+
+    return NULL;
+}
+
 static struct tx_port * pmd_send_port_cache_lookup(
     const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
 
-static inline int
+inline int
 dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
                   odp_port_t port_no,
                   struct dp_packet *packet,
@@ -7020,12 +7098,13 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
-static void
+static int32_t
 dp_netdev_input(struct dp_netdev_pmd_thread *pmd,
                 struct dp_packet_batch *packets,
                 odp_port_t port_no)
 {
     dp_netdev_input__(pmd, packets, false, port_no);
+    return 0;
 }
 
 static void
@@ -8465,7 +8544,7 @@ dpcls_flow_key_gen_masks(const struct netdev_flow_key *tbl,
 
 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
  * in 'mask' the values in 'key' and 'target' are the same. */
-bool
+inline bool
 dpcls_rule_matches_key(const struct dpcls_rule *rule,
                        const struct netdev_flow_key *target)
 {
@@ -8491,7 +8570,7 @@ dpcls_rule_matches_key(const struct dpcls_rule *rule,
  * priorities, instead returning any rule which matches the flow.
  *
  * Returns true if all miniflows found a corresponding rule. */
-static bool
+bool
 dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
              struct dpcls_rule **rules, const size_t cnt,
              int *num_lookups_p)

From patchwork Thu Jul  8 14:02:34 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502309
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::133; helo=smtp2.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp2.osuosl.org (smtp2.osuosl.org [IPv6:2605:bc80:3010::133])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0l07h7z9sWq
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:18 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp2.osuosl.org (Postfix) with ESMTP id 98E0A41D6A;
	Thu,  8 Jul 2021 14:03:15 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp2.osuosl.org ([127.0.0.1])
	by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id LkqGPzN2DScj; Thu,  8 Jul 2021 14:03:13 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org
 [IPv6:2605:bc80:3010:104::8cd3:938])
	by smtp2.osuosl.org (Postfix) with ESMTPS id 57AA641D04;
	Thu,  8 Jul 2021 14:03:03 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 11F11C0010;
	Thu,  8 Jul 2021 14:03:03 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 3776BC0010
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:02 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id 6431441619
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:57 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id Ag28y35B9uZa for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:02:54 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id A115F415FD
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:54 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326250"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326250"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:02:54 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436331"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:52 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:34 +0100
Message-Id: <20210708140240.61172-5-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 04/10] dpif-netdev: Add command to switch dpif
	implementation.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit adds a new command to allow the user to switch
the active DPIF implementation at runtime. A probe function
is executed before switching the DPIF implementation, to ensure
the CPU is capable of running the ISA required. For example, the
below code will switch to the AVX512 enabled DPIF assuming
that the runtime CPU is capable of running AVX512 instructions:

 $ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512

A new configuration flag is added to allow selection of the
default DPIF. This is useful for running the unit-tests against
the available DPIF implementations, without modifying each unit test.

The design of the testing & validation for ISA optimized DPIF
implementations is based around the work already upstream for DPCLS.
Note however that a DPCLS lookup has no state or side-effects, allowing
the auto-validator implementation to perform multiple lookups and
provide consistent statistic counters.

The DPIF component does have state, so running two implementations in
parallel and comparing output is not a valid testing method, as there
are changes in DPIF statistic counters (side effects). As a result, the
DPIF is tested directly against the unit-tests.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v15:
- Address Flavio's comments from the v14 review.
- Move dp_netdev_impl_set_default_by_name() below
  dp_netdev_impl_get_by_name() since it relies on that function, and it
  is no longer prototyped in the .h file.

v14:
- Change command name to dpif-impl-set
- Fix the order of includes to what is layed out in the coding-style.rst
- Use bool not int to capture return value of dpdk_get_cpu_has_isa()
- Use an enum to index DPIF impls array.
- Hide more of the dpif impl details from lib/dpif-netdev.c.
- Fix comment on *dp_netdev_input_func() typedef.
- Rename dp_netdev_input_func func to input_func.
- Remove the datapath or dp argument from the dpif-impl-set CMD.
- Set the DPIF function pointer atomically.

v13:
- Add Docs items about the switch DPIF command here rather than in
  later commit.
- Document operation in manpages as well as rST.
- Minor code refactoring to address review comments.
---
 Documentation/topics/dpdk/bridge.rst |  34 ++++++++
 acinclude.m4                         |  15 ++++
 configure.ac                         |   1 +
 lib/automake.mk                      |   1 +
 lib/dpif-netdev-avx512.c             |  14 +++
 lib/dpif-netdev-private-dpif.c       | 124 +++++++++++++++++++++++++++
 lib/dpif-netdev-private-dpif.h       |  41 +++++++++
 lib/dpif-netdev-private-thread.h     |  10 ---
 lib/dpif-netdev-unixctl.man          |   3 +
 lib/dpif-netdev.c                    |  74 ++++++++++++++--
 10 files changed, 302 insertions(+), 15 deletions(-)
 create mode 100644 lib/dpif-netdev-private-dpif.c

diff --git a/Documentation/topics/dpdk/bridge.rst b/Documentation/topics/dpdk/bridge.rst
index 526d5c959..06d1f943c 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -214,3 +214,37 @@ implementation ::
 
 Compile OVS in debug mode to have `ovs_assert` statements error out if
 there is a mis-match in the DPCLS lookup implementation.
+
+Datapath Interface Performance
+------------------------------
+
+The datapath interface (DPIF) or dp_netdev_input() is responsible for taking
+packets through the major components of the userspace datapath; such as
+miniflow_extract, EMC, SMC and DPCLS lookups, and a lot of the performance
+stats associated with the datapath.
+
+Just like with the SIMD DPCLS feature above, SIMD can be applied to the DPIF to
+improve performance.
+
+By default, dpif_scalar is used. The DPIF implementation can be selected by
+name ::
+
+    $ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512
+    DPIF implementation set to dpif_avx512.
+
+    $ ovs-appctl dpif-netdev/dpif-impl-set dpif_scalar
+    DPIF implementation set to dpif_scalar.
+
+Running Unit Tests with AVX512 DPIF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the AVX512 DPIF is disabled by default, a compile time option is
+available in order to test it with the OVS unit test suite. When building with
+a CPU that supports AVX512, use the following configure option ::
+
+    $ ./configure --enable-dpif-default-avx512
+
+The following line should be seen in the configure output when the above option
+is used ::
+
+    checking whether DPIF AVX512 is default implementation... yes
diff --git a/acinclude.m4 b/acinclude.m4
index 18c52f63a..343303447 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -30,6 +30,21 @@ AC_DEFUN([OVS_CHECK_DPCLS_AUTOVALIDATOR], [
   fi
 ])
 
+dnl Set OVS DPIF default implementation at configure time for running the unit
+dnl tests on the whole codebase without modifying tests per DPIF impl
+AC_DEFUN([OVS_CHECK_DPIF_AVX512_DEFAULT], [
+  AC_ARG_ENABLE([dpif-default-avx512],
+                [AC_HELP_STRING([--enable-dpif-default-avx512], [Enable DPIF AVX512 implementation as default.])],
+                [dpifavx512=yes],[dpifavx512=no])
+  AC_MSG_CHECKING([whether DPIF AVX512 is default implementation])
+  if test "$dpifavx512" != yes; then
+    AC_MSG_RESULT([no])
+  else
+    OVS_CFLAGS="$OVS_CFLAGS -DDPIF_AVX512_DEFAULT"
+    AC_MSG_RESULT([yes])
+  fi
+])
+
 dnl OVS_ENABLE_WERROR
 AC_DEFUN([OVS_ENABLE_WERROR],
   [AC_ARG_ENABLE(
diff --git a/configure.ac b/configure.ac
index c077034d4..e45685a6c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -185,6 +185,7 @@ OVS_ENABLE_WERROR
 OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
+OVS_CHECK_DPIF_AVX512_DEFAULT
 OVS_CHECK_BINUTILS_AVX512
 
 AC_ARG_VAR(KARCH, [Kernel Architecture String])
diff --git a/lib/automake.mk b/lib/automake.mk
index 432b98e62..3c9523c1a 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -116,6 +116,7 @@ lib_libopenvswitch_la_SOURCES = \
 	lib/dpif-netdev-private-dfc.c \
 	lib/dpif-netdev-private-dfc.h \
 	lib/dpif-netdev-private-dpcls.h \
+	lib/dpif-netdev-private-dpif.c \
 	lib/dpif-netdev-private-dpif.h \
 	lib/dpif-netdev-private-flow.h \
 	lib/dpif-netdev-private-thread.h \
diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
index f59c1bbe0..1ae66ca6c 100644
--- a/lib/dpif-netdev-avx512.c
+++ b/lib/dpif-netdev-avx512.c
@@ -24,6 +24,7 @@
 #include "dpif-netdev-perf.h"
 #include "dpif-netdev-private.h"
 
+#include <errno.h>
 #include <immintrin.h>
 
 #include "dp-packet.h"
@@ -57,6 +58,19 @@ struct dpif_userdata {
         struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST];
 };
 
+int32_t
+dp_netdev_input_outer_avx512_probe(void)
+{
+    bool avx512f_available = dpdk_get_cpu_has_isa("x86_64", "avx512f");
+    bool bmi2_available = dpdk_get_cpu_has_isa("x86_64", "bmi2");
+
+    if (!avx512f_available || !bmi2_available) {
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
 int32_t
 dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
                              struct dp_packet_batch *packets,
diff --git a/lib/dpif-netdev-private-dpif.c b/lib/dpif-netdev-private-dpif.c
new file mode 100644
index 000000000..a05a82fa1
--- /dev/null
+++ b/lib/dpif-netdev-private-dpif.c
@@ -0,0 +1,124 @@
+/*
+ * Copyright (c) 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "dpif-netdev-private-dpif.h"
+#include "dpif-netdev-private-thread.h"
+
+#include <errno.h>
+#include <string.h>
+
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+#include "util.h"
+
+VLOG_DEFINE_THIS_MODULE(dpif_netdev_impl);
+
+enum dpif_netdev_impl_info_idx {
+    DPIF_NETDEV_IMPL_SCALAR,
+    DPIF_NETDEV_IMPL_AVX512
+};
+
+/* Actual list of implementations goes here. */
+static struct dpif_netdev_impl_info_t dpif_impls[] = {
+    /* The default scalar C code implementation. */
+    [DPIF_NETDEV_IMPL_SCALAR] = { .input_func = dp_netdev_input,
+      .probe = NULL,
+      .name = "dpif_scalar", },
+
+#if (__x86_64__ && HAVE_AVX512F && HAVE_LD_AVX512_GOOD && __SSE4_2__)
+    /* Only available on x86_64 bit builds with SSE 4.2 used for OVS core. */
+    [DPIF_NETDEV_IMPL_AVX512] = { .input_func = dp_netdev_input_outer_avx512,
+      .probe = dp_netdev_input_outer_avx512_probe,
+      .name = "dpif_avx512", },
+#endif
+};
+
+static dp_netdev_input_func default_dpif_func;
+
+dp_netdev_input_func
+dp_netdev_impl_get_default(void)
+{
+    /* For the first call, this will be NULL. Compute the compile time default.
+     */
+    if (!default_dpif_func) {
+        int dpif_idx = DPIF_NETDEV_IMPL_SCALAR;
+
+/* Configure-time overriding to run test suite on all implementations. */
+#if (__x86_64__ && HAVE_AVX512F && HAVE_LD_AVX512_GOOD && __SSE4_2__)
+#ifdef DPIF_AVX512_DEFAULT
+        dp_netdev_input_func_probe probe;
+
+        /* Check if the compiled default is compatible. */
+        probe = dpif_impls[DPIF_NETDEV_IMPL_AVX512].probe;
+        if (!probe || !probe()) {
+            dpif_idx = DPIF_NETDEV_IMPL_AVX512;
+        }
+#endif
+#endif
+
+        VLOG_INFO("Default DPIF implementation is %s.\n",
+                  dpif_impls[dpif_idx].name);
+        default_dpif_func = dpif_impls[dpif_idx].input_func;
+    }
+
+    return default_dpif_func;
+}
+
+/* This function checks all available DPIF implementations, and selects the
+ * returns the function pointer to the one requested by "name".
+ */
+static int32_t
+dp_netdev_impl_get_by_name(const char *name, dp_netdev_input_func *out_func)
+{
+    ovs_assert(name);
+    ovs_assert(out_func);
+
+    uint32_t i;
+
+    for (i = 0; i < ARRAY_SIZE(dpif_impls); i++) {
+        if (strcmp(dpif_impls[i].name, name) == 0) {
+            /* Probe function is optional - so check it is set before exec. */
+            if (dpif_impls[i].probe) {
+                int probe_err = dpif_impls[i].probe();
+                if (probe_err) {
+                    *out_func = NULL;
+                    return probe_err;
+                }
+            }
+            *out_func = dpif_impls[i].input_func;
+            return 0;
+        }
+    }
+
+    return -EINVAL;
+}
+
+int32_t
+dp_netdev_impl_set_default_by_name(const char *name)
+{
+    dp_netdev_input_func new_default;
+
+    int32_t err = dp_netdev_impl_get_by_name(name, &new_default);
+
+    if (!err) {
+        default_dpif_func = new_default;
+    }
+
+    return err;
+
+}
diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
index bbd719b22..7880647ad 100644
--- a/lib/dpif-netdev-private-dpif.h
+++ b/lib/dpif-netdev-private-dpif.h
@@ -23,7 +23,48 @@
 struct dp_netdev_pmd_thread;
 struct dp_packet_batch;
 
+/* Typedef for DPIF functions.
+ * Returns whether all packets were processed successfully.
+ */
+typedef int32_t (*dp_netdev_input_func)(struct dp_netdev_pmd_thread *pmd,
+                                        struct dp_packet_batch *packets,
+                                        odp_port_t port_no);
+
+/* Probe a DPIF implementation. This allows the implementation to validate CPU
+ * ISA availability. Returns -ENOTSUP if not available, returns 0 if valid to
+ * use.
+ */
+typedef int32_t (*dp_netdev_input_func_probe)(void);
+
+/* Structure describing each available DPIF implementation. */
+struct dpif_netdev_impl_info_t {
+    /* Function pointer to execute to have this DPIF implementation run. */
+    dp_netdev_input_func input_func;
+    /* Function pointer to execute to check the CPU ISA is available to run. If
+     * not necessary, it must be set to NULL which implies that it is always
+     * valid to use. */
+    dp_netdev_input_func_probe probe;
+    /* Name used to select this DPIF implementation. */
+    const char *name;
+};
+
+/* Returns the default DPIF which is first ./configure selected, but can be
+ * overridden at runtime. */
+dp_netdev_input_func dp_netdev_impl_get_default(void);
+
+/* Overrides the default DPIF with the user set DPIF. */
+int32_t dp_netdev_impl_set_default_by_name(const char *name);
+
 /* Available DPIF implementations below. */
+int32_t
+dp_netdev_input(struct dp_netdev_pmd_thread *pmd,
+                struct dp_packet_batch *packets,
+                odp_port_t in_port);
+
+/* AVX512 enabled DPIF implementation and probe functions. */
+int32_t
+dp_netdev_input_outer_avx512_probe(void);
+
 int32_t
 dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
                              struct dp_packet_batch *packets,
diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
index 63b99220b..ba79c4a0a 100644
--- a/lib/dpif-netdev-private-thread.h
+++ b/lib/dpif-netdev-private-thread.h
@@ -50,16 +50,6 @@ struct dp_netdev_pmd_thread_ctx {
     bool smc_enable_db;
 };
 
-/* Forward declaration for typedef. */
-struct dp_netdev_pmd_thread;
-
-/* Typedef for DPIF functions.
- * Returns a bitmask of packets to handle, possibly including upcall/misses.
- */
-typedef int32_t (*dp_netdev_input_func)(struct dp_netdev_pmd_thread *pmd,
-                                        struct dp_packet_batch *packets,
-                                        odp_port_t port_no);
-
 /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
  * the performance overhead of interrupt processing.  Therefore netdev can
  * not implement rx-wait for these devices.  dpif-netdev needs to poll
diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
index 858d491df..76cc949f9 100644
--- a/lib/dpif-netdev-unixctl.man
+++ b/lib/dpif-netdev-unixctl.man
@@ -226,3 +226,6 @@ recirculation (only in balance-tcp mode).
 When this is the case, the above command prints the load-balancing information
 of the bonds configured in datapath \fIdp\fR showing the interface associated
 with each bucket (hash).
+.
+.IP "\fBdpif-netdev/dpif-impl-set\fR \fIdpif_impl\fR"
+Sets the DPIF to be used to \fIdpif_impl\fR. By default "dpif_scalar" is used.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 6e006da9e..8ef518994 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -469,8 +469,6 @@ static void dp_netdev_execute_actions(struct dp_netdev_pmd_thread *pmd,
                                       const struct flow *flow,
                                       const struct nlattr *actions,
                                       size_t actions_len);
-static int32_t dp_netdev_input(struct dp_netdev_pmd_thread *,
-                            struct dp_packet_batch *, odp_port_t port_no);
 static void dp_netdev_recirculate(struct dp_netdev_pmd_thread *,
                                   struct dp_packet_batch *);
 
@@ -967,6 +965,66 @@ dpif_netdev_subtable_lookup_set(struct unixctl_conn *conn, int argc OVS_UNUSED,
     ds_destroy(&reply);
 }
 
+static void
+dpif_netdev_impl_set(struct unixctl_conn *conn, int argc OVS_UNUSED,
+                     const char *argv[], void *aux OVS_UNUSED)
+{
+    /* This function requires just one parameter, the DPIF name. */
+    const char *dpif_name = argv[1];
+    struct shash_node *node;
+
+    static const char *error_description[2] = {
+        "Unknown DPIF implementation",
+        "CPU doesn't support the required instruction for",
+    };
+
+    ovs_mutex_lock(&dp_netdev_mutex);
+    int32_t err = dp_netdev_impl_set_default_by_name(dpif_name);
+
+    if (err) {
+        struct ds reply = DS_EMPTY_INITIALIZER;
+        ds_put_format(&reply, "DPIF implementation not available: %s %s.\n",
+                      error_description[ (err == -ENOTSUP) ], dpif_name);
+        const char *reply_str = ds_cstr(&reply);
+        unixctl_command_reply_error(conn, reply_str);
+        VLOG_ERR("%s", reply_str);
+        ds_destroy(&reply);
+        ovs_mutex_unlock(&dp_netdev_mutex);
+        return;
+    }
+
+    SHASH_FOR_EACH (node, &dp_netdevs) {
+        struct dp_netdev *dp = node->data;
+
+        /* Get PMD threads list, required to get DPCLS instances. */
+        size_t n;
+        struct dp_netdev_pmd_thread **pmd_list;
+        sorted_poll_thread_list(dp, &pmd_list, &n);
+
+        for (size_t i = 0; i < n; i++) {
+            struct dp_netdev_pmd_thread *pmd = pmd_list[i];
+            if (pmd->core_id == NON_PMD_CORE_ID) {
+                continue;
+            }
+
+            /* Initialize DPIF function pointer to the newly configured
+             * default. */
+            dp_netdev_input_func default_func = dp_netdev_impl_get_default();
+            atomic_uintptr_t *pmd_func = (void *) &pmd->netdev_input_func;
+            atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
+        };
+    }
+    ovs_mutex_unlock(&dp_netdev_mutex);
+
+    /* Reply with success to command. */
+    struct ds reply = DS_EMPTY_INITIALIZER;
+    ds_put_format(&reply, "DPIF implementation set to %s.\n", dpif_name);
+    const char *reply_str = ds_cstr(&reply);
+    unixctl_command_reply(conn, reply_str);
+    VLOG_INFO("%s", reply_str);
+    ds_destroy(&reply);
+}
+
 static void
 dpif_netdev_pmd_rebalance(struct unixctl_conn *conn, int argc,
                           const char *argv[], void *aux OVS_UNUSED)
@@ -1189,6 +1247,10 @@ dpif_netdev_init(void)
     unixctl_command_register("dpif-netdev/subtable-lookup-prio-get", "",
                              0, 0, dpif_netdev_subtable_lookup_get,
                              NULL);
+    unixctl_command_register("dpif-netdev/dpif-impl-set",
+                             "dpif_implementation_name",
+                             1, 1, dpif_netdev_impl_set,
+                             NULL);
     return 0;
 }
 
@@ -6126,8 +6188,10 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     hmap_init(&pmd->send_port_cache);
     cmap_init(&pmd->tx_bonds);
 
-    /* Initialize the DPIF function pointer to the default scalar version. */
-    pmd->netdev_input_func = dp_netdev_input;
+    /* Initialize DPIF function pointer to the default configured version. */
+    dp_netdev_input_func default_func = dp_netdev_impl_get_default();
+    atomic_uintptr_t *pmd_func = (void *) &pmd->netdev_input_func;
+    atomic_init(pmd_func, (uintptr_t) default_func);
 
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
@@ -7098,7 +7162,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
-static int32_t
+int32_t
 dp_netdev_input(struct dp_netdev_pmd_thread *pmd,
                 struct dp_packet_batch *packets,
                 odp_port_t port_no)

From patchwork Thu Jul  8 14:02:35 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502310
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=140.211.166.133; helo=smtp2.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0p6pz1z9sRf
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:22 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp2.osuosl.org (Postfix) with ESMTP id 00C5041D9F;
	Thu,  8 Jul 2021 14:03:19 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp2.osuosl.org ([127.0.0.1])
	by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id dYTgEvRspkTk; Thu,  8 Jul 2021 14:03:18 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org
 [IPv6:2605:bc80:3010:104::8cd3:938])
	by smtp2.osuosl.org (Postfix) with ESMTPS id 65E7941D26;
	Thu,  8 Jul 2021 14:03:07 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id D22C5C0020;
	Thu,  8 Jul 2021 14:03:05 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 27698C001F
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:03 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id D3D4B4161D
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:57 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id ElMrqoULuupX for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:02:56 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 93CFB41610
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:56 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326256"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326256"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:02:56 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436344"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:54 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:35 +0100
Message-Id: <20210708140240.61172-6-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 05/10] dpif-netdev: Add command to get dpif
	implementations.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit adds a new command to retrieve the list of available
DPIF implementations. This can be used by to check what implementations
of the DPIF are available in any given OVS binary. It also returns which
implementations are in use by the OVS PMD threads.

Usage:
 $ ovs-appctl dpif-netdev/dpif-impl-get

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v15:
- Address Flavio's comments from the v14 review.

v14:
- Rename command to dpif-impl-get.
- Hide more of the dpif impl details from lib/dpif-netdev.c. Pass a
  dynamic_string to return the dpif-impl-get CMD output.
- Add information about which DPIF impl is currently in use by each PMD
  thread.

v13:
- Add NEWS item about DPIF get and set commands here rather than in a
  later commit.
- Add documentation items about DPIF set commands here rather than in a
  later commit.
---
 Documentation/topics/dpdk/bridge.rst |  8 +++++++
 NEWS                                 |  1 +
 lib/dpif-netdev-private-dpif.c       | 31 ++++++++++++++++++++++++++++
 lib/dpif-netdev-private-dpif.h       |  6 ++++++
 lib/dpif-netdev-unixctl.man          |  3 +++
 lib/dpif-netdev.c                    | 26 +++++++++++++++++++++++
 6 files changed, 75 insertions(+)

diff --git a/Documentation/topics/dpdk/bridge.rst b/Documentation/topics/dpdk/bridge.rst
index 06d1f943c..2d0850836 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -226,6 +226,14 @@ stats associated with the datapath.
 Just like with the SIMD DPCLS feature above, SIMD can be applied to the DPIF to
 improve performance.
 
+OVS provides multiple implementations of the DPIF. The available
+implementations can be listed with the following command ::
+
+    $ ovs-appctl dpif-netdev/dpif-impl-get
+    Available DPIF implementations:
+      dpif_scalar (pmds: none)
+      dpif_avx512 (pmds: 1,2,6,7)
+
 By default, dpif_scalar is used. The DPIF implementation can be selected by
 name ::
 
diff --git a/NEWS b/NEWS
index 349718178..9e34027bf 100644
--- a/NEWS
+++ b/NEWS
@@ -20,6 +20,7 @@ Post-v2.15.0
      * Refactor lib/dpif-netdev.c to multiple header files.
      * Add avx512 implementation of dpif which can process non recirculated
        packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
+     * Add commands to get and set the dpif implementations.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpif-netdev-private-dpif.c b/lib/dpif-netdev-private-dpif.c
index a05a82fa1..84d4ec156 100644
--- a/lib/dpif-netdev-private-dpif.c
+++ b/lib/dpif-netdev-private-dpif.c
@@ -79,6 +79,37 @@ dp_netdev_impl_get_default(void)
     return default_dpif_func;
 }
 
+void
+dp_netdev_impl_get(struct ds *reply, struct dp_netdev_pmd_thread **pmd_list,
+                   size_t n)
+{
+    /* Add all dpif functions to reply string. */
+    ds_put_cstr(reply, "Available DPIF implementations:\n");
+
+    for (uint32_t i = 0; i < ARRAY_SIZE(dpif_impls); i++) {
+        ds_put_format(reply, "  %s (pmds: ", dpif_impls[i].name);
+
+        for (size_t j = 0; j < n; j++) {
+            struct dp_netdev_pmd_thread *pmd = pmd_list[j];
+            if (pmd->core_id == NON_PMD_CORE_ID) {
+                continue;
+            }
+
+            if (pmd->netdev_input_func == dpif_impls[i].input_func) {
+                ds_put_format(reply, "%u,", pmd->core_id);
+            }
+        }
+
+        ds_chomp(reply, ',');
+
+        if (ds_last(reply) == ' ') {
+            ds_put_cstr(reply, "none");
+        }
+
+        ds_put_cstr(reply, ")\n");
+    }
+}
+
 /* This function checks all available DPIF implementations, and selects the
  * returns the function pointer to the one requested by "name".
  */
diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
index 7880647ad..0da639c55 100644
--- a/lib/dpif-netdev-private-dpif.h
+++ b/lib/dpif-netdev-private-dpif.h
@@ -22,6 +22,7 @@
 /* Forward declarations to avoid including files. */
 struct dp_netdev_pmd_thread;
 struct dp_packet_batch;
+struct ds;
 
 /* Typedef for DPIF functions.
  * Returns whether all packets were processed successfully.
@@ -48,6 +49,11 @@ struct dpif_netdev_impl_info_t {
     const char *name;
 };
 
+/* This function returns all available implementations to the caller. */
+void
+dp_netdev_impl_get(struct ds *reply, struct dp_netdev_pmd_thread **pmd_list,
+                   size_t n);
+
 /* Returns the default DPIF which is first ./configure selected, but can be
  * overridden at runtime. */
 dp_netdev_input_func dp_netdev_impl_get_default(void);
diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
index 76cc949f9..5f9256215 100644
--- a/lib/dpif-netdev-unixctl.man
+++ b/lib/dpif-netdev-unixctl.man
@@ -227,5 +227,8 @@ When this is the case, the above command prints the load-balancing information
 of the bonds configured in datapath \fIdp\fR showing the interface associated
 with each bucket (hash).
 .
+.IP "\fBdpif-netdev/dpif-impl-get\fR
+Lists the DPIF implementations that are available.
+.
 .IP "\fBdpif-netdev/dpif-impl-set\fR \fIdpif_impl\fR"
 Sets the DPIF to be used to \fIdpif_impl\fR. By default "dpif_scalar" is used.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8ef518994..8dfbdef15 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -965,6 +965,29 @@ dpif_netdev_subtable_lookup_set(struct unixctl_conn *conn, int argc OVS_UNUSED,
     ds_destroy(&reply);
 }
 
+static void
+dpif_netdev_impl_get(struct unixctl_conn *conn, int argc OVS_UNUSED,
+                     const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
+{
+    struct ds reply = DS_EMPTY_INITIALIZER;
+    struct shash_node *node;
+
+    ovs_mutex_lock(&dp_netdev_mutex);
+    SHASH_FOR_EACH (node, &dp_netdevs) {
+        struct dp_netdev_pmd_thread **pmd_list;
+        struct dp_netdev *dp = node->data;
+        size_t n;
+
+        /* Get PMD threads list, required to get the DPIF impl used by each PMD
+         * thread. */
+        sorted_poll_thread_list(dp, &pmd_list, &n);
+        dp_netdev_impl_get(&reply, pmd_list, n);
+    }
+    ovs_mutex_unlock(&dp_netdev_mutex);
+    unixctl_command_reply(conn, ds_cstr(&reply));
+    ds_destroy(&reply);
+}
+
 static void
 dpif_netdev_impl_set(struct unixctl_conn *conn, int argc OVS_UNUSED,
                      const char *argv[], void *aux OVS_UNUSED)
@@ -1251,6 +1274,9 @@ dpif_netdev_init(void)
                              "dpif_implementation_name",
                              1, 1, dpif_netdev_impl_set,
                              NULL);
+    unixctl_command_register("dpif-netdev/dpif-impl-get", "",
+                             0, 0, dpif_netdev_impl_get,
+                             NULL);
     return 0;
 }
 

From patchwork Thu Jul  8 14:02:36 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502308
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::137; helo=smtp4.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp4.osuosl.org (smtp4.osuosl.org [IPv6:2605:bc80:3010::137])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0h2Q5Dz9sRf
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:16 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp4.osuosl.org (Postfix) with ESMTP id 919F342201;
	Thu,  8 Jul 2021 14:03:13 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
	by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id RXJb1f9H_JEJ; Thu,  8 Jul 2021 14:03:11 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56])
	by smtp4.osuosl.org (Postfix) with ESMTPS id 8BBCB421EA;
	Thu,  8 Jul 2021 14:03:09 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 50244C0010;
	Thu,  8 Jul 2021 14:03:09 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 1288CC002B
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:08 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id 2557541635
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:01 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 2cexPi35vOan for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:02:59 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 15EB241628
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:02:58 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326268"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326268"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:02:58 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436361"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:56 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:36 +0100
Message-Id: <20210708140240.61172-7-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 06/10] dpif-netdev: Add a partial HWOL PMD statistic.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

It is possible for packets traversing the userspace datapath to match a
flow before hitting on EMC by using a mark ID provided by a NIC. Add a
PMD statistic for this hit.

Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

Cc: Gaetan Rivet <gaetanr@nvidia.com>
Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>

v14:
- Added Flavio's Acked-by tag.

v13:
- Minor refactoring to address review comments.
- Update manpages to reflect the new format of the pmd-perf-show
  command.
---
 NEWS                        | 2 ++
 lib/dpif-netdev-avx512.c    | 3 +++
 lib/dpif-netdev-perf.c      | 3 +++
 lib/dpif-netdev-perf.h      | 1 +
 lib/dpif-netdev-unixctl.man | 1 +
 lib/dpif-netdev.c           | 9 +++++++--
 tests/pmd.at                | 6 ++++--
 7 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index 9e34027bf..d39b0ddf9 100644
--- a/NEWS
+++ b/NEWS
@@ -21,6 +21,8 @@ Post-v2.15.0
      * Add avx512 implementation of dpif which can process non recirculated
        packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
      * Add commands to get and set the dpif implementations.
+     * Add a partial HWOL PMD statistic counting hits similar to existing
+       EMC/SMC/DPCLS stats.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
index 1ae66ca6c..6f9aa8284 100644
--- a/lib/dpif-netdev-avx512.c
+++ b/lib/dpif-netdev-avx512.c
@@ -127,6 +127,7 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
 
     uint32_t emc_hits = 0;
     uint32_t smc_hits = 0;
+    uint32_t phwol_hits = 0;
 
     /* A 1 bit in this mask indicates a hit, so no DPCLS lookup on the pkt. */
     uint32_t hwol_emc_smc_hitmask = 0;
@@ -178,6 +179,7 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
                 rules[i] = &f->cr;
                 pkt_meta[i].tcp_flags = parse_tcp_flags(packet);
                 pkt_meta[i].bytes = dp_packet_size(packet);
+                phwol_hits++;
                 hwol_emc_smc_hitmask |= (1 << i);
                 continue;
             }
@@ -286,6 +288,7 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
 
     /* At this point we don't return error anymore, so commit stats here. */
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV, batch_size);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_PHWOL_HIT, phwol_hits);
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, emc_hits);
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, smc_hits);
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index 9560e7c3c..7103a2d4d 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -246,6 +246,7 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
         ds_put_format(str,
             "  Rx packets:        %12"PRIu64"  (%.0f Kpps, %.0f cycles/pkt)\n"
             "  Datapath passes:   %12"PRIu64"  (%.2f passes/pkt)\n"
+            "  - PHWOL hits:      %12"PRIu64"  (%5.1f %%)\n"
             "  - EMC hits:        %12"PRIu64"  (%5.1f %%)\n"
             "  - SMC hits:        %12"PRIu64"  (%5.1f %%)\n"
             "  - Megaflow hits:   %12"PRIu64"  (%5.1f %%, %.2f "
@@ -255,6 +256,8 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
             rx_packets, (rx_packets / duration) / 1000,
             1.0 * stats[PMD_CYCLES_ITER_BUSY] / rx_packets,
             passes, rx_packets ? 1.0 * passes / rx_packets : 0,
+            stats[PMD_STAT_PHWOL_HIT],
+            100.0 * stats[PMD_STAT_PHWOL_HIT] / passes,
             stats[PMD_STAT_EXACT_HIT],
             100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
             stats[PMD_STAT_SMC_HIT],
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 72645b6b3..8b1a52387 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -56,6 +56,7 @@ extern "C" {
 /* Set of counter types maintained in pmd_perf_stats. */
 
 enum pmd_stat_type {
+    PMD_STAT_PHWOL_HIT,     /* Packets that had a partial HWOL hit (phwol). */
     PMD_STAT_EXACT_HIT,     /* Packets that had an exact match (emc). */
     PMD_STAT_SMC_HIT,       /* Packets that had a sig match hit (SMC). */
     PMD_STAT_MASKED_HIT,    /* Packets that matched in the flow table. */
diff --git a/lib/dpif-netdev-unixctl.man b/lib/dpif-netdev-unixctl.man
index 5f9256215..83ce4f1c5 100644
--- a/lib/dpif-netdev-unixctl.man
+++ b/lib/dpif-netdev-unixctl.man
@@ -135,6 +135,7 @@ pmd thread numa_id 0 core_id 1:
   - busy iterations:        86009  ( 84.1 % of used cycles)
   Rx packets:             2399607  (2381 Kpps, 848 cycles/pkt)
   Datapath passes:        3599415  (1.50 passes/pkt)
+  - PHWOL hits:                 0  (  0.0 %)
   - EMC hits:              336472  (  9.3 %)
   - SMC hits:                   0  ( 0.0 %)
   - Megaflow hits:        3262943  ( 90.7 %, 1.00 subtbl lookups/hit)
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8dfbdef15..7d36b71fb 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -646,6 +646,7 @@ pmd_info_show_stats(struct ds *reply,
                   "  packets received: %"PRIu64"\n"
                   "  packet recirculations: %"PRIu64"\n"
                   "  avg. datapath passes per packet: %.02f\n"
+                  "  phwol hits: %"PRIu64"\n"
                   "  emc hits: %"PRIu64"\n"
                   "  smc hits: %"PRIu64"\n"
                   "  megaflow hits: %"PRIu64"\n"
@@ -654,7 +655,8 @@ pmd_info_show_stats(struct ds *reply,
                   "  miss with failed upcall: %"PRIu64"\n"
                   "  avg. packets per output batch: %.02f\n",
                   total_packets, stats[PMD_STAT_RECIRC],
-                  passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
+                  passes_per_pkt, stats[PMD_STAT_PHWOL_HIT],
+                  stats[PMD_STAT_EXACT_HIT],
                   stats[PMD_STAT_SMC_HIT],
                   stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
                   stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
@@ -1683,6 +1685,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
     CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
         stats->n_flows += cmap_count(&pmd->flow_table);
         pmd_perf_read_counters(&pmd->perf_stats, pmd_stats);
+        stats->n_hit += pmd_stats[PMD_STAT_PHWOL_HIT];
         stats->n_hit += pmd_stats[PMD_STAT_EXACT_HIT];
         stats->n_hit += pmd_stats[PMD_STAT_SMC_HIT];
         stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
@@ -6805,7 +6808,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
                bool md_is_valid, odp_port_t port_no)
 {
     struct netdev_flow_key *key = &keys[0];
-    size_t n_missed = 0, n_emc_hit = 0;
+    size_t n_missed = 0, n_emc_hit = 0, n_phwol_hit = 0;
     struct dfc_cache *cache = &pmd->flow_cache;
     struct dp_packet *packet;
     const size_t cnt = dp_packet_batch_size(packets_);
@@ -6850,6 +6853,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
             }
             if (OVS_LIKELY(flow)) {
                 tcp_flags = parse_tcp_flags(packet);
+                n_phwol_hit++;
                 if (OVS_LIKELY(batch_enable)) {
                     dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
                                             n_batches);
@@ -6912,6 +6916,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
     /* Count of packets which are not flow batched. */
     *n_flows = map_cnt;
 
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_PHWOL_HIT, n_phwol_hit);
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit);
 
     if (!smc_enable_db) {
diff --git a/tests/pmd.at b/tests/pmd.at
index 9c5824c55..46d9ede5e 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -202,11 +202,12 @@ dummy@ovs-dummy: hit:0 missed:0
     p0 7/1: (dummy-pmd: configured_rx_queues=4, configured_tx_queues=<cleared>, requested_rx_queues=4, requested_tx_queues=<cleared>)
 ])
 
-AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 9], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 10], [0], [dnl
 pmd thread numa_id <cleared> core_id <cleared>:
   packets received: 0
   packet recirculations: 0
   avg. datapath passes per packet: 0.00
+  phwol hits: 0
   emc hits: 0
   smc hits: 0
   megaflow hits: 0
@@ -233,11 +234,12 @@ AT_CHECK([cat ovs-vswitchd.log | filter_flow_install | strip_xout], [0], [dnl
 recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(src=50:54:00:00:00:77,dst=50:54:00:00:01:78),eth_type(0x0800),ipv4(frag=no), actions: <del>
 ])
 
-AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 9], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 10], [0], [dnl
 pmd thread numa_id <cleared> core_id <cleared>:
   packets received: 20
   packet recirculations: 0
   avg. datapath passes per packet: 1.00
+  phwol hits: 0
   emc hits: 19
   smc hits: 0
   megaflow hits: 0

From patchwork Thu Jul  8 14:02:37 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502311
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::133; helo=smtp2.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp2.osuosl.org (smtp2.osuosl.org [IPv6:2605:bc80:3010::133])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ0y5Xt6z9sRf
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:30 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp2.osuosl.org (Postfix) with ESMTP id 85C6641DB3;
	Thu,  8 Jul 2021 14:03:27 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp2.osuosl.org ([127.0.0.1])
	by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id ZBFWLX27fn9G; Thu,  8 Jul 2021 14:03:25 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56])
	by smtp2.osuosl.org (Postfix) with ESMTPS id 08A3441D44;
	Thu,  8 Jul 2021 14:03:13 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id D0448C001C;
	Thu,  8 Jul 2021 14:03:12 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 2E665C0010
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:11 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id 5DED341610
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:04 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 5rR79DYG9UnI for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:03:02 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id C97AE41611
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:01 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326277"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326277"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:03:01 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436375"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:02:59 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:37 +0100
Message-Id: <20210708140240.61172-8-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 07/10] dpif-netdev/dpcls-avx512: Enable 16 block
	processing.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit implements larger subtable searches in avx512. A limitation
of the previous implementation was that up to 8 blocks of miniflow
data could be matched on (so a subtable with 8 blocks was handled
in avx, but 9 blocks or more would fall back to scalar/generic).
This limitation is removed in this patch, where up to 16 blocks
of subtable can be matched on.

From an implementation perspective, the key to enabling 16 blocks
over 8 blocks was to do bitmask calculation up front, and then use
the pre-calculated bitmasks for 2x passes of the "blocks gather"
routine. The bitmasks need to be shifted for k-mask usage in the
upper (8-15) block range, but it is relatively trivial. This also
helps in case expanding to 24 blocks is desired in future.

The implementation of the 2nd iteration to handle > 8 blocks is
behind a conditional branch which checks the total number of bits.
This helps the specialized versions of the function that have a
miniflow fingerprint of less-than-or-equal 8 blocks, as the code
can be statically stripped out of those functions. Specialized
functions that do require more than 8 blocks will have the branch
removed and unconditionally execute the 2nd blocks gather routine.

Lastly, the _any() flavour will have the conditional branch, and
the branch predictor may mispredict a bit, but per burst will
likely get most packets correct (particularly towards the middle
and end of a burst).

The code has been run with unit tests under autovalidation and
passes all cases, and unit test coverage has been checked to
ensure the 16 block code paths are executing.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v14:
- Added Flavio's Acked-by tag.

v13:
- Improve function comment including variable usage (Ian)
- Comment scope bracket usage (Ian)
---
 NEWS                                   |   1 +
 lib/dpif-netdev-lookup-avx512-gather.c | 218 ++++++++++++++++++-------
 2 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/NEWS b/NEWS
index d39b0ddf9..c2e7538c5 100644
--- a/NEWS
+++ b/NEWS
@@ -23,6 +23,7 @@ Post-v2.15.0
      * Add commands to get and set the dpif implementations.
      * Add a partial HWOL PMD statistic counting hits similar to existing
        EMC/SMC/DPCLS stats.
+     * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c
index 8fc1cdfa5..f1b320bb6 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -34,7 +34,21 @@
  * AVX512 code at a time.
  */
 #define NUM_U64_IN_ZMM_REG (8)
-#define BLOCKS_CACHE_SIZE (NETDEV_MAX_BURST * NUM_U64_IN_ZMM_REG)
+
+/* This implementation of AVX512 gather allows up to 16 blocks of MF data to be
+ * present in the blocks_cache, hence the multiply by 2 in the blocks count.
+ */
+#define MF_BLOCKS_PER_PACKET (NUM_U64_IN_ZMM_REG * 2)
+
+/* Blocks cache size is the maximum number of miniflow blocks that this
+ * implementation of lookup can handle.
+ */
+#define BLOCKS_CACHE_SIZE (NETDEV_MAX_BURST * MF_BLOCKS_PER_PACKET)
+
+/* The gather instruction can handle a scale for the size of the items to
+ * gather. For uint64_t data, this scale is 8.
+ */
+#define GATHER_SCALE_8 (8)
 
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
@@ -69,22 +83,98 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
 {
     const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
     const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
-    const uint32_t lane_mask = (1 << mf_bits_total) - 1;
+    const uint32_t lane_mask = (1ULL << mf_bits_total) - 1;
 
     /* Always load a full cache line from blocks_cache. Other loads must be
      * trimmed to the amount of data required for mf_bits_total blocks.
      */
-    __m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
-    __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask, &maskp[0]);
-    __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask, &keyp[0]);
+    uint32_t res_mask;
 
-    __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
-    uint32_t res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask, v_data, v_key);
+    /* To avoid a loop, we have two iterations of a block of code here.
+     * Note the scope brackets { } are used to avoid accidental variable usage
+     * in the second iteration.
+     */
+    {
+        __m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
+        __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask, &maskp[0]);
+        __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask, &keyp[0]);
+        __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
+        res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask, v_data, v_key);
+    }
+
+    if (mf_bits_total > 8) {
+        uint32_t lane_mask_gt8 = lane_mask >> 8;
+        __m512i v_blocks = _mm512_loadu_si512(&block_cache[8]);
+        __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask_gt8, &maskp[8]);
+        __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask_gt8, &keyp[8]);
+        __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
+        uint32_t c = _mm512_mask_cmpeq_epi64_mask(lane_mask_gt8, v_data,
+                                                  v_key);
+        res_mask |= (c << 8);
+    }
 
-    /* returns 1 assuming result of SIMD compare is all blocks. */
+    /* Returns 1 assuming result of SIMD compare is all blocks matching. */
     return res_mask == lane_mask;
 }
 
+/* Takes u0 and u1 inputs, and gathers the next 8 blocks to be stored
+ * contiguously into the blocks cache. Note that the pointers and bitmasks
+ * passed into this function must be incremented for handling next 8 blocks.
+ *
+ * Register contents on entry:
+ *   v_u0: register with all u64 lanes filled with u0 bits.
+ *   v_u1: register with all u64 lanes filled with u1 bits.
+ *   pkt_blocks: pointer to packet blocks.
+ *   tbl_blocks: pointer to table blocks.
+ *   tbl_mf_masks: pointer to miniflow bitmasks for this subtable.
+ *   u1_bcast_msk: bitmask of lanes where u1 bits are used.
+ *   pkt_mf_u0_pop: population count of bits in u0 of the packet.
+ *   zero_mask: bitmask of lanes to zero as packet doesn't have mf bits set.
+ *   u64_lanes_mask: bitmask of lanes to process.
+ */
+static inline ALWAYS_INLINE __m512i
+avx512_blocks_gather(__m512i v_u0,
+                     __m512i v_u1,
+                     const uint64_t *pkt_blocks,
+                     const void *tbl_blocks,
+                     const void *tbl_mf_masks,
+                     __mmask64 u1_bcast_msk,
+                     const uint64_t pkt_mf_u0_pop,
+                     __mmask64 zero_mask,
+                     __mmask64 u64_lanes_mask)
+{
+        /* Suggest to compiler to load tbl blocks ahead of gather(). */
+        __m512i v_tbl_blocks = _mm512_maskz_loadu_epi64(u64_lanes_mask,
+                                                        tbl_blocks);
+
+        /* Blend u0 and u1 bits together for these 8 blocks. */
+        __m512i v_pkt_bits = _mm512_mask_blend_epi64(u1_bcast_msk, v_u0, v_u1);
+
+        /* Load pre-created tbl miniflow bitmasks, bitwise AND with them. */
+        __m512i v_tbl_masks = _mm512_maskz_loadu_epi64(u64_lanes_mask,
+                                                      tbl_mf_masks);
+        __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks);
+
+        /* Manual AVX512 popcount for u64 lanes. */
+        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+
+        /* Add popcounts and offset for u1 bits. */
+        __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_msk,
+                                                          pkt_mf_u0_pop);
+        __m512i v_indexes = _mm512_add_epi64(v_popcnts, v_idx_u0_offset);
+
+        /* Gather u64 blocks from packet miniflow. */
+        __m512i v_zeros = _mm512_setzero_si512();
+        __m512i v_blocks = _mm512_mask_i64gather_epi64(v_zeros, u64_lanes_mask,
+                                                       v_indexes, pkt_blocks,
+                                                       GATHER_SCALE_8);
+
+        /* Mask pkt blocks with subtable blocks, k-mask to zero lanes. */
+        __m512i v_masked_blocks = _mm512_maskz_and_epi64(zero_mask, v_blocks,
+                                                         v_tbl_blocks);
+        return v_masked_blocks;
+}
+
 static inline uint32_t ALWAYS_INLINE
 avx512_lookup_impl(struct dpcls_subtable *subtable,
                    uint32_t keys_map,
@@ -94,76 +184,86 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
                    const uint32_t bit_count_u1)
 {
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t block_cache[BLOCKS_CACHE_SIZE];
-
-    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
-    int i;
     uint32_t hashes[NETDEV_MAX_BURST];
+
     const uint32_t n_pkts = __builtin_popcountll(keys_map);
     ovs_assert(NETDEV_MAX_BURST >= n_pkts);
 
+    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
+    const uint64_t bit_count_total_mask = (1ULL << bit_count_total) - 1;
+
     const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
     const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
 
-    /* Load subtable blocks for masking later. */
     const uint64_t *tbl_blocks = miniflow_get_values(&subtable->mask.mf);
-    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
-
-    /* Load pre-created subtable masks for each block in subtable. */
-    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1;
-    const __m512i v_mf_masks = _mm512_maskz_loadu_epi64(bit_count_total_mask,
-                                                        subtable->mf_masks);
+    const uint64_t *tbl_mf_masks = subtable->mf_masks;
 
+    int i;
     ULLONG_FOR_EACH_1 (i, keys_map) {
+        /* Create mask register with packet-specific u0 offset.
+         * Note that as 16 blocks can be handled in total, the width of the
+         * mask register must be >=16.
+         */
         const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0];
         const uint64_t pkt_mf_u0_pop = __builtin_popcountll(pkt_mf_u0_bits);
-
-        /* Pre-create register with *PER PACKET* u0 offset. */
-        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0);
-        const __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_mask,
-                                                                pkt_mf_u0_pop);
+        const __mmask64 u1_bcast_mask = (UINT64_MAX << bit_count_u0);
 
         /* Broadcast u0, u1 bitmasks to 8x u64 lanes. */
-        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);
-        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask,
-                                         keys[i]->mf.map.bits[1]);
-
-        /* Bitmask by pre-created masks. */
-        __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_mf_masks);
-
-        /* Manual AVX512 popcount for u64 lanes. */
-        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
-
-        /* Offset popcounts for u1 with pre-created offset register. */
-        __m512i v_indexes = _mm512_add_epi64(v_popcnts, v_idx_u0_offset);
-
-        /* Gather u64 blocks from packet miniflow. */
-        const __m512i v_zeros = _mm512_setzero_si512();
-        const void *pkt_data = miniflow_get_values(&keys[i]->mf);
-        __m512i v_all_blocks = _mm512_mask_i64gather_epi64(v_zeros,
-                                   bit_count_total_mask, v_indexes,
-                                   pkt_data, 8);
+        __m512i v_u0 = _mm512_set1_epi64(keys[i]->mf.map.bits[0]);
+        __m512i v_u1 = _mm512_set1_epi64(keys[i]->mf.map.bits[1]);
 
         /* Zero out bits that pkt doesn't have:
          * - 2x pext() to extract bits from packet miniflow as needed by TBL
          * - Shift u1 over by bit_count of u0, OR to create zero bitmask
          */
-         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0], tbl_u0);
-         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1], tbl_u1);
-         uint64_t zero_mask = (u1_to_zero << bit_count_u0) | u0_to_zero;
-
-        /* Mask blocks using AND with subtable blocks, use k-mask to zero
-         * where lanes as required for this packet.
-         */
-        __m512i v_masked_blocks = _mm512_maskz_and_epi64(zero_mask,
-                                                v_all_blocks, v_tbl_blocks);
+        uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0], tbl_u0);
+        uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1], tbl_u1);
+        const uint64_t zero_mask_wip = (u1_to_zero << bit_count_u0) |
+                                       u0_to_zero;
+        const uint64_t zero_mask = zero_mask_wip & bit_count_total_mask;
+
+        /* Get ptr to packet data blocks. */
+        const uint64_t *pkt_blocks = miniflow_get_values(&keys[i]->mf);
+
+        /* Store first 8 blocks cache, full cache line aligned. */
+        __m512i v_blocks = avx512_blocks_gather(v_u0, v_u1,
+                                                &pkt_blocks[0],
+                                                &tbl_blocks[0],
+                                                &tbl_mf_masks[0],
+                                                u1_bcast_mask,
+                                                pkt_mf_u0_pop,
+                                                zero_mask,
+                                                bit_count_total_mask);
+        _mm512_storeu_si512(&block_cache[i * MF_BLOCKS_PER_PACKET], v_blocks);
+
+        if (bit_count_total > 8) {
+            /* Shift masks over by 8.
+             * Pkt blocks pointer remains 0, it is incremented by popcount.
+             * Move tbl and mf masks pointers forward.
+             * Increase offsets by 8.
+             * Re-run same gather code.
+             */
+            uint64_t zero_mask_gt8 = (zero_mask >> 8);
+            uint64_t u1_bcast_mask_gt8 = (u1_bcast_mask >> 8);
+            uint64_t bit_count_gt8_mask = bit_count_total_mask >> 8;
+
+            __m512i v_blocks_gt8 = avx512_blocks_gather(v_u0, v_u1,
+                                                    &pkt_blocks[0],
+                                                    &tbl_blocks[8],
+                                                    &tbl_mf_masks[8],
+                                                    u1_bcast_mask_gt8,
+                                                    pkt_mf_u0_pop,
+                                                    zero_mask_gt8,
+                                                    bit_count_gt8_mask);
+            _mm512_storeu_si512(&block_cache[(i * MF_BLOCKS_PER_PACKET) + 8],
+                                v_blocks_gt8);
+        }
 
-        /* Store to blocks cache, full cache line aligned. */
-        _mm512_storeu_si512(&block_cache[i * 8], v_masked_blocks);
     }
 
     /* Hash the now linearized blocks of packet metadata. */
     ULLONG_FOR_EACH_1 (i, keys_map) {
-        uint64_t *block_ptr = &block_cache[i * 8];
+        uint64_t *block_ptr = &block_cache[i * MF_BLOCKS_PER_PACKET];
         uint32_t hash = hash_add_words64(0, block_ptr, bit_count_total);
         hashes[i] = hash_finish(hash, bit_count_total * 8);
     }
@@ -183,7 +283,7 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
         struct dpcls_rule *rule;
 
         CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
-            const uint32_t cidx = i * 8;
+            const uint32_t cidx = i * MF_BLOCKS_PER_PACKET;
             uint32_t match = netdev_rule_matches_key(rule, bit_count_total,
                                                      &block_cache[cidx]);
             if (OVS_LIKELY(match)) {
@@ -220,7 +320,7 @@ DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
 
 /* Check if a specialized function is valid for the required subtable. */
 #define CHECK_LOOKUP_FUNCTION(U0, U1)                                         \
-    ovs_assert((U0 + U1) <= NUM_U64_IN_ZMM_REG);                              \
+    ovs_assert((U0 + U1) <= (NUM_U64_IN_ZMM_REG * 2));                        \
     if (!f && u0_bits == U0 && u1_bits == U1) {                               \
         f = dpcls_avx512_gather_mf_##U0##_##U1;                               \
     }
@@ -250,7 +350,11 @@ dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits)
     CHECK_LOOKUP_FUNCTION(4, 1);
     CHECK_LOOKUP_FUNCTION(4, 0);
 
-    if (!f && (u0_bits + u1_bits) < NUM_U64_IN_ZMM_REG) {
+    /* Check if the _any looping version of the code can perform this miniflow
+     * lookup. Performance gain may be less pronounced due to non-specialized
+     * hashing, however there is usually a good performance win overall.
+     */
+    if (!f && (u0_bits + u1_bits) < (NUM_U64_IN_ZMM_REG * 2)) {
         f = dpcls_avx512_gather_mf_any;
         VLOG_INFO("Using avx512_gather_mf_any for subtable (%d,%d)\n",
                   u0_bits, u1_bits);

From patchwork Thu Jul  8 14:02:38 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502314
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=140.211.166.133; helo=smtp2.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ196xCsz9sWq
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:41 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp2.osuosl.org (Postfix) with ESMTP id DF32541DD6;
	Thu,  8 Jul 2021 14:03:39 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp2.osuosl.org ([127.0.0.1])
	by localhost (smtp2.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id yB79xWNpbJI9; Thu,  8 Jul 2021 14:03:38 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56])
	by smtp2.osuosl.org (Postfix) with ESMTPS id 87E4641D4D;
	Thu,  8 Jul 2021 14:03:26 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id BB39BC0021;
	Thu,  8 Jul 2021 14:03:23 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 27EB7C001C
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:22 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id 254224160A
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:07 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id DXUi9dgXzbEv for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:03:04 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 885324161D
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:03 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326286"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326286"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:03:03 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436399"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:03:01 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:38 +0100
Message-Id: <20210708140240.61172-9-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 08/10] dpif-netdev/dpcls: Specialize more subtable
	signatures.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit adds more subtables to be specialized. The traffic
pattern here being matched is VXLAN traffic subtables, which commonly
have (5,3), (9,1) and (9,4) subtable fingerprints.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v14:
- Added Flavio's Acked-by tag.

v8: Add NEWS entry.
---
 NEWS                                   | 2 ++
 lib/dpif-netdev-lookup-avx512-gather.c | 6 ++++++
 lib/dpif-netdev-lookup-generic.c       | 6 ++++++
 3 files changed, 14 insertions(+)

diff --git a/NEWS b/NEWS
index c2e7538c5..7ad3463dd 100644
--- a/NEWS
+++ b/NEWS
@@ -24,6 +24,8 @@ Post-v2.15.0
      * Add a partial HWOL PMD statistic counting hits similar to existing
        EMC/SMC/DPCLS stats.
      * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
+     * Add more specialized DPCLS subtables to cover common rules, enhancing
+       the lookup performance.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c
index f1b320bb6..0b51ef9dc 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -314,6 +314,9 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
         return avx512_lookup_impl(subtable, keys_map, keys, rules, U0, U1);   \
     }                                                                         \
 
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 1)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 3)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
@@ -346,6 +349,9 @@ dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits)
         return NULL;
     }
 
+    CHECK_LOOKUP_FUNCTION(9, 4);
+    CHECK_LOOKUP_FUNCTION(9, 1);
+    CHECK_LOOKUP_FUNCTION(5, 3);
     CHECK_LOOKUP_FUNCTION(5, 1);
     CHECK_LOOKUP_FUNCTION(4, 1);
     CHECK_LOOKUP_FUNCTION(4, 0);
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index e3b6be4b6..6c74ac3a1 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -282,6 +282,9 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
         return lookup_generic_impl(subtable, keys_map, keys, rules, U0, U1);  \
     }                                                                         \
 
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 1)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 3)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
@@ -303,6 +306,9 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
 {
     dpcls_subtable_lookup_func f = NULL;
 
+    CHECK_LOOKUP_FUNCTION(9, 4);
+    CHECK_LOOKUP_FUNCTION(9, 1);
+    CHECK_LOOKUP_FUNCTION(5, 3);
     CHECK_LOOKUP_FUNCTION(5, 1);
     CHECK_LOOKUP_FUNCTION(4, 1);
     CHECK_LOOKUP_FUNCTION(4, 0);

From patchwork Thu Jul  8 14:02:39 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502313
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=140.211.166.136; helo=smtp3.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp3.osuosl.org (smtp3.osuosl.org [140.211.166.136])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ190d0wz9sRf
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:40 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp3.osuosl.org (Postfix) with ESMTP id EF1BC6FCE0;
	Thu,  8 Jul 2021 14:03:37 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp3.osuosl.org ([127.0.0.1])
	by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id aaxHtUhLyuQS; Thu,  8 Jul 2021 14:03:34 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org
 [IPv6:2605:bc80:3010:104::8cd3:938])
	by smtp3.osuosl.org (Postfix) with ESMTPS id BB4616FC9C;
	Thu,  8 Jul 2021 14:03:22 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 8DBD1C0010;
	Thu,  8 Jul 2021 14:03:22 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 28BBBC001F
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:21 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id AA56041828
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:06 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 7UvbL-sEvNVY for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:03:05 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id BEB70421D5
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:05 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326289"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326289"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:03:05 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436405"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:03:03 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:39 +0100
Message-Id: <20210708140240.61172-10-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 09/10] dpdk: Cache result of CPU ISA checks.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

As a small optimization, this patch caches the result of a CPU ISA
check from DPDK. Particularly in the case of running the DPCLS
autovalidator (which repeatedly probes subtables) this reduces
the amount of CPU ISA lookups from the DPDK level.

By caching them at the OVS/dpdk.c level, the ISA checks remain
runtime for the CPU where they are executed, but subsequent checks
for the same ISA feature become much cheaper.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v14:
- Added Flavio's Acked-by tag.
---
 lib/dpdk.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/lib/dpdk.c b/lib/dpdk.c
index 0c910092c..8807de54a 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -665,13 +665,33 @@ print_dpdk_version(void)
     puts(rte_version());
 }
 
+/* Avoid calling rte_cpu_get_flag_enabled() excessively, by caching the
+ * result of the call for each CPU flag in a static variable. To avoid
+ * allocating large numbers of static variables, use a uint8 as a bitfield.
+ * Note the macro must only return if the ISA check is done and available.
+ */
+#define ISA_CHECK_DONE_BIT (1 << 0)
+#define ISA_AVAILABLE_BIT  (1 << 1)
+
 #define CHECK_CPU_FEATURE(feature, name_str, RTE_CPUFLAG)               \
     do {                                                                \
         if (strncmp(feature, name_str, strlen(name_str)) == 0) {        \
-            int has_isa = rte_cpu_get_flag_enabled(RTE_CPUFLAG);        \
-            VLOG_DBG("CPU flag %s, available %s\n", name_str,           \
-                      has_isa ? "yes" : "no");                          \
-            return true;                                                \
+            static uint8_t isa_check_##RTE_CPUFLAG;                     \
+            int check = isa_check_##RTE_CPUFLAG & ISA_CHECK_DONE_BIT;   \
+            if (OVS_UNLIKELY(!check)) {                                 \
+                int has_isa = rte_cpu_get_flag_enabled(RTE_CPUFLAG);    \
+                VLOG_DBG("CPU flag %s, available %s\n",                 \
+                         name_str, has_isa ? "yes" : "no");             \
+                isa_check_##RTE_CPUFLAG = ISA_CHECK_DONE_BIT;           \
+                if (has_isa) {                                          \
+                    isa_check_##RTE_CPUFLAG |= ISA_AVAILABLE_BIT;       \
+                }                                                       \
+            }                                                           \
+            if (isa_check_##RTE_CPUFLAG & ISA_AVAILABLE_BIT) {          \
+                return true;                                            \
+            } else {                                                    \
+                return false;                                           \
+            }                                                           \
         }                                                               \
     } while (0)
 

From patchwork Thu Jul  8 14:02:40 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Ferriter, Cian" <cian.ferriter@intel.com>
X-Patchwork-Id: 1502315
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=140.211.166.137; helo=smtp4.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>)
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4GLJ1D15tzz9sRf
	for <incoming@patchwork.ozlabs.org>; Fri,  9 Jul 2021 00:03:44 +1000 (AEST)
Received: from localhost (localhost [127.0.0.1])
	by smtp4.osuosl.org (Postfix) with ESMTP id 5A1824229A;
	Thu,  8 Jul 2021 14:03:42 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
	by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id ElVXsyxNEhZk; Thu,  8 Jul 2021 14:03:38 +0000 (UTC)
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org
 [IPv6:2605:bc80:3010:104::8cd3:938])
	by smtp4.osuosl.org (Postfix) with ESMTPS id B7D25421E9;
	Thu,  8 Jul 2021 14:03:37 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 8CCC6C0021;
	Thu,  8 Jul 2021 14:03:37 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp4.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 0A707C0010
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:36 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp4.osuosl.org (Postfix) with ESMTP id 6E407421E8
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:09 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from smtp4.osuosl.org ([127.0.0.1])
 by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 5I65UEYCUknm for <ovs-dev@openvswitch.org>;
 Thu,  8 Jul 2021 14:03:08 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by smtp4.osuosl.org (Postfix) with ESMTPS id 071B2421E3
 for <ovs-dev@openvswitch.org>; Thu,  8 Jul 2021 14:03:07 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10038"; a="209326300"
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="209326300"
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jul 2021 07:03:07 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,222,1620716400"; d="scan'208";a="498436414"
Received: from silpixa00399779.ir.intel.com (HELO
 silpixa00399779.ger.corp.intel.com) ([10.237.222.105])
 by fmsmga002.fm.intel.com with ESMTP; 08 Jul 2021 07:03:05 -0700
From: Cian Ferriter <cian.ferriter@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu,  8 Jul 2021 15:02:40 +0100
Message-Id: <20210708140240.61172-11-cian.ferriter@intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <20210708140240.61172-1-cian.ferriter@intel.com>
References: <20210708140240.61172-1-cian.ferriter@intel.com>
MIME-Version: 1.0
Cc: i.maximets@ovn.org, fbl@sysclose.org
Subject: [ovs-dev] [v15 10/10] dpcls-avx512: Enable avx512 vector popcount
	instruction.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

From: Harry van Haaren <harry.van.haaren@intel.com>

This commit enables the AVX512-VPOPCNTDQ Vector Popcount
instruction. This instruction is not available on every CPU
that supports the AVX512-F Foundation ISA, hence it is enabled
only when the additional VPOPCNTDQ ISA check is passed.

The vector popcount instruction is used instead of the AVX512
popcount emulation code present in the avx512 optimized DPCLS today.
It provides higher performance in the SIMD miniflow processing
as that requires the popcount to calculate the miniflow block indexes.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
---

v14:
- Added Flavio's Acked-by tag.

v13:
- Rebased and Improved comment on use_vpop variable (Ian)
---
 NEWS                                   |  3 +
 lib/dpdk.c                             |  1 +
 lib/dpif-netdev-lookup-avx512-gather.c | 85 ++++++++++++++++++++------
 3 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/NEWS b/NEWS
index 7ad3463dd..6944a3037 100644
--- a/NEWS
+++ b/NEWS
@@ -26,6 +26,9 @@ Post-v2.15.0
      * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
      * Add more specialized DPCLS subtables to cover common rules, enhancing
        the lookup performance.
+     * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if the
+       CPU supports it. This enhances performance by using the native vpopcount
+       instructions, instead of the emulated version of vpopcount.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
        in ovsdb on startup.
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 8807de54a..9de2af58e 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -706,6 +706,7 @@ dpdk_get_cpu_has_isa(const char *arch, const char *feature)
 #if __x86_64__
     /* CPU flags only defined for the architecture that support it. */
     CHECK_CPU_FEATURE(feature, "avx512f", RTE_CPUFLAG_AVX512F);
+    CHECK_CPU_FEATURE(feature, "avx512vpopcntdq", RTE_CPUFLAG_AVX512VPOPCNTDQ);
     CHECK_CPU_FEATURE(feature, "bmi2", RTE_CPUFLAG_BMI2);
 #endif
 
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c
index 0b51ef9dc..bc359dc4a 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -53,6 +53,15 @@
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
 
+
+/* Wrapper function required to enable ISA. */
+static inline __m512i
+__attribute__((__target__("avx512vpopcntdq")))
+_mm512_popcnt_epi64_wrapper(__m512i v_in)
+{
+    return _mm512_popcnt_epi64(v_in);
+}
+
 static inline __m512i
 _mm512_popcnt_epi64_manual(__m512i v_in)
 {
@@ -131,6 +140,7 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
  *   pkt_mf_u0_pop: population count of bits in u0 of the packet.
  *   zero_mask: bitmask of lanes to zero as packet doesn't have mf bits set.
  *   u64_lanes_mask: bitmask of lanes to process.
+ *   use_vpop: compile-time constant indicating if VPOPCNT instruction allowed.
  */
 static inline ALWAYS_INLINE __m512i
 avx512_blocks_gather(__m512i v_u0,
@@ -141,7 +151,8 @@ avx512_blocks_gather(__m512i v_u0,
                      __mmask64 u1_bcast_msk,
                      const uint64_t pkt_mf_u0_pop,
                      __mmask64 zero_mask,
-                     __mmask64 u64_lanes_mask)
+                     __mmask64 u64_lanes_mask,
+                     const uint32_t use_vpop)
 {
         /* Suggest to compiler to load tbl blocks ahead of gather(). */
         __m512i v_tbl_blocks = _mm512_maskz_loadu_epi64(u64_lanes_mask,
@@ -155,8 +166,15 @@ avx512_blocks_gather(__m512i v_u0,
                                                       tbl_mf_masks);
         __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks);
 
-        /* Manual AVX512 popcount for u64 lanes. */
-        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+        /* Calculate AVX512 popcount for u64 lanes using the native instruction
+         * if available, or using emulation if not available.
+         */
+        __m512i v_popcnts;
+        if (use_vpop) {
+            v_popcnts = _mm512_popcnt_epi64_wrapper(v_masks);
+        } else {
+            v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+        }
 
         /* Add popcounts and offset for u1 bits. */
         __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_msk,
@@ -181,7 +199,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
                    const struct netdev_flow_key *keys[],
                    struct dpcls_rule **rules,
                    const uint32_t bit_count_u0,
-                   const uint32_t bit_count_u1)
+                   const uint32_t bit_count_u1,
+                   const uint32_t use_vpop)
 {
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t block_cache[BLOCKS_CACHE_SIZE];
     uint32_t hashes[NETDEV_MAX_BURST];
@@ -233,7 +252,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
                                                 u1_bcast_mask,
                                                 pkt_mf_u0_pop,
                                                 zero_mask,
-                                                bit_count_total_mask);
+                                                bit_count_total_mask,
+                                                use_vpop);
         _mm512_storeu_si512(&block_cache[i * MF_BLOCKS_PER_PACKET], v_blocks);
 
         if (bit_count_total > 8) {
@@ -254,7 +274,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
                                                     u1_bcast_mask_gt8,
                                                     pkt_mf_u0_pop,
                                                     zero_mask_gt8,
-                                                    bit_count_gt8_mask);
+                                                    bit_count_gt8_mask,
+                                                    use_vpop);
             _mm512_storeu_si512(&block_cache[(i * MF_BLOCKS_PER_PACKET) + 8],
                                 v_blocks_gt8);
         }
@@ -303,7 +324,11 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
     return found_map;
 }
 
-/* Expand out specialized functions with U0 and U1 bit attributes. */
+/* Expand out specialized functions with U0 and U1 bit attributes. As the
+ * AVX512 vpopcnt instruction is not supported on all AVX512 capable CPUs,
+ * create two functions for each miniflow signature. This allows the runtime
+ * CPU detection in probe() to select the ideal implementation.
+ */
 #define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
     static uint32_t                                                           \
     dpcls_avx512_gather_mf_##U0##_##U1(struct dpcls_subtable *subtable,       \
@@ -311,7 +336,20 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
                                        const struct netdev_flow_key *keys[],  \
                                        struct dpcls_rule **rules)             \
     {                                                                         \
-        return avx512_lookup_impl(subtable, keys_map, keys, rules, U0, U1);   \
+        const uint32_t use_vpop = 0;                                          \
+        return avx512_lookup_impl(subtable, keys_map, keys, rules,            \
+                                  U0, U1, use_vpop);                          \
+    }                                                                         \
+                                                                              \
+    static uint32_t __attribute__((__target__("avx512vpopcntdq")))            \
+    dpcls_avx512_gather_mf_##U0##_##U1##_vpop(struct dpcls_subtable *subtable,\
+                                       uint32_t keys_map,                     \
+                                       const struct netdev_flow_key *keys[],  \
+                                       struct dpcls_rule **rules)             \
+    {                                                                         \
+        const uint32_t use_vpop = 1;                                          \
+        return avx512_lookup_impl(subtable, keys_map, keys, rules,            \
+                                  U0, U1, use_vpop);                          \
     }                                                                         \
 
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4)
@@ -321,11 +359,18 @@ DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
 
-/* Check if a specialized function is valid for the required subtable. */
-#define CHECK_LOOKUP_FUNCTION(U0, U1)                                         \
+/* Check if a specialized function is valid for the required subtable.
+ * The use_vpop variable is used to decide if the VPOPCNT instruction can be
+ * used or not.
+ */
+#define CHECK_LOOKUP_FUNCTION(U0, U1, use_vpop)                               \
     ovs_assert((U0 + U1) <= (NUM_U64_IN_ZMM_REG * 2));                        \
     if (!f && u0_bits == U0 && u1_bits == U1) {                               \
-        f = dpcls_avx512_gather_mf_##U0##_##U1;                               \
+        if (use_vpop) {                                                       \
+            f = dpcls_avx512_gather_mf_##U0##_##U1##_vpop;                    \
+        } else {                                                              \
+            f = dpcls_avx512_gather_mf_##U0##_##U1;                           \
+        }                                                                     \
     }
 
 static uint32_t
@@ -333,9 +378,11 @@ dpcls_avx512_gather_mf_any(struct dpcls_subtable *subtable, uint32_t keys_map,
                            const struct netdev_flow_key *keys[],
                            struct dpcls_rule **rules)
 {
+    const uint32_t use_vpop = 0;
     return avx512_lookup_impl(subtable, keys_map, keys, rules,
                               subtable->mf_bits_set_unit0,
-                              subtable->mf_bits_set_unit1);
+                              subtable->mf_bits_set_unit1,
+                              use_vpop);
 }
 
 dpcls_subtable_lookup_func
@@ -349,12 +396,14 @@ dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits)
         return NULL;
     }
 
-    CHECK_LOOKUP_FUNCTION(9, 4);
-    CHECK_LOOKUP_FUNCTION(9, 1);
-    CHECK_LOOKUP_FUNCTION(5, 3);
-    CHECK_LOOKUP_FUNCTION(5, 1);
-    CHECK_LOOKUP_FUNCTION(4, 1);
-    CHECK_LOOKUP_FUNCTION(4, 0);
+    int use_vpop = dpdk_get_cpu_has_isa("x86_64", "avx512vpopcntdq");
+
+    CHECK_LOOKUP_FUNCTION(9, 4, use_vpop);
+    CHECK_LOOKUP_FUNCTION(9, 1, use_vpop);
+    CHECK_LOOKUP_FUNCTION(5, 3, use_vpop);
+    CHECK_LOOKUP_FUNCTION(5, 1, use_vpop);
+    CHECK_LOOKUP_FUNCTION(4, 1, use_vpop);
+    CHECK_LOOKUP_FUNCTION(4, 0, use_vpop);
 
     /* Check if the _any looping version of the code can perform this miniflow
      * lookup. Performance gain may be less pronounced due to non-specialized