From patchwork Tue Jul  9 12:34:39 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Van Haaren, Harry" <harry.van.haaren@intel.com>
X-Patchwork-Id: 1129754
X-Patchwork-Delegate: ian.stokes@intel.com
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=openvswitch.org
	(client-ip=140.211.169.12; helo=mail.linuxfoundation.org;
	envelope-from=ovs-dev-bounces@openvswitch.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=intel.com
Received: from mail.linuxfoundation.org (mail.linuxfoundation.org
	[140.211.169.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 45jhpN6213z9sNF
	for <incoming@patchwork.ozlabs.org>;
	Tue,  9 Jul 2019 22:44:12 +1000 (AEST)
Received: from mail.linux-foundation.org (localhost [127.0.0.1])
	by mail.linuxfoundation.org (Postfix) with ESMTP id 9035C376D;
	Tue,  9 Jul 2019 12:42:10 +0000 (UTC)
X-Original-To: dev@openvswitch.org
Delivered-To: ovs-dev@mail.linuxfoundation.org
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 07DBE373A
	for <dev@openvswitch.org>; Tue,  9 Jul 2019 12:33:35 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 9AE3F148
	for <dev@openvswitch.org>; Tue,  9 Jul 2019 12:33:33 +0000 (UTC)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
	by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
	09 Jul 2019 05:33:33 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.63,470,1557212400"; d="scan'208";a="167406731"
Received: from silpixa00399779.ir.intel.com (HELO
	silpixa00399779.ger.corp.intel.com) ([10.237.222.100])
	by fmsmga007.fm.intel.com with ESMTP; 09 Jul 2019 05:33:31 -0700
From: Harry van Haaren <harry.van.haaren@intel.com>
To: dev@openvswitch.org
Date: Tue,  9 Jul 2019 13:34:39 +0100
Message-Id: <20190709123440.45519-5-harry.van.haaren@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190709123440.45519-1-harry.van.haaren@intel.com>
References: <20190508151321.39231-1-harry.van.haaren@intel.com>
	<20190709123440.45519-1-harry.van.haaren@intel.com>
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED
	autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: i.maximets@samsung.com
Subject: [ovs-dev] [PATCH v10 4/5] dpif-netdev: refactor generic
	implementation
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
MIME-Version: 1.0
Sender: ovs-dev-bounces@openvswitch.org
Errors-To: ovs-dev-bounces@openvswitch.org

This commit refactors the generic implementation. The
goal of this refactor is to simply the code to enable
"specialization" of the functions at compile time.

Given compile-time optimizations, the compiler is able
to unroll loops, and create optimized code sequences due
to compile time knowledge of loop-trip counts.

In order to enable these compiler optimizations, we must
refactor the code to pass the loop-trip counts to functions
as compile time constants.

This patch allows the number of miniflow-bits set per "unit"
in the miniflow to be passed around as a function argument.

Note that this patch does NOT yet take advantage of doing so,
this is only a refactor to enable it in the next patches.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
---

v10:
- Rebase updates from previous patches
- Fix whitespace indentation of func params
- Removed restrict keyword, Windows CI failing when it is used (Ian)
- Fix integer 0 used to set NULL pointer (Ilya)
- Postpone free() call on cls->blocks_scratch (Ilya)
- Fix indentation of a function

v9:
- Use count_1bits in favour of __builtin_popcount (Ilya)
- Use ALWAYS_INLINE instead of __attribute__ synatx (Ilya)

v8:
- Rework block_cache and mf_masks to avoid variable-lenght array
  due to compiler issues. Provisioning for worst case is not a
  good solution due to magnitue of over-provisioning required.
- Rework netdev_flatten function removing unused parameter
---
 lib/dpif-netdev-lookup-generic.c | 239 ++++++++++++++++++++++++-------
 lib/dpif-netdev.c                |  77 +++++++++-
 lib/dpif-netdev.h                |  18 +++
 3 files changed, 281 insertions(+), 53 deletions(-)
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index d49d4b570..432d8782e 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -29,67 +29,204 @@
 #include "packets.h"
 #include "pvector.h"
 
-/* Returns a hash value for the bits of 'key' where there are 1-bits in
- * 'mask'. */
-static inline uint32_t
-netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
-                             const struct netdev_flow_key *mask)
+VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
+
+/* netdev_flow_key_flatten_unit:
+ * Given a packet, table and mf_masks, this function iterates over each bit
+ * set in the subtable, and calculates the appropriate metadata to store in the
+ * blocks_scratch[].
+ *
+ * The results of the blocks_scratch[] can be used for hashing, and later for
+ * verification of if a rule matches the given packet.
+ */
+static inline void
+netdev_flow_key_flatten_unit(const uint64_t *pkt_blocks,
+                             const uint64_t *tbl_blocks,
+                             const uint64_t *mf_masks,
+                             uint64_t *blocks_scratch,
+                             const uint64_t pkt_mf_bits,
+                             const uint32_t count)
 {
-    const uint64_t *p = miniflow_get_values(&mask->mf);
-    uint32_t hash = 0;
-    uint64_t value;
+    uint32_t i;
+    for (i = 0; i < count; i++) {
+        uint64_t mf_mask = mf_masks[i];
+        /* Calculate the block index for the packet metadata */
+        uint64_t idx_bits = mf_mask & pkt_mf_bits;
+        const uint32_t pkt_idx = count_1bits(idx_bits);
 
-    NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP (value, key, mask->mf.map) {
-        hash = hash_add64(hash, value & *p);
-        p++;
+        /* check if the packet has the subtable miniflow bit set. If yes, the
+         * block at the above pkt_idx will be stored, otherwise it is masked
+         * out to be zero.
+         */
+        uint64_t pkt_has_mf_bit = (mf_mask + 1) & pkt_mf_bits;
+        uint64_t no_bit = ((!pkt_has_mf_bit) > 0) - 1;
+
+        /* mask packet block by table block, and mask to zero if packet
+         * doesn't actually contain this block of metadata
+         */
+        blocks_scratch[i] = pkt_blocks[pkt_idx] & tbl_blocks[i] & no_bit;
     }
+}
+
+/* netdev_flow_key_flatten:
+ * This function takes a packet, and subtable and writes an array of uint64_t
+ * blocks. The blocks contain the metadata that the subtable matches on, in
+ * the same order as the subtable, allowing linear iteration over the blocks.
+ *
+ * To calculate the blocks contents, the netdev_flow_key_flatten_unit function
+ * is called twice, once for each "unit" of the miniflow. This call can be
+ * inlined by the compiler for performance.
+ *
+ * Note that the u0_count and u1_count variables can be compile-time constants,
+ * allowing the loop in the inlined flatten_unit() function to be compile-time
+ * unrolled, or possibly removed totally by unrolling by the loop iterations.
+ * The compile time optimizations enabled by this design improves performance.
+ */
+static inline void
+netdev_flow_key_flatten(const struct netdev_flow_key *key,
+                        const struct netdev_flow_key *mask,
+                        const uint64_t *mf_masks,
+                        uint64_t *blocks_scratch,
+                        const uint32_t u0_count,
+                        const uint32_t u1_count)
+{
+    /* load mask from subtable, mask with packet mf, popcount to get idx */
+    const uint64_t *pkt_blocks = miniflow_get_values(&key->mf);
+    const uint64_t *tbl_blocks = miniflow_get_values(&mask->mf);
+
+    /* packet miniflow bits to be masked by pre-calculated mf_masks */
+    const uint64_t pkt_bits_u0 = key->mf.map.bits[0];
+    const uint32_t pkt_bits_u0_pop = count_1bits(pkt_bits_u0);
+    const uint64_t pkt_bits_u1 = key->mf.map.bits[1];
 
-    return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
+    /* Unit 0 flattening */
+    netdev_flow_key_flatten_unit(&pkt_blocks[0],
+                                 &tbl_blocks[0],
+                                 &mf_masks[0],
+                                 &blocks_scratch[0],
+                                 pkt_bits_u0,
+                                 u0_count);
+
+    /* Unit 1 flattening:
+     * Move the pointers forward in the arrays based on u0 offsets, NOTE:
+     * 1) pkt blocks indexed by actual popcount of u0, which is NOT always
+     *    the same as the amount of bits set in the subtable.
+     * 2) mf_masks, tbl_block and blocks_scratch are all "flat" arrays, so
+     *    the index is always u0_count.
+     */
+    netdev_flow_key_flatten_unit(&pkt_blocks[pkt_bits_u0_pop],
+                                 &tbl_blocks[u0_count],
+                                 &mf_masks[u0_count],
+                                 &blocks_scratch[u0_count],
+                                 pkt_bits_u1,
+                                 u1_count);
+}
+
+static inline uint64_t
+netdev_rule_matches_key(const struct dpcls_rule *rule,
+                        const uint32_t mf_bits_total,
+                        const uint64_t *blocks_scratch)
+{
+    const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
+    const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
+
+    uint64_t not_match = 0;
+    for (int i = 0; i < mf_bits_total; i++) {
+        not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
+    }
+
+    /* invert result to show match as 1 */
+    return !not_match;
 }
 
+/* const prop version of the function: note that mf bits total and u0 are
+ * explicitly passed in here, while they're also available at runtime from the
+ * subtable pointer. By making them compile time, we enable the compiler to
+ * unroll loops and flatten out code-sequences based on the knowledge of the
+ * mf_bits_* compile time values. This results in improved performance.
+ */
+static inline uint32_t ALWAYS_INLINE
+lookup_generic_impl(struct dpcls_subtable *subtable,
+                    uint64_t *blocks_scratch,
+                    uint32_t keys_map,
+                    const struct netdev_flow_key *keys[],
+                    struct dpcls_rule **rules,
+                    const uint32_t bit_count_u0,
+                    const uint32_t bit_count_u1)
+{
+    const uint32_t n_pkts = count_1bits(keys_map);
+    ovs_assert(NETDEV_MAX_BURST >= n_pkts);
+    uint32_t hashes[NETDEV_MAX_BURST];
+
+    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
+    uint64_t *mf_masks = subtable->mf_masks;
+    int i;
+
+    /* Flatten the packet metadata into the blocks_scratch[] using subtable */
+    ULLONG_FOR_EACH_1(i, keys_map) {
+            netdev_flow_key_flatten(keys[i],
+                                    &subtable->mask,
+                                    mf_masks,
+                                    &blocks_scratch[i * bit_count_total],
+                                    bit_count_u0,
+                                    bit_count_u1);
+    }
+
+    /* Hash the now linearized blocks of packet metadata */
+    ULLONG_FOR_EACH_1(i, keys_map) {
+         uint32_t hash = 0;
+         uint32_t i_off = i * bit_count_total;
+         for (int h = 0; h < bit_count_total; h++) {
+             hash = hash_add64(hash, blocks_scratch[i_off + h]);
+         }
+         hashes[i] = hash_finish(hash, bit_count_total * 8);
+    }
+
+    /* Lookup: this returns a bitmask of packets where the hash table had
+     * an entry for the given hash key. Presence of a hash key does not
+     * guarantee matching the key, as there can be hash collisions.
+     */
+    uint32_t found_map;
+    const struct cmap_node *nodes[NETDEV_MAX_BURST];
+    found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
+
+    /* Verify that packet actually matched rule. If not found, a hash
+     * collision has taken place, so continue searching with the next node.
+     */
+    ULLONG_FOR_EACH_1(i, found_map) {
+        struct dpcls_rule *rule;
+
+        CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
+            const uint32_t cidx = i * bit_count_total;
+            uint32_t match = netdev_rule_matches_key(rule, bit_count_total,
+                                                     &blocks_scratch[cidx]);
+
+            if (OVS_LIKELY(match)) {
+                rules[i] = rule;
+                subtable->hit_cnt++;
+                goto next;
+            }
+        }
+
+        /* None of the found rules was a match.  Clear the i-th bit to
+         * search for this key in the next subtable. */
+        ULLONG_SET0(found_map, i);
+    next:
+        ;                     /* Keep Sparse happy. */
+    }
+
+    return found_map;
+}
+
+/* Generic - use runtime provided mf bits */
 uint32_t
 dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
+                              uint64_t *blocks_scratch,
                               uint32_t keys_map,
                               const struct netdev_flow_key *keys[],
                               struct dpcls_rule **rules)
 {
-        int i;
-        /* Compute hashes for the remaining keys.  Each search-key is
-         * masked with the subtable's mask to avoid hashing the wildcarded
-         * bits. */
-        uint32_t hashes[NETDEV_MAX_BURST];
-        ULLONG_FOR_EACH_1(i, keys_map) {
-            hashes[i] = netdev_flow_key_hash_in_mask(keys[i],
-                                                     &subtable->mask);
-        }
-
-        /* Lookup. */
-        const struct cmap_node *nodes[NETDEV_MAX_BURST];
-        uint32_t found_map =
-                cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
-        /* Check results.  When the i-th bit of found_map is set, it means
-         * that a set of nodes with a matching hash value was found for the
-         * i-th search-key.  Due to possible hash collisions we need to check
-         * which of the found rules, if any, really matches our masked
-         * search-key. */
-        ULLONG_FOR_EACH_1(i, found_map) {
-            struct dpcls_rule *rule;
-
-            CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
-                if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i]))) {
-                    rules[i] = rule;
-                    /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
-                     * within one second optimization interval. */
-                    subtable->hit_cnt++;
-                    goto next;
-                }
-            }
-            /* None of the found rules was a match.  Reset the i-th bit to
-             * keep searching this key in the next subtable. */
-            ULLONG_SET0(found_map, i);  /* Did not match. */
-        next:
-            ;                     /* Keep Sparse happy. */
-        }
-
-        return found_map;
+    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
+                               subtable->mf_bits_set_unit0,
+                               subtable->mf_bits_set_unit1);
 }
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 190cc8918..03ab5267a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -233,6 +233,15 @@ struct dpcls {
     odp_port_t in_port;
     struct cmap subtables_map;
     struct pvector subtables;
+
+    /* Region of memory for this DPCLS instance to use as scratch.
+     * Size is garaunteed to be large enough to hold all blocks required for
+     * the subtable's to match on. This allows each dpcls lookup to flatten
+     * the packet miniflows into this blocks_scratch area, without using
+     * variable lenght arrays. This region is allocated on subtable create, and
+     * will be resized as required if a larger subtable is added. */
+    uint64_t *blocks_scratch;
+    uint32_t blocks_scratch_size;
 };
 
 /* Data structure to keep packet order till fastpath processing. */
@@ -7567,6 +7576,7 @@ static void
 dpcls_subtable_destroy_cb(struct dpcls_subtable *subtable)
 {
     cmap_destroy(&subtable->rules);
+    ovsrcu_postpone(free, subtable->mf_masks);
     ovsrcu_postpone(free, subtable);
 }
 
@@ -7577,6 +7587,8 @@ dpcls_init(struct dpcls *cls)
 {
     cmap_init(&cls->subtables_map);
     pvector_init(&cls->subtables);
+    cls->blocks_scratch = NULL;
+    cls->blocks_scratch_size = 0;
 }
 
 static void
@@ -7604,6 +7616,7 @@ dpcls_destroy(struct dpcls *cls)
         }
         cmap_destroy(&cls->subtables_map);
         pvector_destroy(&cls->subtables);
+        ovsrcu_postpone(free, cls->blocks_scratch);
     }
 }
 
@@ -7619,7 +7632,28 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     subtable->hit_cnt = 0;
     netdev_flow_key_clone(&subtable->mask, mask);
 
-    /* Decide which hash/lookup/verify function to use. */
+    /* The count of bits in the mask defines the space required for masks.
+     * Then call gen_masks() to create the appropriate masks, avoiding the cost
+     * of doing runtime calculations. */
+    uint32_t unit0 = count_1bits(mask->mf.map.bits[0]);
+    uint32_t unit1 = count_1bits(mask->mf.map.bits[1]);
+    subtable->mf_bits_set_unit0 = unit0;
+    subtable->mf_bits_set_unit1 = unit1;
+
+    subtable->mf_masks = xmalloc(sizeof(uint64_t) * (unit0 + unit1));
+    netdev_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1);
+
+    /* Allocate blocks scratch space only if subtable requires more size than
+     * is currently allocated. */
+    const uint32_t blocks_required_per_pkt = unit0 + unit1;
+    if (cls->blocks_scratch_size < blocks_required_per_pkt) {
+        free(cls->blocks_scratch);
+        cls->blocks_scratch = xmalloc(sizeof(uint64_t) * NETDEV_MAX_BURST *
+                                      blocks_required_per_pkt);
+        cls->blocks_scratch_size = blocks_required_per_pkt;
+    }
+
+    /* Assign the generic lookup - this works with any miniflow fingerprint. */
     subtable->lookup_func = dpcls_subtable_lookup_generic;
 
     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
@@ -7764,6 +7798,43 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
     }
 }
 
+/* inner loop for mask generation of a unit, see netdev_flow_key_gen_masks */
+static inline void
+netdev_flow_key_gen_mask_unit(uint64_t iter,
+                              const uint64_t count,
+                              uint64_t *mf_masks)
+{
+    int i;
+    for (i = 0; i < count; i++) {
+        uint64_t lowest_bit = (iter & -iter);
+        iter &= ~lowest_bit;
+        mf_masks[i] = (lowest_bit - 1);
+    }
+    /* checks that count has covered all bits in the iter bitmap */
+    ovs_assert(iter == 0);
+}
+
+/* generate a mask for each block in the miniflow, based on the bits set. This
+ * allows easily masking packets with the generated array here, without
+ * calculations. This replaces runtime-calculating the masks.
+ * @param key The table to generate the mf_masks for
+ * @param mf_masks Pointer to a u64 array of at least *mf_bits* in size
+ * @param mf_bits_total Number of bits set in the whole miniflow (both units)
+ * @param mf_bits_unit0 Number of bits set in unit0 of the miniflow
+ */
+void
+netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
+                          uint64_t *mf_masks,
+                          const uint32_t mf_bits_u0,
+                          const uint32_t mf_bits_u1)
+{
+    uint64_t iter_u0 = tbl->mf.map.bits[0];
+    uint64_t iter_u1 = tbl->mf.map.bits[1];
+
+    netdev_flow_key_gen_mask_unit(iter_u0, mf_bits_u0, &mf_masks[0]);
+    netdev_flow_key_gen_mask_unit(iter_u1, mf_bits_u1, &mf_masks[mf_bits_u0]);
+}
+
 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
  * in 'mask' the values in 'key' and 'target' are the same. */
 bool
@@ -7804,6 +7875,7 @@ dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
     BUILD_ASSERT_DECL(MAP_BITS >= NETDEV_MAX_BURST);
 
     struct dpcls_subtable *subtable;
+    uint64_t *blocks_scratch = cls->blocks_scratch;
 
     uint32_t keys_map = TYPE_MAXIMUM(uint32_t); /* Set all bits. */
 
@@ -7824,7 +7896,8 @@ dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
      * non-overlapping. */
     PVECTOR_FOR_EACH (subtable, &cls->subtables) {
         /* Call the subtable specific lookup function. */
-        found_map = subtable->lookup_func(subtable, keys_map, keys, rules);
+        found_map = subtable->lookup_func(subtable, blocks_scratch, keys_map,
+                                          keys, rules);
 
         /* Count the number of subtables searched for this packet match. This
          * estimates the "spread" of subtables looked at per matched packet. */
diff --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h
index 3c3cc65ef..51ed7e43e 100644
--- a/lib/dpif-netdev.h
+++ b/lib/dpif-netdev.h
@@ -68,6 +68,7 @@ struct dpcls_rule {
  */
 typedef
 uint32_t (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
+                                       uint64_t *blocks_scratch,
                                        uint32_t keys_map,
                                        const struct netdev_flow_key *keys[],
                                        struct dpcls_rule **rules);
@@ -75,6 +76,7 @@ uint32_t (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
 /* Prototype for generic lookup func, using same code path as before. */
 uint32_t
 dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
+                              uint64_t *blocks_scratch,
                               uint32_t keys_map,
                               const struct netdev_flow_key *keys[],
                               struct dpcls_rule **rules);
@@ -95,8 +97,18 @@ struct dpcls_subtable {
      * subtable matches on. The miniflow "bits" are used to select the actual
      * dpcls lookup implementation at subtable creation time.
      */
+    uint8_t mf_bits_set_unit0;
+    uint8_t mf_bits_set_unit1;
+
+    /* the lookup function to use for this subtable. If there is a known
+     * property of the subtable (eg: only 3 bits of miniflow metadata is
+     * used for the lookup) then this can point at an optimized version of
+     * the lookup function for this particular subtable. */
     dpcls_subtable_lookup_func lookup_func;
 
+    /* caches the masks to match a packet to, reducing runtime calculations */
+    uint64_t *mf_masks;
+
     struct netdev_flow_key mask; /* Wildcards for fields (const). */
     /* 'mask' must be the last field, additional space is allocated here. */
 };
@@ -105,6 +117,12 @@ struct dpcls_subtable {
 #define NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(VALUE, KEY, FLOWMAP)   \
     MINIFLOW_FOR_EACH_IN_FLOWMAP (VALUE, &(KEY)->mf, FLOWMAP)
 
+void
+netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
+                          uint64_t *mf_masks,
+                          const uint32_t mf_bits_u0,
+                          const uint32_t mf_bits_u1);
+
 bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
                             const struct netdev_flow_key *target);