From patchwork Thu Apr 11 12:56:04 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Van Haaren, Harry" <harry.van.haaren@intel.com>
X-Patchwork-Id: 1084038
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=openvswitch.org
	(client-ip=140.211.169.12; helo=mail.linuxfoundation.org;
	envelope-from=ovs-dev-bounces@openvswitch.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=intel.com
Received: from mail.linuxfoundation.org (mail.linuxfoundation.org
	[140.211.169.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 44g1Kf5dD3z9ryj
	for <incoming@patchwork.ozlabs.org>;
	Thu, 11 Apr 2019 22:58:14 +1000 (AEST)
Received: from mail.linux-foundation.org (localhost [127.0.0.1])
	by mail.linuxfoundation.org (Postfix) with ESMTP id 63CAA25D5;
	Thu, 11 Apr 2019 12:56:17 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@mail.linuxfoundation.org
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 197CC220B
	for <ovs-dev@openvswitch.org>; Thu, 11 Apr 2019 12:55:15 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 58FE2F4
	for <ovs-dev@openvswitch.org>; Thu, 11 Apr 2019 12:55:14 +0000 (UTC)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga003.jf.intel.com ([10.7.209.27])
	by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
	11 Apr 2019 05:55:14 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,337,1549958400"; d="scan'208";a="141870034"
Received: from silpixa00399779.ir.intel.com (HELO
	silpixa00399779.ger.corp.intel.com) ([10.237.222.34])
	by orsmga003.jf.intel.com with ESMTP; 11 Apr 2019 05:55:12 -0700
From: Harry van Haaren <harry.van.haaren@intel.com>
To: ovs-dev@openvswitch.org
Date: Thu, 11 Apr 2019 13:56:04 +0100
Message-Id: <20190411125604.70050-6-harry.van.haaren@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190411125604.70050-1-harry.van.haaren@intel.com>
References: <20190411125604.70050-1-harry.van.haaren@intel.com>
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: i.maximets@samsung.com
Subject: [ovs-dev] [PATCH v7 5/5] dpif-netdev: add specialized generic
	scalar functions
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
MIME-Version: 1.0
Sender: ovs-dev-bounces@openvswitch.org
Errors-To: ovs-dev-bounces@openvswitch.org

This commit adds a number of specialized functions, that handle
common miniflow fingerprints. This enables compiler optimization,
resulting in higher performance. Below a quick description of
how this optimization actually works;

"Specialized functions" are "instances" of the generic implementation,
but the compiler is given extra context when compiling. In the case of
iterating miniflow datastructures, the most interesting value to enable
compile time optimizations is the loop trip count per unit.

In order to create a specialized function, there is a generic implementation,
which uses a for() loop without the compiler knowing the loop trip count at
compile time. The loop trip count is passed in as an argument to the function:

uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
{
    for(uint32_t i = 0; i < loop_count; i++)
        // do work
}

In order to "specialize" the function, we call the generic implementation
with hard-coded numbers - these are compile time constants!

uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
{
    // use hard coded constant for compile-time constant-propogation
    return miniflow_impl_generic(mf, 5);
}

Given the compiler is aware of the loop trip count at compile time,
it can perform an optimization known as "constant propogation". Combined
with inlining of the miniflow_impl_generic() function, the compiler is
now enabled to *compile time* unroll the loop 5x, and produce "flat" code.

The last step to using the specialized functions is to utilize a
function-pointer to choose the specialized (or generic) implementation.
The selection of the function pointer is performed at subtable creation
time, when miniflow fingerprint of the subtable is known. This technique
is known as "multiple dispatch" in some literature, as it uses multiple
items of information (miniflow bit counts) to select the dispatch function.

By pointing the function pointer at the optimized implementation, OvS
benefits from the compile time optimizations at runtime.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/dpif-netdev-lookup-generic.c | 77 ++++++++++++++++++++++++++++----
 lib/dpif-netdev.c                |  5 ++-
 lib/dpif-netdev.h                |  8 ++++
 3 files changed, 81 insertions(+), 9 deletions(-)
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index 770ef70d3..5393fa836 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct netdev_flow_key * restrict key,
 
     /* Unit 0 flattening */
     netdev_flow_key_flatten_unit(&pkt_blocks[0],
-                            &tbl_blocks[0],
-                            &mf_masks[0],
-                            &block_cache[0],
-                            pkt_bits_u0,
-                            u0_count);
+                                 &tbl_blocks[0],
+                                 &mf_masks[0],
+                                 &block_cache[0],
+                                 pkt_bits_u0,
+                                 u0_count);
 
     /* Unit 1 flattening:
      * Move the pointers forward in the arrays based on u0 offsets, NOTE:
@@ -275,7 +275,68 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
                               const struct netdev_flow_key *keys[],
                               struct dpcls_rule **rules)
 {
-        return lookup_generic_impl(subtable, keys_map, keys, rules,
-                                   subtable->mf_bits_set_unit0,
-                                   subtable->mf_bits_set_unit1);
+    /* Here the runtime subtable->mf_bits counts are used, which forces the
+     * compiler to iterate normal for() loops. Due to this limitation in the
+     * compilers available optimizations, this function has lower performance
+     * than the below specialized functions.
+     */
+    return lookup_generic_impl(subtable, keys_map, keys, rules,
+                               subtable->mf_bits_set_unit0,
+                               subtable->mf_bits_set_unit1);
+}
+
+static uint32_t
+dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
+                              uint32_t keys_map,
+                              const struct netdev_flow_key *keys[],
+                              struct dpcls_rule **rules)
+{
+    /* hard coded bit counts - enables compile time loop unrolling, and
+     * generating of optimized code-sequences due to loop unrolled code.
+     */
+    return lookup_generic_impl(subtable, keys_map, keys, rules, 5, 1);
+}
+
+static uint32_t
+dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
+                              uint32_t keys_map,
+                              const struct netdev_flow_key *keys[],
+                              struct dpcls_rule **rules)
+{
+    return lookup_generic_impl(subtable, keys_map, keys, rules, 4, 1);
+}
+
+static uint32_t
+dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
+                              uint32_t keys_map,
+                              const struct netdev_flow_key *keys[],
+                              struct dpcls_rule **rules)
+{
+    return lookup_generic_impl(subtable, keys_map, keys, rules, 4, 0);
+}
+
+/* Probe function to lookup an available specialized function.
+ * If capable to run the requested miniflow fingerprint, this function returns
+ * the most optimal implementation for that miniflow fingerprint.
+ * @retval FunctionAddress A valid function to handle the miniflow bit pattern
+ * @retval 0 The requested miniflow is not supported here, NULL is returned
+ */
+dpcls_subtable_lookup_func
+dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
+{
+    dpcls_subtable_lookup_func f = NULL;
+
+    if (u0_bits == 5 && u1_bits == 1) {
+        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
+    } else if (u0_bits == 4 && u1_bits == 1) {
+        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
+    } else if (u0_bits == 4 && u1_bits == 0) {
+        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
+    }
+
+    if (f) {
+        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
+                  u0_bits, u1_bits);
+    }
+    return f;
 }
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 3bc826079..4c9586cc2 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -7596,7 +7596,10 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     subtable->mf_bits_set_unit0 = unit0;
     subtable->mf_bits_set_unit1 = unit1;
 
-    subtable->lookup_func = dpcls_subtable_lookup_generic;
+    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
+    if (!subtable->lookup_func) {
+        subtable->lookup_func = dpcls_subtable_lookup_generic;
+    }
 
     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
     /* Add the new subtable at the end of the pvector (with no hits yet) */
diff --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h
index 0911fa93c..948461855 100644
--- a/lib/dpif-netdev.h
+++ b/lib/dpif-netdev.h
@@ -69,6 +69,14 @@ typedef uint32_t (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
                 uint32_t keys_map, const struct netdev_flow_key *keys[],
                 struct dpcls_rule **rules);
 
+/* Probe function to select a specialized version of the generic lookup
+ * implementation. This provides performance benefit due to compile-time
+ * optimizations such as loop-unrolling. These are enabled by the compile-time
+ * constants in the specific function implementations.
+ */
+dpcls_subtable_lookup_func
+dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count);
+
 /* Prototype for generic lookup func, using same code path as before */
 uint32_t
 dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,