diff mbox series

[ovs-dev] dpcls: cache miniflow blocks during lookup

Message ID 1527849927-208691-1-git-send-email-harry.van.haaren@intel.com
State Changes Requested
Delegated to: Ian Stokes
Headers show
Series [ovs-dev] dpcls: cache miniflow blocks during lookup | expand

Commit Message

Van Haaren, Harry June 1, 2018, 10:45 a.m. UTC
Lookup consists of 3 main stages:
1) hashing of the packet miniflow based on subtable mf bits
2) lookup of that hash into cmap
3) verification of the rule based on the rule mf bits

Before this commit, the iteration of stages 1) and 3) was
totally independant, and the work done was duplicate. The
reason is that the bits in the subtable and the rule are
identical.

This commit optimizes away this duplicate work by caching
the data values retrieved from the packet miniflow, storing
them in a linear array that matches the subtable mf blocks.

Once the cmap lookup has completed, the *same* miniflow
must be iterated, and the cached miniflow block data is now
re-used. This avoids iterating the packet miniflow for the
second time, and reducing the overhead of rule verification.

Performance:
This patch was tested using VSPerf with phy2phy_cont test,
varying the number of flows, and with EMC en/dis-abled.
Results are based on initial testing here - please verify
with your use-case and report back :)

Flows | EMC | Master mpps | +Patch mpps | Perf Gain
----------------------------------------------------
  4K  |  1  |    ~15.1    |    ~15.2    | + ~0.6 %
  4K  |  0  |    ~14.6    |    ~15.9    | + ~8.2 %
  1M  |  1  |    ~13.5    |    ~14.4    | + ~6.5 %
  1M  |  0  |    ~14.6    |    ~15.8    | + ~8.2 %

The approx neutral performance with EMC enabled and a low
flow count (4K) is due to that DPCLS is not often hit as
EMC will match many of the flows. In cases where EMC is
disabled, we see a consistent performance improvement as
DPCLS is used for each packet match.

Input and feedback welcomed,

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/dpif-netdev.c | 60 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 46 insertions(+), 14 deletions(-)

Comments

Wang, Yipeng1 June 18, 2018, 9:12 p.m. UTC | #1
Hi, Harry,

This is a good observation and there is definitely some "duplicated calculation" there. 

Actually when we were designing our "CD" distributor structure, we found this overhead and we tried to cache the first hash value in "CD" thus to totally bypass the first hash calculation.  But that solution turns out consumeing too much CPU cache.

Your solution seems effective. For now it looks good to me. Just make sure the claim "that the bits in the subtable and the rule are identical." Is correct.  :)

Thanks
Yipeng


>-----Original Message-----
>From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Harry van Haaren
>Sent: Friday, June 1, 2018 3:45 AM
>To: ovs-dev@openvswitch.org
>Subject: [ovs-dev] [PATCH] dpcls: cache miniflow blocks during lookup
>
>Lookup consists of 3 main stages:
>1) hashing of the packet miniflow based on subtable mf bits
>2) lookup of that hash into cmap
>3) verification of the rule based on the rule mf bits
>
>Before this commit, the iteration of stages 1) and 3) was
>totally independant, and the work done was duplicate. The
>reason is that the bits in the subtable and the rule are
>identical.
>
>This commit optimizes away this duplicate work by caching
>the data values retrieved from the packet miniflow, storing
>them in a linear array that matches the subtable mf blocks.
>
>Once the cmap lookup has completed, the *same* miniflow
>must be iterated, and the cached miniflow block data is now
>re-used. This avoids iterating the packet miniflow for the
>second time, and reducing the overhead of rule verification.
>
>Performance:
>This patch was tested using VSPerf with phy2phy_cont test,
>varying the number of flows, and with EMC en/dis-abled.
>Results are based on initial testing here - please verify
>with your use-case and report back :)
>
>Flows | EMC | Master mpps | +Patch mpps | Perf Gain
>----------------------------------------------------
>  4K  |  1  |    ~15.1    |    ~15.2    | + ~0.6 %
>  4K  |  0  |    ~14.6    |    ~15.9    | + ~8.2 %
>  1M  |  1  |    ~13.5    |    ~14.4    | + ~6.5 %
>  1M  |  0  |    ~14.6    |    ~15.8    | + ~8.2 %
>
>The approx neutral performance with EMC enabled and a low
>flow count (4K) is due to that DPCLS is not often hit as
>EMC will match many of the flows. In cases where EMC is
>disabled, we see a consistent performance improvement as
>DPCLS is used for each packet match.
>
>Input and feedback welcomed,
>
>Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>---
> lib/dpif-netdev.c | 60 ++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 46 insertions(+), 14 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index 818b184..ba03040 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -2185,7 +2185,8 @@ netdev_flow_key_init_masked(struct netdev_flow_key *dst,
>  * 'mask'. */
> static inline uint32_t
> netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
>-                             const struct netdev_flow_key *mask)
>+                             const struct netdev_flow_key *mask,
>+                             uint64_t *blocks)
> {
>     const uint64_t *p = miniflow_get_values(&mask->mf);
>     uint32_t hash = 0;
>@@ -2193,6 +2194,7 @@ netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
>
>     NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(value, key, mask->mf.map) {
>         hash = hash_add64(hash, value & *p++);
>+        *blocks++ = value;
>     }
>
>     return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
>@@ -6203,6 +6205,9 @@ dpif_dummy_register(enum dummy_level level)
>
> /* A set of rules that all have the same fields wildcarded. */
> struct dpcls_subtable {
>+    /* number of bits set in miniflow, aka blocks used in hash calculation */
>+    uint32_t num_mf_bits;
>+
>     /* The fields are only used by writers. */
>     struct cmap_node cmap_node OVS_GUARDED; /* Within dpcls 'subtables_map'. */
>
>@@ -6263,11 +6268,17 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>     cmap_init(&subtable->rules);
>     subtable->hit_cnt = 0;
>     netdev_flow_key_clone(&subtable->mask, mask);
>+
>+    /* count the number of bits in this subtable miniflow */
>+    subtable->num_mf_bits = __builtin_popcountll(mask->mf.map.bits[0]) +
>+                            __builtin_popcountll(mask->mf.map.bits[1]);
>+
>     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
>     /* Add the new subtable at the end of the pvector (with no hits yet) */
>     pvector_insert(&cls->subtables, subtable, 0);
>-    VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d",
>-             cmap_count(&cls->subtables_map), subtable, cls->in_port);
>+    VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d, mf bits %d\n",
>+             cmap_count(&cls->subtables_map), subtable, cls->in_port,
>+             subtable->num_mf_bits);
>     pvector_publish(&cls->subtables);
>
>     return subtable;
>@@ -6287,7 +6298,6 @@ dpcls_find_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>     return dpcls_create_subtable(cls, mask);
> }
>
>-
> /* Periodically sort the dpcls subtable vectors according to hit counts */
> static void
> dpcls_sort_subtable_vector(struct dpcls *cls)
>@@ -6375,18 +6385,21 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
>     }
> }
>
>-/* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>- * in 'mask' the values in 'key' and 'target' are the same. */
>+/* Returns true if the packet represented by pkt_blocks equals rule,
>+ * after the pkt_blocks have been masked by the subtable mask.
>+ */
> static inline bool
> dpcls_rule_matches_key(const struct dpcls_rule *rule,
>-                       const struct netdev_flow_key *target)
>+                       const uint64_t *pkt_blocks,
>+                       const uint32_t mf_bits)
> {
>-    const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
>-    const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
>-    uint64_t value;
>+    const uint64_t *tbl_blocks_mask = miniflow_get_values(&rule->mask->mf);
>+    const uint64_t *rule_block_values = miniflow_get_values(&rule->flow.mf);
>
>-    NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(value, target, rule->flow.mf.map) {
>-        if (OVS_UNLIKELY((value & *maskp++) != *keyp++)) {
>+    int i;
>+    for (i = 0; i < mf_bits; i++) {
>+        const uint64_t masked_data = pkt_blocks[i] & tbl_blocks_mask[i];
>+        if (masked_data != rule_block_values[i]) {
>             return false;
>         }
>     }
>@@ -6436,16 +6449,31 @@ dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
>      * search-key against each subtable, but when a match is found for a
>      * search-key, the search for that key can stop because the rules are
>      * non-overlapping. */
>+
>+    /* calculate the number of blocks that the miniflow *could* take. This is
>+     * results in one uint64_t being reserved on the stack for any potential
>+     * miniflow data. It is over-provisioned for any real-world usecase, but
>+     * this gaurantees there is space for all miniflow values.
>+     */
>+#define MF_BLOCKS_MAX (MAP_T_BITS * FLOWMAP_UNITS)
>+    uint64_t blocks_pkts[NETDEV_MAX_BURST * MF_BLOCKS_MAX];
>+
>     PVECTOR_FOR_EACH (subtable, &cls->subtables) {
>         int i;
>
>+        /* make space on stack to cache miniflow block data from packets */
>+        const uint32_t mf_bits = subtable->num_mf_bits;
>+
>         /* Compute hashes for the remaining keys.  Each search-key is
>          * masked with the subtable's mask to avoid hashing the wildcarded
>          * bits. */
>         ULLONG_FOR_EACH_1(i, keys_map) {
>+            const uint32_t block_idx = i * MF_BLOCKS_MAX;
>             hashes[i] = netdev_flow_key_hash_in_mask(&keys[i],
>-                                                     &subtable->mask);
>+                                                     &subtable->mask,
>+                                                     &blocks_pkts[block_idx]);
>         }
>+
>         /* Lookup. */
>         found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
>         /* Check results.  When the i-th bit of found_map is set, it means
>@@ -6457,7 +6485,11 @@ dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
>             struct dpcls_rule *rule;
>
>             CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
>-                if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) {
>+                const uint64_t *blocks = &blocks_pkts[i * MF_BLOCKS_MAX];
>+                int rule_matches = dpcls_rule_matches_key(rule, blocks,
>+                                                          mf_bits);
>+
>+                if (OVS_LIKELY(rule_matches)) {
>                     rules[i] = rule;
>                     /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
>                      * within one second optimization interval. */
>--
>2.7.4
>
>_______________________________________________
>dev mailing list
>dev@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Eelco Chaudron July 9, 2018, 12:35 p.m. UTC | #2
On 1 Jun 2018, at 12:45, Harry van Haaren wrote:

> Lookup consists of 3 main stages:
> 1) hashing of the packet miniflow based on subtable mf bits
> 2) lookup of that hash into cmap
> 3) verification of the rule based on the rule mf bits
>
> Before this commit, the iteration of stages 1) and 3) was
> totally independant, and the work done was duplicate. The
> reason is that the bits in the subtable and the rule are
> identical.
>
> This commit optimizes away this duplicate work by caching
> the data values retrieved from the packet miniflow, storing
> them in a linear array that matches the subtable mf blocks.
>
> Once the cmap lookup has completed, the *same* miniflow
> must be iterated, and the cached miniflow block data is now
> re-used. This avoids iterating the packet miniflow for the
> second time, and reducing the overhead of rule verification.
>
> Performance:
> This patch was tested using VSPerf with phy2phy_cont test,
> varying the number of flows, and with EMC en/dis-abled.
> Results are based on initial testing here - please verify
> with your use-case and report back :)
>
> Flows | EMC | Master mpps | +Patch mpps | Perf Gain
> ----------------------------------------------------
>   4K  |  1  |    ~15.1    |    ~15.2    | + ~0.6 %
>   4K  |  0  |    ~14.6    |    ~15.9    | + ~8.2 %
>   1M  |  1  |    ~13.5    |    ~14.4    | + ~6.5 %
>   1M  |  0  |    ~14.6    |    ~15.8    | + ~8.2 %
>
> The approx neutral performance with EMC enabled and a low
> flow count (4K) is due to that DPCLS is not often hit as
> EMC will match many of the flows. In cases where EMC is
> disabled, we see a consistent performance improvement as
> DPCLS is used for each packet match.
>
> Input and feedback welcomed,

Hi Harry,

As I promised you to do some tests, here are my findings…

For the PVP test which runs at wire speed (XL710 @40g), I see not much 
improvement. Note that the “Number of flows” are traffic flows, the 
number of installed OpenFlow rules is double, i.e. to and from the VM. 
For details on the test see https://github.com/chaudron/ovs_perf

+-----------------+--------+--------+--------+--------+--------+--------+--------+
| Packet size /   |   64   |  128   |  256   |  512   |  768   |  1024  
|  1514  |
| Number of flows |        |        |        |        |        |        
|        |
+-----------------+--------+--------+--------+--------+--------+--------+--------+
|              10 | -1,36% | -0,18% | -2,33% | -1,87% | -0,32% |  2,19% 
|  0,37% |
|            1000 |  0,16% |  0,06% | -0,14% | -0,02% | -0,75% | -0,05% 
| -0,20% |
|           10000 | -0,12% |  0,07% |  0,14% |  1,14% |  1,49% |  0,29% 
|  1,33% |
|          100000 | -0,84% |  0,51% |  0,92% |  0,87% |  1,40% |  1,82% 
|  1,22% |
|         1000000 |  0,28% |  3,85% | -0,27% |  6,46% | -0,31% |  4,06% 
| -2,48% |
+-----------------+--------+--------+--------+--------+--------+--------+--------+

However using a zero packet loss test (like VSPef) I do see similar 
numbers as above:

+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
|                    |     64      |     128     |     256     |     512 
     |    1024     |    1518     |
+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| rx pkts/sec DELTA  | 9,58%       | 6,02%       | 3,66%       | 8,96%   
     | 5,02%       | -0,10%      |
| rx pkts/sec MASTER | 1397142,077 | 1348985,6   | 1356478,542 | 
1254775,104 | 1077042,154 | 768962,1308 |
| rx pkts/sec PATCH  | 1545202,762 | 1438959,781 | 1407962,838 | 
1378228,358 | 1133955,096 | 768201,7808 |
+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+

Note: This is a test that uses the NORMAL rule, with 100 MAC address 
pairs, and 1M constant flows, with an additional 200K flows changing 
every second. This tries to simulate a Mobile NFV scenario.


I’ll try to actually review the patch later this week if time permits, 
as I’m preparing for PTO.

Cheers,


Eelco



<SNIP>
diff mbox series

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 818b184..ba03040 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2185,7 +2185,8 @@  netdev_flow_key_init_masked(struct netdev_flow_key *dst,
  * 'mask'. */
 static inline uint32_t
 netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
-                             const struct netdev_flow_key *mask)
+                             const struct netdev_flow_key *mask,
+                             uint64_t *blocks)
 {
     const uint64_t *p = miniflow_get_values(&mask->mf);
     uint32_t hash = 0;
@@ -2193,6 +2194,7 @@  netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
 
     NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(value, key, mask->mf.map) {
         hash = hash_add64(hash, value & *p++);
+        *blocks++ = value;
     }
 
     return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
@@ -6203,6 +6205,9 @@  dpif_dummy_register(enum dummy_level level)
 
 /* A set of rules that all have the same fields wildcarded. */
 struct dpcls_subtable {
+    /* number of bits set in miniflow, aka blocks used in hash calculation */
+    uint32_t num_mf_bits;
+
     /* The fields are only used by writers. */
     struct cmap_node cmap_node OVS_GUARDED; /* Within dpcls 'subtables_map'. */
 
@@ -6263,11 +6268,17 @@  dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     cmap_init(&subtable->rules);
     subtable->hit_cnt = 0;
     netdev_flow_key_clone(&subtable->mask, mask);
+
+    /* count the number of bits in this subtable miniflow */
+    subtable->num_mf_bits = __builtin_popcountll(mask->mf.map.bits[0]) +
+                            __builtin_popcountll(mask->mf.map.bits[1]);
+
     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
     /* Add the new subtable at the end of the pvector (with no hits yet) */
     pvector_insert(&cls->subtables, subtable, 0);
-    VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d",
-             cmap_count(&cls->subtables_map), subtable, cls->in_port);
+    VLOG_DBG("Creating %"PRIuSIZE". subtable %p for in_port %d, mf bits %d\n",
+             cmap_count(&cls->subtables_map), subtable, cls->in_port,
+             subtable->num_mf_bits);
     pvector_publish(&cls->subtables);
 
     return subtable;
@@ -6287,7 +6298,6 @@  dpcls_find_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     return dpcls_create_subtable(cls, mask);
 }
 
-
 /* Periodically sort the dpcls subtable vectors according to hit counts */
 static void
 dpcls_sort_subtable_vector(struct dpcls *cls)
@@ -6375,18 +6385,21 @@  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
     }
 }
 
-/* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
- * in 'mask' the values in 'key' and 'target' are the same. */
+/* Returns true if the packet represented by pkt_blocks equals rule,
+ * after the pkt_blocks have been masked by the subtable mask.
+ */
 static inline bool
 dpcls_rule_matches_key(const struct dpcls_rule *rule,
-                       const struct netdev_flow_key *target)
+                       const uint64_t *pkt_blocks,
+                       const uint32_t mf_bits)
 {
-    const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
-    const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
-    uint64_t value;
+    const uint64_t *tbl_blocks_mask = miniflow_get_values(&rule->mask->mf);
+    const uint64_t *rule_block_values = miniflow_get_values(&rule->flow.mf);
 
-    NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(value, target, rule->flow.mf.map) {
-        if (OVS_UNLIKELY((value & *maskp++) != *keyp++)) {
+    int i;
+    for (i = 0; i < mf_bits; i++) {
+        const uint64_t masked_data = pkt_blocks[i] & tbl_blocks_mask[i];
+        if (masked_data != rule_block_values[i]) {
             return false;
         }
     }
@@ -6436,16 +6449,31 @@  dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
      * search-key against each subtable, but when a match is found for a
      * search-key, the search for that key can stop because the rules are
      * non-overlapping. */
+
+    /* calculate the number of blocks that the miniflow *could* take. This is
+     * results in one uint64_t being reserved on the stack for any potential
+     * miniflow data. It is over-provisioned for any real-world usecase, but
+     * this gaurantees there is space for all miniflow values.
+     */
+#define MF_BLOCKS_MAX (MAP_T_BITS * FLOWMAP_UNITS)
+    uint64_t blocks_pkts[NETDEV_MAX_BURST * MF_BLOCKS_MAX];
+
     PVECTOR_FOR_EACH (subtable, &cls->subtables) {
         int i;
 
+        /* make space on stack to cache miniflow block data from packets */
+        const uint32_t mf_bits = subtable->num_mf_bits;
+
         /* Compute hashes for the remaining keys.  Each search-key is
          * masked with the subtable's mask to avoid hashing the wildcarded
          * bits. */
         ULLONG_FOR_EACH_1(i, keys_map) {
+            const uint32_t block_idx = i * MF_BLOCKS_MAX;
             hashes[i] = netdev_flow_key_hash_in_mask(&keys[i],
-                                                     &subtable->mask);
+                                                     &subtable->mask,
+                                                     &blocks_pkts[block_idx]);
         }
+
         /* Lookup. */
         found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
         /* Check results.  When the i-th bit of found_map is set, it means
@@ -6457,7 +6485,11 @@  dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
             struct dpcls_rule *rule;
 
             CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
-                if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) {
+                const uint64_t *blocks = &blocks_pkts[i * MF_BLOCKS_MAX];
+                int rule_matches = dpcls_rule_matches_key(rule, blocks,
+                                                          mf_bits);
+
+                if (OVS_LIKELY(rule_matches)) {
                     rules[i] = rule;
                     /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
                      * within one second optimization interval. */