[ovs-dev,RFC,1/4] dpif-netdev: Refactor datapath flow cache

Message ID 1516299624-452546-2-git-send-email-yipeng1.wang@intel.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series
  • dpif-netdev: Combine CD and DFC patch for datapath refactor
Related show

Commit Message

Wang, Yipeng1 Jan. 18, 2018, 6:20 p.m.
From: Jan Scheurich <jan.scheurich@ericsson.com>

This is a rebase of Jan's previous patch
[PATCH] dpif-netdev: Refactor datapath flow cache
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341066.html

So far the netdev datapath uses an 8K EMC to speed up the lookup of
frequently used flows by comparing the parsed packet headers against
the miniflow of a cached flow, using 13 bits of the packet RSS hash
as index. The EMC is too small for many applications with 100K or more
parallel packet flows so that EMC threshing actually degrades performance.
Furthermore, the size of struct miniflow and the flow copying cost
prevents us from making it much larger.

At the same time the lookup cost of the megaflow classifier (DPCLS) is
increasing as the number of frequently hit subtables grows with the
complexity of pipeline and the number of recirculations.

To close the performance gap for many parallel flows, this patch
introduces the datapath flow cache (DFC) with 1M entries as lookup
stage between EMC and DPCLS. It directly maps 20 bits of the RSS hash
to a pointer to the last hit megaflow entry and performs a masked
comparison of the packet flow with the megaflow key to confirm the
hit. This avoids the costly DPCLS lookup even for very large number of
parallel flows with a small memory overhead.

Due the large size of the DFC and the low risk of DFC thrashing, any
DPCLS hit immediately inserts an entry in the DFC so that subsequent
packets get speeded up. The DFC, thus, accelerate also short-lived
flows.

To further accelerate the lookup of few elephant flows, every DFC hit
triggers a probabilistic EMC insertion of the flow. As the DFC entry is
already in place the default EMC insertion probability can be reduced to
1/1000 to minimize EMC thrashing should there still be many fat flows.
The inverse EMC insertion probability remains configurable.

The EMC implementation is simplified by removing the possibility to
store a flow in two slots, as there is no particular reason why two
flows should systematically collide (the RSS hash is not symmetric).
The maximum size of the EMC flow key is limited to 256 bytes to reduce
the memory footprint. This should be sufficient to hold most real life
packet flow keys. Larger flows are not installed in the EMC.

The pmd-stats-show command is enhanced to show both EMC and DFC hits
separately.

The sweep speed for cleaning up obsolete EMC and DFC flow entries and
freeing dead megaflow entries is increased. With a typical PMD cycle
duration of 100us under load and checking one DFC entry per cycle, the
DFC sweep should normally complete within in 100s.

In PVP performance tests with an L3 pipeline over VXLAN we determined the
optimal EMC size to be 16K entries to obtain a uniform speedup compared
to the master branch over the full range of parallel flows. The measurement
below is for 64 byte packets and the average number of subtable lookups
per DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for
a single busy mask. Tests with many visited subtables should show a strong
increase of the gain through DFC.

Flows   master  DFC+EMC  Gain
        [Mpps]  [Mpps]
------------------------------
8       4.45    4.62     3.8%
100     4.17    4.47     7.2%
1000    3.88    4.34    12.0%
2000    3.54    4.17    17.8%
5000    3.01    3.82    27.0%
10000   2.75    3.63    31.9%
20000   2.64    3.50    32.8%
50000   2.60    3.33    28.1%
100000  2.59    3.23    24.7%
500000  2.59    3.16    21.9%

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
---
 lib/dpif-netdev.c | 350 ++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 235 insertions(+), 115 deletions(-)

Comments

Bodireddy, Bhanuprakash Feb. 20, 2018, 9:09 p.m. | #1
Hi Yipeng,

Thanks for the RFC series. This patch series need to be rebased. 
I applied this on an older commit to do initial testing. Some comments below.

I see that DFC cache is implemented in similar lines of EMC cache except that it holds
Million entries and uses more bits of RSS hash to index in to the Cache. I agree that
DPCLS lookup is expensive and consumes 30% of total cycles in some test cases and DFC
Cache will definitely reduce some pain there.

On the memory foot print:

On Master, 
    EMC  entry size = 592 bytes
               8k entries = ~4MB.

With this patch,
     EMC entry size = 256 bytes
              16k entries = ~4MB.

I like the above reduction in flow key size, keeping the entry size to multiple of cache line and still keeping the overall EMC size to ~4MB with more EMC entries.

However my concern is DFC cache size. As DFC cache is million entries and consumes ~12 MB for each PMD thread, it might not fit in to L3 cache. Also note that in newer platforms L3 cache is shrinking and L2 is slightly increased (eg: Skylake has only 1MB L2 and 19MB L3 cache).

Inspite of the memory footprint I still think DFC cache improves switching performance as it is lot less expensive than invoking dpcls_lookup() as the later involves more expensive hash computation and subtable traversal. It would be nice if there is more testing done with real VNFs to see that this patch doesn't cause cache thrashing and suffer from memory bottlenecks.

Some more comments below.

>This is a rebase of Jan's previous patch [PATCH] dpif-netdev: Refactor
>datapath flow cache https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341066.html
>
>So far the netdev datapath uses an 8K EMC to speed up the lookup of
>frequently used flows by comparing the parsed packet headers against the
>miniflow of a cached flow, using 13 bits of the packet RSS hash as index. The
>EMC is too small for many applications with 100K or more parallel packet flows
>so that EMC threshing actually degrades performance.
>Furthermore, the size of struct miniflow and the flow copying cost prevents us
>from making it much larger.
>
>At the same time the lookup cost of the megaflow classifier (DPCLS) is
>increasing as the number of frequently hit subtables grows with the
>complexity of pipeline and the number of recirculations.
>
>To close the performance gap for many parallel flows, this patch introduces
>the datapath flow cache (DFC) with 1M entries as lookup stage between EMC
>and DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last hit
>megaflow entry and performs a masked comparison of the packet flow with
>the megaflow key to confirm the hit. This avoids the costly DPCLS lookup even
>for very large number of parallel flows with a small memory overhead.
>
>Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
>immediately inserts an entry in the DFC so that subsequent packets get
>speeded up. The DFC, thus, accelerate also short-lived flows.
>
>To further accelerate the lookup of few elephant flows, every DFC hit triggers
>a probabilistic EMC insertion of the flow. As the DFC entry is already in place
>the default EMC insertion probability can be reduced to
>1/1000 to minimize EMC thrashing should there still be many fat flows.
>The inverse EMC insertion probability remains configurable.
>
>The EMC implementation is simplified by removing the possibility to store a
>flow in two slots, as there is no particular reason why two flows should
>systematically collide (the RSS hash is not symmetric).

[BHANU]
I am not sure if this is good idea to simplify EMC by using 1-way associative instead of current 2 way associative implementation.
I prefer to leave the current approach as-is unless we have strong data to prove it otherwise.
This comment applies to below code changes w.r.t to EMC lookup and insert.

>The maximum size of the EMC flow key is limited to 256 bytes to reduce the
>memory footprint. This should be sufficient to hold most real life packet flow
>keys. Larger flows are not installed in the EMC.

+1 

>
>The pmd-stats-show command is enhanced to show both EMC and DFC hits
>separately.
>
>The sweep speed for cleaning up obsolete EMC and DFC flow entries and
>freeing dead megaflow entries is increased. With a typical PMD cycle duration
>of 100us under load and checking one DFC entry per cycle, the DFC sweep
>should normally complete within in 100s.
>
>In PVP performance tests with an L3 pipeline over VXLAN we determined the
>optimal EMC size to be 16K entries to obtain a uniform speedup compared to
>the master branch over the full range of parallel flows. The measurement
>below is for 64 byte packets and the average number of subtable lookups per
>DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for a single
>busy mask. Tests with many visited subtables should show a strong increase
>of the gain through DFC.
>
>Flows   master  DFC+EMC  Gain
>        [Mpps]  [Mpps]
>------------------------------
>8       4.45    4.62     3.8%
>100     4.17    4.47     7.2%
>1000    3.88    4.34    12.0%
>2000    3.54    4.17    17.8%
>5000    3.01    3.82    27.0%
>10000   2.75    3.63    31.9%
>20000   2.64    3.50    32.8%
>50000   2.60    3.33    28.1%
>100000  2.59    3.23    24.7%
>500000  2.59    3.16    21.9%
>
>Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
>---
> lib/dpif-netdev.c | 350 ++++++++++++++++++++++++++++++++++++---------
>---------
> 1 file changed, 235 insertions(+), 115 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index c7d157a..b9f4b6d
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -128,19 +128,19 @@ struct netdev_flow_key {
>     uint64_t buf[FLOW_MAX_PACKET_U64S];  };
>
>-/* Exact match cache for frequently used flows
>+/* Datapath flow cache (DFC) for frequently used flows
>  *
>- * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
>- * search its entries for a miniflow that matches exactly the miniflow of the
>- * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
>+ * The cache uses the 32-bit hash of the packet (which can be the RSS
>+ hash) to
>+ * directly look up a pointer to the matching megaflow. To check for a
>+ match
>+ * the packet's flow key is compared against the key and mask of the
>megaflow.
>  *
>- * A cache entry holds a reference to its 'dp_netdev_flow'.
>- *
>- * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS
>different
>- * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
>- * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown
>away). Each
>- * value is the index of a cache entry where the miniflow could be.
>+ * For even faster lookup, the most frequently used packet flows are
>+ also
>+ * inserted into a small exact match cache (EMC). The EMC uses a part
>+ of the
>+ * packet hash to look up a miniflow that matches exactly the miniflow
>+ of the
>+ * packet. The matching EMC also returns a reference to the megaflow.
>  *
>+ * Flows are promoted from the DFC to the EMC through probabilistic
>+ insertion
>+ * after successful DFC lookup with minor probability to favor elephant flows.
>  *
>  * Thread-safety
>  * =============
>@@ -149,33 +149,38 @@ struct netdev_flow_key {
>  * If dp_netdev_input is not called from a pmd thread, a mutex is used.
>  */
>
>-#define EM_FLOW_HASH_SHIFT 13
>-#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT) -#define
>EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1) -#define
>EM_FLOW_HASH_SEGS 2
>+#define DFC_MASK_LEN 20
>+#define DFC_ENTRIES (1u << DFC_MASK_LEN) #define DFC_MASK
>(DFC_ENTRIES
>+- 1) #define EMC_MASK_LEN 14 #define EMC_ENTRIES (1u <<
>EMC_MASK_LEN)
>+#define EMC_MASK (EMC_ENTRIES - 1)
>
> /* Default EMC insert probability is 1 /
>DEFAULT_EM_FLOW_INSERT_INV_PROB */ -#define
>DEFAULT_EM_FLOW_INSERT_INV_PROB 100
>+#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
> #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
>                                     DEFAULT_EM_FLOW_INSERT_INV_PROB)
>
> struct emc_entry {
>     struct dp_netdev_flow *flow;
>-    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
>+    char key[248];   /* Holds struct netdev_flow_key of limited size. */
> };
>
> struct emc_cache {
>-    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
>-    int sweep_idx;                /* For emc_cache_slow_sweep(). */
>+    struct emc_entry entries[EMC_ENTRIES];
>+    int sweep_idx;
>+};
>+
>+struct dfc_entry {
>+    struct dp_netdev_flow *flow;
>+};
>+
>+struct dfc_cache {
>+    struct emc_cache emc_cache;
>+    struct dfc_entry entries[DFC_ENTRIES];
>+    int sweep_idx;
> };
>
>-/* Iterate in the exact match cache through every entry that might contain a
>- * miniflow with hash 'HASH'. */
>-#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)
>\
>-    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
>-         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ &
>EM_FLOW_HASH_MASK], \
>-         i__ < EM_FLOW_HASH_SEGS;                                            \
>-         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
> 

> /* Simple non-wildcarding single-priority classifier. */
>
>@@ -215,7 +220,8 @@ static bool dpcls_lookup(struct dpcls *cls,
>                          const struct netdev_flow_key keys[],
>                          struct dpcls_rule **rules, size_t cnt,
>                          int *num_lookups_p);
>-

>+static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
>+                            const struct netdev_flow_key *target);
> /* Set of supported meter flags */
> #define DP_SUPPORTED_METER_FLAGS_MASK \
>     (OFPMF13_STATS | OFPMF13_PKTPS | OFPMF13_KBPS |
>OFPMF13_BURST) @@ -333,6 +339,7 @@ static struct dp_netdev_port
>*dp_netdev_lookup_port(const struct dp_netdev *dp,
>
> enum dp_stat_type {
>     DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
>+    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
>     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
>     DP_STAT_MISS,               /* Packets that did not match. */
>     DP_STAT_LOST,               /* Packets not passed up to the client. */
>@@ -576,7 +583,7 @@ struct dp_netdev_pmd_thread {
>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>      * will only be accessed by its own pmd thread. */
>-    struct emc_cache flow_cache;
>+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
>
>     /* Flow-Table and classifiers
>      *
>@@ -742,46 +749,59 @@ dpif_netdev_xps_revalidate_pmd(const struct
>dp_netdev_pmd_thread *pmd,  static int dpif_netdev_xps_get_tx_qid(const
>struct dp_netdev_pmd_thread *pmd,
>                                       struct tx_port *tx);
>
>-static inline bool emc_entry_alive(struct emc_entry *ce);
>+static inline bool dfc_entry_alive(struct dfc_entry *ce);
> static void emc_clear_entry(struct emc_entry *ce);
>+static void dfc_clear_entry(struct dfc_entry *ce);
>
> static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>
> static void
>-emc_cache_init(struct emc_cache *flow_cache)
>+emc_cache_init(struct emc_cache *emc)
> {
>     int i;
>
>-    flow_cache->sweep_idx = 0;
>+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
>+        emc->entries[i].flow = NULL;
>+        struct netdev_flow_key *key =
>+                (struct netdev_flow_key *) emc->entries[i].key;
>+        key->hash = 0;
>+        key->len = sizeof(struct miniflow);
>+        flowmap_init(&key->mf.map);
>+    }
>+    emc->sweep_idx = 0;
>+}
>+
>+static void
>+dfc_cache_init(struct dfc_cache *flow_cache) {
>+    int i;
>+
>+    emc_cache_init(&flow_cache->emc_cache);
>     for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>         flow_cache->entries[i].flow = NULL;
>-        flow_cache->entries[i].key.hash = 0;
>-        flow_cache->entries[i].key.len = sizeof(struct miniflow);
>-        flowmap_init(&flow_cache->entries[i].key.mf.map);
>     }
>+    flow_cache->sweep_idx = 0;
> }
>
> static void
>-emc_cache_uninit(struct emc_cache *flow_cache)
>+emc_cache_uninit(struct emc_cache *emc)
> {
>     int i;
>
>-    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>-        emc_clear_entry(&flow_cache->entries[i]);
>+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
>+        emc_clear_entry(&emc->entries[i]);
>     }
> }
>
>-/* Check and clear dead flow references slowly (one entry at each
>- * invocation).  */
> static void
>-emc_cache_slow_sweep(struct emc_cache *flow_cache)
>+dfc_cache_uninit(struct dfc_cache *flow_cache)
> {
>-    struct emc_entry *entry = &flow_cache->entries[flow_cache-
>>sweep_idx];
>+    int i;
>
>-    if (!emc_entry_alive(entry)) {
>-        emc_clear_entry(entry);
>+    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>+        dfc_clear_entry(&flow_cache->entries[i]);
>     }
>-    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) &
>EM_FLOW_HASH_MASK;
>+    emc_cache_uninit(&flow_cache->emc_cache);
> }
>
> /* Updates the time in PMD threads context and should be called in three
>cases:
>@@ -858,8 +878,8 @@ pmd_info_show_stats(struct ds *reply,
>     }
>
>     /* Sum of all the matched and not matched packets gives the total.  */
>-    total_packets = stats[DP_STAT_EXACT_HIT] +
>stats[DP_STAT_MASKED_HIT]
>-                    + stats[DP_STAT_MISS];
>+    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
>+                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];
>
>     for (i = 0; i < PMD_N_CYCLES; i++) {
>         if (cycles[i] > pmd->cycles_zero[i]) { @@ -892,11 +912,14 @@
>pmd_info_show_stats(struct ds *reply,
>     }
>
>     ds_put_format(reply,
>-                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
>+                  "\temc hits:%llu\n\tdfc hits:%llu\n"
>+                  "\tmegaflow hits:%llu\n"
>                   "\tavg. subtable lookups per hit:%.2f\n"
>                   "\tmiss:%llu\n\tlost:%llu\n"
>                   "\tavg. packets per output batch: %.2f\n",
>-                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
>+                  stats[DP_STAT_EXACT_HIT],
>+                  stats[DP_STAT_DFC_HIT],
>+                  stats[DP_STAT_MASKED_HIT],
>                   lookups_per_hit, stats[DP_STAT_MISS], stats[DP_STAT_LOST],
>                   packets_per_batch);
>
>@@ -1521,6 +1544,8 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct
>dpif_dp_stats *stats)
>         stats->n_hit += n;
>         atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
>         stats->n_hit += n;
>+        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
>+        stats->n_hit += n;
>         atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
>         stats->n_missed += n;
>         atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n); @@ -2140,6
>+2165,16 @@ netdev_flow_key_hash_in_mask(const struct
>netdev_flow_key *key,
>     return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);  }
>
>+/*
>+ * Datapath Flow Cache and EMC implementation  */
>+
>+static inline struct emc_entry *
>+emc_entry_get(struct emc_cache *emc, const uint32_t hash) {
>+    return &emc->entries[hash & EMC_MASK]; }
>+
> static inline bool
> emc_entry_alive(struct emc_entry *ce)
> {
>@@ -2156,9 +2191,15 @@ emc_clear_entry(struct emc_entry *ce)  }
>
> static inline void
>-emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
>-                 const struct netdev_flow_key *key)
>+emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key
>*key,
>+                 struct dp_netdev_flow *flow)
> {
>+    /* We only store small enough flows in the EMC. */
>+    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
>+    if (key_size > sizeof(ce->key)) {
>+        return;
>+    }
>+
>     if (ce->flow != flow) {
>         if (ce->flow) {
>             dp_netdev_flow_unref(ce->flow); @@ -2170,73 +2211,148 @@
>emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
>             ce->flow = NULL;
>         }
>     }
>-    if (key) {
>-        netdev_flow_key_clone(&ce->key, key);
>-    }
>+    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
> }
>
> static inline void
>-emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
>-           struct dp_netdev_flow *flow)
>+emc_probabilistic_insert(struct emc_cache *emc,
>+                         const struct netdev_flow_key *key,
>+                         struct dp_netdev_flow *flow)
> {
>-    struct emc_entry *to_be_replaced = NULL;
>-    struct emc_entry *current_entry;
>+    const uint32_t threshold = UINT32_MAX/1000;
>
>-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
>-        if (netdev_flow_key_equal(&current_entry->key, key)) {
>-            /* We found the entry with the 'mf' miniflow */
>-            emc_change_entry(current_entry, flow, NULL);
>-            return;
>+    if (random_uint32() <= threshold) {
>+
>+        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
>+        emc_change_entry(current_entry, key, flow);
>+    }
>+}
>+
>+static inline struct dp_netdev_flow *
>+emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) {
>+    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
>+    struct netdev_flow_key *current_key =
>+            (struct netdev_flow_key *) current_entry->key;
>+
>+    if (current_key->hash == key->hash
>+        && emc_entry_alive(current_entry)
>+        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
>+
>+        /* We found the entry with the 'key->mf' miniflow */
>+        return current_entry->flow;
>+    }
>+    return NULL;
>+}
>+
>+static inline struct dfc_entry *
>+dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) {
>+    return &cache->entries[hash & DFC_MASK]; }
>+
>+static inline bool
>+dfc_entry_alive(struct dfc_entry *ce)
>+{
>+    return ce->flow && !ce->flow->dead; }
>+
>+static void
>+dfc_clear_entry(struct dfc_entry *ce)
>+{
>+    if (ce->flow) {
>+        dp_netdev_flow_unref(ce->flow);
>+        ce->flow = NULL;
>+    }
>+}
>+
>+static inline void
>+dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow) {
>+    if (ce->flow != flow) {
>+        if (ce->flow) {
>+            dp_netdev_flow_unref(ce->flow);
>         }
>
>-        /* Replacement policy: put the flow in an empty (not alive) entry, or
>-         * in the first entry where it can be */
>-        if (!to_be_replaced
>-            || (emc_entry_alive(to_be_replaced)
>-                && !emc_entry_alive(current_entry))
>-            || current_entry->key.hash < to_be_replaced->key.hash) {
>-            to_be_replaced = current_entry;
>+        if (dp_netdev_flow_ref(flow)) {
>+            ce->flow = flow;
>+        } else {
>+            ce->flow = NULL;
>         }
>     }
>-    /* We didn't find the miniflow in the cache.
>-     * The 'to_be_replaced' entry is where the new flow will be stored */
>-
>-    emc_change_entry(to_be_replaced, flow, key);
> }
>
> static inline void
>-emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
>-                         const struct netdev_flow_key *key,
>-                         struct dp_netdev_flow *flow)
>+dfc_insert(struct dp_netdev_pmd_thread *pmd,
>+           const struct netdev_flow_key *key,
>+           struct dp_netdev_flow *flow)
> {
>-    /* Insert an entry into the EMC based on probability value 'min'. By
>-     * default the value is UINT32_MAX / 100 which yields an insertion
>-     * probability of 1/100 ie. 1% */
>+    struct dfc_cache *cache = &pmd->flow_cache;
>+    struct dfc_entry *current_entry;
>+
>+    current_entry = dfc_entry_get(cache, key->hash);
>+    dfc_change_entry(current_entry, flow); }
>+
>+static inline struct dp_netdev_flow *
>+dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
>+           bool *exact_match)
>+{
>+    struct dp_netdev_flow *flow;
>
>-    uint32_t min;
>-    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
>+    /* Try an EMC lookup first. */
>+    flow = emc_lookup(&cache->emc_cache, key);
>+    if (flow) {
>+        *exact_match = true;
>+        return flow;
>+    }
>+
>+    /* EMC lookup not successful: try DFC lookup. */
>+    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
>+    flow = current_entry->flow;
>
>-    if (min && random_uint32() <= min) {
>-        emc_insert(&pmd->flow_cache, key, flow);
>+    if (dfc_entry_alive(current_entry) &&
>+        dpcls_rule_matches_key(&flow->cr, key)) {
>+
>+        /* Found a match in DFC. Insert into EMC for subsequent lookups.
>+         * We use probabilistic insertion here so that mainly elephant
>+         * flows enter EMC. */
>+        emc_probabilistic_insert(&cache->emc_cache, key, flow);
>+        *exact_match = false;
>+        return flow;
>+    } else {
>+
>+        /* No match. Need to go to DPCLS lookup. */
>+        return NULL;
>     }
> }
>
>-static inline struct dp_netdev_flow *
>-emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
>+/* Check and clear dead flow references slowly (one entry at each
>+ * invocation).  */
>+static void
>+emc_slow_sweep(struct emc_cache *emc)
> {
>-    struct emc_entry *current_entry;
>+    struct emc_entry *entry = &emc->entries[emc->sweep_idx];
>
>-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
>-        if (current_entry->key.hash == key->hash
>-            && emc_entry_alive(current_entry)
>-            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
>+    if (!emc_entry_alive(entry)) {
>+        emc_clear_entry(entry);
>+    }
>+    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK; }
>
>-            /* We found the entry with the 'key->mf' miniflow */
>-            return current_entry->flow;
>-        }
>+static void
>+dfc_slow_sweep(struct dfc_cache *cache) {
>+    /* Sweep the EMC so that both finish in the same time. */
>+    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
>+        emc_slow_sweep(&cache->emc_cache);
>     }
>
>-    return NULL;
>+    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
>+    if (!dfc_entry_alive(entry)) {
>+        dfc_clear_entry(entry);
>+    }
>+    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
> }
>
> static struct dp_netdev_flow *
>@@ -4135,7 +4251,7 @@ pmd_thread_main(void *f_)
>     ovs_numa_thread_setaffinity_core(pmd->core_id);
>     dpdk_set_lcore_id(pmd->core_id);
>     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>-    emc_cache_init(&pmd->flow_cache);
>+    dfc_cache_init(&pmd->flow_cache);
> reload:
>     pmd_alloc_static_tx_qid(pmd);
>
>@@ -4166,8 +4282,7 @@ reload:
>         }
>
>         if (lc++ > 1024) {
>-            bool reload;
>-
>+            dfc_slow_sweep(&pmd->flow_cache);
[BHANU]  I need to better understand the usage of RCUs. But I am wondering why the sweep function
Isn't under the below !rcu_try_quiesce() condition?

>             lc = 0;
>
>             coverage_try_clear();
>@@ -4175,9 +4290,9 @@ reload:
>              * iteration, if there were no received packets. */
>             pmd_thread_ctx_time_update(pmd);
>             dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
>-            if (!ovsrcu_try_quiesce()) {
>-                emc_cache_slow_sweep(&pmd->flow_cache);
>-            }
>+            ovsrcu_try_quiesce();
>+
>+            bool reload;
>
>             atomic_read_relaxed(&pmd->reload, &reload);
>             if (reload) {
>@@ -4200,7 +4315,7 @@ reload:
>         goto reload;
>     }
>
>-    emc_cache_uninit(&pmd->flow_cache);
>+    dfc_cache_uninit(&pmd->flow_cache);
>     free(poll_list);
>     pmd_free_cached_ports(pmd);
>     return NULL;
>@@ -4635,7 +4750,7 @@ dp_netdev_configure_pmd(struct
>dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>     /* init the 'flow_cache' since there is no
>      * actual thread created for NON_PMD_CORE_ID. */
>     if (core_id == NON_PMD_CORE_ID) {
>-        emc_cache_init(&pmd->flow_cache);
>+        dfc_cache_init(&pmd->flow_cache);
>         pmd_alloc_static_tx_qid(pmd);
>     }
>     cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *,
>&pmd->node), @@ -4677,7 +4792,7 @@ dp_netdev_del_pmd(struct
>dp_netdev *dp, struct dp_netdev_pmd_thread *pmd)
>      * but extra cleanup is necessary */
>     if (pmd->core_id == NON_PMD_CORE_ID) {
>         ovs_mutex_lock(&dp->non_pmd_mutex);
>-        emc_cache_uninit(&pmd->flow_cache);
>+        dfc_cache_uninit(&pmd->flow_cache);
>         pmd_free_cached_ports(pmd);
>         pmd_free_static_tx_qid(pmd);
>         ovs_mutex_unlock(&dp->non_pmd_mutex);
>@@ -4987,7 +5102,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>     packet_batch_per_flow_update(batch, pkt, mf);  }
>
>-/* Try to process all ('cnt') the 'packets' using only the exact match cache
>+/* Try to process all ('cnt') the 'packets' using only the PMD flow
>+cache
>  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>  * miniflow is copied into 'keys' and the packet pointer is moved at the
>  * beginning of the 'packets' array.
>@@ -5002,19 +5117,20 @@ dp_netdev_queue_batches(struct dp_packet
>*pkt,
>  * will be ignored.
>  */
> static inline size_t
>-emc_processing(struct dp_netdev_pmd_thread *pmd,
>+dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                struct dp_packet_batch *packets_,
>                struct netdev_flow_key *keys,
>                struct packet_batch_per_flow batches[], size_t *n_batches,
>                bool md_is_valid, odp_port_t port_no)  {
>-    struct emc_cache *flow_cache = &pmd->flow_cache;
>+    struct dfc_cache *flow_cache = &pmd->flow_cache;
>     struct netdev_flow_key *key = &keys[0];
>-    size_t n_missed = 0, n_dropped = 0;
>+    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
>     struct dp_packet *packet;
>     const size_t cnt = dp_packet_batch_size(packets_);
>     uint32_t cur_min;
>     int i;
>+    bool exact_match;
>
>     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>
>@@ -5023,7 +5139,6 @@ emc_processing(struct dp_netdev_pmd_thread
>*pmd,
>
>         if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>             dp_packet_delete(packet);
>-            n_dropped++;
>             continue;
>         }
>
>@@ -5039,7 +5154,7 @@ emc_processing(struct dp_netdev_pmd_thread
>*pmd,
>         }
>         miniflow_extract(packet, &key->mf);
>         key->len = 0; /* Not computed yet. */
>-        /* If EMC is disabled skip hash computation and emc_lookup */
>+        /* If DFC is disabled skip hash computation and DFC lookup */
[BHANU] Why would DFC ever be disabled by user? This was done earlier for EMC as in EMC saturation case, there was penalty doing emc insertion
and emc_lookup() (that involved costly memcmp()) and disabling EMC showed some 10% advantage for some use cases.

This might not apply for DFC and I assume DFC shouldn't be configurable option? 

>         if (cur_min) {
>             if (!md_is_valid) {
>                 key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
>@@ -5047,11 +5162,16 @@ emc_processing(struct dp_netdev_pmd_thread
>*pmd,
>             } else {
>                 key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
>             }
>-            flow = emc_lookup(flow_cache, key);
>+            flow = dfc_lookup(flow_cache, key, &exact_match);
>         } else {
>             flow = NULL;
>         }
>         if (OVS_LIKELY(flow)) {
>+            if (exact_match) {
>+                n_emc_hit++;
>+            } else {
>+                n_dfc_hit++;
>+            }
>             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                     n_batches);
>         } else {
>@@ -5065,8 +5185,8 @@ emc_processing(struct dp_netdev_pmd_thread
>*pmd,
>         }
>     }
>
>-    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
>-                           cnt - n_dropped - n_missed);
>+    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
>+    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);
>
>     return dp_packet_batch_size(packets_);  } @@ -5135,7 +5255,7 @@
>handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>                                              add_actions->size);
>         }
>         ovs_mutex_unlock(&pmd->flow_mutex);
>-        emc_probabilistic_insert(pmd, key, netdev_flow);
>+        dfc_insert(pmd, key, netdev_flow);
>     }
> }
>
>@@ -5227,7 +5347,7 @@ fast_path_processing(struct
>dp_netdev_pmd_thread *pmd,
>
>         flow = dp_netdev_flow_cast(rules[i]);
>
>-        emc_probabilistic_insert(pmd, &keys[i], flow);
>+        dfc_insert(pmd, &keys[i], flow);
>         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
>n_batches);
>     }
>
>@@ -5259,7 +5379,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
>*pmd,
>     odp_port_t in_port;
>
>     n_batches = 0;
>-    emc_processing(pmd, packets, keys, batches, &n_batches,
>+    dfc_processing(pmd, packets, keys, batches, &n_batches,
>                             md_is_valid, port_no);
>     if (!dp_packet_batch_is_empty(packets)) {
>         /* Get ingress port from first packet's metadata. */ @@ -6195,7 +6315,7
>@@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
>
> /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>  * in 'mask' the values in 'key' and 'target' are the same. */ -static inline bool
>+static bool
> dpcls_rule_matches_key(const struct dpcls_rule *rule,
>                        const struct netdev_flow_key *target)  {
>--
>2.7.4
>
>_______________________________________________
>dev mailing list
>dev@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Wang, Yipeng1 Feb. 26, 2018, 6:35 p.m. | #2
Bhanu, thanks for the comment. Please see my comments inlined.

And Jan please feel free to add more. 

>-----Original Message-----
>From: Bodireddy, Bhanuprakash
>Sent: Tuesday, February 20, 2018 1:10 PM
>To: Wang, Yipeng1 <yipeng1.wang@intel.com>; dev@openvswitch.org; jan.scheurich@ericsson.com
>Cc: Tai, Charlie <charlie.tai@intel.com>
>Subject: RE: [ovs-dev] [RFC 1/4] dpif-netdev: Refactor datapath flow cache
>
>Hi Yipeng,
>
>Thanks for the RFC series. This patch series need to be rebased.
>I applied this on an older commit to do initial testing. Some comments below.
>
>I see that DFC cache is implemented in similar lines of EMC cache except that it holds
>Million entries and uses more bits of RSS hash to index in to the Cache. 
[Wang, Yipeng] 
Conceptually, DFC/CD is much memory efficient than EMC since it does not store the full key.

>DPCLS lookup is expensive and consumes 30% of total cycles in some test cases and DFC
>Cache will definitely reduce some pain there.
>
>On the memory foot print:
>
>On Master,
>    EMC  entry size = 592 bytes
>               8k entries = ~4MB.
>
>With this patch,
>     EMC entry size = 256 bytes
>              16k entries = ~4MB.
>
>I like the above reduction in flow key size, keeping the entry size to multiple of cache line and still keeping the overall EMC size to
>~4MB with more EMC entries.
>
>However my concern is DFC cache size. As DFC cache is million entries and consumes ~12 MB for each PMD thread, it might not fit in to
>L3 cache. Also note that in newer platforms L3 cache is shrinking and L2 is slightly increased (eg: Skylake has only 1MB L2 and 19MB L3
>cache).
>
[Wang, Yipeng] 
Yes, this is also our concern. The following (4/4) patch introduces indirect table for this reason. 

>Inspite of the memory footprint I still think DFC cache improves switching performance as it is lot less expensive than invoking
>dpcls_lookup() as the later involves more expensive hash computation and subtable traversal. It would be nice if there is more testing
>done with real VNFs to see that this patch doesn't cause cache thrashing and suffer from memory bottlenecks.
>
[Wang, Yipeng] 
I don't have real VNF benchmarking set, but will it be useful if we do test with some synthetic cache pirating application to emulate the effect? 
 
>
>[BHANU]
>I am not sure if this is good idea to simplify EMC by using 1-way associative instead of current 2 way associative implementation.
>I prefer to leave the current approach as-is unless we have strong data to prove it otherwise.
>This comment applies to below code changes w.r.t to EMC lookup and insert.
>
[Wang, Yipeng] 
I have synthetic tests showing simpler EMC works much better, while I also have other sets of synthetic tests showing the 2way EMC works much better.
It eventually depends on the use cases.  We had some discussion at this thread: https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/342197.html

>>The maximum size of the EMC flow key is limited to 256 bytes to reduce the
>>memory footprint. This should be sufficient to hold most real life packet flow
>>keys. Larger flows are not installed in the EMC.
>
>+1
>
>>
>> reload:
>>     pmd_alloc_static_tx_qid(pmd);
>>
>>@@ -4166,8 +4282,7 @@ reload:
>>         }
>>
>>         if (lc++ > 1024) {
>>-            bool reload;
>>-
>>+            dfc_slow_sweep(&pmd->flow_cache);
>[BHANU]  I need to better understand the usage of RCUs. But I am wondering why the sweep function
>Isn't under the below !rcu_try_quiesce() condition?

[Wang, Yipeng] 
The condition is to see if this pmd can be successfully quiesced once in this iteration. And sweeping will evict EMC entries and register rcu callbacks to free megaflow entry later.  Then putting this function under the condition means that we sweep and register new call back only if this PMD successfully quiesced this round.
I think functionally it should be fine either put sweep inside or outside the condition. But if outside the condition, there may be callbacks accumulated and cannot be called since this PMD may never be able to enter quiesced state once.  So, I think we may want to change the code to keep the sweep under the condition.
Others may know better of this, please comment.
>
>>@@ -5039,7 +5154,7 @@ emc_processing(struct dp_netdev_pmd_thread
>>*pmd,
>>         }
>>         miniflow_extract(packet, &key->mf);
>>         key->len = 0; /* Not computed yet. */
>>-        /* If EMC is disabled skip hash computation and emc_lookup */
>>+        /* If DFC is disabled skip hash computation and DFC lookup */
>[BHANU] Why would DFC ever be disabled by user? This was done earlier for EMC as in EMC saturation case, there was penalty doing
>emc insertion
>and emc_lookup() (that involved costly memcmp()) and disabling EMC showed some 10% advantage for some use cases.
>
>This might not apply for DFC and I assume DFC shouldn't be configurable option?

[Wang, Yipeng] 
Yes, this part of code should be tweaked a bit if we want DFC to be always on.



Thanks

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c7d157a..b9f4b6d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -128,19 +128,19 @@  struct netdev_flow_key {
     uint64_t buf[FLOW_MAX_PACKET_U64S];
 };
 
-/* Exact match cache for frequently used flows
+/* Datapath flow cache (DFC) for frequently used flows
  *
- * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
- * search its entries for a miniflow that matches exactly the miniflow of the
- * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
+ * The cache uses the 32-bit hash of the packet (which can be the RSS hash) to
+ * directly look up a pointer to the matching megaflow. To check for a match
+ * the packet's flow key is compared against the key and mask of the megaflow.
  *
- * A cache entry holds a reference to its 'dp_netdev_flow'.
- *
- * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS different
- * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
- * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away). Each
- * value is the index of a cache entry where the miniflow could be.
+ * For even faster lookup, the most frequently used packet flows are also
+ * inserted into a small exact match cache (EMC). The EMC uses a part of the
+ * packet hash to look up a miniflow that matches exactly the miniflow of the
+ * packet. The matching EMC also returns a reference to the megaflow.
  *
+ * Flows are promoted from the DFC to the EMC through probabilistic insertion
+ * after successful DFC lookup with minor probability to favor elephant flows.
  *
  * Thread-safety
  * =============
@@ -149,33 +149,38 @@  struct netdev_flow_key {
  * If dp_netdev_input is not called from a pmd thread, a mutex is used.
  */
 
-#define EM_FLOW_HASH_SHIFT 13
-#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT)
-#define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)
-#define EM_FLOW_HASH_SEGS 2
+#define DFC_MASK_LEN 20
+#define DFC_ENTRIES (1u << DFC_MASK_LEN)
+#define DFC_MASK (DFC_ENTRIES - 1)
+#define EMC_MASK_LEN 14
+#define EMC_ENTRIES (1u << EMC_MASK_LEN)
+#define EMC_MASK (EMC_ENTRIES - 1)
 
 /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */
-#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
+#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
                                     DEFAULT_EM_FLOW_INSERT_INV_PROB)
 
 struct emc_entry {
     struct dp_netdev_flow *flow;
-    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
+    char key[248];   /* Holds struct netdev_flow_key of limited size. */
 };
 
 struct emc_cache {
-    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
-    int sweep_idx;                /* For emc_cache_slow_sweep(). */
+    struct emc_entry entries[EMC_ENTRIES];
+    int sweep_idx;
+};
+
+struct dfc_entry {
+    struct dp_netdev_flow *flow;
+};
+
+struct dfc_cache {
+    struct emc_cache emc_cache;
+    struct dfc_entry entries[DFC_ENTRIES];
+    int sweep_idx;
 };
 
-/* Iterate in the exact match cache through every entry that might contain a
- * miniflow with hash 'HASH'. */
-#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)                 \
-    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
-         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \
-         i__ < EM_FLOW_HASH_SEGS;                                            \
-         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
 
 /* Simple non-wildcarding single-priority classifier. */
 
@@ -215,7 +220,8 @@  static bool dpcls_lookup(struct dpcls *cls,
                          const struct netdev_flow_key keys[],
                          struct dpcls_rule **rules, size_t cnt,
                          int *num_lookups_p);
-
+static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
+                            const struct netdev_flow_key *target);
 /* Set of supported meter flags */
 #define DP_SUPPORTED_METER_FLAGS_MASK \
     (OFPMF13_STATS | OFPMF13_PKTPS | OFPMF13_KBPS | OFPMF13_BURST)
@@ -333,6 +339,7 @@  static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
 
 enum dp_stat_type {
     DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
+    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
     DP_STAT_MISS,               /* Packets that did not match. */
     DP_STAT_LOST,               /* Packets not passed up to the client. */
@@ -576,7 +583,7 @@  struct dp_netdev_pmd_thread {
      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
      * need to be protected by 'non_pmd_mutex'.  Every other instance
      * will only be accessed by its own pmd thread. */
-    struct emc_cache flow_cache;
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
 
     /* Flow-Table and classifiers
      *
@@ -742,46 +749,59 @@  dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
 static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
                                       struct tx_port *tx);
 
-static inline bool emc_entry_alive(struct emc_entry *ce);
+static inline bool dfc_entry_alive(struct dfc_entry *ce);
 static void emc_clear_entry(struct emc_entry *ce);
+static void dfc_clear_entry(struct dfc_entry *ce);
 
 static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
 
 static void
-emc_cache_init(struct emc_cache *flow_cache)
+emc_cache_init(struct emc_cache *emc)
 {
     int i;
 
-    flow_cache->sweep_idx = 0;
+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
+        emc->entries[i].flow = NULL;
+        struct netdev_flow_key *key =
+                (struct netdev_flow_key *) emc->entries[i].key;
+        key->hash = 0;
+        key->len = sizeof(struct miniflow);
+        flowmap_init(&key->mf.map);
+    }
+    emc->sweep_idx = 0;
+}
+
+static void
+dfc_cache_init(struct dfc_cache *flow_cache)
+{
+    int i;
+
+    emc_cache_init(&flow_cache->emc_cache);
     for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
         flow_cache->entries[i].flow = NULL;
-        flow_cache->entries[i].key.hash = 0;
-        flow_cache->entries[i].key.len = sizeof(struct miniflow);
-        flowmap_init(&flow_cache->entries[i].key.mf.map);
     }
+    flow_cache->sweep_idx = 0;
 }
 
 static void
-emc_cache_uninit(struct emc_cache *flow_cache)
+emc_cache_uninit(struct emc_cache *emc)
 {
     int i;
 
-    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
-        emc_clear_entry(&flow_cache->entries[i]);
+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
+        emc_clear_entry(&emc->entries[i]);
     }
 }
 
-/* Check and clear dead flow references slowly (one entry at each
- * invocation).  */
 static void
-emc_cache_slow_sweep(struct emc_cache *flow_cache)
+dfc_cache_uninit(struct dfc_cache *flow_cache)
 {
-    struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx];
+    int i;
 
-    if (!emc_entry_alive(entry)) {
-        emc_clear_entry(entry);
+    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
+        dfc_clear_entry(&flow_cache->entries[i]);
     }
-    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK;
+    emc_cache_uninit(&flow_cache->emc_cache);
 }
 
 /* Updates the time in PMD threads context and should be called in three cases:
@@ -858,8 +878,8 @@  pmd_info_show_stats(struct ds *reply,
     }
 
     /* Sum of all the matched and not matched packets gives the total.  */
-    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_MASKED_HIT]
-                    + stats[DP_STAT_MISS];
+    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
+                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];
 
     for (i = 0; i < PMD_N_CYCLES; i++) {
         if (cycles[i] > pmd->cycles_zero[i]) {
@@ -892,11 +912,14 @@  pmd_info_show_stats(struct ds *reply,
     }
 
     ds_put_format(reply,
-                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
+                  "\temc hits:%llu\n\tdfc hits:%llu\n"
+                  "\tmegaflow hits:%llu\n"
                   "\tavg. subtable lookups per hit:%.2f\n"
                   "\tmiss:%llu\n\tlost:%llu\n"
                   "\tavg. packets per output batch: %.2f\n",
-                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
+                  stats[DP_STAT_EXACT_HIT],
+                  stats[DP_STAT_DFC_HIT],
+                  stats[DP_STAT_MASKED_HIT],
                   lookups_per_hit, stats[DP_STAT_MISS], stats[DP_STAT_LOST],
                   packets_per_batch);
 
@@ -1521,6 +1544,8 @@  dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
         stats->n_hit += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
         stats->n_hit += n;
+        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
+        stats->n_hit += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
         stats->n_missed += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n);
@@ -2140,6 +2165,16 @@  netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
     return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
 }
 
+/*
+ * Datapath Flow Cache and EMC implementation
+ */
+
+static inline struct emc_entry *
+emc_entry_get(struct emc_cache *emc, const uint32_t hash)
+{
+    return &emc->entries[hash & EMC_MASK];
+}
+
 static inline bool
 emc_entry_alive(struct emc_entry *ce)
 {
@@ -2156,9 +2191,15 @@  emc_clear_entry(struct emc_entry *ce)
 }
 
 static inline void
-emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
-                 const struct netdev_flow_key *key)
+emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key *key,
+                 struct dp_netdev_flow *flow)
 {
+    /* We only store small enough flows in the EMC. */
+    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
+    if (key_size > sizeof(ce->key)) {
+        return;
+    }
+
     if (ce->flow != flow) {
         if (ce->flow) {
             dp_netdev_flow_unref(ce->flow);
@@ -2170,73 +2211,148 @@  emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
             ce->flow = NULL;
         }
     }
-    if (key) {
-        netdev_flow_key_clone(&ce->key, key);
-    }
+    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
 }
 
 static inline void
-emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
-           struct dp_netdev_flow *flow)
+emc_probabilistic_insert(struct emc_cache *emc,
+                         const struct netdev_flow_key *key,
+                         struct dp_netdev_flow *flow)
 {
-    struct emc_entry *to_be_replaced = NULL;
-    struct emc_entry *current_entry;
+    const uint32_t threshold = UINT32_MAX/1000;
 
-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (netdev_flow_key_equal(&current_entry->key, key)) {
-            /* We found the entry with the 'mf' miniflow */
-            emc_change_entry(current_entry, flow, NULL);
-            return;
+    if (random_uint32() <= threshold) {
+
+        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
+        emc_change_entry(current_entry, key, flow);
+    }
+}
+
+static inline struct dp_netdev_flow *
+emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key)
+{
+    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
+    struct netdev_flow_key *current_key =
+            (struct netdev_flow_key *) current_entry->key;
+
+    if (current_key->hash == key->hash
+        && emc_entry_alive(current_entry)
+        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
+
+        /* We found the entry with the 'key->mf' miniflow */
+        return current_entry->flow;
+    }
+    return NULL;
+}
+
+static inline struct dfc_entry *
+dfc_entry_get(struct dfc_cache *cache, const uint32_t hash)
+{
+    return &cache->entries[hash & DFC_MASK];
+}
+
+static inline bool
+dfc_entry_alive(struct dfc_entry *ce)
+{
+    return ce->flow && !ce->flow->dead;
+}
+
+static void
+dfc_clear_entry(struct dfc_entry *ce)
+{
+    if (ce->flow) {
+        dp_netdev_flow_unref(ce->flow);
+        ce->flow = NULL;
+    }
+}
+
+static inline void
+dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow)
+{
+    if (ce->flow != flow) {
+        if (ce->flow) {
+            dp_netdev_flow_unref(ce->flow);
         }
 
-        /* Replacement policy: put the flow in an empty (not alive) entry, or
-         * in the first entry where it can be */
-        if (!to_be_replaced
-            || (emc_entry_alive(to_be_replaced)
-                && !emc_entry_alive(current_entry))
-            || current_entry->key.hash < to_be_replaced->key.hash) {
-            to_be_replaced = current_entry;
+        if (dp_netdev_flow_ref(flow)) {
+            ce->flow = flow;
+        } else {
+            ce->flow = NULL;
         }
     }
-    /* We didn't find the miniflow in the cache.
-     * The 'to_be_replaced' entry is where the new flow will be stored */
-
-    emc_change_entry(to_be_replaced, flow, key);
 }
 
 static inline void
-emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
-                         const struct netdev_flow_key *key,
-                         struct dp_netdev_flow *flow)
+dfc_insert(struct dp_netdev_pmd_thread *pmd,
+           const struct netdev_flow_key *key,
+           struct dp_netdev_flow *flow)
 {
-    /* Insert an entry into the EMC based on probability value 'min'. By
-     * default the value is UINT32_MAX / 100 which yields an insertion
-     * probability of 1/100 ie. 1% */
+    struct dfc_cache *cache = &pmd->flow_cache;
+    struct dfc_entry *current_entry;
+
+    current_entry = dfc_entry_get(cache, key->hash);
+    dfc_change_entry(current_entry, flow);
+}
+
+static inline struct dp_netdev_flow *
+dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
+           bool *exact_match)
+{
+    struct dp_netdev_flow *flow;
 
-    uint32_t min;
-    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
+    /* Try an EMC lookup first. */
+    flow = emc_lookup(&cache->emc_cache, key);
+    if (flow) {
+        *exact_match = true;
+        return flow;
+    }
+
+    /* EMC lookup not successful: try DFC lookup. */
+    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
+    flow = current_entry->flow;
 
-    if (min && random_uint32() <= min) {
-        emc_insert(&pmd->flow_cache, key, flow);
+    if (dfc_entry_alive(current_entry) &&
+        dpcls_rule_matches_key(&flow->cr, key)) {
+
+        /* Found a match in DFC. Insert into EMC for subsequent lookups.
+         * We use probabilistic insertion here so that mainly elephant
+         * flows enter EMC. */
+        emc_probabilistic_insert(&cache->emc_cache, key, flow);
+        *exact_match = false;
+        return flow;
+    } else {
+
+        /* No match. Need to go to DPCLS lookup. */
+        return NULL;
     }
 }
 
-static inline struct dp_netdev_flow *
-emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
+/* Check and clear dead flow references slowly (one entry at each
+ * invocation).  */
+static void
+emc_slow_sweep(struct emc_cache *emc)
 {
-    struct emc_entry *current_entry;
+    struct emc_entry *entry = &emc->entries[emc->sweep_idx];
 
-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (current_entry->key.hash == key->hash
-            && emc_entry_alive(current_entry)
-            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
+    if (!emc_entry_alive(entry)) {
+        emc_clear_entry(entry);
+    }
+    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK;
+}
 
-            /* We found the entry with the 'key->mf' miniflow */
-            return current_entry->flow;
-        }
+static void
+dfc_slow_sweep(struct dfc_cache *cache)
+{
+    /* Sweep the EMC so that both finish in the same time. */
+    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
+        emc_slow_sweep(&cache->emc_cache);
     }
 
-    return NULL;
+    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
+    if (!dfc_entry_alive(entry)) {
+        dfc_clear_entry(entry);
+    }
+    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
 }
 
 static struct dp_netdev_flow *
@@ -4135,7 +4251,7 @@  pmd_thread_main(void *f_)
     ovs_numa_thread_setaffinity_core(pmd->core_id);
     dpdk_set_lcore_id(pmd->core_id);
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
-    emc_cache_init(&pmd->flow_cache);
+    dfc_cache_init(&pmd->flow_cache);
 reload:
     pmd_alloc_static_tx_qid(pmd);
 
@@ -4166,8 +4282,7 @@  reload:
         }
 
         if (lc++ > 1024) {
-            bool reload;
-
+            dfc_slow_sweep(&pmd->flow_cache);
             lc = 0;
 
             coverage_try_clear();
@@ -4175,9 +4290,9 @@  reload:
              * iteration, if there were no received packets. */
             pmd_thread_ctx_time_update(pmd);
             dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
-            if (!ovsrcu_try_quiesce()) {
-                emc_cache_slow_sweep(&pmd->flow_cache);
-            }
+            ovsrcu_try_quiesce();
+
+            bool reload;
 
             atomic_read_relaxed(&pmd->reload, &reload);
             if (reload) {
@@ -4200,7 +4315,7 @@  reload:
         goto reload;
     }
 
-    emc_cache_uninit(&pmd->flow_cache);
+    dfc_cache_uninit(&pmd->flow_cache);
     free(poll_list);
     pmd_free_cached_ports(pmd);
     return NULL;
@@ -4635,7 +4750,7 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
-        emc_cache_init(&pmd->flow_cache);
+        dfc_cache_init(&pmd->flow_cache);
         pmd_alloc_static_tx_qid(pmd);
     }
     cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
@@ -4677,7 +4792,7 @@  dp_netdev_del_pmd(struct dp_netdev *dp, struct dp_netdev_pmd_thread *pmd)
      * but extra cleanup is necessary */
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_lock(&dp->non_pmd_mutex);
-        emc_cache_uninit(&pmd->flow_cache);
+        dfc_cache_uninit(&pmd->flow_cache);
         pmd_free_cached_ports(pmd);
         pmd_free_static_tx_qid(pmd);
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -4987,7 +5102,7 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, mf);
 }
 
-/* Try to process all ('cnt') the 'packets' using only the exact match cache
+/* Try to process all ('cnt') the 'packets' using only the PMD flow cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
  * beginning of the 'packets' array.
@@ -5002,19 +5117,20 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
  * will be ignored.
  */
 static inline size_t
-emc_processing(struct dp_netdev_pmd_thread *pmd,
+dfc_processing(struct dp_netdev_pmd_thread *pmd,
                struct dp_packet_batch *packets_,
                struct netdev_flow_key *keys,
                struct packet_batch_per_flow batches[], size_t *n_batches,
                bool md_is_valid, odp_port_t port_no)
 {
-    struct emc_cache *flow_cache = &pmd->flow_cache;
+    struct dfc_cache *flow_cache = &pmd->flow_cache;
     struct netdev_flow_key *key = &keys[0];
-    size_t n_missed = 0, n_dropped = 0;
+    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
     struct dp_packet *packet;
     const size_t cnt = dp_packet_batch_size(packets_);
     uint32_t cur_min;
     int i;
+    bool exact_match;
 
     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
 
@@ -5023,7 +5139,6 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
 
         if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
             dp_packet_delete(packet);
-            n_dropped++;
             continue;
         }
 
@@ -5039,7 +5154,7 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
         miniflow_extract(packet, &key->mf);
         key->len = 0; /* Not computed yet. */
-        /* If EMC is disabled skip hash computation and emc_lookup */
+        /* If DFC is disabled skip hash computation and DFC lookup */
         if (cur_min) {
             if (!md_is_valid) {
                 key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
@@ -5047,11 +5162,16 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
             } else {
                 key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
             }
-            flow = emc_lookup(flow_cache, key);
+            flow = dfc_lookup(flow_cache, key, &exact_match);
         } else {
             flow = NULL;
         }
         if (OVS_LIKELY(flow)) {
+            if (exact_match) {
+                n_emc_hit++;
+            } else {
+                n_dfc_hit++;
+            }
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
         } else {
@@ -5065,8 +5185,8 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
     }
 
-    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
-                           cnt - n_dropped - n_missed);
+    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
+    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);
 
     return dp_packet_batch_size(packets_);
 }
@@ -5135,7 +5255,7 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                              add_actions->size);
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
-        emc_probabilistic_insert(pmd, key, netdev_flow);
+        dfc_insert(pmd, key, netdev_flow);
     }
 }
 
@@ -5227,7 +5347,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
         flow = dp_netdev_flow_cast(rules[i]);
 
-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        dfc_insert(pmd, &keys[i], flow);
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }
 
@@ -5259,7 +5379,7 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     odp_port_t in_port;
 
     n_batches = 0;
-    emc_processing(pmd, packets, keys, batches, &n_batches,
+    dfc_processing(pmd, packets, keys, batches, &n_batches,
                             md_is_valid, port_no);
     if (!dp_packet_batch_is_empty(packets)) {
         /* Get ingress port from first packet's metadata. */
@@ -6195,7 +6315,7 @@  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
 
 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
  * in 'mask' the values in 'key' and 'target' are the same. */
-static inline bool
+static bool
 dpcls_rule_matches_key(const struct dpcls_rule *rule,
                        const struct netdev_flow_key *target)
 {